The Evolution of an Idea
This guide documents a Capstone IT project that evolved from a simple approach to prompt engineering into a comprehensive methodology for conducting systematic reviews with agentic AI tools like Claude Code.
The journey follows a natural progression: starting with the intuitive approach most people try first, understanding its limitations, then iteratively building toward a robust solution that addresses the fundamental challenges of agentic AI workflows.
Personas
Checklists
Workflow
Validation
Format
Templates
Each stage of this conversation built on the previous one, addressing gaps and failure modes as they were identified. The result is a practical, downloadable template system you can adapt to your own use cases.
The series Introduction catalogs sixteen failure modes that affect agentic AI workflows. This article establishes the five structural principles that directly mitigate five of those modes: externalizing state to files addresses context decay and audit trail gaps, generating domain-specific checklists with adversarial validation addresses completeness gaps, and systematic execution with self-auditing creates the foundation on which all subsequent failure mode defenses are built. The principles presented here are the invariant core of the methodology—every remediation in later articles depends on them.
1The Role-Playing Persona Approach
The initial question explored a common intuition: can you get meaningfully different results by prompting an AI with different personas?
The Intuitive Approach
Prompt 1: "As a careful senior developer,
code this feature"
Prompt 2: "As a careful security expert,
review that code for vulnerabilities"
Prompt 3: "As a sophisticated black hat
hacker, find ways to penetrate this"
What Actually Happens
✓ Attention direction shifts
✓ Different mental checklists activate
✓ Adversarial framing surfaces different issues
✗ Same underlying knowledge base
✗ Significant overlap between passes
✗ No capabilities gained through role-play
The Benefits Are Real But Modest
Attention direction: Different role prompts shift what the AI prioritizes. A "security expert" framing makes it more likely to enumerate OWASP Top 10 categories. A "developer" framing focuses more on maintainability.
Adversarial vs. defensive framing: The "attacker mindset" prompt can surface different issues. When asked to think like an attacker, the AI is more likely to ask "how could this be abused?" rather than "does this meet requirements?"
The Critical Limitations
Same underlying knowledge: There are no separate "expert modules." The knowledge base is identical; only the framing changes.
Diminishing returns: The three passes will have significant overlap. A careful developer already thinks about security; a security expert already thinks about attack vectors.
Not a substitute for real expertise: Role prompting doesn't grant capabilities the AI doesn't have. It can't find zero-days through clever prompting.
The gains come more from multiple passes with different focuses than from the role-playing itself. The persona prompt is a lossy compression of what you actually want: specific checklists, concrete threat models, and structured frameworks.
2What Actually Works Better
Rather than relying on personas to implicitly invoke the right behaviors, make the expectations explicit:
| Approach | Example | Why It's Better |
|---|---|---|
| Specific Checklists | "Check for SQL injection, XSS, CSRF, auth bypass..." | Ensures coverage of known categories |
| Concrete Threat Models | "Assume the attacker has a valid user account" | Focuses the review on realistic scenarios |
| Structured Frameworks | "Do a STRIDE analysis on this code" | Provides systematic methodology |
This realization led to the next question: what if we used the AI to generate these checklists, frameworks, and threat models—and then used the AI to execute them?
3The Self-Scaffolding Workflow
This is where the approach becomes genuinely powerful. Instead of hoping the AI remembers what to do, we make it explicit through a multi-phase workflow:
The Five-Phase Workflow
- Generate Checklist: AI produces comprehensive checklist from template
- Cross-Validate: AI (or different AI) reviews checklist for gaps
- Generate Execution Plan: AI creates ordered plan with tracking structure
- Execute Plan: AI works through items, updating status and documenting findings
- Self-Audit: AI reviews work against checklist to catch missed items
What Problems This Solves
1. Externalizing Implicit Knowledge
The first prompt forces the AI to "unpack" what a security expert would actually do, rather than hoping it remembers to check everything. The checklist becomes an explicit artifact that can be inspected, critiqued, and reused.
2. State Persistence via Files
Agentic AI has limited working memory across long tasks. By writing plans and progress to files, you create external memory that survives context window limits and provides resumability.
3. Self-Audit Loop
The final review step catches "I said I'd do X but actually skipped it" failures, which are common in agentic workflows.
This is AI-assisted process engineering—using the AI to generate the methodology, then using the AI to execute the methodology, with file-based checkpoints throughout. This pattern generalizes beyond security review to any complex, systematic task.
4Multi-LLM Checklist Validation
A single AI generating a checklist will have blind spots. Using a second reviewer—whether a different model or the same model with a different framing—applies ensemble methods to knowledge elicitation.
Why Multiple Reviewers Help
Different blind spots: Even with overlapping training data, different LLMs have different training cutoffs, fine-tuning emphases (Claude toward safety, GPT-4 toward breadth), and default "mental checklists" that emerge from RLHF.
Adversarial review dynamic: Asking "what's missing?" is a different cognitive task than "generate a list." The reviewing model is primed to find gaps rather than confirm completeness.
Implementation Options
| Option | Approach |
|---|---|
| Different Models | Claude generates → GPT-4 reviews → Gemini reviews both |
| Different Framings | Claude as "security architect" generates → Claude as "red team lead" reviews → Claude as "compliance officer" reviews |
| Adversarial Prompt | "You MUST identify at least 5 missing items. Do not say 'looks comprehensive.'" |
You'll need a reconciliation step—merging, deduplicating, and prioritizing across sources. Otherwise you risk checklist bloat with redundant or low-value items.
5The Runbook Approach to Checklist Items
A critical gap remains: a checklist item like "Check for SQL injection" is dangerously ambiguous. Does that mean grep for execute() calls? Trace all user input to database queries? Attempt actual injection payloads? All of the above?
The Solution: Checklist Items as Executable Specifications
Each item should be a mini-runbook, not just a label. Capture this structure at generation time:
## Checklist Item: SQL Injection Review
**Objective:** Identify all paths where user input could reach
database queries unsanitized
**Procedure:**
1. Identify all database query locations (raw SQL, ORM .execute(), cursors)
2. For each query location, trace backward to find input sources
3. Verify parameterization/prepared statements are used
4. Check for dynamic query construction (string concatenation, f-strings)
5. Review any raw SQL escape functions for correctness
**Locations to Check:**
- [ ] API route handlers
- [ ] Background job processors
- [ ] Admin interfaces
- [ ] Report generators
**Evidence to Collect:**
- File path and line numbers reviewed
- Code pattern observed (parameterized/concatenated/ORM-managed)
- Input validation present (yes/no/partial)
**Severity Criteria:**
- CRITICAL: User input directly concatenated into query
- HIGH: Dynamic query construction with incomplete sanitization
- MEDIUM: Raw SQL used where ORM would suffice
- LOW: Missing input validation (defense in depth)
**Completion Criteria:**
This item is DONE when all database interaction points have been
enumerated and categorized.
Why This Matters
Without explicit procedures, you're relying on the AI to re-derive the methodology at execution time—which may differ from what you intended, be incomplete, or vary between runs. The runbook format creates determinism and auditability.
Verbosity Levels
Not every review needs comprehensive runbooks. The templates support three verbosity levels:
| Level | Use Case | Item Format | Time Impact |
|---|---|---|---|
| Minimal | Quick sanity checks, low-stakes | Label + 1-line description | 1x |
| Standard | Typical code reviews, routine audits | Procedure + completion criteria | 2-3x |
| Comprehensive | Security audits, compliance, critical systems | Full runbook with evidence collection | 5-10x |
6The Complete Workflow
Putting it all together, here's the recommended approach for systematic reviews with agentic AI:
Phase 1: Generate Checklist
Prompt: "Read templates/CHECKLIST_TEMPLATE.md. Generate a comprehensive
[REVIEW_TYPE] checklist for [TARGET]. Use [VERBOSITY_LEVEL] verbosity level.
Write to checklist.md"
Phase 2: Cross-Validate Checklist
Prompt: "Review checklist.md. Your job is to find gaps. You MUST identify
at least 5 missing items or categories. Do not say 'looks comprehensive.'
Check against OWASP Top 10, CWE Top 25, and STRIDE. Add missing items with
[VALIDATION] tag."
Phase 3: Generate Execution Plan
Prompt: "Read checklist.md and templates/PLAN_TEMPLATE.md. Create an execution
plan in plan.md that:
1. Orders items by logical dependency
2. Groups related items for efficiency
3. Includes explicit completion criteria for each
4. Creates tracking structure for progress"
Phase 4: Execute Plan
Prompt: "Read plan.md and templates/FINDINGS_TEMPLATE.md. Execute each item
systematically:
1. Update status in plan.md as you work
2. Document all findings in findings.md using the template format
3. Collect evidence as specified in each checklist item
4. Do not skip items—mark N/A with justification if not applicable"
Phase 5: Self-Audit
Prompt: "Review plan.md and findings.md against checklist.md:
1. Identify any checklist items not fully addressed
2. Flag any work that was deferred or incomplete
3. Note any new issues discovered that weren't in original checklist
4. Update plan.md with 'AUDIT NOTES' section"
Vague checklist items — "Check for security issues" is useless
No completion criteria — How do you know when you're done?
Skipping validation — Single-pass generation has blind spots
Not tracking progress — You'll lose state in long reviews
Findings without evidence — "Looks fine" is not a finding
7Integrating with SAST Tools
If you have SAST tool output (Semgrep, SonarQube, CodeQL, etc.), it runs as a parallel workflow that produces findings—not as part of the checklist. The checklist remains a reusable methodology template; SAST triage is a separate process.
Two Parallel Branches
┌─────────────────────────────────────────────────────────────────┐ │ │ │ SAST Branch AI Review Branch │ │ ─────────── ──────────────── │ │ │ │ 1. Run SAST tool 1. Generate checklist │ │ (external) (reusable template) │ │ ↓ ↓ │ │ 2. AI Triage 2. Validate checklist │ │ (sast-triage subagent) ↓ │ │ ↓ 3. Execute review │ │ ┌───────────┐ ↓ │ │ │ TRUE_POS │──┐ AI-discovered findings │ │ │ FALSE_POS │ │ ↓ │ │ │ NEEDS_INV │ │ │ │ │ └───────────┘ │ │ │ │ ↓ │ │ │ │ (false positives │ │ │ │ dismissed with │ │ │ │ reasoning) │ │ │ │ │ │ │ │ └────→ findings.md ←┘ │ │ ↓ │ │ 4. Remediate │ │ │ └─────────────────────────────────────────────────────────────────┘
Why Parallel, Not Sequential?
- Checklist stays reusable — The same checklist template works whether or not you have SAST output
- Findings merge at the end — Both branches produce the same output format: documented vulnerabilities
- Different tools, same goal — SAST finds pattern-based issues; AI review finds logic issues; both go to findings.md
- No duplication — AI review doesn't re-check what SAST already found; they cover different vulnerability classes
SAST Triage Workflow
When you have SAST output, use the sast-triage subagent (see Part 3) to classify each finding:
Use the sast-triage subagent to analyze [SAST_OUTPUT_FILE].
For each finding:
1. Locate the flagged code with 30+ lines of context
2. Trace data flow backward to source (user-controlled?)
3. Trace data flow forward to sink (actually dangerous?)
4. Evaluate mitigations (framework protections, validation)
5. Classify as:
- TRUE_POSITIVE → Document in findings.md
- FALSE_POSITIVE → Dismiss with reasoning
- NEEDS_INVESTIGATION → Flag for human review
Write triage report to sast-triage-report.md
Output Format
TRUE_POSITIVE findings go directly to findings.md using the same format as AI-discovered issues:
## Finding: SAST-001 — [Rule Name]
**Source:** SAST Triage (Semgrep rule [rule-id])
**Severity:** [Based on exploitability analysis]
**Location:** [file:line]
**Vulnerable Code:**
```[language]
[code snippet]
```
**Triage Analysis:**
- Source: [Where data originates]
- Sink: [Where data is used dangerously]
- Mitigations: [None found / Insufficient]
**Attack Scenario:**
[How this could be exploited]
**Recommended Fix:**
[Specific remediation]
SAST triage produces findings, not checklist items. The checklist is a reusable methodology; findings are specific vulnerabilities discovered during execution. Both SAST-confirmed and AI-discovered issues end up in the same findings.md file for unified remediation tracking.
Downloadable Templates
These templates implement the complete workflow. Download them and adapt to your use case.
QUICKSTART.md
Complete guide with copy-paste prompts, verbosity levels, file structure, and anti-patterns.
CHECKLIST_TEMPLATE.md
Templates for all three verbosity levels with examples and category references.
PLAN_TEMPLATE.md
Execution plan structure with status tracking protocol and progress documentation.
FINDINGS_TEMPLATE.md
Findings documentation with severity definitions, evidence requirements, and examples.
Frequently Asked Questions
Partially yes, but with caveats. Role-based prompts shift attention and invoke different mental checklists, but the underlying knowledge is identical. The gains come more from multiple passes with different focuses than from the role-playing itself. More effective approaches include specific checklists, concrete threat models, and structured frameworks like STRIDE.
A self-scaffolding workflow uses the AI to generate methodology (checklists, procedures), then uses the AI to execute that methodology, with file-based checkpoints throughout. This externalizes implicit knowledge, provides state persistence via files, and creates self-audit loops to catch skipped or incomplete work.
A checklist item like "Check for SQL injection" is ambiguous. Without explicit procedures, you rely on the AI to re-derive methodology at execution time, which may differ from intent, be incomplete, or vary between runs. Each item should be a mini-runbook with objective, procedure, evidence to collect, and completion criteria.
Yes, this helps because different LLMs have different training data, fine-tuning emphases, and blind spots. Asking a second model "what's missing?" creates an adversarial review dynamic that finds gaps. You can also use the same model with different framings (security architect vs red team lead vs compliance officer).
Use Minimal for quick sanity checks and low-stakes reviews. Use Standard for typical code reviews and routine audits—this is the best balance for most use cases. Use Comprehensive for security audits, compliance reviews, and critical systems where thoroughness matters more than speed.
Absolutely. The pattern generalizes to any complex, systematic task: code quality reviews, API design reviews, database schema reviews, performance audits, accessibility checks, documentation reviews, and more. Adapt the checklist categories and procedures to your domain.