Agentic AI Security Review: From Role-Playing Prompts to Systematic Workflows

Q: Why should checklist items include detailed procedures?

A checklist item like 'Check for SQL injection' is ambiguous. Without explicit procedures, you rely on the AI to re-derive methodology at execution time, which may differ from intent, be incomplete, or vary between runs. Each item should be a mini-runbook with objective, procedure, evidence to collect, and completion criteria.

The Evolution of an Idea

This guide documents a Capstone IT project that evolved from a simple approach to prompt engineering into a comprehensive methodology for conducting systematic reviews with agentic AI tools like Claude Code.

The journey follows a natural progression: starting with the intuitive approach most people try first, understanding its limitations, then iteratively building toward a robust solution that addresses the fundamental challenges of agentic AI workflows.

Role-Playing
Personas

→

Explicit
Checklists

→

Multi-Pass
Workflow

→

Multi-LLM
Validation

→

Runbook
Format

→

Complete
Templates

Each stage of this conversation built on the previous one, addressing gaps and failure modes as they were identified. The result is a practical, downloadable template system you can adapt to your own use cases.

The series Introduction catalogs sixteen failure modes that affect agentic AI workflows. This article establishes the five structural principles that directly mitigate five of those modes: externalizing state to files addresses context decay and audit trail gaps, generating domain-specific checklists with adversarial validation addresses completeness gaps, and systematic execution with self-auditing creates the foundation on which all subsequent failure mode defenses are built. The principles presented here are the invariant core of the methodology—every remediation in later articles depends on them.

1The Role-Playing Persona Approach

The initial question explored a common intuition: can you get meaningfully different results by prompting an AI with different personas?

The Intuitive Approach

Prompt 1: "As a careful senior developer, 
code this feature"

Prompt 2: "As a careful security expert, 
review that code for vulnerabilities"

Prompt 3: "As a sophisticated black hat 
hacker, find ways to penetrate this"

What Actually Happens

✓ Attention direction shifts
✓ Different mental checklists activate
✓ Adversarial framing surfaces different issues

✗ Same underlying knowledge base
✗ Significant overlap between passes
✗ No capabilities gained through role-play

The Benefits Are Real But Modest

Attention direction: Different role prompts shift what the AI prioritizes. A "security expert" framing makes it more likely to enumerate OWASP Top 10 categories. A "developer" framing focuses more on maintainability.

Adversarial vs. defensive framing: The "attacker mindset" prompt can surface different issues. When asked to think like an attacker, the AI is more likely to ask "how could this be abused?" rather than "does this meet requirements?"

The Critical Limitations

Same underlying knowledge: There are no separate "expert modules." The knowledge base is identical; only the framing changes.

Diminishing returns: The three passes will have significant overlap. A careful developer already thinks about security; a security expert already thinks about attack vectors.

Not a substitute for real expertise: Role prompting doesn't grant capabilities the AI doesn't have. It can't find zero-days through clever prompting.

Key Insight

The gains come more from multiple passes with different focuses than from the role-playing itself. The persona prompt is a lossy compression of what you actually want: specific checklists, concrete threat models, and structured frameworks.

2What Actually Works Better

Rather than relying on personas to implicitly invoke the right behaviors, make the expectations explicit:

Approach	Example	Why It's Better
Specific Checklists	"Check for SQL injection, XSS, CSRF, auth bypass..."	Ensures coverage of known categories
Concrete Threat Models	"Assume the attacker has a valid user account"	Focuses the review on realistic scenarios
Structured Frameworks	"Do a STRIDE analysis on this code"	Provides systematic methodology

This realization led to the next question: what if we used the AI to generate these checklists, frameworks, and threat models—and then used the AI to execute them?

3The Self-Scaffolding Workflow

This is where the approach becomes genuinely powerful. Instead of hoping the AI remembers what to do, we make it explicit through a multi-phase workflow:

The Five-Phase Workflow

Generate Checklist: AI produces comprehensive checklist from template
Cross-Validate: AI (or different AI) reviews checklist for gaps
Generate Execution Plan: AI creates ordered plan with tracking structure
Execute Plan: AI works through items, updating status and documenting findings
Self-Audit: AI reviews work against checklist to catch missed items

What Problems This Solves

1. Externalizing Implicit Knowledge

The first prompt forces the AI to "unpack" what a security expert would actually do, rather than hoping it remembers to check everything. The checklist becomes an explicit artifact that can be inspected, critiqued, and reused.

2. State Persistence via Files

Agentic AI has limited working memory across long tasks. By writing plans and progress to files, you create external memory that survives context window limits and provides resumability.

3. Self-Audit Loop

The final review step catches "I said I'd do X but actually skipped it" failures, which are common in agentic workflows.

The Meta-Insight

This is AI-assisted process engineering—using the AI to generate the methodology, then using the AI to execute the methodology, with file-based checkpoints throughout. This pattern generalizes beyond security review to any complex, systematic task.

4Multi-LLM Checklist Validation

A single AI generating a checklist will have blind spots. Using a second reviewer—whether a different model or the same model with a different framing—applies ensemble methods to knowledge elicitation.

Why Multiple Reviewers Help

Different blind spots: Even with overlapping training data, different LLMs have different training cutoffs, fine-tuning emphases (Claude toward safety, GPT-4 toward breadth), and default "mental checklists" that emerge from RLHF.

Adversarial review dynamic: Asking "what's missing?" is a different cognitive task than "generate a list." The reviewing model is primed to find gaps rather than confirm completeness.

Implementation Options

Option	Approach
Different Models	Claude generates → GPT-4 reviews → Gemini reviews both
Different Framings	Claude as "security architect" generates → Claude as "red team lead" reviews → Claude as "compliance officer" reviews
Adversarial Prompt	"You MUST identify at least 5 missing items. Do not say 'looks comprehensive.'"

Practical Note

You'll need a reconciliation step—merging, deduplicating, and prioritizing across sources. Otherwise you risk checklist bloat with redundant or low-value items.

5The Runbook Approach to Checklist Items

A critical gap remains: a checklist item like "Check for SQL injection" is dangerously ambiguous. Does that mean grep for execute() calls? Trace all user input to database queries? Attempt actual injection payloads? All of the above?

The Solution: Checklist Items as Executable Specifications

Each item should be a mini-runbook, not just a label. Capture this structure at generation time:

## Checklist Item: SQL Injection Review

**Objective:** Identify all paths where user input could reach 
database queries unsanitized

**Procedure:**
1. Identify all database query locations (raw SQL, ORM .execute(), cursors)
2. For each query location, trace backward to find input sources
3. Verify parameterization/prepared statements are used
4. Check for dynamic query construction (string concatenation, f-strings)
5. Review any raw SQL escape functions for correctness

**Locations to Check:**
- [ ] API route handlers
- [ ] Background job processors  
- [ ] Admin interfaces
- [ ] Report generators

**Evidence to Collect:**
- File path and line numbers reviewed
- Code pattern observed (parameterized/concatenated/ORM-managed)
- Input validation present (yes/no/partial)

**Severity Criteria:**
- CRITICAL: User input directly concatenated into query
- HIGH: Dynamic query construction with incomplete sanitization
- MEDIUM: Raw SQL used where ORM would suffice
- LOW: Missing input validation (defense in depth)

**Completion Criteria:**
This item is DONE when all database interaction points have been 
enumerated and categorized.

Why This Matters

Without explicit procedures, you're relying on the AI to re-derive the methodology at execution time—which may differ from what you intended, be incomplete, or vary between runs. The runbook format creates determinism and auditability.

Verbosity Levels

Not every review needs comprehensive runbooks. The templates support three verbosity levels:

Level	Use Case	Item Format	Time Impact
Minimal	Quick sanity checks, low-stakes	Label + 1-line description	1x
Standard	Typical code reviews, routine audits	Procedure + completion criteria	2-3x
Comprehensive	Security audits, compliance, critical systems	Full runbook with evidence collection	5-10x

6The Complete Workflow

Putting it all together, here's the recommended approach for systematic reviews with agentic AI:

Phase 1: Generate Checklist

Prompt: "Read templates/CHECKLIST_TEMPLATE.md. Generate a comprehensive 
[REVIEW_TYPE] checklist for [TARGET]. Use [VERBOSITY_LEVEL] verbosity level. 
Write to checklist.md"

Phase 2: Cross-Validate Checklist

Prompt: "Review checklist.md. Your job is to find gaps. You MUST identify 
at least 5 missing items or categories. Do not say 'looks comprehensive.' 
Check against OWASP Top 10, CWE Top 25, and STRIDE. Add missing items with 
[VALIDATION] tag."

Phase 3: Generate Execution Plan

Prompt: "Read checklist.md and templates/PLAN_TEMPLATE.md. Create an execution 
plan in plan.md that:
1. Orders items by logical dependency
2. Groups related items for efficiency  
3. Includes explicit completion criteria for each
4. Creates tracking structure for progress"

Phase 4: Execute Plan

Prompt: "Read plan.md and templates/FINDINGS_TEMPLATE.md. Execute each item 
systematically:
1. Update status in plan.md as you work
2. Document all findings in findings.md using the template format
3. Collect evidence as specified in each checklist item
4. Do not skip items—mark N/A with justification if not applicable"

Phase 5: Self-Audit

Prompt: "Review plan.md and findings.md against checklist.md:
1. Identify any checklist items not fully addressed
2. Flag any work that was deferred or incomplete
3. Note any new issues discovered that weren't in original checklist
4. Update plan.md with 'AUDIT NOTES' section"

Anti-Patterns to Avoid

Vague checklist items — "Check for security issues" is useless
No completion criteria — How do you know when you're done?
Skipping validation — Single-pass generation has blind spots
Not tracking progress — You'll lose state in long reviews
Findings without evidence — "Looks fine" is not a finding

7Integrating with SAST Tools

If you have SAST tool output (Semgrep, SonarQube, CodeQL, etc.), it runs as a parallel workflow that produces findings—not as part of the checklist. The checklist remains a reusable methodology template; SAST triage is a separate process.

Two Parallel Branches

┌─────────────────────────────────────────────────────────────────┐
│                                                                 │
│  SAST Branch                    AI Review Branch                │
│  ───────────                    ────────────────                │
│                                                                 │
│  1. Run SAST tool               1. Generate checklist           │
│     (external)                     (reusable template)          │
│         ↓                              ↓                        │
│  2. AI Triage                   2. Validate checklist           │
│     (sast-triage subagent)             ↓                        │
│         ↓                       3. Execute review               │
│     ┌───────────┐                      ↓                        │
│     │ TRUE_POS  │──┐            AI-discovered findings          │
│     │ FALSE_POS │  │                   ↓                        │
│     │ NEEDS_INV │  │                   │                        │
│     └───────────┘  │                   │                        │
│         ↓          │                   │                        │
│  (false positives  │                   │                        │
│   dismissed with   │                   │                        │
│   reasoning)       │                   │                        │
│                    │                   │                        │
│                    └────→ findings.md ←┘                        │
│                                  ↓                              │
│                           4. Remediate                          │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Why Parallel, Not Sequential?

Checklist stays reusable — The same checklist template works whether or not you have SAST output
Findings merge at the end — Both branches produce the same output format: documented vulnerabilities
Different tools, same goal — SAST finds pattern-based issues; AI review finds logic issues; both go to findings.md
No duplication — AI review doesn't re-check what SAST already found; they cover different vulnerability classes

SAST Triage Workflow

When you have SAST output, use the sast-triage subagent (see Part 3) to classify each finding:

Use the sast-triage subagent to analyze [SAST_OUTPUT_FILE].

For each finding:
1. Locate the flagged code with 30+ lines of context
2. Trace data flow backward to source (user-controlled?)
3. Trace data flow forward to sink (actually dangerous?)
4. Evaluate mitigations (framework protections, validation)
5. Classify as:
   - TRUE_POSITIVE → Document in findings.md
   - FALSE_POSITIVE → Dismiss with reasoning
   - NEEDS_INVESTIGATION → Flag for human review

Write triage report to sast-triage-report.md

Output Format

TRUE_POSITIVE findings go directly to findings.md using the same format as AI-discovered issues:

## Finding: SAST-001 — [Rule Name]

**Source:** SAST Triage (Semgrep rule [rule-id])
**Severity:** [Based on exploitability analysis]
**Location:** [file:line]

**Vulnerable Code:**
```[language]
[code snippet]
```

**Triage Analysis:**
- Source: [Where data originates]
- Sink: [Where data is used dangerously]
- Mitigations: [None found / Insufficient]

**Attack Scenario:**
[How this could be exploited]

**Recommended Fix:**
[Specific remediation]

Key Principle

SAST triage produces findings, not checklist items. The checklist is a reusable methodology; findings are specific vulnerabilities discovered during execution. Both SAST-confirmed and AI-discovered issues end up in the same findings.md file for unified remediation tracking.

Downloadable Templates

These templates implement the complete workflow. Download them and adapt to your use case.

QUICKSTART.md

Complete guide with copy-paste prompts, verbosity levels, file structure, and anti-patterns.

CHECKLIST_TEMPLATE.md

Templates for all three verbosity levels with examples and category references.

PLAN_TEMPLATE.md

Execution plan structure with status tracking protocol and progress documentation.

FINDINGS_TEMPLATE.md

Findings documentation with severity definitions, evidence requirements, and examples.

Related Work

This methodology builds on and extends several emerging approaches to structured AI workflows. Understanding these related frameworks helps contextualize where this approach fits in the broader landscape.

Spec-Driven Development (SDD)

GitHub's Spec-Kit implements a specification-driven approach that creates spec.md, plan.md, and tasks.md files, with agents working through tasks systematically and marking progress. The JetBrains Junie blog describes a similar workflow: "Work in phases. Don't ask the agent to 'do everything in tasks.md' in one go. Instead, start with a subset... Mark progress. Require the agent to update tasks.md with checkmarks or completion notes."

Agentic Primitives & Context Engineering

The GitHub Blog describes how specifications provide "everything a developer (or an AI agent) needs to know to start building: the problem, the approach, required components, validation criteria, and a checklist for handoff." This aligns with our emphasis on explicit, complete checklist items.

Multi-LLM Ensemble Validation

Research on Probabilistic Consensus through Ensemble Validation supports our multi-LLM validation approach, showing that using multiple models improved precision from 73.1% to 93.9% with two models and to 95.6% with three models—demonstrating that "ensemble approaches spanning two model families may mitigate model-specific blind spots."

Verifiable Checklist Modules

Academic research on Verifiable Checklist Modules describes engineered components "designed to structure, document, and formally validate multi-step reasoning, evaluation, or verification workflows" with evidence collection, validation scoring, and audit mechanisms—concepts that inform our findings documentation approach.

How This Approach Differs

While Spec-Driven Development covers similar ground for software development tasks, this methodology adds several elements specifically designed for systematic review and audit workflows:

Element	This Approach	Typical SDD
Checklist Item Format	Full runbook with procedures, evidence requirements, severity criteria, and completion criteria	Task labels with brief descriptions
Verbosity Levels	Three tiers (minimal/standard/comprehensive) for different risk profiles	Single format
Adversarial Validation	Explicit prompt: "You MUST find 5 gaps, don't say 'looks comprehensive'"	Optional review step
Domain Focus	Security review with OWASP/STRIDE/CWE framework integration	General software development
Self-Audit Phase	Formal workflow step comparing results to checklist	Not typically included
Evidence Documentation	Structured findings with severity, location, attack scenario, and fix	Task completion status

The key distinction is purpose: Spec-Driven Development optimizes for building software efficiently, while this workflow optimizes for thorough, auditable review where coverage and evidence collection matter more than speed.

Frequently Asked Questions

Can you prompt AI with different personas and get meaningfully different results?

Partially yes, but with caveats. Role-based prompts shift attention and invoke different mental checklists, but the underlying knowledge is identical. The gains come more from multiple passes with different focuses than from the role-playing itself. More effective approaches include specific checklists, concrete threat models, and structured frameworks like STRIDE.

What is a self-scaffolding workflow for agentic AI?

A self-scaffolding workflow uses the AI to generate methodology (checklists, procedures), then uses the AI to execute that methodology, with file-based checkpoints throughout. This externalizes implicit knowledge, provides state persistence via files, and creates self-audit loops to catch skipped or incomplete work.

Why should checklist items include detailed procedures?

A checklist item like "Check for SQL injection" is ambiguous. Without explicit procedures, you rely on the AI to re-derive methodology at execution time, which may differ from intent, be incomplete, or vary between runs. Each item should be a mini-runbook with objective, procedure, evidence to collect, and completion criteria.

Should you use multiple LLMs to review a security checklist?

Yes, this helps because different LLMs have different training data, fine-tuning emphases, and blind spots. Asking a second model "what's missing?" creates an adversarial review dynamic that finds gaps. You can also use the same model with different framings (security architect vs red team lead vs compliance officer).

What verbosity level should I use?

Use Minimal for quick sanity checks and low-stakes reviews. Use Standard for typical code reviews and routine audits—this is the best balance for most use cases. Use Comprehensive for security audits, compliance reviews, and critical systems where thoroughness matters more than speed.

Can this workflow be used for things other than security reviews?

Absolutely. The pattern generalizes to any complex, systematic task: code quality reviews, API design reviews, database schema reviews, performance audits, accessibility checks, documentation reviews, and more. Adapt the checklist categories and procedures to your domain.