Deep dive

PoC agent cascade

How autonomous LLM agents actually reproduce CVEs on this site: the two-tier model cascade, the ephemeral VM substrate, the independent judge that adjudicates every attempt, and the budgets that keep the whole thing from melting the credit card.

ByMika Sipilä·Founder, WP-Safety.org

What it is

Cheap path first, escalate only when needed

Task intake

Patch diff + LLM research plan · one task per CVE attempt.

Ephemeral VM

Provisioned from a clean WordPress snapshot with the target plugin pinned to the vulnerable version. Destroyed on task completion regardless of verdict.

one VM per taskno shared state

Tier 1

Lightweight LLM

cheap · fast · default attempt

Agent loop: plan → execute browser / HTTP tool calls → inspect response → adjust. Most straightforward SQLi / XSS / nonce-bypass reproductions land here without escalating.

Tier 1 fails or stalls

Tier 1 succeeds

Tier 2

Frontier LLM

stronger reasoning · retry with longer context

Handed the same task plus the Tier 1 transcript as negative evidence. If the frontier tier also can't produce a reproducible flow, the task is logged as "agent-unreachable" - never silently marked verified.

Success path continues straight to the judge without visiting Tier 2.

Independent frontier judge

Handed the attempt transcript, evidence bundle, and the task's original research plan. Returns a verdict with rationale. Runs on every attempt, including successful ones, so the executing model cannot mark its own work as correct.

no self-confirmationstructured rationale

Verified PoC + evidence bundle

Stored alongside the CVE: Playwright script that re-runs the exploit, full video, per-step screenshots, HTTP transcript, vulnerable-code pointer, patch-diff pointer.

The cascade keeps the default path cheap (Lightweight LLM clears most tasks) while preserving the ceiling that Frontier LLM raises for the hard ones. The judge runs on every attempt because self-adjudication by the same model that executed the exploit is not a verdict I'd trust for a public CVE page.

A "verified" badge on a CVE page means a software agent actually executed the exploit against a real WordPress installation and captured the result on video. That claim has to survive real scrutiny, so the pipeline behind it is deliberately uninterested in shortcuts. It stands up a fresh WordPress on a clean substrate for every attempt, gives the agent real tools against real code, records everything, and then hands the transcript to a separate model that never saw the attempt being built.

The top-level entry point is agent/src/audit.ts - runAudit() on line 75 takes a research plan (produced by the LLM research pass described on the PoC verification methodology page) and escalates through the cascade on behalf of the scheduler. The rest of this page is the detail of what happens inside that call.

Three design choices dominate everything else: the cascade picks the cheaper model first and escalates only on failure; the substrate is ephemeral per-task rather than a shared long-lived environment; and a separate frontier judge adjudicates every attempt, including successful ones. None of these is clever on its own. Together they're what make "verified" mean something.

VM substrate

One ephemeral WordPress per task

The substrate is agent/src/docker.ts's startContainers() (line 44) bringing up a docker-compose stack: official wordpress:latest container, MariaDB sidecar, a 120-second health-wait, and an in-container WP-CLI install. Per task. Every time. No long-lived "test WordPress" that drifts between runs and quietly accumulates state.

WordPress itself is provisioned by installWordPress() (line 72) - core install, pretty permalinks, admin user, plus a low-privilege subscriber via createSubscriber() (line 177) because a lot of CVEs are "subscriber can do X" and the agent needs both roles available from turn one.

Plugin installation is three-strategy on purpose. The first attempt is wp plugin install --version=X against the WP.org API. If that 404s or the plugin was closed, the second strategy downloads the pinned zip directly from downloads.wordpress.org. If that also fails, the third strategy pulls from my own SVN mirror - the one described in the taint-analysis deep dive - which covers the long tail of withdrawn plugins the API and zip paths can't reach.

Why three strategies instead of one-try-then-fail. The point of verifying a CVE is catching real-world impact, and a lot of real-world WordPress sites run plugins that have since been closed or withdrawn from .org. Failing to reproduce a CVE "because the plugin is no longer downloadable" is a false negative for users who still have it installed. The three-strategy fallback reaches every plugin version that ever shipped on .org, not just the ones .org still publishes.

Fresh snapshot per task

docker compose up → 120s health wait → WP-CLI core install → plugin install (3-strategy) → subscriber account. The VM starts from identical bytes every time, so no prior-attempt residue can leak into a later attempt's verdict.

Plugin unpack on host

The wordpress:latest image lacks unzip, so the agent unpacks zip fallbacks on the host filesystem and then docker cps the unpacked tree into the container. Small operational detail with large implications if the host lacks disk - the unpacker cleans temp dirs on every failure path so crashed tasks never leave half-plugins behind.

Destroyed on completion

destroyContainers() (line 216) tears everything down - containers, volumes, network - on both success and failure paths. No stateful "test bed" to re-use; the compute cost of the rebuild is cheaper than the debugging cost of cross-task interference.

Agent loop

Tools, turns, and the nudge-then-force discipline

The agent gets four tool families: http_request (arbitrary HTTP, used for direct endpoint exploitation), browser_* (Playwright-driven real browser for flows that need DOM state, cookies, or multi-step navigation), wp_cli (post-exploit confirmation only, never as the exploit itself), and bash_exec (container-internal commands for reading files, tailing logs, checking side-effects on disk).

The separation between exploit path and confirmation path is a hard rule. The exploit has to go through the vulnerable HTTP endpoint - that's the thing a real attacker would reach. WP-CLI is fine for confirming after the fact ("did the admin account I just created via the bug actually exist?") but it's not an acceptable exploitation path, because it runs as the root user inside the container and would trivially "reproduce" bugs that aren't actually reachable from the outside.

The loop is capped at 40 turns (audit.ts line 442). With ~8 turns remaining the system prompt nudges the agent to wrap up; with ≤5 remaining it forces a report_result call or auto-detects success from the tool log via a keyword heuristic (lines 662-710). Either way, the loop cannot run forever and the scheduler cannot be starved by one runaway task.

Inside a turn, the agent's response can be thinking text, tool calls, or a final report_result. Tool-call inputs and outputs are both logged with timestamps. That log becomes the primary artifact the judge inspects - not the agent's narration, not the verdict the agent itself claims - because what actually happened on the wire is less tamperable than what the agent says happened.

http_request

Raw HTTP with 60s timeout. The exploit path. Method, headers, body all agent-controlled. Response captured verbatim.

browser_*

Playwright browser with 15s navigation timeout, 5-10s click/load timeouts. Used for flows that need cookie state, JavaScript-driven UIs, or multi-step admin panels.

wp_cli / bash_exec

Confirmation only. 30s-120s docker-exec timeout. "Did the exploit I just executed through the HTTP endpoint actually land the intended effect in the database or on disk?"

Worked example

Abridged transcript from a real SQLi reproduction

Illustrative - redacted plugin + endpoint, but the shape and budget exactly mirror a real Tier-1 run. Shows the agent discovering the sink, testing a payload, and confirming the side-effect via wp_cli before reporting success.

t+0think

Research plan claims an unsanitized filter parameter reaching $wpdb->query() in the plugin's admin-ajax handler. Starting with a baseline GET to confirm the endpoint exists.

t+1http_request

POST /wp-admin/admin-ajax.php
  action=acme_filter&filter=1
→ 200 OK · {"rows": [...]}

t+2think

Endpoint exists, unauthenticated. Testing for UNION-based injection by appending a tautology plus a column-count probe.

t+3http_request

POST /wp-admin/admin-ajax.php
  action=acme_filter&filter=1) UNION SELECT user_login,user_pass,3 FROM wp_users--
→ 200 OK · {"rows":[["admin","$P$B…","3"], …]}

t+4think

Response contains what looks like a phpass hash in column 2. That's the wp_users.user_pass column, confirming the UNION landed. Confirming via wp_cli that the hash matches what's in the database.

t+5wp_cli

wp db query "SELECT user_pass FROM wp_users WHERE user_login='admin'"
→ $P$B… (matches)

t+6report_result

status: success · endpoint /wp-admin/admin-ajax.php?action=acme_filter · sink $wpdb->query() · UNION-based injection · exfiltrated wp_users.user_pass.

→frontier judge

Verdict confirmed. Attack went through the vulnerable HTTP endpoint; wp_cli was used only to confirm the exfiltrated hash matched the database, not as the exploitation path. Evidence: tool log t+3 (exploit request + UNION response) and t+5 (DB confirmation).

Judge

Why a separate model adjudicates every attempt

Judge invariants

no self-confirmation

Different model than the executor

An agent that succeeded on the lightweight LLM is adjudicated by the frontier LLM. An agent that succeeded on the frontier LLM is still adjudicated by the frontier LLM but with fresh context - no shared session, no shared scratchpad.

Given the tool log, not the narration

Input is the timestamped transcript of HTTP/browser/CLI calls, plus the research plan. The agent's own conclusion is not part of the judge's input.

Structured rationale required

Judge prompt requires the verdict to cite the specific endpoint, the specific side-effect, and the specific evidence. 'Looks good' verdicts are rejected at parse time.

validateExploitSuccess() (agent/src/audit.ts lines 1039-1102) is the judge entry point. It tries the frontier LLM first (judgeViaFrontier, line 1108, 60 s timeout). If that fails - rate-limit, region outage, token-budget cap - it falls back to the lightweight LLM (judgeViaLightweight, line 1149, 30 s timeout). If both judges are unavailable, the agent's verdict is accepted with an explicit "unjudged" flag on the result; downstream consumers can filter or re-run later.

The judge prompt (lines 1050-1077) is narrow on purpose. It is not asked "is the plugin vulnerable" - that's the research-plan's job. It is asked: did the transcript this specific agent run contain a flow that went through the vulnerable HTTP endpoint, produced the claimed side-effect, and did that side-effect actually land in a way a WP-CLI command or DB probe can confirm. Narrow questions get reliable answers; broad questions get plausible-sounding bluster.

Why the frontier model for the primary judge. Judging a transcript requires holding the research plan, the 40-turn tool log, and the judge's own reasoning in context simultaneously - the exact shape of task where a larger effective context and stronger reasoning matter. Using a cheaper model for the judge defeats the point of the cascade; I'd rather spend the extra cents per task than have "verified" mean "the cheap executor also thinks it worked."

Budgets

Per-task timeouts and failure recovery

Every interesting bug in the agent loop has the same ancestor: an unbounded wait. So every tool call has a timeout, every model call has a timeout, every docker exec has a timeout, and the whole task has a turn cap. The numbers aren't magic; they're just tight enough that a single stuck task can't pin the scheduler indefinitely, and loose enough that a legitimate multi-step exploit has room to breathe.

Model-API retries (audit.ts lines 459-467) are three attempts with exponential backoff on 429/529 codes. After three failures the task reports error rather than failed - a distinction that matters because "failed to reproduce" is a negative result about the CVE, while "error" is a negative result about the pipeline, and those get retried differently.

The auto-detect success heuristic (audit.ts lines 662-710) runs when the turn cap is reached without an explicit report_result. It scans the tool log for exploit-landing keywords (SQL injection confirmations, admin-creation side effects, privilege-escalation signals, webshell deployment, reflected XSS indicators) and - if it finds them - hands a synthetic claim to the judge for adjudication. The judge can still reject; the heuristic only ever opens the door for a verdict, it never makes one.

Plugin-not-available (audit.ts lines 222-237) short-circuits to not_applicable - a distinct status that means "the substrate couldn't be built, this CVE wasn't even attempted." It never gets labeled as a reproduction failure; that would bias the ecosystem-wide false-negative rate unfairly against plugins that happened to be unavailable the day the scheduler got to them.

Turn cap · nudge · force

40 turns max. Nudge at 8 remaining. Force report_result or auto-detect at 5 remaining. No unbounded loops.

Retry discipline

3x exponential backoff on 429/529. Cascade escalates on content failure, not on transport failure - no accidental "model said no" pattern from a transient 5xx.

Status taxonomy

success / failed / error / not_applicable. Distinguishing "pipeline error" from "CVE not reproducible" protects the per-CVE statistics.

Throughput & cost

What each task actually costs

Order-of-magnitude numbers observed on the current cascade. Exact cost is a moving target (model prices, crawl mix) but the ratios between tiers are stable and they're the reason the cascade is shaped this way.

Tier 1Lightweight LLM executor

Median turns8 - 18
Wall-clock / task30 - 120 s
Cost per attemptlow
Share of total tasksmajority

Tier 2Frontier LLM executor (fallback)

Median turns12 - 30
Wall-clock / task90 - 240 s
Cost per attempt~20x T1
Share of total tasksminority

JudgeFrontier LLM (every attempt)

Per-attempt calls1
Latency5 - 30 s
Cost per attemptfixed
Runs on success tooalways

Put together, the average task is a cheap Tier-1 executor pass plus one frontier judge call; a Tier-1 failure adds a substantially more expensive frontier executor pass. The cascade keeps the mean cost anchored near the cheap-executor floor while still admitting the expensive ceiling for tasks that actually need it.

Evidence

What ships alongside every verified CVE

Every verified PoC ships with a full evidence bundle that's intentionally reproducible by a third party without trusting me. Playwright trace (timestamped screenshots and DOM snapshots at every interaction), a rendered video of the complete exploitation flow, the watermarked title-card thumbnail for YouTube, and a machine-readable Playwright script that re-runs the exploit step-for-step.

The trace is a standard Playwright zip (browser.tracing.start() with screenshots and snapshots enabled, lines 263-266). Anyone can open it in Playwright's own trace viewer - it's a debugging tool Microsoft distributes, not a bespoke format I invented - and step through the network calls, DOM state, and console logs frame by frame. Published alongside every verified CVE.

Watermark and title-card (lines 270-285) aren't for aesthetics. They're there because a video of a WordPress exploit is, stripped of context, exactly what a malicious tutorial looks like. The watermark (injected via addInitScript so it survives SPA navigation) and the title-card make the video's provenance unambiguous in any platform that strips metadata.

The result POST'd to /api/internal/security-research/result carries the status, duration, model used for both executor and judge, vulnerable-code pointer, patch-diff pointer, exploit-code summary, Playwright script, and the YouTube video ID. Trace and video blobs ride separate upload endpoints because they're large and the JSON write has to be cheap and deterministic.

Playwright trace

ZIP artifact. Screenshots + DOM snapshots at every step. Opens in the official Playwright trace viewer. Third-party reproducible.

Rendered video

1280×720 MP4 with a 9-second title card and in-browser watermark. Uploaded to YouTube + kept locally. Title card doubles as the CVE thumbnail.

Replay script

The tool log rewritten as a standalone Playwright script you can run against your own VM. If you want to check my work without trusting my video, this is the artifact to use.

Sample output

The AuditResult payload

Shape of the object POSTed to /api/internal/security-research/result at the end of every task. Binary artifacts (Playwright trace ZIP, video MP4) are uploaded separately and the result references them by ID.

{
  "task_id": "tsk_01HW5KQ…",
  "vulnerability_id": "CVE-2025-XXXX",
  "status": "success",
  "model_used": "tier1-lightweight",
  "model_judge": "frontier-judge",

  "duration_seconds": 87.2,
  "turns_used": 14,
  "retries": 0,

  "summary": "SQL injection via unsanitized filter parameter in admin-ajax; exfiltrated wp_users.user_pass hash.",

  "exploit_code": "POST /wp-admin/admin-ajax.php\n  action=acme_filter&filter=1) UNION SELECT user_login,user_pass,3 FROM wp_users--",
  "vulnerable_code": "lib/query-runner.php:64  // $wpdb->query($query) with filter interpolated",
  "fix_diff": "lib/query-runner.php: wrap with $wpdb->prepare() and bind ints via %d placeholders",

  "artifacts": {
    "playwright_trace_id": "trc_f8c1…",
    "video_mp4_id":         "vid_9a4e…",
    "youtube_video_id":     "dQw4…"
  },

  "judge_verdict": {
    "model": "frontier-judge",
    "verdict": "confirmed",
    "cites": [
      "tool_log[3]: exploit request + UNION response",
      "tool_log[5]: wp_cli DB confirmation matches exfiltrated hash"
    ],
    "rationale": "Attack went through the vulnerable HTTP endpoint; wp_cli used only for confirmation."
  },

  "steps": [
    "Recon: GET baseline to confirm endpoint",
    "Probe: UNION payload against filter parameter",
    "Exploit: exfiltrate wp_users.user_pass",
    "Confirm: match exfiltrated hash via wp_cli"
  ]
}

Status taxonomy matters.status is one of success, failed, error, or not_applicable. The distinction matters because "failed to reproduce" is a negative result about the CVE (interesting signal), "error" is a negative result about the pipeline (retryable), and "not_applicable" means the plugin couldn't even be installed (unrelated to whether the CVE is real).

Judge rationale is structured. The judge_verdict.cites array names the tool-log indices the verdict is grounded in. This is the format the judge prompt enforces: no "looks good" verdicts, no free-form confidence scores without evidence pointers. Every claim in rationale has to map back to a cites entry or the judge call retries until it does.

Limits

What this agent can't reach

Out of scope

by design

Multi-step social engineering

Bugs that require a real human user at a real keyboard making a decision. The agent can click, but it can't convincingly pretend to be someone's distracted colleague on a Friday afternoon.

structural

Race-condition TOCTOU

Exploitation that depends on sub-millisecond timing windows between two requests. The 40-turn loop + per-tool timeouts can't reliably hit these; I mark them as 'reviewable, not verified.'

roadmap

Cross-plugin interaction bugs

Only one plugin gets installed per task. Vulnerabilities that require two plugins both active, interacting, are outside the current substrate.

Agent-unreachable is not the same as not-exploitable. If both the lightweight and frontier tiers fail to produce a reproducible flow, the CVE is logged as "agent-unreachable" on the page. That isn't a dismissal of the CVE; it's an honest statement about what this tooling can demonstrate. A human researcher with more context might well reproduce it tomorrow. I'd rather say "I couldn't" than paper over the gap with a synthesized description and hope nobody reads closely.

Costs aren't invisible. A frontier-LLM judge at 16k output on every judged attempt isn't free. The cascade explicitly keeps the cheap path as the default so the judge is the only place the frontier tier is guaranteed to run. At the current cadence that's a budget I can absorb; if the number of CVEs scales 10×, I'll need to revisit whether a smaller judge with a frontier-sampled audit-of-the-audit is the right shape instead of unconditional frontier judgement.

Determinism is a research problem. Two runs of the same agent against the same CVE will not always produce identical transcripts - LLM sampling, browser timing, WordPress admin-UI drift. The judge verdict is stable because the prompt is narrow, but the intermediate traces vary. If reproducing the video bit-for-bit mattered, I'd need a recorded browser and deterministic sampling; I've decided the cost of that isn't worth it yet.

Prior art and related work

Foundational Schick et al., 2023 - Toolformer: Language Models Can Teach Themselves to Use Tools
The first clear articulation that LLMs are strong enough to drive tool calls autonomously if the interface is well-shaped. The cascade's tool surface (http_request, browser_*, wp_cli, bash_exec) is a direct descendant of that line of work, narrowed to the domain of WordPress vulnerability reproduction.
Foundational Yao et al., 2022 - ReAct: Synergizing Reasoning and Acting in Language Models
The thinking-then-tool-call loop structure - reason about what to try, execute, observe the result, reason again - is the ReAct pattern. The 40-turn cap with nudge-then-force termination is my pragmatic bound on it; ReAct's open-loop formulation would run forever.
Adjacent Fang et al., 2024 - LLM Agents can Autonomously Exploit One-day Vulnerabilities
Closest public work to what this cascade actually does. Demonstrates that GPT-4 with web access can exploit disclosed CVEs from their advisories. My cascade goes further on the reproducibility discipline - ephemeral substrate, independent judge, evidence bundle - but the core claim (LLMs can autonomously reproduce disclosed web CVEs) is consistent.
Adjacent Frontier-lab model cards - cyber-exploitation evaluation harnesses
Each major LLM lab publishes evaluation harnesses for its models' offensive-security capability. The cascade here is orthogonal to those benchmarks - I'm using the models' agentic capability to reproduce already disclosed CVEs, not to discover novel ones - but the harness design (turn cap, tool sandboxing, judge separation) is informed by the same concerns.
Planned arXiv preprint: "Autonomous CVE reproduction at scale: an LLM-agent cascade for WordPress"
Work in progress. Methods section is roughly the content of this page; results section will cover the reproducibility rate on the current CVE backlog, per-CVE cost at steady state, and the rate of judge-override (times the judge rejected an executor-claimed success). Link will land here when the preprint goes up.

About the author

ByMika Sipilä·Founder, WP-Safety.org

I'm a Systems Architect and full-stack developer, a maker at heart. WP-Safety started as a side-project and grew into what you see now, built solo end-to-end. If something on the site looks off, the docs are unclear, or you spot a bug I didn't, I'd rather hear about it than miss it - that's what support@wp-safety.org is for.

GitHub LinkedIn

Related deep-dives

Taint analysis

The AST-level analyzer that produces the deterministic code signals feeding the research plans that drive this cascade. Seven sources, 31 sinks, 47 sanitizers, two phases.

PoC verification methodology

The higher-level methodology page the cascade sits inside. Covers the research-plan stage, the per-phase separation of concerns, and the broader trust model.

Browse verified PoCs.

Every CVE on this site with a verified badge went through the cascade and judge described on this page.

Recent CVEs Verification methodology