PoC agent cascade
How autonomous LLM agents actually reproduce CVEs on this site: the two-tier model cascade, the ephemeral VM substrate, the independent judge that adjudicates every attempt, and the budgets that keep the whole thing from melting the credit card.
Cheap path first, escalate only when needed
A "verified" badge on a CVE page means a software agent actually executed the exploit against a real WordPress installation and captured the result on video. That claim has to survive real scrutiny, so the pipeline behind it is deliberately uninterested in shortcuts. It stands up a fresh WordPress on a clean substrate for every attempt, gives the agent real tools against real code, records everything, and then hands the transcript to a separate model that never saw the attempt being built.
The top-level entry point is agent/src/audit.ts - runAudit() on line 75 takes a research plan (produced by the LLM research pass described on the PoC verification methodology page) and escalates through the cascade on behalf of the scheduler. The rest of this page is the detail of what happens inside that call.
Three design choices dominate everything else: the cascade picks the cheaper model first and escalates only on failure; the substrate is ephemeral per-task rather than a shared long-lived environment; and a separate frontier judge adjudicates every attempt, including successful ones. None of these is clever on its own. Together they're what make "verified" mean something.
One ephemeral WordPress per task
The substrate is agent/src/docker.ts's startContainers() (line 44) bringing up a docker-compose stack: official wordpress:latest container, MariaDB sidecar, a 120-second health-wait, and an in-container WP-CLI install. Per task. Every time. No long-lived "test WordPress" that drifts between runs and quietly accumulates state.
WordPress itself is provisioned by installWordPress() (line 72) - core install, pretty permalinks, admin user, plus a low-privilege subscriber via createSubscriber() (line 177) because a lot of CVEs are "subscriber can do X" and the agent needs both roles available from turn one.
Plugin installation is three-strategy on purpose. The first attempt is wp plugin install --version=X against the WP.org API. If that 404s or the plugin was closed, the second strategy downloads the pinned zip directly from downloads.wordpress.org. If that also fails, the third strategy pulls from my own SVN mirror - the one described in the taint-analysis deep dive - which covers the long tail of withdrawn plugins the API and zip paths can't reach.
Why three strategies instead of one-try-then-fail. The point of verifying a CVE is catching real-world impact, and a lot of real-world WordPress sites run plugins that have since been closed or withdrawn from .org. Failing to reproduce a CVE "because the plugin is no longer downloadable" is a false negative for users who still have it installed. The three-strategy fallback reaches every plugin version that ever shipped on .org, not just the ones .org still publishes.
docker compose up → 120s health wait → WP-CLI core install → plugin install (3-strategy) → subscriber account. The VM starts from identical bytes every time, so no prior-attempt residue can leak into a later attempt's verdict. wordpress:latest image lacks unzip, so the agent unpacks zip fallbacks on the host filesystem and then docker cps the unpacked tree into the container. Small operational detail with large implications if the host lacks disk - the unpacker cleans temp dirs on every failure path so crashed tasks never leave half-plugins behind. destroyContainers() (line 216) tears everything down - containers, volumes, network - on both success and failure paths. No stateful "test bed" to re-use; the compute cost of the rebuild is cheaper than the debugging cost of cross-task interference. Tools, turns, and the nudge-then-force discipline
The agent gets four tool families: http_request (arbitrary HTTP, used for direct endpoint exploitation), browser_* (Playwright-driven real browser for flows that need DOM state, cookies, or multi-step navigation), wp_cli (post-exploit confirmation only, never as the exploit itself), and bash_exec (container-internal commands for reading files, tailing logs, checking side-effects on disk).
The separation between exploit path and confirmation path is a hard rule. The exploit has to go through the vulnerable HTTP endpoint - that's the thing a real attacker would reach. WP-CLI is fine for confirming after the fact ("did the admin account I just created via the bug actually exist?") but it's not an acceptable exploitation path, because it runs as the root user inside the container and would trivially "reproduce" bugs that aren't actually reachable from the outside.
The loop is capped at 40 turns (audit.ts line 442). With ~8 turns remaining the system prompt nudges the agent to wrap up; with ≤5 remaining it forces a report_result call or auto-detects success from the tool log via a keyword heuristic (lines 662-710). Either way, the loop cannot run forever and the scheduler cannot be starved by one runaway task.
Inside a turn, the agent's response can be thinking text, tool calls, or a final report_result. Tool-call inputs and outputs are both logged with timestamps. That log becomes the primary artifact the judge inspects - not the agent's narration, not the verdict the agent itself claims - because what actually happened on the wire is less tamperable than what the agent says happened.
Abridged transcript from a real SQLi reproduction
Illustrative - redacted plugin + endpoint, but the shape and budget exactly mirror a real Tier-1 run. Shows the agent discovering the sink, testing a payload, and confirming the side-effect via wp_cli before reporting success.
filter parameter reaching $wpdb->query() in the plugin's admin-ajax handler. Starting with a baseline GET to confirm the endpoint exists. POST /wp-admin/admin-ajax.php
action=acme_filter&filter=1
→ 200 OK · {"rows": [...]}
POST /wp-admin/admin-ajax.php
action=acme_filter&filter=1) UNION SELECT user_login,user_pass,3 FROM wp_users--
→ 200 OK · {"rows":[["admin","$P$B…","3"], …]}
wp_users.user_pass column, confirming the UNION landed. Confirming via wp_cli that the hash matches what's in the database. wp db query "SELECT user_pass FROM wp_users WHERE user_login='admin'"
→ $P$B… (matches)
status: success · endpoint /wp-admin/admin-ajax.php?action=acme_filter · sink $wpdb->query() · UNION-based injection · exfiltrated wp_users.user_pass. wp_cli was used only to confirm the exfiltrated hash matched the database, not as the exploitation path. Evidence: tool log t+3 (exploit request + UNION response) and t+5 (DB confirmation). Why a separate model adjudicates every attempt
validateExploitSuccess() (agent/src/audit.ts lines 1039-1102) is the judge entry point. It tries the frontier LLM first (judgeViaFrontier, line 1108, 60 s timeout). If that fails - rate-limit, region outage, token-budget cap - it falls back to the lightweight LLM (judgeViaLightweight, line 1149, 30 s timeout). If both judges are unavailable, the agent's verdict is accepted with an explicit "unjudged" flag on the result; downstream consumers can filter or re-run later.
The judge prompt (lines 1050-1077) is narrow on purpose. It is not asked "is the plugin vulnerable" - that's the research-plan's job. It is asked: did the transcript this specific agent run contain a flow that went through the vulnerable HTTP endpoint, produced the claimed side-effect, and did that side-effect actually land in a way a WP-CLI command or DB probe can confirm. Narrow questions get reliable answers; broad questions get plausible-sounding bluster.
Why the frontier model for the primary judge. Judging a transcript requires holding the research plan, the 40-turn tool log, and the judge's own reasoning in context simultaneously - the exact shape of task where a larger effective context and stronger reasoning matter. Using a cheaper model for the judge defeats the point of the cascade; I'd rather spend the extra cents per task than have "verified" mean "the cheap executor also thinks it worked."
Per-task timeouts and failure recovery
Every interesting bug in the agent loop has the same ancestor: an unbounded wait. So every tool call has a timeout, every model call has a timeout, every docker exec has a timeout, and the whole task has a turn cap. The numbers aren't magic; they're just tight enough that a single stuck task can't pin the scheduler indefinitely, and loose enough that a legitimate multi-step exploit has room to breathe.
Model-API retries (audit.ts lines 459-467) are three attempts with exponential backoff on 429/529 codes. After three failures the task reports error rather than failed - a distinction that matters because "failed to reproduce" is a negative result about the CVE, while "error" is a negative result about the pipeline, and those get retried differently.
The auto-detect success heuristic (audit.ts lines 662-710) runs when the turn cap is reached without an explicit report_result. It scans the tool log for exploit-landing keywords (SQL injection confirmations, admin-creation side effects, privilege-escalation signals, webshell deployment, reflected XSS indicators) and - if it finds them - hands a synthetic claim to the judge for adjudication. The judge can still reject; the heuristic only ever opens the door for a verdict, it never makes one.
Plugin-not-available (audit.ts lines 222-237) short-circuits to not_applicable - a distinct status that means "the substrate couldn't be built, this CVE wasn't even attempted." It never gets labeled as a reproduction failure; that would bias the ecosystem-wide false-negative rate unfairly against plugins that happened to be unavailable the day the scheduler got to them.
report_result or auto-detect at 5 remaining. No unbounded loops. success / failed / error / not_applicable. Distinguishing "pipeline error" from "CVE not reproducible" protects the per-CVE statistics. What each task actually costs
Order-of-magnitude numbers observed on the current cascade. Exact cost is a moving target (model prices, crawl mix) but the ratios between tiers are stable and they're the reason the cascade is shaped this way.
- Median turns8 - 18
- Wall-clock / task30 - 120 s
- Cost per attemptlow
- Share of total tasksmajority
- Median turns12 - 30
- Wall-clock / task90 - 240 s
- Cost per attempt~20x T1
- Share of total tasksminority
- Per-attempt calls1
- Latency5 - 30 s
- Cost per attemptfixed
- Runs on success tooalways
Put together, the average task is a cheap Tier-1 executor pass plus one frontier judge call; a Tier-1 failure adds a substantially more expensive frontier executor pass. The cascade keeps the mean cost anchored near the cheap-executor floor while still admitting the expensive ceiling for tasks that actually need it.
What ships alongside every verified CVE
Every verified PoC ships with a full evidence bundle that's intentionally reproducible by a third party without trusting me. Playwright trace (timestamped screenshots and DOM snapshots at every interaction), a rendered video of the complete exploitation flow, the watermarked title-card thumbnail for YouTube, and a machine-readable Playwright script that re-runs the exploit step-for-step.
The trace is a standard Playwright zip (browser.tracing.start() with screenshots and snapshots enabled, lines 263-266). Anyone can open it in Playwright's own trace viewer - it's a debugging tool Microsoft distributes, not a bespoke format I invented - and step through the network calls, DOM state, and console logs frame by frame. Published alongside every verified CVE.
Watermark and title-card (lines 270-285) aren't for aesthetics. They're there because a video of a WordPress exploit is, stripped of context, exactly what a malicious tutorial looks like. The watermark (injected via addInitScript so it survives SPA navigation) and the title-card make the video's provenance unambiguous in any platform that strips metadata.
The result POST'd to /api/internal/security-research/result carries the status, duration, model used for both executor and judge, vulnerable-code pointer, patch-diff pointer, exploit-code summary, Playwright script, and the YouTube video ID. Trace and video blobs ride separate upload endpoints because they're large and the JSON write has to be cheap and deterministic.
The AuditResult payload
Shape of the object POSTed to /api/internal/security-research/result at the end of every task. Binary artifacts (Playwright trace ZIP, video MP4) are uploaded separately and the result references them by ID.
{
"task_id": "tsk_01HW5KQ…",
"vulnerability_id": "CVE-2025-XXXX",
"status": "success",
"model_used": "tier1-lightweight",
"model_judge": "frontier-judge",
"duration_seconds": 87.2,
"turns_used": 14,
"retries": 0,
"summary": "SQL injection via unsanitized filter parameter in admin-ajax; exfiltrated wp_users.user_pass hash.",
"exploit_code": "POST /wp-admin/admin-ajax.php\n action=acme_filter&filter=1) UNION SELECT user_login,user_pass,3 FROM wp_users--",
"vulnerable_code": "lib/query-runner.php:64 // $wpdb->query($query) with filter interpolated",
"fix_diff": "lib/query-runner.php: wrap with $wpdb->prepare() and bind ints via %d placeholders",
"artifacts": {
"playwright_trace_id": "trc_f8c1…",
"video_mp4_id": "vid_9a4e…",
"youtube_video_id": "dQw4…"
},
"judge_verdict": {
"model": "frontier-judge",
"verdict": "confirmed",
"cites": [
"tool_log[3]: exploit request + UNION response",
"tool_log[5]: wp_cli DB confirmation matches exfiltrated hash"
],
"rationale": "Attack went through the vulnerable HTTP endpoint; wp_cli used only for confirmation."
},
"steps": [
"Recon: GET baseline to confirm endpoint",
"Probe: UNION payload against filter parameter",
"Exploit: exfiltrate wp_users.user_pass",
"Confirm: match exfiltrated hash via wp_cli"
]
}Status taxonomy matters.status is one of success, failed, error, or not_applicable. The distinction matters because "failed to reproduce" is a negative result about the CVE (interesting signal), "error" is a negative result about the pipeline (retryable), and "not_applicable" means the plugin couldn't even be installed (unrelated to whether the CVE is real).
Judge rationale is structured. The judge_verdict.cites array names the tool-log indices the verdict is grounded in. This is the format the judge prompt enforces: no "looks good" verdicts, no free-form confidence scores without evidence pointers. Every claim in rationale has to map back to a cites entry or the judge call retries until it does.
What this agent can't reach
Agent-unreachable is not the same as not-exploitable. If both the lightweight and frontier tiers fail to produce a reproducible flow, the CVE is logged as "agent-unreachable" on the page. That isn't a dismissal of the CVE; it's an honest statement about what this tooling can demonstrate. A human researcher with more context might well reproduce it tomorrow. I'd rather say "I couldn't" than paper over the gap with a synthesized description and hope nobody reads closely.
Costs aren't invisible. A frontier-LLM judge at 16k output on every judged attempt isn't free. The cascade explicitly keeps the cheap path as the default so the judge is the only place the frontier tier is guaranteed to run. At the current cadence that's a budget I can absorb; if the number of CVEs scales 10×, I'll need to revisit whether a smaller judge with a frontier-sampled audit-of-the-audit is the right shape instead of unconditional frontier judgement.
Determinism is a research problem. Two runs of the same agent against the same CVE will not always produce identical transcripts - LLM sampling, browser timing, WordPress admin-UI drift. The judge verdict is stable because the prompt is narrow, but the intermediate traces vary. If reproducing the video bit-for-bit mattered, I'd need a recorded browser and deterministic sampling; I've decided the cost of that isn't worth it yet.
Prior art and related work
- Foundational Schick et al., 2023 - Toolformer: Language Models Can Teach Themselves to Use Tools
The first clear articulation that LLMs are strong enough to drive tool calls autonomously if the interface is well-shaped. The cascade's tool surface (
http_request,browser_*,wp_cli,bash_exec) is a direct descendant of that line of work, narrowed to the domain of WordPress vulnerability reproduction. - Foundational Yao et al., 2022 - ReAct: Synergizing Reasoning and Acting in Language Models
The thinking-then-tool-call loop structure - reason about what to try, execute, observe the result, reason again - is the ReAct pattern. The 40-turn cap with nudge-then-force termination is my pragmatic bound on it; ReAct's open-loop formulation would run forever.
- Adjacent Fang et al., 2024 - LLM Agents can Autonomously Exploit One-day Vulnerabilities
Closest public work to what this cascade actually does. Demonstrates that GPT-4 with web access can exploit disclosed CVEs from their advisories. My cascade goes further on the reproducibility discipline - ephemeral substrate, independent judge, evidence bundle - but the core claim (LLMs can autonomously reproduce disclosed web CVEs) is consistent.
- Adjacent Frontier-lab model cards - cyber-exploitation evaluation harnesses
Each major LLM lab publishes evaluation harnesses for its models' offensive-security capability. The cascade here is orthogonal to those benchmarks - I'm using the models' agentic capability to reproduce already disclosed CVEs, not to discover novel ones - but the harness design (turn cap, tool sandboxing, judge separation) is informed by the same concerns.
- Planned arXiv preprint: "Autonomous CVE reproduction at scale: an LLM-agent cascade for WordPress"
Work in progress. Methods section is roughly the content of this page; results section will cover the reproducibility rate on the current CVE backlog, per-CVE cost at steady state, and the rate of judge-override (times the judge rejected an executor-claimed success). Link will land here when the preprint goes up.
Related deep-dives
The AST-level analyzer that produces the deterministic code signals feeding the research plans that drive this cascade. Seven sources, 31 sinks, 47 sanitizers, two phases.
The higher-level methodology page the cascade sits inside. Covers the research-plan stage, the per-phase separation of concerns, and the broader trust model.
Browse verified PoCs.
Every CVE on this site with a verified badge went through the cascade and judge described on this page.
