AI-Powered Vulnerability Hunting in WordPress Plugins/Themes

| <7 days spare time | 100 plugins scanned | 524 candidate findings | 16 confirmed vulns | 5 scanner patches |
|---|
This is not a vulnerability disclosure. It's a methodology. I want to share how to build an AI pipeline that lets a single part-time researcher cover dozens to hundreds of WordPress plugins at once, so the "16" today becomes "60" next month as the farm keeps learning.
Quick setup: I pulled 100 random plugins from WordPress.org with ≥10,000 active installs (via the plugins/info/1.2 API, sorted by popularity), threw them into the pipeline, walked away. In under a week of spare time, the farm returned:
| 12 SQL Injection | 1 Path Traversal | 1 Insecure Deserialization | 1 Broken Authentication | 1 Stored XSS |
|---|
16 confirmed vulnerabilities across 15 different plugins. They're sitting in the responsible-disclosure pipeline right now, so no plugin names in this post. What matters more than the count: I don't grep manually anymore. I type /scan-targets, /triage-findings, /verify-vulnerabilities, then go do other work. Next morning I have verdicts, cross-file taint chains, and HTTP PoCs ready to replay through Burp.
Why "just ask Claude to scan and find bugs" doesn't scale
The naive first attempt everyone tries: dump the whole plugin directory into the prompt with "find vulnerabilities". I tried this. It fails in three ways:
The thing that finally clicked: enumeration and judgment are two completely different problems. Listing every spot that calls $wpdb->query() with concatenation is a job for a static scanner. Cheap, fast, deterministic. But "does this variable actually reach the sink, is there sanitization, is this hook externally reachable, what role triggers it" — that's judgment, and an LLM does it well once you put concrete evidence in front of it.
The pipeline has 4 layers, each with its own processor, connected by strict JSON schemas:
| Layer | Name | Description |
|---|---|---|
| Layer 1 | Static Scanner | 10 deterministic Python scanners. High recall, low precision — FPs are fine. |
| Layer 2 | AI Triage Fleet | 8 parallel subagents with budget caps. Precision filter — drops 97% of FPs. |
| Layer 3 | Dynamic Verify | HTTP probes through Burp + WP sandbox. Role-priority probing. |
| Layer 4 | Feedback Loop | AI patches the scanner. Each confirmed bug fixes a whole pattern family. |
4-layer pipeline — tracker.db is the single source of truth, every step is resumable
Layer 1 — Deterministic static scanners (high recall, low precision)
10 Python scanners, one per OWASP-style vulnerability class: sqli, xss, csrf, broken-access-control, lfi, rce, ssrf, php-object-injection, file-upload, arbitrary-file-deletion. Each scanner registers sink patterns in _shared/php_patterns.py:
| Class | CWE | Typical sinks | Severity |
|---|---|---|---|
sqli |
CWE-89 | $wpdb->query / get_results / get_var |
HIGH |
xss |
CWE-79 | echo, print, wp_send_json* |
MEDIUM |
csrf |
CWE-352 | wp_ajax_* without nonce |
MEDIUM |
broken-access-control |
CWE-862 | Handler missing current_user_can |
HIGH |
lfi |
CWE-22 | include, require, file_get_contents |
HIGH |
rce |
CWE-94 | eval, assert, system, exec |
CRITICAL |
ssrf |
CWE-918 | wp_remote_*, curl_exec |
HIGH |
php-object-injection |
CWE-502 | unserialize, maybe_unserialize |
HIGH |
file-upload |
CWE-434 | move_uploaded_file, wp_handle_upload |
HIGH |
arbitrary-file-deletion |
CWE-73 | unlink, rmdir, rename |
HIGH |
I deliberately designed these scanners to accept high false positive rates. If a file has echo \(foo and the scanner can't trace where \)foo is sanitized (say, through a helper method in another file), it flags. Target: recall ≈ 100%. Mediocre precision is fine. The next layer is where judgment happens.
Tracker.db — the farm's state machine
A CHECK constraint in SQLite enforces transitions — you cannot skip a step
Diff scan: When a plugin gets a new version, the farm defaults to a diff scan between version_current and version_previous (only scans changed files). For a 20k-line plugin, a patch usually only touches a few files, so diff scan cuts cost 5-10x. That's what makes the pipeline scale over time. Adding one plugin to the watch list costs one full scan upfront, then only deltas going forward.
Layer 2 — AI triage subagent fleet
This is the layer I invested the most prompt-engineering effort in. Each candidate from Layer 1 is dispatched to an independent triage-finding subagent in Claude Code. Up to 8 subagents run in parallel.
The subagent is defined in .claude/agents/triage-finding.md:
---
name: triage-finding
description: Triage a single static scanner finding from a WordPress
plugin/theme scan. Reads the source around the sink, traces the
tainted variable, checks for defenses, and returns a structured
verdict (true_positive | false_positive | needs_dynamic_verify |
needs_more_context | out_of_scope). Read-only.
tools: Read, Grep, Bash, WebFetch
---
Five design choices. This is the "force the AI to do the right thing" part:
(1) Hard cap at 10 tool calls
The subagent has a budget of at most 10 tool calls. When that's spent, it must return a verdict. Why? With unlimited budget, the agent tends to "keep researching" and drift. It reads files outside scope, guesses edge conditions, builds up a story. Cap 10 forces it straight to a decision. If 10 calls isn't enough evidence, the right answer is needs_more_context, not a guess.
Prefer
needs_more_contextover guessing. A wrongtrue_positivewastes verifier cost; a wrongfalse_positivehides a real bug.
(2) Ground every claim in source quotes
Every assertion must include file:line + the exact quoted code. The subagent cannot say "there's sanitization somewhere", it must quote it. When I read a verdict, I verify in 5 seconds by opening the file at that line. This rule alone eliminated most hallucinations I saw in the naive approach.
(3) Prefer ast-grep over plain grep
I bundle the ast-grep binary at ./bin/ast-grep and force the subagent to use it for any structural query. Reason: grep matches inside comments, string literals, and reformatted code. ast-grep matches on the AST, so it's structurally correct.
# Find all calls to a sink method, regardless of object name:
./bin/ast-grep --pattern '\(X->query(\)_)' --lang php <plugin_root>
# Find calls where the argument is a string concatenation:
./bin/ast-grep --pattern '\(X->query(\)_ . $_)' --lang php <plugin_root>
# Find direct superglobal taint sources:
./bin/ast-grep --pattern '\(_POST[\)_]' --lang php <plugin_root>
# Find hook registration to locate the AJAX callback:
./bin/ast-grep --pattern "add_action('wp_ajax_\(_', \)_)" --lang php <plugin_root>
(4) JSON-only output
The subagent must emit exactly one JSON object matching a strict schema:
{
"finding_id": "<from input>",
"verdict": "true_positive | false_positive | needs_dynamic_verify | needs_more_context | out_of_scope",
"confidence": "high | medium | low",
"reason": "1-2 sentences, grounded in source quotes",
"evidence_quotes": [{"file": "<rel>", "line": 123, "code": "<exact>"}],
"defense_observed": ["sanitize_text_field", "current_user_can"],
"external_reachability": "via_endpoint | internal_only | unknown",
"exploitability_role": "unauthenticated | subscriber | ... | admin | unknown",
"suggested_next_step": "<one line>"
}
No markdown, no preamble, no "Here's my analysis:". This output gets piped directly into the scanner_accuracy table. Any extra text breaks the parser.
(5) Pre-written decision tree, no "let AI think for itself"
Decision tree applied in order, stops at first matching outcome:
Does the tainted variable actually reach the sink? (Read ±25 lines around the sink, trace assignments — if the variable is rebound or the tainted branch doesn't reach the sink →
false_positive)Is there a sanitizer/escape/cap-check/prepare applied to that variable before the sink? (
sanitize_*,esc_*,$wpdb->prepare,current_user_can,check_ajax_referer)Is the wrapping function reachable from the claimed endpoint? (ast-grep for
add_action/register_rest_route— if orphan →out_of_scope: dead_handler)What role reaches the endpoint? (
wp_ajax_nopriv_X→ unauthenticated;permission_callback => '__return_true'→ unauthenticated)Final verdict: high confidence taint reaches sink + no defense + externally reachable →
true_positive
Forbidden anti-patterns. This matters because AI tends to pattern-match shallowly:
Do not assume an
intval()somewhere in the file protects an unrelated SQL query. A defense only counts when applied to THAT specific tainted variable.Do not pattern-match keywords like
"allowed"or"safe"in code text. Read the actual statement.Do not trust the scanner's
evidencefield. Verify by reading source.
Cluster propagation — spread verdict to the same family
Often the scanner produces 20-30 candidates with the same pattern in a single plugin (e.g., 20 callsites of the same un-sanitized helper). Rather than run 30 identical subagents, I run one on a representative and propagate the verdict to the cluster. The tracker records decided_by for audit: triage-finding-agent (321), triage-finding-cluster-propagated (80), deep-dive-manual (92), dynamic-verify (11), dynamic-verify-cluster-propagated (20).
Layer 3 — Dynamic verify: from verdict to working PoC
A text verdict still isn't evidence. Layer 3 is real HTTP probing: the /verify-vulnerabilities skill installs the plugin into a local WordPress sandbox via wp-cli, activates it, then fires requests per role.
Tier-1 probes for each vuln class:
SQLi: Differential timing probes. A
SLEEP(5)-true /SLEEP(0)-false pair (latency delta) works for both blind boolean and time-based. Caught all 12 SQLi cases, including ones where the payload had to travel through a webhook body parser or REST request param.Path Traversal:
/etc/passwdbaseline + null-byte / encoding variants + diff against a blank baseline response.BAC / Broken Auth: Request the same endpoint with unauth vs subscriber cookies, diff the JSON structure to detect missing capability checks.
Stored XSS: Inject payload via admin endpoint, re-render on front-end shortcode/widget, parse HTML response for un-escaped payload.
Insecure Deserialization: Marker object
O:8:"stdClass"...+ canary callback to confirmunserializeactually fires (and isn't just stringified).
Burp proxy routing + Tier-1/Tier-2 separation
Every probe routes through local Burp (127.0.0.1:8080) so I can audit request/response later and use raw transcripts as evidence. Tier-1 (automated probes) runs on Apache :80; Tier-2 (manual PoC crafting) runs php -S on :8082. The two installs use different DB prefixes so they reset independently and don't mix in Burp history. Lesson I learned the painful way: once you've confirmed a bug with Tier-1 and want to craft a cleaner PoC for the report, you really don't want Tier-1 (dozens of probes) and Tier-2 (a few clean PoC requests) sitting next to each other in the same Burp history.
Layer 4 — Feedback loop: AI patches its own scanner
This is the layer where the leverage actually compounds. Every confirmed CVE from Layer 3 becomes training data for the scanner. A 2-agent pipeline:
2-agent feedback loop with 3 guard rails — patches always go through human review before merge
Agent A — vuln-root-cause (read-only, cap 15 calls)
Receives one confirmed finding, fully analyzes the taint flow, cross-checks against the existing scanner rules. Output is a JSON proposal:
{
"vuln_type": "sqli",
"root_cause": "<3-sentence narrative grounded in quotes>",
"taint_trace": [
{"step": 1, "file": "...", "line": 28, "code": "...", "role": "source"},
{"step": 2, "file": "...", "line": 32, "code": "...", "role": "cross_file_dispatch"}
],
"scanner_gap": {
"current_behavior": "MISSED | PARTIAL_CATCH | NO_GAP",
"root_pattern_missing": "...",
"false_positive_risk": "low | medium | high"
},
"proposed_change": {
"verdict": "add_sink | add_source | narrow_defense | architectural | no_change | not_worth_it",
"approach": "<concrete description>",
"ast_grep_pattern": "<structural pattern>",
"files_to_touch": [".claude/skills/sqli/scripts/scan_sqli.py"],
"risk_level": "low | medium | high"
},
"regression_fixture": {
"case_name": "snake_case_id",
"must_detect_as": "<vuln_type> vulnerability",
"minimal_php": "<?php\n..."
}
}
Distinguish "this specific bug" from "the generic pattern". A scanner rule that only matches one helper-method name is worse than no rule (training-data leak). Estimate FP risk for any proposed pattern. If it's high and the bug is rare, recommend
not_worth_itrather than adding garbage rules.
5 proposals produced so far (anonymized)
REST_PARAM_UNAUTH source recognizer — a
register_rest_routewithpermission_callback => '__return_true'whose callback feeds\(request->get_param()straight into an interpolated\)wpdb->*. A common bug. Developers thinksanitize_text_fieldis enough, but it doesn't strip apostrophes.RAW_HTTP_BODY bypass — when a plugin reads
file_get_contents('php://input')instead of$_POST, the body doesn't pass throughwp_magic_quotes(), so the attacker's quote goes straight into SQL.Cross-file taint resolver (architectural) — 2-pass plugin-scope registry: pass A builds a registry of methods that are direct sinks or passthroughs; pass B walks back from every callsite. Chain depth capped at 2 hops to avoid blowup.
Magic-quotes bypass + sink correlation — the scanner already detected
urldecode(\(_REQUEST['x'])but emitted medium severity. Agent A proposed correlating the bypass site with a specific string-quoted unprepared\)wpdb->*slot, bumping severity to HIGH.IN-clause loop concat sink — the pattern
for { \(ids_str .= "'" . \)arr[\(i] . "'"; } \)sql = "... IN {$ids_str}"is extremely common but the scanner had no rule.
Agent B — scanner-rule-evolve (Edit-scoped, cap 20 calls)
Implements Agent A's proposal, with 3 guard rails:
edit_allowlist: Hard-check that the agent only edits files in that specific scanner (e.g.,.claude/skills/sqli/**). No spillover into_shared/or other scanners.pattern_linter: Rejects substring-heuristic patterns. I got burned once by a rule checking"allow" in text. It matched the comment// allow CORSin a plugin and caused a real LFI miss. The linter blocks that class of pattern up front, forcing word-boundary regex or AST patterns.Regression test must pass. If the test suite breaks, the patch is rejected.
Patches commit to a branch
improve/<vuln_type>-<id>. Never auto-merged to main. State intracker.db.scanner_improvements:analyzed → patched_pending_review → merged | rejected | superseded. At post time: 5 patches, 3 atpatched_pending_review, 2 atanalyzed. What actually matters: each patch closes an entire pattern family. Once merged, rescanning the original 100 plugins plus the next 100 will automatically flag every similar instance.
Farm results: 524 candidates → 16 confirmed across 15 plugins
Stats pulled from tracker.db at post time:
| Layer | Unit | Number |
|---|---|---|
| Layer 0 | plugin corpus | 100 random plugins (≥10K active installs) |
| Layer 1 | candidates produced | ~500+ (mostly from sqli/lfi/rce/file-upload/xss/poi scanners) |
| Layer 2 | triaged via subagent | 524 records in scanner_accuracy |
| Layer 2 | true positives | 16 (sqli=12, path-traversal=1, deserialization=1, broken-auth=1, stored-xss=1) |
| Layer 2 | false positives | 508 |
| Layer 2 | deep-dive manual | 92 (handling needs_more_context) |
| Layer 3 | dynamic verify confirmed | 13 (+20 cluster-propagated) |
| Layer 4 | scanner rule patches | 5 (3 pending review, 2 analyzed) |
TP distribution by class and root pattern
| Class | # | Plugin category (anonymized) | Root pattern |
|---|---|---|---|
| SQLi | 12 | e-commerce, payment, downloads/membership (1 plugin with 2 sinks), analytics, SEO, image optimization, form builder, appointments, shipping | Mix: sanitize_text_field doesn't strip apostrophes, magic-quotes bypass, cross-file taint, IN-clause loop concat |
| Path Traversal | 1 | order tracking | User-controlled filename concatenated into file_get_contents without realpath()/whitelist |
| POI | 1 | background task scheduler | unserialize() on a DB-stored task payload that another endpoint can inject — POP gadget chain |
| Broken Auth | 1 | backup / migration | Sensitive endpoint (export/restore) missing nonce + capability check during a state-reset window |
| Stored XSS | 1 | page builder addon | Widget setting (admin role) stores raw payload, front-end shortcode renders without escape — cross-role impact |
Zooming out, 5 root patterns showed up over and over in the SQLi findings. These are exactly what the scanner needs to learn (Layer 4 has produced patches for all 5):
REST routes with
permission_callback => '__return_true'interpolating$request->get_param()straight into raw SQL.file_get_contents('php://input')reader in a webhook handler hooked intoinit. Fires on every request, raw body bypasseswp_magic_quotes().Cross-file taint through a shared helper method (e.g.,
plugin_get_value(\(payload, 'a/b/c')) without a\)validateargument — propagates from webhook source through 2-3 files into$wpdb->get_row()interpolated.IN-clause loop concatenation — pattern
for { \(ids_str .= "'" . \)arr[\(i] . "'"; }piped into\)wpdb->get_results("... IN {$ids_str}").Magic-quotes bypass via
urldecode(\(_REQUEST[...])thensanitize_text_field(useless for SQL context) before going into a string-quoted slot in an unprepared\)wpdb->*.
FP rate ~97% sounds bad, but it's by design. The static scanner is intentionally high recall, the AI subagent is the precision filter. If Layer 1 only flagged 16 findings, it would likely miss 5-10 more. I'd rather have AI filter 508 false positives than miss one critical unauth SQLi on a 100K-install plugin.
Proof of Work — Sample Verification Evidence
Example findings of a prompt with the skills mentioned above across a batch of 10 plugins.
Based on that, I verified it again and confirmed that it was correct in WPCargo Track & Trace
I’ve submitted it and am now waiting for Patchstack’s review.
Another SQL injection in WPDM Premium Packages
One of us was rewarded with a bounty for the vulnerability we found.
Honest evaluation: what works, what doesn't, what's next
Separating enumeration from judgment kills almost all hallucination. The 10-call subagent budget enforces focus. Parallel dispatch of 8 agents triages a batch in minutes. Cluster propagation saves cost on duplicate patterns. The feedback loop produces architectural patches (cross-file resolver) that manual work rarely reaches.
That said, cross-file taint through many helpers is still weak. There's no JavaScript scanner yet. Cost isn't tightly controlled; a full pass over 50 plugins can burn a few dozen USD in tokens. Theme scanning has worse signal-to-noise than plugins.
What I want to build next: a generator for exploit scripts, an MCP server exposing tracker.db for natural language queries, a per-function taint summary cache, and expanding dynamic verify to Tier-2 (POI/RCE chains).
Closing
The whole pipeline runs on Claude Code (CLI) with Opus 4.x for both the parent skill and every subagent. No Codex mixed in, no Sonnet. I need enough reasoning depth at the subagent layer to judge cross-file taint flow, especially when taint travels through dynamic dispatch or hook chains. Sonnet is cheaper, but I tried it and it missed noticeably on helper-method indirection cases.
If you want to build a similar farm: don't make the LLM do the static analyzer's job, and don't make the static analyzer do the LLM's job. Let each side do what it's good at. Static enumerates sinks, sources and defenses. The LLM judges on quoted evidence. Wire them together with strict JSON schemas, set budget caps on the AI to force focus. The rest is patiently tinkering with prompts and building resumable state machines.
The farm is still running. Each new plugin version in the watch list auto-runs a diff scan; each newly confirmed pattern feeds back into the scanner. That's the part that scales. Not the "16" today (which will be stale in a week), but the rate at which the scanner learns new patterns, plus near-zero marginal cost to add one more plugin to the watch list.
If you're building something similar, three things I wish I'd done sooner:
(1) build a resumable state machine from day one, not after crashing midway through a 50-plugin batch;
(2) force subagents into a strict JSON schema with budget caps, because that alone kills most hallucination;
(3) invest in the scanner feedback loop, because confirming one bug without feeding the rule back throws away the most valuable part of the whole exercise.




