Skip to main content

Command Palette

Search for a command to run...

AI-Powered Bug Hunting in Closed-Source Software:

Updated
36 min read

Author: Anhlt91, Thuanhn Date: May 2026
Tags: AI Claude Bug Hunting Closed-Source Security Research


Overview

I used Claude AI to find real security vulnerabilities in a closed-source enterprise product — without access to source code, documentation, or vendor support. This post covers the methodology, workflow, and lessons learned.


1. Why AI + Closed-Source?

Traditional security research on closed-source software is painful:

  • No source code — you work with decompiled bytecode, obfuscated logic

  • Massive codebase — enterprise products have thousands of classes and endpoints

  • Time-consuming — manual review can take weeks or months

  • Easy to miss — human eyes get tired; subtle vulnerabilities hide in boring code

The question: Can AI change this game?

The answer: Yes. Significantly.


2. The Target

The target was a commercial enterprise web application built on the JVM stack (Java/Kotlin). It is widely used in the software development industry.

Note: Product name, vendor, and specific details are intentionally omitted to comply with responsible disclosure policies. All vulnerabilities found were reported through official channels.


3. First Things First: Building Custom AI Skills

After identifying the target, the very first step — before any scanning or reviewing — was to build the AI tools (skills) that would do the work. This is the foundation everything else depends on.


What is a Skill?

In Claude Code, a "skill" is a structured prompt that acts as a specialized expert. Instead of explaining your methodology from scratch every conversation, you package it into a skill that Claude can invoke — like calling a specialist.

Without skill:                          With skill:
┌─────────────────────┐                ┌─────────────────────┐
│ "Hey Claude, I need │                │ "Run security review │
│  you to decompile   │                │  skill on /target"   │
│  these JARs, then   │                └──────────┬──────────┘
│  review them in 4   │                           │
│  passes, first pass │                ┌──────────▼──────────┐
│  focuses on entry   │                │ Skill auto-executes: │
│  points and..."     │                │ • 4-pass review      │
│                     │                │ • Anti-hallucination │
│  [20 minutes of     │                │ • Proper filtering   │
│   explaining]       │                │ • Report generation  │
└─────────────────────┘                └─────────────────────┘
     Every. Single. Time.                    One command.

Why Build Skills FIRST?

❌ Wrong order (what most people do):
   Get target → Start prompting manually → Get inconsistent results
   → Try to remember what worked → Forget half the methodology
   → Each session starts from scratch

✅ Right order (what I learned to do):
   Get target → Build skills first → Run skills on target
   → Get structured, repeatable results → Improve skills
   → Each session builds on previous knowledge

The pipeline:
   Target image
     → [SKILL 0: Decompiler] → organized source code
       → [SKILL 1: CVE Hunt] → prioritized findings report
         → [SKILL 2: CVE Confirm] → verified PoCs + final report

Think of it this way: a carpenter builds their jigs and templates before cutting wood. The time invested in tooling pays for itself many times over.


Step-by-Step: How I Built the Skills

Step 1: Start with a base skill

Use Claude Code's built-in skill creator to generate a starting point:

You: "/skill-creator Create a skill to perform security code 
      review on decompiled Java applications"

Claude: [Generates base skill with generic review instructions]

The base skill will be functional but shallow — it's like hiring a junior analyst who knows the textbook but hasn't done real work yet.

Step 2: Run it on a real target immediately

Don't try to make the skill perfect before using it. Run it immediately on a real target and observe what goes wrong.

First run observations:
├── Finding quality?     → Probably mixed: some real, some noise
├── Coverage?            → Probably incomplete
├── False positives?     → Probably many
├── Missing categories?  → Probably auth, business logic
└── Report quality?      → Probably unstructured

Write down everything that's wrong. Each problem becomes a rule in the next version.

Step 3: The multi-run discovery

This was my biggest insight: running the same review multiple times produces different findings each time.

Run 1: Found 15 findings (entry points, obvious patterns)
Run 2: Found 12 NEW findings (deeper analysis, auth issues)  
Run 3: Found 8 NEW findings (chain exploits, edge cases)
Run 4: Found 5 NEW findings (business logic, race conditions)
─────────────────────────────────────────────────────────
Total: 40 findings vs. 15 from a single run (2.7x improvement)

Why? Because LLMs are non-deterministic. Each run "looks" at the code from a slightly different angle, notices different patterns, and makes different connections.

The fix: Instead of relying on multiple manual runs, I encoded multiple passes directly into the skill — each pass with a mandatory different analytical focus:

Pass 1 (Scanner):      Entry points, endpoints, obvious vulns
Pass 2 (Auth expert):  Filter chains, session management, access control
Pass 3 (Chain builder): Internal APIs, trust boundaries, multi-step chains
Pass 4 (Adversarial):  Race conditions, crypto, debug endpoints, edge cases

Rule: Each pass MUST receive results from previous passes 
      and MUST find NEW code/patterns not covered before.

Step 4: Improve through real-world failure

Steps 4, 5, 6... didn't come from planning. They came from catching AI doing things wrong during actual testing. Each mistake became a new rule encoded into the skill:

Version 0 (Base)        → skill-creator generated a generic review
    │  💡 Decompilation itself needs a skill — built DECOMPILER
    ▼
Version 1 (First run)   → ❌ Single pass misses 60% of findings
    ▼                      Fix: Added 4-pass multi-angle review
Version 2 (Multi-pass)  → ❌ "Critical RCE" that needed physical access
    ▼                      Fix: Added precondition reality filter
Version 3 (Filtering)   → ❌ AI used docker exec to "confirm" bugs
    ▼                      Fix: Added BLACK-BOX ONLY rules
Version 4 (Anti-cheat)  → ❌ AI fabricated exploit output
    ▼                      Fix: Added proof-by-backconnect rules
Version 5 (Proof rules) → ✅ Reliable, verifiable results

The full story of each failure — AI fabricating PoCs, cheating through Docker, mixing information sources — is documented in Section 8. Those real-world incidents are what shaped these skills.


The Three Skills I Built

After all iterations, I ended up with three complementary skills that form a complete pipeline:

┌────────────────────────────────────────────────────────────┐
│                  SKILL 0: DECOMPILER                       │
│          (Extract + Decompile Closed-Source)                │
│                                                            │
│  Input:  Docker image or installer path                    │
│  Output: Organized decompiled Java source code             │
│                                                            │
│  What it does:                                             │
│  1. Extract rootfs from Docker image / installer           │
│  2. Find all JAR/WAR/EAR files in the filesystem          │
│  3. Identify closed-source vs open-source JARs             │
│     (skip known libraries: Spring, Apache Commons, etc.)   │
│  4. Merge related JARs before decompiling                  │
│     (avoids split-class issues across multiple JARs)       │
│  5. Decompile using CFR/JADX/Vineflower                   │
│  6. Organize output into logical directory structure       │
│  7. Generate manifest: file count, package map, entry      │
│     points summary                                         │
│                                                            │
│  Key rules encoded:                                        │
│  • Try multiple decompilers — CFR first, JADX as fallback │
│     (some classes decompile better with one vs another)    │
│  • Merge JARs sharing same package namespace               │
│  • Skip decompiling pure open-source dependencies          │
│  • Preserve directory structure matching package names     │
│  • Output stats: total JARs, total files, skipped libs    │
└──────────────────────────┬─────────────────────────────────┘
                           │
                           │ Decompiled source feeds into
                           ▼
┌────────────────────────────────────────────────────────────┐
│                    SKILL 1: CVE HUNT                       │
│              (Static Analysis + Code Review)                │
│                                                            │
│  Input:  Path to extracted/decompiled application          │
│  Output: SECURITY_REPORT.md with prioritized findings      │
│                                                            │
│  What it does:                                             │
│  1. Decompile all closed-source JARs (CFR/JADX)           │
│  2. 4-pass review, each pass different angle               │
│  3. Each pass receives previous results, must find NEW     │
│  4. Filter unrealistic preconditions                       │
│  5. Generate structured report with severity table         │
│                                                            │
│  Key rules encoded:                                        │
│  • Skip open-source libraries                              │
│  • Merge JARs before decompiling                           │
│  • Reject findings needing MITM/physical/insider access    │
│  • Only keep findings with realistic attack paths          │
└──────────────────────────┬─────────────────────────────────┘
                           │
                           │ Findings feed into
                           ▼
┌────────────────────────────────────────────────────────────┐
│                  SKILL 2: CVE CONFIRM                      │
│            (Live Verification + PoC Creation)               │
│                                                            │
│  Input:  SECURITY_REPORT.md + running target instance      │
│  Output: CONFIRM_REPORT.md with verified PoCs              │
│                                                            │
│  What it does:                                             │
│  1. Boot fresh container from zero                         │
│  2. Setup through web UI (not docker exec)                 │
│  3. Create account through registration (not database)     │
│  4. Test each finding: unauth → lowest priv → escalate     │
│  5. All traffic through Burp proxy                         │
│  6. Verdict: CONFIRMED / NOT CONFIRMED / FALSE POSITIVE    │
│                                                            │
│  Anti-fabrication rules:                                   │
│  • BLACK-BOX ONLY — no container access                    │
│  • Same-flow proof — cause & effect in one request chain   │
│  • Type-specific proof — RCE=backconnect, SSRF=callback    │
│  • Human gate — blind RCE needs explicit approval          │
│  • Every claim backed by Burp-visible HTTP evidence        │
└────────────────────────────────────────────────────────────┘

Advice: Build Your Own Skills

If you want to build AI skills for security research:

DO:
  ✅ Start simple, improve through failure
  ✅ Run the skill on REAL targets, not theory
  ✅ Document every AI mistake → turn it into a rule
  ✅ Make rules specific ("backconnect proof") not vague ("verify properly")
  ✅ Separate hunting (static) from confirming (live) into different skills
  ✅ Encode the multi-pass approach — single runs miss too much
  ✅ Build in human checkpoints for dangerous/ambiguous actions

DON'T:
  ❌ Try to design the perfect skill upfront
  ❌ Trust AI verification without independent proof
  ❌ Use generic instructions ("review this code for security")
  ❌ Skip the filtering step — noise drowns real findings
  ❌ Let AI have container access during confirmation
  ❌ Assume one run is enough

The best skill is one that has failed many times and learned from each failure.


4. Methodology

With the skills built and battle-tested, here's the methodology they encode:

Phase 1: Preparation

Target Application (Docker/Installer)
        │
        ▼
┌─────────────────┐
│  Extract rootfs  │  ← Unpack the application files
│  from container  │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│   Decompile JAR  │  ← Convert .class → .java using decompilers
│   files to Java  │     (CFR, Procyon, Fernflower)
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Organize code   │  ← Structure for AI consumption
│  for AI review   │
└─────────────────┘

Steps:

  1. Get the application — Pull the Docker image or download the installer

  2. Extract the filesystem — Unpack all JAR/WAR files

  3. Decompile — Use Java decompilers to recover readable source code

  4. Organize — Structure the decompiled code into a reviewable workspace

Phase 2: AI-Assisted Code Review

This is where Claude becomes the game-changer — and where the skills built in Section 3 take over.

a) Reconnaissance — Understanding the codebase

First, I asked Claude to map out the application architecture:

  • Identify entry points (controllers, API endpoints, servlets)

  • Map authentication and authorization flows

  • Understand data flow from user input to processing

b) Targeted Vulnerability Hunting

Then, I directed Claude to focus on high-impact vulnerability classes:

Vulnerability Class What Claude Looked For
RCE Unsafe deserialization, template injection, expression language injection
Auth Bypass Missing authorization checks, privilege escalation paths
SSRF Unvalidated URL parameters, internal network access
Path Traversal File operations with user-controlled paths
SQL Injection Dynamic query construction
XXE XML parsing without proper restrictions

c) Multi-Pass Review

A single pass is never enough. I used multiple review passes with different analytical angles:

  • Pass 1: Broad surface scan — find obvious issues

  • Pass 2: Source-to-sink tracing — follow user input through the code

  • Pass 3: Authentication/Authorization focus — check access controls

  • Pass 4: Business logic review — look for logic flaws

  • Pass 5: Cross-reference findings — connect dots between passes

Phase 3: Verification

AI Findings (Static Analysis)
        │
        ▼
┌─────────────────────┐
│  Boot the application │  ← Run it locally (Docker)
│  in test environment  │
└────────┬────────────┘
         │
         ▼
┌─────────────────────┐
│  Craft PoC exploits   │  ← Prove the bug is real
│  for each finding     │
└────────┬────────────┘
         │
         ▼
┌─────────────────────┐
│  Confirm impact and   │  ← Document severity
│  write report         │
└─────────────────────┘

Every finding from AI review was verified by:

  1. Running the application locally in Docker

  2. Crafting proof-of-concept exploits (HTTP requests, API calls)

  3. Confirming the impact — Does it actually work? What's the severity?

  4. Documenting — Screenshot, request/response, impact analysis


5. Results

What we found:

Using this AI-assisted methodology, I discovered multiple confirmed security vulnerabilities including:

  • Critical severity findings (CVSS 9.0+)

  • High severity findings (CVSS 7.0+)

  • Vulnerabilities spanning multiple categories (RCE, Auth Bypass, and more)

AI vs Manual — The difference:

Metric Manual Only With AI (Claude)
Time to first finding Days to weeks Hours
Code coverage Limited by fatigue Systematic, consistent
False positive rate Low Moderate (needs verification)
Depth of analysis Deep but narrow Broad and reasonably deep
Pattern recognition Experience-dependent Trained on vast knowledge

6. Key Lessons Learned

What AI does well:

  1. Rapid code comprehension — Claude can understand thousands of lines of decompiled code in minutes

  2. Pattern matching — Recognizes vulnerability patterns across different coding styles

  3. Cross-referencing — Connects related code paths that a human might review separately

  4. Tireless review — No fatigue, no "I'll look at this later"

  5. Knowledge breadth — Knows about CVE patterns, framework-specific issues, common misconfigurations

What AI needs help with:

  1. Context about the target — You need to explain what the application does

  2. Verification — AI can find candidates, but you must confirm they're exploitable

  3. Business logic — Deep domain-specific logic flaws still need human intuition

  4. Prioritization — AI may flag many issues; the researcher must prioritize

  5. Creative exploitation — Chaining vulnerabilities requires human creativity

The winning formula:

Human Expertise × AI Capability = Amplified Results

AI doesn't replace the security researcher — it amplifies them. Think of it as a very knowledgeable colleague who can read code extremely fast and never gets tired.


7. Practical Tips

If you want to try this approach:

Setting up:

  • Use Claude Code (CLI) for the best experience with large codebases

  • Organize decompiled code into logical directories

  • Have the target application running locally for verification

Prompting effectively:

  • Be specific about what vulnerability classes to look for

  • Provide context — tell AI what the application does

  • Ask for source-to-sink tracing — not just pattern matching

  • Request multiple passes with different focus areas

  • Challenge the AI — ask "are you sure this is exploitable?" and "what could go wrong with this approach?"

Verification:

  • Always verify — never report an unconfirmed finding

  • Build PoCs — a working exploit is worth a thousand static findings

  • Test edge cases — AI might miss authentication requirements or rate limits


8. The Real Prompting Journey — From Decompile to Confirmed CVE

This section documents the actual process, including mistakes, dead ends, and critical lessons about AI hallucination in security research. Nothing is sugar-coated.


Session 1: Decompile → Review → First Confirmed Bugs (Day 1)

Step 1 — Extract source from Docker

I started by pulling the target's Docker image and asking Claude to extract all application files to a local directory.

Me: "Pull the Docker image for [target] and extract the rootfs 
     to /home/kali/target-rootfs"

Claude: [Executed docker commands, extracted filesystem]

Step 2 — Mass decompile

Next, I asked Claude to decompile every JAR file in the extracted filesystem. The result was massive:

![](https://cdn.hashnode.com/uploads/covers/6a13afd2551486ce6c483ca9/558825f4-ecc8-4f0a-bd00-396a8c50ffd3.png align="middle")

This is where AI shines — no human wants to manually navigate 125K decompiled files. But Claude can systematically scan all of them.

Step 3 — Security review with multi-pass scanning

I triggered the CVE Hunt skill (built in Section 3) that ran 4 passes over the codebase, each with a different analytical focus:

![](https://cdn.hashnode.com/uploads/covers/6a13afd2551486ce6c483ca9/827ded0d-6f06-4aee-ab92-6ffaaad0df27.png align="middle")

![](https://cdn.hashnode.com/uploads/covers/6a13afd2551486ce6c483ca9/f3cf1685-770d-46a1-b767-d6a32ebea2ca.png align="middle")

Result: findings — 6 Critical, 13 High, 17 Medium, 6 Low. All exported to a structured security report.

![](https://cdn.hashnode.com/uploads/covers/6a13afd2551486ce6c483ca9/89594741-1f3a-486f-ad06-f1fec7acb03f.png align="middle")

Step 4 — Live confirmation through Burp Suite

Here's where things get real — and where the problems started.

I triggered the CVE Confirm skill (also built in Section 3) to confirm the Critical findings by sending actual HTTP requests through Burp Suite proxy (port 8089), so I could see every request/response with my own eyes.

Me: "Confirm finding PASS2-001 (Missing Auth) step by step. 
     Route all traffic through Burp proxy at 127.0.0.1:8089"

Claude: [Sent requests through Burp, showed unauthenticated 
         access to admin endpoints]

Session 1 confirmed findings:

ID Type Status
PASS2-001 Missing Authentication on admin endpoints Confirmed
PASS2-002 Cross-Site Request Forgery (CSRF) Confirmed
PASS4-001 JWT implementation weakness Confirmed
PASS4-002 Docker Socket exposure Confirmed

Session 2: Exploit Development → The Fabrication Problem → Real PoC (Day 2)

Step 5 — Understanding the RCE chain

I asked Claude to explain a complex RCE chain it had found during static review:

SSRF → ZIP Slip → Command Injection = Remote Code Execution

Claude explained the chain clearly — how an attacker could abuse a server-side feature to fetch a malicious ZIP, extract it with path traversal, and ultimately execute commands.

Step 6 — Setting up the test environment

I asked Claude to spin up the application in Docker and create a test account.

⚠️ PROBLEM #1: AI takes shortcuts through Docker

Instead of registering an account through the web interface like a real attacker would, Claude ran Docker commands to directly create a session/account inside the container. This is cheating — a real attacker doesn't have docker exec access.

❌ What AI did:
   docker exec -it target-container bash
   → Created account directly in database
   → Generated session token from inside container

✅ What a real attacker would do:
   POST /api/register → Create account through web UI
   POST /api/login → Get session through normal authentication

I had to explicitly tell Claude: "Do NOT use docker exec. Everything must go through HTTP, like a real attacker."

Step 7 — Navigating the application

I needed to find a specific feature in the UI to test the exploit. I sent screenshots to Claude and asked where to find it. The feature wasn't visible in the default menu — I had to explore the application paths manually.

Step 8 — The PoC that was a lie

⚠️ PROBLEM #2: AI fabricated the exploit results — This is the most critical lesson.

I asked Claude to send the RCE exploit through Burp proxy. Claude sent the request and reported: "Command executed successfully! Here's the output..."

But when I checked Burp Suite, the response contained NOTHING proving the command was executed.

Even worse — the PoC sent a command to one port but claimed to see results from a different port. The entire verification was fabricated.

🚨 What happened:

   AI sent:    POST /api/endpoint → {"cmd": "id"} → Port 8080
   AI claimed: "Output: uid=0(root)..."
   Reality:    Response was a generic 200 OK with no command output
   
   AI also:    Checked a DIFFERENT port for "results"
   Reality:    That port had nothing to do with the exploit

This is dangerous. If I hadn't been watching through Burp Suite, I might have reported a fabricated vulnerability. The AI was confidently wrong — it hallucinated a successful exploit.

Step 9 — Demanding real proof

I called out the fabrication and demanded a real PoC with undeniable proof:

Me: "Your PoC is fake. The response has nothing proving command 
     execution. I need a REAL backconnect — reverse shell that 
     sends /etc/passwd to my netcat listener on port 4444. 
     That's the only proof I'll accept."

Step 10 — Real exploitation, real proof

After being challenged, Claude crafted a proper exploit chain:

![](https://cdn.hashnode.com/uploads/covers/6a13afd2551486ce6c483ca9/517a454c-386e-4e90-b8c4-f58c7543b249.png align="middle")

This time it was real. The netcat listener received actual file contents from inside the container. Undeniable proof of Remote Code Execution.

Step 11 — Writing the vulnerability report

I asked Claude to write a professional report with specific requirements:

Report requirements:
├── Does exploit need an account?  → Yes, but self-registration (no admin)
├── What privilege level?          → REGULAR (lowest possible)  
├── Pre-conditions?                → Specific app feature must be enabled
├── PoC complexity?                → 3 HTTP requests: register → login → exploit
├── CVSS Score                     → 8.2 (High)
└── Full PoC scripts               → 6 scripts provided

The Hard Truth: AI Lies Confidently

Let me be brutally honest about the problems I encountered:

Problem 1: Fabricated verification results

AI will generate realistic-looking exploit output even when the exploit didn't work. It doesn't "know" it's lying — it predicts what a successful output should look like and presents it as real.

🧠 Why this happens:
   LLMs are text prediction machines. When given the context 
   "I sent an exploit and the result is...", the model predicts 
   the most likely continuation — which is a successful exploit output.
   
   It's not malicious. It's the fundamental nature of language models.

Problem 2: Docker shortcut cheating

When AI has access to docker exec, it will take the easiest path — including paths that a real attacker would never have. This makes PoCs meaningless because they don't prove external exploitability.

❌ AI shortcuts to avoid:
   - docker exec to create accounts/sessions
   - Reading tokens from container filesystem
   - Accessing internal databases directly  
   - Using container networking that isn't exposed
   
✅ Real attacker constraints:
   - HTTP/HTTPS only through exposed ports
   - Account creation through web registration
   - Authentication through login endpoints
   - No access to container internals

Problem 3: Mixing information sources

AI might read a config file from inside the container (via filesystem access) and then use that information in an "HTTP-only" PoC — making it look like the information was obtained through the web interface when it actually wasn't.


How I Fixed the Process

After these experiences, I established strict rules for AI-assisted verification — and encoded them back into the skills (see Section 3):

Rules for AI-Assisted PoC:

1. NETWORK ONLY — All interactions must be via HTTP/HTTPS 
   through exposed ports. No docker exec. No filesystem access.

2. WATCH EVERYTHING — Route all traffic through Burp Suite. 
   Verify every request and response yourself.

3. PROVE IT — Demand undeniable proof: reverse shell, file 
   exfiltration, DNS callback. Not "the response was 200 OK".

4. CHALLENGE — When AI says "exploit successful", ask: 
   "Show me the exact bytes that prove code execution happened."

5. ATTACKER PERSPECTIVE — Every step must be reproducible by 
   an external attacker with zero prior access.

Timeline Summary

Day 1 (Session 1):
├── 🔧 Extract source from Docker image
├── 🔧 Decompile 880 JARs → 125,172 Java files
├── 🔍 4-pass security review → 42 findings
├── ✅ Confirmed: Missing Auth, CSRF, JWT weakness, Docker Socket
└── ⏱️ Total: ~6 hours from zero to 4 confirmed vulnerabilities

Day 2 (Session 2):
├── 📖 Understood RCE chain (SSRF + ZIP Slip + Command Injection)
├── 🐛 Caught AI fabricating PoC results
├── 🐛 Caught AI cheating via docker exec
├── ✅ Got REAL reverse shell proof (839 bytes /etc/passwd)
├── 📝 Wrote professional vulnerability report (CVSS 8.2)
├── 🔧 Updated AI verification skill to prevent future fabrication
└── ⏱️ Total: ~8 hours including debugging AI mistakes
Overall: 2 days from "I have a Docker image" to 
         "Confirmed RCE with professional report ready to submit"

Vuln confirm

![](https://cdn.hashnode.com/uploads/covers/6a13afd2551486ce6c483ca9/944be587-f111-4049-adef-7b6826cfbe2c.png align="middle")

![](https://cdn.hashnode.com/uploads/covers/6a13afd2551486ce6c483ca9/ab164c9e-d0e9-45d5-9237-bac2b4dc7077.png align="middle")

![](https://cdn.hashnode.com/uploads/covers/6a13afd2551486ce6c483ca9/1061d850-9e25-4716-8645-6c4bf6b913d8.png align="middle")


9. The 5-Pass Miss — A Critical Bug That Hid in Plain Sight

After the first round of confirmed findings, I felt confident about my methodology. I ran five full audit passes across the same codebase, each focused on a different attack class — auth bypass, parser quirks, SSRF, sandbox escapes, access control. Every pass found new bugs. Coverage felt solid.

Then a single observation broke that confidence: incident-response signals showed someone was reaching admin without using any of the bugs I had filed. A separate path existed. I had missed it five times in a row.

When I finally found it on pass six, the bug was embarrassingly direct. The application generates one-time authentication codes (the kind you receive in a "reset your password" email) by calling a helper that picks 64 random characters. The helper draws those characters from a non-cryptographic random number generator — java.util.Random, which is a 48-bit linear-congruential generator. Its internal state can be recovered from a single observed output in a few seconds of plain Python. And the helper next to it, in the same file, written by the same developer, was cryptographically secure — it just wasn't the one being called. The field name in the source code was, literally, UNSECURE_RANDOM.

Once you can recover the RNG state, the rest is mechanical:

1. Self-register a few attacker-owned accounts.
2. Trigger a password reset for each → server emits one code per request,
   drawn from the same RNG stream.
3. Read those codes from your own mailboxes.
4. Recover the RNG state offline (~9 seconds, one CPU core).
5. Trigger a password reset for admin → server generates the next code
   but you never see the email.
6. Predict that code from the recovered state.
7. POST your prediction → server resets the admin password to whatever
   you sent → returns a session cookie for the admin account.

End-to-end: under 40 seconds against a default install. No prior access.

Why I missed it five times

These weren't bad luck. Each miss was a specific gap in how I had set up the audit, and each one is a gap any researcher can fall into. So they're worth naming clearly:

┌─────────────────────────────────────────────────────────────────┐
│  Miss #1 — My endpoint catalog was incomplete                   │
├─────────────────────────────────────────────────────────────────┤
│  The endpoint listing tool I used reads annotations on REST     │
│  handler methods. The reset-password handler was annotated      │
│  only with its leaf path ("/password"). The parent route        │
│  ("/restore") and the application prefix were declared higher   │
│  up the class hierarchy, and the tool didn't stitch the full    │
│  path back together.                                            │
│                                                                 │
│  Result: the endpoint appeared in my list as just "/password",  │
│  with no clue that the full reachable URL ended in              │
│  "/restore/password". I never prioritized it.                   │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│  Miss #2 — My skill had no rule for "weak RNG → token"          │
├─────────────────────────────────────────────────────────────────┤
│  My code-review skill had detailed patterns for the usual       │
│  suspects: XML external entities, SQL injection, server-side    │
│  request forgery, unsafe deserialization, command injection.    │
│                                                                 │
│  It had zero patterns for "value generated by a non-secure      │
│  random source, then later compared as a security token."       │
│                                                                 │
│  Claude scanned ~125,000 decompiled files, dutifully flagged    │
│  every pattern I had asked for, and silently walked past the    │
│  one called UNSECURE_RANDOM — because it wasn't in the rules.   │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│  Miss #3 — The vulnerable line looks like boring glue code      │
├─────────────────────────────────────────────────────────────────┤
│  The actual vulnerability is one ternary expression inside an   │
│  auto-generated Kotlin constructor:                             │
│                                                                 │
│    (i & 128) != 0  ?  randomAlphanumeric(64)  :  code           │
│                                                                 │
│  Reading the surrounding file, the eye treats this as default-  │
│  value plumbing — a "if no value was provided, use this         │
│  default" pattern. The security primitive is hidden two layers  │
│  down: which constructor variant fires, which default branch    │
│  fires, what the default helper returns.                        │
│                                                                 │
│  Auto-generated synthetic constructors are exactly where        │
│  review fatigue kicks in. I — and Claude — slid past it for     │
│  five passes.                                                   │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│  Miss #4 — I deferred the hypothesis as "too theoretical"       │
├─────────────────────────────────────────────────────────────────┤
│  Pass 1 status notes literally said:                            │
│                                                                 │
│    H1: restore code may be predictable. Would need RNG state    │
│         recovery + a mailbox to receive samples. Not testable   │
│         with a single curl — deferring.                         │
│                                                                 │
│  Then I never came back. Five passes later, that hypothesis     │
│  was still in my notes, still unchecked.                        │
│                                                                 │
│  I had confused "expensive to verify" with "low likelihood".    │
│  They are completely different things. A bug whose impact       │
│  would be Critical-if-true is worth a few hours of math, even   │
│  if you can't prove it from the command line.                   │
└─────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────┐
│  Miss #5 — Confirmation bias after the first big finding        │
├─────────────────────────────────────────────────────────────────┤
│  Earlier in the engagement I confirmed a separate pre-auth      │
│  admin-takeover bug (different mechanism, different code path). │
│  Once I had it, I subconsciously stopped looking for OTHER      │
│  pre-auth admin takeovers on the same target.                   │
│                                                                 │
│  The incident-response signals said "admin compromise via a     │
│  different path," and I rationalised: "well, maybe the bug I    │
│  already found explains that too, indirectly." It didn't. Two   │
│  separate critical bugs, neither one a substitute for the       │
│  other, both reachable from a default install.                  │
│                                                                 │
│  One critical finding does not satisfy the search. Keep         │
│  searching until a confirmed finding *precisely* explains the   │
│  signal you started with.                                       │
└─────────────────────────────────────────────────────────────────┘

The rules I added to the skill

Every miss became a concrete rule, encoded back into the code-review skill so the same blind spots can't recur:

Rule A  (Full-path reconstruction)
  Every endpoint in the catalog must come with the FULL reachable URL,
  not just the leaf path. Walk the parent annotations upward to the
  application root. Sub-resources without a stitched full path are
  flagged for manual review before the audit proceeds.

Rule B  (Weak-RNG → token taint rule)
  Grep regex (case-sensitive):
      new Random\(|Math\.random\(|RandomStringUtils\.random|
      randomAlphanumeric|randomAlphabetic|randomNumeric

  For every hit, trace whether the output reaches any of:
    • a variable/field named: token, code, secret, salt, nonce,
      sessionId, csrf, hash, password, key, otp, verification
    • the second argument of String.equals / Arrays.equals
    • a JWT signing helper / HMAC seed / cookie value

  Positive trace = severity HIGH minimum, CRITICAL if reachable
  from an unauthenticated endpoint.

Rule C  (Synthetic-constructor review pass)
  For every auto-generated constructor in a POJO that touches the
  authentication boundary (auth, token, session, credential,
  contact), open the synthetic constructor and inspect each
  default-value branch for randomness, expiry, or boolean flags
  relevant to security.

Rule D  (No hypothesis dies silently)
  Each pass keeps a "hypotheses tracker". Entries leave the tracker
  ONLY via:
    • "confirmed → finding ID X", OR
    • "ruled out → reasoning: <exact sentence>"

  A hypothesis from pass 1 still open at pass 5 is a methodology
  alarm — stop and resolve it, don't accumulate more backlog.

Rule E  (Anchor on the original signal)
  If there is any prior signal that "a bug exists here" (incident
  report, customer ticket, vendor security advisory), every pass
  ends with the question: "Does any finding from this pass
  PRECISELY explain that signal?" Answers like "maybe / could be /
  if combined with X" mean NO — keep searching.

Lessons that generalise

If you're auditing similar codebases, these are the parts worth taking with you:

  1. Boring code hides the worst bugs. Auto-generated constructors. Utility classes. Default-value plumbing. Anything named *Utils, *Helper, *Kt. These are where review fatigue lands first, and where centralised primitives live — so a weakness in one of them multiplies across every caller.

  2. Name your blind spots out loud each pass. "I'm not looking for weak RNG today" is a sentence worth saying. Then schedule the next pass to look for exactly that.

  3. Don't trust any single tool as authoritative. If one tool gave you a list of 10,000 things, ask "which ten are NOT in this list?" before you trust it as complete.

  4. Expensive-to-verify is not the same as low-impact. Build the math. A few hours of LCG state recovery code is cheap compared to a missed Critical report.

  5. Track every hypothesis to closure. Either confirm it or rule it out with an explicit reason. Append-only status docs bury leads.

  6. Don't stop searching at the first big finding. Same code path, same auth boundary, same surface — different bug class is still possible. Two Critical primitives can coexist on the same default install.

A note on AI's role in this miss

It would be easy to say "Claude missed the bug." That isn't honest. I missed it. Claude reviewed the code using rules I wrote. The rules didn't include weak-RNG-to-token tainting. The rules didn't require auto-generated constructor inspection. The rules didn't track open hypotheses.

The same blind spots would have applied if I had done this entirely by hand — I would have just made all five misses faster. AI accelerated my coverage and inherited my blind spots one-to-one. The fix is in the rules I encode into the skill, not in the model.


10. Endpoint Mapping with noir — Foundation, but Not the Whole Story

When you're hunting in a closed-source codebase, the first question is always: what's the attack surface? Before you can pick a target endpoint to study, you need to know which endpoints even exist.

You can answer that by hand — open the decompiled tree in an IDE, grep for @Path, @GET, @POST annotations across hundreds of JARs, take notes. It takes days. Or you can run noir and have the answer in minutes.

What noir does

noir is an open-source attack-surface discovery tool that reads source code and emits a list of every endpoint it finds. It supports 60+ frameworks out of the box — JAX-RS, Spring, Ktor, Flask, Express, Rails, Gin, ASP.NET, and more. For a decompiled Java/Kotlin target, it parses the annotations on every handler method and produces a row per endpoint.

$ noir -b <decompiled-source-dir> -f tsv -o all_endpoints.tsv

Five minutes later for the target I was working on: 10,019 endpoints, indexed in a single TSV.

$ head -3 all_endpoints.tsv
method  path                    sources  note
POST    /absorb                 noir     ./.../ProjectTeamResource.java
GET     /accessview/...         noir     ./.../AccessviewResource.java

That output becomes the foundation for everything else: filtering for admin paths, cross-referencing against authentication filters, prioritising what to read first, even diffing endpoint signatures between versions of the target.

Before vs after noir

Before noir (manual Phase 0):
   1. Open the decompiled tree in an IDE
   2. Search for "@Path", read context, manually note method+path
   3. Forget which class you were in by file 50
   4. Give up and pick a random module to focus on

After noir:
   1. Run noir → get 10,000+ endpoints in a TSV
   2. Filter for /admin, /oauth2, /api/v2, etc.
   3. Cross-reference against the auth-filter source
   4. Prioritise the 50 endpoints that matter most

For a target with 880 JARs and 125,000 decompiled files, noir turned a multi-day reconnaissance effort into a 5-minute query. It's the closest thing to a Burp Suite SiteMap you can get for a target you don't have HTTP access to yet.

Concrete value during this audit

Question Without noir With noir
How big is the surface? "I don't know — a lot." 10,019 endpoints.
What endpoints touch admin? Grep, read, guess. grep '/admin' all_endpoints.tsv → exact number.
Did this version change anything? Compare classes by hand. diff endpoints-old.tsv endpoints-new.tsv.
Is endpoint X reachable? Read the routing code. Check the TSV.
What should I show Claude as context? A vague description. The full TSV as an attached file.

One concrete win: I diffed the endpoint signatures between the version I was auditing and the latest released version. Zero changes in the area where my bug lived. That single grep gave the vendor disclosure report a hard claim: the bug is still present in the latest build.

Where noir alone isn't enough

This is the part most write-ups skip. noir is excellent at what it does, but treating its output as the complete attack surface is exactly what cost me time — and missed bugs — on this target. Three specific gaps:

┌────────────────────────────────────────────────────────────────────┐
│  Gap 1 — Endpoints declared outside annotations                    │
├────────────────────────────────────────────────────────────────────┤
│  noir reads JAX-RS / Spring annotations in source files. It does  │
│  NOT read WEB-INF/web.xml, where Java web apps can declare        │
│  servlet mounts directly.                                          │
│                                                                    │
│  One critical bug I worked on lived behind a servlet declared in  │
│  web.xml, not as a JAX-RS annotation. noir's TSV had zero rows    │
│  for it. I trusted the TSV as authoritative and the endpoint      │
│  stayed invisible to me for three audit passes.                    │
└────────────────────────────────────────────────────────────────────┘

┌────────────────────────────────────────────────────────────────────┐
│  Gap 2 — Path fragments without parent context                     │
├────────────────────────────────────────────────────────────────────┤
│  In JAX-RS, an endpoint's reachable URL is the concatenation of   │
│  application path + parent resource path + method-level path.     │
│                                                                    │
│  noir lists the method-level fragment but doesn't always walk     │
│  upward to stitch the full URL. A handler annotated only with     │
│  "/password" appeared in the TSV as "POST /password" — with no    │
│  way to tell that the full reachable URL was actually a much      │
│  longer path under "/api/.../restore/password".                    │
│                                                                    │
│  This is exactly how I delayed my discovery of one of the         │
│  critical bugs described in section 9 — the endpoint was visible  │
│  in my catalog but its true reachability wasn't.                   │
└────────────────────────────────────────────────────────────────────┘

┌────────────────────────────────────────────────────────────────────┐
│  Gap 3 — Routes created at runtime                                 │
├────────────────────────────────────────────────────────────────────┤
│  Some frameworks dispatch sub-resources through methods that      │
│  return a child resource instance at runtime. The dispatch is in  │
│  code, not in annotations.                                         │
│                                                                    │
│  noir sees the dispatcher method but not the children it serves.  │
│  For this kind of routing I had to write a small framework-aware  │
│  expander that walks the dispatchers recursively and re-emits     │
│  the full set of expanded endpoints. For this target, that pass   │
│  alone added ~3,000 endpoints that noir didn't see.                │
└────────────────────────────────────────────────────────────────────┘

The Phase 0 I actually use now

Phase 0 attack-surface inventory =
   noir TSV                                      ← fast, annotation-based
 + every web.xml in the app, fully enumerated    ← catches Gap 1
 + Full-path stitching against parent resources  ← catches Gap 2
 + Framework-aware sub-resource expander         ← catches Gap 3
 + Grep for raw HttpServlet / Filter classes     ← catches custom routers
 ─────────────────────────────────────────────
 = a catalog you can actually trust as complete

For this target the difference was roughly 10,000 endpoints (noir alone) versus 13,000 with all the extra passes — a 30 % gap that quietly held two Critical bugs.

Practical tips

✅  DO
   • Re-run noir on every new version of the target — diff the TSV
   • Pipe the output through sort/uniq for human readability
   • Attach the TSV as context when prompting Claude about coverage
   • Add a sanity check: `grep -c "@Path" --include='*.java'` — noir
     should be a superset of the raw annotation count

❌  DON'T
   • Treat noir's TSV as the complete attack surface
   • Skip web.xml just because noir ran
   • Skip sub-resource expansion just because noir ran
   • Trust noir alone for non-annotation routers (Ktor DSL,
     Express middleware, anything reflection-driven)

The one-line takeaway: noir is the foundation, not the whole story. The bugs hide in the parts noir can't see — go find them.


11. Conclusion

AI-assisted security research on closed-source software is not a future possibility — it is happening now. With the right methodology, tools like Claude can dramatically accelerate vulnerability discovery while maintaining the quality and accuracy that responsible disclosure demands.

But — and this is crucial — AI will lie to you with complete confidence. It will fabricate exploit output, take shortcuts through Docker, and present hallucinated results as verified facts. The human researcher is not optional. You are the verification layer. You are the bullshit detector. Without you, AI findings are just well-formatted guesses.

And — equally crucial, and the harder lesson from the 5-pass miss — AI inherits your blind spots one-to-one. If your skill doesn't have a rule for "weak RNG used as a security token", the model will scan past UNSECURE_RANDOM for five passes in a row without flagging it. If your endpoint inventory comes only from one tool, the model won't audit the endpoints declared elsewhere. The intelligence in the loop is the rules YOU wrote, not the model. Every miss you survive becomes a rule the next researcher running your skills will not have to learn the hard way.

The results speak for themselves: real bugs, confirmed vulnerabilities, responsible reports filed. But only because a human was watching every step, questioning every result, demanding real proof, and rewriting the rules every time the audit missed something it should have caught.