Skip to main content

Command Palette

Search for a command to run...

Reasoning-First vulnerability research: How I built an AI Agent that found multiples bugs in Open Source project

Updated
11 min read
Reasoning-First vulnerability research:  How I built an AI Agent that found multiples bugs in Open Source project

Subtitle: A practical look at building an AI-assisted vulnerability research workflow that reasons through code, traces trust boundaries, and helps discover real security issues responsibly on multiple open source projects (MLflow, OpenEMR, JetBrains/kotlin-web-site, fasten-onprem, teletraan,...)


Introduction

Last month, I have been building a Reasoning-First vulnerability research agent: an AI-assisted system designed to help security researchers understand unfamiliar codebases, form hypotheses, trace data flow, and identify security-relevant logic flaws.

The goal was not to create a tool that blindly scans repositories and floods maintainers with noisy reports.

Instead, I wanted to build an agent that behaves more like a careful human researcher:

  • research project,

  • reading code,

  • asking why something exists,

  • connecting small clues,

  • checking assumptions,

  • and turning uncertainty into focused investigation.

During testing across multiple open source projects, the agent found several vulnerabilities, including issues that were assessed as high and critical severity.

This post explains the motivation, the research workflow, what made the approach useful, and the lessons I learned from reporting vulnerabilities in real projects.

Disclosure note: This article intentionally avoids exploit-ready technical details. Vulnerabilities mentioned here were handled through responsible disclosure, and technical specifics should only be shared after maintainers have had time to patch and publish advisories.


Why “Reasoning-First” Matters

Most automated security tools are good at pattern matching.

They can find known dangerous functions, suspicious dependencies, exposed secrets, or common misconfigurations. That is useful, but many serious vulnerabilities do not look obvious from a single line of code.

Some bugs require understanding questions like:

  • Who controls this input?

  • Which trust boundary does this data cross?

  • What assumptions does this function make?

  • Can this state be reached by an untrusted user?

  • Does this authorization check protect every path, or only the common path?

  • What happens when two safe-looking features interact in an unexpected way?

That is where a reasoning-first approach becomes valuable.

So I built a coordinated-disclosure workflow that does the opposite of the usual pipeline: it uses no static analysis at all. No SAST, no fuzzing.

The agent finds vulnerabilities the way a good human researcher does — by building a mental model of the system and reasoning about how its trust assumptions can be broken.


The Core Idea

The workflow is a five-phase pipeline, each phase feeding an artifact into the next:

  1. Scope & setup — confirm the project actually welcomes reports, stand up an isolated build, and map the attack surface.

  2. Discovery — find real flaws by manual, hypothesis-driven reasoning. This is the heart of it.

  3. Triage — kill the false positives, confirm survivors with a minimal proof-of-concept, score them, and cluster by root cause.

  4. Private report — write a confidential advisory and route it through the right channel.

  5. Coordinated fix — co-develop the patch with maintainers in private, then disclose on release. Wrapped around all five is a set of non-negotiable rules of engagement: only test authorized targets, only test in a sandbox you control, keep everything private until a fix ships, build the minimum proof needed, report only what you can demonstrate, and let the maintainer control the disclosure timeline. These aren't decoration. As we'll see, the interesting engineering is in making them binding rather than aspirational.


My Agent Workflow

Why reason instead of scan

The case against leaning on a scanner is simple: it's a flashlight, not a detective. It illuminates the patterns it already knows and leaves everything else dark.

A reasoning agent inverts the relationship. Instead of starting from "here's a suspicious line, is it exploitable?", it starts from "here's what this code promises to do; how could an attacker make it break that promise?" The core loop looks like this:

MODEL       → build a threat model from intent vs. implementation
HYPOTHESIZE → "attacker controls X, violates assumption Y, achieves Z"
INVESTIGATE → trace the dataflow and read the code by hand
CONFIRM     → build a minimal local PoC; real, or discard
RECORD      → log the finding, or log the killed hypothesis and why

Tools never set the agenda; a search through the code serves a hypothesis the agent already holds. The agent reads whole modules — auth, parsing, tenant isolation — rather than sampling, and it traces a tainted input across files and callbacks the way a person would, asking at each sanitizer not "is there a check?" but "is this check correct, complete, and applied on every path?"

A concrete trace makes this tangible. Suppose a handler joins a base directory with a request-supplied filename and reads the file. The hypothesis writes itself: an attacker controls the filename and escapes the directory to read arbitrary files. Tracing backward, you find a check that rejects the literal string .. — but does it run before or after URL-decoding? Does it catch ..%2f, or an absolute path, or a symlink? If the decode happens after the check, the check is theater. That's an inconsistent-parsing bug, and no rule in a stock ruleset would have found it, because the code "validates input." The agent found it by being suspicious of the validation, not the input.

The failure mode that nearly kills the idea

Here's the catch nobody mentions when they romanticize "the AI just reasons about it": remove the scanner and you remove the thing that forced a real line of code into evidence. An LLM with no tool to ground it will confidently assert that an ownership check is missing from code it only skimmed, or declare a path reachable it never actually traced. Reasoning-first discovery has exactly one dominant failure mode, and it's confabulation — plausible-sounding fiction that wastes a maintainer's time and burns your credibility.

So the workflow's backbone isn't the clever hunting heuristics. It's evidence discipline:

  • Every claim in a finding cites a specific file:line the agent actually read. Not "the handler probably doesn't check ownership" — "the handler at orders.py:142 loads by primary key with no ownership check."

  • Absence requires a search, not an assumption. Before claiming a mitigation is missing, grep the whole path for it and confirm it's truly gone — not merely absent from the one function you happened to read.

  • Running beats reasoning. "I built the PoC and observed the data returned" outranks "this should work." Anything unrun is, at most, plausible-unverified — and the workflow forbids rounding that up to confirmed.

  • Killing a hypothesis is a success, not a failure. That last point gets its own worked example in the playbook: a candidate XSS that the agent is supposed to kill. It traces the user-controlled field into the template, reads the engine config, finds autoescaping is on and no raw filter is applied, and records the hypothesis as dead — with the evidence. The lesson baked into the example is to go in trying to disprove your own idea. An agent that goes in trying to confirm will stop at "user data in a template" and file a false positive. The discipline of disconfirmation is what makes a scanner-free workflow trustworthy at all.

Aim for impact, and don't stop at the first kill

Two more behaviors separate a useful agent from a noisy one.

First, impact is a gate on effort, not a label applied afterward. Reasoning budget is finite, so before investing in a hypothesis the agent asks "what's the realistic worst case if I'm right?" Remote code execution, authentication bypass, privilege escalation, an IDOR exposing other users' data at scale — those get the deep dig. Self-XSS, a missing security header, a verbose error message — those get one line in the case file and a shrug, unless they chain into something serious. The agent judges by the end consequence of the whole chain, not the severity of the weakest link.

Second, a finding is a checkpoint, not a finish line. The naive failure is to find one critical bug and declare victory. But one missing ownership check is rarely alone — it's a signal about where the developer's assumptions are soft. So the agent sweeps for siblings of the bug it just found (same root cause, more instances), then keeps working the rest of the attack surface. You hand the maintainer one complete report, not a drip-feed of criticals that each restart the disclosure clock. The single exception: a live, currently-valid leaked credential or a bug that looks exploited in the wild gets reported immediately, because that's an active incident, not a 90-day embargo.

Making the rules binding, not aspirational

A workflow that runs untrusted code and handles unfixed vulnerabilities can't rely on the agent simply intending to behave. The guardrails have to be enforced by the environment.

The isolation model is a disposable VM as the boundary, Docker as the engine inside it. Snapshot the VM so you can reset between PoC attempts.

"Private by default" gets the same treatment. When the agent runs inside a coding tool, the tooling's own permission system forbids the commands that would leak an unfixed bug — opening a public PR or issue, pushing to a public remote — and prompts a human before anything crosses the sandbox. The rule isn't a sentence in a prompt the model might forget; it's a gate the model can't open.


Operationalizing it for an agent

The workflow is just markdown — tool-agnostic by design — so porting it to a specific agent is a matter of wiring, not rewriting.

On Claude Code, the rules of engagement live in CLAUDE.md (always-loaded context), each phase becomes an invocable skill, the permission settings and a pre-tool hook enforce "nothing public," and triage runs as an isolated subagent — a fresh context that can't be biased by the discovery session's enthusiasm to confirm its own finds.

On Codex, the same content maps onto AGENTS.md, a config.toml, and execpolicy rules. Codex's sandbox is actually a cleaner fit: its workspace keeps network access off by default, so the egress boundary is enforced by the runtime rather than a script, and the "no public push" backstop is expressed as forbidden command prefixes. The isolated triage review, which Claude Code does with a subagent, becomes a separate read-only codex exec invocation. Same design, different primitives.

The common thread across both: the manuals tell the agent what to do, the per-phase tool scoping limits what it can reach, and the sandbox plus command policy is the last line that catches a dangerous action if the first two somehow fail. Defense in depth — the same principle the workflow applies to the target, applied to the agent running it.


Projects Reviewed

The higher-impact findings came from reviewing several real-world open source projects:

Project Findings
MLflow 1 High-severity vulnerability
OpenEMR 1 High-severity vulnerability
FastenEMR 1 Critical vulnerability and 1 High-severity vulnerability
kotlin-web-site by JetBrains 1 high-severity vulnerability
teletraan by Pinterest 1 high-severity vulnerability

I also tested the agent on other open source repositories. Some of those reviews produced valid findings too, but most were lower-impact issues.

I think this distinction is important. Not every bug needs to be presented as critical, and not every finding has the same security impact. A useful vulnerability research workflow should help separate serious risks from minor issues.


Conclusion

With the right workflow, it can become a research partner that helps uncover deeper logic bugs and improve the security of real-world projects.

There is still a lot to improve, especially optimized the agent for big size project.

But the early results are promising.

For me, the most exciting part is not just that the agent found vulnerabilities.

It is that it helped turn vulnerability research into a more structured, explainable, and repeatable process.


Disclosure Timeline

Date Project Disclosure Channel Current Status
30 Apr 2026 FastenEMR GitHub Security Advisory Reported
05 May 2026 OpenEMR GitHub Security Advisory Reported
23 May 2026 kotlin-web-site by JetBrains GitHub Security Advisory Reported
23 May 2026 teletraan Bugcrowd Reported
24 May 2026 MLflow GitHub Security Advisory Reported
V
Varsha4h ago

This is a good reminder that security research is not just tool output. The reasoning part matters most: understanding the flow, asking what can break, and following weak signals until they turn into real bugs.

S

This is a useful direction because AI assisted vulnerability research gets much more interesting when it is reasoning through trust boundaries, not just pattern matching for suspicious code.

A lot of tools can flag risky functions or known bug patterns. The harder part is understanding how data moves, where assumptions break, what input becomes trusted, and whether a bug is actually exploitable.

I also like the emphasis on responsible discovery. If AI agents make bug finding faster, the next bottleneck becomes validation, triage, disclosure, and giving maintainers reports they can actually act on.

The strongest AI security workflows probably won’t be fully autonomous. They’ll be researcher guided systems that help trace paths, test hypotheses, reduce manual review time, and document findings clearly.