Introducing EVMbench

There are routinely $100B+ in assets sitting in open source crypto contracts. As LLMs rapidly improve at finding exploits, it is important that we have visibility into and influence over the risks they could create for crypto.

Together with OpenAI, we built EVMbench to measure exactly that. EVMbench is an open evaluation framework that tests AI agents across detecting, patching, and exploiting vulnerabilities.

The benchmark uses real vulnerabilities from open code audits as well as custom tasks from unreleased contracts, containerized per-task so agents operate in realistic environments. We include an “answer key” for each task to verify the benchmark itself is solvable.

We’ve also extended the benchmark harness into an auditing agent that can be found at https://paradigm.xyz/evmbench.

When we started working on this project, top models were only able to exploit less than 20% of the critical, fund-draining Code4rena bugs. Today, GPT-5.3-Codex exploits over 70%. The rate of improvement is incredible.

It’s now clear to us that a growing portion of audits in the future will be done by agents. Hopefully this benchmark, harness, and agent serve both as a preview and an accelerant towards that future.

(also, thank you to OtterSec for significant support with implementing the frontend!)

For more details, read OpenAI's research summary and our joint academic paper.