OpenAI Develops Benchmark for AI Agents to Tackle Smart Contract Security Challenges

OpenAI is advancing efforts to evaluate the performance of AI agents in economically impactful environments as their use becomes more widespread. In a new initiative, OpenAI has introduced a benchmark designed to assess the capability of various AI models to identify, resolve, and potentially exploit security flaws within cryptocurrency smart contracts.

The initiative, detailed in a paper titled “EVMbench: Evaluating AI Agents on Smart Contract Security,” was released in collaboration with Paradigm, a crypto investment firm, and OtterSec, a crypto security company. This benchmark was used to analyze 120 vulnerabilities in smart contracts, determining the extent to which AI agents could theoretically exploit these issues.

Among the AI models tested, Anthropic’s Claude Opus 4.6 emerged as the top performer, securing an average “detect award” of $37,824. It was followed by OpenAI’s OC-GPT-5.2 and Google’s Gemini 3 Pro, which achieved detect awards of $31,623 and $25,112, respectively. These results highlight the growing competence of AI agents in managing foundational tasks, with OpenAI emphasizing the importance of evaluating their capabilities in environments of significant economic value.

OpenAI underscored the impact of AI on smart contracts, which are responsible for securing billions in assets. The organization anticipates that AI will play a transformative role for both attackers and defenders in this space. OpenAI also predicts a rise in agentic stablecoin payments, which could anchor AI in a domain of emerging practical importance.

Circle CEO Jeremy Allaire recently forecasted that within five years, billions of AI agents would be conducting transactions using stablecoins for routine payments on behalf of users. Similarly, former Binance CEO Changpeng “CZ” Zhao has suggested that cryptocurrency could become the “native currency for AI agents.”

The urgency to assess AI's proficiency in identifying security vulnerabilities is underscored by statistics showing that $3.4 billion in crypto funds were stolen by attackers in 2025, marking a slight increase from the previous year. EVMbench utilized 120 vulnerabilities from 40 smart contract audits, with many sourced from open-source audit competitions. OpenAI hopes that this benchmark will facilitate tracking AI advancements in detecting and addressing smart contract vulnerabilities.

In a related commentary, Haseeb Qureshi, managing partner at Dragonfly, highlighted that the initial promise of crypto to replace property rights and legal contracts hasn’t come to fruition due to the technology not being designed for human intuition. Qureshi pointed out that large transactions remain daunting due to threats like drainer wallets, unlike the relative security felt with bank transfers.

Qureshi envisions a future where AI-intermediated, self-driving wallets could handle these threats and manage complex operations for users. He compared this potential evolution to the historical synergy between GPS and smartphones or TCP/IP and browsers, suggesting that AI agents might be the missing complement for crypto’s broader adoption.