The Benchmark Is the Vulnerability: How AI Agents Are Being Tested to Attack the Real Web
CVE-Bench and CAIBench reveal a troubling gap in how AI benchmarks measure offensive cybersecurity capability – and what it means for every enterprise running LLM agents.