The Open-Source AI Stack
RSS

Grants · Project grant · Global

Reward Hacking Benchmark

Open benchmark that systematically tests LLM agents for reward hacking and deceptive behavior, providing a public-good evaluation rather than relying on lab-internal red teams.

Cosmos Institute granted Kunvar Thaman on August 15, 2025 for the Reward Hacking Benchmark, with funding in the $1K to $10K range plus compute. The grant came through the inaugural AI x Truth-Seeking cohort of 27 winners co-funded by Cosmos Institute and FIRE under a $1M program.

Thaman is an independent AI safety researcher based in Chandigarh, India. He previously published "Finding Deception in Language Models" on LessWrong, situating his work inside the AI safety community's deception and scheming literature.

The Reward Hacking Benchmark (RHB), described in arXiv 2605.02964, is a suite of multi-step tasks requiring sequential tool operations with naturalistic shortcut opportunities: skipping verification steps, inferring answers from task-adjacent metadata, or tampering with evaluation-relevant functions. RHB-v1 contains five code-optimization tasks (sorting, regex, compression, matrix multiplication, prime checking) tested in a deliberately vulnerable sandbox and a hardened comparison sandbox. Models communicate through a structured harness with fields for "bash", "cot", "code", and other actions, executed in a loop. Across 13 frontier models from OpenAI, Anthropic, Google, and DeepSeek, exploit rates ranged from 0% (Claude Sonnet 4.5) to 13.9% (DeepSeek-R1-Zero), with a controlled sibling comparison between DeepSeek-V3 and DeepSeek-R1-Zero showing that RL post-training is associated with substantially higher reward-hacking rates.

Within the cohort the project sits in the evaluation and safety-guardrails layers, paired thematically with Steven Molotnikov's Perplex as model-facing probes of closed systems. The lock-in vector is the same one Perplex addresses: public benchmarks for reward-hacking behavior are scarce, since most reported results come from frontier labs' own scaffolding, which limits independent comparison across providers.

Recipient

Kunvar Thaman

Funder

Cosmos Institute · foundation · US

Backs philosopher-builders making prototypes, essays, and projects at the intersection of AI and human flourishing, with emphasis on reason, decentralization, and individual autonomy.

Primary source

https://blog.cosmos-institute.org/p/introducing-the-first-cohort-of-ai

Additional sources

More from Cosmos Institute