Cosmos Institute granted Kunvar Thaman on August 15, 2025 for the Reward Hacking Benchmark, with funding in the $1K to $10K range plus compute. The grant came through the inaugural AI x Truth-Seeking cohort of 27 winners co-funded by Cosmos Institute and FIRE under a $1M program.
Thaman is an independent AI safety researcher based in Chandigarh, India. He previously published "Finding Deception in Language Models" on LessWrong, situating his work inside the AI safety community's deception and scheming literature.
The Reward Hacking Benchmark (RHB), described in arXiv 2605.02964, is a suite of multi-step tasks requiring sequential tool operations with naturalistic shortcut opportunities: skipping verification steps, inferring answers from task-adjacent metadata, or tampering with evaluation-relevant functions. RHB-v1 contains five code-optimization tasks (sorting, regex, compression, matrix multiplication, prime checking) tested in a deliberately vulnerable sandbox and a hardened comparison sandbox. Models communicate through a structured harness with fields for "bash", "cot", "code", and other actions, executed in a loop. Across 13 frontier models from OpenAI, Anthropic, Google, and DeepSeek, exploit rates ranged from 0% (Claude Sonnet 4.5) to 13.9% (DeepSeek-R1-Zero), with a controlled sibling comparison between DeepSeek-V3 and DeepSeek-R1-Zero showing that RL post-training is associated with substantially higher reward-hacking rates.
Within the cohort the project sits in the evaluation and safety-guardrails layers, paired thematically with Steven Molotnikov's Perplex as model-facing probes of closed systems. The lock-in vector is the same one Perplex addresses: public benchmarks for reward-hacking behavior are scarce, since most reported results come from frontier labs' own scaffolding, which limits independent comparison across providers.
Recipient
Kunvar Thaman
Funder
Cosmos Institute · foundation · US
Backs philosopher-builders making prototypes, essays, and projects at the intersection of AI and human flourishing, with emphasis on reason, decentralization, and individual autonomy.
Primary source
https://blog.cosmos-institute.org/p/introducing-the-first-cohort-of-ai
Additional sources
More from Cosmos Institute
- Institute for Decentralized AI launch 2025-09 · 5 fully funded research positions
New institute combining research and infrastructure to support AI that operates without centralized gatekeepers. Initial cohort of five fully funded positions across Oxford and Stanford.
- Metalens 2025-08-15 · $1K-$10K plus compute
Open-source scientific research platform combining AI summarization with structured human checks, intended to help researchers see beyond hype and identify reliable claims in dense literature.
- Perplex 2025-08-15 · $1K-$10K plus compute
Tooling for surfacing hidden goals in closed AI systems by probing them with open-weight reference models. Aims to give independent reviewers a way to audit proprietary deployments.
- Policy Explorer 2025-08-15 · $1K-$10K plus compute
AI tool for analyzing the impacts and assumptions embedded in regulatory and policy proposals, geared toward helping civil-society analysts surface what a rule actually does.
- Argument Debugger 2025-08-15 · $1K-$10K plus compute
AI assistant that finds gaps in chains of reasoning and suggests repairs, framed as a tool for better arguments rather than persuasive ones. Built by a Harvard CS faculty member.