news

2025.12.11 🍫 Introducing CocoaBench, an evaluation framework for evaluating general agents’ compositional cognitive abilities.
2025.10.15 🏟️ Check out BigCodeArena, a human-in-the-loop platform for evaluating code through execution.
2025.09.20 πŸ§™πŸ» Guru, our exploration of cross-domain RL for LLM reasoning, is accepted to NeurIPS 2025!
2025.06.20 πŸ§™πŸ» Check out Guru: how cross-domain RL supercharges LLM reasoning.
2024.10.10 πŸ€– We pre-release Decentralized Arena for automated, scalable, and transparent LLM evaluation.
2024.09.20 πŸŽ‰ DRPO is accepted to the main conference of EMNLP 2024!
2024.07.10 πŸŽ‰ LLM Reasoners is accepted to COLM 2024!
2024.02.28 πŸ’« We release StarCoder 2, a family of open LLMs for code.
2024.01.16 πŸŽ‰ RepoBench gets accepted to ICLR 2024!
2023.11.18 πŸ₯³ ToolkenGPT receives best paper award at SoCal NLP 2023!
2023.09.22 πŸŽ‰ ToolkenGPT gets accepted to NeurIPS 2023 as an oral presentation!