Programming Language Benchmarks

The Anoa-L01 Benchmark: Prompt-Based Zero-Shot Evaluation for Sulawesi’s Regional Languages Detection in LLMs

Abstract: In recent years, large language models (LLMs) have demonstrated impressive performance in a wide range of tasks of natural language processing. However, their performance on low-resource ...

Tech Times

Claude AI Beats Human Robotics Teams 20x: Anthropic Marks Physical AI Turn

Claude AI robotics benchmark shows Opus 4.7 finishing physical robot programming in 9 minutes, against 181 minutes for ...

Tech Times

Autonomous AI Coding Clears 60,000-Line Ceiling: MirrorCode Benchmark Released

AI coding benchmark MirrorCode published its full results June 26, showing Claude Opus 4.7 autonomously rebuilt a 60,000-line interpreter and scored 56% overall — completing tasks that take human ...

GitHub

LangArena: A Balanced Programming Language Benchmark Suite

The suite started with my original implementation in Crystal. AI tools assisted in translating it to other languages. Throughout this process, I reviewed and edited the implementation for semantic ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results