Clémentine Fourier of HuggingFace on why you should stop using LLMs as Judges, what comes after MMLU, how prompts formatting sways benchmark results, and why leaderboards are GPU poor
Benchmarks 201: Why Leaderboards > Arenas >> LLM-as-Judge
Benchmarks 201: Why Leaderboards > Arenas …
Benchmarks 201: Why Leaderboards > Arenas >> LLM-as-Judge
Clémentine Fourier of HuggingFace on why you should stop using LLMs as Judges, what comes after MMLU, how prompts formatting sways benchmark results, and why leaderboards are GPU poor