AI Model Leaderboards
Comprehensive collection of AI model leaderboards and benchmarks for evaluating large language models, multimodal models, and specialized AI systems across various domains and capabilities.
General Performance
Artificial Analysis Leaderboard
Ranks 30+ LLMs on quality, price, speed (tokens/s), latency, context window. Top models: o3, o4-mini, Gemini 2.5 Pro, DeepSeek R1.
Visit LeaderboardHugging Face Open LLM Leaderboard
Evaluates open-source models (e.g., Mixtral, Yi, Smaug, Qwen) on benchmarks like ARC, MMLU, HellaSwag. Excludes proprietary models.
Visit LeaderboardLMArena Leaderboard (Chatbot Arena)
Uses ELO scores from 3M+ community votes to rank 400+ text and vision models. Top models: Gemini-2.5-Pro-Grounding, Perplexity-Sonar-Reasoning-Pro, o3, o4-mini.
Visit LeaderboardScale AI SEAL Leaderboards
Ranks models on coding, instruction following, math, and multilinguality using private datasets. Top models: GPT-4o, Gemini 1.5 Pro, Claude 3 Opus.
Visit LeaderboardLiveBench
Frequently updated, reliable for reasoning tasks. Top models: Claude 3.5 Sonnet, GPT-4o.
Visit LeaderboardAzure AI Foundry Leaderboards
Compares models on quality, cost, throughput, and scenario-specific performance.
Visit LeaderboardVellum LLM Leaderboard
Comprehensive evaluation platform comparing LLM performance across multiple domains and use cases.
Visit LeaderboardCoding
EvalPlus Leaderboard
Evaluates AI coders with rigorous tests using HumanEval+ and MBPP+ benchmarks with 80x/35x more test cases.
Visit LeaderboardBig Code Models Leaderboard
Compares performance of base multilingual code generation models on HumanEval and MultiPL-E benchmarks across 18 programming languages.
Visit LeaderboardHumanEval Benchmark
State-of-the-art code generation benchmark with 164 programming problems, measuring functional correctness.
Visit LeaderboardBigCodeBench
Large benchmark of 1,140 diverse, realistic programming tasks with complex function calls testing true coding capabilities.
Visit LeaderboardSWE-Bench
Software engineering benchmark with 2,294 real GitHub issues testing code understanding and bug fixing abilities.
Visit LeaderboardMultimodal
MEGA-Bench
Multimodal model performance on 505 tasks. Top model: Gemini-2.0.
Visit LeaderboardOpen VLM Leaderboard
Comprehensive evaluation toolkit for large vision-language models, supporting 220+ models and 80+ benchmarks.
Visit LeaderboardMMMU Benchmark
Massive Multi-discipline Multimodal Understanding benchmark with college-level problems across 30 subjects requiring expert knowledge.
Visit LeaderboardMME Leaderboard
Comprehensive evaluation of multimodal large language models on perception and cognition tasks across 14 subtasks.
Visit LeaderboardReasoning
GSM8K Benchmark
Arithmetic reasoning benchmark with 8.5K grade-school level math word problems testing mathematical reasoning capabilities.
Visit LeaderboardARC Challenge
AI2 Reasoning Challenge with 7,787 grade-school science questions testing scientific reasoning and commonsense knowledge.
Visit LeaderboardMathArena
Platform using fresh math competition and Olympiad problems to assess mathematical reasoning without data contamination.
Visit LeaderboardAI Agents
Agent Leaderboard by RunGalileo
Evaluates LLMs on real-world tool-calling for AI agents. Top model: GPT-4.5.
Visit LeaderboardEnterprise
Kearney LLM Leaderboard
Focuses on enterprise readiness and business performance. Top models: GPT-4o, GPT-4 Turbo, Gemini Pro, Qwen Max.
Visit LeaderboardBytePlus ModelArk Leaderboard
Tracks LLMs on performance, ethics, and enterprise use.
Visit LeaderboardConversational AI
Chai Research Leaderboard
Community-driven leaderboard for conversational AI models with focus on engagement and personality.
Visit LeaderboardComprehensive
HELM (Stanford)
Holistic Evaluation of Language Models providing transparent, standardized, and multi-dimensional evaluation.
Visit Leaderboard