A curated list of AI agent frameworks, launchpads, tools, tutorials, & resources.

AI Model Leaderboards

Comprehensive collection of AI model leaderboards and benchmarks for evaluating large language models, multimodal models, and specialized AI systems across various domains and capabilities.

General Performance

Artificial Analysis Leaderboard

Ranks 30+ LLMs on quality, price, speed (tokens/s), latency, context window. Top models: o3, o4-mini, Gemini 2.5 Pro, DeepSeek R1.

Visit Leaderboard

Hugging Face Open LLM Leaderboard

Evaluates open-source models (e.g., Mixtral, Yi, Smaug, Qwen) on benchmarks like ARC, MMLU, HellaSwag. Excludes proprietary models.

Visit Leaderboard

LMArena Leaderboard (Chatbot Arena)

Uses ELO scores from 3M+ community votes to rank 400+ text and vision models. Top models: Gemini-2.5-Pro-Grounding, Perplexity-Sonar-Reasoning-Pro, o3, o4-mini.

Visit Leaderboard

Scale AI SEAL Leaderboards

Ranks models on coding, instruction following, math, and multilinguality using private datasets. Top models: GPT-4o, Gemini 1.5 Pro, Claude 3 Opus.

Visit Leaderboard

LiveBench

Frequently updated, reliable for reasoning tasks. Top models: Claude 3.5 Sonnet, GPT-4o.

Visit Leaderboard

Azure AI Foundry Leaderboards

Compares models on quality, cost, throughput, and scenario-specific performance.

Visit Leaderboard

Vellum LLM Leaderboard

Comprehensive evaluation platform comparing LLM performance across multiple domains and use cases.

Visit Leaderboard

Coding

EvalPlus Leaderboard

Evaluates AI coders with rigorous tests using HumanEval+ and MBPP+ benchmarks with 80x/35x more test cases.

Visit Leaderboard

Big Code Models Leaderboard

Compares performance of base multilingual code generation models on HumanEval and MultiPL-E benchmarks across 18 programming languages.

Visit Leaderboard

HumanEval Benchmark

State-of-the-art code generation benchmark with 164 programming problems, measuring functional correctness.

Visit Leaderboard

BigCodeBench

Large benchmark of 1,140 diverse, realistic programming tasks with complex function calls testing true coding capabilities.

Visit Leaderboard

SWE-Bench

Software engineering benchmark with 2,294 real GitHub issues testing code understanding and bug fixing abilities.

Visit Leaderboard

Multimodal

MEGA-Bench

Multimodal model performance on 505 tasks. Top model: Gemini-2.0.

Visit Leaderboard

Open VLM Leaderboard

Comprehensive evaluation toolkit for large vision-language models, supporting 220+ models and 80+ benchmarks.

Visit Leaderboard

MMMU Benchmark

Massive Multi-discipline Multimodal Understanding benchmark with college-level problems across 30 subjects requiring expert knowledge.

Visit Leaderboard

MME Leaderboard

Comprehensive evaluation of multimodal large language models on perception and cognition tasks across 14 subtasks.

Visit Leaderboard

Reasoning

GSM8K Benchmark

Arithmetic reasoning benchmark with 8.5K grade-school level math word problems testing mathematical reasoning capabilities.

Visit Leaderboard

ARC Challenge

AI2 Reasoning Challenge with 7,787 grade-school science questions testing scientific reasoning and commonsense knowledge.

Visit Leaderboard

MathArena

Platform using fresh math competition and Olympiad problems to assess mathematical reasoning without data contamination.

Visit Leaderboard

AI Agents

Agent Leaderboard by RunGalileo

Evaluates LLMs on real-world tool-calling for AI agents. Top model: GPT-4.5.

Visit Leaderboard

Enterprise

Kearney LLM Leaderboard

Focuses on enterprise readiness and business performance. Top models: GPT-4o, GPT-4 Turbo, Gemini Pro, Qwen Max.

Visit Leaderboard

BytePlus ModelArk Leaderboard

Tracks LLMs on performance, ethics, and enterprise use.

Visit Leaderboard

Conversational AI

Chai Research Leaderboard

Community-driven leaderboard for conversational AI models with focus on engagement and personality.

Visit Leaderboard

Comprehensive

HELM (Stanford)

Holistic Evaluation of Language Models providing transparent, standardized, and multi-dimensional evaluation.

Visit Leaderboard