A curated list of AI agent frameworks, launchpads, tools, tutorials, & resources.

AI Model Leaderboards

Comprehensive collection of AI model leaderboards and benchmarks for evaluating large language models, multimodal models, and specialized AI systems across various domains and capabilities.

General Performance

Artificial Analysis Leaderboard
Ranks 30+ LLMs on quality, price, speed (tokens/s), latency, context window. Top models: o3, o4-mini, Gemini 2.5 Pro, DeepSeek R1.
Visit Leaderboard
Hugging Face Open LLM Leaderboard
Evaluates open-source models (e.g., Mixtral, Yi, Smaug, Qwen) on benchmarks like ARC, MMLU, HellaSwag. Excludes proprietary models.
Visit Leaderboard
LMArena Leaderboard (Chatbot Arena)
Uses ELO scores from 3M+ community votes to rank 400+ text and vision models. Top models: Gemini-2.5-Pro-Grounding, Perplexity-Sonar-Reasoning-Pro, o3, o4-mini.
Visit Leaderboard
Scale AI SEAL Leaderboards
Ranks models on coding, instruction following, math, and multilinguality using private datasets. Top models: GPT-4o, Gemini 1.5 Pro, Claude 3 Opus.
Visit Leaderboard
LiveBench
Frequently updated, reliable for reasoning tasks. Top models: Claude 3.5 Sonnet, GPT-4o.
Visit Leaderboard
Azure AI Foundry Leaderboards
Compares models on quality, cost, throughput, and scenario-specific performance.
Visit Leaderboard
Vellum LLM Leaderboard
Comprehensive evaluation platform comparing LLM performance across multiple domains and use cases.
Visit Leaderboard

Coding

EvalPlus Leaderboard
Evaluates AI coders with rigorous tests using HumanEval+ and MBPP+ benchmarks with 80x/35x more test cases.
Visit Leaderboard
Big Code Models Leaderboard
Compares performance of base multilingual code generation models on HumanEval and MultiPL-E benchmarks across 18 programming languages.
Visit Leaderboard
HumanEval Benchmark
State-of-the-art code generation benchmark with 164 programming problems, measuring functional correctness.
Visit Leaderboard
BigCodeBench
Large benchmark of 1,140 diverse, realistic programming tasks with complex function calls testing true coding capabilities.
Visit Leaderboard
SWE-Bench
Software engineering benchmark with 2,294 real GitHub issues testing code understanding and bug fixing abilities.
Visit Leaderboard

Multimodal

MEGA-Bench
Multimodal model performance on 505 tasks. Top model: Gemini-2.0.
Visit Leaderboard
Open VLM Leaderboard
Comprehensive evaluation toolkit for large vision-language models, supporting 220+ models and 80+ benchmarks.
Visit Leaderboard
MMMU Benchmark
Massive Multi-discipline Multimodal Understanding benchmark with college-level problems across 30 subjects requiring expert knowledge.
Visit Leaderboard
MME Leaderboard
Comprehensive evaluation of multimodal large language models on perception and cognition tasks across 14 subtasks.
Visit Leaderboard

Reasoning

GSM8K Benchmark
Arithmetic reasoning benchmark with 8.5K grade-school level math word problems testing mathematical reasoning capabilities.
Visit Leaderboard
ARC Challenge
AI2 Reasoning Challenge with 7,787 grade-school science questions testing scientific reasoning and commonsense knowledge.
Visit Leaderboard
MathArena
Platform using fresh math competition and Olympiad problems to assess mathematical reasoning without data contamination.
Visit Leaderboard

AI Agents

Agent Leaderboard by RunGalileo
Evaluates LLMs on real-world tool-calling for AI agents. Top model: GPT-4.5.
Visit Leaderboard

Enterprise

Kearney LLM Leaderboard
Focuses on enterprise readiness and business performance. Top models: GPT-4o, GPT-4 Turbo, Gemini Pro, Qwen Max.
Visit Leaderboard
BytePlus ModelArk Leaderboard
Tracks LLMs on performance, ethics, and enterprise use.
Visit Leaderboard

Conversational AI

Chai Research Leaderboard
Community-driven leaderboard for conversational AI models with focus on engagement and personality.
Visit Leaderboard

Comprehensive

HELM (Stanford)
Holistic Evaluation of Language Models providing transparent, standardized, and multi-dimensional evaluation.
Visit Leaderboard