CodeBlue is a comprehensive benchmark for evaluating LLM agents on data analytics tasks. Multi-turn conversations, complex joins, and adversarial challenges.
2,847
Tasks
95
Databases
10+
Models Tested
80-10-10
Weighted Scoring
From benchmarking to deployment, CodeBlue covers the full lifecycle of analytics agents.
Comprehensive leaderboard comparing models on analytics tasks. Cost vs performance, quadrant analysis, and category breakdowns.
See how our environment improves open-source models. Before/after comparisons with percentage improvements and training details.
Upload your CSV and get instant analytics. Powered by our fine-tuned model. See natural language queries turn into insights.
Agents explore data iteratively, saving notes and building context across turns.
80% correctness, 10% efficiency, 10% token cost. Partial credit for close answers.
Prompt injections, ambiguous queries, and edge cases that catch failures others miss.
Submit your model to the leaderboard or run evaluations locally.