CodeBlue

Open-source benchmark for analytics agents

Evaluate. Improve. Deploy.

CodeBlue is a comprehensive benchmark for evaluating LLM agents on data analytics tasks. Multi-turn conversations, complex joins, and adversarial challenges.

View Leaderboard Star on GitHub

2,847

Tasks

Databases

10+

Models Tested

80-10-10

Weighted Scoring

Three Core Products

From benchmarking to deployment, CodeBlue covers the full lifecycle of analytics agents.

Agentic Benchmark

Comprehensive leaderboard comparing models on analytics tasks. Cost vs performance, quadrant analysis, and category breakdowns.

View Leaderboard

Fine-tuning Gains

See how our environment improves open-source models. Before/after comparisons with percentage improvements and training details.

View Results

CSV Analytics

Upload your CSV and get instant analytics. Powered by our fine-tuned model. See natural language queries turn into insights.

Try Demo

How CodeBlue Works

Multi-Turn Conversations

Agents explore data iteratively, saving notes and building context across turns.

Weighted Scoring (80-10-10)

80% correctness, 10% efficiency, 10% token cost. Partial credit for close answers.

Adversarial Testing

Prompt injections, ambiguous queries, and edge cases that catch failures others miss.

Ready to evaluate your model?

Submit your model to the leaderboard or run evaluations locally.

Get Started