Open-source benchmark for analytics agents

Evaluate. Improve. Deploy.

CodeBlue is a comprehensive benchmark for evaluating LLM agents on data analytics tasks. Multi-turn conversations, complex joins, and adversarial challenges.

2,847

Tasks

95

Databases

10+

Models Tested

80-10-10

Weighted Scoring

Three Core Products

From benchmarking to deployment, CodeBlue covers the full lifecycle of analytics agents.

How CodeBlue Works

01

Multi-Turn Conversations

Agents explore data iteratively, saving notes and building context across turns.

02

Weighted Scoring (80-10-10)

80% correctness, 10% efficiency, 10% token cost. Partial credit for close answers.

03

Adversarial Testing

Prompt injections, ambiguous queries, and edge cases that catch failures others miss.

Ready to evaluate your model?

Submit your model to the leaderboard or run evaluations locally.