State-of-the-art evaluation pipeline

Put frontier models in the crucible.

Benchmark, battle and red-team any model across psychology, trading, software, business, marketing and logic — with live scoring, an Elo arena and full observability.

Start a benchmark Enter the arena Leaderboard

Runs

Model calls

0 tokens

Total cost

$0.00

Avg latency

0ms

0.0% errors

Benchmark domains

Curated suites across the areas that matter

Top models

Aggregate capability across all runs

Full board

No results yet — start a run to populate the leaderboard.

Recent activity

All runs

No runs yet.