State-of-the-art evaluation pipeline

Put frontier models in the crucible.

Benchmark, battle and red-team any model across psychology, trading, software, business, marketing and logic — with live scoring, an Elo arena and full observability.

Runs
0
Model calls
0
0 tokens
Total cost
$0.00
Avg latency
0ms
0.0% errors

Benchmark domains

Curated suites across the areas that matter

Top models

Aggregate capability across all runs

Full board
No results yet — start a run to populate the leaderboard.

Recent activity

All runs
No runs yet.