New evaluation run
Build a benchmark run.
Pick your contenders, choose the suites, tune the parameters — then launch and watch the scores stream in live.
Pick models
Every selected model runs against every chosen benchmark
1 Contenders
Pick benchmarks
Toggle suites, or select a whole domain at once
2 0 suites chosen
Parameters
Dial in scale, sampling and the judge
3 Tuning
Items per benchmark8
150
Temperature0.2
precisecreative
Total items to evaluate
0
0 models × 0 benchmarks × 8 items
Est. cost
$0.000
Model calls
0
Cost is a rough forecast (~700in / 350out tokens per item) based on selected model pricing. Actuals depend on prompt length and outputs.
0 models·0 benchmarks·0 items
Pick at least one model and benchmark