New evaluation run

Build a benchmark run.

Pick your contenders, choose the suites, tune the parameters — then launch and watch the scores stream in live.

Pick models

Every selected model runs against every chosen benchmark

0 selected

1 Contenders

Toggle suites, or select a whole domain at once

2 0 suites chosen

Dial in scale, sampling and the judge

3 Tuning

Run name

Items per benchmark8

150

Temperature0.2

precisecreative

Judge modelGrades open-ended responses where scoring isn't deterministic.

Total items to evaluate

0 models × 0 benchmarks × 8 items

Est. cost

$0.000

Model calls

Cost is a rough forecast (~700in / 350out tokens per item) based on selected model pricing. Actuals depend on prompt length and outputs.

0 models·0 benchmarks·0 items

Pick at least one model and benchmark