New evaluation run

Build a benchmark run.

Pick your contenders, choose the suites, tune the parameters — then launch and watch the scores stream in live.

Pick models

Every selected model runs against every chosen benchmark

0 selected
1 Contenders

Pick benchmarks

Toggle suites, or select a whole domain at once

2 0 suites chosen

Parameters

Dial in scale, sampling and the judge

3 Tuning
Items per benchmark8
150
Temperature0.2
precisecreative
Total items to evaluate
0
0 models × 0 benchmarks × 8 items
Est. cost
$0.000
Model calls
0

Cost is a rough forecast (~700in / 350out tokens per item) based on selected model pricing. Actuals depend on prompt length and outputs.

0 models·0 benchmarks·0 items