Benchmarking explained
Why benchmark LLMs?
Benchmarks help us compare models fairly. In real products you care about quality, cost, speed and context. A good choice balances all four.
What to measure
- Utility: one score that blends task results
chat / reason / code / vision
and is normalized to 0โ100. - Latency: time to first token and total completion time. Keep decoding parameters fixed when you compare.
- Cost: input and output tokens for a typical reply, for example
200 in / 800 out
. - Context: the maximum window and the effective part left after system prompts, tools and safety preamble.
Popular public benchmarks
- MMLU: broad knowledge QA. Repo ยท Paper
- GSM8K: grade school math reasoning. Repo ยท Paper
- HumanEval: code generation accuracy. Repo
- HellaSwag: commonsense inference. Site
- TruthfulQA: factuality under pressure. Site
- BIG-bench: a wide set of tasks and abilities. Repo
- HELM: a holistic evaluation framework. Site
- MT-Bench: multi turn chat judged pairwise and by GPT. FastChat
- lm-eval-harness: a standard runner for many tasks. Repo
- Open LLM Leaderboard: community comparisons. HF Space
How to run good evaluations
- Fix randomness: set temperature to 0 or fix a seed, and keep decoding parameters constant.
- Control prompts: pin the system prompt and instructions; keep examples the same across models.
- Isolate variables: change one thing at a time (model, context length, tool use) and write it down.
- Watch for contamination: public test sets may appear in pretraining; prefer private or freshly generated evals for high-stakes work.
- Report cost and latency together with quality so results are easy to compare.
Quick start
- Pick two or three tasks you care about, such as Q&A, reasoning or code review.
- Write 10โ20 short, realistic prompts for each task.
- Run the same prompts on two or three models with temperature 0 and the same system prompt.
- Score with a simple rubric: correct, helpful, safe. Average the results.
- Record token counts and time. Build a basic Utility / $ / ms view to compare options.
Tip: a simple spreadsheet is fine to start; you can adopt a formal harness later.
Glossary
- Utility: a single number that summarizes performance across the tasks you weight.
- Latency: how quickly you see the first token and the full answer.
- Context window: how much text the model can consider at once.
- Token: a short chunk of text; providers bill per input and output token.