Benchmarking explained

Why benchmark LLMs?

Benchmarks help us compare models fairly. In real products you care about quality, cost, speed and context. A good choice balances all four.

Utility: one score that blends task results chat / reason / code / vision and is normalized to 0–100.
Latency: time to first token and total completion time. Keep decoding parameters fixed when you compare.
Cost: input and output tokens for a typical reply, for example 200 in / 800 out.
Context: the maximum window and the effective part left after system prompts, tools and safety preamble.

Fix randomness: set temperature to 0 or fix a seed, and keep decoding parameters constant.
Control prompts: pin the system prompt and instructions; keep examples the same across models.
Isolate variables: change one thing at a time (model, context length, tool use) and write it down.
Watch for contamination: public test sets may appear in pretraining; prefer private or freshly generated evals for high-stakes work.
Report cost and latency together with quality so results are easy to compare.

Pick two or three tasks you care about, such as Q&A, reasoning or code review.
Write 10–20 short, realistic prompts for each task.
Run the same prompts on two or three models with temperature 0 and the same system prompt.
Score with a simple rubric: correct, helpful, safe. Average the results.
Record token counts and time. Build a basic Utility / $ / ms view to compare options.

Tip: a simple spreadsheet is fine to start; you can adopt a formal harness later.

Utility: a single number that summarizes performance across the tasks you weight.
Latency: how quickly you see the first token and the full answer.
Context window: how much text the model can consider at once.
Token: a short chunk of text; providers bill per input and output token.