Methodology

Overview

Axes

We plot each model by Utility (X, 0–100) and Expected cost per 100 replies (Y, USD). Utility is a weighted blend of task scores chat / reason / code / vision normalized to 0–1 and scaled to 0–100. A small baseline keeps very low-cost points readable; axis labels always show real USD.

Speed

Speed icons: 🚀🚀🚀 ≤150 ms (ultra-fast), 🚀🚀 ≤300 ms (fast), 🚀 ≤600 ms (moderate), ⏳ >600 ms (slow). Latency is median time-to-first-token under light load on typical provider GPUs. Measure on your own infra for production SLOs.

Context

Context windows come from provider docs. “Effective context” is a conservative estimate after system prompts, tools, and safety preamble.

Sources

Pricing and context are taken from official provider pages. The chart tooltips and pinned details include direct links when available.

Changelog · Back to chart

How “Utility” is computed

0–100 scaled

Utility summarizes task performance into a single score. We blend four task buckets—chat, reason, code, and vision—using a preset of weights (Overall, Chat, Reason, Code, Vision). Raw metrics are normalized to 0–1 and then scaled to 0–100 for display.

Inputs

Task scores per model: chat, reason, code, vision
Normalization bounds: utility_min=70, utility_max=100 (display)
Preset weights (examples below)

Formula

// normalized task scores in [0,1]
utility_raw =
  w_chat   * s_chat   +
  w_reason * s_reason +
  w_code   * s_code   +
  w_vision * s_vision

// scale to 0–100 for display (bounded)
utility_display = clamp( 100 * utility_raw, 0, 100 )

The Overall preset typically uses something like chat 0.30 · reason 0.35 · code 0.20 · vision 0.15. Other tabs reweight toward their focus (e.g., Code or Vision).

Presets (illustrative)

Overall: chat 0.30 · reason 0.35 · code 0.20 · vision 0.15
Chat:    chat 0.60 · reason 0.20 · code 0.10 · vision 0.10
Reason:  chat 0.20 · reason 0.60 · code 0.15 · vision 0.05
Code:    chat 0.15 · reason 0.30 · code 0.50 · vision 0.05
Vision:  chat 0.20 · reason 0.15 · code 0.05 · vision 0.60

We keep the presets visible in the UI tabs so the blend is transparent.

How it’s used with ranking

Utility feeds the optional rank score alongside cost and latency:

rank_score =
  wU * norm(utility_display) -
  wC * norm(cost_per_100)    -
  wL * norm(latency)

Default weights: wU 0.55 · wC 0.30 · wL 0.15. The chart shows Utility vs Cost; rank_score is for internal sorting or future views.

Notes on normalization & updates

Source signals include public benchmarks and provider guidance; we map them into a stable 0–1 space to avoid overreacting to small paper deltas.
The displayed range 70–100 helps visual separation in the higher band where most current models sit; raw 0–1 values are preserved internally.
We refresh weights and mappings together with data updates; see Changelog.

How we calculate cost per 100 replies

USD / per-1M token pricing

Prices are taken from official provider pages. We compute the displayed Y value using a simple, reproducible scenario (you can adapt it to your workload).

Assumptions

Default tokens/reply: 200 input / 800 output
Batch size: 100 replies
Units: provider pricing per 1M tokens (USD)

Formula

cost_per_100 =
  100 * (input_tokens_per_reply  / 1_000_000) * price_in_per_1M
+ 100 * (output_tokens_per_reply / 1_000_000) * price_out_per_1M

If cache or tiered pricing applies, we use standard on-demand rates unless noted.

Worked example (default scenario)

For a model priced at input = $1.25 / 1M and output = $10.00 / 1M:

input  cost = 100 * (200 / 1_000_000) * 1.25  = $0.025
output cost = 100 * (800 / 1_000_000) * 10.00 = $0.800
-----------------------------------------------
total cost per 100 replies                 = $0.825

Change these assumptions in code to match your workload (e.g., longer outputs, cache hits, batch APIs).

Tip: the pinned details card in the chart links here and also deep-links to provider pricing and docs when available.

Caveats

Provider prices and context limits change; provider pages are the source of truth.
Utility is directional; it is not a substitute for task-specific evaluations.
Latency varies by region, batch size, network, and provider load.

Disclaimer

No warranty · Kein Gewähr · No garantía · Aucune garantie

The information on this site is provided “as is” and “as available”, without warranties of any kind, express or implied, including but not limited to accuracy, completeness, fitness for a particular purpose, or non-infringement. No guarantee is made that prices, capabilities, context limits, latencies, or links are current, error-free, or applicable to your use case. This is not advice (technical, legal, financial, or otherwise) and does not constitute an endorsement of any provider or model. You are solely responsible for evaluating suitability and risks for your workload. Always validate with your own evaluations. We disclaim all liability for any loss or damage arising from reliance on this content.

← Back to Explorer

Questions or corrections? Contact & Feedback