Model Scaling Law Performance Calculator

JJ Ben-Joseph headshot Editorial review by: JJ Ben-Joseph

Empirical scaling laws summarize how training loss tends to improve as you increase training resources. This calculator focuses on the common “data scaling” relationship where loss decreases as a power law in the number of training tokens.

Introduction: What this calculator estimates

Given a baseline training run with token count N₀ and observed training loss L₀, plus a scaling exponent α and an irreducible loss floor B, the calculator:

Fits the constant A implied by the baseline point.
Projects the expected loss at a new token count N₁.
Optionally estimates how many tokens would be required to reach a target loss.

Definitions and units

Tokens (N): total training tokens consumed (often “tokens seen,” including repeats across epochs). Use the same definition for N₀ and N₁.
Loss (L): typically training cross-entropy / negative log-likelihood (NLL) averaged per token. It is not accuracy.
Irreducible loss (B): a floor the power law approaches as N grows; it captures limits from data quality, architecture, optimization, label noise, and evaluation setup.
Scaling exponent (α): positive number controlling diminishing returns. Larger α means faster improvement with added tokens.

The scaling-law formula

The calculator uses the common form:

L(N) = A × N^−α + B

Presented in MathML:

L (N) = A N^{- α} + B

Solving for A from the baseline

Using the baseline observation (N₀, L₀):

A = (L₀ − B) × N₀^α

This requires L₀ > B. If L₀ is less than or equal to B, the fitted A is non-positive and the model no longer represents a diminishing-loss curve.

Projecting loss at N1

Once A is known, the projected loss at N₁ is:

L(N₁) = A × N₁^−α + B

Solving for tokens needed to reach a target loss

If you provide a target loss L_target (must satisfy L_target > B), then:

N_target = (A / (L_target − B))^1/α

How to interpret the results

Projected loss is an estimate of training loss after convergence under a similar training recipe. It does not automatically translate to downstream task performance.
Marginal gains shrink as N increases because N^−α flattens; the curve asymptotically approaches B.
Sensitivity to α is high: small changes in α can cause large changes in N_target, especially when you aim close to B.
Extrapolation risk rises the further N₁ is from N₀. Scaling laws often hold over ranges, not indefinitely.

Worked example (matches the default inputs)

Suppose you observed:

N₀ = 1,000,000 tokens
L₀ = 2.5
α = 0.1
B = 1.0
N₁ = 5,000,000 tokens

First compute A:

N₀^α = (1,000,000)^0.1 = 10^0.6 ≈ 3.981
A = (2.5 − 1.0) × 3.981 ≈ 5.972

Now project loss at N₁:

N₁^−α = (5,000,000)^−0.1 ≈ 0.214
L(N₁) ≈ 5.972 × 0.214 + 1.0 ≈ 2.28

If you also set L_target = 1.5:

N_target = (5.972 / (1.5 − 1.0))^1/0.1 = (11.944)¹⁰ ≈ 6.0 × 10¹⁰ tokens

This illustrates a common takeaway: pushing training loss close to the floor B can require enormous increases in tokens.

Quick comparison: what changes when you scale data?

Change	What happens to L(N)?	Practical implication
Increase N (more tokens)	L decreases roughly as N^−α until it nears B	Diminishing returns; biggest gains are earlier
Increase α	Curve falls faster with N	Fewer extra tokens needed for the same loss drop
Increase B	Loss floor rises; all projections shift upward	Data/architecture/label noise may be limiting
Increase L₀ with same N₀	Implied A increases	Worse baseline implies higher losses at all N unless you change B or α

Assumptions & limitations (read before acting on projections)

Single-factor scaling: This page only scales with token count. If you also change model size, context length, optimizer, batch size, training schedule, data mixture, or augmentation, the fitted A, α, and B may change.
Training vs validation: The formula is most often reported for training loss (or validation loss under fixed evaluation). If your L₀ is validation loss, keep that consistent for all comparisons, and expect different parameters than training loss.
Domain and data quality dependence: Scaling behavior depends strongly on dataset distribution and cleanliness. Adding low-quality tokens can reduce effective gains; in extreme cases it can worsen outcomes even if N increases.
Requirement that L > B: You must choose B smaller than your observed and target losses. If B is set too high, the algebra implies negative/undefined values.
Extrapolation limits: Power-law fits are empirical. Predictions far beyond the baseline (orders of magnitude) can be unreliable—use them as planning heuristics, not guarantees.
Not a downstream KPI forecast: A lower cross-entropy does not map linearly to metrics like accuracy, win rate, or task-specific scores; improvements may saturate differently.
Token accounting ambiguity: “Tokens” may mean unique tokens, total tokens seen, or effective tokens after deduplication/filtering. Mixing conventions will distort results.

Practical tips

If you have multiple runs, fit α and B (or at least α) from data rather than borrowing a generic value.
Use N₁/N₀ as a sanity check: if you only scale tokens by 2× and α is small (e.g., 0.05–0.1), expect modest loss changes.
Treat N_target near B as a warning sign: it often implies that improving data quality or architecture may be more cost-effective than brute-force scaling.

How to use this calculator

Enter Baseline Dataset Tokens (N₀) using the unit or time period shown by the field.
Enter Observed Baseline Loss (L₀) using the unit or time period shown by the field.
Enter Scaling Exponent (α) using the unit or time period shown by the field.
Run the calculation and compare the output with a second scenario before acting on it.

Arcade Mini-Game: Model Scaling Law Performance Calculator Calibration Run

Use this quick arcade run to practice separating useful scenario inputs from common planning mistakes before you rely on the calculator output.

Score: 0 Timer: 30s Best: 0

Start the game, then use your pointer or arrow keys to catch useful inputs and avoid bad assumptions.

Enter baseline metrics and scaling parameters to project performance.

Model Scaling Law Performance Calculator

Introduction: What this calculator estimates

Definitions and units

The scaling-law formula

Solving for A from the baseline

Projecting loss at N1

Solving for tokens needed to reach a target loss

How to interpret the results

Worked example (matches the default inputs)

Quick comparison: what changes when you scale data?

Assumptions & limitations (read before acting on projections)

Practical tips

How to use this calculator

Embed this calculator

Related Calculators

Context Window Scaling Cost Calculator | Estimate Long-Context Memory, Throughput, and Cost

Dataset Deduplication Savings Calculator

AI Image Generation Cost Calculator - Budget Art with Tokens

LLM Fine-Tuning Compute Cost Estimator

Synthetic Data Generation ROI Calculator

Model Distillation Efficiency Calculator