Algorithmic Fairness Bias Metric Calculator

JJ Ben-Joseph headshot Editorial review by: JJ Ben-Joseph

What this calculator measures

Algorithmic fairness discussions often sound abstract until you look at the actual decisions a model made. This calculator turns that audit into something concrete. You enter the confusion matrix counts for two groups, and the page computes three comparison metrics: demographic parity difference, the positive prediction rate ratio for group B divided by group A, and equal opportunity difference. Those numbers do not settle every policy or legal question, but they do show whether the model is selecting people at different rates and whether truly positive cases are being recognized at different rates across groups.

That focus matters because a model can appear reasonable in the aggregate while still producing noticeably different outcomes for different populations. A single overall accuracy score can hide that one group receives more positive predictions, or that one group's qualified cases are missed more often. By splitting the confusion matrix by group, this calculator makes those differences visible. It is useful for model audits, threshold reviews, classroom exercises, and sanity checks before a more detailed fairness assessment.

It is also important to be precise about what the output means. The calculator does not infer protected classes, moral intent, or regulatory compliance. It simply measures outcome patterns from the counts you provide. Think of it as a diagnostic instrument: it helps you see whether the current decision rule is balanced in one particular way, and it helps you compare scenarios consistently when you adjust data, thresholds, or model settings.

How to read the inputs

Each group needs four counts from a confusion matrix built on the same prediction task. The labels are standard, but they are easy to mix up when teams use different reporting conventions, so it is worth pausing here. A true positive is a case the model predicted as positive and that really was positive. A false positive is a case the model predicted as positive but should not have. A false negative is a missed positive case. A true negative is a correctly rejected negative case. Enter raw counts, not percentages, and make sure both groups come from the same dataset slice, time window, and threshold.

True positives: positive decisions that were correct.
False positives: positive decisions that were incorrect.
False negatives: missed positive cases.
True negatives: negative decisions that were correct.

The calculator treats these values as simple counts. There are no hidden units. That simplicity is helpful, but it also means you must supply internally consistent numbers. If one group uses monthly counts and the other uses annual counts, the fairness comparison will be meaningless. If you only have rates, convert them back to counts before using this tool. Counts make edge cases easier to spot too, such as a group with no positive ground-truth examples, which makes the true positive rate undefined.

The default values in the form are only a worked example. They are not recommended targets and they are not evidence of what a fair model should look like. Replace them with your own confusion matrix counts. Once you do, the result panel becomes a compact summary of how the model is behaving for the two groups you care about.

Formulas behind the results

At the broadest level, any calculator is a function that maps inputs to outputs. The two MathML expressions below are preserved from the original page because they describe that general idea: first there is a result built from several inputs, and then there is a weighted sum form that shows how different components can contribute differently to a total. In fairness auditing, those general patterns become specific rate formulas built from confusion matrix counts.

R = f (x_{1}, x_{2}, \dots, x_{n})

T = \sum_{i = 1}^{n} w_{i} \cdot x_{i}

For this calculator, the most important group-level rate is the positive prediction rate. That is the share of all cases in a group that received a positive prediction, whether the prediction was right or wrong. It is the quantity used in demographic parity calculations.

PositiveRate = \frac{TP + FP}{TP + FP + FN + TN}

The second rate is the true positive rate, sometimes called recall or sensitivity on the positive class. It asks a narrower question: among the cases that truly were positive, how many did the model correctly identify? Equal opportunity focuses on the gap in this rate between groups.

TPR = \frac{TP}{TP + FN}

The calculator then computes the difference in positive prediction rates between group A and group B, expressed in percentage points. A result near zero means the two groups receive positive predictions at similar overall rates. A positive sign means group A is being predicted positive more often than group B. A negative sign means the opposite.

DemographicParityDifference = ({PositiveRate}_{A} - {PositiveRate}_{B}) \times 100 %

Next, it reports the positive rate ratio as group B divided by group A. A ratio of 1.00 means the groups have equal positive prediction rates. Ratios below 1 mean group B receives fewer positive predictions relative to group A. Ratios above 1 mean group B receives more.

PositiveRateRatio = \frac{{PositiveRate}_{B}}{{PositiveRate}_{A}}

Finally, the equal opportunity difference compares true positive rates. This metric ignores true negatives and false positives directly; it focuses on whether qualified cases are being found at similar rates in both groups. Again, a value near zero indicates closer alignment, while the sign shows which group has the higher true positive rate.

EqualOpportunityDifference = ({TPR}_{A} - {TPR}_{B}) \times 100 %

One practical detail is worth stressing: these metrics can move independently. A model can have almost no demographic parity gap and still have a noticeable equal opportunity gap. That is not a bug in the calculator. It reflects the fact that fairness metrics ask different questions about the same model behavior.

Worked example using the default values

Suppose group A has 50 true positives, 10 false positives, 20 false negatives, and 120 true negatives. The total number of cases for group A is 200. Its positive prediction rate is therefore 60 divided by 200, which equals 0.30 or 30.00%. Its true positive rate is 50 divided by 70, which is about 0.7143 or 71.43%.

Now suppose group B has 40 true positives, 15 false positives, 30 false negatives, and 100 true negatives. The total is 185. Its positive prediction rate is 55 divided by 185, or about 29.73%. Its true positive rate is 40 divided by 70, or about 57.14%.

With those two group summaries, the calculator reports a demographic parity difference of about 0.27 percentage points because 30.00% minus 29.73% is small. The positive rate ratio, computed as group B divided by group A, is about 0.99. That suggests overall selection rates are nearly the same. But the equal opportunity difference is much larger at about 14.29 percentage points because group A correctly identifies truly positive cases more often than group B in this example. That contrast is exactly why people use more than one fairness metric.

Default example broken into intermediate rates
Group	Total cases	Positive prediction rate	True positive rate
A	200	30.00%	71.43%
B	185	29.73%	57.14%

Interpreting that result in plain language, the model appears to select the two groups at almost the same overall rate, but it is better at capturing actual positives in group A than in group B. If you were reviewing a threshold, that would tell you not to stop at parity alone. You would probably inspect score distributions, calibration, data quality, and threshold sensitivity next.

How to interpret the result panel

When you press the calculate button, the result area returns three compact outputs. The first is demographic parity difference in percentage points. Values near zero mean the overall positive decision rates are similar. The second is the positive rate ratio for group B divided by group A. Values near 1.00 mean the rates are similar on a multiplicative scale. The third is equal opportunity difference in percentage points, which highlights how differently the model treats truly positive cases across groups.

There is no universal numeric threshold that proves fairness in every domain. Hiring, lending, insurance, education, healthcare, and criminal justice all involve different stakes and sometimes different legal standards. Use the numbers as evidence about behavior, not as a complete verdict. A small observed gap may still deserve attention if the affected population is large or the harm from missed positives is severe. A larger gap may sometimes be explained by tiny sample sizes, but it should still prompt investigation rather than dismissal.

The sign convention matters too. Because the difference metrics are computed as group A minus group B, a positive value means group A has the higher rate and a negative value means group B has the higher rate. The ratio goes the other way on purpose: it is group B divided by group A. That is why a ratio below 1 and a positive difference can appear together. They are simply two ways of describing the same directional comparison.

The copy button is useful when you are running scenario analysis. You can change one assumption at a time, copy each summary, and keep a small record of how the fairness metrics respond. That kind of disciplined comparison is far more informative than guessing from memory after several threshold tweaks.

Assumptions, limits, and good practice

This calculator assumes that each count belongs in exactly one confusion matrix cell and that both groups were evaluated under the same model and decision policy. If you compare counts from different thresholds, different time periods, or different labeling standards, the metrics will still compute, but the interpretation will be weak. Consistency of measurement is part of fairness analysis.

Another limit is sample size. Rates derived from very small groups can swing sharply when only a few records change. If one group has only a handful of positive cases, the equal opportunity difference can jump around dramatically. In a formal audit, you would usually pair these point estimates with confidence intervals, uncertainty analysis, or at least a note that the denominator is small. This page does not estimate uncertainty; it reports the direct metrics from the counts you enter.

It also helps to remember that fairness criteria can conflict. You may be able to reduce the parity gap by changing the decision threshold, only to increase the true positive rate gap or the false positive rate gap. That tension is a common feature of real systems, especially when base rates and score distributions differ across groups. The right response is not to hunt for a single magic metric. It is to decide which harms matter most in your domain, document the tradeoffs, and review them transparently.

Counts must be non-negative: negative values have no interpretation here.
Each group needs observations: if a group total is zero, rates cannot be computed.
Equal opportunity needs actual positives: if TP plus FN is zero for a group, its true positive rate is undefined.
One metric is not the whole story: combine these results with broader model review and domain context.

Used thoughtfully, this calculator is a fast way to move a fairness conversation from vague impressions to explicit, checkable arithmetic. That is often the most valuable first step: everyone can see the same inputs, the same formulas, and the same consequences of changing them.

Enter non-negative confusion matrix counts for both groups. Use counts from the same model, the same threshold, and the same evaluation slice so the comparison is meaningful.

Group A outcomes True positives False positives False negatives True negatives

Group B outcomes True positives False positives False negatives True negatives

Enter confusion matrix counts for both groups.

Mini-game: Fairness Threshold Tuner

This optional mini-game turns the calculator idea into a short live simulation. Applicants from group A and group B stream toward a decision beam with a model score attached. Drag the horizontal threshold line up or down before they cross the beam. Cases above the line receive a positive prediction. The goal is not just raw accuracy. You score best when you keep both the positive-rate gap and the true-positive-rate gap under control while the score distributions shift over time.

Score0

Time80.0s

Streak0

Trust100%

Gap P / TPRwarming up

Wave1: Calibration

Start game

Click to play. Drag or tap to move the threshold line. Applicants above the line become positive predictions when they hit the beam. Aim for high accuracy, but remember that a line that feels good for one group can widen parity and opportunity gaps for the other.

Pointer or touch: drag anywhere on the canvas.
Keyboard fallback: use the up and down arrow keys after focusing the game.
Watch the HUD: score rewards good decisions and fairness control, while trust drains when mistakes and gaps pile up.

Best score: 0

Optional practice mode: use it to feel how the same threshold can keep one metric steady while another drifts.

Algorithmic Fairness Bias Metric Calculator

What this calculator measures

How to read the inputs

Formulas behind the results

Worked example using the default values

How to interpret the result panel

Assumptions, limits, and good practice

Mini-game: Fairness Threshold Tuner

Start game

Embed this calculator

Related Calculators

Algorithmic Fairness Metrics Calculator | Measure Model Bias Accurately

Workplace Diversity Score Calculator - Measure Representation

Interest Rate Parity Calculator - Forward FX Estimator

Population Projection Calculator - Forecast Future Demographics

Qualified Opportunity Zone Investment Deferral Planner

Frobenius Norm Calculator - Matrix Magnitude and Difference