Confusion Matrix Pro

Edit any cell in the matrix on the right and every metric updates instantly. Rows are actual classes, columns are predicted. Optionally rename the two classes, or load a quick preset to see how the metrics shift at the extremes.

Class Labels

Positive class

Negative class

Optional. Renames the matrix axes only (e.g. Dog / Cat). All metrics keep their standard names; the positive class is the one that drives Sensitivity, Precision, and the rest.

Quick Presets

Predictions & Threshold

Have raw model outputs instead of counts? Paste predicted probabilities (0–1) and the true labels (0 or 1), or upload a CSV. The matrix is built for you, and you get a decision-threshold slider to tune the cut-off live and compare any two thresholds.

📁 Upload CSV Wide CSVs are fine — pick the prediction and label columns after upload. Header row optional.

Predicted Probabilities

Actual Labels (0 or 1)

Edit any cell in the matrix on the right and every metric updates instantly. Rows are actual classes, columns are predicted. Set the number of classes (3–10) and optionally name them below.

Class Setup

Number of classes

Class labels

Optional. Comma-separated names rename the matrix rows and columns. Anything missing falls back to Class 1, Class 2…

Confusion Matrix

Actual +

2008

Actual −

1961

Total

1945

2024

3969

Core Metrics

Accuracy

88.5%

(TP+TN)/Total

Sensitivity

88.5%

TP/(TP+FN)

also: Recall · TPR · Power (1−β)

Specificity

88.5%

TN/(TN+FP)

also: TNR

Precision

88.5%

TP/(TP+FP)

also: PPV

NPV

89.3%

TN/(TN+FN)

Prevalence

48.0%

(TP+FN)/Total

Composite Scores

F1-Score

88.5%

2(P×R)/(P+R)

Balanced Acc

88.5%

(Sens+Spec)/2

MCC

0.77

Correlation

Kappa

0.77

Agreement

Errors & Effectiveness

False Positive Rate (FPR)

11.5%

FP/(TN+FP)

also: Type I Error (α)

False Negative Rate (FNR)

11.5%

FN/(TP+FN)

also: Type II Error (β)

Youden's J

0.77

Sens+Spec−1

Markedness

0.77

PPV+NPV−1

Ratios & Complements

LR+

7.69

TPR/FPR

LR−

0.13

FNR/TNR

Diagnostic Odds Ratio (DOR)

59.00

LR+ / LR−

Error Rate

11.5%

(FP+FN)/Total

False Discovery Rate (FDR)

12.4%

FP/(TP+FP)

False Omission Rate (FOR)

10.7%

FN/(TN+FN)

Multi-class Metrics

Accuracy

Macro F1

Weighted F1

Cohen's Kappa

Class	Precision	Recall	F1	Support

Metric Definitions & Terminology

Core Metrics

Accuracy

The number everyone asks for first, and the one most likely to fool you. It is simply the share of predictions you got right, lumping correct positives and correct negatives together. When the two classes are roughly balanced, that is an honest summary. But when one class dominates, accuracy flatters lazy models: if 95% of cases are negative, a model that blindly answers "negative" every time scores 95% while telling you nothing. Read it first, then immediately check a metric that accounts for the imbalance.

Formula: (TP + TN) / (TP + TN + FP + FN)

Sensitivity · Recall · TPR · Power (1 − β)

Of everything that was genuinely positive, how much did the model actually catch? The same calculation travels under several names: machine learning calls it recall, ROC analysis calls it the true positive rate (TPR), and classical statistics calls it power, written 1 − β. Think of it as the model's thoroughness. It is the metric you care about most when a miss is the expensive kind of mistake: an undetected tumour, a fraud that slips through, a search-and-rescue target left behind. It says nothing about false alarms, so it is best read alongside specificity or precision.

Formula: TP / (TP + FN)

Specificity · TNR

The mirror image of sensitivity: of everything that was genuinely negative, how much did the model correctly leave alone? ROC analysis calls the same quantity the true negative rate (TNR). High specificity means the model rarely cries wolf. You lean on it whenever a false alarm is costly or disruptive: sending a healthy patient for invasive tests, or dropping a real email into the spam folder.

Formula: TN / (TN + FP)

Precision · PPV

Sensitivity asks "did we catch everything?" Precision asks the trust question instead: "when the model says positive, should I believe it?" In clinical fields the same quantity is called positive predictive value, or PPV. It is the fraction of positive calls that turned out to be correct, and high precision means few false alarms. One catch worth remembering: this number depends heavily on how common the positive class is. The rarer the positive class, the harder it becomes to keep precision high, no matter how good the model is.

Formula: TP / (TP + FP)

Negative Predictive Value (NPV)

The counterpart of precision on the other side of the matrix: given a negative result, how likely is it to be genuinely negative? This is the number that tells you whether an "all clear" can actually be trusted.

Formula: TN / (TN + FN)

Prevalence

Before judging any other metric, ask how common the positive class actually is; that is prevalence. It is not a score of the model at all; it is a property of your data. But it is the context that decides whether accuracy can be trusted and how high precision could realistically ever reach.

Formula: (TP + FN) / (TP + TN + FP + FN)

Advanced Metrics

F1-Score

Precision and recall usually pull in opposite directions; push one up and the other tends to sag. F1 is the single number that refuses to let you ignore either. It is their harmonic mean, which behaves like a stricter average: it punishes imbalance, so a brilliant precision cannot paper over a dismal recall. If both matter and you want one figure to optimise, this is the one to watch.

Formula: 2 × (Precision × Recall) / (Precision + Recall)

FPR · Type I Error (α)

The false-alarm rate. Of all the genuinely negative cases, what fraction did the model wrongly flag as positive? Classical statistics calls the same quantity the Type I error rate, α, and it is the false-alarm cost you agree to tolerate before a hypothesis test even begins. Convicting an innocent defendant is the textbook example. It is also the horizontal axis of every ROC curve, and lower is better.

Formula: FP / (TN + FP)

FNR · Type II Error (β)

The miss rate. Of all the genuinely positive cases, what fraction did the model wave through as negative? Classical statistics calls the same quantity the Type II error rate, β. Where a Type I error is a false accusation, a Type II error is a real problem walking quietly out the door. It is one minus sensitivity, and you watch it closely whenever a miss is the dangerous outcome.

Formula: FN / (TP + FN)

Balanced Accuracy

Plain accuracy can be hijacked by a dominant class. Balanced accuracy closes that loophole by averaging sensitivity and specificity: judging the model on each class separately, then giving the two equal weight. On a lopsided dataset, this is the more honest "overall" score.

Formula: (Sensitivity + Specificity) / 2

Matthews Correlation Coefficient (MCC)

If you are allowed to trust just one number on an imbalanced dataset, make it this one. MCC is a correlation coefficient between predictions and reality: +1 is perfect, 0 is no better than a coin flip, −1 is perfectly wrong. Its strength is that it only scores well when the model performs across all four cells of the matrix at once; there is nowhere to hide.

Formula: (TP×TN − FP×FN) / √[(TP+FP)(TP+FN)(TN+FP)(TN+FN)]

Cohen's Kappa

Some agreement happens by luck alone. Kappa asks how much your model agrees with the truth beyond what random guessing would already have produced. A value of 0 means "no better than chance," 1 means perfect agreement. It is the honest way to discount a score that class imbalance has quietly inflated.

Formula: (Observed − Expected agreement) / (1 − Expected agreement)

Youden's J Statistic

A compact summary of diagnostic skill: sensitivity plus specificity, minus one. Zero means the test is no use; 1 means it is flawless. Its second job is practical: the threshold that maximises J is a common and defensible choice for a decision cut-off.

Formula: Sensitivity + Specificity − 1

Markedness

Youden's J judged the model from the side of the actual classes; markedness judges it from the side of the predictions. It combines how trustworthy your positive calls and your negative calls are. Strong markedness means both kinds of prediction are genuinely worth believing.

Formula: PPV + NPV − 1

Rate Complements

Error Rate (Misclassification Rate)

Accuracy told from the pessimist's chair: the share of predictions that came out wrong. It is exactly one minus accuracy, and it carries the same blind spot: on imbalanced data, a reassuringly low error rate can still hide a model that misses every case that matters.

Formula: (FP + FN) / (TP + TN + FP + FN) = 1 − Accuracy

False Discovery Rate (FDR)

Of all the cases your model called positive, what fraction were false alarms? That is the False Discovery Rate, the complement of precision. It is the natural language for fields that make many positive calls at once, such as screening thousands of genes, where you want a firm bound on how many "discoveries" are really just noise.

Formula: FP / (TP + FP) = 1 − Precision

False Omission Rate (FOR)

The quieter counterpart to FDR: of all the cases called negative, what fraction were actually positive? A high False Omission Rate means your "all clear" is concealing real cases, precisely the failure you most fear in screening triage.

Formula: FN / (TN + FN) = 1 − NPV

Diagnostic Ratios

Positive Likelihood Ratio (LR+)

How much should a positive result move your belief? LR+ compares how often positives turn up among true cases versus false ones. A value of 1 is useless noise; the higher it climbs, the more a positive result genuinely shifts the odds toward "real." It can run anywhere from 0 to infinity.

Formula: TPR / FPR = Sensitivity / (1 − Specificity)

Negative Likelihood Ratio (LR−)

The same idea applied to a negative result: how much should a negative finding lower your suspicion? Here you want a small number: the closer to 0, the more confidently a negative result rules the condition out. A value of 1 again means the result told you nothing.

Formula: FNR / TNR = (1 − Sensitivity) / Specificity

Diagnostic Odds Ratio (DOR)

One number to rank overall discriminative power: the positive likelihood ratio divided by the negative one. A DOR of 1 means the test cannot tell the classes apart at all; the larger it grows, the cleaner the separation. Meta-analyses favour it because it compresses a test's quality into a single, comparable figure.

Formula: LR+ / LR− = (TP × TN) / (FP × FN)

Confusion Matrix Components

True Positive (TP)

The model said positive and reality agreed: a clean hit.

True Negative (TN)

The model said negative and reality agreed: a correct pass.

False Positive (FP)

The model raised a flag that should never have gone up. This is the false alarm, known in hypothesis testing as a Type I error.

False Negative (FN)

The model stayed quiet on a case that was genuinely real. This is the miss, known in hypothesis testing as a Type II error.

Confusion Matrix Pro is a free, browser-based tool for evaluating binary classification models. Enter your true positives, true negatives, false positives, and false negatives, or paste raw predicted probabilities and labels, to instantly calculate 26 performance metrics, including accuracy, precision, recall, sensitivity, specificity, F1-score, Matthews correlation coefficient (MCC), and Cohen's kappa, alongside an interactive confusion matrix, ROC and precision-recall curves, and a live decision-threshold slider. It is designed for data scientists, machine learning engineers, students, and researchers who want a fast, no-install way to interpret model results, tune classification thresholds, and compare diagnostic performance across medical screening, fraud detection, spam filtering, and other binary classification problems.

This tool is free for anyone to use, with no sign-up or download required. All results are provided for informational and educational purposes only. Isik & Co. makes no warranty regarding the accuracy of any calculations and accepts no liability for decisions or outcomes arising from use of this tool.

Site Guide

What it does

Confusion Matrix Pro takes the four counts behind any binary classifier (true and false positives and negatives) and turns them into 26 named metrics, a visual matrix, and threshold curves. Everything is computed in your browser; nothing leaves the page.

Entering data

Three routes, pick whichever matches what you have on hand:

Binary: type the TP, TN, FP, FN counts directly. Use this when your model is already scored.
Predictions & Threshold: paste predicted probabilities and true labels (or upload a CSV), and the matrix is built for you at a decision threshold you can drag.
Multi-class: for 3 or more classes, enter the full NxN confusion matrix and get per-class precision, recall, F1, plus macro and weighted averages and Cohen's κ.

Reading the results

Each metric shows a value, a small 95% CI line, and its formula. The confidence interval is the band around the metric you would expect from sampling alone; a wide CI is a warning that your test set is too small to be confident. CIs are computed by the Wilson method for proportions and the Simel / Altman log-normal method for the likelihood ratios.

Hover any small "i" for a one-line recap, and find the full write-ups at the foot of the page. The presets on the Binary tab load reference scenarios so you can watch how the metrics behave at the extremes.

Comparing thresholds

In the Predictions & Threshold tab, click Set current as baseline after loading your predictions. Move the slider afterwards and every metric grows a Δ line showing how it changed: green is an improvement for that metric, red is a regression, and percentages report as percentage points. Clear the baseline any time to return to plain readings.

Naming the classes

The class-label boxes rename the matrix axes (say, Dog and Cat) so the layout reads in your own terms. Metric names stay standard, since the positive class is still what defines sensitivity, precision, and the rest.

Keeping your work

Export the values and metrics as JSON, save a PNG, JPG, or PDF snapshot, or copy a share link that reopens the page exactly as you left it. Your choice of light or dark theme is remembered between visits.

Embedding on your site

Click </> Copy embed code in the top toolbar to get an <iframe> snippet you can drop into a blog post, doc, or course page. The embed strips the header, sidebar, and footer, leaving just the matrix and metrics, and it preserves the state you have configured.

Confusion Matrix Pro

Metric Definitions & Terminology

Core Metrics

Accuracy

Sensitivity · Recall · TPR · Power (1 − β)

Specificity · TNR

Precision · PPV

Negative Predictive Value (NPV)

Prevalence

Advanced Metrics

F1-Score

FPR · Type I Error (α)

FNR · Type II Error (β)

Balanced Accuracy

Matthews Correlation Coefficient (MCC)

Cohen's Kappa

Youden's J Statistic

Markedness

Rate Complements

Error Rate (Misclassification Rate)

False Discovery Rate (FDR)

False Omission Rate (FOR)

Diagnostic Ratios

Positive Likelihood Ratio (LR+)

Negative Likelihood Ratio (LR−)

Diagnostic Odds Ratio (DOR)

Confusion Matrix Components

True Positive (TP)

True Negative (TN)

False Positive (FP)

False Negative (FN)

Site Guide

What it does

Entering data

Reading the results

Comparing thresholds

Naming the classes

Keeping your work

Embedding on your site

💬 Send Feedback