Confusion Matrix Pro
Enter your model's results to instantly calculate accuracy, precision, and many other performance metrics.
Class Labels
Optional. Renames the matrix axes only (e.g. Dog / Cat). All metrics keep their standard names; the positive class is the one that drives Sensitivity, Precision, and the rest.
Quick Presets
Class Setup
Optional. Comma-separated names rename the matrix rows and columns. Anything missing falls back to Class 1, Class 2…
| Class | Precision | Recall | F1 | Support |
|---|
Metric Definitions & Terminology
Core Metrics
Accuracy
The number everyone asks for first, and the one most likely to fool you. It is simply the share of predictions you got right, lumping correct positives and correct negatives together. When the two classes are roughly balanced, that is an honest summary. But when one class dominates, accuracy flatters lazy models: if 95% of cases are negative, a model that blindly answers "negative" every time scores 95% while telling you nothing. Read it first, then immediately check a metric that accounts for the imbalance.
Formula: (TP + TN) / (TP + TN + FP + FN)
Sensitivity · Recall · TPR · Power (1 − β)
Of everything that was genuinely positive, how much did the model actually catch? The same calculation travels under several names: machine learning calls it recall, ROC analysis calls it the true positive rate (TPR), and classical statistics calls it power, written 1 − β. Think of it as the model's thoroughness. It is the metric you care about most when a miss is the expensive kind of mistake: an undetected tumour, a fraud that slips through, a search-and-rescue target left behind. It says nothing about false alarms, so it is best read alongside specificity or precision.
Formula: TP / (TP + FN)
Specificity · TNR
The mirror image of sensitivity: of everything that was genuinely negative, how much did the model correctly leave alone? ROC analysis calls the same quantity the true negative rate (TNR). High specificity means the model rarely cries wolf. You lean on it whenever a false alarm is costly or disruptive: sending a healthy patient for invasive tests, or dropping a real email into the spam folder.
Formula: TN / (TN + FP)
Precision · PPV
Sensitivity asks "did we catch everything?" Precision asks the trust question instead: "when the model says positive, should I believe it?" In clinical fields the same quantity is called positive predictive value, or PPV. It is the fraction of positive calls that turned out to be correct, and high precision means few false alarms. One catch worth remembering: this number depends heavily on how common the positive class is. The rarer the positive class, the harder it becomes to keep precision high, no matter how good the model is.
Formula: TP / (TP + FP)
Negative Predictive Value (NPV)
The counterpart of precision on the other side of the matrix: given a negative result, how likely is it to be genuinely negative? This is the number that tells you whether an "all clear" can actually be trusted.
Formula: TN / (TN + FN)
Prevalence
Before judging any other metric, ask how common the positive class actually is; that is prevalence. It is not a score of the model at all; it is a property of your data. But it is the context that decides whether accuracy can be trusted and how high precision could realistically ever reach.
Formula: (TP + FN) / (TP + TN + FP + FN)
Advanced Metrics
F1-Score
Precision and recall usually pull in opposite directions; push one up and the other tends to sag. F1 is the single number that refuses to let you ignore either. It is their harmonic mean, which behaves like a stricter average: it punishes imbalance, so a brilliant precision cannot paper over a dismal recall. If both matter and you want one figure to optimise, this is the one to watch.
Formula: 2 × (Precision × Recall) / (Precision + Recall)
FPR · Type I Error (α)
The false-alarm rate. Of all the genuinely negative cases, what fraction did the model wrongly flag as positive? Classical statistics calls the same quantity the Type I error rate, α, and it is the false-alarm cost you agree to tolerate before a hypothesis test even begins. Convicting an innocent defendant is the textbook example. It is also the horizontal axis of every ROC curve, and lower is better.
Formula: FP / (TN + FP)
FNR · Type II Error (β)
The miss rate. Of all the genuinely positive cases, what fraction did the model wave through as negative? Classical statistics calls the same quantity the Type II error rate, β. Where a Type I error is a false accusation, a Type II error is a real problem walking quietly out the door. It is one minus sensitivity, and you watch it closely whenever a miss is the dangerous outcome.
Formula: FN / (TP + FN)
Balanced Accuracy
Plain accuracy can be hijacked by a dominant class. Balanced accuracy closes that loophole by averaging sensitivity and specificity: judging the model on each class separately, then giving the two equal weight. On a lopsided dataset, this is the more honest "overall" score.
Formula: (Sensitivity + Specificity) / 2
Matthews Correlation Coefficient (MCC)
If you are allowed to trust just one number on an imbalanced dataset, make it this one. MCC is a correlation coefficient between predictions and reality: +1 is perfect, 0 is no better than a coin flip, −1 is perfectly wrong. Its strength is that it only scores well when the model performs across all four cells of the matrix at once; there is nowhere to hide.
Formula: (TP×TN − FP×FN) / √[(TP+FP)(TP+FN)(TN+FP)(TN+FN)]
Cohen's Kappa
Some agreement happens by luck alone. Kappa asks how much your model agrees with the truth beyond what random guessing would already have produced. A value of 0 means "no better than chance," 1 means perfect agreement. It is the honest way to discount a score that class imbalance has quietly inflated.
Formula: (Observed − Expected agreement) / (1 − Expected agreement)
Youden's J Statistic
A compact summary of diagnostic skill: sensitivity plus specificity, minus one. Zero means the test is no use; 1 means it is flawless. Its second job is practical: the threshold that maximises J is a common and defensible choice for a decision cut-off.
Formula: Sensitivity + Specificity − 1
Markedness
Youden's J judged the model from the side of the actual classes; markedness judges it from the side of the predictions. It combines how trustworthy your positive calls and your negative calls are. Strong markedness means both kinds of prediction are genuinely worth believing.
Formula: PPV + NPV − 1
Rate Complements
Error Rate (Misclassification Rate)
Accuracy told from the pessimist's chair: the share of predictions that came out wrong. It is exactly one minus accuracy, and it carries the same blind spot: on imbalanced data, a reassuringly low error rate can still hide a model that misses every case that matters.
Formula: (FP + FN) / (TP + TN + FP + FN) = 1 − Accuracy
False Discovery Rate (FDR)
Of all the cases your model called positive, what fraction were false alarms? That is the False Discovery Rate, the complement of precision. It is the natural language for fields that make many positive calls at once, such as screening thousands of genes, where you want a firm bound on how many "discoveries" are really just noise.
Formula: FP / (TP + FP) = 1 − Precision
False Omission Rate (FOR)
The quieter counterpart to FDR: of all the cases called negative, what fraction were actually positive? A high False Omission Rate means your "all clear" is concealing real cases, precisely the failure you most fear in screening triage.
Formula: FN / (TN + FN) = 1 − NPV
Diagnostic Ratios
Positive Likelihood Ratio (LR+)
How much should a positive result move your belief? LR+ compares how often positives turn up among true cases versus false ones. A value of 1 is useless noise; the higher it climbs, the more a positive result genuinely shifts the odds toward "real." It can run anywhere from 0 to infinity.
Formula: TPR / FPR = Sensitivity / (1 − Specificity)
Negative Likelihood Ratio (LR−)
The same idea applied to a negative result: how much should a negative finding lower your suspicion? Here you want a small number: the closer to 0, the more confidently a negative result rules the condition out. A value of 1 again means the result told you nothing.
Formula: FNR / TNR = (1 − Sensitivity) / Specificity
Diagnostic Odds Ratio (DOR)
One number to rank overall discriminative power: the positive likelihood ratio divided by the negative one. A DOR of 1 means the test cannot tell the classes apart at all; the larger it grows, the cleaner the separation. Meta-analyses favour it because it compresses a test's quality into a single, comparable figure.
Formula: LR+ / LR− = (TP × TN) / (FP × FN)
Confusion Matrix Components
True Positive (TP)
The model said positive and reality agreed: a clean hit.
True Negative (TN)
The model said negative and reality agreed: a correct pass.
False Positive (FP)
The model raised a flag that should never have gone up. This is the false alarm, known in hypothesis testing as a Type I error.
False Negative (FN)
The model stayed quiet on a case that was genuinely real. This is the miss, known in hypothesis testing as a Type II error.
Confusion Matrix Pro is a free, browser-based tool for evaluating binary classification models. Enter your true positives, true negatives, false positives, and false negatives, or paste raw predicted probabilities and labels, to instantly calculate 26 performance metrics, including accuracy, precision, recall, sensitivity, specificity, F1-score, Matthews correlation coefficient (MCC), and Cohen's kappa, alongside an interactive confusion matrix, ROC and precision-recall curves, and a live decision-threshold slider. It is designed for data scientists, machine learning engineers, students, and researchers who want a fast, no-install way to interpret model results, tune classification thresholds, and compare diagnostic performance across medical screening, fraud detection, spam filtering, and other binary classification problems.
This tool is free for anyone to use, with no sign-up or download required. All results are provided for informational and educational purposes only. Isik & Co. makes no warranty regarding the accuracy of any calculations and accepts no liability for decisions or outcomes arising from use of this tool.