Skip to content
Statistics

Confusion Matrix Calculator

Enter the four cells of your binary confusion matrix (True Positives, False Positives, True Negatives, and False Negatives) and this calculator instantly computes every standard classification performance metric: accuracy, precision, recall, specificity, F1 score, Matthews Correlation Coefficient, balanced accuracy, and more. Results update as you type, with a step-by-step breakdown of every formula.

Your details

Cases where the model correctly predicted the positive class.
Cases where the model predicted positive but the actual class was negative (Type I error).
Cases where the model predicted negative but the actual class was positive (Type II error).
Cases where the model correctly predicted the negative class.
AccuracyGood
0.88%

Fraction of all predictions that are correct: (TP + TN) / total

Precision (PPV)0.89%
Recall (Sensitivity / TPR)0.85%
Specificity (TNR)0.9%
F1 Score0.87%
Matthews Correlation Coefficient (MCC)0.7509
Balanced Accuracy0.88%
Negative Predictive Value (NPV)0.86%
False Positive Rate (FPR)0.1%
False Negative Rate (FNR)0.15%
False Discovery Rate (FDR)0.11%
Prevalence0.5%
Total Samples200
F2 Score0.86%
Accuracy0.88%
Precision0.89%
Recall0.85%
Specificity0.9%
F1 Score0.87%

Accuracy: 87.5% across 200 samples.

  • The model correctly classified 175 out of 200 samples (87.5% accuracy).
  • Precision (89.5%) and recall (85.0%) are well balanced.
  • The Matthews Correlation Coefficient is 0.751, indicating strong predictive agreement. MCC is the most balanced single-number summary for binary classifiers.

Next stepConsider tuning the decision threshold, resampling for class balance, or adjusting the cost of false positives vs false negatives to improve this result (F1 = 87.2%).

What is a confusion matrix?

A confusion matrix is a table that summarises the performance of a binary classification model by comparing its predictions against the actual class labels. It has four cells: True Positives (TP) where the model correctly predicts the positive class, True Negatives (TN) where it correctly predicts the negative class, False Positives (FP) where it wrongly predicts positive (Type I error), and False Negatives (FN) where it wrongly predicts negative (Type II error). Every classification metric used to evaluate a model can be derived from these four numbers, making the confusion matrix the single most important diagnostic table in machine learning.

How to read your results: the key metrics explained

Accuracy is the fraction of all predictions that are correct. It can be misleading when classes are imbalanced, for example a model that always predicts "negative" on a dataset that is 95% negative achieves 95% accuracy while being completely useless. Precision (also called Positive Predictive Value) answers "how many of the items flagged positive are actually positive?", and is the right priority when false alarms are costly - spam detection or fraud flagging, for instance. Recall (Sensitivity) answers "how many of the actual positives did the model catch?", and is the right priority when missing a positive is costly - cancer screening or fault detection. The F1 Score is the harmonic mean of precision and recall and gives a single number that balances both. The Matthews Correlation Coefficient (MCC) goes further: it accounts for all four matrix cells and is widely considered the most informative single metric for imbalanced binary classification, ranging from -1 (perfectly wrong) through 0 (random) to +1 (perfect).

Precision vs recall: choosing the right trade-off

Raising the decision threshold of a classifier increases precision but reduces recall, and vice versa. This trade-off is fundamental and the right balance depends entirely on the cost asymmetry of errors in your problem. In medical diagnosis, a missed cancer (FN) is far more costly than an extra biopsy (FP), so you optimise for recall. In email spam filtering, blocking a legitimate email (FP) may matter more than the occasional spam getting through (FN), so you optimise for precision. The F2 score (beta = 2) weights recall twice as heavily as precision and is useful in the first type of scenario. Plot a Precision-Recall curve or an ROC curve to visualise the full trade-off space and choose a threshold that matches your cost structure.

Imbalanced datasets and why accuracy can lie

When the positive class is rare - credit card fraud, rare disease screening, structural failures - a naive classifier that always predicts "negative" can achieve very high accuracy while detecting nothing useful. In these situations, balanced accuracy (the arithmetic mean of sensitivity and specificity), MCC, and the F1 score are far more informative than raw accuracy. If your prevalence is below 10% or above 90%, pay close attention to the MCC and F1 results this calculator provides, and consider whether resampling techniques (SMOTE, oversampling) or cost-sensitive training are needed.

Classification metric quick reference

MetricFormulaIdeal valueRange
Accuracy(TP+TN) / total1.0 (100%)0 to 1
Precision (PPV)TP / (TP+FP)1.0 (100%)0 to 1
Recall (Sensitivity)TP / (TP+FN)1.0 (100%)0 to 1
Specificity (TNR)TN / (TN+FP)1.0 (100%)0 to 1
F1 Score2PR / (P+R)1.0 (100%)0 to 1
F2 Score5PR / (4P+R)1.0 (100%)0 to 1
MCC(TP*TN - FP*FN) / sqrt(...)+1.0-1 to +1
Balanced Accuracy(Recall + Specificity) / 21.0 (100%)0 to 1
NPVTN / (TN+FN)1.0 (100%)0 to 1
FPR (Fall-out)FP / (FP+TN)0 (0%)0 to 1
FNR (Miss rate)FN / (FN+TP)0 (0%)0 to 1
FDRFP / (FP+TP)0 (0%)0 to 1
Prevalence(TP+FN) / totalvaries0 to 1

Standard interpretation ranges for binary classifier performance metrics.

Frequently asked questions

What are true positives, false positives, true negatives, and false negatives?

In binary classification, a True Positive (TP) is a case where the model predicted "positive" and the actual label is positive. A False Positive (FP) is a case where the model predicted "positive" but the actual label is negative (Type I error, also called a false alarm). A True Negative (TN) is a case where the model correctly predicted "negative". A False Negative (FN) is a case where the model predicted "negative" but the actual label was positive (Type II error, also called a miss). Every classification metric is a different arithmetic combination of these four counts.

What is the difference between precision and recall?

Precision is the fraction of the model's positive predictions that are actually correct: TP / (TP + FP). It measures how trustworthy a positive prediction is. Recall (also called sensitivity or true positive rate) is the fraction of actual positives the model correctly identified: TP / (TP + FN). It measures how thorough the model is. A spam filter with high precision rarely marks legitimate emails as spam. A cancer screening tool with high recall rarely misses a true case. There is an inherent trade-off: increasing one often decreases the other.

What is MCC and why is it better than accuracy for imbalanced data?

The Matthews Correlation Coefficient (MCC) is calculated as (TP * TN - FP * FN) / sqrt((TP+FP)(TP+FN)(TN+FP)(TN+FN)). It produces a value between -1 and +1 that takes all four cells of the confusion matrix into account equally. A score of +1 means perfect predictions, 0 means the model is no better than random guessing, and -1 means the model always predicts the wrong class. Because MCC factors in both classes symmetrically, it cannot be inflated by a class imbalance the way accuracy can, making it the preferred single-number summary when your dataset is imbalanced.

What is specificity and how is it different from precision?

Specificity (True Negative Rate) is TN / (TN + FP) - the fraction of actual negatives the model correctly identifies. Precision is TP / (TP + FP) - the fraction of predicted positives that are actually positive. Both deal with false positives, but from different perspectives: specificity looks at what fraction of real negatives were safely rejected, while precision looks at what fraction of the model's positive flags can be trusted. In medical testing, specificity tells you how good the test is at ruling out the disease; precision tells you how likely a positive test result actually means the disease is present (this depends heavily on prevalence).

How do I choose between F1 and F2 score?

Both are weighted harmonic means of precision and recall. F1 weights them equally. F2 (beta = 2) gives recall twice the weight of precision, making it the right choice when missing a positive (a false negative) is costlier than a false alarm. Use F1 when precision and recall errors are equally costly; use F2 in scenarios like fraud detection or medical diagnosis where failing to catch a true positive causes more harm than an occasional false alarm.

What is balanced accuracy and when should I use it?

Balanced accuracy is the average of sensitivity (recall) and specificity: (TPR + TNR) / 2. Unlike raw accuracy, it is not inflated by class imbalance because it gives equal weight to both classes regardless of how many samples are in each. Use it when your dataset is imbalanced and you want a simple percentage metric that remains interpretable, as an alternative to MCC which is harder to explain to a non-technical audience.

Can accuracy be high while the model is still bad?

Yes. If 95% of your samples are negative, a classifier that always predicts "negative" achieves 95% accuracy while having 0% recall - it never detects a positive case. This is why you should always check MCC, F1, recall, and balanced accuracy alongside accuracy, especially when the positive class is rare. This calculator computes all of these automatically so you can spot the problem at a glance.

Sources

Written by Dr. Hannah Brandt, PhD Statistician · Munich, Germany

Applied statistician translating rigorous probability theory into clear, accurate tools for researchers and practitioners.

Search 3,500+ calculators

Loading search…