Conf_Matrix

Does high accuracy imply that a test is a useful metric? Well, it depends. Lets take a look at why accuracy can be misleading in certain situations. We will use confusion matrix, which most basic statistics courses introduce in some fashion.

	Malignant	Benign
Test Malignant	TP	FP	$$ PPV = \dfrac{TP}{TP+FP}$$ (Precision)
Test Benign	FN	TN	$$ NPV= \dfrac{TN}{FN+TN}$$
	$$ TPR=\dfrac{TP}{TP+FN}$$ (Sensitivity, Recall)	$$ TNR=\dfrac{TN}{FP+TN}$$ (Specificity)	$$F2 = \dfrac{5PPVTPR}{4*PPV+TPR} $$

First, lets examine what happens with a perfect test, which obviously does not exist in any real world scenario.

	Malignant	Benign
Test Malignant	100	0	$$ PPV = \dfrac{100}{100+0}=1$$ (Precision)
Test Benign	0	100	$$ NPV= \dfrac{100}{0+100}=1$$
	$$ TPR=\dfrac{100}{100+0}=1$$ (Sensitivity, Recall)	$$ TNR=\dfrac{100}{0+100}=1$$ (Specificity)	$$ Accuracy = \dfrac{100+100}{100+0+0+100}=1$$

As you can see, everything is perfect in this scenario: accuracy, sensitivity, specificity, PPV, NPV are all at 100%.

Now, lets make this slightly more real and throw in one false positive and one false negative.

	Malignant	Benign
Test Malignant	99	1	$$ PPV = \dfrac{99}{99+1}=0.99$$ (Precision)
Test Benign	1	99	$$ NPV= \dfrac{99}{1+99}=0.99$$
	$$ TPR=\dfrac{99}{99+1}=0.99$$ (Sensitivity, Recall)	$$ TNR=\dfrac{99}{1+99}=0.99$$ (Specificity)	$$ Accuracy = \dfrac{99+99}{99+1+1+99}=0.99$$

All of the metrics dropped by 1% to 99%.

Lets start moving towards real world scenarios. The real prevalences are usually not 1:1. We will keep our ""test"" near perfect and only have 1% of FP and FN. But, there will be asymmetric prevalence of 1:9 of some condition. For example, for one malignant soft tissue sarcoma, there are approximatelly nine lipomas, neuromas, ganglion cysts, hemangiomas and other benign lesions.

In this example, we are testing 1000 cases. So, using some gold standard, 100 will be diagnosed as malignant and 900 will be diagnosed as benign. Our test will still be very accurate, sensitive and specific. How does a test that only gets 1% wrong, performs on different metrics when prevalence is not symmetric:

	Malignant	Benign
Test Malignant	99	9	$$ PPV = \dfrac{99}{99+9}=0.92$$ (Precision)
Test Benign	1	891	$$ NPV= \dfrac{891}{1+891}=>0.99$$
	$$ TPR=\dfrac{99}{99+1}=0.99$$ (Sensitivity, Recall)	$$ TNR=\dfrac{891}{9+891}=0.99$$ (Specificity)	$$ Accuracy = \dfrac{99+891}{99+1+9+891}=0.99$$
Prevalence Ratio	1	9

The accuracy, specificity and sensitivity stayed the same, but precision dropped by 8%.

If we take one further step in the direction of reality, we realize that many research studies report that their test is about 90% accurate, sensitive and specific. Let's see:

	Malignant	Benign
Test Malignant	90	90	$$ PPV = \dfrac{90}{90+90}=0.50$$ (Precision)
Test Benign	10	810	$$ NPV= \dfrac{810}{10+810}=0.99$$
	$$ TPR=\dfrac{90}{90+100}=0.90$$ (Sensitivity, Recall)	$$ TNR=\dfrac{90}{90+810}=0.90$$ (Specificity)	$$ Accuracy = \dfrac{90+810}{90+10+90+810}=0.90$$
Prevalence Ratio	1	9

The PPV (precision) dropped to 50%. The accuracy, sensitivity and specificity are at 90% and NPV is 99%. What is going on here? By attempting to increase our accuracy, we settled on minimizing false positives and false negatives. Furthermore, since the benign condition was more prevalent than malignant condition, NPV benefited from the large number of TNs. But, the test is no longer precise.

Here is a question: how would you like to get into a commercial plane that is 99% safe? It is highly doubtful that aviation would survive as an industry if every 100th plane was not safe. When 2 of 387 737-MAX8 planes crashed (0.5%), the entire 387 planes were grounded until the problem was fixed. There are more than 25,000 commercial planes in the world. In 2018, there were 13 fatal incidents (less than 0.05%). More than 99.95% of planes were safe from a fatal incident.

Using similar logic, we would like to have a 100% (or at least extremely close to 100%) sensitive test when it comes to detecting malignancies. The high sensitivity can only be achieved at the expense of precision. Why at the expense of precision? Because as false negatives drop, false positives rise. Numerically, FPs will rise much more - prevalence, remember, is 1:9 in our example.

To achieve 100% or very close to 100% sensitivity, real world scenario would look something like this:

	Malignant	Benign
Test Malignant	100	300	$$ PPV = \dfrac{100}{100+300}=0.33$$ (Precision)
Test Benign	0	600	$$ NPV= \dfrac{600}{0+600}=1$$
	$$ TPR=\dfrac{100}{100+0}=1$$ (Sensitivity, Recall)	$$ TNR=\dfrac{600}{300+600}=0.67$$ (Specificity)	$$ Accuracy = \dfrac{100+600}{100+0+300+600}=0.7$$
Prevalence Ratio	1	9

To achieve 100% (or extremely close to 100%) certainty that a positive test is truly sensitive (TPR=100%), accuracy, specificity and precision have to drop in the real world scenarios.

At this point, our discussion becomes philosophic rather than scientific. What is an acceptable and reasonable sensitivity goal for a test used in health care. Many research publications are happy to publish 90% or 95% sensitive results of a new test. This translates to 1 in 10, or 1 in 20 false negatives. Would you board a plane, which safety was checked with an instrument that was wrong 1 in 10 times? Continuing with our airplane analogies, we see that humans accept 99.95% safety record as reasonable for air travel. This translates to 5:10,000 or 1:2000 of false negatives for sensitivity calculation. To achieve 99.95% sensitivity with, say, 50% precision, here is how confusion matrix for a test of a condition with asymmetric prevalence would look:

	Malignant	Benign
Test Malignant	1999	3998	$$ PPV = \dfrac{1999}{1999+3998}=0.5$$ (Precision)
Test Benign	1	14002	$$ NPV= \dfrac{14002}{14002+1}=0.9999$$
	$$ TPR=\dfrac{1999}{1999+1}=0.9995$$ (Sensitivity, Recall)	$$ TNR=\dfrac{14002}{14002+3998}=0.7779$$ (Specificity)	$$ Accuracy = \dfrac{1999+14002}{1999+1+3998+1}=0.8$$
Prevalence Ratio	1	9

When we come across a paper that reports 95% accuracy, we should look under the hood, and see at what cost to sensitivity, specificity, and precision such accuracy was obtained.