mridoc.com Anatomy Outlines DiffDx Templates

Kappa correlation coefficient of 0.85 is considered "almost perfect". Why? Because two authors said so in an article published in 1970s. Lets take a look at how "perfect" is this statistical test to start with. And, are there alternatives?

First, why even bother with Kappa statitistic when we can simply calculate percent agreement between two observors. Because "simple" is not good enough. At least in some folks' opinion. You see, there may be a chance that an agreement was there not because they really agreed, but because they "randomly" checked the same box for whatever random reason. Ok, sounds convoluted, but noone would argue that randomness is always present around us. So, a chance exists that the agreement was "random," and not real. But, lets keep that simple measure in mind, for now:

Reader1\Reader2BlueGreen
BlueAB
GreenCD

$$ Percent Agreement = \dfrac{A+D}{A+B+C+D}$$

So, how do we account for that "chance agreement"? Well, that is what Cohen's Kappa coefficient is for.

$$ \kappa = 1 - \dfrac{1-p_{o}}{1-p_{e}}$$

Are you ready for some complexity? po is probability of observed and pe is probability of chance agreement. po is easy - it is the same as our Percent Agreement above.

What about pe? Think of it this way - what is the combined probablity that both Readers would choose "blue" at random? What is the probabilty that they both would choose "green" at random? Now, what is overall random chance that they agree on either "blue" or "green"?

$$p_{e} = p_{blue} + p_{green}$$

$$p_{blue} = \dfrac{a+b}{a+b+c+d}\times\dfrac{a+c}{a+b+c+d}$$

$$p_{green} = \dfrac{d+b}{a+b+c+d}\times\dfrac{d+c}{a+b+c+d}$$

$$p_{e} = \dfrac{(a+b)\times(a+c)}{a+b+c+d}+\dfrac{(d+b)\times(d+c)}{a+b+c+d}$$

This got pretty complicated and philosophical quickly, didn't it.

Here is an example, where Kappa makes a good point about randomness being important:

Reader1\Reader2BlueGreen
Blue500500
Green500500

$$ Percent Agreement = 50\%$$

$$ \kappa = 0$$

If two observors randomly check off "blue" and "green" with a large enough sample, they will choose two of the same 50% of the time.

The problem with Kappa is the following. It works reasonably well when there is a balanced number of "green" and "blue" observations. For example:

Reader1\Reader2BlueGreen
Blue50010
Green10500

$$ Percent Agreement = 98\%$$

$$ \kappa = 0.96$$

But, if there are a bunch of one and not the other, we run into some issues:

Reader1\Reader2BlueGreen
Blue5001
Green120

$$ Percent Agreement = 96\%$$

$$ \kappa = 0.64$$

And, a more extreme example to illustrate the difference between Percent Agreement and Kappa:

Reader1\Reader2BlueGreen
Blue5001
Green11

$$ Percent Agreement = 99.6\%$$

$$ \kappa = 0.49$$

How is this a "moderate agreement" when they agreed on almost all observations? You get the point