Interrater

Kappa correlation coefficient of 0.85 is considered "almost perfect". Why? Because two authors said so in an article published in 1970s. Lets take a look at how "perfect" is this statistical test to start with. And, are there alternatives?

First, why even bother with Kappa statitistic when we can simply calculate percent agreement between two observors. Because "simple" is not good enough. At least in some folks' opinion. You see, there may be a chance that an agreement was there not because they really agreed, but because they "randomly" checked the same box for whatever random reason. Ok, sounds convoluted, but noone would argue that randomness is always present around us. So, a chance exists that the agreement was "random," and not real. But, lets keep that simple measure in mind, for now:

Reader1\Reader2	Blue	Green
Blue	A	B
Green	C	D

$$ Percent Agreement = \dfrac{A+D}{A+B+C+D}$$

So, how do we account for that "chance agreement"? Well, that is what Cohen's Kappa coefficient is for.

$$ \kappa = 1 - \dfrac{1-p_{o}}{1-p_{e}}$$

Are you ready for some complexity? p_o is probability of observed and p_e is probability of chance agreement. p_o is easy - it is the same as our Percent Agreement above.

What about p_e? Think of it this way - what is the combined probablity that both Readers would choose "blue" at random? What is the probabilty that they both would choose "green" at random? Now, what is overall random chance that they agree on either "blue" or "green"?

$$p_{e} = p_{blue} + p_{green}$$

$$p_{blue} = \dfrac{a+b}{a+b+c+d}\times\dfrac{a+c}{a+b+c+d}$$

$$p_{green} = \dfrac{d+b}{a+b+c+d}\times\dfrac{d+c}{a+b+c+d}$$

$$p_{e} = \dfrac{(a+b)\times(a+c)}{a+b+c+d}+\dfrac{(d+b)\times(d+c)}{a+b+c+d}$$

This got pretty complicated and philosophical quickly, didn't it.

Here is an example, where Kappa makes a good point about randomness being important:

Reader1\Reader2	Blue	Green
Blue	500	500
Green	500	500

$$ Percent Agreement = 50\%$$

$$ \kappa = 0$$

If two observors randomly check off "blue" and "green" with a large enough sample, they will choose two of the same 50% of the time.

The problem with Kappa is the following. It works reasonably well when there is a balanced number of "green" and "blue" observations. For example:

Reader1\Reader2	Blue	Green
Blue	500	10
Green	10	500

$$ Percent Agreement = 98\%$$

$$ \kappa = 0.96$$

But, if there are a bunch of one and not the other, we run into some issues:

Reader1\Reader2	Blue	Green
Blue	500	1
Green	1	20

$$ Percent Agreement = 96\%$$

$$ \kappa = 0.64$$

And, a more extreme example to illustrate the difference between Percent Agreement and Kappa:

Reader1\Reader2	Blue	Green
Blue	500	1
Green	1	1

$$ Percent Agreement = 99.6\%$$

$$ \kappa = 0.49$$

How is this a "moderate agreement" when they agreed on almost all observations? You get the point