# Tail Inequalities

The empirical rule from the previous section tells us how data are
concentrated for *normal* distributions. But what if we don't know that our distribution is normal, or
if we know that it isn't?

As it happens, there are additional "rules" we can apply in such situations. While they aren't quite as informative as the empirical rule, at least they give us some handle on things.

In this section we'll cover two such rules: the Markov inequality and the Chebyshev inequality. In a nutshell, these rules establish upper bounds on how much of the data can be in the distribution tails.

Tail inequalities are particularly useful for anomaly detection (also known as outlier detection). In this discussion we'll focus on this particular application. We'll begin with a discussion of noise vs. anomalies.

## Noise vs. anomalies

Almost all anomaly detection methods involve scoring data points with an outlier score, even if only implicitly. Then we decide where on that scale we want to put the discrimination threshold, and classification is just a matter of checking whether a given score crosses the threshold.

One of the things that makes anomaly detection challenging is that there's no bright line separating scores generated by noise from those generated by true anomalies. The data-generating process typically produces some amount of noise. Then random sampling produces noise in the form of sampling error. Noise can generate scores that exceed those associated with smaller anomalies:

We can make it easier to distinguish anomalies from noise by having a sense for how much "ambient noise" to expect. Then we decide where to place our anomaly detection threshold relative to that noise, placing it somewhere at the "tail" of that action. But it's a balancing act. Placing the threshold closer to the noise gives us better recall (ability to capture real anomalies) but worse precision (ability to exclude false anomalies), whereas placing it further from the noise gives us better precision but worse recall. (See my posts on PR curves and ROC curves for more information.)

Now that we better understand the conceptual difference between noise and anomalies, we're ready to explore the tail inequality approach to limiting our false positives.

## Understanding tail inequalities

Tail inequalities (a.k.a. concentration inequalities) allow us to bound the proportion of false positives in
the tail. Given a variable `X` and a threshold `α`, a tail inequality has the general
form

*P* stands for "probability", but we haven't covered that yet, so you can just think of it as
representing a proportion in the range [0, 1]. The formula says that the proportion of the data in the tail
past `α` is no greater than some bound.

Visually, we're looking at something like this:

A bound can be weaker or stronger, depending on how much information we have about the variable `X`.
In general we prefer strong (or "tight") bounds because they better approximate the true probability
\(P(X > \alpha)\), but establishing a strong bound requires more information. Either way, having a bound in
place offers some sense for how problematic noise might be for a given threshold `α`.

Now let's look at a couple of important tail inequalities.

## Markov inequality

The first tail inequality we'll consider is called the *Markov inequality*. This bound on
\(P(X > \alpha)\) applies when there's a single threshold `α` such that the anomalies all live
beyond that threshold.

The Markov inequality requires the following assumptions:

`X`is non-negative. (Actually it's sufficient for X to be "almost surely" non-negative, but we'll ignore that advanced detail here.)`X`has a finite mean. (This will always be true for finite datasets. Again the more general case is advanced and out of scope for this discussion.)

Then for any constant threshold `α` > 0:

where \(\mu_X\) is the mean of `X`.

Let's see how this bound performs against a hypothetical variable \(X \sim N(100, 15^2)\) (i.e., normally distributed with mean 100 and standard deviation 15—think IQ scores), truncated to eliminate negative values, which are virtually nonexistent anyway. Note that we just happen to be using a normal distribution here, but the Markov inequality applies to any distribution satisfying the assumptions above.

\(\alpha\) | \(z(\alpha)\) | \(P(X > \alpha)\) | Markov bound | Diff |
---|---|---|---|---|

95 | -0.33 | -0.6293 | 1.0526 | 0.4233 |

100 | 0.00 | 0.5000 | 1.0000 | 0.5000 |

105 | 0.33 | 0.3707 | 0.9524 | 0.5817 |

110 | 0.67 | 0.2514 | 0.9091 | 0.6577 |

115 | 1.00 | 0.1587 | 0.8696 | 0.7109 |

120 | 1.33 | 0.0918 | 0.8333 | 0.7415 |

125 | 1.67 | 0.0475 | 0.8000 | 0.7525 |

130 | 2.00 | 0.0228 | 0.7692 | 0.7464 |

There are a couple things to observe here:

- Only
`α`'s greater than \(\mu_X\) produce anything useful at all, given the way the inequality is defined. - Clearly the bound is very weak, and converges too slowly to be of much practical use. This is because it
doesn't incorporate any information about the shape of the probability distribution or the form of
`X`. But it's a start.

In many cases it's useful to relax the assumption that `X` is non-negative. The next tail
inequality—the Chebyshev inequality—allows us to do just that.

## Chebyshev inequality

The Chebyshev inequality allows us to bound probabilities attaching to the distance between an outlier score
and its mean. The outlier scores `X` themselves can be on either side of the mean, which is useful
when we want anomaly detection to be a matter of being too far away from that mean, whether too high or too
low.

The trick to getting the Chebyshev inequality is to recognize that the distance is itself a non-negative
function of `X`, and therefore the Markov inequality applies to the distance.

The Chebyshev inequality makes the following assumptions:

`X`has a finite mean.`X`has a finite variance.

Then for any constant threshold `α` > 0:

where \(\mu_X\) is the mean and \(\sigma_X^2\) is the variance.

Using the same hypothetical variable \(X \sim N(100, 15^2)\) as before, let's repeat the exercise of plugging in
some example `α`'s (which here are distances from `μ` = 100 rather than direct thresholds on
`α`) to see how the bound performs:

\(\alpha\) | \(z(\alpha)\) | \(P(X > \alpha)\) | Chebyshev bound | Diff |
---|---|---|---|---|

5 | 0.33 | 0.3707 | 9.0000 | 8.6293 |

10 | 0.67 | 0.2514 | 2.2500 | 1.9986 |

15 | 1.00 | 0.1587 | 1.0000 | 0.8413 |

20 | 1.33 | 0.0918 | 0.5625 | 0.4707 |

25 | 1.67 | 0.0475 | 0.3600 | 0.3125 |

30 | 2.00 | 0.0028 | 0.2500 | 0.2272 |

As was the case with Markov inequalities, only sufficiently large `α`'s produce meaningful Chebyshev
bounds. From the table above and the inequality itself it's clear "sufficiently large" means that `α`
is greater than the standard deviation `σ` = 15.

The Chebyshev inequality incorporates the variance of `X`, which allows it to converge faster than
the Markov inequality since the latter incorporates only the mean of `X`. But it's still a weak bound
since even the variance isn't really much information.

Additionally, our example involves a normal distribution, which has an exponential drop-off. The Markov and Chebyshev bounds are polynomial, so their convergence is weak here.

There are additional rules with faster convergence, including Chernoff bounds and the Hoeffding inequality. We won't cover them here as they are more advanced, but I've included links to those in the Additional resources below should you want to learn more about them.