Ethics and Statistics

Statistics is an important field because it helps us to understand the world so we can make decisions. Some of these decisions are micro-level decisions, like those surrounding the creditworthiness or the insurability of individual applicants. Others are larger decisions, like the US Federal Reserve assessing macroeconomic trends and making adjustments to monetary policy.

Given the importance of data and statistics in everyday life, it's no surprise that ethical issues arise. Decisions backed by statistics and machine learning (ML) impact the entire world population on a daily basis. It's important for both consumers and producers of statistical and ML methods to understand the associated ethical issues to avoid problems ranging from invalid conclusions to immoral and even illegal actions.

This is a large topic, and I can't hope to do it justice here. But I'll touch upon a couple of key topics: misleading uses of statistics and biased datasets.

Misleading uses of statistics

Statistics is complicated. The data are hard to see and the analyses and conclusions are unintuitive. Yet the appetite for statistical analyses is such that nearly everybody consumes them on a regular basis, and many people—both laypeople and experts—produce analyses too. This creates a situation where we need to be on guard against misleading uses of statistics.

More often than not, the improper use is unintended, as beginners, being beginners, will make rookie mistakes, such as confusing causation and correlation. But even more experienced practitioners make mistakes. Statistics is a very human endeavor.

But in some cases, people produce intentionally deceptive statistics. Some examples:

  • a student excluding the summer class she failed at the local community college when reporting her grade point average on a scholarship application
  • a researcher repeating an experiment a large number of times, with possibly minor variations, and then reporting on the one time it succeeded (also known as "data dredging" or "p-hacking")
  • a survey author asking questions in such a way as to nudge respondents in a certain direction (e.g., "Would you agree that...")
  • a media organization using misleading charts to promote a political opinion

Let's start with a classic example from xkcd.

Example: P-hacking

Significant.
"Significant." Used with kind permission from xkcd.

Here are some hypothetical examples of misleading charts. They're similar to real world examples, but I'm creating my own since the real ones are often political in nature, and I want to avoid that here.

Example: Misleading bar charts

Here's a bar chart showing the popularity of two hypothetical political candidates with an electorate:

Candidate popularity

Candidate A is much more popular than candidate B, right? But do you see anything odd?

Look at the y-axis. Notice how it starts at 46.5%. What happens if we change the axis so that it starts at 0%?

Candidate popularity

This presents a totally different picture. Candidate A is still more popular, but just barely.

But even with this improvement, this second chart is still somewhat misleading, because it looks like both candidates are pretty popular. That's because the top value on the y-axis is 60%. When we change it to 100%, a clearer picture emerges:

Candidate popularity

Now we can readily see that each candidate is popular with roughly half the electorate, with candidate A enjoying a small advantage.

I've seen misleading bar charts like the above many times, even in the "official" news media. It isn't always intentional. In fact, Excel's default auto-scaling produced the first chart of the three charts. But sometimes it's absolutely intentional. Either way, it's important to pay attention to bar chart scales to avoid being duped.

Let's look at one example of a misleading chart.

Example: A misleading pictograph

In 2016, the year-end incarceration rate per 100,000 adults for Texas was 1,050, and for Washington state it was 530. (Source) Here's a pictograph that presents these data:

Incarceration rates for Texas and Washington state

See any issues?

Well, one major issue is the following: even though the count for Texas is essentially twice that of Washington state, the prisoner icon is four times larger! The reason is that doubling the prisoner width and height squares the area. (We double the height of the Texas icon because the count is twice that of Washington state, and we double the width to maintain visual proportionality.) This compares Texas and Washington in a highly misleading way.

These examples give you a sense for how we can mislead one another with statistics, whether intentionally or not. This happens all the time, and it's key to be vigilant when producing and consuming statistics.

Now let's consider a second area in the intersection between ethics and statistics, which is biased datasets.

Biased datasets

A second area of concern in statistical (and machine learning) methods is biased datasets. This is where the dataset either includes or else fails to include information that leads to decisions that are problematic in some way: possibly misleading, unfair or even illegal. Such bias can additionally lead to public relations nightmares for the associated organizations.

One common case is where a dataset doesn't contain sufficiently diverse data. Statistical and machine learning models depend crucially on the data they consume, and if certain types of data are limited in number or missing, or have include dubious correlations, the model's performance will generally suffer.

Here is a very partial set of examples of the phenomena in question:

Obviously, the errors above were unintentional and deeply embarrassing to the companies in question. Misbehavior of the sort described above is hurtful and can perpetuate negative stereotypes. In other cases, incorrect model performance can lead to wrongful and potentially illegal rejection of lending or insurance applications, improper assessment of risk of criminal recidivism, and so forth.

Users of statistical and machine learning methods should be aware of the issues surrounding biased datasets, and be on guard against such bias.