The San Francisco-Oakland Bay Bridge
The San Francisco-Oakland Bay Bridge. Source

Datasets

The central object of study in statistics is the dataset. A dataset can be a simple list of values, like this:

14, 85, 12, 32, 104

More commonly, a dataset has a tabular structure. The columns are called variables and the rows are are called elements.

Different people use different terminology for these concepts. Sometimes a tabular dataset is called a data table or a data frame. Other names for the elements of a dataset are cases, members, or instances.

Let's look at an example of a dataset with a tabular structure. Here are the first 20 rows in a dataset for San Francisco crime in 2016 (source):

IncidntNum Category Description DayOfWeek Date Time PdDistrict
120058272 WEAPON LAWS POSS OF PROHIBITED WEAPON Friday 1/29/16 0:00 11:00 SOUTHERN
120058272 WEAPON LAWS FIREARM, LOADED, IN VEHICLE, POSSESSION OR USE Friday 1/29/16 0:00 11:00 SOUTHERN
141059263 WARRANTS WARRANT ARREST Monday 4/25/16 0:00 14:59 BAYVIEW
160013662 NON-CRIMINAL LOST PROPERTY Tuesday 1/5/16 0:00 23:50 TENDERLOIN
160002740 NON-CRIMINAL LOST PROPERTY Friday 1/1/16 0:00 0:30 MISSION
160002869 ASSAULT BATTERY Friday 1/1/16 0:00 21:35 NORTHERN
160003130 OTHER OFFENSES PAROLE VIOLATION Saturday 1/2/16 0:00 0:04 SOUTHERN
160003259 NON-CRIMINAL FIRE REPORT Saturday 1/2/16 0:00 1:02 TENDERLOIN
160003970 WARRANTS WARRANT ARREST Saturday 1/2/16 0:00 12:21 SOUTHERN
160003641 MISSING PERSON FOUND PERSON Friday 1/1/16 0:00 10:06 BAYVIEW
160086863 LARCENY/THEFT ATTEMPTED THEFT FROM LOCKED VEHICLE Friday 1/29/16 0:00 22:30 TARAVAL
160004053 NON-CRIMINAL AIDED CASE, MENTAL DISTURBED Saturday 1/2/16 0:00 13:30 TARAVAL
160073014 OTHER OFFENSES RESISTING ARREST Monday 1/25/16 0:00 23:20 BAYVIEW
140776777 ASSAULT AGGRAVATED ASSAULT WITH A GUN Thursday 9/15/16 0:00 7:40 INGLESIDE
160004069 BURGLARY BURGLARY,STORE UNDER CONSTRUCTION, FORCIBLE ENTRY Saturday 1/2/16 0:00 1:43 CENTRAL
160004150 STOLEN PROPERTY STOLEN CHECKS, POSSESSION Saturday 1/2/16 0:00 11:54 SOUTHERN
160004241 ROBBERY ROBBERY, ARMED WITH A KNIFE Saturday 1/2/16 0:00 14:11 MISSION
160004558 ASSAULT BATTERY WITH SERIOUS INJURIES Saturday 1/2/16 0:00 16:40 MISSION
160004655 ASSAULT BATTERY Saturday 1/2/16 0:00 17:05 INGLESIDE
160004837 LARCENY/THEFT PETTY THEFT SHOPLIFTING Saturday 1/2/16 0:00 17:39 SOUTHERN

In this table, the rows represent individual incidents, and the columns represent individual variables for which the San Francisco Police Department collected data. The full dataset contains 150,500 rows and 13 columns. Here I'm showing only the first 20 rows and the first seven columns to keep things reasonable.

The first thing that jumps out at me when I see this data is "that's a lot to digest". I can see what happened with respect to any individual incident, but it's not easy to see whatever big picture there may be.

But this tabular view is still useful for thinking about the big picture, because it immediately suggests a number of questions that we might ask. For instance:

  • How many violent crimes are there vs. property crimes?
  • Are there more crimes at certain times of the day?
  • Are there more crimes on certain days of the week?
  • Which districts have higher crime?

We could come up with several other questions without too much effort, but you get the point. Knowing the answers would help to inform resourcing decisions around the type, level, location and scheduling of resources. Over time the answers would also help to identify trends that need addressing, such as possible increases in crime related to drug addiction or homelessness.

Where do datasets come from?

Datasets come from a variety of sources. Most of the time they are proprietary: companies generate data as a normal part of business operations, and the data isn't available for public consumption. In some cases this can be because the data is sensitive (for example, it may contain customer data, sales data and so forth). But in many cases the data is closely held simply because it's a valuable business asset, and it doesn't make business sense to give it away to competitors.

There are also lots of public datasets though. Government studies are taxpayer-funded, so in many cases the datasets are public.

Finally, with so much interest in data science and machine learning, lots of sites publish datasets for educational purposes. For an awesome, curated list of public datasets, check out awesome-public-datasets.

Exercises

Exercise 1. Can you think of other questions that you might ask about the crime dataset above?

Exercise 2. Download one of the public datasets from awesome-public-datasets. What are some of its variables? What are some of its elements?