The central object of study in statistics is the dataset. A dataset can be a simple list of values, like this:
14, 85, 12, 32, 104
More commonly, a dataset has a tabular structure. The columns are called variables and the rows are are called elements.
Different people use different terminology for these concepts. Sometimes a tabular dataset is called a data table or a data frame. Other names for the elements of a dataset are cases, members, or instances.
Let's look at an example of a dataset with a tabular structure. Here are the first 20 rows in a dataset for San Francisco crime in 2016 (source):
|120058272||WEAPON LAWS||POSS OF PROHIBITED WEAPON||Friday||1/29/16 0:00||11:00||SOUTHERN|
|120058272||WEAPON LAWS||FIREARM, LOADED, IN VEHICLE, POSSESSION OR USE||Friday||1/29/16 0:00||11:00||SOUTHERN|
|141059263||WARRANTS||WARRANT ARREST||Monday||4/25/16 0:00||14:59||BAYVIEW|
|160013662||NON-CRIMINAL||LOST PROPERTY||Tuesday||1/5/16 0:00||23:50||TENDERLOIN|
|160002740||NON-CRIMINAL||LOST PROPERTY||Friday||1/1/16 0:00||0:30||MISSION|
|160003130||OTHER OFFENSES||PAROLE VIOLATION||Saturday||1/2/16 0:00||0:04||SOUTHERN|
|160003259||NON-CRIMINAL||FIRE REPORT||Saturday||1/2/16 0:00||1:02||TENDERLOIN|
|160003970||WARRANTS||WARRANT ARREST||Saturday||1/2/16 0:00||12:21||SOUTHERN|
|160003641||MISSING PERSON||FOUND PERSON||Friday||1/1/16 0:00||10:06||BAYVIEW|
|160086863||LARCENY/THEFT||ATTEMPTED THEFT FROM LOCKED VEHICLE||Friday||1/29/16 0:00||22:30||TARAVAL|
|160004053||NON-CRIMINAL||AIDED CASE, MENTAL DISTURBED||Saturday||1/2/16 0:00||13:30||TARAVAL|
|160073014||OTHER OFFENSES||RESISTING ARREST||Monday||1/25/16 0:00||23:20||BAYVIEW|
|140776777||ASSAULT||AGGRAVATED ASSAULT WITH A GUN||Thursday||9/15/16 0:00||7:40||INGLESIDE|
|160004069||BURGLARY||BURGLARY,STORE UNDER CONSTRUCTION, FORCIBLE ENTRY||Saturday||1/2/16 0:00||1:43||CENTRAL|
|160004150||STOLEN PROPERTY||STOLEN CHECKS, POSSESSION||Saturday||1/2/16 0:00||11:54||SOUTHERN|
|160004241||ROBBERY||ROBBERY, ARMED WITH A KNIFE||Saturday||1/2/16 0:00||14:11||MISSION|
|160004558||ASSAULT||BATTERY WITH SERIOUS INJURIES||Saturday||1/2/16 0:00||16:40||MISSION|
|160004837||LARCENY/THEFT||PETTY THEFT SHOPLIFTING||Saturday||1/2/16 0:00||17:39||SOUTHERN|
In this table, the rows represent individual incidents, and the columns represent individual variables for which the San Francisco Police Department collected data. The full dataset contains 150,500 rows and 13 columns. Here I'm showing only the first 20 rows and the first seven columns to keep things reasonable.
The first thing that jumps out at me when I see this data is "that's a lot to digest". I can see what happened with respect to any individual incident, but it's not easy to see whatever big picture there may be.
But this tabular view is still useful for thinking about the big picture, because it immediately suggests a number of questions that we might ask. For instance:
- How many violent crimes are there vs. property crimes?
- Are there more crimes at certain times of the day?
- Are there more crimes on certain days of the week?
- Which districts have higher crime?
We could come up with several other questions without too much effort, but you get the point. Knowing the answers would help to inform resourcing decisions around the type, level, location and scheduling of resources. Over time the answers would also help to identify trends that need addressing, such as possible increases in crime related to drug addiction or homelessness.
Where do datasets come from?
Datasets come from a variety of sources. Most of the time they are proprietary: companies generate data as a normal part of business operations, and the data isn't available for public consumption. In some cases this can be because the data is sensitive (for example, it may contain customer data, sales data and so forth). But in many cases the data is closely held simply because it's a valuable business asset, and it doesn't make business sense to give it away to competitors.
There are also lots of public datasets though. Government studies are taxpayer-funded, so in many cases the datasets are public.
Finally, with so much interest in data science and machine learning, lots of sites publish datasets for educational purposes. For an awesome, curated list of public datasets, check out awesome-public-datasets.
Exercise 1. Can you think of other questions that you might ask about the crime dataset above?
Exercise 2. Download one of the public datasets from awesome-public-datasets. What are some of its variables? What are some of its elements?