In the previous section, Datasets, we noted that a dataset's columns represent variables. There are different types of variable, and the approach to visualizing and analyzing them depends upon the type. Here's a common scheme for classifying variables:
This section offers an overview of the types of variables in the diagram above.
Numerical variables are variables whose values are, well, numbers. This could be counts, temperatures, birth orders, test scores and so forth. We further divide numerical variables into two categories: discrete numerical variables and continuous numerical variables.
Discrete variables are numerical variables whose possible values have "gaps" between them. For example, counts are discrete, because counts are numbers like 0, 1, 2, ..., and in between any of those numbers are gaps like 0.5, 0.62 and so on. Birth years are another example of a discrete variable.
A continuous variable is a numerical variable whose possible values are "gapless": for any pair of values, there's always some possible value in between them. Examples would be distances, temperatures, durations and so forth.
The second general type of variable are categorical variables, which are variables whose values are
categories or labels. The values are usually text, like a variable whose values are color names like
RED and so forth. There are two types of categorical variable: nominal
variables and ordinal variables.
A nominal variable is a categorical variable whose values are categories or labels that has no particular order. The color names that we just mentioned are an example since there's no specific order to the colors. A variable whose values are fruit names is another example.
An ordinal variables is a categorical variable whose values are ordered. For example, a level
variable with values
LOW would be an ordinal
variable. T-shirt sizes like
XS are another example since they are ordered by size.
Not every variable fits neatly into the scheme we describe above. Sometimes it's more of a judgment call. Take for instance a Year variable. Years are numbers, and so in many contexts it might make sense to treat them as numerical variables. On the other hand, they sometimes behave more like labels. For example we might want to count celebrity deaths per year, and in such a case there's not really much reason to treat this as if there's some deeper numerical relationship between the year and the number of deaths—the year is just a label for a certain period of time, and there were a certain number of celebrity deaths that occurred during that time. In this context, it would make more sense to treat the year as categorical data.
That wraps up our quick tour of datasets and variables. In the next section we'll look at a key piece of context around the datasets we study: populations vs. samples.
Exercise 1. Consider geographical data that includes Latitude and Longitude variables. How would you classify these?
Exercise 2. Consider a user feedback scale that includes values
TERRIBLE. What kind of variable is
Exercise 3. Suppose you have an employee dataset that includes a variable EmployeeID, with values like 16087 or 32104. What kind of variable is this? Explain.
Exercise 4. What kind of variable is a US ZIP code (i.e., a postal code)? Explain.