5.1 Creating Dummy Variables for Unordered Categories
The most basic approach to representing categorical values as numeric data is to create dummy or indicator variables. These are artificial numeric variables that capture some aspect of one (or more) of the categorical values. There are many methods for doing this and, to illustrate, consider a simple example for the day of the week. If we take the seven possible values and convert them into binary dummy variables, the mathematical function required to make the translation is often referred to as a contrast or parameterization function. An example of a contrast function is called the “reference cell” or “treatment” contrast, where one of the values of the predictor is left unaccounted for in the resulting dummy variables. Using Sunday as the reference cell, the contrast function would create six dummy variables:
Mon | Tues | Wed | Thurs | Fri | Sat | |
---|---|---|---|---|---|---|
Sun | 0 | 0 | 0 | 0 | 0 | 0 |
Mon | 1 | 0 | 0 | 0 | 0 | 0 |
Tues | 0 | 1 | 0 | 0 | 0 | 0 |
Wed | 0 | 0 | 1 | 0 | 0 | 0 |
Thurs | 0 | 0 | 0 | 1 | 0 | 0 |
Fri | 0 | 0 | 0 | 0 | 1 | 0 |
Sat | 0 | 0 | 0 | 0 | 0 | 1 |
These six numeric predictors would take the place of the original categorical variable.
Why only six? There are two related reasons. First, if the values of the six dummy variables are known, then the seventh can be directly inferred. The second reason is more technical. When fitting linear models, the design matrix \(X\) is created. When the model has an intercept, an additional initial column of ones for all rows is included. Estimating the parameters for a linear model (as well as other similar models) involve inverting the matrix \((X'X)\). If the model includes an intercept and contains dummy variables for all seven days, then the seven day columns would add up (row-wise) to the intercept and this linear combination would prevent the matrix inverse from being computed (as it is singular). When this occurs, the design matrix said to be less than full rank or overdetermined. When there are \(C\) possible values of the predictor and only \(C-1\) dummy variables are used, the matrix inverse can be computed and the contrast method is said to be a full rank parameterization (Timm and Carlson 1975; Haase 2011). Less than full rank encodings are sometimes called “one-hot” encodings. Generating the full set of indicator variables may be advantageous for some models that are insensitive to linear dependencies (such as the glmnet model described in Section 7.3.2).
What is the interpretation of the dummy variables? That depends on what type of model is being used. Consider a linear model for the Chicago transit data that only uses the day of the week in the model with the reference cell parameterization above. Using the training set to fit the model, the intercept value estimates the mean of the reference cell, which is the average number of Sunday riders in the training set, and was estimated to be 3.84K people. The second model parameter, for Monday, is estimated to be 12.61K. In the reference cell model, the dummy variables represent the mean value above and beyond the reference cell mean. In this case, estimate indicates that there were 12.61K more riders on Monday than Sunday. The overall estimate of Monday ridership adds the estimates from the intercept and dummy variable (16.45K rides).
When there is more than one categorical predictor, the reference cell becomes multidimensional. Suppose there was a predictor for the weather that has only a few values: “clear”, “cloudy”, “rain”, and “snow”. Let’s consider “clear” to be the reference cell. If this variable was included in the model, the intercept would correspond to the mean of the Sundays with a clear sky. But, the interpretation of each set of dummy variables does not change. The average ridership for a cloudy Monday would augment the average clear Sunday ridership with the average incremental effect of cloudy and the average incremental effect of Monday.
There are other contrast functions for dummy variables. The “cell means” parameterization (Timm and Carlson 1975) would create a dummy variable for each day of the week and would not include an intercept to avoid the problem of singularity of the design matrix. In this case, the estimates for each dummy variable would correspond to the average value for that category. For the Chicago data, the parameter estimate for the Monday dummy variable would simply be 16.45K.
There are several other contrast methods for constructing dummy variables. Another, based on polynomial functions, is discussed below for ordinal data.