4.4 Post Modeling Exploratory Visualizations
As previously mentioned in Section 3.2.2, predictions on the assessment data produced during resampling can be used to understand performance of a model. It can also guide the modeler to understand the next set of improvements that could be made using visualization and analysis. Here, this process is illustrated using the train ridership data.
Multiple linear regression has a rich set of diagnostics based on model residuals that aid in understanding the model fit and in identifying relationships that may be useful to include in the model. While multiple linear regression lags in predictive performance to other modeling techniques, the available diagnostics are magnificent tools that should not be underestimated in uncovering predictors and predictor relationships that can benefit more complex modeling techniques.
One tool from regression diagnosis that is helpful for identifying useful predictors is the partial regression plot (Neter et al. 1996). This plot utilizes residuals from two distinct linear regression models to unearth the potential usefulness of a predictor in a model. To begin the process, we start by fitting the following model and computing the residuals (\(\epsilon_i\)):
\[ y_i = \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p + \epsilon_i \] Next, we select a predictor that was not in the model, but may contain additional predictive information with respect to the response. For the potential predictor, we then fit:
\[ x_{new_i} = \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_p x_p + \eta_i \] The residuals (\(\epsilon_i\) and \(\eta_i\)) from the models are then plotted against each other in a simple scatter plot. A linear or curvi-linear pattern between these sets of residuals are an indication that the new predictor as a linear or quadratic (or some other non-linear variant) term would be a useful addition in the model.
As previously mentioned, it is problematic to examine the model fit via residuals when those values are determined by simply pre-predicting the training set data. A better strategy is to use the residuals from the various assessment sets created during resampling.
For the Chicago data, a rolling forecast origin scheme (Section 3.4.4) was used for resampling. There are 5697 data points in the training set, each representing a single day. The resampling used here contains a base set of samples before September 01, 2014 and the analysis/assessment splits begins at this date. In doing this, the analysis set grows cumulatively; once an assessment set is evaluated, it is put into the analysis set on the next iteration of resampling. Each assessment set contains the 14 days immediately after the last value in the analysis set. As a result, there are 52 resamples and each assessment set is a mutually exclusive collection of the latest dates. This scheme is meant to mimic how the data would be repeatedly analyzed; once a new set of data are captured, the previous set is used to train the model and the new data are used as a test set. Figure 4.18 shows an illustration of the first few resamples. The arrows on the left-hand side indicate that the full analysis set starts on January 22, 2001.
For any model fit to these data using such a scheme, the collection of 14 sets of residuals can be used to understand the strengths and weaknesses of the model with minimal risk of overfitting. Also, since the assessment sets move over blocks of time, it also allows the analyst to understand if there are any specific times of year that the model does poorly.
The response for the regression model is the ridership at the Clark/Lake station, and our initial model will contain the predictors of week, month and year. The distribution of the hold-out residuals from this model are provided in Figure 4.19(a). As we saw earlier in this chapter, the distribution has two peaks, which we found were due to the part of the week (weekday versus weekend). To investigate the importance of part of the week we then regress the base predictors on part of the week and compute the hold-out residuals from this model. The relationship between sets of hold-out residuals is provided in (b) which demonstrates the non-random relationship, indicating that part of the week contains additional predictive information for the Clark/Lake station ridership. We can see that including part of the week in the model further reduces the residual distribution as illustrated in the histogram labeled Base + Part of Week.
Next, let’s explore the importance of the 14-day lag of ridership at the Clark/Lake station. Part (c) of the figure demonstrates the importance of including the 14-day lag of ridership at the Clark/Lake station. In this part of the figure we see a strong linear relationship between the model residuals in the mainstream of the data, with a handful of days lying outside of the overall pattern. These days happen to be holidays for this time period. The potential predictive importance of holidays are reflected in part (d). In this figure, the reader’s eye may be drawn to one holiday that lies far away from the rest of the holiday samples. It turns out that this sample is July 4, 2015, and is the only day in the training data that is both a holiday and weekend day. Because the model already accounted for part of the week, this additional information that this day is a holiday has virtually no effect on the predicted value for this sample.