4.2 Visualizations for Numeric Data: Exploring Train Ridership Data
4.2.1 Box Plots, Violin Plots, and Histograms
Univariate visualizations are used to understand the distribution of a single variable. A few common univariate visualizations are box-and-whisker plots (i.e., box plot), violin plots, or histograms. While these are simple graphical tools, they provide great value in comprehending characteristics of the quantity of interest.
Because the foremost goal of modeling is to understand variation in the response, the first step should be to understand the distribution of the response. For a continuous response such as the ridership at the Clark/Lake station, it is important to understand if the response has a symmetric distribution, if the distribution has a decreasing frequency of larger observations (i.e., the distribution is skewed), if the distribution appears to be made up of two or more individual distributions (i.e., the distribution has multiple peaks or modes), or if there appears to be unusually low or high observations (i.e outliers).
Understanding the distribution of the response as well as its variation provides a lower bound of the expectations of model performance. That is, if a model contains meaningful predictors, then the residuals from a model that contains these predictors should have less variation than the variation of the response. Furthermore, the distribution of the response may indicate that the response should be transformed prior to analysis. For example, responses that have a distribution where the frequency of response proportionally decreases with larger values may indicate that the response follows a log-normal distribution. In this case, log-transforming the response would induce a normal (bell-shaped, symmetric) distribution and often will enable a model to have better predictive performance. A third reason why we should work to understand the response is because the distribution may provide clues for including or creating features that help explain the response.
As a simple example of the importance of understanding the response distribution, consider Figure 4.2 which displays a box plot of the response for the ridership at the Clark/Lake station. The box plot was originally developed by John Tukey as a quick way to assess a variable’s distribution (Tukey 1977), and consists of the minimum, lower quartile, median, upper quartile and maximum of the data. Alternative versions of the box plot extend the whiskers to a value beyond which samples would be considered unusually high (or low) (Frigge, Hoaglin, and Iglewicz 1989). A variable that has a symmetric distribution has equal spacing across the quartiles making the box and whiskers also appear symmetric. Alternatively, a variable that has fewer values in a wider range of space will not appear symmetric.
A drawback of the box plot is that it is not effective at identifying distributions that have multiple peaks or modes. As an example, consider the distribution of ridership at the Clark/Lake station (Figure 4.3). Part (a) of this figure is a histogram of the data. To create a histogram, the data are binned into equal regions of the variable’s value. The number of samples are counted in each region, and a bar is created with the height of the frequency (or percentage) of samples in that region. Like box plots, histograms are simple to create, and these figures offer the ability to see additional distributional characteristics. In the ridership distribution, there are two peaks, which could represent two different mechanisms that affect ridership. The box plot (b) is unable to capture this important nuance. To achieve a compact visualization of the distribution that retains histogram-like characteristics, Hintze and Nelson (1998) developed the violin plot. This plot is created by generating a density or distribution of the data and its mirror image. Figure 4.3 (c) is the violin plot, where we can now see the two distinct peaks in ridership distribution. The lower quartile, median, and upper quartile can be added to a violin plot to also consider this information in the overall assessment of the distribution.
These data will be analyzed in several chapters. Given the range of the daily ridership numbers, there was some question as to whether the outcome should be modeled in the natural units or on the log scale. On one hand, the natural units makes interpretation of the results easier since the RMSE would be in terms of riders. However, if the outcome were transformed prior to modeling, it would ensure that negative ridership could not be predicted. The bimodal nature of these data, as well as distributions of ridership for each year that have a longer tail on the right made this decision difficult. In the end, a handful of models were fit both ways to make the determination. The models computed in the natural units appeared to have slightly better performance and, for this reason, all models were analyzed in the natural units.
Examining the distribution of each predictor can help to guide our decisions about the need to engineer the features through transformation prior to analysis. When we have a moderate number of predictors (< ~100) and when the predictors are on the same order of magnitude, we can visualize the distributions simultaneously using side-by-side box or violin plots. Consider again the ridership data with the two-week lag in ridership as predictors. The distributions across selected stations for weekday ridership for 2016 are provided in Figure 4.4. Here, our focus is on the variability and range of ridership across stations. Despite the earlier warning, box plots are good at characterizing these aspects of data. To help see patterns more clearly, ridership is ordered from the largest median (left) to the smallest median (right). Several characteristics stand out: variability in ridership increases with the median ridership, there are a number of unusually low and unusually high values for each station, and a few stations have distinctly large variation. One station particularly stands out, which is about one-quarter of the way from the left. This happens to be the Addison station which is the nearest stop to Wrigley Field. The wider distribution is due to ridership associated with the weekday home games for the Chicago Cubs, with attendance at its peak reaching close to the most frequently traveled stations. If the goal was to predict ridership at the Addison station, then the Cubs’ home game schedule would be important information for any model. The unusually low values for the majority of the stations will be discussed next.
As the number of predictors grows, the ability to visualize the individual distributions lessens and may be practically impossible. In this situation, a subset of predictors that are thought to be important can be examined using these techniques.
4.2.2 Augmenting Visualizations through Faceting, Colors, and Shapes
Additional dimensions can be added to almost any figure by using faceting, colors, and shapes. Faceting refers to creating the same type of plot (e.g., a scatter plot) and splitting the plot into different panels based on some variable39. Figure 3.2 is a good example. While this is a simple approach, these types of augmentation can be powerful tools for seeing important patterns that can be used to direct the engineering of new features. The Clark/Lake station ridership distribution is a prime candidate for adding another dimension. As shown above, Figure 4.3 has two distinct peaks. A reasonable explanation for this would be that ridership is different for weekdays than for weekends. Figure 4.5 partitions the ridership distribution by part of the week through color and faceting (for ease of visualization). Part of the week was not a predictor in the original data set; by using intuition and carefully examining the distribution, we have found a feature that is important and necessary for explaining ridership and should be included in a model.
Figure 4.5 invites us to pursue understanding of these data further. Careful viewing of the weekday ridership distribution should draw our eye to a long tail on the left which is a result of a number of days with lower ridership similar to the range of ridership on weekends. What would cause weekday ridership to be low? If the cause can be uncovered, then a feature can be engineered. A model that has the ability to explain these lower values will have better predictive performance than a model that does not.
The use of colors and shapes for elucidating predictive information will be illustrated several of the following sections.
4.2.3 Scatter Plots
Augmenting visualizations through the use of faceting, color, or shapes is one way to incorporate an additional dimension in a figure. Another approach is to directly add another dimension to a graph. When working with two numeric variables, this type of graph is called a scatter plot. A scatter plot arranges one variable on the x-axis and another on the y-axis. Each sample is then plotted in this coordinate space. We can use this type of figure to assess the relationship between a predictor and the response, to uncover relationships between pairs of predictors, and to understand if a new predictor may be useful to include in a model. These simple relationships are the first to provide clues to assist in engineering characteristics that may not be directly available in the data.
If the goal is to predict ridership at the Clark/Lake station then we could anticipate that recent past ridership information should be related to current ridership. That is to say another potential predictor to consider would be the previous day’s or previous week’s ridership information. Because we know that weekday and weekend have different distributions, a one-day lag would be less useful for predicting ridership on Monday or Saturday. A week-based lag would not have this difficulty (although it would be further apart in time) since the information occurs on the same day of the week. Because the primary interest is in predicting ridership two weeks in advance, we will create the 14-day lag in ridership for the Clark/Lake station.
In this case, the relationship between these variables can be directly understood by creating a scatter plot (Figure 4.6). This figure highlights several characteristics that we need to know: there is a strong linear relationship between the 14-day lag and current-day ridership, there are two distinct groups of points (due to the part of the week), and there are many 14-day lag/current day pairs of days that lie far off from the overall scatter of points. These results indicate that the 14-day lag will be a crucial predictor for explaining current-day ridership. Moreover, uncovering the explanation of samples that are far off from the overall pattern visualized here will lead to a new feature that will be useful as a input to models.
4.2.4 Heatmaps
Low weekday ridership as illustrated in Figure 4.5 might be due to annual occurrences; to investigate this hypothesis, the data will need to be augmented. The first step is to create an indicator variable for weekdays with ridership less than 10,000 or greater than or equal to 10,000. We then need a visualization that allows us to see when these unusual values occur. A visualization that would elucidate annual patterns in this context is a heatmap. A heatmap is a versatile plot that can be created utilizing almost any type of predictor and displays one predictor on the x-axis and another predictor on the y-axis. In this figure the x- and y-axis predictors must be able to be categorized. The categorized predictors then form a grid, and the grid is filled by another variable. The filling variable can be either continuous or categorical. If continuous, then the boxes in the grid are colored on a continuous scale from the lowest value of the filling predictor to the highest value. If the filling variable is categorical, then the boxes have distinct colors for each category.
For the ridership data, we will create a month and day predictor, a year predictor, and an indicator of weekday ridership less than 10,000 rides.
These new features are the inputs to the heatmap (Figure 4.7). In this figure, the x-axis represents the year and the y-axis represents the month and day. Red boxes indicate weekdays that have ridership less than 10,000 for the Clark/Lake station. The heat map of the data in this form brings out some clear trends. Low ridership occurs on or around the beginning of the year, mid-January, mid-February until 2007, late-May, early-July, early-September, late-November, and late-December. Readers in the US would recognize these patterns as regularly observed holidays. Because holidays are known in advance, adding a feature for common weekday holidays will be beneficial for models to explain ridership.
Carefully observing the heatmap points to two days that do not follow the annual patterns: February 2, 2011 and January 6, 2014. These anomalies were due to extreme weather. On February 2, 2011, Chicago set a record low temperature of -16F. Then on January 6, 2014, there was a blizzard that dumped 21.2 inches of snow on the region. Extreme weather instances are infrequent, so adding this predictor will have limited usefulness in a model. If the frequency of extreme weather increases in the future, then using forecast data could become a valuable predictor for explaining ridership.
Now that we understand the effect of major US holidays, these values will be excluded from the scatter plot of 14-day lag versus current-day ridership (Figure 4.8). Most of the points that fell off the diagonal of Figure 4.6 are now gone. However a couple of points remain. The day associated with these points was June 11, 2010 which was the city’s celebration for the Chicago Blackhawks winning the Stanley Cup. While these types of celebrations are infrequent, engineering a feature to anticipate these unusual events will aid in reducing the prediction error for a model40.
4.2.5 Correlation Matrix Plots
An extension of the scatter plot is the correlation matrix plot. In this plot, the correlation between each pair of variables is plotted in the form of a matrix. Every variable is represented on the outer x-axis and outer y-axis of the matrix, and the strength of the correlation is represented by the color in the respective location in the matrix. This visualization was first shown in Figure 2.3. Here we will construct a similar image for the 14-day lag in ridership across stations for non-holiday, weekdays in 2016 for the Chicago data. The correlation structure of these 14-day lagged predictors is the almost the same as the original (unlagged) predictors; using the lagged versions ensures the correct number of rows are used.
The correlation matrix in Figure 4.9 leads to additional understandings. First, ridership across stations is positively correlated (red) for nearly all pairs of stations; this means that low ridership at one station corresponds to relatively low ridership at another station, and high ridership at one station corresponds to relatively high ridership at another station. Second, the correlation for a majority of pairs of stations is extremely high. In fact, more than 18.7% of the predictor pairs have a correlation greater than 0.90 and 3.1% have a correlation greater than 0.95. The high degree of correlation is a clear indicator that the information present across the stations is redundant and could be eliminated or reduced. Filtering techniques as discussed in Chapter 3 could be used to eliminate predictors. Also, feature engineering through dimension reduction (Chapter 6) could be an effective alternative representation of the data in settings like these. We will address dimension reduction as an exploratory visualization technique in Section 4.2.7.
This version of the correlation plot includes an organization structure of the rows and columns based on hierarchical cluster analysis (Dillon and Goldstein 1984). The overarching goal of cluster analysis is to arrange samples in a way that those that are ‘close’ in the measurement space are also nearby in their location on the axis. For these data, the distances between any two stations are based on the stations’ vectors of correlation values. Therefore, stations that have similar correlation vectors will be nearby in their arrangement on each axis, whereas stations that have dissimilar correlation vectors will be located far away. The tree-like structure on the x- and y-axes is called a dendrogram and connects the samples based on their correlation vector proximity. This organizational structure helps to elucidate some visually distinct groupings of stations. These groupings, in turn, may point to features that could be important to include in models for explaining ridership.
As an example, consider the stations shown on the very left-hand side of the x-axis where there are some low and/or negative correlations. One station in this group has a median correlation of 0.23 with the others. This station services O’Hare airport, which is one of the two major airports in the area. One could imagine that ridership at this station has a different driver than for other stations. Ridership here is likely influenced by incoming and outgoing plane schedules, while the other stations are not. For example, this station has a negative correlation (-0.46) with the UIC-Halsted station on the same line. The second-most dissimilar station is the previously mentioned Addison station as it is driven by game attendance.
4.2.6 Line Plots
A variable that is collected over time presents unique challenges in the modeling process. This type of variable is likely to have trends or patterns that are associated incrementally with time. This means that a variable’s current value is more related to recent values than to values further apart in time. Therefore, knowing what a variable’s value is today will be more predictive of tomorrow’s value than last week, last month, or last year’s value. We can assess the relationship between time and the value of a variable by creating a line plot, which is an extension of a scatter plot. In a line plot, time is on the x-axis, and the value of the variable is on the y-axis. The value of the variable at adjacent time points is connected with a line. Identifying time trends can lead to other features or to engineering other features that are associated with the response. The Chicago data provide a good illustration of line plots.
The Chicago data are collected over time, so we should also look for potential trends due to time. To look for these trends, the mean of ridership per month during weekdays and separately during the weekend was computed (Figure 4.10). Here we see that since 2001, ridership has steadily increased during weekdays and weekends. This would make sense because the population of the Chicago metropolitan area has increased during this time period (United States Census Bureau 2017). The line plot additionally reveals that within each year ridership generally increases from January through October then decreases through December. These findings imply that the time proximity of ridership information should be useful to a model. That is, understanding ridership information within the last week or month will be more useful in predicting future ridership.
Weekend ridership also shows annual trends but exhibits more variation within the trends for some years. Uncovering predictors associated with this increased variation could help reduce forecasting error. Specifically, the Weekend line plots have the highest variation during 2008, with much higher ridership in the summer. A potential driver for increased ridership on public transportation is gasoline prices. Weekly gas prices in the Chicago area were collected from the U.S. Energy Information Administration, and the monthly average price by year is shown in the line plot of Figure 4.11.
Next, let’s see if a relationship can be established between gas prices and ridership. To do this, we will calculate the monthly average 2-week lag in gas price and the geometric mean of ridership at the Clark/Lake station. Figure 4.12 illustrates this relationship; there is a positive association between the 2-week lag in gas price and the geometric mean of ridership. From 2001 through 2014, the higher the gas price, the higher the ridership, with 2008 data appearing at the far right of both the Weekday and Weekend scatter plots. This trend is slightly different for 2015 and 2016 when oil prices dropped due to a marked increase in supply (U.S. Energy Information Administration 2017b). Here we see that by digging in to characteristics of the original line plot, another feature can be found that will be useful for explaining some of the variation in ridership.
4.2.7 Principal Components Analysis
It is possible to visualize five or six dimensions of data in a two-dimensional figure by using colors, shapes, and faceting. But almost any data set today contains many more than just a handful of variables. Being able to visualize many dimensions in the physical space that we can actually see is crucial to understanding the data and to understanding if there characteristics of the data that point to the need for feature engineering. One way to condense many dimensions into just two or three is to use projection techniques such as principal components analysis (PCA), partial least squares (PLS), or multidimensional scaling (MDS). Dimension reduction techniques for the purpose of feature engineering will be more fully addressed in Section 6.3. Here we will highlight PCA, and how this method can be used to engineer features that effectively condense the original predictors’ information.
Predictors that are highly correlated, like the station ridership illustrated in Figure 4.9, can be thought of as existing in a lower dimensional space than the original data. That is, the data represented here could be approximately represented by combinations of similar stations. Principal components analysis finds combinations of the variables that best summarizes the variability in the original data (Dillon and Goldstein 1984). The combinations are a simpler representation of the data and often identify underlying characteristics within the data that will help guide the feature engineering process.
PCA will now be applied to the 14-day lagged station ridership data. Because the objective of PCA is to optimally summarize variability in the data, the cumulative percentage of variation is summarized by the top components. For these data, the first component captures 76.7% of the overall variability, while the first two components capture 83.1%. This is a large percentage of variation given that there are 125 total stations, indicating that station ridership information is redundant and can likely be summarized in a more condensed fashion.
Figure 4.13 provides a summary of the analysis. Part (a) displays the cumulative amount of variation summarized across the first 50 components. This type of plot is used to visually determine how many components are required to summarize a sufficient amount of variation in the data. Examining a scatter plot of the first two components (b) reveals that PCA is focusing on variation due to the part of the week with the weekday samples with lower component 1 scores and weekend samples with higher component 1 scores. The second component focuses on variation due to change over time with earlier samples receiving lower component 2 scores and later samples receiving higher component 2 scores. These patterns are revealed more clearly in parts (c) and (d) where the first and second component are plotted against the underlying variables that appear to affect them the most.
Other visualizations earlier in this chapter already alerted us to the importance of part of the week and year relative to the response. PCA now helps to confirm these findings, and will enable the creation of new features that simplify our data while retaining crucial predictive information. We will delve into this techniques and other similar techniques later in Section 6.3.
This visualization approach is also referred to as “trellising” or “conditioning” in different software.↩
This does lead to an interesting dilemma for this analysis: should such an aberrant instance be allowed to potentially influence the model? We have left the value untouched but a strong case could be made for imputing what the value would be from previous data and using this value in the analysis.↩