1.2 Important Concepts
Before proceeding to specific strategies and methods there are some key concepts that should be discussed. These concepts involve theoretical aspects of modeling as well as the practice of creating a model. A number of these aspects are discussed here and additional details are provided in later chapters and references are given throughout this work.
1.2.1 Overfitting
Overfitting is the situation where a model fits very well to the current data but fails when predicting new samples. It typically occurs when the model has relied too heavily on patterns and trends in the current data set that do not occur otherwise. Since the model only has access to the current data set, it has no ability to understand that such patterns are anomalous. For example, in the housing data shown in Figure 1.1, one could determine that properties that had square footage between 1,267.5 and 1,277 and contained three bedrooms could have their sale prices predicted within $1,207 of the true values. However, the accuracy on other houses (not in this data set) that satisfy these conditions would be much worse. This is an example of a trend that does not generalize to new data.
Often, models that are very flexible (called “low-bias models” in Section 1.2.5) have a higher likelihood of overfitting the data. It is not difficult for these models to do extremely well on the data set used to create the model and, without some preventative mechanism, can easily fail to generalize to new data. As will be seen in the coming chapters, especially Section 3.5, overfitting is one of the primary risks in modeling and should be a concern for practitioners.
While models can overfit to the data points, such as with the housing data shown above, feature selection techniques can overfit to the predictors. This occurs when a variable appears relevant in the current data set but shows no real relationship with the outcome once new data are collected. The risk of this type of overfitting is especially dangerous when the number of data points, denoted as \(n\), is small and the number of potential predictors (\(p\)) is very large. As with overfitting to the data points, this problem can be mitigated using a methodology that will show a warning when this is occurring.
1.2.2 Supervised and Unsupervised Procedures
Supervised data analysis involves identifying patterns between predictors and an identified outcome that is to be modeled or predicted, while unsupervised techniques are focused solely on identifying patterns among the predictors.
Both types of analyses would typically involve some amount of exploration. Exploratory data analysis (EDA) (Tukey 1977) is used to understand the major characteristics of the predictors and outcome so that any particular challenges associated with the data can be discovered prior to modeling. This can include investigations of correlation structures in the variables, patterns of missing data, and/or anomalous motifs in the data that might challenge the initial expectations of the modeler.
Obviously, predictive models are strictly supervised since there is a direct focus on finding relationships between the predictors and the outcome. Unsupervised analyses include methods such as cluster analysis, principal component analysis, and similar tools for discovering patterns in data.
Both supervised and unsupervised analyses are susceptible to overfitting but supervised are particularly inclined to discovering erroneous patterns in the data for predicting the outcome. In short, we can use these techniques to create a self-fulfilling predictive prophecy. For example, it is not uncommon for an analyst to conduct a supervised analysis of data to detect which predictors are significantly associated with the outcome. These significant predictors are then used in a visualization (such as a heat map or cluster analysis) on the same data. Not surprisingly, the visualization reliably demonstrates clear patterns between the outcomes and predictors and appears to provide evidence of their importance. However, since the same data are shown, the visualization is essentially cherry picking the results that are only true for these data and which are unlikely to generalize to new data. This issue is discussed at various points in this text but most pointedly in Section 3.8.
1.2.3 No Free Lunch
The “No Free Lunch” Theorem (Wolpert 1996) is the idea that, without any specific knowledge of the problem or data at hand, no one predictive model can be said to be the best. There are many models that are optimized for some data characteristics (such as missing values or collinear predictors). In these situations, it might be reasonable to assume that they would do better than other models (all other things being equal). In practice, things are not so simple. One model that is optimized for collinear predictors might be constrained to model linear trends in the data and is sensitive to missingness in the data. It is very difficult to predict the best model especially before the data are in hand.
There have been experiments to judge which models tend to do better than others on average, notably Demsar (2006) and Fernandez-Delgado et al. (2014). These analyses show that some models have a tendency to produce the most accurate models but the rate of “winning” is not high enough to enact a strategy of “always use model X.”
In practice, it is wise to try a number of disparate types of models to probe which ones will work well with your particular data set.
1.2.4 The Model versus the Modeling Process
The process of developing an effective model is both iterative and heuristic. It is difficult to know the needs of any data set prior to working with it and it is common for many approaches to be evaluated and modified before a model can be finalized. Many books and resources solely focus on the modeling technique but this activity is often a small part of the overall process. Figure 1.4 shows an illustration of the overall process for creating a model for a typical problem.
The initial activity begins9 at marker (a) where exploratory data analysis is used to investigate the data. After initial explorations, marker (b) indicates where early data analysis might take place. This could include evaluating simple summary measures or identifying predictors that have strong correlations with the outcome. The process might iterate between visualization and analysis until the modeler feels confident that the data are well understood. At milestone (c), the first draft for how the predictors will be represented in the models is created based on the previous analysis.
At this point, several different modeling methods might be evaluated with the initial feature set. However, many models can contain hyperparameters that require tuning10. This is represented at marker (d) where four clusters of models are shown as thin red marks. This represents four distinct models that are being evaluated but each one is evaluated multiple times over a set of candidate hyperparameter values. This model tuning process is discussed in Section 3.6 and is illustrated several times in later chapters. Once the four models have been tuned, they are numerically evaluated on the data to understand their performance characteristics (e). Summary measures for each model, such as model accuracy, are used to understand the level of difficulty for the problem and to determine which models appear to best suit the data. Based on these results, more EDA can be conducted on the model results (f), such as residual analysis. For the previous example of predicting the sale prices of houses, the properties that are poorly predicted can be examined to understand if there is any systematic issues with the model. As an example, there may be particular ZIP codes that are difficult to accurately assess. Consequently, another round of feature engineering (g) might be used to compensate for these obstacles. By this point, it may be apparent which models tend to work best for the problem at hand and another, more extensive, round of model tuning can be conducted on fewer models (h). After more tuning and modification of the predictor representation, the two candidate models (#2 and #4) have been finalized. These models can be evaluated on an external test set as a final “bake off” between the models (i). The final model is then chosen (j) and this fitted model will be used going forward to predict new samples or to make inferences.
The point of this schematic is to illustrate there are far more activities in the process than simply fitting a single mathematical model. For most problems, it is common to have feedback loops that evaluate and reevaluate how well any model/feature set combination performs.
1.2.5 Model Bias and Variance
Variance is a well-understood concept. When used in regard to data, it describes the degree in which the values can fluctuate. If the same object is measured multiple times, the observed measurements will be different to some degree. In statistics, bias is generally thought of as the degree in which something deviates from its true underlying value. For example, when trying to estimate public opinion on a topic, a poll could be systematically biased if the people surveyed over-represent a particular demographic. The bias would occur as a result of the poll incorrectly estimating the desired target.
Models can also be evaluated in terms of variance and bias (Geman, Bienenstock, and Doursat 1992). A model has high variance if small changes to the underlying data used to estimate the parameters cause a sizable change in those parameters (or in the structure of the model). For example, the sample mean of a set of data points has higher variance than the sample median. The latter uses only the values in the center of the data distribution and, for this reason, it is insensitive to moderate changes in the values. A few examples of models with low variance are linear regression, logistic regression, and partial least squares. High-variance models include those that strongly rely on individual data points to define their parameters such as classification or regression trees, nearest neighbor models, and neural networks. To contrast low-variance and high-variance models, consider linear regression and, alternatively, nearest neighbor models. Linear regression uses all of the data to estimate slope parameters and, while it can be sensitive to outliers, it is much less sensitive than a nearest neighbor model.
Model bias reflects the ability of a model to conform to the underlying theoretical structure of the data. A low-bias model is one that can be highly flexible and has the capacity to fit a variety of different shapes and patterns. A high-bias model would be unable to estimate values close to their true theoretical counterparts. Linear methods often have high bias since, without modification, cannot describe nonlinear patterns in the predictor variables. Tree-based models, support vector machines, neural networks, and others can be very adaptable to the data and have low bias.
As one might expect, model bias and variance can often be in opposition to one another; in order to achieve low bias, models tend to demonstrate high variance (and vice versa). The variance-bias trade-off is a common theme in statistics. In many cases, models have parameters that control the flexibility of the model and thus affect the variance and bias properties of the results. Consider a simple sequence of data points such as a daily stock price. A moving average model would estimate the stock price on a given day by the average of the data points within a certain window of the day. The size of the window can modulate the variance and bias here. For a small window, the average is much more responsive to the data and has a high potential to match the underlying trend. However, it also inherits a high degree of sensitivity to those data in the window and this increases variance. Widening the window will average more points and will reduce the variance in the model but will also desensitize the model fit potential by risking over-smoothing the data (and thus increasing bias).
Consider the example in Figure 1.5a that contains a single predictor and outcome where their relationship is nonlinear. The right-hand panel (b) shows two model fits. First, a simple three-point moving average is used (in green). This trend line is bumpy but does a good job of tracking the nonlinear trend in the data. The purple line shows the results of a standard linear regression model that includes a term for the predictor value and a term for the square of the predictor value. Linear regression is a linear in the model parameters and adding polynomial terms to the model can be effective way of allowing the model to identify nonlinear patterns. Since the data points start low on the y-axis, reach an apex near a predictor value of 0.3 then decrease, a quadratic regression model would be a reasonable first attempt at modeling these data. This model is very smooth (showing low variance) but does not do a very good job of fitting the nonlinear trend seen in the data (i.e., high bias).
To accentuate this point further, the original data were “jittered” multiple times by adding small amounts of random noise to their values. This was done twenty times and, for each version of the data, the same two models were fit to the jittered data. The fitted curves are shown in Figure 1.6. The moving average shows a significant degree of noise in the regression predictions but, on average, manages to track the data patterns well. The quadratic model was not confused by the extra noise and generated very similar (although inaccurate) model fits.
The notions of model bias and variance are central to the ideas in this text. As previously described, simplicity is an important characteristic of a model. One method of creating a low-variance, low-bias model is to augment a low-variance model with appropriate representations of the data to decrease the bias. The previous example in Section 1.1 is a simple example of this process; a logistic regression (high bias, low variance) was improved by modifying the predictor variables and was able to show results on par with a neural networks model (low bias). As another example, the data in Figure 1.5(a) were generated using the following equation
\[y = x^3 + \left[\beta_1 \: exp(\beta_2 \: (x-\beta_3)^2)\right] + \epsilon\]
Theoretically, if this functional form could be determined from the data, then the best possible model would be a nonlinear regression model (low variance, low bias). We revisit the variance-bias relationship in Section 3.4.6 in the context of measuring performance using resampling.
In a similar manner, models can have reduced performance due to irrelevant predictors causing excess model variation. Feature selection techniques improve models by reducing the unwanted noise of extra variables.
1.2.6 Experience-Driven Modeling and Empirically Driven Modeling
Projects may arise where no modeling has previously been applied to the data. For example, suppose that a new customer database becomes available and this database contains a large number of fields that are potential predictors. Subject matter experts may have a good sense of what features should be in the model based on previous experience. This knowledge allows experts to be prescriptive about exactly which variables are to be used and how they are represented. Their reasoning should be strongly considered given their expertise. However, since the models estimate parameters from the data, there can be a strong desire to be data-driven rather than experience-driven.
Many types of models have the ability to empirically discern which predictors should be in the model and can derive the representation of the predictors that can maximize performance (based on the available data). The perceived (and often real) danger in this approach is twofold. First, as previously discussed, data-driven approaches run the risk of overfitting to false patterns in the data. Second, they might yield models that are highly complex and may not have any obvious rational explanation. In the latter case a circular argument may arise where practitioners only accept models that quantify what they already know but expect better results than what a human’s manual assessment can provide. For example, if an unexpected, novel predictor is found that has a strong relationship with the outcome, this may challenge the current conventional wisdom and be viewed with suspicion.
It is common to have some conflict between experience-driven modeling and empirically driven modeling. Each approach has its advantages and disadvantages. In practice, we have found that a combination of the two approaches works best as long as both sides see the value in the contrasting approaches. The subject matter expert may have more confidence in a novel model feature if they feel that the methodology used to discover the feature is rigorous enough to avoid spurious results. Also an empirical modeler might find benefit in an expert’s recommendations to initially whittle down a large number of predictors or at least to help prioritize them in the modeling process. Also, the process of feature engineering requires some level of expertise related to what is being modeled. It is difficult to make recommendations on how predictors should be represented in a vacuum or without the knowing the context of the project. For example, in the simple example in Section 1.1, the inverse transformation for the predictors might have seemed obvious to an experienced practitioner.
1.2.7 Big Data
The definition of Big Data is somewhat nebulous. Typically, this term implies a large number of data points (as opposed to variables) and it is worth noting that the effective sample size might be smaller than the actual data size. For example, if there is a severe class imbalance or rare event rate, the number of events in the data might be fairly small. Click-through rate on online ads is a good example of this. Another example is when one particular region of the predictor space is abundantly sampled. Suppose a data set had billions of records but most correspond to white males within a certain age range. The number of distinct samples might be low, resulting in a data set that is not diverse.
One situation where large datasets probably doesn’t help is when samples are added within the mainstream of the data. This simply increases the granularity of the distribution of the variables and, after a certain point, may not help in the data analysis. More rows of data can be helpful when new areas of the population are being accrued. In other words, big data does not necessarily mean better data.
While the benefits of big data have been widely espoused, there are some potential drawbacks. First, it simply might not solve problems being encountered in the analysis. Big data cannot automatically induce a relationship between the predictors and outcome when none exists. Second, there are often computational ramifications to having large amounts of data. Many high-variance/low-bias models tend to be very complex and computationally demanding and the time to fit these models can increase with data size and, in some cases, the increase can be nonlinear. Adding more data allows these models to more accurately reflect the complexity of the data but would require specialized solutions to be feasible. This, in itself, is not problematic unless the solutions have the effect of restricting the types of models that can be utilized. It is better for the problem to dictate the type of model that is needed.
Additionally, not all models can exploit large data volumes. For high-bias, low-variance models, big data tends to simply drive down the standard errors of the parameter estimates. For example, in a linear regression created on a million data records, doubling or tripling the amount of training data is unlikely to improve the parameter estimates to any practical degree (all other things being equal).
However, there are models that can effectively leverage large data sets. In some domains, there can be large amounts of unlabeled data where the outcome is unknown but the predictors have been measured or computed. The classic examples are images and text but unlabeled data can occur in other situations. For example, pharmaceutical companies have large databases of chemical compounds that have been designed but their important characteristics have not been measured (which can be expensive). Other examples include public governmental databases where there is an abundance of data that have not been connected to a specific outcome.
Unlabeled data can be used to solve some specific modeling problems. For models that require formal probability specifications, determining multivariate distributions can be extremely difficult. Copious amounts of predictors data can help estimate or specify these distributions. Autoencoders, discussed in Section 6.3.2, are models that can denoise or smooth the predictor values. The outcome is not required to create an autoencoder so unlabeled data can potentially improve the situation.
Overall, when encountering (or being offered) large amounts of data, one might think to ask:
- What are you using it for? Does it solve some unmet need?
- Will it get in the way?