3.5 Tuning Parameters and Overfitting
Many models include parameters that, while important, cannot be directly estimated from the data. The tuning parameters (sometimes called hyperparameters27) are important since they often control the complexity of the model and thus also affect any variance-base trade-off that can be made.
As an example, the K-nearest neighbor model stores the training set data and, when predicting new samples, locates the K training set points that are in the closest proximity to the new sample. Using the training set outcomes for the neighbors, a prediction is made. The number of neighbors controls the complexity and, in a manner very similar to the moving average discussion in Section 1.2.5, controls the variance and bias of the models. When K is very small, there is the most potential for overfitting since only few values are used for prediction and are most susceptible to changes in the data. However, if K is too large, too many potentially irrelevant data points are used for prediction resulting in an underfit model.
To illustrate this, consider Figure 3.10 where a single test set sample is shown as the blue circle and its five closest (geographic) neighbors from the training set are shown in red. The test set sample’s sale price is $176K and the neighbor’s prices, from closest to farthest, are: $175K, $128K, $100K, $120K, $125K. Using K = 1, the model would miss the true house price by $0.9K. This illustrates the concept of overfitting introduced in Section 1.2.1; the model is too aggressively using patterns in the training set to make predictions on new data points. For this model, increasing the number of neighbors might help alleviate the issue. Averaging all K = 5 points to make a prediction substantially cuts the error to $-46.4K.
This example illustrates the effect that the tuning parameter can have on the quality of the models. In some models, there could be more than one parameter. Again for the nearest neighbor model, a different distance metric could be used as well as difference schemes for weighting the neighbors so that more distant points have less of an effect on the prediction.
To make sure that proper values of the tuning parameters are used, some sort of search procedure is required along with a method for obtaining good, generalizable measures of performance. For the latter, repeatedly using the test set for these questions is problematic since it would lose its impartiality. Instead, resampling is commonly used. The next section will describe a few methods for determining optimal values of these types of parameters.
Not to be confused with the hyperparameters of a prior distribution in Bayesian analysis.↩