1.5 An Outline of the Book
The goal of this text is to provide effective tools for uncovering relevant and predictively useful representations of predictors. These tools will be the bookends of the predictive modeling process. At the beginning of the process we will explore techniques for augmenting the predictor set. Then at the end of the process we will provide methods for filtering the enhanced predictor set to ultimately produce better models. These concepts will be detailed in Chapters 2-12 as follows.
We begin by providing a short illustration of the interplay between the modeling and feature engineering process (Chapter 2) In this example, we use feature engineering and feature selection methods to improve the ability of a model to predict the risk of ischemic stroke.
In Chapter 3 we provide a review of the process for developing predictive models, which will include an illustration of the steps of data splitting, validation approach selection, model tuning, and performance estimation for future predictions. This chapter will also include guidance on how to use feedback loops when cycling through the model building process across multiple models.
Exploratory visualizations of the data are crucial for understanding relationships among predictors and between predictors and the response, especially for high-dimensional data. In addition, visualizations can be used to assist in understanding the nature of individual predictors including predictors’ skewness and missing data patterns. Chapter 4 will illustrate useful visualization techniques to explore relationships within and between predictors. Graphical methods for evaluating model lack-of-fit will also be presented.
Chapter 5 will focus on approaches for encoding discrete, or categorical, predictors. Here we will summarize standard techniques for representing categorical predictors. Feature engineering methods for categorical predictors such as feature hashing are introduced as a method for using existing information to create new predictors that better uncover meaningful relationships. This chapter will also provide guidance on practical issues such as how to handle rare levels within a categorical predictor and the impact of creating dummy variables for tree and rule-based models. Date-based predictors are present in many data sets and can be viewed as categorical predictors.
Engineering numeric predictors will be discussed in Chapter 6. As mentioned above, numeric predictors as collected in the original data may not be optimal for predicting the response. Univariate and multivariate transformations are a first step to finding better forms of numeric predictors. A more advanced approach is to use basis expansions (i.e., splines) to create better representations of the original predictors. In certain situations, transforming continuous predictors to categorical or ordinal bins reduces variation and helps to improve predictive model performance. Caveats to binning numerical predictors will also be provided.
Up to this point in the book, a feature has been considered as one of the observed predictors in the data. In Chapter 7 we will illustrate that important features for a predictive model could also be the interaction between two or more of the original predictors. Quantitative tools for determining which predictors interact with one another will be explored along with graphical methods to evaluate the importance of these types of effects. This chapter will also discuss the concept of estimability of interactions.
Every practitioner working with real-world data will encounter missing data at some point. While some predictive models (e.g., trees) have novel ways of handling missing data, other models do not and require complete data. Chapter 8 explores mechanisms that cause missing data and provides visualization tools for investigating missing data patterns. Traditional and modern tools for removing or imputing missing data are provided. In addition, the imputation methods are evaluated for continuous and categorical predictors.
Working with profile data, such as time series (longitudinal), cellular-to-wellular, and image data will be addressed in Chapter 9. These kind of data are normally collected in the fields of finance, pharmaceutical, intelligence, transportation, and weather forecasting, and this particular data structure generates unique challenges to many models. Some modern predictive modeling tools such as partial least squares can naturally handle data in this format. But many other powerful modeling techniques do not have direct ways of working with this kind of data. These models require that profile data be summarized or collapsed prior to modeling. This chapter will illustrate techniques for working with profile data in ways that strive to preserve the predictive information while creating a format that can be used across predictive models.
The feature engineering process as described in Chapters 5-9 can lead to many more predictors than what was contained in the original data. While some of the additional predictors will likely enhance model performance, not all of the original and new predictors will likely be useful for prediction. The final chapters will discuss feature selection as an overall strategy for improving model predictive performance. Important aspects include: the goals of feature selection, consequences of irrelevant predictors, comparisons with selection via regularization, and how to avoid overfitting (in the feature selection process).