Building predictive models that successfully generalize to new data is a challenging process full of potential pitfalls. During the modeling building process, cross-validation is routinely used to optimize parameters, and importantly, provide an independent assessment of trained model performance using the hold-out test set partitions. Violation of this training/test set independence occurs when information from the test sets influence the training set models, referred to here as information leakage. Historically, one common place where this has occurred is in the feature selection step where features are initially selected prior to cross-validation using all of the input data. This can result in an overestimation of the cross-validation models' performance, and ultimately, a final model that might not generalize as well when applied to new data. Here, we use a simulated data set to demonstrate how feature selection, when performed in a leaky manner, results in cross-validation performance estimates that are grossly over optimistic. However, when feature selection is performed properly within the cross-validation folds, the resulting performance estimates are in line with expected properties of the data set.
A simulated data set, designed with no inherent signal, was generated to exemplify how a leaky feature selection process can adversely affect cross-validation performance estimates and suggest the presence of a discriminatory signal when one may not exist at all. This data set was produced by randomly generating 10,000 feature values for 100 samples each using a Normal distribution. Binary class labels were then randomly assigned to the samples creating a dataset with no intrinsic discrimination between the two classes. Next, classification models were built and assessed using feature down selection to 5 features (using elastic net regression) and 10-fold cross-validation repeated 100 times with two procedures that differed only by when feature selection step was performed. In the first procedure, feature selection was performed within the cross-validation folds prior to modeling fitting (SVM, linear kernel), while in the second, feature selection was perform upfront using all the training data before cross-validation modeling fitting was performed. The first approach represents a non-leaky method for feature selection, while the second approach is leaky. The average cross-validation test set performance for the two procedures was then calculated and compared against the expected performance consistent with no class discrimination in the dataset (i.e. ROC AUC = 0.5).
For the leaky cross-validation assessment, the resulting average test set ROC AUC is 0.90 [95% CI: 0.89 - 0.90], and suggests a real discriminatory signal is present in the data when in fact one doesn't exist. However, when features selection is properly embedded within the cross-validation folds, independence of the training and testing sets is retained and the resulting average test set ROC AUC is 0.54 [95% CI: 0.53 - 0.55]. This estimate is much closer to the expected result of no signal present in the data.
Because feature selection is a critical component of the modeling building procedure, information leakage occurs if this step is not embedded within the cross-validation folds. Performing feature selection upfront using all the data effectively transfers (i.e. leaks) information from the to-be partitioned testing sets into the selected features. During cross-validation, this gives the test set samples an unfair advantage that can never be realized by a truly independent set of new data, and therefore can result in over-optimistic performance estimates. Particularly in the case where the number of candidate features is much larger than the number of samples, as is typical in 'omics data sets, information leakage can be particularly problematic as machine learning models can more easily pick-up on noise patterns present in the data. The improper use of feature selection is just one example of information leakage, which can occur in a variety of ways that are often complex and subtle. As a general rule of thumb, any step in the modeling building process that affects how the final model is determined should be embedded within the cross-validation procedure, with the use of the modeling outcomes (i.e. the class labels or values being predicted) restricted to the training partitions only.
Simulated data set:
100 samples, 10Krandom features Binary class labels
Feature down selection:
elastic net regression
SVM, linear kernel
Cross-validation (CV) was used to estimate the model
performance under two different scenarios:
1 Feature selection performed within the CV folds
2 Feature selection was performed prior to CV using all the data
Averaged test set ROC AUC's used to estimate model performance
As expected, performing feature selection prior to CV results in over optimistic model performance estimates, and in this particular case, suggests the presence of a discriminatory signal when none actually exists.
Figure 1. When feature selection is performed in a leaky manner, the resulting cross-validation performance estimate is over optimistic and does not accurately reflect the performance one would expect when the model is applied to new data. Here, while no signal was expected, the average test set AUC was 0.90.
Figure 2. When feature selection is properly embedded within the cross-validation folds, a more accurate assessment of the model's generalization performance is obtained. In this case the average test set performance was 0.54.