Technical Note

Discovering The Hidden Signal Among the Noise: An In Silico Example

Ryan Benz, Jeff Jones





Highlights



IN SILICO

250 Subjects
2 Cohorts, 1:1 relationship

5,000 Features
random numerical values
2 features allowed to interact

MODELING

Repeated 10-fold Cross Validation

Feature Selection: Random Forest, T-Test, Elastic Net, Genetic, Non-parametric Exploration; choose 2-5

Classification: SVM

RESULTS

Univariate results by T-Test fail to expose a decent classifier, despite combining the top 5 by p-value significance.

Multivariate feature selection exposes a relationship between two features. When applied to an SVM model, a high performance classifier is identified.



Top 5 Univariate Performance




Abstract

Biomarkers are often reported as having been identified by fold change or significance testing, usually seeking out the top N by a calculated p-value measure. While some preliminary conclusions can be drawn from these types of analyses, they typically fail in larger data sets where between sample variability is 15% or greater, such as LCMS label free proteomics.

We demonstrate here an approach to identifying combinations of biomarkers that individually would fail in a univariate model but can be identified through various exploratory and machine learning methods. It can be considered that this example represents a method for identifying molecular features that interact or represent the effects of a network or pathway.



Methods

An in silico data set resembling a proteomics experiment was generated wherein 250 individuals, representing an equal proportion of control and disease, and 5,000 distinct observations were assigned a typical distribution of response values. From that set, two features were selected and had values randomly adjusted such that when plotted on a cartesian coordinate graph there is a clear relationship, yet the individual values retained a unimodal distribution, and significance testing for either feature fails the null hypothesis. Some of the features in this simulated data set pass significance by random chance, as their class assignments were randomized, and some of show up as plausible biomarkers in a univariate analysis. However, the individual predictive value of either of the two simulated in silico features yielded and AUC near 0.5 (AUC not shown).


The simulated data set was then fed into a feature discovery and classification pipeline that utilizes several machine learning methods, such as Random Forest, Elastic Net, Genetic Combination and various forms of Non-parametric Exploration to estimate a set of features need to provide a classification without overfitting. Both the biomarker down selection and classification are done inside a 10-fold cross validation with a radial-kernel SVM utilized as the final machine learning method for classification. Given the dimensions of this particular example, there are approximately 2.6E16 combinations of any 2 through 5 features; too great of a space to explore all possible combinations. Therefor, two models were explored; one defined by the top 5 features by p-value and the other defined by the 2 features with the greatest interaction potential.



Results

A volcano plot was constructed to highlight the differences in the two feature sets explored. At the top boxed in blue are the obvious choices as they have the typical greater fold change and pass a given significance threshold. At the bottom boxed in red are the two features with the greatest interaction potential, however, buried deep among other features that are typically deemed not worth exploring in classification model building. The top 5 features identified by p-value when combined into a multivariate classifier and modeled in an SVM using 10-fold cross validation repeated 10 times yielded a ROC AUC of 0.617 on average, barely recognized as having any classification performance. Most of these features have equivalent or better performance individually, indicating that when combined there is a lack of complementary information.
SoCal Bioinformatics Inc.


FEATURE SELECTION BY

Univariate Performance

Interaction


The top 2 features identified with the greatest interaction potential by the classification pipeline were indeed the two simulated in silico. These features combined into a multivariate classifier and modeled using the same SVM and 10x10-fold cross validation yielded a ROC AUC of 0.925 on average - significantly better than 5 identified by top univariate performance.