Biomarkers are often reported as having been identified by fold change or significance testing, usually seeking out the top N by a calculated p-value measure. While some preliminary conclusions can be drawn from these types of analyses, they typically fail in larger data sets where between sample variability is 15% or greater, such as LCMS label free proteomics.
We demonstrate here an approach to identifying combinations of biomarkers that individually would fail in a univariate model but can be identified through various exploratory and machine learning methods. It can be considered that this example represents a method for identifying molecular features that interact or represent the effects of a network or pathway.
The simulated data set was then fed into a feature discovery and classification pipeline that utilizes several machine learning methods, such as Random Forest, Elastic Net, Genetic Combination and various forms of Non-parametric Exploration to estimate a set of features need to provide a classification without overfitting. Both the biomarker down selection and classification are done inside a 10-fold cross validation with a radial-kernel SVM utilized as the final machine learning method for classification. Given the dimensions of this particular example, there are approximately 2.6E16 combinations of any 2 through 5 features; too great of a space to explore all possible combinations. Therefor, two models were explored; one defined by the top 5 features by p-value and the other defined by the 2 features with the greatest interaction potential.
The top 2 features identified with the greatest interaction potential by the classification pipeline were indeed the two simulated in silico. These features combined into a multivariate classifier and modeled using the same SVM and 10x10-fold cross validation yielded a ROC AUC of 0.925 on average - significantly better than 5 identified by top univariate performance.
2 Cohorts, 1:1 relationship
random numerical values
2 features allowed to interact
Repeated 10-fold Cross Validation
Feature Selection: Random Forest, T-Test, Elastic Net, Genetic, Non-parametric Exploration; choose 2-5
Univariate results by T-Test fail to expose a decent classifier, despite combining the top 5 by p-value significance.
Multivariate feature selection exposes a relationship between two features. When applied to an SVM model, a high performance classifier is identified.