In biomarker discovery studies, uncertainty associated with case and control labels is often overlooked. By omitting to take into account label uncertainty, model parameters and the predictive risk can become biased, sometimes severely. The most common situation is when the control set contains an unknown number of undiagnosed, or future, cases. This has a marked impact in situations where the model needs to be well-calibrated, e.g., when the prediction performance of a biomarker panel is evaluated. Failing to account for class label uncertainty may lead to underestimation of classification performance and bias in parameter estimates. This can further impact on meta-analysis for combining evidence from multiple studies. Using a simulation study, we outline how conventional statistical models can be modified to address class label uncertainty leading to well-calibrated prediction performance estimates and reduced bias in meta-analysis. We focus on the problem of mislabeled control subjects in case-control studies, i.e., when some of the control subjects are undiagnosed cases, although the procedures we report are generic. The uncertainty in control status is a particular situation common in biomarker discovery studies in the context of genomic and molecular epidemiology, where control subjects are commonly sampled from the general population with an established expected disease incidence rate.

Original publication




Journal article


J Proteome Res

Publication Date





5562 - 5567


Algorithms, Bias (Epidemiology), Biomarkers, Case-Control Studies, Computer Simulation, Humans, Logistic Models, Meta-Analysis as Topic, ROC Curve, Reproducibility of Results, Risk Factors, Uncertainty