Disease-specific variant pathogenicity prediction significantly improves variant interpretation in inherited cardiac conditions
Zhang X., Walsh R., Whiffin N., Buchan R., Midwinter W., Wilk A., Govind R., Li N., Ahmad M., Mazzarotto F., Roberts A., Theotokis P., Mazaika E., Allouba M., de Marvao A., Pua CJ., Day S., Ashley E., Colan S., Michels M., Pereira A., Jacoby D., Ho C., Olivotto I., Gunnarsson G., Jefferies J., Semsarian C., Ingles J., O’Regan D., Aguib Y., Yacoub M., Cook S., Barton PJR., Bottolo L., Ware J.
Background Accurate discrimination of benign and pathogenic rare variation remains a priority for clinical genome interpretation. State-of-the-art machine learning tools are useful for genome-wide variant prioritisation but remain imprecise. Since the relationship between molecular consequence and likelihood of pathogenicity varies between genes with distinct molecular mechanisms, we hypothesised that a disease-specific classifier may outperform existing genome-wide tools. Methods We present a novel disease-specific variant classification tool, CardioBoost, that estimates the probability of pathogenicity for rare missense variants in inherited cardiomyopathies and arrhythmias, trained with variants of known clinical effect. To benchmark against state-of-the-art genome-wide pathogenicity classification tools, we assessed classification of hold-out test variants using both overall performance metrics, and metrics of high-confidence (>90%) classifications relevant to variant interpretation. We further evaluated the prioritisation of variants associated with disease and patient clinical outcomes, providing validations that are robust to potential mis-classification in gold-standard reference datasets. Results CardioBoost has higher discriminating power than published genome-wide variant classification tools in distinguishing between pathogenic and benign variants based on overall classification performance measures with the highest area under the Precision-Recall Curve as 91% for cardiomyopathies and as 96% for inherited arrhythmias. When assessed at high-confidence (>90%) classification thresholds, prediction accuracy is improved by at least 120% over existing tools for both cardiomyopathies and arrhythmias, with significantly improved sensitivity and specificity. Finally, CardioBoost improves prioritisation of variants significantly associated with disease, and stratifies survival of patients with cardiomyopathies, confirming biologically relevant variant classification. Conclusions We demonstrate that a disease-specific variant pathogenicity prediction tool outperforms state-of-the-art genome-wide tools for the classification of rare missense variants of uncertain significance for inherited cardiac conditions. To facilitate evaluation of CardioBoost, we provide pre-computed pathogenicity scores for all possible rare missense variants in genes associated with cardiomyopathies and arrhythmias ( https://www.cardiodb.org/cardioboost/ ). Our results also highlight the need to develop and evaluate variant classification tools focused on specific diseases and clinical application contexts. Our proposed model for assessing variants in known disease genes, and the use of application-specific evaluations, is broadly applicable to improve variant interpretation across a wide range of Mendelian diseases.