Application of machine-learning algorithms to identify the key determinants of risk for HIV, hepatitis C and hepatitis B in primary care settings.

Manley H., Leber W., Smith K., Farooq HZ., Pareek M., Baggaley RF., Anderson J., Loman L., Griffiths C., Robson J., Panovska-Griffiths J.

BackgroundTesting for Blood-Borne-Viruses (BBVs) such as the human immunodeficiency virus (HIV), hepatitis C virus (HCV) and hepatitis B virus (HBV) is generally focused on specialist settings. However, people with undiagnosed infections are also present within the general population. We explore whether using machine-learning algorithms (MLAs) can identify people at heightened risk of HIV, HBV, HCV, or a composite 'any BBV' (defined as positivity for one or more of the three infections) in primary care settings.MethodsFrom de-identified electronic health records data from 165 general practices in North East London we extracted risk factors for HIV, HCV and HBV and used them to train (75% data) and test (25% data) three MLAs: Logistic Regression (LR), AdaBoost with random under sampling (RUSBoost) and Balanced Random Forest classifier (BRFC). The ROC curves, ROC AUC, sensitivity and specificity values quantified the models' performance. Across the models the key features for each outcome were identified.ResultsA total of 1,987,954 patients were included in the study with no inclusion or exclusion criteria, from whom 75 predictive features were selected for HIV, 24 for HCV, 37 for HBV and 88 for any BBV outcome. Different models were optimal for individual BBVs positivity classification, depending on the accuracy metric. As a single infection, HCV was predicted most accurately across models and accuracy metrics. When targeting any BBV outcome, LR was the model with highest AUC value, BRFC was the most sensitive model and RUSBoost was the most specific model. The key identified features were similar across models with age the strongest predictor for both individual positivity and the composite outcome. A number of features were important for two of the BBV positive groups: Black African ethnicity (HIV and HBV), liver disease (HBV and HCV) and opiate and cocaine use (HBV and HCV). A number of individual features were important for individual BBVs positivity.ConclusionOur findings illustrate that combining digital technology with routinely available general practice data has promise in improving case-finding of targeted BBV testing. There are however challenges in identifying the optimal MLAs and the accuracy metrics for multiple HIV/HCV/HBV positivity. This underscores the importance of evaluating different models and applying a broad set of accuracy criteria when utilising digital technology for precision medicine.Clinical trial numberNot applicable.

More information Original publication

DOI

10.1186/s12879-026-13247-0

Type

Journal article

Publication Date

2026-05-01T00:00:00+00:00

Addresses

U, K, , H, e, a, l, t, h, , S, e, c, u, r, i, t, y, , A, g, e, n, c, y, ,, , L, o, n, d, o, n, ,, , U, K, .

Cookies on this website

Application of machine-learning algorithms to identify the key determinants of risk for HIV, hepatitis C and hepatitis B in primary care settings.

Manley H., Leber W., Smith K., Farooq HZ., Pareek M., Baggaley RF., Anderson J., Loman L., Griffiths C., Robson J., Panovska-Griffiths J.

DOI

Type

Publication Date

Addresses