Oxford Big Data Institute

Addressing Vaccine Inequity - Covid-19 Vaccines as a Global Public Good.

Integrating the environmental and genetic architectures of mortality and aging

Observational and genetic associations between cardiorespiratory fitness and cancer: a UK Biobank and international consortia study.

BackgroundThe association of fitness with cancer risk is not clear.MethodsWe used Cox proportional hazards models to estimate hazard ratios (HRs) and 95% confidence intervals (CIs) for risk of lung, colorectal, endometrial, breast, and prostate cancer in a subset of UK Biobank participants who completed a submaximal fitness test in 2009-12 (N = 72,572). We also investigated relationships using two-sample Mendelian randomisation (MR), odds ratios (ORs) were estimated using the inverse-variance weighted method.ResultsAfter a median of 11 years of follow-up, 4290 cancers of interest were diagnosed. A 3.5 ml O2⋅min-1⋅kg-1 total-body mass increase in fitness (equivalent to 1 metabolic equivalent of task (MET), approximately 0.5 standard deviation (SD)) was associated with lower risks of endometrial (HR = 0.81, 95% CI: 0.73-0.89), colorectal (0.94, 0.90-0.99), and breast cancer (0.96, 0.92-0.99). In MR analyses, a 0.5 SD increase in genetically predicted O2⋅min-1⋅kg-1 fat-free mass was associated with a lower risk of breast cancer (OR = 0.92, 95% CI: 0.86-0.98). After adjusting for adiposity, both the observational and genetic associations were attenuated.DiscussionHigher fitness levels may reduce risks of endometrial, colorectal, and breast cancer, though relationships with adiposity are complex and may mediate these relationships. Increasing fitness, including via changes in body composition, may be an effective strategy for cancer prevention.

Machine learning approaches to the identification of children affected by prenatal alcohol exposure: A narrative review

Fetal alcohol spectrum disorders (FASDs) affect at least 0.8% of the population globally. The diagnosis of FASD is uniquely complex, with a heterogeneous physical and neurobehavioral presentation that requires multidisciplinary expertise for diagnosis. Many researchers have begun to incorporate machine learning approaches into FASD research to identify children who are affected by prenatal alcohol exposure, including those with FASD. This narrative review highlights these efforts. Following an introduction to machine learning, we summarize examples from the literature of neurobehavioral screening tools and physiologic markers of exposure. We discuss individual efforts, including models that classify FASD based on parent-reported neurocognitive or behavioral questionnaires, 3D facial imaging, brain imaging, DNA methylation patterns, microRNA profiles, cardiac orienting response, and dysmorphic facial features. We highlight model performance and discuss the limitations of these approaches. We conclude by considering the scalability of these approaches and how these machine learning models, largely developed from clinical samples or highly exposed birth cohorts, may perform in the general population.

Assessing agreement between different polygenic risk scores in the UK Biobank

Evaluating Approaches for Constructing Polygenic Risk Scores for Prostate Cancer in Men of African and European Ancestry.

To do no harm - and the most good - with AI in health care.

Publisher Correction: Genetic correlates of vitamin D-binding protein and 25-hydroxyvitamin D in neonatal dried blood spots.

DrugGPT: A Knowledge-Grounded Collaborative Large Language Model for Evidence-based Drug Analysis

Lightweight transformers for clinical natural language processing

Specialised pre-trained language models are becoming more frequent in Natural language Processing (NLP) since they can potentially outperform models trained on generic texts. BioBERT (Sanh et al.Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv: 1910.01108, 2019) and BioClinicalBERT (Alsentzer et al.Publicly available clinical bert embeddings. In Proceedings of the 2nd Clinical Natural Language Processing Workshop, pp. 72-78, 2019) are two examples of such models that have shown promise in medical NLP tasks. Many of these models are overparametrised and resource-intensive, but thanks to techniques like knowledge distillation, it is possible to create smaller versions that perform almost as well as their larger counterparts. In this work, we specifically focus on development of compact language models for processing clinical texts (i.e. progress notes, discharge summaries, etc). We developed a number of efficient lightweight clinical transformers using knowledge distillation and continual learning, with the number of parameters ranging from million to million. These models performed comparably to larger models such as BioBERT and ClinicalBioBERT and significantly outperformed other compact models trained on general or biomedical data. Our extensive evaluation was done across several standard datasets and covered a wide range of clinical text-mining tasks, including natural language inference, relation extraction, named entity recognition and sequence classification. To our knowledge, this is the first comprehensive study specifically focused on creating efficient and compact transformers for clinical NLP tasks. The models and code used in this study can be found on our Huggingface profile at https://huggingface.co/nlpie and Github page at https://github.com/nlpie-research/Lightweight-Clinical-Transformers, respectively, promoting reproducibility of our results.

Decoding 2.3 Million ECGs: Interpretable Deep Learning for Advancing Cardiovascular Diagnosis and Mortality Risk Stratification

MRI economics: Balancing sample size and scan duration in brain wide association studies.

A pervasive dilemma in neuroimaging is whether to prioritize sample size or scan duration given fixed resources. Here, we systematically investigate this trade-off in the context of brain-wide association studies (BWAS) using resting-state functional magnetic resonance imaging (fMRI). We find that total scan duration (sample size × scan duration per participant) robustly explains individual-level phenotypic prediction accuracy via a logarithmic model, suggesting that sample size and scan duration are broadly interchangeable. The returns of scan duration eventually diminish relative to sample size, which we explain with principled theoretical derivations. When accounting for fixed costs associated with each participant (e.g., recruitment, non-imaging measures), we find that prediction accuracy in small-scale BWAS might benefit from much longer scan durations (>50 min) than typically assumed. Most existing large-scale studies might also have benefited from smaller sample sizes with longer scan durations. Both logarithmic and theoretical models of the relationships among sample size, scan duration and prediction accuracy explain well-predicted phenotypes better than poorly-predicted phenotypes. The logarithmic and theoretical models are also undermined by individual differences in brain states. These results replicate across phenotypic domains (e.g., cognition and mental health) from two large-scale datasets with different algorithms and metrics. Overall, our study emphasizes the importance of scan time, which is ignored in standard power calculations. Standard power calculations inevitably maximize sample size at the expense of scan duration. The resulting prediction accuracies are likely lower than would be produced with alternate designs, thus impeding scientific discovery. Our empirically informed reference is available for future study design: WEB_APPLICATION_LINK.

Authors’ reply to the Discussion of ‘Martingale Posterior Distributions’

Martingale posterior distributions

The prior distribution is the usual starting point for Bayesian uncertainty. In this paper, we present a different perspective that focuses on missing observations as the source of statistical uncertainty, with the parameter of interest being known precisely given the entire population. We argue that the foundation of Bayesian inference is to assign a distribution on missing observations conditional on what has been observed. In the i.i.d. setting with an observed sample of size n, the Bayesian would thus assign a predictive distribution on the missing Yn+1:∞ conditional on Y1:n, which then induces a distribution on the parameter. We utilize Doob’s theorem, which relies on martingales, to show that choosing the Bayesian predictive distribution returns the conventional posterior as the distribution of the parameter. Taking this as our cue, we relax the predictive machine, avoiding the need for the predictive to be derived solely from the usual prior to posterior to predictive density formula. We introduce the martingale posterior distribution, which returns Bayesian uncertainty on any statistic via the direct specification of the joint predictive. To that end, we introduce new predictive methodologies for multivariate density estimation, regression and classification that build upon recent work on bivariate copulas.

Audio-based AI classifiers show no evidence of improved COVID-19 screening over simple symptoms checkers

Recent work has reported that respiratory audio-trained AI classifiers can accurately predict SARS-CoV-2 infection status. However, it has not yet been determined whether such model performance is driven by latent audio biomarkers with true causal links to SARS-CoV-2 infection or by confounding effects, such as recruitment bias, present in observational studies. Here we undertake a large-scale study of audio-based AI classifiers as part of the UK government’s pandemic response. We collect a dataset of audio recordings from 67,842 individuals, with linked metadata, of whom 23,514 had positive polymerase chain reaction tests for SARS-CoV-2. In an unadjusted analysis, similar to that in previous works, AI classifiers predict SARS-CoV-2 infection status with high accuracy (ROC–AUC = 0.846 [0.838–0.854]). However, after matching on measured confounders, such as self-reported symptoms, performance is much weaker (ROC–AUC = 0.619 [0.594–0.644]). Upon quantifying the utility of audio-based classifiers in practical settings, we find them to be outperformed by predictions on the basis of user-reported symptoms. We make best-practice recommendations for handling recruitment bias, and for assessing audio-based classifiers by their utility in relevant practical settings. Our work provides insights into the value of AI audio analysis and the importance of study design and treatment of confounders in AI-enabled diagnostics.

VertXNet: an ensemble method for vertebral body segmentation and identification from cervical and lumbar spinal X-rays.

Accurate annotation of vertebral bodies is crucial for automating the analysis of spinal X-ray images. However, manual annotation of these structures is a laborious and costly process due to their complex nature, including small sizes and varying shapes. To address this challenge and expedite the annotation process, we propose an ensemble pipeline called VertXNet. This pipeline currently combines two segmentation mechanisms, semantic segmentation using U-Net, and instance segmentation using Mask R-CNN, to automatically segment and label vertebral bodies in lateral cervical and lumbar spinal X-ray images. VertXNet enhances its effectiveness by adopting a rule-based strategy (termed the ensemble rule) for effectively combining segmentation outcomes from U-Net and Mask R-CNN. It determines vertebral body labels by recognizing specific reference vertebral instances, such as cervical vertebra 2 ('C2') in cervical spine X-rays and sacral vertebra 1 ('S1') in lumbar spine X-rays. Those references are commonly relatively easy to identify at the edge of the spine. To assess the performance of our proposed pipeline, we conducted evaluations on three spinal X-ray datasets, including two in-house datasets and one publicly available dataset. The ground truth annotations were provided by radiologists for comparison. Our experimental results have shown that the proposed pipeline outperformed two state-of-the-art (SOTA) segmentation models on our test dataset with a mean Dice of 0.90, vs. a mean Dice of 0.73 for Mask R-CNN and 0.72 for U-Net. We also demonstrated that VertXNet is a modular pipeline that enables using other SOTA model, like nnU-Net to further improve its performance. Furthermore, to evaluate the generalization ability of VertXNet on spinal X-rays, we directly tested the pre-trained pipeline on two additional datasets. A consistently strong performance was observed, with mean Dice coefficients of 0.89 and 0.88, respectively. In summary, VertXNet demonstrated significantly improved performance in vertebral body segmentation and labeling for spinal X-ray imaging. Its robustness and generalization were presented through the evaluation of both in-house clinical trial data and publicly available datasets.

Association between health insurance cost-sharing and choice of hospital tier for cardiovascular diseases in China: a prospective cohort study.

BACKGROUND: Hospitals in China are classified into tiers (1, 2 or 3), with the largest (tier 3) having more equipment and specialist staff. Differential health insurance cost-sharing by hospital tier (lower deductibles and higher reimbursement rates in lower tiers) was introduced to reduce overcrowding in higher tier hospitals, promote use of lower tier hospitals, and limit escalating healthcare costs. However, little is known about the effects of differential cost-sharing in health insurance schemes on choice of hospital tiers. METHODS: In a 9-year follow-up of a prospective study of 0.5 M adults from 10 areas in China, we examined the associations between differential health insurance cost-sharing and choice of hospital tiers for patients with a first hospitalisation for stroke or ischaemic heart disease (IHD) in 2009-2017. Analyses were performed separately in urban areas (stroke: n = 20,302; IHD: n = 19,283) and rural areas (stroke: n = 21,130; IHD: n = 17,890), using conditional logit models and adjusting for individual socioeconomic and health characteristics. FINDINGS: About 64-68% of stroke and IHD cases in urban areas and 27-29% in rural areas chose tier 3 hospitals. In urban areas, higher reimbursement rates in each tier and lower tier 3 deductibles were associated with a greater likelihood of choosing their respective hospital tiers. In rural areas, the effects of cost-sharing were modest, suggesting a greater contribution of other factors. Higher socioeconomic status and greater disease severity were associated with a greater likelihood of seeking care in higher tier hospitals in urban and rural areas. INTERPRETATION: Patient choice of hospital tiers for treatment of stroke and IHD in China was influenced by differential cost-sharing in urban areas, but not in rural areas. Further strategies are required to incentivise appropriate health seeking behaviour and promote more efficient hospital use. FUNDING: Wellcome Trust, Medical Research Council, British Heart Foundation, Cancer Research UK, Kadoorie Charitable Foundation, China Ministry of Science and Technology, and National Natural Science Foundation of China.

Causal association between snoring and stroke: a Mendelian randomization study in a Chinese population.

BackgroundPrevious observational studies established a positive relationship between snoring and stroke. We aimed to investigate the causal effect of snoring on stroke.MethodsBased on 82,339 unrelated individuals with qualified genotyping data of Asian descent from the China Kadoorie Biobank (CKB), we conducted a Mendelian randomization (MR) analysis of snoring and stroke. Genetic variants identified in the genome-wide association analysis (GWAS) of snoring in CKB and UK Biobank (UKB) were selected for constructing genetic risk scores (GRS). A two-stage method was applied to estimate the associations of the genetically predicted snoring with stroke and its subtypes. Besides, MR analysis among the non-obese group (body mass index, BMI <24.0 kg/m2), as well as multivariable MR (MVMR), were performed to control for potential pleiotropy from BMI. In addition, the inverse-variance weighted (IVW) method was applied to estimate the causal association with genetic variants identified in CKB GWAS.FindingsPositive associations were found between snoring and total stroke, hemorrhagic stroke (HS), and ischemic stroke (IS). With GRS of CKB, the corresponding HRs (95% CIs) were 1.56 (1.15, 2.12), 1.50 (0.84, 2.69), 2.02 (1.36, 3.01), and the corresponding HRs (95% CIs) using GRS of UKB were 1.78 (1.30, 2.43), 1.94 (1.07, 3.52), and 1.74 (1.16, 2.61). The associations remained stable in the MR among the non-obese group, MVMR analysis, and MR analysis using the IVW method.InterpretationThis study suggests that, among Chinese adults, genetically predicted snoring could increase the risk of total stroke, IS, and HS, and the causal effect was independent of BMI.FundingNational Natural Science Foundation of China, Kadoorie Charitable Foundation Hong Kong, UK Wellcome Trust, National Key R&D Program of China, Chinese Ministry of Science and Technology.

Prevalence of persistent SARS-CoV-2 in a large community surveillance study.

Persistent SARS-CoV-2 infections may act as viral reservoirs that could seed future outbreaks1-5, give rise to highly divergent lineages6-8 and contribute to cases with post-acute COVID-19 sequelae (long COVID)9,10. However, the population prevalence of persistent infections, their viral load kinetics and evolutionary dynamics over the course of infections remain largely unknown. Here, using viral sequence data collected as part of a national infection survey, we identified 381 individuals with SARS-CoV-2 RNA at high titre persisting for at least 30 days, of which 54 had viral RNA persisting at least 60 days. We refer to these as 'persistent infections' as available evidence suggests that they represent ongoing viral replication, although the persistence of non-replicating RNA cannot be ruled out in all. Individuals with persistent infection had more than 50% higher odds of self-reporting long COVID than individuals with non-persistent infection. We estimate that 0.1-0.5% of infections may become persistent with typically rebounding high viral loads and last for at least 60 days. In some individuals, we identified many viral amino acid substitutions, indicating periods of strong positive selection, whereas others had no consensus change in the sequences for prolonged periods, consistent with weak selection. Substitutions included mutations that are lineage defining for SARS-CoV-2 variants, at target sites for monoclonal antibodies and/or are commonly found in immunocompromised people11-14. This work has profound implications for understanding and characterizing SARS-CoV-2 infection, epidemiology and evolution.

Assessing the importance of primary care diagnoses in the UK Biobank.

The UK Biobank has made general practitioner (GP) data (censoring date 2016-2017) available for approximately 45% of the cohort, whilst hospital inpatient and death registry (referred to as "HES/Death") data are available cohort-wide through 2018-2022 depending on whether the data comes from England, Wales or Scotland. We assessed the importance of case ascertainment via different data sources in UKB for three diseases that are usually first diagnosed in primary care: Parkinson's disease (PD), type 2 diabetes (T2D), and all-cause dementia. Including GP data at least doubled the number of incident cases in the subset of the cohort with primary care data (e.g. from 619 to 1390 for dementia). Among the 786 dementia cases that were only captured in the GP data before the GP censoring date, only 421 (54%) were subsequently recorded in HES. Therefore, estimates of the absolute incidence or risk-stratified incidence are misleadingly low when based only on the HES/Death data. For incident cases present in both HES/Death and GP data during the full follow-up period (i.e. until the HES censoring date), the median time difference between an incident diagnosis of dementia being recorded in GP and HES/Death was 2.25 years (i.e. recorded 2.25 years earlier in the GP records). Similar lag periods were also observed for PD (median 2.31 years earlier) and T2D (median 2.82 years earlier). For participants with an incident GP diagnosis, only 65.6% of dementia cases, 69.0% of PD cases, and 58.5% of T2D cases had their diagnosis recorded in HES/Death within 7 years since GP diagnosis. The effect estimates (hazard ratios, HR) of established risk factors for the three health outcomes mostly remain in the same direction and with a similar strength of association when cases are ascertained either using HES only or further adding GP data. The confidence intervals of the HR became narrower when adding GP data, due to the increased statistical power from the additional cases. In conclusion, it is desirable to extend both the coverage and follow-up period of GP data to allow researchers to maximise case ascertainment of chronic health conditions in the UK.

Search results

Found 8755 matches for