Oxford Big Data Institute

Differences in 5'untranslated regions highlight the importance of translational regulation of dosage sensitive genes.

Untranslated regions (UTRs) are important mediators of post-transcriptional regulation. The length of UTRs and the composition of regulatory elements within them are known to vary substantially across genes, but little is known about the reasons for this variation in humans. Here, we set out to determine whether this variation, specifically in 5'UTRs, correlates with gene dosage sensitivity. We investigate 5'UTR length, the number of alternative transcription start sites, the potential for alternative splicing, the number and type of upstream open reading frames (uORFs) and the propensity of 5'UTRs to form secondary structures. We explore how these elements vary by gene tolerance to loss-of-function (LoF; using the LOEUF metric), and in genes where changes in dosage are known to cause disease. We show that LOEUF correlates with 5'UTR length and complexity. Genes that are most intolerant to LoF have longer 5'UTRs, greater TSS diversity, and more upstream regulatory elements than their LoF tolerant counterparts. We show that these differences are evident in disease gene-sets, but not in recessive developmental disorder genes where LoF of a single allele is tolerated. Our results confirm the importance of post-transcriptional regulation through 5'UTRs in tight regulation of mRNA and protein levels, particularly for genes where changes in dosage are deleterious and lead to disease. Finally, to support gene-based investigation we release a web-based browser tool, VuTR, that supports exploration of the composition of individual 5'UTRs and the impact of genetic variation within them.

CamTrapAsia: A dataset of tropical forest vertebrate communities from 239 camera trapping studies.

Information on tropical Asian vertebrates has traditionally been sparse, particularly when it comes to cryptic species inhabiting the dense forests of the region. Vertebrate populations are declining globally due to land-use change and hunting, the latter frequently referred as "defaunation." This is especially true in tropical Asia where there is extensive land-use change and high human densities. Robust monitoring requires that large volumes of vertebrate population data be made available for use by the scientific and applied communities. Camera traps have emerged as an effective, non-invasive, widespread, and common approach to surveying vertebrates in their natural habitats. However, camera-derived datasets remain scattered across a wide array of sources, including published scientific literature, gray literature, and unpublished works, making it challenging for researchers to harness the full potential of cameras for ecology, conservation, and management. In response, we collated and standardized observations from 239 camera trap studies conducted in tropical Asia. There were 278,260 independent records of 371 distinct species, comprising 232 mammals, 132 birds, and seven reptiles. The total trapping effort accumulated in this data paper consisted of 876,606 trap nights, distributed among Indonesia, Singapore, Malaysia, Bhutan, Thailand, Myanmar, Cambodia, Laos, Vietnam, Nepal, and far eastern India. The relatively standardized deployment methods in the region provide a consistent, reliable, and rich count data set relative to other large-scale pressence-only data sets, such as the Global Biodiversity Information Facility (GBIF) or citizen science repositories (e.g., iNaturalist), and is thus most similar to eBird. To facilitate the use of these data, we also provide mammalian species trait information and 13 environmental covariates calculated at three spatial scales around the camera survey centroids (within 10-, 20-, and 30-km buffers). We will update the dataset to include broader coverage of temperate Asia and add newer surveys and covariates as they become available. This dataset unlocks immense opportunities for single-species ecological or conservation studies as well as applied ecology, community ecology, and macroecology investigations. The data are fully available to the public for utilization and research. Please cite this data paper when utilizing the data.

Risks of second primary cancers among 584,965 female and male breast cancer survivors in England: a 25-year retrospective cohort study

Recommendations for laboratory workflow that better support centralised amalgamation of genomic variant data: findings from CanVIG-UK national molecular laboratory survey.

BackgroundNational and international amalgamation of genomic data offers opportunity for research and audit, including analyses enabling improved classification of variants of uncertain significance. Review of individual-level data from National Health Service (NHS) testing of cancer susceptibility genes (2002-2023) submitted to the National Disease Registration Service revealed heterogeneity across participating laboratories regarding (1) the structure, quality and completeness of submitted data, and (2) the ease with which that data could be assembled locally for submission.MethodsIn May 2023, we undertook a closed online survey of 51 clinical scientists who provided consensus responses representing all 17 of 17 NHS molecular genetic laboratories in England and Wales which undertake NHS diagnostic analyses of cancer susceptibility genes. The survey included 18 questions relating to 'next-generation sequencing workflow' (11), 'variant classification' (3) and 'phenotypical context' (4).ResultsWidely differing processes were reported for transfer of variant data into their local LIMS (Laboratory Information Management System), for the formatting in which the variants are stored in the LIMS and which classes of variants are retained in the local LIMS. Differing local provisions and workflow for variant classifications were also reported, including the resources provided and the mechanisms by which classifications are stored.ConclusionThe survey responses illustrate heterogeneous laboratory workflow for preparation of genomic variant data from local LIMS for centralised submission. Workflow is often labour-intensive and inefficient, involving multiple manual steps which introduce opportunities for error. These survey findings and adoption of the concomitant recommendations may support improvement in laboratory dataflows, better facilitating submission of data for central amalgamation.

Image-based consensus molecular subtyping in rectal cancer biopsies and response to neoadjuvant chemoradiotherapy.

The development of deep learning (DL) models to predict the consensus molecular subtypes (CMS) from histopathology images (imCMS) is a promising and cost-effective strategy to support patient stratification. Here, we investigate whether imCMS calls generated from whole slide histopathology images (WSIs) of rectal cancer (RC) pre-treatment biopsies are associated with pathological complete response (pCR) to neoadjuvant long course chemoradiotherapy (LCRT) with single agent fluoropyrimidine. DL models were trained to classify WSIs of colorectal cancers stained with hematoxylin and eosin into one of the four CMS classes using a multi-centric dataset of resection and biopsy specimens (n = 1057 WSIs) with paired transcriptional data. Classifiers were tested on a held out RC biopsy cohort (ARISTOTLE) and correlated with pCR to LCRT in an independent dataset merging two RC cohorts (ARISTOTLE, n = 114 and SALZBURG, n = 55 patients). DL models predicted CMS with high classification performance in multiple comparative analyses. In the independent cohorts (ARISTOTLE, SALZBURG), cases with WSIs classified as imCMS1 had a significantly higher likelihood of achieving pCR (OR = 2.69, 95% CI 1.01-7.17, p = 0.048). Conversely, imCMS4 was associated with lack of pCR (OR = 0.25, 95% CI 0.07-0.88, p = 0.031). Classification maps demonstrated pathologist-interpretable associations with high stromal content in imCMS4 cases, associated with poor outcome. No significant association was found in imCMS2 or imCMS3. imCMS classification of pre-treatment biopsies is a fast and inexpensive solution to identify patient groups that could benefit from neoadjuvant LCRT. The significant associations between imCMS1/imCMS4 with pCR suggest the existence of predictive morphological features that could enhance standard pathological assessment.

Ethical and social implications of public-private partnerships in the context of genomic/big health data collection.

A Unified Framework for U-Net Design and Analysis

U-Nets are a go-to neural architecture across numerous tasks for continuous signals on a square such as images and Partial Differential Equations (PDE), however their design and architecture is understudied. In this paper, we provide a framework for designing and analysing general U-Net architectures. We present theoretical results which characterise the role of the encoder and decoder in a U-Net, their high-resolution scaling limits and their conjugacy to ResNets via preconditioning. We propose Multi-ResNets, U-Nets with a simplified, wavelet-based encoder without learnable parameters. Further, we show how to design novel U-Net architectures which encode function constraints, natural bases, or the geometry of the data. In diffusion models, our framework enables us to identify that high-frequency information is dominated by noise exponentially faster, and show how U-Nets with average pooling exploit this. In our experiments, we demonstrate how Multi-ResNets achieve competitive and often superior performance compared to classical U-Nets in image segmentation, PDE surrogate modelling, and generative modelling with diffusion models. Our U-Net framework paves the way to study the theoretical properties of U-Nets and design natural, scalable neural architectures for a multitude of problems beyond the square.

Venous thromboembolism risk in amyotrophic lateral sclerosis: a hospital record-linkage study.

BackgroundVenous thromboembolism (VTE) can occur in amyotrophic lateral sclerosis (ALS) and pulmonary embolism causes death in a minority of cases. The benefits of preventing VTE must be weighed against the risks. An accurate estimate of the incidence of VTE in ALS is crucial to assessing this balance.MethodsThis retrospective record-linkage cohort study derived data from the Hospital Episode Statistics database, covering admissions to England's hospitals from 1 April 2003 to 31 December 2019 and included 21 163 patients with ALS and 17 425 337 controls. Follow-up began at index admission and ended at VTE admission, death or 2 years (whichever came sooner). Adjusted HRs (aHRs) for VTE were calculated, controlling for confounders.ResultsThe incidence of VTE in the ALS cohort was 18.8/1000 person-years. The relative risk of VTE in ALS was significantly greater than in controls (aHR 2.7, 95% CI 2.4 to 3.0). The relative risk of VTE in patients with ALS under 65 years was five times higher than controls (aHR 5.34, 95% CI 4.6 to 6.2), and higher than that of patients over 65 years compared with controls (aHR 1.86, 95% CI 1.62 to 2.12).ConclusionsPatients with ALS are at a higher risk of developing VTE, but this is similar in magnitude to that reported in other chronic neurological conditions associated with immobility, such as multiple sclerosis, which do not routinely receive VTE prophylaxis. Those with ALS below the median age of symptom onset have a notably higher relative risk. A reappraisal of the case for routine antithrombotic therapy in those diagnosed with ALS now requires a randomised controlled trial.

De novo variants in the non-coding spliceosomal snRNA gene RNU4-2 are a frequent cause of syndromic neurodevelopmental disorders.

The Hidden Hand of Asymptomatic Infection Hinders Control of Neglected Tropical Diseases: A Modeling Analysis

Abstract Background Neglected tropical diseases are responsible for considerable morbidity and mortality in low-income populations. International efforts have reduced their global burden, but transmission is persistent and case-finding-based interventions rarely target asymptomatic individuals. Methods We develop a generic mathematical modeling framework for analyzing the dynamics of visceral leishmaniasis in the Indian sub-continent (VL), gambiense sleeping sickness (gHAT), and Chagas disease and use it to assess the possible contribution of asymptomatics who later develop disease (pre-symptomatics) and those who do not (non-symptomatics) to the maintenance of infection. Plausible interventions, including active screening, vector control, and reduced time to detection, are simulated for the three diseases. Results We found that the high asymptomatic contribution to transmission for Chagas and gHAT and the apparently high basic reproductive number of VL may undermine long-term control. However, the ability to treat some asymptomatics for Chagas and gHAT should make them more controllable, albeit over relatively long time periods due to the slow dynamics of these diseases. For VL, the toxicity of available therapeutics means the asymptomatic population cannot currently be treated, but combining treatment of symptomatics and vector control could yield a quick reduction in transmission. Conclusions Despite the uncertainty in natural history, it appears there is already a relatively good toolbox of interventions to eliminate gHAT, and it is likely that Chagas will need improvements to diagnostics and their use to better target pre-symptomatics. The situation for VL is less clear, and model predictions could be improved by additional empirical data. However, interventions may have to improve to successfully eliminate this disease.

Device-Measured Physical Activity in 3,506 Individuals with Knee or Hip Arthroplasty.

PurposeHip and knee arthroplasty aims to reduce joint pain and increase functional mobility in patients with osteoarthritis; however, the degree to which arthroplasty is associated with higher physical activity is unclear. The current study sought to assess the association of hip and knee arthroplasty with objectively measured physical activity.MethodsThis cross-sectional study analysed wrist-worn accelerometer data collected in 2013-2016 from UK Biobank participants (aged 43-78). Multivariable linear regression was performed to assess step count, cadence, overall acceleration, and activity behaviours between non-arthritic controls, end-stage arthritic, and postoperative cohorts, controlling for demographic and behavioural confounders. From a cohort of 94,707 participants with valid accelerometer wear time and complete self-reported data, electronic health records were used to identify 3,506 participants having undergone primary or revision hip or knee arthroplasty and 68,389 non-arthritic controls.ResultsEnd-stage hip or knee arthritis was associated with taking 1,129 fewer steps/day [95% CI: 811, 1,447] (p < 0.001), and having 5.8 fewer minutes/day [95% CI: 3.0, 8.7] (p < 0.001) of moderate-to-vigorous activity compared to non-arthritic controls. Unilateral primary hip and knee arthroplasty were associated with 877 [95% CI: 284, 1,471] (p = 0.004) and 893 [95% CI: 232, 1,554] (p = 0.008) more steps than end-stage osteoarthritic participants, respectively. Postoperative unilateral hip arthroplasty participants demonstrated levels of moderate-to-vigorous physical activity and daily step count equivalent to non-arthritic controls. No difference in physical activity was observed between any cohorts in terms of overall acceleration, or time spent in daily light activity, sedentary behaviour, or sleep.ConclusionsHip and knee arthroplasty are associated with higher levels of physical activity compared to participants with end-stage arthritis. Unilateral hip arthroplasty patients, in particular, demonstrate equivalence to non-arthritic peers at more than 1 year following surgery.

Digital health technologies and machine learning augment patient reported outcomes to remotely characterise rheumatoid arthritis.

Digital measures of health status captured during daily life could greatly augment current in-clinic assessments for rheumatoid arthritis (RA), to enable better assessment of disease progression and impact. This work presents results from weaRAble-PRO, a 14-day observational study, which aimed to investigate how digital health technologies (DHT), such as smartphones and wearables, could augment patient reported outcomes (PRO) to determine RA status and severity in a study of 30 moderate-to-severe RA patients, compared to 30 matched healthy controls (HC). Sensor-based measures of health status, mobility, dexterity, fatigue, and other RA specific symptoms were extracted from daily iPhone guided tests (GT), as well as actigraphy and heart rate sensor data, which was passively recorded from patients' Apple smartwatch continuously over the study duration. We subsequently developed a machine learning (ML) framework to distinguish RA status and to estimate RA severity. It was found that daily wearable sensor-outcomes robustly distinguished RA from HC participants (F1, 0.807). Furthermore, by day 7 of the study (half-way), a sufficient volume of data had been collected to reliably capture the characteristics of RA participants. In addition, we observed that the detection of RA severity levels could be improved by augmenting standard patient reported outcomes with sensor-based features (F1, 0.833) in comparison to using PRO assessments alone (F1, 0.759), and that the combination of modalities could reliability measure continuous RA severity, as determined by the clinician-assessed RAPID-3 score at baseline (r2, 0.692; RMSE, 1.33). The ability to measure the impact of the disease during daily life-through objective and remote digital outcomes-paves the way forward to enable the development of more patient-centric and personalised measurements for use in RA clinical trials.

Epidemiology and Economics of Deworming

Global access to deworming is one of the public health success stories of the twenty-first century and was the key catalyst for creating the neglected tropical disease (NTD) agenda. Human worm infections appear to have been with us since the domestication of household animals, some 10,500 years ago, and putative treatments are known from the earliest pharmacopoeias, but it has only been in the last 100 years that we have sought a public health solution and only in the last 5 years that real success at scale has been achieved. This is a success that depends on donated drugs and targeted treatment campaigns outside of the traditional health system. In this chapter, we explore the scientific foundations for this success and explore what this implies for the future management of soil-transmitted helminths (STHs) by health systems. This chapter describes the evolution of public health approaches to reduce the prevalence and morbidity of STH and the evidence of impact of mass drug administration on their target populations, and provides context for the debate that has surrounded these results. This chapter also details the costs of delivering these interventions as well as how future delivery approaches can align with Universal Health Care objectives.

Mapping cell-to-tissue graphs across human placenta histology whole slide images using deep learning with HAPPY.

Accurate placenta pathology assessment is essential for managing maternal and newborn health, but the placenta's heterogeneity and temporal variability pose challenges for histology analysis. To address this issue, we developed the 'Histology Analysis Pipeline.PY' (HAPPY), a deep learning hierarchical method for quantifying the variability of cells and micro-anatomical tissue structures across placenta histology whole slide images. HAPPY differs from patch-based features or segmentation approaches by following an interpretable biological hierarchy, representing cells and cellular communities within tissues at a single-cell resolution across whole slide images. We present a set of quantitative metrics from healthy term placentas as a baseline for future assessments of placenta health and we show how these metrics deviate in placentas with clinically significant placental infarction. HAPPY's cell and tissue predictions closely replicate those from independent clinical experts and placental biology literature.

Development of an enhanced scoring system to predict ICU readmission or in-hospital death within 24 hours using routine patient data from two NHS Foundation Trusts.

RationaleIntensive care units (ICUs) admit the most severely ill patients. Once these patients are discharged from the ICU to a step-down ward, they continue to have their vital signs monitored by nursing staff, with Early Warning Score (EWS) systems being used to identify those at risk of deterioration.ObjectivesWe report the development and validation of an enhanced continuous scoring system for predicting adverse events, which combines vital signs measured routinely on acute care wards (as used by most EWS systems) with a risk score of a future adverse event calculated on discharge from the ICU.DesignA modified Delphi process identified candidate variables commonly available in electronic records as the basis for a 'static' score of the patient's condition immediately after discharge from the ICU. L1-regularised logistic regression was used to estimate the in-hospital risk of future adverse event. We then constructed a model of physiological normality using vital sign data from the day of hospital discharge. This is combined with the static score and used continuously to quantify and update the patient's risk of deterioration throughout their hospital stay.SettingData from two National Health Service Foundation Trusts (UK) were used to develop and (externally) validate the model.ParticipantsA total of 12 394 vital sign measurements were acquired from 273 patients after ICU discharge for the development set, and 4831 from 136 patients in the validation cohort.ResultsOutcome validation of our model yielded an area under the receiver operating characteristic curve of 0.724 for predicting ICU readmission or in-hospital death within 24 hours. It showed an improved performance with respect to other competitive risk scoring systems, including the National EWS (0.653).ConclusionsWe showed that a scoring system incorporating data from a patient's stay in the ICU has better performance than commonly used EWS systems based on vital signs alone.Trial registration numberISRCTN32008295.

Distinct patterns of vital sign and inflammatory marker responses in adults with suspected bloodstream infection.

ObjectivesTo identify patterns in inflammatory marker and vital sign responses in adult with suspected bloodstream infection (BSI) and define expected trends in normal recovery.MethodsWe included patients ≥16y from Oxford University Hospitals with a blood culture taken between 01-January-2016 to 28-June-2021. We used linear and latent class mixed models to estimate trajectories in C-reactive protein (CRP), white blood count, heart rate, respiratory rate and temperature and identify CRP response subgroups. Centile charts for expected CRP responses were constructed via the lambda-mu-sigma method.ResultsIn 88,348 suspected BSI episodes; 6,908(7.8%) were culture-positive with a probable pathogen, 4,309(4.9%) contained potential contaminants, and 77,131(87.3%) were culture-negative. CRP levels generally peaked 1-2 days after blood culture collection, with varying responses for different pathogens and infection sources (p<0.0001). We identified five CRP trajectory subgroups: peak on day-1 (36,091;46.3%) or 2 (4,529;5.8%), slow recovery (10,666;13.7%), peak on day-6 (743;1.0%), and low response (25,928;33.3%). Centile reference charts tracking normal responses were constructed from those peaking on day-1/2.ConclusionsCRP and other infection response markers rise and recover differently depending on clinical syndrome and pathogen involved. However, centile reference charts, that account for these differences, can be used to track if patients are recovering line as expected and to help personalise infection.

Leveraging transformers and large language models with antimicrobial prescribing data to predict sources of infection for electronic health record studies

38 Modelling Class Dependencies for Lung Cancer Subtyping from Digitised Pathology Images

A framework for longitudinal latent factor modelling of treatment response in clinical trials with applications to Psoriatic Arthritis and Rheumatoid Arthritis

Predicting the future risk of lung cancer: development, and internal and external validation of the CanPredict (lung) model in 19·67 million people and evaluation of model performance against seven other risk prediction models.

BackgroundLung cancer is the second most common cancer in incidence and the leading cause of cancer deaths worldwide. Meanwhile, lung cancer screening with low-dose CT can reduce mortality. The UK National Screening Committee recommended targeted lung cancer screening on Sept 29, 2022, and asked for more modelling work to be done to help refine the recommendation. This study aims to develop and validate a risk prediction model-the CanPredict (lung) model-for lung cancer screening in the UK and compare the model performance against seven other risk prediction models.MethodsFor this retrospective, population-based, cohort study, we used linked electronic health records from two English primary care databases: QResearch (Jan 1, 2005-March 31, 2020) and Clinical Practice Research Datalink (CPRD) Gold (Jan 1, 2004-Jan 1, 2015). The primary study outcome was an incident diagnosis of lung cancer. We used a Cox proportional-hazards model in the derivation cohort (12·99 million individuals aged 25-84 years from the QResearch database) to develop the CanPredict (lung) model in men and women. We used discrimination measures (Harrell's C statistic, D statistic, and the explained variation in time to diagnosis of lung cancer [R2D]) and calibration plots to evaluate model performance by sex and ethnicity, using data from QResearch (4·14 million people for internal validation) and CPRD (2·54 million for external validation). Seven models for predicting lung cancer risk (Liverpool Lung Project [LLP]v2, LLPv3, Lung Cancer Risk Assessment Tool [LCRAT], Prostate, Lung, Colorectal, and Ovarian [PLCO]M2012, PLCOM2014, Pittsburgh, and Bach) were selected to compare their model performance with the CanPredict (lung) model using two approaches: (1) in ever-smokers aged 55-74 years (the population recommended for lung cancer screening in the UK), and (2) in the populations for each model determined by that model's eligibility criteria.FindingsThere were 73 380 incident lung cancer cases in the QResearch derivation cohort, 22 838 cases in the QResearch internal validation cohort, and 16 145 cases in the CPRD external validation cohort during follow-up. The predictors in the final model included sociodemographic characteristics (age, sex, ethnicity, Townsend score), lifestyle factors (BMI, smoking and alcohol status), comorbidities, family history of lung cancer, and personal history of other cancers. Some predictors were different between the models for women and men, but model performance was similar between sexes. The CanPredict (lung) model showed excellent discrimination and calibration in both internal and external validation of the full model, by sex and ethnicity. The model explained 65% of the variation in time to diagnosis of lung cancer R2D in both sexes in the QResearch validation cohort and 59% of the R2D in both sexes in the CPRD validation cohort. Harrell's C statistics were 0·90 in the QResearch (validation) cohort and 0·87 in the CPRD cohort, and the D statistics were 2·8 in the QResearch (validation) cohort and 2·4 in the CPRD cohort. Compared with seven other lung cancer prediction models, the CanPredict (lung) model had the best performance in discrimination, calibration, and net benefit across three prediction horizons (5, 6, and 10 years) in the two approaches. The CanPredict (lung) model also had higher sensitivity than the current UK recommended models (LLPv2 and PLCOM2012), as it identified more lung cancer cases than those models by screening the same amount of individuals at high risk.InterpretationThe CanPredict (lung) model was developed, and internally and externally validated, using data from 19·67 million people from two English primary care databases. Our model has potential utility for risk stratification of the UK primary care population and selection of individuals at high risk of lung cancer for targeted screening. If our model is recommended to be implemented in primary care, each individual's risk can be calculated using information in the primary care electronic health records, and people at high risk can be identified for the lung cancer screening programme.FundingInnovate UK (UK Research and Innovation).TranslationFor the Chinese translation of the abstract see Supplementary Materials section.

Search results

Found 8729 matches for