close
close

Improved disease prediction through biobank datasets

Improved disease prediction through biobank datasets

Improving disease prediction: How the MILTON framework uses multi-omics data to transform healthcare insights.

Study: Disease prediction with multi-omics and biomarkers enables case-control gene discoveries in the UK BiobankImage credit: Xray Computer/Shutterstock.com

In a recently published study Natural geneticsA group of researchers developed and applied an ensemble machine learning framework (MILTON) to predict disease and improve genetic association analyses using multi-omics data from the United Kingdom Biobank (UKB).

background

Identifying individuals at high risk for disease is critical for preventive medicine, but traditional risk assessment tools that rely on factors such as age and family history may not fully capture the complexity of disease biology.

Large biobanks such as the UKB integrate multi-omics data such as blood tests, proteomics and metabolomics, offering the opportunity to discover new biomarkers.

These comprehensive datasets enable the identification of biomarker combinations that improve disease prediction beyond individual markers. Further research is needed to better understand the biological processes underlying complex diseases and improve predictive models.

About the study

The UKB cohort includes 502,226 participants aged 37 to 73 years, with a mean age of 58 years. Of these, 54.4% are female. The data provide comprehensive information such as diagnostic records, blood biochemistry, body measurements, genomics and proteomics data. All participants gave informed consent and participated voluntarily.

The Finnish Gene (FinnGen) cohort consists of 412,181 individuals, of whom 55.9% are female and the average age is 63 years. Participants also gave informed consent and participated voluntarily.

FinnGen data were not accessed at the patient level; only summary statistics from the Genome-Wide Association Study (GWAS) were used. The research complied with all ethical requirements and approvals were obtained from the relevant ethics committees.

The UKB study received approval from the Northwestern Research Centre Ethics Committee. At the same time, the Coordinating Ethics Committee of the Helsinki and Uusimaa Hospital District approved the FinnGen study.

The Finnish Institute for Health and Welfare, the Digital and Population Data Services Agency, the Social Insurance Institution and Statistics Finland have granted additional permits to FinnGen.

Both studies carefully processed the data and ensured accurate case and control definitions. To ensure consistency in the distribution of age, gender, and other baseline characteristics, cases and controls were extensively filtered.

Study results

Clinical biomarkers play a crucial role in the diagnosis and evaluation of diseases, as they provide measurable evidence of the presence and severity of a disease. In the context of phenome-wide association studies (PheWAS), biomarkers also offer the opportunity to identify misclassified or cryptic cases.

MILTON, a machine learning method, was introduced to predict disease status for 3,213 disease phenomena using quantitative biomarkers. The technique works by first learning a disease-specific signature from diagnosed patients and then predicting potential new cases among the original control subjects. These expanded cohorts are used for rare variant analysis to compare them with baseline cohorts.

MILTON's disease prediction models are based on the time span between biomarker sample collection and diagnosis. At UKB, samples can have been collected up to 16.5 years before or 50 years after diagnosis.

MILTON was trained using three different time models: prognostic (up to 10 years after sample collection), diagnostic (up to 10 years before) and time-agnostic (all diagnosed cases). After a sensitivity analysis of 400 randomly selected International Classification of Diseases, 10th Revision (ICD10) codes, a 10-year cutoff was determined to be optimal.

MILTON was trained on 67 features including blood biochemistry and blood count, urinalysis, height, blood pressure, gender, age, spirometry, and fasting time. The model's performance was evaluated using the area under the curve (AUC). MILTON achieved AUC ≥ 0.7 for 1,091 ICD10 codes, AUC ≥ 0.8 for 384 codes, and AUC ≥ 0.9 for 121 codes across all time models and lineages.

For 1,466 ICD10 codes, diagnostic models generally performed better than prognostic models. For example, among participants of European (EUR) ancestry, diagnostic models had higher median AUC (0.668 versus 0.647) and sensitivity (0.586 versus 0.570).

MILTON also showed stable performance for European and African ancestry, while performance for South Asian diagnostic models improved with increasing number of cases.

The ability of MILTON to predict diseases before their onset was further confirmed. When analyzing individuals with a high probability of being a case (0.7 ≤ Pcase ≤ 1), 97.41% of ICD10 codes were significantly enriched in participants who were subsequently diagnosed with the corresponding diseases. These results confirm the effectiveness of MILTON in identifying emerging cases and complementing genetic association analyses.

Conclusions

In summary, MILTON predicts disease using multi-omics and biomarkers, improving case-control studies across five UKB lineages. Despite the broad, non-disease-specific feature set, MILTON achieved high predictive power for numerous phenotypes, with AUC > 0.7 for 1,091 ICD10 codes, AUC > 0.8 for 384, and AUC > 0.9 for 121.

However, for some diseases, predictive power remained low, indicating that more informative features are needed.

MILTON frequently outperformed polygenic risk scores (PRS), but underperformed for diseases such as melanoma and breast cancer. Proteomic data improved predictions for 52 phenotypes. MILTON also identified 182 putatively new gene disease signals that require further validation.

Related Post