Significant Sparse Polygenic Risk Scores across 813 traits in UK Biobank
Published in PLOS Genetics, 2022
Using the dense phenotypic information in UK Biobank, we systematically characterized polygenic risk score (or PRS) across more than 1,500 traits. We evaluated the predictive performance of the models and compared that with the baseline models that only consider covariates, which are age, sex, and top 10 genotype principal components. We then assessed the statistical significance of the PRS in improving the predictive performance. When we look at the incremental R2 or incremental AUC, we find 813 traits have significant incremental predictive performance, after accounting for multiple hypothesis testing.
Across 813 traits with significant PRS, we asked whether the number of genetic variants in the sparse PRS models is correlated with the incremental predictive performance. We found those two quantities are significantly correlated only in both quantitative traits and binary traits.
We assessed the trans-ethnic predictive performance using the additional test sets of different ancestry groups in UK Biobank.
If you’re interested, please look at our PRS map page at the Global Biobank Engine. If you’re interested in the BASIL algorithm and the R snpnet
package that we used in this application, please also look at our paper (J. Qian, Tanigawa, et al, 2020)
Revision history
2022.03.24
Our paper is published at PLOS genetics. We thank insightful feedback from the reviewers.
2022.01.27
We revised the manuscript based on the feedback from colleagues. The major changes in this revision are the following three points.
- Given the feedback from colleagues, we removed the sentences that inappropriately mentioned genetic architecture in binary traits. We instead clarified a power difference between quantitative traits and binary traits.
- Given the concerns regarding the lack of theoretical basis in using incremental ROC-AUC for assessing linear relationship (estimated SNP-based heritabilities and transferability assessment), we now use Nagelkerke’s pseudo-R2 as the primary evaluation metric of predictive performance for binary traits in the current version of the manuscript.
- As we change the evaluation metric for binary traits, we now observe a significant rank-based correlation between the effect size (incremental Nagelkerke’s pseudo-R2) and the model size (number of genetic variants with non-zero coefficients) of the sparse PRS model.
2021.11.26
We have updated the manuscript based on the feedback from the colleagues. The major changes in this revision are the following six points.
- We reported the significant polygenic risk score (PRS) models for 428 traits in the original manuscript. We realized that there was a critical mistake when evaluating the significance of the PRS models (specifically, the coefficients of covariates were mistakenly estimated on the test set individuals, not on the score development set). We fixed this issue and now present 813 significant PRS models in the latest version of the manuscript.
- Following the feedback from colleagues, we included the types of genotyping array in the covariates when evaluating the predictive performance of PRS models. We updated the predictive performance metrics.
- We added Table 1 and Supplementary Table 3 and described the effects of prioritization of medically relevant alleles.
- We added a new figure (Fig 2) comparing the estimated SNP-based heritability vs. the predictive performance of PRS models.
- We deposit the significant PRS models on the PGS catalog. We list the score IDs in Supplementary Table 1.
- We improved the clarity of the methods section and added a few additional analyses (Supplementary Figure 1 - Supplementary Figure 3).
2021.09.02
We posted the original version on medRxiv.
Data and Code availability
- The sparse PRS model weights generated from this study are available on the Global Biobank Engine.
- The significant PRS models are also available at the PGS catalog (PGP000244 and PGP000128, score IDs are listed in S1 Table).
- The BASIL algorithm implemented in the R snpnet package was used in the PRS analysis.
- The analyses presented in this study were based on data accessed through the UK Biobank
Reference: Y. Tanigawa, J. Qian, G. R. Venkataraman, J. M. Justesen, R. Li, R. Tibshirani, T. Hastie, M. A. Rivas, Significant Sparse Polygenic Risk Scores across 813 traits in UK Biobank. PLOS Genet. 18(3), e1010105 (2022). https://doi.org/10.1371/journal.pgen.1010105