Here we consider that there are 400 individuals to be used as validation set to tune hyper-parameters for LDpred2-grid. # Attach the "bigSNP" object in R session, # takes several minutes if you do not have many cores, Why clumping should be preferred over pruning, How to capture Population Structure with PCA (LD problem explained), How to capture Population Structure with PCA (directly on PLINK bed files), Computing polygenic scores using Stacked Clumping and Thresholding (SCT), the code to prepare summary statistics (including performing the quality control presented in the Methods section “Quality control of summary statistics” of the paper), the code to read BGEN files into the data format used by bigsnpr, https://doi.org/10.6084/m9.figshare.13034123. Basic Tutorial for Polygenic Risk Score Analyses. You should verify if the chains “converged”. LDpred2: better, faster, stronger. "), # Extract SNPs that are included in the chromosome, # We assume the fam order is the same across different chromosomes, # Assuming the file naming is EUR_chr#.bed, # Reformat the phenotype file such that y is of the same order as the, # (will also need the fmsb package to calculate the pseudo R2), Basic Tutorial for Polygenic Risk Score Analyses, 1. Get the final performance of the LDpred models, The genotype file after performing some basic filtering, This file contains the SNPs that passed the basic filtering, This file contains the samples that passed the basic filtering, This file contains the phenotype of the samples, This file contains the covariates of the samples, This file contains the PCs of the samples. Here we show how to compute polygenic risk scores using LDpred2.. You can see there how we generated these data from the 1000 Genomes project. lassosum is one of the dedicated PRS programs which is an R package that uses penalised regression (LASSO) in its approach to PRS calculation. LDpred-2 is one of the dedicated PRS programs which is an R package that uses a Bayesian approach to polygenic risk scoring. If no or few variants are actually flipped, you might want to disable the strand flipping option. In practice, until we find a better set of variants, we recommend using the HapMap3 variants used in PRS-CS and the LDpred2 papers. For more information, please refer to this paper, "https://github.com/privefl/bigsnpr/raw/master/data-raw/hm3_variants.rds", # LDpred 2 require the header to follow the exact naming, # Initialize variables for storing the LD score and LD matrix, # We want to know the ordering of samples in the bed file, # preprocess the bed file (only need to do once for each data set), # extract the SNP information from the genotype, # Assign the genotype to a variable for easier downstream analysis, # get the CM information from 1000 Genome, # will download the 1000G file to the current directory (". Note. Here, we use the Z-Score from the regression of the phenotype by the PRS since we have found it more robust than using the AUC. The script used here is based on LDpred 2 implemented under bigsnpr version 1.4.7, For more details, please refer to LDpred 2's homepage. We recommend to run many of them in parallel with different initial values for p (e.g. Read in the phenotype and covariate files, 3. Some quality control on summary statistics is highly recommended (see paper). Note that we now recommend to run LDpred2 genome-wide, contrary to what was shown in the first versions of this tutorial. You can find the accompany tutorial here PRSice-2: Polygenic Risk Score software PRSice (pronounced 'precise') is a Polygenic Risk Score software for calculating, applying, evaluating and plotting the results of polygenic risk scores (PRS) analyses. Docs » PLINK; Edit on GitHub; Background¶ On this page, you will compute PRS using the popular genetic analyses tool plink - while plink is not a dedicated PRS software, you can perform every required steps of the C+T approach with plink. Docs » lassosum; Edit on GitHub; Background¶ lassosum is one of the dedicated PRS programs which is an R package that uses penalised regression (LASSO) in its approach to PRS calculation. The tutorial is separated into four main sections and reflects the structure of our guide paper: the first two sections on QC corres… max AUC). Polygenic risk score tutorial Sarah Medland Quantitative Genetics, QIMR Berghofer 16/07/2014 . We split genotype data using part of the data to choose hyper-parameters and another part of the data to evaluate statistical properties of polygenic risk score such as AUC. "https://github.com/privefl/bigsnpr/raw/master/data-raw/hm3_variants.rds". # Read from bed/bim/fam, it generates .bk and .rds files. We have seen how to run 3 versions of LDpred2 (“-inf”, “-grid” and “-auto”) for one chromosome. In practice, if you do not really care about sparsity, you could choose the best LDpred2-grid model among all sparse and non-sparse models. thus one must rename the columns according to their actual ordering, Scripts for binary trait analysis only serve as a reference as we have not simulate any binary traits. First, you need to read genotype data from the PLINK files (or BGEN files) as well as the text file containing summary statistics. length.out = 30). Here, we have built the LD matrix using variants from one chromosome only. This tutorial only uses fake data for educational purposes. The script used here is based on lassosum version 0.4.4, For more details, please refer to lassosum's homepage. BioRxiv. For more details, please refer to lassosum's homepage. First, you need to compute correlations between variants. Load and transform the summary statistic file, 8. this code). You can install lassosum and its dependencies in R with the following command: Again, we assume that we have the following files (or you can download it from here): # Prefer to work with data.table as it speeds up file reading, # For multi-threading, you can use the parallel package and, # invoke cl which is then passed to lassosum.pipeline, # Need as.data.frame here as lassosum doesn't handle data.table, # We will need the EUR.hg19 file provided by lassosum. In addition, Nagelkerke \(R^2\) is biased when there are ascertainment of samples. To match variants contained in genotype data and summary statistics, the variables "chr" (chromosome number), "pos" (genetic position), "a0" (reference allele) and "a1" (derived allele) should be available in the summary statistics and in the genotype data. Basic Tutorial for Polygenic Risk Score Analyses. Information about these variants can be retrieved with. The only difference it makes is when building the SFBM (the sparse LD matrix on disk), you need to build it so that it contains all variants genome-wide (see e.g. Docs » LDpred-2; Edit on GitHub; Background¶ LDpred-2 is one of the dedicated PRS programs which is an R package that uses a Bayesian approach to polygenic risk scoring. These 4 variables are used to match variants between the two data frames. You can install LDpred and its dependencies in R with the following command: For mac users, you might need to follow the guide here to be able to install LDpred2. Privé, F., Arbel, J., & Vilhjálmsson, B. J. This tutorial only uses fake data for educational purposes. The other 159 individuals are used as test set to evaluate the final models. In the paper, we propose an automatic way to filter bad chains by comparing the scale of the resulting predictions (see this code, reproduced below). However, in many cases, the ordering of the summary statistics differ, It also enables adjusting for covariates in this step (using parameter covar.train in big_univLogReg() or big_univLinReg()). # Remove P-value = 0, which causes problem in the transformation, # Transform the P-values into correlation, # The cluster parameter is used for multi-threading, # You can ignore that if you do not wish to perform multi-threaded processing, Basic Tutorial for Polygenic Risk Score Analyses, The genotype file after performing some basic filtering, This file contains the SNPs that passed the basic filtering, This file contains the samples that passed the basic filtering, This file contains the phenotype of the samples, This file contains the covariates of the samples, This file contains the PCs of the samples. 60 We define polygenic risk scores, or polygenic scores, as a single value estimate of an 61 individual’s propensity to a phenotype, calculated as a sum of their genome-wide genotypes 62 weighted by corresponding genotype effect sizes – potentially scaled or shrunk – from 63 summary statistic GWAS data. Note. Here we show how to compute polygenic risk scores using LDpred2. Installing LDpred-2¶ Note. You should also probably look at the code of the paper, particularly at the code to prepare summary statistics (including performing the quality control presented in the Methods section “Quality control of summary statistics” of the paper), at the code to read BGEN files into the data format used by bigsnpr, at the code to prepare LD matrices and at the code to run LDpred2 (genome-wide). The script used here is based on lassosum version 0.4.4. We recommend to use a window size of 3 cM (see ref). The aim of this tutorial is to provide a simple introduction to PRS analyses to those new to PRS, while equipping existing users with a better understanding of the processes and implementation "underneath the hood" of popular PRS software. (2020). This is not the case here, which is probably because the data is so small. # which are LD regions defined in Berisa and Pickrell (2015) for the European population and the hg19 genome. In practice, you need to build it for variants from all chromosomes. We assume that you have the following files (or you can download it from here): While we do provide a rough guide on how to perform LDpred on bed files separated into individual chromosomes, this script is untested and extra caution is required, On some server, you might need to first use the following code in order to run LDpred with multi-thread, LDpred2 authors recommend restricting the analysis to only the HapMap3 SNPs, Here, we know the exact ordering of the summary statistics file. The script used here is based on LDpred 2 implemented under bigsnpr version 1.4.7. Please look at the code linked at the beginning. Also note that we separate both sparse and non-sparse models here (and in the paper) to show that their predictive performance are similar. Basic Tutorial for Polygenic Risk Score Analyses. Alternatively, we also provide an LD reference to be used direcly, along with an example script on how to use it at https://doi.org/10.6084/m9.figshare.13034123. You can look at the path of the chains, as shown below. Here, these are simulated data so all variants use the same strand and the same reference. Installing lassosum¶ Note. In practice, we recommend to test multiple values for h2 and p. You can then choose the best model according to your preferred criterion (e.g. This tutorial provides a step-by-step guide to performing basic polygenic risk score (PRS) analyses and accompanies our PRS Guide paper. You can download data and unzip files in R. We store those files in a directory called "tmp-data" here. Note that these data are for educational purposes only, not for use as a reference panel.