GWAS and genetic analyses with PLINK2 and pgenlibr
Published:
PLINK is a well-established software for genetic analysis. In many projects, we use plink2
for genome-wide association studies (GWAS) and other genetic analyses using the raw genotype matrix.
One of the key advantages of plink2
is its pgen
/pvar
/psam
format that efficiently stores large-scale genotype dataset. As a part of plink2
software, there is an R binding called pgenlibr
that allows us to read genotype data from pgen file.
Here, we list several tips regarding the use of plink2
, pgen
format, and pgenlibr
.
The pgen/psam/pvar format used in PLINK2 and pgenlibr
PLINK2 introduced pgen/pvar/psam format to efficiently store the genetic data. In statistical genetics, genetic data is stored as a (compressed) matrix of size #(genetic variants) x #(individuals)
. A pgen
binary file stores each entry of this matrix with 2 bits, and psam
(text) file and pvar
(text) file are the row (individuals) and column (variants) indices. Because it is not uncommon to have hundreds of thousands of genetic variants, PLINK2 supports zstd
(ZStandard) compression of the pvar
file.
While PLINK2 binary software supports many file formats used in genetic analysis software, its corresponding R binding pgenlibr
works only with this pgen/pvar/psam format.
How to convert your genetic data into plink format
One can use plink2
binary software to convert your genetic dataset into this format. For example, the following --make-pgen
(link to reference) command will convert your input.vcf
file into binary_fileset.{pgen,pvar,psam}
. Many other input formats are also supported in plink2.
plink2 --vcf input.vcf --make-pgen --out binary_fileset
Similarly, you can convert your genetic dataset into plink1.9 binary format (i.e. the triple of bed,bim,fam) with --make-bed
(link to reference) command.
plink2 --vcf input.vcf --make-bed --out binary_fileset
Note that the file extension bed
here refers to the binary PED format. It is nothing to do with UCSC’s BED (Browser Extensible Data) file format.
pgenlibr
to read plink2 pgen files from R
To read genotype data stored in plink2
pgen/pvar/psam files, one can use an R package called pgenlibr
. Here, we show some example usage of pgenlibr
.
How to install pgenlibr
in your R environment?
The source code of the pgenlibr
package is available on GitHub. You can use devtools
package to install pgenlibr
as follows:
devtools::install_github('chrchang/plink-ng', subdir='2.0/pgenlibr');
If you don’t have devtools
installed in your R environment, you can clone the repo and install it from there.
$ git clone https://github.com/chrchang/plink-ng.git
$ R
> install.packages(
'plink-ng/2.0/pgenlibr',
repos = NULL, type='source'
)
Some basic usages of pgenlibr
One can use the sample genotype data hosted here in the R snpnet package, which fits a large-scale Lasso regression using the genetic data (stored in the pgen format).
require(pgenlibr)
pfile <- '@@@/sample_genotype_data'
pvar <- NewPvar(paste0(pfile, '.pvar.zst'))
pgen <- NewPgen(paste0(pfile, '.pgen'), pvar=pvar)
GetVariantId(pvar, 1)
GetVariantId(pvar, 10)
GetVariantCt(pvar)
GetRawSampleCt(pgen)
# read a single variant
buf <- pgenlibr::Buf(pgen)
pgenlibr::Read(pgen, buf, 1)
# read a list of variants
## var.idxs (list of variants you would like to read)
geno_mat <- ReadList(pgen, var.idxs, meanimpute=F)
It seems like the package is still an early development version without a full documentation. You can see the list of supported functions in its source code on GitHub.
Genome-wide association analysis (GWAS) with plink2
Here, we show some example usages of plink2
as well as some tips.
GWAS of binary or quantitative traits
You may run GWAS of binary or quantitative traits using genotype in genotype.{pgen,pvar.zst,psam}
, phenotype in phenotype.phe.gz
, and covariates in covariates.phe
, using command like this:
plink2 \
--memory 16000 \
--threads 6 \
--out out/gwas_sumstats \
--keep list_of_individuals_to_keep.phe \
--pfile genotype vzs \
--pheno phenotype.phe.gz \
--pheno-quantile-normalize \
--chr 1-22,X,XY,Y,MT \
--covar covariates.phe \
--covar-name age sex PC1-PC10 \
--covar-variance-standardize \
--extract variants_of_interest.txt \
--glm skip-invalid-pheno firth-fallback hide-covar omit-ref no-x-sex
GWAS of multiple quantitative traits
Starting plink2/20190401
, the program supports efficient linear regression of multiple quantitative phenotypes.
–glm without –adjust now detects groups of quantitative phenotypes with the same “missingness pattern”, and processes them together
One should use this feature for efficient computation.
Running the recessive model on chrX
As illustrated in PLINK2’s website, --glm recessive
will be the modifier to run the GWAS scan with the recessive model. However, this does not work well for the X chromosome (chrX). To mitigate this limitation, we can record the chromosome X in the pvar file and pretend as if another autosome first and run the association analysis.
- Use
--output-chr 26
to change the.pvar
to encode chrX as “26” - Run the original
--glm
command with e.g.--chr-set 40
(which specifies 40 instead of 22 autosomes).
Computing polygenic scores for each individual with plink2
Given coefficients of polygenic scores (PGS), you can compute the polygenic score for each individual in your individual-level genotype dataset. Let’s say you have a coefficient file called PGS.betas.tsv
and it contains the variant ID, the effect allele, and the PGS effect size (BETA) in 3rd, 5th, and 6th columns, respectively. You can download some example PGS coefficient files from our recent study (Tanigawa et al., 2022) hosted at Global Biobank Engine.
plink2 \
--memory 16000 \
--threads 6 \
--out out/pgs_out \
--pfile genotype vzs \
--score \
PGS.betas.tsv \
3 5 6 header-read
- If you would like to focus on the subset of individuals, you may specify the list of individuals to consider using
--keep
flag. - You need to make sure the variant IDs in the score file matches with the variant IDs in your genotype dataset (i.e. the ones listed in
pvar
file). You can use your scripting langage of your choice (Python/R) to convert IDs.
Other miscellaneous tips
How to extract a pre-defined set of variants
Using plink2’s --extract
subcommand, you can extract individual-level genotype data. For example, you may run something like this:
plink2 \
--pfile <plink2 pgen/pvar/psam> \
--extract <list of variants> \
--export vcf \
--out subset.vcf
Note: the example above assumes that you have converted your genotype dataset into plink2 file format (which I highly recommend), but the –extract subcommand should work for other input file formats.
References
- PLINK 2.0 website
- PLINK 2.0 GitHub / pgenlibr
- The pgen/pvar/psam format specification
- Example genetic datasets in
snpent
packagesnpnet
usespgenlibr
to read large-scale genetic data and fit penalized regression models. Please check out our snpnet paper for more information.