GWAS and genetic analyses with PLINK2 and pgenlibr

6 minute read

Published:

PLINK is a well-established software for genetic analysis. In many projects, we use plink2 for genome-wide association studies (GWAS) and other genetic analyses using the raw genotype matrix.

One of the key advantages of plink2 is its pgen/pvar/psam format that efficiently stores large-scale genotype dataset. As a part of plink2 software, there is an R binding called pgenlibr that allows us to read genotype data from pgen file.

Here, we list several tips regarding the use of plink2, pgen format, and pgenlibr.

The pgen/psam/pvar format used in PLINK2 and pgenlibr

PLINK2 introduced pgen/pvar/psam format to efficiently store the genetic data. In statistical genetics, genetic data is stored as a (compressed) matrix of size #(genetic variants) x #(individuals). A pgen binary file stores each entry of this matrix with 2 bits, and psam (text) file and pvar (text) file are the row (individuals) and column (variants) indices. Because it is not uncommon to have hundreds of thousands of genetic variants, PLINK2 supports zstd (ZStandard) compression of the pvar file.

While PLINK2 binary software supports many file formats used in genetic analysis software, its corresponding R binding pgenlibr works only with this pgen/pvar/psam format.

One can use plink2 binary software to convert your genetic dataset into this format. For example, the following --make-pgen (link to reference) command will convert your input.vcf file into binary_fileset.{pgen,pvar,psam}. Many other input formats are also supported in plink2.

plink2 --vcf input.vcf --make-pgen --out binary_fileset

Similarly, you can convert your genetic dataset into plink1.9 binary format (i.e. the triple of bed,bim,fam) with --make-bed (link to reference) command.

plink2 --vcf input.vcf --make-bed --out binary_fileset

Note that the file extension bed here refers to the binary PED format. It is nothing to do with UCSC’s BED (Browser Extensible Data) file format.

pgenlibr to read plink2 pgen files from R

To read genotype data stored in plink2 pgen/pvar/psam files, one can use an R package called pgenlibr. Here, we show some example usage of pgenlibr.

How to install pgenlibr in your R environment?

The source code of the pgenlibr package is available on GitHub. You can use devtools package to install pgenlibr as follows:

devtools::install_github('chrchang/plink-ng', subdir='2.0/pgenlibr');

If you don’t have devtools installed in your R environment, you can clone the repo and install it from there.

$ git clone https://github.com/chrchang/plink-ng.git
$ R
> install.packages(
    'plink-ng/2.0/pgenlibr',
    repos = NULL, type='source'
  )

Some basic usages of pgenlibr

One can use the sample genotype data hosted here in the R snpnet package, which fits a large-scale Lasso regression using the genetic data (stored in the pgen format).

require(pgenlibr)

pfile <- '@@@/sample_genotype_data'

pvar <- NewPvar(paste0(pfile, '.pvar.zst'))
pgen <- NewPgen(paste0(pfile, '.pgen'), pvar=pvar)

GetVariantId(pvar, 1)
GetVariantId(pvar, 10)
GetVariantCt(pvar)
GetRawSampleCt(pgen)

# read a single variant
buf <- pgenlibr::Buf(pgen)
pgenlibr::Read(pgen, buf, 1)

# read a list of variants
## var.idxs (list of variants you would like to read)
geno_mat <- ReadList(pgen, var.idxs, meanimpute=F)

It seems like the package is still an early development version without a full documentation. You can see the list of supported functions in its source code on GitHub.

Genome-wide association analysis (GWAS) with plink2

Here, we show some example usages of plink2 as well as some tips.

GWAS of binary or quantitative traits

You may run GWAS of binary or quantitative traits using genotype in genotype.{pgen,pvar.zst,psam}, phenotype in phenotype.phe.gz, and covariates in covariates.phe, using command like this:

plink2 \
  --memory 16000 \
  --threads 6 \
  --out out/gwas_sumstats \
  --keep list_of_individuals_to_keep.phe \
  --pfile genotype vzs \
  --pheno phenotype.phe.gz \
  --pheno-quantile-normalize \
  --chr 1-22,X,XY,Y,MT \
  --covar covariates.phe \
  --covar-name age sex PC1-PC10 \
  --covar-variance-standardize \
  --extract variants_of_interest.txt \
  --glm skip-invalid-pheno firth-fallback hide-covar omit-ref no-x-sex

GWAS of multiple quantitative traits

Starting plink2/20190401, the program supports efficient linear regression of multiple quantitative phenotypes.

–glm without –adjust now detects groups of quantitative phenotypes with the same “missingness pattern”, and processes them together

One should use this feature for efficient computation.

Running the recessive model on chrX

As illustrated in PLINK2’s website, --glm recessive will be the modifier to run the GWAS scan with the recessive model. However, this does not work well for the X chromosome (chrX). To mitigate this limitation, we can record the chromosome X in the pvar file and pretend as if another autosome first and run the association analysis.

  1. Use --output-chr 26 to change the .pvar to encode chrX as “26”
  2. Run the original --glm command with e.g. --chr-set 40 (which specifies 40 instead of 22 autosomes).

Computing polygenic scores for each individual with plink2

Given coefficients of polygenic scores (PGS), you can compute the polygenic score for each individual in your individual-level genotype dataset. Let’s say you have a coefficient file called PGS.betas.tsv and it contains the variant ID, the effect allele, and the PGS effect size (BETA) in 3rd, 5th, and 6th columns, respectively. You can download some example PGS coefficient files from our recent study (Tanigawa et al., 2022) hosted at Global Biobank Engine.

plink2 \
  --memory 16000 \
  --threads 6 \
  --out out/pgs_out \
  --pfile genotype vzs \
  --score \
    PGS.betas.tsv \
    3 5 6 header-read
  • If you would like to focus on the subset of individuals, you may specify the list of individuals to consider using --keep flag.
  • You need to make sure the variant IDs in the score file matches with the variant IDs in your genotype dataset (i.e. the ones listed in pvar file). You can use your scripting langage of your choice (Python/R) to convert IDs.

Other miscellaneous tips

How to extract a pre-defined set of variants

Using plink2’s --extract subcommand, you can extract individual-level genotype data. For example, you may run something like this:

plink2 \
  --pfile <plink2 pgen/pvar/psam> \
  --extract <list of variants> \
  --export vcf \
  --out subset.vcf

Note: the example above assumes that you have converted your genotype dataset into plink2 file format (which I highly recommend), but the –extract subcommand should work for other input file formats.

References