Some tips when working with PLINK2

4 minute read

Published:

PLINK is a well-established software for genetic analysis. In many projects, we use plink2 for genome-wide association study (GWAS) and other computations related to the raw genotype matrix. Here, we list several tips on the use of this software.

The pgen/psam/pvar format used in PLINK2 and pgenlibr

PLINK2 introduced pgen/pvar/psam format to efficiently store the genetic data. In statistical genetics, the genetic data is stored as a (compressed) matrix of size #(genetic variants) x #(individuals). The pgen binary file stores each entry of this matrix with 2 bits, and psam (text) file and pvar (text) file are the row (individuals) and column (variants) indices. Because it is common to have hundreds of thousands of genetic variants, PLINK2 supports zstd (ZStandard) compression of the pvar file.

While PLINK2 binary software supports many file formats used in genetic analysis software, the corresponding R binding pgenlibr works only with this pgen/pvar/psam format.

One can use plink2 binary software to convert your genetic dataset into this format. For example, the following --make-pgen (link to reference) command will convert your input.vcf file into binary_fileset.{pgen,pvar,psam}. Many other input formats are also supported in plink2.

plink2 --vcf input.vcf --make-pgen --out binary_fileset

Similarly, you can convert your genetic dataset into plink1.9 binary format (i.e. the triple of bed,bim,fam) with --make-bed (link to reference) command.

plink2 --vcf input.vcf --make-bed --out binary_fileset

Note that the file extension bed here refers to the binary PED format. It is nothing to do with UCSC’s BED (Browser Extensible Data) file format.

pgenlibr to read plink2 pgen files from R

To read genotype data stored in plink2 pgen/pvar/psam files, one can use an R package called pgenlibr. Here, we show some example usage of pgenlibr.

How to install pgenlibr in your R environment?

$ git clone https://github.com/chrchang/plink-ng.git
$ R
> install.packages('plink-ng/2.0/pgenlibr', repos = NULL, type='source')

Some basic usages of pgenlibr

One can use the sample genotype data hosted here in the R snpnet package, which fits a large-scale Lasso regression using the genetic data (stored in the pgen format).

require(pgenlibr)

pfile <- '@@@/sample_genotype_data'

pvar <- NewPvar(paste0(pfile, '.pvar.zst'))
pgen <- NewPgen(paste0(pfile, '.pgen'), pvar=pvar)

GetVariantId(pvar, 1)
GetVariantId(pvar, 10)
GetVariantCt(pvar)
GetRawSampleCt(pgen)

# read a single variant
buf <- pgenlibr::Buf(pgen)
pgenlibr::Read(pgen, buf, 1)

# read a list of variants
## var.idxs (list of variants you would like to read)
geno_mat <- ReadList(pgen, var.idxs, meanimpute=F)

It seems like the package is still an early development version without a full documentation. You can see the list of supported functions in its source code on GitHub.

Running the recessive model on chrX

As illustrated in the PLINK2 website, --glm recessive will be the modifier to run the GWAS scan with the recessive model. However, this does not work well for the X chromosome (chrX). To mitigate this limitation, we can record the chromosome X in the pvar file and pretend as if another autosome first and run the association analysis.

  1. Use --output-chr 26 to change the .pvar to encode chrX as “26”
  2. Run the original --glm command with e.g. --chr-set 40 (which specifies 40 instead of 22 autosomes).

GWAS of multiple quantitative traits

Starting plink2/20190401, the program supports efficient linear regression of multiple quantitative phenotypes.

–glm without –adjust now detects groups of quantitative phenotypes with the same “missingness pattern”, and processes them together

One should use this feature for efficient computation.

How to extract a pre-defined set of variants

Using plink2’s --extract subcommand, you can extract individual-level genotype data. For example, you may run something like this:

plink2 --pfile <plink2 pgen/pvar/psam> --extract <list of variants> --export vcf --out subset.vcf

Note: the example above assumes that you have converted your genotype dataset into plink2 file format (which I highly recommend), but the –extract subcommand should work for other input file formats.

Reference