Some tips when working with PLINK2
Published:
PLINK is a well-established software for genetic analysis. In many projects, we use plink2
for genome-wide association study (GWAS) and other computations related to the raw genotype matrix. Here, we list several tips on the use of this software.
The pgen/psam/pvar format used in PLINK2 and pgenlibr
PLINK2 introduced pgen/pvar/psam format to efficiently store the genetic data. In statistical genetics, the genetic data is stored as a (compressed) matrix of size #(genetic variants) x #(individuals)
. The pgen
binary file stores each entry of this matrix with 2 bits, and psam
(text) file and pvar
(text) file are the row (individuals) and column (variants) indices. Because it is common to have hundreds of thousands of genetic variants, PLINK2 supports zstd
(ZStandard) compression of the pvar
file.
While PLINK2 binary software supports many file formats used in genetic analysis software, the corresponding R binding pgenlibr
works only with this pgen/pvar/psam format.
How to convert your genetic data into plink format
One can use plink2
binary software to convert your genetic dataset into this format. For example, the following --make-pgen
(link to reference) command will convert your input.vcf
file into binary_fileset.{pgen,pvar,psam}
. Many other input formats are also supported in plink2.
plink2 --vcf input.vcf --make-pgen --out binary_fileset
pgenlibr
to read plink2 pgen files from R
To read genotype data stored in plink2
pgen/pvar/psam files, one can use an R package called pgenlibr
. Here, we show some example usage of pgenlibr
.
How to install pgenlibr
in your R environment?
$ git clone https://github.com/chrchang/plink-ng.git
$ R
> install.packages('plink-ng/2.0/pgenlibr', repos = NULL, type='source')
Some of the basic usages of pgenlibr
One can use the sample genotype data hosted here in the R snpnet package, which fits a large-scale Lasso regression using the genetic data (stored in the pgen format).
require(pgenlibr)
pfile <- '@@@/sample_genotype_data'
pvar <- NewPvar(paste0(pfile, '.pvar.zst'))
pgen <- NewPgen(paste0(pfile, '.pgen'), pvar=pvar)
GetVariantId(pvar, 1)
GetVariantId(pvar, 10)
GetVariantCt(pvar)
GetRawSampleCt(pgen)
# read a single variant
buf <- pgenlibr::Buf(pgen)
pgenlibr::Read(pgen, buf, 1)
# read a list of variants
## var.idxs (list of variants you would like to read)
geno_mat <- ReadList(pgen, var.idxs, meanimpute=F)
It seems like the package is still an early development version without a full documentation. You can see the list of supported functions in its source code on GitHub.
Running the recessive model on chrX
As illustrated in the PLINK2 website, --glm recessive
will be the modifier to run the GWAS scan with the recessive model. However, this does not work well for the X chromosome (chrX). To mitigate this limitation, we can record the chromosome X in the pvar file and pretend as if another autosome first and run the association analysis.
- Use
--output-chr 26
to change the.pvar
to encode chrX as “23” - Run the original
--glm
command with e.g.--chr-set 40
(which specifies 40 instead of 22 autosomes).
GWAS of multiple quantitative traits
Starting plink2/20190401
, the program supports efficient linear regression of multiple quantitative phenotypes.
–glm without –adjust now detects groups of quantitative phenotypes with the same “missingness pattern”, and processes them together
One should use this feature for efficient computation.
How to extract a pre-defined set of variants
Using plink2’s --extract
subcommand, you can extract individual-level genotype data. For example, you may run something like this:
plink2 --pfile <plink2 pgen/pvar/psam> --extract <list of variants> --export vcf --out subset.vcf
Note: the example above assumes that you have converted your genotype dataset into plink2 file format (which I highly recommend), but the –extract subcommand should work for other input file formats.
Reference
- PLINK 2.0 website
- PLINK 2.0 GitHub / pgenlibr
- The pgen/pvar/psam format specification
- Example genetic datasets in
snpent
packagesnpnet
usespgenlibr
to read large-scale genetic data and fit penalized regression models. Please check out our snpnet paper for more information.