vcfR documentation

by
Brian J. Knaus and Niklaus J. Grünwald

Extracting data matrices

The vcfR function extract.gt() is used to extract matrices of data from the GT portion of VCF data. The funtion extract.gt() provides a link between VCF data and R. Much of R is designed to operate on matrices of data and once extract.gt() provides this matrix the universe of R becomes available.

Querying the meta data

As an example of how to use extract.gt() we will extract the depth (DP) data. Note that we use the ‘as.numeric=TRUE’ option here. We should only use this option when we are certain that we have numeric data. If you use it on non-numeric data R will do its best to do something, which is not likely to be what you expect. We can use the queryMETA() function remind us what this element is.

queryMETA(vcf, element = 'FORMAT.+DP')
## [[1]]
## [1] "FORMAT=ID=DP"                                                 
## [2] "Number=1"                                                     
## [3] "Type=Integer"                                                 
## [4] "Description=Read Depth (only filtered reads used for calling)"

The queryMETA() function reports a description to tell us what the acronym ‘DP’ means. It also reports the type of data this is. Here we see that ‘DP’ is integer data. Because integers are a form of numerics we can safely use as.numeric = TRUE.

Extract depth (DP)

The GT portion of VCF data is not strictly tabular. We can observe this by accessing the @gt slot of the vcfR object.

vcf@gt[1:4,1:4]
##      FORMAT           1                       10                       
## [1,] "GT:AD:DP:GQ:PL" "0/0:5,0:5:96:0,15,180" "1/1:0,10:10:99:255,30,0"
## [2,] "GT:AD:DP:GQ:PL" "1/1:0,1:1:66:36,3,0"   "0/0:4,0:4:94:0,12,144"  
## [3,] "GT:AD:DP:GQ:PL" NA                      NA                       
## [4,] "GT:AD:DP:GQ:PL" "1/1:0,5:5:96:180,15,0" "0/0:8,0:8:99:0,24,255"  
##      13                     
## [1,] "0/0:9,0:9:99:0,27,255"
## [2,] "0/0:1,0:1:66:0,3,36"  
## [3,] "0/0:3,0:3:88:0,9,108" 
## [4,] NA

The first column reports the format for subsequent columns. This is a colon delimited string containing abbreviations for the data that appear in subsequent columns, in the same order as they appear in subsequent columns. Different variants (rows) can have different formats, so these need to be processed independently. The extract.gt() function finds which position the data you’re interested in is in the format string and processes this position in the subsequent columns. Here we’ve also used the option to convert the data to numerical data. The default is to leave the data as character data.

dp <- extract.gt(vcf, element = "DP", as.numeric=TRUE)
dp[1:4,1:3]
##          1 10 13
## S1_4509  5 10  9
## S1_4657  1  4  1
## S1_5193 NA NA  3
## S1_5647  5  8 NA

We now have a matrix of numerical ‘DP’ data with the sample names as column names. Samples (columns) or variants (rows) can be accessed with the square brackets ([,]). If you need a matrix where the samples are in rows and the variants are in columns you can use the transpose function (t()). We have now taken our VCF data and extracted it into a form that makes it available to much of the broad spectrum of existing R packages and functions.


Copyright © 2017, 2018 Brian J. Knaus. All rights reserved.

USDA Agricultural Research Service, Horticultural Crops Research Lab.