The vcfR function extract.gt()
is used to extract
matrices of data from the GT portion of VCF data. The funtion
extract.gt()
provides a link between VCF data and R. Much
of R is designed to operate on matrices of data and once
extract.gt()
provides this matrix the universe of R becomes
available.
As an example of how to use extract.gt()
we will extract
the depth (DP) data. Note that we use the ‘as.numeric=TRUE’ option here.
We should only use this option when we are certain that we have numeric
data. If you use it on non-numeric data R will do its best to do
something, which is not likely to be what you expect. We can use the
queryMETA()
function remind us what this element is.
queryMETA(vcf, element = 'FORMAT.+DP')
## [[1]]
## [1] "FORMAT=ID=DP"
## [2] "Number=1"
## [3] "Type=Integer"
## [4] "Description=Read Depth (only filtered reads used for calling)"
The queryMETA()
function reports a description to tell
us what the acronym ‘DP’ means. It also reports the type of data this
is. Here we see that ‘DP’ is integer data. Because integers are a form
of numerics we can safely use as.numeric = TRUE
.
The GT portion of VCF data is not strictly tabular. We can observe
this by accessing the @gt
slot of the vcfR object.
vcf@gt[1:4,1:4]
## FORMAT 1 10
## [1,] "GT:AD:DP:GQ:PL" "0/0:5,0:5:96:0,15,180" "1/1:0,10:10:99:255,30,0"
## [2,] "GT:AD:DP:GQ:PL" "1/1:0,1:1:66:36,3,0" "0/0:4,0:4:94:0,12,144"
## [3,] "GT:AD:DP:GQ:PL" NA NA
## [4,] "GT:AD:DP:GQ:PL" "1/1:0,5:5:96:180,15,0" "0/0:8,0:8:99:0,24,255"
## 13
## [1,] "0/0:9,0:9:99:0,27,255"
## [2,] "0/0:1,0:1:66:0,3,36"
## [3,] "0/0:3,0:3:88:0,9,108"
## [4,] NA
The first column reports the format for subsequent columns. This is a
colon delimited string containing abbreviations for the data that appear
in subsequent columns, in the same order as they appear in subsequent
columns. Different variants (rows) can have different formats, so these
need to be processed independently. The extract.gt()
function finds which position the data you’re interested in is in the
format string and processes this position in the subsequent columns.
Here we’ve also used the option to convert the data to numerical data.
The default is to leave the data as character data.
dp <- extract.gt(vcf, element = "DP", as.numeric=TRUE)
dp[1:4,1:3]
## 1 10 13
## S1_4509 5 10 9
## S1_4657 1 4 1
## S1_5193 NA NA 3
## S1_5647 5 8 NA
We now have a matrix of numerical ‘DP’ data with the sample names as
column names. Samples (columns) or variants (rows) can be accessed with
the square brackets ([,]
). If you need a matrix where the
samples are in rows and the variants are in columns you can use the
transpose function (t()
). We have now taken our VCF data
and extracted it into a form that makes it available to much of the
broad spectrum of existing R packages and functions.
Copyright © 2017, 2018 Brian J. Knaus. All rights reserved.
USDA Agricultural Research Service, Horticultural Crops Research Lab.