Extracting matrices from vcfR objects

vcfR documentation

by
Brian J. Knaus and Niklaus J. Grünwald

The data contained in VCF files presents a challenge to analysis in that the data are not strictly tabular and are large. Because the data are not strictly tabular we need to find ways to coerce the data into tables or matrices. Once the data is stored as a matrix it can be operated on by numerous R functions and packages. Because the data are large, this process needs to happen in a computationally efficent manner. Here we explore how vcfR facilitates these tasks.

Extracting data matrices

Typical VCF datasets are so large that it is not practical to view them in their entirety. Here we use the example dataset vcfR_test which comes with vcfR. This is a nice dataset because it is small enough to visualize.

library(vcfR)
data("vcfR_test")
head(vcfR_test)

## [1] "***** Object of class 'vcfR' *****"
## [1] "***** Meta section *****"
## [1] "##fileformat=VCFv4.3"
## [1] "##fileDate=20090805"
## [1] "##source=myImputationProgramV3.1"
## [1] "##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta"
## [1] "##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d [Truncated]"
## [1] "##phasing=partial"
## [1] "First 6 rows."
## [1] 
## [1] "***** Fixed section *****"
##      CHROM POS       ID          REF   ALT      QUAL FILTER
## [1,] "20"  "14370"   "rs6054257" "G"   "A"      "29" "PASS"
## [2,] "20"  "17330"   NA          "T"   "A"      "3"  "q10" 
## [3,] "20"  "1110696" "rs6040355" "A"   "G,T"    "67" "PASS"
## [4,] "20"  "1230237" NA          "T"   NA       "47" "PASS"
## [5,] "20"  "1234567" "microsat1" "GTC" "G,GTCT" "50" "PASS"
## [1] 
## [1] "***** Genotype section *****"
##      FORMAT        NA00001          NA00002          NA00003       
## [1,] "GT:GQ:DP:HQ" "0|0:48:1:51,51" "1|0:48:8:51,51" "1/1:43:5:.,."
## [2,] "GT:GQ:DP:HQ" "0|0:49:3:58,50" "0|1:3:5:65,3"   "0/0:41:3"    
## [3,] "GT:GQ:DP:HQ" "1|2:21:6:23,27" "2|1:2:0:18,2"   "2/2:35:4"    
## [4,] "GT:GQ:DP:HQ" "0|0:54:7:56,60" "0|0:48:4:51,51" "0/0:61:2"    
## [5,] "GT:GQ:DP"    "0/1:35:4"       "0/2:17:2"       "1/1:40:3"    
## [1] 
## [1] "Unique GT formats:"
## [1] "GT:GQ:DP:HQ" "GT:GQ:DP"   
## [1]

In the genotype section we see four columns: a FORMAT column that indicates the contents of subsequent columns and three sample columns. Variants one through four contain four pieces of information with each genotype (GT:GQ:DP:HQ). The meaning of these acronyms should be defined in the meta section. Each element of the genotype matrix therefore contains a colon delimeted list that we need to parse before we can work with it. For example, if we wanted to analyze teh genotypes we would need to isolate them from the other information. Thsi can be accomplished using the function extract.gt().

gt <- extract.gt(vcfR_test)
gt

##            NA00001 NA00002 NA00003
## rs6054257  "0|0"   "1|0"   "1/1"  
## 20_17330   "0|0"   "0|1"   "0/0"  
## rs6040355  "1|2"   "2|1"   "2/2"  
## 20_1230237 "0|0"   "0|0"   "0/0"  
## microsat1  "0/1"   "0/2"   "1/1"

When a genotype section is included in VCF data the one mandatory field is the genotype (GT) and it must be the first. The genotype is the default field for extract.gt() to work on. Note that the samplers names have been used to name the columns. The variant names (row names) are a little more complicated. The ‘ID’ column of the fixed section contains optional unique identifiers. These typically come from some sort of database of variants. If this column were entirely populated with unique identifiers we could use it to name our variants. However, all variants do not always have an ID. For example, variants 2 and 4 have missing IDs (NA). The function extract.gt() tries to name these variants by combining the chromosome (CHROM) and the position (POS) with the delimiter ’_’ to create a name. We see in our matrix of genotypes that they all have a name.

We can also extract numeric data from our VCF data. We can do this by selecting a numeric element and setting the parameter as.numeric = TRUE.

gt <- extract.gt(vcfR_test, element = 'DP', as.numeric = TRUE)
gt

##            NA00001 NA00002 NA00003
## rs6054257        1       8       5
## 20_17330         3       5       3
## rs6040355        6       0       4
## 20_1230237       7       4       2
## microsat1        4       2       3

Here we’ve extracted the sequence depth (DP) of each variant and converted it to a numeric. Because they are numeric they may be used for quantitative comparisons. Much of R is designed to work with matrices of numbers. Be careful setting the parameter as.numeric = TRUE. It will do its best to convert the data to a numeric, even if this may not make sense. For example, we could convert the genotype to numeric.

gt <- extract.gt(vcfR_test, element = 'GT', as.numeric = TRUE)
gt

##            NA00001 NA00002 NA00003
## rs6054257        0       1       1
## 20_17330         0       0       0
## rs6040355        1       2       2
## 20_1230237       0       0       0
## microsat1        0       0       1

This operation did not throw an error, but the resulting matrix doesn’t appear to make much sense. Only ask to convert numeric data to numeric data or you will probably get unexpected results.

Matrix parsing

Some elements fromthe genotype section may require further parsing. In our example the haplotype quality (HQ) includes a quality for each haplotype and these values are comma delimited.

gt <- extract.gt(vcfR_test, element = 'HQ')
gt

##            NA00001 NA00002 NA00003
## rs6054257  "51,51" "51,51" ".,."  
## 20_17330   "58,50" "65,3"  NA     
## rs6040355  "23,27" "18,2"  NA     
## 20_1230237 "56,60" "51,51" NA     
## microsat1  NA      NA      NA

Note that some elements contain data while in others it is missing.

myHQ1 <- masplit(gt[,1:2], sort = 0)
myHQ1

##            NA00001 NA00002
## rs6054257       51      51
## 20_17330        58      65
## rs6040355       23      18
## 20_1230237      56      51
## microsat1       NA      NA

The function masplit() is fairly flexible in that you can choose which element to return and you can also sort the data prior to selection. These functions should help make VCF data accessible to the wealth of existing R functions and packages.

USDA Agricultural Research Service, Horticultural Crops Research Lab.