The data contained in VCF files presents a challenge to analysis in that the data are not strictly tabular and are large. Because the data are not strictly tabular we need to find ways to coerce the data into tables or matrices. Once the data is stored as a matrix it can be operated on by numerous R functions and packages. Because the data are large, this process needs to happen in a computationally efficent manner. Here we explore how vcfR facilitates these tasks.
Typical VCF datasets are so large that it is not practical to view
them in their entirety. Here we use the example dataset
vcfR_test
which comes with vcfR. This is a nice dataset
because it is small enough to visualize.
library(vcfR)
data("vcfR_test")
head(vcfR_test)
## [1] "***** Object of class 'vcfR' *****"
## [1] "***** Meta section *****"
## [1] "##fileformat=VCFv4.3"
## [1] "##fileDate=20090805"
## [1] "##source=myImputationProgramV3.1"
## [1] "##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta"
## [1] "##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d [Truncated]"
## [1] "##phasing=partial"
## [1] "First 6 rows."
## [1]
## [1] "***** Fixed section *****"
## CHROM POS ID REF ALT QUAL FILTER
## [1,] "20" "14370" "rs6054257" "G" "A" "29" "PASS"
## [2,] "20" "17330" NA "T" "A" "3" "q10"
## [3,] "20" "1110696" "rs6040355" "A" "G,T" "67" "PASS"
## [4,] "20" "1230237" NA "T" NA "47" "PASS"
## [5,] "20" "1234567" "microsat1" "GTC" "G,GTCT" "50" "PASS"
## [1]
## [1] "***** Genotype section *****"
## FORMAT NA00001 NA00002 NA00003
## [1,] "GT:GQ:DP:HQ" "0|0:48:1:51,51" "1|0:48:8:51,51" "1/1:43:5:.,."
## [2,] "GT:GQ:DP:HQ" "0|0:49:3:58,50" "0|1:3:5:65,3" "0/0:41:3"
## [3,] "GT:GQ:DP:HQ" "1|2:21:6:23,27" "2|1:2:0:18,2" "2/2:35:4"
## [4,] "GT:GQ:DP:HQ" "0|0:54:7:56,60" "0|0:48:4:51,51" "0/0:61:2"
## [5,] "GT:GQ:DP" "0/1:35:4" "0/2:17:2" "1/1:40:3"
## [1]
## [1] "Unique GT formats:"
## [1] "GT:GQ:DP:HQ" "GT:GQ:DP"
## [1]
In the genotype section we see four columns: a FORMAT column that
indicates the contents of subsequent columns and three sample columns.
Variants one through four contain four pieces of information with each
genotype (GT:GQ:DP:HQ). The meaning of these acronyms should be defined
in the meta section. Each element of the genotype matrix therefore
contains a colon delimeted list that we need to parse before we can work
with it. For example, if we wanted to analyze teh genotypes we would
need to isolate them from the other information. Thsi can be
accomplished using the function extract.gt()
.
gt <- extract.gt(vcfR_test)
gt
## NA00001 NA00002 NA00003
## rs6054257 "0|0" "1|0" "1/1"
## 20_17330 "0|0" "0|1" "0/0"
## rs6040355 "1|2" "2|1" "2/2"
## 20_1230237 "0|0" "0|0" "0/0"
## microsat1 "0/1" "0/2" "1/1"
When a genotype section is included in VCF data the one mandatory
field is the genotype (GT) and it must be the first. The genotype is the
default field for extract.gt()
to work on. Note that the
samplers names have been used to name the columns. The variant names
(row names) are a little more complicated. The ‘ID’ column of the fixed
section contains optional unique identifiers. These typically come from
some sort of database of variants. If this column were entirely
populated with unique identifiers we could use it to name our variants.
However, all variants do not always have an ID. For example, variants 2
and 4 have missing IDs (NA). The function extract.gt()
tries to name these variants by combining the chromosome (CHROM) and the
position (POS) with the delimiter ’_’ to create a name. We see in our
matrix of genotypes that they all have a name.
We can also extract numeric data from our VCF data. We can do this by
selecting a numeric element and setting the parameter
as.numeric = TRUE
.
gt <- extract.gt(vcfR_test, element = 'DP', as.numeric = TRUE)
gt
## NA00001 NA00002 NA00003
## rs6054257 1 8 5
## 20_17330 3 5 3
## rs6040355 6 0 4
## 20_1230237 7 4 2
## microsat1 4 2 3
Here we’ve extracted the sequence depth (DP) of each variant and
converted it to a numeric. Because they are numeric they may be used for
quantitative comparisons. Much of R is designed to work with matrices of
numbers. Be careful setting the parameter
as.numeric = TRUE
. It will do its best to convert the data
to a numeric, even if this may not make sense. For example, we could
convert the genotype to numeric.
gt <- extract.gt(vcfR_test, element = 'GT', as.numeric = TRUE)
gt
## NA00001 NA00002 NA00003
## rs6054257 0 1 1
## 20_17330 0 0 0
## rs6040355 1 2 2
## 20_1230237 0 0 0
## microsat1 0 0 1
This operation did not throw an error, but the resulting matrix doesn’t appear to make much sense. Only ask to convert numeric data to numeric data or you will probably get unexpected results.
Some elements fromthe genotype section may require further parsing. In our example the haplotype quality (HQ) includes a quality for each haplotype and these values are comma delimited.
gt <- extract.gt(vcfR_test, element = 'HQ')
gt
## NA00001 NA00002 NA00003
## rs6054257 "51,51" "51,51" ".,."
## 20_17330 "58,50" "65,3" NA
## rs6040355 "23,27" "18,2" NA
## 20_1230237 "56,60" "51,51" NA
## microsat1 NA NA NA
Note that some elements contain data while in others it is missing.
myHQ1 <- masplit(gt[,1:2], sort = 0)
myHQ1
## NA00001 NA00002
## rs6054257 51 51
## 20_17330 58 65
## rs6040355 23 18
## 20_1230237 56 51
## microsat1 NA NA
The function masplit()
is fairly flexible in that you
can choose which element to return and you can also sort the data prior
to selection. These functions should help make VCF data accessible to
the wealth of existing R functions and packages.
Copyright © 2017, 2018 Brian J. Knaus. All rights reserved.
USDA Agricultural Research Service, Horticultural Crops Research Lab.