The package vcfR uses two objects to contain data: vcfR and chromR. The chromR object will be covered in a later section. The vcfR object is intended to contain data read in from a variant call format (VCF) file. Once stored in this object the data may be manipulated in a number of ways. It may also be used as part of the more complicated chromR object. Here we provide an overview of the vcfR object.
A first step in understanding the vcfR object is to create one so we can explore it.
library(vcfR)
vcf_file <- system.file("extdata", "pinf_sc50.vcf.gz", package = "pinfsc50")
vcf <- read.vcfR( vcf_file, verbose = FALSE )
Here we start by loading the library vcfR to make its functions
available to us. The function system.file()
locates teh
file ‘pinf_sc50.vcf.gz’ in the R package ‘pinfsc50.’ This library may be
in different locations on different system, so this function helps us
find our example file. In practice, you’ll substitute this step by using
the name of your data file and any relevant path information (if the
file is not in your working directory). The function
read.vcfR()
reads in a VCF file and returns a vcfR object.
Here we’ve called this vcfR object ‘vcf.’ We’ve also set ‘verbose =
FALSE’ in order to improve the formating of this page. In practice, tou
will typically leave ‘verbose = TRUE’ (it’s default) so that progress is
reported. Large files may require some patience to import, so progress
provides important feedback during this step.
Once we’re imorted our VCF data, we’ll want to explore our vcfR
object to validate that its contents are what we expect. Two relevant
tools are the object’s show
method as wellas the
head()
command.
vcf
## ***** Object of Class vcfR *****
## 18 samples
## 1 CHROMs
## 22,031 variants
## Object size: 22.4 Mb
## 7.929 percent missing data
## ***** ***** *****
When we execute the object’s name with no function we implement the
object’s show
method. The show
method for vcfR
objects reports a summary of the object’s contents. Here we see that we
have 18 samples and 22,031 variants. We should know how many samples
were in our experiment. This number should be compared to the value
reported here. We may similarly have some information about how many
variants should be in the file. That number can also be compared to the
value reported here.
head(vcf)
## [1] "***** Object of class 'vcfR' *****"
## [1] "***** Meta section *****"
## [1] "##fileformat=VCFv4.1"
## [1] "##source=\"GATK haplotype Caller, phased with beagle4\""
## [1] "##FILTER=<ID=LowQual,Description=\"Low quality\">"
## [1] "##FORMAT=<ID=AD,Number=.,Type=Integer,Description=\"Allelic depths fo [Truncated]"
## [1] "##FORMAT=<ID=DP,Number=1,Type=Integer,Description=\"Approximate read [Truncated]"
## [1] "##FORMAT=<ID=GQ,Number=1,Type=Integer,Description=\"Genotype Quality\">"
## [1] "First 6 rows."
## [1]
## [1] "***** Fixed section *****"
## CHROM POS ID REF ALT QUAL FILTER
## [1,] "Supercontig_1.50" "41" NA "AT" "A" "4784.43" NA
## [2,] "Supercontig_1.50" "136" NA "A" "C" "550.27" NA
## [3,] "Supercontig_1.50" "254" NA "T" "G" "774.44" NA
## [4,] "Supercontig_1.50" "275" NA "A" "G" "714.53" NA
## [5,] "Supercontig_1.50" "386" NA "T" "G" "876.55" NA
## [6,] "Supercontig_1.50" "462" NA "T" "G" "1301.07" NA
## [1]
## [1] "***** Genotype section *****"
## FORMAT BL2009P4_us23 DDR7602
## [1,] "GT:AD:DP:GQ:PL" "1|1:0,7:7:21:283,21,0" "1|1:0,6:6:18:243,18,0"
## [2,] "GT:AD:DP:GQ:PL" "0|0:12,0:12:36:0,36,427" "0|0:20,0:20:60:0,60,819"
## [3,] "GT:AD:DP:GQ:PL" "0|0:27,0:27:81:0,81,1117" "0|0:26,0:26:78:0,78,1077"
## [4,] "GT:AD:DP:GQ:PL" "0|0:29,0:29:87:0,87,1243" "0|0:27,0:27:81:0,81,1158"
## [5,] "GT:AD:DP:GQ:PL" "0|0:26,0:26:78:0,78,1034" "0|0:30,0:30:90:0,90,1242"
## [6,] "GT:AD:DP:GQ:PL" "0|0:23,0:23:69:0,69,958" "0|0:36,0:36:99:0,108,1556"
## IN2009T1_us22 LBUS5
## [1,] "1|1:0,8:8:24:324,24,0" "1|1:0,6:6:18:243,18,0"
## [2,] "0|0:16,0:16:48:0,48,650" "0|0:20,0:20:60:0,60,819"
## [3,] "0|0:23,0:23:69:0,69,946" "0|0:26,0:26:78:0,78,1077"
## [4,] "0|0:32,0:32:96:0,96,1299" "0|0:27,0:27:81:0,81,1158"
## [5,] "0|0:41,0:41:99:0,122,1613" "0|0:30,0:30:90:0,90,1242"
## [6,] "0|0:35,0:35:99:0,105,1467" "0|0:36,0:36:99:0,108,1556"
## NL07434
## [1,] "1|1:0,12:12:36:486,36,0"
## [2,] "0|0:28,0:28:84:0,84,948"
## [3,] "0|1:19,20:39:99:565,0,559"
## [4,] "0|1:19,19:38:99:523,0,535"
## [5,] "0|1:22,22:44:99:593,0,651"
## [6,] "0|1:29,25:54:99:723,0,876"
## [1] "First 6 columns only."
## [1]
## [1] "Unique GT formats:"
## [1] "GT:AD:DP:GQ:PL"
## [1]
The head()
function reports the head, or the top, of an
object. A vcfR object consists of three slots: meta, fix and gt. Here we
see the first few lines or rows of each slot. More information on VCF
data can be found in the section vcf_data.
Once we have validated that our vcfR object contains the data we expect we may want to explore it further.
head(is.polymorphic(vcf, na.omit = TRUE))
## Supercontig_1.50_41 Supercontig_1.50_136 Supercontig_1.50_254
## FALSE TRUE TRUE
## Supercontig_1.50_275 Supercontig_1.50_386 Supercontig_1.50_462
## TRUE TRUE TRUE
head(is.biallelic(vcf))
## [1] TRUE TRUE TRUE TRUE TRUE TRUE
VCF files report only variable, or polymorphic, positions. This means that all positions in your VCF file should have been polymorphic. The package vcfR allows you to manipulate VCF data. For example, you may want to omit certain samples. If the samples you omit were the only ones polymorphic for certain positions you may have rendered these positions unpolymorphic. Note that columns of the ‘fix’ slot contains summaries over all samples. If you remove samples then these summaries are no longer accurate, so make sure you keep this in mind when making these changes. This function allows you to query the positions to validate that they are polymorphic.
The function is.biallelic()
queries positions to
determine if they contain no more than two alleles. Some downstream
analyses can only handle biallelic loci. Some R object from other
packages, or file formats for other softwares, can only handle biallelic
loci. This funciton helps validate if loci contain more than two alleles
and can help manage this situation if desired.
vcf2 <- extract.indels(vcf)
Variants contained in VCF data may include single nucleotide
polymorphisms (SNPs) as well as indels or more complicated features.
Some analyses may require that only SNPs are used (e.g., when a mutation
model is used). In these cases it may be useful to subset the data to
only the SNPs. The function extract.indels()
may be used
for this. Note that it is different from the previous queries in that it
subsets the vcfR object for you instead of just returning an index. This
allows the rapid creation of vcfR object that should only contain
SNPs.
The package vcfR provides the ability to manipulte VCF data. With this functionality comes the ability to create invalide VCF files. When done with forethought this provides valuable options. When not done with forethought this is likely to create problems. Please familiarize yourself with the VCF specification if your goal is to output valid VCF files. If your goal is to create different formats of your data then these tools may be helpful.
vcf[,1:4]
## ***** Object of Class vcfR *****
## 3 samples
## 1 CHROMs
## 22,031 variants
## Object size: 11.9 Mb
## 2.875 percent missing data
## ***** ***** *****
vcf[1:4,]
## ***** Object of Class vcfR *****
## 18 samples
## 1 CHROMs
## 4 variants
## Object size: 0 Mb
## 6.944 percent missing data
## ***** ***** *****
vcf[is.biallelic(vcf),]
## ***** Object of Class vcfR *****
## 18 samples
## 1 CHROMs
## 21,719 variants
## Object size: 21.8 Mb
## 7.991 percent missing data
## ***** ***** *****
The square brackets ([]
) have been implemented for vcfR
objects in order to allow their manipulation appear similar to other R
objects. The vcfR object’s ‘fix’ and ‘gt’ slots consist of matrices.
When columns are selected, by subsetting after the comma, only columns
from the ‘gt’ slot are manipulated and the ‘fix’ slot is maintained.
Note that the first column of the ‘gt’ slot contains the format
information for all of the subsequent columns. This means you will
typically want to include the first column. When selecting rows, by
indexing before the comma, both the ‘fix’ and ‘gt’ slots are subset.
Subsetting can be performed in typical R manners such as the numeric
sequences provided above. They can also be combined with the query
functions presented above to facilitate more complex operations.
Copyright © 2017, 2018 Brian J. Knaus. All rights reserved.
USDA Agricultural Research Service, Horticultural Crops Research Lab.