vcfR documentation

by
Brian J. Knaus and Niklaus J. Grünwald

The package vcfR uses two objects to contain data: vcfR and chromR. The chromR object will be covered in a later section. The vcfR object is intended to contain data read in from a variant call format (VCF) file. Once stored in this object the data may be manipulated in a number of ways. It may also be used as part of the more complicated chromR object. Here we provide an overview of the vcfR object.

Creation

A first step in understanding the vcfR object is to create one so we can explore it.

library(vcfR)
vcf_file <- system.file("extdata", "pinf_sc50.vcf.gz", package = "pinfsc50")
vcf <- read.vcfR( vcf_file, verbose = FALSE )

Here we start by loading the library vcfR to make its functions available to us. The function system.file() locates teh file ‘pinf_sc50.vcf.gz’ in the R package ‘pinfsc50.’ This library may be in different locations on different system, so this function helps us find our example file. In practice, you’ll substitute this step by using the name of your data file and any relevant path information (if the file is not in your working directory). The function read.vcfR() reads in a VCF file and returns a vcfR object. Here we’ve called this vcfR object ‘vcf.’ We’ve also set ‘verbose = FALSE’ in order to improve the formating of this page. In practice, tou will typically leave ‘verbose = TRUE’ (it’s default) so that progress is reported. Large files may require some patience to import, so progress provides important feedback during this step.

Summarization

Once we’re imorted our VCF data, we’ll want to explore our vcfR object to validate that its contents are what we expect. Two relevant tools are the object’s show method as wellas the head() command.

vcf
## ***** Object of Class vcfR *****
## 18 samples
## 1 CHROMs
## 22,031 variants
## Object size: 22.4 Mb
## 7.929 percent missing data
## *****        *****         *****

When we execute the object’s name with no function we implement the object’s show method. The show method for vcfR objects reports a summary of the object’s contents. Here we see that we have 18 samples and 22,031 variants. We should know how many samples were in our experiment. This number should be compared to the value reported here. We may similarly have some information about how many variants should be in the file. That number can also be compared to the value reported here.

head(vcf)
## [1] "***** Object of class 'vcfR' *****"
## [1] "***** Meta section *****"
## [1] "##fileformat=VCFv4.1"
## [1] "##source=\"GATK haplotype Caller, phased with beagle4\""
## [1] "##FILTER=<ID=LowQual,Description=\"Low quality\">"
## [1] "##FORMAT=<ID=AD,Number=.,Type=Integer,Description=\"Allelic depths fo [Truncated]"
## [1] "##FORMAT=<ID=DP,Number=1,Type=Integer,Description=\"Approximate read  [Truncated]"
## [1] "##FORMAT=<ID=GQ,Number=1,Type=Integer,Description=\"Genotype Quality\">"
## [1] "First 6 rows."
## [1] 
## [1] "***** Fixed section *****"
##      CHROM              POS   ID REF  ALT QUAL      FILTER
## [1,] "Supercontig_1.50" "41"  NA "AT" "A" "4784.43" NA    
## [2,] "Supercontig_1.50" "136" NA "A"  "C" "550.27"  NA    
## [3,] "Supercontig_1.50" "254" NA "T"  "G" "774.44"  NA    
## [4,] "Supercontig_1.50" "275" NA "A"  "G" "714.53"  NA    
## [5,] "Supercontig_1.50" "386" NA "T"  "G" "876.55"  NA    
## [6,] "Supercontig_1.50" "462" NA "T"  "G" "1301.07" NA    
## [1] 
## [1] "***** Genotype section *****"
##      FORMAT           BL2009P4_us23              DDR7602                    
## [1,] "GT:AD:DP:GQ:PL" "1|1:0,7:7:21:283,21,0"    "1|1:0,6:6:18:243,18,0"    
## [2,] "GT:AD:DP:GQ:PL" "0|0:12,0:12:36:0,36,427"  "0|0:20,0:20:60:0,60,819"  
## [3,] "GT:AD:DP:GQ:PL" "0|0:27,0:27:81:0,81,1117" "0|0:26,0:26:78:0,78,1077" 
## [4,] "GT:AD:DP:GQ:PL" "0|0:29,0:29:87:0,87,1243" "0|0:27,0:27:81:0,81,1158" 
## [5,] "GT:AD:DP:GQ:PL" "0|0:26,0:26:78:0,78,1034" "0|0:30,0:30:90:0,90,1242" 
## [6,] "GT:AD:DP:GQ:PL" "0|0:23,0:23:69:0,69,958"  "0|0:36,0:36:99:0,108,1556"
##      IN2009T1_us22               LBUS5                      
## [1,] "1|1:0,8:8:24:324,24,0"     "1|1:0,6:6:18:243,18,0"    
## [2,] "0|0:16,0:16:48:0,48,650"   "0|0:20,0:20:60:0,60,819"  
## [3,] "0|0:23,0:23:69:0,69,946"   "0|0:26,0:26:78:0,78,1077" 
## [4,] "0|0:32,0:32:96:0,96,1299"  "0|0:27,0:27:81:0,81,1158" 
## [5,] "0|0:41,0:41:99:0,122,1613" "0|0:30,0:30:90:0,90,1242" 
## [6,] "0|0:35,0:35:99:0,105,1467" "0|0:36,0:36:99:0,108,1556"
##      NL07434                    
## [1,] "1|1:0,12:12:36:486,36,0"  
## [2,] "0|0:28,0:28:84:0,84,948"  
## [3,] "0|1:19,20:39:99:565,0,559"
## [4,] "0|1:19,19:38:99:523,0,535"
## [5,] "0|1:22,22:44:99:593,0,651"
## [6,] "0|1:29,25:54:99:723,0,876"
## [1] "First 6 columns only."
## [1] 
## [1] "Unique GT formats:"
## [1] "GT:AD:DP:GQ:PL"
## [1]

The head() function reports the head, or the top, of an object. A vcfR object consists of three slots: meta, fix and gt. Here we see the first few lines or rows of each slot. More information on VCF data can be found in the section vcf_data.

Queries

Once we have validated that our vcfR object contains the data we expect we may want to explore it further.

head(is.polymorphic(vcf, na.omit = TRUE))
##  Supercontig_1.50_41 Supercontig_1.50_136 Supercontig_1.50_254 
##                FALSE                 TRUE                 TRUE 
## Supercontig_1.50_275 Supercontig_1.50_386 Supercontig_1.50_462 
##                 TRUE                 TRUE                 TRUE
head(is.biallelic(vcf))
## [1] TRUE TRUE TRUE TRUE TRUE TRUE

VCF files report only variable, or polymorphic, positions. This means that all positions in your VCF file should have been polymorphic. The package vcfR allows you to manipulate VCF data. For example, you may want to omit certain samples. If the samples you omit were the only ones polymorphic for certain positions you may have rendered these positions unpolymorphic. Note that columns of the ‘fix’ slot contains summaries over all samples. If you remove samples then these summaries are no longer accurate, so make sure you keep this in mind when making these changes. This function allows you to query the positions to validate that they are polymorphic.

The function is.biallelic() queries positions to determine if they contain no more than two alleles. Some downstream analyses can only handle biallelic loci. Some R object from other packages, or file formats for other softwares, can only handle biallelic loci. This funciton helps validate if loci contain more than two alleles and can help manage this situation if desired.

vcf2 <- extract.indels(vcf)

Variants contained in VCF data may include single nucleotide polymorphisms (SNPs) as well as indels or more complicated features. Some analyses may require that only SNPs are used (e.g., when a mutation model is used). In these cases it may be useful to subset the data to only the SNPs. The function extract.indels() may be used for this. Note that it is different from the previous queries in that it subsets the vcfR object for you instead of just returning an index. This allows the rapid creation of vcfR object that should only contain SNPs.

Subsetting

The package vcfR provides the ability to manipulte VCF data. With this functionality comes the ability to create invalide VCF files. When done with forethought this provides valuable options. When not done with forethought this is likely to create problems. Please familiarize yourself with the VCF specification if your goal is to output valid VCF files. If your goal is to create different formats of your data then these tools may be helpful.

vcf[,1:4]
## ***** Object of Class vcfR *****
## 3 samples
## 1 CHROMs
## 22,031 variants
## Object size: 11.9 Mb
## 2.875 percent missing data
## *****        *****         *****
vcf[1:4,]
## ***** Object of Class vcfR *****
## 18 samples
## 1 CHROMs
## 4 variants
## Object size: 0 Mb
## 6.944 percent missing data
## *****        *****         *****
vcf[is.biallelic(vcf),]
## ***** Object of Class vcfR *****
## 18 samples
## 1 CHROMs
## 21,719 variants
## Object size: 21.8 Mb
## 7.991 percent missing data
## *****        *****         *****

The square brackets ([]) have been implemented for vcfR objects in order to allow their manipulation appear similar to other R objects. The vcfR object’s ‘fix’ and ‘gt’ slots consist of matrices. When columns are selected, by subsetting after the comma, only columns from the ‘gt’ slot are manipulated and the ‘fix’ slot is maintained. Note that the first column of the ‘gt’ slot contains the format information for all of the subsequent columns. This means you will typically want to include the first column. When selecting rows, by indexing before the comma, both the ‘fix’ and ‘gt’ slots are subset. Subsetting can be performed in typical R manners such as the numeric sequences provided above. They can also be combined with the query functions presented above to facilitate more complex operations.


Copyright © 2017, 2018 Brian J. Knaus. All rights reserved.

USDA Agricultural Research Service, Horticultural Crops Research Lab.