tidy vcfR

vcfR documentation

by
Brian J. Knaus and Niklaus J. Grünwald

Much of the functionality of vcfR was built around R and it’s base graphics system. More recently, the concept of ‘tidy data’ has been described in the tidy data article and implemented in the CRAN package tidyr (and an increasing number of other packages). In order to accomodate the audience that is interested in working with vcfR in a ‘tidy’ framework, several functions have been added. (Thank you Eric!)

For our example we will load an example data set provided in vcfR.

library(vcfR)
data("vcfR_test")
vcfR_test

## ***** Object of Class vcfR *****
## 3 samples
## 1 CHROMs
## 5 variants
## Object size: 0 Mb
## 0 percent missing data
## *****        *****         *****

The function vcfR2tidy() will convert all of the data in the vcfR object to a ‘tibble.’ A ‘tibble’ is a trimmed down version of a data.frame. (See ?tibble::tibble for more information.) This could result in the creation of a large data structure. We can also specify what parts of the VCF data we want to have converted into our tibble. The funciton vcf_field_names() can remind us of what data are contained in our vcfR object.

vcf_field_names(vcfR_test, tag = "FORMAT")

## # A tibble: 4 × 5
##   Tag    ID    Number Type    Description      
##   <chr>  <chr> <chr>  <chr>   <chr>            
## 1 FORMAT GT    1      String  Genotype         
## 2 FORMAT GQ    1      Integer Genotype Quality 
## 3 FORMAT DP    1      Integer Read Depth       
## 4 FORMAT HQ    2      Integer Haplotype Quality

Z <- vcfR2tidy(vcfR_test, format_fields = c("GT", "DP"))

## Extracting gt element GT

## Extracting gt element DP

names(Z)

## [1] "fix"  "gt"   "meta"

The result is a list containing three elements named ‘fix, ’gt’ and ‘meta.’ This is analogous to the three slots in the vcfR object. Each element of the list is a tibble that we can examine as we would any other list element.

Z$meta

## # A tibble: 8 × 5
##   Tag    ID    Number Type    Description                
##   <chr>  <chr> <chr>  <chr>   <chr>                      
## 1 INFO   NS    1      Integer Number of Samples With Data
## 2 INFO   DP    1      Integer Total Depth                
## 3 INFO   AF    A      Float   Allele Frequency           
## 4 INFO   AA    1      String  Ancestral Allele           
## 5 INFO   DB    0      Flag    dbSNP membership, build 129
## 6 INFO   H2    0      Flag    HapMap2 membership         
## 7 FORMAT gt_GT 1      String  Genotype                   
## 8 FORMAT gt_DP 1      Integer Read Depth

Z$fix

## # A tibble: 5 × 14
##   ChromKey CHROM     POS ID     REF   ALT    QUAL FILTER    NS    DP AF    AA   
##      <int> <chr>   <int> <chr>  <chr> <chr> <dbl> <chr>  <int> <int> <chr> <chr>
## 1        1 20      14370 rs605… G     A        29 PASS       3    14 0.5   <NA> 
## 2        1 20      17330 <NA>   T     A         3 q10        3    11 0.017 <NA> 
## 3        1 20    1110696 rs604… A     G,T      67 PASS       2    10 0.33… T    
## 4        1 20    1230237 <NA>   T     <NA>     47 PASS       3    13 <NA>  T    
## 5        1 20    1234567 micro… GTC   G,GT…    50 PASS       3     9 <NA>  G    
## # ℹ 2 more variables: DB <lgl>, H2 <lgl>

Z$gt

## # A tibble: 15 × 6
##    ChromKey     POS Indiv   gt_GT gt_DP gt_GT_alleles
##       <int>   <int> <chr>   <chr> <int> <chr>        
##  1        1   14370 NA00001 0|0       1 G|G          
##  2        1   17330 NA00001 0|0       3 T|T          
##  3        1 1110696 NA00001 1|2       6 G|T          
##  4        1 1230237 NA00001 0|0       7 T|T          
##  5        1 1234567 NA00001 0/1       4 GTC/G        
##  6        1   14370 NA00002 1|0       8 A|G          
##  7        1   17330 NA00002 0|1       5 T|A          
##  8        1 1110696 NA00002 2|1       0 T|G          
##  9        1 1230237 NA00002 0|0       4 T|T          
## 10        1 1234567 NA00002 0/2       2 GTC/GTCT     
## 11        1   14370 NA00003 1/1       5 A/A          
## 12        1   17330 NA00003 0/0       3 T/T          
## 13        1 1110696 NA00003 2/2       4 T/T          
## 14        1 1230237 NA00003 0/0       2 T/T          
## 15        1 1234567 NA00003 1/1       3 G/G

Note that the fix and gt elements have a ‘ChromKey’ to help coordinate the variants in both structures. Also, the information from the meta region has been used to assign a type to each column (e.g., integer, character, etc.). These data structures should now be in a format that other packages in the ‘tidyverse’ can work with. More information about vcfR2tidy() can be found in its manual page (?vcfR2tidy).

USDA Agricultural Research Service, Horticultural Crops Research Lab.