Much of the functionality of vcfR was built around R and it’s base graphics system. More recently, the concept of ‘tidy data’ has been described in the tidy data article and implemented in the CRAN package tidyr (and an increasing number of other packages). In order to accomodate the audience that is interested in working with vcfR in a ‘tidy’ framework, several functions have been added. (Thank you Eric!)
For our example we will load an example data set provided in vcfR.
library(vcfR)
data("vcfR_test")
vcfR_test
## ***** Object of Class vcfR *****
## 3 samples
## 1 CHROMs
## 5 variants
## Object size: 0 Mb
## 0 percent missing data
## ***** ***** *****
The function vcfR2tidy()
will convert all of the data in
the vcfR object to a ‘tibble.’ A ‘tibble’ is a trimmed down version of a
data.frame. (See ?tibble::tibble
for more information.)
This could result in the creation of a large data structure. We can also
specify what parts of the VCF data we want to have converted into our
tibble. The funciton vcf_field_names()
can remind us of
what data are contained in our vcfR object.
vcf_field_names(vcfR_test, tag = "FORMAT")
## # A tibble: 4 × 5
## Tag ID Number Type Description
## <chr> <chr> <chr> <chr> <chr>
## 1 FORMAT GT 1 String Genotype
## 2 FORMAT GQ 1 Integer Genotype Quality
## 3 FORMAT DP 1 Integer Read Depth
## 4 FORMAT HQ 2 Integer Haplotype Quality
Z <- vcfR2tidy(vcfR_test, format_fields = c("GT", "DP"))
## Extracting gt element GT
## Extracting gt element DP
names(Z)
## [1] "fix" "gt" "meta"
The result is a list containing three elements named ‘fix, ’gt’ and ‘meta.’ This is analogous to the three slots in the vcfR object. Each element of the list is a tibble that we can examine as we would any other list element.
Z$meta
## # A tibble: 8 × 5
## Tag ID Number Type Description
## <chr> <chr> <chr> <chr> <chr>
## 1 INFO NS 1 Integer Number of Samples With Data
## 2 INFO DP 1 Integer Total Depth
## 3 INFO AF A Float Allele Frequency
## 4 INFO AA 1 String Ancestral Allele
## 5 INFO DB 0 Flag dbSNP membership, build 129
## 6 INFO H2 0 Flag HapMap2 membership
## 7 FORMAT gt_GT 1 String Genotype
## 8 FORMAT gt_DP 1 Integer Read Depth
Z$fix
## # A tibble: 5 × 14
## ChromKey CHROM POS ID REF ALT QUAL FILTER NS DP AF AA
## <int> <chr> <int> <chr> <chr> <chr> <dbl> <chr> <int> <int> <chr> <chr>
## 1 1 20 14370 rs605… G A 29 PASS 3 14 0.5 <NA>
## 2 1 20 17330 <NA> T A 3 q10 3 11 0.017 <NA>
## 3 1 20 1110696 rs604… A G,T 67 PASS 2 10 0.33… T
## 4 1 20 1230237 <NA> T <NA> 47 PASS 3 13 <NA> T
## 5 1 20 1234567 micro… GTC G,GT… 50 PASS 3 9 <NA> G
## # … with 2 more variables: DB <lgl>, H2 <lgl>
Z$gt
## # A tibble: 15 × 6
## ChromKey POS Indiv gt_GT gt_DP gt_GT_alleles
## <int> <int> <chr> <chr> <int> <chr>
## 1 1 14370 NA00001 0|0 1 G|G
## 2 1 17330 NA00001 0|0 3 T|T
## 3 1 1110696 NA00001 1|2 6 G|T
## 4 1 1230237 NA00001 0|0 7 T|T
## 5 1 1234567 NA00001 0/1 4 GTC/G
## 6 1 14370 NA00002 1|0 8 A|G
## 7 1 17330 NA00002 0|1 5 T|A
## 8 1 1110696 NA00002 2|1 0 T|G
## 9 1 1230237 NA00002 0|0 4 T|T
## 10 1 1234567 NA00002 0/2 2 GTC/GTCT
## 11 1 14370 NA00003 1/1 5 A/A
## 12 1 17330 NA00003 0/0 3 T/T
## 13 1 1110696 NA00003 2/2 4 T/T
## 14 1 1230237 NA00003 0/0 2 T/T
## 15 1 1234567 NA00003 1/1 3 G/G
Note that the fix and gt elements have a ‘ChromKey’ to help
coordinate the variants in both structures. Also, the information from
the meta region has been used to assign a type to each column (e.g.,
integer, character, etc.). These data structures should now be in a
format that other packages in the ‘tidyverse’ can work with. More
information about vcfR2tidy()
can be found in its manual
page (?vcfR2tidy
).
Copyright © 2017, 2018 Brian J. Knaus. All rights reserved.
USDA Agricultural Research Service, Horticultural Crops Research Lab.