One of the weaknesses of R is that loops can be relatively slow to
execute. The apply family of functions attempts to address this. Use
or ?lapply
for examples of other
flavors of the apply command.
Create a test matrix.
tmp <- matrix(rep(1:3, times=3), ncol=3)
## [,1] [,2] [,3]
## [1,] 1 1 1
## [2,] 2 2 2
## [3,] 3 3 3
‘Apply’ the function ‘sum’ over rows.
apply(tmp, MARGIN=1, sum)
## [1] 3 6 9
‘Apply’ the function ‘sum’ over rows.
apply(tmp, MARGIN=2, sum)
## [1] 6 6 6
If the operation we wish to apply to a data structure exists as an R function, we can call it from the apply command. We can also define our own functions to apply over a data structure.
In practice, if we wanted to get averages over a matrix, there are existing functions that should be used. Here we’ll create our own as an example.
myMean <- function(x){
apply(tmp, MARGIN=1, myMean)
## [1] 1 2 3
Through defining our own functions we can extract all sorts of summaries from data in a fairly efficient manner.
A large part of quality control of data sets is finding and mitigating missing data. Here I’ll create a larger toy data set and add some missing data. As homework, create a custom function that will help you identify missing data.
toy <- matrix(ncol=10, nrow=12)
toy[] <- rnorm(length(toy))
colnames(toy) <- paste("sample", 1:ncol(toy), sep="_")
rownames(toy) <- paste("variant", 1:nrow(toy), sep="_")
set.seed(999)[round(runif(n=30, min=1, max=length(toy)))]) <- TRUE
## sample_1 sample_2 sample_3 sample_4 sample_5 sample_6
## variant_1 -0.2817402 0.9387494 -1.1252685 -0.370527471 0.58226648 -0.9233114
## variant_2 -1.3125596 NA 0.6422657 0.522867793 -0.03472639 1.1649540
## variant_3 NA 0.9576504 -1.1067376 0.517805536 -0.11666415 1.0420687
## variant_4 0.2700705 NA NA -1.402510873 -0.64498209 NA
## variant_5 -0.2773064 NA -1.5540951 -0.485636726 NA NA
## variant_6 -0.5660237 0.1006576 NA 0.008498139 0.36609447 -1.1469577
## variant_7 -1.8786583 0.9013448 2.3826642 -1.282113287 NA -1.4081795
## variant_8 NA -2.0743571 0.6012761 NA 0.28261247 -0.2823287
## variant_9 -0.9677497 -1.2285633 0.1793613 0.300665411 NA -0.4177700
## variant_10 -1.1210094 0.6430443 1.0805315 0.276478845 -1.27921590 NA
## variant_11 NA NA NA -2.050877659 0.43536881 -0.1062858
## variant_12 0.1339774 0.2940356 -2.1137370 0.014190211 -0.56550098 NA
## sample_7 sample_8 sample_9 sample_10
## variant_1 0.94970110 -1.20409383 NA -1.5100543
## variant_2 NA -0.37684776 NA -0.6772986
## variant_3 0.97400041 1.36364858 -0.2417764 -0.2979716
## variant_4 0.06229143 -0.25288275 NA -1.5191194
## variant_5 0.53842205 NA -1.6509552 -0.9118353
## variant_6 -2.06482325 0.43714914 0.4782007 -0.8358807
## variant_7 NA NA -0.8052824 -0.2171495
## variant_8 -0.16022669 0.02768521 NA -1.0710323
## variant_9 -0.64292273 NA NA 0.9450480
## variant_10 0.98529855 1.28372914 -2.5954909 1.1279968
## variant_11 -1.22857333 -1.12974161 0.2901482 -1.2786429
## variant_12 0.08522467 1.04665773 1.3836599 0.4576313
Now we create a custom function to process our matrix.
my_fun <- function(x){
Lastly, we use apply to iterate the function over the matrix.
apply(toy, MARGIN=1, my_fun)
## variant_1 variant_2 variant_3 variant_4 variant_5 variant_6 variant_7
## 10 10 10 10 10 10 10
## variant_8 variant_9 variant_10 variant_11 variant_12
## 10 10 10 10 10
Can you modify the function my_fun()
so that it counts
missing data in each sample and variant?
Copyright © 2017, 2018 Brian J. Knaus. All rights reserved.
USDA Agricultural Research Service, Horticultural Crops Research Lab.