One of the weaknesses of R is that loops can be relatively slow to
execute. The apply family of functions attempts to address this. Use
?apply
or ?lapply
for examples of other
flavors of the apply command.
Create a test matrix.
tmp <- matrix(rep(1:3, times=3), ncol=3)
tmp
## [,1] [,2] [,3]
## [1,] 1 1 1
## [2,] 2 2 2
## [3,] 3 3 3
‘Apply’ the function ‘sum’ over rows.
apply(tmp, MARGIN=1, sum)
## [1] 3 6 9
‘Apply’ the function ‘sum’ over rows.
apply(tmp, MARGIN=2, sum)
## [1] 6 6 6
If the operation we wish to apply to a data structure exists as an R function, we can call it from the apply command. We can also define our own functions to apply over a data structure.
In practice, if we wanted to get averages over a matrix, there are existing functions that should be used. Here we’ll create our own as an example.
myMean <- function(x){
sum(x)/length(x)
}
apply(tmp, MARGIN=1, myMean)
## [1] 1 2 3
Through defining our own functions we can extract all sorts of summaries from data in a fairly efficient manner.
A large part of quality control of data sets is finding and mitigating missing data. Here I’ll create a larger toy data set and add some missing data. As homework, create a custom function that will help you identify missing data.
toy <- matrix(ncol=10, nrow=12)
set.seed(999)
toy[] <- rnorm(length(toy))
colnames(toy) <- paste("sample", 1:ncol(toy), sep="_")
rownames(toy) <- paste("variant", 1:nrow(toy), sep="_")
set.seed(999)
is.na(toy[round(runif(n=30, min=1, max=length(toy)))]) <- TRUE
toy
## sample_1 sample_2 sample_3 sample_4 sample_5 sample_6
## variant_1 -0.2817402 0.9387494 -1.1252685 -0.370527471 0.58226648 -0.9233114
## variant_2 -1.3125596 NA 0.6422657 0.522867793 -0.03472639 1.1649540
## variant_3 NA 0.9576504 -1.1067376 0.517805536 -0.11666415 1.0420687
## variant_4 0.2700705 NA NA -1.402510873 -0.64498209 NA
## variant_5 -0.2773064 NA -1.5540951 -0.485636726 NA NA
## variant_6 -0.5660237 0.1006576 NA 0.008498139 0.36609447 -1.1469577
## variant_7 -1.8786583 0.9013448 2.3826642 -1.282113287 NA -1.4081795
## variant_8 NA -2.0743571 0.6012761 NA 0.28261247 -0.2823287
## variant_9 -0.9677497 -1.2285633 0.1793613 0.300665411 NA -0.4177700
## variant_10 -1.1210094 0.6430443 1.0805315 0.276478845 -1.27921590 NA
## variant_11 NA NA NA -2.050877659 0.43536881 -0.1062858
## variant_12 0.1339774 0.2940356 -2.1137370 0.014190211 -0.56550098 NA
## sample_7 sample_8 sample_9 sample_10
## variant_1 0.94970110 -1.20409383 NA -1.5100543
## variant_2 NA -0.37684776 NA -0.6772986
## variant_3 0.97400041 1.36364858 -0.2417764 -0.2979716
## variant_4 0.06229143 -0.25288275 NA -1.5191194
## variant_5 0.53842205 NA -1.6509552 -0.9118353
## variant_6 -2.06482325 0.43714914 0.4782007 -0.8358807
## variant_7 NA NA -0.8052824 -0.2171495
## variant_8 -0.16022669 0.02768521 NA -1.0710323
## variant_9 -0.64292273 NA NA 0.9450480
## variant_10 0.98529855 1.28372914 -2.5954909 1.1279968
## variant_11 -1.22857333 -1.12974161 0.2901482 -1.2786429
## variant_12 0.08522467 1.04665773 1.3836599 0.4576313
Now we create a custom function to process our matrix.
my_fun <- function(x){
length(x)
}
Lastly, we use apply to iterate the function over the matrix.
apply(toy, MARGIN=1, my_fun)
## variant_1 variant_2 variant_3 variant_4 variant_5 variant_6 variant_7
## 10 10 10 10 10 10 10
## variant_8 variant_9 variant_10 variant_11 variant_12
## 10 10 10 10 10
Can you modify the function my_fun()
so that it counts
missing data in each sample and variant?
Copyright © 2017, 2018 Brian J. Knaus. All rights reserved.
USDA Agricultural Research Service, Horticultural Crops Research Lab.