This document is based on a presentation I did for the grad student organization for the Department of Integrative Biology, UW–Madison in Fall 2018. I’ve made a few changes to hopefully make it more transparent as a stand-alone document.
Why use functions?
Two main advantages over copy and paste:
- Create fewer errors
- Improve readability of code
Consider the following example
Small errors are easy to make and can be annoying to find.
<- lm(mpg ~ factor(cyll), mtcars) lm_mpg
## Error in factor(cyll): object 'cyll' not found
<- lm(hp ~ factor(cyl), mrcars) lm_hp
## Error in is.data.frame(data): object 'mrcars' not found
<- lm(mpg ~ factor(cyl), mtcars) lm_disp
The problem is even worse when you have lots of copying.
lm(mpg ~ cyl + disp + hp + drat, mtcars)
lm(mpg ~ cyl + disp + hp + wt, mtcars)
lm(mpg ~ cyl + disp + drat + wt, mtcars)
lm(mpg ~ cyl + hp + drat + wt, mtcars)
lm(mpg ~ disp + hp + drat + wt, mtcars)
lm(disp ~ mpg + cyl + hp + drat, mtcars)
lm(disp ~ mpg + cyl + hp + wt, mtcars)
lm(disp ~ mpg + cyl + drat + wt, mtcars)
lm(disp ~ mpg + hp + drat + wt, mtcars)
lm(disp ~ cyl + hp + drat + wt, mtcars)
lm(hp ~ mpg + cyl + disp + drat, mtcars)
lm(hp ~ mpg + cyl + disp + wt, mtcars)
lm(hp ~ mpg + cyl + drat + wt, mtcars)
lm(hp ~ mpg + disp + drat + wt, mtcars)
lm(hp ~ cyl + disp + drat + wt, mtcars)
Which is better?
<- lm(mpg ~ factor(cyl), mtcars)
lm_mpg <- lm(disp ~ factor(cyl), mtcars)
lm_disp <- lm(hp ~ factor(cyl), mtcars)
lm_hp <- lm(drat ~ factor(cyl), mtcars)
lm_drat <- lm(wt ~ factor(cyl), mtcars)
lm_wt <- lm(qsec ~ factor(cyl), mtcars)
lm_qsec <- lm(vs ~ factor(cyl), mtcars)
lm_vs <- lm(am ~ factor(cyl), mtcars)
lm_am <- lm(gear ~ factor(cyl), mtcars)
lm_gear <- lm(carb ~ factor(cyl), mtcars) lm_carb
or
<- c("mpg", "disp", "hp", "drat",
y_pars "wt", "qsec", "vs", "am",
"gear", "carb")
<- lapply(y_pars, cyl_model) all_lm
Some R basics
Basics of functions in R
<- function(x, y = 1) {
subtract <- x - y
z return(z)
}subtract(1:3, 4:6)
## [1] -3 -3 -3
subtract(1:3)
## [1] 0 1 2
subtract(y = 1:3, x = 4:6)
## [1] 3 3 3
Flexibility of lists
<- numeric(2)
x 1]] <- matrix(0, 0, 0) x[[
## Error in x[[1]] <- matrix(0, 0, 0): replacement has length zero
<- as.list(numeric(3))
x 1]] <- matrix(0, 0, 0)
x[[2]] <- data.frame()
x[[3]] <- runif(3)
x[[ x
## [[1]]
## <0 x 0 matrix>
##
## [[2]]
## data frame with 0 columns and 0 rows
##
## [[3]]
## [1] 0.2085866 0.9925219 0.7026601
The apply
functions
- Allows you to apply a function to multiple inputs.
lapply
outputs a list,sapply
coerces to an array.
lapply(4:5, function(i) 1 + i)
## [[1]]
## [1] 5
##
## [[2]]
## [1] 6
sapply(4:5, function(i) 1 + i)
## [1] 5 6
For loops
- Especially useful when one iteration’s result depends on the previous iteration.
- Changes existing object(s).
<- numeric(100)
x 1] <- 10
x[for (t in 2:length(x)) {
<- x[t-1] + rnorm(1)
x[t]
}plot(x, type = "l", lwd = 2)
General process to “functionalize” code
- Break problem into smaller sub-problems.
- For each sub-problem, write a function.
- For writing each function…
- The main function code will include the commonalities between all situations.
- Features that aren’t common should be input to the function as arguments.
Example #1: Cleaning weird files
Suppose we have a folder full of CSV files like this:
## ## Data provided by X
##
## Ozone,Solar.R,Wind,Temp,Month,Day
## 41,190,7.4,67,5,1
## NA,NA,14.3,56,5,5
## --- instrument error
## 28,NA,14.9,66,5,6
## 23,299,8.6,65,5,7
## --- instrument error
## NA,194,8.6,69,5,10
##
## ## Year observed: 1990
Problems:
- Remove unnecessary lines from each file.
- Create a single data frame from multiple cleaned files.
Input information:
- Vector of file names (
file_names
)
<- c("file1.csv", "file2.csv") file_names
Clean a single CSV file to a string
<- function(file_name) {
clean_str <- readLines(file_name)
lines <- lines[!grepl("^\\#\\#|^--", lines) &
lines != ""]
lines <- paste(lines, collapse = "\n")
cleaned_str return(cleaned_str)
}
Clean multiple files then combine them into a single data frame:
<- function(file_names) {
clean_df <- lapply(file_names, clean_str)
cleaned_strs <- lapply(cleaned_strs, readr::read_csv)
data_frames <- dplyr::bind_rows(data_frames)
combined_df return(as.data.frame(combined_df))
}head(clean_df(file_names))
## Rows: 5 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (6): Ozone, Solar.R, Wind, Temp, Month, Day
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 7 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (6): Ozone, Solar.R, Wind, Temp, Month, Day
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 NA NA 14.3 56 5 5
## 3 28 NA 14.9 66 5 6
## 4 23 299 8.6 65 5 7
## 5 NA 194 8.6 69 5 10
## 6 7 NA 6.9 74 5 11
Example #2: Fitting lots of models
How can we simplify this?
lm(mpg ~ cyl + disp + hp + drat, mtcars)
lm(mpg ~ cyl + disp + hp + wt, mtcars)
lm(mpg ~ cyl + disp + drat + wt, mtcars)
lm(mpg ~ cyl + hp + drat + wt, mtcars)
lm(mpg ~ disp + hp + drat + wt, mtcars)
lm(disp ~ mpg + cyl + hp + drat, mtcars)
lm(disp ~ mpg + cyl + hp + wt, mtcars)
lm(disp ~ mpg + cyl + drat + wt, mtcars)
lm(disp ~ mpg + hp + drat + wt, mtcars)
lm(disp ~ cyl + hp + drat + wt, mtcars)
lm(hp ~ mpg + cyl + disp + drat, mtcars)
lm(hp ~ mpg + cyl + disp + wt, mtcars)
lm(hp ~ mpg + cyl + drat + wt, mtcars)
lm(hp ~ mpg + disp + drat + wt, mtcars)
lm(hp ~ cyl + disp + drat + wt, mtcars)
Problems:
- Create all necessary formulas for each of the multiple Ys.
- Fit
lm
based on each of the created formulas.
Input information:
- Vector of Y variables (
Ys
) - Vector of possible X variables (
Xs
) - Number of X variables to include in each model (
n_Xs
)
<- c("mpg", "disp", "hp")
Ys <- c("mpg", "cyl", "disp", "hp", "drat", "wt")
Xs <- 4 n_Xs
Make vector of all necessary formulas:
<- function(y, Xs, n_Xs) {
make_forms <- Xs[Xs != y]
poss_Xs <- length(poss_Xs)
n_poss_Xs # All possible combinations:
<- combn(n_poss_Xs, n_Xs, simplify = FALSE)
combs # Change to names:
<- lapply(combs, function(x) poss_Xs[x])
names_ # Combine each set to single RHS of formula:
<- sapply(names_, paste, collapse = " + ")
rhs # Whole formulas as strings:
<- paste(y, "~", rhs)
form_strings # Convert to formulas:
<- sapply(form_strings, as.formula,
forms USE.NAMES = FALSE)
return(forms)
}
Fit lm()
based on a single formula:
<- function(form) {
single_mod # Fit lm:
<- lm(form, mtcars)
mod # Make model print prettier:
$call$formula <- as.formula(mod$terms)
modreturn(mod)
}
Put both steps together:
<- function(Ys, Xs, n_Xs) {
fit_models <- lapply(Ys, make_forms,
forms Xs = Xs, n_Xs = n_Xs)
<- c(forms, recursive = TRUE)
forms <- lapply(forms, single_mod)
lms return(lms)
}<- fit_models(Ys, Xs, n_Xs)
model_fits 1]] model_fits[[
##
## Call:
## lm(formula = mpg ~ cyl + disp + hp + drat, data = mtcars)
##
## Coefficients:
## (Intercept) cyl disp hp drat
## 23.98524 -0.81402 -0.01390 -0.02317 2.15405
More information
- T Mailund (2017). Functional Programming in R. doi: 10.1007/978-1-4842-2746-6
- Free through UW
- Functional Programming in R with purrr (towardsdatascience.com)
- Functional Programming (in Advanced R)