This document is based on a presentation I did for the grad student organization for the Department of Integrative Biology, UW–Madison in Fall 2018. I’ve made a few changes to hopefully make it more transparent as a stand-alone document.

Why use functions?

Two main advantages over copy and paste:

  1. Create fewer errors
  2. Improve readability of code

Consider the following example

Small errors are easy to make and can be annoying to find.

lm_mpg <- lm(mpg ~ factor(cyll), mtcars)
## Error in eval(predvars, data, env): object 'cyll' not found
lm_hp <- lm(hp ~ factor(cyl), mrcars)
## Error in eval(mf, parent.frame()): object 'mrcars' not found
lm_disp <- lm(mpg ~ factor(cyl), mtcars)

The problem is even worse when you have lots of copying.

lm(mpg ~ cyl + disp + hp + drat, mtcars)
lm(mpg ~ cyl + disp + hp + wt, mtcars)
lm(mpg ~ cyl + disp + drat + wt, mtcars)
lm(mpg ~ cyl + hp + drat + wt, mtcars)
lm(mpg ~ disp + hp + drat + wt, mtcars)
lm(disp ~ mpg + cyl + hp + drat, mtcars)
lm(disp ~ mpg + cyl + hp + wt, mtcars)
lm(disp ~ mpg + cyl + drat + wt, mtcars)
lm(disp ~ mpg + hp + drat + wt, mtcars)
lm(disp ~ cyl + hp + drat + wt, mtcars)
lm(hp ~ mpg + cyl + disp + drat, mtcars)
lm(hp ~ mpg + cyl + disp + wt, mtcars)
lm(hp ~ mpg + cyl + drat + wt, mtcars)
lm(hp ~ mpg + disp + drat + wt, mtcars)
lm(hp ~ cyl + disp + drat + wt, mtcars)

Which is better?

lm_mpg <- lm(mpg ~ factor(cyl), mtcars)
lm_disp <- lm(disp ~ factor(cyl), mtcars)
lm_hp <- lm(hp ~ factor(cyl), mtcars)
lm_drat <- lm(drat ~ factor(cyl), mtcars)
lm_wt <- lm(wt ~ factor(cyl), mtcars)
lm_qsec <- lm(qsec ~ factor(cyl), mtcars)
lm_vs <- lm(vs ~ factor(cyl), mtcars)
lm_am <- lm(am ~ factor(cyl), mtcars)
lm_gear <- lm(gear ~ factor(cyl), mtcars)
lm_carb <- lm(carb ~ factor(cyl), mtcars)

or

y_pars <- c("mpg", "disp", "hp", "drat",
            "wt", "qsec", "vs", "am",
            "gear", "carb")
all_lm <- lapply(y_pars, cyl_model)

Some R basics

Basics of functions in R

subtract <- function(x, y = 1) {
  z <- x - y
  return(z)
}
subtract(1:3, 4:6)
## [1] -3 -3 -3
subtract(1:3)
## [1] 0 1 2
subtract(y = 1:3, x = 4:6)
## [1] 3 3 3

Flexibility of lists

x <- numeric(2)
x[[1]] <- matrix(0, 0, 0)
## Error in x[[1]] <- matrix(0, 0, 0): replacement has length zero
x <- as.list(numeric(3))
x[[1]] <- matrix(0, 0, 0)
x[[2]] <- data.frame()
x[[3]] <- runif(3)
x
## [[1]]
## <0 x 0 matrix>
## 
## [[2]]
## data frame with 0 columns and 0 rows
## 
## [[3]]
## [1] 0.1701993 0.3846910 0.3334314

The apply functions

  • Allows you to apply a function to multiple inputs.
  • lapply outputs a list, sapply coerces to an array.
lapply(4:5, function(i) 1 + i)
## [[1]]
## [1] 5
## 
## [[2]]
## [1] 6
sapply(4:5, function(i) 1 + i)
## [1] 5 6

For loops

  • Especially useful when one iteration’s result depends on the previous iteration.
  • Changes existing object(s).
x <- numeric(100)
x[1] <- 10
for (t in 2:length(x)) {
  x[t] <- x[t-1] + rnorm(1)
}
plot(x, type = "l", lwd = 2)

General process to “functionalize” code

  1. Break problem into smaller sub-problems.
  2. For each sub-problem, write a function.
  3. For writing each function…
  4. The main function code will include the commonalities between all situations.
  5. Features that aren’t common should be input to the function as arguments.

Example #1: Cleaning weird files

Suppose we have a folder full of CSV files like this:

## ## Data provided by X
## 
## Ozone,Solar.R,Wind,Temp,Month,Day
## 41,190,7.4,67,5,1
## NA,NA,14.3,56,5,5
## --- instrument error
## 28,NA,14.9,66,5,6
## 23,299,8.6,65,5,7
## --- instrument error
## NA,194,8.6,69,5,10
## 
## ## Year observed: 1990

Problems:

  1. Remove unnecessary lines from each file.
  2. Create a single data frame from multiple cleaned files.

Input information:

  1. Vector of file names (file_names)
file_names <- c("file1.csv", "file2.csv")

Clean a single CSV file to a string

clean_str <- function(file_name) {
  lines <- readLines(file_name)
  lines <- lines[!grepl("^\\#\\#|^--", lines) &
                   lines != ""]
  cleaned_str <- paste(lines, collapse = "\n")
  return(cleaned_str)
}

Clean multiple files then combine them into a single data frame:

clean_df <- function(file_names) {
  cleaned_strs <- lapply(file_names, clean_str)
  data_frames <- lapply(cleaned_strs, readr::read_csv)
  combined_df <- dplyr::bind_rows(data_frames)
  return(as.data.frame(combined_df))
}
head(clean_df(file_names))
## Rows: 5 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (6): Ozone, Solar.R, Wind, Temp, Month, Day
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 7 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (6): Ozone, Solar.R, Wind, Temp, Month, Day
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    NA      NA 14.3   56     5   5
## 3    28      NA 14.9   66     5   6
## 4    23     299  8.6   65     5   7
## 5    NA     194  8.6   69     5  10
## 6     7      NA  6.9   74     5  11

Example #2: Fitting lots of models

How can we simplify this?

lm(mpg ~ cyl + disp + hp + drat, mtcars)
lm(mpg ~ cyl + disp + hp + wt, mtcars)
lm(mpg ~ cyl + disp + drat + wt, mtcars)
lm(mpg ~ cyl + hp + drat + wt, mtcars)
lm(mpg ~ disp + hp + drat + wt, mtcars)
lm(disp ~ mpg + cyl + hp + drat, mtcars)
lm(disp ~ mpg + cyl + hp + wt, mtcars)
lm(disp ~ mpg + cyl + drat + wt, mtcars)
lm(disp ~ mpg + hp + drat + wt, mtcars)
lm(disp ~ cyl + hp + drat + wt, mtcars)
lm(hp ~ mpg + cyl + disp + drat, mtcars)
lm(hp ~ mpg + cyl + disp + wt, mtcars)
lm(hp ~ mpg + cyl + drat + wt, mtcars)
lm(hp ~ mpg + disp + drat + wt, mtcars)
lm(hp ~ cyl + disp + drat + wt, mtcars)

Problems:

  1. Create all necessary formulas for each of the multiple Ys.
  2. Fit lm based on each of the created formulas.

Input information:

  1. Vector of Y variables (Ys)
  2. Vector of possible X variables (Xs)
  3. Number of X variables to include in each model (n_Xs)
Ys <- c("mpg", "disp", "hp")
Xs <- c("mpg", "cyl", "disp", "hp", "drat", "wt")
n_Xs <- 4

Make vector of all necessary formulas:

make_forms <- function(y, Xs, n_Xs) {
  poss_Xs <- Xs[Xs != y]
  n_poss_Xs <- length(poss_Xs)
  # All possible combinations:
  combs <- combn(n_poss_Xs, n_Xs, simplify = FALSE)
  # Change to names:
  names_ <- lapply(combs, function(x) poss_Xs[x])
  # Combine each set to single RHS of formula:
  rhs <- sapply(names_, paste, collapse = " + ")
  # Whole formulas as strings:
  form_strings <- paste(y, "~", rhs)
  # Convert to formulas:
  forms <- sapply(form_strings, as.formula,
                  USE.NAMES = FALSE)
  return(forms)
}

Fit lm() based on a single formula:

single_mod <- function(form) {
  # Fit lm:
  mod <- lm(form, mtcars)
  # Make model print prettier:
  mod$call$formula <- as.formula(mod$terms)
  return(mod)
}

Put both steps together:

fit_models <- function(Ys, Xs, n_Xs) {
  forms <- lapply(Ys, make_forms,
                  Xs = Xs, n_Xs = n_Xs)
  forms <- c(forms, recursive = TRUE)
  lms <- lapply(forms, single_mod)
  return(lms)
}
model_fits <- fit_models(Ys, Xs, n_Xs)
model_fits[[1]]
## 
## Call:
## lm(formula = mpg ~ cyl + disp + hp + drat, data = mtcars)
## 
## Coefficients:
## (Intercept)          cyl         disp           hp         drat  
##    23.98524     -0.81402     -0.01390     -0.02317      2.15405