Parallelization Example

Author

Prof. Tiffany Tang

Published

May 8, 2026

Show Code
knitr::opts_chunk$set(echo = TRUE)

library(future)
library(furrr)

# load in helper functions
for (fname in list.files(here::here("R"), pattern = "*.R")) {
  source(here::here(file.path("R", fname)))
}

# helper variables
DATA_PATH <- here::here("data")
n <- 10  # do leave-one-out for first 10 samples for illustration
Show Code
import sys
sys.path.append("..")

import os
from os.path import join as oj
import pandas as pd
import numpy as np
import time
from joblib import Parallel, delayed
import inspect

sys.path.append('..') # add parent directory to path
from python.load import load_brca_data, load_subtype_data
from python.clean import apply_log_transform
from python.fit import fit_rf_loo

# Helper variables
DATA_PATH = oj("..", "data")
n = 10  # Do leave-one-out for the first 10 samples for illustration

Load Data

In this parallelization example, we will be working with gene expression data from women with breast cancer from The Cancer Genome Atlas (TCGA) (and also used in lab 2). In particular, we will be using the gene expressions to predict their breast cancer subtype (Luminal A, Luminal B, Basal, Her2, Normal-like). Let’s first load in this data.

Show Code
# load in TCGA breast cancer data
X <- load_brca_data(DATA_PATH) |> 
  apply_log_transform()
y <- as.factor(load_subtype_data(DATA_PATH)$subtype)
Show Code
head(X[, 1:5])
# A tibble: 6 × 5
   NAT1  NAT2 RP11.986E7.7 AADAC  ABAT
  <dbl> <dbl>        <dbl> <dbl> <dbl>
1  7.40 0.933        10.9  0.572  7.72
2  9.38 3.73         10.1  1.39   7.46
3  7.76 3.12         10.7  0      8.46
4  6.16 1.41          7.34 0      6.22
5  4.40 1.23          8.20 0.710  5.88
6  4.34 0.367         9.68 0      7.38
Show Code
head(y)
[1] LumA LumB LumA LumB Her2 LumA
Levels: Basal Her2 LumA LumB
Show Code
# Load in TCGA breast cancer data
X = load_brca_data(DATA_PATH).values  # convert to numpy array
X = apply_log_transform(X)
y = load_subtype_data(DATA_PATH)['subtype'].values  # convert to numpy array
Show Code
X
array([[ 7.40176299,  0.93318718, 10.85901081, ...,  3.50921137,
         3.23065419,  3.27872085],
       [ 9.37559946,  3.73175933, 10.12843209, ...,  2.3830677 ,
         4.51640127,  5.09027854],
       [ 7.75682541,  3.12374283, 10.69036948, ...,  3.75331797,
         0.89559811,  3.57832492],
       ...,
       [ 8.13355076,  3.06390481, 10.27647796, ...,  4.01444421,
         3.06390481,  4.02652065],
       [ 5.66191514,  0.96397552,  8.57902627, ...,  3.5520512 ,
         0.        ,  4.66282664],
       [ 7.82862208,  3.25703829,  9.9021779 , ...,  2.34617126,
         4.57316545,  5.00141895]])
Show Code
y
array(['LumA', 'LumB', 'LumA', ..., 'LumA', 'LumA', 'LumA'], dtype=object)

Fitting leave-one-out models

For the sake of this demonstration, suppose that we want to evaluate the performance of a random forest model on this data using leave-one-out cross-validation (also known as the jackknife in statistics). The jackknife is a useful technique for estimating the bias and variance of a statistical model/estimator and can also be used to identify influential observations (i.e., those observations that have a large impact on the model’s performance).

To do so, there is a helper function fit_rf_loo() in R/fit.R or python/fit.py that takes in the covariate data X, the response variable y, and the index of the observation to leave out i. This function fit_rf_loo() fits a random forest model on the data with the i-th observation left out and returns the predicted class of the left-out i-th observation.

Show Code
fit_rf_loo
function (i, x, y, ...) 
{
    fit <- ranger::ranger(x = x[-i, , drop = FALSE], y = y[-i], 
        num.threads = 1, ...)
    preds <- as.character(predict(fit, x[i, , drop = FALSE])$predictions)
    return(preds)
}
Show Code
print(inspect.getsource(fit_rf_loo))
def fit_rf_loo(i, x, y, **kwargs):
    """
    Fit a random forest model with sample `i` left out

    Parameters:
    -----------
    i: int
        The sample to leave out
    x: np.array
        Covariate matrix
    y: np.array
        Response vector
    **kwargs:
        Arbitrary keyword arguments for RandomForestClassifier

    Returns:
    --------
    str
        Predicted class of the left out sample `i`
    """
    # Remove the ith observation for leave-one-out
    x_train = np.delete(x, i, axis=0)
    y_train = np.delete(y, i)

    # Train the RandomForestClassifier
    rf = RandomForestClassifier(**kwargs)
    rf.fit(x_train, y_train)

    # Make a prediction on the left-out observation
    preds = rf.predict(x[i:(i+1), :])[0]
    return preds

Without Parallelization

Let’s first run the leave-one-out cross-validation without parallelization to see how long it takes. We will only run this on the first n = 10 observations for demonstration purposes.

Show Code
set.seed(242)
start_time <- Sys.time()
preds <- rep(NA, n) # vector of leave-one-out predictions
for (i in 1:n) {
  preds[i] <- fit_rf_loo(i, X, y)
}
end_time <- Sys.time()
execution_time <- end_time - start_time
print(execution_time)
Time difference of 45.83143 secs
Show Code
start_time = time.time()
preds = np.empty(n, dtype=object)  # Vector of leave-one-out predictions
for i in range(n):
    preds[i] = fit_rf_loo(i, X, y)
end_time = time.time()
execution_time = end_time - start_time
print("Execution time without parallelization:", execution_time)
Execution time without parallelization: 21.470791816711426

With Parallelization

Next, let’s run the leave-one-out cross-validation with parallelization using the future library in R and the joblib library in Python. But before implementing this, let’s check how many cores are available on this machine using future::availableCores() in R and os.cpu_count() in Python.

Show Code
availableCores()
system 
    11 
Show Code
os.cpu_count()
11

Now, to parallelize this code, there are two steps:

Step 1: Setting up the parallel backend.

  • In R: This is done using the future::plan() function, which specifies the parallel backend to use. Here, we will use the multisession backend, which creates multiple (independent) background R sessions to parallelize the computation.
  • In Python: This is done using Parallel(n_jobs=...), where we specify the number of cores we would like to use.

Step 2: Re-write the code using futures (or delayed()) A future is an object that represents the result of an expression that will be evaluated in the future.

  • In R: We can create a future using the future::future() function and extract the value of the future using the future::value() function.
  • In Python: We need to put the code that we want to run in parallel into a single function (luckily, this is already done) and wrap that function call using delayed(). This creates futures that can be evaluated in parallel.
Show Code
set.seed(242)

# step 1: set up parallel backend
n_workers <- 5
plan(multisession, workers = n_workers)

start_time <- Sys.time()
futures <- list()
for (i in 1:n) {
  # step 2: wrap code in a "future" object
  futures[[i]] <- future({
    fit_rf_loo(i, X, y)
  }, seed = TRUE)
  # (optional) checking whether the future we created has been resolved or unresolved
  print(resolved(futures))
  print(paste("Intermediary Time Checkpoint:", Sys.time() - start_time))
}
preds_future <- lapply(futures, value)
end_time <- Sys.time()
execution_time <- end_time - start_time
print(paste("Final Time Checkpoint:", execution_time))
[1] FALSE
[1] "Intermediary Time Checkpoint: 0.100641012191772"
[1] FALSE FALSE
[1] "Intermediary Time Checkpoint: 0.214888095855713"
[1] FALSE FALSE FALSE
[1] "Intermediary Time Checkpoint: 0.32613205909729"
[1] FALSE FALSE FALSE FALSE
[1] "Intermediary Time Checkpoint: 0.454898118972778"
[1] FALSE FALSE FALSE FALSE FALSE
[1] "Intermediary Time Checkpoint: 0.710055112838745"
[1]  TRUE FALSE FALSE FALSE FALSE FALSE
[1] "Intermediary Time Checkpoint: 5.57519102096558"
[1]  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE
[1] "Intermediary Time Checkpoint: 5.72342705726624"
[1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE
[1] "Intermediary Time Checkpoint: 5.85366606712341"
[1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE
[1] "Intermediary Time Checkpoint: 6.01137018203735"
 [1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE
[1] "Intermediary Time Checkpoint: 6.16991114616394"
[1] "Final Time Checkpoint: 10.9680399894714"
Show Code
start_time = time.time()
preds_parallel = Parallel(n_jobs=5)(delayed(fit_rf_loo)(i, X, y) for i in range(n))
end_time = time.time()
execution_time = end_time - start_time
print("Execution time with joblib parallelization:", execution_time)
Execution time with joblib parallelization: 7.436609983444214

Note:

  • The futures are “non-blocking” so that the code continues to run (while the futures are being computed in the background) until it reaches a point where the value of the future is needed. This is why the print statement shows that the futures are at first unresolved and there is a time gap between the intermediary checkpoint and the final checkpoint.
  • A future can be resolved or unresolved. In R, we can check whether a future is resolved using the future::resolved() function.
  • If the value is queried while the future is still unresolved, the current process is blocked until the future is resolved. Otherwise, the code continues to execute (i.e., it’s unblocking).
  • The futures are resolved when the value() function is called, and the code waits for the future to be resolved before continuing.

Additional Notes for R Users

Using furrr

Sometimes, it can be more convenient to rewrite the for loop using the furrr package, which provides a functional programming interface for parallel programming in R. The furrr::future_map() function is similar to the purrr::map() function, but it runs the function in parallel using futures.

Show Code
set.seed(242)
# step 1: set up parallel backend
plan(multisession, workers = n_workers)
start_time <- Sys.time()
# step 2: wrap code in a "future" object
preds_furrr <- furrr::future_map(
  1:n,
  function(i) {
    fit_rf_loo(i, X, y)
  },
  .options = furrr::furrr_options(seed = TRUE)
)
end_time <- Sys.time()
execution_time <- end_time - start_time
print(execution_time)
Time difference of 10.5588 secs

Parallelization backends

In short, use multisession if running R locally in RStudio, and use multicore for simple parallelization workflows on the CRC.