Chapter 9 High performance computing

9.1 Rcpp: using C++ within R

While R is a relatively new language essentially tailored for statisticians and data analysts, C++ is a well-established general-purpose programming language. This chapter does not aim at providing an overview of programming in C++ but rather how it can be used to address specific problems related to R. The point of view taken here is that of an R programmer with no notions of C++ or similar languages. Therefore we will not go into details concerning fundamental notions of C++ but rather provide some necessary concepts in order to start programming with this language and use it within R.

Having said this, R is an interpreted language which makes it particularly inefficient when performing tasks that require a repetition of operations such as for and while loops, or recursive calling of functions. On the other hand, C++ is a compiled language which means that, before being able to use your code, it is translated (compiled) to lower-level machine language which allows to considerably accelerate the code running time.

Note that C++ is just one of many solutions that can make the execution of R code more efficient such as vectorization and memory management. Profiling also helps to identify which parts of the code are the slowest to execute and therefore provides useful information to understand where to improve code efficiency.

Another note that is of interest is that C++ compiles ahead of time (before using the code), but there are other approaches that compile just in time (at runtime) such as the Julia language.

9.1.1 Installation

Working with C++ in R does not come “out of the box”, but it is nevertheless a well handled procedure due to the latest contributions to R. In order to make use of these tools, you basically need two elements: a C++ compiler and a program that permits to connect R and C++. For the latter, you need to install the Rcpp package:

# For CRAN official release
install.packages("Rcpp")

# For development version
devtools::install_github("RcppCore/Rcpp")

For the former, it depends on your OS:
1. On Windows, install RTools.
2. On Mac, install Xcode from the App Store.
3. On Linux, you need to install r-bas-dev (or similar) from your package manager.

Verify that your installation works by running:

require(Rcpp)
cppFunction("double sumC(double x, double y){
  double s;
  s = x + y;
  return s;
} ")
sumC(3.5, 6.5)
In C++, a line of code must end with a semicolon ; ottherwise the compiler will return an error.

The function sumC(x,y) takes two doubles (numeric) values (x and y) and returns a double (the sum of the two values). Can you predict what happens if you run the following lines? Can you explain why?

sumC(3L, 6L)
sumC(TRUE, FALSE)
sumC(c(3,4), c(5,6))

9.1.2 A first program

The above cppFunction() creates an inline C++ function (within your R code). However, it is a better practice to create a separate file dedicated to C++ functions. It is possible then to compile your C++ functions and call them within R.

In RStudio, you can create a C++ file by clicking on File > New File > C++ File. RStudio provides you with an example. At the top of C++ file, you should see:

#include <Rcpp.h>
using namespace Rcpp;

The first line tells your program that you can access functionalities from the Rcpp library. The second line tells your program that you can access those functionalities without specifying/loading the Rcpp library. For example, suppose there is function mean() in Rcpp. With the first line, you could use this function only by typing Rcpp::mean(). Adding the second line allows you to directly write mean() in your program. This is somehow equivalent to using library(Rcpp) in your R code.

Once the C++ code is written, it should be saved with the appropriate extension which is either *.cc or *.cpp. Now you can compile your first C++ program and for this type the following command in your R file:

sourceCpp("file_name.cpp")

To test this, try to write the sumC() function within a separate C++ file and compile it in R.

By default, C++ functions are not accessible within your R program. One key element is to use the attribute [[Rcpp::export]] in comment just before the function definition. For example, the separate C++ file could look something like this:

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
double sumC(
    double x,
    double y
) {
  double s;
  s = x + y;
  return s;
}
In C++, the symbol // is used for single line comment. The symbol /* starts a comment which can possibly cover multiple lines and */ ends the comment.

9.1.3 Data structures

One difference you can immediately notice while working with C++ is that you need to tell the program what type of data structure you intend to use before using it. This makes the writing (and execution) of the code “safer” since the programmer pay more attention to the implementation details thereby preventing the code from being used in a way for which it was not intended. On the other hand, a drawback of this characteristic is that the programmer will have to spend (at least at the beginning) more time conceiving the code.

Here is a list of primitive data types with their keywords:
- int (4) : integer,
- float (4) : single precision floating number (numeric),
- double (8) : double precision floating number (numeric),
- bool : logical (true or false),
- char : character,
- void : without any value.

The numbers in parenthesis represents typical memory (in bytes) required to store the variable. Single and double precision floating numbers represent how the computer stores real numbers in the memory. float requires less memory than double but is less precise. As a general rule, we only use type double since this is the default type in R:

R has no single precision data type. All real numbers are stored in double precision format.

help(double) in R

These data types can be modified by using one (or more) of the following keywords:
- long : allocate more space in memory,
- short : allocate less space in memory,
- unsigned : numbers take only positive value.

For more complex data structures such as vectors and matrices, the story is slightly more complicated. Luckily for R programmers, Rcpp provides many predefined structures that make the transition to C++ easier. Here is a non-exhaustive list:

Table : Common data structure in R with Rcpp equivalent.

Rcpp R Role
NumericVector c(...) or vector(mode = "double", ...) A vector of double
IntegerVector c(...) or vector(mode = "integer", ...) A vector of integer
LogicalVector c(...) or vector(mode = "bool", ...) A vector of boolean
NumericMatrix matrix() A matrix of double
IntegerMatrix matrix() A matrix of integer
LogicalMatrix matrix() A matrix of boolean
List list A list
DataFrame data.frame() A data frame

9.1.3.1 Vectors

Before using a vector in C++, you need to create one. For example,

NumericVector v(10);

creates an empty vector of double of length 10.

To access elements of a vector, the syntax is the same as in R with one notable exception: accessor starts at 0! So if you want to access the first element of a vector v, you should write v[0], the ith element, v[i-1]. Note that you can equivalently write v(0) and v(i-1).

The length of the vector v can be obtained as follows

int n = v.length();

Here the length of the vector is saved into an integer n.

Note that length() is a method for the class NumericVector and uses the syntax ‘’dot-function’’. Theses concepts are beyond the scope of this introduction and the only aspect to underline is that the above notation gives the same result as if you wrote length(v) in R.

9.1.3.2 Matrices

Creating a matrix is similar to creating a vector. The command

NumericMatrix m(3,4);

creates an empty matrix with 3 rows and 4 columns. This is equivalent to writing the following in R

m <- matrix(nr=3,nc=4)

The syntax for accessing matrix elements is slightly different. For example, accessing a column or a row can be done as follows

// Copy first column into a vector
NumericVector v = m( _ , 0);

// Copy first row into a vector
NumericVector v = m(0, _);

// Access the ith row, jth column
double x = m(i-1, j-1); // i and j are positive integers already defined

The number of columns/rows of a matrix can be retrieved using

// number of rows
int nr = m.nrow();

// number of columns
int nc = m.ncol();

9.1.3.3 List

List creation can be accomplished as follows:

// Create a list with elements e1 and e2
List L = List::create(e1, e2);

// Create a list with elements e1 and e2 with names
List L = List::create(Named("name1") e1, Named("name2") e2);

The elements e1 and e2 can be of any type (e.g. vector, matrix, etc.)

9.1.4 Control structures

Logical operators in R and C++ are similar. Here is a list where x and y are scalar (not vectors!).

Common logical operators for C++.
Operator Description
x > y x greater than y
x >= y x greater than or equal to y
x < y x less than y
x <= y x less than or equal to y
x == y x equals to y
x != y x not equal to y
x && y x and y
x || y x or y
!x not x

9.1.4.1 if/else/elseif statements

R and C++ share the same syntax also for if/else/elseif statements with the exception that else if has a space in C++ syntax. For example, the following expression is valid within both R and C++:

if (condition){
  plan A;
}else{
  plan B;
}

9.1.4.2 loops

Loops are much faster in C++ than in R and the syntax is slightly different. For example, the following R for loop

for(i in 1:10){
  print("The value is: ", i)
}

can be written in C++ with the following code

for(int i(1); i<=10; i++){
  Rcpp::Rcout << "The value is: " << i << std::endl;
}

Three elements need to be defined:
1. The initialization statement: an iterator (i here), generally of type integer, needs to be initialized (here i is initialized to 1).
2. The test expression: the expression is evaluated and should return a boolean. If false the loop stops and if true, the body of the for loop is evaluated.
3. The update statement: it defines the increment of the iterator, usually set to i++.

Note that we can also allow the iterators to decrease via the decrement operator i--.

C++ also offers variants of assignment operators.

Assignment operator in C++. x and y are already defined.
Operator Equivalence Description
x = y; Assign the value of y to x
x += y; x = x + y; Take the value of x and add y, then assign to x
x -= y; x = x - y; Take the value of x and substract y, then assign to x
x *= y; x = x * y; Take the value of x and multiply by y, then assign to x
x /= y; x = x / y; Take the value of x and divide by y, then assign to x

Therefore, i++ is equivalent to i+=1. If one therefore wanted to increment by 2 at each loop, one would use i+=2.

The C++ while loop has the same syntax as R. The following expression

while( condition ) {
  body;
}

is the same in both languages.

Note that C++ is also equipped with a do ... while loop. This may be useful, for example, in the situation where you need a first run of the body prior to checking the condition.

do{
  body;
}
while(condition);

9.1.5 Functions

Creating functions in C++ is fundamentally not much different from writing them in R. Neverthelss, C++ requires an additional layer of precision. A function can follow this pattern:

type function_name(type arg1, ..., type argn){
  body of the function;
  
  return obj;
}

First, the type of return of the function should be specified. The obj that is returned should be of the same type as the one specified at the start of the function declaration and each argument should have a type. Otherwise, as in R, the body of the function contains statements that define what the function does.

As mentioned earlier, by default a C++ function cannot be used within R and the attribute [[Rcpp::export]] in comment just before the function definition needs to be specified. For example

// [[Rcpp::export]]
type function_name(type arg1, ..., type argn){
  body of the function;
  
  return obj;
}

will make function_name available within R.

9.1.6 Comparing with R

Making an effort of writing C++ code instead of R code should deliver advantages. One way to measure these advantages is by comparing the computational efficiency of a C++ function with the equivalent R function. This can be achieved, for example, using the microbenchmark package.

For example, suppose you create a function that computes the factorial number of a positive integer input. In R, you can write

my_factorialR <- function(n){
  factorial <- 1
  for(i in 2:n){
    factorial <- factorial * i
  }
  return(factorial)
}

An equivalent C++ function could be

int my_factorialC(int n){
  int factorial = 1;
  for(int i(2); i<=n; ++i){
    factorial *= i;
  }
  return factorial;
}
require(microbenchmark)
## Loading required package: microbenchmark
## Warning: package 'microbenchmark' was built under R version 3.6.2
microbenchmark(my_factorialR(500), my_factorialC(500))
## Unit: microseconds
##                expr  min   lq   mean median    uq  max neval
##  my_factorialR(500) 11.5 11.6 12.383   11.7 11.95 31.8   100
##  my_factorialC(500)  1.4  1.6  2.068    1.7  2.10 16.4   100

It is clear that the computational performance is considerably different.

9.1.7 Example: Buffon’s needle (con’t)

Let’s reconsider Buffon’s needle problem. We propose to recode the R functions cast_needle() and buffon_experiment() into C++.

For this purpose, first create a C++ file with header

#include <Rcpp.h>
using namespace Rcpp;

The cast_needle() function was defined as

cast_needle <- function(plane_width = 20){
  available_range <- plane_width/2 - 1 # where 1 is the length of the needle (unit)
  x_start <- runif(2, -available_range, available_range)
  angle <- runif(1, 0, 2*pi)
  x_end <- c(cos(angle), sin(angle)) + x_start # where the angles are multiplied by the needle length which is 1 in this example
  cross <- floor(x_start[2]) != floor(x_end[2])
  out <- list(start = x_start, end = x_end, cross = cross)
  out
}

A possible C++ implementation is

// [[Rcpp::export]]
List cast_needle(
  unsigned int plane_width = 20
){
  // Variable declaration
  int available_range;
  double angle,my_pi;
  NumericVector x_start(2), x_end(2), tmp_angle(1);
  bool cross;
  
  // Computation
  my_pi = atan(1) * 4; // defines pi (for simplicity)
  available_range = plane_width / 2 - 1;
  x_start = runif(2, -available_range, available_range);
  tmp_angle = runif(1, 0, 2 * my_pi);
  angle = tmp_angle[0];
  x_end[0] = cos(angle) + x_start[0];
  x_end[1] = sin(angle) + x_start[1];
  cross = floor(x_start[2]) != floor(x_end[2]);
  
  return List::create(
    Named("start") = x_start,
    Named("end") = x_end,
    Named("cross") = cross
  );
}

Based on the above function, we can make a few remarks:

  • For better readibility, it is good practice to separate variable declaration and computation.
  • The value of \(\pi\) is not accessible by default so in this case we simply approximate it (but better solutions exist).
  • Basic functions such as atan(), runif(), floor(), cos(), sin() are made available by Rcpp (check for Rcpp sugar).
  • To avoid an error we have first to save the result of runif() as NumericVector (its default result type).

The buffon_needle() function was declared in R as follows:

buffon_experiment <- function(B = 2084, plane_width = 10, seed = NULL){
  
  if (!is.null(seed)){
    set.seed(seed)
  }
  
  X_start <- X_end <- matrix(NA, B, 2) 
  cross <- rep(NA, B)
  
  for (i in 1:B){
    inter <- cast_needle(plane_width = plane_width)
    X_start[i, ] <- inter$start
    X_end[i, ] <- inter$end
    cross[i] <- inter$cross
  }
  
  out <- list(start = X_start, end = X_end, cross = cross, plane = plane_width)
  class(out) <- "buffon_experiment"
  out
}

The C++ alternative we propose is the following

// [[Rcpp::export]]
List buffon_experiment(
  int B = 2084,
  int plane_width = 10
){
  // Variable declaration
  NumericMatrix X_start(B,2), X_end(B,2);
  LogicalVector cross(B);
  List L;
  NumericVector tmp(2);
  
  // Computation
  for(int i(0); i<B; ++i){
    L = cast_needle(plane_width);
    tmp = L["start"];
    X_start(i,_) = tmp; 
    tmp = L["end"];
    X_end(i,_) = tmp;
    cross(i) = L["cross"];
  }
  
  return List::create(
    Named("start") = X_start,
    Named("end") = X_end,
    Named("cross") = cross,
    Named("plane") = plane_width
  );
}

Also in this case we have few remarks:

  • C++ does not know the content of L["start"] in advance, so for this reason it is first saved into a vector tmp of the correct dimension (the same for L["end"]).
  • To avoid complexity, the list returned by the C++ implementationis not of the class buffon_experiment. This should be added within R.

9.1.8 C++ in an R package

Writing an R package with C++ code follows essentially the same steps as a regular R package. Here we underline the main differences:

  1. Create an “empty” R package: the option R package using Rcpp should be selected as Project Type.
  2. Edit description file: Rcpp has automatically been added in Imports and LinkingTo fields.
  3. Move your R scripts into the R folder. In addition, move the C++ files into the src folder; src stands for source.The C++ function must have the attribute // [[Rcpp::export]] prior to function definition in order to make it available within R.
  4. Documentation: follow the exact same steps as for R files. For C++ files, documentation uses the same syntax except that # should be replaced by // (comments in C++).
  5. The steps that follow are identical.