When and how to write low-level (C/C++) instead of high-level (R/Python) code?
This is an introduction to interfacing low-level code with high-level languages. Below, I explain situations when you should and should not use low-level code.
- You can express your computation in terms of matrix-vector or vector-scalar operations (or other R/Python functions which are already implemented in low-level C code). In that case you should just write R code or python numpy code. For example the R code in my WeightedROC package uses the vectorized cumsum function to compute a weighted Receiver Operating Characteristic (ROC) curve faster than the pROC package (which actually uses C code in some cases, but has a lot of overhead in R code that slows it down).
- You are waiting for your high-level R/Python code to compute. In that case you should use a code profiler, which will tell you which functions in your code are taking the most time. Try using the Rprof function and reading Hadley Wickham’s guide to Profiling R code. For Python take a look at profile/cProfile. Usually, there is one function that takes 90% of the time, and you can just re-write that function in terms of vector operations in high-level R/Python code. If your code is still slow after that, then you should start thinking about re-writing that one slow function in low-level code.
- Your program uses all available memory and starts swapping.
In that case maybe you can break your problem into separate tasks that can be computed independently?
If so, then just read the first data subset into memory,
perform the first computation,
write the first result to disk,
erase the first data subset from memory,
read the second data subset into memory,
etc.
It may be useful to know how much memory your R objects are using, try
object.size(your_object)
and reading Circle 2 Growing Objects of The R Inferno by Patrick Burns.
In general the answer is: use a low-level language when there is no efficient high-level implementation. Some examples:
- You need complex data structures, for example that keep a bunch of stuff sorted in a certain way. In that case you should definitely use C++ and the STL. For example the STL multimap is a red-black tree which I used in clusterpath and binseg. Another example is mmit which uses the map container.
- You need to perform scalar operations that can’t be expressed without performing some redundant/inefficient computations using vector or matrix operations. Sometimes such operations can be coded efficiently in R using data.table, but otherwise you should use arrays in C. For example hclust() result (hierarchical clustering tree) is represented as a 2-column matrix, and then cutree() is implemented in C (converts tree into an integer vector of cluster labels). Another example is penaltyLearning::modelSelection() which uses C code to compute the exact model selection function.
- There is some high quality library code in a low-level language that is already written and you just want to use it in your high-level language. For example the revector package uses the Perl-compatible regular expressions C library; my student Qin Wenfeng’s re2r package uses Google’s RE2 C++ library.
See my video tutorials: [Make an R package with C++ code](https://www.youtube.com/playlist?list=PLwc48KSH3D1OkObQ22NHbFwEzof2CguJJ)
There are several different options. The pages linked below use either .C, .Call, or Rcpp. Which is best depends on what you want to do.
- .C is the simplest method for most numerical algorithms, and the C++ code ends up being easily portable to other languages (e.g. Python). The .C interface is limited to input/output of char/integer/double arrays, but that is not a big drawback for numerical algorithms (which mainly use int/double).
- I recommend using the Rcpp interface, which requires learning some additional C++ data types, but results in interfaces that are easy to modify/debug. It is useful for algorithms that need to input/output R internal data structures such as list and data.frame, but a drawback is that if you use the R data types then the C++ code ends up being very specific to R (and not easily portable to other languages). That drawback can be avoided by writing an interface function similar to the .C interface, see binseg example for discussion.
- In general I do not recommend .Call for casual useRs, because these days Rcpp provides more user-friendly ways to do anything that .Call can. (However power users should note that Rcpp was written using the .Call interface)
More detailed examples / case studies:
- Hello world example for .C by my student Anuraag Srivastava
- Comparing .C and Rcpp implementations of binary segmentation
- SegAnnot: a simple C library with bindings in both R and Python
- breakpointError: an R package that uses both .C and .Call
- clusterpath (standard Rinternals.h functions) vs clusterpathRcpp
- penaltyLearning: an R package with registered C and C++ functions