The basic idea behind speeding up R with a C script is to write a C script, compile it, load it into the environment (ask it over to the house, so to speak) and call it from R with a built-in R function. There are some choices here but for our test case, we're going to use the .C() function.
A couple ground rules first. The C function must be of type void (does not return nothing) and needs to accept pointer arguments. You define the argument in R as an object, and it gets passed as a pointer to the C function. R objects are basically pointers anyway so this is not terribly difficult. Let's illustrate before this becomes a lecture on pointers and objects.
Here is the code for an R function that uses an R loop to print out permutations of a double loop. It's the guts of a brute force search but only returns a string saying that it was able to locate and print each permutation. You can either copy and paste it into R or save it and source() it.
twoloop <- function(){
for(i in seq(10, 50, 1 ))
for(j in seq(80, 240, 4))
cat("Parameter one is", i, "and Parameter two is",j,"\n" )
}
We are stepping from 10 to 50 in increments of 1 (40 events) and also from 80 to 240 in steps of 4 (40 events again). The permutation total is those two multiplied, which comes to 1600. The size of a large sofa basically. R can do it, but it takes some time because of the explicit looping. Here is the tail of the output along with performance statistics. The system.time() function is used to calculate performance.
system.time(twoloop())
.
.
.
Parameter one is 50 and Parameter two is 224
Parameter one is 50 and Parameter two is 228
Parameter one is 50 and Parameter two is 232
Parameter one is 50 and Parameter two is 236
Parameter one is 50 and Parameter two is 240
user system elapsed
0.070 0.031 0.139
To get some help from C, we write the following script:
#include < R.h >
#include < stdio.h >
void twoloop(int *startOne, int *stopOne, int *stepOne, int *startTwo, int *stopTwo, int *stepTwo)
{
int i,j;
for(i = *startOne; i < *stopOne+1; i = i + *stepOne)
for(j = *startTwo; j < *stopTwo+1; j = j + *stepTwo)
Rprintf("Parameter one is %d and Parameter two is %d\n", i, j);
}
Notice the pointer arguments. Also, notice we include R.h header file. Save this script as twoloop.c and then from command line, use the following instruction to compile it.
R CMD SHLIB twoloop.c
This creates a new file called twoloop.so that will be loaded into R with the following function inside of an R session.
dyn.load("twoloop.so")
Great, now we use the .C() function to call the function. But first we need to create some R objects. There are six variables so here is what we'll do.
a <- 10
b <- 50
c <- 1
p <- 180
q <- 240
r <- 4
Now we simply pass those objects in as integers. Remember, that is what the C function is expecting so let's not mess around. It is here to help us after all.
.C("twoloop", as.integer(a), as.integer(b), as.integer(c), as.integer(p), as.integer(q), as.integer(r))
Of course if you're like me, you run the function first because you're so excited to see if it actually works. It does and then you remember that you were supposed to time it. No worry here, hit the up arrow to prompt R to display the last command (the monstrosity above) and hit Ctrl-A to get to the beginning of the command. Insert system.time() like thus:
system.time(.C("twoloop", as.integer(a), as.integer(b), as.integer(c), as.integer(p), as.integer(q), as.integer(r)))
Here is the tail of the C function's output along with its performance.
Parameter one is 50 and Parameter two is 224
Parameter one is 50 and Parameter two is 228
Parameter one is 50 and Parameter two is 232
Parameter one is 50 and Parameter two is 236
Parameter one is 50 and Parameter two is 240
user system elapsed
0.002 0.002 0.017
Nice. Time to crack a beer on the new sofa. With our best friend C.
library(compiler)
ReplyDeletetwoloop_cmp <- cmpfun(twoloop)
system.time(twoloop_cmp())
> user system elapsed
0.05 0.00 0.03
Not as fast as the C version, but better than the uncompiled R version.
Also, this function (which is vectorized) is faster than either the R function or the C function you wrote, and is very useful for doing brute-force parameter searches. I know you're just showing an example of how to write C function (thanks, it's a great tutorial btw), but if you can find a way to vectorize your operations, R is surprisingly fast.
ReplyDeletesystem.time(expand.grid(seq(10, 50, 1 ),seq(80, 240, 4)))
> user system elapsed
0 0 0
@Zach I didn't know about the compiler package, thanks.
ReplyDeleteI agree with you that vectorizing is a first choice. Then C, then the apply() family (which are basically pretty R for loops in disguise) and lastly the dreaded R for loop.
I was quite frankly surprised at how easily R and C get along. I'm wondering how common it is to have a small set of C scripts load up with .First(). It's on my short list of experiments, probably after getting a grip on the .Call() function.
I think that's one of the huge advantages of R-- a lot of it has been written in C/C++/Fortran and it plays very nicely with all of those languages. Also check out Rcpp: http://dirk.eddelbuettel.com/code/rcpp.html
ReplyDeletehttp://cran.r-project.org/web/packages/Rcpp/index.html