For our purposes, we are going to prepare some information nuggets for the C/C++ SVM library of algorithms known as libsvm. This is open-source and award-winning, so it's a good brain for our purposes. It aims to predict an outcome given a specific set of conditions. Before it can start on the business of prediction, it needs to learn from "quote" -- examples. It needs to eat some chicken nuggets. And it needs those nuggets presented in a specific way. The libsvm format for data is:
label index1:value1 index2:value2 ...
The label is the answer. This of course depends on what you want to predict. Let's say we are interested in a prediction if the market for a particular stock is going to be up or down from yesterday. In this case, label will be either 1 or -1. This is what it looks like when the answer is an up day.
1 index1:value1 index2:value2 ...
We could say TRUE or FALSE, but libsvm needs a numerical representation. The next series of lines are what you think the brain needs to get the answer correct. It follows in the format of:
1 1: "value of 50-day SMA" 2: "value of RSI" ...
We need to replace the quoted string with an actual number and that's where R comes in. I chose a Dow 30 stock as my example. The following R code gets us most of the way there.
require("quantmod")
getSymbols("MCD")
MCD$Cl.sma_10 <- Lag(SMA(Cl(MCD), n=10)) #yesterday's value
MCD$Cl.sma_30 <- Lag(SMA(Cl(MCD), n=30)) #yesterday's value
MCD$Vo.sma_10 <- Lag(SMA(Vo(MCD), n=10)) #yesterday's value
MCD$Vo.sma_30 <- Lag(SMA(Vo(MCD), n=30)) #yesterday's value
MCD$Cl.rsi <- Lag(RSI(Cl(MCD))) #yesterday's value
MCD$Cl.return.daily <- Lag(Delt(Cl(MCD))) #yesterday's value
MCD$Cl.return.10 <- Lag(Delt(Cl(MCD), k=10)) #yesterday's value
MCD$Cl.return.30 <- Lag(Delt(Cl(MCD), k=30)) #yesterday's value
MCD$pre_answer <- Delt(Cl(MCD)) #today's pre_answer
squish <- function(x){
if(x>0)
return(1)
else(x< 0)
return(-1)
}
MCD <- na.locf(MCD, na.rm=TRUE)
MCD$answer <- cbind(MCD, apply(MCD,1, function(x)squish(x[15])))
write.table(MCD, "~/Desktop/goo", row.names=FALSE, col.names=FALSE)Try this yourself. You should get a text file on your desktop called goo. Here is what the first row looks like, but a warning first. It's not pretty. Remember what were making here. 45 45.38 44.86 45.32 6806600 39.83 44.71 44.2873333333333 4683760 6206100 60.5405839422285 -0.000888494002665663 0.0112410071942446 0.0253020287212218 0.00755891507336592 45 45.38 44.86 45.32 6806600 39.83 44.71 44.2873333333333 4683760 6206100 60.5405839422285 -0.000888494002665663 0.0112410071942446 0.0253020287212218 0.00755891507336592 1I would have preferred doing a little more prep in the R code, but some mysterious going-ons created new columns when I tried indexing out data I wasn't interested in. I suppose it has to do with not being able to delete a column whose value creates a column you want to keep. Not sure about this one. I've turned to a little awk wizardry to get the values I truly want, and to get the format just so. Here, we convert the goo file to a paste file. This is all on one line from command line in the directory where your goo file is located.
$ awk '{print $31 " 1:" $7 " 2:" $8
" 3:" $9 " 4:" $10 " 5:" $11 " 6:" $12 " 7:" $13 " 8:" $14}' goo > pasteThis is the first line of the paste file. Still not appropriate for visually sensitive people, so be careful. 1 1:44.71 2:44.2873333333333 3:4683760 4:6206100 5:60.5405839422285 6:-0.000888494002665663 7:0.0112410071942446 8:0.0253020287212218I didn't mention python earlier because I didn't want to make this sound too complicated right off the bat. But there is actually a little python script that checks to see if the format is satisfactory. The program comes with libsvm.
$ python checkdata.py paste No error.
There is a little work to do with scaling the data, so be careful to feed this raw paste to your beloved algorithm. You still need to change the bubblegum color and pasteurize it. The README that comes with libsvm explains it well.
The elegance of this approach to feeding your brain is that not only do you control the ingredients, but you can experiment with those ingredients to find the best chicken nugget recipe of all time. Ever.
UPDATE ON CODE:
A suggestion by a reader below replaces lines 13-25 of the example above with the following:
MCD$answer <- sign(Delt(Cl(MCD))) MCD <- na.locf(MCD, na.rm=TRUE) MCD <- cbind(MCD$answer, MCD[,-ncol(MCD)])
I timed my verbose version that uses an explicit loop in R and got the following speed data:
proc.time() - ptm
user system elapsed
0.394 0.004 0.415
Then running the more compact version suggeted by @jro below:
proc.time() - ptm
user system elapsed
0.181 0.002 0.187
Much faster, compact and the proper way to write code in R.
libsvm is accessible via the CRAN package e1071 -- that may provide a more direct way for you.
ReplyDelete@edd thanks for the comment. I'm likely going that route long-term, but wanted to test drive the parent library. Also, I think the file formatting exercise may have other benefits to separate problems. I made things look a little more complicated than need be in a polished solution to a specific problem. There is an excellent article (sorry no link) published by Journal of Statistical Software (April 2006, Volume 15, Issue 9) in which the authors compare kernlab, e1071, klaR and svmpath R packages. kernlab and e1071 come out on top in their comparison and as you mentioned, e1071 is a robust interface to the libsvm library.
ReplyDeleteFyi, you could use the sign() function instead of writing the custom squish func. E.g.,
ReplyDeleterequire("quantmod")
getSymbols("MCD")
MCD$Cl.sma_10 <- Lag(SMA(Cl(MCD), n=10)) #yesterday's value
MCD$Cl.sma_30 <- Lag(SMA(Cl(MCD), n=30)) #yesterday's value
MCD$Vo.sma_10 <- Lag(SMA(Vo(MCD), n=10)) #yesterday's value
MCD$Vo.sma_30 <- Lag(SMA(Vo(MCD), n=30)) #yesterday's value
MCD$Cl.rsi <- Lag(RSI(Cl(MCD))) #yesterday's value
MCD$Cl.return.daily <- Lag(Delt(Cl(MCD))) #yesterday's value
MCD$Cl.return.10 <- Lag(Delt(Cl(MCD), k=10)) #yesterday's value
MCD$Cl.return.30 <- Lag(Delt(Cl(MCD), k=30)) #yesterday's value
MCD$answer <- sign(Delt(Cl(MCD)))
MCD <- na.locf(MCD, na.rm=TRUE)
MCD <- cbind(MCD$answer, MCD[,-ncol(MCD)])
output <- apply(MCD, 1, function(x) paste(1:length(x), x, sep=": ", collapse=" "))
write(output, "test.txt")
@jro Thanks for taking the time to look over the code. I spent almost an hour looking for that function and finally gave up in favor of my own version. I'm going to test my R loop approach to your vectorization method by calling sign(). Also thanks for the formatted output code. The first 'column' though does not have an integer attached. It's all alone. So the first line would be printed in the following manner.
ReplyDelete1 1:value 2:value
I'm curious if you had any lack predicting stock direction using libsvm? Can you please post an example.
ReplyDelete@kapler, I haven't actually tried to predict market direction and I'm still pondering on what is the best thing to actually predict. For example, is it helpful to know tomorrow will be an up day? Or next week is likely to experience increased volatility? Or some such thing. I will post my attempts to answer this bewildering question.
ReplyDeleteI came across following posts by Quantum Financier that show how to fit SMV using R
ReplyDeletehttp://quantumfinancier.wordpress.com/2010/06/10/svm-classification-using-rsi-from-various-lengths/
http://quantumfinancier.wordpress.com/2010/06/26/support-vector-machine-rsi-system/
I though you might find these interesting,
Michael