Chapter 6.  Trading Using Machine Learning

In the capital market, machine learning-based algorithmic trading is quite popular these days and many companies are putting a lot of effort into machine learning-based algorithms which are either proprietary or for clients. Machine learning algorithms are programmed in such a way that they learn continuously and change their behavior automatically. This helps to identify new patterns when they emerge in the market. Sometimes patterns in the capital market are so complex they cannot be captured by humans. Even if humans somehow managed to find one pattern, humans do not have the tendency to find it efficiently. Complexity in patterns forces people to look for alternative mechanisms which identify such complex patterns accurately and efficiently.

In the previous chapter, you got the feel of momentum, pairs-trading-based algorithmic trading, and portfolio construction. In this chapter, I will explain step by step a few supervised and unsupervised machine learning algorithms which are being used in algorithm trading:

  • Logistic regression neural network
  • Neural network
  • Deep neural network
  • K means algorithm
  • K nearest neighborhood
  • Support vector machine
  • Decision tree
  • Random forest

A few of the packages used in this chapter are quantmod, nnet, genalg, caret, PerformanceAnalytics, deepnet, h2o, clue, e1071, randomForest, and party.

Logistic regression neural network

Market direction is very important for investors or traders. Predicting market direction is quite a challenging task as market data involves lots of noise. The market moves either upward or downward and the nature of market movement is binary. A logistic regression model help us to fit a model using binary behavior and forecast market direction. Logistic regression is one of the probabilistic models which assigns probability to each event. I am assuming you are well versed with extracting data from Yahoo as you have studied this in previous chapters. Here again, I am going to use the quantmod package. The next three commands are used for loading the package into the workspace, importing data into R from the yahoo repository and extracting only the closing price from the data:

>library("quantmod")
>getSymbols("^DJI",src="yahoo")
>dji<- DJI[,"DJI.Close"]

The input data to the logistic regression is constructed using different indicators, such as moving average, standard deviation, RSI, MACD, Bollinger Bands, and so on, which has some predictive power in market direction, that is, Up or Down. These indicators can be constructed using the following commands:

>avg10<- rollapply(dji,10,mean)
>avg20<- rollapply(dji,20,mean)
>std10<- rollapply(dji,10,sd)
>std20<- rollapply(dji,20,sd)
>rsi5<- RSI(dji,5,"SMA")
>rsi14<- RSI(dji,14,"SMA")
>macd12269<- MACD(dji,12,26,9,"SMA")
>macd7205<- MACD(dji,7,20,5,"SMA")
>bbands<- BBands(dji,20,"SMA",2)

The following commands are to create variable direction with either Up direction (1) or Down direction (0). Up direction is created when the current price is greater than the 20 days previous price and Down direction is created when the current price is less than the 20 days previous price:

>direction<- NULL
>direction[dji> Lag(dji,20)] <- 1
>direction[dji< Lag(dji,20)] <- 0

Now we have to bind all columns consisting of price and indicators, which is shown in the following command:

>dji<-cbind(dji,avg10,avg20,std10,std20,rsi5,rsi14,macd12269,macd7205,bbands,direction)

The dimension of the dji object can be calculated using dim(). I used dim() over dji and saved the output in dm(). dm() has two values stored: the first value is the number of rows and the second value is the number of columns in dji. Column names can be extracted using colnames(). The third command is used to extract the name for the last column. Next I replaced the column name with a particular name, Direction:

>dm<- dim(dji)
>dm
[1] 2493   16
>colnames(dji)[dm[2]] 
[1] "..11"
>colnames(dji)[dm[2]] <- "Direction"
>colnames(dji)[dm[2]] 
[1] "Direction"

We have extracted the Dow Jones Index (DJI) data into the R workspace. Now, to implement logistic regression, we should divide the data into two parts. The first part is in-sample data and the second part is out-sample data.

In-sample data is used for the model building process and out-sample data is used for evaluation purposes. This process also helps to control the variance and bias in the model. The next four lines are for in-sample start, in-sample end, out-sample start, and out-sample end dates:

>issd<- "2010-01-01"
>ised<- "2014-12-31"
>ossd<- "2015-01-01"
>osed<- "2015-12-31"

The following two commands are to get the row number for the dates, that is, the variable isrow extracts row numbers for the in-sample date range and osrow extracts the row numbers for the out-sample date range:

>isrow<- which(index(dji) >= issd& index(dji) <= ised)
>osrow<- which(index(dji) >= ossd& index(dji) <= osed)

The variables isdji and osdji are the in-sample and out-sample datasets respectively:

>isdji<- dji[isrow,]
>osdji<- dji[osrow,]

If you look at the in-sample data, that is, isdji, you will realize that the scaling of each column is different: a few columns are in the scale of 100, a few others are in the scale of 10,000, and a few others are in the scale of 1. Difference in scaling can put your results in trouble as higher weights are being assigned to higher scaled variables. So before moving ahead, you should consider standardizing the dataset. I will use the following formula:

Logistic regression neural network

The mean and standard deviation of each column using apply() can be seen here:

>isme<- apply(isdji,2,mean)
>isstd<- apply(isdji,2,sd)

An identity matrix of dimension equal to the in-sample data is generated using the following command, which is going to be used for normalization:

>isidn<- matrix(1,dim(isdji)[1],dim(isdji)[2])

Use formula 6.1 to standardize the data:

>norm_isdji<-  (isdji - t(isme*t(isidn))) / t(isstd*t(isidn))

The preceding line also standardizes the direction column, that is, the last column. We don't want direction to be standardized so I replace the last column again with variable direction for the in-sample data range:

>dm<- dim(isdji)
>norm_isdji[,dm[2]] <- direction[isrow]

Now we have created all the data required for model building. You should build a logistic regression model and it will help you to predict market direction based on in-sample data. First, in this step, I created a formula which has direction as dependent and all other columns as independent variables. Then I used a generalized linear model, that is, glm(), to fit a model which has formula, family, and dataset:

>formula<- paste("Direction ~ .",sep="")
>model<- glm(formula,family="binomial",norm_isdji)

A summary of the model can be viewed using the following command:

>summary(model)

Next use predict() to fit values on the same dataset to estimate the best fitted value:

>pred<- predict(model,norm_isdji)

Once you have fitted the values, you should try to convert it to probability using the following command. This will convert the output into probabilistic form and the output will be in the range [0,1]:

>prob<- 1 / (1+exp(-(pred)))

Figure 6.1 is plotted using the following commands. The first line of the code shows that we divide the figure into two rows and one column, where the first figure is for prediction of the model and the second figure is for probability:

>par(mfrow=c(2,1))
>plot(pred,type="l")
>plot(prob,type="l")

head() can be used to look at the first few values of the variable:

>head(prob)
2010-01-042010-01-05 2010-01-06 2010-01-07 
0.8019197  0.4610468  0.7397603  0.9821293  

The following figure shows the above-defined variable pred, which is a real number, and its conversion between 0 and 1, which represents probability, that is, prob, using the preceding transformation:

Logistic regression neural network

Figure 6.1: Prediction and probability distribution of DJI

As probabilities are in the range of (0,1) so is our vector prob. Now, to classify them as one of the two classes, I considered Up direction (1) when prob is greater than 0.5 and Down direction (0) when prob is less than 0.5. This assignment can be done using the following commands. prob> 0.5 generate true for points where it is greater and pred_direction[prob> 0.5] assigns 1 to all such points. Similarly, the next statement shows assignment 0 when probability is less than or equal to 0.5:

>pred_direction<- NULL
>pred_direction[prob> 0.5] <- 1
>pred_direction[prob<= 0.5] <- 0

Once we have figured out the predicted direction, we should check model accuracy: how much our model has predicted Up direction as Up direction and Down as Down. There might be some scenarios where it predicted the opposite of what it is, such as predicting down when it is actually Up and vice versa. We can use the caret package to calculate confusionMatrix(), which gives a matrix as an output. All diagonal elements are correctly predicted and off-diagonal elements are errors or wrongly predicted. One should aim to reduce the off-diagonal elements in a confusion matrix:

>install.packages('caret')
>library(caret)
>matrix<- confusionMatrix(pred_direction,norm_isdji$Direction)
>matrix
Confusion Matrix and Statistics
                       Reference
Prediction                 0                       1
         0               362                      35
         1                42                     819
Accuracy : 0.9388          95% CI : (0.9241, 0.9514)
    No Information Rate : 0.6789     P-Value [Acc>NIR] : <2e-16          
Kappa : 0.859              Mcnemar's Test P-Value : 0.4941                      Sensitivity : 0.8960      Specificity : 0.9590          
PosPredValue : 0.9118     NegPred Value : 0.9512          
Prevalence : 0.3211          Detection Rate : 0.2878
Detection Prevalence : 0.3156    Balanced Accuracy : 0.9275  

The preceding table shows we have got 94% correct prediction, as 362+819 = 1181 are correct predictions out of 1258 (sum of all four values). Prediction above 80% over in-sample data is generally assumed good prediction; however, 80% is not fixed, one has to figure out this value based on the dataset and industry. Now you have implemented the logistic regression model, which has predicted 94% correctly, and need to test it for generalization power. One should test this model using out-sample data and test its accuracy. The first step is to standardize the out-sample data using formula (6.1). Here mean and standard deviations should be the same as those used for in-sample normalization:

>osidn<- matrix(1,dim(osdji)[1],dim(osdji)[2])
>norm_osdji<-  (osdji - t(isme*t(osidn))) / t(isstd*t(osidn))
>norm_osdji[,dm[2]] <- direction[osrow]

Next we use predict() on the out-sample data and use this value to calculate probability:

>ospred<- predict(model,norm_osdji)
>osprob<- 1 / (1+exp(-(ospred)))

Once probabilities are determined for the out-sample data, you should put it into either Up or Down classes using the following commands. ConfusionMatrix() here will generate a matrix for the out-sample data:

>ospred_direction<- NULL
>ospred_direction[osprob> 0.5] <- 1
>ospred_direction[osprob<= 0.5] <- 0
>osmatrix<- confusionMatrix(ospred_direction,norm_osdji$Direction)
>osmatrix
Confusion Matrix and Statistics
                      Reference
Prediction              0                         1
         0             115                       26
         1              12                       99
Accuracy : 0.8492         95% CI : (0.7989, 0.891)

This shows 85% accuracy on the out-sample data. Quality of accuracy is beyond the scope of the book so I am not going to cover whether out-sample accuracy is good or bad and what the techniques are to improve this performance. A realistic trading model also accounts for trading cost and market slippage, which decrease the winning odds significantly. The next thing to be done is to devise a trading strategy using predicted directions. I will explain how to implement an automated trading strategy using predicted signals in the next section.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset