CS229_Final_Report
CS229 Project ReportAutomated Stock Trading Using Machine Learning Algorithms
Tianxin Dai
[email protected]
- Introduction
The use of algorithms to make trading decisions hasbecome a prevalent practice in major stock exchangesof the world. Algorithmic trading, sometimes calledhigh-frequency trading, is the use of automated systems toidentify true signals among massive amounts of data thatcapture the underlying stock market dynamics. MachineLearning has therefore been central to the process ofalgorithmic trading because it provides powerful tools toextract patterns from the seemingly chaotic market trends.This project, in particular, learns models from Bloombergstock data to predict stock price changes and aims to makeprofit over time.
In this project, we examine two separate algorithms andmethodologies utilized to investigate Stock Market trendsand then iteratively improve the model to achieve higherprofitability as well as accuracy via the predictions. - Methods2.1. Stock Selection
Stock ticker data, relating to prices, volumes, quotes areavailable to academic institutions through the Bloombergterminal and Stanford has a easily accessible one in itsengineering library.
When collecting stock data for this project we attemptedto have a conservative universe selection to ensure that wemined a good universe a priori and avoided stocks that werelikely to be outliers to our algorithm to confuse the results.The criteria we shortlisted by were the following:
price between 10-30 dollars
membership in the last 300 of SP500
average daily volume (ADV) in the middle 33 per-centile
variety of stock sectors
Arpan Shah Hongxia Zhong
[email protected] [email protected]
According to the listed criteria, we obtained a universeof 23 stocks for this project1.
The data we focussed on was the price and volume move-ments for each stock throughout the day on a tick-by-tickbasis. This data was then further preprocessed to enable in-terfacing with Matlab and integrate into the machine learn-ing algorithms.
2.2. Preprocessing
Before using the data in the learning algorithms, the fol-lowing preprocessing steps were taken.
2.2.1 Discretization
Since the tick-by-tick entries retrieved from Bloomberghappen in non-deterministic timestamps, we attempted tostandardize the stock data by discretizing the continuoustime domain, from 9:00 am to 5:00 pm when the marketcloses. Specifically, the time domain was separated into1-minute buckets and we discarded all granularities withineach bucket and treated the buckets as the basic units in ourlearning algorithms.
2.2.2 Bucket Description
For each 1-minute bucket, we attempted to extract 8 identi-fiers to describe the price and volume change of that minuteheuristically. We discussed the identifier selection with ex-perienced veteran in algorithmic trading industry (footnote:Keith). Based on his suggestions, we chose the following 4identifiers to describe the price change:
- open price: price at the beginning of each 1-minutebucket
- close price: price at the end of each 1-minute bucket3. high price: highest price within each 1-minute bucket4. low price: lowest price within each 1-minute bucket
1See Appendix
Similarly, we chose open volume, close volume, highvolume and low volume to describe the volume change.
With this set of identifiers, we can formulate the algo-rithms to predict the change in the closing price of each1-minute bucket given information of the remaining sevenidentifiers (volume and price) prior to that minute2. Theidentifiers help capture the trend of the data of a givenminute.
2.3. Metrics
To evaluate the learning algorithms, we simulate a
real-time trading process, on one single day, using the
models obtained from each algorithm. Again, we discretize
the continuous time domain into 1-minute buckets. For
each bucket at time t, each model attempts to invest 1
share in each stock if it predicts an uptrend in price, i.e.
(t) (t)Pclose > Popen. If a model invested in a stock at time t, it
based on the discussion above.
The first model we tried was Logistic Regression3Initially, we attempted to fit logistic regression with thefollowing six features: 1) percentage change in open price,2) percentage change in high price, 3) percentage changein low price, 4) percentage change in open volume, 5)percentage change in high volume, and 6) percentagechange in low volume.
Note that although change in ”open” variables are be-tween the current and previous 1-minute bucket, since highand low variables for the current 1-minute bucket are unob-served so far, we can only consider the change between theprevious two buckets as an indicator of the trend. Formally,these features can be expressed using the formula below4:
⇣ (t) (t1)⌘ (t1)Popen Popen /Popen (1)
always sells that stock at the end of that minute(t). To esti-(t) (t)
⇣⌘
P(t1) P(t2)high high
⇣ (t1) (t2) ⌘Plow Plow
/P(t2) (2)high
(t2)/Plow (3)
mate profit, we calculate the price difference Pclose Popento update the rolling profit. If, on the other hand, it predictsa downtrend it does nothing. This rolling profit, denotedconcisely as just ”profit” in this report, is one of our metricsin evaluating the algorithm’s performance.
⇣ (t) (t1)⌘ (t1)Vopen Vopen /Vopen (4)
⇣V(t1) V(t2)⌘/V(t2 (5)high high high
⇣V (t1) V (t2) ⌘ /V (t2) (6)low low low
The results, however, showed that a logistic regressionmodel could not be applied well to this set of high-dimensional features. Intuitively this behavior can beexplained if we consider the significant noise introduced bythe high-dimensional features, which makes it difficult tofit weights for our model. More specifically, this behaviorcould be due to certain features obscuring patterns obtainedby other features.
In an attempt to reduce the dimensionality of our featurespace, we use cross-validation to eliminate less effectivefeatures. We realized that logistic regression model onstock-data can fit at most two-dimensional feature spacewith reliability. The results of the cross validation sug-gested that feature(1) and feature(4) provide optimal results.
In addition to optimizing the feature set, we also usecross-validation to obtain an optimal training set, which isdefined as the training duration in our application. Figure1 plots the variation of the metrics over training durationsfrom 30-minute period to 120-minute period (the heuris-tic assumption is training begins at 9:30 AM, and testing
In addition to profit, we also utilize the standard evalu-ation metrics: accuracy, precision and recall, to judge theperformance of our models. Specifically,
accuracy =precision =recall =
correct predictions
total predictions# accurate uptick predictions
uptick predictions# accurate uptick predictions# actual upticks
To conclude, each time we evaluate a specific model oralgorithm, we take the average precision, average recall andaverage accuracy and average profit over all 23 stocks in ouruniverse. These are the metrics used for performance in thisreport.
- Models & Results
3.1. Logistic Regression
3.1.1 Feature Optimization and Dimensionality Con-straint
To predict the stock-price trends, our goal was to predict
1{P(t) >P(t) }close open
open price/volume, high price/volume, low price/volume, end volume
3Our implementation utilizes the MNRFIT library in Matlab.4We will denote features using the numbering of equations for the rest
2
⇣ (t) (t1)⌘ (t1)of this report, e.g. feature (1) is Popen Popen /Popen )
lasts for 30 minutes right after training finishes). We ob-serve that logistic regression model achieves maximal per-formance when training duration is set to 60 minutes.
Figure 1: Performance over different training durations
Hence, we train the logistic regression model with fea-ture (1) and feature (4), starting from 9:30 AM to 10:30AM, and the obtained model obtains precision 55.07%, re-call 30.05%, accuracy 38.39%, and profit 0.0123 when test-ing for the rest of the day.
3.1.2 Improvements based on Time Locality
While logistic regression was able to achieve a reason-able performance with the two-dimensional feature setincluding (1) and (4) and made a profit of 0.0123 , weattempted to further improve our results. Based on earlierdiscussion, our logistic regression model is constrainedto a low-dimensional feature space. As a result, we musteither select more descriptive features in low-dimensionalfeature space or use a different model that would learnfrom a higher-dimensional feature space for our application.
We started by constructing more descriptive features.We hypothesized that the stock-market exhibits significanttime-locality of price-trends based on the fact that it is ofteninfluenced by group decision making and other time-boundevents that occur in the marketplace. The signals of theseevents are usually visible over a time-frame longer thana minute since in the very-short term, these trends aremasked by the inherent volatility of the stock prices inthe market. For example, if the market enters a mode ofgeneral rise with high-fluctuation at a certain time, thenlarge 1-minute percentage changes in price or volumebecome less significant in comparison to the general trend.
We attempted to address these concerns by formulating
new features based on the -minute high-low model[1]5.Professionals in the algorithmic trading field recommendedthe heuristic choice of = 5.6 The -minute high-lowmodel tracks the high price, low price, high volume, lowvolume across all the ticks in any -minute span. For themost recent -minute span w.r.t. any 1-minute bucket of
time t, we define P H(t), P L(t), V H(t), V L(t) as follows:
PH(t) =
PL(t) =
VH(t) =
VL(t) =
max P(i) (7)tit1 high
min P(i) (8)tit1 low
max P(i) (9)tit1 high
min P(i) (10)tit1 low
Under the -minute high-low model, we choose our fea-tures to be the following:
⇣ (t) (t1)⌘Popen Popen
PH(t) PL(t)
⇣⌘(t) (t1)
(11)
Vopen Vopen (12)VH(t) VL(t)
Specifically, they are the ratio of open price and openvolume change to the most recent “-minute high-lowspread”, respectively.
Considering that our stock universe may be different, weuse cross-validation to determine the optimal value of .Figure 2 suggests that = 5 leads to maximal precisionwhile = 10 guarantees maximal profit and recall. Forthe purpose of this project, we chose = 5 because higherprecision leads to a more conservative strategy.
Figure 2: Performance over different
5Inspired by CS 246 (2011-2012 Winter) HW4, Problem 1.6Keith Siilats, a former CS 246 TA
Also, we set training duration to 60 minutes based an-other cross-validation analysis with = 5. Our -minutehigh-low logistic regression model finally achieves preci-sion 59.39%, recall 27.43%, accuracy 41.58% and profit0.0186.
Table 1: Comparison between two logistic regression mod-els
cross-validation. Similarly, we choose optimal = 10 andC = 0.1 using cross-validation. We also compared linearkernel with Gaussian kernel, and linear kernel tends to givebetter results.
The SVM model trained with the chosen training du-ration, and C finally achieves precision 47.13%, recall53.96%, accuracy 42.30% and profit 0.3066. By compar-ing -minute high-low regression model with SVM model,we see that SVM model significantly improves recall, byalmost 100%, by only sacrificing a small percentage of pre-cision, around 20%.
3.2.2 Time-Locality Revisited
Recall that the -min high-low model is based on ourhypothesis that there exists a minute rolling correlationin between trades within a certain period of time, and bycross-validation, we choose = 10 for the SVM model.To further substantiate this hypothesis, we conducted anexperiment in which we train an SVM using the optimalparameters from the previous section, and then we evaluatethe accuracy of the model by testing it on different periodsof time.
Specifically, the performance statistics of an SVMmodel, trained from 9:30 AM to 10:30 AM, are listed inTable 3. A close inspection shows that there exists a down-trend in performance as delay between testing period andtraining period becomes larger. In fact, it wouldn’t be sur-prising to see even better performance of this model within10 minutes after training completes as we chose = 107!
Model
Baseline-HL
Profit
Precision
Recall
Accuracy
38.39%41.58%
0.01230.0186
55.07%59.39%
30.05%27.43%
By compare the performance of the two logistic regres-sion models in Table 1, we clearly see that -minute high-low model provides a superior model than baseline model.This result validates our hypothesis on the time-localitycharacteristic of stock data and suggests that time-localitylasts around 5 minutes.
3.2. Support Vector Machine
As we discussed earlier, further improvement of resultsmay still be possible by exploring a new machine learningmodel. The previous model we explored contained us to alow-dimensional feature space, and to overcome this con-straint, we attempted to experiment with SVM using `-1regularization with C = 1.
3.2.1 Feature & Parameter Selection
We tried different combinations of the 8 features defined byequation (1) to (6), equation (11), and equation(12). Sincethere are a large number of feature combinations to con-sider, we used forward-search to continuously add featuresto our existing feature set and choose the best set based onour 4 metrics.
Table 2: Performance over different feature sets
Table 3: Performance over periods of time
Period
10:30-11:00AM10:45-11:15AM11:00-11:30AM11:15-11:45AM11:30-12:00AM
Profit
Precision
Recall
Accuracy
43.92%42.15%43.07%38.68%40.44%
0.0926
56.45%
38.10%
0.0684
42.49%
38.32%
0.0775
54.29%
41.09%
0.0726
48.68%
36.68%
0.0632
32.74%
29.77%
Features(1), (4)
(11), (12)(1), (4), (11),(12)(1), (4), (11),(12), (2), (5)(1), (4), (11),(12), (2), (5),(3), (6)
Profit Precision
Recall Accuracy
42.85%40.34%39.42%
42.60%42.91%
0.3066
44.72%
52.11%
0.3706
42.81%
57.64%
0.3029
42.48%
47.54%
0.3627
45.22%
56.25%
0.3484
46.43%
55.66%
- Conclusion and Furtherwork
We chose the last feature set since it leads to the highestprecision and also very high profit, recall, and accuracy.In addition, we set training duration to 60 minutes using
Predicting stock market trends using machine learningalgorithms is a challenging task due to the trends being
7The result is precision: 68.84%, recall: 36.88%, accuracy: 44.84%,which tops all other results in Table 3.
masked by various factors such as noise and volatility. Inaddition, the market operates in various local-modes thatchange from time to time making it necessary to capturethose changes in order to be profitable while trading.
Although our algorithms and models were simplified, wewere able to meet our expectation of reaching modest prof-itability. As per our sequential analysis it became clear thatfactoring in time-locality and capturing the features aftersmoothing, to reduce volatility improves profitability andprecision substantially.
Factoring in features of high-dimensionality after carefulselection can also be significant to improving the results andour analysis of the SVM compared to logistic regressionwas able to capture this. We expect that this is the casebecause of higher-dimensionality increasing the likelihoodof linear separation of the dataset.
Finally, iterative improvements achieved through se-quential optimizations in the form of discretization, real-ization of time-locality, smoothing improved results signifi-cantly. Cross-validation and forward search were also pow-erful tools in making the algorithm perform better.
In conclusion, our experience in this project suggeststhat machine learning has great potential in this field andwe hope to continue working on this project further to ex-plore more nuances in improving performance via better al-gorithms as well as optimizations.
A few interesting questions that we think would be worthinvestigating would be exploring other international stockmarkets to find locations where algorithmic trading is ableto perform better. In addition, it would be interesting toinvestigate other algorithms such as reinforcement-learningto compare with the models discussed in this report. Featureselection has been key and more work in discovering moredescriptive features would prove to be promising in termsof making the results even better.
- Acknowledgements
We would like to thank Professor Andrew Ng and theTA’s of the class for their feedback and input on the project.We would also like to thank Keith Sillats for generous helpin the form of advice as well as valuable personal experi-ence in the field that helped inform our decisions.
References
[1] JureLeskovec,TA:KeithSillatsHW4
A. Appendix
Stock Ticker
APOL
CMA
GCI
NFX
Origin
US Equity
CBG US Equity
US Equity
CMS
US Equity
CVS
US Equity
US Equity
GME
JBL
US Equity
GT
US Equity
US Equity
KIM
US Equity
LNC
US Equity
US Equity
NI
US Equity
NWL
US Equity
NYX
PWR
US Equity
US Equity
QEP
US Equity
SEE
TER
US Equity
US Equity
THC
US Equity
TIE
TXT
US Equity
US Equity
ZION
US Equity