CS229_Final_Report

CS229 Project ReportAutomated Stock Trading Using Machine Learning Algorithms

Introduction
The use of algorithms to make trading decisions hasbecome a prevalent practice in major stock exchangesof the world. Algorithmic trading, sometimes calledhigh-frequency trading, is the use of automated systems toidentify true signals among massive amounts of data thatcapture the underlying stock market dynamics. MachineLearning has therefore been central to the process ofalgorithmic trading because it provides powerful tools toextract patterns from the seemingly chaotic market trends.This project, in particular, learns models from Bloombergstock data to predict stock price changes and aims to makeprofit over time.
In this project, we examine two separate algorithms andmethodologies utilized to investigate Stock Market trendsand then iteratively improve the model to achieve higherprofitability as well as accuracy via the predictions.
Methods2.1. Stock Selection
Stock ticker data, relating to prices, volumes, quotes areavailable to academic institutions through the Bloombergterminal and Stanford has a easily accessible one in itsengineering library.
When collecting stock data for this project we attemptedto have a conservative universe selection to ensure that wemined a good universe a priori and avoided stocks that werelikely to be outliers to our algorithm to confuse the results.The criteria we shortlisted by were the following:
price between 10-30 dollars

membership in the last 300 of SP500

average daily volume (ADV) in the middle 33 per-centile

variety of stock sectors

Arpan Shah Hongxia Zhong
[email protected] [email protected]

According to the listed criteria, we obtained a universeof 23 stocks for this project1.
The data we focussed on was the price and volume move-ments for each stock throughout the day on a tick-by-tickbasis. This data was then further preprocessed to enable in-terfacing with Matlab and integrate into the machine learn-ing algorithms.
2.2. Preprocessing
Before using the data in the learning algorithms, the fol-lowing preprocessing steps were taken.
2.2.1 Discretization
Since the tick-by-tick entries retrieved from Bloomberghappen in non-deterministic timestamps, we attempted tostandardize the stock data by discretizing the continuoustime domain, from 9:00 am to 5:00 pm when the marketcloses. Specifically, the time domain was separated into1-minute buckets and we discarded all granularities withineach bucket and treated the buckets as the basic units in ourlearning algorithms.
2.2.2 Bucket Description
For each 1-minute bucket, we attempted to extract 8 identi-fiers to describe the price and volume change of that minuteheuristically. We discussed the identifier selection with ex-perienced veteran in algorithmic trading industry (footnote:Keith). Based on his suggestions, we chose the following 4identifiers to describe the price change:

open price: price at the beginning of each 1-minutebucket
close price: price at the end of each 1-minute bucket3. high price: highest price within each 1-minute bucket4. low price: lowest price within each 1-minute bucket
1See Appendix

Similarly, we chose open volume, close volume, highvolume and low volume to describe the volume change.
With this set of identifiers, we can formulate the algo-rithms to predict the change in the closing price of each1-minute bucket given information of the remaining sevenidentifiers (volume and price) prior to that minute2. Theidentifiers help capture the trend of the data of a givenminute.
2.3. Metrics
To evaluate the learning algorithms, we simulate a
real-time trading process, on one single day, using the
models obtained from each algorithm. Again, we discretize
the continuous time domain into 1-minute buckets. For
each bucket at time t, each model attempts to invest 1
share in each stock if it predicts an uptrend in price, i.e.
(t) (t)Pclose > Popen. If a model invested in a stock at time t, it

based on the discussion above.
The first model we tried was Logistic Regression3Initially, we attempted to fit logistic regression with thefollowing six features: 1) percentage change in open price,2) percentage change in high price, 3) percentage changein low price, 4) percentage change in open volume, 5)percentage change in high volume, and 6) percentagechange in low volume.
Note that although change in ”open” variables are be-tween the current and previous 1-minute bucket, since highand low variables for the current 1-minute bucket are unob-served so far, we can only consider the change between theprevious two buckets as an indicator of the trend. Formally,these features can be expressed using the formula below4:
⇣ (t) (t1)⌘ (t1)Popen Popen /Popen (1)

always sells that stock at the end of that minute(t). To esti-(t) (t)

⇣⌘
P(t1) P(t2)high high
⇣ (t1) (t2) ⌘Plow Plow

/P(t2) (2)high
(t2)/Plow (3)

mate profit, we calculate the price difference Pclose Popento update the rolling profit. If, on the other hand, it predictsa downtrend it does nothing. This rolling profit, denotedconcisely as just ”profit” in this report, is one of our metricsin evaluating the algorithm’s performance.

⇣ (t) (t1)⌘ (t1)Vopen Vopen /Vopen (4)
⇣V(t1) V(t2)⌘/V(t2 (5)high high high
⇣V (t1) V (t2) ⌘ /V (t2) (6)low low low
The results, however, showed that a logistic regressionmodel could not be applied well to this set of high-dimensional features. Intuitively this behavior can beexplained if we consider the significant noise introduced bythe high-dimensional features, which makes it difficult tofit weights for our model. More specifically, this behaviorcould be due to certain features obscuring patterns obtainedby other features.
In an attempt to reduce the dimensionality of our featurespace, we use cross-validation to eliminate less effectivefeatures. We realized that logistic regression model onstock-data can fit at most two-dimensional feature spacewith reliability. The results of the cross validation sug-gested that feature(1) and feature(4) provide optimal results.
In addition to optimizing the feature set, we also usecross-validation to obtain an optimal training set, which isdefined as the training duration in our application. Figure1 plots the variation of the metrics over training durationsfrom 30-minute period to 120-minute period (the heuris-tic assumption is training begins at 9:30 AM, and testing

In addition to profit, we also utilize the standard evalu-ation metrics: accuracy, precision and recall, to judge theperformance of our models. Specifically,

accuracy =precision =recall =

correct predictions

total predictions# accurate uptick predictions

uptick predictions# accurate uptick predictions# actual upticks

To conclude, each time we evaluate a specific model oralgorithm, we take the average precision, average recall andaverage accuracy and average profit over all 23 stocks in ouruniverse. These are the metrics used for performance in thisreport.

Models & Results
3.1. Logistic Regression
3.1.1 Feature Optimization and Dimensionality Con-straint
To predict the stock-price trends, our goal was to predict

1{P(t) >P(t) }close open
open price/volume, high price/volume, low price/volume, end volume

3Our implementation utilizes the MNRFIT library in Matlab.4We will denote features using the numbering of equations for the rest

⇣ (t) (t1)⌘ (t1)of this report, e.g. feature (1) is Popen Popen /Popen )

lasts for 30 minutes right after training finishes). We ob-serve that logistic regression model achieves maximal per-formance when training duration is set to 60 minutes.
Figure 1: Performance over different training durations
Hence, we train the logistic regression model with fea-ture (1) and feature (4), starting from 9:30 AM to 10:30AM, and the obtained model obtains precision 55.07%, re-call 30.05%, accuracy 38.39%, and profit 0.0123 when test-ing for the rest of the day.
3.1.2 Improvements based on Time Locality
While logistic regression was able to achieve a reason-able performance with the two-dimensional feature setincluding (1) and (4) and made a profit of 0.0123 , weattempted to further improve our results. Based on earlierdiscussion, our logistic regression model is constrainedto a low-dimensional feature space. As a result, we musteither select more descriptive features in low-dimensionalfeature space or use a different model that would learnfrom a higher-dimensional feature space for our application.
We started by constructing more descriptive features.We hypothesized that the stock-market exhibits significanttime-locality of price-trends based on the fact that it is ofteninfluenced by group decision making and other time-boundevents that occur in the marketplace. The signals of theseevents are usually visible over a time-frame longer thana minute since in the very-short term, these trends aremasked by the inherent volatility of the stock prices inthe market. For example, if the market enters a mode ofgeneral rise with high-fluctuation at a certain time, thenlarge 1-minute percentage changes in price or volumebecome less significant in comparison to the general trend.
We attempted to address these concerns by formulating

new features based on the -minute high-low model[1]5.Professionals in the algorithmic trading field recommendedthe heuristic choice of = 5.6 The -minute high-lowmodel tracks the high price, low price, high volume, lowvolume across all the ticks in any -minute span. For themost recent -minute span w.r.t. any 1-minute bucket of
time t, we define P H(t), P L(t), V H(t), V L(t) as follows:

PH(t) =
PL(t) =
VH(t) =
VL(t) =

max P(i) (7)tit1 high
min P(i) (8)tit1 low
max P(i) (9)tit1 high
min P(i) (10)tit1 low

Under the -minute high-low model, we choose our fea-tures to be the following:

⇣ (t) (t1)⌘Popen Popen
PH(t) PL(t)
⇣⌘(t) (t1)

(11)

Vopen Vopen (12)VH(t) VL(t)

Specifically, they are the ratio of open price and openvolume change to the most recent “-minute high-lowspread”, respectively.
Considering that our stock universe may be different, weuse cross-validation to determine the optimal value of .Figure 2 suggests that = 5 leads to maximal precisionwhile = 10 guarantees maximal profit and recall. Forthe purpose of this project, we chose = 5 because higherprecision leads to a more conservative strategy.
Figure 2: Performance over different
5Inspired by CS 246 (2011-2012 Winter) HW4, Problem 1.6Keith Siilats, a former CS 246 TA

Also, we set training duration to 60 minutes based an-other cross-validation analysis with = 5. Our -minutehigh-low logistic regression model finally achieves preci-sion 59.39%, recall 27.43%, accuracy 41.58% and profit0.0186.
Table 1: Comparison between two logistic regression mod-els

cross-validation. Similarly, we choose optimal = 10 andC = 0.1 using cross-validation. We also compared linearkernel with Gaussian kernel, and linear kernel tends to givebetter results.
The SVM model trained with the chosen training du-ration, and C finally achieves precision 47.13%, recall53.96%, accuracy 42.30% and profit 0.3066. By compar-ing -minute high-low regression model with SVM model,we see that SVM model significantly improves recall, byalmost 100%, by only sacrificing a small percentage of pre-cision, around 20%.
3.2.2 Time-Locality Revisited
Recall that the -min high-low model is based on ourhypothesis that there exists a minute rolling correlationin between trades within a certain period of time, and bycross-validation, we choose = 10 for the SVM model.To further substantiate this hypothesis, we conducted anexperiment in which we train an SVM using the optimalparameters from the previous section, and then we evaluatethe accuracy of the model by testing it on different periodsof time.
Specifically, the performance statistics of an SVMmodel, trained from 9:30 AM to 10:30 AM, are listed inTable 3. A close inspection shows that there exists a down-trend in performance as delay between testing period andtraining period becomes larger. In fact, it wouldn’t be sur-prising to see even better performance of this model within10 minutes after training completes as we chose = 107!

Model
Baseline-HL

Profit

Precision

Recall

Accuracy
38.39%41.58%

0.01230.0186

55.07%59.39%

30.05%27.43%

By compare the performance of the two logistic regres-sion models in Table 1, we clearly see that -minute high-low model provides a superior model than baseline model.This result validates our hypothesis on the time-localitycharacteristic of stock data and suggests that time-localitylasts around 5 minutes.
3.2. Support Vector Machine
As we discussed earlier, further improvement of resultsmay still be possible by exploring a new machine learningmodel. The previous model we explored contained us to alow-dimensional feature space, and to overcome this con-straint, we attempted to experiment with SVM using `-1regularization with C = 1.
3.2.1 Feature & Parameter Selection
We tried different combinations of the 8 features defined byequation (1) to (6), equation (11), and equation(12). Sincethere are a large number of feature combinations to con-sider, we used forward-search to continuously add featuresto our existing feature set and choose the best set based onour 4 metrics.
Table 2: Performance over different feature sets

Table 3: Performance over periods of time

Period
10:30-11:00AM10:45-11:15AM11:00-11:30AM11:15-11:45AM11:30-12:00AM

Profit

Precision

Recall

Accuracy
43.92%42.15%43.07%38.68%40.44%

0.0926

56.45%

38.10%

0.0684

42.49%

38.32%

0.0775

54.29%

41.09%

0.0726

48.68%

36.68%

0.0632

32.74%

29.77%

Features(1), (4)
(11), (12)(1), (4), (11),(12)(1), (4), (11),(12), (2), (5)(1), (4), (11),(12), (2), (5),(3), (6)

Profit Precision

Recall Accuracy
42.85%40.34%39.42%
42.60%42.91%

0.3066

44.72%

52.11%

0.3706

42.81%

57.64%

0.3029

42.48%

47.54%

0.3627

45.22%

56.25%

0.3484

46.43%

55.66%

Conclusion and Furtherwork

We chose the last feature set since it leads to the highestprecision and also very high profit, recall, and accuracy.In addition, we set training duration to 60 minutes using

Predicting stock market trends using machine learningalgorithms is a challenging task due to the trends being
7The result is precision: 68.84%, recall: 36.88%, accuracy: 44.84%,which tops all other results in Table 3.

masked by various factors such as noise and volatility. Inaddition, the market operates in various local-modes thatchange from time to time making it necessary to capturethose changes in order to be profitable while trading.
Although our algorithms and models were simplified, wewere able to meet our expectation of reaching modest prof-itability. As per our sequential analysis it became clear thatfactoring in time-locality and capturing the features aftersmoothing, to reduce volatility improves profitability andprecision substantially.
Factoring in features of high-dimensionality after carefulselection can also be significant to improving the results andour analysis of the SVM compared to logistic regressionwas able to capture this. We expect that this is the casebecause of higher-dimensionality increasing the likelihoodof linear separation of the dataset.
Finally, iterative improvements achieved through se-quential optimizations in the form of discretization, real-ization of time-locality, smoothing improved results signifi-cantly. Cross-validation and forward search were also pow-erful tools in making the algorithm perform better.
In conclusion, our experience in this project suggeststhat machine learning has great potential in this field andwe hope to continue working on this project further to ex-plore more nuances in improving performance via better al-gorithms as well as optimizations.
A few interesting questions that we think would be worthinvestigating would be exploring other international stockmarkets to find locations where algorithmic trading is ableto perform better. In addition, it would be interesting toinvestigate other algorithms such as reinforcement-learningto compare with the models discussed in this report. Featureselection has been key and more work in discovering moredescriptive features would prove to be promising in termsof making the results even better.

Acknowledgements
We would like to thank Professor Andrew Ng and theTA’s of the class for their feedback and input on the project.We would also like to thank Keith Sillats for generous helpin the form of advice as well as valuable personal experi-ence in the field that helped inform our decisions.
References
[1] JureLeskovec,TA:KeithSillatsHW4

A. Appendix

Stock Ticker

APOL

CMA

GCI

NFX

Origin

US Equity

CBG US Equity

US Equity

CMS

US Equity

CVS

US Equity

GME

JBL

US Equity

KIM

US Equity

LNC

US Equity

NWL

US Equity

NYX

PWR

US Equity

QEP

US Equity

SEE

TER

US Equity

THC

US Equity

TIE

TXT

US Equity

ZION

US Equity

Automated Stock Trading Using Machine Learning Algorithms

correct predictions

total predictions# accurate uptick predictions

uptick predictions# accurate uptick predictions# actual upticks

你可能感兴趣的:(Automated Stock Trading Using Machine Learning Algorithms)