深度学习用于股票预测_用于自动股票交易的深度强化学习

深度学习用于股票预测

Note from Towards Data Science’s editors: While we allow independent authors to publish articles in accordance with our rules and guidelines, we do not endorse each author’s contribution. You should not rely on an author’s works without seeking professional advice. See our Reader Terms for details.

Towards Data Science编辑的注意事项: 尽管我们允许独立作者按照我们的 规则和指南 发表文章 ,但我们不认可每位作者的贡献。 您不应在未征求专业意见的情况下依赖作者的作品。 有关 详细信息, 请参见我们的 阅读器条款

This blog is based on our paper: Deep Reinforcement Learning for Automated Stock Trading: An Ensemble Strategy, presented at ICAIF 2020: ACM International Conference on AI in Finance.

该博客基于我们的论文: 《用于自动股票交易的深度强化学习:整体策略》 ,在ICAIF 2020 :ACM金融人工智能国际会议上发表。

Our codes are available on Github.

我们的代码可在Github上找到 。

Our paper will be available on arXiv soon.

我们的论文即将在arXiv上发布。

If you want to cite our paper, the reference format is as follows:

如果您想引用我们的论文,参考格式如下:

Hongyang Yang, Xiao-Yang Liu, Shan Zhong, and Anwar Walid. 2020. Deep Reinforcement Learning for Automated Stock Trading: An Ensemble Strategy. In ICAIF ’20: ACM International Conference on AI in Finance, Oct. 15–16, 2020, Manhattan, NY. ACM, New York, NY, USA.

杨洪阳,刘晓阳,山中和安华·瓦利德(Anwar Walid)。 2020年。《自动交易的深度强化学习:整体策略》。 在ICAIF '20:ACM金融人工智能国际会议上,2020年10月15日至16日,纽约曼哈顿。 美国纽约州ACM。

总览 (Overview)

One can hardly overestimate the crucial role stock trading strategies play in investment.

人们几乎不能高估股票交易策略在投资中的关键作用。

Profitable automated stock trading strategy is vital to investment companies and hedge funds. It is applied to optimize capital allocation and maximize investment performance, such as expected return. Return maximization can be based on the estimates of potential return and risk. However, it is challenging to design a profitable strategy in a complex and dynamic stock market.

获利的自动股票交易策略对投资公司和对冲基金至关重要。 它用于优化资本分配和最大化投资绩效,例如预期收益。 收益最大化可以基于潜在收益和风险的估计。 但是,在复杂而动态的股票市场中设计一种有利可图的战略是一项挑战。

Every player wants a winning strategy. Needless to say, a profitable strategy in such a complex and dynamic stock market is not easy to design.

每个玩家都希望有一个获胜的策略。 毋庸置疑,在如此复杂而动态的股票市场中,要制定一项有利可图的策略并不容易。

Yet, we are to reveal a deep reinforcement learning scheme that automatically learns a stock trading strategy by maximizing investment return.

但是,我们将揭示一种深度强化学习方案,该方案可以通过最大化投资回报来自动学习股票交易策略。

SuhyeonSuhyeon提供 on 在未 Unsplash 飞溅

Our Solution: Ensemble Deep Reinforcement Learning Trading StrategyThis strategy includes three actor-critic based algorithms: Proximal Policy Optimization (PPO), Advantage Actor Critic (A2C), and Deep Deterministic Policy Gradient (DDPG).It combines the best features of the three algorithms, thereby robustly adjusting to different market conditions.

我们的解决方案 :整合深度强化学习交易策略此策略包括三种基于行为者批评的算法:近距离策略优化(PPO),优势行为者批评者(A2C)和深度确定性策略梯度(DDPG)。 它结合了三种算法的最佳功能,从而可以稳健地适应不同的市场条件。

The performance of the trading agent with different reinforcement learning algorithms is evaluated using Sharpe ratio and compared with both the Dow Jones Industrial Average index and the traditional min-variance portfolio allocation strategy.

使用夏普比率评估具有不同强化学习算法的交易代理商的绩效,并与道琼斯工业平均指数和传统的最小方差投资组合分配策略进行比较。

Copyright by AI4Finance LLC AI4Finance LLC版权所有

第1部分。为什么要在股票交易中使用深度强化学习(DRL)? (Part 1. Why do you want to use Deep Reinforcement Learning (DRL) for stock trading?)

Existing works are not satisfactory. Deep Reinforcement Learning approach has many advantages.

现有作品不令人满意。 深度强化学习方法具有许多优势。

1.1 DRL和现代投资组合理论(MPT) (1.1 DRL and Modern Portfolio Theory (MPT))

  1. MPT performs not so well in out-of-sample data.

    MPT在样本外数据中表现不佳。

  2. MPT is very sensitive to outliers.

    MPT 对异常值非常敏感

  3. MPT is calculated only based on stock returns, if we want to take other relevant factors into account, for example some of the technical indicators like Moving Average Convergence Divergence (MACD), and Relative Strength Index (RSI), MPT may not be able to combine these information together well.

    MPT 仅基于股票收益进行计算,如果我们要考虑其他相关因素 ,例如某些技术指标,例如移动平均收敛散度(MACD)相对强度指数(RSI) ,MPT可能无法将这些信息很好地结合在一起。

1.2 DRL和监督式机器学习预测模型 (1.2 DRL and supervised machine learning prediction models)

  1. DRL doesn’t need large labeled training datasets. This is a significant advantage since the amount of data grows exponentially today, it becomes very time-and-labor-consuming to label a large dataset.

    DRL不需要大型的标签训练数据集 。 这是一个重要的优势,因为如今的数据量呈指数增长,因此标记大型数据集变得非常耗时且费力。

  2. DRL uses a reward function to optimize future rewards, in contrast to an ML regression/classification model that predicts the probability of future outcomes.

    与预测未来结果可能性的ML回归/分类模型相比,DRL使用奖励函数来优化未来奖励。

1.3 采用DRL股票交易中 的R ationale (1.3 The rationale of using DRL for stock trading)

  1. The goal of stock trading is to maximize returns, while avoiding risks. DRL solves this optimization problem by maximizing the expected total reward from future actions over a time period.

    股票交易的目的是在避免风险的同时最大化回报 。 DRL通过最大化一段时间内来自未来行动的预期总回报来解决此优化问题。

  2. Stock trading is a continuous process of testing new ideas, getting feedback from the market, and trying to optimize the trading strategies over time. We can model stock trading process as Markov decision process which is the very foundation of Reinforcement Learning.

    股票交易是一个不断测试新想法,从市场上获得反馈以及不断优化交易策略的连续过程 。 我们可以将股票交易过程建模为马尔可夫决策过程 ,这是强化学习的基础。

1.4 深度强化学习的优势 (1.4 The advantages of deep reinforcement learning)

  1. Deep reinforcement learning algorithms can outperform human players in many challenging games. For example, in March 2016, DeepMind’s AlphaGo program, a deep reinforcement learning algorithm, beat the world champion Lee Sedol at the game of Go.

    在许多具有挑战性的游戏中,深度强化学习算法可以胜过人类玩家 。 例如,2016年3月, DeepMind的AlphaGo程序(一种深度强化学习算法)在Go游戏中击败了世界冠军Lee Sedol。

  2. Return maximization as trading goal: by defining the reward function as the change of the portfolio value, Deep Reinforcement Learning maximizes the portfolio value over time.

    最大化回报作为交易目标 :通过将回报函数定义为投资组合价值的变化,深度强化学习可以使投资组合价值随时间最大化。

  3. The stock market provides sequential feedback. DRL can sequentially increase the model performance during the training process.

    股市提供顺序反馈DRL可以在训练过程中顺序提高模型性能。

  4. The exploration-exploitation technique balances trying out different new things and taking advantage of what’s figured out. This is difference from other learning algorithms. Also, there is no requirement for a skilled human to provide training examples or labeled samples. Furthermore, during the exploration process, the agent is encouraged to explore the uncharted by human experts.

    勘探开发技术可以平衡尝试各种新事物并利用发现的优势。 这与其他学习算法不同。 而且,不需要技术人员提供训练实例或标记的样品。 此外,在探索过程中,鼓励代理商探索人类专家未知的领域。

  5. Experience replay: is able to overcome the correlated samples issue, since learning from a batch of consecutive samples may experience high variances, hence is inefficient. Experience replay efficiently addresses this issue by randomly sampling mini-batches of transitions from a pre-saved replay memory.

    经验重播 :能够克服相关样本的问题,因为从一批连续样本中学习可能会遇到很大的差异,因此效率很低。 体验重播通过从预先保存的重播内存中随机采样过渡的迷你批来有效地解决了这个问题。

  6. Multi-dimensional data: by using a continuous action space, DRL can handle large dimensional data.

    多维数据 :通过使用连续操作空间,DRL可以处理大型数据。

  7. Computational power: Q-learning is a very important RL algorithm, however, it fails to handle large space. DRL, empowered by neural networks as efficient function approximator, is powerful to handle extremely large state space and action space.

    计算能力 :Q学习是一种非常重要的RL算法,但是它不能处理大空间。 神经网络将DRL用作有效的函数逼近器,它可以强大地处理非常大的状态空间和动作空间。

Kevin凯文 on 在未 Unsplash 飞溅

第2部分:什么是强化学习? 什么是深度强化学习? 使用强化学习进行股票交易有哪些相关作品? (Part 2: What is Reinforcement Learning? What is Deep Reinforcement Learning? What are some of the related works to use Reinforcement Learning for stock trading?)

2.1概念 (2.1 Concepts)

Reinforcement Learning is one of three approaches of machine learning techniques, and it trains an agent to interact with the environment by sequentially receiving states and rewards from the environment and taking actions to reach better rewards.

强化学习是机器学习技术的三种方法之一,它通过顺序接收环境中的状态和奖励并采取行动以获得更好的奖励来训练代理与环境交互。

Deep Reinforcement Learning approximates the Q value with a neural network. Using a neural network as a function approximator would allow reinforcement learning to be applied to large data.

深度强化学习使用神经网络来近似Q值。 使用神经网络作为函数逼近器可以将强化学习应用于大数据。

Bellman Equation is the guiding principle to design reinforcement learning algorithms.

贝尔曼方程式是设计强化学习算法的指导原则。

Markov Decision Process (MDP) is used to model the environment.

马尔可夫决策过程(MDP)用于对环境进行建模。

2.2 相关作品 (2.2 Related works)

Recent applications of deep reinforcement learning in financial markets consider discrete or continuous state and action spaces, and employ one of these learning approaches: critic-only approach, actor-only approach, or and actor-critic approach.

深度强化学习在金融市场中的最新应用考虑了离散或连续的状态空间和动作空间,并采用了以下学习方法之一: 仅批评者方法,仅演员角色方法或演员批评方法。

1. Critic-only approach: the critic-only learning approach, which is the most common, solves a discrete action space problem using, for example, Q-learning, Deep Q-learning (DQN) and its improvements, and trains an agent on a single stock or asset. The idea of the critic-only approach is to use a Q-value function to learn the optimal action-selection policy that maximizes the expected future reward given the current state. Instead of calculating a state-action value table, DQN minimizes the mean squared error between the target Q-values, and uses a neural network to perform function approximation. The major limitation of the critic-only approach is that it only works with discrete and finite state and action spaces, which is not practical for a large portfolio of stocks, since the prices are of course continuous.

1.仅批评者的方法:仅批评者的学习方法是最常见的方法,它使用例如Q学习,深度Q学习(DQN)及其改进方法来解决离散的行动空间问题,并培训代理在单一股票或资产上。 仅限批评家的方法的想法是使用Q值函数来学习最佳操作选择策略,该策略在给定当前状态的情况下最大化预期的未来奖励。 DQN无需计算状态作用值表,而是将目标Q值之间的均方误差最小化 ,并使用神经网络执行函数逼近。 仅批评者方法的主要局限性在于,它仅适用于离散且有限的状态空间和动作空间,这对于大量的股票投资组合是不切实际的,因为价格当然是连续的。

  • Q-learning: is a value-based Reinforcement Learning algorithm that is used to find the optimal action-selection policy using a Q function.

    Q学习:是一种基于值的强化学习算法,用于使用Q函数查找最佳的动作选择策略。

  • DQN: In deep Q-learning, we use a neural network to approximate the Q-value function. The state is given as the input and the Q-value of allowed actions is the predicted output.

    DQN:在深度Q学习中,我们使用神经网络来近似Q值函数。 给出状态作为输入,允许动作的Q值是预测​​的输出。

2. Actor-only approach: The idea here is that the agent directly learns the optimal policy itself. Instead of having a neural network to learn the Q-value, the neural network learns the policy. The policy is a probability distribution that is essentially a strategy for a given state, namely the likelihood to take an allowed action. The actor-only approach can handle the continuous action space environments.

2.仅基于参与者的方法:这里的想法是代理直接学习最佳策略本身。 代替让神经网络学习Q值,神经网络学习策略。 策略是一种概率分布,本质上是一种针对给定状态的策略,即采取允许行动的可能性。 纯角色的方法可以处理连续的动作空间环境。

  • Policy Gradient: aims to maximize the expected total rewards by directly learns the optimal policy itself.

    政策梯度:旨在通过直接学习最优政策本身来最大化预期的总回报。

3. Actor-Critic approach: The actor-critic approach has been recently applied in finance. The idea is to simultaneously update the actor network that represents the policy, and the critic network that represents the value function. The critic estimates the value function, while the actor updates the policy probability distribution guided by the critic with policy gradients. Over time, the actor learns to take better actions and the critic gets better at evaluating those actions. The actor-critic approach has proven to be able to learn and adapt to large and complex environments, and has been used to play popular video games, such as Doom. Thus, the actor-critic approach fits well in trading with a large stock portfolio.

3.行为者批判方法:行为者批判方法最近已在金融中应用。 想法是同时更新代表策略 的参与者网络和代表价值函数的批评者网络 。 评论者估计价值函数,而参与者则用策略梯度更新评论者指导的策略概率分布。 随着时间的流逝,演员学会采取更好的行动,评论家也变得更好地评估这些行动。 事实证明,演员批评方法能够学习并适应大型复杂环境,并且已被用于玩流行的视频游戏,例如《毁灭战士》。 因此,行动者批判方法非常适合大型股票投资组合的交易。

  • A2C: A2C is a typical actor-critic algorithm. A2C uses copies of the same agent working in parallel to update gradients with different data samples. Each agent works independently to interact with the same environment.

    A2C: A2C是一种典型的actor-critic算法。 A2C使用并行工作的同一代理的副本来更新具有不同数据样本的梯度。 每个代理独立工作以与同一环境交互。

  • PPO: PPO is introduced to control the policy gradient update and ensure that the new policy will not be too different from the previous one.

    PPO:引入PPO是为了控制策略梯度更新,并确保新策略与以前的策略没有太大不同。

  • DDPG: DDPG combines the frameworks of both Q-learning and policy gradient, and uses neural networks as function approximators.

    DDPG: DDPG结合了Q学习和策略梯度的框架,并使用神经网络作为函数逼近器。

第3部分:如何使用DRL交易股票? (Part 3: How to use DRL to trade stocks?)

MarkusMarkus on 在未 Unsplash 飞溅

3.1数据 (3.1 Data)

We track and select the Dow Jones 30 stocks and use historical daily data from 01/01/2009 to 05/08/2020 to train the agent and test the performance. The dataset is downloaded from Compustat database accessed through Wharton Research Data Services (WRDS).

我们跟踪并选择道琼斯30只股票,并使用2009年1月1日至2020年5月8日的历史每日数据来训练代理商并测试业绩。 该数据集是从Compustat数据库下载的,该数据库可通过Wharton Research Data Services(WRDS)访问

The whole dataset is split in the following figure. Data from 01/01/2009 to 12/31/2014 is used for training, and the data from 10/01/2015 to 12/31/2015 is used for validation and tuning of parameters. Finally, we test our agent’s performance on trading data, which is the unseen out-of-sample data from 01/01/2016 to 05/08/2020. To better exploit the trading data, we continue training our agent while in the trading stage, since this will help the agent to better adapt to the market dynamics.

下图拆分了整个数据集。 2009年1月1日至2014年12月31日的数据用于训练 ,2015年10月1日至2015年12月31日的数据用于参数的验证和调整。 最后,我们测试代理商在交易数据上的表现,这是从01/01/2016到05/08/2020的未知样本数据。 为了更好地利用交易数据,我们将在交易阶段继续培训代理商,因为这将有助于代理商更好地适应市场动态。

Copyright by AI4Finance LLC AI4Finance LLC版权所有

3.2股票交易的MDP模型: (3.2 MDP model for stock trading:)

State = [, , ]: a vector that includes stock prices ∈ R+^D, the stock shares ∈ Z+^D, and the remaining balance ∈ R+, where denotes the number of stocks and Z+ denotes non-negative integers.

状态 = [,,]:一个包含股价∈R + ^ D,股票∈Z + ^ D和剩余余额∈R +的向量,其中表示股票数量,Z +表示非负整数。

Action : a vector of actions over stocks. The allowed actions on each stock include selling, buying, or holding, which result in decreasing, increasing, and no change of the stock shares , respectively.

行动 :超过d股票操作的载体。 允许对每只股票进行的操作包括出售,购买或持有,这分别导致股票的减少,增加和不变。

Reward (,,′):the direct reward of taking action at state and arriving at the new state ′.

奖励 (,,′):在状态处采取行动action并到达新状态arriving′的直接奖励。

Policy (): the trading strategy at state , which is the probability distribution of actions at state .

策略 ():状态的交易策略,即状态actions动作的概率分布。

Q-value (, ): the expected reward of taking action at state following policy .

Q值 (,):遵循策略state在州采取行动expected的预期回报。

The state transition of our stock trading process is shown in the following figure. At each state, one of three possible actions is taken on stock ( = 1, …, ) in the portfolio.

下图显示了我们股票交易过程的状态转换。 在每个状态下,对投资组合中的股票(= 1,…,)采取三种可能的措施之一。

  • Selling [] ∈ [1,[]] shares results in +1[] = [] − [],where[] ∈Z+ and =1,…,.

    卖出 []∈[1,[]]份额会导致+ 1 [] =[]-[],其中[]∈Z+和= 1,…,。

  • Holding, +1[]=[].

    保持 + 1 [] =[]。

  • Buying [] shares results in +1[] = [] + [].

    [shares]股得到+ 1 [] =[] +[]。

At time an action is taken and the stock prices update at +1, accordingly the portfolio values may change from “portfolio value 0” to “portfolio value 1”, “portfolio value 2”, or “portfolio value 3”, respectively, as illustrated in Figure 2. Note that the portfolio value is + .

在时间,采取措施并且股价在+1处更新,因此投资组合值可能分别从“投资组合值0”更改为“投资组合值1”,“投资组合值2”或“投资组合值3”。 ,如图2所示。请注意,投资组合的值为+。

Copyright by AI4Finance LLC AI4Finance LLC版权所有

3.3约束: (3.3 Constraints:)

  • Market liquidity: The orders can be rapidly executed at the close price. We assume that stock market will not be affected by our reinforcement trading agent.

    市场流动性 :可以以收盘价快速执行订单。 我们假设股票市场不会受到我们的强化交易代理的影响。

  • Nonnegative balance: the allowed actions should not result in a negative balance.

    非负余额 :允许的操作不应导致负余额。

  • Transaction cost: transaction costs are incurred for each trade. There are many types of transaction costs such as exchange fees, execution fees, and SEC fees. Different brokers have different commission fees. Despite these variations in fees, we assume that our transaction costs to be 1/1000 of the value of each trade (either buy or sell).

    交易费用 :每笔交易产生的交易费用。 有许多类型的交易成本,例如交换费,执行费和SEC费。 不同的经纪人收取不同的佣金。 尽管费用有所不同,我们仍假定交易成本为每笔交易(买卖)价值的1/1000。

  • Risk-aversion for market crash: there are sudden events that may cause stock market crash, such as wars, collapse of stock market bubbles, sovereign debt default, and financial crisis. To control the risk in a worst-case scenario like 2008 global financial crisis, we employ the financial turbulence index that measures extreme asset price movements.

    市场崩溃的风险规避 :有些突发事件可能导致股市崩溃,例如战争,股市泡沫崩溃,主权债务违约和金融危机。 为了在最坏的情况下(例如2008年全球金融危机)控制风险,我们采用了衡量资产价格极端波动的金融动荡指数

3.4收益最大化作为交易目标 (3.4 Return maximization as trading goal)

We define our reward function as the change of the portfolio value when action is taken at state and arriving at new state + 1.

我们将奖励函数定义为在状态taken采取行动并达到新状态+ 1时投资组合价值的变化

The goal is to design a trading strategy that maximizes the change of the portfolio value (,,+1) in the dynamic environment, and we employ the deep reinforcement learning method to solve this problem.

我们的目标是设计一种交易策略,以在动态环境中最大化投资组合价值(,,+ 1)的变化,我们采用深度强化学习方法来解决此问题。

Image by Isaac on Unsplash Isaac在 Unsplash上的 图片

3.5多股票的环境: (3.5 Environment for multiple stocks:)

State Space: We use a 181-dimensional vector (30 stocks * 6 + 1) consists of seven parts of information to represent the state space of multiple stocks trading environment

状态空间:我们使用181维向量(30个股票* 6 + 1),该信息由七部分信息组成,用以表示多个股票交易环境的状态空间

  1. Balance: available amount of money left in the account at current time step

    余额 :当前时间步骤中帐户中剩余的可用金额

  2. Price: current adjusted close price of each stock.

    价格 :每只股票当前调整后的收盘价。

  3. Shares: shares owned of each stock.

    股份 :每个股票拥有的股份。

  4. MACD: Moving Average Convergence Divergence (MACD) is calculated using close price.

    MACD :移动平均收敛散度(MACD)使用收盘价计算。

  5. RSI: Relative Strength Index (RSI) is calculated using close price.

    RSI :相对强度指数(RSI)使用收盘价计算。

  6. CCI: Commodity Channel Index (CCI) is calculated using high, low and close price.

    CCI :商品通道指数(CCI)使用高,低和收盘价计算。

  7. ADX: Average Directional Index (ADX) is calculated using high, low and close price.

    ADX :平均方向指数(ADX)使用高​​,低和收盘价计算。

Action Space:

动作空间

  1. For a single stock, the action space is defined as {-k,…,-1, 0, 1, …, k}, where k and -k presents the number of shares we can buy and sell, and k ≤h_max while h_max is a predefined parameter that sets as the maximum amount of shares for each buying action.

    对于单个股票,操作空间定义为{-k,…,-1,0,1,...,k} ,其中k和-k表示我们可以购买和出售的股票数量,而k≤h_max h_max是一个预定义参数,它设置为每个购买操作的最大股票数量。

  2. For multiple stocks, therefore the size of the entire action space is (2k+1)^30.

    因此,对于多只股票,整个操作空间的大小为(2k + 1)^ 30

  3. The action space is then normalized to [-1, 1], since the RL algorithms A2C and PPO define the policy directly on a Gaussian distribution, which needs to be normalized and symmetric.

    然后将动作空间归一化为[-1,1] ,因为RL算法A2C和PPO直接在高斯分布上定义策略,需要对其进行归一化和对称。

class StockEnvTrain(gym.Env):“””A stock trading environment for OpenAI gym””” metadata = {‘render.modes’: [‘human’]}def __init__(self, df, day = 0):
self.day = day
self.df = df # Action Space
# action_space normalization and shape is STOCK_DIM self.action_space = spaces.Box(low = -1, high = 1,shape = (STOCK_DIM,))
# State Space
# Shape = 181: [Current Balance]+[prices 1–30]+[owned shares 1–30]
# +[macd 1–30]+ [rsi 1–30] + [cci 1–30] + [adx 1–30] self.observation_space = spaces.Box(low=0, high=np.inf, shape = (181,)) # load data from a pandas dataframe
self.data = self.df.loc[self.day,:]
self.terminal = False # initalize state self.state = [INITIAL_ACCOUNT_BALANCE] + \
self.data.adjcp.values.tolist() + \
[0]*STOCK_DIM + \
self.data.macd.values.tolist() + \
self.data.rsi.values.tolist() + \
self.data.cci.values.tolist() + \
self.data.adx.values.tolist()
# initialize reward
self.reward = 0
self.cost = 0 # memorize all the total balance change
self.asset_memory = [INITIAL_ACCOUNT_BALANCE]
self.rewards_memory = []
self.trades = 0
#self.reset()
self._seed()

3.6基于深度强化学习的交易代理 (3.6 Trading agent based on deep reinforcement learning)

A2C (A2C)

Copyright by AI4Finance LLC AI4Finance LLC版权所有

A2C is a typical actor-critic algorithm which we use as a component in the ensemble method. A2C is introduced to improve the policy gradient updates. A2C utilizes an advantage function to reduce the variance of the policy gradient. Instead of only estimates the value function, the critic network estimates the advantage function. Thus, the evaluation of an action not only depends on how good the action is, but also considers how much better it can be. So that it reduces the high variance of the policy networks and makes the model more robust.

A2C是一种典型的actor-critic算法 ,我们将其用作集成方法的组成部分。 引入了A2C以改进策略梯度更新。 A2C利用优势函数来减少策略梯度的变化。 评论家网络不仅评估价值函数,还评估利益函数 。 因此,对一个动作的评估不仅取决于该动作的好坏,还考虑了该动作的好坏。 从而减少了策略网络的高方差,并使模型更加健壮。

A2C uses copies of the same agent working in parallel to update gradients with different data samples. Each agent works independently to interact with the same environment. After all of the parallel agents finish calculating their gradients, A2C uses a coordinator to pass the average gradients over all the agents to a global network. So that the global network can update the actor and the critic network. The presence of a global network increases the diversity of training data. The synchronized gradient update is more cost-effective, faster and works better with large batch sizes. A2C is a great model for stock trading because of its stability.

A2C使用并行工作的同一代理的副本来更新具有不同数据样本的梯度。 每个代理独立工作以与同一环境交互。 在所有并行代理完成梯度计算之后,A2C使用协调器将所有代理上的平均梯度传递到全局网络 。 这样,全球网络可以更新演员和评论者网络。 全球网络的存在增加了训练数据的多样性。 同步梯度更新具有更高的成本效益,更快的速度,并且在批量较大时效果更好。 由于A2C的稳定性,它是股票交易的理想模型。

DDPG (DDPG)

Copyright by AI4Finance LLC AI4Finance LLC版权所有

DDPG is an actor-critic based algorithm which we use as a component in the ensemble strategy to maximize the investment return. DDPG combines the frameworks of both Q-learning and policy gradient, and uses neural networks as function approximators. In contrast with DQN that learns indirectly through Q-values tables and suffers the curse of dimensionality problem, DDPG learns directly from the observations through policy gradient. It is proposed to deterministically map states to actions to better fit the continuous action space environment.

DDPG是一种基于行为准则的算法,我们将其用作整体策略的组成部分,以最大限度地提高投资回报。 DDPG 结合Q学习和策略等级的框架,并使用神经网络作为函数逼近器。 与通过Q值表间接学习并遭受维度问题的诅咒的DQN相比,DDPG通过策略梯度直接从观察中学习。 提议确定性地将状态映射到动作以更好地适应连续动作空间环境。

PPO (PPO)

We explore and use PPO as a component in the ensemble method. PPO is introduced to control the policy gradient update and ensure that the new policy will not be too different from the older one. PPO tries to simplify the objective of Trust Region Policy Optimization (TRPO) by introducing a clipping term to the objective function.

我们探索并使用PPO作为集成方法的组成部分。 引入了PPO以控制策略梯度更新,并确保新策略与旧策略不会有太大差异。 PPO尝试通过在目标函数中引入限幅项来简化“ 信任区域策略优化”(TRPO)的目标。

The objective function of PPO takes the minimum of the clipped and normal objective. PPO discourages large policy change move outside of the clipped interval. Therefore, PPO improves the stability of the policy networks training by restricting the policy update at each training step. We select PPO for stock trading because it is stable, fast, and simpler to implement and tune.

PPO的目标函数采用限幅和正常目标最小值 。 PPO不鼓励将较大的策略更改移到限定的时间间隔之外。 因此,PPO通过限制每个培训步骤中的策略更新来提高策略网络培训的稳定性。 我们选择PPO进行股票交易是因为它稳定,快速且易于实现和调整。

合奏策略 (Ensemble strategy)

Our purpose is to create a highly robust trading strategy. So we use an ensemble method to automatically select the best performing agent among PPO, A2C, and DDPG to trade based on the Sharpe ratio. The ensemble process is described as follows:

我们的目的是创建一个高度健壮的交易策略。 因此,我们使用集成方法自动在PPOA2CDDPG中选择性能最佳的代理以基于 夏普比 。 集成过程描述如下:

Step 1. We use a growing window of months to retrain our three agents concurrently. In this paper, we retrain our three agents at every three months.

步骤1.我们使用 几个月的增长窗口来同时对我们的三个代理商进行再培训。 在本文中,我们每三个月对三个代理商进行再培训。

Step 2. We validate all three agents by using a 3-month validation rolling window followed by training to pick the best performing agent which has the highest Sharpe ratio. We also adjust risk-aversion by using turbulence index in our validation stage.

第2步。我们使用3个月的验证滚动窗口验证所有三种代理,然后训练以挑选出具有最高Sharpe比率的性能最好的代理。 在验证阶段,我们还使用湍流指数来调整风险规避。

Step 3. After validation, we only use the best model with the highest Sharpe ratio to predict and trade for the next quarter.

步骤3.验证后,我们仅使用具有最高Sharpe比率的最佳模型来预测和交易下一个季度。

from stable_baselines import SACfrom stable_baselines import PPO2from stable_baselines import A2Cfrom stable_baselines import DDPGfrom stable_baselines import TD3from stable_baselines.ddpg.policies import DDPGPolicyfrom stable_baselines.common.policies import MlpPolicyfrom stable_baselines.common.vec_env import DummyVecEnvdef train_A2C(env_train, model_name, timesteps=10000): “””A2C model””” start = time.time()
model = A2C(‘MlpPolicy’, env_train, verbose=0)
model.learn(total_timesteps=timesteps)
end = time.time() model.save(f”{config.TRAINED_MODEL_DIR}/{model_name}”)
print(‘Training time (A2C): ‘, (end-start)/60,’ minutes’)
return modeldef train_DDPG(env_train, model_name, timesteps=10000): “””DDPG model””” start = time.time()
model = DDPG(‘MlpPolicy’, env_train)
model.learn(total_timesteps=timesteps)
end = time.time() model.save(f”{config.TRAINED_MODEL_DIR}/{model_name}”)
print(‘Training time (DDPG): ‘, (end-start)/60,’ minutes’)
return modeldef train_PPO(env_train, model_name, timesteps=50000): “””PPO model””” start = time.time()
model = PPO2(‘MlpPolicy’, env_train)
model.learn(total_timesteps=timesteps)
end = time.time() model.save(f”{config.TRAINED_MODEL_DIR}/{model_name}”)
print(‘Training time (PPO): ‘, (end-start)/60,’ minutes’)
return modeldef DRL_prediction(model, test_data, test_env, test_obs): “””make a prediction””” start = time.time()for i in range(len(test_data.index.unique())):
action, _states = model.predict(test_obs)
test_obs, rewards, dones, info = test_env.step(action)
# env_test.render()
end = time.time()

3.7绩效评估 (3.7 Performance evaluations)

We use Quantopian’s pyfolio to do the backtesting. The charts look pretty good, and it takes literally one line of code to implement it. You just need to convert everything into daily returns.

我们使用Quantopian的投资组合进行回测。 这些图表看起来非常不错,并且只需一行代码即可实现。 您只需要将一切转换为每日收益即可。

import pyfoliowith pyfolio.plotting.plotting_context(font_scale=1.1):
pyfolio.create_full_tear_sheet(returns = ensemble_strat,benchmark_rets=dow_strat, set_context=False)
Copyright by AI4Finance LLC AI4Finance LLC版权所有
Copyright by AI4Finance LLC AI4Finance LLC版权所有

A2C:Volodymyr Mnih, Adrià Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. 2016. Asynchronous methods for deep reinforcement learning. The 33rd International Conference on Machine Learning (02 2016). https://arxiv.org/abs/1602.01783

A2C :Volodymyr Mnih,AdriàBadia,Mehdi Mirza,Alex Graves,Timothy Lillicrap,Tim Harley,David Silver和Koray Kavukcuoglu。 2016。用于深度强化学习的异步方法。 第33届国际机器学习会议(2016年2月)。 https://arxiv.org/abs/1602.01783

DDPG:Timothy Lillicrap, Jonathan Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. 2015. Continuous control with deep reinforcement learning. International Conference on Learning Representations (ICLR) 2016 (09 2015). https://arxiv.org/abs/1509.02971

DDPG :Timothy Lillicrap,Jonathan Hunt,Alexander Pritzel,Nicolas Heess,Tom Erez,Yuval Tassa,David Silver和Daan Wierstra。 2015年。通过深度强化学习进行持续控制。 2016年国际学习代表大会(ICLR)(2015年9月)。 https://arxiv.org/abs/1509.02971

PPO:John Schulman, Sergey Levine, Philipp Moritz, Michael Jordan, and Pieter Abbeel. 2015. Trust region policy optimization. In The 31st International Conference on Machine Learning. https://arxiv.org/abs/1502.05477

PPO :John Schulman,Sergey Levine,Philipp Moritz,Michael Jordan和Pieter Abbeel。 2015。信任区域政策优化。 在第31届机器学习国际会议上。 https://arxiv.org/abs/1502.05477

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms. arXiv:1707.06347 (07 2017). https://arxiv.org/abs/1707.06347

John Schulman,Filip Wolski,Prafulla Dhariwal,Alec Radford和Oleg Klimov。 2017。近端策略优化算法。 arXiv:1707.06347(2017年7月)。 https://arxiv.org/abs/1707.06347

翻译自: https://towardsdatascience.com/deep-reinforcement-learning-for-automated-stock-trading-f1dad0126a02

深度学习用于股票预测

你可能感兴趣的:(机器学习,深度学习,python,人工智能,tensorflow)