原创文章第114篇,专注“个人成长与财富自由、世界运作的逻辑, AI量化投资”。
今天的核心工作是把强化学习环境整合进我们的AI量化平台中。
网上很多代码都把数据获取和预处理,都整合到强化学习的环境里,对于总体量化平台而言,这不利于代码的复用。我们之前已经实现好了dataloader。所以我们单独实现强化学习的gym即可。
01 金融强化学习的环境
一个强化学习的环境要定义四个东西:状态空间,动作空间,reward函数以及状态obseravtion。
状态空间与动作空间。
状态空间就是智能体可以观察到的环境的维度,对于金融强化学习环境而言,就是因子的维度(特征的维度)。
# 状态空间 class observation_space: def __init__(self, n): self.shape = (n,) # 动作空间 class action_space: def __init__(self, n): self.n = n def seed(self, seed): pass def sample(self): return random.randint(0, self.n - 1)
动作空间是智能体针对观察到的状态,可以采用的动作的维度,比如“买入”和“平仓”就是两个动作。
环境初始化:
class FinanceEnv: def __init__(self, symbols, features, df_features): self.symbols = symbols self.features = features self.df_features = df_features self.observation_space = observation_space(4) self.rows = self.observation_space.shape[0] # 一次取多少行 self.action_space = action_space(2) # 动作维度 self.min_accuracy = 0.475 # 最低准确率
Reset是对环境进行重置:
def _get_state(self): state = self.df_features[self.features].iloc[ self.index - self.rows:self.index] return state.values def reset(self): self.treward = 0 self.accuracy = 0 self.index = self.rows # 返回状态空间的行数,features列数的初始状态 state = self._get_state() return state.values
两个主要的工作方法:reset和step。
其中step是环境最重要的功能。智能体通过观察环境状态,选择相应的动作,执行动作后,从环境得到反馈。同时会检查总体的收益,任务是否失败等。
def step(self, action): # 根据传入的动作,计算奖励reward correct = action == self.df_features['label'].iloc[self.index] reward = 1 if correct else 0 # 计算奖励值,准确率 self.treward += reward self.index += 1 self.accuracy = self.treward / (self.index - self.rows) # index>=总长度,则退出 if self.index >= len(self.df_features): done = True elif reward == 1: done = False elif (self.accuracy < self.min_accuracy and self.index > self.rows + 10): done = True else: done = False state = self._get_state() info = {} return state, reward, done, info
02 深度强化网络DQLAgent
from collections import deque import random import numpy as np from keras import Sequential from keras.layers import Dense from keras.optimizers import Adam class DQLAgent: def __init__(self, env, gamma=0.95, hu=24, opt=Adam, lr=0.001, finish=False): self.env = env self.finish = finish self.epsilon = 1.0 self.epsilon_min = 0.01 self.epsilon_decay = 0.995 self.gamma = gamma self.batch_size = 32 self.max_treward = 0 self.averages = list() self.memory = deque(maxlen=2000) self.osn = env.observation_space.shape[0] self.model = self._build_model(hu, opt, lr) def _build_model(self, hu, opt, lr): model = Sequential() model.add(Dense(hu, input_dim=self.osn, activation='relu')) model.add(Dense(hu, activation='relu')) model.add(Dense(self.env.action_space.n, activation='linear')) model.compile(loss='mse', optimizer=opt(lr=lr)) return model def act(self, state): if random.random() <= self.epsilon: return self.env.action_space.sample() action = self.model.predict(state)[0] return np.argmax(action) def replay(self): batch = random.sample(self.memory, self.batch_size) for state, action, reward, next_state, done in batch: if not done: reward += self.gamma * np.amax( self.model.predict(next_state)[0]) target = self.model.predict(state) target[0, action] = reward self.model.fit(state, target, epochs=1, verbose=False) if self.epsilon > self.epsilon_min: self.epsilon *= self.epsilon_decay def learn(self, episodes): trewards = [] for e in range(1, episodes + 1): state = self.env.reset() state = np.reshape(state, [1, self.osn]) for _ in range(5000): action = self.act(state) next_state, reward, done, info = self.env.step(action) next_state = np.reshape(next_state, [1, self.osn]) self.memory.append([state, action, reward, next_state, done]) state = next_state if done: treward = _ + 1 trewards.append(treward) av = sum(trewards[-25:]) / 25 self.averages.append(av) self.max_treward = max(self.max_treward, treward) templ = 'episode: {:4d}/{} | treward: {:4d} | ' templ += 'av: {:6.1f} | max: {:4d}' print(templ.format(e, episodes, treward, av, self.max_treward), end='\r') break if av > 195 and self.finish: print() break if len(self.memory) > self.batch_size: self.replay() def test(self, episodes): trewards = [] for e in range(1, episodes + 1): state = self.env.reset() for _ in range(5001): state = np.reshape(state, [1, self.osn]) action = np.argmax(self.model.predict(state)[0]) next_state, reward, done, info = self.env.step(action) state = next_state if done: treward = _ + 1 trewards.append(treward) print('episode: {:4d}/{} | treward: {:4d}' .format(e, episodes, treward), end='\r') break return trewards
小结:
今天是走通了深度强化学习在金融上的应用,搭建了金融交易环境。
后续还需要针对性的优化。
代码就上传至星球。
ETF轮动+RSRS择时,加上卡曼滤波:年化48.41%,夏普比1.89
我的开源项目及知识星球