论文原文:Spectrum Sharing in Vehicular Networks Based on Multi-Agent Reinforcement Learning
论文翻译 & 解读:[论文笔记]Spectrum Sharing in Vehicular Networks Based on Multi-Agent Reinforcement Learning
代码地址:https://github.com/le-liang/MARLspectrumSharingV2X
博客中用到的VISIO流程图(由博主个人绘制,有错误欢迎交流指教):https://download.csdn.net/download/m0_37495408/12353933
(来自原作者github的readme)
在Environment_marl.py文件中,定义了架构的四个基本CLASS,分别是:V2Vchannels,V2Ichannels,Vehicle,Environ。其中Environ的方法(即函数)最多,Vehicle没有函数只有几个属性,其余两者各有两个方法(分别是计算路损和阴影衰落)。
初始化时需要传入三个参数:起始位置、起始方向、速度。函数内部将自己定义两个list:neighbors、destinations,分别存放邻居和V2V的通信端(这里两者在数值上相同,因为设定V2V的对象即为邻居)
class Vehicle:
# Vehicle simulator: include all the information for a vehicle
def __init__(self, start_position, start_direction, velocity):
self.position = start_position
self.direction = start_direction
self.velocity = velocity
self.neighbors = []
self.destinations = []
从下方的代码可见destionations的含义
def renew_neighbor(self): # 这个来自CLASS Env
""" Determine the neighbors of each vehicles """
for i in range(len(self.vehicles)):
self.vehicles[i].neighbors = []
self.vehicles[i].actions = []
z = np.array([[complex(c.position[0], c.position[1]) for c in self.vehicles]])
Distance = abs(z.T - z)
for i in range(len(self.vehicles)):
sort_idx = np.argsort(Distance[:, i])
for j in range(self.n_neighbor):
self.vehicles[i].neighbors.append(sort_idx[j + 1])
destination = self.vehicles[i].neighbors
self.vehicles[i].destinations = destination
内部参数z:这里将bs和ms的高度设置为1.5m,阴影的std为3,都是来自TR36 885-A.1.4-1;载波频率为2,单位为GHz;
class V2Vchannels:
# Simulator of the V2V Channels
def __init__(self):
self.t = 0
self.h_bs = 1.5
self.h_ms = 1.5
self.fc = 2
self.decorrelation_distance = 10
self.shadow_std = 3
包含两个方法:
计算路损
def get_path_loss(self, position_A, position_B):
d1 = abs(position_A[0] - position_B[0])
d2 = abs(position_A[1] - position_B[1])
d = math.hypot(d1, d2) + 0.001 # sqrt(x*x + y*y)
# 下一行定义有效BP距离
d_bp = 4 * (self.h_bs - 1) * (self.h_ms - 1) * self.fc * (10 ** 9) / (3 * 10 ** 8)
def PL_Los(d):
if d <= 3:
return 22.7 * np.log10(3) + 41 + 20 * np.log10(self.fc / 5)
else:
if d < d_bp:
return 22.7 * np.log10(d) + 41 + 20 * np.log10(self.fc / 5)
else:
return 40.0 * np.log10(d) + 9.45 - 17.3 * np.log10(self.h_bs) - 17.3 * np.log10(self.h_ms) + 2.7 * np.log10(self.fc / 5)
def PL_NLos(d_a, d_b):
n_j = max(2.8 - 0.0024 * d_b, 1.84)
return PL_Los(d_a) + 20 - 12.5 * n_j + 10 * n_j * np.log10(d_b) + 3 * np.log10(self.fc / 5)
if min(d1, d2) < 7:
PL = PL_Los(d)
else:
PL = min(PL_NLos(d1, d2), PL_NLos(d2, d1))
return PL # + self.shadow_std * np.random.normal()
说明:上述代码使用随机过程模型(见[2]-p328)。
路损使用曼哈顿网格布局LOS模型,即:
, for
以及:, for
上面的n_1=2.2、n_2=4.0,分别表示在BP之前和之后的率衰落常数,d'表示有效BP距离,代码中用d_bp表示。
曼哈顿网格布局NLOS模型:
代码后半出现的min函数,在[2]的p344页有描述,这是假设接收机位于垂直街道是对PL的估计方法。
代码中的公式出自IST-4-027756 WINNER II D1.1.2 V1.2 WINNER II
其中有如下表格,与代码中的参数完全符合:
更新阴影衰落
def get_shadowing(self, delta_distance, shadowing):
return np.exp(-1 * (delta_distance / self.decorrelation_distance)) * shadowing \
+ math.sqrt(1 - np.exp(-2 * (delta_distance / self.decorrelation_distance))) * np.random.normal(0, 3) # standard dev is 3 db
这个更新公式是出自文献[1]-A-1.4 Channel model表格后的部分,如下:
包含的两个方法和V2V相同,但是计算路损的时候不再区分Los了
def get_path_loss(self, position_A):
d1 = abs(position_A[0] - self.BS_position[0])
d2 = abs(position_A[1] - self.BS_position[1])
distance = math.hypot(d1, d2)
return 128.1 + 37.6 * np.log10(math.sqrt(distance ** 2 + (self.h_bs - self.h_ms) ** 2) / 1000) # + self.shadow_std * np.random.normal()
def get_shadowing(self, delta_distance, shadowing):
nVeh = len(shadowing)
self.R = np.sqrt(0.5 * np.ones([nVeh, nVeh]) + 0.5 * np.identity(nVeh))
return np.multiply(np.exp(-1 * (delta_distance / self.Decorrelation_distance)), shadowing) \
+ np.sqrt(1 - np.exp(-2 * (delta_distance / self.Decorrelation_distance))) * np.random.normal(0, 8, nVeh)
上面的两个方法均是文献[1]-Table A.1.4-2的内容和其后的说明,如下:
初始化需要传入4个list(为上下左右路口的位置数据):down_lane, up_lane, left_lane, right_lane;地图的宽和高;车辆数和邻居数。除以上所提外,内部含有好多参数,如下:
class Environ:
def __init__(self, down_lane, up_lane, left_lane, right_lane, width, height, n_veh, n_neighbor):
self.V2Vchannels = V2Vchannels()
self.V2Ichannels = V2Ichannels()
self.vehicles = []
self.demand = []
self.V2V_Shadowing = []
self.V2I_Shadowing = []
self.delta_distance = []
self.V2V_channels_abs = []
self.V2I_channels_abs = []
self.V2I_power_dB = 23 # dBm
self.V2V_power_dB_List = [23, 15, 5, -100] # the power levels
self.V2I_power = 10 ** (self.V2I_power_dB)
self.sig2_dB = -114
self.bsAntGain = 8
self.bsNoiseFigure = 5
self.vehAntGain = 3
self.vehNoiseFigure = 9
self.sig2 = 10 ** (self.sig2_dB / 10)
self.n_RB = n_veh
self.n_Veh = n_veh
self.n_neighbor = n_neighbor
self.time_fast = 0.001
self.time_slow = 0.1 # update slow fading/vehicle position every 100 ms
self.bandwidth = int(1e6) # bandwidth per RB, 1 MHz
# self.bandwidth = 1500
self.demand_size = int((4 * 190 + 300) * 8 * 2) # V2V payload: 1060 Bytes every 100 ms
# self.demand_size = 20
self.V2V_Interference_all = np.zeros((self.n_Veh, self.n_neighbor, self.n_RB)) + self.sig2
添加车:有两个方法:add_new_vehivles(需要传输起始坐标、方向、速度),add_new_vehicles_by_number(n)。后者比较有意思,只需要一个参数,n,但是并不是添加n辆车,而是4n辆车,上下左右方向各一台,位置是随机的。
更新车辆位置:renew_position(无),遍历每辆车,根据其方向和速度更新位置,到路口时依据概率顺时针转弯,到地图边界时使其顺时针转弯留在地图中。
更新邻居:renew_neighbor(self),已经在Vehicle中进行描述
更新信道:renew_channel(self),这里定义了一个很重要的量:channel_abs,它是路损和阴影衰落的和。
def renew_channel(self):
""" Renew slow fading channel """
self.V2V_pathloss = np.zeros((len(self.vehicles), len(self.vehicles))) + 50 * np.identity(len(self.vehicles))
self.V2I_pathloss = np.zeros((len(self.vehicles)))
self.V2V_channels_abs = np.zeros((len(self.vehicles), len(self.vehicles)))
self.V2I_channels_abs = np.zeros((len(self.vehicles)))
for i in range(len(self.vehicles)):
for j in range(i + 1, len(self.vehicles)):
self.V2V_Shadowing[j][i] = self.V2V_Shadowing[i][j] = self.V2Vchannels.get_shadowing(self.delta_distance[i] + self.delta_distance[j], self.V2V_Shadowing[i][j])
self.V2V_pathloss[j,i] = self.V2V_pathloss[i][j] = self.V2Vchannels.get_path_loss(self.vehicles[i].position, self.vehicles[j].position)
self.V2V_channels_abs = self.V2V_pathloss + self.V2V_Shadowing
self.V2I_Shadowing = self.V2Ichannels.get_shadowing(self.delta_distance, self.V2I_Shadowing)
for i in range(len(self.vehicles)):
self.V2I_pathloss[i] = self.V2Ichannels.get_path_loss(self.vehicles[i].position)
self.V2I_channels_abs = self.V2I_pathloss + self.V2I_Shadowing
更新快衰落信道:renew_channels_fastfading(self),所谓的快衰落就是把channels_abs减了一个随机数。
def renew_channels_fastfading(self):
""" Renew fast fading channel """
# 1 2, 3 4 --> 1 1 2 2 3 3 4 4 逐个元素复制
V2V_channels_with_fastfading = np.repeat(self.V2V_channels_abs[:, :, np.newaxis], self.n_RB, axis=2)
# A - 20 log
self.V2V_channels_with_fastfading = V2V_channels_with_fastfading - 20 * np.log10(
np.abs(np.random.normal(0, 1, V2V_channels_with_fastfading.shape) + 1j * np.random.normal(0, 1, V2V_channels_with_fastfading.shape)) / math.sqrt(2))
# 1 2, 3 4 --> 1 1 2 2, 3 3 4 4
V2I_channels_with_fastfading = np.repeat(self.V2I_channels_abs[:, np.newaxis], self.n_RB, axis=1)
self.V2I_channels_with_fastfading = V2I_channels_with_fastfading - 20 * np.log10(
np.abs(np.random.normal(0, 1, V2I_channels_with_fastfading.shape) + 1j * np.random.normal(0, 1, V2I_channels_with_fastfading.shape))/ math.sqrt(2))
计算Reward:Compute_Performance_Reward_Train(self, actions_power),这里的输入非常重要,是RL的action,其定义在main_marl_train.py中,是个三维数组,以(层,行,列)进行说明,一层一个车,一行一个邻居,共有两列分别为RB选择(用RB的序号表示)和power选择(也用序号表示,作为power_db_list的索引),如下所示:
for i in range(n_veh):
for j in range(n_neighbor):
state_old = get_state(env, [i, j], 1, epsi_final)
action = predict(sesses[i*n_neighbor+j], state_old, epsi_final, True)
action_all_testing[i, j, 0] = action % n_RB # chosen RB
action_all_testing[i, j, 1] = int(np.floor(action / n_RB)) # power level
具体计算步骤为:
代码如下:
def Compute_Performance_Reward_Train(self, actions_power):
actions = actions_power[:, :, 0] # the channel_selection_part
power_selection = actions_power[:, :, 1] # power selection
# ------------ Compute V2I rate --------------------
V2I_Rate = np.zeros(self.n_RB)
V2I_Interference = np.zeros(self.n_RB) # V2I interference
for i in range(len(self.vehicles)):
for j in range(self.n_neighbor):
if not self.active_links[i, j]:
continue
V2I_Interference[actions[i][j]] += 10 ** ((self.V2V_power_dB_List[power_selection[i, j]] - self.V2I_channels_with_fastfading[i, actions[i, j]]
+ self.vehAntGain + self.bsAntGain - self.bsNoiseFigure) / 10)
self.V2I_Interference = V2I_Interference + self.sig2
V2I_Signals = 10 ** ((self.V2I_power_dB - self.V2I_channels_with_fastfading.diagonal() + self.vehAntGain + self.bsAntGain - self.bsNoiseFigure) / 10)
V2I_Rate = np.log2(1 + np.divide(V2I_Signals, self.V2I_Interference)) # 计算V2I信道容量
# ------------ Compute V2V rate -------------------------
V2V_Interference = np.zeros((len(self.vehicles), self.n_neighbor))
V2V_Signal = np.zeros((len(self.vehicles), self.n_neighbor))
actions[(np.logical_not(self.active_links))] = -1 # inactive links will not transmit regardless of selected power levels
for i in range(self.n_RB): # scanning all bands
indexes = np.argwhere(actions == i) # find spectrum-sharing V2Vs
for j in range(len(indexes)):
receiver_j = self.vehicles[indexes[j, 0]].destinations[indexes[j, 1]]
V2V_Signal[indexes[j, 0], indexes[j, 1]] = 10 ** ((self.V2V_power_dB_List[power_selection[indexes[j, 0], indexes[j, 1]]]
- self.V2V_channels_with_fastfading[indexes[j][0], receiver_j, i] + 2 * self.vehAntGain - self.vehNoiseFigure) / 10)
# V2I links interference to V2V links
V2V_Interference[indexes[j, 0], indexes[j, 1]] = 10 ** ((self.V2I_power_dB - self.V2V_channels_with_fastfading[i, receiver_j, i] + 2 * self.vehAntGain - self.vehNoiseFigure) / 10)
# V2V interference
for k in range(j + 1, len(indexes)): # spectrum-sharing V2Vs
receiver_k = self.vehicles[indexes[k][0]].destinations[indexes[k][1]]
V2V_Interference[indexes[j, 0], indexes[j, 1]] += 10 ** ((self.V2V_power_dB_List[power_selection[indexes[k, 0], indexes[k, 1]]]
- self.V2V_channels_with_fastfading[indexes[k][0]][receiver_j][i] + 2 * self.vehAntGain - self.vehNoiseFigure) / 10)
V2V_Interference[indexes[k, 0], indexes[k, 1]] += 10 ** ((self.V2V_power_dB_List[power_selection[indexes[j, 0], indexes[j, 1]]]
- self.V2V_channels_with_fastfading[indexes[j][0]][receiver_k][i] + 2 * self.vehAntGain - self.vehNoiseFigure) / 10)
self.V2V_Interference = V2V_Interference + self.sig2
V2V_Rate = np.log2(1 + np.divide(V2V_Signal, self.V2V_Interference))
self.demand -= V2V_Rate * self.time_fast * self.bandwidth
self.demand[self.demand < 0] = 0 # eliminate negative demands
self.individual_time_limit -= self.time_fast
reward_elements = V2V_Rate/10
reward_elements[self.demand <= 0] = 1
self.active_links[np.multiply(self.active_links, self.demand <= 0)] = 0 # transmission finished, turned to "inactive"
return V2I_Rate, V2V_Rate, reward_elements
注:这里返回三个数值,其中最后一个并不是最终的reward,最终的reward需要把这三个数值加权组合起来。
执行训练:act_for_training(self, actions),输入actions,通过Compute_Performance_Reward_Train计算最终reward,代码如下:
def act_for_training(self, actions):
action_temp = actions.copy()
V2I_Rate, V2V_Rate, reward_elements = self.Compute_Performance_Reward_Train(action_temp)
lambdda = 0.
reward = lambdda * np.sum(V2I_Rate) / (self.n_Veh * 10) + (1 - lambdda) * np.sum(reward_elements) / (self.n_Veh * self.n_neighbor)
return reward
执行测试:act_for_testing(self, actions),这里和上面差不多,也用到了Compute_Performance_Reward_Train,但最后返回的是V2I_rate, V2V_success, V2V_rate。
def act_for_testing(self, actions):
action_temp = actions.copy()
V2I_Rate, V2V_Rate, reward_elements = self.Compute_Performance_Reward_Train(action_temp)
V2V_success = 1 - np.sum(self.active_links) / (self.n_Veh * self.n_neighbor) # V2V success rates
return V2I_Rate, V2V_success, V2V_Rate
上面所述的三个量,是一次episode中的单步step所生成的最终结果,main_marl_train.py的testing部分可以看到,部分代码如下:
for test_step in range(n_step_per_episode):
# trained models
action_all_testing = np.zeros([n_veh, n_neighbor, 2], dtype='int32')
for i in range(n_veh):
for j in range(n_neighbor):
state_old = get_state(env, [i, j], 1, epsi_final)
action = predict(sesses[i*n_neighbor+j], state_old, epsi_final, True)
action_all_testing[i, j, 0] = action % n_RB # chosen RB
action_all_testing[i, j, 1] = int(np.floor(action / n_RB)) # power level
action_temp = action_all_testing.copy()
V2I_rate, V2V_success, V2V_rate = env.act_for_testing(action_temp)
V2I_rate_per_episode.append(np.sum(V2I_rate)) # sum V2I rate in bps
rate_marl[idx_episode, test_step,:,:] = V2V_rate
demand_marl[idx_episode, test_step+1,:,:] = env.demand
计算干扰:Compute_Interference(self, actions),通过+=的方法计算V2V_Interference_all,代码如下:
def Compute_Interference(self, actions):
V2V_Interference = np.zeros((len(self.vehicles), self.n_neighbor, self.n_RB)) + self.sig2
channel_selection = actions.copy()[:, :, 0] # 取所有层的第0列
power_selection = actions.copy()[:, :, 1] # 取所有层的第1列
channel_selection[np.logical_not(self.active_links)] = -1 # 将未激活的链路置为-1
# interference from V2I links
for i in range(self.n_RB):
for k in range(len(self.vehicles)):
for m in range(len(channel_selection[k, :])):
V2V_Interference[k, m, i] += 10 ** ((self.V2I_power_dB - self.V2V_channels_with_fastfading[i][self.vehicles[k].destinations[m]][i] + 2 * self.vehAntGain - self.vehNoiseFigure) / 10)
# interference from peer V2V links
for i in range(len(self.vehicles)):
for j in range(len(channel_selection[i, :])):
for k in range(len(self.vehicles)):
for m in range(len(channel_selection[k, :])):
# if i == k or channel_selection[i,j] >= 0:
if i == k and j == m or channel_selection[i, j] < 0:
continue
V2V_Interference[k, m, channel_selection[i, j]] += 10 ** ((self.V2V_power_dB_List[power_selection[i, j]]
- self.V2V_channels_with_fastfading[i][self.vehicles[k].destinations[m]][channel_selection[i,j]] + 2 * self.vehAntGain - self.vehNoiseFigure) / 10)
self.V2V_Interference_all = 10 * np.log10(V2V_Interference)
在main_marl_train.py的get_state中有用到,用于构成state中的V2V_interference,如下:
def get_state(env, idx=(0,0), ind_episode=1., epsi=0.02):
""" Get state from the environment """
# include V2I/V2V fast_fading, V2V interference, V2I/V2V 信道信息(PL+shadow),
# 剩余时间, 剩余负载
# V2I_channel = (env.V2I_channels_with_fastfading[idx[0], :] - 80) / 60
V2I_fast = (env.V2I_channels_with_fastfading[idx[0], :] - env.V2I_channels_abs[idx[0]] + 10)/35
# V2V_channel = (env.V2V_channels_with_fastfading[:, env.vehicles[idx[0]].destinations[idx[1]], :] - 80) / 60
V2V_fast = (env.V2V_channels_with_fastfading[:, env.vehicles[idx[0]].destinations[idx[1]], :] - env.V2V_channels_abs[:, env.vehicles[idx[0]].destinations[idx[1]]] + 10)/35
V2V_interference = (-env.V2V_Interference_all[idx[0], idx[1], :] - 60) / 60
V2I_abs = (env.V2I_channels_abs[idx[0]] - 80) / 60.0
V2V_abs = (env.V2V_channels_abs[:, env.vehicles[idx[0]].destinations[idx[1]]] - 80)/60.0
load_remaining = np.asarray([env.demand[idx[0], idx[1]] / env.demand_size])
time_remaining = np.asarray([env.individual_time_limit[idx[0], idx[1]] / env.time_slow])
# return np.concatenate((np.reshape(V2V_channel, -1), V2V_interference, V2I_abs, V2V_abs, time_remaining, load_remaining, np.asarray([ind_episode, epsi])))
return np.concatenate((V2I_fast, np.reshape(V2V_fast, -1), V2V_interference, np.asarray([V2I_abs]), V2V_abs, time_remaining, load_remaining, np.asarray([ind_episode, epsi])))
# 这里有所有感兴趣的物理量:V2V_fast V2I_fast V2V_interference V2I_abs V2V_abs
有的小伙伴看到这就有点迷了,为什么这里又要计算V2V_Interference了?我怎么感觉之前好像算过,是的,在计算V2V_rate的时候就需要计算V2V_Interference,我目前观察那个是按照RB分配来算的,这个是直接按照车挨个遍历的。
这部分内容来自replay_memory.py,内容不多,只定义了一个类: ReplayMemory,需要注意的是每一个agent都有一个memory,在main_marl_train.py--class Agent可以看到,如下所示
class Agent(object):
def __init__(self, memory_entry_size):
self.discount = 1
self.double_q = True
self.memory_entry_size = memory_entry_size
self.memory = ReplayMemory(self.memory_entry_size)
初始化:需要输入memory的容量:entry_size,初始化的代码如下:
class ReplayMemory:
def __init__(self, entry_size):
self.entry_size = entry_size
self.memory_size = 200000
self.actions = np.empty(self.memory_size, dtype = np.uint8)
self.rewards = np.empty(self.memory_size, dtype = np.float64)
self.prestate = np.empty((self.memory_size, self.entry_size), dtype = np.float16)
self.poststate = np.empty((self.memory_size, self.entry_size), dtype = np.float16)
self.batch_size = 2000
self.count = 0
self.current = 0
添加(s, a)对:add(self, prestate, poststate, reward, action),从add方法的参数可以看出参数包括:(上一个状态,下一个状态,奖励,动作),代码如下:
def add(self, prestate, poststate, reward, action):
self.actions[self.current] = action
self.rewards[self.current] = reward
self.prestate[self.current] = prestate
self.poststate[self.current] = poststate
self.count = max(self.count, self.current + 1)
self.current = (self.current + 1) % self.memory_size
对每个agent来说,都需要将自己在每个time_step将这个状态转移的信息记录下来,在main_marl_train.py--Training的部分可以看到add的使用,代码如下,这个for循环上面还有一个对于episode的for循环,可以看出,在每个episode的每个step,都需要对所有agent进行(s,a)对的添加【最后一行】
for i_step in range(n_step_per_episode): # range内是0.1/0.001 = 100
time_step = i_episode*n_step_per_episode + i_step # time_step是整体的step
state_old_all = []
action_all = []
action_all_training = np.zeros([n_veh, n_neighbor, 2], dtype='int32')
for i in range(n_veh):
for j in range(n_neighbor):
state = get_state(env, [i, j], i_episode/(n_episode-1), epsi)
state_old_all.append(state)
action = predict(sesses[i*n_neighbor+j], state, epsi)
action_all.append(action)
action_all_training[i, j, 0] = action % n_RB # chosen RB
action_all_training[i, j, 1] = int(np.floor(action / n_RB)) # power level
# All agents take actions simultaneously, obtain shared reward, and update the environment.
action_temp = action_all_training.copy()
train_reward = env.act_for_training(action_temp)
record_reward[time_step] = train_reward
env.renew_channels_fastfading()
env.Compute_Interference(action_temp)
for i in range(n_veh):
for j in range(n_neighbor):
state_old = state_old_all[n_neighbor * i + j]
action = action_all[n_neighbor * i + j]
state_new = get_state(env, [i, j], i_episode/(n_episode-1), epsi)
agents[i * n_neighbor + j].memory.add(state_old, state_new, train_reward, action) # add entry to this agent's memory
采样:sample(self),经过多次add后,每个agent已经有了多个(s,a)对,但是实际训练的时候一次取出batch_size个(s,a)对进行训练,代码如下所示:
def sample(self):
if self.count < self.batch_size:
indexes = range(0, self.count)
else:
indexes = random.sample(range(0,self.count), self.batch_size)
prestate = self.prestate[indexes]
poststate = self.poststate[indexes]
actions = self.actions[indexes]
rewards = self.rewards[indexes]
return prestate, poststate, actions, rewards
定义CLASS Agent:Agent(object),无输入参数,内容是一些算法参数,注意memory的实现方法是ReplayMemory,上面刚提到过
class Agent(object):
def __init__(self, memory_entry_size):
self.discount = 1
self.double_q = True
self.memory_entry_size = memory_entry_size
self.memory = ReplayMemory(self.memory_entry_size)
参数初始化:这部分直接写在代码中,没有函数,大概包括:地图属性(路口坐标,整体地图尺寸)、#车、#邻居、#RB、#episode,一些算法参数,代码如下:
对于地图参数 up_lanes / down_lanes / left_lanes / right_lanes 的含义,首先要了解本次所用的系统模型由3GPP TR 36.885的城市案例给出,每条街有四个车道(正反方向各两个车道) ,车道宽3.5m,模型网格(road grid)的尺寸以黄线之间的距离确定,为433m*250m,区域面积为1299m*750m。仿真中等比例缩小为原来的1/2(这点可以由 width 和 height 参数是 / 2 的看出来),反映在车道的参数上就是在 lanes 中的 i / 2.0 。
下面以 up_lanes 为例进行说明。在上图中我们可以看到,车道宽3.5m,所以将车视作质点的话,应该是在3.5m的车道中间移动的,因此在 up_lanes 中 in 后面的 中括号里 3.5 需要 /2,第二项的3.5就是通向双车道的第二条车道的中间;第三项 +250 就是越过建筑物的第一条同向车道,以此类推。
up_lanes = [i/2.0 for i in [3.5/2, 3.5 + 3.5/2, 250+3.5/2, 250+3.5+3.5/2, 500+3.5/2, 500+3.5+3.5/2]]
down_lanes = [i/2.0 for i in [250-3.5-3.5/2,250-3.5/2,500-3.5-3.5/2,500-3.5/2,750-3.5-3.5/2,750-3.5/2]]
left_lanes = [i/2.0 for i in [3.5/2,3.5/2 + 3.5,433+3.5/2, 433+3.5+3.5/2, 866+3.5/2, 866+3.5+3.5/2]]
right_lanes = [i/2.0 for i in [433-3.5-3.5/2,433-3.5/2,866-3.5-3.5/2,866-3.5/2,1299-3.5-3.5/2,1299-3.5/2]]
width = 750/2
height = 1298/2
IS_TRAIN = 1
IS_TEST = 1-IS_TRAIN
label = 'marl_model'
n_veh = 4
n_neighbor = 1
n_RB = n_veh
env = Environment_marl.Environ(down_lanes, up_lanes, left_lanes, right_lanes, width, height, n_veh, n_neighbor)
env.new_random_game() # initialize parameters in env
# n_episode = 3000
n_episode = 600
n_step_per_episode = int(env.time_slow/env.time_fast) # slow = 0.1, fast = 0.001
epsi_final = 0.02
epsi_anneal_length = int(0.8*n_episode)
mini_batch_step = n_step_per_episode
target_update_step = n_step_per_episode*4
n_episode_test = 100 # test episodes
获取状态:get_state(env, idx=(0,0), ind_episode=1., epsi=0.02),输入是env(环境),输出包括:
def get_state(env, idx=(0,0), ind_episode=1., epsi=0.02):
""" Get state from the environment """
# include V2I/V2V fast_fading, V2V interference, V2I/V2V 信道信息(PL+shadow),
# 剩余时间, 剩余负载
# V2I_channel = (env.V2I_channels_with_fastfading[idx[0], :] - 80) / 60
V2I_fast = (env.V2I_channels_with_fastfading[idx[0], :] - env.V2I_channels_abs[idx[0]] + 10)/35
# V2V_channel = (env.V2V_channels_with_fastfading[:, env.vehicles[idx[0]].destinations[idx[1]], :] - 80) / 60
V2V_fast = (env.V2V_channels_with_fastfading[:, env.vehicles[idx[0]].destinations[idx[1]], :] - env.V2V_channels_abs[:, env.vehicles[idx[0]].destinations[idx[1]]] + 10)/35
V2V_interference = (-env.V2V_Interference_all[idx[0], idx[1], :] - 60) / 60
V2I_abs = (env.V2I_channels_abs[idx[0]] - 80) / 60.0
V2V_abs = (env.V2V_channels_abs[:, env.vehicles[idx[0]].destinations[idx[1]]] - 80)/60.0
load_remaining = np.asarray([env.demand[idx[0], idx[1]] / env.demand_size])
time_remaining = np.asarray([env.individual_time_limit[idx[0], idx[1]] / env.time_slow])
# return np.concatenate((np.reshape(V2V_channel, -1), V2V_interference, V2I_abs, V2V_abs, time_remaining, load_remaining, np.asarray([ind_episode, epsi])))
return np.concatenate((V2I_fast, np.reshape(V2V_fast, -1), V2V_interference, np.asarray([V2I_abs]), V2V_abs, time_remaining, load_remaining, np.asarray([ind_episode, epsi])))
定义NN:
with g.as_default():
# ============== Training network ========================
x = tf.placeholder(tf.float32, [None, n_input]) # 输入
w_1 = tf.Variable(tf.truncated_normal([n_input, n_hidden_1], stddev=0.1))
w_2 = tf.Variable(tf.truncated_normal([n_hidden_1, n_hidden_2], stddev=0.1))
w_3 = tf.Variable(tf.truncated_normal([n_hidden_2, n_hidden_3], stddev=0.1))
w_4 = tf.Variable(tf.truncated_normal([n_hidden_3, n_output], stddev=0.1))
b_1 = tf.Variable(tf.truncated_normal([n_hidden_1], stddev=0.1))
b_2 = tf.Variable(tf.truncated_normal([n_hidden_2], stddev=0.1))
b_3 = tf.Variable(tf.truncated_normal([n_hidden_3], stddev=0.1))
b_4 = tf.Variable(tf.truncated_normal([n_output], stddev=0.1))
layer_1 = tf.nn.relu(tf.add(tf.matmul(x, w_1), b_1))
layer_1_b = tf.layers.batch_normalization(layer_1)
layer_2 = tf.nn.relu(tf.add(tf.matmul(layer_1_b, w_2), b_2))
layer_2_b = tf.layers.batch_normalization(layer_2)
layer_3 = tf.nn.relu(tf.add(tf.matmul(layer_2_b, w_3), b_3))
layer_3_b = tf.layers.batch_normalization(layer_3)
y = tf.nn.relu(tf.add(tf.matmul(layer_3_b, w_4), b_4))
g_q_action = tf.argmax(y, axis=1)
# compute loss
g_target_q_t = tf.placeholder(tf.float32, None, name="target_value")
g_action = tf.placeholder(tf.int32, None, name='g_action')
action_one_hot = tf.one_hot(g_action, n_output, 1.0, 0.0, name='action_one_hot')
q_acted = tf.reduce_sum(y * action_one_hot, reduction_indices=1, name='q_acted')
g_loss = tf.reduce_mean(tf.square(g_target_q_t - q_acted), name='g_loss') # 求误差
optim = tf.train.RMSPropOptimizer(learning_rate=0.001, momentum=0.95, epsilon=0.01).minimize(g_loss) # 梯度下降
# ==================== Prediction network ========================
x_p = tf.placeholder(tf.float32, [None, n_input]) # 输入
w_1_p = tf.Variable(tf.truncated_normal([n_input, n_hidden_1], stddev=0.1))
w_2_p = tf.Variable(tf.truncated_normal([n_hidden_1, n_hidden_2], stddev=0.1))
w_3_p = tf.Variable(tf.truncated_normal([n_hidden_2, n_hidden_3], stddev=0.1))
w_4_p = tf.Variable(tf.truncated_normal([n_hidden_3, n_output], stddev=0.1))
b_1_p = tf.Variable(tf.truncated_normal([n_hidden_1], stddev=0.1))
b_2_p = tf.Variable(tf.truncated_normal([n_hidden_2], stddev=0.1))
b_3_p = tf.Variable(tf.truncated_normal([n_hidden_3], stddev=0.1))
b_4_p = tf.Variable(tf.truncated_normal([n_output], stddev=0.1))
layer_1_p = tf.nn.relu(tf.add(tf.matmul(x_p, w_1_p), b_1_p))
layer_1_p_b = tf.layers.batch_normalization(layer_1_p)
layer_2_p = tf.nn.relu(tf.add(tf.matmul(layer_1_p_b, w_2_p), b_2_p))
layer_2_p_b = tf.layers.batch_normalization(layer_2_p)
layer_3_p = tf.nn.relu(tf.add(tf.matmul(layer_2_p_b, w_3_p), b_3_p))
layer_3_p_b = tf.layers.batch_normalization(layer_3_p)
y_p = tf.nn.relu(tf.add(tf.matmul(layer_3_p_b, w_4_p), b_4_p))
g_target_q_idx = tf.placeholder('int32', [None, None], 'output_idx') # 输入,这是一个(n, 2)的list
target_q_with_idx = tf.gather_nd(y_p, g_target_q_idx) # 提取首参的某几行/列
init = tf.global_variables_initializer()
saver = tf.train.Saver()
在这里仅说明大体结构,具体含义请见下问“采样并获得loss”部分,有结合算法原理的Network结构说明。
整体分成三个NN:Training,compute_loss,Prediction,分别用N1 N2 N3表示。其中N1和N3结构完全一致,为算法结构中的DQN网络,输出Q值,不同点在于,N1每次迭代式都更新,而N3每隔一段时间更新一次。N2接受N1的输入,负责计算Q函数并对N1实现迭代更新。
在
预测:predict(sess, s_t, ep, test_ep = False),此函数用于驱动NN,生成动作action,代码如下:
def predict(sess, s_t, ep, test_ep = False):
n_power_levels = len(env.V2V_power_dB_List)
if np.random.rand() < ep and not test_ep:
pred_action = np.random.randint(n_RB*n_power_levels)
else:
pred_action = sess.run(g_q_action, feed_dict={x: [s_t]})[0]
return pred_action
这里的action是一个int,但内涵了RB和power_level的信息,在本代码后面Training和Testing中都有出现,使用方法如下:
action = predict(sesses[i*n_neighbor+j], state, epsi)
action_all.append(action)
action_all_training[i, j, 0] = action % n_RB # chosen RB
action_all_training[i, j, 1] = int(np.floor(action / n_RB)) # power level
采样并获得loss:q_learning_mini_batch(current_agent, current_sess),输入单个agent,里面用到了CLASS:memory的sample方法,上面有提到。此外double q-learning也在这里设置。
def q_learning_mini_batch(current_agent, current_sess):
""" Training a sampled mini-batch """
batch_s_t, batch_s_t_plus_1, batch_action, batch_reward = current_agent.memory.sample()
if current_agent.double_q: # double q-learning
pred_action = current_sess.run(g_q_action, feed_dict={x: batch_s_t_plus_1})
q_t_plus_1 = current_sess.run(target_q_with_idx, {x_p: batch_s_t_plus_1, g_target_q_idx: [[idx, pred_a] for idx, pred_a in enumerate(pred_action)]})
batch_target_q_t = current_agent.discount * q_t_plus_1 + batch_reward
else:
q_t_plus_1 = current_sess.run(y_p, {x_p: batch_s_t_plus_1})
max_q_t_plus_1 = np.max(q_t_plus_1, axis=1)
batch_target_q_t = current_agent.discount * max_q_t_plus_1 + batch_reward
_, loss_val = current_sess.run([optim, g_loss], {g_target_q_t: batch_target_q_t, g_action: batch_action, x: batch_s_t})
return loss_val
4.23 补充:这个函数需要结合NN的结构来看,个人感觉还是有点复杂的。如表面意思通过 if 表现了不同DQN和double q-learning两种方法,需要注意的是在两个if里面都只计算了target network的部分,算法图左上方的Network的输入、迭代更新由最后一句完成:
_, loss_val = current_sess.run([optim, g_loss], {g_target_q_t: batch_target_q_t, g_action: batch_action, x: batch_s_t})
这段代码需要和这篇博客中的图相对应才可以理解,在这里将算法原理图和代码流程图贴出来(代码图由博主通过VISIO绘制,没有遵循标准格式,有错误请见谅)
普通DQN
Double DQN
与普通DQN在target network处有不同,前者直接通过Predict Network(上图的‘predict/每隔一段时间更新一次’)和max构成target network,但是doubkle DQN将training network和Predict Network级联构成target network。
Training环节
for i in episode:(对于一次完整的episode迭代)
初始化state_old_all,action_all action_all_training
通过predict得到action(包含RB和POWER的信息)
根据action得到action_all_trainging = [车,邻居,RB/power]
通过action_for_training得到reward
把reward加入record_reward
更新快衰
根据action计算干扰
使用for循环对每辆车计算新状态,将(state_old,state_new,train_reward,action)加入agent的memory中
每当得到mini_batch_step个新状态后:通过Q-learning_mini_batch得到loss
每当到达target_update_step后,更新target_q_network
record_reward = np.zeros([n_episode*n_step_per_episode, 1])
record_loss = []
if IS_TRAIN:
for i_episode in range(n_episode):
print("-------------------------")
print('Episode:', i_episode)
if i_episode < epsi_anneal_length:
epsi = 1 - i_episode * (1 - epsi_final) / (epsi_anneal_length - 1) # epsilon decreases over each episode
else:
epsi = epsi_final
# 每迭代100次更新一次位置、邻居、信道、快衰
if i_episode%100 == 0:
env.renew_positions() # update vehicle position
env.renew_neighbor()
env.renew_channel() # update channel slow fading
env.renew_channels_fastfading() # update channel fast fading
env.demand = env.demand_size * np.ones((env.n_Veh, env.n_neighbor))
env.individual_time_limit = env.time_slow * np.ones((env.n_Veh, env.n_neighbor))
env.active_links = np.ones((env.n_Veh, env.n_neighbor), dtype='bool')
for i_step in range(n_step_per_episode): # range内是0.1/0.001 = 100
time_step = i_episode*n_step_per_episode + i_step # time_step是整体的step
state_old_all = []
action_all = []
action_all_training = np.zeros([n_veh, n_neighbor, 2], dtype='int32')
for i in range(n_veh):
for j in range(n_neighbor):
state = get_state(env, [i, j], i_episode/(n_episode-1), epsi)
state_old_all.append(state)
action = predict(sesses[i*n_neighbor+j], state, epsi)
action_all.append(action)
action_all_training[i, j, 0] = action % n_RB # chosen RB
action_all_training[i, j, 1] = int(np.floor(action / n_RB)) # power level
# All agents take actions simultaneously, obtain shared reward, and update the environment.
action_temp = action_all_training.copy()
train_reward = env.act_for_training(action_temp)
record_reward[time_step] = train_reward
env.renew_channels_fastfading()
env.Compute_Interference(action_temp)
for i in range(n_veh):
for j in range(n_neighbor):
state_old = state_old_all[n_neighbor * i + j]
action = action_all[n_neighbor * i + j]
state_new = get_state(env, [i, j], i_episode/(n_episode-1), epsi)
agents[i * n_neighbor + j].memory.add(state_old, state_new, train_reward, action) # add entry to this agent's memory
# training this agent
if time_step % mini_batch_step == mini_batch_step-1:
loss_val_batch = q_learning_mini_batch(agents[i*n_neighbor+j], sesses[i*n_neighbor+j])
record_loss.append(loss_val_batch)
if i == 0 and j == 0:
print('step:', time_step, 'agent',i*n_neighbor+j, 'loss', loss_val_batch)
if time_step % target_update_step == target_update_step-1:
update_target_q_network(sesses[i*n_neighbor+j])
if i == 0 and j == 0:
print('Update target Q network...')
print('Training Done. Saving models...')
for i in range(n_veh):
for j in range(n_neighbor):
model_path = label + '/agent_' + str(i * n_neighbor + j)
save_models(sesses[i * n_neighbor + j], model_path)
current_dir = os.path.dirname(os.path.realpath(__file__))
reward_path = os.path.join(current_dir, "model/" + label + '/reward.mat')
scipy.io.savemat(reward_path, {'reward': record_reward})
record_loss = np.asarray(record_loss).reshape((-1, n_veh*n_neighbor))
loss_path = os.path.join(current_dir, "model/" + label + '/train_loss.mat')
scipy.io.savemat(loss_path, {'train_loss': record_loss})
Testing环节
首先加载training得到的模型
for i in episode:(对于一次完整的episode迭代)
初始化state_old_all,action_all action_all_testing
通过predict得到action(包含RB和POWER的信息)
根据action得到action_all_traingingaction_all_testing = [车,邻居,RB/power]
通过action_for_trainingaction_for_testing得到reward V2I_rate, V2V_success, V2V_rate
对V2I_rate求和并加入V2I_rate_per_episode
将V2V_rate加入rate_marl
更新demand
if IS_TEST:
print("\nRestoring the model...")
for i in range(n_veh):
for j in range(n_neighbor):
model_path = label + '/agent_' + str(i * n_neighbor + j)
load_models(sesses[i * n_neighbor + j], model_path)
V2I_rate_list = []
V2V_success_list = []
V2I_rate_list_rand = []
V2V_success_list_rand = []
rate_marl = np.zeros([n_episode_test, n_step_per_episode, n_veh, n_neighbor])
rate_rand = np.zeros([n_episode_test, n_step_per_episode, n_veh, n_neighbor])
demand_marl = env.demand_size * np.ones([n_episode_test, n_step_per_episode+1, n_veh, n_neighbor])
demand_rand = env.demand_size * np.ones([n_episode_test, n_step_per_episode+1, n_veh, n_neighbor])
power_rand = np.zeros([n_episode_test, n_step_per_episode, n_veh, n_neighbor])
for idx_episode in range(n_episode_test):
print('----- Episode', idx_episode, '-----')
env.renew_positions()
env.renew_neighbor()
env.renew_channel()
env.renew_channels_fastfading()
env.demand = env.demand_size * np.ones((env.n_Veh, env.n_neighbor))
env.individual_time_limit = env.time_slow * np.ones((env.n_Veh, env.n_neighbor))
env.active_links = np.ones((env.n_Veh, env.n_neighbor), dtype='bool')
env.demand_rand = env.demand_size * np.ones((env.n_Veh, env.n_neighbor))
env.individual_time_limit_rand = env.time_slow * np.ones((env.n_Veh, env.n_neighbor))
env.active_links_rand = np.ones((env.n_Veh, env.n_neighbor), dtype='bool')
V2I_rate_per_episode = []
V2I_rate_per_episode_rand = []
for test_step in range(n_step_per_episode):
# trained models
action_all_testing = np.zeros([n_veh, n_neighbor, 2], dtype='int32')
for i in range(n_veh):
for j in range(n_neighbor):
state_old = get_state(env, [i, j], 1, epsi_final)
action = predict(sesses[i*n_neighbor+j], state_old, epsi_final, True)
action_all_testing[i, j, 0] = action % n_RB # chosen RB
action_all_testing[i, j, 1] = int(np.floor(action / n_RB)) # power level
action_temp = action_all_testing.copy()
V2I_rate, V2V_success, V2V_rate = env.act_for_testing(action_temp)
V2I_rate_per_episode.append(np.sum(V2I_rate)) # sum V2I rate in bps
rate_marl[idx_episode, test_step,:,:] = V2V_rate
demand_marl[idx_episode, test_step+1,:,:] = env.demand
# random baseline
action_rand = np.zeros([n_veh, n_neighbor, 2], dtype='int32')
action_rand[:, :, 0] = np.random.randint(0, n_RB, [n_veh, n_neighbor]) # band
action_rand[:, :, 1] = np.random.randint(0, len(env.V2V_power_dB_List), [n_veh, n_neighbor]) # power
V2I_rate_rand, V2V_success_rand, V2V_rate_rand = env.act_for_testing_rand(action_rand)
V2I_rate_per_episode_rand.append(np.sum(V2I_rate_rand)) # sum V2I rate in bps
rate_rand[idx_episode, test_step, :, :] = V2V_rate_rand
demand_rand[idx_episode, test_step+1,:,:] = env.demand_rand
for i in range(n_veh):
for j in range(n_neighbor):
power_rand[idx_episode, test_step, i, j] = env.V2V_power_dB_List[int(action_rand[i, j, 1])]
# update the environment and compute interference
env.renew_channels_fastfading()
env.Compute_Interference(action_temp)
if test_step == n_step_per_episode - 1:
V2V_success_list.append(V2V_success)
V2V_success_list_rand.append(V2V_success_rand)
V2I_rate_list.append(np.mean(V2I_rate_per_episode))
V2I_rate_list_rand.append(np.mean(V2I_rate_per_episode_rand))
print(round(np.average(V2I_rate_per_episode), 2), 'rand', round(np.average(V2I_rate_per_episode_rand), 2))
print(V2V_success_list[idx_episode], 'rand', V2V_success_list_rand[idx_episode])
[1]3GPP TR36.885报告
[2]《5G移动通信技术》