[代码解读]基于多代理RL的车联网频谱分享_Python实现

论文原文:Spectrum Sharing in Vehicular Networks Based on Multi-Agent Reinforcement Learning

论文翻译 & 解读:[论文笔记]Spectrum Sharing in Vehicular Networks Based on Multi-Agent Reinforcement Learning

代码地址:https://github.com/le-liang/MARLspectrumSharingV2X

博客中用到的VISIO流程图(由博主个人绘制,有错误欢迎交流指教):https://download.csdn.net/download/m0_37495408/12353933


使用方法:

(来自原作者github的readme)

  • 要训​​练多主体RL模型:main_marl_train.py + Environment_marl.py + replay_memory.py
  • 要训​​练基准单一代理RL模型:main_sarl_train.py + Environment_marl.py + replay_memory.py
  • 要在同一环境中测试所有模型:main_test.py + Environment_marl_test.py + replay_memory.py +'/ model'。
    • 可以从运行“ main_test.py”直接复制本文中的图3和图4。通过“ Environment_marl_test.py”中的“ self.demand_size”更改V2V有效负载大小。
    • 图5只能从培训期间的记录回报中获得。
    • 图6-7显示了任意情节的表现(但随机基线失败且MARL传输成功)。实际上,大多数此类情节都表现出一些有趣的现象,表明了多主体合作。解释取决于读者。
    • 不建议在“ main_marl_train.py”中使用“测试”模式。

基本类定义

Environment_marl.py文件中,定义了架构的四个基本CLASS,分别是:V2Vchannels,V2Ichannels,Vehicle,Environ。其中Environ的方法(即函数)最多,Vehicle没有函数只有几个属性,其余两者各有两个方法(分别是计算路损和阴影衰落)。

Vehicle

初始化时需要传入三个参数:起始位置、起始方向、速度。函数内部将自己定义两个list:neighbors、destinations,分别存放邻居和V2V的通信端(这里两者在数值上相同,因为设定V2V的对象即为邻居)

class Vehicle:
    # Vehicle simulator: include all the information for a vehicle

    def __init__(self, start_position, start_direction, velocity):
        self.position = start_position
        self.direction = start_direction
        self.velocity = velocity
        self.neighbors = []
        self.destinations = []

从下方的代码可见destionations的含义 

    def renew_neighbor(self):  # 这个来自CLASS Env
        """ Determine the neighbors of each vehicles """

        for i in range(len(self.vehicles)):
            self.vehicles[i].neighbors = []
            self.vehicles[i].actions = []
        z = np.array([[complex(c.position[0], c.position[1]) for c in self.vehicles]])
        Distance = abs(z.T - z)

        for i in range(len(self.vehicles)):
            sort_idx = np.argsort(Distance[:, i])
            for j in range(self.n_neighbor):
                self.vehicles[i].neighbors.append(sort_idx[j + 1])
            destination = self.vehicles[i].neighbors

            self.vehicles[i].destinations = destination

V2Vchannels

内部参数z:这里将bs和ms的高度设置为1.5m,阴影的std为3,都是来自TR36 885-A.1.4-1;载波频率为2,单位为GHz;

class V2Vchannels:
    # Simulator of the V2V Channels

    def __init__(self):
        self.t = 0
        self.h_bs = 1.5
        self.h_ms = 1.5
        self.fc = 2
        self.decorrelation_distance = 10
        self.shadow_std = 3

包含两个方法:

计算路损

    def get_path_loss(self, position_A, position_B):
        d1 = abs(position_A[0] - position_B[0])
        d2 = abs(position_A[1] - position_B[1])
        d = math.hypot(d1, d2) + 0.001  # sqrt(x*x + y*y)
        # 下一行定义有效BP距离
        d_bp = 4 * (self.h_bs - 1) * (self.h_ms - 1) * self.fc * (10 ** 9) / (3 * 10 ** 8)

        def PL_Los(d):
            if d <= 3:
                return 22.7 * np.log10(3) + 41 + 20 * np.log10(self.fc / 5)
            else:
                if d < d_bp:
                    return 22.7 * np.log10(d) + 41 + 20 * np.log10(self.fc / 5)
                else:
                    return 40.0 * np.log10(d) + 9.45 - 17.3 * np.log10(self.h_bs) - 17.3 * np.log10(self.h_ms) + 2.7 * np.log10(self.fc / 5)

        def PL_NLos(d_a, d_b):
            n_j = max(2.8 - 0.0024 * d_b, 1.84)
            return PL_Los(d_a) + 20 - 12.5 * n_j + 10 * n_j * np.log10(d_b) + 3 * np.log10(self.fc / 5)

        if min(d1, d2) < 7:
            PL = PL_Los(d)
        else:
            PL = min(PL_NLos(d1, d2), PL_NLos(d2, d1))
        return PL  # + self.shadow_std * np.random.normal()

说明:上述代码使用随机过程模型(见[2]-p328)。

路损使用曼哈顿网格布局LOS模型,即:

PL_{LOS}(d)_{|dB} = 10n_1 lg(d/1m)+28.0+20log(fc/1GHz)+PL_{I|dB}, for 10m <d<d_{BP}

以及:PL_{LOS}(d)_{|dB} = 10n_2 lg(d/d'_{BP})+PL_{LOS}(d'_{BP})_{|dB}, for d'_{BP}<d<500m

上面的n_1=2.2、n_2=4.0,分别表示在BP之前和之后的率衰落常数,d'表示有效BP距离,代码中用d_bp表示。

曼哈顿网格布局NLOS模型:PL_{NLOS} = PL_{LOS}(d_1)_{|dB} + 17.9-12.5n_j+10n_jlg(d_2/1m)+3lg(f_c/1GHz)+PL_{2|dB}

代码后半出现的min函数,在[2]的p344页有描述,这是假设接收机位于垂直街道是对PL的估计方法。

代码中的公式出自IST-4-027756 WINNER II D1.1.2 V1.2 WINNER II 

其中有如下表格,与代码中的参数完全符合:

[代码解读]基于多代理RL的车联网频谱分享_Python实现_第1张图片

 更新阴影衰落

    def get_shadowing(self, delta_distance, shadowing):
        return np.exp(-1 * (delta_distance / self.decorrelation_distance)) * shadowing \
               + math.sqrt(1 - np.exp(-2 * (delta_distance / self.decorrelation_distance))) * np.random.normal(0, 3)  # standard dev is 3 db

这个更新公式是出自文献[1]-A-1.4 Channel model表格后的部分,如下:

[代码解读]基于多代理RL的车联网频谱分享_Python实现_第2张图片

V2Ichannels

包含的两个方法和V2V相同,但是计算路损的时候不再区分Los了

    def get_path_loss(self, position_A):
        d1 = abs(position_A[0] - self.BS_position[0])
        d2 = abs(position_A[1] - self.BS_position[1])
        distance = math.hypot(d1, d2)
        return 128.1 + 37.6 * np.log10(math.sqrt(distance ** 2 + (self.h_bs - self.h_ms) ** 2) / 1000) # + self.shadow_std * np.random.normal()

    def get_shadowing(self, delta_distance, shadowing):
        nVeh = len(shadowing)
        self.R = np.sqrt(0.5 * np.ones([nVeh, nVeh]) + 0.5 * np.identity(nVeh))
        return np.multiply(np.exp(-1 * (delta_distance / self.Decorrelation_distance)), shadowing) \
               + np.sqrt(1 - np.exp(-2 * (delta_distance / self.Decorrelation_distance))) * np.random.normal(0, 8, nVeh)

上面的两个方法均是文献[1]-Table A.1.4-2的内容和其后的说明,如下:

[代码解读]基于多代理RL的车联网频谱分享_Python实现_第3张图片

[代码解读]基于多代理RL的车联网频谱分享_Python实现_第4张图片

 Environ

[代码解读]基于多代理RL的车联网频谱分享_Python实现_第5张图片 env 数据流图

初始化需要传入4个list(为上下左右路口的位置数据):down_lane, up_lane, left_lane, right_lane;地图的宽和高;车辆数和邻居数。除以上所提外,内部含有好多参数,如下:

class Environ:
    def __init__(self, down_lane, up_lane, left_lane, right_lane, width, height, n_veh, n_neighbor):
        self.V2Vchannels = V2Vchannels()
        self.V2Ichannels = V2Ichannels()
        self.vehicles = []

        self.demand = []
        self.V2V_Shadowing = []
        self.V2I_Shadowing = []
        self.delta_distance = []
        self.V2V_channels_abs = []
        self.V2I_channels_abs = []

        self.V2I_power_dB = 23  # dBm
        self.V2V_power_dB_List = [23, 15, 5, -100]  # the power levels
        self.V2I_power = 10 ** (self.V2I_power_dB)
        self.sig2_dB = -114
        self.bsAntGain = 8
        self.bsNoiseFigure = 5
        self.vehAntGain = 3
        self.vehNoiseFigure = 9
        self.sig2 = 10 ** (self.sig2_dB / 10)

        self.n_RB = n_veh
        self.n_Veh = n_veh
        self.n_neighbor = n_neighbor
        self.time_fast = 0.001
        self.time_slow = 0.1  # update slow fading/vehicle position every 100 ms
        self.bandwidth = int(1e6)  # bandwidth per RB, 1 MHz
        # self.bandwidth = 1500
        self.demand_size = int((4 * 190 + 300) * 8 * 2)  # V2V payload: 1060 Bytes every 100 ms
        # self.demand_size = 20

        self.V2V_Interference_all = np.zeros((self.n_Veh, self.n_neighbor, self.n_RB)) + self.sig2

添加车:有两个方法:add_new_vehivles(需要传输起始坐标、方向、速度),add_new_vehicles_by_number(n)。后者比较有意思,只需要一个参数,n,但是并不是添加n辆车,而是4n辆车,上下左右方向各一台,位置是随机的。

更新车辆位置:renew_position(无),遍历每辆车,根据其方向和速度更新位置,到路口时依据概率顺时针转弯,到地图边界时使其顺时针转弯留在地图中。

更新邻居:renew_neighbor(self),已经在Vehicle中进行描述

更新信道:renew_channel(self),这里定义了一个很重要的量:channel_abs,它是路损和阴影衰落的和。

    def renew_channel(self):
        """ Renew slow fading channel """

        self.V2V_pathloss = np.zeros((len(self.vehicles), len(self.vehicles))) + 50 * np.identity(len(self.vehicles))
        self.V2I_pathloss = np.zeros((len(self.vehicles)))
        self.V2V_channels_abs = np.zeros((len(self.vehicles), len(self.vehicles)))
        self.V2I_channels_abs = np.zeros((len(self.vehicles)))
        for i in range(len(self.vehicles)):
            for j in range(i + 1, len(self.vehicles)):
                self.V2V_Shadowing[j][i] = self.V2V_Shadowing[i][j] = self.V2Vchannels.get_shadowing(self.delta_distance[i] + self.delta_distance[j], self.V2V_Shadowing[i][j])
                self.V2V_pathloss[j,i] = self.V2V_pathloss[i][j] = self.V2Vchannels.get_path_loss(self.vehicles[i].position, self.vehicles[j].position)
        self.V2V_channels_abs = self.V2V_pathloss + self.V2V_Shadowing

        self.V2I_Shadowing = self.V2Ichannels.get_shadowing(self.delta_distance, self.V2I_Shadowing)
        for i in range(len(self.vehicles)):
            self.V2I_pathloss[i] = self.V2Ichannels.get_path_loss(self.vehicles[i].position)
        self.V2I_channels_abs = self.V2I_pathloss + self.V2I_Shadowing

更新快衰落信道:renew_channels_fastfading(self),所谓的快衰落就是把channels_abs减了一个随机数。

    def renew_channels_fastfading(self):
        """ Renew fast fading channel """

        # 1 2, 3 4 --> 1 1 2 2 3 3 4 4 逐个元素复制
        V2V_channels_with_fastfading = np.repeat(self.V2V_channels_abs[:, :, np.newaxis], self.n_RB, axis=2)
        # A - 20 log
        self.V2V_channels_with_fastfading = V2V_channels_with_fastfading - 20 * np.log10(
            np.abs(np.random.normal(0, 1, V2V_channels_with_fastfading.shape) + 1j * np.random.normal(0, 1, V2V_channels_with_fastfading.shape)) / math.sqrt(2))

        # 1 2, 3 4 --> 1 1 2 2, 3 3 4 4
        V2I_channels_with_fastfading = np.repeat(self.V2I_channels_abs[:, np.newaxis], self.n_RB, axis=1)
        self.V2I_channels_with_fastfading = V2I_channels_with_fastfading - 20 * np.log10(
            np.abs(np.random.normal(0, 1, V2I_channels_with_fastfading.shape) + 1j * np.random.normal(0, 1, V2I_channels_with_fastfading.shape))/ math.sqrt(2))

计算Reward:Compute_Performance_Reward_Train(self, actions_power),这里的输入非常重要,是RL的action,其定义在main_marl_train.py中,是个三维数组,以(层,行,列)进行说明,一层一个车,一行一个邻居,共有两列分别为RB选择(用RB的序号表示)和power选择(也用序号表示,作为power_db_list的索引),如下所示:

            for i in range(n_veh):
                for j in range(n_neighbor):
                    state_old = get_state(env, [i, j], 1, epsi_final)
                    action = predict(sesses[i*n_neighbor+j], state_old, epsi_final, True)
                    action_all_testing[i, j, 0] = action % n_RB  # chosen RB
                    action_all_testing[i, j, 1] = int(np.floor(action / n_RB))  # power level

具体计算步骤为:

  1. 从action中去除RB选择、power选择
  2. 计算V2I信道容量 V2I_rate
  3. 计算V2V信道容量V2V_rate
    1. 遍历每一个RB,从actions找到共用一个RB的车号
    2. 分V2I对V2V的干扰、V2V之间的干扰两步,计算信道容量
  4. 计算剩余demand和time_limit的剩余时间
  5. 生成reward(reward_elements = V2V_Rate/10,并且demand=0的记作1)
  6. 根据剩余demand将active_links置0(这是唯二修改active_links的方法,另一种是初始化active_links时将其全部置一)

代码如下:

    def Compute_Performance_Reward_Train(self, actions_power):

        actions = actions_power[:, :, 0]  # the channel_selection_part
        power_selection = actions_power[:, :, 1]  # power selection

        # ------------ Compute V2I rate --------------------
        V2I_Rate = np.zeros(self.n_RB)
        V2I_Interference = np.zeros(self.n_RB)  # V2I interference
        for i in range(len(self.vehicles)):
            for j in range(self.n_neighbor):
                if not self.active_links[i, j]:
                    continue
                V2I_Interference[actions[i][j]] += 10 ** ((self.V2V_power_dB_List[power_selection[i, j]] - self.V2I_channels_with_fastfading[i, actions[i, j]]
                                                           + self.vehAntGain + self.bsAntGain - self.bsNoiseFigure) / 10)
        self.V2I_Interference = V2I_Interference + self.sig2
        V2I_Signals = 10 ** ((self.V2I_power_dB - self.V2I_channels_with_fastfading.diagonal() + self.vehAntGain + self.bsAntGain - self.bsNoiseFigure) / 10)
        V2I_Rate = np.log2(1 + np.divide(V2I_Signals, self.V2I_Interference))  # 计算V2I信道容量

        # ------------ Compute V2V rate -------------------------
        V2V_Interference = np.zeros((len(self.vehicles), self.n_neighbor))
        V2V_Signal = np.zeros((len(self.vehicles), self.n_neighbor))
        actions[(np.logical_not(self.active_links))] = -1 # inactive links will not transmit regardless of selected power levels
        for i in range(self.n_RB):  # scanning all bands
            indexes = np.argwhere(actions == i)  # find spectrum-sharing V2Vs
            for j in range(len(indexes)):
                receiver_j = self.vehicles[indexes[j, 0]].destinations[indexes[j, 1]]
                V2V_Signal[indexes[j, 0], indexes[j, 1]] = 10 ** ((self.V2V_power_dB_List[power_selection[indexes[j, 0], indexes[j, 1]]]
                                                                   - self.V2V_channels_with_fastfading[indexes[j][0], receiver_j, i] + 2 * self.vehAntGain - self.vehNoiseFigure) / 10)
                # V2I links interference to V2V links
                V2V_Interference[indexes[j, 0], indexes[j, 1]] = 10 ** ((self.V2I_power_dB - self.V2V_channels_with_fastfading[i, receiver_j, i] + 2 * self.vehAntGain - self.vehNoiseFigure) / 10)

                #  V2V interference
                for k in range(j + 1, len(indexes)):  # spectrum-sharing V2Vs
                    receiver_k = self.vehicles[indexes[k][0]].destinations[indexes[k][1]]
                    V2V_Interference[indexes[j, 0], indexes[j, 1]] += 10 ** ((self.V2V_power_dB_List[power_selection[indexes[k, 0], indexes[k, 1]]]
                                                                              - self.V2V_channels_with_fastfading[indexes[k][0]][receiver_j][i] + 2 * self.vehAntGain - self.vehNoiseFigure) / 10)
                    V2V_Interference[indexes[k, 0], indexes[k, 1]] += 10 ** ((self.V2V_power_dB_List[power_selection[indexes[j, 0], indexes[j, 1]]]
                                                                              - self.V2V_channels_with_fastfading[indexes[j][0]][receiver_k][i] + 2 * self.vehAntGain - self.vehNoiseFigure) / 10)
        self.V2V_Interference = V2V_Interference + self.sig2
        V2V_Rate = np.log2(1 + np.divide(V2V_Signal, self.V2V_Interference))

        self.demand -= V2V_Rate * self.time_fast * self.bandwidth
        self.demand[self.demand < 0] = 0 # eliminate negative demands

        self.individual_time_limit -= self.time_fast

        reward_elements = V2V_Rate/10
        reward_elements[self.demand <= 0] = 1

        self.active_links[np.multiply(self.active_links, self.demand <= 0)] = 0 # transmission finished, turned to "inactive"

        return V2I_Rate, V2V_Rate, reward_elements

注:这里返回三个数值,其中最后一个并不是最终的reward,最终的reward需要把这三个数值加权组合起来。

执行训练:act_for_training(self, actions),输入actions,通过Compute_Performance_Reward_Train计算最终reward,代码如下:

    def act_for_training(self, actions):

        action_temp = actions.copy()
        V2I_Rate, V2V_Rate, reward_elements = self.Compute_Performance_Reward_Train(action_temp)

        lambdda = 0.
        reward = lambdda * np.sum(V2I_Rate) / (self.n_Veh * 10) + (1 - lambdda) * np.sum(reward_elements) / (self.n_Veh * self.n_neighbor)

        return reward

执行测试:act_for_testing(self, actions),这里和上面差不多,也用到了Compute_Performance_Reward_Train,但最后返回的是V2I_rate, V2V_success, V2V_rate。

    def act_for_testing(self, actions):

        action_temp = actions.copy()
        V2I_Rate, V2V_Rate, reward_elements = self.Compute_Performance_Reward_Train(action_temp)
        V2V_success = 1 - np.sum(self.active_links) / (self.n_Veh * self.n_neighbor)  # V2V success rates

        return V2I_Rate, V2V_success, V2V_Rate

上面所述的三个量,是一次episode中的单步step所生成的最终结果,main_marl_train.py的testing部分可以看到,部分代码如下:

        for test_step in range(n_step_per_episode):
            # trained models
            action_all_testing = np.zeros([n_veh, n_neighbor, 2], dtype='int32')
            for i in range(n_veh):
                for j in range(n_neighbor):
                    state_old = get_state(env, [i, j], 1, epsi_final)
                    action = predict(sesses[i*n_neighbor+j], state_old, epsi_final, True)
                    action_all_testing[i, j, 0] = action % n_RB  # chosen RB
                    action_all_testing[i, j, 1] = int(np.floor(action / n_RB))  # power level

            action_temp = action_all_testing.copy()
            V2I_rate, V2V_success, V2V_rate = env.act_for_testing(action_temp)
            V2I_rate_per_episode.append(np.sum(V2I_rate))  # sum V2I rate in bps
            rate_marl[idx_episode, test_step,:,:] = V2V_rate
            demand_marl[idx_episode, test_step+1,:,:] = env.demand

计算干扰:Compute_Interference(self, actions),通过+=的方法计算V2V_Interference_all,代码如下:

    def Compute_Interference(self, actions):
        V2V_Interference = np.zeros((len(self.vehicles), self.n_neighbor, self.n_RB)) + self.sig2

        channel_selection = actions.copy()[:, :, 0]  # 取所有层的第0列
        power_selection = actions.copy()[:, :, 1]  # 取所有层的第1列
        channel_selection[np.logical_not(self.active_links)] = -1  # 将未激活的链路置为-1

        # interference from V2I links
        for i in range(self.n_RB):
            for k in range(len(self.vehicles)):
                for m in range(len(channel_selection[k, :])):
                    V2V_Interference[k, m, i] += 10 ** ((self.V2I_power_dB - self.V2V_channels_with_fastfading[i][self.vehicles[k].destinations[m]][i] + 2 * self.vehAntGain - self.vehNoiseFigure) / 10)

        # interference from peer V2V links
        for i in range(len(self.vehicles)):
            for j in range(len(channel_selection[i, :])):
                for k in range(len(self.vehicles)):
                    for m in range(len(channel_selection[k, :])):
                        # if i == k or channel_selection[i,j] >= 0:
                        if i == k and j == m or channel_selection[i, j] < 0:
                            continue
                        V2V_Interference[k, m, channel_selection[i, j]] += 10 ** ((self.V2V_power_dB_List[power_selection[i, j]]
                                                                                   - self.V2V_channels_with_fastfading[i][self.vehicles[k].destinations[m]][channel_selection[i,j]] + 2 * self.vehAntGain - self.vehNoiseFigure) / 10)
        self.V2V_Interference_all = 10 * np.log10(V2V_Interference)

在main_marl_train.py的get_state中有用到,用于构成state中的V2V_interference,如下:

def get_state(env, idx=(0,0), ind_episode=1., epsi=0.02):
    """ Get state from the environment """
    # include V2I/V2V fast_fading, V2V interference, V2I/V2V 信道信息(PL+shadow),
    # 剩余时间, 剩余负载

    # V2I_channel = (env.V2I_channels_with_fastfading[idx[0], :] - 80) / 60
    V2I_fast = (env.V2I_channels_with_fastfading[idx[0], :] - env.V2I_channels_abs[idx[0]] + 10)/35

    # V2V_channel = (env.V2V_channels_with_fastfading[:, env.vehicles[idx[0]].destinations[idx[1]], :] - 80) / 60
    V2V_fast = (env.V2V_channels_with_fastfading[:, env.vehicles[idx[0]].destinations[idx[1]], :] - env.V2V_channels_abs[:, env.vehicles[idx[0]].destinations[idx[1]]] + 10)/35

    V2V_interference = (-env.V2V_Interference_all[idx[0], idx[1], :] - 60) / 60

    V2I_abs = (env.V2I_channels_abs[idx[0]] - 80) / 60.0
    V2V_abs = (env.V2V_channels_abs[:, env.vehicles[idx[0]].destinations[idx[1]]] - 80)/60.0

    load_remaining = np.asarray([env.demand[idx[0], idx[1]] / env.demand_size])
    time_remaining = np.asarray([env.individual_time_limit[idx[0], idx[1]] / env.time_slow])

    # return np.concatenate((np.reshape(V2V_channel, -1), V2V_interference, V2I_abs, V2V_abs, time_remaining, load_remaining, np.asarray([ind_episode, epsi])))
    return np.concatenate((V2I_fast, np.reshape(V2V_fast, -1), V2V_interference, np.asarray([V2I_abs]), V2V_abs, time_remaining, load_remaining, np.asarray([ind_episode, epsi])))
    # 这里有所有感兴趣的物理量:V2V_fast V2I_fast V2V_interference V2I_abs V2V_abs

有的小伙伴看到这就有点迷了,为什么这里又要计算V2V_Interference了?我怎么感觉之前好像算过,是的,在计算V2V_rate的时候就需要计算V2V_Interference,我目前观察那个是按照RB分配来算的,这个是直接按照车挨个遍历的。

ReplayMemory

这部分内容来自replay_memory.py,内容不多,只定义了一个类: ReplayMemory,需要注意的是每一个agent都有一个memory,在main_marl_train.py--class Agent可以看到,如下所示

class Agent(object):
    def __init__(self, memory_entry_size):
        self.discount = 1
        self.double_q = True
        self.memory_entry_size = memory_entry_size
        self.memory = ReplayMemory(self.memory_entry_size)

初始化:需要输入memory的容量:entry_size,初始化的代码如下:

class ReplayMemory:
    def __init__(self, entry_size):
        self.entry_size = entry_size
        self.memory_size = 200000
        self.actions = np.empty(self.memory_size, dtype = np.uint8)
        self.rewards = np.empty(self.memory_size, dtype = np.float64)
        self.prestate = np.empty((self.memory_size, self.entry_size), dtype = np.float16)
        self.poststate = np.empty((self.memory_size, self.entry_size), dtype = np.float16)
        self.batch_size = 2000
        self.count = 0
        self.current = 0

添加(s, a)对:add(self, prestate, poststate, reward, action),从add方法的参数可以看出参数包括:(上一个状态,下一个状态,奖励,动作),代码如下:

    def add(self, prestate, poststate, reward, action):
        self.actions[self.current] = action
        self.rewards[self.current] = reward
        self.prestate[self.current] = prestate
        self.poststate[self.current] = poststate
        self.count = max(self.count, self.current + 1)
        self.current = (self.current + 1) % self.memory_size

对每个agent来说,都需要将自己在每个time_step将这个状态转移的信息记录下来,在main_marl_train.py--Training的部分可以看到add的使用,代码如下,这个for循环上面还有一个对于episode的for循环,可以看出,在每个episode的每个step,都需要对所有agent进行(s,a)对的添加【最后一行】

        for i_step in range(n_step_per_episode):  # range内是0.1/0.001 = 100
            time_step = i_episode*n_step_per_episode + i_step  # time_step是整体的step
            state_old_all = []
            action_all = []
            action_all_training = np.zeros([n_veh, n_neighbor, 2], dtype='int32')
            for i in range(n_veh):
                for j in range(n_neighbor):
                    state = get_state(env, [i, j], i_episode/(n_episode-1), epsi)
                    state_old_all.append(state)
                    action = predict(sesses[i*n_neighbor+j], state, epsi)
                    action_all.append(action)

                    action_all_training[i, j, 0] = action % n_RB  # chosen RB
                    action_all_training[i, j, 1] = int(np.floor(action / n_RB)) # power level

            # All agents take actions simultaneously, obtain shared reward, and update the environment.
            action_temp = action_all_training.copy()
            train_reward = env.act_for_training(action_temp)
            record_reward[time_step] = train_reward

            env.renew_channels_fastfading()
            env.Compute_Interference(action_temp)

            for i in range(n_veh):
                for j in range(n_neighbor):
                    state_old = state_old_all[n_neighbor * i + j]
                    action = action_all[n_neighbor * i + j]
                    state_new = get_state(env, [i, j], i_episode/(n_episode-1), epsi)
                    agents[i * n_neighbor + j].memory.add(state_old, state_new, train_reward, action)  # add entry to this agent's memory

采样:sample(self),经过多次add后,每个agent已经有了多个(s,a)对,但是实际训练的时候一次取出batch_size个(s,a)对进行训练,代码如下所示:

    def sample(self):

        if self.count < self.batch_size:
            indexes = range(0, self.count)
        else:
            indexes = random.sample(range(0,self.count), self.batch_size)
        prestate = self.prestate[indexes]
        poststate = self.poststate[indexes]
        actions = self.actions[indexes]
        rewards = self.rewards[indexes]
        return prestate, poststate, actions, rewards

主代码-main_marl_train.py

定义CLASS Agent:Agent(object),无输入参数,内容是一些算法参数,注意memory的实现方法是ReplayMemory,上面刚提到过

class Agent(object):
    def __init__(self, memory_entry_size):
        self.discount = 1
        self.double_q = True
        self.memory_entry_size = memory_entry_size
        self.memory = ReplayMemory(self.memory_entry_size)

参数初始化:这部分直接写在代码中,没有函数,大概包括:地图属性(路口坐标,整体地图尺寸)、#车、#邻居、#RB、#episode,一些算法参数,代码如下:

对于地图参数 up_lanes / down_lanes / left_lanes / right_lanes 的含义,首先要了解本次所用的系统模型由3GPP TR 36.885的城市案例给出,每条街有四个车道(正反方向各两个车道) ,车道宽3.5m,模型网格(road grid)的尺寸以黄线之间的距离确定,为433m*250m,区域面积为1299m*750m。仿真中等比例缩小为原来的1/2(这点可以由 width 和 height 参数是 / 2 的看出来),反映在车道的参数上就是在 lanes 中的 i / 2.0 。

[代码解读]基于多代理RL的车联网频谱分享_Python实现_第6张图片

下面以 up_lanes 为例进行说明。在上图中我们可以看到,车道宽3.5m,所以将车视作质点的话,应该是在3.5m的车道中间移动的,因此在 up_lanes 中 in 后面的 中括号里 3.5 需要 /2,第二项的3.5就是通向双车道的第二条车道的中间;第三项 +250 就是越过建筑物的第一条同向车道,以此类推。

up_lanes = [i/2.0 for i in [3.5/2, 3.5 + 3.5/2, 250+3.5/2, 250+3.5+3.5/2, 500+3.5/2, 500+3.5+3.5/2]]
down_lanes = [i/2.0 for i in [250-3.5-3.5/2,250-3.5/2,500-3.5-3.5/2,500-3.5/2,750-3.5-3.5/2,750-3.5/2]]
left_lanes = [i/2.0 for i in [3.5/2,3.5/2 + 3.5,433+3.5/2, 433+3.5+3.5/2, 866+3.5/2, 866+3.5+3.5/2]]
right_lanes = [i/2.0 for i in [433-3.5-3.5/2,433-3.5/2,866-3.5-3.5/2,866-3.5/2,1299-3.5-3.5/2,1299-3.5/2]]

width = 750/2
height = 1298/2

IS_TRAIN = 1
IS_TEST = 1-IS_TRAIN

label = 'marl_model'

n_veh = 4
n_neighbor = 1
n_RB = n_veh

env = Environment_marl.Environ(down_lanes, up_lanes, left_lanes, right_lanes, width, height, n_veh, n_neighbor)
env.new_random_game()  # initialize parameters in env

# n_episode = 3000
n_episode = 600
n_step_per_episode = int(env.time_slow/env.time_fast)  # slow = 0.1, fast = 0.001
epsi_final = 0.02
epsi_anneal_length = int(0.8*n_episode)
mini_batch_step = n_step_per_episode
target_update_step = n_step_per_episode*4

n_episode_test = 100  # test episodes

获取状态:get_state(env, idx=(0,0), ind_episode=1., epsi=0.02),输入是env(环境),输出包括:

  1. V2V_fast:(PL+shadowing) - 随机数(在本文 基本类定义 -- Environ -- 更新快衰信道 一节有)
  2. V2I_fast:同上
  3. V2V_interference(在本文 基本类定义 -- Environ -- 计算干扰 一节有)
  4. V2I_abs(PL+shadowing) 
  5. V2V_abs(PL+shadowing) 
def get_state(env, idx=(0,0), ind_episode=1., epsi=0.02):
    """ Get state from the environment """
    # include V2I/V2V fast_fading, V2V interference, V2I/V2V 信道信息(PL+shadow),
    # 剩余时间, 剩余负载

    # V2I_channel = (env.V2I_channels_with_fastfading[idx[0], :] - 80) / 60
    V2I_fast = (env.V2I_channels_with_fastfading[idx[0], :] - env.V2I_channels_abs[idx[0]] + 10)/35

    # V2V_channel = (env.V2V_channels_with_fastfading[:, env.vehicles[idx[0]].destinations[idx[1]], :] - 80) / 60
    V2V_fast = (env.V2V_channels_with_fastfading[:, env.vehicles[idx[0]].destinations[idx[1]], :] - env.V2V_channels_abs[:, env.vehicles[idx[0]].destinations[idx[1]]] + 10)/35

    V2V_interference = (-env.V2V_Interference_all[idx[0], idx[1], :] - 60) / 60

    V2I_abs = (env.V2I_channels_abs[idx[0]] - 80) / 60.0
    V2V_abs = (env.V2V_channels_abs[:, env.vehicles[idx[0]].destinations[idx[1]]] - 80)/60.0

    load_remaining = np.asarray([env.demand[idx[0], idx[1]] / env.demand_size])
    time_remaining = np.asarray([env.individual_time_limit[idx[0], idx[1]] / env.time_slow])

    # return np.concatenate((np.reshape(V2V_channel, -1), V2V_interference, V2I_abs, V2V_abs, time_remaining, load_remaining, np.asarray([ind_episode, epsi])))
    return np.concatenate((V2I_fast, np.reshape(V2V_fast, -1), V2V_interference, np.asarray([V2I_abs]), V2V_abs, time_remaining, load_remaining, np.asarray([ind_episode, epsi])))
 

定义NN:

with g.as_default():
    # ============== Training network ========================
    x = tf.placeholder(tf.float32, [None, n_input])  # 输入

    w_1 = tf.Variable(tf.truncated_normal([n_input, n_hidden_1], stddev=0.1))
    w_2 = tf.Variable(tf.truncated_normal([n_hidden_1, n_hidden_2], stddev=0.1))
    w_3 = tf.Variable(tf.truncated_normal([n_hidden_2, n_hidden_3], stddev=0.1))
    w_4 = tf.Variable(tf.truncated_normal([n_hidden_3, n_output], stddev=0.1))

    b_1 = tf.Variable(tf.truncated_normal([n_hidden_1], stddev=0.1))
    b_2 = tf.Variable(tf.truncated_normal([n_hidden_2], stddev=0.1))
    b_3 = tf.Variable(tf.truncated_normal([n_hidden_3], stddev=0.1))
    b_4 = tf.Variable(tf.truncated_normal([n_output], stddev=0.1))

    layer_1 = tf.nn.relu(tf.add(tf.matmul(x, w_1), b_1))
    layer_1_b = tf.layers.batch_normalization(layer_1)
    layer_2 = tf.nn.relu(tf.add(tf.matmul(layer_1_b, w_2), b_2))
    layer_2_b = tf.layers.batch_normalization(layer_2)
    layer_3 = tf.nn.relu(tf.add(tf.matmul(layer_2_b, w_3), b_3))
    layer_3_b = tf.layers.batch_normalization(layer_3)
    y = tf.nn.relu(tf.add(tf.matmul(layer_3_b, w_4), b_4))

    g_q_action = tf.argmax(y, axis=1)

    # compute loss
    g_target_q_t = tf.placeholder(tf.float32, None, name="target_value")

    g_action = tf.placeholder(tf.int32, None, name='g_action')
    action_one_hot = tf.one_hot(g_action, n_output, 1.0, 0.0, name='action_one_hot')
    q_acted = tf.reduce_sum(y * action_one_hot, reduction_indices=1, name='q_acted')

    g_loss = tf.reduce_mean(tf.square(g_target_q_t - q_acted), name='g_loss')  # 求误差
    optim = tf.train.RMSPropOptimizer(learning_rate=0.001, momentum=0.95, epsilon=0.01).minimize(g_loss)  # 梯度下降

    # ==================== Prediction network ========================
    x_p = tf.placeholder(tf.float32, [None, n_input])  # 输入

    w_1_p = tf.Variable(tf.truncated_normal([n_input, n_hidden_1], stddev=0.1))
    w_2_p = tf.Variable(tf.truncated_normal([n_hidden_1, n_hidden_2], stddev=0.1))
    w_3_p = tf.Variable(tf.truncated_normal([n_hidden_2, n_hidden_3], stddev=0.1))
    w_4_p = tf.Variable(tf.truncated_normal([n_hidden_3, n_output], stddev=0.1))

    b_1_p = tf.Variable(tf.truncated_normal([n_hidden_1], stddev=0.1))
    b_2_p = tf.Variable(tf.truncated_normal([n_hidden_2], stddev=0.1))
    b_3_p = tf.Variable(tf.truncated_normal([n_hidden_3], stddev=0.1))
    b_4_p = tf.Variable(tf.truncated_normal([n_output], stddev=0.1))

    layer_1_p = tf.nn.relu(tf.add(tf.matmul(x_p, w_1_p), b_1_p))
    layer_1_p_b = tf.layers.batch_normalization(layer_1_p)
    layer_2_p = tf.nn.relu(tf.add(tf.matmul(layer_1_p_b, w_2_p), b_2_p))
    layer_2_p_b = tf.layers.batch_normalization(layer_2_p)
    layer_3_p = tf.nn.relu(tf.add(tf.matmul(layer_2_p_b, w_3_p), b_3_p))
    layer_3_p_b = tf.layers.batch_normalization(layer_3_p)
    y_p = tf.nn.relu(tf.add(tf.matmul(layer_3_p_b, w_4_p), b_4_p))

    g_target_q_idx = tf.placeholder('int32', [None, None], 'output_idx')  # 输入,这是一个(n, 2)的list
    target_q_with_idx = tf.gather_nd(y_p, g_target_q_idx)  # 提取首参的某几行/列

    init = tf.global_variables_initializer()
    saver = tf.train.Saver()

在这里仅说明大体结构,具体含义请见下问“采样并获得loss”部分,有结合算法原理的Network结构说明。

整体分成三个NN:Training,compute_loss,Prediction,分别用N1 N2 N3表示。其中N1和N3结构完全一致,为算法结构中的DQN网络,输出Q值,不同点在于,N1每次迭代式都更新,而N3每隔一段时间更新一次。N2接受N1的输入,负责计算Q函数并对N1实现迭代更新。

预测:predict(sess, s_t, ep, test_ep = False),此函数用于驱动NN,生成动作action,代码如下:

def predict(sess, s_t, ep, test_ep = False):

    n_power_levels = len(env.V2V_power_dB_List)
    if np.random.rand() < ep and not test_ep:
        pred_action = np.random.randint(n_RB*n_power_levels)
    else:
        pred_action = sess.run(g_q_action, feed_dict={x: [s_t]})[0]
    return pred_action

这里的action是一个int,但内涵了RB和power_level的信息,在本代码后面Training和Testing中都有出现,使用方法如下:

                    action = predict(sesses[i*n_neighbor+j], state, epsi)
                    action_all.append(action)

                    action_all_training[i, j, 0] = action % n_RB  # chosen RB
                    action_all_training[i, j, 1] = int(np.floor(action / n_RB)) # power level

采样并获得loss:q_learning_mini_batch(current_agent, current_sess),输入单个agent,里面用到了CLASS:memory的sample方法,上面有提到。此外double q-learning也在这里设置。

def q_learning_mini_batch(current_agent, current_sess):
    """ Training a sampled mini-batch """

    batch_s_t, batch_s_t_plus_1, batch_action, batch_reward = current_agent.memory.sample()

    if current_agent.double_q:  # double q-learning
        pred_action = current_sess.run(g_q_action, feed_dict={x: batch_s_t_plus_1})
        q_t_plus_1 = current_sess.run(target_q_with_idx, {x_p: batch_s_t_plus_1, g_target_q_idx: [[idx, pred_a] for idx, pred_a in enumerate(pred_action)]})
        batch_target_q_t = current_agent.discount * q_t_plus_1 + batch_reward
    else:
        q_t_plus_1 = current_sess.run(y_p, {x_p: batch_s_t_plus_1})
        max_q_t_plus_1 = np.max(q_t_plus_1, axis=1)
        batch_target_q_t = current_agent.discount * max_q_t_plus_1 + batch_reward

    _, loss_val = current_sess.run([optim, g_loss], {g_target_q_t: batch_target_q_t, g_action: batch_action, x: batch_s_t})
    return loss_val

4.23 补充:这个函数需要结合NN的结构来看,个人感觉还是有点复杂的。如表面意思通过 if 表现了不同DQN和double q-learning两种方法,需要注意的是在两个if里面都只计算了target network的部分,算法图左上方的Network的输入、迭代更新由最后一句完成:

    _, loss_val = current_sess.run([optim, g_loss], {g_target_q_t: batch_target_q_t, g_action: batch_action, x: batch_s_t})

这段代码需要和这篇博客中的图相对应才可以理解,在这里将算法原理图和代码流程图贴出来(代码图由博主通过VISIO绘制,没有遵循标准格式,有错误请见谅)

普通DQN

[代码解读]基于多代理RL的车联网频谱分享_Python实现_第7张图片 普通DQN对于policy的迭代更新流程(来自李宏毅老师ppt)

 

[代码解读]基于多代理RL的车联网频谱分享_Python实现_第8张图片 代码中的普通DQN数据流图(红字为实参)

Double DQN

[代码解读]基于多代理RL的车联网频谱分享_Python实现_第9张图片 来自 这篇博客,Double DQN算法说明
[代码解读]基于多代理RL的车联网频谱分享_Python实现_第10张图片 代码中的 double dqn数据流动图

与普通DQN在target network处有不同,前者直接通过Predict Network(上图的‘predict/每隔一段时间更新一次’)和max构成target network,但是doubkle DQN将training network和Predict Network级联构成target network。

Training环节

for i in episode:(对于一次完整的episode迭代)

  • 根据i确定epsi(递增->不变)
  • 每100次更新一次位置、邻居、快衰、信道。
  • 初始化demand time_limit active_links(全1)
  • for i_step in episode:(对于episode中的每一步):
    • 初始化state_old_all,action_all action_all_training

    • 通过predict得到action(包含RB和POWER的信息)

    • 根据action得到action_all_trainging = [车,邻居,RB/power]

    • 通过action_for_training得到reward

    • 把reward加入record_reward

    • 更新快衰

    • 根据action计算干扰

    • 使用for循环对每辆车计算新状态,将(state_old,state_new,train_reward,action)加入agent的memory中  

      • 每当得到mini_batch_step个新状态后:通过Q-learning_mini_batch得到loss

      • 每当到达target_update_step后,更新target_q_network

record_reward = np.zeros([n_episode*n_step_per_episode, 1])
record_loss = []
if IS_TRAIN:
    for i_episode in range(n_episode):
        print("-------------------------")
        print('Episode:', i_episode)
        if i_episode < epsi_anneal_length:
            epsi = 1 - i_episode * (1 - epsi_final) / (epsi_anneal_length - 1)  # epsilon decreases over each episode
        else:
            epsi = epsi_final

        # 每迭代100次更新一次位置、邻居、信道、快衰
        if i_episode%100 == 0:
            env.renew_positions() # update vehicle position
            env.renew_neighbor()
            env.renew_channel() # update channel slow fading
            env.renew_channels_fastfading() # update channel fast fading

        env.demand = env.demand_size * np.ones((env.n_Veh, env.n_neighbor))
        env.individual_time_limit = env.time_slow * np.ones((env.n_Veh, env.n_neighbor))
        env.active_links = np.ones((env.n_Veh, env.n_neighbor), dtype='bool')

        for i_step in range(n_step_per_episode):  # range内是0.1/0.001 = 100
            time_step = i_episode*n_step_per_episode + i_step  # time_step是整体的step
            state_old_all = []
            action_all = []
            action_all_training = np.zeros([n_veh, n_neighbor, 2], dtype='int32')
            for i in range(n_veh):
                for j in range(n_neighbor):
                    state = get_state(env, [i, j], i_episode/(n_episode-1), epsi)
                    state_old_all.append(state)
                    action = predict(sesses[i*n_neighbor+j], state, epsi)
                    action_all.append(action)

                    action_all_training[i, j, 0] = action % n_RB  # chosen RB
                    action_all_training[i, j, 1] = int(np.floor(action / n_RB)) # power level

            # All agents take actions simultaneously, obtain shared reward, and update the environment.
            action_temp = action_all_training.copy()
            train_reward = env.act_for_training(action_temp)
            record_reward[time_step] = train_reward

            env.renew_channels_fastfading()
            env.Compute_Interference(action_temp)

            for i in range(n_veh):
                for j in range(n_neighbor):
                    state_old = state_old_all[n_neighbor * i + j]
                    action = action_all[n_neighbor * i + j]
                    state_new = get_state(env, [i, j], i_episode/(n_episode-1), epsi)
                    agents[i * n_neighbor + j].memory.add(state_old, state_new, train_reward, action)  # add entry to this agent's memory

                    # training this agent
                    if time_step % mini_batch_step == mini_batch_step-1:
                        loss_val_batch = q_learning_mini_batch(agents[i*n_neighbor+j], sesses[i*n_neighbor+j])
                        record_loss.append(loss_val_batch)
                        if i == 0 and j == 0:
                            print('step:', time_step, 'agent',i*n_neighbor+j, 'loss', loss_val_batch)
                    if time_step % target_update_step == target_update_step-1:
                        update_target_q_network(sesses[i*n_neighbor+j])
                        if i == 0 and j == 0:
                            print('Update target Q network...')

    print('Training Done. Saving models...')
    for i in range(n_veh):
        for j in range(n_neighbor):
            model_path = label + '/agent_' + str(i * n_neighbor + j)
            save_models(sesses[i * n_neighbor + j], model_path)

    current_dir = os.path.dirname(os.path.realpath(__file__))
    reward_path = os.path.join(current_dir, "model/" + label + '/reward.mat')
    scipy.io.savemat(reward_path, {'reward': record_reward})

    record_loss = np.asarray(record_loss).reshape((-1, n_veh*n_neighbor))
    loss_path = os.path.join(current_dir, "model/" + label + '/train_loss.mat')
    scipy.io.savemat(loss_path, {'train_loss': record_loss})

Testing环节

首先加载training得到的模型

for i in episode:(对于一次完整的episode迭代)

  • 更新位置、邻居、快衰、信道。
  • 初始化demand time_limit active_links(全1)
  • for i_step in episode:(对于episode中的每一步):
    • 初始化state_old_all,action_all  action_all_testing

    • 通过predict得到action(包含RB和POWER的信息)

    • 根据action得到action_all_traingingaction_all_testing = [车,邻居,RB/power]

    • 通过action_for_trainingaction_for_testing得到reward V2I_rate, V2V_success, V2V_rate

    • 对V2I_rate求和并加入V2I_rate_per_episode

    • 将V2V_rate加入rate_marl

    • 更新demand

if IS_TEST:
    print("\nRestoring the model...")

    for i in range(n_veh):
        for j in range(n_neighbor):
            model_path = label + '/agent_' + str(i * n_neighbor + j)
            load_models(sesses[i * n_neighbor + j], model_path)

    V2I_rate_list = []
    V2V_success_list = []
    V2I_rate_list_rand = []
    V2V_success_list_rand = []
    rate_marl = np.zeros([n_episode_test, n_step_per_episode, n_veh, n_neighbor])
    rate_rand = np.zeros([n_episode_test, n_step_per_episode, n_veh, n_neighbor])
    demand_marl = env.demand_size * np.ones([n_episode_test, n_step_per_episode+1, n_veh, n_neighbor])
    demand_rand = env.demand_size * np.ones([n_episode_test, n_step_per_episode+1, n_veh, n_neighbor])
    power_rand = np.zeros([n_episode_test, n_step_per_episode, n_veh, n_neighbor])
    for idx_episode in range(n_episode_test):
        print('----- Episode', idx_episode, '-----')

        env.renew_positions()
        env.renew_neighbor()
        env.renew_channel()
        env.renew_channels_fastfading()

        env.demand = env.demand_size * np.ones((env.n_Veh, env.n_neighbor))
        env.individual_time_limit = env.time_slow * np.ones((env.n_Veh, env.n_neighbor))
        env.active_links = np.ones((env.n_Veh, env.n_neighbor), dtype='bool')

        env.demand_rand = env.demand_size * np.ones((env.n_Veh, env.n_neighbor))
        env.individual_time_limit_rand = env.time_slow * np.ones((env.n_Veh, env.n_neighbor))
        env.active_links_rand = np.ones((env.n_Veh, env.n_neighbor), dtype='bool')

        V2I_rate_per_episode = []
        V2I_rate_per_episode_rand = []
        for test_step in range(n_step_per_episode):
            # trained models
            action_all_testing = np.zeros([n_veh, n_neighbor, 2], dtype='int32')
            for i in range(n_veh):
                for j in range(n_neighbor):
                    state_old = get_state(env, [i, j], 1, epsi_final)
                    action = predict(sesses[i*n_neighbor+j], state_old, epsi_final, True)
                    action_all_testing[i, j, 0] = action % n_RB  # chosen RB
                    action_all_testing[i, j, 1] = int(np.floor(action / n_RB))  # power level

            action_temp = action_all_testing.copy()
            V2I_rate, V2V_success, V2V_rate = env.act_for_testing(action_temp)
            V2I_rate_per_episode.append(np.sum(V2I_rate))  # sum V2I rate in bps
            rate_marl[idx_episode, test_step,:,:] = V2V_rate
            demand_marl[idx_episode, test_step+1,:,:] = env.demand

            # random baseline
            action_rand = np.zeros([n_veh, n_neighbor, 2], dtype='int32')
            action_rand[:, :, 0] = np.random.randint(0, n_RB, [n_veh, n_neighbor]) # band
            action_rand[:, :, 1] = np.random.randint(0, len(env.V2V_power_dB_List), [n_veh, n_neighbor]) # power

            V2I_rate_rand, V2V_success_rand, V2V_rate_rand = env.act_for_testing_rand(action_rand)
            V2I_rate_per_episode_rand.append(np.sum(V2I_rate_rand))  # sum V2I rate in bps
            rate_rand[idx_episode, test_step, :, :] = V2V_rate_rand
            demand_rand[idx_episode, test_step+1,:,:] = env.demand_rand
            for i in range(n_veh):
                for j in range(n_neighbor):
                    power_rand[idx_episode, test_step, i, j] = env.V2V_power_dB_List[int(action_rand[i, j, 1])]

            # update the environment and compute interference
            env.renew_channels_fastfading()
            env.Compute_Interference(action_temp)

            if test_step == n_step_per_episode - 1:
                V2V_success_list.append(V2V_success)
                V2V_success_list_rand.append(V2V_success_rand)

        V2I_rate_list.append(np.mean(V2I_rate_per_episode))
        V2I_rate_list_rand.append(np.mean(V2I_rate_per_episode_rand))

        print(round(np.average(V2I_rate_per_episode), 2), 'rand', round(np.average(V2I_rate_per_episode_rand), 2))
        print(V2V_success_list[idx_episode], 'rand', V2V_success_list_rand[idx_episode])

 参考自

[1]3GPP TR36.885报告

[2]《5G移动通信技术》

你可能感兴趣的:(强化学习,V2X,自动驾驶,强化学习,V2X,频谱资源分配)