Training!!!

目录

参数的读取

训练部分

设置损失函数

训练过程

 net.train()训练集开始的标志


参数的读取

        在.conf文件中存放各种参数信息,用argparse,configparser来读取参数信息,便于修改参数,简介明朗。

import argparse
import configparser


parser = argparse.ArgumentParser()
parser.add_argument("--config", default='configurations/PEMS04_astgcn.conf',         
                                type=str,help="configuration file path")
args = parser.parse_args()
config = configparser.ConfigParser()
print('Read configuration file: %s' % (args.config))
config.read(args.config)
data_config = config['Data']
training_config = config['Training']

dataset_name = data_config['dataset_name']
model_name = training_config['model_name']

建立文件夹存放实验过程中求得的各种参数信息:

folder_dir = '%s_h%dd%dw%d_channel%d_%e' % (model_name, num_of_hours, num_of_days, num_of_weeks, in_channels, learning_rate)
print('folder_dir:', folder_dir)
params_path = os.path.join('experiments', dataset_name, folder_dir)
print('params_path:', params_path)

得到数据信息,第一个是得到节点特征信息,第二个是得到关于边的特征信息,eg:邻接矩阵,带权重的邻接矩阵:

train_loader, train_target_tensor, val_loader, val_target_tensor, test_loader, test_target_tensor, _mean, _std = load_graphdata_channel1(
    graph_signal_matrix_filename, num_of_hours,  num_of_days, num_of_weeks, batch_size)
adj_mx, distance_mx = get_adjacency_matrix(adj_filename, num_of_vertices, id_filename)

实例化模型net,传入模型参数:

net = make_model(nb_block, in_channels, K, nb_chev_filter, nb_time_filter, time_strides, adj_mx,
                 num_for_predict, len_input, num_of_vertices)

训练部分

         如果存放参数的文件夹还没有,就递归创建,如果已经有了,而且这才刚开始训练,就递归删除,如果处于训练过程中且文件夹存在,那么打印一下,否则就有错!

    if (start_epoch == 0) and (not os.path.exists(params_path)):
        os.makedirs(params_path)
        print('create params directory %s' % (params_path))
    elif (start_epoch == 0) and (os.path.exists(params_path)):
        shutil.rmtree(params_path)
        os.makedirs(params_path)
        print('delete the old one and create params directory %s' % (params_path))
    elif (start_epoch > 0) and (os.path.exists(params_path)):
        print('train from params directory %s' % (params_path))
    else:
        raise SystemExit('Wrong type of model!')

设置损失函数

用MSE作为损失函数反向传播,用rmse和mae作为评估度量,但是怎么实现?

mse作为损失函数,是在训练集和验证集进行训练的时候反向传播更新的,也就是代码中训练集和测试集都有的每个batch中的训练过程,这个criterion就是mse

if masked_flag:
    loss = criterion_masked(outputs, labels,missing_value)
else :
    loss = criterion(outputs, labels)
loss.backward()

而在测试集训练时,只是得到模型的输出结果,不对其进行反向传播训练,会计算它的每个指标matrices。打印,就是整个的预测结果。

        for batch_index, batch_data in enumerate(data_loader):

            encoder_inputs, labels = batch_data

            input.append(encoder_inputs[:, :, 0:1].cpu().numpy())  # (batch, T', 1)

            outputs = net(encoder_inputs)

            prediction.append(outputs.detach().cpu().numpy())

            if batch_index % 100 == 0:
                print('predicting data set batch %s / %s' % (batch_index + 1, loader_length))

        input = np.concatenate(input, 0)

        input = re_normalization(input, _mean, _std)

        prediction = np.concatenate(prediction, 0)  # (batch, T', 1)

        print('input:', input.shape)
        print('prediction:', prediction.shape)
        print('data_target_tensor:', data_target_tensor.shape)
        output_filename = os.path.join(params_path, 'output_epoch_%s_%s' % (global_step, type))
        np.savez(output_filename, input=input, prediction=prediction, data_target_tensor=data_target_tensor)

 

 # 计算误差
        excel_list = []
        prediction_length = prediction.shape[2]

        for i in range(prediction_length):
            assert data_target_tensor.shape[0] == prediction.shape[0]
            print('current epoch: %s, predict %s points' % (global_step, i))
            if metric_method == 'mask':
                mae = masked_mae_test(data_target_tensor[:, :, i], prediction[:, :, i],0.0)
                rmse = masked_rmse_test(data_target_tensor[:, :, i], prediction[:, :, i],0.0)
                mape = masked_mape_np(data_target_tensor[:, :, i], prediction[:, :, i], 0)
            else :
                mae = mean_absolute_error(data_target_tensor[:, :, i], prediction[:, :, i])
                rmse = mean_squared_error(data_target_tensor[:, :, i], prediction[:, :, i]) ** 0.5
                mape = masked_mape_np(data_target_tensor[:, :, i], prediction[:, :, i], 0)
            print('MAE: %.2f' % (mae))
            print('RMSE: %.2f' % (rmse))
            print('MAPE: %.2f' % (mape))
            excel_list.extend([mae, rmse, mape])

        # print overall results
        if metric_method == 'mask':
            mae = masked_mae_test(data_target_tensor.reshape(-1, 1), prediction.reshape(-1, 1), 0.0)
            rmse = masked_rmse_test(data_target_tensor.reshape(-1, 1), prediction.reshape(-1, 1), 0.0)
            mape = masked_mape_np(data_target_tensor.reshape(-1, 1), prediction.reshape(-1, 1), 0)
        else :
            mae = mean_absolute_error(data_target_tensor.reshape(-1, 1), prediction.reshape(-1, 1))
            rmse = mean_squared_error(data_target_tensor.reshape(-1, 1), prediction.reshape(-1, 1)) ** 0.5
            mape = masked_mape_np(data_target_tensor.reshape(-1, 1), prediction.reshape(-1, 1), 0)
        print('all MAE: %.2f' % (mae))
        print('all RMSE: %.2f' % (rmse))
        print('all MAPE: %.2f' % (mape))
        excel_list.extend([mae, rmse, mape])
        print(excel_list)

打印state_dict中存放的key和value的值

print('Net\'s state_dict:')
    total_param = 0
    for param_tensor in net.state_dict():
        print(param_tensor, '\t', net.state_dict()[param_tensor].size())
        total_param += np.prod(net.state_dict()[param_tensor].size())
    print('Net\'s total params:', total_param)

    print('Optimizer\'s state_dict:')
    for var_name in optimizer.state_dict():
        print(var_name, '\t', optimizer.state_dict()[var_name])

训练过程

深度学习模型训练全流程!_Datawhale的博客-CSDN博客

训练集(Train Set):模型用于训练和调整模型参数。
验证集(Validation Set):用来验证模型精度和调整模型超参数。
测试集(Test Set):验证模型的泛化能力。
因为训练集和验证集是分开的,所以模型在验证集上面的精度在一定程度上可以反映模型的泛化能力。在划分验证集的时候,需要注意验证集的分布应该与测试集尽量保持一致,不然模型在验证集上的精度就失去了指导意义。

    for epoch in range(start_epoch, epochs):

        params_filename = os.path.join(params_path, 'epoch_%s.params' % epoch)

        if masked_flag:
            val_loss = compute_val_loss_mstgcn(net, val_loader, criterion_masked, masked_flag,missing_value,sw, epoch)
        else:
            val_loss = compute_val_loss_mstgcn(net, val_loader, criterion, masked_flag, missing_value, sw, epoch)


        if val_loss < best_val_loss:
            best_val_loss = val_loss
            best_epoch = epoch
            torch.save(net.state_dict(), params_filename)
            print('save parameters to file: %s' % params_filename)

 net.train()训练集开始的标志

分批次在训练集上进行训练,训练部分的损失函数按照论文来说应该是MSE。

对每个epoch都要进行训练集和验证集两个数据集的分batch训练。

        for batch_index, batch_data in enumerate(train_loader):

            encoder_inputs, labels = batch_data

            optimizer.zero_grad()

            outputs = net(encoder_inputs)

            if masked_flag:
                loss = criterion_masked(outputs, labels,missing_value)
            else :
                loss = criterion(outputs, labels)

            loss.backward()

            optimizer.step()

            training_loss = loss.item()

            global_step += 1

            sw.add_scalar('training_loss', training_loss, global_step)

            if global_step % 1000 == 0:

                print('global step: %s, training loss: %.2f, time: %.2fs' % (global_step, training_loss, time() - start_time))

 

自己对整个训练的理解:

        对于每个epoch,一般来说,先对这个epoch的所有batch进行遍历,把每个batch的数据在训练集上进行训练,然后在验证集上对于验证集数据的一个epoch的所有batch进行验证,打印验证集的训练结果,查看其训练情况,所以要打印验证集结果比打印训练集结果更重要。而验证集需要和测试集的划分相同,就相当于对训练数据进行模拟考试,对于每个epoch都进行验证集的参与,是根据打印出来的验证集的结果看看这些epochs合不合适,loss结果好不好,根据它模拟考试的情况,进行调整,在所有epochs结束之后,就进行真正的考试了,也就是测试集的训练,对测试集训练之后得到结果,所以测试集数据不用epochs,因为它只是进行一轮,就是看它预测的结果怎么样,会将其所有表示其预测结果的指标(eg:mae,rmse,acc等)打印出来,这个结果就是之后要写在论文里的结果,就是整个工作的结果。

 

你可能感兴趣的:(pytorch实战专栏,java,前端,linux)