Wandb 记录模型训练情况

参考:Wandb——Pytorch模型指标可视化及超参搜索_鹿枫的博客-CSDN博客_pytorch 超参数搜索

            Wandb:模型训练最强辅助 - 知乎

        自己做记录使用,打字一遍,更容易理解~ 希望大家去看原作者!

1 介绍

Wandb = "Weight" and "bias" 权重和偏执;

是一款提供给开发人员更好更快的构建机器学习模型的平台;

        轻量化、可交互、追踪版本、迭代数据集、评估模型性能、重现性能、可视化结果、点回归;

作用:

        保存训练运行中使用的超参数;

        搜索、比较和可视化训练的运行;

        在运行的同时分析系统硬件的情况如:CPU和GPU使用率;

        永远保存可用的实验记录;

        在团队中分享训练数据;

Dashboard:记录实验过程、将结果可视化;

Reports:保存和分享可复制的成果/结论;

Sweeps:通过改变超参数来优化模型;

Artifacts:可以自己搭建pipline实现保存储存数据集和模型以及评估结果的流程。

2 相关网址:

wandb官网:https://wandb.ai/site

wandb中文文档:https://docs.wandb.ai/v/zh-hans/

常见报错及解决:https://docs.wandb.ai/guides/sweeps/faq

模型参数可视化colab示例:http://wandb.me/pytorch-colab

超参搜索colab示例:https://colab.research.google.com/github/wandb/examples/blob/master/colabs/pytorch/Organizing_Hyperparameter_Sweeps_in_PyTorch_with_W%26B.ipynb

3 代码使用方法

基本接口:

wandb.init — 在训练脚本开头初始化一个新的运行项;

wandb.config — 跟踪超参数;

wandb.log — 在训练循环中持续记录变化的指标;

wandb.save — 保存运行项相关文件,如模型权值;

wandb.restore — 运行指定运行项时,恢复代码状态。

对于Any framework,使用wandb的代码如下:

# Flexible integration for any Python script
# 可灵活集成任何Python脚本
import wandb

# 1. Start a W&B run 开始运行
wandb.init(project='gpt3')

# 2. Save model inputs and hyperparameters 保存模型输入和超参数
config = wandb.config
config.learning_rate = 0.01

# Model training here 模型训练在这里
‍
# 3. Log metrics over time to visualize performance 随着时间的推移记录度量以可视化性能
wandb.log({"loss": loss})

4 结合含有模型的项目步骤

代码大致可分为以下步骤:
①导包;
②初始化一个项目;
③设置参数;
④设定好模型和数据集;
⑤追踪模型参数并记录;
⑥储存模型;

step 0:
引入一些后续需要使用的包并初始化设备是GPU还是CPU:

import os
import random

import numpy as np
import torch
import torch.nn as nn
import torchvision
import torchvision.transforms as transforms
from tqdm.notebook import tqdm
import wandb

# Device configuration
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

step 1:

初始化模型的一些参数

config = dict(
    epochs=5,
    classes=10,
    kernels=[16, 32],
    batch_size=128,
    learning_rate=0.005,
    dataset="MNIST",
    architecture="CNN")

step 2:

定义整个流程

def model_pipeline(hyperparameters):

    # tell wandb to get started 告诉wandb开始
    with wandb.init(project="pytorch-demo", config=hyperparameters):
      # access all HPs through wandb.config, so logging matches execution! 通过wandb访问所有配置,以便日志记录匹配执行!
      config = wandb.config

      # make the model, data, and optimization problem  制定模型、数据和优化问题
      model, train_loader, test_loader, criterion, optimizer = make(config)
      print(model)

      # and use them to train the model 用它们来训练模型
      train(model, train_loader, criterion, optimizer, config)

      # and test its final performance 并测试其最终性能
      test(model, test_loader)

    return model

step 3:

定义一些具体的函数

        数据集预处理:

def get_data(slice=5, train=True):
    full_dataset = torchvision.datasets.MNIST(root=".",
                                              train=train, 
                                              transform=transforms.ToTensor(),
                                              download=True)
    #  equiv to slicing with [::slice] 
    sub_dataset = torch.utils.data.Subset(
      full_dataset, indices=range(0, len(full_dataset), slice))
    
    return sub_dataset


def make_loader(dataset, batch_size):
    loader = torch.utils.data.DataLoader(dataset=dataset,
                                         batch_size=batch_size, 
                                         shuffle=True,
                                         pin_memory=True, num_workers=2)
    return loader

        step 4:定义模型类 :

# Conventional and convolutional neural network

class ConvNet(nn.Module):
    def __init__(self, kernels, classes=10):
        super(ConvNet, self).__init__()
        
        self.layer1 = nn.Sequential(
            nn.Conv2d(1, kernels[0], kernel_size=5, stride=1, padding=2),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2))
        self.layer2 = nn.Sequential(
            nn.Conv2d(16, kernels[1], kernel_size=5, stride=1, padding=2),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2))
        self.fc = nn.Linear(7 * 7 * kernels[-1], classes)
        
    def forward(self, x):
        out = self.layer1(x)
        out = self.layer2(out)
        out = out.reshape(out.size(0), -1)
        out = self.fc(out)
        return out

        step 5:定义数据集、模型;

def make(config):
    # Make the data
    train, test = get_data(train=True), get_data(train=False)
    train_loader = make_loader(train, batch_size=config.batch_size)
    test_loader = make_loader(test, batch_size=config.batch_size)

    # Make the model
    model = ConvNet(config.kernels, config.classes).to(device)

    # Make the loss and optimizer
    criterion = nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(
        model.parameters(), lr=config.learning_rate)
    
    return model, train_loader, test_loader, criterion, optimizer

        step 6:定义训练记录日志:

def train_log(loss, example_ct, epoch):
    # Where the magic happens
    wandb.log({"epoch": epoch, "loss": loss}, step=example_ct)
    print(f"Loss after " + str(example_ct).zfill(5) + f" examples: {loss:.3f}")

      

        step 7:模型训练:

def train(model, loader, criterion, optimizer, config):
    # Tell wandb to watch what the model gets up to: gradients, weights, and more!
    # 告诉wandb观察模型达到了什么:梯度、权重等!
    wandb.watch(model, criterion, log="all", log_freq=10)

    # Run training and track with wandb 跑步训练和用wandb跟踪
    total_batches = len(loader) * config.epochs
    example_ct = 0  # number of examples seen
    batch_ct = 0
    for epoch in tqdm(range(config.epochs)):
        for _, (images, labels) in enumerate(loader):

            loss = train_batch(images, labels, model, optimizer, criterion)
            example_ct +=  len(images)
            batch_ct += 1

            # Report metrics every 25th batch 每25批次报告指标
            if ((batch_ct + 1) % 25) == 0:
                train_log(loss, example_ct, epoch)


def train_batch(images, labels, model, optimizer, criterion):
    images, labels = images.to(device), labels.to(device)
    
    # Forward pass ➡
    outputs = model(images)
    loss = criterion(outputs, labels)
    
    # Backward pass ⬅
    optimizer.zero_grad()
    loss.backward()

    # Step with optimizer
    optimizer.step()

    return loss

        step 8:记录测试集上的日志:

def test(model, test_loader):
    model.eval()

    # Run the model on some test examples
    with torch.no_grad():
        correct, total = 0, 0
        for images, labels in test_loader:
            images, labels = images.to(device), labels.to(device)
            outputs = model(images)
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

        print(f"Accuracy of the model on the {total} " +
              f"test images: {100 * correct / total}%")
        
        wandb.log({"test_accuracy": correct / total})

    # Save the model in the exchangeable ONNX format 将模型保存为可交换的ONNX格式
    torch.onnx.export(model, images, "model.onnx")
    wandb.save("model.onnx")

        step 9:定义模型训练测试的参数:

config = dict(
    epochs=5,
    classes=10,
    kernels=[16, 32],
    batch_size=128,
    learning_rate=0.005,
    dataset="MNIST",
    architecture="CNN")

      step 10:  运行函数显示所有的记录过的指标

# Build, train and analyze the model with the pipeline
model = model_pipeline(config)

你可能感兴趣的:(编程,python)