liyihao76

[pytorch] 分布式训练 Distributed Data-Parallel Training (DDP)

[pytorch] Distributed Data-Parallel Training

Intro
cluster
torch.nn.DataParallel
Prerequisite for DistributedDataParallel
- Principle
- Backends
- NVIDIA NCCL
- Initialize the process group (dist.init_process_group)
torch.nn.parallel.DistributedDataParallel
Code comparison
- No Distributed Data-Parallel : 1 device 1 GPU
- torch.nn.DataParallel : 1 device 2 GPU
- torch.nn.parallel.DistributedDataParallel
- - Each process occupies a GPU card (recommend)
  - - Single node, multiple GPUs
    - Multiple nodes
  - Each process occupies multiple gpu cards
  - - Single node, multiple GPUs
    - Multiple nodes
Performance comparison
Conclusion
Ref

Intro

PYTORCH DISTRIBUTED OVERVIEW
Distributed Data-Parallel Training (DDP) is a widely adopted single-program multiple-data training paradigm. With DDP, the model is replicated on every process, and every model replica will be fed with a different set of input data samples. DDP takes care of gradient communication to keep model replicas synchronized and overlaps it with the gradient computations to speed up training.

PyTorch provides several options for data-parallel training. For applications that gradually grow from simple to complex and from prototype to production, the common development trajectory would be:

Use single-device training if the data and model can fit in one GPU, and training speed is not a concern.
Use single-machine multi-GPU DataParallel to make use of multiple GPUs on a single machine to speed up training with minimal code changes.
Use single-machine multi-GPU DistributedDataParallel, if you would like to further speed up training and are willing to write a little more code to set it up.
Use multi-machine DistributedDataParallel, if the application needs to scale across machine boundaries.

cluster

GPU
Actuellement, pour notre cluster, il y a 6 nœuds au total (ns[3182447,3185995,3186000-3186002,31374487]), et chaque liste de nœuds contient 4 GPU.
Si un GPU est utilisé sur un nœud, la formation distribuée (utilisant 4 GPU en même temps) sur ce nœud sera bloquée.
Nous espérons juste que lors de l’exécution des tâches, nous pourrons spécifier #SBATCH --nodelist, afin d’éviter que les tâches soient affectées à différents nœuds, affectant ainsi l’utilisation de la formation distribuée sur d’autres nœuds.
ex: il y a 2 GPU libre sur ns3185995, pour les utiliser, nous devons écrire en .sh

#SBATCH --gres=gpu:2
#SBATCH --nodelist=ns3185995

CPU
for each node, 32 cores 64 threads

import psutil
print(psutil.cpu_count(False))  # 32
import multiprocessing
print('nomber t = ',multiprocessing.cpu_count()) # 64

One tasks on one node, #SBATCH -c max 60
4 tasks on one node, #SBATCH -c 16 + 16 +16 +10
3 tasks on one node, #SBATCH -c 20+20+20
2 tesks on one node, #SBATCH -c 30 + 30

torch.nn.DataParallel

OPTIONAL: DATA PARALLELISM
The DataParallel package enables single-machine multi-GPU parallelism with the lowest coding hurdle. It only requires a one-line change to the application code. Although DataParallel is very easy to use, it usually does not offer the best performance because it replicates the model in every forward pass, and its single-process multi-thread parallelism naturally suffers from GIL contention. To get better performance, consider using DistributedDataParallel.

It’s natural to execute your forward, backward propagations on multiple GPUs. However, Pytorch will only use one GPU by default. You can easily run your operations on multiple GPUs by making your model run parallelly using DataParallel:

model = model.cuda()
model = nn.DataParallel(model)

Although DataParallel is easier to use (simply wrap a single GPU model), since a process is used to compute model weights, which are then distributed to each GPU during each batch, communication quickly becomes a bottleneck and GPU utilization Usually very low. Also, nn.DataParallel requires all GPUs to be on the same node.

Prerequisite for DistributedDataParallel

Principle

The process of DistributedDataParallel is different from that of DataParallel, only the derivative is propagated between subprocesses.
Taking 2 gpus as an example, GPU0 takes the derivative of the data allocated to itself, and then passes the derivative of the batch of data to GPU1; GPU1 takes the derivative of the data allocated to itself, and then passes the derivative of the batch of data to GPU0.
Therefore, each GPU has the complete derivative of the batch data, and then each GPU performs a gradient descent. Because the parameters and gradients and the optimizer are consistent, the parameters after each GPU update independently are still consistent.

Scatter Reduce process: First, we divide the parameters into N parts, and adjacent GPUs pass different parameters. After passing N-1 times, the accumulation of each parameter (on different GPUs) can be obtained.

All Gather: After getting the accumulation of each parameter, make another pass and synchronize to all GPUs.

Backends

DISTRIBUTED COMMUNICATION PACKAGE - TORCH.DISTRIBUTED
torch.distributed supports three built-in backends, each with different capabilities. The table below shows which functions are available for use with CPU / CUDA tensors. MPI supports CUDA only if the implementation used to build PyTorch supports it.

Use the NCCL backend for distributed GPU training
Use the Gloo backend for distributed CPU training.

NVIDIA NCCL

The NVIDIA Collective Communication Library (NCCL) implements multi-GPU and multi-node communication primitives optimized for NVIDIA GPUs and Networking. NCCL provides routines such as all-gather, all-reduce, broadcast, reduce, reduce-scatter as well as point-to-point send and receive that are optimized to achieve high bandwidth and low latency over PCIe and NVLink high-speed interconnects within a node and over NVIDIA Mellanox Network across nodes.

Leading deep learning frameworks such as Caffe2, Chainer, MxNet, PyTorch and TensorFlow have integrated NCCL to accelerate deep learning training on multi-GPU multi-node systems.

NCCL is available for download as part of the NVIDIA HPC SDK and as a separate package for Ubuntu and Red Hat.
Installing NCCL on Ubuntu

Note: torch.cuda.nccl.version() can look up version information , os.environ["NCCL_DEBUG"]="INFO" can look up error messages

Initialize the process group (dist.init_process_group)

The first and most complicated thing you need to deal with is process initialization.
There are as many in-sync copies of the train script as GPUs in the cluster, each gpu runs in a different process.
DDP relies on c10d ProcessGroup for communications. Hence, applications must create ProcessGroup instances before constructing DDP.

# multi_init.py
import torch
import torch.distributed as dist
import torch.multiprocessing as mp
import os
def init_process(rank, size, backend='gloo'):
    """ Initialize the distributed environment. """
    dist.init_process_group(backend,init_method= 'tcp://127.0.0.1:23456', rank=rank, world_size=size)

def train(rank, num_epochs, world_size):
    init_process(rank, world_size)
    print(
        f"Rank {rank + 1}/{world_size} process initialized.\n"
    )
    # rest of the training script goes here!
    
if __name__=="__main__":
    WORLD_SIZE = torch.cuda.device_count()
    NUM_EPOCHS = 5
    mp.spawn(
        train, args=(NUM_EPOCHS, WORLD_SIZE),
        nprocs=torch.cuda.device_count(), join=True
    )

mp.spawn spawns 4 different processes, each with a level of 0, 1, 2 or 3. The rank 0 process is given some additional responsibilities and is therefore called the master process.

Let’s have a look at the init_process function. It ensures that every process will be able to coordinate through a master, using the same ip address and port. Note that we used the gloo backend but other backends are available.

The package needs to be initialized using the torch.distributed.init_process_group() function before calling any other methods. This blocks until all processes have joined.

torch.distributed.init_process_group(backend, init_method=None, timeout=datetime.timedelta(seconds=1800), world_size=- 1, rank=- 1, store=None, group_name='', pg_options=None)

backend (str or Backend) – The backend to use. Depending on build-time configurations, valid values include mpi, gloo, and nccl.
init_method (str, optional) – URL specifying how to initialize the process group.
world_size (int, optional) – Number of processes participating in the job. / Number of GPUs.
rank (int, optional) – Rank of the current process (it should be a number between 0 and world_size-1)./ global rank.

Currently three initialization methods are supported:

TCP initialization
Initialize using TCP, requiring a network address reachable from all processes and a desired world_size. It requires specifying an address that belongs to the rank 0 process.

import torch.distributed as dist

# Use address of one of the machines
dist.init_process_group(backend, init_method='tcp://10.1.1.20:23456',
                        rank=args.rank, world_size=4)

host_name (str) – The hostname or IP Address the server store should run on.
port (int) – The port on which the server store should listen for incoming requests.
If we use Multiple nodes, We should choose ip according to the address of the cluster node, for exemple rank = ns3186000, then init_method='tcp://135.125.87.167:23500'. We can choose an unused port.
If we use Single node, ip and port can choose at will.

Shared file-system initialization
Another initialization method makes use of a file system that is shared and visible from all machines in a group, along with a desired world_size. The URL should start with file:// and contain a path to a non-existent file (in an existing directory) on a shared file system. File-system initialization will automatically create that file if it doesn’t exist, but will not delete the file. Therefore, it is your responsibility to make sure that the file is cleaned up before the next init_process_group() call on the same file path/name.

import torch.distributed as dist

# rank should always be specified
dist.init_process_group(backend, init_method='file:///mnt/nfs/sharedfile',
                        world_size=4, rank=args.rank)

Our nodes share the /home path, so we can specify a non-existing file on a path we have write access to, for exemple file:///home/yihao/sharedfile.

Environment variable initialization
This method will read the configuration from environment variables, allowing one to fully customize how the information is obtained. The variables to be set are:

MASTER_PORT - required; has to be a free port on machine with rank 0

MASTER_ADDR - required (except for rank 0); address of rank 0 node

WORLD_SIZE - required; can be set either here, or in a call to init function

RANK - required; can be set either here, or in a call to init function

import torch.distributed as dist
""" Initialize the distributed environment. """
os.environ['MASTER_ADDR'] = '127.0.0.1'
os.environ['MASTER_PORT'] = '29500'
dist.init_process_group(backend, rank=rank, world_size=size)

torch.nn.parallel.DistributedDataParallel

Now that we understand the initialization process, we can start to complete the body of the train method.

def train(rank, num_epochs, world_size):
    init_process(rank, world_size)
    print(
        f"{rank + 1}/{world_size} process initialized.\n"
    )
    # rest of the training script goes here!

Each of our four training processes runs this function until completion, then exits when done. If we run this code now (via python multi_init.py) we will see something like this on the console:

$ python multi_init.py
1/4 process initialized.
3/4 process initialized.
2/4 process initialized.
4/4 process initialized.

These processes run independently, and there is no guarantee what state will be at any point in the training loop. So some modifications to the training process are required here.

Please note, as DDP broadcasts model states from rank 0 process to all other processes in the DDP constructor, you don’t need to worry about different DDP processes start from different model parameter initial values.

Downloads of any data should be isolated to the main process.
Otherwise, the data download process will be replicated across all processes, causing four processes to write to the same file at the same time, which is the cause of data corruption.

# import torch.distributed as dist
if rank == 0:
    downloading_dataset()
    downloading_model_weights()
dist.barrier()
print(
    f"Rank {rank + 1}/{world_size} training process passed data download barrier.\n"
)

The dist.barrier in this code sample will block the call until the main process (rank == 0) downloading_dataset and downloading_model_weights is done. This isolates all network I/O into one process and prevents other processes from moving on.

The data loader needs to use DistributedSampler.

>>> sampler = DistributedSampler(dataset) if is_distributed else None
>>> loader = DataLoader(dataset, shuffle=(sampler is None),
...                     sampler=sampler)
>>> for epoch in range(start_epoch, n_epochs):
...     if is_distributed:
...         sampler.set_epoch(epoch)
...     train(loader)

Load tensors in the correct device.
To do this, parameterize the .cuda() call with the local rank of the device the process is managing:

torch.cuda.set_device(args.local_rank)
device = torch.device("cuda", args.local_rank)
model = model.cuda(args.local_rank)
image = image.cuda(args.local_rank)
label = label.cuda(args.local_rank)

Any method of file I/O should be isolated in the main process.

if rank == 0:
     filepath = args.weights_dir
     folder = os.path.exists(filepath)
     if not folder:
         # 判断是否存在文件夹如果不存在则创建为文件夹
         os.makedirs(filepath)
     path = './weights/model' + str(epoch) + '.pth'
     # torch.save(model.state_dict(), path)
     torch.save(model.module.state_dict(), path)  # multi gpu

Wrap the model in DistributedDataParallel.

model = DistributedDataParallel(model)

Code comparison

No Distributed Data-Parallel : 1 device 1 GPU

#SBATCH --gres=gpu
#SBATCH --nodelist=ns3185995

import torch
import torchvision
import torchvision.transforms as transforms
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim


class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, input_x):
        x = self.pool(F.relu(self.conv1(input_x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = torch.flatten(x, 1) # flatten all dimensions except batch
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        output_x = self.fc3(x)
        print("\tIn Model: input size", input_x.size(),
              "output size", output_x.size())
        return output_x

if __name__=="__main__":
    print('nccl version')
    print(torch.cuda.nccl.version())
    print('pytorch version')
    print(torch.__version__)
    print('cuda', torch.cuda.is_available())
    print('gpu number', torch.cuda.device_count())
    for i in range(torch.cuda.device_count()):
        print(torch.cuda.get_device_name(i))
        
    transform = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

    batch_size = 64

    trainset = torchvision.datasets.CIFAR10(root='./data', train=True,
                                            download=True, transform=transform)
    trainloader = torch.utils.data.DataLoader(trainset, batch_size=batch_size,
                                              shuffle=True, num_workers=2)

    testset = torchvision.datasets.CIFAR10(root='./data', train=False,
                                           download=True, transform=transform)
    testloader = torch.utils.data.DataLoader(testset, batch_size=batch_size,
                                             shuffle=False, num_workers=2)
    net = Net()
    net = net.cuda()
    print('Start Training')
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)
    
    for epoch in range(2):  # loop over the dataset multiple times

        running_loss = 0.0
        for i, data in enumerate(trainloader, 0):
            # get the inputs; data is a list of [inputs, labels]
            inputs, labels = data
            inputs = inputs.cuda()
            labels = labels.cuda()

            # zero the parameter gradients
            optimizer.zero_grad()

            # forward + backward + optimize
            outputs = net(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            # print statistics
            running_loss += loss.item()
            if i % 200 == 0:    # print every 2000 mini-batches
                print(f'[{epoch + 1}, {i + 1:5d}] loss: {running_loss / 2000:.3f}')
    
    print('Finished Training')
    
    PATH = './cifar_net.pth'
    torch.save(net.state_dict(), PATH)

nccl version
(2, 12, 9)
pytorch version
1.12.0a0+2c916ef
cuda True
gpu number 1
Tesla V100S-PCIE-32GB
Files already downloaded and verified
Files already downloaded and verified
Start Training
	In Model: input size torch.Size([64, 3, 32, 32]) output size torch.Size([64, 10])
[1,     1] loss: 0.001
	In Model: input size torch.Size([64, 3, 32, 32]) output size torch.Size([64, 10])
	In Model: input size torch.Size([64, 3, 32, 32]) output size torch.Size([64, 10])
	In Model: input size torch.Size([64, 3, 32, 32]) output size torch.Size([64, 10])
	In Model: input size torch.Size([64, 3, 32, 32]) output size torch.Size([64, 10])
	In Model: input size torch.Size([64, 3, 32, 32]) output size torch.Size([64, 10])
	In Model: input size torch.Size([64, 3, 32, 32]) output size torch.Size([64, 10])
	In Model: input size torch.Size([64, 3, 32, 32]) output size torch.Size([64, 10])
	In Model: input size torch.Size([64, 3, 32, 32]) output size torch.Size([64, 10])
	In Model: input size torch.Size([64, 3, 32, 32]) output size torch.Size([64, 10])
	In Model: input size torch.Size([64, 3, 32, 32]) output size torch.Size([64, 10])
	In Model: input size torch.Size([64, 3, 32, 32]) output size torch.Size([64, 10])
	In Model: input size torch.Size([64, 3, 32, 32]) output size torch.Size([64, 10])
	In Model: input size torch.Size([64, 3, 32, 32]) output size torch.Size([64, 10])
	In Model: input size torch.Size([16, 3, 32, 32]) output size torch.Size([16, 10])
Finished Training

torch.nn.DataParallel : 1 device 2 GPU

#SBATCH --gres=gpu:2
#SBATCH --nodelist=ns3185995

    net = Net()
    net = net.cuda()
    net = nn.DataParallel(net)

    filepath = './weights/model'
    folder = os.path.exists(filepath)
    if not folder:
        # 判断是否存在文件夹如果不存在则创建为文件夹
        os.makedirs(filepath)
    path = './weights/model' + str(epoch) + '.pth'
    # torch.save(model.state_dict(), path)
    torch.save(model.module.state_dict(), path)  # multi gpu

nccl version
(2, 12, 9)
pytorch version
1.12.0a0+2c916ef
cuda True
gpu number 2
Tesla V100S-PCIE-32GB
Tesla V100S-PCIE-32GB
Files already downloaded and verified
Files already downloaded and verified
Start Training
	In Model: input size torch.Size([32, 3, 32, 32]) output size torch.Size([32, 10])
	In Model: input size torch.Size([32, 3, 32, 32]) output size torch.Size([32, 10])
[1,     1] loss: 0.001
	In Model: input size torch.Size([32, 3, 32, 32]) output size torch.Size([32, 10])
	In Model: input size torch.Size([32, 3, 32, 32]) output size torch.Size([32, 10])
	In Model: input size torch.Size([32, 3, 32, 32]) output size torch.Size([32, 10])
	In Model: input size torch.Size([32, 3, 32, 32]) output size torch.Size([32, 10])
	In Model: input size torch.Size([32, 3, 32, 32]) output size torch.Size([32, 10])
	In Model: input size torch.Size([32, 3, 32, 32]) output size torch.Size([32, 10])
	In Model: input size torch.Size([32, 3, 32, 32]) output size torch.Size([32, 10])
	In Model: input size torch.Size([32, 3, 32, 32]) output size torch.Size([32, 10])
	In Model: input size torch.Size([32, 3, 32, 32]) output size torch.Size([32, 10])
	.......
	In Model: input size torch.Size([8, 3, 32, 32]) output size torch.Size([8, 10])
	In Model: input size torch.Size([8, 3, 32, 32]) output size torch.Size([8, 10])
Finished Training

torch.nn.parallel.DistributedDataParallel

2 methods:

Each process occupies a GPU card (recommend)
Each process occupies multiple gpu cards

Each process occupies a GPU card (recommend)

The rank in dist.init_process_group needs to be calculated according to the number of nodes and GPUs
The size of world_size = number of nodes x number of GPUs
The device_ids in ddp need to specify the corresponding graphics card.

import argparse
import torch
import torchvision
import torchvision.transforms as transforms
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import random
import numpy as np
import os
import torch.nn.parallel
import torch.backends.cudnn as cudnn
import torch.distributed as dist
import torch.optim
from torch.optim.lr_scheduler import StepLR
import torch.multiprocessing as mp
import torch.utils.data
import torch.utils.data.distributed
import torchvision.transforms as transforms
import torchvision.datasets as datasets
import torchvision.models as models

parser = argparse.ArgumentParser(description='PyTorch Example')
parser.add_argument('--init-method', type=str, default='tcp://127.0.0.1:23456')
parser.add_argument('--rank', type=int)
parser.add_argument('--world-size',type=int)


class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, input_x):
        x = self.pool(F.relu(self.conv1(input_x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = torch.flatten(x, 1) # flatten all dimensions except batch
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        output_x = self.fc3(x)
        #print("\tIn Model: input size", input_x.size(),
        #      "output size", output_x.size())
        return output_x

def main():
    random.seed(76)
    torch.manual_seed(76)
    cudnn.deterministic = True
    torch.backends.cudnn.deterministic = True
    np.random.seed(76)

    args = parser.parse_args()
    ngpus_per_node = torch.cuda.device_count()
    print('###############################################')
    print('Total node number =', (args.world_size))
    print('Total GPU number =', (ngpus_per_node * args.world_size))
    print(f"Node {args.rank + 1}/{args.world_size} for training.\n")
    print("Use GPU: {} for training".format(torch.cuda.device_count()))
    print('###############################################')
    if not args.world_size == 1:
        print('now waiting another node start job ......')
    mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args))


def main_worker(gpu, ngpus_per_node, args):
    world_size = ngpus_per_node * args.world_size
    rank = args.rank * ngpus_per_node + gpu
    # rank : global rank
    # gpu : local rank
    dist.init_process_group(init_method=args.init_method,backend="nccl",world_size=world_size,rank=rank,group_name="pytorch_test")
    print(f"Rank {rank+1}/{world_size} training process initialized.\n")
    # For multiprocessing distributed, DistributedDataParallel constructor
    # should always set the single device scope, otherwise,
    # DistributedDataParallel will use all available devices.
    torch.cuda.set_device(gpu)
    model = Net()
    model.cuda(gpu)
    # When using a single GPU per process and per
    # DistributedDataParallel, we need to divide the batch size
    # ourselves based on the total number of GPUs of the current node.
    batch_size = 64
    batch_size = int(batch_size / ngpus_per_node)
    model = torch.nn.parallel.DistributedDataParallel(model,device_ids=[gpu], output_device=gpu) 

    # define loss function (criterion), optimizer, and learning rate scheduler
    criterion = nn.CrossEntropyLoss().cuda(gpu)
    optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)

    # Data loading code
    transform = transforms.Compose(
        [transforms.ToTensor(),
         transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

    if rank == 0:
        torchvision.datasets.CIFAR10(root='./data', train=True,
                                     download=True, transform=transform)
    dist.barrier()

    train_dataset = torchvision.datasets.CIFAR10(root='./data', train=True,
                                                 download=True, transform=transform)

    train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset)

    train_loader = torch.utils.data.DataLoader(
        train_dataset, batch_size=batch_size, shuffle=(train_sampler is None),
        num_workers=0, sampler=train_sampler)

    num_epochs = 10
    for epoch in range(num_epochs):
        train_sampler.set_epoch(epoch)

        # switch to train mode
        model.train()
        running_loss = []
        for i, (images, target) in enumerate(train_loader):
            images = images.cuda(gpu)
            target = target.cuda(gpu)

            output = model(images)
            loss = criterion(output, target)

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            running_loss.append(loss.item())

        if rank == 0:
            if not os.path.exists('./checkpoints/'):
                os.mkdir('./checkpoints/')
                torch.save(
                    model.state_dict(),
                    './checkpoints/model_{epoch}.pth'
                )
        print(
            f'Finished epoch {epoch}, rank {rank+1}/{world_size}. '
            f'Avg Loss: {np.mean(running_loss)}; Median Loss: {np.min(running_loss)}.\n'
        )
        dist.barrier()


if __name__=="__main__":
    main()
    print('finish.....')

Single node, multiple GPUs

#SBATCH -J LIYIHAO
#SBATCH --output=cataract_RL_%j.out
#SBATCH --gres=gpu:4
#SBATCH --nodelist=ns3186001

srun singularity exec --bind /home/shared:/home/shared --nv /home/yihao/2022/docker/monai_new.simg python3 test.py --init-method tcp://127.0.0.1:23500 --rank 0 --world-size 1

Multiple nodes

world-size : Total node number
global rank : which node
local rank : var gpu in code
init-method : Two initialization methods

There are two initialization methods，we can choose one.

1. by file : file:///home/yihao/sharedfile   # Manually delete this file after each execution
2. by tcp : tcp://tcp://IP_OF_NODE0:FREEPORT   ex :tcp://135.125.87.167:23500

We need to start the test on a different node.
There is no requirement for the execution order of the different node tests，if another task has not started, it will show now waiting another node start job .......

for node0 :

#!/bin/bash

#SBATCH -J LIYIHAO
#SBATCH --output=cataract_RL_%j.out
#SBATCH --gres=gpu:4
#SBATCH --nodelist=ns3186001

srun singularity exec --bind /home/shared:/home/shared --nv /home/yihao/2022/docker/monai_new.simg python3 test0.py --init-method file:///home/yihao/sharedfile --rank 0 --world-size 2

for node1:

#!/bin/bash

#SBATCH -J LIYIHAO
#SBATCH --output=cataract_RL_%j.out
#SBATCH --gres=gpu:4
#SBATCH --nodelist=ns3186002

srun singularity exec --bind /home/shared:/home/shared --nv /home/yihao/2022/docker/monai_new.simg python3 test0.py --init-method file:///home/yihao/sharedfile --rank 1 --world-size 2

Each process occupies multiple gpu cards

The rank in dist.init_process_group is equal to the node number;
world_size is equal to the total number of nodes;
DDP does not need to specify device.

import argparse
import torch
import torchvision
import torchvision.transforms as transforms
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import random
import numpy as np
import os
import torch.nn.parallel
import torch.backends.cudnn as cudnn
import torch.distributed as dist
import torch.optim
from torch.optim.lr_scheduler import StepLR
import torch.multiprocessing as mp
import torch.utils.data
import torch.utils.data.distributed
import torchvision.transforms as transforms
import torchvision.datasets as datasets
import torchvision.models as models


parser = argparse.ArgumentParser(description='PyTorch Example')
parser.add_argument('--init-method', type=str, default='tcp://127.0.0.1:23456')
parser.add_argument('--rank', type=int)
parser.add_argument('--world-size',type=int)

def main():
    random.seed(76)
    torch.manual_seed(76)
    cudnn.deterministic = True
    torch.backends.cudnn.deterministic = True
    np.random.seed(76)

    args = parser.parse_args()
    ngpus_per_node = torch.cuda.device_count()
    print('###############################################')
    print('Total node number =', (args.world_size))
    print('Total GPU number =', (ngpus_per_node * args.world_size))
    print(f"Node {args.rank+1}/{args.world_size} for training.\n")
    print("Use GPU: {} for training".format(torch.cuda.device_count()))
    print('###############################################')
    if not args.world_size == 1:
        print('now waiting another node start job ......')
    main_worker(args.rank,ngpus_per_node,args)


def main_worker(gpu, ngpus_per_node, args):
    world_size = args.world_size
    rank = args.rank
    dist.init_process_group(init_method=args.init_method,backend="nccl",world_size=world_size,rank=rank,group_name="pytorch_test")
    print(f"Rank {rank + 1}/{world_size} training process initialized.\n")
    # For multiprocessing distributed, DistributedDataParallel constructor
    # should always set the single device scope, otherwise,
    # DistributedDataParallel will use all available devices.
    #torch.cuda.set_device(gpu)
    model = models.resnet18()

    # 修改全连接层的输出
    num_ftrs = model.fc.in_features
    model.fc = nn.Linear(num_ftrs, 10)
    model.cuda()

    # When using a single GPU per process and per
    # DistributedDataParallel, we need to divide the batch size
    # ourselves based on the total number of GPUs of the current node.
    batch_size = 64
    batch_size = int(batch_size / ngpus_per_node)
    model = torch.nn.parallel.DistributedDataParallel(model)
    # define loss function (criterion), optimizer, and learning rate scheduler
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)

    # Data loading code
    transform = transforms.Compose(
        [transforms.ToTensor(),
         transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

    if rank == 0:
        torchvision.datasets.CIFAR10(root='./data', train=True,
                                     download=True, transform=transform)
    dist.barrier()

    train_dataset = torchvision.datasets.CIFAR10(root='./data', train=True,
                                                 download=True, transform=transform)
	
    train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset)
    train_loader = torch.utils.data.DataLoader(
        train_dataset, batch_size=batch_size, shuffle=(train_sampler is None),
        num_workers=0, sampler=train_sampler)

    #print('before train is ok')
    num_epochs = 1
    for epoch in range(num_epochs):
        train_sampler.set_epoch(epoch)

        # switch to train mode
        model.train()
        running_loss = []
        for i, (images, target) in enumerate(train_loader):
            images = images.cuda()
            target = target.cuda()

            output = model(images)
            loss = criterion(output, target)

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            running_loss.append(loss.item())

        print(
            f'Finished epoch {epoch}, rank {rank + 1}/{world_size}. '
            f'Avg Loss: {np.mean(running_loss)}; Median Loss: {np.min(running_loss)}.\n'
        )



if __name__=="__main__":
    main()
    print('finish.....')

Single node, multiple GPUs

#!/bin/bash

#SBATCH -J LIYIHAO
#SBATCH --output=cataract_RL_%j.out
#SBATCH --gres=gpu:4
#SBATCH --nodelist=ns3186001

srun singularity exec --bind /home/shared:/home/shared --nv /home/yihao/2022/docker/monai_new.simg python3 test1.py --init-method tcp://127.0.0.1:23500 --rank 0 --world-size 1

Multiple nodes

node0 :

#!/bin/bash

#SBATCH -J LIYIHAO
#SBATCH --output=cataract_RL_%j.out
#SBATCH --gres=gpu:4
#SBATCH --nodelist=ns3186001

srun singularity exec --bind /home/shared:/home/shared --nv /home/yihao/2022/docker/monai_new.simg python3 test1.py --init-method file:///home/yihao/sharedfile33 --rank 0 --world-size 2

node1 :

#!/bin/bash

#SBATCH -J LIYIHAO
#SBATCH --output=cataract_RL_%j.out
#SBATCH --gres=gpu:4
#SBATCH --nodelist=ns3186002

srun singularity exec --bind /home/shared:/home/shared --nv /home/yihao/2022/docker/monai_new.simg python3 test1.py --init-method file:///home/yihao/sharedfile33 --rank 1 --world-size 2

Performance comparison

cifar10

	No DP	DP	DDP (1 node)	DDP (2 node)
GPU	1	4	4	8
CPU	1	1	1	1
batch_size	64	64*4	64*4	64*8
loss- 0 epoch	2.285884	2.304621	2.306401	2.304969
loss- 24 epoch	0.963238	1.521069	1.445795
loss- 49 epoch	0.644783	1.226753	1.176290
time (50epochs)	576.94s	555.67s	1438.51s	more than 3000s

OCT-A

	TEST1	TEST2	TEST3	DP	DDP
GPU	1	1	1	4	4
CPU	1	8	12	1	1
batch_size	4	4	4	12	12
loss- 0 epoch	0.611296	0.675859	0.611652	0.600055	0.725467
loss- 9 epoch	0.549468	0.543123	0.569846	0.544406	0.523935
time (10 epochs)	55248.91s	44117.22s	43300.80s	17612.83s	20745.87s

	TEST4	TEST5	DP	DDP
GPU	1	1	4	4
SBATCH -c	58	30	56	60
num_worker	50	28	12	14
batch_size	4	4	12	12
loss- 0 epoch	0.650601	0.623308	0.620667	0.719268
loss- 9 epoch	0.504250	0.546000	0.545852	0.514747
time (10 epochs)	2666.15s	4547.34s	3185.96s	1696.70s

DDP	TEST1	TEST2	TEST3	TEST4	TEST5	TEST6	TEST7
GPU	4	2	4	4	4	4	4
- C	60	60	60	60	60	60	60
num_worker	14	28	8	12	20	32	54
time (10 epochs)	1696s	2201s	2297s	1582s	1850s	3014s	3112s

Conclusion

Grâce à l’exploration du cluster OVH, nous avons réussi à réduire considérablement le temps nécessaire à la formation. Le temps de formation pour 10 époques est réduit de 55000s à environ 1700s (plus de 30 fois).
La meilleure configuration actuelle est #SBATCH -c 60 + DistributedDataParallel (4 GPU). Elle tire pleinement parti de presque toutes les ressources CPU et GPU d’un nœud.

Je pense qu’il y a deux goulots d’étranglement dans le temps de formation, l’un est le traitement des données (lecture et augmentation), et l’autre est la formation en réseau.
Parmi eux, l’augmentation du nombre de cpu/thread (= augmentation de la limite supérieure de num_worker) peut réduire considérablement le temps de traitement des données, tandis que la formation distribuée peut accélérer le temps de formation du réseau.

Ref

问题: pytorch 多机多卡卡住问题汇总
问题: PyTorch 训练时中遇到的卡住停住等问题
启动方式 : PyTorch分布式训练基础–DDP使用
单机多卡代码: 在PyTorch中使用DistributedDataParallel进行多GPU分布式模型训练
示例项目代码: ImageNet training in PyTorch
代码例子: PyTorch: Multi-GPU and multi-node data parallelism
代码示例: Distributed data parallel training using Pytorch on the multiple nodes of CSC and Narvi clusters
讲解: Pytorch 分布式训练
DDP原理: 【分布式训练】单机多卡的正确打开方式（一）：理论基础
注意事项: pytorch 分布式多卡训练DistributedDataParallel 踩坑记

你可能感兴趣的:(医学图像,pytorch,分布式,人工智能)

鸿蒙保姆级教学冬冬小圆帽 harmonyos 华为
鸿蒙（HarmonyOS）是华为推出的一款面向全场景的分布式操作系统，支持手机、平板、智能穿戴、智能家居、车载设备等多种设备。鸿蒙系统的核心特点是分布式架构、一次开发多端部署和高性能。以下是从入门到大神级别的鸿蒙开发深度分析，结合代码示例，帮助你逐步掌握鸿蒙开发。1.鸿蒙开发入门1.1环境搭建鸿蒙编译器安装运行教程安装DevEcoStudio：下载并安装DevEcoStudio，这是鸿蒙官方提供的
PyTorch模型训练实战指南：掌握动态图特性与工业级部署技巧 lmtealily pytorch 人工智能 python
前言在深度学习领域，PyTorch凭借其动态计算图、高效的自动微分系统及高度Pythonic的设计哲学，已成为学术界与工业界的主流框架。其即时执行模式大幅简化了模型调试流程，而灵活的模块化设计则为复杂模型的构建提供了坚实基础。然而，从实验原型到工业级部署的全链路实践中，开发者仍需系统性掌握框架核心特性与工程化技巧。本文以实战为导向，深入剖析PyTorch动态图机制与自动微分原理，详解从数据预处理、
分布式事务解决方案：Seata原理详解与实战教程 Cloud_. 分布式 wpf seata
一、为什么需要Seata？在微服务架构中，跨服务的事务管理成为核心痛点：传统事务失效：服务拆分导致无法使用本地事务数据不一致风险：网络抖动、服务宕机等情况导致数据错乱复杂场景处理难：涉及多个数据库、消息队列等异构存储Seata（SimpleExtensibleAutonomousTransactionArchitecture）是阿里开源的分布式事务解决方案，提供AT模式、TCC模式、Saga模式三
计算机视觉毕业设计选题推荐：选题技巧建议收藏 HaiLang_IT 毕业设计人工智能计算机视觉
目录前言毕设选题开题指导建议更多精选选题选题帮助最后前言大家好,这里是海浪学长毕设专题!大四是整个大学期间最忙碌的时光，一边要忙着准备考研、考公、考教资或者实习为毕业后面临的升学就业做准备,一边要为毕业设计耗费大量精力。学长给大家整理了人工智能专业最新精选选题，如遇选题困难或选题有任何疑问，都可以问学长哦(见文末)!对毕设有任何疑问都可以问学长哦!更多选题指导:最新最全计算机专业毕设选题精选推荐汇
详解Springboot的启动流程凭君语未可面试 spring boot 后端 java
在Redis中实现分布式锁1.主入口与SpringApplication.run()2.准备阶段3.创建应用上下文（ApplicationContext）4.Bean定义加载与上下文刷新5.EmbeddedWebServer的启动（针对Web应用）6.ApplicationRunner和CommandLineRunner执行7.应用启动完成总结1.主入口与SpringApplication.run
【Dive Into Stable Diffusion v3.5】1：开源项目正式发布——深入探索SDv3.5模型全参/LoRA/RLHF训练 Donvink 大模型 #AIGC stable diffusion AIGC 人工智能机器学习深度学习
目录1引言2项目简介3快速上手3.1下载代码3.2环境配置3.3项目结构3.4下载模型与数据集3.5运行指令3.6核心参数说明3.6.1通用参数3.6.2优化器/学习率3.6.3数据相关4结语1引言在人工智能和机器学习领域，生成模型的应用越来越广泛。StableDiffusion作为其中的佼佼者，因其强大的图像生成能力而备受关注。今天，我的开源项目DiveIntoStableDiffusionv3
云原生边缘计算：分布式智能的时代黎明桂月二二云原生边缘计算分布式
引言：从集中式算力到万物智联的范式裂变AT&T边缘节点部署超5000个，特斯拉自动驾驶系统每节点200TOPS算力。国家电网通过边缘计算实现毫秒级电网故障隔离，菜鸟物流分拣效率提升400%。IDC预测2027年边缘基础设施支出将达亿，宝马汽车工厂设备预测性维护准确率达9亿运维成本。一、边缘计算范式进化论1.1算力拓扑结构演变世代大型主机中心化云计算分布式雾计算去中心化边缘计算泛在化神经形态计算体计
零基础掌握分布式ID生成：从理论到实战的完整指南 [特殊字符] 添砖Java中分布式分布式id java
一、为什么需要分布式ID？在单机系统中，使用数据库自增ID就能满足需求。但在分布式系统中，多个服务节点同时生成ID时会出现以下问题：ID冲突：不同节点生成相同ID扩展困难：数据库自增ID无法水平扩展安全性差：连续ID暴露业务数据量性能瓶颈：高并发场景下生成速度慢典型应用场景：✅电商订单号生成✅社交平台用户ID✅物流运单号生成✅金融交易流水号二、分布式ID的核心要求特性说明重要性全局唯一性整个分布式
PyTorch 深度学习实战（19）：离线强化学习与 Conservative Q-Learning (CQL) 算法进取星辰 PyTorch 深度学习实战深度学习 pytorch 算法
在上一篇文章中，我们探讨了分布式强化学习与IMPALA算法，展示了如何通过并行化训练提升强化学习的效率。本文将聚焦离线强化学习（OfflineRL）这一新兴方向，并实现ConservativeQ-Learning(CQL)算法，利用Minari提供的静态数据集训练安全的强化学习策略。一、离线强化学习与CQL原理1.离线强化学习的特点无需环境交互：直接从预收集的静态数据集学习数据效率高：复用历史经验
Java IDEA中Gutter Icons图标的含义路宇 java笔记 java intellij-idea 开发语言 gutter-icons 图标 Java开发工具
前些天发现了一个蛮有意思的人工智能学习网站,8个字形容一下"通俗易懂，风趣幽默"，感觉非常有意思,忍不住分享一下给大家。点击跳转到教程前言：很多人刚开始用IDEA来学习编程，会发现下面这些图标。但是我们有时候并不知道它的含义和设置显示与隐藏，下面给大家讲解一下装订线图标位于左侧编辑器中。它们调用一些基本操作以及其他特定于框架和技术的功能。设置步骤File->Setting进到idea的设置页面。接
【科研必备】EI/Scopus收录！2025年3-4月智能制造、自动化、无人驾驶、人工智能等前沿领域国际会议邀您参与~与全球学者交流，让学术之光在国际舞台上闪耀！努力毕业的小土博^_^ 学术会议推荐制造自动化人工智能深度学习神经网络算法
【科研必备】EI/Scopus收录！2025年3-4月智能制造、无人驾驶、人工智能等前沿领域国际会议邀您参与~与全球学者交流，让学术之光在国际舞台上闪耀！【科研必备】EI/Scopus收录！2025年3-4月智能制造、无人驾驶、人工智能等前沿领域国际会议邀您参与~与全球学者交流，让学术之光在国际舞台上闪耀！文章目录【科研必备】EI/Scopus收录！2025年3-4月智能制造、无人驾驶、人工智能等
Elasticsearch 介绍：分布式搜索与分析引擎吱屋猪_ elasticsearch
在如今大数据时代，企业和开发者面临着前所未有的数据量和实时性要求。为了能够高效地处理、存储和查询这些数据，Elasticsearch作为一种强大的分布式搜索引擎，已经成为了很多组织和开发者的首选解决方案。1.什么是Elasticsearch？Elasticsearch是一个开源的、基于ApacheLucene构建的全文搜索引擎。它提供了高效的搜索功能，并且非常适合处理大量数据，尤其是在需要快速搜索
一切皆是映射：DQN训练加速技术：分布式训练与GPU并行 AI天才研究院计算 AI大模型企业级应用开发实战 ChatGPT 计算科学神经计算深度学习神经网络大数据人工智能大型语言模型 AI AGI LLM Java Python 架构设计 Agent RPA
1.背景介绍1.1深度强化学习的兴起近年来，深度强化学习（DeepReinforcementLearning，DRL）在游戏、机器人控制、自然语言处理等领域取得了令人瞩目的成就。作为一种结合深度学习和强化学习的强大技术，DRL能够使智能体在与环境交互的过程中学习最优策略，从而实现自主决策和控制。1.2DQN算法及其局限性深度Q网络（DeepQ-Network，DQN）是DRL的一种经典算法，它利用
大规模语言模型从理论到实践分布式训练的集群架构 AI智能涌现深度研究 DeepSeek R1 &大数据AI人工智能 Python入门实战计算科学神经计算深度学习神经网络大数据人工智能大型语言模型 AI AGI LLM Java Python 架构设计 Agent RPA
大规模语言模型从理论到实践分布式训练的集群架构作者：禅与计算机程序设计艺术/ZenandtheArtofComputerProgramming1.背景介绍1.1问题的由来随着深度学习技术的飞速发展，大规模语言模型（LargeLanguageModels,LLMs）在自然语言处理（NaturalLanguageProcessing,NLP）领域取得了突破性进展。LLMs，如BERT、GPT-3等，通
数仓建模—Data Warebase AI 时代数据平台应当的样子不二人生数仓建模人工智能数据仓库数仓建模
DataWarebaseAI时代数据平台应当的样子引言：在这个AI技术飞速发展的时代，我们有能力更深入地发掘数据潜在的价值，而数据处理不应当成为阻碍。云原生分布式DataWarebase将开启处理数据的新范式，它让数据的使用返璞归真，不论是存储还是查询，一个系统满足业务全方位数据需求。打破复杂数据架构的束缚，大大降低数据的使用门槛，释放数据潜能，让数据涌现智能。背景近二十年大数据发展史2002年我
【第11章】亿级电商平台订单系统-海量数据架构设计 cherry5230 架构系统架构架构分布式
1-1本章导学课程导学课程定位：大型系统架构设计核心难点解析核心项目：BToB电商平台订单系统（年交易额200亿级）本章知识体系1.核心概念辨析海量数据vs大数据本质区别解析常见认知误区说明2.方法论框架海量数据处理核心思想分布式计算原理数据分片策略弹性扩展机制3.数据库架构设计方法论体系读写分离模式分库分表策略数据分区方案缓存层设计4.数据处理体系海量数据处理之道批处理与流处理数据压缩技术异步处
图生视频技术的发展与展望：从技术突破到未来图景 Liudef06 Stable Diffusion 音视频人工智能深度学习 stable diffusion
一、技术发展现状图生视频（Image-to-VideoGeneration）是生成式人工智能（AIGC）的重要分支，其核心是通过单张或多张静态图像生成动态视频序列。近年来，随着深度学习、多模态融合和计算硬件的进步，图生视频技术经历了从基础研究到商业落地的快速演进。早期探索与GAN的奠基早期图生视频技术主要基于生成对抗网络（GAN），通过对抗训练生成低分辨率的视频片段。例如，DeepMind的DVD
Couchbase Analytics 的结构 PersistDZ 数据存储 couchbase
CouchbaseAnalytics的结构CouchbaseAnalytics服务专为大规模、并发、复杂的分析查询而设计，同时不会影响事务性工作负载的性能。下面将详细介绍其结构和架构，以帮助您深入理解CouchbaseAnalytics的运作方式。1.Couchbase集群架构CouchbaseServer是一个多维度可扩展的分布式数据库，其核心架构由多个服务组成：数据服务（DataService
美团Leaf分布式ID生成器使用教程：号段模式与Snowflake模式详解 Cloud_. 分布式
引言在分布式系统中，生成全局唯一ID是核心需求之一。美团开源的Leaf提供了两种分布式ID生成方案：号段模式（高可用、依赖数据库）和Snowflake模式（高性能、去中心化）。本文将手把手教你如何配置和使用这两种模式，并解析其核心机制。一、Leaf号段模式使用教程1.环境准备数据库：MySQL5.7+Java环境：JDK1.8+Leaf源码：从GitHub克隆Leaf仓库（推荐使用feature/
腾讯云与MongoDB战略合作升级，瞄准AI时代的数据管理服务 CSDN资讯腾讯云 mongodb 人工智能
2025年3月20日，腾讯云与MongoDB联合宣布续签战略合作协议，双方将围绕AI时代的技术变革为全球用户提供卓越的数据管理服务。文档数据库MongoDB以其灵活的数据结构、强大的性能和原生的分布式扩展性等特点，成为最受欢迎的NoSQL数据库之一，广泛应用于游戏、社交媒体、电商、金融和物联网等各行各业。在DB-Engines全球数据库排行榜上，MongoDB长期位居NoSQL数据库第一。据了解，
Hugging Face 模型格式全解析：从 PyTorch 到 GGUF mingo_敏 Deep Learning pytorch 人工智能 python
HuggingFace模型格式全解析：从PyTorch到GGUFHuggingFace生态支持多种模型格式，以满足不同场景下的存储、部署和推理需求。以下是主流格式的技术解析与演进脉络：1.PyTorch原生格式（.pt/.pth）特性：直接保存PyTorch的state_dict（模型参数）或完整模型（含结构）。兼容性强，与PyTorch训练/推理流程深度集成。文件体积较大，加载速度较慢，存在安全
Python后端学习系列（10）：分布式系统与数据一致性（使用分布式锁、分布式事务等） DoYangTan python 学习分布式
Python后端学习系列（10）：分布式系统与数据一致性（使用分布式锁、分布式事务等）前言随着业务规模的不断扩大以及对系统性能、可扩展性的更高要求，后端应用往往会朝着分布式系统的方向发展。然而，分布式系统带来诸多优势的同时，也面临着如数据一致性等复杂的挑战。本期我们就聚焦于分布式系统中的关键问题——数据一致性，深入探讨分布式锁、分布式事务等相关知识以及保障数据一致性的策略与实践，让我们一起深入学习
掌握C#企业级应用的数据一致性与分布式事务：从基础到高级的全面解析墨夶 C#学习资料1 c#分布式 wpf
在当今的企业级应用开发中，确保数据的一致性是至关重要的。尤其是在涉及分布式系统时，如何处理跨服务、跨数据库的操作以保证数据的一致性和可靠性成为了一个复杂但必须解决的问题。本文将深入探讨使用C#进行企业级应用开发时的数据一致性和分布式事务管理，提供详细的代码示例和最佳实践。第一部分：理解数据一致性与分布式事务的基础知识1.1数据一致性的重要性在企业级应用中，数据一致性是指关联数据之间的逻辑关系是否正
【SoC基础】单片机之寄存器解析望闻问嵌 #SoC 单片机嵌入式硬件
：如果你也对机器人、人工智能感兴趣，看来我们志同道合✨：不妨浏览一下我的博客主页【https://blog.csdn.net/weixin_51244852】：文章若有幸对你有帮助，可点赞收藏⭐不迷路：内容若有错误，敬请留言指正！原创文，转载注明出处文章目录1、寄存器位置2、寄存器种类2.1通用用途寄存器2.2CPU执行相关寄存器2.3外设控制寄存器3.寄存器在CPU访问外设过程中起到的作用1、寄
大模型时代的知识焦虑机载软件与适航机器学习-建模算法-代理模型人工智能大数据
引言：浪潮之巅，焦虑暗涌大模型时代已经浩荡而来，如同奔腾的浪潮，以令人惊叹的速度重塑着世界的面貌。从智能客服的温声细语，到AI绘画的妙笔生花，再到自动驾驶的日趋成熟，大型语言模型、图像模型等人工智能技术以前所未有的姿态，渗透进我们生活的方方面面。信息获取前所未有的便捷，知识创造空前高效，人机交互焕然一新，一个充满无限可能的智能化未来似乎触手可及。然而，在这令人眼花缭乱的技术盛景之下，一股无形的焦虑
近期计算机领域的热点技术 0dayNu1L 云计算量子计算人工智能
随着科技的飞速发展，计算机领域的新技术、新趋势层出不穷。本文将探讨近期计算机领域的几个热点技术趋势，并对它们进行简要的分析和展望。一、人工智能与机器学习人工智能（AI）和机器学习（ML）是近年来计算机领域最为热门的话题之一。AI和ML技术已经广泛应用于图像识别、自然语言处理、智能推荐等领域，并取得了显著的成果。随着技术的不断进步，AI和ML将更深入地渗透到各个行业，为人类社会带来更多便利和效益。在
DevOps中集成自动化测试的具体案例 Zachary AI CICD相关 devops 运维
在DevOps中集成自动化测试的具体案例可以从多个角度进行分析，包括金融行业、分布式系统、大型企业等不同领域的实践。以下是几个具体的案例：金融行业的DevOps实践：在金融行业中，DevOps被广泛应用于提升软件开发和运营的效率。例如，通过解析后台接口代码日志格式，自动化生成接口测试案例，解决了接口自动化测试过程中各交易输入值难以确定的问题，从而提高了接口测试效率[14]。此外，农行手机银行系统存
谷歌准备斥资 230 亿收购网络安全初创公司 Wiz 网络研究观网络研究观谷歌
Alphabet正在就收购Wiz进行深入谈判，这将显著增强其安全能力。这将是谷歌母公司有史以来最大规模的收购。这是路透社根据匿名消息来源撰写的内容。目标收购金额为230亿美元，即211亿欧元。Wiz拥有实时检测和响应网络威胁的技术。通过实施人工智能，Wiz能够在短时间内吸引许多公司作为客户。Alphabet的收购目标定于2020年初。到2023年，Wiz的收入将达到3.5亿美元。当时，全球40%的
数学领域的跨时代进化与升级：从公理化到智能化的破茧之路夏末之花算法
作者：夏末之花|发布时间：2025-03-16|阅读量：10万+|点赞数：5.6万引言：数学的“破茧时刻”与文明跃迁人类历史上，数学的每一次重大突破都像一次“破茧时刻”，推动文明跨越式发展。从古希腊的几何公理化到牛顿的微积分，再到20世纪的计算机理论，数学始终是科学革命的基石。而在21世纪的今天，随着量子计算、人工智能、生物信息等技术的爆发，数学正迎来新一轮的进化与升级——从纯粹的逻辑工具，演变为
精准测试：软件开发中的高效质量保障利器霍格沃兹软件测试开发精准化测试测试用例安全性测试测试覆盖率模块测试 selenium 测试工具压力测试
全面解析软件测试开发：人工智能测试、自动化测试、性能测试、测试左移、测试右移到DevOps如何驱动持续交付在现代软件开发中，测试效率与测试质量直接影响产品竞争力。精准测试作为一项兼具效率与精度的创新测试方法，已经成为众多企业提升软件质量的重要手段。本篇文章围绕精准测试的落地实施、对质量指标的提升、数据统计与效果评估方法以及如何提高投入产出比进行全面解读，帮助企业掌握精准测试的价值与实践路径。精准测
解线性方程组 qiuwanchi
package gaodai.matrix; import java.util.ArrayList; import java.util.List; import java.util.Scanner; public class Test { public static void main(String[] args) { Scanner scanner = new Sc
在mysql内部存储代码 annan211 性能 mysql 存储过程触发器
在mysql内部存储代码在mysql内部存储代码，既有优点也有缺点，而且有人倡导有人反对。先看优点： 1 她在服务器内部执行，离数据最近，另外在服务器上执行还可以节省带宽和网络延迟。 2 这是一种代码重用。可以方便的统一业务规则，保证某些行为的一致性，所以也可以提供一定的安全性。 3 可以简化代码的维护和版本更新。 4 可以帮助提升安全，比如提供更细
Android使用Asynchronous Http Client完成登录保存cookie的问题 hotsunshine android
Asynchronous Http Client是android中非常好的异步请求工具除了异步之外还有很多封装比如json的处理，cookie的处理引用 Persistent Cookie Storage with PersistentCookieStore This library also includes a PersistentCookieStore whi
java面试题 Array_06 java 面试
java面试题第一，谈谈final, finally, finalize的区别。 final-修饰符（关键字）如果一个类被声明为final，意味着它不能再派生出新的子类，不能作为父类被继承。因此一个类不能既被声明为 abstract的，又被声明为final的。将变量或方法声明为final，可以保证它们在使用中不被改变。被声明为final的变量必须在声明时给定初值，而在以后的引用中只能
网站加速 oloz 网站加速
前序:本人菜鸟，此文研究总结来源于互联网上的资料，大牛请勿喷！本人虚心学习，多指教. 1、减小网页体积的大小，尽量采用div+css模式，尽量避免复杂的页面结构，能简约就简约。 2、采用Gzip对网页进行压缩； GZIP最早由Jean-loup Gailly和Mark Adler创建，用于UNⅨ系统的文件压缩。我们在Linux中经常会用到后缀为.gz
正确书写单例模式随意而生 java 设计模式单例
　　单例模式算是设计模式中最容易理解，也是最容易手写代码的模式了吧。但是其中的坑却不少，所以也常作为面试题来考。本文主要对几种单例写法的整理，并分析其优缺点。很多都是一些老生常谈的问题，但如果你不知道如何创建一个线程安全的单例，不知道什么是双检锁，那这篇文章可能会帮助到你。　　懒汉式，线程不安全　　当被问到要实现一个单例模式时，很多人的第一反应是写出如下的代码，包括教科书上也是这样
单例模式香水浓 java
懒汉调用getInstance方法时实例化 public class Singleton { private static Singleton instance; private Singleton() {} public static synchronized Singleton getInstance() { if(null == ins
安装Apache问题：系统找不到指定的文件 No installed service named "Apache2" AdyZhang apache http server
安装Apache问题：系统找不到指定的文件 No installed service named "Apache2" 每次到这一步都很小心防它的端口冲突问题，结果，特意留出来的80端口就是不能用，烦。解决方法确保几处： 1、停止IIS启动 2、把端口80改成其它（譬如90，800，，，什么数字都好） 3、防火墙(关掉试试) 在运行处输入 cmd 回车，转到apa
如何在android 文件选择器中选择多个图片或者视频？ aijuans android
我的android app有这样的需求，在进行照片和视频上传的时候，需要一次性的从照片/视频库选择多条进行上传但是android原生态的sdk中，只能一个一个的进行选择和上传。我想知道是否有其他的android上传库可以解决这个问题，提供一个多选的功能，可以使checkbox之类的，一次选择多个处理方法官方的图片选择器(但是不支持所有版本的androi，只支持API Level
mysql中查询生日提醒的日期相关的sql baalwolf mysql
SELECT sysid,user_name,birthday,listid,userhead_50,CONCAT(YEAR(CURDATE()),DATE_FORMAT(birthday,'-%m-%d')),CURDATE(), dayofyear( CONCAT(YEAR(CURDATE()),DATE_FORMAT(birthday,'-%m-%d')))-dayofyear(
MongoDB索引文件破坏后导致查询错误的问题 BigBird2012 mongodb
问题描述： MongoDB在非正常情况下关闭时，可能会导致索引文件破坏，造成数据在更新时没有反映到索引上。解决方案：使用脚本，重建MongoDB所有表的索引。 var names = db.getCollectionNames(); for( var i in names ){ var name = names[i]; print(name);
Javascript Promise bijian1013 JavaScript Promise
Parse JavaScript SDK现在提供了支持大多数异步方法的兼容jquery的Promises模式，那么这意味着什么呢，读完下文你就了解了。一.认识Promises “Promises”代表着在javascript程序里下一个伟大的范式，但是理解他们为什么如此伟大不是件简
[Zookeeper学习笔记九]Zookeeper源代码分析之Zookeeper构造过程 bit1129 zookeeper
Zookeeper重载了几个构造函数，其中构造者可以提供参数最多，可定制性最多的构造函数是 public ZooKeeper(String connectString, int sessionTimeout, Watcher watcher, long sessionId, byte[] sessionPasswd, boolea
【Java命令三】jstack bit1129 jstack
jstack是用于获得当前运行的Java程序所有的线程的运行情况(thread dump），不同于jmap用于获得memory dump [hadoop@hadoop sbin]$ jstack Usage: jstack [-l] <pid> (to connect to running process) jstack -F
jboss 5.1启停脚本　动静分离部署 ronin47
以前启动jboss，往各种xml配置文件，现只要运行一句脚本即可。start nohup sh /**/run.sh -c servicename -b ip -g clustername -u broatcast jboss.messaging.ServerPeerID=int -Djboss.service.binding.set=p
UI之如何打磨设计能力? brotherlamp UI ui教程 ui自学 ui资料 ui视频
在越来越拥挤的初创企业世界里，视觉设计的重要性往往可以与杀手级用户体验比肩。在许多情况下，尤其对于 Web 初创企业而言，这两者都是不可或缺的。前不久我们在《右脑革命：别学编程了，学艺术吧》中也曾发出过重视设计的呼吁。如何才能提高初创企业的设计能力呢?以下是 9 位创始人的体会。 1.找到自己的方式如果你是设计师，要想提高技能可以去设计博客和展示好设计的网站如D-lists或
三色旗算法 bylijinnan java 算法
import java.util.Arrays; /** 问题：假设有一条绳子，上面有红、白、蓝三种颜色的旗子，起初绳子上的旗子颜色并没有顺序，您希望将之分类，并排列为蓝、白、红的顺序，要如何移动次数才会最少，注意您只能在绳子上进行这个动作，而且一次只能调换两个旗子。网上的解法大多类似：在一条绳子上移动，在程式中也就意味只能使用一个阵列，而不使用其它的阵列来
警告:No configuration found for the specified action: \'s chiangfai configuration
1.index.jsp页面form标签未指定namespace属性。  <%@taglib prefix="s" uri="/struts-tags"%> ... <s:form action="submit" method="post"&g
redis -- hash_max_zipmap_entries设置过大有问题 chenchao051 redis hash
使用redis时为了使用hash追求更高的内存使用率，我们一般都用hash结构，并且有时候会把hash_max_zipmap_entries这个值设置的很大，很多资料也推荐设置到1000，默认设置为了512，但是这里有个坑 #define ZIPMAP_BIGLEN 254 #define ZIPMAP_END 255 /* Return th
select into outfile access deny问题 daizj mysql txt 导出数据到文件
本文转自：http://hatemysql.com/2010/06/29/select-into-outfile-access-deny%E9%97%AE%E9%A2%98/ 为应用建立了rnd的帐号，专门为他们查询线上数据库用的，当然，只有他们上了生产网络以后才能连上数据库，安全方面我们还是很注意的，呵呵。授权的语句如下： grant select on armory.* to rn
phpexcel导出excel表简单入门示例 dcj3sjt126com PHP Excel phpexcel
<?php error_reporting(E_ALL); ini_set('display_errors', TRUE); ini_set('display_startup_errors', TRUE); if (PHP_SAPI == 'cli') die('This example should only be run from a Web Brows
美国电影超短200句 dcj3sjt126com 电影
1. I see．我明白了。2. I quit! 我不干了!3. Let go! 放手!4. Me too．我也是。5. My god! 天哪!6. No way! 不行!7. Come on．来吧(赶快)8. Hold on．等一等。9. I agree。我同意。10. Not bad．还不错。11. Not yet．还没。12. See you．再见。13. Shut up!
Java访问远程服务 dyy_gusi httpclient webservice get post
随着webService的崛起，我们开始中会越来越多的使用到访问远程webService服务。当然对于不同的webService框架一般都有自己的client包供使用，但是如果使用webService框架自己的client包，那么必然需要在自己的代码中引入它的包，如果同时调运了多个不同框架的webService，那么就需要同时引入多个不同的clien
Maven的settings.xml配置 geeksun settings.xml
settings.xml是Maven的配置文件，下面解释一下其中的配置含义： settings.xml存在于两个地方： 1.安装的地方：$M2_HOME/conf/settings.xml 2.用户的目录：${user.home}/.m2/settings.xml 前者又被叫做全局配置，后者被称为用户配置。如果两者都存在，它们的内容将被合并，并且用户范围的settings.xml优先。
ubuntu的init与系统服务设置 hongtoushizi ubuntu
转载自： http://iysm.net/?p=178 init Init是位于/sbin/init的一个程序，它是在linux下，在系统启动过程中，初始化所有的设备驱动程序和数据结构等之后，由内核启动的一个用户级程序，并由此init程序进而完成系统的启动过程。 ubuntu与传统的linux略有不同，使用upstart完成系统的启动，但表面上仍维持init程序的形式。运行
跟我学Nginx+Lua开发目录贴 jinnianshilongnian nginx lua
使用Nginx+Lua开发近一年的时间，学习和实践了一些Nginx+Lua开发的架构，为了让更多人使用Nginx+Lua架构开发，利用春节期间总结了一份基本的学习教程，希望对大家有用。也欢迎谈探讨学习一些经验。目录第一章安装Nginx+Lua开发环境第二章 Nginx+Lua开发入门第三章 Redis/SSDB+Twemproxy安装与使用第四章 L
php位运算符注意事项 home198979 位运算 PHP &
$a = $b = $c = 0; $a & $b = 1; $b | $c = 1 问a,b,c最终为多少? 当看到这题时，我犯了一个低级错误，误以为位运算符会改变变量的值。所以得出结果是1 1 0 但是位运算符是不会改变变量的值的，例如： $a=1;$b=2; $a&$b; 这样a,b的值不会有任何改变
Linux shell数组建立和使用技巧 pda158 linux
1.数组定义　　[chengmo@centos5 ~]$ a=(1 2 3 4 5) 　　[chengmo@centos5 ~]$ echo $a 　　1 　　一对括号表示是数组，数组元素用“空格”符号分割开。　　 2.数组读取与赋值　　得到长度：　　[chengmo@centos5 ~]$ echo ${#a[@]} 　　5 　　用${#数组名[@或
hotspot源码(JDK7) ol_beta java HotSpot jvm
源码结构图，方便理解： ├─agent Serviceab
Oracle基本事务和ForAll执行批量DML练习 vipbooks oracle sql
基本事务的使用：从账户一的余额中转100到账户二的余额中去，如果账户二不存在或账户一中的余额不足100则整笔交易回滚 select * from account; -- 创建一张账户表 create table account( -- 账户ID id number(3) not null, -- 账户名称 nam