mmsegmentation修仙之路-bug篇(2)

合集目录
  1. mmsegmentation修仙之路-bug篇(1)
  2. mmsegmentation修仙之路-bug篇(2)
  3. mmsegmentation修仙之路-bug篇(3)

随着mmcv2.0版本的发布,现在越来越多的算法库已经替换成2.0架构,其基础是mmengine。整个流程有很大的变动,整个流程更清晰明了,智能简单,因此之前很多涉及流程的报错也不会再出现。

RuntimeError: An attempt has been made to start a new process before the current process has finished its bootstrapping phase.

 File "D:\ProgramData\Anaconda3\envs\openmmlab\lib\multiprocessing\spawn.py", line 134, in _check_not_importing_main
    raise RuntimeError('''
RuntimeError: 
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.

报错原因:
pytorch在Linux中和windows/MacOS中支持多线程的是存在差异的,Linux中,子线程是直接通过克隆地址来获取dataset和python函数参数的。在window下,python解释器运行后,如果没有指定主入口,python会按顺序执行代码,此时新创建进程后,子进程会再次导入运行的文件并按顺序全部执行代码,多进程代码会再次被执行,但multiprocessing.Process的源码中是对子进程再次产生子进程是做了限制的,是不允许的,所以会报错提示。

解决办法:
1、在主程序加入if __name__ == '__main__': ,指定程序入口,此时主程序的命名空间为‘__main__’,继续使用多线程加载
2、设置num_workers=0并且,persistent_workers=False 转为单线程加载

Python中if name==“main” 语句在调用多进程Process过程中的作用分析
Python 中的 if name == ‘main’ 该如何理解
python官方说明
pytorch官方说明

RuntimeError: Found dtype Float but expected Half/RuntimeError: Found dtype Double but expected Float

Traceback (most recent call last):
  File "F:/train_amp.py", line 214, in <module>
    train_net(net=net, device=device, train_data_path=train_data_path, val_data_path=val_data_path,
  File "F:/train_amp.py", line 101, in train_net
    loss = criterion(pred, label)
  File "C:\Users\Administrator\.conda\envs\mmlab\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\Administrator\.conda\envs\mmlab\lib\site-packages\torch\nn\modules\loss.py", line 613, in forward
    return F.binary_cross_entropy(input, target, weight=self.weight, reduction=self.reduction)
  File "C:\Users\Administrator\.conda\envs\mmlab\lib\site-packages\torch\nn\functional.py", line 3083, in binary_cross_entropy
    return torch._C._nn.binary_cross_entropy(input, target, weight, reduction_enum)
RuntimeError: Found dtype Float but expected Half

Process finished with exit code 1

这两个都一样,检查报错部分的数据类型,比如我的报错,就是在计算loss时,pred和label的数据类型不一致导致的,使用pred.float()就可以消除报错。

TypeError:‘tuple‘ object is not callable

# 报错代码
class DoubleConv(nn.Module):
    def __init__(self, in_ch, out_ch):
        super(DoubleConv, self).__init__()
        self.conv1 = nn.Conv2d(in_ch, out_ch, 3, padding=1),
        self.batch = nn.BatchNorm2d(out_ch),
        self.ReLU = nn.ReLU(inplace=True),
        self.conv2 = nn.Conv2d(out_ch, out_ch, 3, padding=1),

    def forward(self, input):
        x = self.conv1(input)
        x = self.batch(x)
        x = self.ReLU(x)
        x = self.conv2(x)
        x = self.batch(x)
        x = self.ReLU(x)

        return x

检查网络定义是否正确,错误代码中定义的每一层后面多了一个逗号,python读入时将整个定义的层当成一个元组读入了,因此报错。

OMP: Error #15: Initializing libiomp5md.dll, but found libiomp5md.dll already initialized.

OMP: Error #15: Initializing libiomp5md.dll, but found libiomp5md.dll already initialized.
OMP: Hint This means that multiple copies of the OpenMP runtime have been linked into the program.
 That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is 
 to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of
  the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set 
  the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, 
  but that may cause crashes or silently produce incorrect results. For more information,
   please see http://www.intel.com/software/products/support/.

这主要是因为anaconda中包含多个libiomp5md.dll文件。导致与运行环境中的libiomp5md.dll链接冲突,因此需要删除一个保证链接的唯一性。

解决办法:
1、代码中添加如下代码,忽略这个,但这个可能还会出错,包括结果的存在较大的不确定性。

import os
os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE"

2、删除当前环境下的libiomp5md.dll

  • 如果在Anaconda的base环境下:删除…\Anaconda3\Library\bin\libiomp5md.dll
  • 如果是在某个env(例如名为torch)下:删除…\Anaconda3\envs\torch\Library\bin\libiomp5md.dll

mmengine TypeError: train() missing 1 required positional argument: ‘self’

mmengine导入的Runner需要实例化后才能调用train()、test()这些。我这里是因为代码写错了,Runner实例化后是runner,但下面训练还是用的Runner。你也可以看看是不是犯了同样的错误。

# 错误代码
runner = Runner.from_cfg(cfg)
Runner.train()
# 正确代码
runner = Runner.from_cfg(cfg)
runner.train()

你可能感兴趣的:(mmsegmentation,bug,python,mmsegmentation,pytorch)