Pytorch学习(十五)------ 杂项知识汇总

总说

记录一下一些小知识, 方便查用.(不断更新中)

  1. 自定义RNN的写法: Implementation of Multiplicative LSTM
  2. 自定义二阶梯度写法: How to implement customized layer with second order derivatives
    How to customize the double backward
  3. Feeding a Custom Gradient into LSTMs
  4. retain_graph的讨论
  5. 定义需要更新的参数 ------ 2019.3.13

定义需要更新的参数注意事项

简单来说, 我们需要更新的参数一般放在__init__函数中, 如果是可以用现有的层, 比如Conv2d, Linear之类的, 就直接用就行了. 因为这些层都是 Module类, 里面定义了 weights, bias等, 这些都是定一层Parameter类型, 当然默认是requires_grad=True, 值得注意的是, 只有Parameter类型, 才会被 optimizer拿到. 这是因为, 一个我们定义的网络是继承自nn.Module()类.

optimizer

一般情况下, 我们用 Module类中的model.parameters()可以得到网络中所有注册为Parameter()类的参数. 同时model.zero_grad()可以把所有为Parameter()类的参数进行梯度清零.
那么, 我们更新参数一般是:

with torch.no_grad():
   # 最简单的梯度下降
   for p in model.parameters(): p -= p.grad * lr
   model.zero_grad()

为了更加简化参数更新方式, pytorch提供optimizer类, 从而简化操作

my_optimizer.step()
my_optimizer.zero_grad()

optimizer.py中 zero_grad的定义, 我们可以看见, 是把放入optimizer中的所有参数进行置0.

    def zero_grad(self):
        r"""Clears the gradients of all optimized :class:`torch.Tensor` s."""
        for group in self.param_groups:
            for p in group['params']:
                if p.grad is not None:
                    p.grad.detach_()
					p.grad.zero_()

.step()函数则由不同的优化方法进行实现.

class MyNet3(nn.Module):
    def __init__(self):
        super(MyNet3, self).__init__()
        self.conv1 = nn.Conv2d(3, 64, 3, padding=1)
        self.relu1 = nn.ReLU(True)
     #  move this to forward
     #  self.conv2 = nn.Conv2d(64, 1, 3, padding=1)

    def forward(self, input):
        self.conv2_filters = nn.Parameter(torch.randn(1, 64, 3, 3))
        return F.conv2d(self.relu1(self.conv1)), self.conv2_filters, padding=1)

虽然这里的conv2_filters也会有梯度, 但是, optimizer一开始并不能拿到这个conv2_filters, 因此,它是无法进行更新的.

关于retrain_graph的讨论

In pytorch, every time you perform a computation with Variables, your create a graph, then if you call backward on the last Variable, it will traverse this graph to compute the gradients for everything in it (and delete the graph as it goes through it if retain_graph=False). So in your command line, you created a single graph, trying to backprop twice through it (withour retain_graph) so it will fail.
If now inside your forward loop you do redo the forward computation, then the Variable on which you call backward is not the same, and the graph attached to it is not the same as the previous iteration one. So no error here.

The common mistake (the would raise the mentioned error while you’re not supposed to share graph) that can happen is that you perform some computation just before the loop, and so even though you create new graphs in the loop, they share a common part out of the loop like below:

a = torch.rand(3,3)

# This will be share by both iterations and will make the second backward fail !
b = a * a

for i in range(10):
    d = b * b
    # The first here will work but the second will not !
    d.backward()

Autograd

import torch

# unbroken gradient, backward goes all the way to x
x = torch.ones(2, 2, requires_grad=True)
y = 2 * x + 2
z = y * y * 3
out = z.mean()
out.backward()
print(x.grad)

# broken gradient, ends at _y
x = torch.ones(2, 2, requires_grad=True)
y = 2 * x + 2

_y = torch.tensor(y.detach(), requires_grad=True)
z = _y * _y * 3
out = z.mean()
out.backward()
print(x.grad)
print(_y.grad)

# we can however, take the grad of _y and put it in manually!
y.backward(_y.grad)
print(x.grad)

我们可以看到,如果是tensor是一个值的话,那么直接grad_tensors是没有的。即标量对矢量求导。这就是为什么docs里面都是先mean后再backward的。当然,如果是loss,得到的也是标量。也是直接.backward()
另外一点,如果是
torch.autograd.backward(tensors, grad_tensors=None, retain_graph=None, create_graph=False, grad_variables=None)

Computes the sum of gradients of given tensors w.r.t. graph leaves.

The graph is differentiated using the chain rule. If any of tensors are non-scalar
 (i.e. their data has more than one element) and require gradient, the function additionally requires specifying grad_tensors.
  It should be a sequence of matching length, that contains gradient of the differentiated function 
  w.r.t. corresponding tensors   (None is an acceptable value for all tensors that don’t need gradient tensors).

This function accumulates gradients in the leaves - you might need to zero them before calling it.

你可能感兴趣的:(pytorch,PyTorch)