记录一下一些小知识, 方便查用.(不断更新中)
简单来说, 我们需要更新的参数一般放在__init__
函数中, 如果是可以用现有的层, 比如Conv2d
, Linear
之类的, 就直接用就行了. 因为这些层都是 Module
类, 里面定义了 weights, bias
等, 这些都是定一层Parameter
类型, 当然默认是requires_grad=True
, 值得注意的是, 只有Parameter
类型, 才会被 optimizer
拿到. 这是因为, 一个我们定义的网络是继承自nn.Module()
类.
一般情况下, 我们用 Module
类中的model.parameters()
可以得到网络中所有注册为Parameter()
类的参数. 同时model.zero_grad()
可以把所有为Parameter()
类的参数进行梯度清零.
那么, 我们更新参数一般是:
with torch.no_grad():
# 最简单的梯度下降
for p in model.parameters(): p -= p.grad * lr
model.zero_grad()
为了更加简化参数更新方式, pytorch提供optimizer
类, 从而简化操作
my_optimizer.step()
my_optimizer.zero_grad()
optimizer.py中 zero_grad的定义, 我们可以看见, 是把放入optimizer中的所有参数进行置0.
def zero_grad(self):
r"""Clears the gradients of all optimized :class:`torch.Tensor` s."""
for group in self.param_groups:
for p in group['params']:
if p.grad is not None:
p.grad.detach_()
p.grad.zero_()
而.step()
函数则由不同的优化方法进行实现.
class MyNet3(nn.Module):
def __init__(self):
super(MyNet3, self).__init__()
self.conv1 = nn.Conv2d(3, 64, 3, padding=1)
self.relu1 = nn.ReLU(True)
# move this to forward
# self.conv2 = nn.Conv2d(64, 1, 3, padding=1)
def forward(self, input):
self.conv2_filters = nn.Parameter(torch.randn(1, 64, 3, 3))
return F.conv2d(self.relu1(self.conv1)), self.conv2_filters, padding=1)
虽然这里的conv2_filters
也会有梯度, 但是, optimizer一开始并不能拿到这个conv2_filters
, 因此,它是无法进行更新的.
In pytorch, every time you perform a computation with Variables, your create a graph, then if you call backward on the last Variable, it will traverse this graph to compute the gradients for everything in it (and delete the graph as it goes through it if retain_graph=False). So in your command line, you created a single graph, trying to backprop twice through it (withour retain_graph) so it will fail.
If now inside your forward loop you do redo the forward computation, then the Variable on which you call backward is not the same, and the graph attached to it is not the same as the previous iteration one. So no error here.
The common mistake (the would raise the mentioned error while you’re not supposed to share graph) that can happen is that you perform some computation just before the loop, and so even though you create new graphs in the loop, they share a common part out of the loop like below:
a = torch.rand(3,3)
# This will be share by both iterations and will make the second backward fail !
b = a * a
for i in range(10):
d = b * b
# The first here will work but the second will not !
d.backward()
import torch
# unbroken gradient, backward goes all the way to x
x = torch.ones(2, 2, requires_grad=True)
y = 2 * x + 2
z = y * y * 3
out = z.mean()
out.backward()
print(x.grad)
# broken gradient, ends at _y
x = torch.ones(2, 2, requires_grad=True)
y = 2 * x + 2
_y = torch.tensor(y.detach(), requires_grad=True)
z = _y * _y * 3
out = z.mean()
out.backward()
print(x.grad)
print(_y.grad)
# we can however, take the grad of _y and put it in manually!
y.backward(_y.grad)
print(x.grad)
我们可以看到,如果是tensor是一个值的话,那么直接grad_tensors
是没有的。即标量对矢量求导。这就是为什么docs里面都是先mean
后再backward
的。当然,如果是loss,得到的也是标量。也是直接.backward()
。
另外一点,如果是
torch.autograd.backward(tensors, grad_tensors=None, retain_graph=None, create_graph=False, grad_variables=None)
Computes the sum of gradients of given tensors w.r.t. graph leaves.
The graph is differentiated using the chain rule. If any of tensors are non-scalar
(i.e. their data has more than one element) and require gradient, the function additionally requires specifying grad_tensors.
It should be a sequence of matching length, that contains gradient of the differentiated function
w.r.t. corresponding tensors (None is an acceptable value for all tensors that don’t need gradient tensors).
This function accumulates gradients in the leaves - you might need to zero them before calling it.