darts如何对一个cell进行搜索的呢,我们通过下图figure 1了解darts的基本思想:
一个Network是由8个cell组成的,cell分为reduction cell 和normal cell两种,在网络的三分之一处和三分之二处是reduction cell,其它是normal cell。reduction cell共享权重 α r e d u t i o n \alpha_{redution} αredution,normal cell共享权重 α n o r m a l \alpha_{normal} αnormal。
一个cell由7个nodes组成,分别是2个input nodes,4个intermediate nodes和1个output nodes。
class Cell(nn.Module):
def __init__(self, steps, multiplier, C_prev_prev, C_prev, C, reduction, reduction_prev):
super(Cell, self).__init__()
self.reduction = reduction
#input nodes的结构固定不变,不参与搜索
#决定第一个input nodes的结构,取决于前一个cell是否是reduction
if reduction_prev:
self.preprocess0 = FactorizedReduce(C_prev_prev, C, affine=False)
self.preprocess0 = ReLUConvBN(C_prev_prev, C, 1, 1, 0, affine=False)#第一个input_nodes是cell k-2的输出,cell k-2的输出通道数为C_prev_prev,所以这里操作的输入通道数为C_prev_prev
#第二个input nodes的结构
self.preprocess1 = ReLUConvBN(C_prev, C, 1, 1, 0, affine=False)#第二个input_nodes是cell k-1的输出
self._steps = steps # 每个cell中有4个节点的连接状态待确定
self._multiplier = multiplier
self._ops = nn.ModuleList() # 构建operation的modulelist
self._bns = nn.ModuleList()
#遍历4个intermediate nodes构建混合操作
for i in range(self._steps):
for j in range(2+i): #对第i个节点来说,他有j个前驱节点(每个节点的input都由前两个cell的输出和当前cell的前面的节点组成)
stride = 2 if reduction and j < 2 else 1
op = MixedOp(C, stride) #op是构建两个节点之间的混合
def forward(self, s0, s1, weights):
s0 = self.preprocess0(s0)
s1 = self.preprocess1(s1)
states = [s0, s1] #当前节点的前驱节点
offset = 0
#遍历每个intermediate nodes,得到每个节点的output
for i in range(self._steps):
s = sum(self._ops[offset+j](h, weights[offset+j]) for j, h in enumerate(states)) #s为当前节点i的output,在ops找到i对应的操作,然后对i的所有前驱节点做相应的操作(调用了MixedOp的forward),然后把结果相加
offset += len(states)
#states中为[s0,s1,b1,b2,b3,b4] b1,b2,b3,b4分别是四个intermediate output的输出
return torch.cat(states[-self._multiplier:], dim=1)#对intermediate的output进行concat作为当前cell的输出
为了使搜索空间连续,我们为每个操作都赋予一个权重 α \alpha α,然后做softmax。这样搜索任务就简化为学习权重 α \alpha α
o ( i , j ) = a r g m a x o ∈ O α 0 ( i , j ) o^{(i,j)=argmax_{o∈O}\alpha_0^{(i,j)}} o(i,j)=argmaxo∈Oα0(i,j)
argmax(f(x))是使得 f(x)取得最大值所对应的变量点x(或x的集合),
class MixedOp(nn.Module):
def __init__(self, C, stride):
super(MixedOp, self).__init__()
self._ops = nn.ModuleList()
for primitive in PRIMITIVES: #PRIMITIVES中就是8个操作
op = OPS[primitive](C, stride, False)#OPS中存储了各种操作的函数
if 'pool' in primitive:
op = nn.Sequential(op, nn.BatchNorm2d(C, affine=False)) #给池化操作后面加一个batchnormalization
def forward(self, x, weights):
return sum(w * op(x) for w, op in zip(weights, self._ops)) #op(x)就是对输入x做一个相应的操作 w1*op1(x)+w2*op2(x)+...+w8*op8(x)
After relaxation, our goal is to jointly learn the architecture α and the weights w within all the mixed operations (e.g. weights of the convolution filters). Analogous to architecture search using RL or evolution where the validation set performance is treated as the reward or fitness, DARTS aims to optimize the validation loss, but using gradient descent.
在对操作relaxation之后,我们就要对 α \alpha α和w进行学习,Darts是通过梯度下降优化validation loss来学习权重的。
Denote by L t r a i n L_{train} Ltrain and L v a l L_{val} Lval the training and the validation loss, respectively. Both losses are determined not only by the architecture α, but also the weights w in the network.
The goal for architecture search is to find α ∗ α^∗ α∗ that minimizes the validation loss L v a l ( w ∗ , α ∗ ) L_{val}(w^∗ , α^∗ ) Lval(w∗,α∗), where the weights w ∗ w^∗ w∗ associated with the architecture are obtained by minimizing the training loss w ∗ w^∗ w∗ = a r g m i n w L t r a i n ( w , α ∗ ) argmin_wL_{train}(w, α^∗ ) argminwLtrain(w,α∗).
This implies a bilevel optimization problem with α as the upper-level variable and w as the lower-level variable:
architecture search的目标就是通过最小化验证集的loss L v a l ( w ∗ , α ∗ ) L_{val}(w^∗ , α^∗ ) Lval(w∗,α∗)得到α,而 w ∗ w^* w∗又是通过最小化训练集loss得到的 w ∗ w^∗ w∗ = a r g m i n w L t r a i n ( w , α ∗ ) argmin_wL_{train}(w, α^∗ ) argminwLtrain(w,α∗)。因此我们得到了如下的bilevel 公式:
本小节主要是在公式(3)和公式(4)的基础上做一个改进,首先作者提出了一个approximation scheme如下:
where w denotes the current weights maintained by the algorithm, and ξ is the learning rate for a step of inner optimization.
The idea is to approximate w ∗ ( α ) w ^∗(α) w∗(α) by adapting w using only a single training step, without solving the inner optimization (equation 4) completely by training until convergence.
我们用 w − ξ ▽ w L t r a i n ( w , α ) w − ξ\bigtriangledown _w L_{train} (w, α) w−ξ▽wLtrain(w,α)来近似 w ∗ ( α ) w ^∗(α) w∗(α),这样只对w用了一次single training step,也就是达到了一步优化的效果,就不需要先对公式4进行优化,等收敛了再求α。
def train(train_queue, valid_queue, model, architect, criterion, optimizer, lr):
objs = utils.AvgrageMeter() # 用于保存loss的值
top1 = utils.AvgrageMeter() # 前1预测正确的概率
top5 = utils.AvgrageMeter() # 前5预测正确的概率
for step, (input, target) in enumerate(train_queue): #每个step取出一个batch,batchsize是64(256个数据对)
n = input.size(0)
input = Variable(input, requires_grad=False).cuda()
target = Variable(target, requires_grad=False).cuda(async=True)
# get a random minibatch from the search queue with replacement
input_search, target_search = next(iter(valid_queue)) #用于架构参数更新的一个batch 。使用iter(dataloader)返回的是一个迭代器,然后可以使用next访问;
input_search = Variable(input_search, requires_grad=False).cuda()
target_search = Variable(target_search, requires_grad=False).cuda(async=True)
architect.step(input, target, input_search, target_search, lr, optimizer, unrolled=args.unrolled)
logits = model(input)
loss = criterion(logits, target) #预测值logits和真实值target的loss
nn.utils.clip_grad_norm(model.parameters(), args.grad_clip)#梯度裁剪
optimizer.step() #应用梯度
prec1, prec5 = utils.accuracy(logits, target, topk=(1, 5))
objs.update(loss.data[0], n)
top1.update(prec1.data[0], n)
top5.update(prec5.data[0], n)
if step % args.report_freq == 0:
logging.info('train %03d %e %f %f', step, objs.avg, top1.avg, top5.avg)
return top1.avg, objs.avg
We also note that when momentum is enabled for weight optimisation, the one-step unrolled learning objective in equation 6 is modified accordingly and all of our analysis still applies.
Applying chain rule to the approximate architecture gradient (equation 6) yields.
其中, w ′ = w − ξ ▽ w L t r a i n ( w , α ) w' = w − ξ\bigtriangledown _w L_{train} (w, α) w′=w−ξ▽wLtrain(w,α)。
第一行的式子,实际上相当于是一个关于 [公式] 的复合函数求导,我们可以将其形式化记为:
The expression above contains an expensive matrix-vector product in its second term. Fortunately, the complexity can be substantially reduced using the finite difference approximation.
又因为公式7的第二项包含一个复杂的matrix-vector product,所以我们通过对公式7进行有限差分近似得到公式8:
然后我们把h换成 ϵ \epsilon ϵ,把A换成 ξ ▽ w ′ L t r a i n ( w ′ , α ) ξ\bigtriangledown _{w'} L_{train} (w', α) ξ▽w′Ltrain(w′,α),把 x 0 x_0 x0换成w,再把f换成 ξ ▽ α L t r a i n ( ⋅ , ⋅ ) ξ\bigtriangledown _{\alpha} L_{train} (·, ·) ξ▽αLtrain(⋅,⋅),就得到公式8了。
通过上面的部分我们知道更新 α \alpha α是通过architect.step()来调用的,那么这个函数具体是怎么实现的,也就是上面讲的一大堆公式是怎么用的,我们一起来看一下architect.py的内容。
architect.step(input, target, input_search, target_search, lr, optimizer, unrolled=args.unrolled)
import torch
import numpy as np
import torch.nn as nn
from torch.autograd import Variable
def _concat(xs):
return torch.cat([x.view(-1) for x in xs]) #把x先拉成一行,然后把所有的x摞起来,变成n行
class Architect(object):
def __init__(self, model, args):
self.network_momentum = args.momentum
self.network_weight_decay = args.weight_decay
self.model = model
self.optimizer = torch.optim.Adam(self.model.arch_parameters(),
lr=args.arch_learning_rate, betas=(0.5, 0.999), weight_decay=args.arch_weight_decay) #用来更新α的optimizer
我们更新梯度就是theta = theta + v + weight_decay * theta
普通的梯度下降:v = -dtheta * lr 其中lr是学习率,dx是目标函数对x的一阶导数
带momentum的梯度下降:v = lr*(-dtheta + v * momentum)
#【完全复制外面的Network更新w的过程】,对应公式6第一项的w − ξ*dwLtrain(w, α)
def _compute_unrolled_model(self, input, target, eta, network_optimizer):
loss = self.model._loss(input, target) #Ltrain
theta = _concat(self.model.parameters()).data #把参数整理成一行代表一个参数的形式,得到我们要更新的参数theta
moment = _concat(network_optimizer.state[v]['momentum_buffer'] for v in self.model.parameters()).mul_(self.network_momentum) #momentum*v,用的就是Network进行w更新的momentum
moment = torch.zeros_like(theta) #不加momentum
dtheta = _concat(torch.autograd.grad(loss, self.model.parameters())).data + self.network_weight_decay*theta #前面的是loss对参数theta求梯度,self.network_weight_decay*theta就是正则项
unrolled_model = self._construct_model_from_theta(theta.sub(eta, moment+dtheta)) #w − ξ*dwLtrain(w, α)
return unrolled_model
def step(self, input_train, target_train, input_valid, target_valid, eta, network_optimizer, unrolled):
if unrolled:#用论文的提出的方法
self._backward_step_unrolled(input_train, target_train, input_valid, target_valid, eta, network_optimizer)
else: #不用论文提出的bilevel optimization,只是简单的对α求导
self._backward_step(input_valid, target_valid)
self.optimizer.step() #应用梯度:根据反向传播得到的梯度进行参数的更新, 这些parameters的梯度是由loss.backward()得到的,optimizer存了这些parameters的指针
def _backward_step(self, input_valid, target_valid):
loss = self.model._loss(input_valid, target_valid)
loss.backward() #反向传播,计算梯度
def _backward_step_unrolled(self, input_train, target_train, input_valid, target_valid, eta, network_optimizer):
#计算公式六:dαLval(w',α) ,其中w' = w − ξ*dwLtrain(w, α)
unrolled_model = self._compute_unrolled_model(input_train, target_train, eta, network_optimizer)#unrolled_model里的w已经是做了一次更新后的w,也就是得到了w'
unrolled_loss = unrolled_model._loss(input_valid, target_valid) #对做了一次更新后的w的unrolled_model求验证集的损失,Lval,以用来对α进行更新
# dαLval(w',α)
dalpha = [v.grad for v in unrolled_model.arch_parameters()] #对alpha求梯度
# dw'Lval(w',α)
vector = [v.grad.data for v in unrolled_model.parameters()] #unrolled_model.parameters()得到w‘
#计算公式八(dαLtrain(w+,α)-dαLtrain(w-,α))/(2*epsilon) 其中w+=w+dw'Lval(w',α)*epsilon w- = w-dw'Lval(w',α)*epsilon
implicit_grads = self._hessian_vector_product(vector, input_train, target_train)
# 公式六减公式八 dαLval(w',α)-(dαLtrain(w+,α)-dαLtrain(w-,α))/(2*epsilon)
for g, ig in zip(dalpha, implicit_grads):
g.data.sub_(eta, ig.data)
for v, g in zip(self.model.arch_parameters(), dalpha):
if v.grad is None:
v.grad = Variable(g.data)
def _construct_model_from_theta(self, theta):
model_new = self.model.new()
model_dict = self.model.state_dict() #Returns a dictionary containing a whole state of the module.
params, offset = {}, 0
for k, v in self.model.named_parameters():#k是参数的名字,v是参数
v_length = np.prod(v.size())
params[k] = theta[offset: offset+v_length].view(v.size()) #将参数k的值更新为theta对应的值
offset += v_length
assert offset == len(theta)
model_dict.update(params) #模型中的参数已经更新为做一次反向传播后的值
model_new.load_state_dict(model_dict) #恢复模型中的参数,也就是我新建的mode_new中的参数为model_dict
return model_new.cuda()
#计算公式八(dαLtrain(w+,α)-dαLtrain(w-,α))/(2*epsilon) 其中w+=w+dw'Lval(w',α)*epsilon w- = w-dw'Lval(w',α)*epsilon
def _hessian_vector_product(self, vector, input, target, r=1e-2): # vector就是dw'Lval(w',α)
R = r / _concat(vector).norm() #epsilon
for p, v in zip(self.model.parameters(), vector):
p.data.add_(R, v) #将模型中所有的w'更新成w+=w+dw'Lval(w',α)*epsilon
loss = self.model._loss(input, target)
grads_p = torch.autograd.grad(loss, self.model.arch_parameters())
for p, v in zip(self.model.parameters(), vector):
p.data.sub_(2*R, v) #将模型中所有的w'更新成w- = w+ - (w-)*2*epsilon = w+dw'Lval(w',α)*epsilon - 2*epsilon*dw'Lval(w',α)=w-dw'Lval(w',α)*epsilon
loss = self.model._loss(input, target)
grads_n = torch.autograd.grad(loss, self.model.arch_parameters())
for p, v in zip(self.model.parameters(), vector):
p.data.add_(R, v) #w=(w-) +dw'Lval(w',α)*epsilon = w-dw'Lval(w',α)*epsilon + dw'Lval(w',α)*epsilon = w
return [(x-y).div_(2*R) for x, y in zip(grads_p, grads_n)]
To form each node in the discrete architecture, we retain the top-2 strongest operations (from distinct nodes) among all non-zero candidate operations collected from all the previous nodes.
genotype = model.genotype() #对应论文2.4 选出来权重值大的两个前驱节点,并把(操作,前驱节点)存下来
具体怎么做的在model_search.py的class Network 的函数genotype中,如下
def genotype(self):
def _parse(weights):
gene = []
n = 2
start = 0
for i in range(self._steps):
end = start + n
W = weights[start:end].copy()
# 找出来前驱节点的哪两个边的权重最大
edges = sorted(range(i + 2), key=lambda x: -max(W[x][k] for k in range(len(W[x])) if k != PRIMITIVES.index('none')))[:2]#sorted:对可迭代对象进行排序,key是用来进行比较的元素
# range(i + 2)表示x取0,1,到i+2 x也就是前驱节点的序号 ,所以W[x]就是这个前驱节点的所有权重[α0,α1,α2,...,α7]
# max(W[x][k] for k in range(len(W[x])) if k != PRIMITIVES.index('none')) 就是把操作不是NONE的α放到一个list里,得到最大值
# sorted 就是把每个前驱节点对应的权重最大的值进行逆序排序,然后选出来top2
# 把这两条边对应的最大权重的操作找到
for j in edges:
k_best = None
for k in range(len(W[j])):
if k != PRIMITIVES.index('none'):
if k_best is None or W[j][k] > W[j][k_best]:
k_best = k
gene.append((PRIMITIVES[k_best], j)) #把(操作,前驱节点序号)放到list gene中,[('sep_conv_3x3', 1),...,]
start = end
n += 1
return gene
gene_normal = _parse(F.softmax(self.alphas_normal, dim=-1).data.cpu().numpy()) #得到normal cell 的最后选出来的结果
gene_reduce = _parse(F.softmax(self.alphas_reduce, dim=-1).data.cpu().numpy()) #得到reduce cell 的最后选出来的结果
concat = range(2+self._steps-self._multiplier, self._steps+2) #[2,3,4,5] 表示对节点2,3,4,5 concat
genotype = Genotype(
normal=gene_normal, normal_concat=concat,
reduce=gene_reduce, reduce_concat=concat
return genotype
In the first stage, we search for the cell architectures using DARTS, and determine the best cells based on their validation performance.
In the second stage, we use these cells to construct larger architectures, which we train from scratch and report their performance on the test set.
class Network(nn.Module):
def __init__(self, C, num_classes, layers, criterion, steps=4, multiplier=4, stem_multiplier=3):
super(Network, self).__init__()
self._C = C #初始通道数
self._num_classes = num_classes
self._layers = layers
self._criterion = criterion
self._steps = steps #一个基本单元cell内有4个节点需要进行operation操作的搜索
self._multiplier = multiplier
C_curr = stem_multiplier*C # 当前Sequential模块的输出通道数
self.stem = nn.Sequential(
nn.Conv2d(3, C_curr, 3, padding=1, bias=False), #前三个参数分别是输入图片的通道数,卷积核的数量,卷积核的大小
nn.BatchNorm2d(C_curr) #BatchNorm2d对minibatch 3d数据组成的4d输入进行batchnormalization操作,num_features为(N,C,H,W)的C
C_prev_prev, C_prev, C_curr = C_curr, C_curr, C
self.cells = nn.ModuleList()# 创建一个空modulelist类型数据
reduction_prev = False #连接的前一个cell是否是reduction cell
for i in range(layers): #网络是8层,在1/3和2/3位置是reduction cell 其他是normal cell,reduction cell的stride是2
if i in [layers//3, 2*layers//3]: #对应论文的Cells located at the 1/3 and 2/3 of the total depth of the network are reduction cells
C_curr *= 2
reduction = True
reduction = False
cell = Cell(steps, multiplier, C_prev_prev, C_prev, C_curr, reduction, reduction_prev)#每个cell的input nodes是前前cell和前一个cell的输出
reduction_prev = reduction
self.cells += [cell]
C_prev_prev, C_prev = C_prev, multiplier*C_curr #C_prev=multiplier*C_curr是因为每个cell的输出是4个中间节点concat的,这个concat是在通道这个维度,所以输出的通道数变为原来的4倍
self.global_pooling = nn.AdaptiveAvgPool2d(1) #构建一个平均池化层,output size是1x1
self.classifier = nn.Linear(C_prev, num_classes) #构建一个线性分类器
cell = Cell(steps, multiplier, C_prev_prev, C_prev, C_curr, reduction, reduction_prev)
layers = 8, 第2和5个cell是reduction_cell
cells[0]: cell = Cell(4, 4, 48, 48, 16, false, false) 输出[N,16*4,h,w]
cells[1]: cell = Cell(4, 4, 48, 64, 16, false, false) 输出[N,16*4,h,w]
cells[2]: cell = Cell(4, 4, 64, 64, 32, True, false) 输出[N,32*4,h,w]
cells[3]: cell = Cell(4, 4, 64, 128, 32, false, false) 输出[N,32*4,h,w]
cells[4]: cell = Cell(4, 4, 128, 128, 32, false, false) 输出[N,32*4,h,w]
cells[5]: cell = Cell(4, 4, 128, 128, 64, True, false) 输出[N,64*4,h,w]
cells[6]: cell = Cell(4, 4, 128, 256, 64, false, false) 输出[N,64*4,h,w]
cells[7]: cell = Cell(4, 4, 256, 256, 64, false, false) 输出[N,64*4,h,w]
train_transform, valid_transform = utils._data_transforms_cifar10(args)
train_data = dset.CIFAR10(root=args.data, train=True, download=True, transform=train_transform)
num_train = len(train_data)
indices = list(range(num_train))
split = int(np.floor(args.train_portion * num_train))
train_queue = torch.utils.data.DataLoader(
train_data, batch_size=args.batch_size,
sampler=torch.utils.data.sampler.SubsetRandomSampler(indices[:split]), #自定义从样本中取数据的策略,当train_portion=0.5时,就是前一半的数据用于train
pin_memory=True, num_workers=2)
valid_queue = torch.utils.data.DataLoader(
train_data, batch_size=args.batch_size,
sampler=torch.utils.data.sampler.SubsetRandomSampler(indices[split:num_train]), #数据集中后一半的数据用于验证
pin_memory=True, num_workers=2)
optimizer = torch.optim.SGD(
momentum=args.momentum, #0.9
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR( #使用余弦退火调度设置各组参数组的学习率
optimizer, float(args.epochs), eta_min=args.learning_rate_min)
<2> 用于优化α的优化器
self.optimizer = torch.optim.Adam(self.model.arch_parameters(),
lr=args.arch_learning_rate, betas=(0.5, 0.999), weight_decay=args.arch_weight_decay) #用来更新α的optimizer
model = Network(args.init_channels, CIFAR_CLASSES, args.layers, criterion)#构建网络
architect = Architect(model, args)
for epoch in range(args.epochs):
lr = scheduler.get_lr()[0] #得到本次迭代的学习率lr
logging.info('epoch %d lr %e', epoch, lr)
genotype = model.genotype() #对应论文2.4 选出来权重值大的两个前驱节点,并把最后的结果存下来,格式为Genotype(normal=[(op,i),..],normal_concat=[],reduce=[],reduce_concat=[])
logging.info('genotype = %s', genotype)
print(F.softmax(model.alphas_normal, dim=-1))
print(F.softmax(model.alphas_reduce, dim=-1))
# training
train_acc, train_obj = train(train_queue, valid_queue, model, architect, criterion, optimizer, lr)
logging.info('train_acc %f', train_acc)
# validation
valid_acc, valid_obj = infer(valid_queue, model, criterion)
logging.info('valid_acc %f', valid_acc)
utils.save(model, os.path.join(args.save, 'weights.pt'))
def train(train_queue, valid_queue, model, architect, criterion, optimizer, lr):
objs = utils.AvgrageMeter() # 用于保存loss的值
top1 = utils.AvgrageMeter() # 前1预测正确的概率
top5 = utils.AvgrageMeter() # 前5预测正确的概率
for step, (input, target) in enumerate(train_queue): #每个step取出一个batch,batchsize是64(256个数据对)
n = input.size(0)
input = Variable(input, requires_grad=False).cuda() #requires_grad为false不对input求导
target = Variable(target, requires_grad=False).cuda(async=True)
# get a random minibatch from the search queue with replacement
# 更新α是用validation set进行更新的,所以我们每次都从valid_queue拿出一个batch传入architect.step()
input_search, target_search = next(iter(valid_queue)) # 使用iter(dataloader)返回的是一个迭代器,然后可以使用next访问;
input_search = Variable(input_search, requires_grad=False).cuda()
target_search = Variable(target_search, requires_grad=False).cuda(async=True)
architect.step(input, target, input_search, target_search, lr, optimizer, unrolled=args.unrolled) #unrolled是true就是用论文的公式进行α的更新
logits = model(input)
loss = criterion(logits, target) #预测值logits和真实值target的loss
nn.utils.clip_grad_norm(model.parameters(), args.grad_clip)#梯度裁剪
optimizer.step() #应用梯度
prec1, prec5 = utils.accuracy(logits, target, topk=(1, 5))
objs.update(loss.data[0], n)
top1.update(prec1.data[0], n)
top5.update(prec5.data[0], n)
if step % args.report_freq == 0:
logging.info('train %03d %e %f %f', step, objs.avg, top1.avg, top5.avg)
return top1.avg, objs.avg
<2> 验证 infer
def infer(valid_queue, model, criterion):
objs = utils.AvgrageMeter()
top1 = utils.AvgrageMeter()
top5 = utils.AvgrageMeter()
for step, (input, target) in enumerate(valid_queue):
input = Variable(input, volatile=True).cuda()
target = Variable(target, volatile=True).cuda(async=True)
logits = model(input)
loss = criterion(logits, target)
prec1, prec5 = utils.accuracy(logits, target, topk=(1, 5))
n = input.size(0)
objs.update(loss.data[0], n)
top1.update(prec1.data[0], n)
top5.update(prec5.data[0], n)
if step % args.report_freq == 0:
logging.info('valid %03d %e %f %f', step, objs.avg, top1.avg, top5.avg)
return top1.avg, objs.avg
def accuracy(output, target, topk=(1,)): #output:(bs,num_class)是64行10列, target:(bs,1),topk=(1,5)
maxk = max(topk) #5
batch_size = target.size(0)
_, pred = output.topk(maxk, 1, True, True)#maxk=5,表示dim=1按行取值
#pred是(bs,5) 值为类别号,0,1,...,9
pred = pred.t() #转置,pred:(5,bs)
correct = pred.eq(target.view(1, -1).expand_as(pred)) #pred和target对应位置值相等返回1,不等返回0
#target原来是64行1列,值为类别;target.view(1, -1)把target拉成一行,expand_as(pred)又把target变成5行64列
res = []
for k in topk:# k=1和k=5
correct_k = correct[:k].view(-1).float().sum(0)
return res #res里是两个值,一个是top1的概率,一个是top5的概率
To evaluate the selected architecture, we randomly initialize its weights (weights learned during the search process are discarded), train it from scratch, and report its performance on the test set.
architecture evaluation这一部分做的就是把architecture search 部分搜到的cell 拿过来(normal cell 和reduction cell的权重),从头进行训练一下。这就和我们之前的那种train大同小异,就是网络结构定好了。
def main():
if not torch.cuda.is_available():
logging.info('no gpu device available')
cudnn.benchmark = True
logging.info('gpu device = %d' % args.gpu)
logging.info("args = %s", args)
#得到train_search里学好的normal cell 和reduction cell,genotypes.DARTS就是选的学好的DARTS_V2
genotype = eval("genotypes.%s" % args.arch) #DARTS_V2 = Genotype(normal=[('sep_conv_3x3', 0), ('sep_conv_3x3', 1), ('sep_conv_3x3', 0), ('sep_conv_3x3', 1), ('sep_conv_3x3', 1), ('skip_connect', 0), ('skip_connect', 0), ('dil_conv_3x3', 2)], normal_concat=[2, 3, 4, 5], reduce=[('max_pool_3x3', 0), ('max_pool_3x3', 1), ('skip_connect', 2), ('max_pool_3x3', 1), ('max_pool_3x3', 0), ('skip_connect', 2), ('skip_connect', 2), ('max_pool_3x3', 1)], reduce_concat=[2, 3, 4, 5])
model = Network(args.init_channels, CIFAR_CLASSES, args.layers, args.auxiliary, genotype)
model = model.cuda()
logging.info("param size = %fMB", utils.count_parameters_in_MB(model))
criterion = nn.CrossEntropyLoss()
criterion = criterion.cuda()
optimizer = torch.optim.SGD(
train_transform, valid_transform = utils._data_transforms_cifar10(args)
train_data = dset.CIFAR10(root=args.data, train=True, download=True, transform=train_transform)
valid_data = dset.CIFAR10(root=args.data, train=False, download=True, transform=valid_transform)
train_queue = torch.utils.data.DataLoader(
train_data, batch_size=args.batch_size, shuffle=True, pin_memory=True, num_workers=2)
valid_queue = torch.utils.data.DataLoader(
valid_data, batch_size=args.batch_size, shuffle=False, pin_memory=True, num_workers=2)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, float(args.epochs))
for epoch in range(args.epochs):
logging.info('epoch %d lr %e', epoch, scheduler.get_lr()[0])
model.drop_path_prob = args.drop_path_prob * epoch / args.epochs
train_acc, train_obj = train(train_queue, model, criterion, optimizer)
logging.info('train_acc %f', train_acc)
valid_acc, valid_obj = infer(valid_queue, model, criterion)
logging.info('valid_acc %f', valid_acc)
utils.save(model, os.path.join(args.save, 'weights.pt'))
model和model_search的区别也就在于cell 部分是把学到的权重直接拿来建网络
class Cell(nn.Module):
def __init__(self, genotype, C_prev_prev, C_prev, C, reduction, reduction_prev):
super(Cell, self).__init__()
print(C_prev_prev, C_prev, C)
if reduction_prev:
self.preprocess0 = FactorizedReduce(C_prev_prev, C)
self.preprocess0 = ReLUConvBN(C_prev_prev, C, 1, 1, 0)
self.preprocess1 = ReLUConvBN(C_prev, C, 1, 1, 0)
#这一部分就是根据是reduction cell 还是normal cell 把对应的节点和节点的操作找到
if reduction:
op_names, indices = zip(*genotype.reduce)
concat = genotype.reduce_concat
op_names, indices = zip(*genotype.normal)
concat = genotype.normal_concat
self._compile(C, op_names, indices, concat, reduction)
def _compile(self, C, op_names, indices, concat, reduction):
assert len(op_names) == len(indices)
self._steps = len(op_names) // 2
self._concat = concat
self.multiplier = len(concat)
self._ops = nn.ModuleList()
for name, index in zip(op_names, indices):
stride = 2 if reduction and index < 2 else 1
op = OPS[name](C, stride, True)
self._ops += [op]
self._indices = indices
A large network of 20 cells is trained for 600 epochs with batch size 96. The initial number of channels is increased from 16 to 36 to ensure our model size is comparable with other baselines in the literature (around 3M).
Other hyperparameters remain the same as the ones used for architecture search.