SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks
As FPGAs are intrinsically parallel devices, they are quite suitable for Neural Network implementations, and so several studies have analyzed their application [3], [4], [5]. A broad classification of FPGA neural network applications can be done according to whether they include the learning process (“on-chip implementations”) [6], [5] or if the training of the neural network model is performed externally in a Personal Computer (PC) where the FPGA acts as a hardware accelerator (“off-chip
implementations”) [7], [8]. Programming a FPGA is not a trivial task as they are predominantly codified using hardware description languages such as VHDL or Verilog, languages that are complex making the programming process very time consuming in most cases. An important aspect at the time of the implementation of an algorithm in a FPGA regards the data type representation. The nature of the FPGAs encourages the use of a fixed point representation because this type of representation is more efficient. A floating point representation might be used but this would require
the utilization of specific cores [9], [10]. Regarding this issue, the work of Savich et al. (2007) [11] describes an interesting analysis of the implementation of floating point neural algorithms in fixed point arithmetic.
Driven by the availability of massive data and the computational capability to process it, deep learning has recently emerged as a critical tool for solving complex problems across a wide range of domains, including image recognition [22], speech processing [3, 14, 18], natural language processing [9], language translation [11], and autonomous vehicles [23]. Convolutional neural networks (CNNs) have become the most popular algorithmic approach for deep learning for many of these domains. Employing CNNs can be decomposed into two tasks: (1) training — in which the parameters of a neural network are learned by observing massive numbers of training examples, and (2) inference — in which a trained neural network is deployed in the
feld and classifes the observed data. Today, training is often done on GPUs [27] or farms of GPUs, while inference depends on the application and can employ CPUs, GPUs, FPGAs or specially-built ASICs.
During the training process, a deep learning expert will typically architect the network, establishing the number of layers, the operation performed by each layer, and the connectivity between layers. Many layers have parameters, typically flter weights, which determine their exact computation. The objective of the training process is to learn these weights, usually via a stochastic gradient descent-based excursion through the space of weights. This process typically employs a forward-propagation calculation for each training example, a measurement of the error between the computed and desired output, and then back-propagation through the network to update the weights. Inference has similarities, but only includes the forward-propagation calculation. Nonetheless, the computation requirements for inference can be enormous, particularly with the emergence of deeper networks (hundreds of layers [19, 20, 29]) and larger inputs sets, such as high-defnition video. Furthermore, the energy effciency of this computation is important, especially for mobile platforms, such as autonomous vehicles, cameras, and electronic personal assistants. Recent published works have shown that common networks have signifcant redundancy and can be pruned dramatically during training without substantively affecting accuracy [17]. Our experience shows that the number of weights that can be eliminated varies widely across the layers but typically ranges from 20% to 80% [16, 17]. Eliminating weights results in a network with a substantial number of zero values, which can potentially reduce the computational requirements of inference. The inference computation also offers a further optimization opportunity, as many networks employ as their non-linear operator the ReLU (rectifed linear unit) function which clamps all negative activation values to zero. The activations are the output values of an individual layer that are passed as inputs to the next layer. Our
experience shows that for typical data sets, 50–70% of the activations are clamped to zero. Since the multiplication of weights and activations is the key computation for inference, the combination of these two factors can reduce the amount of computation required by over an order of magnitude. Additional benefts can be achieved by a compressed encoding for zero weights and activations, thus allowing more to ft in on-chip RAM and eliminating energy-costly DRAM accesses.
This paper introduces the Sparse CNN (SCNN) accelerator architecture, a new CNN inference architecture that exploits both weight and activation sparsity to improve the performance and power of DNNs. Our SCNN accelerator is designed to optimize the computation of the convolutional layers as state-of-the-art DNNs for computer vision are primarily dominated by these compute-intensive layers [24, 31]. Previous works have employed techniques for exploiting sparsity, including saving computation energy for zerovalued activations and compressing weights and activations stored
in DRAM [7, 8]. Other works have used either a compressed encoding of activations [1] or compressed weights [34] in parts of their dataflow to reduce data transfer bandwidth and save time for computations of some multiplications with a zero operand. While these prior architectures have largely focused on eliminating computations and exploiting some data compression, SCNN is the frist sparse CNN accelerator that effectively handles both the ineffectual activations and weights at the same time. Furthermore, SCNN employs both an algorithmic dataflow that eliminates all multiplications with a zero and a compressed representation of both weights and
activations through almost the entire computation. At the heart of the SCNN design is a processing element (PE) with a multiplier array that accepts a vector of weights and a vector of activations. Unlike previous convolutional dataflows [6, 8, 12, 28], the SCNN dataflow only delivers to the multiplier array weights and activations that can all be multiplied with one another in the manner of a Cartesian product. To reduce data accesses, the activation vectors are reused in an input stationary [7] fashion while being multiplied with a series of weight vectors. Finally, only non-zero weights and activations are fetched from the input storage arrays and delivered to the multiplier array. As with any CNN accelerator,
SCNN must accumulate the partial products generated by the multipliers. However, since the products generated by the multiplier array cannot be directly summed together, SCNN tracks the output coordinates associated with each multiplication and sends the coordinate and product to a scatter accumulator array for summation.
To increase performance and capacity beyond a single PE, multiple PEs can run in parallel, each working on a disjoint 3D tile of input activations. The compression and tiling of the CNN data enables two energy-saving optizations. First, maintaining the weights and activations in a compressed form throughout the pipeline reduces energy-hungry data staging and transmission costs. Second, the entire volume of activations of larger CNNs can remain in on-die buffers between layers, entirely eliminating expensive cross-layer DRAM references for a large number of networks. Overall, this design provides effcient compressed storage and delivery of input Table 1: Network characteristics. Weights and activations assume a data-type size of two bytes.
operands, exploits high reuse of the input operands in the multiplier array, and spends no time on multiplications with zero operands. To evaluate SCNN, we developed a cycle-level performance model and a validated analytical model that allows us to quickly explore the design space of different types of accelerators. We also implemented an SCNN PE in synthesizable System C and compiled the design into gates using a combination of commercial high-level synthesis (HLS) tools and a traditional Verilog compiler. Our results show that a 64 PE SCNN implementation with 16 multipliers per PE (1,024 multipliers in total) can be implemented in approximately 7.4mm2 in a 16nm technology, which is a bit larger than an equivalently provisioned dense accelerator architecture due to the overheads of managing the sparse dataflow. On a range of networks, SCNN provides a factor of 2.7× speedup and a 2.3× energy reduction relative to a comparably provisioned dense CNN accelerator.
Convolutional Neural Network algorithms (CNNs) are essentially a cascaded set of pattern recognition flters that need to be trained [23]. A CNN consists of a series of layers, which include convolutional layers, non-linear scalar operator layers, and layers that downsample the intermediate data, for example by pooling. The convolutional layers represent the core of the CNN computation and are characterized by a set of filters that are usually 1×1 or 3×3, and occasionally 5×5 or larger. The values of these filters are the weights that are learned using a training set for the network. Some deep neural networks (DNNs) also include fully-connected layers, typically toward the end of the DNN. During inference, a new image (in the case of image recognition) is presented to the network, which classifes into the training categories by computing each of the layers in the network, in succession. The intermediate data between the layers are called activations, and the output activations of one layer becomes the input activations of the next layer. In this paper, we focus on
accelerating the convolutional layers as they constitute the majority of the computation.
able 2 describes how several recent CNN accelerator architecture exploit sparsity. Eyeriss [7] exploits sparsity in activations by storing them in compressed form in DRAM and by gating computation cycles for zero-valued activations to save energy. Cnvlutin [1] is more aggressive—the architecture moves and stages sparse activations
in compressed form and skips computation cycles for zero-valued activations to improve both performance and energy effciency. Both these architectures are also able to partially elide inner-buffer accesses for weights if those weights were only to be multiplied with a zero-valued activation. Conversely, the Cambricon-X [34] architecture exploits sparsity by compressing the pruned weights, skipping computation cycles for zero-valued weights, but it still suffers from wasted computation cycles when the non-zero weight is to be multiplied with zero-valued activations
Eyeriss [7]利用激活中的稀疏性通过将其以压缩形式存储在DRAM中并利用零值激活的计算周期来节省能量来。
Cnvlutin [1]更具激进的架构,数据移动和分阶段进行稀疏激活以压缩形式并跳过零值激活的计算周期以提高性能和能量效率。
如果这些权重只与零值激活相乘,则会部分地减少内部缓冲区权重。
Cambricon-X架构通过压缩修剪后的权重,跳过零值权重的计算周期来利用稀疏性,但是当非零权重要与零值激活相乘时仍然遭受浪费的计算周期。
In addition to the different approaches of exploiting sparsity, these architectures also employ distinct dataflows [7] to execute a sparse convolutional layer.
由于采用不同的稀疏方法,计算架构也采取不同的数据流来执行稀疏的卷积层。
The most relevant distinction among these architectures’ dataflows is how the innermost computation datapath exploits spatial reuse and sparsity patterns.
这些架构数据流的最相关的区别在于最内层的计算数据通路如何利用空间重用和稀疏模式。
Eyeriss uses a row-stationary dataflow, multicasting weights and activations across
multiple scalar processing elements (PEs), with each PE independently performing zero-activation detection.
Eyeriss使用行固定的数据流,跨越多播权重和激活多个标量处理单元(PE),每个PE独立执行零激活检测。
Cnvlutin multiplies a single scalar non-zero activation across a vector of weights (organized by output-channel), and then reduces these output vectors across different input-channels.
Cvvlutin在一个权重向量(由输出信道组织)上乘以一个标量非零激活,然后减少这些输出向量跨越不同的输入渠道。
Cambricon-X fetches activation vectors across input-channels based on non-zero weight vectors and computes their dot product, including unnecessary work for zerovalued elements of the activation vector
Cambricon-X基于非零权重向量获取跨越输入通道的激活向量计算它们的点积,包括对激活矢量的零值元素的不必要的工作.
SCNN’s objective is to exploit sparseness in both activations and pruned weights to eliminate as many computation cycles and data movement and storage operations as possible.
SCNN的目标是利用激活和修剪权重中的稀疏性来尽可能多地消除计算周期和数据移动和存储操作。
SCNN employs a dense encoding of both sparse weights and activations so that
only non-zero data values are retrieved from DRAM and on-chip buffers.
SCNN采用稀疏权重和激活的密集编码只有非零数据值才能从DRAM和片上检索到并输出到缓冲区。
Unfortunately, orchestrating a dataflow to deliver these sparse datasets to an array of multipliers while maximizing data reuse and multiplier utilization is non-trivial.
不幸的是,编排一个数据流来提供这些稀疏数据集到一个乘法器数组,同时最大化数据重用和乘数使用是非平凡的。
Instead of coercing any of the previously proposed dataflows to suit our purpose, we
employ a novel Cartesian product dataflow that exploits both weight and activation reuse while delivering only non-zero weights and activations to the multipliers.
我们没有强迫使用以前提出的任何数据流来适合我们的目的,而是采用了新颖的笛卡尔产品数据流来利用这两种重量和激活重用,而只提供非零权重和激活乘法器。
This dataflow performs an all-to-all multiply of non-zero weight and activation vector elements that can avoid any arithmetic based on zero-valued operands and achieve full multiplier utilization in steady-state.
这个数据流执行整体到整体的非零权重和激活向量元素的乘,这样能够避免基于零值操作数的任何算术并达到饱和乘数利用率在一个稳定的状态。
3 SCNN DATAFLOW
While the inner core of the dataflow in SCNN is based on a spatial Cartesian product, the complete dataflow requires a deep nested loop structure, mapped both spatially and temporally across multiple processing elements.
而SCNN数据流的内核是基于空间的笛卡尔积,完整的数据流需要深层嵌套
循环结构,在空间和时间上跨越多个处理单元进行映射。
We call the full dataflow PlanarTiledInputStationary-CartesianProduct-sparse, or PT-IS-CP-sparse.
我们把完整的数据流称为PlanarTiledInputStation-CartesianProduct-sparse或者PT-IS-CP-sparse。
this section frst describes a simple CNN convolutional layer to provide context for a detailed discussion of the construction of PT-IS-CPsparse.
本节首先描述了一个简单的CNN卷积层来提供详细讨论了PT-IS-CPsparse的构造的背景。
The core operation in a CNN convolutional layer is a 2-dimensional sliding-window convolution of an R× S element filter over a W × H element input activation plane to produce a W × H element output activation plane.
CNN卷积层的核心操作是二维的R×S元件的滑动窗口卷积在W×H上过滤
元件输入激活平面以产生W×H元件输出激活平面。
The data can include multiple (C) input activation planes, which are referred to as input channels.
数据可以包括多个(C)输入激活这些平面被称为输入通道。
A distinct filter is applied to each input activation channel, and the flter outputs for
each of the C channels are accumulated together element-wise into a single output activation plane or output channel.
每个输入激活通道都有一个独特的滤波器,滤波器输出为C通道中的每一个都按照元素方式一起累积成a单个输出激活平面或输出通道。
Multiple flters (K) can be applied to the same volume of input activations to produce K output channels. Finally, a batch of N groups of C channels of input activation planes can be applied to the same volume of flter weights.
多个过滤器(K)可以应用于相同的输入激活量来产生K.输出通道。 最后,一批C组输入的C组通道激活平面可以应用于相同体积的重物。
all permutations of these 7 loop variables are legal. Figure 3 shows an example loop nest based on one such permutation. We can concisely describe this nest as N → K → C → W → H → R → S. Each point in the 7-dimensional space formed from these variables represents a single multiply-accumulate operation. For the remainder
of this paper, we assume a batch size of 1 (N = 1), which is common for inferencing tasks.
这7个循环变量的所有排列都是合法的。 图3显示了一个基于一个这样的排列的示例循环嵌套。我们可以简洁地描述这个嵌套为N→K→C→W→H→R→S。
由这些变量形成的7维空间中的每个点代表一个单一的乘法累加操作。 剩下的
本文假设批量为1(N = 1),这是很常见的用于推理任务。
This simple loop nest can be transformed in numerous ways to capture different reuse patterns of the activations and weights and to map the computation to a hardware accelerator implementation. A CNN’s dataflow defnes how the loops are ordered, partitioned, and parallelized [7]. Prior work has shown that the choice of dataflow has a signifcant effect on the area and energy-effciency of an architecture [7]. In fact, the choice of dataflow is perhaps the single most signifcant differentiator between many prior works on CNN architectures.
这个简单的循环嵌套可以通过多种方式进行转换捕获激活和权重的不同重用模式将计算映射到硬件加速器实现。 一个CNN的数据流定义了这些循环是如何进行排序,分区的并行化[7]。 之前的工作表明,数据流的选择对架构的面积和能效有重要影响[7]。 实际上,数据流的选择也许是一个与以前的CNN架构工作最显着的区别特征.
While the concept of dataflow has been studied for dense architectures, sparse architectures can also employ various alternative dataflows, each with its own set of trade-offs. While an exhaustive enumeration of sparse dataflows is beyond the scope of this paper, we present a specifc dataflow called lanarTiled-InputStationaryCartesianProduct-sparse, or PT-IS-CP-sparse. After examining a range of different dataflows, we selected PT-IS-CP-sparse because it enables reuse patterns that exploit the characteristics of sparse weights and activations. This section first presents an equivalent dense dataflow (PT-IS-CP-dense) to explain the decomposition of the computations and then adds the specifc features for PT-IS-CPsparse.
虽然数据流的概念已经被研究用于密集架构,但稀疏架构也可以采用各种替代方案数据流,每个都有自己的一套权衡。 虽然是一个详尽的列举稀疏数据流不在本文的讨论范围之内,但是我们提出了一个称为PlanarTiled-InputStationCartesianProduct-sparse或PT-IS-CP-sparse的特定数据流。 经过检查一系列不同的数据流,我们选择了PT-IS-CP-sparse因为它支持利用权重和激活稀疏特性的重用模式。本节首先提出一个等价密集数据流(PT-IS-CP-dense)来解释计算分解并添加PT-IS-CPsparse的特性。
3.1 The PT-IS-CP-dense Dataflow
Single-multiplier temporal dataflow.
The IS term in PT-IS-CPdense describes the temporal component of the dataflow. First, consider the operation of a scalar processing element (PE) with a single multiply-accumulate unit. We employ an input-stationary (IS) computation order in which an input activation is held stationary at the computation units as it is multiplied by all of the flter weights needed to make all of its contributions to each of the the K output channels (a K × R× S sub-volume). Thus each input activation will contribute to a volume of K ×W × H output activations. This order maximizes the reuse of the input activations, while paying a cost to stream the weights to the computation units. Accommodating multiple input channels (C) adds an additional outer loop and results in the loop nest C → W → H → K → R → S
PT-IS-CPdense中的IS项描述数据流的时间成量。 首先,考虑一个乘法累加的标量处理单元(PE)的操作单位。 我们采用输入固定(IS)计算顺序,输入激活在其中保持固定计算单位与所有的权重相乘需要对每个K输出做出全部贡献通道(K×R×S分卷)。 因此,每个输入激活将有助于K×W×H输出激活的量。 这个顺序最大限度地提高输入激活的重用性,同时支付成本将权重流式传输到计算单元。容纳多个输入通道(C)增加了一个额外的外部循环和结果在循环嵌套C→W→H→K→R→S中。--------------------第一步
The PT-IS-CP-dense dataflow requires input buffers for weights and input activations, and an accumulator buffer to store the partial sums of the output activations. The accumulator buffer must perform a read-add-write operation for every access to a previously-written index. We call this accumulator buffer along with the attached adder an accumulation unit
PT-IS-CP密集的数据流需要输入缓冲器来保存权重和输入激活变量,以及一个累加器缓冲区来存储该部分输出激活的总和。 累加器缓冲区必须执行每次访问先前写入的读 - 写 - 写操作指数。 我们把这个累加器缓冲区和附加的加法器一起调用叫做一个累加单位。---------第二步
这两步合起来没看懂。
One of the objectives of the SCNN architecture is to maximize opportunities to store compressed activations on-die between network layers. This requires a moderately large input buffer, which can be energy-expensive to access. The input-stationary temporal loop nest amortizes the energy cost of accessing the input buffer over multiple weight and accumulator buffer accesses. More precisely, the register in which the stationary input is held over K × R× S iterations serves as an inner buffer that flters accesses to the larger input buffer
SCNN架构的目标之一是最大限度地在网络之间存储压缩激活在片上。 这需要一个适度大的输入缓冲区,可以能源昂贵的访问。 输入固定的时间循环嵌套分摊多次访问输入缓冲区的能源成本重量和累加器缓冲区访问。 更准确地说,寄存器其中静态输入保持在K×R×S迭代服务作为内部缓冲区,过滤访问较大的输入缓冲区
Unfortunately, the stationarity of input activations comes at the cost of more streaming accesses to the weights and to the partial sums in the accumulator buffer. Blocking the weights and partial sums in the output channel (K) dimension can increase reuse of these data structures and improve energy effciency. We therefore factor the K output channels into K/Kc output-channel groups of size Kc, and only store weights and outputs for a single output-channel group at a time inside the weight and accumulator buffers. Thus the sub-volumes that are housed in buffers at the computation unit are:
• Weights: C × Kc × R× S
• Inputs: C ×W × H
• Partial Sums: Kc ×W × H
An outer loop over all the K/Kc output-channel groups results in the complete loop nest K/Kc → C → W → H → Kc → R → S. Each iteration of this outer loop will require the weight buffer to be refilled and the accumulator buffer to be drained and cleared, while the contents of the input buffer will be fully reused because the same input activations are used across all output channels
不幸的是,输入激活的平稳性是以更多的流访问权重和部分在累加器缓冲区中求和为代价的。阻止权重和部分在输出通道(K)维度中的和可以增加这些数据结构的重用和提高能源效率。因此我们考虑K输出通道分为K / Kc输出通道组的大小Kc,一次只在权重和累加器缓冲区存储单个输出通道组的权重和输出。就这样子卷就被封装在计算单元的缓冲区内:
•权重:C×Kc×R×S
•输入:C×W×H
•部分总和:Kc×W×H
在外部循环的所有K / Kc输出通道组产生完整的循环嵌套K / Kc→C→W→H→Kc→R→S中。这个外层循环的每次迭代都需要权重缓冲区重填和累加器缓冲区被排空和清除,而输入缓冲区的内容将被完全重用,因为相同输入激活用于所有输出通道
Intra-PE parallelism
The CP term in PT-IS-CP-dense describes how parallelism of many multipliers within a PE can be exploited while maximizing spatial reuse. A vector of F flter-weights fetched from the weight buffer and a vector of I inputs fetched from the input activation buffer are delivered to an array of F×I multipliers to compute a full Cartesian product (CP) of output partial-sums.
在PT-IS-CP-dense中的CP概念描述如何利用PE内多个乘法器的并行性同时最大限度地空间重用。一个F权重向量从权重缓冲区中提取和一个I输入向量从输入激活缓冲区提取,I和F被传送到一个F×I乘法器的数组中计算输出部分和的完整笛卡尔乘积(CP)。
this all-to-all operation has two useful properties. First, each fetched weight is reused (via wire-based multicast) over all I activations; each activation is reused over all F weights. Second, each product yields a useful partial sum such that no extraneous fetches or computations are performed. PT-IS-CP-sparse will exploit these same properties to make computation effcient on compressed-sparse weights and input activations.
这个整体到整体的操作有两个有用的属性。 首先,每次提取的权重在所有的I激活(通过基于有线的多播)上被重用,每个激活都在所有F权重上重用。 其次,每个乘积产生有用的部分和,使得不会执行无关的提取或计算。 PT-IS-CP-sparse将利用这些相同的属性使得在压缩稀疏权重和输入激活的计算有效率。
The multiplier outputs are sent to the accumulation unit, which updates the partial sums of the output activation. Each multiplier output is accumulated with a partial sum at the matching output coordinates in the output activation space. These coordinates are computed in parallel with the multiplications. The accumulation unit must employ at least F×I adders to match the throughput of the multipliers.
乘法器输出被发送到累加单元更新输出激活的部分总和。 每个乘数输出在匹配输出处以部分和累加坐标输出激活空间。这些坐标是与乘法并行计算。 积累单位必须使用至少F×I加法器来匹配乘法器的吞吐量。
Inter-PE parallelism
Finally, the PT term in PT-IS-CP-dense describes how to scale beyond the practical limits of multiplier count and buffer sizes within a PE. We employ a spatial tiling strategy to spread the work across an array of PEs so that each PE can operate independently. The W×H element activation plane is partitioned into smaller Wt×Ht element planar tiles (PT) that are distributed across the PEs. Each tile extends fully into the input-channel dimension C, resulting in an input-activation volume of C×Wt×Ht assigned to each PE. Weights are broadcast to the PEs, and each PE operates on
its own subset of the input and output activation space.
最后,在PT-IS-CP-dense中的PT概念描述了如何超越乘数计数的实际限制和PE内的缓冲区大小。 我们采用空间平铺战略将工作分散到一系列PE中,以便每个PE都可以运行独立。 W×H元素激活平面被分割成较小的Wt×Ht元素平瓦片分散到PEs。 每个瓦片完全延伸到输入通道维度C,导致C×Wt×Ht的输入激活卷被分配到每个PE。 权重广播给PE,每个PE都运行它自己的输入和输出激活空间的子集。
Unfortunately, strictly partitioning both input and output activations into Wt × Ht tiles does not work because the sliding-window nature of the convolution operation introduces cross-tile dependencies at tile edges. These data halos [13] can be resolved in one of two ways。
不幸的是,严格地将输入和输出激活划分为Wt×Ht瓦片是不行的,因为滑动窗口卷积操作的性质在瓦片边缘处引入了交叉瓦片依赖性。 这些数据halos[13]可以在其中一个解决两种方式。
接下来看这篇文章
Input halos: The input buffers at each PE are sized to be slightly larger than C ×Wt × Ht to accommodate the halos. These halo input values are replicated across adjacent PEs, but outputs are strictly private to each PE. Replicated input values can be multicast when they are being fetched into the buffers.
每个PE上的输入缓冲区的大小要比C×Wt×Ht稍大来容纳halos。 这些halos输入值在相邻的PE之间复制,但输出对于每个PE是严格私有的。 复制的输入值在被提取到缓冲区时可以被多播。
Output halos: The accumulation buffers at each PE are sized to be slightly larger than Kc ×Wt × Ht to accommodate the halos. The halos now contain incomplete partial sums that must be communicated to neighbor PEs for accumulation, which occurs at the end of computing each output-channel group.
每个PE上的累积缓冲器的大小要比Kc×Wt×Ht略大来容纳halos。 现在halos包含不完整的部分和,他们必须传递给相邻的PE进行累加,这发生在计算每个输出通道组的末尾。
Our PT-IS-CP-dense dataflow uses output halos, though the efficiency difference between the two approaches is minimal.
我们的PT-IS-CP密度数据流使用输出halos,尽管两种方法之间的效率差异很小。
Figure 4 shows pseudo-code for a single PE’s loop nest in the PTIS-CP-dense dataflow, including blocking in the K dimension (A,C),fetching vectors of input activations and weights (B,D), and computing the Cartesian product in parallel (E,F). X and Y coordinates for the accumulation buffer that are either negative or greater than Wt −1
and Ht − 1 correspond to the locations of incomplete partial sums in the halo regions. Communication of these halos to neighboring PEs is not shown in the fgure. The Kcoord(), Xcoord(), and Y coord() functions compute the k, x, and y coordinates of the uncompressed output volume using a de-linearization of the temporal loop indices a and w, the spatial loop indices i and f , and the known filter width and height. Overall, this PT-IS-CP-dense dataflow is simply a reordered, partitioned, and parallelized version of Figure 3.
图4显示了PTIS-CP密集数据流中单个PE循环嵌套的伪代码,包括K维(A,C)中的阻塞,获取输入激活和权重向量(B,D),并行计算笛卡尔乘积(E,F)。 X和Y坐标为积累缓冲区可能是负值或大于Wt -1和Ht - 1对应的不完全部分和的halo区域。 这些halo与邻近的PE进行通信没有显示在图中。 Kcoord(),Xcoord()和Y coord()函数计算未压缩的k,x和y坐标输出子卷使用时间循环索引a和w的去线性化,空间循环指数i和f,以及已知的滤波器宽度和高度。 总的来说,这个PT-IS-CP密集的数据流是简单的重新排序的,分区和并行版本的图3
固定feature map, 把输入权重和输入feature map 分块运算。最后在拼回去;
而稀疏的数据流是利用非0值的索引进行计算。
3.2 PT-IS-CP-sparse Dataflow
PT-IS-CP-sparse is a natural extension of PT-IS-CP-dense that exploits sparsity in the weights and input activations. The dataflow is specifcally designed to operate on compressed-sparse encodings of the weights and input activations and produces a compressed-sparse encoding of the output activations. At a CNN layer boundary, the
output activations of the previous layer become the input activations of the next layer. While prior work has proposed a number of compressed-sparse representations [1, 15, 34], the specifc format used is orthogonal to the sparse architecture itself. The key feature is that decoding a sparse format ultimately yields a non-zero data value and an index indicating the coordinates of the value in the weight or input activation matrices.
PT-IS-CP-sparse是PT-IS-CP-dense的一个自然扩展,它利用权重和输入激活的稀疏性。 数据流是专门设计用于对压缩稀疏编码进行操作权重和输入激活并产生压缩稀疏输出激活的编码。 在CNN层边界,前一层的输出激活成为下一层的输入激活。 而以前的工作提出了一些压缩稀疏表示[1,15,34],特定的格式
使用的是与稀疏体系结构本身正交的。 主要特点是解码稀疏格式最终会产生一个非零的数据值和一个索引显示权重或输入激活矩阵的坐标值。
Even though calculating output coordinates is trivial, the multiplier outputs are not typically contiguous as they are in PT-IS-CPdense. Thus the F×I multiplier outputs must be scattered to discontiguous addresses within the Kc×Wt×Ht output range. Because any
Even though calculating output coordinates is trivial, the multiplier outputs are not typically contiguous as they are in PT-IS-CPdense. Thus the F×I multiplier outputs must be scattered to discontiguous addresses within the Kc×Wt×Ht output range. Because any value in the output range can be non-zero, the accumulation buffer must be kept in an uncompressed format. In fact, output activations will probabilistically have high density even with a very low density of weights and input activations, until they pass through a ReLU operation
尽管计算输出坐标是微不足道的,但乘法器输出通常不是连续的,因为它们在PT-IS-CPdense中。 因此,F×I乘法器输出必须分散到Kc×Wt×Ht输出范围内的不连续地址。输出范围内的任何值都可以是非零的,累加缓冲区必须保持未压缩的格式。 事实上,输出激活即使在密度非常低的权重和输入激活的情况下也具有高密度的概率,直到他们通过一个ReLU操作。
To accommodate the needs of accumulation of sparse partial sums, we modify the monolithic Kc×Wt×Ht accumulation buffer from the PT-IS-CP-dense dataflow into a distributed array of smaller accumulation buffers accessed via a scatter network which can be implemented as a crossbar switch. The scatter network routes an array of F×I partial sums to an array of A accumulator banks based on the output index associated with each partial sum. Taken together, the complete accumulator array still maps the same Kc×Wt×Ht address range, though the address space is now split across a distributed set of banks. PT-IS-CP-sparse can be implemented via small adjustments of Figure 4. Instead of a dense vector fetches, (B) and (D) fetch the compressed sparse input activations and weights, respectively. In addition, the coordinates of the non-zero values in the compressedsparse form of these data structures must be fetched from their respective buffers (not shown). Then the accumulator buffer (F) must be indexed with the computed output coordinates from the sparse weights and activations. Finally when the computation for the output-channel group has been completed, the accumulator buffer is drained and compressed into the output buffer。
为了适应稀疏局部积累的需要总和,我们修改单片Kc×Wt×Ht积累缓冲区从PT-IS-CP密集的数据流转换成一个较小的分布式数组通过分散网络访问的累加缓冲区,可以作为交叉开关来实现。分散网络路由一个数组的F×I部分加到A累加器库阵列的基础上输出索引与每个部分总和相关联。综合考虑,完整的累加器阵列仍然映射相同的Kc×Wt×Ht地址范围内,尽管地址空间现在分散在一个分布式的集合中bank。 PT-IS-CP稀疏可以通过小的调整来实现(B)和(D)取而代之压缩稀疏输入激活和权重,分别。在此外,这些数据结构的压缩格式中的非零值的坐标必须从其中取出各自的缓冲器(未示出)。然后累加器缓冲区(F)
必须从稀疏的权重和激活的输出坐标计算索引。最后在计算的时候输出通道组已完成,累加器缓冲区为排空并压缩到输出缓冲区
4 SCNN ACCELERATOR ARCHITECTURE
CNNs typically consist of a series of layers, including convolution,non-linear, pooling, and fully-connected. As the convolution layers typically dominate both arithmetic and computation time, the SCNN architecture is optimized for effciency on these layers. For example, on GoogLeNet, the number of multiplies in the fully connected layers only account for 1% of the total computation. However, SCNN also includes dedicated logic for the simple localized non-linear and pooling layers. The non-linear layer is applied on a per-element basis at the end of a convolution or fully-connected layer, and is often implemented using the ReLU operator. A pooling layer can be applied after the ReLU layer; a typical 2×2 max pooling operator retains the maximum value in a window of four elements, thus reducing the volume of data passed to the next layer. While fully connected layers are similar in nature to the convolution layers, they do require much larger weight matrices. However, recent work has demonstrated effective DNNs without fully connected layers [24]. Section 4.3 describes further how FC layers can be processed by SCNN.
CNN通常由一系列层组成,包括卷积,非线性,池化和全连接。由于卷积层通常在算术和计算时间中占主导地位,所以SCNN体系结构针对这些层上的效率进行了优化。例如,在GoogLeNet上,全连接层中的乘法次数仅占总计算量的1%。但是,SCNN还包含简单的局部非线性和池化层的专用逻辑。非线性层是在卷积或全连接层的末端基于每个元素应用的,并且通常使用ReLU算子来实现。池化层可以用在ReLU层之后;典型的2×2最大池运算符在四个元素的窗口中保留最大值,从而减少传递到下一层的数据量。虽然全连接的层在性质上与卷积层相似,但它们确实需要更大的权重矩阵。然而,最近的工作已经证明没有全连接层的有效的DNN [24]。 4.3节进一步描述了如何通过SCNN处理FC层。
4.1 Tiled Architecture
A full SCNN accelerator employing the PT-IS-CP-sparse dataflow of Section 3 consists of multiple SCNN processing elements (PEs) connected via simple interconnections. Figure 5 shows an array of PEs, with each PE including channels for receiving weights and input activations, and channels delivering output activations. The PEs are connected to their nearest neighbors to exchange halo values during the processing of each CNN layer. The PE array is driven by a layer sequencer that orchestrates the movement of weights and activations and is connected to a DRAM controller that can broadcast weights to the PEs and stream activations to/from the PEs. SCNN can use an arbitrated bus as the global network to facilitate the weight broadcasts, the point-to-point delivery of input activations (IA) from DRAM, and the return of output activations (OA) back to DRAM. The figure omits these links for simplicity.
采用PT-IS-CP稀疏数据流的完整SCNN加速器第3部分由多个SCNN处理单元(PE)通过简单的互连连接。 图5显示了一个数组PE,每个PE包括接收权重的渠道输入激活和输出激活的通道。该PE与最近的邻居连接以交换光环值
在每个CNN层的处理期间。 PE阵列被驱动通过编排权重运动的层次序列器
和激活,并连接到一个DRAM控制器,可以向PE广播权重以及向PE发送流式激活。SCNN可以使用仲裁总线作为全球网络来促进重量广播,输入激活的点对点传递(IA),以及输出激活(OA)返回DRAM。 为简单起见,图中省略了这些链接.
PE之间通过简单的互联,而权重和激活值利用一个层排序器通过DRAM控制器安排到每一个PE单元,SCNN有一个仲裁总线,来管理整个网络的权重非分发。管理数据的输入和写会到DRAM。
4.2 Processing Element (PE) Architecture
Figure 6 shows the microarchitecture of an SCNN PE, including a weight buffer, input/output activation RAMs (IARAM and OARAM), a multiplier array, a scatter crossbar, a bank of accumulator buffers, and a post-processing unit (PPU). To process the first CNN layer, the layer sequencer streams a portion of the input image into the IARAM of each PE and broadcasts the compressed-sparse weights into the weight buffer of each PE. Upon completion of the layer, the sparse-compressed output activation is distributed across the OARAMs of the PEs. When possible, the activations are held in the IARAMs/OARAMs and are never swapped out to DRAM. If the output activation volume of a layer can serve as the input activation volume for the next layer, the IARAMs and OARAMs are logically swapped between the two layers’ computation sequences. Each layer of the CNN has a set of parameters that confgure the controllers in the layer sequencer, the weight FIFO, the IARAMs/OARAMs, and the PPU to execute the required computations.
图6显示了SCNN PE的微体系结构,包括权重缓冲器,输入/输出激活RAM(IARAM和OARAM),乘法器阵列,分散交叉开关,一组累加器缓冲器和后处理单元(PPU)。为了处理第一CNN层,层排序器将输入图像的一部分流入每个PE的IARAM,并将压缩稀疏权重广播到每个PE的权重缓冲器中。完成该层后,稀疏压缩输出激活值分布在PE的OARAM中。在可能的情况下,激活值被保存在IARAMs / OARAMs中,并且永远不会换出到DRAM。如果一个图层的输出激活量可以作为下一个图层的输入激活量,则IARAM和OARAM在两个图层的计算序列之间进行逻辑交换。 CNN的每一层都有一组参数,用于配置层顺序器,权重FIFO,IARAM / OARAM和PPU中的控制器以执行所需的计算。
Input weights and activations
Each PE’s state machine operates on the weight and input activations in the order defned by the PT-IS-CP-sparse dataflow to produce a output-channel group of Kc ×Wt × Ht partial sums inside the accumulation buffers. First, a vector F of compressed weights and a vector I of compressed input activations are fetched from their respective buffers. These vectors are distributed into the F×I multiplier array which computes a form of the Cartesian product of the vectors, i.e, every input activation is multiplied by every weight to form a partial sum. At the same time,
the indices from the sparse-compressed weights and activations are processed to compute the output coordinates in the dense output activation space.
每个PE的状态机按照由PT-IS-CP-稀疏数据流定义的顺序对权重和输入激活进行操作,以在累积缓冲器内部产生Kc×Wt×Ht部分和的输出通道组。 首先,从其各自的缓冲器中取出压缩权重矢量F和压缩输入激活矢量I. 这些矢量被分配到F×I乘法器阵列中,该乘法器阵列计算矢量的笛卡尔积的形式,即,每个输入激活被乘以每个权重以形成部分和。 与此同时,处理稀疏压缩权重和激活的索引以计算密集输出激活空间中的输出坐标。
Accumulation
The F×I products are delivered to an array of A accumulator banks, indexed by the output coordinates. To reduce contention among products that hash to the same accumulator bank, A is set to be larger than F×I. Our results show that A = 2×F×I
suffciently reduces accumulator bank contention. Each accumulator bank includes adders and small set of entries for the output channels associated with the output-channel group being processed. The accumulation buffers are double-buffered so that one set of banks can be updated by incoming partial sums while the second set of banks are drained out by the PPU.
F×I点乘被传送到一个A累加器组的阵列通过输出坐标索引。 为了减少拼凑到同一个累加器组的点乘之间的争夺,A被设置为大于F×I。 我们的结果显示A = 2×F×I充分降低了累加器组的争夺。 每个累加器组包括加法器和与正被处理的输出通道组相关联的输出通道的小组条目。 累加缓冲器是双缓冲的,从而一系列组可以通过进入的部分和进行更新,而第二组bank被PPU排出。
Post-processing
When the output-channel group is complete, the PPU performs the following tasks: (1) exchange partial sums with neighbor PEs for the halo regions at the boundary of the PE’s output activations, (2) apply the non-linear activation (e.g. ReLU), pooling,
and dropout functions, and (3) compress the output activations into the compressed-sparse form and write them into the OARAM. Aside from the neighbor halo exchange, these operations are confned to the data values produced locally by the PE.
当输出通道组完成时,PPU执行以下任务:(1)相邻的PE交换部分和在PE输出激活值的边界处用于halo区域(2)应用非线性激活(例如ReLU),池化,和排出功能,(3)压缩输出激活压缩稀疏形式并将其写入OARAM。 除了从邻居halo交换,这些操作都是限制在由PE本地产生的数据值。
Compression
To compress the weights and activations, we use variants of previously proposed compressed sparse matrix representations [15, 33]. Figure 7 shows an example of SCNN’s compressed sparse encoding for R = S = 3 and K = 2 with 6 non-zero elements. The encoding includes a data vector consisting of the non-zero valuesand an index vector that includes the number of non-zero values followed by the number of zeros before each value. The 3-dimensional R×S×K volume is effectively linearized, enabling full compression across the dimension transitions. The activations are encoded in a similar fashion, but across the H×W×C dimensions. As the activations are divided among the PEs, each tile of compressed activations is actually Ht×Wt×C.
为了压缩权重和激活,我们使用先前提出的压缩稀疏矩阵表示的变体[15,33]。 图7显示了对于具有6个非零元素的R = S = 3和K = 2的SCNN的压缩稀疏编码的示例。 编码包括一个由非零值组成的数据向量和一个索引向量,索引向量包括非零值的数量,后面跟着每个值之前的零的数量。 三维R×S×K体积实际上是线性化的,可在维度转换中进行完全压缩。 激活以类似的方式被编码,但在H×W×C维上。 由于激活在PE之间被分开,所以压缩激活的每个块实际上是Ht×Wt×C。
We use four bits per index to allow for up to 15 zeros to appear between any two non-zero elements. Non-zero elements that are further apart can have a zero-value placeholder without incurring any noticeable degradation in compression effciency. The original (dense) coordinates of the weights and activations are recomputed by
keeping a running sum of the number of zero and non-zero elements and the dividing by the appropriate dimension. Determining the coordinates in the accumulator buffer for each multiplier output requires reconstructing the coordinates from index vectors for F and I and combining them with the coordinates of the portion of the output activation space currently being processed. The encoding scheme can be enhanced with optimizations such as rounding up the dimension of each weight flter to a power of two to make division easier or using the four index bits per entry to encode a non-uniform number of zeros. While these optimizations would increase density somewhat, they do not substantively affect the SCNN architecture or our observed results.
我们使用每个索引用四位表示允许出现多达15个零在任何两个非零元素之间。非零元素是进一步分开,可以有一个零值的占位符而不会发生压缩效率的明显降低。原本的(稠密)权重和激活的坐标被重新计算,通过保持零和非零元素的数量的总和运行,通过合适的维度进行分割。决定每个乘法器输出的累加器缓冲区中的坐标需要从F和I的索引向量中重建坐标,用当前正在处理的部分输出激活空间的坐标结合起来。编码方案可以通过优化,例如四舍五入来增强每个重量的维度,以2的幂作划分更容易或者使用每个条目的四个索引位来编码非均匀的零的数量。虽然这些优化会增加密度有些时候,它们并没有实质性地影响SCNN架构或我们的观察结果。
4.3 Fully-connected Layers
While our Cartesian product-based SCNN dataflow dramatically improves its effciency at the convolutional layers, it does impose some challenges when handling fully-connected layers. Concretely, unlike a convolution flter, an individual weight connection in a fully connected layer is not reused across multiple input activations. Thus, the Cartesian product approach of SCNN does not automatically align non-zero weights and activations that must be multiplied. As a result, the SCNN 4×4 multiplier array can only operate at a peak rate of 4 multiplies per cycles (25% of peak throughput) because the 4 input activations and 4 weights can produce only 4 useful products. Another challenge with fully-connected layers is aligning the sparse
weights and the sparse activations so that the appropriate non-zero values are delivered into the multiplier array at the same time. The SCNN’s logic for processing the activation and weight indices can be reused to determine the alignment, but some additional multiplexing hardware would be required to move the non-zero weights into position.
虽然我们基于笛卡尔乘积的SCNN数据流有很大的程度上提高了卷积层的效率,但是当处理全连接层时加强了一些挑战。具体而言,与卷积过滤器不同,全连接层中的单个权重连接不会在多个输入激活之间重复使用。从而,SCNN的笛卡尔乘积法不会自动完成对齐非零权重和必须相乘的激活。如结果,SCNN 4×4乘法器阵列只能在峰值下工作每个周期4个乘法的速率(峰值吞吐量的25%),因为4个输入激活和4个权重只能产生4个有用的乘积。全连接层的另一个挑战是对齐稀疏权重和稀疏激活使适当的非零值同时传递到乘法器数组中。该SCNN的处理激活和重量指标的逻辑可以是重复使用以确定对齐,但是需要一些额外的复用硬件将被要求移动非零权重位置。
While the loss of average throughput at the fully-connected layers would make SCNN unattractive for networks that are dominated by fully-connected layers, state-of-the-art CNNs for image classifcation, detection, and segmentation are primarily dominated by the convolutional layers; in the networks we used in our study, the fully-connected layers accounted for only 8%, 1%, and 2% of the multiplication operations in AlexNet, GoogLeNet, and VGGNet, respectively. In addition, networks for computer vision are decreasing their reliance of fully-connected networks and in fact, recent networks eliminate these layers completely [24]. Furthermore, the fully-connected layers are generally memory-bandwidth limited as they spend most of their execution time delivering network weights from DRAM. While SCNN’s noticeable throughput reduction on fully-connected layers is not ideal, it is not a signifcant performance limiter for these memory-hungry layers. We argue that systems desiring optimal effciency for both convolution and fully-connected layers should consider employing both SCNN and an architecture such as EIE that is optimized for fully-connected layers [15]
尽管全连接层的平均吞吐量的损失将使得SCNN对于由全连接层支配的网络不具吸引力,但是用于图像分类,检测和分割的最新CNN主要由卷积层主导;在我们研究中使用的网络中,全连通层分别仅占AlexNet,GoogLeNet和VGGNet中乘法运算的8%,1%和2%。另外,计算机视觉网络正在减少它们对全连接网络的依赖,事实上,最近的网络完全消除了这些层[24]。此外,全连接的层通常受限于存储器带宽,因为它们大部分的执行时间都是从DRAM提供网络权重。尽管SCNN在完全连接的层上显着的吞吐量降低并不理想,但对于这些需要内存的层来说,它并不是一个显着的性能限制器。我们认为,要求卷积和全连接层的最佳效率的系统应该考虑采用SCNN和EIE这样的架构,EIE这种架构已经针对完全连接层进行了优化[15]
4.4 Temporal Tiling for Large Models
SCNN compresses weights and activations to reduce both arithmetic operations and data movement. Ideally, the degree of compression and the capacity of the IARAMs and OARAMs are large enough so that the activations are never evicted to outer layers of the memory hierarchy. While we size our activation RAMs to capture the capacity requirements of nearly all of the layers in the networks we examined, a few layers of VGGNet require activations to be saved to and restored from DRAM. Like other accelerator architectures, SCNN can temporally tile the activation space so that the collection of PEs operate on a sub-volume of the activations at a time. This temporal tiling can be applied in addition to the spatial tiling that SCNN already employs to partition the activation volume across the PEs。
SCNN压缩权重和激活以减少算术操作和数据移动。 理想情况下,压缩的程度和IARAM和OARAM的容量足够大才使得激活不会被驱逐到内存架构的外层。 而我们调整我们的激活RAM来捕捉容量我们检查的网络中几乎所有层的要求,几层VGGNet需要激活能被保存和从DRAM中恢复。 像其他加速器体系结构一样,SCNN可以临时平铺激活空间,以便收集PE在一次激活的子卷上操作。
While a temporally tiled convolution layer still broadcasts weights to all of the PEs, the input activation planes are partitioned into coarse-grained tiles (across all channels) that each ft into the total IARAM capacity of the accelerator. The output activation tile is offloaded to DRAM and then reloaded into the IARAM when the data is needed as the input activation to the next layer. This type of tiling leads to a small halo at the edge of each input activation tile, resulting in a few additional input activation fetches from DRAM. The temporally tiled PT-IS-CP-sparse dataflow still exploits the reuse
of each input activation value from the IARAM R×S×Kc times
虽然时间平铺的卷积层仍然广播权重对于所有的PE,输入激活平面被分割成
粗糙的瓦片(横跨所有渠道),每一个适应加速器总的IARAM容量。 输出激活不用加载到内存,然后当需要作为输入激活到下一层时重新加载到IARAM中。
In the networks we analyzed, only 9 of the 72 total layers fail to ft entirely within the IARAM/OARAM structures, with all of them coming from VGGNet. Our analysis shows that for both dense and sparse architectures, the DRAM accesses for one temporal tile can be hidden by pipelining them in tandem with the computation of another tile. Our nominal DRAM bandwidth confguration of 50 GB/s provides ample bandwidth to absorb the additional activation traffc. Only when DRAM bandwidth drops to around 4 GB/s does performance degrade. On these 9 layers, the per-layer energy penalty of activation data transfer ranges from 5–62%, with a mean of 18%.
This overhead is fairly low and will be born by all CNN architectures with similar levels of on-chip RAM for activations; overheads will be higher for accelerators that do not compress activations. While the tiling approach is attractive, we expect some low power deployment scenarios to motivate neural networks designers to size them so that they ft completely in the on-chip SRAM capacity provided by the accelerator implementation.
在我们分析的网络中,72个总层中只有9个完全不能完全适应IARAM / OARAM结构,全部来自VGGNet。我们的分析表明,对于密集和稀疏的体系结构,一个时间片的DRAM访问可以通过与另一个片的计算一起流水线来隐藏。我们标称的50GB / s的DRAM带宽配置提供了足够的带宽来吸收额外的激活流量。只有当DRAM带宽下降到4 GB / s左右时,性能才会下降。在这9层中,激活数据传输的每层能量损失范围为5-62%,平均值为18%。这种开销相当低,用片上RAM激活值的类似级别的的所有CNN架构都能产生这样的损失;不压缩激活的加速器的开销将会更高。虽然平铺方法很有吸引力,但我们期望一些低功耗部署方案能够激励神经网络设计人员对其进行调整,以使其完全符合加速器实施提供的片上SRAM容量。
4.5 SCNN Architecture Confguration
While the SCNN architecture can be scaled across a number of dimensions, Table 3 lists the key parameters of the SCNN design we explore in this paper. The design employs an 8×8 array of PEs, each with a 4×4 multiplier array, and an accumulator buffer with 32 banks. We chose a design point of 1,024 multipliers to match the expected computation throughput required to process HD video in real-time at acceptable frame rates. The IARAM and OARAM are sized so that the sparse activations of AlexNet and GoogLeNet can ft entirely within these RAMs so that activations need not spill to DRAM. The weight FIFO and the activation RAMs each carry a 4-bit overhead for each 16-bit value to encode the coordinates in the compressed-sparse format. In total, the SCNN design includes a total of 1,024 multipliers and 1MB of activation RAM. At the synthesized clock speed of the PE of slightly more than 1 GHz in a 16nm technology, this design achieves a peak throughput of 2 Tera-ops (16-bit multiplies plus 24-bit adds).
而SCNN体系结构可以扩展到许多尺寸,表3列出了SCNN设计的关键参数
我们在本文中探讨。该设计采用8×8阵列的PE,每个都有一个4×4乘法器阵列和一个累加器缓冲器32个bank。我们选择了1,024个乘法器的设计点进行匹配处理高清视频所需的预期计算吞吐量实时以可接受的帧速率。 IARAM和OARAM调整,以便AlexNet和GoogLeNet的稀少激活可以完全在这些RAM中,以便激活不必溢出到DRAM。权重FIFO和激活RAM各自承载每个16位值的4位开销用于对坐标进行编码压缩稀疏格式。 SCNN的总体设计包括共有1,024个乘法器和1MB的激活RAM。在综合的时钟速度稍微超过1 GHz的PE in采用16纳米技术,该设计实现了2的峰值吞吐量Tera-ops(16位乘加24位加)
Area Analysis
To prototype the SCNN architecture, we designed an SCNN PE in synthesizable SystemC and then used the Catapult high-level synthesis (HLS) tool [25, 26] to generate Verilog RTL. During this step, we used HLS design constraints to optimize the design by mapping different memory structures to synchronous RAMs and latch arrays and pipelining the design to achieve full throughput. We then used Synopsys Design Compiler to perform placementaware logic synthesis and obtain post-synthesis area estimates in a TSMC 16nm FinFET technology. Table 4 summarizes the area of the major structures of the SCNN PE. A signifcant fraction of the PE area is contributed by memories (IARAM, OARAM, accumulator buffers), which consume 57% of the PE area, while the multiplier array only consumes 6%. IARAM and OARAM are large in size and consume 25% of the PE area. Accumulator buffers, though smaller in size compared to IARAM/OARAM, are heavily banked (32 banks), contributing to their large area.
为了建立SCNN架构的原型,我们设计了一个可综合SystemC中的SCNN PE,然后使用弹射器高级综合(HLS)工具[25,26]来生成Verilog RTL。在这一步中,我们使用HLS设计约束来通过将不同的存储器结构映射到同步RAM来优化设计并锁存阵列和流水线设计,以达到完全的吞吐量。然后,我们使用Synopsys Design Compiler执行placementaware逻辑综合,并获得后处理面积估计值用台积电16nm FinFET技术。 表4总结了该SCNN PE的主要结构的面积情况。 PE面积的重要部分是由内存(IARAM,OARAM,累加器)贡献的缓冲区),占用57%PE面积,而乘数数组只消耗6%。 IARAM和OARAM规模很大并消耗PE面积的25%。 累加器缓冲区,但与IARAM / OARAM相比,规模较小,大量存放(32bank),为他们的大面积做出了贡献。
5 EXPERIMENTAL METHODOLOGY
CNN performance and power measurements
To model the performance of the SCNN architecture, we rely primarily on a custombuilt cycle-level simulator. This simulator is parameterizable across dimensions including number of processing element (PE) tiles, RAM capacity, multiplier array dimensions (F and I), and accumulator buffers (A). The SCNN simulator is driven by the pruned weights and sparse input activation maps extracted from the Caffe Python interface (pycaffe) [4] and executes each layers of the network one at a time. As a result, the simulator precisely captures the effects of the sparsity of the data and its effect on load balancing within the SCNN architecture.
为了模拟SCNN体系结构的性能,我们主要依赖于一个定制的周期级仿真器。 这个仿真器可以在包括处理单元(PE)瓦片的数量,RAM容量,乘法器数组尺寸(F和I)以及累加器缓冲器(A)的维度上参数化。 SCNN模拟器是由从Caffe Python接口(pycaffe)[4]中提取的修剪权重和稀疏输入激活地图驱动的,并一次执行一个网络的每一层。 因此,模拟器能够精确地捕捉SCNN体系结构内数据的稀疏性及其对负载平衡的影响。
Architecture confgurations
Table 5 summarizes the major accelerator confgurations that we explore, including both dense and sparse accelerators. All of the accelerators employ the same number of multipliers so that we can compare the performance of the accelerators with the same computational resources. The dense DCNN accelerator operates solely on dense weights and activations and employs a dot-product dataflow called PT-IS-DP-dense. Dot products are usually effcient for dense accelerators because of reduced accumulation-buffer accesses, although this comes at the cost of reduced spatial reuse of weights and input activations. The optimized DCNN-opt architectures have the same confguration as DCNN but employ two optimizations: (1) compression/decompression of activations as they are transferred out of/into DRAM, and (2) multiply ALU gating to save energy when a multiplier input is zero. The DCNN architectures are confgured with 2MB of SRAM for holding inter-layer activations, and can hold all of them for AlexNet and GoogLeNet. The SCNN confguration matches the architecture described in Section 4, and includes a total of 1MB of IARAM + OARAM. Because the activations are compressed, this capacity
enables all of the activation data for the two networks to be held on chip, without requiring DRAM transfers for activations. The larger VGGNet requires the activation data to be transferred in and out of DRAM. The last column of the table lists the area required for each accelerator; for simplicity, we total only the area for the PE array
and SRAM banks, and omit any area for wiring among the PEs. As the interconnect bandwidth requirements for SCNN are less than for dense architectures due to the compression, including interconnect area for both dense and sparse architectures would close the area gap somewhat between the two. While SCNN has smaller activation RAM capacity, its larger size is due to the banked accumulator buffers, as described in Section 4.
表5总结了我们探索的主要加速器配置,包括稠密和稀疏的加速器。所有加速器都使用相同数量的乘法器,以便我们可以将加速器的性能与相同的计算资源进行比较。密集的DCNN加速器仅在密集的权重和激活下运行,并采用称为PT-IS-DP密集型的点积数据流。点积对于密集加速器通常是有效的,因为减少了累积 - 缓冲区访问,尽管这是以减少权重和输入激活的空间重用为代价的。优化的DCNN-opt体系结构与DCNN具有相同的配置,但是采用两种优化方式:(1)当它们被转移到DRAM之外时对其进行压缩/解压缩,以及(2)乘法器输入为零乘法器ALU关闭,来节省能源。 DCNN架构配有2MB的SRAM用于保存层间激活,并可以全部包含用于AlexNet和GoogLeNet的中间层激活值。 SCNN配置与体系结构相匹配在第四章描述,并包含总共1MB的IARAM + OARAM。因为激活被压缩,有能力使得两个网络的所有激活数据都被保持在芯片上,而不需要用于DRAM传输激活值。较大的VGGNet需要将激活数据传入和传出DRAM。表中的最后一列列出了每个加速器所需的面积;为简单起见,我们只对PE阵列进行总计和SRAM bank,并省略PE之间的任何接线面积。由于SCNN的互连带宽要求比密集体系结构要低,包括密集体系结构和稀疏体系结构的互连面积都会缩小两者之间的面积差距。虽然SCNN具有较小的激活RAM容量,但其较大的规模是由于分区累加器缓冲区,如第4节所述。
Benchmarks
As described in Section 2, we use AlexNet and GoogLeNet for the bulk of our experiments. For GoogLeNet, we primarily focus on the convolutional layers that are within the inception modules [31]. VGGNet is known to be over-parameterized, which results in an excessively large amount of inter-layer activation data (6 MB or about 4× the largest GoogLeNet layer). Nonetheless, we use VGGNet as a proxy for large input data (such has highresolution images) to explore the implications of coarse-grained temporal tiling on accelerator architectures. We leverage two different types of benchmarks to evaluate SCNN’s effciency. First, we developed synthetic network models where we can adjust the degree of sparsity of both weights and activations. These synthetic models are used to explore the sensitivity of the architectures to sparsity parameters (detailed in Section 6.1.). Second, we generate the actual sparse network models using the CNN pruning algorithm proposed by Han et al. [17], which we employ in cycle-level performance simulation. The pruned models have been retrained to achieve the same level of classifcation accuracy provided by the dense model, and we use this pruned model to obtain the post-ReLU activation
maps to feed it into our performance simulator
如第2节所述,我们使用AlexNet和GoogLeNet大部分的实验。对于GoogLeNet,我们主要集中在初始模块内的卷积层[31]。 VGGNet是超参数化的,这导致层间激活数值量数据过大(6 MB或大约4×最大的GoogLeNet层)。尽管如此,
我们使用VGGNet作为大量输入数据(如高分辨率图像)的代理来探索粗粒度
在加速器体系结构上进行时间平铺。我们利用两种不同类型的基准来评估SCNN的效率。首先,我们开发了可以调整程度的合成网络模型权重和激活的稀疏性。这些合成模型用于探索架构对稀疏性的敏感性参数(详见6.1节)。其次,我们生成实际的使用CNN修剪算法提出的稀疏网络模型Han等人[17],我们在周期级表现中使用模拟。修剪的模型已经被重新训练以达到目的密集模型提供的分类精度水平相同,我们使用这个修剪后的模型来获得ReLU的激活映射到我们的性能模拟器
6 EVALUATION
This section frst evaluates the sensitivity of SCNN to the sparseness of weights and activations using a synthetic CNN benchmark. We then measure the performance and energy-effciency of SCNN versus a dense CNN accelerator, using AlexNet, GoogLeNet, and VGGNet. For brevity, all the inception modules in GoogLeNet are denoted as IC_id in all of the fgures discussed in this section.
本节首先使用综合CNN基准评估测量权重和激活稀疏性的SCNN的敏感性。 我们然后测量SCNN和DCNN加速器的性能和能源效率,使用AlexNet,GoogLeNet和VGGNet。为简洁起见,GoogLeNet中的所有初始模块都表示为IC_id在本节讨论的所有图标中
6.1 Sensitivity to CNN Sparsity
We frst compare the performance and energy-effciency of the SCNN,DCNN, and DCNN-opt architectures as we artifcially sweep the weight and activation densities in GoogLeNet’s layers from 100% (fully dense) down to 10%. The X-axis of Figure 8 simultaneously scales both weight and activation density. The 0.5/0.5 point corresponds to 50% weight density, 50% activation density, and 25% of the multiplication operations relative to the fully-dense 1.0/1.0 point. Figure 8a shows that at full density, SCNN only achieves about 79% of the performance of DCNN/DCNN-opt1 because of SCNN’s dataflow is more susceptible to certain multiplier underutilization effects than DCNN’s dot-product dataflow. As density decreases to about 0.85/0.85, SCNN starts to perform better than DCNN, ultimately reaching a 24× improvement at the sparsest evaluated point with a density of 0.1/0.1
我们首先比较SCNN、DCNN和DCNN-opt体系结构的性能和能源效率,,因为我们人为地扫除了重量和GoogLeNet的层激活密度从100%(完全密集)下降到10%。 图8的X轴同时缩放权重量和激活密度。 0.5 / 0.5点对应到50%的权重密度,50%的激活密度,和25%的乘法运算相对于全密集的1.0 / 1.0点。图8a显示,在全密度下,SCNN仅达到DCNN / DCNN-opt1的性能的约79%,由于SCNN的数据流相比DCNN的点积数据流,更容易受到某种乘数效应不足的影响。 当密度降低到大约0.85 / 0.85时,SCNN开始比DCNN表现更好,最终达到24×在最稀疏的评估点,密度为0.1 / 0.1
Figure 8b frst shows that DCNN-opt’s energy optimizations of zero gating and DRAM traffc compression enable it to be better than DCNN at every level of density. These energy optimizations are surprisingly effective despite their minimal effect on the design of the accelerator. At full density, SCNN consumes 33% more energy than
either dense architecture due to the overheads of storing and maintaining the sparse data structures. SCNN becomes more effcient than DCNN at about 0.83/0.83 density and more effcient than DCNN-opt at 0.6/0.6 density. At the sparsest evaluated point of 0.1/0.1 density, SCNN consumes 6% of the energy of DCNN and 23% of the energy
of DCNN-opt. Given the density measurements of the networks in Figure 1, we expect SCNN to (a) signifcantly outperform the dense architectures on nearly all the layers of the networks we examined, (b) surpass the energy effciency of DCNN on a majority of layers, and (c) stay roughly competitive with the energy-effciency of DCNN-opt across most layers.
图8b首先显示了DCNN-opt的能量优化零门控和DRAM交易压缩使它比DCNN在每个密度级别。这些能源优化是惊人的有效,尽管他们的设计最小的影响
加速器。在全密度下,SCNN消耗33%的能量由于存储和维护稀疏的数据结构而导致的高密度体系结构。 SCNN变得更有效率DCNN的密度约为0.83 / 0.83,比DCNN-opt的效率更高密度为0.6 / 0.6。在0.1 / 0.1密度的最稀疏评估点处,
SCNN消耗DCNN的6%的能量和23%的能量DCNN-opt。考虑到网络中的密度测量图1,我们预计SCNN(a)明显优于密集型在我们所检查的网络的几乎所有层上的架构,(b)在大多数层上超过DCNN的能源效率(c)与DCNN-opt的能源效率保持大致相当的竞争力跨越大多数层。
6.2 SCNN Performance and Energy
Performance
We compare the performance of SCNN to the baseline dense DCNN accelerator and to an oracular SCNN design (SCNN(oracle)) that represents an upper bound on performance. The performance of SCNN(oracle) is derived by dividing the number of multiplication operations required for a Cartesian productbased convolution (Section 3) by 1,024, the number of multipliers in the architectures we examine. Figure 9 summarizes the speedups offered by SCNN versus a dense CNN accelerator. Overall, SCNN consistently outperforms the DCNN design across all the layers of AlexNet, GoogLeNet, and VGGNet, achieving an average 2.37×, 2.19×, and 3.52× network-wide performance improvement, respectively
我们将SCNN的性能与基准稠密DCNN加速器以及SCNN的设计进行比较(SCNN(oracle)),它代表了性能的上限。SCNN(oracle)的性能是通过将笛卡尔乘积卷积(第3节)所需的乘法运算次数除以1,024而得到的,乘数在我们研究的架构中。 图9总结了加速由SCNN提供的是一个密集的CNN加速器。 总的来说,SCNN在所有层面上始终优于DCNN设计AlexNet,GoogLeNet和VGGNet,平均达到2.37×,2.19倍,3.52倍的网络性能提升。
The performance gap between SCNN versus SCNN(oracle) widens in later layers of the network, i.e., the rightmost layers on the x-axis of Figure 9. SCNN suffers from two forms of ineffciency that cause this gap. First, the working set allocated to each PE tends to be smaller in the later layers (e.g., IC_5b) than in the earlier layers (e.g., IC_3a). As a result, assigning enough non-zero activations and weights in the later layers to fully utilize a PE’s multiplier array becomes diffcult. In other words, SCNN can suffer from intra-PE fragmentation when layers do not have enough useful work to fully populate the vectorized arithmetic units
SCNN与SCNN(oracle)之间的性能差距在网络的后续层(即,图9的x轴上的最右边的层)中扩大.SCNN遭受两种导致这种差距的低效率形式。 首先,分配给每个PE的工作集在后面的层(例如,IC_5b)中比在先前的层(例如IC_3a)中倾向于更小。 因此,在后面的层次中分配足够的非零激活和权重以充分利用PE的乘法器数组变得困难。 换句话说,当层没有足够的有用工作来完全填充向量化的算术单元时,SCNN会遭受内部PE碎裂。
The second source of ineffciency stems from the way the PT-ISCP-sparse dataflow partitions work across the array of PEs, which could lead to load imbalance among the PEs. Load imbalance results in under-utilization because the work corresponding to the next output-channel group Kc+1 can only start after the PEs complete the current output-channel group Kc. The PEs effectively perform an inter-PE synchronization barrier at the boundaries of output-channel groups which can cause early-fnishing PEs to idle while waiting for laggards.
效率低下的第二个原因源于PT-ISCP稀疏数据流分区在PE阵列中的工作方式,
可能导致PE之间的负载不平衡。 加载不平衡会导致低的利用率,因为下一个工作对应输出通道组Kc + 1只能在PE完成当前输出通道组Kc后才能启动。 PE有效地执行在输出信道的边界处的PE间同步屏障,在等待的时候可能导致早期PE的闲置落伍者
Figure 10 quantitatively demonstrates the intra-PE fragmentation in the multiplier arrays. Fragmentation is severe in the last two inception modules of GoogLeNet, with average multiplier utilization at less than 20%. In this layer, three out of the six convolutional sublayers within the inception module have a flter size of 1×1, resulting in a maximum of 8 non-zero weights within an output-channel group with a Kc value of 8. Nonetheless, later layers generally account for a small portion of the overall execution time as the input activation volume (i.e., H×W×C) gradually diminishes across the layers.
图10定量地演示了乘法器阵列中的PE内部碎片。 GoogLeNet的最后两个启动模块中的碎片严重,平均乘法器利用率低于20%。 在该层中,初始模块中的六个卷积子层中的三个具有1×1大小的滤波器,导致Kc值为8的输出信道组内最多有8个非零权重。但是,稍后 随着输入激活体积(即,H×W×C)逐层逐渐减小,后面的层总体上占整个执行时间的一小部分。
The right y-axis of Figure 10 demonstrates the effect of load imbalance across the PEs by showing the fraction of cycles spent waiting at an inter-PE barrier. Although the inter-PE global barriers and intra-PE fragmentation prevents SCNN from reaching similar speedups offered by SCNN(oracle), it still provides an average 2.7× network-wide performance boost over DCNN across the three CNNs we examined.
图10的右侧y轴显示了负载的影响通过显示所花费的周期的比例来衡量整个PEs之间的不平衡等待在一个PE之间的障碍。 虽然PE之间的全球性障碍并且PE内碎片防止SCNN达到相似由SCNN(oracle)提供的加速,它仍然提供了一个平均值整个三网络的DCNN网络性能提升了2.7倍我们审查了CNN
Energy-effciency.
Figure 11 compares the energy of the three accelerator architectures across the layers of the three networks. On average, DCNN-opt improves energy-effciency by 2.0× over DCNN, while SCNN improves effciency by 2.3× . SCNN’s effectiveness varies widely across layers depending on the layer density, ranging from 0.89× to 4.7× improvement over DCNN and 0.76× to 1.9× improvement over DCNN-opt. Input layers such as VGGNet_conv1_1 and AlexNet_conv1 usually present a challenge for sparse architectures because of their 100% input activation density. In such cases, the overheads of SCNN’s structures such as the crossbar and distributed accumulation RAMs overshadow any benefts from fewer arithmetic operations and data movement.
图11比较了三个网络中三个加速器体系结构的能量。 平均而言,DCNN-opt比DCNN提高了2.0倍的能源效率,而SCNN提高了2.3倍的效率。 SCNN的效果因层密度而有很大的不同,从DCNN改善0.89倍到4.7倍,DCNN-opt改善0.76倍到1.9倍。 输入层(如VGGNet_conv1_1和AlexNet_conv1)通常对稀疏体系结构提出挑战,因为它们具有100%的输入激活密度。在这种情况下,交叉开关和分布式累加RAM等SCNN结构的开销会掩盖算术运算和数据移动所带来的好处。
These results reveal that although the straightforward DCNN-opt architecture is unable to improve performance, it is remarkably effective at achieving good energy-effciency on moderately sparse network layers. Nonetheless, SCNN is on average even more energyeffcient across our benchmark networks while providing a tremendous performance advantage over both DCNN and DCNN-opt
这些结果表明,虽然直截了当的DCNN-opt架构无法提高性能,在适度稀疏的情况下达到较好的节能效果非常显着网络层。 尽管如此,在我们的基准测试网络中,SCNN平均更加节能,同时提供了超过DCNN和DCNN-opt的巨大性能优势
PE Granularity
As outlined in Section 6.2, both cross-PE global barriers and intraPE multiplier array fragmentation can contribute to degradation in the performance of SCNN. We quantify the effects of both of these factors on system performance by conducting the following sensitivity study. Assuming a fxed 1,024 multipliers for the accelerator, we sweep the total number of PEs on-chip from 64 (8×8 PEs, 16 multipliers per PE) down to 4 (2×2 PEs, 256 multipliers per PE)Clearly, an SCNN with 4 PEs can better sustain the effects of the global barriers than an SCNN with 64 PEs. However, the 4 PE confguration is also more likely to suffer from intra-PE fragmentation because each PE must now process a larger working set to fully utilize the math units. When evaluated on GoogLeNet, SCNN with 64 PEs achieves an 11% speedup over the one with 4 PEs as it does a better job utilizing the math arrays (average 59% math utilization versus 35%). We observed similar trends for AlexNet and VGGNet, concluding that addressing intra-PE fragmentation is more critical than inter-PE barriers for system-wide performance with the PT-IS-CP-sparse dataflow.
如第6.2节所述,跨PE整体障碍和intraPE乘数阵列碎片都可能导致SCNN性能的下降。我们通过进行以下敏感性研究来量化这两个因素对系统性能的影响。假设一个加速器的1024个乘法器被固定,我们从64个(8×8个PE,每个PE 16个乘法器)到4个(2×2个PE,每个PE有256个乘法器),将片上PE的总数扫描出来。显然,SCNN 4个PE比64个PE的SCNN能够更好地支撑整体障碍的影响。然而,4 PE配置也更容易受到PE内部碎片的困扰,因为每个PE现在必须处理更大的工作集以充分利用数学单元。当使用GoogLeNet进行评估时,使用64个PE的SCNN比使用数学阵列(平均59%的数学利用率和35%)做得更好。我们观察到了AlexNet和VGGNet类似的趋势,得出结论认为,使用PT-IS-CP稀疏数据流解决intra-PE碎裂比内部PE障碍对提高广泛的系统性能更重要。
6.4 Effects of Weight and Activation Sparsity
While Figure 1 shows that sparseness is abundant in both activations and pruned weights, isolating the effect of sparsity provides insight into different accelerator architecture trade-offs. We run the densitysweep experiments from Section 6.1 on two architectures derived from the SCNN design. The SCNN-SparseA architecture only takes advantage of sparsity in activations and is similar in spirit to Cnvlutin [1]. The SCNN-SparseW architecture only takes advantage of sparsity in weights and is similar in spirit to Cambricon-X [34]
虽然图1显示在这两个激活中稀疏是丰富的并修剪权量,隔离稀疏效应提供见解
进入不同的加速器体系结构权衡。 我们运行第6.1节中介绍的两种体系结构的密度扫描实验来自SCNN的设计。 SCNN-SparseA架构只需要激活稀疏的优点,在精神上类似于Cnvlutin [1]。 SCNN-SparseW架构只能利用权重稀疏,精神上与Cambricon-X相似[34]。
Table 6 tabulates the characteristics of these new architectures alongside our baseline SCNN, DCNN, and DCNN-opt architectures. These fve architectures together cover a broad design space of sparse architectures, and also encompass the types of sparsity explored in prior research, as described in Table 2. However, because of signifcant differences in dataflow, buffer sizing/ organization, and implementation choices (such as the use of eDRAM), our evaluated architectures cannot precisely represent those prior proposals.
表6列出了这些新体系结构的特征以及我们的基准SCNN,DCNN和DCNN-opt体系结构。这五种结构共同覆盖了广泛的设计空间稀疏的体系结构,也包括在以前的研究中探索的稀疏类型,如表2所述。然而,因为在数据流,缓冲区大小/组织等和实施选择(如使用eDRAM)有着重大的不同,我们评估架构不能精确地代表那些先前的提议。
Figure 12 demonstrates that SCNN is consistently superior to the SCNN-SparseA and SCNN-SparseW confgurations in both performance and energy across the entire density range. The only exception is that at very high density levels (weight/activation density greater than 0.9/0.9), SCNN-SparseA is slightly more energy-effcient because of the removal of overheads to manage sparse weights. The input-stationary temporal loop around the Cartesian product makes these architectures extremely effective at fltering IARAM accesses, resulting in the IARAM consuming less than 1% of the total energy. The weight FIFO is accessed more frequently in SCNN, resulting in the weight FIFO consuming around 6.7% of total energy. Therefore, removing the weight encoding overheads in SCNN-SparseA shows a far greater beneft than removing the activation encoding overheads in SCNN-SparseW. However, as density is reduced, the fltering advantage of the input-stationary loop starts diminishing relative to the weight FIFO. At a density of 0.8/0.8, SCNN-SparseW surpasses the energy-effciency of SCNN-SparseA, ultimately reaching a 2.5×。advantage at 0.1/0.1. For a nominal density of 0.4/0.4, SCNN achieves performances advantages of 1.7× and 2.6× over SCNN-SparseW and SCNN-SparseA, respectively; SCNN achieves energy-effciency advantages of 1.6× and 2.1× over SCNN-SparseW and SCNN-SparseA, respectively
图12表明,在整个密度范围内,SCNN在性能和能量方面始终优于SCNN-SparseA和SCNN-SparseW配置。唯一的例外是,在非常高的密度水平下(重量/激活密度大于0.9 / 0.9),SCNN-SparseA稍微更节能,因为移除管理费用来管理稀疏权重。围绕笛卡尔乘积的输入固定的时间嵌套使得这些架构在IARAM访问过程中非常有效,从而导致IARAM消耗不到总能量的1%。在SCNN中权重FIFO被更频繁地访问,导致权重FIFO消耗约6.7%的总能量。因此,去除SCNN-SparseA中的权重编码开销显示出比去除SCNN-SparseW中的激活编码开销更大的好处。然而,随着密度降低,输入固定嵌套的过滤优势相对于权重FIFO开始减小。在0.8 / 0.8的密度下,SCNN-SparseW超过了SCNN-SparseA的能量效率,最终达到了0.1 / 0.1的2.5倍的优势。对于0.4 / 0.4的标称密度,SCNN分别在SCNN-SparseW和SCNN-SparseA上分别达到1.7×和2.6×的性能优势; SCNN分别比SCNN-SparseW和SCNN-SparseA分别获得了1.6倍和2.1倍的节能优势
7 RELATED WORK
Previous efforts to exploit sparsity in CNN accelerators have focused on reducing energy or saving time, which will invariably also save energy. Eliminating the multiplication when an input operand is zero is a natural way to save energy. Eyeriss [8] gates the multiplier when it sees an input activation of zero. Because the sparsity of weights is not signifcant for non-pruned networks, Eyeriss opted not to gate the multiplier on zero weights. This gating approach will save energy, but not execution time. Not only does SCNN obtain energy effciency by eliminating unnecessary multiplications, it also reduces execution time by eliminating the dead multiplier cycles inherent in zero-gating approaches.
以前努力在CNN加速器中利用稀疏性已经成为焦点,来减少能源或节省时间,这也总是会节省能源。 当输入操作数为0时,消除乘法运算零是一种自然的方式来节省能源。 Eyeriss[8]当它看到输入激活为零时关闭乘法器。 因为稀疏选择权重对非修剪网络来说并不重要。 这门控方法将会节省能源,但不是执行时间。 SCNN不仅通过消除不必要的乘法获得能源效率,也是通过消除死亡乘数周期来缩短执行时间与零门控方法有内在的联系。
Another approach to reducing energy is to reduce data transfer costs when the data is sparse. Eyeriss uses a run length encoding scheme when transferring activations to and from DRAM. This approach saves energy (and time) by reducing the number of DRAM accesses. However because the data is kept in an expanded form in the on-chip memory hierarchy, such architectures cannot completely eliminate energy on the data transfers from one internal buffer to another internal buffer or to the multipliers. Eyeriss does, however, save the energy for accessing the weight buffer when the associated activation is zero. SCNN also uses a compressed representation for all data coming from DRAM, but also maintains that compressed representation in all the on-die buffers.
另一种降低能耗的方法是当数据稀疏时减少数据传输的成本。 当从DRAM传输激活值时Eyeriss使用运行长度编码的方案。 这个方法通过减少DRAM的访问数量来节省能量(和时间)。 但是因为数据是以扩展的形式保存的片上存储器层次结构,这样的体系结构不能完全消除数据传输的能量从一个内部缓冲区到另一个内部缓冲区或乘法器。 但是,当相联系的激活值是0时,Eyeriss确实节省了访问权重缓冲区的能量。 SCNN也使用压缩表示所有来自DRAM的数据,也把保持压缩的表示放在所有片上缓冲器中。
Other sparse CNN accelerator architectures do not exploit all of the sparsity opportunities leveraged by SCNN. For example, Cnvlutin compresses activation values based on the ReLU operator, but it does not employ pruning to exploit weight sparsity [1]. CambriconX employs weight sparsity to keep only non-zero weights in its internal buffers [34]. However, it does not compress activation data between DRAM and the accelerator. Nor does it keep activations in a compressed form in the internal buffers, except in the queues directly delivering activations to the multipliers. In contrast, SCNN keeps both weights and activations in a compressed form in both
DRAM and internal buffers. This approach saves data transfer time and energy on all data transfers and allows the chip hold larger models for a given amount of internal storage。
其他稀疏的CNN加速器体系结构没有利用所有的SCNN利用的稀疏性的机会。 例如,Cnvlutin基于ReLU运算符压缩激活值,但是它不采用修剪来利用权重稀疏[1]。 CambriconX使用权重稀疏来保持其中的非零权重内部缓冲区[34]。 但是,它不压缩激活数据在DRAM和加速器之间。 它也不会把激活值以压缩形式存在内部缓冲区中,除了在队列中直接向乘法器提供激活。 相比之下,SCNN同时保持压缩形式的权重和激活在DRAM和内部缓冲区。 这种方法节省了数据传输时间和能量在所有的数据传输中,并允许芯片在给定数量的内部存储能够容纳更大模型。
Avoiding delivery of zero-valued activations or weights to the multipliers can save time by eliminating ineffectual multiplier cycles. Cnvlutin selects only non-zero activation values for delivery as multiplier operands, but does occupy a multiplier with zerovalued weights. Cambricon-X can save a compute cycle for zerovalued weights, but still wastes times computing multiplications for operands with zero-valued activations. The Deep Learning Accelerator Core (DLAC) architecture paper mentions that it can skip computation on zero-valued operands to improve performance, but the architecture does not appear to employ zero compression [32].
SCNN does not deliver either zero activations or weights to the multipliers and maximally exploits opportunities of CNN sparsity
通过消除无效乘数周期来避免向乘法器提供零值的激励或权重可以节省时间。 Cvvlutin仅选择非零激活值进行传送作为乘数操作数,但却占用了零权重的乘数。 Cambricon-X可以为零值权重节省一个计算周期,但仍然浪费时间计算乘法操作数与零值激活。 深度学习加速器核心(DLAC)架构文章提到它可以跳过计算零值操作数来提高性能,但是该体系结构似乎不采用零压缩[32]。SCNN不会传递零激活或权重乘数和最大限度地利用CNN稀疏的机会、
再举例,0值利用细节上各自的优劣,进一步佐证SCNN既能不传递0值权重和激活值,也能最大程度的利用CNN的稀疏性。这文章写的很饱满,很充分,实验很足。
The EIE CNN accelerator uses a compressed representation of both activations and weights, and only delivers non-zero operands to the multipliers [15]. However, EIE is designed for the fully connected layers of a CNN model, while SCNN targets the convolutional layers, which encompass the vast majority of the computations in CNNs [10, 24]. Finally, Alwani et al. propose to fuse adjacent layers in a dense CNN accelerator so that intermediate activations between layers can be kept on chip [2]. By compressing the activations, SCNN can typically keep all of the activations on-chip, without requiring a more complicated fused algorithm.
EIE CNN加速器使用压缩表示激活和权重,只提供非零操作数到乘数[15]。 然而,EIE是为CNN模型的完全连接层而设计的,而SCNN是针对卷积的层,其中包含绝大多数CNN的计算10,24]。 最后,Alwani等。 建议融合相邻的层
在密集的CNN加速器之间进行中间激活层可以保存在芯片上[2]。 通过压缩激活,SCNN通常可以保持芯片上的所有激活,而无需激活需要更复杂的融合算法。
8 CONCLUSION
This paper presents the Sparse CNN (SCNN) accelerator architecture for inference in convolutional neural networks. SCNN exploits sparsity in both weights and activations using the sparse planar-tiled input-stationary Cartesian product (PT-IS-CP-sparse) dataflow. This approach enables SCNN to use a novel Cartesian product-based computation architecture that maximizes reuse of weights and activations within a set of distributed processing elements. In addition, it allows the use of a dense compressed representation for both weights and activations to be used through almost the entire processing flow. This allows for reduced data movement and increased on-die storage capacity than alternatives approaches. Our results show that with equivalent area, the SCNN architecture starts to beat an energyoptimized dense architecture on energy effciency when the weights and activations are each less than 85% dense. On three contemporary networks (AlexNet, GoogLeNet, and VGGNet) SCNN achieves performance improvements over a comparably provisioned dense CNN accelerator by a factor of 2.7×, while still being 2.3× more energy-effcient.
本文提出了用于卷积神经网络推理的稀疏CNN(SCNN)加速器体系结构。 SCNN利用使用稀疏平面平铺的权重和激活的稀疏性输入静止笛卡儿积(PT-IS-CP-sparse)数据流。这个的方法使SCNN能够使用一种新颖的基于笛卡尔积的产品计算体系结构,其最大化一组分布式处理元件内的权重和激活的重用。另外,它允许对两个权重使用密集的压缩表示并通过几乎整个处理流程使用激活。这可以减少数据移动并增加管芯存储能力比替代方法。我们的结果表明,
等效面积,SCNN架构在权重时开始击败能量优化的能量效率密集架构和激活每个密度小于85%。在三个当代网络(AlexNet,GoogLeNet,和VGGNet)SCNN达到性能提高超过了可比较的配置密度CNN加速器减少了2.7倍,而仍然是2.3倍能源
这篇文章堪称中等IDEA写作的典范,写作思路非常好,论据支持非常饱满,骨架很硬,血肉很饱满,厉害。。