In scientific computing, OpenBLAS is an open source implementation of the BLAS (Basic Linear Algebra Subprograms) APIwith many hand-crafted optimizations for specific processor types. It is developed at the Lab of Parallel Software and Computational Science, ISCAS.
OpenBLAS adds optimized implementations of linear algebra kernels for several processor architectures, including IntelSandy Bridge[2] and Loongson.[3] It claims to achieve performance comparable to the Intel MKL.
OpenBLAS is a fork of GotoBLAS2, which was created by Kazushige Goto at the Texas Advanced Computing Center.
Basic Linear Algebra Subprograms (BLAS) is a specification that prescribes a set of low-level routines for performing common linear algebra operations such as vector addition, scalar multiplication, dot products, linear combinations, andmatrix multiplication. They are the de facto standard low-level routines for linear algebra libraries; the routines have bindings for both C and Fortran. Although the BLAS specification is general, BLAS implementations are often optimized for speed on a particular machine, so using them can bring substantial performance benefits. BLAS implementations will take advantage of special floating point hardware such as vector registers or SIMD instructions.
Single Instruction Multiple Data (SIMD)顾名思义就是“一条指令处理多个数据(一般是以2为底的指数数量)”的并行处理技术,相比于“一条指令处理几个数据”,运算速度将会大大提高。它是Michael J. Flynn在1966年定义的四种计算机架构之一(根据指令数与数据流的关系定义,其余还有SISD、MISD、MIMD)。
许多程序需要处理大量的数据集,而且很多都是由少于32bits的位数来存储的。比如在视频、图形、图像处理中的8-bit像素数据;音频编码中的16-bit采样数据等。在诸如上述的情形中,很可能充斥着大量简单而重复的运算,且少有控制代码的出现。因此,SIMD就擅长为这类程序提供更高的性能,比如下面几类:
* Block-based data processing.
* Audio, video, and image processing codes.
* 2D graphics based on rectangular blocks of pixels.
* 3D graphics.
* Color-space conversion.
* Physics simulations.
在32-bit内核的处理器上,如Cortex-A系列,如果不采用SIMD则会将大量时间花费在处理8-bit或16-bit的数据上,但是处理器本身的ALU、寄存器、数据深度又是主要为了32-bit的运算而设计的。因此NEON应运而生。
NEON就是一种基于SIMD思想的ARM技术,相比于ARMv6或之前的架构,NEON结合了64-bit和128-bit的SIMD指令集,提供128-bit宽的向量运算(vector operations)。NEON技术从ARMv7开始被采用,目前可以在ARM Cortex-A和Cortex-R系列处理器中采用。
NEON在Cortex-A7、Cortex-A12、Cortex-A15处理器中被设置为默认选项,但是在其余的ARMv7 Cortex-A系列中是可选项。NEON与VFP共享了同样的寄存器,但它具有自己独立的执行流水线。
NEON支持的数据类型
* 32-bit single precision floating-point 32-bit单精度浮点数;
* 8, 16, 32 and 64-bit unsigned and signed integers 8, 16, 32 and 64-bit无符号/有符号整型;
* 8 and 16-bit polynomials 8 and 16-bit多项式。
NEON数据类型说明符:
* Unsigned integer U8 U16 U32 U64
* Signed integer S8 S16 S32 S64
* Integer of unspecified type I8 I16 I32 I64
* Floating-point number F16 F32
* Polynomial over {0,1} P8
注:F16不适用于数据处理运算,只用于数据转换,仅用于实现半精度体系结构扩展的系统。
多项式算术在实现某些加密、数据完整性算法中非常有用。
Floating-point (VFP)[edit]VFP (Vector Floating Point) technology is an FPU (Floating-Point Unit) coprocessor extension to the ARM architecture[72] (implemented differently in ARMv8 - coprocessors not defined there). It provides low-cost single-precision and double-precision floating-point computation fully compliant with the ANSI/IEEE Std 754-1985 Standard for Binary Floating-Point Arithmetic. VFP provides floating-point computation suitable for a wide spectrum of applications such as PDAs, smartphones, voice compression and decompression, three-dimensional graphics and digital audio, printers, set-top boxes, and automotive applications. The VFP architecture was intended to support execution of short “vector mode” instructions but these operated on each vector element sequentially and thus did not offer the performance of true single instruction, multiple data (SIMD) vector parallelism. This vector mode was therefore removed shortly after its introduction,[73]to be replaced with the much more powerful NEON Advanced SIMD unit.
Some devices such as the ARM Cortex-A8 have a cut-down VFPLite module instead of a full VFP module, and require roughly ten times more clock cycles per float operation.[74] Pre-ARMv8 architecture implemented floating-point/SIMD with the coprocessor interface. Other floating-point and/or SIMD units found in ARM-based processors using the coprocessor interface include FPA, FPE, iwMMXt, some of which were implemented in software by trapping but could have been implemented in hardware. They provide some of the same functionality as VFP but are not opcode-compatible with it.
There are quite some difference between the two. Neon is a SIMD (Single Instruction Multiple Data) accelerator processor as part of the ARM core. It means that during the execution of one instruction the same operation will occur on up to 16 data sets in parallel. Since there is parallelism inside the Neon, you can get more MIPS or FLOPS out of Neon than you can a standard SISD processor running at the same clock rate.
The biggest benefit of Neon is if you want to execute operation with vectors, i.e. video encoding/decoding. Also it can perform single precision floating point(float) operations in parallel.
VFP is a classic floating point hardware accelerator. It is not a parallel architecture like Neon. Basically it performs one operation on one set of inputs and returns one output. It’s purpose is to speed up floating point calculations. It supports single and double precision floating point.
You have 3 possibilities to use Neon:
* use intrinsics functions #include "arm_neon.h"
* inline the assembly code
* let the gcc to do the optimizations for you by providing -mfpu=neon as argument (gcc 4.5 is good on this)