AI chip

一、https://www.youtube.com/watch?v=fm0kxnj3DuM
1、AI chip没有固定的结构,不同结构适应不同的算法,算法更新快,所以芯片也要快速更新。
2、结构虽然变化,但主要内容是计算单元、存储结构、IO(PCIe)等。
3、大家都在解决同样的问题,如果快速计算。
4、不同需求设计不同的芯片,因为设计芯片时需要考虑不同因素,比如数据如何在芯片中流转,延迟、能耗等,在不同因素之间找到平衡,考虑的原则是适合需求。
5、芯片上的软件很重要,就是running time library。它和芯片是workload 的基础。
二、https://www.youtube.com/watch?v=d32jtdFwpcE
1、训练和推理芯片是不同的,训练有大量的数据,推理设备有限,所以芯片设计之前就要确定是推理还是训练。
2、大多数AI芯片就是标量、向量、矩阵计算单元。
三、https://www.youtube.com/watch?v=2uZeVYjofiM
1、GPU上控制占小部分,但计算占大部分。
三、AI Chips: What They Are and Why They Matter.pdf
1、Our definition of “AI chips” includes graphics processing units (GPUs), field-programmable
gate arrays (FPGAs), and certain types of application-specific integrated circuits (ASICs)
specialized for AI calculations. Our definition also includes a GPU, FPGA, or AI-specific
ASIC implemented as a core on system-on-a-chip (SoC).
2、The transistors used in today’s state-of-the-art chips are only a few atoms wide.
3、specialized AI chips are taking market share from CPUs。CPUs are becoming less and
less useful as AI advances.
4、Like general-purpose CPUs, AI chips gain speed and efficiency by incorporating huge numbers of smaller and smaller transistors, which run faster and consume less energy than larger transistors.
All computer chips—including general-purpose CPUs and specialized ones like AI chips—benefit from smaller transistors, which run faster and consume less energy than larger transistors.
smaller transistors generally use less power than larger ones。
Greater transistor density improved speed primarily via “frequency scaling,” i.e. transistors switching between ones and zeros faster
to allow more calculations per second by a given execution unit. Because smaller transistors use less power than larger ones, transistor switching speeds could be increased without increasing total power consumption.
5、AI 芯片对比CPU的优势:
calculating numbers with low precision in a way that successfully implements AI algorithms but reduces the number of transistors needed for the same calculation;
speeding up memory access by, for example, storing an entire AI algorithm in a single AI chip;
and using programming languages built specifically to efficiently translate AI computer code for execution on an AI chip.
5、中国公司依赖美国的EDA软件设计芯片。
6、U.S., Taiwanese, and South Korean firms control the large majority of chip fabrication factories (“fabs”)。
U.S., Dutch, and Japanese firms together control the market for semiconductor manufacturing equipment (SME) used by fabs.
7、Moore’s Law was first observed in the 1960s, and it held until the 2010s
8、As transistors shrink and density increases, new chip designs become possible, further improving efficiency and speed.
First, CPUs can include more and different types of execution units optimized for different functions.
Second, more on-chip memory can reduce the need for accessing slower offchip memory. Memory chips such as DRAM chips likewise can pack more memory.
Third, CPUs can have more space for architectures that implement parallel rather than serial computation.
9、Various physics problems at small scales also make further shrinkage more technically challenging. The first significant change arrived in the 2000s when the transistor’s insulative layer became so thin that electrical current started leaking across it.
10、并行
More transistors could theoretically enable a CPU to include more circuits to perform a larger number of calculations in parallel. However, speedups from parallelism are commonly limited by the percentage of time spent on serial computations。
Unfortunately, most applications require at least some serial computation, and processor energy waste becomes too high as the serialization percentage increases.
12、
AI chips execute a much larger number of calculations in parallel than CPUs。
They also calculate numbers with low precision in a way that successfully implements AI algorithms。
They also speed up memory access by storing an entire AI algorithm in a single AI chip.
Finally, AI chips use programming languages specialized to efficiently translate AI computer code to execute on an AI chip.
13、不同AI chip ,设计不同
While general-purpose chips include a small number of popular designs, particularly the CPU, AI chips are more diverse. AI chips vary widely in design。
14、train and infer
training virtually always benefits from data parallelism, inference often does not。
15、ASICS灵活性小
ASICs are narrowly optimized for specific algorithms, design engineers consider far fewer variables.
To design a circuit meant for only one calculation, an engineer can simply translate the calculation into a circuit optimized for that calculation. But to design a circuit meant for many types of calculations, the engineer must predict which circuit will perform well on a wide variety of tasks, many of which are unknown in advance.
16、An AI chip’s commercialization has depended on its degrees of generalpurpose capability
17、ASIC芯片
The AI ASIC market, especially for inference, is more distributed with lower barriers to entry, as ASICs and inference chips are easier to design。
Google, Tesla, and Amazon have begun designing AI ASICs specialized for their own AI applications。
18、GPU FPGA
U.S. companies Nvidia and AMD have a duopoly over the world GPU design market。
U.S. companies Xilinx and Intel dominate the global FPGA market。
19、chip design
“Chip design” refers to the layout and structure of these electrical devices and their interconnections.
20、CPU 和其他芯片
CPUs, which are general-purpose processors suitable for a wide variety of computing tasks but not specialized for any given tasks。
GPUs, FPGAs, and ASICs are specialized for improved efficiency and speed for specific applications—such as AI—at the expense of worse-than-CPU efficiency and speed on other applications.
A system-on-a-chip (SoC) is a single chip that includes all necessary computer functions, including logic functions and memory.
21、AI芯片主要包括加法和乘法。
AI chip designs typically include large numbers of “multiply-accumulate circuits” (MACs) in a single chip
22、数据并行
Data parallelism, the most common form of parallelism, splits the input dataset into different “batches,” such that computations are performed on each batch in parallel. These batches can be split across different execution units of an AI chip or across different AI chips connected in parallel.
data parallelism using hundreds to thousands of batches during training achieves the same model accuracy without increasing the total number of required computations. However, greater numbers of batches start requiring more compute to achieve the same model accuracy. Beyond a certain number of batches—for some DNNs, over a million—increasing data parallelism requires more compute without any decrease in time spent training the model, thereby imposing a limit on useful data parallelism.
模型并行
Model parallelism splits the model into multiple parts on which computations are performed in parallel on different execution units of an AI chip or across different AI chips connected in parallel.
23、降低模型参数的精度不影响结果
First, trained DNNs are often impervious to noise, such that rounding off numbers in inference calculations does not affect results.
Second, certain numerical parameters in DNNs are known in advance to have values falling within only a small numerical range—precisely the type of data that can be stored with a low number of bits.
24、低精度计算的好处
Lower-bit data calculations can be performed with execution units containing fewer transistors.
This produces two benefits. First, chips can include more parallel execution units if each execution unit requires fewer transistors.
Second, lower-bit calculations are more efficient and require fewer operations. An 8-bit execution unit uses 6x less circuit area and 6x less energy than a 16-bit execution unit.
25、A transistor stores a bit, which can take a value of 1 or 0。
26、关于编程语言
A computer program called a compiler (or an interpreter) then translates this code into a form directly readable and executable by a processor. Different computer languages operate at various levels of abstraction.
For example, a high-level programming language like Python is simplified for human-accessibility, but Python code when executed, is often relatively slow due to complexities of converting high-level instructions for humans into machine code optimized for a specific processor.
By contrast, programming languages like C operating at a lower-level of abstraction require more complex code (and effort by programmers), but their code often execute more efficiently because it is easier to convert into machine code optimized for a specific processor.
27、DSL(Domain-specific languages)
By contrast, DSLs are specialized to efficiently program for and execute on specialized chips。
Sometimes, the advantages of DSLs can be delivered by specialized code libraries like PyTorch: these code libraries package knowledge of specialized AI-processors in functions that can be called by general-purpose languages (such as Python in this case).
四、IP
芯片行业中所说的IP,一般也称为IP核。IP核是指芯片中具有独立功能的电路模块的成熟设计。该电路模块设计可以应用在包含该电路模块的其他芯片设计项目中,从而减少设计工作量,缩短设计周期,提高芯片设计的成功率。
一般说来,一个复杂的芯片是由芯片设计者自主设计的电路部分和多个外购的IP核连接构成的。要设计这样结构的一款芯片,设计公司可以外购芯片中所有的IP核,仅设计芯片中自己有创意的、自主设计的部分(用绿色表示),并把各部分连接起来。
五、
1、Dojo更多依赖FAST data movement,而不是local data storage。
2、特斯拉的芯片可扩展性强。
3、特斯拉自己写芯片的software stack。
4、特斯拉之前用英伟达 的芯片做训练。

你可能感兴趣的:(工作流,神经网络,AIchip)