GPU上冗余多线程的实际实现

A brief report about paper Real-World Design and Evaluation of Compiler-Managed GPU Redundant Multithreading.


This paper introduces the use of Redundant Multithreading to provide an efficient software solution on GPU. The modification is made in kernel level and have two strategy, Intra-Group RMT and Inter-Group RMT. The paper also proves that GPU RMT performance depends on the unique behaviors of each kernel and the required SoR.


The author's method has three differences from others. First, it assume protection in storage and transfer to off-chip resources and target an on-chip protection domain; Second, the paper focus on detection on the GPU not on the CPU(RMT was originally used in CPU); Third it's on software-only GPU reliability solution, not hardware, since hardware is expensive to implement,inflexible and GPU simulators reflect inappropriate.


So what's RMT? Redundant Multithreading is a little like RAID1, but thread is duplicated, not data. The GPU has a mode called Fault modes, which can cause permanent or transient faults, hard and soft separately. Fault is caused by physical level, which is hard to avoid. But with RMT, the possibility of two faults creating simultaneous identical errors can be ignored. RMT relicate all the values when enter the sphere of replication SoR, and compare the output before a correct copy leave the SoR, just like RAID1. And in this paper RMT is implemented in GPU kernels with OpenCL, so there is a transformation between OpenCL kernels and RMT programs for error detection.


The first strategy is Intra-Group, and there are two different types of Intra-Group RMT, Intre_Group+LDS and Intre_Group-LDS, +LDS means the LDS is in the SoR, so will be deplicated and protected, while -LDS is out and not. Other stuff like scalr register file SRF, scalar uint SU, instruction fetch, decode and scheduling logic are all out of SoR and are not protected by Intra-Group RMT. There are three kernel modifications in Intra-Group : Work-Item ID is modified to create a pair of identical, redundant workitems; LDS is included in the SoR, its allocation and map redundant loads and stores are doubled; communication and output comparison are added as well. The Intra-Group flavors perform not very well, because memory operations DCT and MM spend a lot of time. And for some applications, the inter-work-item communication cost a lot. Of course the behaviour of doubling the size of work-groups takes time too. Although RMT executes twice as many work-items,the power consumption increases is small, less than 2%. And the cost of redundant computation can be hidden behind Intra-Group RMT latency, while Instruction fetch scheduling and decode logic of each CU can be considered inside of the SoR.


Kernel modifications of Inter-Group RMT : adding explicit synchronization to coordinate communication between work-items; modify work-item ID to avoid deadlock; communication buffers are in global memory, for Inter-Group RMT communication between work-times is more expensive than Intra-Group RMT communicaition. The poor performance of Inter-Group RMT is caused by using global memory for inter-work-item communication, which is extremely high. And the CU under-utilization is related to the the work-groups launched.

你可能感兴趣的:(GPU)