nvidia从cuda 6开始支持unified memory(统一内存寻址),目前更新至cuda 8,进一步加强了该特性。
CUDA 6 introduced Unified Memory, which creates a pool of managed memory that is shared between the CPU and GPU, bridging the CPU-GPU divide. Managed memory is accessible to both the CPU and GPU using a single pointer. The CUDA system software automatically migrates data allocated in Unified Memory between GPU and CPU, so that it looks like CPU memory to code running on the CPU, and like GPU memory to code running on the GPU.
实质上,就是将用户从CPU-GPU间的内存拷贝操作中解放出来,交给CUDA实现。
用法:cudaMalloc用cudaMallocManaged替换
CUDA 6 Unified Memory was limited by the features of the Kepler and Maxwell GPU architectures: all managed memory touched by the CPU had to be synchronized with the GPU before any kernel launch; the CPU and GPU could not simultaneously access a managed memory allocation; and the Unified Memory address space was limited to the size of the GPU physical memory.
PASCAL GP100 UNIFIED MEMORY:Two main hardware features enable these improvements: support for large address spaces and page faulting capability.
在CPU访问Unified Memory之前需要加上
cudaDeviceSynchronize();
进行同步,防止CPU和GPU同时对Unified Memory进行访问
对于compute capabiliy 6.x 以上的设备,可以通过stream支持CPU/GPU对Unified Memory的异步访问,即CPU不需要等待GPU kernel处理完成后才能访问Unified Memory
而对于compute capabiliy 5.x 及以下的设备,则不支持CPU/GPU对Unified Memory的异步访问,当kernel运行过程中,CPU是无法访问Unified Memory的
Unified memory really shines with C++ data structures. C++ simplifies the deep copy problem by using classes with copy constructors. A copy constructor is a function that knows how to create an object of a class, allocate space for its members, and copy their values from another object. C++ also allows the new and delete memory management operators to be overloaded.
Thanks to Unified Memory, the deep copies, pass by value and pass by reference all just work. This provides tremendous value in running C++ code on the GPU.
通过在拷贝构造函数使用cudaMallocManaged,即可以实现GPU kernel中对象的值传递和引用传递。
总结unified memory的好处:
Simpler programming and memory model
Unified Memory lowers the bar of entry to parallel programming on GPUs, by making explicit device memory management an optimization, rather than a requirement. Unified Memory lets programmers focus on developing parallel code without getting bogged down in the details of allocating and copying device memory. This makes it easier to learn to program GPUs and simpler to port existing code to the GPU. But it’s not just for beginners; Unified Memory also makes complex data structures and C++ classes much easier to use on the GPU. With GP100, applications can operate out-of-core on data sets that are larger than the total memory size of the system. On systems that support Unified Memory with the default system allocator, any hierarchical or nested data structure can automatically be accessed from any processor in the system.
减轻GPU内存管理的繁杂性,方便代码移植GPU。
Performance through data locality
By migrating data on demand between the CPU and GPU, Unified Memory can offer the performance of local data on the GPU, while providing the ease of use of globally shared data. The complexity of this functionality is kept under the covers of the CUDA driver and runtime, ensuring that application code is simpler to write. The point of migration is to achieve full bandwidth from each processor; the 750 GB/s of HBM2 memory bandwidth is vital to feeding the compute throughput of a GP100 GPU. With page faulting on GP100, locality can be ensured even for programs with sparse data access, where the pages accessed by the CPU or GPU cannot be known ahead of time, and where the CPU and GPU access parts of the same array allocations simultaneously.
An important point is that CUDA programmers still have the tools they need to explicitly optimize data management and CPU-GPU concurrency where necessary: CUDA 8 introduces useful APIs for providing the runtime with memory usage hints (cudaMemAdvise()) and for explicit prefetching (cudaMemPrefetchAsync()). These tools allow the same capabilities as explicit memory copy and pinning APIs without reverting to the limitations of explicit GPU memory allocation.
提高性能,通过CUDA的驱动和运行时对全局共享数据进行优化,简化应用程序的编码工作。
In simple terms, Unified Memory eliminates the need for explicit data movement via the cudaMemcpy*() routines without the performance penalty incurred by placing all data into zero-copy memory. Data movement, of course, still takes place, so a program’s run time typically does not decrease; Unified Memory instead enables the writing of simpler and more maintainable code.
注意:数据迁移是不可避免的,unified memory不会降低代码的运行时间
Unified Memory has three basic requirements:
‣ a GPU with SM architecture 3.0 or higher (Kepler class or newer)
‣ a 64-bit host application and operating system, except Android
‣ Linux or Windows
Unified Memory的运行环境要求:Kepler以上的核心、64位Linux或者Windows