Nvidia 推出 CUDA 2.1 beta

OpenHero晚上在回答论坛上问题的时候,突发发现Nvidia的2.1beta update了~~赶紧进去那个帖子一看,才6个人看到了那个帖子~~

兴奋之余,下载了2.1beta 首先看了programing guide的更新,下面是对更新的翻译:

Version 2.1 Beta
--------------------------------------------------------------------------------

  1. 1 - Section 4.2.3
    - Dg.z must be equal to 1

Dg is of type dim3 (see Section 4.3.1.2) and specifies the dimension and size of the grid, such that Dg.x * Dg.y

equals the number of blocks being launched; Dg.z must be equal to 1

grid的第三个参数z必须为1,大家需要注意了,以前的时候这个参数没起作用,可能程序移植过来的时候,会出一些小问题。

  1. 2 - Sections 4.2.5, 4.5.3.4
    - PTX code can now be compiled through the driver API

可以直接用ptx写代码了,这样就支持了元语编程,对系统的优化,性能的提高,又提供的路径,不过给大家建议,还是先用nvcc的编

译器把cu文件编译成ptx文件以后再对代码就行修改,nvcc的编译器我是见识过,优化本身就是很强的了:)大家可以看看它把cu文件

转换为ptx文件的时候做得一些优化工作,就会知道我的建议的用意了:)

This code sample compiles and loads a new module from ptx code and parses compilation errors:
# define ERROR_BUFFER_SIZE 100
CUmodule cuModule;
CUptxas_option options[3];
void* values[3];
char* ptxCode = “some ptx code”;
options[0] = CU_ASM_ERROR_LOG_BUFFER;
values[0] = (void*)malloc(ERROR_BUFFER_SIZE);
options[1] = CU_ASM_ERROR_LOG_BUFFER_SIZE_BYTES;
values[1] = (void*)ERROR_BUFFER_SIZE;
options[2] = CU_ASM_TARGET_FROM_CUCONTEXT;
values[2] = 0;
cuModuleLoadDataEx(&cuModule, ptxCode, 3, options, values);
for (int i = 0; i // Parse error string here
}

  1. 3 - Sections 4.5.1.4, 4.5.2.8, 4.5.3.11
    - Updated with Direc3D 10 interoperability

4.5.1.4:Direct3D interoperability is supported for Direct3D 9.0 and Direct3D 10.0。
在2.0的时候,还是直接支持DirectX 9.0,现在已经可以支持DirectX 10.0 用10.0的朋友可以尝试一下了。

4.5.2.8:这里可以看到使用DX 10.0的过程
void* devPtr;
cudaD3D10ResourceGetMappedPointer(&devPtr, buffer, 0);
size_t size;
cudaD3D10ResourceGetMappedSize(&size, buffer, 0);
cudaMemset(devPtr, 0, size);

// host code
void* devPtr;
cudaD3D10ResourceGetMappedPointer(&devPtr, surface, 0);
size_t pitch;
cudaD3D10ResourceGetMappedPitch(&pitch, 0, surface, 0);
dim3 Db = dim3(16, 16);
dim3 Dg = dim3((width+Db.x–1)/Db.x, (height+Db.y–1)/Db.y);
myKernel>>((unsigned char*)devPtr,
width, height, pitch);
// device code
__global__ void myKernel(unsigned char* surface,
int width, int height, size_t pitch)
{
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
if (x >= width || y >= height) return;
float* pixel = (float*)(surface + y * pitch) + 4 * x;
}

4.5.3.11:这里是Driver API调用DX 10.0的过程
CUdeviceptr devPtr;
cuD3D10ResourceGetMappedPointer(&devPtr, buffer, 0);
size_t size;
cuD3D10ResourceGetMappedSize(&size, buffer, 0);
cuMemset(devPtr, 0, size);

// host code
CUdeviceptr devPtr;
cuD3D10ResourceGetMappedPointer(&devPtr, surface, 0);
size_t pitch;
cuD3D10ResourceGetMappedPitch(&pitch, 0, surface, 0);
cuModuleGetFunction(&cuFunction, cuModule, “myKernel”);
cuFuncSetBlockShape(cuFunction, 16, 16, 1);
int offset = 0;
cuParamSeti(cuFunction, offset, devPtr);
offset += sizeof(devPtr);
cuParamSeti(cuFunction, 0, width);
offset += sizeof(width);
cuParamSeti(cuFunction, 0, height);
offset += sizeof(height);
cuParamSeti(cuFunction, 0, pitch);
offset += sizeof(pitch);
cuParamSetSize(cuFunction, offset);
cuLaunchGrid(cuFunction,
(width+Db.x–1)/Db.x, (height+Db.y–1)/Db.y);
// device code
__global__ void myKernel(unsigned char* surface,
int width, int height, size_t pitch)
{
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
if (x >= width || y >= height) return;
float* pixel = (float*)(surface + y * pitch) + 4 * x;
}

  1. 4 - Section 4.5.2.2
    - Any subsequent explicit call to cudaSetDevice() will now fail

以前版本如果重复使用cudaSetDevice()函数只会产生失效,2.1beta 如果重新调用这个函数就会失败,ps我2.1beta也是刚下下来的还

没测试过,可能是就会产生一个错误。

  1. 5 - Section 4.5.2.7
    - cudaGLSetGLDevice() must be called for proper OpenGL interoperability

在使用OpenGL之前需要设置cuda启动支持OpenGL的功能,在以前的版本中,这个是不需要的,我猜想,应该是GPU的通用计算模式和显

卡模式为了更好的结合起来做的工作,以后可能直接把计算模式的数据交给现实模式,而不是现在这样需要copy出来。这个应该是第一

步,做实时渲染的朋友,肯定期待这样的支持。

  1. 6 - Section 4.5.3.10
    - Context must be created with cuGLCtxCreate() for OpenGL interoperability

和4.5.2.7一样,只是这里是Driver api架构。

  1. 7- Section 4.6
    - Mode switches cause runtime calls to fail

4.6 Mode Switches
GPUs dedicate some DRAM memory to the so-called primary surface, which is used
to refresh the display device whose output is viewed by the user. When users initiate
a mode switch of the display by changing the resolution or bit depth of the display
(using NVIDIA control panel or the Display control panel on Windows), the
amount of memory needed for the primary surface changes. For example, if the user
changes the display resolution from 1280x1024x32-bit to 1600x1200x32-bit, the
system must dedicate 7.68 MB to the primary surface rather than 5.24 MB. (Fullscreen
graphics applications running with anti-aliasing enabled may require much
more display memory for the primary surface.) On Windows, other events that may
initiate display mode switches include launching a full-screen DirectX application,
hitting Alt+Tab to task switch away from a full-screen DirectX application, or
hitting Ctrl+Alt+Del to lock the computer.
If a mode switch increases the amount of memory needed for the primary surface,
the system may have to cannibalize memory allocations dedicated to CUDA
applications. Therefore, a mode switch results in any call to the CUDA runtime to
fail and return an invalid context error.

这一段的重点其实在最后一句话,当有现实模式切换的时候,显卡的内存会被调度给现实部分,就会影响到计算部分的CUDA的工作,这

样就会造成CUDA程序运行失败,就会造成资源访问错误,程序就会退出。
- Section A.1
- Updated with latest GPUs

更新了最新支持CUDA的Nvidia的显卡

------------------------------------------------------------------------------------------------------------------------------------------------

看了上面的更新,

我觉得需要注意的就是第一条,grid的z必须为1,

感到最兴奋的是第二条可以直接支持ptx的编译,这样就可以直接写ptx代码了:)

其实还有更兴奋的,那就是linux 下的2.1 beta可以支持Device上的debugger了~~这样就不为调试郁闷了~~~

期待windows版本的Device Debugger

让我们来看看这一段,写在Linux Debugger开始的一段,

CUDA-GDB: The NVIDIA CUDA Debugger
CUDA-GDB is a ported version of GDB: The GNU Debugger, version 6.6. The
goal of its design is to present the user with an all-in-one debugging environment
that is capable of debugging native host code as well as CUDA code. Therefore, it
is an extension to the standard i386 port that is provided in the GDB release. As a
result, standard debugging features are inherently supported for host code, and
additional features have been provided to support debugging CUDA code. CUDAGDB
is supported on 32-bit Linux in the 2.1 Beta release.
All information contained within this document is subject to change.

将来的debugger肯定还有很多变化,so我们就不用多介绍,在调试的过程和gdb差不多:)期待windows版本下的debugger。

----------------------------------------------------------------------------------------------------------------------------

对使用VS2008的人来说,2.1beta也是一个好的等待,2.1beta支持VS2008了~~呵呵我的CUDA VS Wizard 2.0beta应该升级了~~哈哈

不过不支持vs2003了~~

------------------------------------------------------------------------------------------------------------------------------------------------------

有一个bug弄好了,如果是opengl使用遇到这个问题的时候,就需要注意是不是因为这个原因:)

--------------------------------------------------------------------------------
Major Bug Fixes
--------------------------------------------------------------------------------

o OpenGL interoperability will now only copy shared buffers through
host memory when CUDA and OpenGL are running on different GPUs.

------------------------------------------------------------------------------------------------------------------------------------------------------

这几条是需要注意的:

--------------------------------------------------------------------------------
Known Issues
--------------------------------------------------------------------------------

Vista Specific Issues:

o In order to run CUDA on a non-TESLA GPU, either the Windows desktop
must be extended onto the GPU, or the GPU must be selected as the
PhysX GPU.

o Individual kernels are limited to a 2-second runtime by Windows
Vista. Kernels that run for longer than 2 seconds will trigger
the Timeout Detection and Recovery (TDR) mechanism. For more
information, see
http://www.microsoft.com/whdc/device/display/wddm_timeout.mspx.

o The CUDA Profiler does not support performance counter events
on Windows Vista. All profiler configuration regarding
performance counter events is ignored.

o On Windows Vista, asynchronous memory copies do not support
GPU overlap. CU_DEVICE_ATTRIBUTE_GPU_OVERLAP will be 0 for all
devices.

o The maximum size of a single allocation created by cudaMalloc
or cuMemAlloc is limited to
( System Memory Size in MB - 512 MB ) / 2.

XP Specific Issues:

o Individual GPU program launches are limited to a run time
of less than 5 seconds on a GPU with a display attached.
Exceeding this time limit usually causes a launch failure
reported through the CUDA driver or the CUDA runtime. GPUs
without a display attached are not subject to the 5 second
runtime restriction. For this reason it is recommended that
CUDA be run on a GPU that is NOT attached to a display and
does not have the Windows desktop extended onto it. In this
case, the system must contain at least one NVIDIA GPU that
serves as the primary graphics adapter.

Issues Common to XP and Vista:

o GPU enumeration order on multi-GPU systems is non-deterministic and
may change with this or future releases. Users should make sure to
enumerate all CUDA-capable GPUs in the system and select the most
appropriate one(s) to use.

o Applications that try to use too much memory may cause a
CUDA memcopy or kernel to fail with the error
CUDA_ERROR_OUT_OF_MEMORY. If this happens, the CUDA Context is
placed into an error state and must be destroyed and recreated
if the application wants to continue using CUDA.

o Malloc may fail due to running out of virtual memory space.
The address space limitation is fixed by a Microsoft issued
hotfix. Please install the patch located at
http://support.microsoft.com/kb/940105 if this is an issue.
Windows Vista SP1 includes this hotfix.

o When two GPUs are run in SLI mode, only one of the GPUs will be
available to the user for executing CUDA programs.

o When using Microsoft Studio Visual 8.0, it is required that Service Pack 1
be installed. Certain Windows C++ header files will cause a crash in cudafe
without it.

o The default compilation mode for host code is now C++. To restore the old
behavior, use the option --host-compilation=c

o For maximum performance when using multiple byte sizes to access the
same data, coalesce adjacent loads and stores when possible rather
than using a union or individual byte accesses. Accessing the data via
a union may result in the compiler reserving extra memory for the object,
and accessing the data as individual bytes may result in non-coalesced
accesses. This will be improved in a future compiler release.

明天或者后天就可以直接从Nvidia的网站上直接download了,我这个是从nvidia的cuda论坛上下载的:)

你可能感兴趣的:(nVidia)