扫地的小何尚

CUDA协作组详解

CUDA中的协作组

1. 协作组简介

Cooperative Groups 是 CUDA 9 中引入的 CUDA 编程模型的扩展，用于组织通信线程组。协作组允许开发人员表达线程通信的粒度，帮助他们表达更丰富、更有效的并行分解。

从历史上看，CUDA 编程模型为同步协作线程提供了一个单一、简单的构造：线程块的所有线程之间的屏障，如使用 __syncthreads() 内部函数实现的那样。但是，程序员希望以其他粒度定义和同步线程组，以“集体”组范围功能接口的形式实现更高的性能、设计灵活性和软件重用。为了表达更广泛的并行交互模式，许多面向性能的程序员已经求助于编写自己的临时和不安全的原语来同步单个 warp 中的线程，或者跨运行在单个 GPU 上的线程块集。虽然实现的性能改进通常很有价值，但这导致了越来越多的脆弱代码集合，随着时间的推移和跨 GPU 架构的不同，这些代码的编写、调整和维护成本很高。合作组通过提供安全且面向未来的机制来启用高性能代码来解决这个问题。

2. CUDA 11.0的最新特性

使用网格范围的组不再需要单独编译，并且同步该组的速度现在提高了 30%。此外，我们在最新的 Windows 平台上启用了协作启动，并在 MPS 下运行时增加了对它们的支持。
grid_group 现在可以转换为 thread_group。
线程块切片和合并组的新集合：reduce 和 memcpy_async。
线程块切片和合并组的新分区操作：labeled_partition 和 binary_partition。
新的 API，meta_group_rank 和 meta_group_size，它们提供有关导致创建该组的分区的信息。
线程块tile现在可以在类型中编码其父级，这允许对发出的代码进行更好的编译时优化。
接口更改：grid_group 必须在声明时使用 this_grid() 构造。默认构造函数被删除。

注意：在此版本中，我们正朝着要求 C++11 提供新功能的方向发展。在未来的版本中，所有现有 API 都需要这样做。

3. 协作组编程模型

协作组编程模型描述了 CUDA 线程块内和跨线程块的同步模式。它为应用程序提供了定义它们自己的线程组的方法，以及同步它们的接口。它还提供了强制执行某些限制的新启动 API，因此可以保证同步正常工作。这些原语在 CUDA 内启用了新的协作并行模式，包括生产者-消费者并行、机会并行和整个网格的全局同步。

合作组编程模型由以下元素组成：

表示协作线程组的数据类型；
获取由 CUDA 启动 API 定义的隐式组的操作（例如，线程块）；
将现有群体划分为新群体的集体；
用于数据移动和操作的集体算法（例如 memcpy_async、reduce、scan）；
同步组内所有线程的操作；
检查组属性的操作；
公开低级别、特定于组且通常是硬件加速的操作的集合。

协作组中的主要概念是对象命名作为其中一部分的线程集的对象。这种将组表示为一等程序对象的方式改进了软件组合，因为集合函数可以接收表示参与线程组的显式对象。该对象还明确了程序员的意图，从而消除了不合理的架构假设，这些假设会导致代码脆弱、对编译器优化的不良限制以及与新一代 GPU 的更好兼容性。

为了编写高效的代码，最好使用专门的组（通用会失去很多编译时优化），并通过引用打算以某种协作方式使用这些线程的函数来传递这些组对象。

合作组需要 CUDA 9.0 或更高版本。要使用合作组，请包含头文件：

// Primary header is compatible with pre-C++11, collective algorithm headers require C++11
#include 
// Optionally include for memcpy_async() collective
#include 
// Optionally include for reduce() collective
#include 
// Optionally include for inclusive_scan() and exclusive_scan() collectives
#include

并使用合作组命名空间：

using namespace cooperative_groups;
// Alternatively use an alias to avoid polluting the namespace with collective algorithms
namespace cg = cooperative_groups;

可以使用 nvcc 以正常方式编译代码，但是如果您希望使用 memcpy_async、reduce 或 scan 功能并且您的主机编译器的默认不是 C++11 或更高版本，那么您必须添加 --std=c++11到命令行。

3.1. 构成示例

为了说明组的概念，此示例尝试执行块范围的求和。以前，编写此代码时对实现存在隐藏的约束：

__device__ int sum(int *x, int n) {
    // ...
    __syncthreads();
    return total;
}

__global__ void parallel_kernel(float *x) {
    // ...
    // Entire thread block must call sum
    sum(x, n);
}

线程块中的所有线程都必须到达 __syncthreads() 屏障，但是，对于可能想要使用 sum(...) 的开发人员来说，这个约束是隐藏的。对于合作组，更好的编写方式是：

__device__ int sum(const thread_block& g, int *x, int n) {
    // ...
    g.sync()
    return total;
}

__global__ void parallel_kernel(...) {
    // ...
    // Entire thread block must call sum
    thread_block tb = this_thread_block();
    sum(tb, x, n);
    // ...
}

4. 协作组类型

4.1. 隐式协作组

隐式组代表内核的启动配置。不管你的内核是如何编写的，它总是有一定数量的线程、块和块尺寸、单个网格和网格尺寸。另外，如果使用多设备协同启动API，它可以有多个网格（每个设备一个网格）。这些组为分解为更细粒度的组提供了起点，这些组通常是硬件加速的，并且更专门针对开发人员正在解决的问题。

尽管您可以在代码中的任何位置创建隐式组，但这样做很危险。为隐式组创建句柄是一项集体操作——组中的所有线程都必须参与。如果组是在并非所有线程都到达的条件分支中创建的，则可能导致死锁或数据损坏。出于这个原因，建议您预先为隐式组创建一个句柄（尽可能早，在任何分支发生之前）并在整个内核中使用该句柄。出于同样的原因，必须在声明时初始化组句柄（没有默认构造函数），并且不鼓励复制构造它们。

4.1.1. Thread Block Group

任何 CUDA 程序员都已经熟悉某一组线程：线程块。 Cooperative Groups 扩展引入了一个新的数据类型 thread_block，以在内核中明确表示这个概念。

class thread_block;

thread_block g = this_thread_block();

公开成员函数：

`static void sync()`:	Synchronize the threads named in the group
`static unsigned int thread_rank()`:	Rank of the calling thread within [0, num_threads)
`static dim3 group_index()`:	3-Dimensional index of the block within the launched grid
`static dim3 thread_index()`:	3-Dimensional index of the thread within the launched block
`static dim3 dim_threads()`:	Dimensions of the launched block in units of threads
`static unsigned int num_threads()`:	Total number of threads in the group

旧版成员函数（别名）:

`static unsigned int size()`:	Total number of threads in the group (alias of num_threads())
`static dim3 group_dim()`:	Dimensions of the launched block (alias of dim_threads())

示例:

/// Loading an integer from global into shared memory
__global__ void kernel(int *globalInput) {
    __shared__ int x;
    thread_block g = this_thread_block();
    // Choose a leader in the thread block
    if (g.thread_rank() == 0) {
        // load from global into shared for all threads to work with
        x = (*globalInput);
    }
    // After loading data into shared memory, you want to synchronize
    // if all threads in your thread block need to see it
    g.sync(); // equivalent to __syncthreads();
}

注意：组中的所有线程都必须参与集体操作，否则行为未定义。

相关：thread_block 数据类型派生自更通用的 thread_group 数据类型，可用于表示更广泛的组类。

4.1.2. Grid Group

该组对象表示在单个网格中启动的所有线程。除了 sync() 之外的 API 始终可用，但要能够跨网格同步，您需要使用协作启动 API。

class grid_group;
grid_group g = this_grid();

公开成员函数：

`bool is_valid() const:`	Returns whether the grid_group can synchronize
`void sync() const:`	Synchronize the threads named in the group
`static unsigned long long thread_rank():`	Rank of the calling thread within [0, num_threads)
`static unsigned long long block_rank():`	Rank of the calling block within [0, num_blocks)
`static unsigned long long num_threads():`	Total number of threads in the group
`static unsigned long long num_blocks():`	Total number of blocks in the group
`static dim3 dim_blocks():`	Dimensions of the launched grid in units of blocks
`static dim3 block_index():`	3-Dimensional index of the block within the launched grid

旧版成员函数（别名）:

`static unsigned long long size():`	Total number of threads in the group (alias of num_threads())
`static dim3 group_dim():`	Dimensions of the launched grid (alias of dim_blocks())

4.1.3. Multi Grid Group

该组对象表示跨设备协作组启动的所有设备启动的所有线程。与 grid.group 不同，所有 API 都要求您使用适当的启动 API。

class multi_grid_group;

通过一下方式构建:

// Kernel must be launched with the cooperative multi-device API
multi_grid_group g = this_multi_grid();

公开成员函数：

`bool is_valid() const`:	Returns whether the multi_grid_group can be used
`void sync() const`:	Synchronize the threads named in the group
`unsigned long long num_threads() const`:	Total number of threads in the group
`unsigned long long thread_rank() const`:	Rank of the calling thread within [0, num_threads)
`unsigned int grid_rank() const`:	Rank of the grid within [0,num_grids]
`unsigned int num_grids() const`:	Total number of grids launched

旧版成员函数（别名）:

`unsigned long long size() const`:	Total number of threads in the group (alias of `num_threads()`)

4.2. 显示协作组

4.2.1. Thread Block Tile

tile组的模板版本，其中模板参数用于指定tile的大小 - 在编译时已知这一点，有可能实现更优化的执行。

template 
class thread_block_tile;

通过以下构建:

template 
_CG_QUALIFIER thread_block_tile tiled_partition(const ParentT& g)

Size必须是 2 的幂且小于或等于 32。

ParentT 是从其中划分该组的父类型。它是自动推断的，但是 void 的值会将此信息存储在组句柄中而不是类型中。

公开成员函数：

`void sync() const`:	Synchronize the threads named in the group
`unsigned long long num_threads() const`:	Total number of threads in the group
`unsigned long long thread_rank() const`:	Rank of the calling thread within [0, num_threads)
u`nsigned long long meta_group_size() const`:	Returns the number of groups created when the parent group was partitioned.
`unsigned long long meta_group_rank() const`:	Linear rank of the group within the set of tiles partitioned from a parent group (bounded by meta_group_size)
`T shfl(T var, unsigned int src_rank) const`:	Refer to Warp Shuffle Functions
`T shfl_up(T var, int delta) const`:	Refer to Warp Shuffle Functions
`T shfl_down(T var, int delta) const`:	Refer to Warp Shuffle Functions
`T shfl_xor(T var, int delta) const`:	Refer to Warp Shuffle Functions
`T any(int predicate) const`:	Refer to Warp Vote Functions
`T all(int predicate) const`:	Refer to Warp Vote Functions
`T ballot(int predicate) const`:	Refer to Warp Vote Functions
`T match_any(T val) const`:	Refer to Warp Match Functions
`T match_all(T val, int &pred) const`:	Refer to Match Functions

旧版成员函数（别名）:

`unsigned long long size() const`:	Total number of threads in the group (alias of num_threads())

注意：

shfl、shfl_up、shfl_down 和 shfl_xor 函数在使用 C++11 或更高版本编译时接受任何类型的对象。这意味着只要满足以下约束，就可以对非整数类型进行shuffle ：

符合普通可复制的条件，即
is_trivially_copyable::value == true
sizeof(T) <= 32

示例:

/// The following code will create two sets of tiled groups, of size 32 and 4 respectively:
/// The latter has the provenance encoded in the type, while the first stores it in the handle
thread_block block = this_thread_block();
thread_block_tile<32> tile32 = tiled_partition<32>(block);
thread_block_tile<4, thread_block> tile4 = tiled_partition<4>(block);

注意：这里使用的是 thread_block_tile 模板化数据结构，并且组的大小作为模板参数而不是参数传递给 tiled_partition 调用。

4.2.1.1. Warp-Synchronous Code Pattern

开发人员可能拥有他们之前对 warp 大小做出隐含假设并围绕该数字进行编码的 warp 同步代码。现在这需要明确指定。

__global__ void cooperative_kernel(...) {
    // obtain default "current thread block" group
    thread_block my_block = this_thread_block();

    // subdivide into 32-thread, tiled subgroups
    // Tiled subgroups evenly partition a parent group into
    // adjacent sets of threads - in this case each one warp in size
    auto my_tile = tiled_partition<32>(my_block);

    // This operation will be performed by only the
    // first 32-thread tile of each block
    if (my_tile.meta_group_rank() == 0) {
        // ...
        my_tile.sync();
    }
}

4.2.1.2. Single thread group

可以从 this_thread 函数中获取代表当前线程的组：

thread_block_tile<1> this_thread();

以下 memcpy_async API 使用 thread_group 将 int 元素从源复制到目标：

#include 
#include 

cooperative_groups::memcpy_async(cooperative_groups::this_thread(), dest, src, sizeof(int));

可以在使用 cuda::pipeline 的单阶段异步数据拷贝和使用 cuda::pipeline 的多阶段异步数据拷贝部分中找到使用 this_thread 执行异步复制的更详细示例。

4.2.1.3. Thread Block Tile of size larger than 32

使用cooperative_groups::experimental 命名空间中的新API 可以获得大小为64、128、256 或512 的thread_block_tile。要使用它，_CG_ABI_EXPERIMENTAL 必须在源代码中定义。在分区之前，必须为 thread_block_tile 保留少量内存。这可以使用必须驻留在共享或全局内存中的cooperative_groups::experimental::block_tile_memory 结构模板来完成。

template 
struct block_tile_memory;

TileCommunicationSize 确定为集体操作保留多少内存。如果对大于指定通信大小的大小类型执行此类操作，则集合可能涉及多次传输并需要更长的时间才能完成。

MaxBlockSize 指定当前线程块中的最大线程数。此参数可用于最小化仅以较小线程数启动的内核中 block_tile_memory 的共享内存使用量。

然后这个 block_tile_memory 需要被传递到cooperative_groups::experimental::this_thread_block，允许将生成的 thread_block 划分为大小大于 32 的tile。 this_thread_block 接受 block_tile_memory 参数的重载是一个集体操作，必须与所有线程一起调用线程块。返回的线程块可以使用experimental::tiled_partition 函数模板进行分区，该模板接受与常规tiled_partition 相同的参数。

#define _CG_ABI_EXPERIMENTAL // enable experimental API

__global__ void cooperative_kernel(...) {
    // reserve shared memory for thread_block_tile usage.
    __shared__ experimental::block_tile_memory<4, 256> shared;
    thread_block thb = experimental::this_thread_block(shared);

    auto tile = experimental::tiled_partition<128>(thb);

    // ...
}

公开成员函数:

`void sync() const`:	Synchronize the threads named in the group
`unsigned long long num_threads() const`:	Total number of threads in the group
`unsigned long long thread_rank() const`:	Rank of the calling thread within [0, num_threads)
`unsigned long long meta_group_size() const`:	Returns the number of groups created when the parent group was partitioned.
`unsigned long long meta_group_rank() const`:	Linear rank of the group within the set of tiles partitioned from a parent group (bounded by meta_group_size)
`T shfl(T var, unsigned int src_rank) const`:	Refer to Warp Shuffle Functions, Note: All threads in the group have to specify the same src_rank, otherwise the behavior is undefined.
`T any(int predicate) const`:	Refer to Warp Vote Functions
`T all(int predicate) const`:	Refer to Warp Vote Functions

旧版成员函数（别名）:

`unsigned long long size() const:`	Total number of threads in the group (alias of num_threads())

4.2.2. Coalesced Groups

在 CUDA 的 SIMT 架构中，在硬件级别，多处理器以 32 个一组的线程执行线程，称为 warp。如果应用程序代码中存在依赖于数据的条件分支，使得 warp 中的线程发散，那么 warp 会串行执行每个分支，禁用不在该路径上的线程。在路径上保持活动的线程称为合并。协作组具有发现和创建包含所有合并线程的组的功能。

通过 coalesced_threads() 构造组句柄是伺机的(opportunistic)。它在那个时间点返回一组活动线程，并且不保证返回哪些线程（只要它们是活动的）或者它们在整个执行过程中保持合并（它们将被重新组合在一起以执行一个集合，但之后可以再次发散）。

class coalesced_group;

通过以下重构:

coalesced_group active = coalesced_threads();

公开成员函数:

`void sync() const`:	Synchronize the threads named in the group
`unsigned long long num_threads() const`:	Total number of threads in the group
`unsigned long long thread_rank() const`:	Rank of the calling thread within [0, num_threads)
`unsigned long long meta_group_size() const`:	Returns the number of groups created when the parent group was partitioned. If this group was created by querying the set of active threads, e.g. coalesced_threads() the value of meta_group_size() will be 1.
`unsigned long long meta_group_rank() const`:	Linear rank of the group within the set of tiles partitioned from a parent group (bounded by meta_group_size). If this group was created by querying the set of active threads, e.g. coalesced_threads() the value of meta_group_rank() will always be 0.
`T shfl(T var, unsigned int src_rank) const`:	Refer to Warp Shuffle Functions
`T shfl_up(T var, int delta) const`:	Refer to Warp Shuffle Functions
`T shfl_down(T var, int delta) const`:	Refer to Warp Shuffle Functions
`T any(int predicate) const`:	Refer to Warp Vote Functions
`T all(int predicate) const`:	Refer to Warp Vote Functions
`T ballot(int predicate) const`:	Refer to Warp Vote Functions
`T match_any(T val) const`:	Refer to Warp Match Functions
`T match_all(T val, int &pred) const`:	Refer to Warp Match Functions

旧版成员函数（别名）:

`unsigned long long size() const`:	Total number of threads in the group (alias of `num_threads()`)

注意：`shfl、shfl_up 和 shfl_down` 函数在使用 C++11 或更高版本编译时接受任何类型的对象。这意味着只要满足以下约束，就可以对非整数类型进行洗牌：

符合普通可复制的条件，即is_trivially_copyable::value == true
sizeof(T) <= 32

示例:

/// Consider a situation whereby there is a branch in the
/// code in which only the 2nd, 4th and 8th threads in each warp are
/// active. The coalesced_threads() call, placed in that branch, will create (for each
/// warp) a group, active, that has three threads (with
/// ranks 0-2 inclusive).
__global__ void kernel(int *globalInput) {
    // Lets say globalInput says that threads 2, 4, 8 should handle the data
    if (threadIdx.x == *globalInput) {
        coalesced_group active = coalesced_threads();
        // active contains 0-2 inclusive
        active.sync();
    }
}

4.2.2.1. Discovery Pattern

通常，开发人员需要使用当前活动的线程集。不对存在的线程做任何假设，而是开发人员使用碰巧存在的线程。这可以在以下“在warp中跨线程聚合原子增量”示例中看到（使用正确的 CUDA 9.0 内在函数集编写）：

{
    unsigned int writemask = __activemask();
    unsigned int total = __popc(writemask);
    unsigned int prefix = __popc(writemask & __lanemask_lt());
    // Find the lowest-numbered active lane
    int elected_lane = __ffs(writemask) - 1;
    int base_offset = 0;
    if (prefix == 0) {
        base_offset = atomicAdd(p, total);
    }
    base_offset = __shfl_sync(writemask, base_offset, elected_lane);
    int thread_offset = prefix + base_offset;
    return thread_offset;
}

这可以用Cooperative Groups重写如下：

{
    cg::coalesced_group g = cg::coalesced_threads();
    int prev;
    if (g.thread_rank() == 0) {
        prev = atomicAdd(p, g.num_threads());
    }
    prev = g.thread_rank() + g.shfl(prev, 0);
    return prev;
}

5. 协作组分区/分块

5.1. tiled_partition

template 
thread_block_tile tiled_partition(const ParentT& g);

thread_group tiled_partition(const thread_group& parent, unsigned int tilesz);

tiled_partition 方法是一种集体操作，它将父组划分为一维、行主序的子组平铺。总共将创建 ((size(parent)/tilesz) 子组，因此父组大小必须能被 Size 整除。允许的父组是 thread_block 或 thread_block_tile。

该实现可能导致调用线程在恢复执行之前等待，直到父组的所有成员都调用了该操作。功能仅限于本地硬件大小，1/2/4/8/16/32和cg::size(parent)必须大于size参数。cooperative_groups::experimental命名空间的实验版本支持64/128/256/512大小。

Codegen 要求：计算能力 3.5 最低，C++11 用于大于 32 的size

示例:

/// The following code will create a 32-thread tile
thread_block block = this_thread_block();
thread_block_tile<32> tile32 = tiled_partition<32>(block);

我们可以将这些组中的每一个分成更小的组，每个组的大小为 4 个线程：

auto tile4 = tiled_partition<4>(tile32);
// or using a general group
// thread_group tile4 = tiled_partition(tile32, 4);

例如，如果我们要包含以下代码行：

if (tile4.thread_rank()==0) printf(“Hello from tile4 rank 0\n”);

那么该语句将由块中的每四个线程打印：每个 tile4 组中排名为 0 的线程，它们对应于块组中排名为 0、4、8、12… 的那些线程。

5.2. labeled_partition

coalesced_group labeled_partition(const coalesced_group& g, int label);
template 
coalesced_group labeled_partition(const thread_block_tile& g, int label);

labeled_partition 方法是一种集体操作，它将父组划分为一维子组，线程在这些子组中合并。该实现将评估条件标签并将具有相同标签值的线程分配到同一组中。

该实现可能会导致调用线程在恢复执行之前等待直到父组的所有成员都调用了该操作。

注意：此功能仍在评估中，将来可能会略有变化。

Codegen 要求：计算能力 7.0 最低，C++11

5.3. binary_partition

coalesced_group binary_partition(const coalesced_group& g, bool pred);
template 
coalesced_group binary_partition(const thread_block_tile& g, bool pred);

binary_partition() 方法是一种集体操作，它将父组划分为一维子组，线程在其中合并。该实现将评估predicate并将具有相同值的线程分配到同一组中。这是labeled_partition() 的一种特殊形式，其中label 只能是0 或1。

该实现可能会导致调用线程在恢复执行之前等待直到父组的所有成员都调用了该操作。

注意：此功能仍在评估中，将来可能会略有变化。

Codegen 要求：计算能力 7.0 最低，C++11

示例:

/// This example divides a 32-sized tile into a group with odd
/// numbers and a group with even numbers
_global__ void oddEven(int *inputArr) {
    cg::thread_block cta = cg::this_thread_block();
    cg::thread_block_tile<32> tile32 = cg::tiled_partition<32>(cta);

    // inputArr contains random integers
    int elem = inputArr[cta.thread_rank()];
    // after this, tile32 is split into 2 groups,
    // a subtile where elem&1 is true and one where its false
    auto subtile = cg::binary_partition(tile32, (elem & 1));
}

6. 协作组关键字合集

6.1. Synchronization

6.1.1. sync

cooperative_groups::sync(T& group);

sync 同步组中指定的线程。 T 可以是任何现有的组类型，因为它们都支持同步。如果组是 grid_group 或 multi_grid_group，则内核必须已使用适当的协作启动 API 启动。

6.2. Data Transfer

6.2.1. memcpy_async

memcpy_async 是一个组范围的集体 memcpy，它利用硬件加速支持从全局到共享内存的非阻塞内存事务。给定组中命名的一组线程，memcpy_async 将通过单个管道阶段传输指定数量的字节或输入类型的元素。此外，为了在使用 memcpy_async API 时获得最佳性能，共享内存和全局内存都需要 16 字节对齐。需要注意的是，虽然在一般情况下这是一个 memcpy，但只有当源(source)是全局内存而目标是共享内存并且两者都可以通过 16、8 或 4 字节对齐来寻址时，它才是异步的。异步复制的数据只能在调用 wait 或 wait_prior 之后读取，这表明相应阶段已完成将数据移动到共享内存。

必须等待所有未完成的请求可能会失去一些灵活性（但会变得简单）。为了有效地重叠数据传输和执行，重要的是能够在等待和操作请求 N 时启动 N+1 memcpy_async 请求。为此，请使用 memcpy_async 并使用基于集体阶段的 wait_prior API 等待它.有关详细信息，请参阅 wait 和 wait_prior。

用法1:

template 
void memcpy_async(
  const TyGroup &group,
  TyElem *__restrict__ _dst,
  const TyElem *__restrict__ _src,
  const TyShape &shape
);

执行shape字节的拷贝

用法2:

template 
void memcpy_async(
  const TyGroup &group,
  TyElem *__restrict__ dst,
  const TyDstLayout &dstLayout,
  const TyElem *__restrict__ src,
  const TySrcLayout &srcLayout
);

执行 min(dstLayout, srcLayout) 元素的拷贝。如果布局的类型为 cuda::aligned_size_t，则两者必须指定相同的对齐方式。

勘误表

CUDA 11.1 中引入的具有 src 和 dst 输入布局的 memcpy_async API 期望布局以元素而不是字节形式提供。元素类型是从 TyElem 推断出来的，大小为 sizeof(TyElem)。如果使用 cuda::aligned_size_t 类型作为布局，指定的元素个数乘以 sizeof(TyElem) 必须是 N 的倍数，建议使用 std::byte 或 char 作为元素类型。

如果副本的指定形状或布局是 cuda::aligned_size_t 类型，则将保证至少为 min(16, N)。在这种情况下，dst 和 src 指针都需要与 N 个字节对齐，并且复制的字节数需要是 N 的倍数。

Codegen 要求：最低计算能力 3.5，异步计算能力 8.0，C++11

需要包含collaborative_groups/memcpy_async.h 头文件。

示例:

/// This example streams elementsPerThreadBlock worth of data from global memory
/// into a limited sized shared memory (elementsInShared) block to operate on.
#include 
#include 

namespace cg = cooperative_groups;

__global__ void kernel(int* global_data) {
    cg::thread_block tb = cg::this_thread_block();
    const size_t elementsPerThreadBlock = 16 * 1024;
    const size_t elementsInShared = 128;
    __shared__ int local_smem[elementsInShared];

    size_t copy_count;
    size_t index = 0;
    while (index < elementsPerThreadBlock) {
        cg::memcpy_async(tb, local_smem, elementsInShared, global_data + index, elementsPerThreadBlock - index);
        copy_count = min(elementsInShared, elementsPerThreadBlock - index);
        cg::wait(tb);
        // Work with local_smem
        index += copy_count;
    }
}

6.2.2. wait and wait_prior

template 
void wait(TyGroup & group);

template 
void wair_prior(TyGroup & group);

wait 和 wait_prior 集合同步指定的线程和线程块，直到所有未完成的 memcpy_async 请求（在等待的情况下）或第一个 NumStages（在 wait_prior 的情况下）完成。

Codegen 要求：最低计算能力 3.5，异步计算能力 8.0，C++11

需要包含collaborative_groups/memcpy_async.h 头文件。

示例:

/// This example streams elementsPerThreadBlock worth of data from global memory
/// into a limited sized shared memory (elementsInShared) block to operate on in
/// multiple (two) stages. As stage N is kicked off, we can wait on and operate on stage N-1.
#include 
#include 

namespace cg = cooperative_groups;

__global__ void kernel(int* global_data) {
    cg::thread_block tb = cg::this_thread_block();
    const size_t elementsPerThreadBlock = 16 * 1024 + 64;
    const size_t elementsInShared = 128;
    __align__(16) __shared__ int local_smem[2][elementsInShared];
    int stage = 0;
    // First kick off an extra request
    size_t copy_count = elementsInShared;
    size_t index = copy_count;
    cg::memcpy_async(tb, local_smem[stage], elementsInShared, global_data, elementsPerThreadBlock - index);
    while (index < elementsPerThreadBlock) {
        // Now we kick off the next request...
        cg::memcpy_async(tb, local_smem[stage ^ 1], elementsInShared, global_data + index, elementsPerThreadBlock - index);
        // ... but we wait on the one before it
        cg::wait_prior<1>(tb);

        // Its now available and we can work with local_smem[stage] here
        // (...)
        //

        // Calculate the amount fo data that was actually copied, for the next iteration.
        copy_count = min(elementsInShared, elementsPerThreadBlock - index);
        index += copy_count;

        // A cg::sync(tb) might be needed here depending on whether
        // the work done with local_smem[stage] can release threads to race ahead or not
        // Wrap to the next stage
        stage ^= 1;
    }
    cg::wait(tb);
    // The last local_smem[stage] can be handled here

6.3. Data manipulation

6.3.1. reduce

template 
auto reduce(const TyGroup& group, TyArg&& val, TyOp&& op) -> decltype(op(val, val));

reduce 对传入的组中指定的每个线程提供的数据执行归约操作。这利用硬件加速（在计算 80 及更高的设备上）进行算术加法、最小或最大操作以及逻辑 AND、OR、或 XOR，以及在老一代硬件上提供软件替代支持(fallback)。只有 4B 类型由硬件加速。

group：有效的组类型是 coalesced_group 和 thread_block_tile。

val：满足以下要求的任何类型：

符合普通可复制的条件，即 is_trivially_copyable::value == true
sizeof(TyArg) <= 32
对给定的函数对象具有合适的算术或比较运算符。

op：将提供具有整数类型的硬件加速的有效函数对象是 plus()、less()、greater()、bit_and()、bit_xor()、bit_or()。这些必须构造，因此需要 TyVal 模板参数，即 plus()。 Reduce 还支持可以使用 operator() 调用的 lambda 和其他函数对象

Codegen 要求：计算能力 3.5 最低，计算能力 8.0 用于硬件加速，C++11。

需要包含collaborative_groups/reduce.h 头文件。

示例:

#include 
#include 
namespace cg=cooperative_groups;

/// The following example accepts input in *A and outputs a result into *sum
/// It spreads the data within the block, one element per thread
#define blocksz 256
__global__ void block_reduce(const int *A, int *sum) {
    __shared__ int reduction_s[blocksz];

    cg::thread_block cta = cg::this_thread_block();
    cg::thread_block_tile<32> tile = cg::tiled_partition<32>(cta);

    const int tid = cta.thread_rank();
    int beta = A[tid];
    // reduce across the tile
    // cg::plus allows cg::reduce() to know it can use hardware acceleration for addition
    reduction_s[tid] = cg::reduce(tile, beta, cg::plus());
    // synchronize the block so all data is ready
    cg::sync(cta);
    // single leader accumulates the result
    if (cta.thread_rank() == 0) {
        beta = 0;
        for (int i = 0; i < blocksz; i += tile.num_threads()) {
            beta += reduction_s[i];
        }
        sum[blockIdx.x] = beta;
    }

6.3.2. Reduce Operators

下面是一些可以用reduce完成的基本操作的函数对象的原型

namespace cooperative_groups {
  template 
  struct cg::plus;

  template 
  struct cg::less;

  template 
  struct cg::greater;

  template 
  struct cg::bit_and;

  template 
  struct cg::bit_xor;

  template 
  struct cg::bit_or;
}

Reduce 仅限于在编译时可用于实现的信息。因此，为了利用 CC 8.0 中引入的内在函数，cg:: 命名空间公开了几个镜像硬件的功能对象。这些对象看起来与 C++ STL 中呈现的对象相似，除了 less/greater。与 STL 有任何差异的原因在于，这些函数对象旨在实际反映硬件内联函数的操作。

功能说明：

cg::plus：接受两个值并使用 operator + 返回两者之和。
cg::less: 接受两个值并使用 operator < 返回较小的值。这不同之处在于返回较低的值而不是布尔值。
cg::greater：接受两个值并使用 operator < 返回较大的值。这不同之处在于返回更大的值而不是布尔值。
cg::bit_and：接受两个值并返回operator &的结果。
cg::bit_xor：接受两个值并返回operator ^的结果。
cg::bit_or：接受两个值并返回 operator | 的结果。

示例:

{
    // cg::plus is specialized within cg::reduce and calls __reduce_add_sync(...) on CC 8.0+
    cg::reduce(tile, (int)val, cg::plus());

    // cg::plus fails to match with an accelerator and instead performs a standard shuffle based reduction
    cg::reduce(tile, (float)val, cg::plus());

    // While individual components of a vector are supported, reduce will not use hardware intrinsics for the following
    // It will also be necessary to define a corresponding operator for vector and any custom types that may be used
    int4 vec = {...};
    cg::reduce(tile, vec, cg::plus())

    // Finally lambdas and other function objects cannot be inspected for dispatch
    // and will instead perform shuffle based reductions using the provided function object.
    cg::reduce(tile, (int)val, [](int l, int r) -> int {return l + r;});
}

6.3.3. inclusive_scan and exclusive_scan

template 
auto inclusive_scan(const TyGroup& group, TyVal&& val, TyFn&& op) -> decltype(op(val, val));

template 
TyVal inclusive_scan(const TyGroup& group, TyVal&& val);

template 
auto exclusive_scan(const TyGroup& group, TyVal&& val, TyFn&& op) -> decltype(op(val, val));

template 
TyVal exclusive_scan(const TyGroup& group, TyVal&& val);

inclusive_scan 和exclusive_scan 对传入组中指定的每个线程提供的数据执行扫描操作。在exclusive_scan 的情况下，每个线程的结果是减少thread_rank 低于该线程的线程的数据。 inclusive_scan 的结果还包括调用线程中的归约数据。

group：有效的组类型是 coalesced_group 和 thread_block_tile。

val：满足以下要求的任何类型：

符合普通可复制的条件，即 is_trivially_copyable::value == true
sizeof(TyArg) <= 32
对给定的函数对象具有合适的算术或比较运算符。

op：为了方便而定义的函数对象有reduce Operators中描述的plus()、less()、greater()、bit_and()、bit_xor()、bit_or()。这些必须构造，因此需要 TyVal 模板参数，即 plus()。 inclusive_scan 和 exclusive_scan 还支持可以使用 operator() 调用的 lambdas 和其他函数对象

Codegen 要求：计算能力 3.5 最低，C++11。

需要包含collaborative_groups/scan.h 头文件。

示例:

#include 
#include 
#include 
namespace cg = cooperative_groups;

__global__ void kernel() {
    auto thread_block = cg::this_thread_block();
    auto tile = cg::tiled_partition<8>(thread_block);
    unsigned int val = cg::inclusive_scan(tile, tile.thread_rank());
    printf("%u: %u\n", tile.thread_rank(), val);
}

/*  prints for each group:
    0: 0
    1: 1
    2: 3
    3: 6
    4: 10
    5: 15
    6: 21
    7: 28
*/

使用 Exclusive_scan 进行动态缓冲区空间分配的示例：

#include 
#include 
namespace cg = cooperative_groups;

// Buffer partitioning is static to make the example easier to follow,
// but any arbitrary dynamic allocation scheme can be implemented by replacing this function.
__device__ int calculate_buffer_space_needed(cg::thread_block_tile<32>& tile) {
    return tile.thread_rank() % 2 + 1;
}

__device__ int my_thread_data(int i) {
    return i;
}

__global__ void kernel() {
    __shared__ int buffer_used;
    extern __shared__ int buffer[];
    auto thread_block = cg::this_thread_block();
    auto tile = cg::tiled_partition<32>(thread_block);

    buffer_used = 0;
    thread_block.sync();

    // each thread calculates buffer size it needs and its offset within the allocation
    int buf_needed = calculate_buffer_space_needed(tile);
    int buf_offset = cg::exclusive_scan(tile, buf_needed);

    // last thread in the tile allocates buffer space with an atomic operation
    int alloc_offset = 0;
    if (tile.thread_rank() == tile.num_threads() - 1) {
        alloc_offset = atomicAdd(&buffer_used, buf_offset + buf_needed);
    }
    // that thread shares the allocation start with other threads in the tile
    alloc_offset = tile.shfl(alloc_offset, tile.num_threads() - 1);
    buf_offset += alloc_offset;

    // each thread fill its part of the buffer with thread specific data
    for (int i = 0 ; i < buf_needed ; ++i) {
        buffer[buf_offset + i] = my_thread_data(i);
    }

    // buffer is {0, 0, 1, 0, 0, 1 ...};
}

7. Grid同步

在引入协作组(Cooperative Groups)之前，CUDA 编程模型只允许在内核完成边界的线程块之间进行同步。内核边界带有隐含的状态失效，以及潜在的性能影响。

例如，在某些用例中，应用程序具有大量小内核，每个内核代表处理pipeline中的一个阶段。当前的 CUDA 编程模型需要这些内核的存在，以确保在一个pipeline阶段上运行的线程块在下一个pipeline阶段上运行的线程块准备好使用数据之前产生数据。在这种情况下，提供全局线程间块同步的能力将允许将应用程序重组为具有持久线程块，当给定阶段完成时，这些线程块能够在设备上同步。

要从内核中跨网格同步，您只需使用 grid.sync() 功能：

grid_group grid = this_grid();
grid.sync();

并且在启动内核时，有必要使用 cudaLaunchCooperativeKernel CUDA 运行时启动 API 或 CUDA 驱动程序等价物，而不是 <<<…>>> 执行配置语法。

例子：

为了保证线程块在 GPU 上的共同驻留，需要仔细考虑启动的块数。例如，可以按如下方式启动与 SM 一样多的块：

int device = 0;
cudaDeviceProp deviceProp;
cudaGetDeviceProperties(&deviceProp, dev);
// initialize, then launch
cudaLaunchCooperativeKernel((void*)my_kernel, deviceProp.multiProcessorCount, numThreads, args);

或者，您可以通过使用占用计算器(occupancy calculator)计算每个 SM 可以同时容纳多少块来最大化暴露的并行度，如下所示：

/// This will launch a grid that can maximally fill the GPU, on the default stream with kernel arguments
int numBlocksPerSm = 0;
 // Number of threads my_kernel will be launched with
int numThreads = 128;
cudaDeviceProp deviceProp;
cudaGetDeviceProperties(&deviceProp, dev);
cudaOccupancyMaxActiveBlocksPerMultiprocessor(&numBlocksPerSm, my_kernel, numThreads, 0);
// launch
void *kernelArgs[] = { /* add kernel args */ };
dim3 dimBlock(numThreads, 1, 1);
dim3 dimGrid(deviceProp.multiProcessorCount*numBlocksPerSm, 1, 1);
cudaLaunchCooperativeKernel((void*)my_kernel, dimGrid, dimBlock, kernelArgs);

最好先通过查询设备属性 cudaDevAttrCooperativeLaunch 来确保设备支持协作启动：

int dev = 0;
int supportsCoopLaunch = 0;
cudaDeviceGetAttribute(&supportsCoopLaunch, cudaDevAttrCooperativeLaunch, dev);

如果设备 0 支持该属性，则将 supportsCoopLaunch 设置为 1。仅支持计算能力为 6.0 及更高版本的设备。此外，您需要在以下任何一个上运行：

没有 MPS 的 Linux 平台
具有 MPS 和计算能力 7.0 或更高版本的设备上的 Linux 平台
最新的 Windows 平台

8. 多设备同步

为了通过协作组启用跨多个设备的同步，需要使用 cudaLaunchCooperativeKernelMultiDevice CUDA API。这与现有的 CUDA API 有很大不同，它将允许单个主机线程跨多个设备启动内核。除了 cudaLaunchCooperativeKernel 做出的约束和保证之外，这个 API 还具有额外的语义：

此 API 将确保启动是原子的，即如果 API 调用成功，则提供的线程块数将在所有指定设备上启动。
通过此 API 启动的功能必须相同。驱动程序在这方面没有进行明确的检查，因为这在很大程度上是不可行的。由应用程序来确保这一点。
提供的 cudaLaunchParams 中没有两个条目可以映射到同一设备。
本次发布所针对的所有设备都必须具有相同的计算能力——主要版本和次要版本。
每个网格的块大小、网格大小和共享内存量在所有设备上必须相同。请注意，这意味着每个设备可以启动的最大块数将受到 SM 数量最少的设备的限制。
拥有正在启动的 CUfunction 的模块中存在的任何用户定义的 device、constant 或 managed 设备全局变量都在每个设备上独立实例化。用户负责适当地初始化此类设备全局变量。

弃用通知：cudaLaunchCooperativeKernelMultiDevice 已在 CUDA 11.3 中针对所有设备弃用。在多设备共轭梯度样本中可以找到替代方法的示例。

多设备同步的最佳性能是通过 cuCtxEnablePeerAccess 或 cudaDeviceEnablePeerAccess 为所有参与设备启用对等访问来实现的。

启动参数应使用结构数组（每个设备一个）定义，并使用 cudaLaunchCooperativeKernelMultiDevice 启动

Example:

cudaDeviceProp deviceProp;
cudaGetDeviceCount(&numGpus);

// Per device launch parameters
cudaLaunchParams *launchParams = (cudaLaunchParams*)malloc(sizeof(cudaLaunchParams) * numGpus);
cudaStream_t *streams = (cudaStream_t*)malloc(sizeof(cudaStream_t) * numGpus);

// The kernel arguments are copied over during launch
// Its also possible to have individual copies of kernel arguments per device, but
// the signature and name of the function/kernel must be the same.
void *kernelArgs[] = { /* Add kernel arguments */ };

for (int i = 0; i < numGpus; i++) {
    cudaSetDevice(i);
    // Per device stream, but its also possible to use the default NULL stream of each device
    cudaStreamCreate(&streams[i]);
    // Loop over other devices and cudaDeviceEnablePeerAccess to get a faster barrier implementation
}
// Since all devices must be of the same compute capability and have the same launch configuration
// it is sufficient to query device 0 here
cudaGetDeviceProperties(&deviceProp[i], 0);
dim3 dimBlock(numThreads, 1, 1);
dim3 dimGrid(deviceProp.multiProcessorCount, 1, 1);
for (int i = 0; i < numGpus; i++) {
    launchParamsList[i].func = (void*)my_kernel;
    launchParamsList[i].gridDim = dimGrid;
    launchParamsList[i].blockDim = dimBlock;
    launchParamsList[i].sharedMem = 0;
    launchParamsList[i].stream = streams[i];
    launchParamsList[i].args = kernelArgs;
}
cudaLaunchCooperativeKernelMultiDevice(launchParams, numGpus);

此外，与网格范围的同步一样，生成的设备代码看起来非常相似：

multi_grid_group multi_grid = this_multi_grid();
multi_grid.sync();

但是，需要通过将 -rdc=true 传递给 nvcc 来单独编译代码。

最好先通过查询设备属性 cudaDevAttrCooperativeMultiDeviceLaunch 来确保设备支持多设备协作启动：

int dev = 0;
int supportsMdCoopLaunch = 0;
cudaDeviceGetAttribute(&supportsMdCoopLaunch, cudaDevAttrCooperativeMultiDeviceLaunch, dev);

如果设备 0 支持该属性，则将 supportsMdCoopLaunch 设置为 1。仅支持计算能力为 6.0 及更高版本的设备。此外，您需要在 Linux 平台（无 MPS）或当前版本的 Windows 上运行，并且设备处于 TCC 模式。

有关更多信息，请参阅 cudaLaunchCooperativeKernelMultiDevice API 文档。

你可能感兴趣的:(开发语言,NVIDIA,CUDA,计算机视觉,人工智能)

算法学习笔记：17.蒙特卡洛算法 ——从原理到实战，涵盖 LeetCode 与考研 408 例题
在计算机科学和数学领域，蒙特卡洛算法（MonteCarloAlgorithm）以其独特的随机抽样思想，成为解决复杂问题的有力工具。从圆周率的计算到金融风险评估，从物理模拟到人工智能，蒙特卡洛算法都发挥着不可替代的作用。本文将深入剖析蒙特卡洛算法的思想、解题思路，结合实际应用场景与Java代码实现，并融入考研408的相关考点，穿插图片辅助理解，帮助你全面掌握这一重要算法。蒙特卡洛算法的基本概念蒙特卡
霍夫变换（Hough Transform）算法原来详解和纯C++代码实现以及OpenCV中的使用示例点云SLAM 算法图形图像处理算法 opencv 图像处理与计算机视觉算法直线提取检测目标检测霍夫变换算法
霍夫变换（HoughTransform）是一种经典的图像处理与计算机视觉算法，广泛用于检测图像中的几何形状，例如直线、圆、椭圆等。其核心思想是将图像空间中的“点”映射到参数空间中的“曲线”，从而将形状检测问题转化为参数空间中的峰值检测问题。一、霍夫变换基本思想输入：边缘图像（如经过Canny边缘检测）输出：一组满足几何模型的形状（如直线、圆）关键思想：图像空间中的一个点→参数空间中的一个曲线参数空
AI音乐模拟器：AIGC时代的智能音乐创作革命 lauo 人工智能 AIGC 开源前端机器人
AI音乐模拟器：AIGC时代的智能音乐创作革命引言：AIGC浪潮下的音乐创作新范式在数字化转型的浪潮中，人工智能生成内容（AIGC）正在重塑各个创意领域。音乐产业作为创意经济的重要组成部分，正经历着前所未有的变革。据最新市场研究数据显示，全球AI音乐市场规模预计将从2023年的5.8亿美元增长到2030年的26.8亿美元，年复合增长率高达24.3%。这一快速增长的市场背后，是AI音乐技术正在打破传
视频分析：让AI看懂动态画面随机森林404 计算机视觉音视频人工智能 microsoft
引言：动态视觉理解的革命在数字信息爆炸的时代，视频已成为最主要的媒介形式。据统计，每分钟有超过500小时的视频内容被上传到YouTube平台，而全球互联网流量的82%来自视频数据传输。面对如此海量的视频内容，传统的人工处理方式已无法满足需求，这正是人工智能视频分析技术大显身手的舞台。视频分析技术赋予机器"看懂"动态画面的能力，使其能够自动理解、解释甚至预测视频中的内容，这一突破正在彻底改变我们与视
目标检测（object detection）加油吧zkf 目标检测目标检测人工智能计算机视觉
目标检测作为计算机视觉的核心技术，在自动驾驶、安防监控、医疗影像等领域发挥着不可替代的作用。本文将系统讲解目标检测的概念、原理、主流模型、常见数据集及应用场景，帮助读者构建对这一技术的完整认知。一、目标检测的核心概念目标检测（ObjectDetection）是指在图像或视频中自动定位并识别出所有感兴趣的目标的技术。它需要解决两个核心问题：分类（Classification）：确定图像中每个目标的类
vllm本地部署bge-reranker-v2-m3模型API服务实战教程雷电法王大模型部署 linux python vscode language model
文章目录一、说明二、配置环境2.1安装虚拟环境2.2安装vllm2.3对应版本的pytorch安装2.4安装flash_attn2.5下载模型三、运行代码3.1启动服务3.2调用代码验证一、说明本文主要介绍vllm本地部署BAAI/bge-reranker-v2-m3模型API服务实战教程本文是在Ubuntu24.04+CUDA12.8+Python3.12环境下复现成功的二、配置环境2.1安装虚
法律科技领域人工智能代理构建的十个经验教训，一位人工智能工程师通过构建、部署和维护智能代理的经验教训来优化法律工作流程的历程。知识大胖 NVIDIA GPU和大语言模型开发教程人工智能 ai
目录介绍什么是代理人？为什么它对法律如此重要？法律技术中代理用例示例-合同审查代理-法律研究代理在LegalTech中使用代理的十个教训-教训1：即使代理很酷，它们也不能解决所有问题-教训2：选择最适合您用例的框架-教训3：能够快速迭代不同的模型-教训4：从简单开始，必要时扩展-教训5：使用跟踪解决方案；您将需要它-教训6：确保跟踪成本，代理循环可能很昂贵-教训7：将控制权交给最终用户（人在环路中
如何在 Linux 上安装 RTX 5090 / 5080 /5070 Ti / 5070 驱动程序 — 详细指南知识大胖 NVIDIA GPU和大语言模型开发教程 linux 运维服务器
简介为了获得最佳性能，您需要在Linux上运行5090/5080/5070Ti/5070或其他50系列GPU（或Windows上的WSL）。这篇文章将包含有关如何操作的详细指南。主线内核和驱动程序怪癖之旅Nvidia50系列GPU拥有最新的Nvidia技术。但是，新硬件需要一些新软件或更新，这需要一些耐心。如果您在这里，您可能会遇到Ubuntu默认设置的障碍。不要害怕！我最近自己摸索了这个迷宫，结
使用 Deepseek Zero Coding Experience 创建类似飞扬的小鸟游戏知识大胖 NVIDIA GPU和大语言模型开发教程游戏 deepseek ollama janus pro
简介Flappybird在苹果商店推出后，每天大约能赚5000美元，但后来被苹果故意下架。现在我正尝试使用Deepseek制作这样一款游戏。技术在不断变化，编码知识也在不断变化，只需修改代码即可获得结果。让我们在Deepseek上试试这款游戏：推荐文章《如何在本地电脑上安装和使用DeepSeekR-1》权重1，DeepSeek《Nvidia系列之使用NVIDIAIsaacSim和ROS2的命令行控
Llama-Omni会说话的人工智能“语音到语音LLM” 利用低延迟、高质量语音转语音 AI 彻底改变对话方式（教程含源码）知识大胖 NVIDIA GPU和大语言模型开发教程 llama 人工智能 nvidia llm
介绍“单靠技术是不够的——技术与文科、人文学科的结合，才能产生让我们心花怒放的成果。”——史蒂夫·乔布斯近年来，人机交互领域发生了重大变化，尤其是随着ChatGPT、GPT-4等大型语言模型(LLM)的出现。虽然这些模型主要基于文本，但人们对语音交互的兴趣日益浓厚，以使人机对话更加无缝和自然。然而，实现语音交互而不受语音转文本处理中常见的延迟和错误的影响仍然是一个挑战。关键字：Llama-Omni
NVIDIA 系列之使用生成式 AI 增强 ROS2 机器人技术：使用 BLIP 和 Isaac Sim 进行实时图像字幕制作知识大胖 NVIDIA GPU和大语言模型开发教程人工智能机器人
简介在快速发展的机器人领域，集成先进的AI模型可以显著增强机器人系统的功能。在本博客中，我们将探讨如何在ROS2（机器人操作系统2）环境中利用BLIP（引导语言图像预训练）模型进行实时图像字幕制作，并使用NVIDIAIsaacSim进行模拟。我们将介绍如何实现一个ROS2节点，该节点订阅摄像头源、应用BLIP模型进行图像字幕制作，并实时显示结果。这种集成展示了生成式AI在增强人机交互方面的强大功能
什么是热力学计算？它如何帮助人工智能发展？知识大胖 NVIDIA GPU和大语言模型开发教程人工智能量子计算
现代计算的基础是晶体管，这是一种微型电子开关，可以用它构建逻辑门，从而创建CPU或GPU等复杂的数字电路。随着技术的进步，晶体管变得越来越小。根据摩尔定律，集成电路中晶体管的数量大约每两年增加一倍。这种指数级增长使得计算技术呈指数级发展。然而，晶体管尺寸的缩小是有限度的。我们很快就会达到晶体管无法工作的阈值。此外，人工智能的进步使得对计算能力的需求比以往任何时候都更加迫切。根本问题是自然是随机的（
使用NVIDIA NeRF将2D图像转换为逼真的3D模型（Python） ByteWhiz 3d python 计算机视觉 Python
使用NVIDIANeRF将2D图像转换为逼真的3D模型（Python）NeuralRadianceFields（NeRF）是一种强大的方法，可以将2D图像转换为逼真的3D模型。它使用神经网络来建模场景的辐射场，并通过渲染多个视角的图像来重建3D模型。在本文中，我们将使用Python和NVIDIANeRF库来实现这一过程。首先，我们需要安装所需的库。我们可以通过以下命令使用pip安装NVIDIANe
上海交大：工具增强推理agent
标题：SciMaster:TowardsGeneral-PurposeScientificAIAgentsPartI.X-MasterasFoundation-CanWeLeadonHumanity’sLastExam?来源：arXiv,2507.05241摘要人工智能代理的快速发展激发了利用它们加速科学发现的长期雄心。实现这一目标需要深入了解人类知识的前沿。因此，人类的最后一次考试（HLE）为评
微算法科技的前沿探索：量子机器学习算法在视觉任务中的革新应用 MicroTech2025 量子计算算法
在信息技术飞速发展的今天，计算机视觉作为人工智能领域的重要分支，正逐步渗透到我们生活的方方面面。从自动驾驶到人脸识别，从医疗影像分析到安防监控，计算机视觉技术展现了巨大的应用潜力。然而，随着视觉任务复杂度的不断提升，传统机器学习算法在处理大规模、高维度数据时遇到了计算瓶颈。在此背景下，量子计算作为一种颠覆性的计算模式，以其独特的并行处理能力和指数级增长的计算空间，为解决这一难题提供了新的思路。微算
中国银联豪掷1亿采购海光C86架构服务器信创新态势海光芯片 C86 国产芯片海光信息
近日，中国银联国产服务器采购大单正式敲定，基于海光C86架构的服务器产品中标，项目金额超过1亿元。接下来，C86服务器将用于支撑中国银联的虚拟化、大数据、人工智能、研发测试等技术场景，进一步提升其业务处理能力、用户服务效率和信息安全水平。作为我国重要的银行卡组织和金融基础设施，中国银联在全球183个国家和地区设有银联受理网络，境内外成员机构超过2600家，是世界三大银行卡品牌之一。此次中国银联发力
【医学影像】无痛安装mamba 周树皮医学影像 python
去年编辑的一个帖子。摆了一段时间后重新回归，发送一下作为状态分界线。很癫狂的体验，man，whatcanisay！issue查看我的狗急跳墙状态1.确定版本cudanvcc-Vpythonpython--versiontorchpipshowtorch2.下载对应版本wheelcausal-conv1d：https://github.com/Dao-AILab/causal-conv1d/rele
AI人工智能浪潮中文心一言的独特优势
AI人工智能浪潮中文心一言的独特优势：为什么它是中国市场的“AI主力军”？关键词：文心一言,AI大模型,中文处理,多模态融合,产业落地,安全可控,百度ERNIE摘要：在全球AI大模型浪潮中，百度文心一言（ERNIEBot）凭借“懂中文、会多模态、能落地、守规矩”的四大核心优势，成为中国市场最具竞争力的AI产品之一。本文将用“超级大脑”的比喻，从中文理解、多模态能力、产业生态融合、安全可控性四个维度
正义的算法迷宫—人工智能重构司法体系的技术悖论与文明试炼
一、法庭的数字化迁徙当美国威斯康星州法院采纳COMPAS算法评估被告再犯风险，当中国"智慧法院"系统年处理1.2亿件案件，司法体系正经历从石柱法典到代码裁判的范式革命。这场转型的核心驱动力是司法效率与公正的永恒张力：美国重罪案件平均审理周期达18个月，中国基层法官年人均结案357件（是德国同行的6倍），而算法能在0.3秒内完成百万份文书比对。人工智能渗透司法引发三重裂变：证据分析从经验推断转向数据
【python实战】不玩微博，一封邮件就能知道实时热榜，天秀吃瓜一条coding 从实战学python 人工智能 python linux 爬虫
❤️欢迎订阅《从实战学python》专栏，用python实现办公自动化、数据可视化、人工智能等各个方向的实战案例，有趣又有用！❤️更多精品专栏简介点这里有的人金玉其表败絮其中，有的人却若彩虹般绚烂，怦然心动前言哈喽，大家好，我是一条。在生活中我是一个不太喜欢逛娱乐平台的人，抖音、快手、微博我手机里都没装，甚至微信朋友圈都不看，但是自从开始写博客，有些热度不得不蹭。所以就有了这样一个需求，能不能让微
MCP协议：AI时代的“万能插座”如何重构IT生态与未来
MCP协议：AI时代的“万能插座”如何重构IT生态与未来在人工智能技术爆炸式发展的浪潮中，一个名为ModelContextProtocol（MCP）的技术协议正以惊人的速度重塑IT行业的底层逻辑。2024年11月由Anthropic首次发布，MCP在短短半年内获得OpenAI、谷歌、亚马逊、阿里、腾讯等全球科技巨头的支持，被业内誉为AI时代的HTTP协议或USB-C接口，正在成为连接大模型与现实世
《算法备案全攻略：规范与流程引领数字时代新秩序》算法及大模型备案顾问刘老师算法备案深度学习 AIGC 语言模型算法人工智能
一、算法备案：开启合规新征程（一）备案规定的起源与发展2022年国家互联网信息办公室、工业和信息化部、公安部、国家市场监督管理总局联合发布《互联网信息服务算法推荐管理规定》，自2022年3月1日起施行。此后，相关规定不断完善和演进。如国家网信办于2022年8月、10月及2023年1月先后三次公布了《境内互联网信息服务算法备案清单》。同时，2022年发布的最高人民法院《关于规范和加强人工智能司法应用
C语言学生成绩管理系统<；自创>；(功能7有小错误,但可运行） han_xue_feng java
腾讯云加速企业和个人开发创新公开直播预告直播预告：07/18(周四)15:00-16:00随着人工智能与大模型的蓬勃发展，我们正步入一个由技微信实习第一天周五入职，早上早早来到了公司，发现好多人都没上班，到十点才陆陆续续有人来，办理完入职后，mentor中联夏令营遗憾没有入选不过hr的回复真的很好，辛苦啦#提前批简历挂麻了怎么办##机械制造投递记录#大数据开发的工作有点过于简单了吧sq大数据开发的
Python 实战人工智能数学基础：推荐系统应用 AI天才研究院 AI大模型企业级应用开发实战大数据人工智能语言模型 Java Python 架构设计
作者：禅与计算机程序设计艺术文章目录1.背景介绍2.核心概念与联系2.1用户画像2.2相似性计算2.2.1基于物品的相似度2.2.2基于用户的相似度2.3协同过滤算法2.3.1基于用户的协同过滤算法2.3.2基于物品的协同过滤算法2.3.3基于上下文的协同过滤算法3.核心算法原理和具体操作步骤以及数学模型公式详细讲解3.1基于用户的协同过滤算法3.2基于物品的协同过滤算法3.3混合协同过滤算法3.
jetson agx orin 刷机、cuda、pytorch配置指南【亲测有效】
jetsonagxorin刷机指南注意事项刷机具体指南cuda环境配置指南Anconda、Pytorch配置注意事项1.使用设备自带usbtoc的传输线时，注意c口插到orin左侧的口，右侧的口不支持数据传输；2.刷机时需准备ubuntu系统，可以是虚拟机，注意安装SDKManager刷机时，JetPack版本要选对，JetPack6.0的对应ubuntu22，cuda12版本，对应pytorch
Python桌面应用开发的未来——智能化工具与大模型赋能 IronwoodStag78
开发AI智能应用，就下载InsCodeAIIDE，一键接入DeepSeek-R1满血版大模型！标题：Python桌面应用开发的未来——智能化工具与大模型赋能随着人工智能技术的飞速发展，传统软件开发模式正在被重新定义。Python作为一门功能强大且灵活的语言，在桌面应用开发领域一直占据重要地位。然而，面对日益复杂的用户需求和快速变化的技术环境，如何提升开发效率、降低开发门槛，成为开发者亟需解决的问题
8卡RTX 5090D服务器部署Qwen3-32B-AWQ模型执行性能测试
一、背景最近得了一台8卡5090D服务器进行测试评估。GPU拓扑情况如下(test)root@ubuntu:/opt/models#nvidia-smitopo-mGPU0GPU1GPU2GPU3GPU4GPU5GPU6GPU7CPUAffinityNUMAAffinityGPUNUMAIDGPU0XNODENODENODESYSSYSSYSSYS0-31,64-950N/AGPU1NODEXNO
OpenCV图片操作100例：从入门到精通指南（1）总有刁民想爱朕ha opencv 计算机视觉人工智能
OpenCV图片操作100例：从入门到精通指南本文整理了100个OpenCV实用技巧，涵盖图像处理各个领域，助你轻松掌握计算机视觉核心技能！一、入门必备：基础操作1.图像读写与显示importcv2#读取图像（BGR格式）img=cv2.imread('image.jpg')#显示图像cv2.imshow('示例图片',img)cv2.waitKey(0)#按任意键退出cv2.destroyAll
OpenCV图片操作100例：从入门到精通指南（3）总有刁民想爱朕ha opencv 人工智能计算机视觉
高效学习路径：1️⃣分阶段学习：入门：1-20例（基础操作）进阶：21-50例（图像处理）高级：51-100例（计算机视觉）2️⃣项目驱动学习：证件照背景替换（1-15例）停车场车位检测（30-45例）视频运动追踪（70-85例）3️⃣性能优化技巧：#使用UMat加速图像处理umat_img=cv2.UMat(img)processed=cv2.GaussianBlur(umat_img,(5,5
Yolov5-obb(旋转目标poly_nms_cuda.cu编译bug记录及解决方案)
关于在执行pythonsetup.pydevelop#or"pipinstall-v-e."时poly_nms_cuda.cu报错问题。前面步骤严格按照install.md环境1.pytorch版本较低时（我的是1.10）：poly_nms_cuda.cu文件添加”#defineeps1e-8“，删除“constdoubleeps=1E-8;”这句2.pytorch版本较高时（我用的是1.27）h
apache 安装linux windows 墙头上一根草 apache inux windows
linux安装Apache 有两种方式一种是手动安装通过二进制的文件进行安装，另外一种就是通过yum 安装，此中安装方式，需要物理机联网。以下分别介绍两种的安装方式通过二进制文件安装Apache需要的软件有apr,apr-util,pcre 1，安装 apr 下载地址：htt
fill_parent、wrap_content和match_parent的区别 Cb123456 match_parent fill_parent
fill_parent、wrap_content和match_parent的区别: 1）fill_parent 设置一个构件的布局为fill_parent将强制性地使构件扩展，以填充布局单元内尽可能多的空间。这跟Windows控件的dockstyle属性大体一致。设置一个顶部布局或控件为fill_parent将强制性让它布满整个屏幕。 2） wrap_conte
网页自适应设计天子之骄 html css 响应式设计页面自适应
网页自适应设计网页对浏览器窗口的自适应支持变得越来越重要了。自适应响应设计更是异常火爆。再加上移动端的崛起，更是如日中天。以前为了适应不同屏幕分布率和浏览器窗口的扩大和缩小，需要设计几套css样式，用js脚本判断窗口大小，选择加载。结构臃肿，加载负担较大。现笔者经过一定时间的学习，有所心得，故分享于此，加强交流，共同进步。同时希望对大家有所
[sql server] 分组取最大最小常用sql 一炮送你回车库 SQL Server
--分组取最大最小常用sql--测试环境if OBJECT_ID('tb') is not null drop table tb;gocreate table tb( col1 int, col2 int, Fcount int)insert into tbselect 11,20,1 union allselect 11,22,1 union allselect 1
ImageIO写图片输出到硬盘 3213213333332132 java image
package awt; import java.awt.Color; import java.awt.Font; import java.awt.Graphics; import java.awt.image.BufferedImage; import java.io.File; import java.io.IOException; import javax.imagei
自己的String动态数组宝剑锋梅花香 java 动态数组数组
数组还是好说，学过一两门编程语言的就知道，需要注意的是数组声明时需要把大小给它定下来，比如声明一个字符串类型的数组：String str[]=new String[10]; 但是问题就来了，每次都是大小确定的数组，我需要数组大小不固定随时变化怎么办呢？动态数组就这样应运而生，龙哥给我们讲的是自己用代码写动态数组，并非用的ArrayList 看看字符
pinyin4j工具类 darkranger .net
pinyin4j工具类Java工具类 2010-04-24 00:47:00 阅读69 评论0 字号：大中小引入pinyin4j-2.5.0.jar包: pinyin4j是一个功能强悍的汉语拼音工具包，主要是从汉语获取各种格式和需求的拼音，功能强悍，下面看看如何使用pinyin4j。本人以前用AscII编码提取工具，效果不理想，现在用pinyin4j简单实现了一个。功能还不是很完美，
StarUML学习笔记----基本概念 aijuans UML建模
介绍StarUML的基本概念，这些都是有效运用StarUML?所需要的。包括对模型、视图、图、项目、单元、方法、框架、模型块及其差异以及UML轮廓。模型、视与图（Model, View and Diagram） &
Activiti最终总结 avords Activiti id 工作流
1、流程定义ID：ProcessDefinitionId，当定义一个流程就会产生。 2、流程实例ID：ProcessInstanceId，当开始一个具体的流程时就会产生，也就是不同的流程实例ID可能有相同的流程定义ID。 3、TaskId，每一个userTask都会有一个Id这个是存在于流程实例上的。 4、TaskDefinitionKey和（ActivityImpl activityId
从省市区多重级联想到的，react和jquery的差别 bee1314 jquery UI react
在我们的前端项目里经常会用到级联的select，比如省市区这样。通常这种级联大多是动态的。比如先加载了省，点击省加载市，点击市加载区。然后数据通常ajax返回。如果没有数据则说明到了叶子节点。针对这种场景，如果我们使用jquery来实现，要考虑很多的问题，数据部分，以及大量的dom操作。比如这个页面上显示了某个区，这时候我切换省，要把市重新初始化数据，然后区域的部分要从页面
Eclipse快捷键大全 bijian1013 java eclipse 快捷键
Ctrl+1 快速修复(最经典的快捷键,就不用多说了)Ctrl+D: 删除当前行 Ctrl+Alt+↓ 复制当前行到下一行(复制增加)Ctrl+Alt+↑ 复制当前行到上一行(复制增加)Alt+↓ 当前行和下面一行交互位置(特别实用,可以省去先剪切,再粘贴了)Alt+↑ 当前行和上面一行交互位置(同上)Alt+← 前一个编辑的页面Alt+→ 下一个编辑的页面(当然是针对上面那条来说了)Alt+En
js 笔记函数征客丶 JavaScript
一、函数的使用 1.1、定义函数变量 var vName = funcation(params){ } 1.2、函数的调用函数变量的调用： vName(params); 函数定义时自发调用：(function(params){})(params); 1.3、函数中变量赋值 var a = 'a'; var ff
【Scala四】分析Spark源代码总结的Scala语法二 bit1129 scala
1. Some操作在下面的代码中，使用了Some操作：if (self.partitioner == Some(partitioner))，那么Some(partitioner)表示什么含义？首先partitioner是方法combineByKey传入的变量， Some的文档说明： /** Class `Some[A]` represents existin
java 匿名内部类 BlueSkator java匿名内部类
组合优先于继承 Java的匿名类，就是提供了一个快捷方便的手段，令继承关系可以方便地变成组合关系继承只有一个时候才能用，当你要求子类的实例可以替代父类实例的位置时才可以用继承。在Java中内部类主要分为成员内部类、局部内部类、匿名内部类、静态内部类。内部类不是很好理解，但说白了其实也就是一个类中还包含着另外一个类如同一个人是由大脑、肢体、器官等身体结果组成，而内部类相
盗版win装在MAC有害发热，苹果的东西不值得买，win应该不用 ljy325 游戏 apple windows XP OS
Mac mini 型号: MC270CH-A RMB:5,688 Apple 对windows的产品支持不好,有以下问题: 1.装完了xp,发现机身很热虽然没有运行任何程序！貌似显卡跑游戏发热一样，按照那样的发热量,那部机子损耗很大,使用寿命受到严重的影响! 2.反观安装了Mac os的展示机，发热量很小，运行了1天温度也没有那么高 &nbs
读《研磨设计模式》-代码笔记-生成器模式-Builder bylijinnan java 设计模式
声明：本文只为方便我个人查阅和理解，详细的分析以及源代码请移步原作者的博客http://chjavach.iteye.com/ /** * 生成器模式的意图在于将一个复杂的构建与其表示相分离，使得同样的构建过程可以创建不同的表示（GoF） * 个人理解： * 构建一个复杂的对象，对于创建者（Builder）来说，一是要有数据来源(rawData)，二是要返回构
JIRA与SVN插件安装 chenyu19891124 SVN jira
JIRA安装好后提交代码并要显示在JIRA上，这得需要用SVN的插件才能看见开发人员提交的代码。 1.下载svn与jira插件安装包，解压后在安装包(atlassian-jira-subversion-plugin-0.10.1) 2.解压出来的包里下的lib文件夹下的jar拷贝到(C:\Program Files\Atlassian\JIRA 4.3.4\atlassian-jira\WEB
常用数学思想方法 comsci 工作
对于搞工程和技术的朋友来讲，在工作中常常遇到一些实际问题，而采用常规的思维方式无法很好的解决这些问题，那么这个时候我们就需要用数学语言和数学工具，而使用数学工具的前提却是用数学思想的方法来描述问题。。下面转帖几种常用的数学思想方法，仅供学习和参考函数思想　　把某一数学问题用函数表示出来，并且利用函数探究这个问题的一般规律。这是最基本、最常用的数学方法
pl/sql集合类型 daizj oracle 集合 type pl/sql
--集合类型 /* 单行单列的数据，使用标量变量单行多列数据，使用记录单列多行数据，使用集合（。。。） *集合：类似于数组也就是。pl/sql集合类型包括索引表（pl/sql table）、嵌套表（Nested Table）、变长数组（VARRAY）等 */ /* --集合方法 &n
[Ofbiz]ofbiz初用 dinguangx 电商 ofbiz
从github下载最新的ofbiz（截止2015-7-13），从源码进行ofbiz的试用 1. 加载测试库 ofbiz内置derby，通过下面的命令初始化测试库 ./ant load-demo (与load-seed有一些区别) 2. 启动内置tomcat ./ant start 或 ./startofbiz.sh 或 java -jar ofbiz.jar &
结构体中最后一个元素是长度为0的数组 dcj3sjt126com c gcc
在Linux源代码中，有很多的结构体最后都定义了一个元素个数为0个的数组，如/usr/include/linux/if_pppox.h中有这样一个结构体： struct pppoe_tag { __u16 tag_type; __u16 tag_len; &n
Linux cp 实现强行覆盖 dcj3sjt126com linux
发现在Fedora 10 /ubutun 里面用cp -fr src dest，即使加了-f也是不能强行覆盖的，这时怎么回事的呢？一两个文件还好说，就输几个yes吧，但是要是n多文件怎么办，那还不输死人呢？下面提供三种解决办法。方法一我们输入alias命令，看看系统给cp起了一个什么别名。 [root@localhost ~]# aliasalias cp=’cp -i’a
Memcached(一)、HelloWorld frank1234 memcached
一、简介高性能的架构离不开缓存，分布式缓存中的佼佼者当属memcached，它通过客户端将不同的key hash到不同的memcached服务器中，而获取的时候也到相同的服务器中获取，由于不需要做集群同步，也就省去了集群间同步的开销和延迟，所以它相对于ehcache等缓存来说能更好的支持分布式应用，具有更强的横向伸缩能力。二、客户端选择一个memcached客户端，我这里用的是memc
Search in Rotated Sorted Array II hcx2013 search
Follow up for "Search in Rotated Sorted Array":What if duplicates are allowed? Would this affect the run-time complexity? How and why? Write a function to determine if a given ta
Spring4新特性——更好的Java泛型操作API jinnianshilongnian spring4 generic type
Spring4新特性——泛型限定式依赖注入 Spring4新特性——核心容器的其他改进 Spring4新特性——Web开发的增强 Spring4新特性——集成Bean Validation 1.1(JSR-349)到SpringMVC Spring4新特性——Groovy Bean定义DSL Spring4新特性——更好的Java泛型操作API Spring4新
CentOS安装JDK liuxingguome centos
1、行卸载原来的： [root@localhost opt]# rpm -qa | grep java tzdata-java-2014g-1.el6.noarch java-1.7.0-openjdk-1.7.0.65-2.5.1.2.el6_5.x86_64 java-1.6.0-openjdk-1.6.0.0-11.1.13.4.el6.x86_64 [root@localhost
二分搜索专题2-在有序二维数组中搜索一个元素 OpenMind 二维数组算法二分搜索
1,设二维数组p的每行每列都按照下标递增的顺序递增。用数学语言描述如下：p满足 (1),对任意的x1，x2，y，如果x1<x2,则p(x1,y)<p(x2,y); (2),对任意的x，y1,y2, 如果y1<y2,则p(x,y1)<p(x,y2); 2,问题：给定满足1的数组p和一个整数k，求是否存在x0,y0使得p(x0,y0)=k? 3,算法分析： (
java 随机数 Math与Random SaraWon java Math Random
今天需要在程序中产生随机数，知道有两种方法可以使用，但是使用Math和Random的区别还不是特别清楚，看到一篇文章是关于的，觉得写的还挺不错的，原文地址是 http://www.oschina.net/question/157182_45274?sort=default&p=1#answers 产生1到10之间的随机数的两种实现方式： //Math Math.roun
oracle创建表空间 tugn oracle
create temporary tablespace TXSJ_TEMP tempfile 'E:\Oracle\oradata\TXSJ_TEMP.dbf' size 32m autoextend on next 32m maxsize 2048m extent m
使用Java8实现自己的个性化搜索引擎 yangshangchuan java superword 搜索引擎 java8 全文检索
需要对249本软件著作实现句子级别全文检索，这些著作均为PDF文件，不使用现有的框架如lucene，自己实现的方法如下： 1、从PDF文件中提取文本，这里的重点是如何最大可能地还原文本。提取之后的文本，一个句子一行保存为文本文件。 2、将所有文本文件合并为一个单一的文本文件，这样，每一个句子就有一个唯一行号。 3、对每一行文本进行分词，建立倒排表，倒排表的格式为：词=包含该词的总行数N=行号

CUDA协作组详解

CUDA中的协作组

1. 协作组简介

2. CUDA 11.0的最新特性

3. 协作组编程模型

3.1. 构成示例

4. 协作组类型

4.1. 隐式协作组

4.1.1. Thread Block Group

注意：组中的所有线程都必须参与集体操作，否则行为未定义。

4.1.2. Grid Group

4.1.3. Multi Grid Group

4.2. 显示协作组

4.2.1. Thread Block Tile

注意：

注意：这里使用的是 thread_block_tile 模板化数据结构，并且组的大小作为模板参数而不是参数传递给 tiled_partition 调用。

4.2.1.1. Warp-Synchronous Code Pattern

4.2.1.2. Single thread group

4.2.1.3. Thread Block Tile of size larger than 32

4.2.2. Coalesced Groups

注意：shfl、shfl_up 和 shfl_down 函数在使用 C++11 或更高版本编译时接受任何类型的对象。 这意味着只要满足以下约束，就可以对非整数类型进行洗牌：

4.2.2.1. Discovery Pattern

5. 协作组分区/分块

5.1. tiled_partition

5.2. labeled_partition

5.3. binary_partition

6. 协作组关键字合集

6.1. Synchronization

6.1.1. sync

6.2. Data Transfer

6.2.1. memcpy_async

6.2.2. wait and wait_prior

6.3. Data manipulation

6.3.1. reduce

6.3.2. Reduce Operators

6.3.3. inclusive_scan and exclusive_scan

7. Grid同步

8. 多设备同步

弃用通知：cudaLaunchCooperativeKernelMultiDevice 已在 CUDA 11.3 中针对所有设备弃用。在多设备共轭梯度样本中可以找到替代方法的示例。

你可能感兴趣的:(开发语言,NVIDIA,CUDA,计算机视觉,人工智能)

注意：`shfl、shfl_up 和 shfl_down` 函数在使用 C++11 或更高版本编译时接受任何类型的对象。这意味着只要满足以下约束，就可以对非整数类型进行洗牌：