图波列夫

OneFlow 中的 Softmax

Softmax 是深度学习模型中的常见算子。PyTorch 的 Softmax 算子直接调用 cuDNN 的接口。而 OneFlow 内部针对输入数据的类别数量，采用3个 kernel 来分别处理，在多数情况下都可以获得比 cuDNN 更优的性能表现。测试结果可见如何实现一个高效的Softmax CUDA kernel？——OneFlow 性能优化分享。下面对其实现进行介绍。OneFlow 的静态分层结构如下图所示：

C++

python

Functor

Kernel

Primitive

Functional

Module

softmax

oneflow.nn.functional.softmax 直接调用了 C++的实现。


# ref https://github.com/pytorch/pytorch/blob/master/torch/nn/functional.py
def softmax(input: Tensor, dim: Optional[int] = None, dtype=None) -> Tensor:
    r"""Applies a softmax function.
    Softmax is defined as:
    :math:`\text{Softmax}(x_{i}) = \frac{\exp(x_i)}{\sum_j \exp(x_j)}`
    It is applied to all slices along dim, and will re-scale them so that the elements
    lie in the range `[0, 1]` and sum to 1.

    See :class:`~oneflow.nn.Softmax` for more details.

    Args:
        input (Tensor): input
        dim (int): A dimension along which softmax will be computed.
        dtype (:class:`oneflow.dtype`, optional): the desired data type of returned tensor.
            If specified, the input tensor is casted to :attr:`dtype` before the operation
            is performed. This is useful for preventing data type overflows. Default: None.

    .. note::
        This function doesn't work directly with NLLLoss,
        which expects the Log to be computed between the Softmax and itself.
        Use log_softmax instead (it's faster and has better numerical properties).
    """
    if dtype is None:
        ret = flow._C.softmax(input, dim)
    else:
        ret = flow._C.softmax(input.to(dtype), dim)
    return ret

在 OneFlow 系统中存在两类算子（op）：系统 op 和 user op。OneFlow user op 的定义及 kernel 实现分别在 oneflow/user/ops 和 oneflow/user/kernels 目录下。

OneFlow_SoftmaxOp

def OneFlow_SoftmaxOp : OneFlow_BaseOp<"softmax", [NoMemoryEffect, DeclareOpInterfaceMethods<UserOpCompatibleInterface>]> {
  let input = (ins
    OneFlow_Tensor:$in
  );
  let output = (outs
    OneFlow_Tensor:$out
  );
  let has_logical_tensor_desc_infer_fn = 1;
  let has_physical_tensor_desc_infer_fn = 1;
  let has_get_sbp_fn = 1;
  let has_data_type_infer_fn = 1;
  let has_compute_complexity_fn = 1;
}

Functor 层作为 OneFlow 的基础设施，为 python 端和 C++端提供了 op 操作的统一入口。各种 op 在 Functor 层需要完成对输入tensor 的 shape、dtype、维度、元素个数等各种 check，以及对 op 特有的逻辑进行解析和处理。

oneflow/core/functional/impl/activation_functor.cpp 文件中 ONEFLOW_FUNCTION_LIBRARY 将 SoftmaxFunctor 注册为 Softmax 的实现。

SoftmaxFunctor

SoftmaxFunctorBase

LogSoftmaxFunctor

OpBuilder 用于构建 UserOp。

class SoftmaxFunctor : public SoftmaxFunctorBase {
 public:
  SoftmaxFunctor() {
    op_ = CHECK_JUST(one::OpBuilder("softmax").Input("in").Output("out").Build());
  }
};

SoftmaxFunctorBase

Tensor::shape 返回输入的 Shape。
Shape::NumAxes 调用 oneflow::Shape::NumAxes 返回维数。


class SoftmaxFunctorBase {
 public:
  Maybe<Tensor> operator()(const std::shared_ptr<one::Tensor>& input,
                           const Optional<int64_t>& dim) const {
    const auto input_shape = input->shape();
    const int64_t num_axes = input_shape->NumAxes();

get_dim函数判断输入形状是否为2维。

    const auto get_dim = [num_axes]() -> int64_t {
      const int64_t ndim = num_axes;
      if (ndim == 0 || ndim == 1 || ndim == 3) {
        return 0;
      } else {
        return 1;
      }
    };

JUST 宏能够得到表达式。
如果dim为0直接赋值给dim_，否则调用get_dim函数。
maybe_wrap_dim 的处理是支持正负数的。
如果dim_不是最后一维，在其前后调用 Transpose 进行转换。
sequence_function 宏创建一个 SequenceFunction 对象。
TransposeFunctor 为 Transpose 算子的实现。
OpInterpUtil::Dispatch 调度算子到设备上进行计算。

    int64_t dim_ = dim ? JUST(dim) : get_dim();
    dim_ = JUST(maybe_wrap_dim(dim_, num_axes));
    if (dim_ != num_axes - 1) {
      std::vector<int> input_perm(input_shape->dim_vec().size(), 0);
      for (size_t i = 1; i < input_perm.size(); ++i) { input_perm[i] = i; }
      input_perm[dim_] = input_perm[input_perm.size() - 1];
      input_perm[input_perm.size() - 1] = dim_;

      return sequence_function(functional::Transpose)
          .then([&](const std::shared_ptr<one::Tensor>& x) {
            return OpInterpUtil::Dispatch<Tensor>(*op_, {x});
          })
          .then(std::bind(functional::Transpose, std::placeholders::_1, input_perm))
          .call(input, input_perm);
    }

    return OpInterpUtil::Dispatch<Tensor>(*op_, {input});
  }

包含一个 OpExpr 指针。后者维护了op_name、input arg、output arg 信息。

 protected:
  SoftmaxFunctorBase() = default;
  virtual ~SoftmaxFunctorBase() = default;

  std::shared_ptr<OpExpr> op_;
};

SoftmaxKernel

REGISTER_USER_KERNEL 可以注册 softmax kernel。

class SoftmaxKernel final : public user_op::OpKernel, public user_op::CudaGraphSupport {
 public:
  SoftmaxKernel() = default;
  ~SoftmaxKernel() override = default;

 private:
  using user_op::OpKernel::Compute;

SoftmaxKernel::Compute

SoftmaxImpl::Launch

UserKernelComputeContext::Tensor4ArgNameAndIndex 根据参数名和索引返回对应的 Tensor。
BlobTensorView::shape_view 得到 ShapeView。
ConstShapeMixIn::NumAxes 得到维数。
NewSoftmaxPrimitive 函数创建一个 Softmax 对象。
调用 SoftmaxImpl::Launch 函数。

  void Compute(user_op::KernelComputeContext* ctx) const override {
    const user_op::Tensor* in = ctx->Tensor4ArgNameAndIndex("in", 0);
    user_op::Tensor* out = ctx->Tensor4ArgNameAndIndex("out", 0);
    const ShapeView& in_shape = in->shape_view();
    const int64_t cols = in_shape.At(in_shape.NumAxes() - 1);
    const int64_t rows = in_shape.Count(0, in_shape.NumAxes() - 1);
    std::unique_ptr<ep::primitive::Softmax> primitive = NewSoftmaxPrimitive(ctx);
    CHECK(primitive);
    primitive->Launch(ctx->stream(), rows, cols, in->dptr(), out->mut_dptr());
  }
  bool AlwaysComputeWhenAllOutputsEmpty() const override { return false; }
};

NewSoftmaxPrimitive

NewPrimitive 使用 SoftmaxFactory 抽象工厂来创建。

template<typename Context>
std::unique_ptr<ep::primitive::Softmax> NewSoftmaxPrimitive(Context* ctx) {
  const DataType data_type = ctx->TensorDesc4ArgNameAndIndex("in", 0)->data_type();
  return ep::primitive::NewPrimitive<ep::primitive::SoftmaxFactory>(ctx->device_type(), data_type);
}

SoftmaxFactory

REGISTER_PRIMITIVE_FACTORY 宏会调用 REGISTER_CLASS 实例化一个 AutoRegistrationFactory::RawRegisterType 对象。
工厂的实现为 SoftmaxFactoryImpl，即 GenericSoftmaxFactoryImpl 模板类。GenericSoftmaxFactoryImpl 会调用 NewSoftmax 函数创建一个 SoftmaxImpl 对象。SoftmaxImpl 即 Softmax 的实现。

class SoftmaxFactory : public Factory<Softmax> {
 public:
  OF_DISALLOW_COPY_AND_MOVE(SoftmaxFactory);
  SoftmaxFactory() = default;
  ~SoftmaxFactory() override = default;

  virtual std::unique_ptr<Softmax> New(DataType data_type) = 0;
};

SoftmaxImpl

SoftmaxImpl::Launch

SoftmaxGpu

SoftmaxImpl::Launch 函数调用 SoftmaxGpu

template<typename SoftmaxBase, Algorithm algorithm, typename T>
class SoftmaxImpl : public SoftmaxBase {
 public:
  OF_DISALLOW_COPY_AND_MOVE(SoftmaxImpl);
  SoftmaxImpl() = default;
  ~SoftmaxImpl() override = default;

  void Launch(Stream* stream, size_t rows, size_t cols, const void* x, void* y) override {
    cudaStream_t cuda_stream = stream->As<CudaStream>()->cuda_stream();
    SoftmaxGpu<algorithm, T>(cuda_stream, rows, cols, reinterpret_cast<const T*>(x),
                             reinterpret_cast<T*>(y));
  }
};

SoftmaxGpu

DispatchSoftmax

DispatchLogSoftmax

DefaultComputeType 默认使用原有类型，对于half和bfloat16使用float类型。
DirectLoad 和 DirectStore 结构体封装了源和目的片外地址。DirectLoad 首先加载源数据到临时变量中，然后转换为片上计算类型。
如果计算类型和源数据类型一致，是否还有必要使用临时变量来存储呢？
根据算法选择调用 DispatchSoftmax 或者 DispatchLogSoftmax 函数。

template<Algorithm algorithm, typename T>
void SoftmaxGpu(cudaStream_t cuda_stream, size_t rows, size_t cols, const T* x, T* y) {
  using ComputeType = typename cuda::softmax::DefaultComputeType<T>::type;
  oneflow::cuda::softmax::DirectLoad<T, ComputeType> load(x, cols);
  oneflow::cuda::softmax::DirectStore<ComputeType, T> store(y, cols);
  if (algorithm == Algorithm::kSoftmax) {
    OF_CUDA_CHECK((cuda::softmax::DispatchSoftmax<decltype(load), decltype(store), ComputeType>(
        cuda_stream, load, store, rows, cols)));
  } else if (algorithm == Algorithm::kLogSoftmax) {
    OF_CUDA_CHECK((cuda::softmax::DispatchLogSoftmax<decltype(load), decltype(store), ComputeType>(
        cuda_stream, load, store, rows, cols)));
  } else {
    UNIMPLEMENTED();
  }
}

DirectLoad

Pack 是一个联合体。
一次性读取N个元素，然后逐个按照DST类型赋值到dst中。

template<typename SRC, typename DST>
struct DirectLoad {
  DirectLoad(const SRC* src, int64_t row_size) : src(src), row_size(row_size) {}
  template<int N>
  __device__ void load(DST* dst, int64_t row, int64_t col) const {
    Pack<SRC, N> pack;
    const int64_t offset = (row * row_size + col) / N;
    pack.storage = *(reinterpret_cast<const PackType<SRC, N>*>(src) + offset);
#pragma unroll
    for (int i = 0; i < N; ++i) { dst[i] = static_cast<DST>(pack.elem[i]); }
  }
  const SRC* src;
  int64_t row_size;
};

Pack

PackType 为 GetPackType

template<typename T, int N>
union Pack {
  static_assert(sizeof(PackType<T, N>) == sizeof(T) * N, "");
  __device__ Pack() {
    // do nothing
  }
  PackType<T, N> storage;
  T elem[N];
};

GetPackType

类std::aligned_storage对象构造完成时，即分配了长度为Len个字节的内存，且该内存满足大小为 Align 的对齐要求。

template<typename T, int N>
struct GetPackType {
  using type = typename std::aligned_storage<N * sizeof(T), N * sizeof(T)>::type;
};

DirectStore

template<typename SRC, typename DST>
struct DirectStore {
  DirectStore(DST* dst, int64_t row_size) : dst(dst), row_size(row_size) {}
  template<int N>
  __device__ void store(const SRC* src, int64_t row, int64_t col) {
    Pack<DST, N> pack;
    const int64_t offset = (row * row_size + col) / N;
#pragma unroll
    for (int i = 0; i < N; ++i) { pack.elem[i] = static_cast<DST>(src[i]); }
    *(reinterpret_cast<PackType<DST, N>*>(dst) + offset) = pack.storage;
  }
  DST* dst;
  int64_t row_size;
};

DispatchSoftmax

cols < 1024

1024 <= cols

fallback

DispatchSoftmax

DispatchSoftmaxWarpImpl

TryDispatchSoftmaxBlockSMemImpl

DispatchSoftmaxBlockUncachedImpl

DispatchSoftmaxWarpImplPackSize

DispatchSoftmaxWarpImplCols

DispatchSoftmaxWarpImplPadding

LaunchSoftmaxWarpImpl

SoftmaxWarpImpl

TryDispatchSoftmaxBlockSMemImplPackSize

TryDispatchSoftmaxBlockSMemImplBlockSize

LaunchSoftmaxBlockSMemImpl

SoftmaxBlockSMemImpl

DispatchSoftmaxBlockUncachedImplPackSize

LaunchSoftmaxBlockUncachedImpl

SoftmaxBlockUncachedImpl

计算类型不是double的实现：

列数小于1024，则调用 DispatchSoftmaxWarpImpl；
否则调用 TryDispatchSoftmaxBlockSMemImpl；
如果 share memory 版本失败则调用 DispatchSoftmaxBlockUncachedImpl。

3种实现首先根据cols的奇偶确定pack_size为1或2。
DispatchSoftmaxWarpImplCols 根据cols的大小确定cols_per_thread、thread_group_width和rows_per_access3个参数的取值：

类别数量较小时，Warp 内对线程进一步分组来处理，每组处理1或2行数据，每个线程处理pack_size个类别；
类别数量较大时，Warp 内所有线程处理1行数据，每个线程处理cols_per_thread个类别。

DispatchSoftmaxWarpImplPadding 根据cols是否能对齐 Warp 处理宽度确定是否需要对cols进行padding处理。
LaunchSoftmaxWarpImpl 中设置block_size为128，再根据不同硬件设置grid_size。

TryDispatchSoftmaxBlockSMemImplPackSize 调用 TryDispatchSoftmaxBlockSMemImplBlockSize 函数，cols为偶数时单次次处理两列，否则处理1列。
TryDispatchSoftmaxBlockSMemImplBlockSize 在给定硬件约束下确保 SM 可调度线程块数量最大，然后优选较大的block_size。

DispatchSoftmaxBlockUncachedImplPackSize 根据cols为奇数或者偶数调用 LaunchSoftmaxBlockUncachedImpl。

template<typename LOAD, typename STORE, typename ComputeType>
inline typename std::enable_if<!std::is_same<ComputeType, double>::value, cudaError_t>::type
DispatchSoftmax(cudaStream_t stream, LOAD load, STORE store, const int64_t rows,
                const int64_t cols) {
  if (cols <= 1024) {
    return DispatchSoftmaxWarpImpl<LOAD, STORE, ComputeType, Algorithm::kSoftmax>(
        stream, load, store, rows, cols);
  } else {
    bool dispatch_smem_impl_success;
    {
      cudaError_t err =
          TryDispatchSoftmaxBlockSMemImpl<LOAD, STORE, ComputeType, Algorithm::kSoftmax>(
              stream, load, store, rows, cols, &dispatch_smem_impl_success);
      if (err != cudaSuccess) { return err; }
    }
    if (!dispatch_smem_impl_success) {
      return DispatchSoftmaxBlockUncachedImpl<LOAD, STORE, ComputeType, Algorithm::kSoftmax>(
          stream, load, store, rows, cols);
    }
    return cudaSuccess;
  }
}

DispatchSoftmax

DispatchSoftmaxBlockUncachedImpl

计算类型为double，则直接调用 DispatchSoftmaxBlockUncachedImpl。

template<typename LOAD, typename STORE, typename ComputeType>
inline typename std::enable_if<std::is_same<ComputeType, double>::value, cudaError_t>::type
DispatchSoftmax(cudaStream_t stream, LOAD load, STORE store, const int64_t rows,
                const int64_t cols) {
  return DispatchSoftmaxBlockUncachedImpl<LOAD, STORE, ComputeType, Algorithm::kSoftmax>(
      stream, load, store, rows, cols);
}

DispatchSoftmaxWarpImplCols

DispatchSoftmaxWarpImplPadding

pack_size为1的版本。

template<typename LOAD, typename STORE, typename ComputeType, int pack_size, Algorithm algorithm>
typename std::enable_if<pack_size == 1, cudaError_t>::type DispatchSoftmaxWarpImplCols(
    cudaStream_t stream, LOAD load, STORE store, const int64_t rows, const int64_t cols) {
  if (cols <= 0) { return cudaErrorInvalidValue; }

DEFINE_ONE_ELIF宏尝试找到一个刚好能处理cols的thread_group_width值，然后以此调用 DispatchSoftmaxWarpImplPadding 函数。

#define DEFINE_ONE_ELIF(thread_group_width)                                                        \
  else if (cols <= (thread_group_width)*pack_size) {                                               \
    if (rows % 2 == 0) {                                                                           \
      return DispatchSoftmaxWarpImplPadding<LOAD, STORE, ComputeType, pack_size, pack_size,        \
                                            thread_group_width, 2, algorithm>(stream, load, store, \
                                                                              rows, cols);         \
    } else {                                                                                       \
      return DispatchSoftmaxWarpImplPadding<LOAD, STORE, ComputeType, pack_size, pack_size,        \
                                            thread_group_width, 1, algorithm>(stream, load, store, \
                                                                              rows, cols);         \
    }                                                                                              \
  }
  DEFINE_ONE_ELIF(1)
  DEFINE_ONE_ELIF(2)
  DEFINE_ONE_ELIF(4)
  DEFINE_ONE_ELIF(8)
  DEFINE_ONE_ELIF(16)
  DEFINE_ONE_ELIF(32)
#undef DEFINE_ONE_ELIF

kWarpSize 为线程束大小。
如果超过了单个 Warp 的处理能力，则找到刚好可以处理cols时每个 Warp 内的列数，然后仍然调用 DispatchSoftmaxWarpImplPadding 函数。

#define DEFINE_ONE_ELIF(col)                                                                      \
  else if (cols <= (col)*kWarpSize) {                                                             \
    return DispatchSoftmaxWarpImplPadding<LOAD, STORE, ComputeType, pack_size, col, kWarpSize, 1, \
                                          algorithm>(stream, load, store, rows, cols);            \
  }
  DEFINE_ONE_ELIF(2)
  DEFINE_ONE_ELIF(3)
  DEFINE_ONE_ELIF(4)
  DEFINE_ONE_ELIF(5)
  DEFINE_ONE_ELIF(6)
  DEFINE_ONE_ELIF(7)
  DEFINE_ONE_ELIF(8)
  DEFINE_ONE_ELIF(9)
  DEFINE_ONE_ELIF(10)
  DEFINE_ONE_ELIF(11)
  DEFINE_ONE_ELIF(12)
  DEFINE_ONE_ELIF(13)
  DEFINE_ONE_ELIF(14)
  DEFINE_ONE_ELIF(15)
  DEFINE_ONE_ELIF(16)
  DEFINE_ONE_ELIF(17)
  DEFINE_ONE_ELIF(18)
  DEFINE_ONE_ELIF(19)
  DEFINE_ONE_ELIF(20)
  DEFINE_ONE_ELIF(21)
  DEFINE_ONE_ELIF(22)
  DEFINE_ONE_ELIF(23)
  DEFINE_ONE_ELIF(24)
  DEFINE_ONE_ELIF(25)
  DEFINE_ONE_ELIF(26)
  DEFINE_ONE_ELIF(27)
  DEFINE_ONE_ELIF(28)
  DEFINE_ONE_ELIF(29)
  DEFINE_ONE_ELIF(30)
  DEFINE_ONE_ELIF(31)
  DEFINE_ONE_ELIF(32)
#undef DEFINE_ONE_ELIF
  else {
    return cudaErrorInvalidValue;
  }
}

LaunchSoftmaxWarpImpl

GetNumBlocks

SoftmaxWarpImpl

template<typename LOAD, typename STORE, typename ComputeType, int pack_size, int cols_per_thread,
         int thread_group_width, int rows_per_access, bool padding, Algorithm algorithm>
inline cudaError_t LaunchSoftmaxWarpImpl(cudaStream_t stream, LOAD load, STORE store,
                                         const int64_t rows, const int64_t cols) {

block_dim是固定的128大小。可以参考如何设置CUDA Kernel中的grid_size和block_size？中的介绍。
waves是期望的作业次数。
thread_group_width为代表处理元素的线程组的宽度，是 kWarpSize 的因数。
thread_groups_per_block为 block 内部划分的线程组数量。
num_blocks为任务支持的最大分块数。
rows_per_access是单次处理的 batch 数。

  constexpr int block_size = 128;
  constexpr int waves = 32;
  static_assert(block_size % thread_group_width == 0, "");
  constexpr int thread_groups_per_block = block_size / thread_group_width;
  dim3 block_dim(thread_group_width, thread_groups_per_block);
  const int64_t num_blocks =
      (rows / rows_per_access + thread_groups_per_block - 1) / thread_groups_per_block;

GetNumBlocks 函数查询设备的 SM 数量以及每个 SM 支持的最大线程数计算 block 数量。

  int grid_dim_x;
  {
    cudaError_t err = GetNumBlocks(block_size, num_blocks, waves, &grid_dim_x);
    if (err != cudaSuccess) { return err; }
  }

Grid 是一维的，Block 是两维的。
启动 SoftmaxWarpImpl 在 Warp 内完成一行的计算。

  SoftmaxWarpImpl<LOAD, STORE, ComputeType, pack_size, cols_per_thread, thread_group_width,
                  rows_per_access, padding, algorithm>
      <<<grid_dim_x, block_dim, 0, stream>>>(load, store, rows, cols);
  return cudaPeekAtLastError();
}

GetNumBlocks

cudaGetDevice 返回当前正在使用的设备。

inline cudaError_t GetNumBlocks(int64_t block_size, int64_t max_blocks, int64_t waves,
                                int* num_blocks) {
  int dev;
  {
    cudaError_t err = cudaGetDevice(&dev);
    if (err != cudaSuccess) { return err; }
  }

cudaDeviceGetAttribute 返回有关设备的信息。
得到设备上的多处理器数量和每个多处理器的最大常驻线程数。

  int sm_count;
  {
    cudaError_t err = cudaDeviceGetAttribute(&sm_count, cudaDevAttrMultiProcessorCount, dev);
    if (err != cudaSuccess) { return err; }
  }
  int tpm;
  {
    cudaError_t err = cudaDeviceGetAttribute(&tpm, cudaDevAttrMaxThreadsPerMultiProcessor, dev);
    if (err != cudaSuccess) { return err; }
  }

根据整个 GPU 上的最大常驻线程数计算出 block 块数。

  *num_blocks =
      std::max<int>(1, std::min<int64_t>(max_blocks, sm_count * tpm / block_size * waves));
  return cudaSuccess;
}

SoftmaxWarpImpl

WarpAllReduce

template<typename LOAD, typename STORE, typename ComputeType, int pack_size, int cols_per_thread,
         int thread_group_width, int rows_per_access, bool padding, Algorithm algorithm>
__global__ void SoftmaxWarpImpl(LOAD load, STORE store, const int64_t rows, const int64_t cols) {
  static_assert(cols_per_thread % pack_size == 0, "");
  static_assert(thread_group_width <= kWarpSize, "");
  static_assert(kWarpSize % thread_group_width == 0, "");

num_packs是每个线程需要处理的数据包的个数。
rows_per_access是单次处理的 batch 数。
buf用于存储输入 $x$ 以及分子项 $e^{x_i}$ 、 $e^{x_i -\alpha}$ 。
blockIdx.x为 block 在行方向上的索引。blockDim.y为 Block 内划分的线程组数量。global_thread_group_id为全局线程组索引，在 Block 内编号是连续的。
num_global_thread_group为全局线程组数量。
lane_id为线程束内的线程 id。
step是 GPU 所有线程单次可处理的 batch 数。令该数值尽可能大，从而利于合并访存。

  constexpr int num_packs = cols_per_thread / pack_size;
  assert(cols <= cols_per_thread * thread_group_width);
  ComputeType buf[rows_per_access][cols_per_thread];
  const int global_thread_group_id = blockIdx.x * blockDim.y + threadIdx.y;
  const int num_global_thread_group = gridDim.x * blockDim.y;
  const int lane_id = threadIdx.x;
  const int64_t step = num_global_thread_group * rows_per_access;

thread_max用于存储最大类别概率。首先初始化为 -inf。
Inf 能够返回不同类型的 inf 值。
col为当前需要处理的列。
如果不需要padding或者col在cols范围内，则调用 DirectLoad::load 加载输入数据。求出线程负责列的最大值。
否则，将row_buf设置为-inf。

  for (int64_t row = global_thread_group_id * rows_per_access; row < rows; row += step) {
    ComputeType thread_max[rows_per_access];
#pragma unroll
    for (int row_id = 0; row_id < rows_per_access; ++row_id) {
      thread_max[row_id] = -Inf<ComputeType>();
      ComputeType* row_buf = buf[row_id];
#pragma unroll
      for (int pack_id = 0; pack_id < num_packs; ++pack_id) {
        const int pack_offset = pack_id * pack_size;
        const int col = (pack_id * thread_group_width + lane_id) * pack_size;
        if (!padding || col < cols) {
          load.template load<pack_size>(row_buf + pack_offset, row + row_id, col);
#pragma unroll
          for (int i = 0; i < pack_size; ++i) {
            thread_max[row_id] = max(thread_max[row_id], row_buf[pack_offset + i]);
          }
        } else {
#pragma unroll
          for (int i = 0; i < pack_size; ++i) { row_buf[pack_offset + i] = -Inf<ComputeType>(); }
        }
      }
    }

WarpAllReduce 函数调用 MaxOp 规约得到线程组内的最大值。
thread_group_width参数可以实现 Warp 内的线程组分组处理。

    ComputeType warp_max[rows_per_access];
#pragma unroll
    for (int row_id = 0; row_id < rows_per_access; ++row_id) {
      warp_max[row_id] = WarpAllReduce<MaxOp, ComputeType, thread_group_width>(thread_max[row_id]);
    }

row_buf中保存 $e^{x_i -\alpha}$ 。
线程内求和，thread_sum保存线程内的 $\sum_j e^{x_j -\alpha}$ 。
可以通过从任何设备线程调用__trap()函数来启动 trap 操作。内核的执行被中止并在主机程序中引发中断。

    ComputeType thread_sum[rows_per_access];
#pragma unroll
    for (int row_id = 0; row_id < rows_per_access; ++row_id) {
      thread_sum[row_id] = 0;
      ComputeType* row_buf = buf[row_id];
#pragma unroll
      for (int i = 0; i < cols_per_thread; ++i) {
        if (algorithm == Algorithm::kSoftmax) {
          row_buf[i] = Exp(row_buf[i] - warp_max[row_id]);
          thread_sum[row_id] += row_buf[i];
        } else if (algorithm == Algorithm::kLogSoftmax) {
          row_buf[i] -= warp_max[row_id];
          thread_sum[row_id] += Exp(row_buf[i]);
        } else {
          __trap();
        }
      }
    }

调用 WarpAllReduce 函数得到各行的warp_sum。

    ComputeType warp_sum[rows_per_access];
#pragma unroll
    for (int row_id = 0; row_id < rows_per_access; ++row_id) {
      warp_sum[row_id] = WarpAllReduce<SumOp, ComputeType, thread_group_width>(thread_sum[row_id]);
    }

Div 和 Log 有快速计算实现。
计算 $\text{Softmax}(x_i) = \frac{e^{x_i -\alpha}}{\sum_j e^{x_j -\alpha}}$ 或者 $\text{LogSoftmax}(x_{i}) = \log\left(\frac{e^{x_i-\alpha} }{ \sum_j e^{x_j-\alpha}} \right) = x_i-\alpha - \log({ \sum_j e^{x_j-\alpha}})$
DirectStore::store 保存结果。

#pragma unroll
    for (int row_id = 0; row_id < rows_per_access; ++row_id) {
      ComputeType* row_buf = buf[row_id];
#pragma unroll
      for (int i = 0; i < cols_per_thread; ++i) {
        if (algorithm == Algorithm::kSoftmax) {
          row_buf[i] = Div(row_buf[i], warp_sum[row_id]);
        } else if (algorithm == Algorithm::kLogSoftmax) {
          row_buf[i] -= Log(warp_sum[row_id]);
        } else {
          __trap();
        }
      }
#pragma unroll
      for (int i = 0; i < num_packs; ++i) {
        const int col = (i * thread_group_width + lane_id) * pack_size;
        if (!padding || col < cols) {
          store.template store<pack_size>(row_buf + i * pack_size, row + row_id, col);
        }
      }
    }
  }
}

WarpAllReduce

__shfl_xor_sync基于自身通道 ID 的按位异或从通道复制。
不断减半mask，实现蝶型归约。实现为下图的逆序过程。

template<template<typename> class ReductionOp, typename T, int thread_group_width = kWarpSize>
__inline__ __device__ T WarpAllReduce(T val) {
  for (int mask = thread_group_width / 2; mask > 0; mask /= 2) {
    val = ReductionOp<T>()(val, __shfl_xor_sync(0xffffffff, val, mask));
  }
  return val;
}

DispatchSoftmaxWarpImplCols

DispatchSoftmaxWarpImplPadding

pack_size为2的版本。

template<typename LOAD, typename STORE, typename ComputeType, int pack_size, Algorithm algorithm>
typename std::enable_if<pack_size == 2, cudaError_t>::type DispatchSoftmaxWarpImplCols(
    cudaStream_t stream, LOAD load, STORE store, const int64_t rows, const int64_t cols) {
  if (cols <= 0) { return cudaErrorInvalidValue; }

尝试找到一个刚好能处理cols的thread_group_width值，然后以此调用 DispatchSoftmaxWarpImplPadding 函数。奇数处理1行，偶数处理两行。

#define DEFINE_ONE_ELIF(thread_group_width)                                                        \
  else if (cols <= (thread_group_width)*pack_size) {                                               \
    if (rows % 2 == 0) {                                                                           \
      return DispatchSoftmaxWarpImplPadding<LOAD, STORE, ComputeType, pack_size, pack_size,        \
                                            thread_group_width, 2, algorithm>(stream, load, store, \
                                                                              rows, cols);         \
    } else {                                                                                       \
      return DispatchSoftmaxWarpImplPadding<LOAD, STORE, ComputeType, pack_size, pack_size,        \
                                            thread_group_width, 1, algorithm>(stream, load, store, \
                                                                              rows, cols);         \
    }                                                                                              \
  }
  DEFINE_ONE_ELIF(1)
  DEFINE_ONE_ELIF(2)
  DEFINE_ONE_ELIF(4)
  DEFINE_ONE_ELIF(8)
  DEFINE_ONE_ELIF(16)
  DEFINE_ONE_ELIF(32)
#undef DEFINE_ONE_ELIF

如果cols较大，Warp 内所有线程处理1行数据，每个线程处理cols_per_thread个类别。

#define DEFINE_ONE_ELIF(col)                                                                      \
  else if (cols <= (col)*kWarpSize) {                                                             \
    return DispatchSoftmaxWarpImplPadding<LOAD, STORE, ComputeType, pack_size, col, kWarpSize, 1, \
                                          algorithm>(stream, load, store, rows, cols);            \
  }
  DEFINE_ONE_ELIF(4)
  DEFINE_ONE_ELIF(6)
  DEFINE_ONE_ELIF(8)
  DEFINE_ONE_ELIF(10)
  DEFINE_ONE_ELIF(12)
  DEFINE_ONE_ELIF(14)
  DEFINE_ONE_ELIF(16)
  DEFINE_ONE_ELIF(18)
  DEFINE_ONE_ELIF(20)
  DEFINE_ONE_ELIF(22)
  DEFINE_ONE_ELIF(24)
  DEFINE_ONE_ELIF(26)
  DEFINE_ONE_ELIF(28)
  DEFINE_ONE_ELIF(30)
  DEFINE_ONE_ELIF(32)
#undef DEFINE_ONE_ELIF
  else {
    return cudaErrorInvalidValue;
  }
}

TryDispatchSoftmaxBlockSMemImplBlockSize

SoftmaxBlockSMemImpl

LaunchSoftmaxBlockSMemImpl

根据cols计算出需要的 Shared Memory 大小。

template<typename LOAD, typename STORE, typename ComputeType, int pack_size, Algorithm algorithm>
inline cudaError_t TryDispatchSoftmaxBlockSMemImplBlockSize(cudaStream_t stream, LOAD load,
                                                            STORE store, const int64_t rows,
                                                            const int64_t cols, bool* success) {
  constexpr int block_size_conf_1 = 128;
  constexpr int block_size_conf_2 = 256;
  constexpr int block_size_conf_3 = 512;
  constexpr int block_size_conf_4 = 1024;
  const size_t smem = cols * sizeof(ComputeType);

cudaOccupancyMaxActiveBlocksPerMultiprocessor 返回设备函数的占用。
SoftmaxBlockSMemImpl 为 kernel 函数。
smem超过 SM 内 Shared Memory 的大小时，kernel 会无法启动。
优先让 SM 同时调度的 block 数量达到最大，其次让 block_size 达到最大。从而提高硬件的利用率。

  int max_active_blocks_conf_1;
  {
    cudaError_t err = cudaOccupancyMaxActiveBlocksPerMultiprocessor(
        &max_active_blocks_conf_1,
        SoftmaxBlockSMemImpl<LOAD, STORE, ComputeType, pack_size, block_size_conf_1, algorithm>,
        block_size_conf_1, smem);
    if (err != cudaSuccess) { return err; }
  }
  if (max_active_blocks_conf_1 <= 0) {
    *success = false;
    return cudaSuccess;
  }
  int max_active_blocks_conf_4;
  {
    cudaError_t err = cudaOccupancyMaxActiveBlocksPerMultiprocessor(
        &max_active_blocks_conf_4,
        SoftmaxBlockSMemImpl<LOAD, STORE, ComputeType, pack_size, block_size_conf_4, algorithm>,
        block_size_conf_4, smem);
    if (err != cudaSuccess) { return err; }
  }

如果block_size_conf_1和block_size_conf_4获得的占用相等，则选择较大的max_active_blocks_conf_4。
LaunchSoftmaxBlockSMemImpl 启动在 Block 内处理一行的 kernel。

  if (max_active_blocks_conf_4 == max_active_blocks_conf_1) {
    *success = true;
    return LaunchSoftmaxBlockSMemImpl<LOAD, STORE, ComputeType, pack_size, block_size_conf_4,
                                      algorithm>(stream, load, store, smem, rows, cols);
  }

依次向下尝试max_active_blocks_conf_3和max_active_blocks_conf_2。

  int max_active_blocks_conf_3;
  {
    cudaError_t err = cudaOccupancyMaxActiveBlocksPerMultiprocessor(
        &max_active_blocks_conf_3,
        SoftmaxBlockSMemImpl<LOAD, STORE, ComputeType, pack_size, block_size_conf_3, algorithm>,
        block_size_conf_3, smem);
    if (err != cudaSuccess) { return err; }
  }
  if (max_active_blocks_conf_3 == max_active_blocks_conf_1) {
    *success = true;
    return LaunchSoftmaxBlockSMemImpl<LOAD, STORE, ComputeType, pack_size, block_size_conf_3,
                                      algorithm>(stream, load, store, smem, rows, cols);
  }
  int max_active_blocks_conf_2;
  {
    cudaError_t err = cudaOccupancyMaxActiveBlocksPerMultiprocessor(
        &max_active_blocks_conf_2,
        SoftmaxBlockSMemImpl<LOAD, STORE, ComputeType, pack_size, block_size_conf_2, algorithm>,
        block_size_conf_2, smem);
    if (err != cudaSuccess) { return err; }
  }
  if (max_active_blocks_conf_2 == max_active_blocks_conf_1) {
    *success = true;
    return LaunchSoftmaxBlockSMemImpl<LOAD, STORE, ComputeType, pack_size, block_size_conf_2,
                                      algorithm>(stream, load, store, smem, rows, cols);
  }
  *success = true;
  return LaunchSoftmaxBlockSMemImpl<LOAD, STORE, ComputeType, pack_size, block_size_conf_1,
                                    algorithm>(stream, load, store, smem, rows, cols);
}
```## [SoftmaxBlockSMemImpl](https://github.com/Oneflow-Inc/oneflow/blob/master/oneflow/core/cuda/softmax.cuh#L482)
```mermaid
graph TD
SoftmaxBlockSMemImpl-->BlockAllReduce

一个 Block 处理一行元素。
使用动态分配的共享内存。以double类型来对齐，即64-bit。ComputeType仅可能是float。

template<typename LOAD, typename STORE, typename ComputeType, int pack_size, int block_size,
         Algorithm algorithm>
__global__ void SoftmaxBlockSMemImpl(LOAD load, STORE store, const int64_t rows,
                                     const int64_t cols) {
  extern __shared__ __align__(sizeof(double)) unsigned char shared_buf[];
  auto* buf = reinterpret_cast<ComputeType*>(shared_buf);
  const int tid = threadIdx.x;
  assert(cols % pack_size == 0);
  const int num_packs = cols / pack_size;

每次处理gridDim.x行。
thread_max用于存储最大类别概率。首先初始化为 -inf。
先将输入加载到pack，再拷贝到buf共享内存中。
buf的形状为[pack_size, num_packs]，这样pack_size内不是连续的，需要逐个存。

  for (int64_t row = blockIdx.x; row < rows; row += gridDim.x) {
    ComputeType thread_max = -Inf<ComputeType>();
    for (int pack_id = tid; pack_id < num_packs; pack_id += block_size) {
      ComputeType pack[pack_size];
      load.template load<pack_size>(pack, row, pack_id * pack_size);
#pragma unroll
      for (int i = 0; i < pack_size; ++i) {
        buf[i * num_packs + pack_id] = pack[i];
        thread_max = max(thread_max, pack[i]);
      }
    }

BlockAllReduce 函数调用 cub::BlockReduce 需要两块内存。
row_max为类别最大值 $\alpha$ 。buf中保存 $e^{x_i -\alpha}$ 。
线程内求和，thread_sum保存线程内的 $\sum_j e^{x_j -\alpha}$ 。

    const ComputeType row_max = BlockAllReduce<MaxOp, ComputeType, block_size>(thread_max);
    ComputeType thread_sum = 0;
    for (int col = tid; col < cols; col += block_size) {
      if (algorithm == Algorithm::kSoftmax) {
        const ComputeType exp_x = Exp(buf[col] - row_max);
        buf[col] = exp_x;
        thread_sum += exp_x;
      } else {
        const ComputeType x = buf[col] - row_max;
        buf[col] = x;
        thread_sum += Exp(x);
      }
    }

再次调用 BlockAllReduce 函数得到所有类别的和。
计算 $\text{Softmax}(x_i) = \frac{e^{x_i -\alpha}}{\sum_j e^{x_j -\alpha}}$ 或者 $\text{LogSoftmax}(x_{i}) = \log\left(\frac{e^{x_i-\alpha} }{ \sum_j e^{x_j-\alpha}} \right) = x_i-\alpha - \log({ \sum_j e^{x_j-\alpha}})$

    const ComputeType row_sum = BlockAllReduce<SumOp, ComputeType, block_size>(thread_sum);
    for (int pack_id = tid; pack_id < num_packs; pack_id += block_size) {
      ComputeType pack[pack_size];
#pragma unroll
      for (int i = 0; i < pack_size; ++i) {
        if (algorithm == Algorithm::kSoftmax) {
          pack[i] = Div(buf[i * num_packs + pack_id], row_sum);
        } else if (algorithm == Algorithm::kLogSoftmax) {
          pack[i] = buf[i * num_packs + pack_id] - Log(row_sum);
        } else {
          __trap();
        }
      }
      store.template store<pack_size>(pack, row, pack_id * pack_size);
    }
  }
}

LaunchSoftmaxBlockUncachedImpl

GetNumBlocks

SoftmaxBlockUncachedImpl

不使用 Shared Memory 时，需要多次访问 Global Memory。函数设置较大的block_size。因为block_size越大，SM 中能同时并行执行的 Block 数就越少，对 cache 的请求次数就越少，就有更多机会命中 Cache。

GetNumBlocks 函数查询设备的 SM 数量以及每个 SM 支持的最大线程数计算 block 数量。

template<typename LOAD, typename STORE, typename ComputeType, int pack_size, Algorithm algorithm>
inline cudaError_t LaunchSoftmaxBlockUncachedImpl(cudaStream_t stream, LOAD load, STORE store,
                                                  const int64_t rows, const int64_t cols) {
  constexpr int block_size = 1024;
  constexpr int waves = 32;
  int grid_dim_x;
  {
    cudaError_t err = GetNumBlocks(block_size, rows, waves, &grid_dim_x);
    if (err != cudaSuccess) { return err; }
  }

启动 SoftmaxBlockUncachedImpl kernel 函数。

  SoftmaxBlockUncachedImpl<LOAD, STORE, ComputeType, pack_size, block_size, algorithm>
      <<<grid_dim_x, block_size, 0, stream>>>(load, store, rows, cols);
  return cudaPeekAtLastError();
}

SoftmaxBlockUncachedImpl

BlockAllReduce

对于cols没有任何限制。

template<typename LOAD, typename STORE, typename ComputeType, int pack_size, int block_size,
         Algorithm algorithm>
__global__ void SoftmaxBlockUncachedImpl(LOAD load, STORE store, const int64_t rows,
                                         const int64_t cols) {
  const int tid = threadIdx.x;
  assert(cols % pack_size == 0);
  const int num_packs = cols / pack_size;

每个 Block 处理一行元素。
每个线程处理pack_size个类别。
首先求出线程内的最大值，然后调用 BlockAllReduce 归约得到全局最大值。

  for (int64_t row = blockIdx.x; row < rows; row += gridDim.x) {
    ComputeType thread_max = -Inf<ComputeType>();
    for (int pack_id = tid; pack_id < num_packs; pack_id += block_size) {
      ComputeType pack[pack_size];
      load.template load<pack_size>(pack, row, pack_id * pack_size);
#pragma unroll
      for (int i = 0; i < pack_size; ++i) { thread_max = max(thread_max, pack[i]); }
    }
    const ComputeType row_max = BlockAllReduce<MaxOp, ComputeType, block_size>(thread_max);

分两步得到 $\sum_j e^{x_j -\alpha}$ 。

    ComputeType thread_sum = 0;
    for (int pack_id = tid; pack_id < num_packs; pack_id += block_size) {
      ComputeType pack[pack_size];
      load.template load<pack_size>(pack, row, pack_id * pack_size);
#pragma unroll
      for (int i = 0; i < pack_size; ++i) { thread_sum += Exp(pack[i] - row_max); }
    }
    const ComputeType row_sum = BlockAllReduce<SumOp, ComputeType, block_size>(thread_sum);

计算 $\text{Softmax}(x_i) = \frac{e^{x_i -\alpha}}{\sum_j e^{x_j -\alpha}}$ 或者 $\text{LogSoftmax}(x_{i}) = \log\left(\frac{e^{x_i-\alpha} }{ \sum_j e^{x_j-\alpha}} \right) = x_i-\alpha - \log({ \sum_j e^{x_j-\alpha}})$

    for (int pack_id = tid; pack_id < num_packs; pack_id += block_size) {
      ComputeType pack[pack_size];
      load.template load<pack_size>(pack, row, pack_id * pack_size);
#pragma unroll
      for (int i = 0; i < pack_size; ++i) {
        if (algorithm == Algorithm::kSoftmax) {
          pack[i] = Div(Exp(pack[i] - row_max), row_sum);
        } else if (algorithm == Algorithm::kLogSoftmax) {
          pack[i] = (pack[i] - row_max) - Log(row_sum);
        } else {
          __trap();
        }
      }
      store.template store<pack_size>(pack, row, pack_id * pack_size);
    }
  }
}

参考资料：

CUDA优化之LayerNorm性能优化实践
如何实现一个高效的Softmax CUDA kernel？
如何实现一个高效的Softmax CUDA kernel？——OneFlow 性能优化分享
用Welford算法实现LN的方差更新
OneFlow是如何做到世界最快深度学习框架的
OneFlow源码解析：基础计算接口Primitive
【BBuf的CUDA笔记】八，对比学习OneFlow 和 FasterTransformer 的 Softmax Cuda实现
【BBuf的CUDA笔记】九，使用newbing（chatgpt）解析oneflow softmax相关的fuse优化
【oneflow】算子在深度学习框架中的执行及interpreter
计算机视觉大型攻略 —— CUDA(3)内存模型（一）CUDA内存
计算机视觉大型攻略 —— CUDA(3)内存模型（二）Aligned and Coalesced内存访问
CUDA Data Alignment
CUDA编程入门之 Grid-Stride Loops
【BBuf 的 CUDA 笔记】一，解析 OneFlow Element-Wise 算子实现
【BBuf的CUDA笔记】三，reduce优化入门学习笔记
简单谈谈CUDA Reduce
cuda的shared momery
C++11的模板类型判断——std::is_same和std::decay
深入了解 | 内存对齐之 alignof、alignas 、aligned_storage、align 深度剖析
CUDA小妙招：这种快捷查询设备属性的方法你知道吗？
pybind11使用指南
Pybind11 理解
PyBind11：基本用法和底层实现
pybind笔记_入门
一文理解 PyTorch 中的 SyncBatchNorm
CUDA笔记线程束洗牌函数
Building a Numerically Stable Softmax
How to assign INFINITY to variables in CUDA code?
Way to get floating-point special values in CUDA?
CUDA C++ Programming Guide
CUDA编程入门之Parallel Reductions
CUDA编程入门之Warp-Level Primitives
CUDA中的Warp Shuffle
附录B – 对C++扩展的详细描述
在LLVM后端实现跨通道数据搬移
CUDA 编程手册系列附录B – 对C++扩展的详细描述（三）
Lecture 4 Warp shuffles, reduction and scan operations
一文理解 PyTorch 中的 SyncBatchNorm
6.CUDA编程手册中文版—附录A&B
CUDA编程笔记——chapter5 共享内存和常量内存
在 CUDA C / C ++ 中使用共享内存
cub 库(七)BlockReduce 类共享内存申请方法和Global Mem to 寄存器数组
CUDA Pro Tip: Increase Performance with Vectorized Memory Access
Optimizing CUDA
Advanced CUDA Programming
cuda编程中，转为float4是什么？
【CUDA编程】OneFlow Softmax 算子源码解读之WarpSoftmax
【CUDA编程】OneFlow Softmax算子源码解读之BlockSoftmax
CUDA 编程手册系列第五章: 性能指南

你可能感兴趣的:(OneFlow,GPU,DeepLearning,oneflow,人工智能,深度学习)

[论文阅读] 人工智能 + 软件工程 | 揭秘ChatGPT在软件开发问题解决中的有效性：一项实证研究张较瘦_ 前沿技术论文阅读人工智能软件工程
揭秘ChatGPT在软件开发问题解决中的有效性：一项实证研究论文：WhatMakesChatGPTEffectiveforSoftwareIssueResolution?AnEmpiricalStudyofDeveloper-ChatGPTConversationsinGitHubarXiv:2506.22390WhatMakesChatGPTEffectiveforSoftwareIssueRe
[论文阅读] 人工智能 + 软件工程 | 代码注释不一致问题研究：从数据革新到端到端解决方案张较瘦_ 前沿技术论文阅读人工智能软件工程
代码注释不一致问题研究：从数据革新到端到端解决方案原文：CCISOLVER:End-to-EndDetectionandRepairofMethod-LevelCode-CommentInconsistencyarXiv:2506.20558CCISolver:End-to-EndDetectionandRepairofMethod-LevelCode-CommentInconsistencyRe
【深度学习|学习笔记】如何在深度学习中使用正则化技术进行模型压缩、稀疏建模和迁移学习调优？努力毕业的小土博^_^ 机器学习基础算法优质笔记2 深度学习学习笔记迁移学习人工智能机器学习
【深度学习|学习笔记】如何在深度学习中使用正则化技术进行模型压缩、稀疏建模和迁移学习调优？【深度学习|学习笔记】如何在深度学习中使用正则化技术进行模型压缩、稀疏建模和迁移学习调优？文章目录【深度学习|学习笔记】如何在深度学习中使用正则化技术进行模型压缩、稀疏建模和迁移学习调优？✅一、使用正则化进行模型压缩（ModelCompression）目标：方法：L1正则化促使权重稀疏化代码示例：后续压缩步骤
数字孪生：未来城市管理的革命性技术大有数据可视化信息可视化
一、数字孪生技术概述数字孪生技术是一种通过创建虚拟模型与物理实体之间实时交互的技术。它借助物联网、大数据、云计算、人工智能等前沿技术，实现对物理实体的精准映射与动态仿真。数字孪生的核心在于构建一个与物理世界相对应的虚拟模型，该模型能够实时反映物理实体的状态，并通过数据分析与模拟优化其性能。在城市管理领域，数字孪生技术为城市管理者提供了一种全新的视角和工具。城市是一个复杂的巨系统，涉及基础设施、交通
人类编程时代即将终结？OpenAI首席产品官预测AI将在今年底全面超越人类程序员前端javascript
ReactHook深入浅出CSS技巧与案例详解vue2与vue3技巧合集VueUse源码解读近日，OpenAI首席产品官KevinWeil在接受采访时表示，人工智能的发展速度远超预期，今年底就有可能在编程领域永久性地超越人类程序员。这一观点立即引发了行业热议，也让程序员们对未来产生了深刻的思考。人工智能的进展速度远超想象在与VarunMayya和TanmayBhat共同主持的YouTube节目《O
Python大数据分析&人工智能教程 - Django-Celery异步处理（深入解析与实战案例） AI_DL_CODE python 数据分析 Django Celery异步处理 Celery
文章目录1.概念介绍1.1Django框架概述1.2Celery异步任务队列1.3AMQP协议与消息路由2.环境搭建2.1安装Django和Celery2.2配置Redis作为消息代理3.Celery架构与工作原理3.1Celery组件介绍3.2任务生命周期3.3任务调度与执行3.3.1定时任务3.3.2异步任务调用3.3.3任务结果查询4.Django与Celery集成4.1创建Celery实例
智能之火，重塑创造：大模型如何点燃新一代开发引擎？黑巧克力可减脂 AIGC 人工智能 AIGC
导言：普罗米修斯之火再现在科技演进的长河中，每一次生产力的跃迁都伴随着工具的质变。从蒸汽机轰鸣到电力普及，再到信息高速公路的铺就，人类驾驭能量的能力不断突破。今天，我们站在一个崭新的临界点上：大语言模型（LLM）正将人工智能的“普罗米修斯之火”引入软件开发的核心腹地。这不再仅仅是效率的优化，更是对开发者角色、开发流程乃至软件本质的深度重塑。GitHubCEOThomasDohmke曾断言：“Cop
Python大数据分析&人工智能教程 - Django-RestFramework框架（深入解析+实操案例） AI_DL_CODE python 数据分析 django RestFramework框架
文章目录1.Django-RestFramework基础1.1Django-RestFramework概述1.2安装与配置1.3构建第一个API1.3.1定义模型1.3.2创建序列化器1.3.3定义视图1.3.4配置URL路由1.4进阶功能1.4.1权限控制1.4.2限流1.5实战案例1.5.1创建图书1.5.2查询图书1.5.3更新图书1.5.4删除图书2.序列化器(Serializers)2.
Python从0到100完整学习指南（必看导航）是Dream呀 Python python 人工智能爬虫 web 神经网络算法深度学习
前言：零基础学Python：Python从0到100最新最全教程。想做这件事情很久了，这次我更新了自己所写过的所有博客，汇集成了Python从0到100，共一百节课，帮助大家一个月时间里从零基础到学习Python基础语法、Python爬虫、Web开发、计算机视觉、机器学习、神经网络以及人工智能相关知识，成为学业升学和工作就业的先行者！【优惠信息】•新专栏订阅前1000名享9.9元优惠•订阅量破10
【机器学习&深度学习】模型微调的基本概念与流程一叶千舟深度学习【理论】机器学习深度学习人工智能
目录前言一、什么是模型微调（Fine-tuning）？二、预训练vs微调：什么关系？三、微调的基本流程（以BERT为例）1️⃣准备数据2️⃣加载预训练模型和分词器3️⃣数据编码与加载4️⃣定义优化器5️⃣开始训练6️⃣评估与保存模型四、是否要冻结BERT层？五、完整训练示例代码5.1环境依赖5.2执行代码总结：微调的优势前言在自然语言处理（NLP）快速发展的今天，预训练模型如BERT成为了众多任务
FastGPT与MCP：解锁AI新时代的技术密码挑战者666888 AI模型应用实战迁移学习集成学习文心一言
一、AI浪潮中的新星：FastGPT与MCP登场在当今科技飞速发展的时代，人工智能（AI）已成为推动各行业变革的核心力量。从智能语音助手到复杂的图像识别系统，AI的应用无处不在，而其中的关键技术——语言模型和集成平台，更是备受关注。FastGPT和MCP（Multi-ComponentPlatform）作为这一领域的新兴代表，正逐渐崭露头角，为AI的发展注入新的活力。FastGPT，以其高效的推理
前沿技术推动机器人的智能化升级 AI天才研究院 AI大模型企业级应用开发实战 Agentic AI 实战 AI人工智能与大数据机器人 ai
前沿技术推动机器人的智能化升级关键词：机器人智能化、人工智能、机器学习、计算机视觉、自主导航、人机交互、边缘计算摘要：本文深入探讨了前沿技术如何推动机器人从传统自动化向智能化升级的演进过程。文章首先分析了机器人技术发展的历史脉络和当前挑战，然后详细阐述了人工智能、机器学习、计算机视觉等关键技术如何赋能机器人智能化。通过算法原理分析、数学模型构建和实际项目案例，展示了智能机器人的核心技术实现路径。最
linux深度学习问题汇总不想改代码备忘录 linux python 深度学习 pytorch 人工智能 1024程序员节
目录一、异常问题1.segementationfault(coredump)2.Illegalinstruction(coredumped)3.死锁4.掉卡二、通用方法1.查看重启记录2.系统性能监控3.后台执行命令4.异常日志三、深度学习技术1.普通网络改DDP训练，单机多卡，pytorch四、专业内容方法1.微调diffusion类模型本文记录一些在使用linux服务器进行深度学习时遇到的问题
提升首屏加载的秘密武器：一文讲透 CDN 加速核心逻辑网罗开发实战源码前端 json javascript
网罗开发（小红书、快手、视频号同名）大家好，我是展菲，目前在上市企业从事人工智能项目研发管理工作，平时热衷于分享各种编程领域的软硬技能知识以及前沿技术，包括iOS、前端、HarmonyOS、Java、Python等方向。在移动端开发、鸿蒙开发、物联网、嵌入式、云原生、开源等领域有深厚造诣。图书作者：《ESP32-C3物联网工程开发实战》图书作者：《SwiftUI入门，进阶与实战》超级个体：CO
量化AI价值的30个关键指标 mao_feng 人工智能 AI
摘要：量化AI的战略价值人工智能（AI）成功集成到业务运营中超越了单纯的技术部署;它需要一种严格、可量化的方法来展示其价值。本报告系统地分类并解释了评估AI优势的基本指标，从核心模型性能到总体战略和道德考虑因素。必须制定多方面的衡量策略，将技术AI指标与运营效率、客户体验、财务绩效、战略优势和负责任的AI实践等有形业务成果直接联系起来。稳健的关键绩效指标（KPI）不仅仅是问责制的工具;它们是持续改
Mac mini 跑 DeepSeek R1 及 QwQ-32B模型实测报告强哥之神 GPT macos GPU deepseek 人工智能语言模型 LLM
测试对象：2025款Macmini（M4/M4Pro芯片）测试模型：DeepSeek-R1（14B/32B）、QwQ-32B（原版/量化版）测试目标：硬件性能适配性、推理速度、内存占用及优化方案一、Macmini硬件配置概览配置项M4基础款（16GB）M4Pro高配（32GB/64GB）芯片M4（10核CPU/10核GPU）M4Pro（14核CPU/20核GPU）内存16GB统一内存32GB/64
【AI】AI大模型发展史：从理论探索到技术爆发不想当程序汪的第N天 AI 人工智能
一、早期探索阶段—理论与技术奠基1.1符号主义与连接主义的博弈20世纪50-70年代，符号主义AI主导研究方向，通过专家系统模拟人类逻辑推理，但受限于计算能力和数据规模。80年代连接主义AI兴起，以神经网络为核心，反向传播算法的提出为深度学习奠定基础。1.2神经网络初步实践1980年：卷积神经网络（CNN）雏形诞生1998年：LeNet-5模型成功应用于手写数字识别，成为首个商用深度学习模型关键局
【AI大模型】23、构建你的西部世界：AI小镇具身智能实战指南无心水 AI大模型人工智能 AI小镇搭建具身智能实战智能体系统架构提示语工程优化虚拟社会构建 AI大模型
引言：从代码到虚拟社会的奇妙旅程在人工智能领域，具身智能的发展正引领着一场新的革命。当我们谈论构建一个类似《西部世界》的虚拟社会时，我们不仅在创造一个数字游乐场，更是在探索智能体如何在模拟环境中展现出类似人类的认知、社交和决策能力。本文将带领你踏上一段激动人心的旅程，从底层架构到上层应用，全面解析如何利用提示语工程构建一个充满活力的AI小镇。想象一下，你将成为这个虚拟世界的造物主，通过精心设计的提
九章数学体系：定义域无界化——AI鲁棒性的“隐形杀手“ 九章数学体系数学建模拓扑学人工智能神经网络
九章数学体系：定义域无界化——AI鲁棒性的"隐形杀手"摘要传统人工智能模型在面对边缘场景时常常表现出鲁棒性不足的问题，本文深入分析发现，这种现象的本质根源在于模型缺乏显式的定义域约束，导致无界化假设成为影响AI鲁棒性的"隐形杀手"。文章系统阐述了无界假设如何引发对抗样本脆弱性和数值不稳定等核心问题，并引入九章数学体系的定义域约束理论，为解决这些问题提供了全新的数学视角和工程实现路径。研究表明，通过
从单一设备到万物互联：鸿蒙生态崛起的未来之路王子良. 经验分享 harmonyos 华为
目录一、引言：开启智能时代的钥匙二、鸿蒙生态概述：跨设备协同的核心价值三、开发者机遇与挑战：抓住鸿蒙崛起的机会四、鸿蒙生态崛起的前景：万物互联的未来五、开发者在鸿蒙生态中的实践机遇与挑战1.跨设备开发的机遇2.与人工智能和物联网结合的创新空间3.持续创新与生态完善的挑战六、鸿蒙生态未来的多维发展：智能硬件与大数据的深度结合1.智能硬件与大数据的结合2.在智能家居与城市管理中的应用3.行业领域的深度
OpenCV让Python实现人脸特征点检测 Python编程之道 Python编程之道 opencv python 人工智能 ai
OpenCV让Python实现人脸特征点检测关键词：OpenCV、Python、人脸检测、特征点定位、计算机视觉、Dlib、深度学习摘要：本文将深入探讨如何使用OpenCV和Python实现人脸特征点检测。我们将从基础概念开始，逐步介绍人脸检测和特征点定位的核心算法原理，包括传统的Haar级联检测器和基于深度学习的Dlib面部特征点检测器。文章将提供详细的代码实现和数学原理讲解，并通过实际项目案例
考取华为HCIE-AI有什么用？博睿谷IT99_ 华为人工智能华为认证职业规划
在人工智能技术重塑各行各业的浪潮中，掌握核心AI能力成为专业人士的制胜关键。华为推出的HCIE-AISolutionArchitect（华为认证ICT专家-AI解决方案架构师），正是面向这一领域顶尖人才设立的最高级别认证。主要是为了培养和认证掌握人工智能解决方案架构、设计与应用知识，具备大模型业务场景分析、大模型训练与微调、模型推理部署能力的专家级人才。一、HCIE-AI：专家级能力的权威认证HC
多模态实操第一弹：多模态AI是什么？能做什么？江凯吴杰多模态的尝试人工智能
多模态AI专栏第一期：多模态人工智能概述与应用你是否想过，AI如何像人一样同时"看、听、说"？本期专栏将带你深入了解多模态AI的核心原理、发展脉络、关键技术、典型应用，并为后续实战打下坚实基础。最后，我们将详细介绍本系列所用的ERIT数据集及其任务背景。目录1.什么是多模态AI？2.多模态AI的发展历程3.多模态AI的核心技术4.多模态AI的应用场景5.多模态AI的挑战与机遇6.专栏预告与ERIT
ChatGPT、DeepSeek等大语言模型助力高效办公、论文与项目撰写、数据分析、机器学习与深度学习建模等深度科研 Yolo566Q chatgpt 语言模型数据分析
随着人工智能技术的快速发展，大语言模型如ChatGPT和DeepSeek在科研领域的应用正在为科研人员提供强大的支持。这些模型通过深度学习和大规模语料库训练，能够帮助科研人员高效地筛选文献、生成论文内容、进行数据分析和优化机器学习模型。ChatGPT和DeepSeek能够快速理解和生成复杂的语言，帮助研究人员在撰写论文时提高效率，不仅生成高质量的文章内容，还能优化论文结构和语言表达。在数据分析方面
大语言模型助力高效办公、论文与项目撰写、数据分析、机器学习与深度学习建模等 xiao5kou4chang6kai4 人工智能深度学习机器学习 rnn 语言模型 lstm 深度学习机器学习人工智能 DeepSeek
随着人工智能技术的快速发展，大语言模型如ChatGPT和DeepSeek在科研领域的应用正在为科研人员提供强大的支持。这些模型通过深度学习和大规模语料库训练，能够帮助科研人员高效地筛选文献、生成论文内容、进行数据分析和优化机器学习模型。ChatGPT和DeepSeek能够快速理解和生成复杂的语言，帮助研究人员在撰写论文时提高效率，不仅生成高质量的文章内容，还能优化论文结构和语言表达。在数据分析方面
十分钟了解人工智能的过去、现在与未来 ithadoop 人工智能人工智能
十分钟了解人工智能的过去、现在与未来人工智能(AI)作为重塑人类社会的技术革命，正以前所未有的速度改变着我们的工作方式、生活方式和思维方式。从1943年人工神经元模型的提出，到2025年AI应用场景的全面爆发，AI发展经历了多个关键阶段。在接下来的十分钟里，我们将通过图文解说，快速了解AI从萌芽到现在的历程，以及未来可能带来的机遇与挑战。一、人工智能的过去：从理论奠基到技术突破1.萌芽阶段(194
ChatGPT、DeepSeek等大语言模型助力高效办公、论文与项目撰写、数据分析、机器学习与深度学习建模 asyxchenchong888 chatgpt 语言模型机器学习
随着人工智能技术的快速发展，大语言模型如ChatGPT和DeepSeek在科研领域的应用正在为科研人员提供强大的支持。这些模型通过深度学习和大规模语料库训练，能够帮助科研人员高效地筛选文献、生成论文内容、进行数据分析和优化机器学习模型。ChatGPT和DeepSeek能够快速理解和生成复杂的语言，帮助研究人员在撰写论文时提高效率，不仅生成高质量的文章内容，还能优化论文结构和语言表达。在数据分析方面
ChatGPT、DeepSeek等大语言模型助力高效办公、论文与项目撰写、数据分析、机器学习与深度学习建模等科研应用科研的力量人工智能 ChatGPT chatgpt 语言模型数据分析
随着人工智能技术的快速发展，大语言模型如ChatGPT和DeepSeek在科研领域的应用正在为科研人员提供强大的支持。这些模型通过深度学习和大规模语料库训练，能够帮助科研人员高效地筛选文献、生成论文内容、进行数据分析和优化机器学习模型。ChatGPT和DeepSeek能够快速理解和生成复杂的语言，帮助研究人员在撰写论文时提高效率，不仅生成高质量的文章内容，还能优化论文结构和语言表达。在数据分析方面
探索 AI 系统提示与模型资源库：`system-prompts-and-models-of-ai-tools` 几道之旅人工智能智能体及数字员工人工智能
在当今的人工智能领域，系统提示和工具模型的优化与应用对于提升AI助手的性能和响应质量至关重要。x1xhlol开源的system-prompts-and-models-of-ai-tools仓库为开发者们提供了一个丰富的资源集合，涵盖了多种AI工具的系统提示、工具和模型。仓库概述这个仓库包含了超过7500行的代码和文档，详细介绍了多个知名AI工具的系统提示和相关模型，其中包括FULLv0、Curso
口扫系统软件的架构设计流程老猿的春天三维 c++口扫三维重建
[结构光图像流]↓解码结构光图案↓三角测量计算深度↓点云生成并去噪滤波↓实时配准/拼接(可选ICP/Odometry)↓网格重建（如MarchingCubes或BallPivoting）↓GPU显示（OpenGL/Open3D/VTK）
戴尔笔记本win8系统改装win7系统 sophia天雪 win7 戴尔改装系统 win8
戴尔win8 系统改装win7 系统详述第一步：使用U盘制作虚拟光驱： 1）下载安装UltraISO：注册码可以在网上搜索。 2）启动UltraISO，点击“文件”—》“打开”按钮，打开已经准备好的ISO镜像文
BeanUtils.copyProperties使用笔记 bylijinnan java
BeanUtils.copyProperties VS PropertyUtils.copyProperties 两者最大的区别是： BeanUtils.copyProperties会进行类型转换，而PropertyUtils.copyProperties不会。既然进行了类型转换，那BeanUtils.copyProperties的速度比不上PropertyUtils.copyProp
MyEclipse中文乱码问题 0624chenhong MyEclipse
一、设置新建常见文件的默认编码格式，也就是文件保存的格式。在不对MyEclipse进行设置的时候，默认保存文件的编码，一般跟简体中文操作系统（如windows2000，windowsXP）的编码一致，即GBK。在简体中文系统下，ANSI 编码代表 GBK编码;在日文操作系统下，ANSI 编码代表 JIS 编码。 Window-->Preferences-->General -
发送邮件不懂事的小屁孩 send email
import org.apache.commons.mail.EmailAttachment; import org.apache.commons.mail.EmailException; import org.apache.commons.mail.HtmlEmail; import org.apache.commons.mail.MultiPartEmail;
动画合集换个号韩国红果果 html css
动画指一种样式变为另一种样式 keyframes应当始终定义0 100 过程 1 transition 制作鼠标滑过图片时的放大效果 css .wrap{ width: 340px;height: 340px; position: absolute; top: 30%; left: 20%; overflow: hidden; bor
网络最常见的攻击方式竟然是SQL注入蓝儿唯美 sql注入
NTT研究表明，尽管SQL注入（SQLi）型攻击记录详尽且为人熟知，但目前网络应用程序仍然是SQLi攻击的重灾区。信息安全和风险管理公司NTTCom Security发布的《2015全球智能威胁风险报告》表明，目前黑客攻击网络应用程序方式中最流行的，要数SQLi攻击。报告对去年发生的60亿攻击行为进行分析，指出SQLi攻击是最常见的网络应用程序攻击方式。全球网络应用程序攻击中，SQLi攻击占
java笔记2 a-john java
类的封装： 1，java中，对象就是一个封装体。封装是把对象的属性和服务结合成一个独立的的单位。并尽可能隐藏对象的内部细节（尤其是私有数据） 2，目的：使对象以外的部分不能随意存取对象的内部数据（如属性），从而使软件错误能够局部化，减少差错和排错的难度。 3，简单来说，“隐藏属性、方法或实现细节的过程”称为——封装。 4，封装的特性： 4.1设置
[Andengine]Error：can't creat bitmap form path “gfx/xxx.xxx” aijuans 学习Android遇到的错误
最开始遇到这个错误是很早以前了，以前也没注意，只当是一个不理解的bug，因为所有的texture，textureregion都没有问题，但是就是提示错误。昨天和美工要图片，本来是要背景透明的png格式，可是她却给了我一个jpg的。说明了之后她说没法改，因为没有png这个保存选项。我就看了一下，和她要了psd的文件，还好我有一点
自己写的一个繁体到简体的转换程序 asialee java 转换繁体 filter 简体
今天调研一个任务，基于java的filter实现繁体到简体的转换，于是写了一个demo，给各位博友奉上，欢迎批评指正。实现的思路是重载request的调取参数的几个方法，然后做下转换。
android意图和意图监听器技术百合不是茶 android 显示意图隐式意图意图监听器
Intent是在activity之间传递数据;Intent的传递分为显示传递和隐式传递显式意图：调用Intent.setComponent() 或 Intent.setClassName() 或 Intent.setClass()方法明确指定了组件名的Intent为显式意图，显式意图明确指定了Intent应该传递给哪个组件。隐式意图;不指明调用的名称,根据设
spring3中新增的@value注解 bijian1013 java spring @Value
在spring 3.0中，可以通过使用@value，对一些如xxx.properties文件中的文件，进行键值对的注入，例子如下： 1.首先在applicationContext.xml中加入： <beans xmlns="http://www.springframework.
Jboss启用CXF日志 sunjing log jboss CXF
1. 在standalone.xml配置文件中添加system-properties： <system-properties> <property name="org.apache.cxf.logging.enabled" value=&
【Hadoop三】Centos7_x86_64部署Hadoop集群之编译Hadoop源代码 bit1129 centos
编译必需的软件 Firebugs3.0.0 Maven3.2.3 Ant JDK1.7.0_67 protobuf-2.5.0 Hadoop 2.5.2源码包 Firebugs3.0.0 http://sourceforge.jp/projects/sfnet_findbug
struts2验证框架的使用和扩展白糖_ 框架 xml bean struts 正则表达式
struts2能够对前台提交的表单数据进行输入有效性校验，通常有两种方式： 1、在Action类中通过validatexx方法验证，这种方式很简单，在此不再赘述； 2、通过编写xx-validation.xml文件执行表单验证，当用户提交表单请求后，struts会优先执行xml文件，如果校验不通过是不会让请求访问指定action的。本文介绍一下struts2通过xml文件进行校验的方法并说
记录-感悟 braveCS 感悟
再翻翻以前写的感悟，有时会发现自己很幼稚，也会让自己找回初心。 2015-1-11 1. 能在工作之余学习感兴趣的东西已经很幸福了； 2. 要改变自己，不能这样一直在原来区域，要突破安全区舒适区，才能提高自己，往好的方面发展； 3. 多反省多思考；要会用工具，而不是变成工具的奴隶； 4. 一天内集中一个定长时间段看最新资讯和偏流式博
编程之美-数组中最长递增子序列 bylijinnan 编程之美
import java.util.Arrays; import java.util.Random; public class LongestAccendingSubSequence { /** * 编程之美数组中最长递增子序列 * 书上的解法容易理解 * 另一方法书上没有提到的是，可以将数组排序（由小到大）得到新的数组， * 然后求排序后的数组与原数
读书笔记5 chengxuyuancsdn 重复提交 struts2的token验证
1、重复提交 2、struts2的token验证 3、用response返回xml时的注意 1、重复提交 (1)应用场景 (1-1)点击提交按钮两次。 (1-2)使用浏览器后退按钮重复之前的操作，导致重复提交表单。 (1-3)刷新页面 (1-4)使用浏览器历史记录重复提交表单。 (1-5)浏览器重复的 HTTP 请求。 (2)解决方法 (2-1)禁掉提交按钮 (2-2)
[时空与探索]全球联合进行第二次费城实验的可能性 comsci
二次世界大战前后,由爱因斯坦参加的一次在海军舰艇上进行的物理学实验 -费城实验至今给我们大家留下很多迷团..... 关于费城实验的详细过程,大家可以在网络上搜索一下,我这里就不详细描述了在这里,我的意思是,现在
easy connect 之 ORA-12154: TNS: 无法解析指定的连接标识符 daizj oracle ORA-12154
用easy connect连接出现“tns无法解析指定的连接标示符”的错误，如下： C:\Users\Administrator>sqlplus username/[email protected]:1521/orcl SQL*Plus: Release 10.2.0.1.0 – Production on 星期一 5月 21 18:16:20 2012 Copyright (c) 198
简单排序:归并排序 dieslrae 归并排序
public void mergeSort(int[] array){ int temp = array.length/2; if(temp == 0){ return; } int[] a = new int[temp]; int
C语言中字符串的\0和空格 dcj3sjt126com c
\0 为字符串结束符，比如说： abcd (空格)cdefg；存入数组时，空格作为一个字符占有一个字节的空间，我们
解决Composer国内速度慢的办法 dcj3sjt126com Composer
用法：有两种方式启用本镜像服务： 1 将以下配置信息添加到 Composer 的配置文件 config.json 中（系统全局配置）。见“例1” 2 将以下配置信息添加到你的项目的 composer.json 文件中（针对单个项目配置）。见“例2” 为了避免安装包的时候都要执行两次查询，切记要添加禁用 packagist 的设置，如下 1 2 3 4 5
高效可伸缩的结果缓存 shuizhaosi888 高效可伸缩的结果缓存
/** * 要执行的算法，返回结果v */ public interface Computable<A, V> { public V comput(final A arg); } /** * 用于缓存数据 */ public class Memoizer<A, V> implements Computable<A,
三点定位的算法 haoningabc c 算法
三点定位，已知a,b,c三个顶点的x,y坐标和三个点都z坐标的距离，la，lb,lc 求z点的坐标原理就是围绕a,b,c 三个点画圆，三个圆焦点的部分就是所求但是，由于三个点的距离可能不准，不一定会有结果，所以是三个圆环的焦点，环的宽度开始为0，没有取到则加1 运行 gcc -lm test.c test.c代码如下 #include "stdi
epoll使用详解 jimmee c linux 服务端编程 epoll
epoll - I/O event notification facility在linux的网络编程中，很长的时间都在使用select来做事件触发。在linux新的内核中，有了一种替换它的机制，就是epoll。相比于select，epoll最大的好处在于它不会随着监听fd数目的增长而降低效率。因为在内核中的select实现中，它是采用轮询来处理的，轮询的fd数目越多，自然耗时越多。并且，在linu
Hibernate对Enum的映射的基本使用方法 linzx0212 enum Hibernate
枚举 /** * 性别枚举 */ public enum Gender { MALE(0), FEMALE(1), OTHER(2); private Gender(int i) { this.i = i; } private int i; public int getI
第10章高级事件（下） onestopweb 事件
index.html <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/
孙子兵法 roadrunners 孙子兵法
始计第一孙子曰：兵者，国之大事，死生之地，存亡之道，不可不察也。故经之以五事，校之以计，而索其情：一曰道，二曰天，三曰地，四曰将，五曰法。道者，令民于上同意，可与之死，可与之生，而不危也；天者，阴阳、寒暑、时制也；地者，远近、险易、广狭、死生也；将者，智、信、仁、勇、严也；法者，曲制、官道、主用也。凡此五者，将莫不闻，知之者胜，不知之者不胜。故校之以计，而索其情，曰
MySQL双向复制 tomcat_oracle mysql
本文包括: 主机配置从机配置建立主-从复制建立双向复制背景按照以下简单的步骤: 参考一下：在机器A配置主机(192.168.1.30) 在机器B配置从机(192.168.1.29) 我们可以使用下面的步骤来实现这一点步骤1：机器A设置主机在主机中打开配置文件 ,
zoj 3822 Domination(dp) 阿尔萨斯 Mina
题目链接：zoj 3822 Domination 题目大意：给定一个N∗M的棋盘，每次任选一个位置放置一枚棋子，直到每行每列上都至少有一枚棋子，问放置棋子个数的期望。解题思路：大白书上概率那一张有一道类似的题目，但是因为时间比较久了，还是稍微想了一下。dp[i][j][k]表示i行j列上均有至少一枚棋子，并且消耗k步的概率（k≤i∗j）,因为放置在i+1~n上等价与放在i+1行上，同理