图波列夫

TensorFlow 中的 Conv2DOp

TensorFlow 中的2D 卷积主要依赖外部库，如 cuDNN、cuBLAS、ROCm 和 hfp/libxsmm，仅 DeepConv2D 为源码实现。

Conv2DOp

BinaryOp

InitConv2DParameters 从 OpKernelConstruction 中读取设置到 Conv2DParameters 并进行检查。
CudnnUseAutotune 标识是否开启自动调优。

template <typename Device, typename T>
class Conv2DOp : public BinaryOp<T> {
 public:
  explicit Conv2DOp(OpKernelConstruction* context) : BinaryOp<T>(context) {
    OP_REQUIRES_OK(context, InitConv2DParameters(context, &params_));

    OP_REQUIRES_OK(context, context->GetAttr("use_cudnn_on_gpu", &use_cudnn_));
    cudnn_use_autotune_ = CudnnUseAutotune();
  }

Conv2DOp::Compute

LaunchXsmmConvOp

LaunchDeepConvOp

LaunchConv2DOp

ComputeConv2DDimension 检查并设置2D 卷积的维度信息。
ShapeFromFormat 根据格式获取形状。
OpKernelContext::allocate_output 根据形状分配输出张量。

  void Compute(OpKernelContext* context) override {
    // Input tensor is of the following dimensions:
    // [ batch, in_rows, in_cols, in_depth ]
    const Tensor& input = context->input(0);

    // Input filter is of the following dimensions:
    // [ filter_rows, filter_cols, in_depth, out_depth]
    const Tensor& filter = context->input(1);

    Conv2DDimensions dimensions;
    OP_REQUIRES_OK(context,
                   ComputeConv2DDimension(params_, input, filter, &dimensions));

    TensorShape out_shape = ShapeFromFormat(
        params_.data_format, dimensions.batch, dimensions.out_rows,
        dimensions.out_cols, dimensions.out_depth);

    // Output tensor is of the following dimensions:
    // [ in_batch, out_rows, out_cols, out_depth ]
    Tensor* output = nullptr;
    OP_REQUIRES_OK(context, context->allocate_output(0, out_shape, &output));

    VLOG(2) << "Conv2D: in_depth = " << dimensions.in_depth
            << ", patch_depth = " << dimensions.patch_depth
            << ", input_cols = " << dimensions.input_cols
            << ", filter_cols = " << dimensions.filter_cols
            << ", input_rows = " << dimensions.input_rows
            << ", filter_rows = " << dimensions.filter_rows
            << ", stride_rows = " << dimensions.stride_rows
            << ", stride_cols = " << dimensions.stride_cols
            << ", dilation_rows = " << dimensions.dilation_rows
            << ", dilation_cols = " << dimensions.dilation_cols
            << ", out_depth = " << dimensions.out_depth;

    // If there is nothing to compute, return.
    if (out_shape.num_elements() == 0) {
      return;
    }

如果使用 hfp/libxsmm 库，在非明确填充模式下调用 LaunchXsmmConvOp::Run。
仅 CPU 模式下有效，GPU 模式下返回 false。

#ifdef TENSORFLOW_USE_LIBXSMM_CONVOLUTIONS
    if (params_.padding != EXPLICIT &&
        LaunchXsmmConvOp<Device, T>::Run(
            context, input, filter, dimensions.batch, dimensions.input_rows,
            dimensions.input_cols, dimensions.in_depth, dimensions.filter_rows,
            dimensions.filter_cols, dimensions.pad_rows_before,
            dimensions.pad_cols_before, dimensions.out_rows,
            dimensions.out_cols, dimensions.out_depth, dimensions.dilation_rows,
            dimensions.dilation_cols, dimensions.stride_rows,
            dimensions.stride_cols, output, params_.data_format)) {
      return;
    }
#endif

非明确填充先尝试调用 LaunchDeepConvOp::Run。LaunchDeepConvOp 默认会返回 false，只有 CPU 的 float 类型进行了实现。

    if (params_.padding != EXPLICIT &&
        LaunchDeepConvOp<Device, T>::Run(
            context, input, filter, dimensions.batch, dimensions.input_rows,
            dimensions.input_cols, dimensions.in_depth, dimensions.filter_rows,
            dimensions.filter_cols, dimensions.pad_rows_before,
            dimensions.pad_cols_before, dimensions.out_rows,
            dimensions.out_cols, dimensions.out_depth, dimensions.dilation_rows,
            dimensions.dilation_cols, dimensions.stride_rows,
            dimensions.stride_cols, output, params_.data_format)) {
      return;
    }

LaunchConv2DOp::operator() GPU 通用版本为 cuDNN 实现。

    launcher_(context, use_cudnn_, cudnn_use_autotune_, input, filter,
              dimensions.dilation_rows, dimensions.dilation_cols,
              dimensions.stride_rows, dimensions.stride_cols, params_.padding,
              params_.explicit_paddings, output, params_.data_format);
  }

LaunchConv2DOp 对象为实际的执行者。

 private:
  Conv2DParameters params_;
  bool use_cudnn_;
  bool cudnn_use_autotune_;

  LaunchConv2DOp<Device, T> launcher_;

  TF_DISALLOW_COPY_AND_ASSIGN(Conv2DOp);
};

InitConv2DParameters

ShapeFromFormat

GetTensorDim

CheckValidPadding

从 OpKernelConstruction 中读取设置到 Conv2DParameters 并进行检查。
ShapeFromFormat 根据格式获取形状。
GetTensorDim 通过字符属性获取维度。
CheckValidPadding 检查填充值。

Status InitConv2DParameters(const OpKernelConstruction* context,
                            Conv2DParameters* params) {
  TF_RETURN_IF_ERROR(context->GetAttr("dilations", &params->dilations));
  TF_RETURN_IF_ERROR(context->GetAttr("strides", &params->strides));
  TF_RETURN_IF_ERROR(context->GetAttr("padding", &params->padding));
  if (context->HasAttr("explicit_paddings")) {
    TF_RETURN_IF_ERROR(
        context->GetAttr("explicit_paddings", &params->explicit_paddings));
  }
  string data_format_string;
  TF_RETURN_IF_ERROR(context->GetAttr("data_format", &data_format_string));
  TF_REQUIRES(FormatFromString(data_format_string, &params->data_format),
              errors::InvalidArgument("Invalid data format"));

  const auto& strides = params->strides;
  const auto& dilations = params->dilations;
  const auto& data_format = params->data_format;

  TF_REQUIRES(dilations.size() == 4,
              errors::InvalidArgument("Sliding window dilations field must "
                                      "specify 4 dimensions"));
  TF_REQUIRES(strides.size() == 4,
              errors::InvalidArgument("Sliding window strides field must "
                                      "specify 4 dimensions"));
  const int64_t stride_n = GetTensorDim(strides, data_format, 'N');
  const int64_t stride_c = GetTensorDim(strides, data_format, 'C');
  const int64_t stride_h = GetTensorDim(strides, data_format, 'H');
  const int64_t stride_w = GetTensorDim(strides, data_format, 'W');
  TF_REQUIRES(
      stride_n == 1 && stride_c == 1,
      errors::Unimplemented("Current implementation does not yet support "
                            "strides in the batch and depth dimensions."));
  TF_REQUIRES(stride_h > 0 && stride_w > 0,
              errors::InvalidArgument(
                  "Row and column strides should be larger than 0."));

  const int64_t dilation_n = GetTensorDim(dilations, data_format, 'N');
  const int64_t dilation_c = GetTensorDim(dilations, data_format, 'C');
  const int64_t dilation_h = GetTensorDim(dilations, data_format, 'H');
  const int64_t dilation_w = GetTensorDim(dilations, data_format, 'W');
  TF_REQUIRES(
      dilation_n == 1 && dilation_c == 1,
      errors::Unimplemented("Current implementation does not yet support "
                            "dilations in the batch and depth dimensions."));
  TF_REQUIRES(
      dilation_h > 0 && dilation_w > 0,
      errors::InvalidArgument("Dilated rates should be larger than 0."));

  TF_RETURN_IF_ERROR(CheckValidPadding(params->padding,
                                       params->explicit_paddings,
                                       /*num_dims=*/4, data_format));

  return Status::OK();
}

LaunchConv2DOp

CUDA 和 ROCm 均使用LaunchConv2DOp。

template <typename Device, typename T>
struct LaunchConv2DOp {
  void operator()(OpKernelContext* ctx, bool use_cudnn, bool cudnn_use_autotune,
                  const Tensor& input, const Tensor& filter, int row_dilation,
                  int col_dilation, int row_stride, int col_stride,
                  const Padding& padding,
                  const std::vector<int64_t>& explicit_paddings, Tensor* output,
                  TensorFormat data_format);
};

#if GOOGLE_CUDA || TENSORFLOW_USE_ROCM
template <typename T>
struct LaunchConv2DOp<Eigen::GpuDevice, T> {
  void operator()(OpKernelContext* ctx, bool use_cudnn, bool cudnn_use_autotune,
                  const Tensor& input, const Tensor& filter, int row_dilation,
                  int col_dilation, int row_stride, int col_stride,
                  const Padding& padding,
                  const std::vector<int64_t>& explicit_paddings, Tensor* output,
                  TensorFormat data_format);
};
#endif  // GOOGLE_CUDA || TENSORFLOW_USE_ROCM

LaunchConv2DOp::operator()

Created with Raphaël 2.3.0 LaunchConv2DOp::operator() ctx, input_param, filter depthwise or fc? Stream::ThenBlasGemm End ComputeInNhwcEnabled explicit padding ? GetExplicitPaddingForDim GetWindowedOutputSizeVerboseV2 symmetric padding? compute_nchw? NHWCToNCHW::operator() transform_filter GetDnnWorkspaceLimitOrDefault AutotuneUnfusedConv LaunchAutotunedConv NCHWToNHWC::operator() OpKernelContext::allocate_temp PadInput::operator() yes no yes no yes no yes no

GPU 通用版本为 cuDNN 实现。

template <typename T>
void LaunchConv2DOp<GPUDevice, T>::operator()(
    OpKernelContext* ctx, bool use_cudnn, bool cudnn_use_autotune,
    const Tensor& input_param, const Tensor& filter, int row_dilation,
    int col_dilation, int row_stride, int col_stride, const Padding& padding,
    const std::vector<int64_t>& explicit_paddings, Tensor* output,
    TensorFormat data_format) {
  using se::dnn::AlgorithmConfig;
  using se::dnn::AlgorithmDesc;
  using se::dnn::ProfileResult;
  auto* stream = ctx->op_device_context()->stream();
  OP_REQUIRES(ctx, stream, errors::Internal("No GPU stream available."));

  if (!use_cudnn) {
    ctx->SetStatus(
        errors::Unimplemented("Conv2D for GPU is not currently supported "
                              "without cudnn"));
    return;
  }

输入维度信息为 int64。

  Tensor input = input_param;
  const int64_t in_batch = GetTensorDim(input, data_format, 'N');
  int64_t in_rows = GetTensorDim(input, data_format, 'H');
  int64_t in_cols = GetTensorDim(input, data_format, 'W');
  const int64_t in_depths = GetTensorDim(input, data_format, 'C');
  const int64_t patch_rows = filter.dim_size(0);
  const int64_t patch_cols = filter.dim_size(1);
  const int64_t patch_depths = filter.dim_size(2);

  OP_REQUIRES(
      ctx, filter.NumElements() > 0,
      errors::InvalidArgument("filter must not have zero elements "
                              "(i.e. all dimensions must be non-zero)"));

如果滤波器深度patch_depths为1且小于输入深度，则为深度卷积；更一般地，如果滤波器深度不为1但小于输入深度，则为分组卷积。
如果是1x1卷积且数据格式为 NHWC，调用 Stream::ThenBlasGemm 函数。
AsDeviceMemory 将张量映射为封装给定类型缓冲区的 DeviceMemory 对象。

  // If the filter in-depth (patch_depths) is 1 and smaller than the input
  // depth, it's a depthwise convolution. More generally, if the filter in-depth
  // divides but is smaller than the input depth, it is a grouped convolution.
  bool is_grouped_convolution = patch_depths != in_depths;
  if (patch_rows == 1 && patch_cols == 1 && !is_grouped_convolution &&
      row_dilation == 1 && col_dilation == 1 && row_stride == 1 &&
      col_stride == 1 && data_format == FORMAT_NHWC &&
      (padding == VALID || padding == SAME)) {
    // 1x1 filter, so call cublas directly.
    const uint64 m = in_batch * in_rows * in_cols;
    const uint64 k = patch_depths;
    const uint64 n = filter.dim_size(3);

    auto a_ptr = AsDeviceMemory(input.template flat<T>().data(),
                                input.template flat<T>().size());
    auto b_ptr = AsDeviceMemory(filter.template flat<T>().data(),
                                filter.template flat<T>().size());
    auto c_ptr = AsDeviceMemory(output->template flat<T>().data(),
                                output->template flat<T>().size());

    auto no_transpose = se::blas::Transpose::kNoTranspose;
    OP_REQUIRES_OK(
        ctx, stream->ThenBlasGemm(no_transpose, no_transpose, n, m, k, b_ptr, n,
                                  a_ptr, k, &c_ptr, n,
                                  se::blas::kDefaultComputePrecision));
    return;

如果卷积核尺寸与输入完全相同，且数据格式为 NHWC 则同样 Stream::ThenBlasGemm 函数。

  } else if (patch_rows == in_rows && patch_cols == in_cols &&
             !is_grouped_convolution && row_dilation == 1 &&
             col_dilation == 1 && padding == VALID &&
             data_format == FORMAT_NHWC) {
    // The input data and filter have the same height/width, so call cublas
    // directly.
    const uint64 m = in_batch;
    const uint64 k = patch_rows * patch_cols * patch_depths;
    const uint64 n = filter.dim_size(3);

    auto a_ptr = AsDeviceMemory(input.template flat<T>().data(),
                                input.template flat<T>().size());
    auto b_ptr = AsDeviceMemory(filter.template flat<T>().data(),
                                filter.template flat<T>().size());
    auto c_ptr = AsDeviceMemory(output->template flat<T>().data(),
                                output->template flat<T>().size());

    auto no_transpose = se::blas::Transpose::kNoTranspose;
    OP_REQUIRES_OK(
        ctx, stream->ThenBlasGemm(no_transpose, no_transpose, n, m, k, b_ptr, n,
                                  a_ptr, k, &c_ptr, n,
                                  se::blas::kDefaultComputePrecision));
    return;
  }

ComputeInNhwcEnabled 根据数据类型、GPU 计算能力和 cuDNN 的版本综合判断。
Tensor Core 在 NHWC 数据布局中支持 NVIDIA Volta+ GPU 的 fp16 和 Ampere+ GPU 的 tf32 高效卷积。在所有其他配置中，以 NCHW 数据格式运行计算效率更高。

#if GOOGLE_CUDA
  const bool compute_in_nhwc = ComputeInNhwcEnabled(DataTypeToEnum<T>::value,
                                                    stream, /*is_conv2d=*/true);
#else
  // fast NHWC implementation is a CUDA only feature
  const bool compute_in_nhwc = false;
#endif

  // We only do one directional conversion: NHWC->NCHW. We never convert in the
  // other direction. Grappler layout optimizer selects preferred layout and
  // adds necessary annotations to the graph.
  // TODO(ezhulenev): Convert in other direction for fp16?
  const TensorFormat compute_data_format =
      (compute_in_nhwc && data_format == FORMAT_NHWC) ? FORMAT_NHWC
                                                      : FORMAT_NCHW;

  VLOG(3) << "Compute Conv2D with cuDNN:"
          << " data_format=" << ToString(data_format)
          << " compute_data_format=" << ToString(compute_data_format);

获取输出的维度信息。
GetExplicitPaddingForDim 从explicit_paddings获取填充值。
GetWindowedOutputSizeVerboseV2 计算出输出尺寸和填充值。

  const int64_t out_batch = GetTensorDim(*output, data_format, 'N');
  const int64_t out_rows = GetTensorDim(*output, data_format, 'H');
  const int64_t out_cols = GetTensorDim(*output, data_format, 'W');
  const int64_t out_depths = GetTensorDim(*output, data_format, 'C');
  int64_t padding_top = -1, padding_bottom = -1;
  int64_t padding_left = -1, padding_right = -1;
  if (padding == EXPLICIT) {
    GetExplicitPaddingForDim(explicit_paddings, data_format, 'H', &padding_top,
                             &padding_bottom);
    GetExplicitPaddingForDim(explicit_paddings, data_format, 'W', &padding_left,
                             &padding_right);
  }
  int64_t out_rows_check, out_cols_check;
  Status status = GetWindowedOutputSizeVerboseV2(
      in_rows, patch_rows, row_dilation, row_stride, padding, &out_rows_check,
      &padding_top, &padding_bottom);
  // The status is guaranteed to be OK because we checked the output and padding
  // was valid earlier.
  TF_CHECK_OK(status);
  DCHECK_EQ(out_rows, out_rows_check);
  status = GetWindowedOutputSizeVerboseV2(in_cols, patch_cols, col_dilation,
                                          col_stride, padding, &out_cols_check,
                                          &padding_left, &padding_right);
  TF_CHECK_OK(status);
  DCHECK_EQ(out_cols, out_cols_check);

cuDNN 只支持填充对称，所以 OpKernelContext::allocate_temp 分配一块临时内存给transformed_input。
计算为实现对称，4个方向上需要填充的值input_pad_top、input_pad_bottom、input_pad_left和input_pad_right。
PadInput::operator() 对input_param进行填充得到transformed_input。

  const int64_t common_padding_rows = std::min(padding_top, padding_bottom);
  const int64_t common_padding_cols = std::min(padding_left, padding_right);
  if (padding_top != padding_bottom || padding_left != padding_right) {
    // cuDNN only supports padding the same amount on the left and right sides,
    // and on the top and bottom sides. So we manually create a new padded
    // input tensor such that we can pass it to cuDNN.
    VLOG(4) << "Pad input tensor:"
            << " padding_top=" << padding_top
            << " padding_bottom=" << padding_bottom
            << " padding_left=" << padding_left
            << " padding_right=" << padding_right;

    // TODO(reedwm): In some cases, we can avoid an allocation even if the two
    // padding sides are different. For example, if the input is 2x2, the filter
    // is 1x1, the stride is 2, and the padding is (1, 0, 1, 0), the result is
    // equivalent to as if the padding is (1, 1, 1, 1). Changing the padding in
    // such a way would allow us to avoid the allocation.
    Tensor transformed_input;
    const int64_t padding_rows_diff = std::abs(padding_bottom - padding_top);
    const int64_t padding_cols_diff = std::abs(padding_right - padding_left);
    const int64_t new_in_rows = in_rows + padding_rows_diff;
    const int64_t new_in_cols = in_cols + padding_cols_diff;
    OP_REQUIRES_OK(ctx, ctx->allocate_temp(
                            DataTypeToEnum<T>::value,
                            ShapeFromFormat(data_format, in_batch, new_in_rows,
                                            new_in_cols, in_depths),
                            &transformed_input));

    const int64_t input_pad_top = padding_top - common_padding_rows;
    const int64_t input_pad_bottom = padding_bottom - common_padding_rows;
    const int64_t input_pad_left = padding_left - common_padding_cols;
    const int64_t input_pad_right = padding_right - common_padding_cols;
    bool in_bounds =
        FastBoundsCheck(input_pad_top, std::numeric_limits<int>::max()) &&
        FastBoundsCheck(input_pad_bottom, std::numeric_limits<int>::max()) &&
        FastBoundsCheck(input_pad_left, std::numeric_limits<int>::max()) &&
        FastBoundsCheck(input_pad_right, std::numeric_limits<int>::max());
    if (!in_bounds) {
      ctx->SetStatus(errors::InvalidArgument("Padding is too large."));
      return;
    }
    functor::PadInput<GPUDevice, T, int, 4>()(
        ctx->eigen_device<GPUDevice>(), To32Bit(input_param.tensor<T, 4>()),
        {{static_cast<int>(input_pad_top), static_cast<int>(input_pad_left)}},
        {{static_cast<int>(input_pad_bottom),
          static_cast<int>(input_pad_right)}},
        To32Bit(transformed_input.tensor<T, 4>()), data_format, T{});

    input = transformed_input;
    in_rows = new_in_rows;
    in_cols = new_in_cols;
  }

如果输入格式为 NHWC 但计算格式为 NCHW，则调用 NHWCToNCHW 对input进行转换。

  if (data_format == FORMAT_NHWC && compute_data_format == FORMAT_NCHW) {
    VLOG(4) << "Convert the input tensor from NHWC to NCHW.";

    TensorShape nchw_shape =
        ShapeFromFormat(FORMAT_NCHW, in_batch, in_rows, in_cols, in_depths);
    if (in_depths > 1) {
      Tensor transformed_input;
      OP_REQUIRES_OK(ctx, ctx->allocate_temp(DataTypeToEnum<T>::value,
                                             nchw_shape, &transformed_input));
      functor::NHWCToNCHW<GPUDevice, T, 4>()(
          ctx->eigen_device<GPUDevice>(),
          const_cast<const Tensor&>(input).tensor<T, 4>(),
          transformed_input.tensor<T, 4>());
      input = transformed_input;
    } else {
      // If depth <= 1, then just reshape.
      CHECK(input.CopyFrom(input, nchw_shape));
    }
  } else {
    CHECK(data_format == compute_data_format)  // Crash OK
        << "Illegal data and compute format pair:"
        << " data_format=" << ToString(data_format)
        << " compute_data_format=" << ToString(compute_data_format);
  }

  CHECK(common_padding_rows >= 0 && common_padding_cols >= 0)  // Crash OK
      << "Negative row or col paddings: (" << common_padding_rows << ", "
      << common_padding_cols << ")";

  constexpr auto kComputeInNHWC =
      std::make_tuple(se::dnn::DataLayout::kBatchYXDepth,
                      se::dnn::FilterLayout::kOutputYXInput);
  constexpr auto kComputeInNCHW =
      std::make_tuple(se::dnn::DataLayout::kBatchDepthYX,
                      se::dnn::FilterLayout::kOutputInputYX);

  se::dnn::DataLayout compute_data_layout;
  se::dnn::FilterLayout filter_layout;

  std::tie(compute_data_layout, filter_layout) =
      compute_data_format == FORMAT_NHWC ? kComputeInNHWC : kComputeInNCHW;

  se::dnn::BatchDescriptor input_desc;
  input_desc.set_count(in_batch)
      .set_feature_map_count(in_depths)
      .set_height(in_rows)
      .set_width(in_cols)
      .set_layout(compute_data_layout);
  se::dnn::BatchDescriptor output_desc;
  output_desc.set_count(out_batch)
      .set_height(out_rows)
      .set_width(out_cols)
      .set_feature_map_count(out_depths)
      .set_layout(compute_data_layout);
  se::dnn::FilterDescriptor filter_desc;
  filter_desc.set_input_filter_height(patch_rows)
      .set_input_filter_width(patch_cols)
      .set_input_feature_map_count(patch_depths)
      .set_output_feature_map_count(filter.dim_size(3))
      .set_layout(filter_layout);
  se::dnn::ConvolutionDescriptor conv_desc;
  conv_desc.set_vertical_dilation_rate(row_dilation)
      .set_horizontal_dilation_rate(col_dilation)
      .set_vertical_filter_stride(row_stride)
      .set_horizontal_filter_stride(col_stride)
      .set_zero_padding_height(common_padding_rows)
      .set_zero_padding_width(common_padding_cols)
      .set_group_count(in_depths / patch_depths);

对于卷积核进行转换。

  Tensor transformed_filter;

  const auto transform_filter = [&](FilterTensorFormat dst_format) -> Status {
    VLOG(4) << "Transform filter tensor from " << ToString(FORMAT_HWIO)
            << " to " << ToString(dst_format);

    TensorShape dst_shape =
        dst_format == FORMAT_OIHW
            ? TensorShape({filter.dim_size(3), filter.dim_size(2),
                           filter.dim_size(0), filter.dim_size(1)})
            : TensorShape({filter.dim_size(3), filter.dim_size(0),
                           filter.dim_size(1), filter.dim_size(2)});

    TF_RETURN_IF_ERROR(ctx->allocate_temp(DataTypeToEnum<T>::value, dst_shape,
                                          &transformed_filter));
    functor::TransformFilter<GPUDevice, T, int, 4>()(
        ctx->eigen_device<GPUDevice>(), dst_format,
        To32Bit(filter.tensor<T, 4>()),
        To32Bit(transformed_filter.tensor<T, 4>()));

    return Status::OK();
  };

  if (compute_data_format == FORMAT_NCHW) {
    OP_REQUIRES_OK(ctx, transform_filter(FORMAT_OIHW));
  } else if (compute_data_format == FORMAT_NHWC) {
    OP_REQUIRES_OK(ctx, transform_filter(FORMAT_OHWI));
  } else {
    ctx->SetStatus(errors::InvalidArgument("Invalid compute data format: ",
                                           ToString(compute_data_format)));
    return;
  }

如果输出格式与计算格式不同，则为transformed_output分配一块内存。

  Tensor transformed_output;
  if (data_format != compute_data_format) {
    VLOG(4) << "Allocate temporary memory for output in compute data format";
    OP_REQUIRES_OK(
        ctx, ctx->allocate_temp(DataTypeToEnum<T>::value,
                                ShapeFromFormat(compute_data_format, out_batch,
                                                out_rows, out_cols, out_depths),
                                &transformed_output));
  } else {
    transformed_output = *output;
  }

GetDnnWorkspaceLimit 从环境变量获取工作区大小限制。
ConvAutotuneMap 即 AutotuneSingleton，AutotuneSingleton::GetInstance 返回 AutotuneMap 对象。
AutotuneUnfusedConv 使用 cuDNN 的动态调优功能得到 AutotuneEntry 。
LaunchAutotunedConv 执行 ConvRunner

  auto input_ptr = AsDeviceMemory(input.template flat<T>().data(),
                                  input.template flat<T>().size());
  auto filter_ptr =
      AsDeviceMemory(transformed_filter.template flat<T>().data(),
                     transformed_filter.template flat<T>().size());
  auto output_ptr =
      AsDeviceMemory(transformed_output.template flat<T>().data(),
                     transformed_output.template flat<T>().size());

  static int64_t ConvolveScratchSize = GetDnnWorkspaceLimit(
      // default value is in bytes despite the name of the environment variable
      "TF_CUDNN_WORKSPACE_LIMIT_IN_MB", 1LL << 32  // 4GB
  );

  int device_id = stream->parent()->device_ordinal();
  DataType dtype = input.dtype();
  ConvParameters conv_parameters = {in_batch,             // batch
                                    in_depths,            // in_depths
                                    {{in_rows,            // in_rows
                                      in_cols}},          // in_cols
                                    compute_data_format,  // compute_data_format
                                    out_depths,           // out_depths
                                    {{patch_rows,         // filter_rows
                                      patch_cols,         // filter_cols
                                      patch_depths}},     // filter_depths
                                    {{row_dilation,       // dilation_rows
                                      col_dilation}},     // dilation_cols
                                    {{row_stride,         // stride_rows
                                      col_stride}},       // stride_cols
                                    {{common_padding_rows,    // padding_rows
                                      common_padding_cols}},  // padding_cols
                                    dtype,                    // tensor datatype
                                    device_id,                // device_id
                                    conv_desc.group_count()};

  auto entry_or = AutotuneUnfusedConv(
      cudnn_use_autotune, ConvAutotuneMap::GetInstance(), conv_parameters, ctx,
      se::dnn::ConvolutionKind::FORWARD, input_desc, input_ptr, filter_desc,
      filter_ptr, conv_desc, output_desc, output_ptr, ConvolveScratchSize);
  OP_REQUIRES_OK(ctx, entry_or.status());
  auto autotune_entry = entry_or.ConsumeValueOrDie();

  DnnScratchAllocator scratch_allocator(ConvolveScratchSize, ctx);
  Status cudnn_launch_status = LaunchAutotunedConv(
      autotune_entry, &scratch_allocator, se::dnn::ConvolutionKind::FORWARD,
      stream, input_desc, input_ptr, filter_desc, filter_ptr, conv_desc,
      output_desc, output_ptr);
  if (!cudnn_launch_status.ok()) {
    ctx->SetStatus(cudnn_launch_status);
    return;
  }

  if (data_format == FORMAT_NHWC && compute_data_format == FORMAT_NCHW) {
    VLOG(4) << "Convert the output tensor back from NCHW to NHWC.";
    functor::NCHWToNHWC<GPUDevice, T, 4>()(
        ctx->eigen_device<GPUDevice>(),
        const_cast<const Tensor&>(transformed_output).tensor<T, 4>(),
        output->tensor<T, 4>());
  }
}

AutotuneUnfusedConv

Created with Raphaël 2.3.0 AutotuneUnfusedConv autotune_map, conv_parameters, ctx AutotuneMap::Find Yes or No? autotune_entry End WrapRedzoneBestEffort AutotuneConvImpl LogConvAutotuneResults BestCudnnConvAlgorithm AutotuneMap::Insert yes no

se::dnn::ConvOp 为 ConvRunner 实现 LazyOpRunner 所需的概念。
AutotuneMap 从参数中查找最佳自动调整配置的帮助程序类。
BatchDescriptor 描述层消耗/产生的维度。
FilterDescriptor
AutotuneEntry 支持 cuDNN 前端 API 的自动调整映射条目。ROCm 仍停留旧版 API，需要一个 AlgorithmConfig。
ConvParameters 唯一标识在特定设备型号上运行的卷积操作。
AutotuneMap::Find 以参数为键查找配置。
ScopedAnnotation 通过当前注册的 TraceCollector 为实例生命周期内的所有活动添加注释。

template <typename T>
StatusOr<AutotuneEntry<se::dnn::ConvOp>> AutotuneUnfusedConv(
    bool cudnn_use_autotune,
    AutotuneMap<ConvParameters, AutotuneEntry<se::dnn::ConvOp>>* autotune_map,
    const ConvParameters& conv_parameters, OpKernelContext* ctx,
    se::dnn::ConvolutionKind kind, const se::dnn::BatchDescriptor& input_desc,
    se::DeviceMemory<T> input_ptr, const se::dnn::FilterDescriptor& filter_desc,
    se::DeviceMemory<T> filter_ptr,
    const se::dnn::ConvolutionDescriptor& conv_desc,
    const se::dnn::BatchDescriptor& output_desc, se::DeviceMemory<T> output_ptr,
    int64_t scratch_size_limit) {
  AutotuneEntry<se::dnn::ConvOp> autotune_entry;

  auto* stream = ctx->op_device_context()->stream();

  if (!autotune_map->Find(conv_parameters, &autotune_entry)) {
    profiler::ScopedAnnotation annotation("cudnn_autotuning");

TfAllocatorAdapter 是封装了 Tensorflow 分配器的适配器类。
RedzoneAllocator 分配器在每次分配的开始/结束时分配一点额外的内存，并且可以检查该内存是否未被修改。
WrapRedzoneBestEffort 调用 RedzoneAllocator::AllocateBytes 分配 DeviceMemory

#if GOOGLE_CUDA
    se::TfAllocatorAdapter tf_allocator_adapter(ctx->device()->GetAllocator({}),
                                                stream);
    se::RedzoneAllocator rz_allocator(stream, &tf_allocator_adapter,
                                      se::GpuAsmOpts());

    // TODO(awpr): second-guess whether it's okay that this profiles
    // convolutions on uninitialized memory.
    switch (kind) {
      case se::dnn::ConvolutionKind::FORWARD:
      case se::dnn::ConvolutionKind::FORWARD_BIAS_ACTIVATION:
        output_ptr = se::DeviceMemory<T>(
            WrapRedzoneBestEffort(&rz_allocator, output_ptr));
        break;
      case se::dnn::ConvolutionKind::BACKWARD_DATA:
        input_ptr = se::DeviceMemory<T>(
            WrapRedzoneBestEffort(&rz_allocator, input_ptr));
        break;
      case se::dnn::ConvolutionKind::BACKWARD_FILTER:
        filter_ptr = se::DeviceMemory<T>(
            WrapRedzoneBestEffort(&rz_allocator, filter_ptr));
        break;
      default:
        return errors::InvalidArgument(
            absl::StrFormat("Unknown ConvolutionKind %d", kind));
    }

launch_func函数调用 se::dnn::ConvRunner 执行后端。
AutotuneConvImpl 执行传入的launch_func函数，得到一组 AutotuneResult。
LogConvAutotuneResults 记录到 AutotuningLog 中。

    const auto element_type = se::dnn::ToDataType<T>::value;
    std::vector<std::unique_ptr<const se::dnn::ConvRunner>> runners;
    TF_RETURN_IF_ERROR(stream->parent()->GetConvolveRunners(
        CudnnUseFrontend(), kind, element_type, element_type, stream,
        input_desc, input_ptr, filter_desc, filter_ptr, output_desc, output_ptr,
        conv_desc, /*use_fallback=*/false, &rz_allocator, &runners));
    auto launch_func =
        [&](se::ScratchAllocator* allocator_used,
            const std::unique_ptr<const se::dnn::ConvRunner>& runner,
            se::dnn::ProfileResult* profile_result) -> Status {
      TF_ASSIGN_OR_RETURN(auto scratch, allocator_used->AllocateBytes(
                                            runner->GetWorkspaceSize()));
      return (*runner)(stream, profile_result, scratch, input_ptr, filter_ptr,
                       output_ptr);
    };
    SE_ASSIGN_OR_RETURN(
        auto results,
        AutotuneConvImpl(ctx, runners, cudnn_use_autotune, launch_func,
                         scratch_size_limit, rz_allocator));

    LogConvAutotuneResults(kind, se::dnn::ToDataType<T>::value, input_ptr,
                           filter_ptr, output_ptr, input_desc, filter_desc,
                           output_desc, conv_desc, stream->parent(), results);

两级自动调整：Cudnn 前端支持两个引擎列表：
启发式和回退。启发式引擎通常更快。
为了减少自动调整时间，我们仅在没有启发式引擎工作时评估回退引擎。
BestCudnnConvAlgorithm 由最优结果创建 AutotuneEntry 对象。

    // Two-level autotuning: Cudnn frontend supports two engine lists:
    // heuristics and fallback. Heuristics engines are normally faster.
    // To reduce autotuning time, we evaluate the fallback engines only when
    // none of the heuristics engines work.
    bool found_working_engine = false;
    for (auto& result : results) {
      if (!result.has_failure()) {
        found_working_engine = true;
        break;
      }
    }

    if (!CudnnUseFrontend() || found_working_engine) {
      SE_ASSIGN_OR_RETURN(
          autotune_entry,
          BestCudnnConvAlgorithm<se::dnn::ConvOp>(results, std::move(runners)));
    } else {
      LOG(WARNING)
          << "None of the algorithms provided by cuDNN frontend heuristics "
             "worked; trying fallback algorithms.  Conv: "
          << conv_parameters.ToString();
      std::vector<std::unique_ptr<const se::dnn::ConvRunner>> fallback_runners;
      TF_RETURN_IF_ERROR(stream->parent()->GetConvolveRunners(
          CudnnUseFrontend(), kind, element_type, element_type, stream,
          input_desc, input_ptr, filter_desc, filter_ptr, output_desc,
          output_ptr, conv_desc, /*use_fallback=*/true, &rz_allocator,
          &fallback_runners));

      SE_ASSIGN_OR_RETURN(
          auto fallback_results,
          AutotuneConvImpl(ctx, fallback_runners, cudnn_use_autotune,
                           launch_func, scratch_size_limit, rz_allocator));

      LogConvAutotuneResults(kind, se::dnn::ToDataType<T>::value, input_ptr,
                             filter_ptr, output_ptr, input_desc, filter_desc,
                             output_desc, conv_desc, stream->parent(),
                             fallback_results);

      SE_ASSIGN_OR_RETURN(autotune_entry,
                          BestCudnnConvAlgorithm<se::dnn::ConvOp>(
                              fallback_results, std::move(fallback_runners)));
    }

ROCm 的实现。

#elif TENSORFLOW_USE_ROCM
    DnnScratchAllocator scratch_allocator(scratch_size_limit, ctx);

    std::vector<se::dnn::ProfileResult> algorithms;
    if (!stream->parent()->GetMIOpenConvolveAlgorithms(
            kind, se::dnn::ToDataType<T>::value, stream, input_desc, input_ptr,
            filter_desc, filter_ptr, output_desc, output_ptr, conv_desc,
            &scratch_allocator, &algorithms)) {
      return errors::Unknown(
          "Failed to get convolution algorithm. This is probably "
          "because MIOpen failed to initialize, so try looking to "
          "see if a warning log message was printed above.");
    }

    std::vector<tensorflow::AutotuneResult> results;
    if (algorithms.size() == 1) {
      auto profile_result = algorithms[0];
      results.emplace_back();
      auto& result = results.back();
      *result.mutable_algorithm() = profile_result.algorithm().ToProto();

      result.set_scratch_bytes(profile_result.scratch_size());
      *result.mutable_run_time() = proto_utils::ToDurationProto(
          absl::Milliseconds(profile_result.elapsed_time_in_ms()));
    } else {
      for (auto miopen_algorithm : algorithms) {
        auto profile_algorithm = miopen_algorithm.algorithm();
        se::dnn::ProfileResult profile_result;
        auto miopen_launch_status = stream->ConvolveWithAlgorithm(
            kind, input_desc, input_ptr, filter_desc, filter_ptr, output_desc,
            output_ptr, conv_desc, &scratch_allocator,
            se::dnn::AlgorithmConfig(profile_algorithm,
                                     miopen_algorithm.scratch_size()),
            &profile_result);
        if (miopen_launch_status.ok() && profile_result.is_valid()) {
          results.emplace_back();
          auto& result = results.back();
          *result.mutable_algorithm() = profile_algorithm.ToProto();

          result.set_scratch_bytes(scratch_allocator.TotalByteSize());
          *result.mutable_run_time() = proto_utils::ToDurationProto(
              absl::Milliseconds(profile_result.elapsed_time_in_ms()));
        }
      }
    }
    LogConvAutotuneResults(kind, se::dnn::ToDataType<T>::value, input_ptr,
                           filter_ptr, output_ptr, input_desc, filter_desc,
                           output_desc, conv_desc, stream->parent(), results);

    SE_ASSIGN_OR_RETURN(auto algo_desc, BestCudnnConvAlgorithm(results));
    autotune_entry = AutotuneEntry<se::dnn::ConvOp>(algo_desc);
#endif

AutotuneMap::Insert 将卷积参数和对应的 AutotuneEntry 插入到 AutotuneMap 中。

    autotune_map->Insert(conv_parameters, autotune_entry);
  }

  return autotune_entry;
}

AutotuneMap


// A helper class that looks up the best autotuned config from parameters.
// Due to the noisy nature of autotune, especially with multiple devices, it
// only accepts a config if its margin exceeds a threshold.
// For the same shape configs, if a new best config matches the previous best,
// they get promoted; otherwise, the winner gets demoted. This process stops
// when the winner's score exceeds the threshold.
// In a bad case when two configs are very close to each other and flips
// back and forth randomly, the expected number of experiments before autotune
// settles is O(threshold ^ 2). So we recommend that number of warmup runs
// for any benchmarks.
template <typename Parameters, typename Config>
class AutotuneMap {
 private:
  // Retrieves the hash code of Parameters class.
  struct Hasher {
    std::size_t operator()(const Parameters& parameter) const {
      return parameter.hash();
    }
  };

AutotuneMap::Find

如果分数小于最小阈值并且未达到最大调优次数，则返回失败。这种机制下会导致调优的次数不固定。

 public:
  bool Find(const Parameters& params, Config* config) const {
    mutex_lock lock(mu_);
    auto iter = params_config_map_.find(params);
    if (iter == params_config_map_.end() ||
        (iter->second.score < min_score_threshold_ &&
         iter->second.count <= max_autotune_count_)) {
      return false;
    }
    *config = iter->second.config;
    return true;
  }

AutotuneMap::Insert

分数是内部定义的一套机制，新参数的分数为1，min_score_threshold_为1，意味着只会保留旧的？
首先检查params_config_map_字典中是否已有该参数。
如果没有则创建一个条目，默认分数为1；
否则如果原有分数小于最低阈值并且调优次数未达上限，则如果两个设置不同减分，相同加分。
如果min_score_threshold_为2，则只保留稳定值。

  void Insert(const Parameters& params, const Config& config) {
    mutex_lock lock(mu_);
    auto iter = params_config_map_.find(params);
    int new_score = 0;
    if (iter == params_config_map_.end()) {
      // Create a new entry if params is new.
      VLOG(1) << GetActionSummary("creates", params, config);
      params_config_map_.insert(
          std::make_pair(params, ValueType{config, 1, 1}));
      new_score = 1;
    } else if (iter->second.score < min_score_threshold_ &&
               iter->second.count <= max_autotune_count_) {
      DCHECK_GT(iter->second.score, 0);
      if (iter->second.config != config) {
        // If it is different from the current winner, demotes the winner.
        VLOG(1) << GetActionSummary("demotes", params, config);
        new_score = --iter->second.score;
        ++iter->second.count;
        if (new_score <= 0) {
          VLOG(1) << GetActionSummary("erases", params, config);
          params_config_map_.erase(iter);
        }
      } else {
        // If it is the same as the current winner, promotes the winner.
        VLOG(1) << GetActionSummary("promotes", params, config);
        new_score = ++iter->second.score;
        ++iter->second.count;
      }
    }

如果new_score小于最低阈值但是全局调优次数已经超过阈值，则接受当前或者字典中已有的配置，将其分数设置为min_score_threshold_。

    if (new_score >= min_score_threshold_) {
      VLOG(1) << GetActionSummary("accepts", params, config);
    } else if (autotune_global_count_ >= max_autotune_global_count_) {
      // The autotuning exceeds the max iteration threshold and we accept the
      // the winner if it exists in the map, otherwise we accept the current
      // winner.
      auto winner = params_config_map_.find(params);
      if (winner == params_config_map_.end()) {
        VLOG(1) << GetActionSummary("creates", params, config);
        for (int i = 0; i < min_score_threshold_; ++i) {
          VLOG(1) << GetActionSummary("promotes", params, config);
        }
        params_config_map_.insert(
            std::make_pair(params, ValueType{config, min_score_threshold_, 1}));
      } else {
        int promotes_times = min_score_threshold_ - winner->second.score;
        for (int i = 0; i < promotes_times; ++i) {
          VLOG(1) << GetActionSummary("promotes", params, config);
        }
        winner->second.score = min_score_threshold_;
      }
      VLOG(1) << GetActionSummary("accepts", params, config);
    }
    autotune_global_count_++;
  }

  std::unordered_map<Parameters, Config, Hasher> GetMap() const {
    mutex_lock lock(mu_);
    std::unordered_map<Parameters, Config, Hasher> map;
    for (const auto& entry : params_config_map_) {
      map.insert(std::make_pair(entry.first, entry.second.config));
    }
    return map;
  }

  // Only for testing
  void ClearMap() {
    mutex_lock lock(mu_);
    params_config_map_.clear();
  }

 private:
  // Underlying data structure of values in the map.
  struct ValueType {
    Config config;
    int32 score;
    int32 count;
  };

如果不修改min_score_threshold_，max_autotune_count_大于等于min_warmup_iterations。
max_autotune_global_count_是max_autotune_count_的两倍。

  AutotuneMap(const std::string& name) : name_(name) {
    min_score_threshold_ = 1;
    int min_warmup_iterations = 10;
    const char* threshold_str = getenv("TF_AUTOTUNE_THRESHOLD");
    if (threshold_str != nullptr) {
      VLOG(1) << "TF_AUTOTUNE_THRESHOLD = " << threshold_str;
      strings::safe_strto32(threshold_str, &min_score_threshold_);
    }
    const char* min_warmup_iteration_str =
        getenv("TF_AUTOTUNE_MIN_WARMUP_ITERATIONS");
    if (min_warmup_iteration_str != nullptr) {
      strings::safe_strto32(min_warmup_iteration_str, &min_warmup_iterations);
    }
    min_score_threshold_ = std::max(min_score_threshold_, 1);
    max_autotune_count_ = std::max(
        5 * min_score_threshold_ * min_score_threshold_, min_warmup_iterations);
    max_autotune_global_count_ = 2 * max_autotune_count_;
    autotune_global_count_ = 0;
  }

  template <class Group, class Params, class Cfg>
  friend class AutotuneSingleton;

  std::string GetActionSummary(StringPiece action, const Parameters& params,
                               const Config& config) {
    return strings::Printf("autotune_map %s %s: %s -> (%s)", name_.c_str(),
                           string(action).c_str(), params.ToString().c_str(),
                           config.ToString().c_str());
  }

  mutable mutex mu_;

  std::unordered_map<Parameters, ValueType, Hasher> params_config_map_
      TF_GUARDED_BY(mu_);
  std::string name_;
  int32 min_score_threshold_;
  int32 max_autotune_count_;
  int32 max_autotune_global_count_;
  int32 autotune_global_count_;

  TF_DISALLOW_COPY_AND_ASSIGN(AutotuneMap);
};

AutotuneConvImpl

设备可以子类化 DeviceContext 以将特定于设备的上下文传递给 OpKernels 类。
se::TfAllocatorAdapter 是包装 Tensorflow 分配器的适配器类。

template <typename LaunchFunc, typename Sig>
StatusOr<std::vector<tensorflow::AutotuneResult>> AutotuneConvImpl(
    OpKernelContext* ctx,
    std::vector<std::unique_ptr<const se::dnn::OpRunner<Sig>>>& runners,
    bool actually_do_autotune, const LaunchFunc& launch_func,
    size_t scratch_size_limit, const se::RedzoneAllocator& rz_allocator) {
  auto* stream = ctx->op_device_context()->stream();

  se::TfAllocatorAdapter tf_allocator_adapter(ctx->device()->GetAllocator({}),
                                              stream);

se::dnn::OpRunner 是拥有特定操作（配置）的缓存状态的抽象类。其主要动机是 cuDNN 后端执行计划（ExecutionPlan）的重新创建成本很高。所有 OpRunner 的寿命都必须超过其父 Stream。
RedzoneAllocator 分配器。
CudnnLegacyConvRunner::ToAlgorithmDesc 调用 CudnnLegacyConvRunner::MakeAlgorithmDesc 创建 dnn::AlgorithmDesc 对象。
se::dnn::ProfileResult 描述 perf 实验的结果。
如果需要实际调优，则调用launch_func函数，否则手动设置profile_result中的值。

  std::vector<tensorflow::AutotuneResult> results;
  // TODO(reedwm): Warn if determinism is enabled after autotune is run
  for (auto& runner : runners) {
    // TODO(zhengxq): profile each algorithm multiple times to better
    // accuracy.
    se::RedzoneAllocator rz_scratch_allocator(
        stream, &tf_allocator_adapter, se::GpuAsmOpts(),
        /*memory_limit=*/scratch_size_limit);
    DnnScratchAllocator scratch_allocator(scratch_size_limit, ctx);
    se::ScratchAllocator* allocator_used =
        !RedzoneCheckDisabled()
            ? static_cast<se::ScratchAllocator*>(&rz_scratch_allocator)
            : static_cast<se::ScratchAllocator*>(&scratch_allocator);

    SE_ASSIGN_OR_RETURN(auto desc, runner->ToAlgorithmDesc());
    se::dnn::ProfileResult profile_result;
    Status cudnn_launch_status =
        actually_do_autotune
            ? launch_func(allocator_used, runner, &profile_result)
            : OkStatus();
    if (!actually_do_autotune) {
      // Make the result valid according to `is_valid`.
      profile_result.set_algorithm(desc);
      profile_result.set_elapsed_time_in_ms(0);
    }

runner 会运行失败？
ProfileResult::is_valid 检查是否有 AlgorithmDesc 值以及时间是否正常。
RedzoneCheckDisabled 读取环境变量TF_DISABLE_RZ_CHECK的值。
RedzoneAllocator::TotalAllocatedBytesExcludingRedzones 返回分配的字节数。
DnnScratchAllocator::TotalByteSize
proto_utils::ToDurationProto 将 absl::Duration 转换为 google::protobuf::Duration。

    // We need to make sure the profiling results are one-to-one with the
    // "runners". So, we insert dummy results when the execution fails.
    results.emplace_back();
    auto& result = results.back();
    *result.mutable_algorithm() = desc.ToProto();
    if (cudnn_launch_status.ok() && profile_result.is_valid()) {
      result.set_scratch_bytes(
          !RedzoneCheckDisabled()
              ? rz_scratch_allocator.TotalAllocatedBytesExcludingRedzones()
              : scratch_allocator.TotalByteSize());
      *result.mutable_run_time() = proto_utils::ToDurationProto(
          absl::Milliseconds(profile_result.elapsed_time_in_ms()));

      CheckRedzones(rz_scratch_allocator, &result);
      CheckRedzones(rz_allocator, &result);
    } else {
      result.mutable_failure()->set_kind(AutotuneResult::UNKNOWN);
      result.mutable_failure()->set_msg(
          absl::StrCat("Profiling failure on CUDNN engine ", desc.ToString(),
                       ": ", cudnn_launch_status.ToString()));
    }
  }

  return results;
}

LogConvAutotuneResults

AutotuningLog 包含 AutotuneResult 以及软硬件信息。
ConvolutionProto 记录卷积信息。

void LogConvAutotuneResults(se::dnn::ConvolutionKind kind,
                            se::dnn::DataType element_type,
                            se::DeviceMemoryBase input_buffer,
                            se::DeviceMemoryBase filter_buffer,
                            se::DeviceMemoryBase output_buffer,
                            const se::dnn::BatchDescriptor& input_desc,
                            const se::dnn::FilterDescriptor& filter_desc,
                            const se::dnn::BatchDescriptor& output_desc,
                            const se::dnn::ConvolutionDescriptor& conv_desc,
                            se::StreamExecutor* stream_exec,
                            absl::Span<const AutotuneResult> results) {
  AutotuningLog log;
  {
    ConvolutionProto instr;
    instr.set_kind(kind);
    *instr.mutable_input() = input_desc.ToProto(element_type);
    *instr.mutable_filter() = filter_desc.ToProto(element_type);
    *instr.mutable_output() = output_desc.ToProto(element_type);
    *instr.mutable_conv_desc() = conv_desc.ToProto();
    instr.set_conv_scale(1);
    instr.set_side_value_scale(0);
    instr.set_input_address(reinterpret_cast<uint64>(input_buffer.opaque()));
    instr.set_filter_address(reinterpret_cast<uint64>(filter_buffer.opaque()));
    instr.set_output_address(reinterpret_cast<uint64>(output_buffer.opaque()));
    log.mutable_instr()->PackFrom(std::move(instr));
  }

GetCudnnVersion
GetComputeCapability

  *log.mutable_cudnn_version() = GetCudnnVersion(stream_exec);
  *log.mutable_compute_capability() = GetComputeCapability(stream_exec);
  log.set_device_pci_bus_id(stream_exec->GetDeviceDescription().pci_bus_id());
  {
    string blas_version;
    if (auto* blas = stream_exec->AsBlas()) {
      if (blas->GetVersion(&blas_version).ok()) {
        log.set_blas_version(blas_version);
      }
    }
  }
  for (const auto& result : results) {
    *log.add_results() = result;
  }
  VLOG(2) << log.DebugString();
  Logger::GetSingleton()->LogProto(log);
}

BestCudnnConvAlgorithm

BestCudnnConvAlgorithmIndices

AutotuneEntry::FromOpRunners

BestCudnnConvAlgorithmIndices 找到耗时最短算法的索引。
AutotuneEntry::FromOpRunners 使用预缓存的 OpRunner 进行初始化，例如在自动调整期间。
TF_ASSIGN_OR_RETURN 执行表达式，如果成功赋值给变量；否则返回状态。

emplate <typename Op>
StatusOr<AutotuneEntry<Op>> BestCudnnConvAlgorithm(
    absl::Span<const AutotuneResult> results,
    std::vector<
        std::unique_ptr<const se::dnn::OpRunner<typename Op::Signature>>>
        runners) {
  if (runners.size() != results.size()) {
    return errors::Internal(
        "Mismatched size of autotune results and runners vectors.");
  }
  int idx;
  int idx_no_scratch;
  TF_ASSIGN_OR_RETURN(std::tie(idx, idx_no_scratch),
                      BestCudnnConvAlgorithmIndices(results));
  VLOG(2) << "fastest algorithm: "
          << proto_utils::FromDurationProto(results[idx].run_time())
          << " with algo " << runners[idx]->ToString() << ", workspace bytes "
          << results[idx].scratch_bytes();
  return AutotuneEntry<Op>::FromOpRunners(
      std::move(runners[idx]), idx_no_scratch == -1 || idx_no_scratch == idx
                                   ? nullptr
                                   : std::move(runners[idx_no_scratch]));
}

BestCudnnConvAlgorithmIndices

compare_run_times比较结果中的时间。

StatusOr<std::tuple<int, int>> BestCudnnConvAlgorithmIndices(
    absl::Span<const AutotuneResult> results) {
  auto compare_run_times 
  = [](const AutotuneResult& lhs,
                              const AutotuneResult& rhs) {
    return proto_utils::FromDurationProto(lhs.run_time()) <
           proto_utils::FromDurationProto(rhs.run_time());
  };

遍历每个结果，找到最小值的索引。

  int idx = -1;
  int idx_no_scratch = -1;
  for (int i = 0; i < results.size(); i++) {
    if (!results[i].has_failure()) {
      if (OpDeterminismRequired()) {
        // When determinism is enabled, choose first working algorithm, and
        // don't choose a no_scratch algorithm.
        idx = i;
        break;
      }
      if (idx == -1 || compare_run_times(results[i], results[idx])) {
        idx = i;
      }
      if (results[i].scratch_bytes() == 0 &&
          (idx_no_scratch == -1 ||
           compare_run_times(results[i], results[idx_no_scratch]))) {
        idx_no_scratch = i;
      }
    }
  }

如果未找到返回错误。

  if (idx == -1) {
    std::ostringstream msg;
    msg << "No algorithm worked!  Error messages:";
    // TODO(awpr): identify the algorithm as part of this error message, too.
    for (const auto& result : results) {
      msg << "\n  " << result.failure().msg();
    }
    return errors::NotFound(msg.str());
  }

  return std::make_tuple(idx, idx_no_scratch);
}

LaunchAutotunedConv

Created with Raphaël 2.3.0 LaunchAutotunedConv autotune_entry, scratch_allocator, kind, stream AutotuneEntry::is_algorithm_config Yes or No? ConvolveWithAlgorithm End AutotuneEntry::GetOpRunners LazyOpRunner::GetOrCreateRunner CudnnExecutionPlanRunner::operator() yes no

AutotuneEntry::is_algorithm_config 检查是否使用了 AlgorithmConfig。
AutotuneEntry::GetOpRunners 返回 AutotuneEntry::OpRunners 结构体。
se::dnn::ConvOp::Config 为卷积的数据类型和描述符。
LazyOpRunner::GetOrCreateRunner 如果可用，则返回一个已经初始化的 OpRunner，或者创建一个。
AllocateScratchOrFallback 返回指向runners的主要OpRunner的指针，如果可分配，则分配暂存内存；否则指向其后备无暂存空间运行器的指针，以及空DeviceMemoryBase。
ConvRunner 使用3个输入的函数签名。
CudnnExecutionPlanRunner::operator() 调用 cudnn 前端操作。

template <typename T>
Status LaunchAutotunedConv(const AutotuneEntry<se::dnn::ConvOp>& autotune_entry,
                           DnnScratchAllocator* scratch_allocator,
                           se::dnn::ConvolutionKind kind, se::Stream* stream,
                           const se::dnn::BatchDescriptor& input_desc,
                           se::DeviceMemory<T> in_ptr,
                           const se::dnn::FilterDescriptor& filter_desc,
                           se::DeviceMemory<T> filter_ptr,
                           const se::dnn::ConvolutionDescriptor& conv_desc,
                           const se::dnn::BatchDescriptor& output_desc,
                           se::DeviceMemory<T> out_ptr) {
  if (!autotune_entry.is_algorithm_config()) {
    const auto& runners = autotune_entry.GetOpRunners();
    se::dnn::DataType element_type = se::dnn::ToDataType<T>::value;
    se::dnn::ConvOp::Config config{kind,       element_type, element_type,
                                   input_desc, filter_desc,  output_desc,
                                   conv_desc};
    TF_ASSIGN_OR_RETURN(auto* primary,
                        runners.primary->GetOrCreateRunner(config, stream));

    const se::dnn::ConvRunner* no_scratch_fallback = nullptr;
    if (runners.no_scratch_fallback) {
      TF_ASSIGN_OR_RETURN(
          no_scratch_fallback,
          runners.no_scratch_fallback->GetOrCreateRunner(config, stream));
    }

    TF_ASSIGN_OR_RETURN(auto runner_and_scratch,
                        AllocateScratchOrFallback<se::dnn::ConvOp::Signature>(
                            scratch_allocator, primary, no_scratch_fallback));
    auto& runner = *std::get<const se::dnn::ConvRunner*>(runner_and_scratch);
    return runner(stream, nullptr,
                  std::get<se::DeviceMemoryBase>(runner_and_scratch), in_ptr,
                  filter_ptr, out_ptr);

否则调用 Stream::ConvolveWithAlgorithm

  } else {
    return stream->ConvolveWithAlgorithm(
        kind, input_desc, in_ptr, filter_desc, filter_ptr, output_desc, out_ptr,
        conv_desc, scratch_allocator, autotune_entry.GetAlgorithmConfig(),
        nullptr);
  }
}

参考资料：

Performance of SpatialConvolution implementation in tensorflow
TensorFlow中2D卷积代码简析
TensorflowXLABeginner/XLA-Report
性能指南
tf.config.experimental.enable_op_determinism
Optimizing your GPU Jobs
『深度长文』Tensorflow代码解析（三）
tensorflow 自定义padding
TensorFlow源码分析（4）：OpKernelContruction类的构造函数
Tensorflow OpKernel机制详解
tf.nn.conv2d()函数以及padding填充方式介绍
First tf.session.run() performs dramatically different from later runs. Why?
os.environ
AMD 编译概述 & Fatbin 文件生成 & HIP Runtime API（启动 CUDA 核函数）
AMD HIP
HIP编程
Using with AMD’s HIP on Frontier
High latency of hipLaunchKernelGGL #1963
Convolutional Layers User’s Guide
TF.distribute.MirroredStrategy() crashes #34067
Time Programming
Tip of the Week #93: using absl::Span

你可能感兴趣的:(TensorFlow,DeepLearning,GPU,高性能计算,深度学习,tensorflow)

强化学习中的深度卷积神经网络设计与应用实例数字扫地僧计算机视觉 cnn 人工智能神经网络
I.引言强化学习（ReinforcementLearning，RL）是机器学习的一个重要分支，通过与环境的交互来学习最优策略。深度学习，特别是深度卷积神经网络（DeepConvolutionalNeuralNetworks，DCNNs）的引入，为强化学习在处理高维度数据方面提供了强大工具。本文将探讨强化学习中深度卷积神经网络的设计原则及其在不同应用场景中的实例。II.深度卷积神经网络在强化学习中的
腾讯云大模型知识引擎与DeepSeek：打造懒人专属的谷歌浏览器翻译插件大富大贵7 程序员知识储备1 程序员知识储备2 程序员知识储备3 腾讯云云计算
摘要：随着人工智能技术的飞速发展，越来越多的前沿技术和工具已走入日常生活。翻译工具作为跨语言沟通的桥梁，一直处于技术创新的风口浪尖。本文探讨了腾讯云大模型知识引擎与DeepSeek结合谷歌浏览器插件的可能性，旨在为用户提供一种便捷、高效的翻译体验。通过应用深度学习、自然语言处理和知识图谱技术，该插件不仅能实时翻译网页内容，还能根据上下文进行智能推荐，实现精准的语境转换。本文将详细阐述其设计思路、技
PyTorch深度学习框架60天进阶学习计划 - 第28天：多模态模型实践（二）凡人的AI工具箱深度学习 pytorch 学习 AI编程人工智能 python
PyTorch深度学习框架60天进阶学习计划-第28天：多模态模型实践（二）5.跨模态检索系统应用场景5.1图文匹配系统的实际应用应用领域具体场景优势电子商务商品图像搜索、视觉购物用户可以上传图片查找相似商品或使用文本描述查找商品智能媒体内容推荐、图片库搜索通过内容的语义理解提供更精准的推荐和搜索社交网络基于内容的帖子推荐理解用户兴趣，提供更相关的内容推荐教育技术多模态教学资源检索教师和学生可以更
PyTorch深度学习框架60天进阶学习计划 - 第28天：多模态模型实践（一）凡人的AI工具箱深度学习 pytorch 学习 AI编程人工智能 python
PyTorch深度学习框架60天进阶学习计划-第28天：多模态模型实践（一）引言：跨越感知的边界欢迎来到我们的PyTorch学习旅程第28天！今天我们将步入AI世界中最激动人心的领域之一：多模态学习。想象一下，如果你的模型既能"看"又能"读"，并且能够理解图像与文字之间的联系，这将为我们打开怎样的可能性？今天我们将专注于构建图文匹配系统，学习如何使用CLIP（ContrastiveLanguage
知识蒸馏：让大模型“瘦身“而不失智慧的魔术一休哥助手人工智能人工智能
引言：当AI模型需要"减肥"在人工智能领域，一个有趣的悖论正在上演：大模型的参数规模每年以10倍速度增长，而移动设备的算力却始终受限。GPT-4的1750亿参数需要价值500万美元的GPU集群运行，但现实中的智能设备可能只有指甲盖大小。这种矛盾催生了一项神奇的技术——知识蒸馏（KnowledgeDistillation），它就像给AI模型进行"脑外科手术"，将庞然大物的智慧浓缩到轻量模型中。第一章
10.2 如何解决从复杂 PDF 文件中提取数据的问题？墨染辉大语言模型 pdf
10.2如何解决从复杂PDF文件中提取数据的问题？解决方案：嵌入式表格检索解释：嵌入式表格检索是一种专门针对从复杂PDF文件中的表格提取数据的技术。它结合了表格识别、解析和语义理解，使得从复杂结构的表格中检索信息成为可能。具体步骤：表格检测和识别：目标：在PDF页面中准确地定位和识别表格区域。方法：使用计算机视觉和深度学习技术，如卷积神经网络（CNN）或其他先进的图像处理算法。效果：能够检测出页面
TensorFlow深度学习实战项目：从入门到精通点我头像干啥 Ai 深度学习 tensorflow 人工智能
引言深度学习作为人工智能领域的一个重要分支，近年来取得了显著的进展。TensorFlow作为Google开源的深度学习框架，因其强大的功能和灵活的架构，成为了众多开发者和研究者的首选工具。本文将带领大家通过一个实战项目，深入理解TensorFlow的使用方法，并掌握深度学习的基本流程。1.TensorFlow简介1.1TensorFlow是什么？TensorFlow是一个开源的机器学习框架，由Go
国外7个最佳大语言模型 (LLM) API推荐幂简集成 API新理念语言模型人工智能自然语言处理
大型语言模型(LLM)API将彻底改变我们处理语言的方式。在深度学习和机器学习算法的支持下，LLMAPI提供了前所未有的自然语言理解能力。通过利用这些新的API，开发人员现在可以创建能够以前所未有的方式理解和响应书面文本的应用程序。下面，我们将比较从Bard到ChatGPT、PaLM等市场上顶级LLMAPI。我们还将探讨整合这些LLM的潜在用例，并考虑其对语言处理的影响。什么是大语言模型(LLM)
【深度学习】DeepSeek模型介绍与部署 Nerous_ 深度学习深度学习人工智能
原文链接：DeepSeek-V31.介绍DeepSeek-V3，一个强大的混合专家(MoE)语言模型，拥有671B总参数，其中每个token激活37B参数。为了实现高效推理和成本效益的训练，DeepSeek-V3采用了多头潜在注意力(MLA)和DeepSeekMoE架构，这些架构在DeepSeek-V2中得到了充分验证。此外，DeepSeek-V3首次提出了无辅助损失的负载平衡策略，并设置了多to
【深度学习】 PyTorch一文详解 Nerous_ 深度学习深度学习 pytorch 人工智能机器学习 python
“PyTorchisadeeplearningframeworkthatprioritizessimplicityandflexibility,makingitthego-tochoiceforbothresearchersanddevelopers.”—Anonymous1.PyTorch简介1.1PyTorch的背景与发展PyTorch是由Facebook人工智能研究院（FAIR）开发的一个开
【DNN量化工具】QKeras 工具简介 kanhao100 笔记 dnn 人工智能神经网络
QKeras工具简介QKeras是一个用于量化深度学习模型的Keras扩展库，旨在使深度学习模型的量化（即将模型的浮点权重转换为低精度格式）变得简单而高效。QKeras主要目标是优化模型的存储和推理速度，特别适用于需要在资源受限的设备（如移动设备和嵌入式系统）上运行深度学习模型的场景。QKeras的主要特点量化支持：QKeras提供了对不同类型量化的支持，包括权重量化和激活量化。用户可以根据需求选
Softmax温度调节与注意力缩放：深度神经网络中的平滑艺术 Mark White dnn 人工智能神经网络
Softmax温度调节与注意力缩放：深度神经网络中的平滑艺术在深度学习的精密机械中，有些细微的调整机制往往被视为理所当然，却实际上蕴含着深刻的数学洞察和巧妙的工程智慧。今天，我们将探讨两个看似独立却本质相通的机制：生成模型中的温度参数与Transformer注意力机制中的缩放因子。这两个设计都围绕着同一个核心概念——softmax分布的平滑控制。Softmax函数：概率分布的催化剂在深入讨论之前，
在网页跑3D多人互动之渲染效能瓶颈微网兔子後端技術前端网络服务器 c++unity 架构 3d
累积到目前测试回馈给我们的心得，主要问题还是在前端显示的部分。所以就来聊聊在网页跑3D多人互动之渲染效能瓶颈!!!数万个3D角色与场景物件需即时渲染，导致GPU/CPU过载，低端设备卡顿。已经使用的解决方案：LOD（LevelofDetail）技术：根据距离动态调整模型细节，远距离使用低多边形模型。InstancedRendering：批次渲染相同模型（如重复的树木、建筑物）。Culling（剔除
密码策略合规性检查仪表盘闲人编程 python 网络服务器异常报警实时监控多因素认证合规性密码策略
目录一、前言二、密码策略合规性背景与意义2.1密码策略的重要性2.2密码策略合规性检查的需求三、系统设计思路与架构3.1数据采集与加解密模块3.2异步任务调度与GPU加速模块3.3密码策略检查算法模块3.4GUI界面模块四、核心数学公式与算法证明4.1AES-GCM加解密公式4.2密码强度评分算法4.3合规性检测算法4.4统计与报告生成五、异步任务调度与GPU加速设计六、GUI界面设计与功能模块七
QKeras、Brevitas和QONNX量化工具对比 kanhao100 笔记深度学习边缘计算
QKeras、Brevitas和QONNX量化工具对比一、引言在深度学习模型部署领域，量化技术已成为提升模型执行效率的关键手段。通过将浮点权重转换为低精度表示，量化能显著减小模型体积、降低内存占用并加速推理过程。对于资源受限的设备（如移动设备、嵌入式系统和边缘计算设备），量化技术尤为重要。本文深入对比三款主流量化工具：QKeras、Brevitas和QONNX，从用户实际应用角度剖析它们的技术特点
Umi-OCR：解锁高效文字识别的新时代水熠芝Dark-Haired
Umi-OCR：解锁高效文字识别的新时代Umi-OCR一款强大而高效的文字识别工具项目地址:https://gitcode.com/Resource-Bundle-Collection/6adda项目介绍在数字化浪潮席卷全球的今天，文字识别技术已成为提升工作效率和生活质量的关键工具。Umi-OCR，作为一款基于深度学习技术的开源文字识别工具，凭借其强大的功能和高效的性能，迅速成为众多用户的首选。无
Umi-OCR：一款强大而高效的文字识别工具裘心国Trent
Umi-OCR：一款强大而高效的文字识别工具Umi-OCR一款强大而高效的文字识别工具项目地址:https://gitcode.com/Resource-Bundle-Collection/6adda介绍Umi-OCR是一款基于深度学习技术的开源文字识别工具，特别适合日常办公、学术研究及数据分析等场景。它能有效解决将图像中的文字快速转化为可编辑文本的需求，极大提升工作效率。此工具依托于先进的计算机
自动语音识别（ASR）：技术、应用与未来 ajie1117 语音识别人工智能
自动语音识别（ASR）：技术、应用与未来1.ASR简介自动语音识别（ASR，AutomaticSpeechRecognition）是一种将语音转换为文本的技术。它利用人工智能（AI）、深度学习和自然语言处理（NLP）技术来识别和理解人类的语言，使计算机能够与人类进行更自然的交互。2.ASR的工作原理ASR的核心流程通常包括以下几个步骤：语音信号采集：通过麦克风或其他设备获取音频数据。预处理：去除噪
关于误差平面小记文弱_书生乱七八糟平面算法神经网络机器学习
四维曲面的二维切片：误差平面详解在深度学习优化过程中，我们通常研究损失函数（LossFunction）的变化，试图找到权重的最优配置。由于神经网络的参数空间通常是高维的，我们需要使用低维可视化的方法来理解优化过程和误差平面（ErrorSurface）。在这里，我们讨论一个四维曲面的二维切片，其中：三个维度是网络的权重（w1,w2,w3w_1,w_2,w_3w1,w2,w3）。第四个维度是误差（损失
GraphCube、Spark和深度学习技术赋能快消行业关键运营环节 weixin_30777913 开发语言大数据深度学习人工智能 spark
在快消品（FMCG）行业，需求计划（DemandPlanning）、库存管理（InventoryManagement）和需求供应管理（DemandSupplyManagement）是影响企业整体效率和利润水平的关键运营环节。GraphCube图多维数据集技术、Spark大数据分析处理技术和深度学习技术的结合，为这些环节提供了智能化、动态化和实时化的解决方案，显著提升业务运营效率和企业利润。一、技术
CPO光电共封装关键技术与Top玩家代表作 CoderIsArt 光学 CPO
CPO（Co-PackagedOptics，光电共封装）关键技术介绍CPO（Co-PackagedOptics）是一种将光学器件与电子芯片（如ASIC、CPU、GPU等）封装在同一基板上的技术。它旨在解决传统可插拔光模块在高密度、高带宽场景下的功耗、散热和信号完整性问题。CPO通过缩短电信号的传输距离，减少信号衰减和功耗，同时提高系统的整体性能和能效。CPO技术主要应用于数据中心、高性能计算（HP
Marker可以快速且准确地将PDF转换为markdown格式。星霜笔记开源关注简介免费源码 pdf
MarkerMarker可以快速且准确地将PDF转换为markdown格式。支持多种文档类型（针对书籍和科学论文进行了优化）支持所有语言移除页眉/页脚/其他杂质格式化表格和代码块提取并保存图像以及markdown将大多数方程转换为latex支持在GPU、CPU或MPS上运行工作原理Marker是一个由深度学习模型组成的管道：提取文本，必要时进行OCR处理（启发式算法，surya，tesseract
Hugging Face预训练GPT微调ChatGPT（微调入门！新手友好！） y江江江江机器学习大模型 gpt chatgpt
HuggingFace预训练GPT微调ChatGPT（微调入门！新手友好！）在实战中，⼤多数情况下都不需要从0开始训练模型，⽽是使⽤“⼤⼚”或者其他研究者开源的已经训练好的⼤模型。在各种⼤模型开源库中，最具代表性的就是HuggingFace。HuggingFace是⼀家专注于NLP领域的AI公司，开发了⼀个名为Transformers的开源库，该开源库拥有许多预训练后的深度学习模型，如BERT、G
Open-Sora - 为所有人实现高效的视频制作大众化小众AI AI开源音视频人工智能 AI编程
GitHub：https://github.com/hpcaitech/Open-Sora更多AI开源软件：发现分享好用的AI工具、AI开源软件、AI模型、AI变现-小众AI这是一款开源的SOTA（State-of-the-Art）视频生成模型，仅用20万美元（224张GPU）就能训练出商业级11B参数的视频生成大模型。它采用Python语言和PyTorch深度学习框架开发，具有生成速度快、资源消
无矩阵乘法LLM：效率与性能双突破 XianxinMao 人工智能矩阵人工智能线性代数
标题：无矩阵乘法LLM：效率与性能双突破文章信息摘要：无矩阵乘法的LLMs通过创新技术替代传统矩阵乘法操作，显著降低了计算成本，减少了对GPU的依赖。这种模型在内存使用和延迟方面表现优异，尤其在大规模模型上效率显著提升。例如，13B参数的模型仅需4.19GBGPU内存，延迟低至695.48ms，远优于传统模型。此外，基于FPGA的硬件优化进一步提升了性能，1.3B参数模型功耗仅为13W，达到人类阅
Adam-mini：深度学习内存效率新突破 XianxinMao 人工智能深度学习人工智能
标题：Adam-mini：深度学习内存效率新突破文章信息摘要：Adam-mini优化器在深度学习领域展现出突破性潜力，尤其在内存效率和计算性能上表现卓越。相比AdamW，Adam-mini将内存效率提升了一倍，并通过减少学习率数量显著降低了内存消耗，同时保持了与AdamW相当甚至更好的性能。在训练十亿参数级别的大语言模型（LLM）时，Adam-mini实现了49.6%的吞吐量提升，并减少了33%的
C++,Go 语言开发危险化学品流动跟踪APP Geeker-2025 c++golang
开发一款危险化学品流动跟踪APP是一个非常重要且复杂的项目，主要用于监控和管理危险化学品的运输、存储和使用过程，确保其符合安全规范，防止泄漏、误用或其他安全事故。该APP需要具备实时跟踪、数据记录、报警机制、权限管理等功能。C++和Go语言的结合在这个项目中可以发挥各自的优势：C++适合高性能计算、底层硬件交互和实时数据处理，而Go语言适合高性能后端服务、并发处理和分布式系统。---##1.**项
Transformer与图神经网络的融合与应用 AI天才研究院 DeepSeek R1 &大数据AI人工智能大模型 AI大模型企业级应用开发实战计算科学神经计算深度学习神经网络大数据人工智能大型语言模型 AI AGI LLM Java Python 架构设计 Agent RPA
Transformer与图神经网络的融合与应用关键词：Transformer,图神经网络,注意力机制,图结构数据,图表示学习,图分类,图生成1.背景介绍近年来，深度学习技术在各个领域取得了显著的进展。其中，Transformer模型和图神经网络（GraphNeuralNetworks,GNNs）是两个备受关注的研究方向。Transformer最初应用于自然语言处理领域，通过自注意力机制实现了并行计
深度学习的颠覆性发展：从卷积神经网络到Transformer AI天才研究院 AI大模型应用入门实战与进阶 ChatGPT 大数据人工智能语言模型 AI LLM Java Python 架构设计 Agent RPA
1.背景介绍深度学习是人工智能的核心技术之一，它通过模拟人类大脑中的神经网络学习从大数据中抽取知识，从而实现智能化的自动化处理。深度学习的发展历程可以分为以下几个阶段：2006年，GeoffreyHinton等人开始研究卷积神经网络（ConvolutionalNeuralNetworks，CNN），这是深度学习的第一个大突破。CNN主要应用于图像处理和语音识别等领域。2012年，AlexKrizh
高性能计算:GPU加速与分布式训练 AI天才研究院 DeepSeek R1 &大数据AI人工智能大模型 AI大模型企业级应用开发实战计算科学神经计算深度学习神经网络大数据人工智能大型语言模型 AI AGI LLM Java Python 架构设计 Agent RPA
1.背景介绍随着人工智能技术的飞速发展，深度学习模型的规模和复杂度不断提升，对计算能力的需求也越来越高。传统的CPU架构已经难以满足深度学习模型训练的需求，因此，GPU加速和分布式训练成为了高性能计算领域的研究热点。1.1.深度学习与计算挑战深度学习模型通常包含数百万甚至数十亿个参数，训练过程需要进行大量的矩阵运算和梯度更新，对计算资源的需求非常高。传统的CPU架构虽然具有较强的通用性，但其并行计
安装数据库首次应用 Array_06 java oracle sql
可是为什么再一次失败之后就变成直接跳过那个要求 enter full pathname of java.exe的界面这个java.exe是你的Oracle 11g安装目录中例如：【F:\app\chen\product\11.2.0\dbhome_1\jdk\jre\bin】下的java.exe 。不是你的电脑安装的java jdk下的java.exe！注意第一次，使用SQL D
Weblogic Server Console密码修改和遗忘解决方法 bijian1013 Welogic
在工作中一同事将Weblogic的console的密码忘记了，通过网上查询资料解决，实践整理了一下。一.修改Console密码打开weblogic控制台，安全领域 --> myrealm -->&n
IllegalStateException: Cannot forward a response that is already committed Cwind java Servlets
对于初学者来说，一个常见的误解是：当调用 forward() 或者 sendRedirect() 时控制流将会自动跳出原函数。标题所示错误通常是基于此误解而引起的。示例代码： protected void doPost() { if (someCondition) { sendRedirect(); } forward(); // Thi
基于流的装饰设计模式木zi_鸣设计模式
当想要对已有类的对象进行功能增强时，可以定义一个类，将已有对象传入，基于已有的功能，并提供加强功能。自定义的类成为装饰类模仿BufferedReader，对Reader进行包装，体现装饰设计模式装饰类通常会通过构造方法接受被装饰的对象，并基于被装饰的对象功能，提供更强的功能。装饰模式比继承灵活，避免继承臃肿，降低了类与类之间的关系装饰类因为增强已有对象，具备的功能该
Linux中的uniq命令被触发 linux
Linux命令uniq的作用是过滤重复部分显示文件内容，这个命令读取输入文件，并比较相邻的行。在正常情况下，第二个及以后更多个重复行将被删去，行比较是根据所用字符集的排序序列进行的。该命令加工后的结果写到输出文件中。输入文件和输出文件必须不同。如果输入文件用“- ”表示，则从标准输入读取。 AD： uniq [选项] 文件说明：这个命令读取输入文件，并比较相邻的行。在正常情况下，第二个
正则表达式Pattern 肆无忌惮_ Pattern
正则表达式是符合一定规则的表达式，用来专门操作字符串，对字符创进行匹配，切割，替换，获取。例如，我们需要对QQ号码格式进行检验规则是长度6~12位不能0开头只能是数字，我们可以一位一位进行比较，利用parseLong进行判断，或者是用正则表达式来匹配[1-9][0-9]{4,14} 或者 [1-9]\d{4,14} &nbs
Oracle高级查询之OVER (PARTITION BY ..) 知了ing oracle sql
一、rank()/dense_rank() over(partition by ...order by ...) 现在客户有这样一个需求，查询每个部门工资最高的雇员的信息，相信有一定oracle应用知识的同学都能写出下面的SQL语句： select e.ename, e.job, e.sal, e.deptno from scott.emp e, (se
Python调试矮蛋蛋 python pdb
原文地址： http://blog.csdn.net/xuyuefei1988/article/details/19399137 1、下面网上收罗的资料初学者应该够用了，但对比IBM的Python 代码调试技巧： IBM：包括 pdb 模块、利用 PyDev 和 Eclipse 集成进行调试、PyCharm 以及 Debug 日志进行调试： http://www.ibm.com/d
webservice传递自定义对象时函数为空，以及boolean不对应的问题 alleni123 webservice
今天在客户端调用方法 NodeStatus status=iservice.getNodeStatus(). 结果NodeStatus的属性都是null。进行debug之后，发现服务器端返回的确实是有值的对象。后来发现原来是因为在客户端，NodeStatus的setter全部被我删除了。本来是因为逻辑上不需要在客户端使用setter，结果改了之后竟然不能获取带属性值的
java如何干掉指针，又如何巧妙的通过引用来操作指针————>说的就是java指针百合不是茶
C语言的强大在于可以直接操作指针的地址，通过改变指针的地址指向来达到更改地址的目的,又是由于c语言的指针过于强大，初学者很难掌握， java的出现解决了c，c++中指针的问题 java将指针封装在底层，开发人员是不能够去操作指针的地址，但是可以通过引用来间接的操作：定义一个指针p来指向a的地址（&是地址符号）：
Eclipse打不开，提示“An error has occurred.See the log file ***/.log” bijian1013 eclipse
打开eclipse工作目录的\.metadata\.log文件，发现如下错误： !ENTRY org.eclipse.osgi 4 0 2012-09-10 09:28:57.139 !MESSAGE Application error !STACK 1 java.lang.NoClassDefFoundError: org/eclipse/core/resources/IContai
spring aop实例annotation方法实现 bijian1013 java spring AOP annotation
在spring aop实例中我们通过配置xml文件来实现AOP，这里学习使用annotation来实现，使用annotation其实就是指明具体的aspect,pointcut和advice。1.申明一个切面(用一个类来实现)在这个切面里,包括了advice和pointcut AdviceMethods.jav
[Velocity一]Velocity语法基础入门 bit1129 velocity
用户和开发人员参考文档 http://velocity.apache.org/engine/releases/velocity-1.7/developer-guide.html 注释 1.行级注释## 2.多行注释#* *# 变量定义使用$开头的字符串是变量定义，例如$var1, $var2, 赋值使用#set为变量赋值，例
【Kafka十一】关于Kafka的副本管理 bit1129 kafka
1. 关于request.required.acks request.required.acks控制者Producer写请求的什么时候可以确认写成功，默认是0， 0表示即不进行确认即返回。 1表示Leader写成功即返回，此时还没有进行写数据同步到其它Follower Partition中 -1表示根据指定的最少Partition确认后才返回，这个在 Th
lua统计nginx内部变量数据 ronin47 lua nginx　统计
server { listen 80; server_name photo.domain.com; location /{set $str $uri; content_by_lua ' local url = ngx.var.uri local res = ngx.location.capture(
java-11.二叉树中节点的最大距离 bylijinnan java
import java.util.ArrayList; import java.util.List; public class MaxLenInBinTree { /* a. 1 / \ 2 3 / \ / \ 4 5 6 7 max=4 pass "root"
Netty源码学习-ReadTimeoutHandler bylijinnan java netty
ReadTimeoutHandler的实现思路：开启一个定时任务，如果在指定时间内没有接收到消息，则抛出ReadTimeoutException 这个异常的捕获，在开发中，交给跟在ReadTimeoutHandler后面的ChannelHandler，例如 private final ChannelHandler timeoutHandler = new ReadTim
jquery验证上传文件样式及大小(好用) cngolon 文件上传 jquery验证
<!DOCTYPE html> <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <script src="jquery1.8/jquery-1.8.0.
浏览器兼容【转】 cuishikuan css 浏览器 IE
浏览器兼容问题一：不同浏览器的标签默认的外补丁和内补丁不同问题症状：随便写几个标签，不加样式控制的情况下，各自的margin 和padding差异较大。碰到频率:100% 解决方案：CSS里 *{margin:0;padding:0;} 备注：这个是最常见的也是最易解决的一个浏览器兼容性问题，几乎所有的CSS文件开头都会用通配符*来设
Shell特殊变量：Shell $0, $#, $*, $@, $?, $$和命令行参数 daizj shell $#$?特殊变量
前面已经讲到，变量名只能包含数字、字母和下划线，因为某些包含其他字符的变量有特殊含义，这样的变量被称为特殊变量。例如，$ 表示当前Shell进程的ID，即pid，看下面的代码： $echo $$ 运行结果 29949 特殊变量列表变量含义 $0 当前脚本的文件名 $n 传递给脚本或函数的参数。n 是一个数字，表示第几个参数。例如，第一个
程序设计KISS 原则-------KEEP IT SIMPLE, STUPID! dcj3sjt126com unix
翻到一本书，讲到编程一般原则是kiss：Keep It Simple, Stupid.对这个原则深有体会，其实不仅编程如此，而且系统架构也是如此。 KEEP IT SIMPLE, STUPID! 编写只做一件事情，并且要做好的程序；编写可以在一起工作的程序，编写处理文本流的程序，因为这是通用的接口。这就是UNIX哲学.所有的哲学真正的浓缩为一个铁一样的定律，高明的工程师的神圣的“KISS 原
android Activity间List传值 dcj3sjt126com Activity
第一个Activity： import java.util.ArrayList;import java.util.HashMap;import java.util.List;import java.util.Map;import android.app.Activity;import android.content.Intent;import android.os.Bundle;import a
tomcat 设置java虚拟机内存 eksliang tomcat 内存设置
转载请出自出处：http://eksliang.iteye.com/blog/2117772 http://eksliang.iteye.com/ 常见的内存溢出有以下两种: java.lang.OutOfMemoryError: PermGen space java.lang.OutOfMemoryError: Java heap space ------------
Android 数据库事务处理 gqdy365 android
使用SQLiteDatabase的beginTransaction()方法可以开启一个事务，程序执行到endTransaction() 方法时会检查事务的标志是否为成功，如果程序执行到endTransaction()之前调用了setTransactionSuccessful() 方法设置事务的标志为成功则提交事务，如果没有调用setTransactionSuccessful() 方法则回滚事务。事
Java 打开浏览器 hw1287789687 打开网址 open浏览器 open browser 打开url 打开浏览器
使用java 语言如何打开浏览器呢? 我们先研究下在cmd窗口中,如何打开网址使用IE 打开 D:\software\bin>cmd /c start iexplore http://hw1287789687.iteye.com/blog/2153709 使用火狐打开 D:\software\bin>cmd /c start firefox http://hw1287789
ReplaceGoogleCDN：将 Google CDN 替换为国内的 Chrome 插件 justjavac chrome Google google api chrome插件
Chrome Web Store 安装地址： https://chrome.google.com/webstore/detail/replace-google-cdn/kpampjmfiopfpkkepbllemkibefkiice 由于众所周知的原因，只需替换一个域名就可以继续使用Google提供的前端公共库了。同样，通过script标记引用这些资源，让网站访问速度瞬间提速吧
进程VS.线程 m635674608 线程
资料来源： http://www.liaoxuefeng.com/wiki/001374738125095c955c1e6d8bb493182103fac9270762a000/001397567993007df355a3394da48f0bf14960f0c78753f000 1、Apache最早就是采用多进程模式 2、IIS服务器默认采用多线程模式 3、多进程优缺点优点：多进程模式最大
Linux下安装MemCached 字符串 memcached
前提准备：1. MemCached目前最新版本为：1.4.22，可以从官网下载到。2. MemCached依赖libevent，因此在安装MemCached之前需要先安装libevent。2.1 运行下面命令，查看系统是否已安装libevent。[root@SecurityCheck ~]# rpm -qa|grep libevent libevent-headers-1.4.13-4.el6.n
java设计模式之--jdk动态代理（实现aop编程） Supanccy2013 java DAO 设计模式 AOP
与静态代理类对照的是动态代理类，动态代理类的字节码在程序运行时由Java反射机制动态生成，无需程序员手工编写它的源代码。动态代理类不仅简化了编程工作，而且提高了软件系统的可扩展性，因为Java 反射机制可以生成任意类型的动态代理类。java.lang.reflect 包中的Proxy类和InvocationHandler 接口提供了生成动态代理类的能力。 &
Spring 4.2新特性-对java8默认方法(default method)定义Bean的支持 wiselyman spring 4
2.1 默认方法(default method) java8引入了一个default medthod; 用来扩展已有的接口,在对已有接口的使用不产生任何影响的情况下,添加扩展使用default关键字 Spring 4.2支持加载在默认方法里声明的bean 2.2 将要被声明成bean的类 public class DemoService {