fengyuzaitu

SSE2 Vectorization of Alphablend Code

Introduction
Structure of Arrays
Explanation of the Code
Benchmark Results
Conclusion
References
History

Introduction

In this article, we will explore vectorizing the pixel alphablending code in my earlier article. I believe most C/C++ programmers have tinkered with SIMD instructions like MMX and Streaming SIMD Extensions 2 (SSE) or evenAdvanced Vector Extensions (AVX). Most programmers, I believe, gave up after their first SIMD attempt, after that effort produced a mediocre or worse performance than the original non-vectorizing code. The main reason for slow performance is CPU cycles are typically wasted in packing the primitive data into a vector prior to SIMD execution and unpacking them into their proper place in the structure after execution. We will dive deeper into this in the next section. The main reason that SIMD is slow to be adopted by programmers around the world is because SIMD code is harder to write and the resultant code, even with using intrinsics, is not readable and even less understandable.

In this article, we will only focus on SSE2; the reason being MMX is a 64 bit SIMD technology which does not offer significant speed up compared to SSE2 (mostly 128 bit SIMD from Intel) on today's 64-bit processors, andAdvanced Vector Extensions (AVX) (256 bit SIMD from Intel) is not chosen for this article because not many mainstream users own these latest Intel processors that can exploit AVX: SSE2 has been around since 2001. (Okay, I admit, the real reason is actually because I do not own an AVX supported processor. I promise to update the article when I have access to a PC with AVX-enabled processor.) All, but the oldest, personal computers should have housed a x86/x64 processor which supports SSE2. Rather than using inline assembly code, we will use SSE2 intrinsics provided by Visual Studio to assist us in calling the SSE2 instructions: currently, the Microsoft 64 bit C/C++ compiler does not permit inline assembly code.

Structure of Arrays

Let us imagine we have a structure type named SomeStruct as shown below and we have an array ofSomeStruct, essentially, an array of structures.

Hide Copy Code

struct SomeStruct
{  int    aInteger;  double aDouble;  short  aShort;
};

To use SSE2 arithmetic instructions on aInteger, we must first pack four aInteger into a 128 bit vector of integer type, __m128i.

Packing before SSE2 execution

After we have got the computation result, we must unpack the four aIntegers into the individual structure objects in the array.

Unpacking after SSE2 execution

This is usually what kills the performance! To enhance performance, programmers are advised to pack their data into a structure of arrays rather than an array of structures; the same advice is applicable to GPGPU programming so as to allow memory coalescing. It is often not possible to alter the existing data domain/model to unnaturally suit the SIMD data packing requirement. In the structure shown below, arrInteger, arrDouble, and arrShortare usually pointers to dynamic allocated arrays whose size is only known during runtime, but I show them here to be of array type so as to emphasize they are of array type rather than pointer type.

Hide Copy Code

struct SomeStruct
{  int    arrInteger[1000];  double arrDouble[1000];  short  arrShort[1000];
};

As the reader may have noticed, integer, double, and short data types have different sizes. Four integers would fit into __m128i and two doubles would fit into __m128i, and eight shorts would fit into __m128i. This might present difficulties when the programmer is writing code to loop through the __m128i arrays to do computation. For our case, this is not a problem, because we are only using the unsigned short type in this alphablend article.

How many data fit into a 128 bit

Explanation of the Code

For our SSE2 alphablend code, we will, as I have just said, use the unsigned short type to store each color channel. The reader may ask why use unsigned short to store a value which only ranges from 0 to 255? The answer is due to the multiplication used, the intermediate values would most likely exceed 255 while the final result will still stay within 255. In normal coding, we can use byte variables to do the same computation, without using a 16-bit variable. The reason is because the byte value is automatically promoted to integer (native word size of the platform) before any computation takes place and is also automatically converted back to byte after computation. We have declared a __m128i array to store the unsigned short for each color channel of the two alphablending images. m_fullsize is to store the full size of the array in bytes. m_arraysize is the array size in terms of an 128-bit sized element. m_remainder is to hold the number to indicate which of theunsigned shorts are valid in the last 128-bit element if the number of unsigned shorts required is not cleanly divisible by 8 (8 x 16 bits = 128-bits). For example, four integers fit into __m128i; if we have six integers to compute, an array of two __m128i elements is needed but on the last __m128i element, only two integers is used, thus m_remainder would reflect 2.

Hide Copy Code

__m128i* m_pSrc1R;
__m128i* m_pSrc1G;
__m128i* m_pSrc1B;

__m128i* m_pSrc2R;
__m128i* m_pSrc2G;
__m128i* m_pSrc2B;// the full size of array in bytesint m_fullsize;// the full size of array in 128-bit chunksint m_arraysize;// the remainder of last 128-bit element// which should be filled.int m_remainder;

With the GetUShortElement method, we can get an unsigned short element from the __m128i array, using an index which counts in terms of unsigned short elements.

Hide Copy Code

unsigned short& CAlphablendDlg::GetUShortElement(
  __m128i* ptr, 
  int index)
{  int bigIndex = index>>3;  int smallIndex = index - (bigIndex<<3);  return ptr[bigIndex].m128i_u16[smallIndex];
}

With the Get128BitElement method, we can get the __m128i element from the __m128i array, using abigIndex which counts in terms of __m128i elements. bigSize is the size of the __m128i array in 128-bit chunks. smallRemainder is the number of valid unsigned short element in the last __m128i element. When we detect that it is the last element and the number of short integers is not clearly divisible by 8, we need to set the unused shorts in the last __m128i element to 1s, so as to prevent a division-by-zero error. It is a good practice to do this for every last element returned, regardless if it is integer type or float type.

Hide Copy Code

__m128i& CAlphablendDlg::Get128BitElement(
  __m128i* ptr, 
  int bigIndex, 
  int bigSize, 
  int smallRemainder)
{  if(bigIndex != bigSize-1) // not last element    return ptr[bigIndex];  else if(smallRemainder==0) // last element    return ptr[bigIndex];  else // last element  {    for(int i=smallRemainder; i<8; ++i)
      ptr[bigIndex].m128i_u16[i] = 1;    return ptr[bigIndex];
  }
}

We must take care to allocate our __m128i array with the _aligned_malloc function so that they would align with a 16 bytes boundary as required by the 128-bit SSE2 instructions. We calculate and store the remainder if the number of unsigned short required is not cleanly divisible by 8 (as explained before) (Note: there are 8unsigned short (16 bit data type) in 128-bit.)

Hide Copy Code

// Calculate the size of 128 bit arraym_fullsize = m_BmpData1.Width * m_BmpData1.Height;// divide by 8, to get the size in 128bit.m_arraysize = (m_fullsize) >> 3;// calculate the remainderm_remainder = m_fullsize%8; 
// if there is remainder, 
// add 1 to the 128bit m_arraysizeif(m_remainder!=0) 
  ++m_arraysize;// since we are using unsigned short (16bit) elements,// let's muliply the total number of bytes needed by 2.m_fullsize = (m_arraysize) * 8 * 2;// Allocate 128bit arraysm_pSrc1R = (__m128i *)(_aligned_malloc((size_t)(m_fullsize), 16));
m_pSrc1G = (__m128i *)(_aligned_malloc((size_t)(m_fullsize), 16));
m_pSrc1B = (__m128i *)(_aligned_malloc((size_t)(m_fullsize), 16));

m_pSrc2R = (__m128i *)(_aligned_malloc((size_t)(m_fullsize), 16));
m_pSrc2G = (__m128i *)(_aligned_malloc((size_t)(m_fullsize), 16));
m_pSrc2B = (__m128i *)(_aligned_malloc((size_t)(m_fullsize), 16));

After allocating our __m128i arrays, we will populate them with the color information from the image object. We use GetUShortElement to get our unsigned short element from the __m128i array.

Hide Copy Code

// Populate the SSE2 pointers to arrayUINT col = 0;
UINT stride = m_BmpData1.Stride >> 2;for(UINT row = 0; row < m_BmpData1.Height; ++row)
{  for(col = 0; col < m_BmpData1.Width; ++col)
  {
    UINT index = row * stride + col;
    BYTE r = (m_pPixels1[index] & 0xff0000) >> 16;
    BYTE g = (m_pPixels1[index] & 0xff00) >> 8; 
    BYTE b = (m_pPixels1[index] & 0xff); 

    // assign the pixel color to the 128bit arrays    GetUShortElement(m_pSrc1R, (int)(index)) = r;
    GetUShortElement(m_pSrc1G, (int)(index)) = g;
    GetUShortElement(m_pSrc1B, (int)(index)) = b;
  }
}

Lastly, we must remember to de-allocate the __m128i arrays with _aligned_free in the destructor of the parent class. It is a good practice to remember to write deallocation code before we begin our main coding.

Hide Shrink Copy Code

if( m_pSrc1R )
{
  _aligned_free(m_pSrc1R);
  m_pSrc1R = NULL;
}if( m_pSrc1G )
{
  _aligned_free(m_pSrc1G);
  m_pSrc1G = NULL;
}if( m_pSrc1B )
{
  _aligned_free(m_pSrc1B);
  m_pSrc1B = NULL;
}if( m_pSrc2R )
{
  _aligned_free(m_pSrc2R);
  m_pSrc2R = NULL;
}if( m_pSrc2G )
{
  _aligned_free(m_pSrc2G);
  m_pSrc2G = NULL;
}if( m_pSrc2B )
{
  _aligned_free(m_pSrc2B);
  m_pSrc2B = NULL;
}

Before we run any SSE2 code, we should ensure that SSE2 is supported on the processor first. We can use theSIMD Detector class, written by the author, to do the check.

Hide Copy Code

#include "SIMD.h"SIMD m_simd;if(m_simd.HasSSE2()==false)
{
  MessageBox(    "Sorry, your processor does not support SSE2", 
    "Error", 
    MB_ICONERROR);  return;
}

Now we have come to the meat of the article: using SSE2 to do alphablending. The formula used is ((SrcColor * Alpha) + (DestColor * InverseAlpha)) >> 8. The below code listing is heavily commented. We use the_mm_set1_epi16 instruction to set the same short value (0) in every short element in the __m128i element. The_mm_mullo_epi16 instruction is used to multiply the unsigned short and only retain the 16 bits if the final result overflows into 32 bits. We use the _mm_add_epi16 instruction for adding of unsigned shorts. The_mm_srli_epi16 instruction is to do shift right by 8 times to simulate division by 256. We can get away without using division instruction which is a blessing because SSE2 does not support division for integers. After SSE2 operations, we need to unpack the data back to ARGB format. This unpacking offsets some of the performance gains.

Hide Shrink Copy Code

// set alpha 128 bit__m128i nAlpha128 = _mm_set1_epi16 ((short)(Alpha));// set inverse alpha 128 bit__m128i nInvAlpha128 = _mm_set1_epi16 ((short)(255-Alpha));// initialize the __m128i variables// so that the compiler shut up about// uninitialized variables.__m128i src1R = _mm_set1_epi16 ((short)(0));
__m128i src1G = _mm_set1_epi16 ((short)(0));
__m128i src1B = _mm_set1_epi16 ((short)(0));

__m128i src2R = _mm_set1_epi16 ((short)(0));
__m128i src2G = _mm_set1_epi16 ((short)(0));
__m128i src2B = _mm_set1_epi16 ((short)(0));

__m128i rFinal = _mm_set1_epi16 ((short)(0));
__m128i gFinal = _mm_set1_epi16 ((short)(0));
__m128i bFinal = _mm_set1_epi16 ((short)(0));for(int i=0;i<m_arraysize;++i)
{  // multiply and retain the result in lower 16bits  src1R = _mm_mullo_epi16 (
    Get128BitElement(m_pSrc1R,i,m_arraysize,m_remainder), 
    nAlpha128);
  src1G = _mm_mullo_epi16 (
    Get128BitElement(m_pSrc1G,i,m_arraysize,m_remainder), 
    nAlpha128);
  src1B = _mm_mullo_epi16 (
    Get128BitElement(m_pSrc1B,i,m_arraysize,m_remainder), 
    nAlpha128);  // multiply and retain the result in lower 16bits  src2R = _mm_mullo_epi16 (
    Get128BitElement(m_pSrc2R,i,m_arraysize,m_remainder), 
    nInvAlpha128);
  src2G = _mm_mullo_epi16 (
    Get128BitElement(m_pSrc2G,i,m_arraysize,m_remainder), 
    nInvAlpha128);
  src2B = _mm_mullo_epi16 (
    Get128BitElement(m_pSrc2B,i,m_arraysize,m_remainder), 
    nInvAlpha128);  // Add a and b together  rFinal = _mm_add_epi16 (src1R, src2R);
  gFinal = _mm_add_epi16 (src1G, src2G);
  bFinal = _mm_add_epi16 (src1B, src2B);  // Use shift right by 8 to do division by 256  rFinal = _mm_srli_epi16 (rFinal, 8);
  gFinal = _mm_srli_epi16 (gFinal, 8);
  bFinal = _mm_srli_epi16 (bFinal, 8);  // unpack the final results into the 8 pixels 
  // if possible  int step = i << 3; // multiply by 8  if(i!=m_arraysize-1) // not the last element  {
    pixels[step+0] = 
      0xff000000                | 
      rFinal.m128i_u16[0] << 16 | 
      gFinal.m128i_u16[0] << 8  | 
      bFinal.m128i_u16[0];
    pixels[step+1] = 
      0xff000000                | 
      rFinal.m128i_u16[1] << 16 | 
      gFinal.m128i_u16[1] << 8  | 
      bFinal.m128i_u16[1];
    pixels[step+2] = 
      0xff000000                | 
      rFinal.m128i_u16[2] << 16 | 
      gFinal.m128i_u16[2] << 8  | 
      bFinal.m128i_u16[2];
    pixels[step+3] = 
      0xff000000                | 
      rFinal.m128i_u16[3] << 16 | 
      gFinal.m128i_u16[3] << 8  | 
      bFinal.m128i_u16[3];
    pixels[step+4] = 
      0xff000000                | 
      rFinal.m128i_u16[4] << 16 | 
      gFinal.m128i_u16[4] << 8  | 
      bFinal.m128i_u16[4];
    pixels[step+5] = 
      0xff000000                | 
      rFinal.m128i_u16[5] << 16 | 
      gFinal.m128i_u16[5] << 8  | 
      bFinal.m128i_u16[5];
    pixels[step+6] = 
      0xff000000                | 
      rFinal.m128i_u16[6] << 16 | 
      gFinal.m128i_u16[6] << 8  | 
      bFinal.m128i_u16[6];
    pixels[step+7] = 
      0xff000000                | 
      rFinal.m128i_u16[7] << 16 | 
      gFinal.m128i_u16[7] << 8  | 
      bFinal.m128i_u16[7];

  }  // last 128 bit element, not all 16 bit element are filled with valid value  else if(m_remainder!=0) 
  {    for(int smallIndex=0; smallIndex<m_remainder; ++smallIndex)
    {
      pixels[i+smallIndex] = 
        0xff000000                         | 
        rFinal.m128i_u16[smallIndex] << 16 | 
        gFinal.m128i_u16[smallIndex] << 8  | 
        bFinal.m128i_u16[smallIndex];
    }
  }  // last 128 bit element, all 16 bit element are filled with valid value  // Because the remainder is zero  else
  {
    pixels[step+0] = 
      0xff000000                | 
      rFinal.m128i_u16[0] << 16 | 
      gFinal.m128i_u16[0] << 8  | 
      bFinal.m128i_u16[0];
    pixels[step+1] = 
      0xff000000                | 
      rFinal.m128i_u16[1] << 16 | 
      gFinal.m128i_u16[1] << 8  | 
      bFinal.m128i_u16[1];
    pixels[step+2] = 
      0xff000000                | 
      rFinal.m128i_u16[2] << 16 | 
      gFinal.m128i_u16[2] << 8  | 
      bFinal.m128i_u16[2];
    pixels[step+3] = 
      0xff000000                | 
      rFinal.m128i_u16[3] << 16 | 
      gFinal.m128i_u16[3] << 8  | 
      bFinal.m128i_u16[3];
    pixels[step+4] = 
      0xff000000                | 
      rFinal.m128i_u16[4] << 16 | 
      gFinal.m128i_u16[4] << 8  | 
      bFinal.m128i_u16[4];
    pixels[step+5] = 
      0xff000000                | 
      rFinal.m128i_u16[5] << 16 | 
      gFinal.m128i_u16[5] << 8  | 
      bFinal.m128i_u16[5];
    pixels[step+6] = 
      0xff000000                | 
      rFinal.m128i_u16[6] << 16 | 
      gFinal.m128i_u16[6] << 8  | 
      bFinal.m128i_u16[6];
    pixels[step+7] = 
      0xff000000                | 
      rFinal.m128i_u16[7] << 16 | 
      gFinal.m128i_u16[7] << 8  | 
      bFinal.m128i_u16[7];
  }
}

Benchmark Results

Application screenshot

Below are listed the formulae used in each benchmark:

Unoptimized code formula used

Unoptimised

Optimized code formula used

Optimised

Optimized code with shift formula used

Optimised With Shift

SSE2 optimized code formula used

Optimised With Shift

This is the timing of doing alphablending 1000 times on an Intel i7 870 at 2.93 Ghz. The performance characteristic is different from what is posted in the earlier article because the benchmarks are done on different PCs. The earlier PC used has 'retired' from useful service in this world.

Benchmark results

Hide Copy Code

VS2010 Release mode(optimization disabled)(Lower is better)Author codeUnoptimized Code          : 4679 milliseconds
Optimized Code            : 2108 milliseconds
Optimized Code with Shift : 1931 milliseconds
SSE2 Optimized Code       :  862 milliseconds (works for all image width)Yves Daoust's code        C version                 : 1923 milliseconds
MMX 1                     : 1888 milliseconds
MMX 2                     : 1355 milliseconds (image width must be even)
SSE                       :  762 milliseconds (image width must be multiple of 4)

Hide Copy Code

VS2010 Release mode(optimization enabled)(Lower is better)Author codeUnoptimized Code          : 3324 milliseconds
Optimized Code            : 1378 milliseconds
Optimized Code with Shift :  802 milliseconds
SSE2 Optimized Code       :  406 milliseconds (works for all image width)Yves Daoust's code        C version                 :  604 milliseconds
MMX 1                     :  213 milliseconds
MMX 2                     :  167 milliseconds (image width must be even)
SSE                       :  104 milliseconds (image width must be multiple of 4)

Below is the number of times of speedup we got when compared to the unoptimized version. But comparing the SSE2 speedup to the second runner up (Optimized Code with Shift), the speedup is only 2.2 times, not the 8 times as we have expected. So what is the memory consumption needed for this SSE2 speedup? 2 bytes for each color channel instead of 1 byte. The original implementation needs 4 bytes to store a color pixel, ARGB. (4 bytes is used because the processor can load and store 32 bit data faster than 24 bit data, which may require shifting it to proper 32 bit boundary before loading inside the 32 bit register). The SSE2 version needs 6 bytes to store a color pixel, RGB (because two bytes are used to store each color channel). Individual alpha channels are not stored because they are not used. The SSE2 version would need 50% more of memory to perform its magic.

Hide Copy Code

(Higher is better)
Unoptimized Code          : 1x
Optimized Code            : 2.2x
Optimized Code with Shift : 2.4x
SSE2 Optimized Code       : 5.4x

Conclusion

We have used SSE2 to speed up from the previous fastest performance by two times. The speedup is not dramatic. It is probably not worth the amount of extended effort and unreadable code to do this. To achieve better readability, we could use the vector class listed in the dvec.h header. This vector class supports overloaded operators for -, +, *, /, and so on. Any suggestion to improve the performance further is welcome. We could ramp up performance by using OpenMP or Parallel Patterns Library (PPL) in tandem with SSE2. But that is an article for another day. Any reader interested in a future article about utilizing OpenCL to do alphablending, please drop me a message below. OpenCL is a heterogeneous technology; OpenCL kernel can be run on either CPUs or GPUs.

References

SIMD Detector
Optimizing software in C++ by Agner Fog
Auto-vectorization in the Visual Studio 11 Express preview
Auto-vectorizer in Visual C++ 11 (MSDN)
A Guide to Auto-vectorization with Intel C++ Compilers

你可能感兴趣的:(agg,SSE2)

Swagger2 多环境安全配置 L烧鱼学习笔记 java swagger
一、生产环境关闭Swagger我们该怎么做？1、在配置文件新增开关#swagger开关swagger2.enable=true2、修改SwaggerConfifig动态设置开关@Configuration@EnableSwagger2publicclassSwaggerConfig{ @Value("${swagger2.enable}") privatebooleanenable; @Be
基于python的手写数字识别knn_用sklearn中的KNN实现Kaggle手写数字识别普和司
importcsvfromsklearnimportneighbors#导入训练数据和测试数据defloadData(filename1,filename2,trainDataSet,trainTargetSet,testDataSet):withopen(filename1,'r')ascsvfile1:lines1=csv.reader(csvfile1)dataSet=list(lines1
spring boot整合swagger添加token null-脑袋大 springboot java
第一种：importio.swagger.annotations.Api;importorg.springframework.context.annotation.Bean;importorg.springframework.context.annotation.Configuration;importspringfox.documentation.builders.ApiInfoBuilder;
Neo4j之CQL基础风云诀4 图数据库知识图谱大数据人工智能
Neo4j之CQL基础文章目录Neo4j之CQL基础一、CQL概念二、CQL简介三、CQL命令使用creatematch+returnwheredeleteremovesetorderbymerge四、CQL函数使用StringAggregationRelationship一、CQL概念关系型数据库的查询语言是SQL，Neo4j图数据库也有自己的查询语言，那就是CQL。CQL全称CypherQue
LLM填坑：训练自己的分词器-Tokenizer 微风❤水墨 LLM &AIGC &VLP 人工智能
说明：文本搬运以下文章，略微调整，有需求可参考原文。paper:https://zhuanlan.zhihu.com/p/625715830code:Chatterbox/example/TrainTokenizersExample/train_tokenizers.pyatmain·enze5088/Chatterbox·GitHubHuaggingface教程：
LLM填坑：训练自己的分词器-Tokenizer 2 微风❤水墨 LLM &AIGC &VLP LLM tokenizer
本文记录另外一个例子，例子中涉及如何手动配置config，实现与Huaggingface兼容。merges.txtmerges文件存放的是训练tokenizer阶段所得到的合并词表结果，就是tokenizer.json中，model.merges下的内容。tokenizer_config.json分词器的配置信息，定义了分词器的版本、额外添加的标记（tokens）、结构/代码和模型参数等信息，比如
【设计模式】遍历集合的艺术：深入探索迭代器模式的无限可能后端java设计模式
概述定义：提供一个对象来顺序访问聚合对象中的一系列数据，而不暴露聚合对象的内部表示。结构迭代器模式主要包含以下角色：抽象聚合（Aggregate）角色：定义存储、添加、删除聚合元素以及创建迭代器对象的接口。具体聚合（ConcreteAggregate）角色：实现抽象聚合类，返回一个具体迭代器的实例。抽象迭代器（Iterator）角色：定义访问和遍历聚合元素的接口，通常包含hasNext()、nex
spring boot3.4.3+MybatisPlus3.5.5+swagger-ui2.7.0 期待着2013 spring boot 后端 java
使用MyBatis-Plus操作books表。我们将实现以下功能：创建实体类Book。创建Mapper接口BookMapper。创建Service层BookService和BookServiceImpl。创建Controller层BookController。配置MyBatis-Plus和数据库连接。1.项目结构src├──main│├──java││└──com││└──example││├──
kaggle竞赛（初识）薛定谔的码* 人工智能
PART0:Kaggle介绍Kaggle是什么？答案很简单Kaggle是数据挖掘比赛火起来的，以至于中国兴起了很多很多类似的比赛；Kaggle是一个数据科学竞赛的平台，很多公司会发布一些接近真实业务的问题，吸引爱好数据科学的人来一起解决。Kaggle提供了一个介于“完美”与真实之间的过渡，问题的定义基本良好，却夹着或多或少的难点，一般没有完全成熟的解决方案。在参赛过程中与论坛上的其他参赛者互动，能
开源项目 Hoarder 使用教程房迁伟
开源项目Hoarder使用教程hoarderAself-hostablebookmark-everythingapp(links,notesandimages)withAI-basedautomatictaggingandfulltextsearch项目地址:https://gitcode.com/gh_mirrors/ho/hoarder1.项目的目录结构及介绍hoarder/├──docs/│
Spring Boot 3 中集成 Swagger 问题：Type javax.servlet.http.HttpServletRequest not present 我命由我12345 后端 -问题清单 spring boot servlet 后端 java http spring java-ee
问题与处理策略问题描述io.springfoxspringfox-swagger23.0.0io.springfoxspringfox-swagger-ui3.0.0在SpringBoot3中集成Swagger时，报如下错误java.lang.TypeNotPresentException:Typejavax.servlet.http.HttpServletRequestnotpresent#翻译
程序员必看！DeepSeek隐藏用法大揭秘：从代码优化到多模态开发，这些技巧让你少熬三夜班后端
最近在程序员圈子里，有个同事老张的故事特别火。他原本每周要花20小时写接口文档，自从用上DeepSeek的代码补全功能，现在喝着咖啡看AI自动生成Swagger注释——这让我想起刚入行时，为了调通一个正则表达式熬夜到凌晨三点的自己。今天咱们不聊那些官方说明书，就说点真正能让键盘冒火星的实战技巧。藏在代码补全里的"作弊码"很多人以为DeepSeek就是个加强版搜索引擎，其实它对代码的理解远超想象。比
Apipost一站式API工具评测：整合Postman+Swagger+JMeter三大功能，打造全流程开发解决方案
作为一名Java开发者，始终追求开发过程的高效性。使用IntelliJIDEA编写代码只是开始。一般来说，代码完成后，我们会切换到Postman进行API调试。在确保API表现符合预期后，我们会使用Swagger为前端团队生成文档。最后，再使用JMeter进行性能和负载测试，以确保API工作流顺畅且自动化。Apipost=Postman+Swagger+JMeter然而，这种多工具的方法存在诸多挑
python3中的os.path模块 hgz_dm 编程语言 python3 os.path
os.path模块主要用于获取文件的属性，这里对该模块中一些常用的函数做些记录。os.abspath(path):获取文件的绝对路径。这里path指的是路径，例如我这里输入“data.csv”[In]os.path.abspath('data.csv')[Out]'E:\\kaggle\\Titanic\\data.csv'os.path.basename(path):获取文件名称。该函数默认通过
element plus实现el-table中拖拽 m0_61618849 vue.js 前端 javascript
1.安装vuedraggablenpmi-Svuedraggable2.在使用的组件，引入sortablejs包含在vuedraggableimportSortablefrom"sortablejs"3.row-key必须设置数据列表4.在setup()中编写拖拽的逻辑，创建一个Sortable的实例，将需要拖拽的元素给到Sortable实例，进行需要的配置，然后在拖拽结束的方法onEnd()中实
苍穹外卖(Springboot3实现) day01 十年不明苍穹外卖学习 spring spring boot java
黑马给的起步代码基础工程版本是springboot2.x我的电脑用的是JDK21springboot3.x所以第一天整了很长时间需要慢慢修改配置环境目录依赖版本更换springboot版本更换mybatis版本更换lombok版本更换数据库依赖更新(很重要)swagger配置问题代码补全依赖版本更换springboot版本更换父工程sky-take-out的pom文件添加spring-boot-s
【人工智能】随机森林的智慧：集成学习的理论与实践蒙娜丽宁人工智能人工智能随机森林集成学习
随机森林（RandomForest）是一种强大的集成学习算法，通过构建多棵决策树并结合投票或平均预测提升模型性能。本文深入探讨了随机森林的理论基础，包括决策树的构建、Bagging方法和特征随机选择机制，并通过LaTeX公式推导其偏差-方差分解和误差分析。接着，我们详细描述了随机森林的算法流程，分析其在分类和回归任务中的适用性。文章还通过实验对比随机森林与单一决策树及其他算法（如SVM）的性能，探
基于机器学习的恶意软件检测系统的详细设计与实现源码空间站11 机器学习人工智能课程设计 python 网络安全信息安全恶意软件检测
以下是一个基于机器学习的恶意软件检测系统的详细设计与实现，适合作为课程作业或项目开发。我们将实现一个通过机器学习模型分析恶意软件特征来检测文件是否为恶意软件的系统。总体思路数据准备：选择现有的恶意软件数据集（如Kaggle的恶意软件数据集）或构造模拟数据集。数据集中包含文件的特征（如二进制特征、字符串特征、API调用特征等）和标签（"恶意"或"正常"）。特征提取：提取文件的静态特征（如文件大小、字
Chapter 4-8. Troubleshooting Congestion in Fibre Channel Fabrics mounter625 Linux kernel 服务器网络 kernel linux
Utilizingtheshowtech-supportslowdrainCommandTheshowtech-supportslowdrainisasinglecommandonCiscoMDSswitchesthataggregatesalltheothercommandsnormallynecessaryfortroubleshootingcongestionintoasingleoutpu
(即插即用模块-特征处理部分) 三十、(2024) BFAM & CBM & DFEM 特征聚合+特征提取+边界感知御宇w 即插即用-特征处理深度学习计算机视觉即插即用模块
文章目录1、BitemporalFeatureAggregationModule2、ChangeBoundary-AwareModule3、DeepFeatureExtractionModule4、代码实现paper：B2CNet:AProgressiveChangeBoundary-to-CenterRefinementNetworkforMultitemporalRemoteSensingIm
已解决：配置了Swagger，但Swagger-ui.html网页打不开 PinkandWhite 遇到的问题 &bug spring boot swagger2 java html
解决方法：在pom中把swagger降级如果你是3.0.0，降到2.9.2即可，再进入Swagger-ui.html就可以了io.springfoxspringfox-swagger22.9.2io.springfoxspringfox-swagger-ui2.9.2
数字识别项目不要天天开心机器学习人工智能深度学习算法
集成算法·Bagging·随机森林构造树模型：由于二重随机性，使得每个树基本上都不会一样，最终的结果也会不一样。集成算法·Stacking·堆叠：很暴力，拿来一堆直接上（各种分类器都来了）·可以堆叠各种各样的分类器（KNN,SVM,RF等等）·分阶段：第一阶段得出各自结果，第二阶段再用前一阶段结果训练实现神经网络实例利用PyTorch内置函数mnist下载数据。·利用torchvision对数据进
ABP框架概念是刘彦宏吖 ABP框架应用数据库
二、领域层10，实体11，值对象12，仓储13，领域服务14，规格模式15，工作单元16，事件总线17，数据过滤器三、应用层18，应用服务19，数据传输对象20，验证数据传输对象21，授权22，功能管理23，审计日志四、分布式服务层24，ASP.NETWebAPIControllers25，动态Webapi层26，OData整合27，SwaggerUI整合10，实体实体具有Id并存储在数据库中，实
Doris存储的逻辑架构和物理架构 fzip Doris 数据湖架构 Doris
ApacheDoris的存储架构分为逻辑架构和物理架构两个层面，其设计核心围绕数据分布与查询优化展开。以下为详细解析：一、逻辑架构1.表结构分层逻辑表（LogicalTable）用户直接操作的抽象表，支持多种数据模型：明细模型（DuplicateKeyModel）：原始数据存储，无预聚合，适合日志类场景。聚合模型（AggregateKeyModel）：写入时按维度预聚合（如SUM、COUNT），适
Spring Boot 中 Swagger 配置详解：生成高效的 RESTful API 文档 Jerry._ 爪哇开发 java spring 测试工具
在项目开发中，清晰的API文档对前后端协作至关重要。而Swagger是一个强大的工具，它不仅能生成RESTfulAPI文档，还提供了交互界面，方便开发人员进行接口测试。本篇文章将以一个完整示例为基础，讲解如何在SpringBoot中配置Swagger，并支持JWT认证的API调用。一、Swagger的功能简介Swagger是什么？Swagger是一种RESTfulAPI文档生成工具，常与Sprin
OpenAPI Generator Maven 插件配置详解（SpringBoot集成） txzq maven spring boot java Generator OpenAPI
0-1开始Java语言编程之路一、Ubuntu下Java语言环境搭建|MacOS下使用Jenv管理多JDK版本二、Ubuntu下Docker环境安装|MacOS下Docker安装与配置三、使用Docker搭建本地NexusMaven私有仓库四、Ubuntu下使用VisualStudioCode进行Java开发五、从Swagger到OpenAPI，SpringBoot集成StepByStep六、Op
swagger-01-swagger介绍褚师子书 swagger java spring boot restful
swagger1.学习目标：了解前后端分离了解Swagger的作用和概念在SpringBoot中集成Swagger1.1swagger由来前后端分离，当前流行的开发模式Vue+SpringBoot早先的后端时代：前端只用管理静态页面；后端将html用模板引擎JSP进行开发前后端分离式时代：后端：后端控制层，服务层，数据访问层【后端团队】前端：前端控制层，视图层【前端团队】mock后端数据json,
swagger基本使用及常用注解耀辰框架 api swagger2
一、介绍Swagger是一个规范和完整的框架，用于生成、描述、调用和可视化RESTful风格的Web服务。总体目标是使客户端和文件系统作为服务器以同样的速度来更新。文件的方法，参数和模型紧密集成到服务器端的代码，允许API来始终保持同步。作用：1.接口的文档在线自动生成。2.功能测试。每当我在学习一门知识的都会习惯性的去看他的介绍，了解出现的起源、使用的目的。或许有人就会问了，知道是什么但是还是不
elasticsearch聚合查询 warrah 岁月云——大数据杂烩 elasticsearch 大数据
9聚合后再过滤查询汇总后多条件过滤超过100万的数据POSTzzp_invoice/_search{"size":0,"query":{"range":{"SSYF":{"gte":"202101","lte":"202112"}}},"aggs":{"ssyf_group":{"terms":{"field":"XHDWMC.raw"},"aggs":{"sum_aggs":{"sum":{"f
Elasticsearch 聚合查询的 Java 实现 Leon_Jinhai_Sun elasticsearch java
importco.elastic.clients.elasticsearch._types.aggregations.Aggregation;Mapaggregations=qu.buildAggregations(data.get("aggregations"));这段代码是Java语言编写的，用于构建Elasticsearch聚合查询。Elasticsearch是一个基于Lucene的搜索服务
mondb入手木zi_鸣 mongodb
windows 启动mongodb 编写bat文件， mongod --dbpath D:\software\MongoDBDATA mongod --help 查询各种配置配置在mongob 打开批处理，即可启动，27017原生端口，shell操作监控端口扩展28017，web端操作端口启动配置文件配置，数据更灵活
大型高并发高负载网站的系统架构 bijian1013 高并发负载均衡
扩展Web应用程序一.概念简单的来说，如果一个系统可扩展，那么你可以通过扩展来提供系统的性能。这代表着系统能够容纳更高的负载、更大的数据集，并且系统是可维护的。扩展和语言、某项具体的技术都是无关的。扩展可以分为两种： 1.
DISPLAY变量和xhost(原创) czmmiao display
DISPLAY 在Linux/Unix类操作系统上, DISPLAY用来设置将图形显示到何处. 直接登陆图形界面或者登陆命令行界面后使用startx启动图形, DISPLAY环境变量将自动设置为:0:0, 此时可以打开终端, 输出图形程序的名称(比如xclock)来启动程序, 图形将显示在本地窗口上, 在终端上输入printenv查看当前环境变量, 输出结果中有如下内容:DISPLAY=:0.0
获取B/S客户端IP 周凡杨 java 编程 jsp Web 浏览器
最近想写个B/S架构的聊天系统，因为以前做过C/S架构的QQ聊天系统，所以对于Socket通信编程只是一个巩固。对于C/S架构的聊天系统，由于存在客户端Java应用，所以直接在代码中获取客户端的IP，应用的方法为： String ip = InetAddress.getLocalHost().getHostAddress(); 然而对于WEB
浅谈类和对象朱辉辉33 编程
类是对一类事物的总称，对象是描述一个物体的特征，类是对象的抽象。简单来说，类是抽象的，不占用内存，对象是具体的，占用存储空间。类是由属性和方法构成的，基本格式是public class 类名{ //定义属性 private/public 数据类型属性名； //定义方法 publ
android activity与viewpager+fragment的生命周期问题肆无忌惮_ viewpager
有一个Activity里面是ViewPager，ViewPager里面放了两个Fragment。第一次进入这个Activity。开启了服务，并在onResume方法中绑定服务后，对Service进行了一定的初始化，其中调用了Fragment中的一个属性。 super.onResume(); bindService(intent, conn, BIND_AUTO_CREATE);
base64Encode对图片进行编码 843977358 base64 图片 encoder
/** * 对图片进行base64encoder编码 * * @author mrZhang * @param path * @return */ public static String encodeImage(String path) { BASE64Encoder encoder = null; byte[] b = null; I
Request Header简介 aigo servlet
当一个客户端(通常是浏览器)向Web服务器发送一个请求是，它要发送一个请求的命令行，一般是GET或POST命令，当发送POST命令时，它还必须向服务器发送一个叫“Content-Length”的请求头(Request Header) 用以指明请求数据的长度，除了Content-Length之外，它还可以向服务器发送其它一些Headers，如：
HttpClient4.3 创建SSL协议的HttpClient对象 alleni123 httpclient 爬虫 ssl
public class HttpClientUtils { public static CloseableHttpClient createSSLClientDefault(CookieStore cookies){ SSLContext sslContext=null; try { sslContext=new SSLContextBuilder().l
java取反 -右移-左移-无符号右移的探讨百合不是茶位运算符位移
取反：在二进制中第一位，1表示符数，0表示正数 byte a = -1; 原码：10000001 反码：11111110 补码：11111111 //异或: 00000000 byte b = -2; 原码：10000010 反码：11111101 补码：11111110 //异或: 00000001
java多线程join的作用与用法 bijian1013 java 多线程
对于JAVA的join，JDK 是这样说的：join public final void join （long millis ）throws InterruptedException Waits at most millis milliseconds for this thread to die. A timeout of 0 means t
Java发送http请求(get 与post方法请求) bijian1013 java spring
PostRequest.java package com.bijian.study; import java.io.BufferedReader; import java.io.DataOutputStream; import java.io.IOException; import java.io.InputStreamReader; import java.net.HttpURL
【Struts2二】struts.xml中package下的action配置项默认值 bit1129 struts.xml
在第一部份，定义了struts.xml文件，如下所示： <!DOCTYPE struts PUBLIC "-//Apache Software Foundation//DTD Struts Configuration 2.3//EN" "http://struts.apache.org/dtds/struts
【Kafka十三】Kafka Simple Consumer bit1129 simple
代码中关于Host和Port是割裂开的，这会导致单机环境下的伪分布式Kafka集群环境下，这个例子没法运行。实际情况是需要将host和port绑定到一起， package kafka.examples.lowlevel; import kafka.api.FetchRequest; import kafka.api.FetchRequestBuilder; impo
nodejs学习api ronin47 nodejs api
NodeJS基础什么是NodeJS JS是脚本语言，脚本语言都需要一个解析器才能运行。对于写在HTML页面里的JS，浏览器充当了解析器的角色。而对于需要独立运行的JS，NodeJS就是一个解析器。每一种解析器都是一个运行环境，不但允许JS定义各种数据结构，进行各种计算，还允许JS使用运行环境提供的内置对象和方法做一些事情。例如运行在浏览器中的JS的用途是操作DOM，浏览器就提供了docum
java-64.寻找第N个丑数 bylijinnan java
public class UglyNumber { /** * 64.查找第N个丑数具体思路可参考 [url] http://zhedahht.blog.163.com/blog/static/2541117420094245366965/[/url] * 题目：我们把只包含因子 2、3和5的数称作丑数（Ugly Number）。例如6、8都是丑数，但14
二维数组（矩阵）对角线输出 bylijinnan 二维数组
/** 二维数组对角线输出两个方向例如对于数组： { 1, 2, 3, 4 }, { 5, 6, 7, 8 }, { 9, 10, 11, 12 }, { 13, 14, 15, 16 }, slash方向输出： 1 5 2 9 6 3 13 10 7 4 14 11 8 15 12 16 backslash输出： 4 3
[JWFD开源工作流设计]工作流跳跃模式开发关键点(今日更新) comsci 工作流
既然是做开源软件的,我们的宗旨就是给大家分享设计和代码,那么现在我就用很简单扼要的语言来透露这个跳跃模式的设计原理大家如果用过JWFD的ARC-自动运行控制器,或者看过代码,应该知道在ARC算法模块中有一个函数叫做SAN(),这个函数就是ARC的核心控制器,要实现跳跃模式,在SAN函数中一定要对LN链表数据结构进行操作,首先写一段代码,把
redis常见使用 cuityang redis 常见使用
redis 通常被认为是一个数据结构服务器，主要是因为其有着丰富的数据结构 strings、map、 list、sets、 sorted sets 引入jar包 jedis-2.1.0.jar (本文下方提供下载) package redistest; import redis.clients.jedis.Jedis; public class Listtest
配置多个redis dalan_123 redis
配置多个redis客户端 <?xml version="1.0" encoding="UTF-8"?><beans xmlns="http://www.springframework.org/schema/beans" xmlns:xsi=&quo
attrib命令 dcj3sjt126com attr
attrib指令用于修改文件的属性.文件的常见属性有:只读.存档.隐藏和系统. 只读属性是指文件只可以做读的操作.不能对文件进行写的操作.就是文件的写保护. 存档属性是用来标记文件改动的.即在上一次备份后文件有所改动.一些备份软件在备份的时候会只去备份带有存档属性的文件.
Yii使用公共函数 dcj3sjt126com yii
在网站项目中，没必要把公用的函数写成一个工具类，有时候面向过程其实更方便。在入口文件index.php里添加 require_once('protected/function.php'); 即可对其引用，成为公用的函数集合。 function.php如下： <?php /** * This is the shortcut to D
linux 系统资源的查看（free、uname、uptime、netstat） eksliang netstat linux uname linux uptime linux free
linux 系统资源的查看转载请出自出处：http://eksliang.iteye.com/blog/2167081 http://eksliang.iteye.com 一、free查看内存的使用情况语法如下： free [-b][-k][-m][-g] [-t] 参数含义 -b:直接输入free时，显示的单位是kb我们可以使用b(bytes),m
JAVA的位操作符 greemranqq 位运算 JAVA位移 <<>>>
最近几种进制，加上各种位操作符，发现都比较模糊，不能完全掌握，这里就再熟悉熟悉。 1.按位操作符：按位操作符是用来操作基本数据类型中的单个bit,即二进制位，会对两个参数执行布尔代数运算，获得结果。与（&）运算： 1&1 = 1, 1&0 = 0, 0&0 &
Web前段学习网站 ihuning Web
Web前段学习网站菜鸟学习：http://www.w3cschool.cc/ JQuery中文网：http://www.jquerycn.cn/ 内存溢出：http://outofmemory.cn/#csdn.blog http://www.icoolxue.com/ http://www.jikexue
强强联合：FluxBB 作者加盟 Flarum justjavac r
原文：FluxBB Joins Forces With Flarum作者：Toby Zerner译文：强强联合：FluxBB 作者加盟 Flarum译者：justjavac FluxBB 是一个快速、轻量级论坛软件，它的开发者是一名德国的 PHP 天才 Franz Liedke。FluxBB 的下一个版本(2.0)将被完全重写，并已经开发了一段时间。FluxBB 看起来非常有前途的，
java统计在线人数（session存储信息的） macroli java Web
这篇日志是我写的第三次了前两次都发布失败！郁闷极了！由于在web开发中常常用到这一部分所以在此记录一下，呵呵，就到备忘录了！我对于登录信息时使用session存储的，所以我这里是通过实现HttpSessionAttributeListener这个接口完成的。 1、实现接口类，在web.xml文件中配置监听类，从而可以使该类完成其工作。 public class Ses
bootstrp carousel初体验快速构建图片播放 qiaolevip 每天进步一点点学习永无止境 bootstrap 纵观千象
img{ border: 1px solid white; box-shadow: 2px 2px 12px #333; _width: expression(this.width > 600 ? "600px" : this.width + "px"); _height: expression(this.width &
SparkSQL读取HBase数据，通过自定义外部数据源 superlxw1234 spark sparksql sparksql读取hbase sparksql外部数据源
关键字：SparkSQL读取HBase、SparkSQL自定义外部数据源前面文章介绍了SparSQL通过Hive操作HBase表。 SparkSQL从1.2开始支持自定义外部数据源(External DataSource)，这样就可以通过API接口来实现自己的外部数据源。这里基于Spark1.4.0，简单介绍SparkSQL自定义外部数据源，访
Spring Boot 1.3.0.M1发布 wiselyman spring boot
Spring Boot 1.3.0.M1于6.12日发布，现在可以从Spring milestone repository下载。这个版本是基于Spring Framework 4.2.0.RC1,并在Spring Boot 1.2之上提供了大量的新特性improvements and new features。主要包含以下： 1.提供一个新的sprin

SSE2 Vectorization of Alphablend Code

Table of Contents

Introduction

Structure of Arrays

Explanation of the Code

Benchmark Results

Conclusion

References

你可能感兴趣的:(agg,SSE2)