vbskj

AVX官方入门介绍

Intel® Advanced Vector Extensions (Intel® AVX) is a set of instructions for doing Single Instruction Multiple Data (SIMD) operations on Intel® architecture CPUs. These instructions extend previous SIMD offerings (MMX™ instructions and Intel® Streaming SIMD Extensions (Intel® SSE)) by adding the following new features:

The 128-bit SIMD registers have been expanded to 256 bits. Intel® AVX is designed to support 512 or 1024 bits in the future.
Three-operand, nondestructive operations have been added. Previous two-operand instructions performed operations such as A = A + B, which overwrites a source operand; the new operands can perform operations like A = B + C, leaving the original source operands unchanged.
A few instructions take four-register operands, allowing smaller and faster code by removing unnecessary instructions.
Memory alignment requirements for operands are relaxed.
A new extension coding scheme (VEX) has been designed to make future additions easier as well as making coding of instructions smaller and faster to execute.

Closely related to these advances are the new Fused-Multiply-Add (FMA) instructions, which allow faster and more accurate specialized operations such as single instruction A = A * B + C. The FMA instructions should be available in the second-generation Intel® Core™ CPU. Other features include new instructions for dealing with Advanced Encryption Standard (AES) encryption and decryption, a packed carry-less multiplication operation (PCLMULQDQ) useful for certain encryption primitives, and some reserved slots for future instructions, such as a hardware random number generator.

Instruction Set Overview

The new instructions are encoded using what Intel calls a VEX prefix, which is a two- or three-byte prefix designed to clean up the complexity of current and future x86/x64 instruction encoding. The two new VEX prefixes are formed from two obsolete 32-bit instructions-Load Pointer Using DS (LDS-0xC4, 3-byte form) and Load Pointer Using ES (LES-0xC5, two-byte form)-which load the DS and ES segment registers in 32-bit mode. In 64-bit mode, opcodes LDS and LES generate an invalid-opcode exception, but under Intel® AVX, these opcodes are repurposed for encoding new instruction prefixes. As a result, the VEX instructions can only be used when running in 64-bit mode. The prefixes allow encoding more registers than previous x86 instructions and are required for accessing the new 256-bit SIMD registers or using the three- and four-operand syntax. As a user, you do not need to worry about this (unless you're writing assemblers or disassemblers).

Note: The rest of this article assumes operation in 64-bit mode.

SIMD instructions allow processing of multiple pieces of data in a single step, speeding up throughput for many tasks, from video encoding and decoding to image processing to data analysis to physics simulations. Intel® AVX instructions work on Institute of Electrical and Electronics Engineers (IEEE)-754 floating-point values in 32-bit length (called single precision) and in 64-bit length (called double precision). IEEE-754 is the standard defining reproducible, robust floating-point operation and is the standard for most mainstream numerical computations.

The older, related Intel® SSE instructions also support various signed and unsigned integer sizes, including signed and unsigned byte (B, 8-bit), word (W, 16-bit), doubleword (DW, 32-bit), quadword (QW, 64-bit), and doublequadword (DQ, 128-bit) lengths. Not all instructions are available in all size combinations; for details, see the links provided in "For More Information." See Figure 2 later in this article for a graphical representation of the data types.

The hardware supporting Intel® AVX (and FMA) consists of the 16 256-bit YMM registers YMM0-YMM15 and a 32-bit control/status register called MXCSR. The YMM registers are aliased over the older 128-bit XMM registers used for Intel SSE, treating the XMM registers as the lower half of the corresponding YMM register, as shown in Figure 1.

Bits 0-5 of MXCSR indicate SIMD floating-point exceptions with "sticky" bits-after being set, they remain set until cleared using LDMXCSR or FXRSTOR. Bits 7-12 mask individual exceptions when set, initially set by a power-up or reset. Bits 0-5 represent invalid operation, denormal, divide by zero, overflow, underflow, and precision, respectively. For details, see the links "For More Information."

Figure 1. XMM registers overlay the YMM registers.

Figure 2 illustrates the data types used in the Intel® SSE and Intel® AVX instructions. Roughly, for Intel AVX, any multiple of 32-bit or 64-bit floating-point type that adds to 128 or 256 bits is allowed as well as multiples of any integer type that adds to 128 bits.

Figure 2. Intel® AVX and Intel® SSE data types

Instructions often come in scalar and vector versions, as illustrated in Figure 3. Vector versions operate by treating data in the registers in parallel "SIMD" mode; the scalar version only operates on one entry in each register. This distinction allows less data movement for some algorithms, providing better overall throughput.

Figure 3. SIMD versus scalar operations

Data is memory aligned when the data to be operated upon as an n-byte chunk is stored on an n-byte memory boundary. For example, when loading 256-bit data into YMM registers, if the data source is 256-bit aligned, the data is called aligned.

For Intel® SSE operations, memory alignment was required unless explicitly stated. For example, under Intel SSE, there were specific instructions for memory-aligned and memory-unaligned operations, such as the MOVAPD (move-aligned packed double) and MOVUPD (move-unaligned packed double) instructions. Instructions not split in two like this required aligned accesses.

Intel® AVX has relaxed some memory alignment requirements, so now Intel AVX by default allows unaligned access; however, this access may come at a performance slowdown, so the old rule of designing your data to be memory aligned is still good practice (16-byte aligned for 128-bit access and 32-byte aligned for 256-bit access). The main exceptions are the VEX-extended versions of the SSE instructions that explicitly required memory-aligned data: These instructions still require aligned data. Other specific instructions requiring aligned access are listed in Table 2.4 of the Intel® Advanced Vector Extensions Programming Reference (see "For More Information" for a link).

Another performance concern besides unaligned data issues is that mixing legacy XMM-only instructions and newer Intel AVX instructions causes delays, so minimize transitions between VEX-encoded instructions and legacy Intel SSE code. Said another way, do not mix VEX-prefixed instructions and non-VEX-prefixed instructions for optimal throughput. If you must do so, minimize transitions between the two by grouping instructions of the same VEX/non-VEX class. Alternatively, there is no transition penalty if the upper YMM bits are set to zero via VZEROUPPER or VZEROALL, which compilers should automatically insert. This insertion requires an extra instruction, so profiling is recommended.

Intel® AVX Instruction Classes

As mentioned, Intel® AVX adds support for many new instructions and extends current Intel SSE instructions to the new 256-bit registers, with most old Intel SSE instructions having a V-prefixed Intel AVX version for accessing new register sizes and three-operand forms. Depending on how instructions are counted, there are up to a few hundred new Intel AVX instructions.

For example, the old two-operand Intel SSE instruction ADDPS xmm1, xmm2/m128 can now be expressed in three-operand syntax as VADDPS xmm1, xmm2, xmm3/m128 or the 256-bit register using the form VADDPS ymm1, ymm2, ymm3/m256. A few instructions allow four operands, such as VBLENDVPS ymm1, ymm2, ymm3/m256, ymm4, which conditionally copies single-precision floating-point values from ymm2 or ymm3/m256to ymm1 based on masks in ymm4. This is an improvement on the previous form, where xmm0 was implicitly needed, requiring compilers to free up xmm0. Now, with all registers explicit, there is more freedom for register allocation. Here, m128 is a 128-bit memory location, xmm1 is the 128-bit register, and so on.

Some new instructions are VEX only (not Intel SSE extensions), including many ways to move data into and out of the YMM registers. Examples are the useful VBROADCASTS[S/D], which loads a single value into all elements of an XMM or YMM register, and ways to shuffle data around in a register using VPERMILP[S/D]. (The bracket notation is explained in the Appendix A.)

Intel® AVX adds arithmetic instructions for variants of add, subtract, multiply, divide, square root, compare, min, max, and round on single- and double-precision packed and scalar floating-point data. Many new conditional predicates are also useful for 128-bit Intel SSE, giving 32 comparison types. Intel® AVX also includes instructions promoted from previous SIMD covering logical, blend, convert, test, pack, unpack, shuffle, load, and store. The toolset adds new instructions, as well, including non-strided fetching (broadcast of single or multiple data into a 256-bit destination, masked-move primitives for conditional load and store), insert and extract multiple-SIMD data to and from 256-bit SIMD registers, permute primitives to manipulate data within a register, branch handling, and packed testing instructions.

Future Additions
The Intel® AVX manual also lists some proposed future instructions, covered here for completeness. This is not a guarantee that these instructions will materialize as written.

Two instructions ( VCVTPH2PS and VCVTPS2PH) are reserved for supporting 16-bit floating-point conversions to and from single- and double-floating-point types. The 16-bit format is called half-precision and has a 10-bit mantissa (with an implied leading 1 for non-denormalized numbers, resulting in 11-bit precision), 5-bit exponent (biased by 15), and 1-bit sign.

The proposed RDRAND instruction uses a cryptographically secure hardware digital random bit generator to generate random numbers for 16- 32- , and 64-bit registers. On success, the carry flag is set to 1 ( CF=1). If not enough entropy is available, the carry flag is cleared ( CF=0).

Finally, there are four instructions ( RDFDBASE, RDGSBASE, WRFSBASE, and WRGSBASE) to read and write FS and GS registers at all privilege levels in 64-bit mode.

Another future addition is the FMA instructions, which perform operations similar to A = + A * B + C, where either of the plus signs (+) on the right can be changed to a minus sign (?) and the three operands on the right can be in any order. There are also forms for interleaved addition and subtraction. Packed FMA instructions can perform eight single-precision FMA operations or four double-precision FMA operations with 256-bit vectors.

FMA operations such as A = A * B + C are better than performing one step at a time, because intermediate results are treated as infinite precision, with rounding done on store, and thus are more accurate for computation. This single rounding is what gives the "fused" prefix. They are also faster than performing the computation in steps.

Each instruction comes in three forms for the ordering of the operands A, B, and C, with the ordering corresponding to a three-digit extension: form 132 does A = AC + B, form 213 does A = BA + C, and form 231does A = BC + A. The ordering number is just the order of the operands on the right side of the expression.

Availability and Support

Detecting availability of the Intel® AVX features in hardware requires using the CPUID instruction to query support in the CPU and in the operating system, as detailed later. Second-generation Intel® Core™ processors (Intel® microarchitecture code name Sandy Bridge), released in Q1, 2011, are the first from Intel supporting Intel® AVX technology. These processors will not have the new FMA instructions. For development and testing without hardware support, the free Intel® Software Development Emulator (see "For More Information" for a link) includes support for all these features, including Intel AVX, FMA, PCLMULQDQ, and AES instructions.

To use the Intel AVX extensions reliably in most settings, the operating system must support saving and loading the new registers (with XSAVE/XRSTOR) on thread context switches to prevent data corruption. To help avoid such errors, operating systems supporting Intel AVX-aware context switches explicitly set a CPU bit enabling the new instructions; otherwise, an undefined opcode ( #UD) exception is generated when Intel AVX instructions are used.

Microsoft Windows* 7 with Service Pack 1 (SP1) and Microsoft Windows* Server 2008 R2 with SP1-both 32- and 64-bit versions-and later versions of Windows* support Intel AVX save and restore in thread and process switches. Linux* kernels from 2.6.30 (June 2009) and later support Intel AVX, as well.

Detecting Availability and Support
Detection of support for the four areas-Intel® AVX, FMA, AES, and PCLMULQDQ-are similar and require similar steps consisting of checking for hardware and operating system support for the desired feature (see Table 1). These steps are (counting bits starting at bit 0):

Verify that the operating system supports XGETBV using CPUID.1:ECX.OSXSAVE bit 27 = 1.
At the same time, verify that CPUID.1:ECX bit 28=1 (Intel AVX supported) and/or bit 25=1 (AES supported) and/or bit 12=1 (FMA supported) and/or bit 1=1 (PCLMULQDQ) are supported.
Issue XGETBV, and verify that the feature-enabled mask at bits 1 and 2 are 11b (XMM state and YMM state enabled by the operating system).

Table 1. Feature-detection Masks

Feature	Bits to check	Constant
Intel® AVX	28, 27	`018000000H`
VAES	28, 27, and 25	`01A000000H`
VPCLMULQDQ	28, 27, and 1	`018000002H`
FMA	28, 27, and 12	`018001000H`

Example code implementing this process is provided in Listing 1, where the CONSTANT is the value from Table 1. A Microsoft* Visual Studio* C++ intrinsic version is given later.

Listing 1. Feature Detection

     
     
     
     
      
      
      
       
        
         
          
           
           01 
           INT Supports_Feature() 
           
          
         
        
        
         
          
           
           02 
              {  
           
          
         
        
        
         
          
           
           03 
              ; result returned in eax 
           
          
         
        
        
         
          
           
           04 
              mov eax, 1 
           
          
         
        
        
         
          
           
           05 
              cpuid 
           
          
         
        
        
         
          
           
           06 
              and ecx, CONSTANT 
           
          
         
        
        
         
          
           
           07 
              cmp ecx, CONSTANT; check desired feature flags 
           
          
         
        
        
         
          
           
           08 
              jne not_supported  
           
          
         
        
        
         
          
           
           09 
              ; processor supports features 
           
          
         
        
        
         
          
           
           10 
              mov ecx, 0; specify 0 for XFEATURE_ENABLED_MASK register 
           
          
         
        
        
         
          
           
           11 
              XGETBV; result in EDX:EAX 
           
          
         
        
        
         
          
           
           12 
              and eax, 06H 
           
          
         
        
        
         
          
           
           13 
              cmp eax, 06H; check OS has enabled both XMM and YMM state support 
           
          
         
        
        
         
          
           
           14 
              jne not_supported 
           
          
         
        
        
         
          
           
           15 
              mov eax, 1; mark as supported 
           
          
         
        
        
         
          
           
           16 
              jmp done 
           
          
         
        
        
         
          
           
           17 
              NOT_SUPPORTED: 
           
          
         
        
        
         
          
           
           18 
              mov eax, 0 ; // mark as not supported 
           
          
         
        
        
         
          
           
           19 
              done: 
           
          
         
        
        
         
          
           
           20 
              } 
           
          
         
        
        
         
          
           
           21

Usage

At the lowest programming level, most common x86 assemblers now support Intel® AVX, FMA, AES, and the VPCLMULQDQ instructions, including Microsoft MASM* (Microsoft Visual Studio* 2010 version), NASM*, FASM*, and YASM*. See their respective documentation for details.

For language compilers, Intel® C++ Compiler version 11.1 and later and Intel® Fortran Compilers support Intel® AVX through compiler switches, and both compilers support automatic vectorization of floating-point loops. The Intel C++ Compiler supports Intel AVX intrinsics (use #include <immintrin.h> to access intrinsics) and inline assembly and even supports Intel AVX intrinsics emulation using #include "avxintrin_emu.h".

Microsoft Visual Studio* C++ 2010 with SP1 and later has support for Intel AVX (see "For More Information") when compiling 64-bit code (use the /arch:AVX compiler switch). It supports intrinsics using the <immintrin.h>header but not inline assembly. Intel AVX support is also in MASM*, the disassembly view of code, and the debugger views of registers (giving full YMM support).

In the GNU Compiler Collection* (GCC*), version 4.4 supports Intel AVX intrinsics through the same header, <immintrin.h>. Other GNU toolchain support is found in Binutils 2.20.51.0.1 and later, gdb 6.8.50.20090915 and later, recent GNU Assembler (GAS) versions, and objdump. If your compiler does not support Intel AVX, you can emit the required bytes under many circumstances, but first-class support makes your life easier.

Each of the three C++ compilers mentioned supports the same intrinsic operations to simplify using Intel® AVX from C or C++ code. Intrinsics are functions that the compiler replaces with the proper assembly instructions. Most Intel AVX intrinsic names follow the following format:

     
     
     
     
      
      
      
       
        
         
          
           
           1 
           _mm256_op_suffix(data_type param1, data_type param2, data_type param3)

where _mm256 is the prefix for working on the new 256-bit registers; _op is the operation, like add for addition or sub for subtraction; and _suffix denotes the type of data to operate on, with the first letters denoting packed (p), extended packed (ep), or scalar (s). The remaining letters are the types in Table 2.

Table 2. Intel® AVX Suffix Markings

Marking	Meaning
`[s/d]`	Single- or double-precision floating point
`[i/u]nnn`	Signed or unsigned integer of bit size nnn, where nnn is 128, 64, 32, 16, or 8
`[ps/pd/sd]`	Packed single, packed double, or scalar double
`epi32`	Extended packed 32-bit signed integer
`si256`	Scalar 256-bit integer

Data types are in Table 3. The first two parameters are source registers, and the third parameter (when present) is an integer mask, selector, or offset value.

Table 3. Intel® AVX Intrinsics Data Types

Type	Meaning
`__m256`	256-bit as eight single-precision floating-point values, representing a YMM register or memory location
`__m256d`	256-bit as four double-precision floating-point values, representing a YMM register or memory location
`__m256i`	256-bit as integers, (bytes, words, etc.)
`__m128`	128-bit single precision floating-point (32 bits each)
`__m128d`	128-bit double precision floating-point (64 bits each)

Some intrinsics are in other headers, such as the AES and PCLMULQDQ being in <wmmintrin.h>. Consult your compiler documentation or the web to track down where various intrinsics live.

Microsoft Visual Studio* 2010
For conciseness, the rest of this article uses Microsoft Visual Studio* 2010 with SP1; similar code should work on the Intel® compiler or GCC*. Microsoft Visual Studio* 2010 with SP1 can automatically generate Intel® AVX code if you click Project Properties > Configuration > Code Generation, select Not Set under Enable Enhanced Instruction Set, and then manually add /arch:AVX to the command line under the Command Line entry. As an example of using intrinsics, Listing 2 offers an intrinsic-based Intel AVX feature-detection routine.

Listing 2. Intrinsic-based Feature Detection

     
     
     
     
      
      
      
       
        
         
          
           
           01 
           // get AVX intrinsics 
           
          
         
        
        
         
          
           
           02 
           #include <immintrin.h> 
           
          
         
        
        
         
          
           
           03 
           // get CPUID capability 
           
          
         
        
        
         
          
           
           04 
           #include <intrin.h> 
           
          
         
        
        
         
          
           
           05 
             
           
          
         
        
        
         
          
           
           06 
           // written for clarity, not conciseness 
           
          
         
        
        
         
          
           
           07 
           #define OSXSAVEFlag (1UL<<27) 
           
          
         
        
        
         
          
           
           08 
           #define AVXFlag     ((1UL<<28)|OSXSAVEFlag) 
           
          
         
        
        
         
          
           
           09 
           #define VAESFlag    ((1UL<<25)|AVXFlag|OSXSAVEFlag) 
           
          
         
        
        
         
          
           
           10 
           #define FMAFlag     ((1UL<<12)|AVXFlag|OSXSAVEFlag) 
           
          
         
        
        
         
          
           
           11 
           #define CLMULFlag   ((1UL<< 1)|AVXFlag|OSXSAVEFlag) 
           
          
         
        
        
         
          
           
           12 
              
           
          
         
        
        
         
          
           
           13 
           bool DetectFeature(unsigned int feature) 
           
          
         
        
        
         
          
           
           14 
               { 
           
          
         
        
        
         
          
           
           15 
               int CPUInfo[4], InfoType=1, ECX = 1; 
           
          
         
        
        
         
          
           
           16 
               __cpuidex(CPUInfo, 1, 1);       // read the desired CPUID format 
           
          
         
        
        
         
          
           
           17 
               unsigned int ECX = CPUInfo[2];  // the output of CPUID in the ECX register.  
           
          
         
        
        
         
          
           
           18 
               if ((ECX & feature) != feature) // Missing feature  
           
          
         
        
        
         
          
           
           19 
                   return false;  
           
          
         
        
        
         
          
           
           20 
               __int64 val = _xgetbv(0);       // read XFEATURE_ENABLED_MASK register 
           
          
         
        
        
         
          
           
           21 
               if ((val&6) != 6)               // check OS has enabled both XMM and YMM support. 
           
          
         
        
        
         
          
           
           22 
                   return false;  
           
          
         
        
        
         
          
           
           23 
               return true; 
           
          
         
        
        
         
          
           
           24 
               } 
           
          
         
        
        
         
          
           
           25

Mandelbrot Example

To demonstrate using the new instructions, compute Mandelbrot set images using straight C/C++ code (checking to ensure that the compiler did not convert the code to Intel® AVX instructions!) and the new Intel AVX instructions as intrinsics, comparing their performance. A Mandelbrot set is a computationally intensive operation on complex numbers, defined in pseudocode as shown in Listing 3.

Listing 3. Mandelbrot Pseudocode

     
     
     
     
      
      
      
       
        
         
          
           
           1 
           z,p are complex numbers 
           
          
         
        
        
         
          
           
           2 
           for each point p on the complex plane 
           
          
         
        
        
         
          
           
           3 
               z = 0 
           
          
         
        
        
         
          
           
           4 
               for count = 0 to max_iterations 
           
          
         
        
        
         
          
           
           5 
                   if abs(z) > 2.0 
           
          
         
        
        
         
          
           
           6 
                       break 
           
          
         
        
        
         
          
           
           7 
                   z = z*z+p 
           
          
         
        
        
         
          
           
           8 
               set color at p based on count reached

The usual image is over the portion of the complex plane in the rectangle ( -2,-1) to ( 1,1). Coloring can be done in many ways (not covered here). Raise the maximum iteration count to zoom into portions and determine whether a value "escapes" over time.

To really stress the CPU, zoom in and draw the box ( 0.29768, 0.48364) to ( 0.29778, 0.48354), computing the grid of counts at multiple sizes and using a max iteration of 4096. The resulting grid of counts, when colored appropriately, is shown in Figure 4.

Figure 4. Mandelbrot set (0.29768, 0.48364) to (0.29778, 0.48354), with max iterations of 4096

A basic C++ implementation to compute the iteration counts is provided in Listing 4. The absolute value of the complex number compared to 2 is replaced with the norm compared to 4.0, almost doubling the speed by removing a square root. For all versions, use single-precision floats to pack as many elements into the YMM registers as possible, which is faster but loses precision compared to doubles when zooming in further.

Listing 4. Simple Mandelbrot C++ Code

     
     
     
     
      
      
      
       
        
         
          
           
           01 
           // simple code to compute Mandelbrot in C++ 
           
          
         
        
        
         
          
           
           02 
           #include <complex> 
           
          
         
        
        
         
          
           
           03 
           void MandelbrotCPU(float x1, float y1, float x2, float y2,  
           
          
         
        
        
         
          
           
           04 
                              int width, int height, int maxIters, unsigned short * image) 
           
          
         
        
        
         
          
           
           05 
           { 
           
          
         
        
        
         
          
           
           06 
               float dx = (x2-x1)/width, dy = (y2-y1)/height; 
           
          
         
        
        
         
          
           
           07 
               for (int j = 0; j < height; ++j) 
           
          
         
        
        
         
          
           
           08 
                   for (int i = 0; i < width; ++i) 
           
          
         
        
        
         
          
           
           09 
                   { 
           
          
         
        
        
         
          
           
           10 
                       complex<float> c (x1+dx*i, y1+dy*j), z(0,0); 
           
          
         
        
        
         
          
           
           11 
                       int count = -1; 
           
          
         
        
        
         
          
           
           12 
                       while ((++count < maxIters) && (norm(z) < 4.0)) 
           
          
         
        
        
         
          
           
           13 
                           z = z*z+c; 
           
          
         
        
        
         
          
           
           14 
                       *image++ = count; 
           
          
         
        
        
         
          
           
           15 
                   } 
           
          
         
        
        
         
          
           
           16 
           }

Test multiple versions for performance: the basic one in Listing 4, a similar CPU version made by expanding the complex types with floats, an intrinsic-based SSE version, and an intrinsic-based Intel® AVX version shown in Listing 5. Each version is tested on image sizes of 128×128, 256×256, 512×512, 1024×1024, 2048×2048, and 4096×4096. The performance of each implementation could likely be improved while retaining its underlying instruction set constraints with more work, but they should be representative of what you can obtain.

The Intel AVX version has been carefully crafted to fit as much as possible into the 16 YMM registers. To help track how you want them to be allocated, the variables are names ymm0 through ymm15. Of course, the compiler allocates registers as it sees fit, but by being careful, you can try to make all computations stay in registers this way. (Actually, from looking at the disassembly, the compiler does not allocate them nicely, and recasting this in assembly code would be a good exercise to anyone learning Intel AVX).

Listing 5. Intel® AVX-intrinsic Mandelbrot Implementation

     
     
     
     
      
      
      
       
        
         
          
           
           01 
           float dx = (x2-x1)/width; 
           
          
         
        
        
         
          
           
           02 
           float dy = (y2-y1)/height; 
           
          
         
        
        
         
          
           
           03 
           // round up width to next multiple of 8 
           
          
         
        
        
         
          
           
           04 
           int roundedWidth = (width+7) & ~7UL;  
           
          
         
        
        
         
          
           
           05 
              
           
          
         
        
        
         
          
           
           06 
           float constants[] = {dx, dy, x1, y1, 1.0f, 4.0f}; 
           
          
         
        
        
         
          
           
           07 
           __m256 ymm0 = _mm256_broadcast_ss(constants);   // all dx 
           
          
         
        
        
         
          
           
           08 
           __m256 ymm1 = _mm256_broadcast_ss(constants+1); // all dy 
           
          
         
        
        
         
          
           
           09 
           __m256 ymm2 = _mm256_broadcast_ss(constants+2); // all x1 
           
          
         
        
        
         
          
           
           10 
           __m256 ymm3 = _mm256_broadcast_ss(constants+3); // all y1 
           
          
         
        
        
         
          
           
           11 
           __m256 ymm4 = _mm256_broadcast_ss(constants+4); // all 1's (iter increments) 
           
          
         
        
        
         
          
           
           12 
           __m256 ymm5 = _mm256_broadcast_ss(constants+5); // all 4's (comparisons) 
           
          
         
        
        
         
          
           
           13 
              
           
          
         
        
        
         
          
           
           14 
           float incr[8]={0.0f,1.0f,2.0f,3.0f,4.0f,5.0f,6.0f,7.0f}; // used to reset the i position when j increases 
           
          
         
        
        
         
          
           
           15 
           __m256 ymm6 = _mm256_xor_ps(ymm0,ymm0); // zero out j counter (ymm0 is just a dummy) 
           
          
         
        
        
         
          
           
           16 
              
           
          
         
        
        
         
          
           
           17 
           for (int j = 0; j < height; j+=1) 
           
          
         
        
        
         
          
           
           18 
           { 
           
          
         
        
        
         
          
           
           19 
               __m256 ymm7  = _mm256_load_ps(incr);  // i counter set to 0,1,2,..,7 
           
          
         
        
        
         
          
           
           20 
               for (int i = 0; i < roundedWidth; i+=8) 
           
          
         
        
        
         
          
           
           21 
               { 
           
          
         
        
        
         
          
           
           22 
                   __m256 ymm8 = _mm256_mul_ps(ymm7, ymm0);  // x0 = (i+k)*dx  
           
          
         
        
        
         
          
           
           23 
                   ymm8 = _mm256_add_ps(ymm8, ymm2);         // x0 = x1+(i+k)*dx 
           
          
         
        
        
         
          
           
           24 
                   __m256 ymm9 = _mm256_mul_ps(ymm6, ymm1);  // y0 = j*dy 
           
          
         
        
        
         
          
           
           25 
                   ymm9 = _mm256_add_ps(ymm9, ymm3);         // y0 = y1+j*dy 
           
          
         
        
        
         
          
           
           26 
                   __m256 ymm10 = _mm256_xor_ps(ymm0,ymm0);  // zero out iteration counter 
           
          
         
        
        
         
          
           
           27 
                   __m256 ymm11 = ymm10, ymm12 = ymm10;        // set initial xi=0, yi=0 
           
          
         
        
        
         
          
           
           28 
              
           
          
         
        
        
         
          
           
           29 
                   unsigned int test = 0; 
           
          
         
        
        
         
          
           
           30 
                   int iter = 0; 
           
          
         
        
        
         
          
           
           31 
                   do 
           
          
         
        
        
         
          
           
           32 
                   { 
           
          
         
        
        
         
          
           
           33 
                       __m256 ymm13 = _mm256_mul_ps(ymm11,ymm11); // xi*xi 
           
          
         
        
        
         
          
           
           34 
                       __m256 ymm14 = _mm256_mul_ps(ymm12,ymm12); // yi*yi 
           
          
         
        
        
         
          
           
           35 
                       __m256 ymm15 = _mm256_add_ps(ymm13,ymm14); // xi*xi+yi*yi 
           
          
         
        
        
         
          
           
           36 
                         
           
          
         
        
        
         
          
           
           37 
                       // xi*xi+yi*yi < 4 in each slot 
           
          
         
        
        
         
          
           
           38 
                       ymm15 = _mm256_cmp_ps(ymm15,ymm5, _CMP_LT_OQ);         
           
          
         
        
        
         
          
           
           39 
                       // now ymm15 has all 1s in the non overflowed locations 
           
          
         
        
        
         
          
           
           40 
                       test = _mm256_movemask_ps(ymm15)&255;      // lower 8 bits are comparisons 
           
          
         
        
        
         
          
           
           41 
                       ymm15 = _mm256_and_ps(ymm15,ymm4); 
           
          
         
        
        
         
          
           
           42 
                       // get 1.0f or 0.0f in each field as counters 
           
          
         
        
        
         
          
           
           43 
                       // counters for each pixel iteration 
           
          
         
        
        
         
          
           
           44 
                       ymm10 = _mm256_add_ps(ymm10,ymm15);         
           
          
         
        
        
         
          
           
           45 
              
           
          
         
        
        
         
          
           
           46 
                       ymm15 = _mm256_mul_ps(ymm11,ymm12);        // xi*yi  
           
          
         
        
        
         
          
           
           47 
                       ymm11 = _mm256_sub_ps(ymm13,ymm14);        // xi*xi-yi*yi 
           
          
         
        
        
         
          
           
           48 
                       ymm11 = _mm256_add_ps(ymm11,ymm8);         // xi <- xi*xi-yi*yi+x0 done! 
           
          
         
        
        
         
          
           
           49 
                       ymm12 = _mm256_add_ps(ymm15,ymm15);        // 2*xi*yi 
           
          
         
        
        
         
          
           
           50 
                       ymm12 = _mm256_add_ps(ymm12,ymm9);         // yi <- 2*xi*yi+y0    
           
          
         
        
        
         
          
           
           51 
              
           
          
         
        
        
         
          
           
           52 
                       ++iter; 
           
          
         
        
        
         
          
           
           53 
                   } while ((test != 0) && (iter < maxIters)); 
           
          
         
        
        
         
          
           
           54 
              
           
          
         
        
        
         
          
           
           55 
                   // convert iterations to output values 
           
          
         
        
        
         
          
           
           56 
                   __m256i ymm10i = _mm256_cvtps_epi32(ymm10); 
           
          
         
        
        
         
          
           
           57 
              
           
          
         
        
        
         
          
           
           58 
                   // write only where needed 
           
          
         
        
        
         
          
           
           59 
                   int top = (i+7) < width? 8: width&7; 
           
          
         
        
        
         
          
           
           60 
                   for (int k = 0; k < top; ++k) 
           
          
         
        
        
         
          
           
           61 
                       image[i+k+j*width] = ymm10i.m256i_i16[2*k]; 
           
          
         
        
        
         
          
           
           62 
              
           
          
         
        
        
         
          
           
           63 
                   // next i position - increment each slot by 8 
           
          
         
        
        
         
          
           
           64 
                   ymm7 = _mm256_add_ps(ymm7, ymm5); 
           
          
         
        
        
         
          
           
           65 
                   ymm7 = _mm256_add_ps(ymm7, ymm5); 
           
          
         
        
        
         
          
           
           66 
               } 
           
          
         
        
        
         
          
           
           67 
               ymm6 = _mm256_add_ps(ymm6,ymm4); // increment j counter 
           
          
         
        
        
         
          
           
           68 
           }

The full code for all versions and a Microsoft Visual Studio* 2010 with SP1 project, including a testing harness, is available at from the links in the "For More Information" section.

The results are shown in Figures 5 and 6. To prevent tying numbers too much to a specific CPU speed, Figure 5 shows performance of each version relative the CPU version, which represents a straightforward non-SIMD C/C++ implementation of the algorithm. As expected, the Intel® SSE version performs almost 4 times as well, because it is doing 4 pixels per pass, and the Intel® AVX version performs almost 8 times as well as the CPU version. Because there is overhead from loops, memory access, less-than-perfect instruction ordering, and other factors, 4- and 8-fold improvements should be about the best possible, so this is pretty good for a first try.

Figure 5. Relative performance across sizes

The second graph in Figure 6 shows that the pixels computed per millisecond are fairly constant over each size; again, the algorithms show almost quadrupling of performance from the CPU to Intel® SSE version and another doubling from the Intel SSE to Intel® AVX version.

Figure 6. Absolute performance across sizes

Conclusion

This article provided a mid-level overview of the new Intel® Advanced Vector Extensions (Intel® AVX). These extensions are similar to previous Intel® SSE instructions but offer a much larger register space and add some new instructions. The Mandelbrot example shows performance gains over previous technology in the amount expected. For full details, be sure to check out the Intel Advanced Vector Extensions Programming Reference (see "For More Information" for a link).

Happy hacking!

For More Information

Intel® Advanced Vector Extensions Programming Reference at /en-us/avx/

Federal Information Processing Standards Publication 197, "Announcing the Advanced Encryption Standard," at /sites/default/files/m/4/d/d/fips-197.pdf

The IEEE 754-2008 floating-point format standard at http://en.wikipedia.org/wiki/IEEE_754-2008

Floating-Point Support for 64-Bit Drivers at http://msdn.microsoft.com/en-us/library/ff545910.aspx

Wikipedia's entry on the Mandelbrot set at http://en.wikipedia.org/wiki/Mandelbrot_set

Intel® Software Development Emulator at /en-us/articles/intel-software-development-emulator

The complete Mandelbrot Intel® AVX implementation for download at http://www.lomont.org

About the Author

Chris Lomont works as a research engineer at Cybernet Systems, working on projects as diverse as quantum computing algorithms, image processing for NASA, developing security hardware for United States Homeland Security, and computer forensics. Before that he obtained a PhD. in math from Purdue, three Bachelors degrees in physics, math, and computer science, worked as a game programmer, did brief stints in financial modeling, robotics work, and various consulting roles. The rest of his time is spent hiking with his wife, watching movies, giving talks, recreational programming, doing math research, learning more physics, playing music, and performing various experiments. Visit his website www.lomont.org or his electronic gadget site www.hypnocube.com.

Appendix A: Instruction Set Reference

Many instructions come in packed or scalar form, meaning that they work on multiple parallel elements or on a single element in the register-a distinction marked as [P/S]. Entry lengths come in double or single precision for floating-point ( doubles and singles, for brevity); marked [D/S]; and the integer forms byte, word, doubleword, and quadword, marked [B/W/D/Q]. Integer forms also sometimes come in signed or unsigned forms, marked [S/U]. Some instructions work on high or low portions of registers, marked as [H/L]; other optional components are in the tables. Instructions coming in Intel® SSE form and Intel® AVX form are prefixed with a ( V) for the Intel® AVX form, allowing three operands and 256-bit register support. Entries in square brackets ( []) are required; entries in parentheses ( ()) are optional.

Examples:

(V)ADD[P/S][D/S] is the addition of packed or scalar, double or single, with eight possible forms-VADDPD, VADDPS, VADDSD, VADDSS, and versions without the leading V.
(V)[MIN/MAX][P/S][D/S] represents 16 different instructions for a min or max of packed or scalar of double or single precision.

The next table represents the multiple comparison types. VEX-prefixed instructions have 32 comparison types; non-VEX-prefixed comparisons only allow those eight types in parentheses. Each comparison type comes in multiple flavors, where O = ordered, U = unordered, S = signaling, and Q = non-signaling. Ordered/unordered tells whether the comparison is false or true if one operand is NaN ( Not-a-Number in floating point, which happens when something failed during the computation, such as divide by 0 or the square root of a negative number). Signaling/non-signaling states whether an exception is fired when at least one operand is QNaN ( Quiet Not-a-Number-useful for error trapping).

Type	Flavors	Meaning
`EQ`	`(OQ), UQ, OS, US`	Equal
`LT`	`(OS), OQ`	Less than
`LE`	`(OS), OQ`	Less than or equal to
`UNORD`	`(Q), S`	Tests for unordered (NaN)
`NEQ`	`(UQ), US, OQ, OS`	Not equal
`NLT`	`(US), UQ`	Not less than
`NLE`	`(US), UQ`	Not less than or equal to
`ORD`	`(Q), S`	Tests for ordered (not NaN)
`NGE`	`US, UQ`	Not greater than or equal to
`NGT`	`US, UQ`	Not greater than
`FALSE`	`OQ, OS`	Comparison is always false
`GE`	`OS, OQ`	Greater than or equal to
`GT`	`OS, OQ`	Greater than
`TRUE`	`UQ, US`	Comparison is always true

Finally, here are all the Intel® AVX instructions:

Arithmetic	Description
`(V)[ADD/SUB/MUL/DIV][P/S][D/S]`	Add/subtract/multiply/divide packed/scalar double/single
`(V)ADDSUBP[D/S]`	Packed double/single add and subtract alternating indices
`(V)DPP[D/S]`	Dot product, based on immediate mask
`(V)HADDP[D/S]`	Horizontally add
`(V)[MIN/MAX][P/S][D/S]`	Min/max packed/scalar double/single
`(V)MOVMSKP[D/S]`	Extract double/single sign mask
`(V)PMOVMSKB`	Make a mask consisting of the most significant bits
`(V)MPSADBW`	Multiple sum of absolute differences
`(V)PABS[B/W/D]`	Packed absolute value on bytes/words/doublewords
`(V)P[ADD/SUB][B/W/D/Q]`	Add/subtract packed bytes/words/doublewords/quadwords
`(V)PADD[S/U]S[B/W]`	Add packed signed/unsigned with saturation bytes/words
`(V)PAVG[B/W]`	Average packed bytes/words
`(V)PCLMULQDQ`	Carry-less multiplication quadword
`(V)PH[ADD/SUB][W/D]`	Packed horizontal add/subtract word/doubleword
`(V)PH[ADD/SUB]SW`	Packed horizontal add/subtract with saturation
`(V)PHMINPOSUW`	Min horizontal unsigned word and position
`(V)PMADDWD`	Multiply and add packed integers
`(V)PMADDUBSW`	Multiply unsigned bytes and signed bytes into signed words
`(V)P[MIN/MAX][S/U][B/W/D]`	Min/max of packed signed/unsigned integers
`(V)PMUL[H/L][S/U]W`	Multiply packed signed/unsigned integers and store high/low result
`(V)PMULHRSW`	Multiply packed unsigned with round and shift
`(V)PMULHW`	Multiply packed integers and store high result
`(V)PMULL[W/D]`	Multiply packed integers and store low result
`(V)PMUL(U)DQ`	Multiply packed (un)signed doubleword integers and store quadwords
`(V)PSADBW`	Compute sum of absolute differences of unsigned bytes
`(V)PSIGN[B/W/D]`	Change the sign on each element in one operand based on the sign in the other operand
`(V)PS[L/R]LDQ`	Byte shift left/right amount in operand
`(V)SL[L/AR/LR][W/D/Q]`	Bit shift left/arithmetic right/logical right
`(V)PSUB(U)S[B/W]`	Packed (un)signed subtract with (un)signed saturation
`(V)RCP[P/S]S`	Compute approximate reciprocal of packed/scalar single precision
`(V)RSQRT[P/S]S`	Compute approximate reciprocal of square root of packed/scalar single precision
`(V)ROUND[P/S][D/S]`	Round packed/scalar double/single
`(V)SQRT[P/S][D/S]`	Square root of packed/scalar double/single
`VZERO[ALL/UPPER]`	Zero all/upper half of YMM registers

Comparison	Description
`(V)CMP[P/S][D/S]`	Compare packed/scalar double/single
`(V)COMIS[S/D]`	Compare scalar double/single, set EFLAGS
`(V)PCMP[EQ/GT][B/W/D/Q]`	Compare packed integers for equality/greater than
`(V)PCMP[E/I]STR[I/M]`	Compare explicit/implicit length strings, return index/mask

Control	Description
`V[LD/ST]MXCSR`	Load/store MXCSR control/status register
`XSAVEOPT`	Save processor extended states optimized

Conversion	Description
`(V)CVTx2y`	Convert type x to type y, where x and y are chosen from `DQ and P[D/S],` `[P/S]S and [P/S]D, or` `S[D/S] and SI.`

Load/store	Description
`VBROADCAST[SS/SD/F128]`	Load with broadcast (loads single value into multiple locations)
`VEXTRACTF128`	Extract 128-bit floating-point values
`(V)EXTRACTPS`	Extract packed single precision
`VINSERTF128`	Insert packed floating-point values
`(V)INSERTPS`	Insert packed single-precision values
`(V)PINSR[B/W/D/Q]`	Insert integer
`(V)LDDQU`	Move quad unaligned integer
`(V)MASKMOVDQU`	Store selected bytes of double quadword with `NT Hint`
`VMASKMOVP[D/S]`	Conditional SIMD packed load/store
`(V)MOV[A/U]P[D/S]`	Move aligned/unaligned packed double/single
`(V)MOV[D/Q]`	Move doubleword/quadword
`(V)MOVDQ[A/U]`	Move double to quad aligned/unaligned
`(V)MOV[HL/LH]P[D/S]`	Move high-to-low/low-to-high packed double/single
`(V)MOV[H/L]P[D/S]`	Move high/low packed double/single
`(V)MOVNT[DQ/PD/PS]`	Move packed integers/doubles/singles using a non-temporal hint
`(V)MOVNTDQA`	Move packed integers using a non-temporal hint, aligned
`(V)MOVS[D/S]`	Move or merge scalar double/single
`(V)MOVS[H/L]DUP`	Move single odd/even indexed singles
`(V)PACK[U/S]SW[B/W]`	Pack with unsigned/signed saturation on bytes/words
`(V)PALIGNR`	Byte align
`(V)PEXTR[B/W/D/Q]`	Extract integer
`(V)PMOV[S/Z]X[B/W/D][W/D/Q]`	Packed move with sign/zero extend (only `up in length, DD, DW,` etc. disallowed)

Logical	Description
`(V)[AND/ANDN/OR]P[D/S]`	Bitwise logical `AND/AND NOT/OR` of packed double/single values
`(V)PAND(N)`	Logical `AND (NOT)`
`(V)P[OR/XOR]`	Bitwise `logical OR/exclusive OR`
`(V)PTEST`	Packed bit test, set zero flag if bitwise `AND` is all `0`
`(V)UCOMIS[D/S]`	Unordered compare scalar doubles/singles and set `EFLAGS`
`(V)XORP[D/S]`	Bitwise logical `XOR` of packed double/single

Shuffle	Description
`(V)BLENDP[D/S]`	Blend packed double/single; selects elements based on mask
`(V)BLENDVP[D/S]`	Blend values
`(V)MOVDDUP`	Copies even values to all values
`(V)PBLENDVB`	Variable blend packed bytes
`(V)PBLENDW`	Blend packed words
`VPERMILP[D/S]`	Permute double/single values
`VPERM2F128`	Permute floating-point values
`(V)PSHUF[B/D]`	Shuffle packed bytes/doublewords based on immediate value
`(V)PSHUF[H/L]W`	Shuffle packed high/low words
`(V)PUNPCK[H/L][BW/WD/DQ/QDQ]`	Unpack high/low data
`(V)SHUFP[D/S]`	Shuffle packed double/single
`(V)UNPCK[H/L]P[D/S]`	Unpack and interleave packed/scalar doubles/singles

AES	Description
`AESENC/AESENCLAST`	Perform one round of AES encryption
`AESDEC/AESDECLAST`	Perform one round of AES decryption
`AESIMC`	Perform the AES `InvMixColumn` transformation
`AESKEYGENASSIST`	AES Round Key Generation Assist

Future Instructions	Description
`[RD/WR][F/G]SBASE`	Read/write FS/GS register
`RDRAND`	Read random number (into r16, r32, r64)
`VCVTPH2PS`	Convert 16-bit floats to single precision floating-point values
`VCVTPS2PH`	Convert single-precision values to 16-bit floating-point values

FMA	Each [z] is the string 132 or 213 or 231, giving the order the operands A,B,C are used in: 132 is A=AC+B 213 is A=AB+C 231 is A=BC+A
`VFMADD[z][P/S][D/S]`	Fused multiply add A = r1 * r2 + r3 for packed/scalar of double/single
`VFMADDSUB[z]P[D/S]`	Fused multiply alternating add/subtract of packed double/single A = r1 * r2 + r3 for odd index, A = r1 * r2-r3 for even
`VFMSUBADD[z]P[D/S]`	Fused multiply alternating subtract/add of packed double/single A = r1 * r2-r3 for odd index, A = r1 * r2+r3 for even
`VFMSUB[z][P/S][D/S]`	Fused multiply subtract A = r1 * r2-r3 of packed/scalar double/single
`VFNMADD[z][P/S][D/S]`	Fused negative multiply add of packed/scalar double/single A = -r1 * r2+r3
`VFNMSUB[z][P/S][D/S]`	Fused negative multiply subtract of packed/scalar double/single A = -r1 * r2-r3

For more complete information about compiler optimizations, see our Optimization Notice.

你可能感兴趣的:(AVX官方入门介绍)

【iOS】MVC设计模式 Magnetic_h ios mvc 设计模式 objective-c 学习 ui
MVC前言如何设计一个程序的结构，这是一门专门的学问，叫做"架构模式"（architecturalpattern），属于编程的方法论。MVC模式就是架构模式的一种。它是Apple官方推荐的App开发架构，也是一般开发者最先遇到、最经典的架构。MVC各层controller层Controller/ViewController/VC（控制器）负责协调Model和View，处理大部分逻辑它将数据从Mod
网易严选官方旗舰店，优质商品，卓越服务高省_飞智666600
网易严选官方旗舰店是网易旗下的一家电商平台，以提供优质商品和卓越服务而闻名。作为一名SEO优化师，我将为您详细介绍网易严选官方旗舰店，并重点强调其特点和优势。大家好！我是高省APP最大团队&联合创始人飞智导师。相较于其他返利app，高省APP的佣金更高，模式更好，最重要的是，终端用户不会流失！高省APP佣金更高，模式更好，终端用户不流失。【高省】是一个自用省钱佣金高，分享推广赚钱多的平台，百度有几
GitHub上克隆项目 bigbig猩猩 github
从GitHub上克隆项目是一个简单且直接的过程，它允许你将远程仓库中的项目复制到你的本地计算机上，以便进行进一步的开发、测试或学习。以下是一个详细的步骤指南，帮助你从GitHub上克隆项目。一、准备工作1.安装Git在克隆GitHub项目之前，你需要在你的计算机上安装Git工具。Git是一个开源的分布式版本控制系统，用于跟踪和管理代码变更。你可以从Git的官方网站（https://git-scm.
厉国刚：新闻学与传播学到底有何区别微观大道
厉国刚：新闻学与传播学到底有何区别头几天，有人在知乎上问我：新闻学与传播学到底有何区别。他是一位想要跨专业考研的学生，对新闻传播学学科可谓了解甚少，甚至一头雾水，想要让我帮他解释解释。在研究生学硕层面，新闻传播学是一级学科，分成新闻学、传播学这两个二级学科。有些高校，还自设了广告学、出版发行学等其他二级学科，但从官方角度，新闻传播学一级学科下，正统的就是那两个二级学科。招生时，一般会按一级学科招，
STM32中的计时与延时 lupinjia STM32 stm32 单片机
前言在裸机开发中，延时作为一种规定循环周期的方式经常被使用，其中尤以HAL库官方提供的HAL_Delay为甚。刚入门的小白可能会觉得既然有官方提供的延时函数，而且精度也还挺好，为什么不用呢？实际上HAL_Delay中有不少坑，而这些也只是HAL库中无数坑的其中一些。想从坑里跳出来还是得加强外设原理的学习和理解，切不可只依赖HAL库。除了延时之外，我们在开发中有时也会想要确定某段程序的耗时，这就需要
CentOS 7官方源停服，配置本机光盘yum源码哝小鱼 linux运维 centos linux 运维
1、挂载系统光盘mkdir/mnt/isomount-oloop/tools/CentOS-7-x86_64-DVD-1810.iso/mnt/isocd/mnt/iso/Packages/rpm-ivh/mnt/iso/Packages/yum-utils-1.1.31-50.el7.noarch.rpm(图形界面安装，默契已安装）如安装yum-utils依赖错误，按提示安装依赖包rpm-ivh
入门MySQL——查询语法练习 K_un
前言：前面几篇文章为大家介绍了DML以及DDL语句的使用方法，本篇文章将主要讲述常用的查询语法。其实MySQL官网给出了多个示例数据库供大家实用查询，下面我们以最常用的员工示例数据库为准，详细介绍各自常用的查询语法。1.员工示例数据库导入官方文档员工示例数据库介绍及下载链接：https://dev.mysql.com/doc/employee/en/employees-installation.h
2023最详细的Python安装教程（Windows版本）程序员林哥 Python python windows 开发语言
python安装是学习pyhon第一步，很多刚入门小白不清楚如何安装python，今天我来带大家完成python安装与配置，跟着我一步步来，很简单，你肯定能完成。第一部分：python安装（一）准备工作1、下载和安装python(认准官方网站)当然你不想去下载的话也可以分享给你，还有入门学习教程，点击下方卡片跳转进群领取（二）开始安装对于Windows操作系统，可以下载“executableins
切换淘宝最新npm镜像源是 hai40587 npm 前端 node.js
切换淘宝最新npm镜像源是一个相对简单的过程，但首先需要明确当前淘宝npm镜像源的状态和最新的镜像地址。由于网络环境和服务更新，镜像源的具体地址可能会发生变化，因此，我将基于当前可获取的信息，提供一个通用的切换步骤，并附上最新的镜像地址（截至回答时）。一、了解npm镜像源npm（NodePackageManager）是JavaScript的包管理器，用于安装、更新和管理项目依赖。由于npm官方仓库
HarmonyOS Next鸿蒙扫一扫功能实现 JohnLiu_ HarmonyOS Next harmonyos 华为扫一扫鸿蒙
直接使用的是华为官方提供的api，封装成一个工具类方便调用。import{common}from'@kit.AbilityKit';import{scanBarcode,scanCore}from'@kit.ScanKit';exportnamespaceScanUtil{exportasyncfunctionstartScan(context:common.Context):Promise{if
果冻宝盒官方app邀请码有哪些一览(附邀请码填写指南)省钱又开心！小小编007
果冻宝盒是一款备受瞩目的社交电商软件，其独特的邀请机制和丰富的奖励制度吸引了大量用户。在使用果冻宝盒的过程中，填写正确的邀请码是获取奖励的重要步骤之一。本文将为您详细介绍果冻宝盒官方app的邀请码有哪些，以及如何正确填写邀请码，帮助您更好地参与果冻宝盒的社交电商生态。果冻宝盒直升金牌总裁（最高返利）注册教程：1各大应用市场搜索【果冻宝盒】并下载安装2注册果冻宝盒，根据提示填写邀请码：2233773
python中文版下载官网-Python下载 v3.8.3 官方中文版 weixin_37988176
Python中文版是一款非常专业的通用型计算机程序设计语言安装包，Python具有比其他语言更有特色语法结构，而且在设计上坚持了清晰划一的风格，使得它成为一门易读、易维护并且被大量用户所欢迎的、用途广泛的语言，随着版本的不断更新和语言新功能的添加，越来越多被用于独立的、大型项目的开发。Python中文版软件介绍Python中文版是一门跨平台的脚本语言，Python规定了一个Python语法规则，实
Nginx之ngx_http_proxy_connect_module模块小米bb Nginx nginx http 运维
近期由于项目需要使用到https正向代理，而nginx官方模块仅支持做http正向代理，一番百度学习后发现了该模块，故今日记录下此笔记供大家一起学习交流ngx_http_proxy_connect_module模块主要用于隧道SSL请求的代理服务器GitHub地址：http://www.github.com/chobits/ngx_http_proxy_connect_modulenginx配置：
三角洲行动内测资格怎么获得三角洲行动内测服怎么进入会飞滴鱼儿
手游内测资格怎么获得？这是每款新游戏开放内测的时候，玩家问的最多的一个问题，其实现在大多数游戏在上线之前官方都会开启几轮的内测测试，每轮测试之后，官方会收集全部运行过程中的数据，来进行优化和改进，至此这也是每款游戏的定律，但是有一个问题的就是，不管哪款游戏，开启测试的时候，名额都是有限的，经常都有很多玩家想要测试资格，却无论怎么也不会获得，本期小编就来给大家整理几个方法，让大家抢先一步！游戏内测资
第三世界 — 来！给你一次重新投胎的机会沧的海
一、投胎系统用户可自行选择来生的方方面面，包括国度、家庭、事业、技能、容貌、寿命等等；赢利点来了：选择好的方面自然要付出一定的代价啦，比如更长的寿命、更好的容貌等等；二、生活系统投胎转世后即进入生活系统，生活系统包括：1、设施系统街头、旅游、KTV、电影院、酒吧、餐厅…世界的一开始、即原始时期是没有这些设施的，官方只提供土地资源，开放接口给第三方开发者，集众力、创世界；2、任务系统你可以在此发布或
FlagEmbedding 吉小雨 python库 python
FlagEmbedding教程FlagEmbedding是一个用于生成文本嵌入（textembeddings）的库，适合处理自然语言处理（NLP）中的各种任务。嵌入（embeddings）是将文本表示为连续向量，能够捕捉语义上的相似性，常用于文本分类、聚类、信息检索等场景。官方文档链接：FlagEmbedding官方GitHub一、FlagEmbedding库概述1.1什么是FlagEmbeddi
Playwright 自动化验证码教程吉小雨 python库自动化数据库运维 python
Playwright自动化点击验证码教程在自动化测试中，Playwright是一个流行的浏览器自动化工具，支持多种浏览器的高效操作。验证码（如图片验证码、滑动验证码等）是网页中常见的反自动化机制，常常需要特别处理。我们将介绍如何使用Playwright自动化点击验证码，并提供几种常见验证码的处理方案。官方文档链接：Playwright官方文档一、Playwright环境搭建1.1安装Playwri
AIGC图生视频技术下的巴黎奥运高光时刻阿里云视频云 AIGC与媒体生产 AIGC
共享，奥运夺金时刻。巴黎奥运会的高光片段中国奥运的夺金时刻动漫风格下的别样风态以下AI动漫视频内容BY「阿里云视频云」智能生成从首金到21金镜头倒转尽情回顾······更多巴黎奥运高光时刻更多AIGC精彩内容可在「新华社官方」新媒体账号观看阿里云视频云用视频云+AI，持续助力奥运
Lua 与 C#交互 z2014z lua c#开发语言
Lua与C#交互前提Lua是一种嵌入式脚本语言，Lua的解释器是用C编写的，因此可以方便的与C/C++进行相互调用。轻量级Lua语言的官方版本只包括一个精简的核心和最基本的库，这使得Lua体积小、启动速度快，也适合嵌入在别的程序里。交互过程C#调用Lua:由C#文件调用Lua解析器底层dll库（由C语言编写），再由dll文件执行相应的Lua文件。Lua调用C#：1、Wrap方式：首先生成C#源文件
Sentinel实时监控不展示问题朱杰jjj sentinel sentinel
问题官方插件Endpoint支持，可以实时统计出SpringBoot的健康状况和请求的调用信息在使用Endpoint特性之前需要在Maven中添加spring-boot-starter-actuator依赖，并在配置中允许Endpoints的访问。SpringBoot1.x中添加配置management.security.enabled=false。暴露的endpoint路径为/sentinelS
代号：椿怎么才能当托？代号：椿如何才能申请内部号？诸葛村夫123
现在的手游不管是刚公测，或是已经上线很久，官方都会公布一些实用的礼包兑换码来给玩家使用，玩家可以在游戏内获得一些道具，或是一些各种游戏内的金币钻石等福利，现在很多手游平台不仅有礼包码提供给玩家，还有很多游戏都有大幅度的充值折扣等，最低可以1-5折等优惠，不过这些礼包兑换码和折扣福利虽然不错，不过和内部号相比，还是相差甚远，下面小编就来总体的和大家聊聊什么是内部号！说起内部号可能很多朋友都是见过的，
我的微商朋友小雨云舒
微商曾经是一个引人遐想的词。不限制时间和工作地点，适合家庭主妇和业余兼职，最重要的是，在无数宣传和洗脑之下，这两个字代表着轻轻松松赚大钱、开开心心上巅峰。和无数割韭菜的“产品“一样，基于互联网服务的微商一度引爆朋友圈，攻陷群聊天，利用微信大多是熟人强关系的特点，堂而皇之把触角深入其中。直到现在已经有了规范的官方微店APP，这种模式依然存在。毕竟与正正经经打理一个虚拟店铺相比，朋友圈发几行文字和图片
京粉怎么给自己返利?京粉自己下单有佣金吗日常购物小技巧
大家好，我是花桃APP商品推荐官：美美，今天给各位说说京粉怎么给自己返利?京粉自己下单有佣金吗。京粉怎么给自己返利？京粉自己下单有佣金吗京粉怎么给自己返利？京粉自己下单买有佣金吗？怎么实现单个账号自推自买？说【京粉返利】之前给大家推荐一款返利APP，【全网返利最高哦!可以对比一下自己在用的返利软件】都是有内部返利和优惠券的，应用商店搜索下载花桃APP即可查询返利佣金。【官方邀请码：00028】目前
8个莆田鞋购买渠道：试试这8个购买莆田鞋的平台美鞋之家
8个莆田鞋购买渠道：试试这8个购买莆田鞋的平台莆田鞋因其精细的工艺和逼真的仿造度受到了消费者的一致好评。那么，在市场上如何购买到正宗的莆田鞋呢？接下来我将为您介绍8个购买莆田鞋的平台。微信:676986923(下单赠送精美礼品)1.莆田鞋官方商城：作为莆田鞋的直销平台，买家可以在这里买到最新设计、最全款式的莆田鞋。而且，所有鞋款都直接从莆田工厂出货，确保了产品的质量。2.淘宝网：作为中国最知名的电
java 技术架构相关文档圣心 java 架构开发语言
在Java中，有许多不同的技术和架构，这里我将列举一些常见的Java技术和架构，并提供一些相关的文档资源。SpringFrameworkSpring是一个开源的Java/JavaEE全功能框架，以Apache许可证形式发布，提供了一种实现企业级应用的方法。官方文档：SpringFrameworkSpringBootSpringBoot是Spring的一个子项目，旨在简化创建生产级的Spring应用
2021-12-17 九汀
报党员我们班报了18个上去，班长拿着名单就是一阵猛划。第二天由我出面来对剩下的12个人进行选举，那些被划掉的两个刺头就冒出来质疑我，我只能用官方的语言强装镇定，然后就不再理会。选下来后，班长逮着一个人就划掉了，我划了倒数第二的人，组织委员划掉了倒数第四，都是女孩子，男生的票普遍高于女生。最后我们都陷入了沉默。文艺委员看着自己倒数第三，索性把自己和其余不是班干的人都划了，然后剩下一个和班长关系好的人
greenplum资源队列李春田
文章来源https://www.cnblogs.com/pl-boke/p/9852439.html官方文档：https://gpdb.docs.pivotal.io/6-8/admin_guide/workload_mgmt.html1、创建资源队列语法Command:CREATERESOURCEQUEUEDescription:createanewresourcequeueforworkloa
拼多多商家电话采集工具爬虫教程分享小电商达人爬虫
以下是使用Python编写的拼多多商家电话采集爬虫教程：一、前期准备安装Python：从Python官方网站下载并安装最新版本的Python，安装过程中注意勾选将Python添加到系统路径选项。安装相关库：在命令提示符中运行以下命令来安装所需的库。pipinstallrequests：用于发送HTTP请求获取网页内容。pipinstallbeautifulsoup4：用于解析HTML页面。二、分析
discuz discuz_admincp.php 讲解,Discuz! 1.5-2.5 命令执行漏洞分析(CVE-2018-14729) weixin_39740419 discuz 讲解
0x00漏洞简述漏洞信息8月27号有人在GitHub上公布了有关Discuz1.5-2.5版本中后台数据库备份功能存在的命令执行漏洞的细节。漏洞影响版本Discuz!1.5-2.50x01漏洞复现官方论坛下载相应版本就好。0x02漏洞分析需要注意的是这个漏洞其实是需要登录后台的，并且能有数据库备份权限，所以比较鸡肋。我这边是用Discuz!2.5完成漏洞复现的，并用此进行漏洞分析的。漏洞点在：so
外卖分销分佣小程序外卖cps小程序返利系统源码分享 m0_56957302 java 小程序 linux python docker
外卖返利小程序源码;轻松部署搭建，小程序服务号数据互通；对接美团官方;佣金比例自定义分配;三级分佣，所有资金数据一目了然；拉新立减最低4.9元购月卡；签到20天免费领取会员卡；提现秒到账！外卖cps带分销返利源码源代码地址美团/饿了么外卖CPS联盟返利公众号小程序裂变核心源码截图步骤下载以上源代码到本地http://y.mybei.cn/修改为你自己的微信小程序，打开/dist/pages/ele
springmvc 下 freemarker页面枚举的遍历输出杨白白 enum freemarker
spring mvc freemarker 中遍历枚举 1枚举类型有一个本地方法叫values（），这个方法可以直接返回枚举数组。所以可以利用这个遍历。 enum public enum BooleanEnum { TRUE(Boolean.TRUE, "是"), FALSE(Boolean.FALSE, "否");
实习简要总结 byalias 工作
来白虹不知不觉中已经一个多月了，因为项目还在需求分析及项目架构阶段，自己在这段时间都是在学习相关技术知识，现在对这段时间的工作及学习情况做一个总结：（1）工作技能方面大体分为两个阶段，Java Web 基础阶段和Java EE阶段 1）Java Web阶段在这个阶段，自己主要着重学习了 JSP, Servlet, JDBC, MySQL，这些知识的核心点都过了一遍，也
Quartz——DateIntervalTrigger触发器 eksliang quartz
转载请出自出处：http://eksliang.iteye.com/blog/2208559 一.概述 simpleTrigger 内部实现机制是通过计算间隔时间来计算下次的执行时间，这就导致他有不适合调度的定时任务。例如我们想每天的 1：00AM 执行任务，如果使用 SimpleTrigger，间隔时间就是一天。注意这里就会有一个问题，即当有 misfired 的任务并且恢复执行时，该执行时间
Unix快捷键 18289753290 unix Unix；快捷键;
复制，删除，粘贴： dd:删除光标所在的行 &nbs
获取Android设备屏幕的相关参数酷的飞上天空 android
包含屏幕的分辨率以及屏幕宽度的最大dp 高度最大dp TextView text = (TextView)findViewById(R.id.text); DisplayMetrics dm = new DisplayMetrics(); text.append("getResources().ge
要做物联网？先保护好你的数据蓝儿唯美数据
根据Beecham Research的说法，那些在行业中希望利用物联网的关键领域需要提供更好的安全性。在Beecham的物联网安全威胁图谱上，展示了那些可能产生内外部攻击并且需要通过快速发展的物联网行业加以解决的关键领域。 Beecham Research的技术主管Jon Howes说：“之所以我们目前还没有看到与物联网相关的严重安全事件，是因为目前还没有在大型客户和企业应用中进行部署，也就
Java取模（求余）运算随便小屋 java
整数之间的取模求余运算很好求，但几乎没有遇到过对负数进行取模求余，直接看下面代码： /** * * @author Logic * */ public class Test { public static void main(String[] args) { // TODO A
SQL注入介绍 aijuans sql注入
二、SQL注入范例这里我们根据用户登录页面 <form action="" > 用户名：<input type="text" name="username"><br/> 密码：<input type="password" name="passwor
优雅代码风格 aoyouzi 代码
总结了几点关于优雅代码风格的描述：代码简单：不隐藏设计者的意图，抽象干净利落，控制语句直截了当。接口清晰：类型接口表现力直白，字面表达含义，API 相互呼应以增强可测试性。依赖项少：依赖关系越少越好，依赖少证明内聚程度高，低耦合利于自动测试，便于重构。没有重复：重复代码意味着某些概念或想法没有在代码中良好的体现，及时重构消除重复。战术分层：代码分层清晰，隔离明确，
布尔数组百合不是茶 java 布尔数组
androi中提到了布尔数组; 布尔数组默认的是false, 并且只会打印false或者是true 布尔数组的例子; 根据字符数组创建布尔数组 char[] c = {'p','u','b','l','i','c'}; //根据字符数组的长度创建布尔数组的个数 boolean[] b = new bool
web.xml之welcome-file-list、error-page bijian1013 java web.xml servlet error-page
welcome-file-list 1.定义： <welcome-file-list> <welcome-file>login.jsp</welcome> </welcome-file-list> 2.作用：用来指定WEB应用首页名称。 error-page1.定义： <error-page&g
richfaces 4 fileUpload组件删除上传的文件 sunjing clear Richfaces 4 fileupload
页面代码 <h:form id="fileForm"> <rich:
技术文章备忘 bit1129 技术文章
Zookeeper http://wenku.baidu.com/view/bab171ffaef8941ea76e05b8.html http://wenku.baidu.com/link?url=8thAIwFTnPh2KL2b0p1V7XSgmF9ZEFgw4V_MkIpA9j8BX2rDQMPgK5l3wcs9oBTxeekOnm5P3BK8c6K2DWynq9nfUCkRlTt9uV
org.hibernate.hql.ast.QuerySyntaxException: unexpected token: on near line 1解决方案白糖_ Hibernate
文章摘自：http://blog.csdn.net/yangwawa19870921/article/details/7553181 在编写HQL时，可能会出现这种代码： select a.name,b.age from TableA a left join TableB b on a.id=b.id 如果这是HQL，那么这段代码就是错误的，因为HQL不支持
sqlserver按照字段内容进行排序 bozch 按照内容排序
在做项目的时候，遇到了这样的一个需求：从数据库中取出的数据集，首先要将某个数据或者多个数据按照地段内容放到前面显示，例如:从学生表中取出姓李的放到数据集的前面； select * fro
编程珠玑-第一章-位图排序 bylijinnan java 编程珠玑
import java.io.BufferedWriter; import java.io.File; import java.io.FileWriter; import java.io.IOException; import java.io.Writer; import java.util.Random; public class BitMapSearch {
Java关于==和equals chenbowen00 java
关于==和equals概念其实很简单，一个是比较内存地址是否相同，一个比较的是值内容是否相同。虽然理解上不难，但是有时存在一些理解误区，如下情况： 1、 String a = "aaa"; a=="aaa"; ==> true 2、 new String("aaa")==new String("aaa
[IT与资本]软件行业需对外界投资热情保持警惕 comsci it
我还是那个看法,软件行业需要增强内生动力,尽量依靠自有资金和营业收入来进行经营,避免在资本市场上经受各种不同类型的风险,为企业自主研发核心技术和产品提供稳定,温和的外部环境... 如果我们在自己尚未掌握核心技术之前,企图依靠上市来筹集资金,然后使劲往某个领域砸钱,然
oracle 数据块结构 daizj oracle 块数据块块结构行目录
oracle 数据块是数据库存储的最小单位，一般为操作系统块的N倍。其结构为：块头－－〉空行－－〉数据，其实际为纵行结构。块的标准大小由初始化参数DB_BLOCK_SIZE指定。具有标准大小的块称为标准块（Standard Block）。块的大小和标准块的大小不同的块叫非标准块（Nonstandard Block）。同一数据库中，Oracle9i及以上版本支持同一数据库中同时使用标
github上一些觉得对自己工作有用的项目收集 dengkane github
github上一些觉得对自己工作有用的项目收集技能类 markdown语法中文说明回到顶部全文检索 elasticsearch bigdesk elasticsearch管理插件回到顶部 nosql mapdb 支持亿级别map, list, 支持事务. 可考虑做为缓存使用 C
初二上学期难记单词二 dcj3sjt126com english word
dangerous 危险的 panda 熊猫 lion 狮子 elephant 象 monkey 猴子 tiger 老虎 deer 鹿 snake 蛇 rabbit 兔子 duck 鸭 horse 马 forest 森林 fall 跌倒；落下 climb 爬；攀登 finish 完成；结束 cinema 电影院；电影 seafood 海鲜；海产食品 bank 银行
8、mysql外键(FOREIGN KEY)的简单使用 dcj3sjt126com mysql
一、基本概念 1、MySQL中“键”和“索引”的定义相同，所以外键和主键一样也是索引的一种。不同的是MySQL会自动为所有表的主键进行索引，但是外键字段必须由用户进行明确的索引。用于外键关系的字段必须在所有的参照表中进行明确地索引，InnoDB不能自动地创建索引。 2、外键可以是一对一的，一个表的记录只能与另一个表的一条记录连接，或者是一对多的，一个表的记录与另一个表的多条记录连接。 3、如
java循环标签 Foreach shuizhaosi888 标签 java循环 foreach
1. 简单的for循环 public static void main(String[] args) { for (int i = 1, y = i + 10; i < 5 && y < 12; i++, y = i * 2) { System.err.println("i=" + i + " y="
Spring Security（05）——异常信息本地化 234390216 exception Spring Security 异常信息本地化
异常信息本地化 Spring Security支持将展现给终端用户看的异常信息本地化，这些信息包括认证失败、访问被拒绝等。而对于展现给开发者看的异常信息和日志信息（如配置错误）则是不能够进行本地化的，它们是以英文硬编码在Spring Security的代码中的。在Spring-Security-core-x
DUBBO架构服务端告警Failed to send message Response javamingtingzhao 架构 DUBBO
废话不多说，警告日志如下，不知道有哪位遇到过，此异常在服务端抛出(服务器启动第一次运行会有这个警告)，后续运行没问题，找了好久真心不知道哪里错了。 WARN 2015-07-18 22:31:15,272 com.alibaba.dubbo.remoting.transport.dispatcher.ChannelEventRunnable.run(84)
JS中Date对象中几个用法 leeqq JavaScript Date 最后一天
近来工作中遇到这样的两个需求 1. 给个Date对象，找出该时间所在月的第一天和最后一天 2. 给个Date对象，找出该时间所在周的第一天和最后一天需求1中的找月第一天很简单，我记得api中有setDate方法可以使用使用setDate方法前，先看看getDate var date = new Date(); console.log(date); // Sat J
MFC中使用ado技术操作数据库你不认识的休道人 sql mfc
1.在stdafx.h中导入ado动态链接库 #import"C:\Program Files\Common Files\System\ado\msado15.dll" no_namespace rename("EOF","end")2.在CTestApp文件的InitInstance()函数中domodal之前写::CoIniti
Android Studio加速 rensanning android studio
Android Studio慢、吃内存！启动时后会立即通过Gradle来sync & build工程。（1）设置Android Studio a) 禁用插件 File -> Settings... Plugins 去掉一些没有用的插件。比如：Git Integration、GitHub、Google Cloud Testing、Google Cloud
各数据库的批量Update操作 tomcat_oracle java oracle sql mysql sqlite
MyBatis的update元素的用法与insert元素基本相同，因此本篇不打算重复了。本篇仅记录批量update操作的 sql语句，懂得SQL语句，那么MyBatis部分的操作就简单了。　　注意：下列批量更新语句都是作为一个事务整体执行，要不全部成功，要不全部回滚。 MSSQL的SQL语句　WITH R AS（　　SELECT 'John' as name, 18 as
html禁止清除input文本输入缓存 xp9802 input
多数浏览器默认会缓存input的值，只有使用ctl+F5强制刷新的才可以清除缓存记录。如果不想让浏览器缓存input的值，有2种方法：方法一：在不想使用缓存的input中添加 autocomplete="off"; eg: <input type="text" autocomplete="off" name

按字母分类： A B C D E F G H I J K L M N O P Q R S T U V W X Y Z 其他