Memcpy(), a fast and portable implementation

原文:http://www.vik.cc/daniel/portfolio/memcpy.htm

注: 本文由于包含原文和译文,故请勿转载,否则视为侵权。

本文地址:http://blog.csdn.net/linyt/archive/2009/04/07/4053164.aspx

 

Memcpy(), a fast and portable implementation

Memcpy(), 一个快速的和可移植的实现

1. Introduction

1. 导言

A co-worker of mine, Fredrik Bredberg, once made an implementation of memcpy() and he was very proud of the result. His implementation was faster than many standardized C library routines found in the embedded market. When looking at his code, I found several places where improvements could be made. I made an implementation, which was quite a lot faster than Fredrik's and this started a friendly competition to make the fastest portable C implementation of memcpy(). Both our implementations got better and better and looked more alike and finally we had an implementation that was very fast and that beats both the native library routines in Windows and Linux, especially when the memory to be copied is not aligned on a 32 bit boundary.

我的同事Fredrik Bredberg曾实现过memcpy,并为此结果感到非常自豪。他的实现比嵌入式市场上很多标准C库例程(memcpy)还要快。查阅他的代码时,我发现代码中有几个地方可以进一步改善。接着我实现一个比Fredrik更快的memcpy,于是为打造最快的,可移植的memcpy C实现,我们开始了友好的竞赛。我们的两个实现变得越来越快,看起来也很相似,最终我们提供了一个非常快的实现。尤其当复制的内存不是32位边界对齐时,该实现优于Windows和Linux下的本地库例程。

The following paragraphs contain descriptions to some of the techniques used in the final implementation.

下面几段介绍最终实现所使用的一些技术。

 

2. Mimic the CPU's Architecture

2. 模拟CPU架构

One of the biggest advantages in the original Bredberg implementation was that the C code was made to imitate the instruction sets on the target processor. He discovered that different processors had different instructions for handling memory pointers. On a Motorola 68K processor the code

Bredberg原先实现的最大优点之一是C代码模拟目标处理器的指令集。他发现处理内存指针的指令因目标处理器而异。在摩托罗拉68K处理器上,代码

        *dst8++ = *src8++;

that copies one byte from the address src8 to the address dst8 and increases both pointers, compiled into a single instruction:

从地址src8上复制一个字节到地址dst8上,并增加这两指针之值,被编译成一条指令:

        MOV.B (A0)+, (A2)+
<!--[if !supportLineBreakNewLine]-->
<!--[endif]-->

This piece of code can be put into a while loop and will copy memory from the address src to the address dest:

这代码片可放到while循环内,并能从地址src复制内存到地址dest:

        void *memcpy(void *dest, const void *src, size_t count) {
            char *dst8 = (char *)dest;
            char *src8 = (char *)src;
           
            while (count--) {
                *dst8++ = *src8++;
            }
            return dest;
}
<!--[if !supportLineBreakNewLine]-->
<!--[endif]-->

While this is pretty good for the Motorola processor, it is not very efficient on a PowerPC that does not have any instructions for post incrementing pointers. The PowerPC uses four instructions for the same task that only required one instruction on the Motorola processor. However, the PowerPC has a set of instructions to load and store data that utilize pre increment of pointers which means that the following code only results in two instructions when compiled on the PowerPC:

虽然在摩托罗拉处理器上,这相当不错。但在PowerPC处理器上并不高效,由于它没有任何后置自增指针指令。摩托罗拉处理器需要一条指令实现的任务,在PowerPC上却需要4条指令来实现。然而,PowerPC提供一套利用前自增指针来加载以及储存数据的指令。这意味意下面代码在PowerPC下编译只生成两条指令:

        *++dst8++ = *++src8;
<!--[if !supportLineBreakNewLine]-->
<!--[endif]-->

In order to use this construction, the pointers have to be decreased before the loop begins and the final code becomes:

为了利用PowerPC这一特点,在循环前先将指针自减,最终代码变成:

        void *memcpy(void *dest, const void *src, size_t count) {
            char *dst8 = (char *)dest;
            char *src8 = (char *)src;
           
            --src8;
            --dst8;
           
            while (count--) {
                *++dst8 = *++src8;
            }
            return dest;
        }
<!--[if !supportLineBreakNewLine]-->
<!--[endif]-->

Unfortunately the ARM processor has no instructions for either pre increment or post increment of pointers. This means that the pointer needs to be incremented manually. If the example above is compiled to an ARM processor, the while loop would actually look something like:

不幸的是,ARM处理器没有任何指针前置或后置自增,必须使用额外指令来增加指针值。如果上面例子在ARM处理器上编译,while循环的代码与下面代码相当。

        while (count--) {
            dst8[0] = src8[0];
            ++dst8;
            ++src8;
        }
<!--[if !supportLineBreakNewLine]-->
<!--[endif]-->

The ARM processor luckily has another feature. It can read and write to memory at a fixed offset from a base pointer. This resulted in a third way of implementing the same task:

幸运地,ARM处理器有另一个特性,那就是它可以通过基址和因定偏移量来指定读写内存。那这任务的第三种实现方法就顺理成章了:

        void *memcpy(void *dest, const void *src, size_t count) {
            char *dst8 = (char *)dest;
            char *src8 = (char *)src;
           
            if (count & 1) {
                dst8[0] = src8[0];
                dst8 += 1;
                src8 += 1;
            }
           
            count /= 2;
            while (count--) {
                dst8[0] = src8[0];
                dst8[1] = src8[1];
               
                dst8 += 2;
                src8 += 2;
            }
            return dest;
        }
<!--[if !supportLineBreakNewLine]-->
<!--[endif]-->

Here the number of turns the loop has to be executed is half of what it was in the earlier examples and the pointers are only updated half as often.

与最前的例子相比,这里循环执行的次数减少一半,同时指针值更改的次数也减半。

3. Optimizing memory accesses

3. 优化内存访问

In most systems, the CPU clock runs at much higher frequency than the speed of the memory bus. My first improvement to the Bredberg original was to read 32 bits at the time from the memory. It is of course possible to read larger chunks of data on some targets with wider data bus and wider data registers. The goal with the C implementation of memcpy() was to get portable code mainly for embedded systems. On such systems it is often expensive to use data types like double and some systems doesn't have a FPU (Floating Point Unit).

大多数系统中,CPU时钟频率远高于内存总线的速度。我在Bredberg原先代码上的第一个改进就是从内存一次读取32位数据。当然,在某些有更宽数据总线和数据寄存上的目标处理器上,可以一次读取更大的数据块。Memcpy()的C实现的目标主要是在嵌入式系统上实现可移植代码。在这些嵌入式系统上,使用诸如double这些数据类型是非常昂贵的,并且某些系统没有FPU(Floating Point Unit)。

By trying to read and write memory in 32 bit blocks as often as possible, the speed of the implementation is increased dramatically, especially when copying data that is not aligned on a 32-bit boundary.

尽可能按32位块来读写内存,实现的速度急剧增大,特别拷贝那些非32位边界对齐的数据时。

It is however quite tricky to do this. The accesses to memory need to be aligned on 32-bit addresses. The implementation needs two temporary variables that implement a 64-bit sliding window where the source data is kept temporary while being copied into the destination. The example below shows how this can be done when the destination buffer is aligned on a 32-bit address and the source buffer is 8 bits off the alignment:

然而,实现这点是相当需要技巧的,因为被访问的内存必须是32位地址对齐的。实现需要两个用于实现64位滑动窗口的临时变量,源数据存放于该滑动窗口,然后再拷贝到目标地址上。下面例子展示了当目标缓冲区为32位对齐和源缓冲区为偏8位对齐时,如何做到这一点。

        srcWord = *src32++;
       
        while (len--) {
            dstWord  = srcWord << 8;
            srcWord  = *src32++;
            dstWord |= srcWord >> 24;
            *dst32++ = dstWord;
        }

<!--[if !supportLineBreakNewLine]-->
<!--[endif]-->

4. Optimizing branches

4. 优化分支

Another improvement is to make it easier for the compiler to generate code that utilizes the processors compare instructions in an efficient way. This means creating loops that terminates when a compare value is zero. The loop

另一个提升是使编译器最容易生成那些高效地利用处理器比较指令的代码,即编写比较值为0时结束的循环。循环

        while (++i > count)
<!--[if !supportLineBreakNewLine]-->
<!--[endif]-->

often generates more complicated and inefficient code than

生成的代码通常复杂于和低效于循环

        while (count--)
生成的代码。

Another thing that makes the code more efficient is when the CPU's native loop instructions can be used. A compiler often generates better code for the loop above than for the following loop expression

另一个提高效率的方法是使用CPU本地的loop指令。编译器为上述循环生成的机器代码通常要比下面这个要高效。

        while (count -= 2)
<!--[if !supportLineBreakNewLine]-->
<!--[endif]-->

5. Conclusion

5. 总结

The techniques described here makes the C implementation of memcpy() a lot faster and in many cases faster than commercial ones. The implementation can probably be improved even more, especially by using wider data types when available. If the target and the compiler supports 64-bit arithmetic operations such as the shift operator, these techniques can be used to implement a 64-bit version as well. I tried to find a compiler with this support for SPARC but I didn't find one. If 64-bit operations can be made in one instruction, the implementation will be faster than the native Solaris memcpy() which is probably written in assembly.

本文描述的技术使memcpy()的C实现快了很多,在大多数情况比商业产品的实现要快。此实现还可能得到改善,特别是,如果有的话,使用更宽的数据类型。如果目标处理器和编译器支持64位的算术操作,如位移操作,这些技术同样可以用来实现64位的版本。我尝试在SPARC上寻找支持64位算术操作的编译,但以失败告终。若(找到这样的编译器,)64位的操作可以由一条指令来完成,memcpy实现将比本地的Solaris memcpy()要快,尽管Solaris memcpy()很可能是由汇编来编写的。

Another improvement that can be to some 32-bit architectures made is that if both the source and the destination buffer are 64-bit aligned or have the same alignment, a double type can be used to speed up the copying. I tried this with the gcc compiler for x86 and the result was a memcpy() that was faster in almost all cases than the native Linux memcpy().

对于一些32位架构,如果源和目标缓冲区都是64位对齐的,可以使用double来提高拷贝的速度,此为另一个改善。在x86的gcc编译器上,我尝试使用double类型,结果几乎所有情况下都快于Linux本地的memcpy()。

6. The complete source code

6. 完整的源代码

The following code contains the complete memcpy() implementation. The code is configured for an intel x86 target but it is easy to change configuration as desired.

下面代码包含完整的memcpy()实现。代码已配置为intel x86了,但很容易更改成如你所愿的配置。

/********************************************************************
 ** File:     memcpy.c
 **
 ** Copyright (C) 2005 Daniel Vik
 **
 ** This software is provided 'as-is', without any express or implied
 ** warranty. In no event will the authors be held liable for any
 ** damages arising from the use of this software.
 ** Permission is granted to anyone to use this software for any
 ** purpose, including commercial applications, and to alter it and
 ** redistribute it freely, subject to the following restrictions:
 **
 ** 1. The origin of this software must not be misrepresented; you
 **    must not claim that you wrote the original software. If you
 **    use this software in a product, an acknowledgment in the
 **    use this software in a product, an acknowledgment in the
 **    product documentation would be appreciated but is not
 **    required.
 **
 ** 2. Altered source versions must be plainly marked as such, and
 **    must not be misrepresented as being the original software.
 **
 ** 3. This notice may not be removed or altered from any source
 **    distribution.
 **
 **
 ** Description: Implementation of the standard library function memcpy.
 **             This implementation of memcpy() is ANSI-C89 compatible.
 **
 **             The following configuration options can be set:
 **
 **           LITTLE_ENDIAN   - Uses processor with little endian
 **                             addressing. Default is big endian.
 **
 **           PRE_INC_PTRS    - Use pre increment of pointers.
 **                             Default is post increment of
 **                             pointers.
 **
 **           INDEXED_COPY    - Copying data using array indexing.
 **                             Using this option, disables the
 **                             PRE_INC_PTRS option.
 **
 **
 ** Best Settings:
 **
 ** Intel x86:  LITTLE_ENDIAN and INDEXED_COPY
 **
 *******************************************************************/

 

/********************************************************************
 ** Configuration definitions.
 *******************************************************************/

#define LITTLE_ENDIAN
#define INDEXED_COPY

 

/********************************************************************
 ** Includes for size_t definition
 *******************************************************************/

#include <stddef.h>

 

/********************************************************************
 ** Typedefs
 *******************************************************************/

typedef unsigned char  u8;
typedef unsigned short u16;
typedef unsigned long  u32;

 

/********************************************************************
 ** Remove definitions when INDEXED_COPY is defined.
 *******************************************************************/

#if defined (INDEXED_COPY)
#if defined (PRE_INC_PTRS)
#undef PRE_INC_PTRS
#endif /*PRE_INC_PTRS*/
#endif /*INDEXED_COPY*/

 

/********************************************************************
 ** Definitions for pre and post increment of pointers.
 *******************************************************************/

#if defined (PRE_INC_PTRS)

#define INC_VAL(x) *++(x)
#define START_VAL(x) (x)--
#define CAST_32_TO_8(p, o)       (u8 *)((u32)p + o + 4)
#define WHILE_DEST_BREAK         3
#define PRE_LOOP_ADJUST        - 3
#define PRE_SWITCH_ADJUST      + 1

#else /*PRE_INC_PTRS*/

#define START_VAL(x)
#define INC_VAL(x) *(x)++
#define CAST_32_TO_8(p, o)       (u8 *)((u32)p + o)
#define WHILE_DEST_BREAK         0
#define PRE_LOOP_ADJUST
#define PRE_SWITCH_ADJUST

#endif /*PRE_INC_PTRS*/

 

/********************************************************************
 ** Definitions for endians
 *******************************************************************/

#if defined (LITTLE_ENDIAN)

#define SHL >>
#define SHR <<

#else /*LITTLE_ENDIAN*/

#define SHL <<
#define SHR >>

#endif /*LITTLE_ENDIAN*/

 

/********************************************************************
 ** Macros for copying 32 bit words of  different alignment.
 ** Uses incremening pointers.
 *******************************************************************/

#define CP32_INCR() {                       /
    INC_VAL(dst32) = INC_VAL(src32);        /
}

#define CP32_INCR_SH(shl, shr) {            /
    dstWord   = srcWord SHL shl;            /
    srcWord   = INC_VAL(src32);             /
    dstWord  |= srcWord SHR shr;            /
    INC_VAL(dst32) = dstWord;               /
}

 

/********************************************************************
 ** Macros for copying 32 bit words of  different alignment.
 ** Uses array indexes.
 *******************************************************************/

#define CP32_INDEX(idx) {                   /
    dst32[idx] = src32[idx];                /
}

#define CP32_INDEX_SH(x, shl, shr) {        /
    dstWord   = srcWord SHL shl;            /
    srcWord   = src32[x];                   /
    dstWord  |= srcWord SHR shr;            /
    dst32[x] = dstWord;                     /
}

 

/********************************************************************
 ** Macros for copying 32 bit words of different alignment.
 ** Uses incremening pointers or array indexes depending on
 ** configuration.
 *******************************************************************/

#if defined (INDEXED_COPY)

#define CP32(idx)               CP32_INDEX(idx)
#define CP32_SH(idx, shl, shr)  CP32_INDEX_SH(idx, shl, shr)

#define INC_INDEX(p, o)         ((p) += (o))

#else /*INDEXED_COPY*/

#define CP32(idx)               CP32_INCR()
#define CP32_SH(idx, shl, shr)  CP32_INCR_SH(shl, shr)

#define INC_INDEX(p, o)

#endif /*INDEXED_COPY*/

 

/********************************************************************
 **
 ** void *memcpy(void *dest, const void *src, size_t count)
 **
 ** Args:     dest    - pointer to destination buffer
 **           src     - pointer to source buffer
 **           count   - number of bytes to copy
 **
 ** Return:   A pointer to destination buffer
 **
 ** Purpose:  Copies count bytes from src to dest. No overlap check
 **           is performed.
 **
 *******************************************************************/

void *memcpy(void *dest, const void *src, size_t count) {
    u8 *dst8 = (u8 *)dest;
    u8 *src8 = (u8 *)src;

    if (count < 8) {
        if (count >= 4 && ((((u32)src8 | (u32)dst8)) & 3) == 0) {
            *((u32 *)dst8) = *((u32 *)src8);
            dst8  += 4;
            src8  += 4;
            count -= 4;
        }

        START_VAL(dst8);
        START_VAL(src8);

        while (count--) {
            INC_VAL(dst8) = INC_VAL(src8);
        }

        return dest;
    }

    START_VAL(dst8);
    START_VAL(src8);

    while (((u32)dst8 & 3L) != WHILE_DEST_BREAK) {
        INC_VAL(dst8) = INC_VAL(src8);
        count--;
    }

    switch ((((u32)src8) PRE_SWITCH_ADJUST) & 3L) {
    default:
        {
            u32 *dst32 = (u32 *)(((u32)dst8) PRE_LOOP_ADJUST);
            u32 *src32 = (u32 *)(((u32)src8) PRE_LOOP_ADJUST);
            u32 length = count / 4;

            while (length & 7) {
                CP32_INCR();
                length--;
            }

            length /= 8;

            while (length--) {
                CP32(0);
                CP32(1);
                CP32(2);
                CP32(3);
                CP32(4);
                CP32(5);
                CP32(6);
                CP32(7);

                INC_INDEX(dst32, 8);
                INC_INDEX(src32, 8);
            }

            src8 = CAST_32_TO_8(src32, 0);
            dst8 = CAST_32_TO_8(dst32, 0);

            if (count & 2) {
                *dst8++ = *src8++;
                *dst8++ = *src8++;
            }

            if (count & 1) {
                *dst8 = *src8;
            }

            return dest;
        }

    case 1:
        {
            u32 *dst32  = (u32 *)((((u32)dst8) PRE_LOOP_ADJUST) & ~3L);
            u32 *src32  = (u32 *)((((u32)src8) PRE_LOOP_ADJUST) & ~3L);
            u32 length  = count / 4;
            u32 srcWord = INC_VAL(src32);
            u32 dstWord;

            while (length & 7) {
                CP32_INCR_SH(8, 24);
                length--;
            }

            length /= 8;

            while (length--) {
                CP32_SH(0, 8, 24);
                CP32_SH(1, 8, 24);
                CP32_SH(2, 8, 24);
                CP32_SH(3, 8, 24);
                CP32_SH(4, 8, 24);
                CP32_SH(5, 8, 24);
                CP32_SH(6, 8, 24);
                CP32_SH(7, 8, 24);

                INC_INDEX(dst32, 8);
                INC_INDEX(src32, 8);
            }

            src8 = CAST_32_TO_8(src32, -3);
            dst8 = CAST_32_TO_8(dst32, 0);

            if (count & 2) {
                *dst8++ = *src8++;
                *dst8++ = *src8++;
            }

            if (count & 1) {
                *dst8 = *src8;
            }

            return dest;
        }

    case 2:
        {
            u32 *dst32  = (u32 *)((((u32)dst8) PRE_LOOP_ADJUST) & ~3L);
            u32 *src32  = (u32 *)((((u32)src8) PRE_LOOP_ADJUST) & ~3L);
            u32 length  = count / 4;
            u32 srcWord = INC_VAL(src32);
            u32 dstWord;

            while (length & 7) {
                CP32_INCR_SH(16, 16);
                length--;
            }

            length /= 8;

            while (length--) {
                CP32_SH(0, 16, 16);
                CP32_SH(1, 16, 16);
                CP32_SH(2, 16, 16);
                CP32_SH(3, 16, 16);
                CP32_SH(4, 16, 16);
                CP32_SH(5, 16, 16);
                CP32_SH(6, 16, 16);
                CP32_SH(7, 16, 16);

                INC_INDEX(dst32, 8);
                INC_INDEX(src32, 8);
            }

            src8 = CAST_32_TO_8(src32, -2);
            dst8 = CAST_32_TO_8(dst32, 0);

            if (count & 2) {
                *dst8++ = *src8++;
                *dst8++ = *src8++;
            }

            if (count & 1) {
                *dst8 = *src8;
            }

            return dest;
        }

    case 3:
        {
            u32 *dst32  = (u32 *)((((u32)dst8) PRE_LOOP_ADJUST) & ~3L);
            u32 *src32  = (u32 *)((((u32)src8) PRE_LOOP_ADJUST) & ~3L);
            u32 length  = count / 4;
            u32 srcWord = INC_VAL(src32);
            u32 dstWord;

            while (length & 7) {
                CP32_INCR_SH(24, 8);
                length--;
            }

            length /= 8;

            while (length--) {
                CP32_SH(0, 24, 8);
                CP32_SH(1, 24, 8);
                CP32_SH(2, 24, 8);
                CP32_SH(3, 24, 8);
                CP32_SH(4, 24, 8);
                CP32_SH(5, 24, 8);
                CP32_SH(6, 24, 8);
                CP32_SH(7, 24, 8);

                INC_INDEX(dst32, 8);
                INC_INDEX(src32, 8);
            }

            src8 = CAST_32_TO_8(src32, -1);
            dst8 = CAST_32_TO_8(dst32, 0);

            if (count & 2) {
                *dst8++ = *src8++;
                *dst8++ = *src8++;
            }

            if (count & 1) {
                *dst8 = *src8;
            }

            return dest;
        }
    }
}

 

本文来自CSDN博客,转载请标明出处:http://blog.csdn.net/linyt/archive/2009/04/07/4053164.aspx

你可能感兴趣的:(Memcpy(), a fast and portable implementation)