Linux下使用libevent实践文件零拷贝操作

1. 前言

这两天又翻了翻libevent代码,发现文件操作中内部带有sendfile、mmap相关代码。又找了网上相关使用说明,资料都比较少,所以实践出真知,搞一搞。

2. 介绍

2.1 初探零拷贝

mmap是一种内存映射文件的方法,即将一个文件或者其它对象映射到进程的地址空间,实现文件磁盘地址和进程虚拟地址空间中一段虚拟地址的一一对映关系[2]。

#include 
void *mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset);
int munmap(void *addr, size_t length);

sendfile也是属于零拷贝的一种,常用于读取socket写文件、读文件发送socket的场景

#include 
ssize_t sendfile(int out_fd, int in_fd, off_t *offset, size_t count);

零拷贝的优势是减少了用户态、内核态的拷贝次数;
但是其实对于80%的场景,编程几乎不考虑零拷贝技术,而对于20%的可能,我们该如何选择;

2.2 evbuffer的文件操作

翻看libevent文件接口的操作,你会惊喜发现,只要你用接口就行,如果你系统支持零拷贝的特性,那么libevent就会优先使用。

入门用法,读取文件内容到evbuffer中:

/**
  Copy data from a file into the evbuffer for writing to a socket.

  This function avoids unnecessary data copies between userland and
  kernel.  If sendfile is available and the EVBUFFER_FLAG_DRAINS_TO_FD
  flag is set, it uses those functions.  Otherwise, it tries to use
  mmap (or CreateFileMapping on Windows).

  The function owns the resulting file descriptor and will close it
  when finished transferring data.

  The results of using evbuffer_remove() or evbuffer_pullup() on
  evbuffers whose data was added using this function are undefined.

  For more fine-grained control, use evbuffer_add_file_segment.

  @param outbuf the output buffer
  @param fd the file descriptor
  @param offset the offset from which to read data
  @param length how much data to read, or -1 to read as much as possible.
    (-1 requires that 'fd' support fstat.)
  @return 0 if successful, or -1 if an error occurred
*/

EVENT2_EXPORT_SYMBOL
int evbuffer_add_file(struct evbuffer *outbuf, int fd, ev_off_t offset, ev_off_t length);

往文件描述符写数据,内部将尽量搬运evbuffer内容写入,注意该接口不保证完全写完。

/**
  Write the contents of an evbuffer to a file descriptor.

  The evbuffer will be drained after the bytes have been successfully written.

  @param buffer the evbuffer to be written and drained
  @param fd the file descriptor to be written to
  @return the number of bytes written, or -1 if an error occurred
  @see evbuffer_read()
 */
EVENT2_EXPORT_SYMBOL
int evbuffer_write(struct evbuffer *buffer, evutil_socket_t fd);

2.3 深入

写文件的内部流程:
evbuffer_write >>evbuffer_write_atmost >>evbuffer_write_sendfile
如果不支持sendfile,那么再试试writev,最次才用pullup + write

buffer.c +2510
int
evbuffer_write_atmost(struct evbuffer *buffer, evutil_socket_t fd,
    ev_ssize_t howmuch)
{
    int n = -1;

    EVBUFFER_LOCK(buffer);

    if (buffer->freeze_start) {
        goto done;
    }    

    if (howmuch < 0 || (size_t)howmuch > buffer->total_len)
        howmuch = buffer->total_len;

    if (howmuch > 0) { 
#ifdef USE_SENDFILE
        struct evbuffer_chain *chain = buffer->first;
        if (chain != NULL && (chain->flags & EVBUFFER_SENDFILE))
            n = evbuffer_write_sendfile(buffer, fd, howmuch);
        else {
#endif
#ifdef USE_IOVEC_IMPL
        n = evbuffer_write_iovec(buffer, fd, howmuch);
#elif defined(_WIN32)
        /* XXX(nickm) Don't disable this code until we know if
         * the WSARecv code above works. */
        void *p = evbuffer_pullup(buffer, howmuch);
        EVUTIL_ASSERT(p || !howmuch);
        n = send(fd, p, howmuch, 0);
#else
        void *p = evbuffer_pullup(buffer, howmuch);
        EVUTIL_ASSERT(p || !howmuch);
        n = write(fd, p, howmuch);
#endif
#ifdef USE_SENDFILE
        }    
#endif
    }    
    if (n > 0) 
        evbuffer_drain(buffer, n);
done:
    EVBUFFER_UNLOCK(buffer);
    return (n);
}

上述代码EVBUFFER_SENDFILE,触发条件是evbuffer的chain内部,带了一个源句柄fd
这就又涉及了到另一个接口:evbuffer_file_segment_new

可以先看下 evbuffer_add_file的实现:

int evbuffer_add_file(struct evbuffer *buf, int fd, ev_off_t offset, ev_off_t length)
{
    struct evbuffer_file_segment *seg;
    unsigned flags = EVBUF_FS_CLOSE_ON_FREE;
    int r;

    seg = evbuffer_file_segment_new(fd, offset, length, flags);
    if (!seg)
        return -1;
    r = evbuffer_add_file_segment(buf, seg, 0, length);
    if (r == 0)
        evbuffer_file_segment_free(seg);
    return r;
}

可见内部封装的系列接口:
evbuffer_file_segment_new、
evbuffer_add_file_segment、
evbuffer_file_segment_free;

/**                                                                                                                                                 
   Create and return a new evbuffer_file_segment for reading data from a                                                                            
   file and sending it out via an evbuffer.

   This function avoids unnecessary data copies between userland and                                                                                
   kernel.  Where available, it uses sendfile or splice.                                                                                            

   The file descriptor must not be closed so long as any evbuffer is using                                                                          
   this segment.                                                                                                                                    

   The results of using evbuffer_remove() or evbuffer_pullup() or any other                                                                         
   function that reads bytes from an evbuffer on any evbuffer containing                                                                            
   the newly returned segment are undefined, unless you pass the                                                                                    
   EVBUF_FS_DISABLE_SENDFILE flag to this function.                                                                                                 

   @param fd an open file to read from.                                                                                                             
   @param offset an index within the file at which to start reading                                                                                 
   @param length how much data to read, or -1 to read as much as possible.                                                                          
      (-1 requires that 'fd' support fstat.)                                                                                                        
   @param flags any number of the EVBUF_FS_* flags
   @return a new evbuffer_file_segment, or NULL on failure.                                                                                         
 **/                                                                                                                                                
EVENT2_EXPORT_SYMBOL                                                                                                                                
struct evbuffer_file_segment *evbuffer_file_segment_new(
    int fd, ev_off_t offset, ev_off_t length, unsigned flags);                                                                                      
                                                                                                                                                    
/**                                                                                                                                                 
   Free an evbuffer_file_segment

   It is safe to call this function even if the segment has been added to
   one or more evbuffers.  The evbuffer_file_segment will not be freed                                                                              
   until no more references to it exist.
 */
EVENT2_EXPORT_SYMBOL                                                                                                                                
void evbuffer_file_segment_free(struct evbuffer_file_segment *seg); 

/**                                                                                                                                                 
   Insert some or all of an evbuffer_file_segment at the end of an evbuffer

   Note that the offset and length parameters of this function have a
   different meaning from those provided to evbuffer_file_segment_new: When
   you create the segment, the offset is the offset _within the file_, and
   the length is the length _of the segment_, whereas when you add a
   segment to an evbuffer, the offset is _within the segment_ and the
   length is the length of the _part of the segment you want to use.

   In other words, if you have a 10 KiB file, and you create an
   evbuffer_file_segment for it with offset 20 and length 1000, it will
   refer to bytes 20..1019 inclusive.  If you then pass this segment to
   evbuffer_add_file_segment and specify an offset of 20 and a length of
   50, you will be adding bytes 40..99 inclusive.

   @param buf the evbuffer to append to
   @param seg the segment to add
   @param offset the offset within the segment to start from
   @param length the amount of data to add, or -1 to add it all.
   @return 0 on success, -1 on failure.
 */
EVENT2_EXPORT_SYMBOL
int evbuffer_add_file_segment(struct evbuffer *buf,
    struct evbuffer_file_segment *seg, ev_off_t offset, ev_off_t length);

3. 实践

3.1 拷贝文件内容

一次性读取文件到evbuffer中,openclose还是使用的常规的操作:
(本试验仅测试使用,实际还要考虑大文件内存的开销,不要一次性载入)

#define SIZE_MB (1024 * 1024)

void ev_readv_file(const char *name, struct evbuffer *evbuf, off_t f_offset)
{
    off_t f_size = 0;
    off_t s_offset = 0; // segment offset

    int fd = -1; 
    int ret = -1; 
    struct evbuffer_file_segment *seg = NULL;

    fd = open(name, O_RDONLY, 0644);
    assert(fd >= 0); 

    f_size = __get_filesize(fd); // -1 let fstat inside
    assert(f_size > 0); 
    seg = evbuffer_file_segment_new(fd, f_offset, f_size, 0); 
    assert(seg != NULL);

    while (f_offset + s_offset < f_size) {
        size_t olen = evbuffer_get_length(evbuf);
        off_t rlen = (f_size - f_offset - s_offset > SIZE_MB) ?
                     SIZE_MB : f_size - f_offset - s_offset;
        ret = evbuffer_add_file_segment(evbuf, seg, s_offset, rlen);
        assert(ret != 0);
        s_offset += evbuffer_get_length(evbuf) - olen;
//      printf("evlen: %zd, s_offset: %ld\n", evbuffer_get_length(evbuf), s_offset);
    }   

//  printf("closing...\n");
    evbuffer_file_segment_free(seg);
    close(fd);
}

写函数相对简单一些,但还是注意有个坑,evbuffer_write_atmost 只是告诉你尽可能给你写入文件,不保证一次调用就能全部写完(可能也就1%的场景),所以我们加个循环保护一下。

void ev_write_file(struct evbuffer *evbuf, const char *name)
{
    int fd = -1;

    fd = open(name, O_WRONLY | O_CREAT | O_TRUNC, 0644);
    assert(fd >= 0);

    /* write all buffer to file */
    while (evbuffer_get_length(evbuf) > 0) {
        int wlen = evbuffer_write_atmost(evbuf, fd, -1);
        if (wlen < 0) {
            break;
        }
        printf("wlen: %d\n", wlen);
    }
    close(fd);
}

main函数,输入为原始文件名、保存文件名、偏移值

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#define SIZE_MB (1024 * 1024)
static inline off_t __get_filesize(int fd) 
{
    struct stat st = {0};
    if (0 != fstat(fd, &st)) {
        return 0;
    }   
    return (unsigned long long)st.st_size;
}

int main(int argc, char *argv[])
{
    struct evbuffer *evbuf = evbuffer_new();

    if (argc < 3) {
        printf("%s   [r_offset]\n", argv[0]);
        exit(EXIT_FAILURE);
    }

    ev_readv_file(argv[1], evbuf, argv[3] ? atol(argv[3]) : 0);
    ev_write_file(evbuf, argv[2]);
    evbuffer_free(evbuf);
    exit(EXIT_SUCCESS);
}

编译:

g++ -o evfile -Wall -g2 -O3 -Os -D_FILE_OFFSET_BITS=64 evfile.cc -levent

3.2 执行结果

首先先验证程序逻辑是否正确,随机造一个10MB文件:

# dd if=/dev/urandom of=test.dat bs=1MB count=10
10+0 records in
10+0 records out
10000000 bytes (10 MB, 9.5 MiB) copied, 0.0620712 s, 161 MB/s

完全拷贝,使用md5sum查看MD5:

./evfile test.dat test.new && md5sum test.*

wlen: 10000000
15bbbb439d89ea50b35af903a2830b72  test.dat
15bbbb439d89ea50b35af903a2830b72  test.new

偏移拷贝,使用hexdump查看十六进制是否一致:

./evfile test.dat test.off 1048576

hexdump -s 1048576 -n 64 test.dat
0100000 4a41 a5d6 d3e7 683c c8b3 a0cc 0354 322e
0100010 d53d c425 672b 93e5 9588 177d f118 b80d
0100020 d3a2 01f0 5bcb 5a01 399e 7500 1e4d b062
0100030 b2f1 a9ce 460e b20a 8635 4c4f f251 30bc
0100040
hexdump -n 64 test.off
0000000 4a41 a5d6 d3e7 683c c8b3 a0cc 0354 322e
0000010 d53d c425 672b 93e5 9588 177d f118 b80d
0000020 d3a2 01f0 5bcb 5a01 399e 7500 1e4d b062
0000030 b2f1 a9ce 460e b20a 8635 4c4f f251 30bc
0000040

ok,其实功能没啥问题;
接着我们再strace看看,程序都干啥了,截取了关键结果如下:

openat(AT_FDCWD, "test.dat", O_RDONLY)  = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=10000000, ...}) = 0
mmap(NULL, 10000000, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7ff39e3d9000
close(3)                                = 0
openat(AT_FDCWD, "test.new", O_WRONLY|O_CREAT|O_TRUNC, 0644) = 3
writev(3, [{iov_base="\322=\221\210aE9\245d\v\224\247\232I\242\242\303h\350\373\243\341\371\262\10\10\250j\243S\222t"..., iov_len=1048576}, {iov_base="P4\2316A\245g6~f\276\247\0275\33\335]\316\"\f\311e\fU0\347\310\301\377\356\253\343"..., iov_len=1048576}, {iov_base="\305p\216\17\257i\207\v\t6\177\0355\265I\230\363\307xW\227\26\33\27\272\310\272r~\32\363\223"..., iov_len=1048576}, {iov_base="\336\263\265\0223g\340\203!\347\r\224#R\r{\374Q\275\315\\}\222RO\376\30\266\177\30(a"..., iov_len=1048576}, {iov_base="Wz\274\374Jk\306!\213p\265\16b\301\256\213G\342CX\332\344:\341\35K\32204C\302z"..., iov_len=1048576}, {iov_base="\6\366Q4\4UWm\312n\236\244\204\256H\27\331\210f\205d}Wv\4y\304d\266\233|l"..., iov_len=1048576}, {iov_base="\332E\227\303\277\24u,Ji\177|\207\",\264\"\223\377y\242m#\7\324\33\371F\3035\212\363"..., iov_len=1048576}, {iov_base=" \344\221o\265\0\2259\272a\256*YU\21>\253\255\16&%&\276<\32\24\27\326\301hFg"..., iov_len=1048576}, {iov_base="+\370j-\347.*\376\6P\v\263\211\331\326\243\22a\214\302|\307\206B!\255\27l\334\236\230o"..., iov_len=1048576}, {iov_base="\253\370\230\343\6\271e\177\3466\372\78`\363\226-\245\244o\347\252\307\322\236\31\t>\200\233N\203"..., iov_len=562816}], 10) = 10000000
munmap(0x7ff39e3d9000, 10000000)        = 0
close(3)                                = 0

内部可以看到,读取文件时候,是将文件内容mmap到evbuffer中,文件就可以先close了,直到最后evbuffer内容全部处理完了,evbuffer内部再进行munmap
而期间写文件,直接使用的writev方式进行,可以看出正好使用了10个iov结构;

3.3 改进

3.1小节探寻了 evbuffer_file_segment_xxx系列的使用,那如果直接用上层接口呢?
所以我们重新封装了一个 ev_readv_file2 进行测试:

void ev_readv_file2(const char *name, struct evbuffer *evbuf, off_t f_offset)
{
    int fd = open(name, O_RDONLY, 0644);
    assert(fd >= 0); 
    evbuffer_add_file(evbuf, fd, f_offset, -1);
}

由于内部给的 EVBUF_FS_CLOSE_ON_FREE标志,所以外部无需进行close操作了

下来看一下strace效果,发现基本一样,就是close的时机自动放在了unmap后面了:

openat(AT_FDCWD, "test.dat", O_RDONLY)  = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=10000000, ...}) = 0
mmap(NULL, 10000000, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f7e0b86f000
openat(AT_FDCWD, "test.new", O_WRONLY|O_CREAT|O_TRUNC, 0644) = 4
writev(4, [{iov_base="\360\267\261\251\223\226\377\333-.\23\314V\354\250\342\377\276\2W\227\323\272tOHe\366<|\262\16"..., iov_len=10000000}], 1) = 10000000
munmap(0x7f7e0b86f000, 10000000)        = 0
close(3)                                = 0
close(4)                                = 0

3.3 改进2

接着我们还不满足,试试如何进行sendfile操作:

/** If this flag is set, then we will not use evbuffer_peek(),
 * evbuffer_remove(), evbuffer_remove_buffer(), and so on to read bytes
 * from this buffer: we'll only take bytes out of this buffer by
 * writing them to the network (as with evbuffer_write_atmost), by
 * removing them without observing them (as with evbuffer_drain),
 * or by copying them all out at once (as with evbuffer_add_buffer).
 *
 * Using this option allows the implementation to use sendfile-based
 * operations for evbuffer_add_file(); see that function for more
 * information.
 *
 * This flag is on by default for bufferevents that can take advantage
 * of it; you should never actually need to set it on a bufferevent's
 * output buffer.
 */
#define EVBUFFER_FLAG_DRAINS_TO_FD 1

/** Change the flags that are set for an evbuffer by adding more.
 *
 * @param buffer the evbuffer that the callback is watching.
 * @param cb the callback whose status we want to change.
 * @param flags One or more EVBUFFER_FLAG_* options
 * @return 0 on success, -1 on failure.
 */
EVENT2_EXPORT_SYMBOL
int evbuffer_set_flags(struct evbuffer *buf, ev_uint64_t flags);

所以,我们在main函数 evbuffer_new下面,加一句:

evbuffer_set_flags(evbuf, EVBUFFER_FLAG_DRAINS_TO_FD);

终于,sendfile露出了庐山真面目:

openat(AT_FDCWD, "test.dat", O_RDONLY)  = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=10000000, ...}) = 0
openat(AT_FDCWD, "test.new", O_WRONLY|O_CREAT|O_TRUNC, 0644) = 4
sendfile(4, 3, [0] => [10000000], 10000000) = 10000000
close(3)                                = 0
close(4)                                = 0

4. 结论

总结,evbuffer进行文件操作,外部使用也是比较简单的;
libevent把复杂的映射、零拷贝处理都封装了接口内部,有那么个优先压榨性能的意思;

或许在不久的将来,你也能跟你们同事吹嘘,老子的代码也用上了零拷贝技术,
然后Makefile里面,默默加上了-levent,深藏功与名;

参考文章:
[0] https://linux.die.net/man/3/mmap
[1] https://linux.die.net/man/2/sendfile
[2] https://www.cnblogs.com/huxiao-tee/p/4660352.html

你可能感兴趣的:(linux,socket)