前文已经介绍了IO复用的select模型,并列举了在并发数量比较大的时候select的缺点,本文将介绍另外一种IO模型,epoll模型,该模型能够高效解决数百万计的并发量,因此在大规模的网络编程中应用很广,在大规模网络编程中Linux中选择epoll,windows下选择IOCP。
epoll的相关系统调用只有三个,使用上比较简单,但是要想能够很好的使用epoll需要如何了解epoll各种模式比如水平触发(LT),边缘触发(ET)。
1、int epoll_create(int size)
在Linux 2.6.8后参数size是忽略的,调用该接口将创建一个epoll的句柄,在使用完该句柄后必须当调用close()关闭该文件句柄。
2、int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event)
epoll的事件注册接口,调用该函数将向epoll注册一个事件,事件的类型由event结构体制定。
参数epfd是epoll_create的返回值,第二个参数OP是一个epoll的枚举,第三个参数是一个socke文件描述符:
EPOLL_CTL_ADD //将epoll注册一个事件
EPOLL_CTL_DEL //删除一个事件
EPOLL_CTL_MOD //修改已经注册的事件。
第四个参数是一个epoll_event结构体类型指针,定义如下:
typedef union epoll_data
{
void *ptr;
int fd;
uint32_t u32;
uint64_t u64;
} epoll_data_t;
struct epoll_event
{
uint32_t events; /* Epoll events */
epoll_data_t data; /* User data variable */
} __EPOLL_PACKED;
等待已注册的事件发生,参数__events是已经分配好的epoll_event数组,内核在注册事件发生后会将相应的event拷贝到events数组中,__maxevent是最大的event数量,__timeout是epoll_wait函数阻塞事件,若timeout==-1则函数一直阻塞直到有注册的事件发生,若timeout == 0则epoll_wait扫描完注册的事件集后立即返回,若是其他整数,则epoll_wait在等待该指定的事件后还没有事件发生,则函数超时返回。
epoll相关的源码可参考系统的
/* Copyright (C) 2002-2012 Free Software Foundation, Inc.
This file is part of the GNU C Library.
The GNU C Library is free software; you can redistribute it and/or
modify it under the terms of the GNU Lesser General Public
License as published by the Free Software Foundation; either
version 2.1 of the License, or (at your option) any later version.
The GNU C Library is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
Lesser General Public License for more details.
You should have received a copy of the GNU Lesser General Public
License along with the GNU C Library; if not, see
. */
#ifndef _SYS_EPOLL_H
#define _SYS_EPOLL_H 1
#include
#include
/* Get __sigset_t. */
#include
#ifndef __sigset_t_defined
# define __sigset_t_defined
typedef __sigset_t sigset_t;
#endif
/* Get the platform-dependent flags. */
#include
#ifndef __EPOLL_PACKED
# define __EPOLL_PACKED
#endif
enum EPOLL_EVENTS
{
EPOLLIN = 0x001,
#define EPOLLIN EPOLLIN
EPOLLPRI = 0x002,
#define EPOLLPRI EPOLLPRI
EPOLLOUT = 0x004,
#define EPOLLOUT EPOLLOUT
EPOLLRDNORM = 0x040,
#define EPOLLRDNORM EPOLLRDNORM
EPOLLRDBAND = 0x080,
#define EPOLLRDBAND EPOLLRDBAND
EPOLLWRNORM = 0x100,
#define EPOLLWRNORM EPOLLWRNORM
EPOLLWRBAND = 0x200,
#define EPOLLWRBAND EPOLLWRBAND
EPOLLMSG = 0x400,
#define EPOLLMSG EPOLLMSG
EPOLLERR = 0x008,
#define EPOLLERR EPOLLERR
EPOLLHUP = 0x010,
#define EPOLLHUP EPOLLHUP
EPOLLRDHUP = 0x2000,
#define EPOLLRDHUP EPOLLRDHUP
EPOLLWAKEUP = 1u << 29,
#define EPOLLWAKEUP EPOLLWAKEUP
EPOLLONESHOT = 1u << 30,
#define EPOLLONESHOT EPOLLONESHOT
EPOLLET = 1u << 31
#define EPOLLET EPOLLET
};
/* Valid opcodes ( "op" parameter ) to issue to epoll_ctl(). */
#define EPOLL_CTL_ADD 1 /* Add a file descriptor to the interface. */
#define EPOLL_CTL_DEL 2 /* Remove a file descriptor from the interface. */
#define EPOLL_CTL_MOD 3 /* Change file descriptor epoll_event structure. */
typedef union epoll_data
{
void *ptr;
int fd;
uint32_t u32;
uint64_t u64;
} epoll_data_t;
struct epoll_event
{
uint32_t events; /* Epoll events */
epoll_data_t data; /* User data variable */
} __EPOLL_PACKED;
__BEGIN_DECLS
/* Creates an epoll instance. Returns an fd for the new instance.
The "size" parameter is a hint specifying the number of file
descriptors to be associated with the new instance. The fd
returned by epoll_create() should be closed with close(). */
extern int epoll_create (int __size) __THROW;
/* Same as epoll_create but with an FLAGS parameter. The unused SIZE
parameter has been dropped. */
extern int epoll_create1 (int __flags) __THROW;
/* Manipulate an epoll instance "epfd". Returns 0 in case of success,
-1 in case of error ( the "errno" variable will contain the
specific error code ) The "op" parameter is one of the EPOLL_CTL_*
constants defined above. The "fd" parameter is the target of the
operation. The "event" parameter describes which events the caller
is interested in and any associated user data. */
extern int epoll_ctl (int __epfd, int __op, int __fd,
struct epoll_event *__event) __THROW;
/* Wait for events on an epoll instance "epfd". Returns the number of
triggered events returned in "events" buffer. Or -1 in case of
error with the "errno" variable set to the specific error code. The
"events" parameter is a buffer that will contain triggered
events. The "maxevents" is the maximum number of events to be
returned ( usually size of "events" ). The "timeout" parameter
specifies the maximum wait time in milliseconds (-1 == infinite).
This function is a cancellation point and therefore not marked with
__THROW. */
extern int epoll_wait (int __epfd, struct epoll_event *__events,
int __maxevents, int __timeout);
/* Same as epoll_wait, but the thread's signal mask is temporarily
and atomically replaced with the one provided as parameter.
This function is a cancellation point and therefore not marked with
__THROW. */
extern int epoll_pwait (int __epfd, struct epoll_event *__events,
int __maxevents, int __timeout,
const __sigset_t *__ss);
__END_DECLS
#endif /* sys/epoll.h */
Epoll的工作原理
与select类似,epoll也是扫描用户注册的事件,然后等待事件的发生,但是与select不同的是epoll_wait函数返回的不是文件描述符,而是文件描述符的数量,用户需要去events数组中查看当前文件描述符发生了什么事件,此外select需要用户去全部扫描注册的事件,而epoll仅将已经发生了的事件放在events数组中,在这里epoll使用了内存映射技术(mmap),节省了文件描述符从内核复制到用户空间的开销,更主要的是epoll采用了一种类似Rector的模式,用户通过epoll_ctl向内核注册事件,一旦事件发生epoll通过类似于回调的形式通知用户,而select需要主动去扫描。
关于epoll的ET和LT
ET(Edge trigger,边缘触发),当epoll处于该中事件触发通知模式下,当有数据到达时,内核只通知一次,此后内核不再通知,若用户在事件发生后未把数据一次性读完,则内核会抛弃该部分未读数据。
LT(level trigger,水平触发),与ET相对,当epoll工作在该种模式下,当有数据到达时,内核会一直通知,直到用户读取完缓冲区的数据。
内核默认工作在何种模式随系统而定,用户可通过epoll_cntl注册事件时将事件设置为ET或者LT,
event.data.fd = listenfd;
event.events = EPOLLIN | EPOLLET;
epoll_ctl(efd, EPOLL_CTL_ADD, listenfd, &event)
epoll实现的ECHO服务器实例:
#include
#include
#include
#include
#include
#include
#include
#include
#include
#include
#define MAX_EVENT_NUM 1024
int make_socket_nonblock(int fd)
{
int flags;
if((flags = fcntl(fd,F_GETFL,0))< 0 )
{
perror("get flags failed");
return -1;
}
flags |= O_NONBLOCK;
if(fcntl(fd, F_SETFL,flags) < 0)
{
perror("set flags failed");
return -1;
}
return 0;
}
int make_socket_reuseable(int fd)
{
int reuse = 1;
if (setsockopt(fd, SOL_SOCKET, SO_REUSEADDR, &reuse, sizeof(reuse)) < 0)
return -1;
return 0;
}
int doAccept(int fd)
{
struct sockaddr_in cliAddr;
char ipaddr[32] = {0};
bzero(&cliAddr, sizeof(cliAddr));
socklen_t len = sizeof(cliAddr);
int connfd = accept(fd, (struct sockaddr*)&cliAddr, &len);
if(connfd < 0)
return -1;
inet_ntop(AF_INET,&(cliAddr.sin_addr),ipaddr,32);
printf("accept client connect %s:%d.\n", ipaddr,ntohs(cliAddr.sin_port));
return connfd;
}
void doRead(int fd)
{
char recvBuf[1024] = {0};
while(recv(fd,recvBuf, 1024, 0) > 0)
{
printf("msg recv is %s\n",recvBuf);
send(fd, recvBuf, sizeof(recvBuf), 0);
char * temp = recvBuf;
while(*temp != '\r' && *temp != '\n')
temp++;
*temp == '\0';
if(strcmp(recvBuf, "quit") == 0)
exit(0);
}
}
int main()
{
int efd;
int connfd;
struct epoll_event event;
struct epoll_event * events = NULL;
struct sockaddr_in SerAddr;
bzero(&SerAddr, sizeof(SerAddr));
SerAddr.sin_addr.s_addr = htonl(INADDR_ANY);
SerAddr.sin_port = htons(12345);
SerAddr.sin_family = AF_INET;
int listenfd = socket(AF_INET, SOCK_STREAM,0);
if(listenfd < 0)
{
perror("listenfd < 0");
return -1;
}
if(make_socket_nonblock(listenfd)< 0)
return -1;
if(make_socket_reuseable(listenfd) < 0)
return -1;
if(bind(listenfd, (struct sockaddr*)&SerAddr, sizeof(SerAddr))< 0)
{
perror("bind error");
return -1;
}
efd = epoll_create(32000);
if(efd < 0)
{
perror("epoll create");
return -1;
}
event.data.fd = listenfd;
event.events = EPOLLIN | EPOLLET;
if(epoll_ctl(efd, EPOLL_CTL_ADD, listenfd, &event) < 0)
{
perror("epoll-cntl");
return -1;
}
events = (epoll_event*)calloc(MAX_EVENT_NUM, sizeof(event));
while(1)
{
int n = epoll_wait(efd, events, MAX_EVENT_NUM, -1);
for(int i=0; i < n; i++)
{
if(events[i].data.fd == listenfd)
{
if((connfd = doAccept(listenfd)) < 0)
{
perror("accept failed.\n");
continue;
}
if(make_socket_nonblock(connfd) < 0)
{
perror("make non block failed");
return -1;
}
event.data.fd = connfd;
event.events = EPOLLIN | EPOLLET | EPOLLOUT ;
if(epoll_ctl(efd, EPOLL_CTL_ADD, connfd, &event) < 0)
{
perror("epoll add");
return -1;
}
}
else if(events[i].events & EPOLLIN)
{
doRead(listenfd);
}
else if(events[i].events & EPOLLOUT)
{
}
else
{
perror("epoll error");
close(events[i].data.fd);
}
}
}
return 0;
}
epoll能够支持进程打开支持的最大数目的描述符,因此能够支持数目较大的并发量,与select,多线程/多进程相比,epoll的这个优势无疑是明显,可以通过cat /proc/sys/fs/file-max 查看当前进程支持的最大文件描述符数量,通常1G内存是10W左右,这个数字和内存相关。
IO效率不随文件描述符的增长而变差,select的一个缺点是当有一个很大的描述符集时,不管当前有多少文件描述符处于活动状态,select都会全部扫描,导致IO效率下降,而epoll不存在这个问题,epoll关注活动的的描述符,当某一个描述符处于活动状态后,它类似于回调一样告诉内核该描述符准备就绪,因此对于那些没有处于活动状态的描述符,epoll不关注,因此不影响IO效率。
使用内存映射(mmap)加速内核与用户空间数据传递,epoll不像select将所有描述符都从内核空间拷贝到用户空间,而是采用了内存映射计数,减少了内核数据拷贝,当描述符越大,这种效率的提升就越明显。
以上epoll的特点就是epoll能够支持数百万计并发的理由,因此当前Linux服务器的实现大多都采用epoll的机制。