在F5 Networks Seattle 总部从事大型企业网络应用交付使用的维护支持经历中, 经常碰到客户应用达到百万至千万级TCP连接时BIGIP可能会遇到的各种瓶劲,如内存分配使用状态,CPU 负荷, TCP 协议栈性能. TCP software syncookie 和 TCP hardware (FPGA) assisted syncookie accelearation 协同工作性能等等. 当在lab里需要模拟百万甚至千万并发TCP连接时,一台普通的Linux客户机是很难达到这种需求的, 公司研发性能测试部门有专门的性能测试设备ixia, 但对于产品开发人员, 客户售前售后服务支持,由于各种原因,是没有太多机会能够用到ixia设备的, 基于上述原因,我分享一下如何利用开源软件mTCP 和 DPDK来满足这样的需求, 包括一些相关技术背景知识.
1 Linux Kernel 网络处理瓶颈.
以下连接大概描述Linux Kernel的网络处理瓶颈
The Secret to 10 Million Concurrent Connections -The Kernel is the Problem, Not the Solution
简化的Linux 网络报文处理流程如下:
NIC RX/TX queues<-->Ring Buffers<-->Driver<-->Socket<-->App
实际的Linux网络报文处理流程
http://www.linuxfoundation.org/images/thumb/1/1c/Network_data_flow_through_kernel.png/100px-Network_data_flow_through_kernel.png
其中瓶颈如下 (虑述其他方面如skbuff, netfilter framework, memory allocation, ip routing...):
昂贵的系统调用 (System calls)
Blocking I/O 的上下文切换 (Context switching on blocking I/O)
网络报文在kernel 和 user space之间的复制 (Data Copying from kernel to user space)
kernel 的终断(Interrupt) 处理 包括软硬终断( Interrupt handling in kernel)
针对以上瓶颈,我们首先从网卡driver上来看, 开源软件DPDK是怎么解决问题的.
2 网络报文在DPDK的处理流程
NIC RX/TX queues<-->Ring Buffers<-->DPDK<-->App
更准确的说,DPDK不仅仅是网卡PMD (Poll Mode Driver) Driver library,DPDK还包含内存优化管理,thread 管理等等.
具有:
处理器粘合性 (Processor affinity)
大内存页提高CPU 缓存使用率, 减少主内存swap (Huge pages( no swap, TLB))
kernel提供的UIO (User space I/O), 去除Kernel和User space之间的网络报文复制 ((no copying from kernel))
Poll Mode Driver (PMD) 去除终断(Interrupt)处理模式的瓶颈 (Polling (no interrupts overhead))
无锁同步机制 (Lockless synchronization(avoid waiting))
网络报文批处理 (Batch packets handling)
利用CPU SSE指令特性.
处理器本地内存使用特性 CPU NUMA awareness
现在我们来看看mTCP 是怎么来解决Linux TCP/IP, socket system call 瓶颈的
3 mTCP 设计原理
首先kernel 具有以下局限性:
无TCP connection 处理器粘合性 (Lack of connection locality) - 处理网络报文终断的CPU可能和运行应用程序的CPU不一致
运行程序文件字符 (File descriptor) 共享限制可用文件字符数. ( Shared file descriptor space)
单个网络报文处理 (Inefficient per-packet processing)
系统调用成本高 System call overhead
mTCP 设计特性:
批处理网络报文, TCP stream 以去除系统调用成本 (Batching in packet I/O, TCP processing, user applications ( reduce system call overhead))
多处理器系统上,同一连接在同一处理器上处理以提高处理器缓存使用率 (Connection locality on multicore systems - handling same connection on same core, avoid cache pollution (solve connection locality))
mTCP 线程之间不分享文件字符 ( No descriptor sharing between mTCP thread)
mTCP 网络报文处理流程:
DPDK/Netmap<-->mTCP thread<-->mTCP socket/epoll<-->mTCP App thread
我针对mTCP 及应用epwget 做细微的改进,在此基础上演变出一些有用的压力测试程序如syn flood 测试, SSL DOS 测试, ApacheBench SSL mTCP 移植, 等等, 下面就是mTCP应用测试列子
4 mTCP 客户端应用测试实列
硬件配置
Hardware SPEC:
mTCP+DPDK: Dell Poweredge R710 72G MEM, 16 core, Intel NIC 82599
BIGIP DUT: Victoria B2250 Intel(R) Xeon(R) CPU E5-2658 v2 @ 2.40GHz 20 cores 64G MEM
4a 千万并发HTTP 连接测试
mTCP 客户端应用
#epwget 10.3.3.249/ 160000000 -N 16 –c 10000000
[CPU 0] dpdk0 flows: 625000, RX: 96382(pps) (err: 0), 0.10(Gbps), TX: 413888(pps), 0.64(Gbps)
[CPU 1] dpdk0 flows: 625000, RX: 101025(pps) (err: 0), 0.10(Gbps), TX: 398592(pps), 0.61(Gbps)
.................................................CUT.....................
[CPU15] dpdk0 flows: 625000, RX: 103412(pps) (err: 0), 0.11(Gbps), TX: 391296(pps), 0.60(Gbps)
[ ALL ] dpdk0 flows: 10010366, RX: 1634404(pps) (err: 0), 1.69(Gbps), TX: 6489408(pps), 9.96(Gbps)
F5 BIGIP CPU
top - 15:25:26 up 23:57, 1 user, load average: 0.16, 0.33, 0.43
Tasks: 778 total, 17 running, 761 sleeping, 0 stopped, 0 zombie
Cpu(s): 45.1%us, 30.6%sy, 0.0%ni, 24.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 66080376k total, 62855960k used, 3224416k free, 136316k buffers
Swap: 5242872k total, 0k used, 5242872k free, 1182216k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
17283 root 1 -19 57.9g 145m 123m R 94.1 0.2 1322:36 tmm.0 -T 10 --tmid BIGIP CPU 使用情况
.......................................CUT..................
17285 root 1 -19 57.9g 145m 123m R 92.1 0.2 1322:37 tmm.0 -T 10 --tmid
31507 root RT 0 0 0 0 R 32.0 0.0 0:00.97 [enforcer/19]
................................CUT.....................
31515 root RT 0 0 0 0 S 16.8 0.0 0:00.51 [enforcer/1]
F5 BIGIP 内存使用状态
[root@localhost:/S1-green-P:Active:Standalone] config # tail -f /var/log/ltm
Nov 4 15:25:29 slot1/bigip1 warning tmm7[17043]: 011e0003:4: Aggressive mode sweeper: /Common/default-eviction-policy (70000000002d6) (global memory) 9864 Connections killed
Nov 4 15:25:29 slot1/bigip1 warning tmm7[17043]: 011e0002:4:sweeper_policy_bind_deactivation_update: Aggressive mode /Common/default-eviction-policy deactivated (70000000002d6) (global memory). (12793204/15051776 pages)
Every 1.0s: tmsh show ltm virtual vs_http_10g Wed Nov 4 15:27:15 2015
CMP : enabled
CMP Mode : all-cpus
Destination : 10.3.3.249:80
PVA Acceleration : none
Traffic ClientSide Ephemeral General
Packets In 287.6M 0 -
Packets Out 150.2M 0 -
Current Connections 6.1M 0 - 当前并发TCP 连接
Maximum Connections 6.7M 0 -
Total Connections 39.8M 0 -
mTCP 函数CPU使用状况, %70 CPU时间为mTCP所有 perf top output ~70% cycles in Userspace
Samples: 1M of event 'cycles', Event count (approx.): 441906428558
8.25% epwget [.] SendTCPPacket
7.93% [kernel] [k] _raw_spin_lock
7.16% epwget [.] GetRSSCPUCore
7.15% epwget [.] IPOutput
4.26% libc-2.19.so [.] memset
4.10% epwget [.] ixgbe_xmit_pkts
3.62% [kernel] [k] clear_page_c
3.26% epwget [.] WriteTCPControlList
3.24% [vdso] [.] 0x0000000000000cf9
2.95% epwget [.] AddtoControlList
2.70% epwget [.] MTCPRunThread
2.66% epwget [.] HandleRTO
2.51% epwget [.] CheckRtmTimeout
2.10% libpthread-2.19.so [.] pthread_mutex_unlock
1.83% epwget [.] dpdk_send_pkts
1.68% epwget [.] HTInsert
1.65% epwget [.] CreateTCPStream
1.42% epwget [.] MPAllocateChunk
1.29% epwget [.] TCPCalcChecksum
1.24% epwget [.] dpdk_recv_pkts
1.20% epwget [.] mtcp_getsockopt
1.12% epwget [.] rx_recv_pkts
4b mTCP SSL DOS 客户应用端测试实列
mTCP 客户端应用
#brute-shake 10.3.3.249/ 160000000 -N 16 –c 3200
F5 BIGIP CPU使用情况
top - 09:10:21 up 22:58, 1 user, load average: 10.45, 4.43, 1.67
Tasks: 782 total, 19 running, 763 sleeping, 0 stopped, 0 zombie
Cpu(s): 50.6%us, 40.1%sy, 0.1%ni, 9.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 66080376k total, 62923192k used, 3157184k free, 138624k buffers
Swap: 5242872k total, 0k used, 5242872k free, 1259132k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
21480 root 1 -19 57.9g 145m 123m R 100.0 0.2 81:24.41 tmm
............................CUT...................
21511 root 1 -19 57.9g 145m 123m R 100.0 0.2 47:06.64 tmm
1670 root RT 0 0 0 0 R 80.2 0.0 2:07.03 enforcer/15
...............................................CUT........................
1672 root RT 0 0 0 0 R 79.9 0.0 2:07.02 enforcer/5
4c mTCP Apachebench SSL port 客户应用端测试实列
#ab -n 16000 -N 16 -c 8000 -L 64 https://10.3.3.249/
---------------------------------------------------------------------------------
Loading mtcp configuration from : /etc/mtcp/config/mtcp.conf
Loading interface setting
EAL: Detected lcore 0 as core 0 on socket 0
................................................. Checking link statusdone
Port 0 Link Up - speed 10000 Mbps - full-duplex
Benchmarking 10.3.3.249 (be patient)
CPU6 connecting to port 443
.............................CUT.............
CPU0 connecting to port 443
.......................................
[ ALL ] dpdk0 flows: 5016, RX: 9651(pps) (err: 0), 0.04(Gbps), TX: 14784(pps), 0.02(Gbps)