st(state-threads) https://github.com/winlinvip/state-threads
以及基于st的RTMP/HLS服务器:https://github.com/winlinvip/simple-rtmp-server
st是实现了coroutine的一套机制,即用户态线程,或者叫做协程。将epoll(async,nonblocking socket)的非阻塞变成协程的方式,将所有状态空间都放到stack中,避免异步的大循环和状态空间的判断。
关于st的详细介绍,参考翻译:http://blog.csdn.net/win_lin/article/details/8242653
本文主要介绍了coroutine基于setjmp和longjmp的实现机制。
我将st进行了简化,去掉了其他系统,只考虑linux系统,以及i386/x86_64/arm/mips四种cpu系列,参考:https://github.com/winlinvip/simple-rtmp-server/tree/master/trunk/research/st
st最关键的地方在于需要重新分配stack,譬如在heap分配stack,支持超大并发。
mips和arm可以直接设置stack;而i386和x86_64的CPU体系在glibc2.4以上为了安全性考虑,jmp_buf的结构不是那么清楚,直接设置jmp_buf的sp是不可行的:
/* * Starting with glibc 2.4, JB_SP definitions are not public anymore. * They, however, can still be found in glibc source tree in * architecture-specific "jmpbuf-offsets.h" files. * Most importantly, the content of jmp_buf is mangled by setjmp to make * it completely opaque (the mangling can be disabled by setting the * LD_POINTER_GUARD environment variable before application execution). * Therefore we will use built-in _st_md_cxt_save/_st_md_cxt_restore * functions as a setjmp/longjmp replacement wherever they are available * unless USE_LIBC_SETJMP is defined. */
这种最简单,实际上setjmp的jmp_buf提供了sp和pc,只需要把jmp_buf的sp设置为分配的stack,把pc设置为main地址就可以。
#if defined(__mips__) #define MD_STACK_GROWS_DOWN #define MD_INIT_CONTEXT(_thread, _sp, _main) \ ST_BEGIN_MACRO \ MD_SETJMP((_thread)->context); \ _thread->context[0].__jmpbuf[0].__pc = (__ptr_t) _main; \ _thread->context[0].__jmpbuf[0].__sp = _sp; \ ST_END_MACRO
ARM实际上st也是用的glibc的setjmp和longjmp,明显arm的glibc的jmp_buf是结构可知的,参考arm的setjmp头文件:
/** /usr/arm-linux-gnueabi/include/bits/setjmp.h #ifndef _ASM The exact set of registers saved may depend on the particular core in use, as some coprocessor registers may need to be saved. The C Library ABI requires that the buffer be 8-byte aligned, and recommends that the buffer contain 64 words. The first 28 words are occupied by v1-v6, sl, fp, sp, pc, d8-d15, and fpscr. (Note that d8-15 require 17 words, due to the use of fstmx.) typedef int __jmp_buf[64] __attribute__((__aligned__ (8))); the layout of setjmp for arm: 0-5: v1-v6 6: sl 7: fp 8: sp 9: pc 10-26: d8-d15 17words 27: fpscr */ /** For example, on raspberry-pi, armv6 cpu: (gdb) x /64 env_func1[0].__jmpbuf v1, 0: 0x00 0x00 0x00 0x00 v2, 1: 0x00 0x00 0x00 0x00 v3, 2: 0x2c 0x84 0x00 0x00 v4, 3: 0x00 0x00 0x00 0x00 v5, 4: 0x00 0x00 0x00 0x00 v6, 5: 0x00 0x00 0x00 0x00 sl, 6: 0x00 0xf0 0xff 0xb6 fp, 7: 0x9c 0xfb 0xff 0xbe sp, 8: 0x88 0xfb 0xff 0xbe pc, 9: 0x08 0x85 0x00 0x00 (gdb) p /x $sp $5 = 0xbefffb88 (gdb) p /x $pc $4 = 0x850c */
st在i386和x86_64下面,都定义了宏MD_USE_BUILTIN_SETJMP,也就是用st自己的md.S里面的setjmp和longjmp:
/* * Starting with glibc 2.4, JB_SP definitions are not public anymore. * They, however, can still be found in glibc source tree in * architecture-specific "jmpbuf-offsets.h" files. * Most importantly, the content of jmp_buf is mangled by setjmp to make * it completely opaque (the mangling can be disabled by setting the * LD_POINTER_GUARD environment variable before application execution). * Therefore we will use built-in _st_md_cxt_save/_st_md_cxt_restore * functions as a setjmp/longjmp replacement wherever they are available * unless USE_LIBC_SETJMP is defined. */ #if defined(__i386__) #define MD_STACK_GROWS_DOWN #define MD_USE_BUILTIN_SETJMP #if defined(__GLIBC__) && __GLIBC__ >= 2 #ifndef JB_SP #define JB_SP 4 #endif #define MD_GET_SP(_t) (_t)->context[0].__jmpbuf[JB_SP] #else /* not an error but certainly cause for caution */ #error "Untested use of old glibc on i386" #define MD_GET_SP(_t) (_t)->context[0].__jmpbuf[0].__sp #endif #elif defined(__amd64__) || defined(__x86_64__) #define MD_STACK_GROWS_DOWN #define MD_USE_BUILTIN_SETJMP #ifndef JB_RSP #define JB_RSP 6 #endif #define MD_GET_SP(_t) (_t)->context[0].__jmpbuf[JB_RSP]
原因讲得很清楚,glibc2.4以上的jmp_buf的sp不能操作了,导致只能用st内建的setjmp和longjmp。
若使用glibc的setjmp和longjmp,即定义宏(参考下一章:ST宏定义)USE_LIBC_SETJMP,则出现segmentfault,gdb调试setjmp的jmp_buf:
(gdb) x /64xb env_func1[0].__jmpbuf 0x600ca0 <env_func1>: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x600ca8 <env_func1+8>: 0xf8 0xc1 0x71 0xe5 0xa8 0x88 0xb4 0x15 0x600cb0 <env_func1+16>: 0xa0 0x05 0x40 0x00 0x00 0x00 0x00 0x00 0x600cb8 <env_func1+24>: 0x90 0xe4 0xff 0xff 0xff 0x7f 0x00 0x00 0x600cc0 <env_func1+32>: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x600cc8 <env_func1+40>: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x600cd0 <env_func1+48>: 0xf8 0xc1 0x51 0xe5 0xa8 0x88 0xb4 0x15 0x600cd8 <env_func1+56>: 0xf8 0xc1 0xd9 0x2f 0xd7 0x77 0x4b 0xea (gdb) p /x $sp $4 = 0x7fffffffe380
s在make时可以定义宏,指定EXTRA_CFLAGS参数即可,参考说明:
########################## # Other possible defines: # To use poll(2) instead of select(2) for events checking: # DEFINES += -DUSE_POLL # You may prefer to use select for applications that have many threads # using one file descriptor, and poll for applications that have many # different file descriptors. With USE_POLL poll() is called with at # least one pollfd per I/O-blocked thread, so 1000 threads sharing one # descriptor will poll 1000 identical pollfds and select would be more # efficient. But if the threads all use different descriptors poll() # may be better depending on your operating system's implementation of # poll and select. Really, it's up to you. Oh, and on some platforms # poll() fails with more than a few dozen descriptors. # # Some platforms allow to define FD_SETSIZE (if select() is used), e.g.: # DEFINES += -DFD_SETSIZE=4096 # # To use malloc(3) instead of mmap(2) for stack allocation: # DEFINES += -DMALLOC_STACK # # To provision more than the default 16 thread-specific-data keys # (but not too many!): # DEFINES += -DST_KEYS_MAX=<n> # # To start with more than the default 64 initial pollfd slots # (but the table grows dynamically anyway): # DEFINES += -DST_MIN_POLLFDS_SIZE=<n> # # Note that you can also add these defines by specifying them as # make/gmake arguments (without editing this Makefile). For example: # # make EXTRA_CFLAGS=-DUSE_POLL <target> # # (replace make with gmake if needed). # # You can also modify the default selection of an alternative event # notification mechanism. E.g., to enable kqueue(2) support (if it's not # enabled by default): # # gmake EXTRA_CFLAGS=-DMD_HAVE_KQUEUE <target> # # or to disable default epoll(4) support: # # make EXTRA_CFLAGS=-UMD_HAVE_EPOLL <target> # ##########################
make linux-debug EXTRA_CFLAGS="-DMALLOC_STACK"
或者在Makefile中改变默认的DEFINES也可以。
看完了st的线程调度和生命周期,我才完全明白setjmp和longjmp的栈切换方式,以及st使用自己分配的stack到底如何切换。参考:http://blog.csdn.net/win_lin/article/details/40978665
考虑一个单线程程序,实际上程序是流水线执行的,也就是从main开始执行,进入各种子函数然后退出。参考:https://github.com/winlinvip/simple-rtmp-server/blob/master/trunk/research/arm/jmp_flow.cpp
/* # for all supports setjmp and longjmp: g++ -g -O0 -o jmp_flow jmp_flow.cpp */ #include <stdio.h> #include <stdlib.h> #include <setjmp.h> jmp_buf context_level_0; void func_level_0() { const char* level_0_0 = "stack variables for func_level_0"; int ret = setjmp(context_level_0); printf("func_level_0 ret=%d\n", ret); if (ret != 0) { printf("call by longjmp.\n"); exit(0); } } int main(int argc, char** argv) { func_level_0(); longjmp(context_level_0, 1); return 0; }
(gdb) f 0 #0 func_level_0 () at jmp_flow.cpp:16 16 if (ret != 0) { (gdb) bt #0 func_level_0 () at jmp_flow.cpp:16 #1 0x0000000000400725 in main (argc=1, argv=0x7fffffffe4b8) at jmp_flow.cpp:24 (gdb) i locals level_0_0 = 0x400838 "stack variables for func_level_0" ret = 0
(gdb) f 0 #0 func_level_0 () at jmp_flow.cpp:16 16 if (ret != 0) { (gdb) bt #0 func_level_0 () at jmp_flow.cpp:16 #1 0x0000000000400734 in main (argc=1, argv=0x7fffffffe4b8) at jmp_flow.cpp:25 (gdb) i locals level_0_0 = 0x1 <error: Cannot access memory at address 0x1> ret = 1
原因是这个函数返回后,栈已经释放了,再重新跳到这个地方执行,执行位置(PC)是对的,栈指针也是对的,但是栈的内容肯定是不一样了。
因此,longjmp到某个地方时,这个函数的堆栈实际上无效,访问变量和返回地址也是不可用的,因此longjmp只能在继续longjmp,这也就是为何有_st_thread_main的原因,永远不会从这个函数返回。
或者说,longjmp到某个函数之后,可以调用子函数,但只能通过longjmp来回到这个函数之外的函数。或者说,第一次longjmp的函数(即函数的thread_main),永远不能返回,只能通过longjmp跳转。
或者说,longjmp的目标只能是同一个stack,在不改变sp的情况下。而st那样需要跳来跳去的方式,必须在堆上分配sp,让每个线程私有自己的sp。
既然longjmp之后不能返回,若再次longjmp到其他的线程,堆栈是公用的,这时候应该会导致堆栈混淆。
查看代码,参考:https://github.com/winlinvip/simple-rtmp-server/blob/master/trunk/research/arm/jmp_2flow.cpp
/* # for all supports setjmp and longjmp: g++ -g -O0 -o jmp_2flow jmp_2flow.cpp */ #include <stdio.h> #include <stdlib.h> #include <setjmp.h> jmp_buf context_thread_0; jmp_buf context_thread_1; void thread0_functions() { int ret = setjmp(context_thread_0); // when ret is 0, create thread, // when ret is not 0, longjmp to this thread. if (ret == 0) { return; } int age = 10000; const char* name = "winlin"; printf("[thread0] age=%d, name=%s\n", age, name); if (!setjmp(context_thread_0)) { printf("[thread0] switch to thread1\n"); longjmp(context_thread_1, 1); } // crash, for the stack is modified by thread1. // name = 0x2b67004009c8 <error: Cannot access memory at address 0x2b67004009c8> printf("[thread0] terminated, age=%d, name=%s\n", age, name); exit(0); } void thread1_functions() { int ret = setjmp(context_thread_1); // when ret is 0, create thread, // when ret is not 0, longjmp to this thread. if (ret == 0) { return; } int age = 11111; printf("[thread1] age=%d\n", age); if (!setjmp(context_thread_1)) { printf("[thread1] switch to thread0\n"); longjmp(context_thread_0, 1); } printf("[thread1] terminated, age=%d\n", age); exit(0); } int main(int argc, char** argv) { thread0_functions(); thread1_functions(); // kickstart longjmp(context_thread_0, 1); return 0; }
Breakpoint 1, thread0_functions () at jmp_2flow.cpp:23 23 printf("[thread0] age=%d, name=%s\n", age, name); (gdb) i locals ret = 1 age = 10000 name = 0x400908 "winlin"
Breakpoint 2, thread0_functions () at jmp_2flow.cpp:31 31 printf("[thread0] terminated, age=%d, name=%s\n", age, name); (gdb) i locals ret = 1 age = 10000 name = 0x2b6700000001 <error: Cannot access memory at address 0x2b6700000001>
因此,longjmp在stack没有在堆开辟时,不能跳转到已经破坏的栈。譬如:
main(setjmp) => func1 => func2 (longjmp to main)
func2若longjmp到main,是没有问题的,这时候func2的栈不可用,但是main的没有破坏。
假设下面的跳转路径:
main => func1 => func2 (setjmp)
=> func3 (longjmp to func2)
func2返回了,然后func3再longjmp到func2时,栈的指针虽然是和func2在setjmp时一样,但是内容已经变了。这个时候就几乎会段错误。
也就是说,stack若不在堆上分配,每个线程有自己的stack时,setjmp的那个函数不能再次longjmp回来,这个时候肯定stack被破坏了。
最后的结论就是,st必须得自己分配stack,每个thread一个stack。