最近自己做的一些小项目里面用到了Lua和C API混合编程。在处理事件上有两种设计,一种是在C层通过消息队列接收消息并根据消息类型调用对应的Lua函数,并向Lua层提供AddListener这样注册回调的方法。另一种是直接将消息队列方法暴露给Lua层,例如PushEvent,GetEvent等,然后在Lua层编写一些代码用来处理事件。最开始采用的是第一种方案,后来发现当消息量增多时会有一些卡顿,于是就想到是不是设计上带来了一些性能缺陷,通过下面的代码进行验证:
int test_function(lua_State* L)
{
int a = lua_tointeger(L, 1);
int b = lua_tointeger(L, 2);
int c = a + b;
lua_pushinteger(L, c);
return 1;
}
int test_in_c(int a, int b)
{
return a + b;
}
int benchmark()
{
LuaVM L;
lua_pushcfunction(L, test_function);
lua_setglobal(L, "ctestfn");
luaL_loadstring(L, "for i=1, 10000000 do ctestfn(1, 2) end");
clock_t before = clock();
lua_pcall(L, 0, 0, -1);
clock_t after = clock();
cout << "Loop in Lua, call into C: " << ((double)after - before)/CLOCKS_PER_SEC << "s" << endl;
luaL_loadstring(L, "for i=1, 10000000 do ctestfn(1, 2) end");
before = clock();
lua_call(L, 0, 0);
after = clock();
cout << "Loop in Lua, call into C (unprotected): " << ((double)after - before) / CLOCKS_PER_SEC << "s" << endl;
luaL_loadstring(L, "for i=1, 10000000 do pcall(ctestfn, 1, 2) end");
before = clock();
lua_call(L, 0, 0);
after = clock();
cout << "Loop in Lua, call into C: (pcall) " << ((double)after - before) / CLOCKS_PER_SEC << "s" << endl;
luaL_loadstring(L, "for i=1, 10000000 do xpcall(ctestfn, function() print(debug.traceback()) end, 1, 2) end");
before = clock();
lua_call(L, 0, 0);
after = clock();
cout << "Loop in Lua, call into C: (xpcall) " << ((double)after - before) / CLOCKS_PER_SEC << "s" << endl;
// Lua, Lua
luaL_loadstring(L, "function testfn(a,b) return a+b end for i=1, 10000000 do testfn(1, 2) end");
before = clock();
lua_call(L, 0, 0);
after = clock();
cout << "Loop in Lua, call in Lua: " << ((double)after - before) / CLOCKS_PER_SEC << "s" << endl;
luaL_loadstring(L, "for i=1, 10000000 do pcall(testfn, 1, 2) end");
before = clock();
lua_call(L, 0, 0);
after = clock();
cout << "Loop in Lua, call in Lua (with pcall): " << ((double)after - before) / CLOCKS_PER_SEC << "s" << endl;
luaL_loadstring(L, "for i=1, 10000000 do xpcall(testfn, function() print(debug.traceback()) end, 1, 2) end");
before = clock();
lua_call(L, 0, 0);
after = clock();
cout << "Loop in Lua, call in Lua (with xpcall): " << ((double)after - before) / CLOCKS_PER_SEC << "s" << endl;
luaL_loadstring(L, "x=coroutine.create(function(a,b) while true do a,b=coroutine.yield(a+b) end end) for i=1, 10000000 do coroutine.resume(x, 1, 2) end");
before = clock();
lua_call(L, 0, 0);
after = clock();
cout << "Loop in Lua, call in Lua (with coroutine): " << ((double)after - before) / CLOCKS_PER_SEC << "s" << endl;
before = clock();
for (int i = 0; i < 10000000; i++)
{
test_in_c(1, 2);
}
after = clock();
cout << "Loop in C, call in C: " << ((double)after - before) / CLOCKS_PER_SEC << "s" << endl;
lua_getglobal(L, "testfn");
before = clock();
for (int i = 0; i < 10000000; i++)
{
lua_pushvalue(L, -1);
lua_pushinteger(L, 1);
lua_pushinteger(L, 2);
lua_call(L, 2, 0);
}
after = clock();
lua_pop(L, 1);
cout << "Loop in C, call into Lua: " << ((double)after - before) / CLOCKS_PER_SEC << "s" << endl;
lua_getglobal(L, "ctestfn");
before = clock();
for (int i = 0; i < 10000000; i++)
{
lua_pushvalue(L, -1);
lua_pushinteger(L, 1);
lua_pushinteger(L, 2);
lua_call(L, 2, 0);
}
after = clock();
lua_pop(L, 1);
cout << "Loop in C, call into Lua, then call into C: " << ((double)after - before) / CLOCKS_PER_SEC << "s" << endl;
return 0;
}
测试的内容很简单,写一个函数,函数接收两个参数a和b,并返回a+b的值。这里面不考虑其他元方法和Lua字符串自动转数字带来的影响,单纯的测试一下a+b调用的性能。
测试分别通过以下几种不同调用方式进行,
循环写在Lua里,调用C函数、pcall调用C函数、xpcall调用C函数、调用Lua函数,pcall调用Lua函数,xpcall调用Lua函数、coroutine.resume/coroutine.yield调用Lua函数(由于Lua调用C函数时,在C函数内yield实质上 下次调用是yieldk的那个“延续函数”,所以没什么必要测)
循环写在C里,调用C函数、调用Lua函数,以及通过Lua调用C函数.
循环次数为一千万次,运行结果如下:
Visual Studio 2019 Debug模式下编译:
Loop in Lua, call into C: 3.146s
Loop in Lua, call into C (unprotected): 3.123s
Loop in Lua, call into C: (pcall) 8.4s
Loop in Lua, call into C: (xpcall) 9.562s
Loop in Lua, call in Lua: 1.84s
Loop in Lua, call in Lua (with pcall): 8.417s
Loop in Lua, call in Lua (with xpcall): 9.348s
Loop in Lua, call in Lua (with coroutine): 12.166s
Loop in C, call in C: 0.138s
Loop in C, call into Lua: 3.964s
Loop in C, call into Lua, then call into C: 3.965s
Visual Studio 2019 Release模式下编译:
Loop in Lua, call into C: 0.423s
Loop in Lua, call into C (unprotected): 0.372s
Loop in Lua, call into C: (pcall) 0.803s
Loop in Lua, call into C: (xpcall) 0.929s
Loop in Lua, call in Lua: 0.489s
Loop in Lua, call in Lua (with pcall): 0.966s
Loop in Lua, call in Lua (with xpcall): 1.086s
Loop in Lua, call in Lua (with coroutine): 1.942s
Loop in C, call in C: 0s
Loop in C, call into Lua: 0.261s
Loop in C, call into Lua, then call into C: 0.194s
可以看到C原生(0.138s/0s)与Lua原生(1.84s/0.489s)之间还是有不小的性能差距的。至于C函数的0s有可能是编译器主动优化掉了,但也不排除时间确实很短的可能性。
跨语言调用时,Debug模式下Lua调C速度比C调Lua速度要快一点,pcall和xpcall由于做了额外的保护模式操作所以要慢很多,coroutine不仅做了保护操作,还涉及到让出时执行栈的保存和之后的恢复,所以要更慢一些。对于Release模式的数据感觉有点难以解释,个人感觉最开始的Lua调用C的0.423s比后面C调用Lua的0.261s要少的原因可能是程序的预热问题(猜测)。甚至说后面的C调用Lua再调用C所花的时间比单纯的C调用Lua时间短更有可能是Lua VM的预热。但是这些都只是推测,还没法找到什么让人信服的理由。
经过一番测试之后,目前决定先转向后一种设计:把事件队列控制权交给Lua层来做,但是会在Lua层写一个Library封装一层提供给用户代码,这样C层就不需要处理太多Lua相关的事情,只需要把消息按照规范推到Lua栈返回即可,同时也不用担心用户层会直接破坏掉事件队列。这样做的另一个好处是Lua层有了更大的操作空间,例如Lua层拥有事件队列操作权之后,在没有收到事件的空闲时间中可以调度并运行一些挂起的coroutine等等。