LuaJIT 2 beta 3 is out: Support both x32 & x64(为什么会如此快?)

LuaJIT's interpreter is fast, because:

•It's written in assembler.
•It keeps all important state in registers. No C compiler manages to do that on x86.
•It uses indirect threading (aka labeled goto in C).
•It has a very small I-cache footprint (the core of the interpreter fits in 6K).
•The parser generates a register-based bytecode.
•The bytecode is really a word-code (32 bit/ins) and designed for fast decoding.
•Bytecode decode and dispatch is heavily optimized for superscalar CPUs.
•The bytecode is type-specialized and patched on-the-fly.
•The dispatch table is patched to allow for debug hooks and trace recording. No need to check for these cases in the fast paths.
•It uses NaN tagging for object references. This allows unboxed FP numbers with a minimal cache footprint for stacks/arrays. FP stores are auto-tagging.
•It inlines all fast paths.
•It uses special calling conventions for built-ins (fast functions).
•Tons more tuning in the VM ... and the JIT compiler has it's own bag of tricks.
E.g. x=x+1 is turned into the ADDVN instruction. This means it's specialized for the 2nd operand to be a constant. Here's the x86 code (+ SSE2 enabled) for this instruction:

// Prologue for type ABC instructions (others have a zero prologue).
movzx  ebp, ah                  Decode RC (split of RD)
movzx  eax, al                  Decode RB (split of RD)

// The instruction itself.
cmp    [edx+ebp*8+0x4], -13     Type check of [RB]
ja     ->lj_vmeta_arith_vn
movsd  xmm0, [edx+ebp*8]        Load of [RB]
addsd  xmm0, [edi+eax*8]        Add to [RC]
movsd  [edx+ecx*8], xmm0        Store in [RA]

// Standard epilogue: decode + dispatch the next instruction.
mov    eax, [esi]               Load next bytecode
movzx  ecx, ah                  Decode RA
movzx  ebp, al                  Decode opcode
add    esi, 0x4                 Increment PC
shr    eax, 0x10                Decode RD
jmp    [ebx+ebp*4]              Dispatch to next instruction
Yes, that's all of it. I don't think you can do this with less instructions. This code reaches up to 2.5 ipc on a Core2 and takes 5-6 cycles (2 nanoseconds on a 3 GHz machine).

你可能感兴趣的:(android,虚拟机,C#,lua,FP)