上一章介绍了Lua的词法分析,本章论述lua语法分析,但是纵观lua源代码,发现语法分析和词法分析区分得并不明显,尽管看起来词法分析是放在llex.c,语法分析是放在lparser.c。
Lua在load一段代码或者一个lua文件时(统称为buffer),首先open一个function,设置好function的参数和upval,然后将buffer里面的内容作为函数体内的代码去解析,以下代码引用自lparser.c中,解析一个文件buffer时调用的mainfunc:
/*
** compiles the main function, which is a regular vararg function with an
** upvalue named LUA_ENV
*/
static void mainfunc (LexState *ls, FuncState *fs) {
BlockCnt bl;
expdesc v;
open_func(ls, fs, &bl);
fs->f->is_vararg = 2; /* main function 永远是变参 */
init_exp(&v, VLOCAL, 0); /* create and... */
newupvalue(fs, ls->envn, &v); /* ...set environment upvalue, 设置环境变量为upvalue,它指向偏移为0的stack */
luaX_next(ls); /* read first token */
statlist(ls); /* parse main body */
check(ls, TK_EOS);
close_func(ls);
}
lua代码由一条条语句(statement)组成,这里和c语言略不同:一个function的定义,可以看作语句;一个require语句,也可看作语句,总而言之,所有的lua代码都可以看作表达式。
通过解析statment,采用自顶向下语法分析,会生成一棵语法树,但lua源代码中很难看到清晰的语法树结构,仔细阅读lparser.c就会发现,parser实质上一边生成临时的语法树,一边调用lcode.c生成code(生成code的过程不在本章中讲解)。
言归正传,本章重点讲解parser,据说最早的lua parser是用YACC生成的,后来改成手写的lparser了,这样的可读性会好很多,lua parser严格按照extend BNF表述的语法进行解析:
chunk ::= block
block ::= {stat} [retstat]
stat ::= ‘;’ |
varlist ‘=’ explist |
functioncall |
label |
break |
goto Name |
do block end |
while exp do block end |
repeat block until exp |
if exp then block {elseif exp then block} [else block] end |
for Name ‘=’ exp ‘,’ exp [‘,’ exp] do block end |
for namelist in explist do block end |
function funcname funcbody |
local function Name funcbody |
local namelist [‘=’ explist]
retstat ::= return [explist] [‘;’]
label ::= ‘::’ Name ‘::’
funcname ::= Name {‘.’ Name} [‘:’ Name]
varlist ::= var {‘,’ var}
var ::= Name | prefixexp ‘[’ exp ‘]’ | prefixexp ‘.’ Name
namelist ::= Name {‘,’ Name}
explist ::= exp {‘,’ exp}
exp ::= nil | false | true | Numeral | LiteralString | ‘...’ | functiondef |
prefixexp | tableconstructor | exp binop exp | unop exp
prefixexp ::= var | functioncall | ‘(’ exp ‘)’
functioncall ::= prefixexp args | prefixexp ‘:’ Name args
args ::= ‘(’ [explist] ‘)’ | tableconstructor | LiteralString
functiondef ::= function funcbody
funcbody ::= ‘(’ [parlist] ‘)’ block end
parlist ::= namelist [‘,’ ‘...’] | ‘...’
tableconstructor ::= ‘{’ [fieldlist] ‘}’
fieldlist ::= field {fieldsep field} [fieldsep]
field ::= ‘[’ exp ‘]’ ‘=’ exp | Name ‘=’ exp | exp
fieldsep ::= ‘,’ | ‘;’
binop ::= ‘+’ | ‘-’ | ‘*’ | ‘/’ | ‘//’ | ‘^’ | ‘%’ |
‘&’ | ‘~’ | ‘|’ | ‘>>’ | ‘<<’ | ‘..’ |
‘<’ | ‘<=’ | ‘>’ | ‘>=’ | ‘==’ | ‘~=’ |
and | or
unop ::= ‘-’ | not | ‘#’ | ‘~’
这里有必要介绍上述BNF中常用的几个推导符号:
符号 | 含义 |
---|---|
::= | 推导 |
{} | 一个或者多个 |
[] | 出现0次或者1次 |
| | 或者 |
以函数推导为例:
functiondef ::= function funcbody
funcbody ::= ‘(’ [parlist] ‘)’ block end
parlist ::= namelist [‘,’ ‘...’] | ‘...’
block ::= {stat} [retstat]
一个函数的定义由function关键词开始,接着是参数列表,然后block语句 以end关键词结束
function(a,b) block end 而,block本身也是由若干语句组成(又是递归)
在下一小节中,将以解析一个函数为例,来分析lparser的源代码。
如何完整的解析一个函数呢,请看下面的代码:
static void funcstat (LexState *ls, int line) {
/* funcstat -> FUNCTION funcname body */
int ismethod;
expdesc v, b;
luaX_next(ls); /* skip FUNCTION */
ismethod = funcname(ls, &v); /* 解析函数名,保存结果到v */
body(ls, &b, ismethod, line); /* 解析函数体,保存结果到b */
luaK_storevar(ls->fs, &v, &b); /* 编码生成赋值语句 */
luaK_fixline(ls->fs, line); /* definition "happens" in the first line */
}
expdesc是用于描述 表达式 的结构体,函数名可以用它来表达;这很好理解,而函数体也可以用它来表达,可以想象一下,函数体被编码好之后,放在某个角落,然后expdesc指向这个角落: ),具体分析代码的话,涉及到的细节比较多,不过这里分析的细节对于解析其它的BNF,也是有帮助的。
函数名如何解析呢?
先以合法的function name为例,一般有以下几种形式:
function A() end
function A.a() end
function A.a.b() end
function A:a() end
查看,funcname函数源代码:
/*
* 解析函数名,保存结果到v
* 1. function A 或者 function A.a
* 2. function A:a
*/
static int funcname (LexState *ls, expdesc *v) {
/* funcname -> NAME {fieldsel} [':' NAME] */
int ismethod = 0;
/* 首先将它作为一个简单的变量名去解析 */
singlevar(ls, v);
while (ls->t.token == '.')
fieldsel(ls, v);
if (ls->t.token == ':') {
ismethod = 1;
fieldsel(ls, v);
}
return ismethod;
}
static void singlevar (LexState *ls, expdesc *var) {
TString *varname = str_checkname(ls);
FuncState *fs = ls->fs;
if (singlevaraux(fs, varname, var, 1) == VVOID) { /* global name? */
expdesc key;
singlevaraux(fs, ls->envn, var, 1); /* get environment variable */
lua_assert(var->k == VLOCAL || var->k == VUPVAL);
codestring(ls, &key, varname); /* key is variable name */
luaK_indexed(fs, var, &key); /* env[varname] */
}
}
/*
Find variable with given name 'n'. If it is an upvalue, add this
upvalue into all intermediate functions.
*/
static int singlevaraux (FuncState *fs, TString *n, expdesc *var, int base) {
if (fs == NULL) /* no more levels? */
return VVOID; /* default is global */
else {
int v = searchvar(fs, n); /* look up locals at current level */
if (v >= 0) { /* found? */
init_exp(var, VLOCAL, v); /* variable is local */
if (!base)
markupval(fs, v); /* local will be used as an upval */
return VLOCAL;
}
else { /* not found as local at current level; try upvalues */
int idx = searchupvalue(fs, n); /* try existing upvalues */
if (idx < 0) { /* not found? */
if (singlevaraux(fs->prev, n, var, 0) == VVOID) /* try upper levels */
return VVOID; /* not found; is a global */
/* else was LOCAL or UPVAL */
idx = newupvalue(fs, n, var); /* will be a new upvalue */
}
init_exp(var, VUPVAL, idx);
return VUPVAL;
}
}
}
首先,它将funcname作为单一变量名去解析,singlevar完成此功能,它先调用singlevaraux是否返回VVOID,如果返回VVOID,说明singlevaraux对var没有进行任何赋值,那么作为调用者的singlevar就需要对这个分支进行处理。
if (singlevaraux(fs, varname, var, 1) == VVOID) { /* global name? */
expdesc key;
singlevaraux(fs, ls->envn, var, 1); /* get environment variable */
lua_assert(var->k == VLOCAL || var->k == VUPVAL);
codestring(ls, &key, varname); /* key is variable name */
luaK_indexed(fs, var, &key); /* env[varname] */
}
首先,它获取ls->envn这个全局环境变量到var,然后调用luaK_indexed对var[key]即,env[varname]进行编码,逻辑上来说,luaK_indexed这个函数完成的功能即是将“取hashtable(env)的key(varname)”这一操作行为,保存在var中。
PS:
函数singlevaraux用于查找一个变量名为n的变量,将结果放在var中 ,这里用markdown编辑器里面提供的流程图来表达:
接着对后面的token进行分析,如果还有’.’,说明该函数名 是var这个table里面的一个key,查看
static void fieldsel (LexState *ls, expdesc *v) {
/* fieldsel -> ['.' | ':'] NAME */
FuncState *fs = ls->fs;
expdesc key;
luaK_exp2anyregup(fs, v);
luaX_next(ls); /* skip the dot or colon */
checkname(ls, &key);
luaK_indexed(fs, v, &key);
}
发现,它首先将v放置在寄存器里面,然后再解析接下来的name(key),最后还是调用luaK_indexed来完成v[key]的操作。
函数名解析完成后,funcname调用返回。接下来解析function body
按惯例,先看代码:
static void body (LexState *ls, expdesc *e, int ismethod, int line) {
/* body -> '(' parlist ')' block END */
FuncState new_fs;
BlockCnt bl;
new_fs.f = addprototype(ls);
new_fs.f->linedefined = line;
open_func(ls, &new_fs, &bl);
checknext(ls, '(');
if (ismethod) {
new_localvarliteral(ls, "self"); /* create 'self' parameter */
adjustlocalvars(ls, 1);
}
parlist(ls);
checknext(ls, ')');
statlist(ls);
new_fs.f->lastlinedefined = ls->linenumber;
check_match(ls, TK_END, TK_FUNCTION, line);
codeclosure(ls, e);
close_func(ls);
}
addprototype是为了new一个proto,并将其加入ls->fs->p数组中,和mainfunc类似,会先open_func;
注意到,判定ismethod为true,则会添加一个self的局部变量,添加局部变量是这样子的:
/*
* 新增一个局部变量,为该变量分配空间,并返回其在f->locvars数组中的下标
* 注意: 数组f->locvars分配的空间大小为f->sizelocvars,但是数组长度却由fs->nlocavars来控制
*/
static int registerlocalvar (LexState *ls, TString *varname) {
FuncState *fs = ls->fs;
Proto *f = fs->f;
int oldsize = f->sizelocvars;
luaM_growvector(ls->L, f->locvars, fs->nlocvars, f->sizelocvars,
LocVar, SHRT_MAX, "local variables");
while (oldsize < f->sizelocvars) f->locvars[oldsize++].varname = NULL;
f->locvars[fs->nlocvars].varname = varname;
luaC_objbarrier(ls->L, f, varname);
return fs->nlocvars++;
}
static void new_localvar (LexState *ls, TString *name) {
FuncState *fs = ls->fs;
Dyndata *dyd = ls->dyd;
int reg = registerlocalvar(ls, name);
checklimit(fs, dyd->actvar.n + 1 - fs->firstlocal,
MAXVARS, "local variables");
luaM_growvector(ls->L, dyd->actvar.arr, dyd->actvar.n + 1,
dyd->actvar.size, Vardesc, MAX_INT, "local variables");
dyd->actvar.arr[dyd->actvar.n++].idx = cast(short, reg);
}
static void new_localvarliteral_ (LexState *ls, const char *name, size_t sz) {
new_localvar(ls, luaX_newstring(ls, name, sz));
}
#define new_localvarliteral(ls,v) \
new_localvarliteral_(ls, "" v, (sizeof(v)/sizeof(char))-1)
查看new_localvar函数,可以发现,dyd中保存的actvar信息只有下标idx。fs->f->locvars[idx]方可访问到此局部变量。
adjustlocalvars函数仅仅是为了将设定一下新增的localvar的pc指针。
接下来解析函数的形参列表parlist(ls):
static void parlist (LexState *ls) {
/* parlist -> [ param { ',' param } ] */
FuncState *fs = ls->fs;
Proto *f = fs->f;
int nparams = 0;
f->is_vararg = 0;
if (ls->t.token != ')') { /* is 'parlist' not empty? */
do {
switch (ls->t.token) {
case TK_NAME: { /* param -> NAME */
new_localvar(ls, str_checkname(ls));
nparams++;
break;
}
case TK_DOTS: { /* param -> '...' */
luaX_next(ls);
f->is_vararg = 2; /* declared vararg */
break;
}
default: luaX_syntaxerror(ls, " or '...' expected" );
}
} while (!f->is_vararg && testnext(ls, ','));
}
adjustlocalvars(ls, nparams);
f->numparams = cast_byte(fs->nactvar);
/* 需要将参数列表里的局部变量保留到寄存器中 */
luaK_reserveregs(fs, fs->nactvar); /* reserve register for parameters */
}
可以看到,如果ls->t.token为变量名的话,则新增一个局部变量,如果为’…’,则说明使用了变参,那么设定f->is_vararg=2后break–由于…只能是函数形参列表里面的最后一个参数,所以读到’…’之后退出是有必要的。
最后一条语句,luaK_reserveregs是为了将局部变量保存于寄存器中,等等,parser的过程中怎么会出现寄存器,寄存器难道不是运行时才有的概念吗?
嗯,确实这样,只有在lua虚拟机中运行时,register才有存在的意义;然luaK_reserveregs并不是使用寄存器,它仅仅只是累加 fs->freereg+=n,换句话讲,它仅仅只是访问和修改了寄存器的索引index,而lua虚拟机在运行时,会根据基地址+index来完成对寄存器的访问。
接下来就是解析statement list:
static void statlist (LexState *ls) {
/* statlist -> { stat [';'] } */
/* 只要当前token不代表下一个block,则继续解析statement */
while (!block_follow(ls, 1)) {
if (ls->t.token == TK_RETURN) {
statement(ls);
return; /* 'return' must be last statement */
}
statement(ls);
}
}
static void statement (LexState *ls) {
int line = ls->linenumber; /* may be needed for error messages */
enterlevel(ls);
switch (ls->t.token) {
case ';': { /* stat -> ';' (empty statement) */
luaX_next(ls); /* skip ';' */
break;
}
case TK_IF: { /* stat -> ifstat */
ifstat(ls, line);
break;
}
case TK_WHILE: { /* stat -> whilestat */
whilestat(ls, line);
break;
}
case TK_DO: { /* stat -> DO block END */
luaX_next(ls); /* skip DO */
block(ls);
check_match(ls, TK_END, TK_DO, line);
break;
}
case TK_FOR: { /* stat -> forstat */
forstat(ls, line);
break;
}
case TK_REPEAT: { /* stat -> repeatstat */
repeatstat(ls, line);
break;
}
case TK_FUNCTION: { /* stat -> funcstat */
funcstat(ls, line);
break;
}
case TK_LOCAL: { /* stat -> localstat */
luaX_next(ls); /* skip LOCAL */
if (testnext(ls, TK_FUNCTION)) /* local function? */
localfunc(ls);
else
localstat(ls);
break;
}
case TK_DBCOLON: { /* stat -> label */
luaX_next(ls); /* skip double colon */
labelstat(ls, str_checkname(ls), line);
break;
}
case TK_RETURN: { /* stat -> retstat */
luaX_next(ls); /* skip RETURN */
retstat(ls);
break;
}
case TK_BREAK: /* stat -> breakstat */
case TK_GOTO: { /* stat -> 'goto' NAME */
gotostat(ls, luaK_jump(ls->fs));
break;
}
default: { /* stat -> func | assignment */
exprstat(ls);
break;
}
}
lua_assert(ls->fs->f->maxstacksize >= ls->fs->freereg &&
ls->fs->freereg >= ls->fs->nactvar);
ls->fs->freereg = ls->fs->nactvar; /* free registers */
leavelevel(ls);
}
statlist和BNF表达的一样,它需要进一步的调用statement来解析一条语句,我们来回顾一下statement的BNF:
stat ::= ‘;’ |
varlist ‘=’ explist |
functioncall |
label |
break |
goto Name |
do block end |
while exp do block end |
repeat block until exp |
if exp then block {elseif exp then block} [else block] end |
for Name ‘=’ exp ‘,’ exp [‘,’ exp] do block end |
for namelist in explist do block end |
function funcname funcbody |
local function Name funcbody |
local namelist [‘=’ explist]
而statement函数也和上述BNF一样,分为很多个case,’;’ 空语句,if语句;do-while语句,还有function语句。。。
具体每一个case如何解析,还有如何进行编码,将在系列三之中具体描述。。。
最后的代码:
check_match(ls, TK_END, TK_FUNCTION, line);
codeclosure(ls, e);
close_func(ls);
check_match会校验函数结束标志符”end”,调用codeclosure,然后把OP_CLOSURE这条指令添加到父函数的code之中。