最近一直在做Session Initiation Protocol (SIP)协议方面的开发,SIP在电信VoIP领域应用非常广泛,是一个基于文本语法的协议。SIP的语法规范是使用ABNF来定义的。对SIP语法有兴趣的同学请移步其Augmented BNF for the SIP Protocol章节。Augmented BNF for Syntax Specifications: ABNF本身也是一种语法规范,ABNF形式上可以由其自身来定义,有兴趣的童鞋请参考其第4章“ABNF Definition of ABNF”。
因此,如果想做一个SIP协议栈,首先要有一个SIP的语法解析器,这个语法解析器属于ABNF语法解析器。网上搜索ABNF语法解析器的生成器(ABNF parser generator)能够搜索到不少。当然,如果从学习编译原理的角度来说,我们更倾向于自己去写一个ABNF parser generator,因为如果我们自己动手写过,以后就算用开源的生成器,用起来肯定会有更深刻的体会。
ABNF的语法定义很短,主要分为两部分,Core Rules和ABNF的主体部分。Core Rules主要是一些最基础的符号定义:
ALPHA = %x41-5A / %x61-7A ; A-Z / a-z BIT = "0" / "1" CHAR = %x01-7F ; any 7-bit US-ASCII character, excluding NUL CR = %x0D ; carriage return CRLF = CR LF ; Internet standard newline CTL = %x00-1F / %x7F ; controls DIGIT = %x30-39 ; 0-9 DQUOTE = %x22 ; " (Double Quote) HEXDIG = DIGIT / "A" / "B" / "C" / "D" / "E" / "F" HTAB = %x09 ; horizontal tab LF = %x0A ; linefeed LWSP = *(WSP / CRLF WSP) ; linear white space (past newline) OCTET = %x00-FF ; 8 bits of data SP = %x20 ; space VCHAR = %x21-7E ; visible (printing) characters WSP = SP / HTAB ; white space
rulelist = 1*( rule / (*c-wsp c-nl) ) rule = rulename defined-as elements c-nl ; continues if next line starts ; with white space rulename = ALPHA *(ALPHA / DIGIT / "-") defined-as = *c-wsp ("=" / "=/") *c-wsp ; basic rules definition and ; incremental alternatives elements = alternation *c-wsp c-wsp = WSP / (c-nl WSP) c-nl = comment / CRLF ; comment or newline comment = ";" *(WSP / VCHAR) CRLF alternation = concatenation *(*c-wsp "/" *c-wsp concatenation) concatenation = repetition *(1*c-wsp repetition) repetition = [repeat] element repeat = 1*DIGIT / (*DIGIT "*" *DIGIT) element = rulename / group / option / char-val / num-val / prose-val group = "(" *c-wsp alternation *c-wsp ")" option = "[" *c-wsp alternation *c-wsp "]" char-val = DQUOTE *(%x20-21 / %x23-7E) DQUOTE ; quoted string of SP and VCHAR without DQUOTE num-val = "%" (bin-val / dec-val / hex-val) bin-val = "b" 1*BIT [ 1*("." 1*BIT) / ("-" 1*BIT) ] ; series of concatenated bit values ; or single ONEOF range dec-val = "d" 1*DIGIT [ 1*("." 1*DIGIT) / ("-" 1*DIGIT) ] hex-val = "x" 1*HEXDIG [ 1*("." 1*HEXDIG) / ("-" 1*HEXDIG) ] prose-val = "<" *(%x20-3D / %x3F-7E) ">" ; bracketed string of SP and VCHAR without angles ; prose description, to be used as last resort
在Alfred V. Aho等大牛的《Compilers》龙书中,recursive-descent parsing是最容易手工编写的算法了,它是一种预测解析方法(predictive parsing),即通过前向看若干个输入字符来决定选择哪一条语法规则(Production),然后将非终止符(nonterminal)定义成一个函数,不断的递归调用即可。对于ABNF自身语法来说,通常是前向看1~2个字符就足够了,但也有局限,后面我们会有分析。龙书中的预测分析器是针对上下文无关文法(BNF)的,ABNF与BNF稍有不同,例如没有空符号(epsilon),增加了一些诸如表示重复的操作符号等,但大体上还是很相像的。
例如,对于一条语法:
stmt -> for ( optexpr; optexpr; optexpr ) stmt解析器可以这样写:
void stmt() { match(for); match('('); optexpr(); match(';'); optexpr(); match(';'); optexpr(); match(')'); stmt(); }
怎么样?很简单吧?下一篇我们将开始动手写ABNF语法解析器。