具体实现代码请看同名资源
After learning the third chapter of “Principles of Compilation” Lexcial Analysis, I understand the construction principle of a lexical analyzer and work flow. As the saying goes, practice makes real knowledge, personally implement a lexical analyzer can help me to further deepen the understanding of LexcialAnalyzor.
As the ancestor of many other high-level languages, C language is rapidly popularized around the world with its features of rich functions, strong expression ability, flexibility and convenience, and wide application range. Therefore, this experiment selects the subset of C language morphology as the target of the experiment to analyze the morphology.
In order to achieve certain experimental objectives, this experiment requires the following lexical requirements:
我查看了Microsoft官网 编写的C语言词法语法参考资料C 词法语法 | Microsoft Learn,发现 token可以进行如下分类:
于是我对其进行逐个分析:
The website also provides descriptions of most of the keywords.
标识符可以看做是除了关键字之外的变量
常量可以看做形如1112的整数或者做形如11.21的小数
字符串可以看做被“ ”
和 ‘’
包裹的字符列
官网也给出了大部分标点符号的说明。
Because this experiment selected the subset of C language morphology as the target of the experiment to analyze the morphology. In order to improve the feasibility of the experiment, I made the following assumptions:
indclude
的头文件引入操作//这里是注释
一种形式此外,对于c)部分中的不同类型我也做了一定的假设:
_
,只包含0-9
a-z
A-Z
Due to the large number of lexical methods implemented in this experiment, only part of FA is listed here.
*indicates that the lexer takes a step forward after accepting the state
由于词法分析器的运行与本学期 软件体系架构 中所讲的数据流 风格十分类似,所以我在构造词法分析器的时候大规模采用if else
switch case
这样子的条件控制语句。
总体而言,有如下while
循环实现Scanner源源不断地读取字符输入,并通过调用anaylzer()
进行分析。
do {
analyzer();
switch (syn) {
case 36:
System.out.println("(" + syn + "," + sum + ")");
break;
case 37:
System.out.println("(" + syn + "," + literal_string + ")");
break;
case -1:
System.out.println("Error in row " + row + "! error:" + error);
break;
case -2:
break; default:
System.out.println("(" + syn + "," + token + ")");
}
} while (!(index > storage.length() - 1));
词法分析器的核心实现在analyzer()
中。
首先,词法分析器需要无视空格以及注释。
while (ch == ' ') {
ch = storage.charAt(index++); //去除空格符号
}
对于//
而言,需要注意的是,/
也是一个运算符,我在实现的过程中的实现方式是——先读取一个/
,再往后读一个字符,如果也是/
,则说明该行之后部分为注释;如果不是,说明为运算符/
,再将index--
,模拟回溯。
case '/':
token.append(ch);
ch = storage.charAt(index++);
if (ch == '/') {
while (ch != '\n') {
ch = storage.charAt(index++); //忽略掉注释,以空格为界定
}
row++;
syn = -2;
break;
} else {
syn = 50;
index--;
}
break;
其次,由于关键字和字符串
对于字符串而言
测试用例 a.txt
int main()
{
//定义小编兜里的钱
int money =12;
//定义打车回家的费用
int cost =11;
printf("小编能不能打车回家呢:");
//输出y小编就打车回家了,输出n小编就不能打车回家
printf("%c\n",money>=cost?'y':'n');
return 0;
}
实验结果
(18,int)
(35,main)
(54,()
(55,))
(56,{)
(18,int)
(35,money)
(49,=)
(36,12)
(53,;)
(18,int)
(35,cost)
(49,=)
(36,11)
(53,;)
(35,printf)
(54,()
(37,"小编能不能打车回家呢:")
(55,))
(53,;)
(35,printf)
(54,()
(37,"%c\n")
(60,,)
(35,money)
(42,>=)
(35,cost)
(61,?)
(38,)
(47,:')
(38,)
(55,))
(53,;)
(22,return)
(36,0)
(53,;)
(57,})
进程已结束,退出代码0
b.txt
void func1(void);
static int count=10;
int main()
{
while (count--) {
func1();
}
return 0;
}
void func1(void)
{
static int thingy=5;
thingy++;
printf(" thingy 为 %d , count 为 %d\n", thingy, count);
}
测试用例 实验结果
(32,void)
(35,func1)
(54,()
(32,void)
(55,))
(53,;)
(26,static)
(18,int)
(35,count)
(49,=)
(36,10)
(53,;)
(18,int)
(35,main)
(54,()
(55,))
(56,{)
(34,while)
(54,()
(35,count)
(51,--)
(55,))
(56,{)
(35,func1)
(54,()
(55,))
(53,;)
(57,})
(22,return)
(36,0)
(53,;)
(57,})
(32,void)
(35,func1)
(54,()
(32,void)
(55,))
(56,{)
(26,static)
(18,int)
(35,thingy)
(49,=)
(36,5)
(53,;)
(35,thingy)
(51,++)
(53,;)
(35,printf)
(54,()
(37,"thingy 为 %d , count 为 %d\n")
(60,,)
(35,thingy)
(60,,)
(35,count)
(55,))
(53,;)
(57,})
进程已结束,退出代码0
Since this implementation is a subset of the C lexical method, the larger the subset to be realized, the more difficult the experiment will be. Even though I thought I had more than half of the C lexology covered, something went wrong when I tested a simple C program as a use case.
比如,"This is a \" cat"
,我一开始分析字符串的做法是——词法分析器到 "
后,就进入else if
选择分支中,不断读取直到读到 "
,则判断该字符串结束。但是对于含有转义符的 "This is a \" cat"
来说,这种方法就不可取了。
else if (ch == '"') {
/*
Literal String
*/
literal_string = "\"";
ch = storage.charAt(index++);
while (ch != '"') {
literal_string += ch;
ch = storage.charAt(index++);
}
literal_string += "\"";
syn = 37;
}
After realizing this problem, I modified the code to expand the subset of lexemologies supported by the lexical analyzer implemented in this experiment.
一开始,我想读到\
后,读取第二个字符后不做分析直接读到第三个字符,但是发现 存在如\g
这样的无转义的字符。所以,我就再次之后再添加if else
分支判断。
else if (ch == '"') {
/*
Literal String
*/
literal_string = "\"";
ch = storage.charAt(index++);
while (ch != '"') {
literal_string += ch;
if(ch=='\\'){
ch = storage.charAt(index++);
literal_string += ch;
}
}
ch = storage.charAt(index++);
}
literal_string += "\"";
syn = 37;
}
Since this implementation is a subset of the C lexical method, the larger the subset to be realized, the more difficult the experiment will be. Even though I thought my implementation of a subset of C lexicology was simple enough, some unexpected problems occurred when I tested a C program that fit that subset as a use case.
In the process of writing the lexer, I have a deeper understanding of the transformation constructs before ‘DFA’ and ‘NFA’.
Thus, a complete lexer can only be used for development after continuous debugging iterations.