词法分析器实现(Java)

具体实现代码请看同名资源

Project: Lexical Analyzer Programming

a)Motivation/Aim

After learning the third chapter of “Principles of Compilation” Lexcial Analysis, I understand the construction principle of a lexical analyzer and work flow. As the saying goes, practice makes real knowledge, personally implement a lexical analyzer can help me to further deepen the understanding of LexcialAnalyzor.

b)Content description

As the ancestor of many other high-level languages, C language is rapidly popularized around the world with its features of rich functions, strong expression ability, flexibility and convenience, and wide application range. Therefore, this experiment selects the subset of C language morphology as the target of the experiment to analyze the morphology.

In order to achieve certain experimental objectives, this experiment requires the following lexical requirements:

  • 关系比较
  • 常数值声明
  • 标识符识别
  • 基础算数运算
  • 控制符
    • if
    • then
    • else
    • do
    • while

c)Ideas/Methods

我查看了Microsoft官网 编写的C语言词法语法参考资料C 词法语法 | Microsoft Learn,发现 token可以进行如下分类:

image-20221112200415401

于是我对其进行逐个分析:

  • 关键字
image-20221112200635018

The website also provides descriptions of most of the keywords.

  • 标识符

标识符可以看做是除了关键字之外的变量

  • 常量

常量可以看做形如1112的整数或者做形如11.21的小数

  • 字符串

字符串可以看做被“ ”‘’ 包裹的字符列

  • 符号
image-20221112201024213

官网也给出了大部分标点符号的说明。

d)Assumptions

Because this experiment selected the subset of C language morphology as the target of the experiment to analyze the morphology. In order to improve the feasibility of the experiment, I made the following assumptions:

  1. 因为编译阶段在链接阶段之前,所以假设实验的测试用例不包含形如indclude 的头文件引入操作
  2. 假设注释只有//这里是注释 一种形式

此外,对于c)部分中的不同类型我也做了一定的假设:

  1. 关键字和标识符不包含_,只包含0-9 a-z A-Z
  2. 常量只包括浮点数和整数,不包括对正负的区分

e)Related FA descriptions

Due to the large number of lexical methods implemented in this experiment, only part of FA is listed here.

*indicates that the lexer takes a step forward after accepting the state

image-20221117192533144

f)Description of important Software Architecture

由于词法分析器的运行与本学期 软件体系架构 中所讲的数据流 风格十分类似,所以我在构造词法分析器的时候大规模采用if else switch case 这样子的条件控制语句。

总体而言,有如下while 循环实现Scanner源源不断地读取字符输入,并通过调用anaylzer()进行分析。

do {
    analyzer();
    switch (syn) {
        case 36:
            System.out.println("(" + syn + "," + sum + ")");
            break;
        case 37:
            System.out.println("(" + syn + "," + literal_string + ")");
            break;
        case -1:
            System.out.println("Error in row " + row + "! error:" + error);
            break;
        case -2:
            break;                default:
            System.out.println("(" + syn + "," + token + ")");
    }
} while (!(index > storage.length() - 1));

g)Description of core Algorithms

词法分析器的核心实现在analyzer()中。

首先,词法分析器需要无视空格以及注释。

while (ch == ' ') {
    ch = storage.charAt(index++);      //去除空格符号
}

对于//而言,需要注意的是,/ 也是一个运算符,我在实现的过程中的实现方式是——先读取一个/,再往后读一个字符,如果也是/,则说明该行之后部分为注释;如果不是,说明为运算符/,再将index--,模拟回溯。

case '/':
    token.append(ch);
    ch = storage.charAt(index++);
    if (ch == '/') {
        while (ch != '\n') {
            ch = storage.charAt(index++);  //忽略掉注释,以空格为界定
        }
        row++;
        syn = -2;
        break;
    } else {
        syn = 50;
        index--;
    }
    break;

其次,由于关键字和字符串

对于字符串而言

h)Use cases on running

测试用例 a.txt

int main()
{
    //定义小编兜里的钱
    int money =12;
    //定义打车回家的费用
    int cost =11;
    printf("小编能不能打车回家呢:");
    //输出y小编就打车回家了,输出n小编就不能打车回家
    printf("%c\n",money>=cost?'y':'n');
    return 0;
}

实验结果

(18,int)
(35,main)
(54,()
(55,))
(56,{)
(18,int)
(35,money)
(49,=)
(36,12)
(53,;)
(18,int)
(35,cost)
(49,=)
(36,11)
(53,;)
(35,printf)
(54,()
(37,"小编能不能打车回家呢:")
(55,))
(53,;)
(35,printf)
(54,()
(37,"%c\n")
(60,,)
(35,money)
(42,>=)
(35,cost)
(61,?)
(38,)
(47,:')
(38,)
(55,))
(53,;)
(22,return)
(36,0)
(53,;)
(57,})

进程已结束,退出代码0

b.txt

void func1(void);

static int count=10;

int main()
{
  while (count--) {
      func1();
  }
  return 0;
}

void func1(void)
{
  static int thingy=5;
  thingy++;
  printf(" thingy 为 %d , count 为 %d\n", thingy, count);
}

测试用例 实验结果

(32,void)
(35,func1)
(54,()
(32,void)
(55,))
(53,;)
(26,static)
(18,int)
(35,count)
(49,=)
(36,10)
(53,;)
(18,int)
(35,main)
(54,()
(55,))
(56,{)
(34,while)
(54,()
(35,count)
(51,--)
(55,))
(56,{)
(35,func1)
(54,()
(55,))
(53,;)
(57,})
(22,return)
(36,0)
(53,;)
(57,})
(32,void)
(35,func1)
(54,()
(32,void)
(55,))
(56,{)
(26,static)
(18,int)
(35,thingy)
(49,=)
(36,5)
(53,;)
(35,thingy)
(51,++)
(53,;)
(35,printf)
(54,()
(37,"thingy 为 %d , count 为 %d\n")
(60,,)
(35,thingy)
(60,,)
(35,count)
(55,))
(53,;)
(57,})

进程已结束,退出代码0

i)Problems occurred and related solutions

Since this implementation is a subset of the C lexical method, the larger the subset to be realized, the more difficult the experiment will be. Even though I thought I had more than half of the C lexology covered, something went wrong when I tested a simple C program as a use case.

比如,"This is a \" cat",我一开始分析字符串的做法是——词法分析器到 " 后,就进入else if 选择分支中,不断读取直到读到 " ,则判断该字符串结束。但是对于含有转义符的 "This is a \" cat" 来说,这种方法就不可取了。

else if (ch == '"') {
    /*
    Literal String
     */
    literal_string = "\"";
    ch = storage.charAt(index++);
    while (ch != '"') {
        literal_string += ch;
        ch = storage.charAt(index++);
    }
    literal_string += "\"";
    syn = 37;
}

After realizing this problem, I modified the code to expand the subset of lexemologies supported by the lexical analyzer implemented in this experiment.

一开始,我想读到\ 后,读取第二个字符后不做分析直接读到第三个字符,但是发现 存在如\g 这样的无转义的字符。所以,我就再次之后再添加if else 分支判断。

else if (ch == '"') {
    /*
    Literal String
     */
    literal_string = "\"";
    ch = storage.charAt(index++);
    while (ch != '"') {
        literal_string += ch;
        if(ch=='\\'){
            ch = storage.charAt(index++);
            literal_string += ch;
            }
        }
        ch = storage.charAt(index++);
    }
    literal_string += "\"";
    syn = 37;
}

j) Your feelings and comments

Since this implementation is a subset of the C lexical method, the larger the subset to be realized, the more difficult the experiment will be. Even though I thought my implementation of a subset of C lexicology was simple enough, some unexpected problems occurred when I tested a C program that fit that subset as a use case.

In the process of writing the lexer, I have a deeper understanding of the transformation constructs before ‘DFA’ and ‘NFA’.

Thus, a complete lexer can only be used for development after continuous debugging iterations.

你可能感兴趣的:(编译原理,java,开发语言)