按照Javadoc里的描述:StreamTokenizer
类获取输入流并将其解析为“标记”,允许一次读取一个标记。解析过程由一个表和许多可以设置为各种状态的标志控制。该流的标记生成器可以识别标识符、数字、引用的字符串和各种注释样式等。
简单的说就是一个可以将源代码文件解释成一个个标记的类,这些标记都对应不同的类别,例如数字,单词,行尾,末尾等。
本文中将使用以下源文件作为演示内容:
package com.iteye.liugang594.java.thread; import java.util.concurrent.CountDownLatch; public class TestCountDownLatch { private int number = 30; private long seconds = 40000L; /** * @param args * @throws InterruptedException */ public static void main(String[] args) throws InterruptedException { //create countdown final CountDownLatch countDownLatch = new CountDownLatch(10); for(int i = 0;i< 10;i++){ new Thread("Thread "+i){ public void run() { try { //wait countDownLatch.await(); } catch (InterruptedException e) { e.printStackTrace(); } System.out.println(getName()+" started"); } }.start(); Thread.sleep(50); //count down countDownLatch.countDown(); } } }
一、提取数字
首先,看一下怎么提取以上内容中的数字,这里有以下几个:30,40000L, 10, 0, 10, 50。
StreamTokenizer tokenizer = new StreamTokenizer(reader); tokenizer.parseNumbers(); int nextToken = tokenizer.nextToken(); while(nextToken != StreamTokenizer.TT_EOF){ if(nextToken == StreamTokenizer.TT_NUMBER){ System.out.println("number "+ tokenizer.nval+" on line "+tokenizer.lineno()); } nextToken = tokenizer.nextToken(); }
打印结果:
number 30.0 on line 7 number 40000.0 on line 8 number 10.0 on line 17 number 0.0 on line 19 number 10.0 on line 19 number 0.0 on line 30 number 50.0 on line 31
有一个问题:在第30行上其实没有数字,却显示0.0,我觉得StreamTokenier的一个bug,它把终结符(也就是 } )后的点识别为一个数字。可以试着把其实的点删除,就可以得到正确答案。或者把点作为普遍的字符的对待,例如:
StreamTokenizer tokenizer = new StreamTokenizer(reader); tokenizer.parseNumbers(); tokenizer.ordinaryChar('.'); int nextToken = tokenizer.nextToken(); while(nextToken != StreamTokenizer.TT_EOF){ if(nextToken == StreamTokenizer.TT_NUMBER){ System.out.println("number "+ tokenizer.nval+" on line "+tokenizer.lineno()); } nextToken = tokenizer.nextToken(); }
二、删除注释
在示例源文件里有一些注释内容,这里演示一下怎么删除其中的注释:
StreamTokenizer tokenizer = new StreamTokenizer(reader); tokenizer.resetSyntax(); // reset all chars as ordinary char tokenizer.slashSlashComments(true); // recognize // tokenizer.slashStarComments(true); // recognize /**/ tokenizer.wordChars(Character.MIN_VALUE, Character.MAX_VALUE); // all chars will be considered as a part of word tokenizer.commentChar('/'); //set the comment char tokenizer.quoteChar('"'); // set quote char, this should be set, else the the / inside a string will be considered as comment int token = tokenizer.nextToken(); while (token != StreamTokenizer.TT_EOF) { // continue if not the end of file if(token == '"'){ //print " + content + " if we encounted a quote System.out.print((char)token); } if(tokenizer.sval != null){ // print token content if have System.out.print(tokenizer.sval); } if(token == '"'){ System.out.print((char)token); } token = tokenizer.nextToken(); }
注意:quote也需要被解析,否则 "//hello" 里的 //hello" 会被认为是注释而处理。
检查打印的内容,所有的注释都被删除,只留下源码主体部分。
三、字符串提取
和上节类似,这次只需要打印字符Token的内容:
StreamTokenizer tokenizer = new StreamTokenizer(reader); tokenizer.resetSyntax(); // reset all chars as ordinary char tokenizer.slashSlashComments(true); // recognize // tokenizer.slashStarComments(true); // recognize /**/ tokenizer.wordChars(Character.MIN_VALUE, Character.MAX_VALUE); // all chars will be considered as a part of word tokenizer.commentChar('/'); //set the comment char tokenizer.quoteChar('"'); int token = tokenizer.nextToken(); while (token != StreamTokenizer.TT_EOF) { // continue if not the end of file if(token == '"'){ System.out.println(tokenizer.sval); } token = tokenizer.nextToken(); }
注意,注释部分也需要包括到token中,否则注释里的字符串也会被解析,例如: // "hello world"
四、去除空行
去除空行可以将文件的大小进行压缩。例如用于网络传输的时候。以下代码片段可以用来去除空行:
StreamTokenizer tokenizer = new StreamTokenizer(reader); tokenizer.resetSyntax(); // remove all symbols before continue //assume all characters are words tokenizer.wordChars(Character.MIN_VALUE, Character.MAX_VALUE); //treat the new line as a token, this requires \r and \n should be white space tokenizer.eolIsSignificant(true); tokenizer.whitespaceChars('\r', '\r'); tokenizer.whitespaceChars('\n', '\n'); int nextToken = tokenizer.nextToken(); while(nextToken != StreamTokenizer.TT_EOF){ //print the content between lines if it's not empty if(nextToken != StreamTokenizer.TT_EOL){ if(tokenizer.sval != null && !"".equals(tokenizer.sval.trim())){ System.out.println(tokenizer.sval); } } nextToken = tokenizer.nextToken(); }
首先清除所有的标记,然后把Character范围内的值都认为是单词的一部分,然后设置换行符为一个Token,这里需要指定\r和\n为空格字符,以使得换行符起作用;然后在扫描的过程中,所在非换行符的内容如果为空字符串则跳过,否则打印。