Java中使用StreamTokenizer

按照Javadoc里的描述:StreamTokenizer 类获取输入流并将其解析为“标记”,允许一次读取一个标记。解析过程由一个表和许多可以设置为各种状态的标志控制。该流的标记生成器可以识别标识符、数字、引用的字符串和各种注释样式等。

 

简单的说就是一个可以将源代码文件解释成一个个标记的类,这些标记都对应不同的类别,例如数字,单词,行尾,末尾等。

 

本文中将使用以下源文件作为演示内容:

 

package com.iteye.liugang594.java.thread;

import java.util.concurrent.CountDownLatch;

public class TestCountDownLatch {

	private int number = 30;
	private long seconds = 40000L;
	
	/**
	 * @param args
	 * @throws InterruptedException 
	 */
	public static void main(String[] args) throws InterruptedException {
		
		//create countdown
		final CountDownLatch countDownLatch = new CountDownLatch(10);
		
		for(int i = 0;i< 10;i++){
			new Thread("Thread "+i){
				public void run() {
					try {
						//wait
						countDownLatch.await();
					} catch (InterruptedException e) {
						e.printStackTrace();
					}
					System.out.println(getName()+" started");
				}
			}.start();
			Thread.sleep(50);
			
			//count down
			countDownLatch.countDown();
		}

	}

}

一、提取数字

首先,看一下怎么提取以上内容中的数字,这里有以下几个:30,40000L, 10, 0, 10, 50。

		StreamTokenizer tokenizer = new StreamTokenizer(reader);
		tokenizer.parseNumbers();
		
		int nextToken = tokenizer.nextToken();
		while(nextToken != StreamTokenizer.TT_EOF){
			if(nextToken == StreamTokenizer.TT_NUMBER){
				System.out.println("number "+ tokenizer.nval+" on line "+tokenizer.lineno());
			}
			nextToken = tokenizer.nextToken();
		}

打印结果:

number 30.0 on line 7
number 40000.0 on line 8
number 10.0 on line 17
number 0.0 on line 19
number 10.0 on line 19
number 0.0 on line 30
number 50.0 on line 31

有一个问题:在第30行上其实没有数字,却显示0.0,我觉得StreamTokenier的一个bug,它把终结符(也就是 } )后的点识别为一个数字。可以试着把其实的点删除,就可以得到正确答案。或者把点作为普遍的字符的对待,例如:

		StreamTokenizer tokenizer = new StreamTokenizer(reader);
		tokenizer.parseNumbers();
		tokenizer.ordinaryChar('.');
		
		int nextToken = tokenizer.nextToken();
		while(nextToken != StreamTokenizer.TT_EOF){
			if(nextToken == StreamTokenizer.TT_NUMBER){
				System.out.println("number "+ tokenizer.nval+" on line "+tokenizer.lineno());
			}
			nextToken = tokenizer.nextToken();
		}

 

二、删除注释

在示例源文件里有一些注释内容,这里演示一下怎么删除其中的注释:

	StreamTokenizer tokenizer = new StreamTokenizer(reader);
		tokenizer.resetSyntax(); // reset all chars as ordinary char
		tokenizer.slashSlashComments(true); // recognize //
		tokenizer.slashStarComments(true); // recognize /**/
		tokenizer.wordChars(Character.MIN_VALUE, Character.MAX_VALUE); // all chars will be considered as a part of word
		tokenizer.commentChar('/'); //set the comment char
		tokenizer.quoteChar('"'); // set quote char, this should be set, else the the / inside a string will be considered as comment

		int token = tokenizer.nextToken();
		while (token != StreamTokenizer.TT_EOF) { // continue if not the end of file
			if(token == '"'){  //print " + content + " if we encounted a quote
				System.out.print((char)token);
			}
			if(tokenizer.sval != null){		// print token content if have
				System.out.print(tokenizer.sval);
			}
			if(token == '"'){
				System.out.print((char)token);
			}
			token = tokenizer.nextToken();
		}

注意:quote也需要被解析,否则 "//hello" 里的 //hello" 会被认为是注释而处理。

检查打印的内容,所有的注释都被删除,只留下源码主体部分。

 

三、字符串提取

和上节类似,这次只需要打印字符Token的内容:

		StreamTokenizer tokenizer = new StreamTokenizer(reader);
		tokenizer.resetSyntax(); // reset all chars as ordinary char
		tokenizer.slashSlashComments(true); // recognize //
		tokenizer.slashStarComments(true); // recognize /**/
		tokenizer.wordChars(Character.MIN_VALUE, Character.MAX_VALUE); // all chars will be considered as a part of word
		tokenizer.commentChar('/'); //set the comment char
		tokenizer.quoteChar('"');

		int token = tokenizer.nextToken();
		while (token != StreamTokenizer.TT_EOF) { // continue if not the end of file
			if(token == '"'){
				System.out.println(tokenizer.sval);
			}
			token = tokenizer.nextToken();
		}

注意,注释部分也需要包括到token中,否则注释里的字符串也会被解析,例如: // "hello world"      

 

四、去除空行

去除空行可以将文件的大小进行压缩。例如用于网络传输的时候。以下代码片段可以用来去除空行:

		StreamTokenizer tokenizer = new StreamTokenizer(reader);  
		tokenizer.resetSyntax(); // remove all symbols before continue
		//assume all characters are words
		tokenizer.wordChars(Character.MIN_VALUE, Character.MAX_VALUE);
		//treat the new line as a token, this requires \r and \n should be white space
		tokenizer.eolIsSignificant(true);
		tokenizer.whitespaceChars('\r', '\r');
		tokenizer.whitespaceChars('\n', '\n');
		
		int nextToken = tokenizer.nextToken();  
		while(nextToken != StreamTokenizer.TT_EOF){  
			//print the content between lines if it's not empty
			if(nextToken != StreamTokenizer.TT_EOL){
				if(tokenizer.sval != null && !"".equals(tokenizer.sval.trim())){
					System.out.println(tokenizer.sval);
				}
			}
			
			nextToken = tokenizer.nextToken();
		}  

首先清除所有的标记,然后把Character范围内的值都认为是单词的一部分,然后设置换行符为一个Token,这里需要指定\r和\n为空格字符,以使得换行符起作用;然后在扫描的过程中,所在非换行符的内容如果为空字符串则跳过,否则打印。 

 

 

 

 

你可能感兴趣的:(Stream)