CharTokenizer 是一个抽象类,它主要是对西文字符进行分词处理的。常见的英文中,是以空格、标点为分隔符号的,在分词的时候,就是以这些分隔符作为分词的间隔符的。
- package org.apache.lucene.analysis;
- import java.io.IOException;
- import java.io.Reader;
- import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
- import org.apache.lucene.analysis.tokenattributes.TermAttribute;
- import org.apache.lucene.util.AttributeSource;
- //CharTokenizer是一个抽象类,貌似只有
- // CharTokenizer token5 = new LetterTokenizer(input); 去实例化?
- public abstract class CharTokenizer extends Tokenizer {
- public CharTokenizer(Reader input) {
- super(input);
- offsetAtt = addAttribute(OffsetAttribute.class);
- termAtt = addAttribute(TermAttribute.class);
- }
- public CharTokenizer(AttributeSource source, Reader input) {
- super(source, input);
- offsetAtt = addAttribute(OffsetAttribute.class);
- termAtt = addAttribute(TermAttribute.class);
- }
- public CharTokenizer(AttributeFactory factory, Reader input) {
- super(factory, input);
- offsetAtt = addAttribute(OffsetAttribute.class);
- termAtt = addAttribute(TermAttribute.class);
- }
- private int offset = 0, bufferIndex = 0, dataLen = 0;
- private static final int MAX_WORD_LEN = 255;
- private static final int IO_BUFFER_SIZE = 4096;
- private final char[] ioBuffer = new char[IO_BUFFER_SIZE];
- private TermAttribute termAtt;
- private OffsetAttribute offsetAtt;
- protected abstract boolean isTokenChar(char c);
- // 对字符进行处理,可以在CharTokenizer 的子类中实现
- protected char normalize(char c) {
- return c;
- }
- @Override
- public final boolean incrementToken() throws IOException {
- clearAttributes();
- int length = 0;
- int start = bufferIndex;
- char[] buffer = termAtt.termBuffer();
- while (true) {
- if (bufferIndex >= dataLen) {
- offset += dataLen;
- dataLen = input.read(ioBuffer);
- if (dataLen == -1) {
- dataLen = 0; // so next offset += dataLen won't decrement offset
- if (length > 0)
- break;
- else
- return false;
- }
- bufferIndex = 0;
- }
- final char c = ioBuffer[bufferIndex++];
- if (isTokenChar(c)) { // if it's a token char
- if (length == 0) // start of token
- start = offset + bufferIndex - 1;
- else if (length == buffer.length)
- buffer = termAtt.resizeTermBuffer(1+length);
- buffer[length++] = normalize(c); // buffer it, normalized
- if (length == MAX_WORD_LEN) // buffer overflow!
- break;
- } else if (length > 0) // at non-Letter w/ chars
- break; // return 'em
- }
- termAtt.setTermLength(length);
- offsetAtt.setOffset(correctOffset(start), correctOffset(start+length));
- return true;
- }
- @Override
- public final void end() {
- // set final offset
- int finalOffset = correctOffset(offset);
- offsetAtt.setOffset(finalOffset, finalOffset);
- }
- @Override
- public void reset(Reader input) throws IOException {
- super.reset(input);
- bufferIndex = 0;
- offset = 0;
- dataLen = 0;
- }
- }
实现 CharTokenizer 的具体类有 1 个,分别为: LetterTokenizer( 仅在核心包 ) 、
看看 LetterTokenizer 类:
- package org.apache.lucene.analysis;
- import java.io.Reader;
- // 只要读取到非字符的符号,就分词
- public class LetterTokenizer extends CharTokenizer {
- public LetterTokenizer(Reader in) {
- super(in);
- }
- protected boolean isTokenChar(char c) {
- return Character.isLetter(c);
- }
- }
做个测试就可以看到:
- package com.fpi.lucene.studying.test;
- import java.io.IOException;
- import java.io.Reader;
- import java.io.StringReader;
- import org.apache.lucene.analysis.CharTokenizer;
- import org.apache.lucene.analysis.LetterTokenizer;
- import org.apache.lucene.analysis.Tokenizer;
- import org.apache.lucene.analysis.cjk.CJKTokenizer;
- import org.apache.lucene.analysis.cn.ChineseTokenizer;
- public class JustTest {
- public static void main(String[] args) {
- Reader read = new StringReader("what are you doing,man?It's none of your business!");
- LetterTokenizer token5 = new LetterTokenizer(read);
- try {
- while(token5.incrementToken()){
- System.out.println(token5.toString());
- }
- } catch (IOException e) {
- // TODO Auto-generated catch block
- e.printStackTrace();
- }
- }
- }
运行结果如下:
(startOffset=0,endOffset=4,term=what)
(startOffset=5,endOffset=8,term=are)
(startOffset=9,endOffset=12,term=you)
(startOffset=13,endOffset=18,term=doing)
(startOffset=19,endOffset=22,term=man)
(startOffset=23,endOffset=25,term=It)
(startOffset=26,endOffset=27,term=s)
(startOffset=28,endOffset=32,term=none)
(startOffset=33,endOffset=35,term=of)
(startOffset=36,endOffset=40,term=your)
(startOffset=41,endOffset=49,term=business)
看到了吗?不但逗号和问号被分开,而且连 It’s 这个也被分解为 It 和 s 了。
没有非字符的英文字母串就可以作为一个词条,一个词条长度的限制为 255 个字符,可以在 CharTokenizer 抽象类中看到定义:
private static final int MAX_WORD_LEN = 255