java.text中BreakIterator这个类用于定位文本中的界限。这个类定义了一个对象的协议,这些对象将会依据一系列的规则将自然语言文本进行分割。可以提供BreakIterator的实例或子类,比如,用于把一段文字分割为单词,句子或者是一些语言或语言组中约定的逻辑字符。有4个BreakIterator的内建类型:
BreakIterator通过一个CharacterIterator来访问它所分析的文本,这样就可以使用BreakIterator来分析任何一个提供了一个CharacterIterator接口的文本存储工具了。
注意:某些类型BreakIterator的创建可能会花费比较长的时间,并且,系统目前是不缓存BreakIterator的实例的。为了优化性能,需要在BreakIterator会被用到的时间段内保持它的实例。比如,当给一份文档做换行处理时,不要为每一行创建和销毁一个新的BreakIterator。而要为整个文档(或者其他什么你要做换行处理的文本)创建一个BreakIterator并使用它来做整个工作的换行处理。
可以看一段sample code:
创建并使用text boundaries:
public class QRCodeActivity extends Activity { private static String TAG = "QRCodeActivity"; private TextView mQRBitmapTextView; @Override protected void onCreate(Bundle savedInstanceState) { super.onCreate(savedInstanceState); setContentView(R.layout.activity_qrcode); mQRBitmapTextView = (TextView)findViewById(R.id.qr_bitmapview); String text = getString(R.string.text_examined); // print each word in order BreakIterator boundary = BreakIterator.getWordInstance(); boundary.setText(text); Log.w(TAG, "print each word forward"); printEachForward(boundary, text); // print each sentence in reverse order boundary = BreakIterator.getSentenceInstance(Locale.US); boundary.setText(text); Log.w(TAG, "print each sentence backward"); printEachBackward(boundary, text); Log.w(TAG, "print first sentence"); printFirst(boundary, text); Log.w(TAG, "print last sentence"); printLast(boundary, text); boundary = BreakIterator.getLineInstance(); boundary.setText(text); Log.w(TAG, "print each line forward"); printEachForward(boundary, text); }顺序打印每一个元素的方法:
public static void printEachForward(BreakIterator boundary, String source) { int start = boundary.first(); for (int end = boundary.next(); end != BreakIterator.DONE; start = end, end = boundary.next()) { Log.v(TAG, source.substring(start, end)); } }逆序打印每一个元素的方法:
public static void printEachBackward(BreakIterator boundary, String source) { int end = boundary.last(); for (int start = boundary.previous(); start != BreakIterator.DONE; end = start, start = boundary.previous()) { Log.v(TAG, source.substring(start, end)); } }打印第一个元素的方法:
public static void printFirst(BreakIterator boundary, String source) { int start = boundary.first(); int end = boundary.next(); Log.v(TAG, source.substring(start, end)); }打印做后一个元素的方法:
public static void printLast(BreakIterator boundary, String source) { int end = boundary.last(); int start = boundary.previous(); Log.v(TAG, source.substring(start, end)); }打印特定位置的元素的方法:
public static void printAt(BreakIterator boundary, int pos, String source) { int end = boundary.following(pos); int start = boundary.previous(); Log.v(TAG, source.substring(start, end)); }查找下一个单词的方法:
public static int nextWordStartAfter (int pos, String text) { BreakIterator wb = BreakIterator.getWordInstance(); wb.setText(text); int last = wb.following(pos); int current = wb.next(); while (current != BreakIterator.DONE) { for (int p = last; p < current; ++ p) { if (Character.isLetter(text.charAt(p))) { return last; } } last = current; current = wb.next(); } return BreakIterator.DONE; }
BreakIterator.getWordInstance()所返回的迭代器是唯一的一个迭代器,其返回的分割位置不同时代表着它所迭代的对象(单词)的开始和结束位置。即是说,一个语句-分割迭代器返回的每一个分割位置代表着一个语句的结束位置和文本中下一个的开始位置。而单词-分割迭代器,两个边界之间的字符可能是一个单词,或者也可能是标点符号或者两个单词之间的空格。上面的code,使用了一个简单的启发式的方法来确定一个单词的起始边界:如果这个边界和下一个边界之间包含的字符含有至少一个文字(这可能是字母的文字,一个CJK象形文字,一个韩语音节,一个假名字符,等),则两个边界之间的文本是一个单词;否则,是单词之间的其他什么东西。
<string name="text_examined">ก้ำเปลี่ยนภาษาเมนูใน某些类型.BreakIterator的创, your computer must be set up for Thai</string>执行之后,将可以看到输出就像下面这样:
05-04 09:54:48.239: W/QRCodeActivity(3360): print each word forward 05-04 09:54:48.246: V/QRCodeActivity(3360): ก้ำ 05-04 09:54:48.246: V/QRCodeActivity(3360): เปลี่ยน 05-04 09:54:48.246: V/QRCodeActivity(3360): ภาษา 05-04 09:54:48.246: V/QRCodeActivity(3360): เมนู 05-04 09:54:48.246: V/QRCodeActivity(3360): ใน 05-04 09:54:48.246: V/QRCodeActivity(3360): 某 05-04 09:54:48.246: V/QRCodeActivity(3360): 些 05-04 09:54:48.246: V/QRCodeActivity(3360): 类 05-04 09:54:48.246: V/QRCodeActivity(3360): 型 05-04 09:54:48.246: V/QRCodeActivity(3360): . 05-04 09:54:48.246: V/QRCodeActivity(3360): BreakIterator 05-04 09:54:48.246: V/QRCodeActivity(3360): 的 05-04 09:54:48.246: V/QRCodeActivity(3360): 创 05-04 09:54:48.246: V/QRCodeActivity(3360): , 05-04 09:54:48.246: V/QRCodeActivity(3360): 05-04 09:54:48.246: V/QRCodeActivity(3360): your 05-04 09:54:48.246: V/QRCodeActivity(3360): 05-04 09:54:48.246: V/QRCodeActivity(3360): computer 05-04 09:54:48.246: V/QRCodeActivity(3360): 05-04 09:54:48.246: V/QRCodeActivity(3360): must 05-04 09:54:48.246: V/QRCodeActivity(3360): 05-04 09:54:48.254: V/QRCodeActivity(3360): be 05-04 09:54:48.254: V/QRCodeActivity(3360): 05-04 09:54:48.254: V/QRCodeActivity(3360): set 05-04 09:54:48.254: V/QRCodeActivity(3360): 05-04 09:54:48.254: V/QRCodeActivity(3360): up 05-04 09:54:48.254: V/QRCodeActivity(3360): 05-04 09:54:48.254: V/QRCodeActivity(3360): for 05-04 09:54:48.254: V/QRCodeActivity(3360): 05-04 09:54:48.254: V/QRCodeActivity(3360): Thai 05-04 09:54:48.254: W/QRCodeActivity(3360): print each sentence backward 05-04 09:54:48.254: V/QRCodeActivity(3360): BreakIterator的创, your computer must be set up for Thai 05-04 09:54:48.254: V/QRCodeActivity(3360): ก้ำเปลี่ยนภาษาเมนูใน某些类型. 05-04 09:54:48.254: W/QRCodeActivity(3360): print first sentence 05-04 09:54:48.254: V/QRCodeActivity(3360): ก้ำเปลี่ยนภาษาเมนูใน某些类型. 05-04 09:54:48.254: W/QRCodeActivity(3360): print last sentence 05-04 09:54:48.254: V/QRCodeActivity(3360): BreakIterator的创, your computer must be set up for Thai 05-04 09:54:48.262: W/QRCodeActivity(3360): print each line forward 05-04 09:54:48.262: V/QRCodeActivity(3360): ก้ำ 05-04 09:54:48.262: V/QRCodeActivity(3360): เปลี่ยน 05-04 09:54:48.262: V/QRCodeActivity(3360): ภาษา 05-04 09:54:48.262: V/QRCodeActivity(3360): เมนู 05-04 09:54:48.262: V/QRCodeActivity(3360): ใน 05-04 09:54:48.262: V/QRCodeActivity(3360): 某 05-04 09:54:48.262: V/QRCodeActivity(3360): 些 05-04 09:54:48.262: V/QRCodeActivity(3360): 类 05-04 09:54:48.270: V/QRCodeActivity(3360): 型.BreakIterator 05-04 09:54:48.270: V/QRCodeActivity(3360): 的 05-04 09:54:48.270: V/QRCodeActivity(3360): 创, 05-04 09:54:48.270: V/QRCodeActivity(3360): your 05-04 09:54:48.270: V/QRCodeActivity(3360): computer 05-04 09:54:48.270: V/QRCodeActivity(3360): must 05-04 09:54:48.270: V/QRCodeActivity(3360): be 05-04 09:54:48.270: V/QRCodeActivity(3360): set 05-04 09:54:48.270: V/QRCodeActivity(3360): up 05-04 09:54:48.270: V/QRCodeActivity(3360): for 05-04 09:54:48.270: V/QRCodeActivity(3360): Thai
Done.