上一篇性能测试,看到耗时和内存占用上的一些现象。当然,对于一个开源的东西,最高效的方式还是研究源代码了。接下来我们会深入到ICU源代码简要看看分词的实现方法。
以我们例子中用到的获取当前默认Locale的WordInstance为例:
java.text.BreakIterator.getWordInstance() ->
java.text.BreakIterator.getWordInstance(Locale.getDefault()) ->
java.text.IcuIteratorWrapper new IcuIteratorWrapper() ->
android.icu.text.BreakIterator.getWordInstance(ULocale where) ->
android.icu.text.BreakIterator.getBreakInstance(where, KIND_WORD)
OK,到了第一个关键逻辑:
android.icu.text.BreakIterator
/**
* Returns a particular kind of BreakIterator for a locale.
* Avoids writing a switch statement with getXYZInstance(where) calls.
* @internal
* @deprecated This API is ICU internal only.
*/
@Deprecated
public static BreakIterator getBreakInstance(ULocale where, int kind) {
if (where == null) {
throw new NullPointerException("Specified locale is null");
}
if (iterCache[kind] != null) {
BreakIteratorCache cache = (BreakIteratorCache)iterCache[kind].get();
if (cache != null) {
if (cache.getLocale().equals(where)) {
return cache.createBreakInstance();
}
}
}
// sigh, all to avoid linking in ICULocaleData...
BreakIterator result = getShim().createBreakIterator(where, kind);
BreakIteratorCache cache = new BreakIteratorCache(where, result);
iterCache[kind] = new SoftReference(cache);
if (result instanceof RuleBasedBreakIterator) {
RuleBasedBreakIterator rbbi = (RuleBasedBreakIterator)result;
rbbi.setBreakType(kind);
}
return result;
}
从上面初始化的逻辑可以看到几点:
(1)使用一个软引用数组iterCache来缓存:
/**
* {@icu}
* @stable ICU 2.4
*/
public static final int KIND_CHARACTER = 0;
/**
* {@icu}
* @stable ICU 2.4
*/
public static final int KIND_WORD = 1;
/**
* {@icu}
* @stable ICU 2.4
*/
public static final int KIND_LINE = 2;
/**
* {@icu}
* @stable ICU 2.4
*/
public static final int KIND_SENTENCE = 3;
/**
* {@icu}
* @stable ICU 2.4
*/
public static final int KIND_TITLE = 4;
/**
* @since ICU 2.8
*/
private static final int KIND_COUNT = 5;
private static final SoftReference>[] iterCache = new SoftReference>[5];
我们知道,在虚拟机堆内存充裕的情况下软引用对象可以被使用,如果内存不充裕,软引用的对象会被GC回收。
(2)缓存的对象对应的类是BreakIteratorCache,这是BreakIterator的一个静态私有内部类:
private static final class BreakIteratorCache {
private BreakIterator iter;
private ULocale where;
BreakIteratorCache(ULocale where, BreakIterator iter) {
this.where = where;
this.iter = (BreakIterator) iter.clone();
}
ULocale getLocale() {
return where;
}
BreakIterator createBreakInstance() {
return (BreakIterator) iter.clone();
}
}
(3)缓存没有命中的情况。通过如下代码生成并保存一个缓存对象:
BreakIterator result = getShim().createBreakIterator(where, kind);
BreakIteratorCache cache = new BreakIteratorCache(where, result);
iterCache[kind] = new SoftReference(cache);
从缓存类源代码可以看到,缓存中的iter引用由getShim().createBreakIterator()得到的BreakIterator对象,这个BreakIterator对象也会直接返回给初始化方法的调用者。
(4)缓存命中的情况。如果kind和Locale都命中,那么会使用缓存中的BreakIteratorCache对象通过BreakIteratorCache.createBreakInstance(),其实也就是(BreakIterator) iter.clone(),克隆一个BreakIterator对象返回。
到这里,这段代码实际上是有很多疑问的,为什么写成这样?带着疑问继续研究。接下来看看初始化中的干货代码:
getShim().createBreakIterator(where, kind);
先看一下getShim():
private static BreakIteratorServiceShim shim;
private static BreakIteratorServiceShim getShim() {
// Note: this instantiation is safe on loose-memory-model configurations
// despite lack of synchronization, since the shim instance has no state--
// it's all in the class init. The worst problem is we might instantiate
// two shim instances, but they'll share the same state so that's ok.
if (shim == null) {
try {
Class> cls = Class.forName("com.ibm.icu.text.BreakIteratorFactory");
shim = (BreakIteratorServiceShim)cls.newInstance();
}
catch (MissingResourceException e)
{
throw e;
}
catch (Exception e) {
///CLOVER:OFF
if(DEBUG){
e.printStackTrace();
}
throw new RuntimeException(e.getMessage());
///CLOVER:ON
}
}
return shim;
}
这是一个懒惰式初始化,有两个特点:第一,没有做同步以实现线程安全;第二,使用反射获取类BreakIteratorFactory并创建对象。
没有实现线程安全已经有注释来解释:初始化的逻辑实现于类加载(类初始化中),对象本身是无状态的,状态都在类中,即使最差情况下非单例,也没关系。也就是说,关键的逻辑是放在类加载中来间接实现线程安全。
为什么使用反射(类BreakIteratorFactory就位于同一个package中)?因为初始化的逻辑放到了BreakIteratorFactory的类加载中,同时,需要实现懒惰式初始化,所以不能够允许类BreakIteratorFactory在第一次调用getShim()之前被加载。我们知道,JVM规范对于类加载的时机没有硬性规定,只要求在使用的必须加载完毕(当然,类初始化也就完成了)。时候所以不同的JVM完全可以根据自身的策略来选择加载时机,也就有可能会被预加载,这就不能满足懒惰式初始化了。而使用反射,JVM预先并不知道需要使用这个类,所以只有在getShim()运行时加载类。在IDE中查看BreakIteratorFactory的使用情况为never used,也是佐证。
接下来分析BreakIteratorFactory。BreakIteratorFactory类初始化逻辑有哪些?
(1)初始化常量数组
/** KIND_NAMES are the resource key to be used to fetch the name of the
* pre-compiled break rules. The resource bundle name is "boundaries".
* The value for each key will be the rules to be used for the
* specified locale - "word" -> "word_th" for Thai, for example.
*/
private static final String[] KIND_NAMES = {
"grapheme", "word", "line", "sentence", "title"
};
(2)初始化service
static final ICULocaleService service = new BFService();
service这个静态成员变量很重要,前文分析过,初始化中的干货代码:
getShim().createBreakIterator(where, kind);
而从BreakIteratorFactory来看,service与创建实例有关:
public BreakIterator createBreakIterator(ULocale locale, int kind) {
// TODO: convert to ULocale when service switches over
if (service.isDefault()) {
return createBreakInstance(locale, kind);
}
ULocale[] actualLoc = new ULocale[1];
BreakIterator iter = (BreakIterator)service.get(locale, kind, actualLoc);
iter.setLocale(actualLoc[0], actualLoc[0]); // services make no distinction between actual & valid
return iter;
}
看一下静态私有内部类BFService:
private static class BFService extends ICULocaleService {
BFService() {
super("BreakIterator");
class RBBreakIteratorFactory extends ICUResourceBundleFactory {
protected Object handleCreate(ULocale loc, int kind, ICUService srvc) {
return createBreakInstance(loc, kind);
}
}
registerFactory(new RBBreakIteratorFactory());
markDefault();
}
/**
* createBreakInstance() returns an appropriate BreakIterator for any locale.
* It falls back to root if there is no specific data.
*
* Without this override, the service code would fall back to the default locale
* which is not desirable for an algorithm with a good Unicode default,
* like break iteration.
*/
@Override
public String validateFallbackLocale() {
return "";
}
}
在他的构造方法中,把一个包装了BreakIteratorFactory.createBreakInstance()方法的局部类RBBreakIteratorFactory的实例注册到自身的工厂集合中,并标记为默认。这样如果没有其他的工厂实例注册进来,默认就会使用BreakIteratorFactory.createBreakInstance()。详见前面介绍的BreakIteratorFactory.createBreakIterator()方法。
再补充一点,前面提到的注释中“对象本身是无状态的,状态都在类中”,应该即指service。