还是那么一句话,java作为一门开源的东西,要实现某种功能,只要下一个对应的jar工具包就可以搞定问题。
由于之前在深圳一家公司用到一个工具包pinyin.jar包,今天就随便写写关于那个包下的源代码。
如果你用反编译器一打看,会吓一跳,pinyin.jar这包,就只有五个简单的类,其中format包下有四类(三个类型别类,一个格式类),根包下只有一个PinyinHelper核心类和一个将uniocde转成拼音的一个映射表文件unicode_to_hanyu_pinyin.txt
废话就那么多了,下面开讲啦!!!!!!!
#########映射表文件unicode_to_hanyu_pinyin.txt##############
这个文件保存着,unicode与汉语拼音之间的映射关系,形式如下:
3007 (ling2)
4E00 (yi1)
4E01 (ding1,zheng1)
4E02 (kao3)
4E03 (qi1)
4E04 (shang4,shang3)
4E05 (xia4)
4E06 (none0)
4E07 (wan4,mo4)
4E08 (zhang4)
4E09 (san1)
4E0A (shang4,shang3)
4E0B (xia4)
4E0C (ji1)
4E0D (bu4,bu2,fou3)
4E0E (yu3,yu4,yu2)
4E0F (mian3)
4E10 (gai4)
4E11 (chou3)
4E12 (chou3)
4E13 (zhuan1)
4E14 (qie3,ju1)
4E15 (pi1)
4E16 (shi4)
4E17 (shi4)
4E18 (qiu1)
4E19 (bing3)
4E1A (ye4)
4E1B (cong2)
4E1C (dong1)
4E1D (si1)
4E1E (cheng2)
4E1F (diu1)
4E20 (qiu1)
4E21 (liang3)
4E22 (diu1)
4E23 (you3)
4E24 (liang3)
4E25 (yan2)
4E26 (bing4)
4E27 (sang1,sang4,sang5)
4E28 (shu4)
4E29 (jiu1)
4E2A (ge4,ge3)
接下来先讲format包下的四个类
#########1HanyuPinyinCaseType汉语拼音大小写类,包含两个自身的静态子节点对象,调用时,只接调用即可,如HanyuPinyinCaseType.UPPERCASE##############
public class HanyuPinyinCaseType
{
public static final HanyuPinyinCaseType UPPERCASE = new HanyuPinyinCaseType();
public static final HanyuPinyinCaseType LOWERCASE = new HanyuPinyinCaseType();
}
#########2HanyuPinyinToneType一个判断是否将汉语拼音中的数字[1-5]过滤掉的一个调和类#############
public class HanyuPinyinToneType
{
public static final HanyuPinyinToneType WITH_TONE_NUMBER = new HanyuPinyinToneType();
public static final HanyuPinyinToneType WITHOUT_TONE = new HanyuPinyinToneType();
}
#########3HanyuPinyinVCharType汉语拼音类,用于定义输出的拼音是否包含某种特殊字符编码的字符##############
public class HanyuPinyinVCharType
{
public static final HanyuPinyinVCharType WITH_U_AND_COLON = new HanyuPinyinVCharType();//默认方式
public static final HanyuPinyinVCharType WITH_V = new HanyuPinyinVCharType();//将"u:" 替换成 "v"
public static final HanyuPinyinVCharType WITH_U_UNICODE = new HanyuPinyinVCharType();//将"u:" 替换成"ü"
}
#########4HanyuPinyinOutputFormat格式类##############
public final class HanyuPinyinOutputFormat
{
private HanyuPinyinVCharType vCharType;
private HanyuPinyinCaseType caseType;
private HanyuPinyinToneType toneType;
public HanyuPinyinOutputFormat()
{
restoreDefault();
}
public void restoreDefault()
{
this.vCharType = HanyuPinyinVCharType.WITH_U_AND_COLON;
this.caseType = HanyuPinyinCaseType.LOWERCASE;
this.toneType = HanyuPinyinToneType.WITH_TONE_NUMBER;
}
。。。get/set方法省略
}
以上四个类原理都是一样,只是第四个类,将前三个类一块封装在一起而已,在调用某一个类时,内存会 自动创建静态自身对象。
比如调用HanyuPinyinCaseType类时,会创建下面两个静态共用类(UPPERCASE,LOWERCASE)
public static final HanyuPinyinCaseType UPPERCASE = new HanyuPinyinCaseType();
public static final HanyuPinyinCaseType LOWERCASE = new HanyuPinyinCaseType();
#################核心类 PinyinHelper######################################
PinyinHelper类,只有一个字段unicodeToHanyuPinyinTable和一个静态块,如下:
private static Properties unicodeToHanyuPinyinTable = null;
static {
if (unicodeToHanyuPinyinTable == null)
initializeTable();
}
代码很容易理解,用于保存映射表的类为空时,进行下面的初始化过程:
private static void initializeTable()
{
try
{
String resourceName = "/pinyindb/unicode_to_hanyu_pinyin.txt";
unicodeToHanyuPinyinTable = new Properties();
unicodeToHanyuPinyinTable.load(new BufferedInputStream(PinyinHelper.class.getResourceAsStream("/pinyindb/unicode_to_hanyu_pinyin.txt")));
} catch (FileNotFoundException ex) {
ex.printStackTrace();
} catch (IOException ex) {
ex.printStackTrace();
}
}
上面读取根类路径下的文件,让unicodeToHanyuPinyinTable表不为空
PinyinHelper类核心方法就只有一个方法:toHanyuPinyinString,其它方法都是辅助性方法,原理见下面注释
public static String toHanyuPinyinString(String str, HanyuPinyinOutputFormat outputFormat, String seperater)//传入一个包含汉语的字符序列,一个格式类,一个分格符(自定义,随便写)
{
StringBuffer resultPinyinStrBuf = new StringBuffer(); //保存序列的缓存
for (int i = 0; i < str.length(); i++) {//遍历字符序列
int codepointOfChar = str.codePointAt(i);//获取对应位的unicode编码整型值
String mainPinyinStrOfChar = getFirstHanyuPinyinString(codepointOfChar, outputFormat);//根据格式类的三个属性类的形式,以及对应unicode编码,在映射文件里边查询出第一个拼音字符串
if (mainPinyinStrOfChar != null) {
resultPinyinStrBuf.append(mainPinyinStrOfChar);//加入到缓存
if (i != str.length() - 1)
resultPinyinStrBuf.append(seperater);//添加分隔符
}
else {
resultPinyinStrBuf.append(str.charAt(i));
}
}
return resultPinyinStrBuf.toString();//return
}
下面讲一下,上面调用getFirstHanyuPinyinString方法
private static String getFirstHanyuPinyinString(int codepoint, HanyuPinyinOutputFormat outputFormat)
{
String[] pinyinStrArray = getHanyuPinyinStringArray(codepoint, outputFormat);
if ((pinyinStrArray != null) && (pinyinStrArray.length > 0)) {//查映射表,返回拼音数组的第一个或null
return pinyinStrArray[0];
}
return null;
}
其中getHanyuPinyinStringArray方法,获取映射表对应的拼单数组,源码如下:
private static String[] getHanyuPinyinStringArray(int codepoint, HanyuPinyinOutputFormat outputFormat)
{
String pinyinRecord = getHanyuPinyinRecord(codepoint);//根据具体的unicode获取对应的拼音字符串,去掉括号的形式,形如:4E36 (zhu3,dian3) ,则得到"(zhu3,dian3)"这个字什呺
if (pinyinRecord != null) {
int indexOfLeftBracket = pinyinRecord.indexOf("(");
int indexOfRightBracket = pinyinRecord.lastIndexOf(")");
String stripedString = pinyinRecord.substring(indexOfLeftBracket + 1, indexOfRightBracket);//去掉左右括号,以上面的例子,获得"zhu3,dian3";
//下面根据outputFormat格式对象,对结果进行格式公替换操作
if (HanyuPinyinVCharType.WITH_V == outputFormat.getVCharType())
stripedString = stripedString.replaceAll("u:", "v");
else if (HanyuPinyinVCharType.WITH_U_UNICODE == outputFormat.getVCharType()) {
stripedString = stripedString.replaceAll("u:", "ü");
}
if (HanyuPinyinToneType.WITHOUT_TONE == outputFormat.getToneType()) {
stripedString = stripedString.replaceAll("[1-5]", "");
}
if (HanyuPinyinCaseType.UPPERCASE == outputFormat.getCaseType()) {
stripedString = stripedString.toUpperCase();
}
return stripedString.split(",");//分隔成字符串数组返回
}
return null;
}
如何调用呢?
代码如下:
PinyinHelper.toHanyuPinyinString("中华人民共和国",new HanyuPinyinOutputFormat(),"#");//使用默认格式类new HanyuPinyinOutputFormat()