在我们的系统中,可能经常需要按首字母排序一些信息(比如淘宝商城的品牌列表字母序排列),那么我们就需要一个能够根据汉字查询对应的拼音,取出拼音的首字母即可。
我们使用sourceforge.pinyin4j开源包来完成我们的功能。
使用很简单:
提供的工具类是下面这个PinyinHelper.java help类,里面有所有开放的API,有几个方法是对应转换成不同的拼音系统,关于拼音系统大家可以查看 http://wenku.baidu.com/view/28dda445b307e87101f696f9.html
/** * This file is part of pinyin4j (http://sourceforge.net/projects/pinyin4j/) * and distributed under GNU GENERAL PUBLIC LICENSE (GPL). * * pinyin4j is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or * (at your option) any later version. * * pinyin4j is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with pinyin4j. */ package net.sourceforge.pinyin4j; import net.sourceforge.pinyin4j.format.HanyuPinyinOutputFormat; import net.sourceforge.pinyin4j.format.exception.BadHanyuPinyinOutputFormatCombination; /** * A class provides several utility functions to convert Chinese characters * (both Simplified and Tranditional) into various Chinese Romanization * representations * * @author Li Min ([email protected]) */ public class PinyinHelper { /** * Get all unformmatted Hanyu Pinyin presentations of a single Chinese * character (both Simplified and Tranditional) * * <p> * For example, <br/> If the input is '间', the return will be an array with * two Hanyu Pinyin strings: <br/> "jian1" <br/> "jian4" <br/> <br/> If the * input is '李', the return will be an array with single Hanyu Pinyin * string: <br/> "li3" * * <p> * <b>Special Note</b>: If the return is "none0", that means the input * Chinese character exists in Unicode CJK talbe, however, it has no * pronounciation in Chinese * * @param ch * the given Chinese character * * @return a String array contains all unformmatted Hanyu Pinyin * presentations with tone numbers; null for non-Chinese character * */ static public String[] toHanyuPinyinStringArray(char ch) { return getUnformattedHanyuPinyinStringArray(ch); } /** * Get all Hanyu Pinyin presentations of a single Chinese character (both * Simplified and Tranditional) * * <p> * For example, <br/> If the input is '间', the return will be an array with * two Hanyu Pinyin strings: <br/> "jian1" <br/> "jian4" <br/> <br/> If the * input is '李', the return will be an array with single Hanyu Pinyin * string: <br/> "li3" * * <p> * <b>Special Note</b>: If the return is "none0", that means the input * Chinese character is in Unicode CJK talbe, however, it has no * pronounciation in Chinese * * @param ch * the given Chinese character * @param outputFormat * describes the desired format of returned Hanyu Pinyin String * * @return a String array contains all Hanyu Pinyin presentations with tone * numbers; return null for non-Chinese character * * @throws BadHanyuPinyinOutputFormatCombination * if certain combination of output formats happens * * @see HanyuPinyinOutputFormat * @see BadHanyuPinyinOutputFormatCombination * */ static public String[] toHanyuPinyinStringArray(char ch, HanyuPinyinOutputFormat outputFormat) throws BadHanyuPinyinOutputFormatCombination { return getFormattedHanyuPinyinStringArray(ch, outputFormat); } /** * Return the formatted Hanyu Pinyin representations of the given Chinese * character (both in Simplified and Tranditional) in array format. * * @param ch * the given Chinese character * @param outputFormat * Describes the desired format of returned Hanyu Pinyin string * @return The formatted Hanyu Pinyin representations of the given codepoint * in array format; null if no record is found in the hashtable. */ static private String[] getFormattedHanyuPinyinStringArray(char ch, HanyuPinyinOutputFormat outputFormat) throws BadHanyuPinyinOutputFormatCombination { String[] pinyinStrArray = getUnformattedHanyuPinyinStringArray(ch); if (null != pinyinStrArray) { for (int i = 0; i < pinyinStrArray.length; i++) { pinyinStrArray[i] = PinyinFormatter.formatHanyuPinyin(pinyinStrArray[i], outputFormat); } return pinyinStrArray; } else return null; } /** * Delegate function * * @param ch * the given Chinese character * @return unformatted Hanyu Pinyin strings; null if the record is not found */ private static String[] getUnformattedHanyuPinyinStringArray(char ch) { return ChineseToPinyinResource.getInstance().getHanyuPinyinStringArray(ch); } /** * Get all unformmatted Tongyong Pinyin presentations of a single Chinese * character (both Simplified and Tranditional) * * @param ch * the given Chinese character * * @return a String array contains all unformmatted Tongyong Pinyin * presentations with tone numbers; null for non-Chinese character * * @see #toHanyuPinyinStringArray(char) * */ static public String[] toTongyongPinyinStringArray(char ch) { return convertToTargetPinyinStringArray(ch, PinyinRomanizationType.TONGYONG_PINYIN); } /** * Get all unformmatted Wade-Giles presentations of a single Chinese * character (both Simplified and Tranditional) * * @param ch * the given Chinese character * * @return a String array contains all unformmatted Wade-Giles presentations * with tone numbers; null for non-Chinese character * * @see #toHanyuPinyinStringArray(char) * */ static public String[] toWadeGilesPinyinStringArray(char ch) { return convertToTargetPinyinStringArray(ch, PinyinRomanizationType.WADEGILES_PINYIN); } /** * Get all unformmatted MPS2 (Mandarin Phonetic Symbols 2) presentations of * a single Chinese character (both Simplified and Tranditional) * * @param ch * the given Chinese character * * @return a String array contains all unformmatted MPS2 (Mandarin Phonetic * Symbols 2) presentations with tone numbers; null for non-Chinese * character * * @see #toHanyuPinyinStringArray(char) * */ static public String[] toMPS2PinyinStringArray(char ch) { return convertToTargetPinyinStringArray(ch, PinyinRomanizationType.MPS2_PINYIN); } /** * Get all unformmatted Yale Pinyin presentations of a single Chinese * character (both Simplified and Tranditional) * * @param ch * the given Chinese character * * @return a String array contains all unformmatted Yale Pinyin * presentations with tone numbers; null for non-Chinese character * * @see #toHanyuPinyinStringArray(char) * */ static public String[] toYalePinyinStringArray(char ch) { return convertToTargetPinyinStringArray(ch, PinyinRomanizationType.YALE_PINYIN); } /** * @param ch * the given Chinese character * @param targetPinyinSystem * indicates target Chinese Romanization system should be * converted to * @return string representations of target Chinese Romanization system * corresponding to the given Chinese character in array format; * null if error happens * * @see PinyinRomanizationType */ private static String[] convertToTargetPinyinStringArray(char ch, PinyinRomanizationType targetPinyinSystem) { String[] hanyuPinyinStringArray = getUnformattedHanyuPinyinStringArray(ch); if (null != hanyuPinyinStringArray) { String[] targetPinyinStringArray = new String[hanyuPinyinStringArray.length]; for (int i = 0; i < hanyuPinyinStringArray.length; i++) { targetPinyinStringArray[i] = PinyinRomanizationTranslator.convertRomanizationSystem(hanyuPinyinStringArray[i], PinyinRomanizationType.HANYU_PINYIN, targetPinyinSystem); } return targetPinyinStringArray; } else return null; } /** * Get all unformmatted Gwoyeu Romatzyh presentations of a single Chinese * character (both Simplified and Tranditional) * * @param ch * the given Chinese character * * @return a String array contains all unformmatted Gwoyeu Romatzyh * presentations with tone numbers; null for non-Chinese character * * @see #toHanyuPinyinStringArray(char) * */ static public String[] toGwoyeuRomatzyhStringArray(char ch) { return convertToGwoyeuRomatzyhStringArray(ch); } /** * @param ch * the given Chinese character * * @return Gwoyeu Romatzyh string representations corresponding to the given * Chinese character in array format; null if error happens * * @see PinyinRomanizationType */ private static String[] convertToGwoyeuRomatzyhStringArray(char ch) { String[] hanyuPinyinStringArray = getUnformattedHanyuPinyinStringArray(ch); if (null != hanyuPinyinStringArray) { String[] targetPinyinStringArray = new String[hanyuPinyinStringArray.length]; for (int i = 0; i < hanyuPinyinStringArray.length; i++) { targetPinyinStringArray[i] = GwoyeuRomatzyhTranslator.convertHanyuPinyinToGwoyeuRomatzyh(hanyuPinyinStringArray[i]); } return targetPinyinStringArray; } else return null; } /** * Get a string which all Chinese characters are replaced by corresponding * main (first) Hanyu Pinyin representation. * * <p> * <b>Special Note</b>: If the return contains "none0", that means that * Chinese character is in Unicode CJK talbe, however, it has not * pronounciation in Chinese. <b> This interface will be removed in next * release. </b> * * @param str * A given string contains Chinese characters * @param outputFormat * Describes the desired format of returned Hanyu Pinyin string * @param seperater * The string is appended after a Chinese character (excluding * the last Chinese character at the end of sentence). <b>Note! * Seperater will not appear after a non-Chinese character</b> * @return a String identical to the original one but all recognizable * Chinese characters are converted into main (first) Hanyu Pinyin * representation * * @deprecated DO NOT use it again because the first retrived pinyin string * may be a wrong pronouciation in a certain sentence context. * <b> This interface will be removed in next release. </b> */ static public String toHanyuPinyinString(String str, HanyuPinyinOutputFormat outputFormat, String seperater) throws BadHanyuPinyinOutputFormatCombination { StringBuffer resultPinyinStrBuf = new StringBuffer(); for (int i = 0; i < str.length(); i++) { String mainPinyinStrOfChar = getFirstHanyuPinyinString(str.charAt(i), outputFormat); if (null != mainPinyinStrOfChar) { resultPinyinStrBuf.append(mainPinyinStrOfChar); if (i != str.length() - 1) { // avoid appending at the end resultPinyinStrBuf.append(seperater); } } else { resultPinyinStrBuf.append(str.charAt(i)); } } return resultPinyinStrBuf.toString(); } /** * Get the first Hanyu Pinyin of a Chinese character <b> This function will * be removed in next release. </b> * * @param ch * The given Unicode character * @param outputFormat * Describes the desired format of returned Hanyu Pinyin string * @return Return the first Hanyu Pinyin of given Chinese character; return * null if the input is not a Chinese character * * @deprecated DO NOT use it again because the first retrived pinyin string * may be a wrong pronouciation in a certain sentence context. * <b> This function will be removed in next release. </b> */ static private String getFirstHanyuPinyinString(char ch, HanyuPinyinOutputFormat outputFormat) throws BadHanyuPinyinOutputFormatCombination { String[] pinyinStrArray = getFormattedHanyuPinyinStringArray(ch, outputFormat); if ((null != pinyinStrArray) && (pinyinStrArray.length > 0)) { return pinyinStrArray[0]; } else { return null; } } // ! Hidden constructor private PinyinHelper() { } }
/** * This file is part of pinyin4j (http://sourceforge.net/projects/pinyin4j/) * and distributed under GNU GENERAL PUBLIC LICENSE (GPL). * * pinyin4j is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or * (at your option) any later version. * * pinyin4j is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * You should have received a copy of the GNU General Public License * along with pinyin4j. */ /** * */ package net.sourceforge.pinyin4j; /** * The class describes variable Chinese Pinyin Romanization System * * @author Li Min ([email protected]) * */ class PinyinRomanizationType { /** * Hanyu Pinyin system */ static final PinyinRomanizationType HANYU_PINYIN = new PinyinRomanizationType("Hanyu"); /** * Wade-Giles Pinyin system */ static final PinyinRomanizationType WADEGILES_PINYIN = new PinyinRomanizationType("Wade"); /** * Mandarin Phonetic Symbols 2 (MPS2) Pinyin system */ static final PinyinRomanizationType MPS2_PINYIN = new PinyinRomanizationType("MPSII"); /** * Yale Pinyin system */ static final PinyinRomanizationType YALE_PINYIN = new PinyinRomanizationType("Yale"); /** * Tongyong Pinyin system */ static final PinyinRomanizationType TONGYONG_PINYIN = new PinyinRomanizationType("Tongyong"); /** * Gwoyeu Romatzyh system */ static final PinyinRomanizationType GWOYEU_ROMATZYH = new PinyinRomanizationType("Gwoyeu"); /** * Constructor */ protected PinyinRomanizationType(String tagName) { setTagName(tagName); } /** * @return Returns the tagName. */ String getTagName() { return tagName; } /** * @param tagName * The tagName to set. */ protected void setTagName(String tagName) { this.tagName = tagName; } protected String tagName; }
package demo; import net.sourceforge.pinyin4j.PinyinHelper; import net.sourceforge.pinyin4j.format.HanyuPinyinCaseType; import net.sourceforge.pinyin4j.format.HanyuPinyinOutputFormat; import net.sourceforge.pinyin4j.format.HanyuPinyinToneType; import net.sourceforge.pinyin4j.format.HanyuPinyinVCharType; import net.sourceforge.pinyin4j.format.exception.BadHanyuPinyinOutputFormatCombination; public class MyPinyinDemo { /** * @param args * @throws BadHanyuPinyinOutputFormatCombination */ public static void main(String[] args) throws BadHanyuPinyinOutputFormatCombination { char chineseCharacter = "绿".charAt(0); HanyuPinyinOutputFormat outputFormat = new HanyuPinyinOutputFormat(); outputFormat.setToneType(HanyuPinyinToneType.WITH_TONE_NUMBER); // 输出的声调为数字:第一声为1,第二声为2,第三声为3,第四声为4 如:lu:4 // outputFormat.setToneType(HanyuPinyinToneType.WITHOUT_TONE); // 输出拼音不带声调 如:lu: // outputFormat.setToneType(HanyuPinyinToneType.WITH_TONE_MARK); // 输出声调在拼音字母上 如:lǜ outputFormat.setVCharType(HanyuPinyinVCharType.WITH_U_AND_COLON); //ǜ的输出格式设置 'ü' 输出为 "u:" // outputFormat.setVCharType(HanyuPinyinVCharType.WITH_U_UNICODE); //ǜ的输出格式设置 'ü' 输出为 "ü" in Unicode form // outputFormat.setVCharType(HanyuPinyinVCharType.WITH_V); //ǜ的输出格式设置 'ü' 输出为 "v" outputFormat.setCaseType(HanyuPinyinCaseType.UPPERCASE); //输出拼音为大写 // outputFormat.setCaseType(HanyuPinyinCaseType.LOWERCASE); //输出拼音为小写 String[] pinyinArray = PinyinHelper.toHanyuPinyinStringArray(chineseCharacter, outputFormat); //汉字拼音 for(String str: pinyinArray){ //多音字输出,会返回多音字的格式 System.out.println(str); } String pinyinstr = PinyinHelper.toHanyuPinyinString("绿色", outputFormat, "|"); System.out.println(pinyinstr); //其他拼音系统的输出 String[] GwoyeuRomatzyhStringArray = PinyinHelper.toGwoyeuRomatzyhStringArray(chineseCharacter); for(String str: GwoyeuRomatzyhStringArray){ //多音字输出,会返回多音字的格式 System.out.println(str); } String[] MPS2PinyinStringArray = PinyinHelper.toMPS2PinyinStringArray(chineseCharacter); for(String str: MPS2PinyinStringArray){ //多音字输出,会返回多音字的格式 System.out.println(str); } String[] TongyongPinyinStringArray = PinyinHelper.toTongyongPinyinStringArray(chineseCharacter); for(String str: TongyongPinyinStringArray){ //多音字输出,会返回多音字的格式 System.out.println(str); } String[] WadeGilesPinyinStringArray = PinyinHelper.toWadeGilesPinyinStringArray(chineseCharacter); for(String str: WadeGilesPinyinStringArray){ //多音字输出,会返回多音字的格式 System.out.println(str); } String[] YalePinyinStringArray = PinyinHelper.toYalePinyinStringArray(chineseCharacter); for(String str: YalePinyinStringArray){ //多音字输出,会返回多音字的格式 System.out.println(str); } } }
LU:4 LU4 LU:4|SE4 liuh luh liu4 lu4 lyu4 lu4 lu:4 lu4 lyu4 lu4
这个拼音包里还自带了一个demo, Pinyin4jAppletDemo.java
至于实现,其实很简单,就是有一个词典,汉字跟拼音的对应关系文件词典,unicode_to_hanyu_pinyin.txt是汉字的unicode字符对应的拼音对应表,pinyin_mapping.xml是汉语拼音系统跟其他系统的对照表,pinyin_Gwoyeu_mapping.xml是汉语系统跟Gwoyeu拼音系统的对照列表。格式参考如下,其实整理完这些之后就很容易实现了。
<?xml version="1.0"?> <pinyin_mapping> <item> <Hanyu>a</Hanyu> <Wade>a</Wade> <MPSII>a</MPSII> <Yale>a</Yale> <Tongyong>a</Tongyong> </item> <item> <Hanyu>ai</Hanyu> <Wade>ai</Wade> <MPSII>ai</MPSII> <Yale>ai</Yale> <Tongyong>ai</Tongyong> </item>
<pinyin_gwoyeu_mapping> <item> <Hanyu>a</Hanyu> <Gwoyeu_I>a</Gwoyeu_I> <Gwoyeu_II>ar</Gwoyeu_II> <Gwoyeu_III>aa</Gwoyeu_III> <Gwoyeu_IV>ah</Gwoyeu_IV> <Gwoyeu_V>.a</Gwoyeu_V> </item> <item> <Hanyu>ai</Hanyu> <Gwoyeu_I>ai</Gwoyeu_I> <Gwoyeu_II>air</Gwoyeu_II> <Gwoyeu_III>ae</Gwoyeu_III> <Gwoyeu_IV>ay</Gwoyeu_IV> <Gwoyeu_V>.ai</Gwoyeu_V> </item>