中文转换为完整拼音算法原理分析

最近由于项目需要,对简体中文转拼音的算法作了一些了解,然而在google找到的大多是获得简体中文拼音首字母的算法,好不容易让我找到了一个sunrise.spell的类,专门用于中文转完整拼音,觉得的确做得不错,于是对它的算法作了一些分析,总的来说觉得还是比较简单的,拿出来与大家分享。

       我们先来学习一些准备知识。GB2312编码对于我们中国人是再熟悉不过了,我先简单的分析一下它的编码规则。GB2312编码包括符号、数字、字母、日文、制表符等,当然最主要的部分还是中文,它采用16位编码方式,简体中文的编码范围从B0A1一直到F7FE,完整编码表可以参考http://ash.jp/code/cn/gb2312tbl.htm。如果我们把该编码的每8位用十进制来表示就是[176 | 161][247 | 254],这样对于每个中文字符,我们都可以通过两个值来表示它,如“啊”就是[176 | 161],“我”则是[206 | 210]

通过上面的方法,我们就可以通过一个二维坐标对每一个中文字进行定位,从而建立一个二维表来实现中文和拼音的对应关系。当然我们会忽略一些特殊情况,比如汉字的多音字问题。由于一个拼音可能对应多个汉字,而拼音的组合本来就不多,因此我们首先建立一个拼音音节表,代码如下,里面列出了所有可能的组合情况,该表是一维数组。

readonly  static  string [] _spellMusicCode  =  new  string []{
        
" a " " ai " " an " " ang " " ao " " ba " " bai " " ban " " bang " " bao " ,
        
" bei " " ben " " beng " " bi " " bian " " biao " " bie " " bin " " bing " " bo " ,
        
" bu " " ca " " cai " " can " " cang " " cao " " ce " " ceng " " cha " " chai " ,
        
" chan " " chang " " chao " " che " " chen " " cheng " " chi " " chong " " chou " " chu " ,
        
" chuai " " chuan " " chuang " " chui " " chun " " chuo " " ci " " cong " " cou " " cu " ,
        
" cuan " " cui " " cun " " cuo " " da " " dai " " dan " " dang " " dao " " de " ,
        
" deng " " di " " dian " " diao " " die " " ding " " diu " " dong " " dou " " du " ,
        
" duan " " dui " " dun " " duo " " e " " en " " er " " fa " " fan " " fang " ,
        
" fei " " fen " " feng " " fu " " fou " " ga " " gai " " gan " " gang " " gao " ,
        
" ge " " ji " " gen " " geng " " gong " " gou " " gu " " gua " " guai " " guan " ,
        
" guang " " gui " " gun " " guo " " ha " " hai " " han " " hang " " hao " " he " ,
        
" hei " " hen " " heng " " hong " " hou " " hu " " hua " " huai " " huan " " huang " ,
        
" hui " " hun " " huo " " jia " " jian " " jiang " " qiao " " jiao " " jie " " jin " ,
        
" jing " " jiong " " jiu " " ju " " juan " " jue " " jun " " ka " " kai " " kan " ,
        
" kang " " kao " " ke " " ken " " keng " " kong " " kou " " ku " " kua " " kuai " ,
        
" kuan " " kuang " " kui " " kun " " kuo " " la " " lai " " lan " " lang " " lao " ,
        
" le " " lei " " leng " " li " " lia " " lian " " liang " " liao " " lie " " lin " ,
        
" ling " " liu " " long " " lou " " lu " " luan " " lue " " lun " " luo " " ma " ,
        
" mai " " man " " mang " " mao " " me " " mei " " men " " meng " " mi " " mian " ,
        
" miao " " mie " " min " " ming " " miu " " mo " " mou " " mu " " na " " nai " ,
        
" nan " " nang " " nao " " ne " " nei " " nen " " neng " " ni " " nian " " niang " ,
        
" niao " " nie " " nin " " ning " " niu " " nong " " nu " " nuan " " nue " " yao " ,
        
" nuo " " o " " ou " " pa " " pai " " pan " " pang " " pao " " pei " " pen " ,
        
" peng " " pi " " pian " " piao " " pie " " pin " " ping " " po " " pou " " pu " ,
        
" qi " " qia " " qian " " qiang " " qie " " qin " " qing " " qiong " " qiu " " qu " ,
        
" quan " " que " " qun " " ran " " rang " " rao " " re " " ren " " reng " " ri " ,
        
" rong " " rou " " ru " " ruan " " rui " " run " " ruo " " sa " " sai " " san " ,
        
" sang " " sao " " se " " sen " " seng " " sha " " shai " " shan " " shang " " shao " ,
        
" she " " shen " " sheng " " shi " " shou " " shu " " shua " " shuai " " shuan " " shuang " ,
        
" shui " " shun " " shuo " " si " " song " " sou " " su " " suan " " sui " " sun " ,
        
" suo " " ta " " tai " " tan " " tang " " tao " " te " " teng " " ti " " tian " ,
        
" tiao " " tie " " ting " " tong " " tou " " tu " " tuan " " tui " " tun " " tuo " ,
        
" wa " " wai " " wan " " wang " " wei " " wen " " weng " " wo " " wu " " xi " ,
        
" xia " " xian " " xiang " " xiao " " xie " " xin " " xing " " xiong " " xiu " " xu " ,
        
" xuan " " xue " " xun " " ya " " yan " " yang " " ye " " yi " " yin " " ying " ,
        
" yo " " yong " " you " " yu " " yuan " " yue " " yun " " za " " zai " " zan " ,
        
" zang " " zao " " ze " " zei " " zen " " zeng " " zha " " zhai " " zhan " " zhang " ,
        
" zhao " " zhe " " zhen " " zheng " " zhi " " zhong " " zhou " " zhu " " zhua " " zhuai " ,
        
" zhuan " " zhuang " " zhui " " zhun " " zhuo " " zi " " zong " " zou " " zu " " zuan " ,
        
" zui " " zun " " zuo " "" " ei " " m " " n " " dia " " cen " " nou " ,
        
" jv " " qv " " xv " " lv " " nv "
        };

    在上面提到的二维表中,我们会保存音节表的索引值,来对应一个音节组合,如bao ,而不是直接把音节组合保存在二维数组中,这样可以有效减少内存消耗。因此,二位数组中的所有值都应该是索引值,如数组元素[176][161] 对应的应该是a的索引值0。值得注意的是由于GB2312对应的字符量比较大,而且半角字符不需要做转换,sunrise 的类把其中的半角部分全部省略掉了,因此它的实际值是从[129 | 64] 开始的,即8140 开始的。
至此,我已经把原理解释了一遍,可能不是很容易理解,下面我们来举个例子说明一下程序流程。

       首先输入汉字“我”,首先程序初始化一个GB2312编码对象

System.Text.Encoding encoding = System.Text.Encoding.GetEncoding("GB2312");

然后通过该对象获得“我”的编码数组

byte[] local = encoding.GetBytes();

local中的值应该是local[0]=206; local[1]=210

假设我们的二维数组叫_spellCodeIndex那么我们就通过_spellCodeIndex[local[0]-129,local[1]-64]获得“我”对应的拼音音节索引值,即327

再查音节组合表,得索引327对应的是"wo",这样就完成了中文到拼音的转换

完整c#类可以在这里下载。

你可能感兴趣的:(中文转换为完整拼音算法原理分析)