1.准备工作:
从官网下载拼音表
http://www.unicode.org/Public/6.0.0/ucd/
Unihan.zip
解压缩后从Unihan_Readings.txt文件取出0x4e00-0x9fa5之间的行,只要kMandarin 普通话类型(kHanyuPinlu数量太少)。如下所示
U+4E00 kMandarin YI1
U+4E01 kMandarin DING1 ZHENG1
U+4E02 kMandarin KAO3 QIAO3 YU2
U+4E03 kMandarin QI1
U+4E04 kMandarin SHANG4 SHANG3
。。。。。
linux脚本如下
grep 'kMandarin' Unihan_Readings.txt > putong.txt
kMandarin替换成空格
sed 's/kMandarin/ /g' putong.txt > pug.txt
wc -l pug.txt
25550 pug.txt
接下来要删除不是0x4e00-0x9fa5开头的行
grep -n "U+4E00" pug.txt
5060:U+4E00 YI1
sed '1,5059d' pug.txt > tmp.txt
grep -n "U+9FA5" tmp.txt
20252:U+9FA5 YU4
sed '20253,$d' tmp.txt > pugd.txt
这样只剩下0x4e00-0x9fa5开头的行
然后开始排序
然后开始排序
sort -t" " -k2 pugd.txt > pugds.txt
U+5416 A1
U+9515 A1
U+963F A1 A4 A5 E1 E3 A3
U+9312 A1 KE1
U+55C4 A2 SHA4
U+554A A5
U+54C0 AI1
个人感觉应该拼音音调后加上频率排,多音字取第一个,有的排序如下。很明显阿应该排第一个。没有直接用的文档,就算了,反正那些不常用的字很少出现。放入0x4e00-0x9fa5对应拼音带频率的排序 先拼音声调再频率,多音字取第一个(可根据自己喜好排,拼音顺序是肯定的)
[a1]吖腌錒锕阿
[a2]嗄
[a3]阿
[a4]阿
[a5]啊阿
--------------------------------------------------------------------------------------------
然后提取第一列内容
cut -c1-6 pugds.txt > pugdsc.txt
U+5416
U+9515
U+963F
// 小写变大写
//sed 's/.*/\U&/g' file
//大写变小写
sed 's/.*/\L&/g' pugdsc.txt > tmp.txt
//替换u+为tab[0x 格式 最终为 [0x0616] = 0x0000, 数组格式
sed 's/u+/\t[0x/' tmp.txt > tmp2.txt
//行尾插入 “] = 0x”
sed 's/$/] = 0x/' tmp2.txt > tmp3.txt
//插入数组下标
awk '/$/{gsub(/$/,sprintf("%.04x,",++i))}1' tmp3.txt > tmp4.txt
//插入数组名
sed -i '1 i\static const unsigned short buf[] =\n{' tmp4.txt
//插入数组结束符
sed '$ a\};' tmp4.txt > pugdscb.txt
--------------------------------------------------------------------------------
结果如下
static const unsigned short buf[] =
{
[0x5416] = 0x0001,
[0x9515] = 0x0002,
[0x963f] = 0x0003,
[0x9312] = 0x0004,
[0x55c4] = 0x0005,
[0x554a] = 0x0006,
[0x54c0] = 0x0007,
[0x54ce] = 0x0008,
...
};
2.
sqlite3_create_collation(db,"pinyin",SQLITE_UTF8,NULL,binCollFuncUtf8);
参考
https://groups.google.com/forum/#!topic/sunpinyin-developers/Q0z5J_LG7Ag