中文拼音分词

最近工作中经常遇到用户的查询是这种类型,handuyishe, chuntianxingganchangqun,leisisiwa,

实在找不到好的办法,也不想用拼音词典分词,所以看了网上一些资料,写了一个正则表达式,

[^aoeiuv]?h?[iuv]?(ai|ei|ao|ou|er|ang?|eng?|ong|a|o|e|i|u|ng|n)?

具体看代码

import java.util.LinkedList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Test {

	public static String regEx  = "[^aoeiuv]?h?[iuv]?(ai|ei|ao|ou|er|ang?|eng?|ong|a|o|e|i|u|ng|n)?";

	public static void main(String[] args) {

		int tag = 0;
		String s = "chuntianxingganchangqun";
		List<String> tokenResult = new LinkedList<String>();
		for (int i = s.length(); i > 0; i = i - tag) {

			Pattern pat = Pattern.compile(regEx);  
			Matcher matcher = pat.matcher(s); 
			boolean rs = matcher.find(); 
			System.out.println(matcher.group());
			tag = matcher.end() - matcher.start();
			tokenResult.add(s.substring(0, 1));
			s = s.substring(tag);
        }
	}
}



具体思想见

http://www.pkucn.com/viewthread.php?tid=189570&page=1

你可能感兴趣的:(java,拼音,正则,分词)