variancePool = loadDensityFile(dataLocation + "variances", varianceFloor); mixtureWeightsPool = loadMixtureWeights(dataLocation + "mixture_weights", mixtureWeightFloor); transitionsPool = loadTransitionMatrices(dataLocation + "transition_matrices"); transformMatrix = loadTransformMatrix(dataLocation + "feature_transform"); senonePool = createSenonePool(distFloor, varianceFloor);
================ wsg-0.xml模型格式: ------------------------------- senone:4147 numGausePerSenone:8 means:33176=4147*8 variances:33176 streams:1 ================= male_result(my)模型格式: ----------------------------------- senone:186 GaussianPerSenone:256 means:1024 varians:1024 streams:4
分析原因,是否针对sphinx4加载的模型,有些参数是固定的,比如streams的个数,以及Gauss个数
【解决】
修改sphinx-config训练参数文件中,将semi改为cont,应该注意到其后的备注,使用pocketsphinx时候,是用semi格式,用sphinx3时候有cont格式,则对应的stream是1,gause数目是8.以此,得到cont模型,加载到sphinx4环境中,编译,ok!运行顺利,使用自己的模型,然后用自己的声音测试,结果如下:
Start speaking. Press Ctrl-C to quit. resultList.size=1 bestFinalToken=0050 -6.8291255E06 0.0000000E00 -1.0008177E04 lt-WordNode </s>(*SIL ) p 0.0 -10008.177{[长城][</s>]} 50 </s> -10008.177 0.0 50 长城 68886.47 0.0 4 <sil> 0.0 0.0 0 <s> 0.0 0.0 0result=<s> <sil> 长城 </s> resultList.size=1 bestFinalToken=0050 -6.8291255E06 0.0000000E00 -1.0008177E04 lt-WordNode </s>(*SIL ) p 0.0 -10008.177{[长城][</s>]} 50 </s> -10008.177 0.0 50 长城 68886.47 0.0 4 <sil> 0.0 0.0 0 <s> 0.0 0.0 resultList.size=2 bestFinalToken=0077 -7.2286605E06 0.0000000E00 -1.0008177E04 lt-WordNode </s>(*SIL ) p 0.0 -10008.177{[长城][</s>]} 77 </s> -10008.177 0.0 59 长城 68886.47 0.0 4 <sil> 0.0 0.0 0 <s> 0.0 0.0 1result=<s> <sil> 长城 </s> resultList.size=2 bestFinalToken=0077 -7.2286605E06 0.0000000E00 -1.0008177E04 lt-WordNode </s>(*SIL ) p 0.0 -10008.177{[长城][</s>]} best token=0077 -7.2286605E06 0.0000000E00 -1.0008177E04 lt-WordNode </s>(*SIL ) p 0.0 -10008.177{[长城][</s>]} 77 </s> -10008.177 0.0 59 长城 68886.47 0.0 4 <sil> 0.0 0.0 0 <s> 0.0 0.0 You said: [长城] Start speaking. Press Ctrl-C to quit.
sphinx4代码解析
javax.sound.sampled 提供用于捕获、处理和回放取样的音频数据的接口和类。 变量: protected AudioFormat AudioInputStream.format 流中包含的音频数据的格式。 方法: AudioFormat AudioFileFormat.getFormat() 获得音频文件中包含的音频数据的格式。 AudioFormat AudioInputStream.getFormat() 获得此音频输入流中声音数据的音频格式。 AudioFormat DataLine.getFormat() 获得数据行的音频数据的当前格式(编码、样本频率、信道数,等等)。 AudioFormat[] DataLine.Info.getFormats() 获得数据行支持的音频格式的集合。 static AudioFormat[] AudioSystem.getTargetFormats(AudioFormat.Encoding targetEncoding, AudioFormat sourceFormat) 使用已安装的格式转换器,获得具有特定编码的格式,以及系统可以从指定格式的流中获得的格式。
参考链接:
类 javax.sound.sampled.AudioFormat
的使用 http://www.766.com/doc/javax/sound/sampled/class-use/AudioFormat.html
利用纯java捕获和播放 音频 http://www.cnblogs.com/haore147/p/3662536.html
javax sound介绍 http://blog.csdn.net/kangojian/article/details/4449956
sphinx4-demo解析
一 模型训练需求点
1 Must train any kind (refers to tying) of model ---必须要能训练任何类型的模型 2 Variable number of states per model---每个模型要有多种状态数目 3 Mixtures with different numbers of Gaussians--多个数目的混合高斯 4 Different kinds of densities at different states--不同的状态要有不同的密度 5 Any kind of context ---- 任何上下文都能支持 6 Simultaneous training of multiple feature streams or models能同时训练多种特征类型的语音或者多种类型的模型
二 语言模型训练的关注点
【语言模型步骤】 Create a generic NGram language model interface--建立一个基本的nram模型界面 Create ARPA LM loader that can load language models in the ARPA format. This loader will not be optimized for efficiency, but for simplicity.-- Create a language model for the alphabet---建立一个导入arpa格式模型的模型导入接口,能保证方便导入 Incorporate use of the language model into the Linguist/Search---将语言模型和语言学部分,搜索部分结合 Test the natural spell with the LM to see what kind of improvement we get.---测试自然发音看看是否有提高 Create a more efficient version of the LM loader that can load very large language models--创建一个能加载更大模型的接口
三 声学模型格式
三 解码部分
(1)搜索方法:breadth first(横向优先搜索), viterbi decoding(维特比搜索),depth-first(深度优先),a* decoding
四 sphinx4使用向导
public class TranscriberDemo { public static void main(String[] args) throws Exception { Configuration configuration = new Configuration(); configuration 设定声学模型,语言模型,词典的路径 .setAcousticModelPath("resource:/edu/cmu/sphinx/models/en-us/en-us"); configuration .setDictionaryPath("resource:/edu/cmu/sphinx/models/en-us/cmudict-en-us.dict"); configuration .setLanguageModelPath("resource:/edu/cmu/sphinx/models/en-us/en-us.lm.bin"); StreamSpeechRecognizer recognizer = new StreamSpeechRecognizer( configuration); InputStream stream = new FileInputStream(new File("test.wav"))) 识别参数中文件的内容 recognizer.startRecognition(stream); 开始识别 SpeechResult result; while ((result = recognizer.getResult()) != null) {得到结果 getResult System.out.format("Hypothesis: %s\n", result.getHypothesis()); } recognizer.stopRecognition();识别停止 } }
LiveSpeechRecognizer recognizer = new LiveSpeechRecognizer(configuration);
3. 流文件识别,是从流文件中读取对象做识别
StreamSpeechRecognizer recognizer = new StreamSpeechRecognizer(configuration);
注意读入的语音流的格式只能是
RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 16000 Hz 或者 RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 8000 Hz 如果是后者,还需要特意设置采样率 configuration.setSampleRate(8000);
4.sphinx4-api带四个demo,分别如下
Transcriber - demonstrates how to transcribe a file----转录语音,也就是语音识别为文字 Dialog - demonstrates how to lead dialog with a user对话系统 SpeakerID - speaker identification 说话人识别(who is he? Aligner - demonstration of audio to transcription timestamping 声音与文本对齐
5.
【参考】
一些影响性能的因素总结 http://www.codes51.com/article/detail_159118.html
cmu sphinx中一些问题的搜索讨论 http://sourceforge.net/p/cmusphinx/discussion/sphinx4/