简单说,对方需要使用SRILM训练自己的语言模型(Language Model),为了快速开始训练减少交流成本,我方提供一个配置好SRILM工具训练环境的docker image。
用过docker , 也用过virtual environment之类的工具,但是没啥从头开始build的经验,所以这次决定从头开始build。
需要下载的SRILM最新版的安装文件压缩包,目前最新的是1.7.3版,下载地址。由于SRI的网页不支持wget
之类的下载,所以必须去网页填写一个表格才能下载。当然,也可以找其他的下载链接。
此外,如果用SRILM里的安装指引比较繁琐,所以直接采用KALDI里的tools
文件夹里的install_srilm.sh
脚本去安装。 不过这个脚本可以做一点修改:
删除以下:
if [ "tools" != "$current_dir" ]; then
echo "You should run this script in tools/ directory!!"
exit 1
fi
和
# http://www.speech.sri.com/projects/srilm/download.html
if [ ! -f srilm.tgz ]; then
echo This script cannot install SRILM in a completely automatic
echo way because you need to put your address in a download form.
echo Please download SRILM from http://www.speech.sri.com/projects/srilm/download.html
echo put it in ./srilm.tgz, then run this script.
exit 1
fi
和
28 mkdir -p srilm
还有
32 major=`awk -F. '{ print $1 }' RELEASE`
33 minor=`awk -F. '{ print $2 }' RELEASE`
34 micro=`awk -F. '{ print $3 }' RELEASE`
35
36 if [ $major -le 1 ] && [ $minor -le 7 ] && [ $micro -le 1 ]; then
37 echo "Detected version 1.7.1 or earlier. Applying patch."
38 patch -p0 < ../extras/srilm.patch
39 fi
修改以下:
bash extras/install_liblbfgs.sh || exit 1
为
bash ./install_liblbfgs.sh || exit 1
以及
tar -xvzf ../srilm.tgz
为
tar -xvzf ../srilm-1.7.3.tar.gz
还会需要一个lib, 就是liblbfgs
。还是可以采用KALDI里的tools/extra/install_liblbfgs.sh
脚本去安装。
build_sri_lm.sh
这个其实就是用ngram-count来对训练语料进行训练。贴一下训练脚本供参考
#!/bin/bash
corpus=$1
lm_suffix=$2
ngram-count -text $corpus -order 3 -write train.3.count
ngram-count -text $corpus -order 4 -write train.4.count
ngram-count -read train.3.count -order 3 -lm 3-gram.arpa.${lm_suffix} -interpolate -kndiscount
ngram-count -read train.4.count -order 4 -lm 4-gram.arpa.${lm_suffix} -interpolate -kndiscount
ngram -order 3 -lm 3-gram.arpa.${lm_suffix} -prune 1e-7 -write-lm 3-gram.arpa.pruned.1e-7.${lm_suffix}
ngram -order 3 -lm 3-gram.arpa.${lm_suffix} -prune 3e-7 -write-lm 3-gram.arpa.pruned.3e-7.${lm_suffix}
gzip 3-gram.arpa.${lm_suffix}
gzip 3-gram.arpa.pruned.1e-7.${lm_suffix}
gzip 3-gram.arpa.pruned.3e-7.${lm_suffix}
gzip 4-gram.arpa.${lm_suffix}
#rm train.3.count
#rm trian.4.count
mv *arpa*.gz /opt/data/
默认是Dockerfile为文件名,但是还是希望能标记好,所以用Dockerfile.Srilm
作为文件名。
原始的install_srilm.sh
在最后会提示source
一下env.sh
以配置环境变量。但是这个操作如果放在Dockerfile里是不会对image产生影响的。所以最后用的是ENV FOO=FOO1
这样的方法进行配置。
FROM debian:9.8
LABEL tagVer="Appen-Srilm-Tool"
#This is the Srilm-1.7.3 language model traing tool image for Appen's ASR training
# fix the bug of apt-get in debian:9.8
#COPY ./badproxy /etc/apt/apt.conf.d/99fixbadproxy
# install tools and packages
RUN apt-get update
RUN apt-get install -y --no-install-recommends g++ gawk make git automake autoconf bzip2 unzip wget sox libtool \
python2.7 python3 ca-certificates zlib1g-dev gfortran subversion ffmpeg patch vim
# install tools and packages
RUN apt-get install -y procps \
libtool-bin python-pip python-yaml python-simplejson python-gi python-dev build-essential
# install extra tools. clean after installing. Link python2.7 to python, bash to /bin/sh
RUN apt-get clean autoclean && \
apt-get autoremove -y && \
rm -rf /var/lib/apt/lists/* && \
rm /usr/bin/python && \
ln -s /usr/bin/python2.7 /usr/bin/python && \
ln -s -f bash /bin/sh
# build srilm
RUN mkdir /opt/srilm
COPY srilm-1.7.3.tar.gz /opt/
COPY install_liblbfgs.sh /opt/
COPY install_srilm.sh /opt/
COPY build_sri_lm.sh /opt/
RUN cd /opt/ && \
bash ./install_liblbfgs.sh && \
bash ./install_srilm.sh
ENV LIBLBFGS=/opt/liblbfgs-1.10
ENV LD_LIBRARY_PATH=${LD_LIBRARY_PATH:-}:${LIBLBFGS}/lib/.libs
ENV SRILM=/opt/srilm
ENV PATH=${PATH}:${SRILM}/bin:${SRILM}/bin/i686-m64
RUN chmod -R 777 /opt/
WORKDIR /opt
docker build -f Dockerfile.Srilm -t srilm:latest ./
最后显示:
Successfully built 068133e78c2e
Successfully tagged srilm:latest
查看刚build的image
docker images
显示
REPOSITORY TAG IMAGE ID CREATED SIZE
srilm latest 068133e78c2e 10 minutes ago 893MB
试运行一下,运行完container即关闭
docker run -it srilm:latest /bin/bash
如果希望保持运行在后台:
docker run -dit srilm:latest /bin/bash
需要访问后台运行的container的话,先找到container ID:
docker ps
显示
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
3448cf685b08 srilm:latest "/bin/bash" 6 seconds ago Up 5 seconds cranky_curran
然后执行:
docker exec -it 3448cf685b08 /bin/bash
即可进入container。
训练工具配置完了如果还是需要进入container才能训练的话会让这个过程有点复杂,所以决定还是挂载本地文件夹到container上运行。
docker run -v /Users/czhang/images/local_data:/opt/data -dit srilm:latest /bin/bash
注意本地文件夹一定要用绝对路径。
试一下:在container里对/opt/data
写一个文件,退出以后在本地的文件夹/Users/czhang/images/local_data
里能查看到那个文件。
docker run -v /Users/czhang/images/local_data:/opt/data -dit srilm:latest /bin/bash build_sri_lm.sh data/corpus.txt testlm
直接运行即可
用docker ps
查看正在运行的container。
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
7e629e509555 srilm:latest "/bin/bash" 6 seconds ago Up 5 seconds epic_cartwright
然后用docker exec
进入
docker exec -it 7e629e509555 /bin/bash
然后执行训练脚本即可
./build_sri_lm.sh data/corpus.txt test2
docker ps -aq
docker stop $(docker ps -aq)
docker rm $(docker ps -aq)
docker rmi $(docker images -q)