word2vec 入门教程

word2vec安装全过程如下:

1.下载word2vec源码

注意:很多文章给的方式是 svn checkout http://word2vec.googlecode.com/svn/trunk/。但是天朝的网络,你懂的
所以咱们采取另外一种方式
https://github.com/dav/word2vec
这是github的地址。天朝的GFW再一次显示出牛掰之处,git clone也down不下来。忍不住要爆个粗口,Fxxx。
最后直接在github上把zip包download下来,然后scp到服务器上。

2.瞅瞅里头有啥

[webopa@hive001 word2vec-master]$ tree -L 1
.
├── bin
├── data
├── LICENSE
├── README.md
├── scripts
└── src

4 directories, 2 files

3.cd 到src中,使用make命令安装

注意:github上下载的zip包里的make源码如下:

1 SCRIPTS_DIR=../scripts
2 BIN_DIR=../bin
3
4 CC = gcc
5 #The -Ofast might not work with older versions of gcc; in that case, use -O2
6 CFLAGS = -lm -pthread -O2 -Wall -funroll-loops
7
8 all: word2vec word2phrase distance word-analogy compute-accuracy
9
10 word2vec : word2vec.c
11     $(CC) word2vec.c -o ${BIN_DIR}/word2vec $(CFLAGS)
12 word2phrase : word2phrase.c
13     $(CC) word2phrase.c -o ${BIN_DIR}/word2phrase $(CFLAGS)
14 distance : distance.c
15     $(CC) distance.c -o ${BIN_DIR}/distance $(CFLAGS)
16 word-analogy : word-analogy.c
17     $(CC) word-analogy.c -o ${BIN_DIR}/word-analogy $(CFLAGS)
18 compute-accuracy : compute-accuracy.c
19     $(CC) compute-accuracy.c -o ${BIN_DIR}/compute-accuracy $(CFLAGS)
20     chmod +x ${SCRIPTS_DIR}/*.sh 21 22 clean: 23 pushd ${BIN_DIR} && rm -rf word2vec word2phrase distance word-analogy compute-accuracy; popd

网上很多文章都写到要将第6行中-pthread 后面的参数改为-02,github上下载的版本已经是02,所以不需要再修改

4.make顺利完成以后,原来为空的bin目录下多了以下文件:

[webopa@hive001 word2vec-master]$ tree bin
bin
├── compute-accuracy
├── distance
├── word2phrase
├── word2vec
└── word-analogy

0 directories, 5 files

就是我们编译的结果

5.运行一个demo

cd到scripts下面,查看demo-classes.sh

1 DATA_DIR=../data
2 SRC_DIR=../src
3 BIN_DIR=../bin
4
5 TEXT_DATA=$DATA_DIR/text8
6 CLASSES_DATA=$DATA_DIR/classes.txt
7
8 pushd ${SRC_DIR} && make; popd
9
10
11 if [ ! -e $CLASSES_DATA ]; then
12
13   if [ ! -e $TEXT_DATA ]; then
14     wget http://mattmahoney.net/dc/text8.zip -O $DATA_DIR/text8.gz
15     gzip -d $DATA_DIR/text8.gz -f
16   fi
17   echo -----------------------------------------------------------------------------------------------------
18   echo -- Training vectors...
19   time $BIN_DIR/word2vec -train $TEXT_DATA -output $CLASSES_DATA -cbow 0 -size 200 -window 5 -negative 0 -hs 1 -sample 1e-3 -threads 12 -classes 500
20
21 fi
22
23 sort $CLASSES_DATA -k 2 -n > $DATA_DIR/classes.sorted.txt
24 echo The word classes were saved to file $DATA_DIR/classes.sorted.txt

第一次运行这个脚本时,会去下载text8.gz这个文件
下载完这个文件后,再次运行这个脚本

[webopa@hive001 scripts]$ ./demo-classes.sh
~/lei.wang/word2vec/word2vec-master/src ~/lei.wang/word2vec/word2vec-master/scripts
gcc word2vec.c -o ../bin/word2vec -lm -pthread -O2 -Wall -funroll-loops
gcc word2phrase.c -o ../bin/word2phrase -lm -pthread -O2 -Wall -funroll-loops
gcc distance.c -o ../bin/distance -lm -pthread -O2 -Wall -funroll-loops
gcc word-analogy.c -o ../bin/word-analogy -lm -pthread -O2 -Wall -funroll-loops
gcc compute-accuracy.c -o ../bin/compute-accuracy -lm -pthread -O2 -Wall -funroll-loops
chmod +x ../scripts/*.sh
~/lei.wang/word2vec/word2vec-master/scripts
-----------------------------------------------------------------------------------------------------
-- Training vectors...
Starting training using file ../data/text8
Vocab size: 71291
Words in train file: 16718843
Alpha: 0.000122  Progress: 99.58%  Words/thread/sec: 14.48k
real    2m52.606s
user    20m24.567s
sys 0m1.328s
The word classes were saved to file ../data/classes.sorted.txt

你可能感兴趣的:(入门教程,word2vec)