fastTEXT入门自然语言处理NLP

推荐算法已经死了,而且没有出路,一线饱和,二线不需要,三线更不需要,而NLP则是一二线都有的坑,不矛盾,NLP也可辅助做好推荐,但NLP的路子更宽了。二线中需要CV,NLP,但没听说有要推荐方面的,搜索都没有,别提多难熬了。仰天大笑出门去,我辈岂是蓬蒿人。

1,以text8数据集为例,其数据全是text,如下:未见标点符号

head text8 
 anarchism originated as a term of abuse first used against early working class radicals including the diggers of the english revolution and the sans culottes of the french revolution whilst the term is still used in a pejorative way to describe any act that used violent means to destroy the organization of society it has also been taken up as a positive label by self defined anarchists the word anarchism is derived from the greek without archons ruler chief king anarchism as a political philosophy is the belief that rulers are unnecessary and should be abolished although there are differing interpretations of what this means anarchism also refers to related social movements that advocate the elimination of authoritarian institutions particularly the state the word anarchy as most anarchists use it does not imply chaos nihilism or anomie but rather a harmonious anti authoritarian society in place of what are regarded as authoritarian political structures and coercive economic institutions anarchists advocate social relations based upon voluntary association of autonomous individuals mutual aid and self governance while anarchism is most easily defined by what it is against anarchists also offer positive visions of what they believe to be a truly free society however ideas about how an anarchist
。。。。

fastTEXT训练,直接一行代码解决

$ fasttext skipgram -input text8  -output text8_ft
Read 17M words
Number of words:  71290
Number of labels: 0

Progress:  99.8% words/sec/thread:    7369 lr:  0.000101 avg.loss:  1.769820 ETA:   0h 0Progress:  99.8% words/sec/thread:    7370 lr:  0.000094 avg.loss:  1.769829 ETA:   0h 0Progress:  99.8% words/sec/thread:    7369 lr:  0.000090 avg.loss:  1.769800 ETA:   0h 0Progress:  99.8% words/sec/thread:    7369 lr:  0.000087 avg.loss:  1.769800 ETA:   0h 0Progress:  99.8% words/sec/thread:    7369 lr:  0.000080 avg.loss:  1.769757 ETA:   0h 0Progress:  99.8% words/sec/thread:    7369 lr:  0.000076 avg.loss:  1.769715 ETA:   0h 0Progress:  99.9% words/sec/thread:    7369 lr:  0.000071 avg.loss:  1.769715 ETA:   0h 0Progress:  99.9% 

结果有两个文件text8_ft.bin,和text8_ft.vec,前者是超参数,后者是词向量,

$ head -n 3 text8_ft.vec 
71290 100
the -0.12909 0.055616 -0.3696 -0.18719 -0.12103 0.2775 0.01854 0.20936 0.09752 -0.19813 -0.21285 0.083418 0.016162 0.14121 -0.24223 -0.0094061 0.028332 0.37123 -0.11781 -0.19176 0.37214 -0.039365 -0.097866 -0.0050016 -0.28954 -0.45515 0.014649 0.23746 0.047788 0.20878 0.03758 0.058175 -0.14702 -0.059107 -0.056708 0.18481 -0.24313 -0.077047 -0.052451 -0.14229 0.44761 -0.078969 -0.34184 0.23616 0.13514 0.10314 -0.0066068 0.43448 0.65234 0.47869 0.4127 0.41117 0.03521 -0.10341 -0.19444 0.28458 0.011618 0.086314 -0.31977 -0.14575 0.11906 0.009967 0.21598 0.1113 0.020127 -0.10383 0.066471 -0.22908 -0.1199 0.25135 0.19716 -0.072622 0.23 -0.025951 0.042014 0.021876 0.039729 -0.55375 0.095557 0.46388 -0.37981 0.00016889 0.2524 -0.32383 0.32751 -0.44859 -0.10094 -0.12716 -0.40568 -0.060579 -0.12642 0.17714 -0.079242 -0.13409 0.058547 0.13197 -0.015039 0.025645 -0.066081 0.12708 

of -0.16715 -0.0010567 -0.19993 -0.093281 -0.085323 0.16803 0.02827 0.21193 -0.18461 0.030626 0.061337 0.20897 -0.048829 0.072872 -0.24804 0.14995 -0.10427 0.25362 -0.25377 -0.046909 0.23103 -0.13958 0.096698 -0.13873 -0.18816 -0.33925 -0.12769 0.025515 0.1927 -0.23886 -0.096003 0.12565 -0.38746 -0.17257 -0.016184 0.11388 -0.11505 -0.135 0.18531 -0.31078 0.25641 -0.21784 -0.23305 0.48851 0.29054 -0.093619 -0.088168 0.40308 0.49952 0.4213 0.18668 0.26579 -0.10406 -0.0013798 -0.15389 0.31486 0.036097 0.032645 -0.11297 0.26994 -0.031791 0.034534 0.0045391 -0.082605 0.16027 -0.1163 0.045438 -0.18456 -0.033046 0.14392 0.38028 0.00054076 0.17435 0.008556 0.19375 -0.020889 0.17603 -0.48627 0.0014847 0.23283 -0.18314 -0.071 -0.028154 -0.34701 0.20839 -0.21952 -0.1269 -0.01303 -0.34134 -0.018452 -0.088293 0.1442 -0.010917 -0.18804 0.029666 0.12227 -0.059641 -0.099701 0.080151 0.098683 

记录的是一共71290个词,embedding维度是100,

2,对于OOV问题(out-of-vocabulary words),如何得到词向量,由上面的bin模型得到推理结果

$ cat testQuery.txt 
what the fuck
out of bug

 对于随意给的词句,得到如下结果,结果保存在results中

$ fasttext print-word-vectors text8_ft.bin < testQuery.txt >results
$ head results 
what -0.057545 -0.48528 -0.20754 -0.15859 -0.14724 0.039533 0.23823 -0.010322 -0.11841 0.2602 -0.071378 0.045908 -0.1794 0.13509 0.42207 -0.073658 -0.085075 -0.010533 -0.30685 -0.23157 -0.0038759 -0.22726 0.11984 0.097364 -0.32854 -0.12644 0.10312 0.05729 0.0088756 -0.12448 0.12922 0.16195 0.22631 -0.14809 0.015782 0.88848 -0.22506 0.31695 -0.017969 0.067788 0.022775 -0.30599 0.10087 0.57101 0.32064 0.16622 -0.17665 -0.064036 0.79752 0.46684 0.43368 0.36142 0.076338 0.21368 0.051775 -0.24059 0.34093 0.19272 -0.43182 -0.10237 -0.07673 0.081198 0.030859 -0.30472 -0.072027 -0.049737 0.025858 0.20029 0.23727 0.21938 0.40949 -0.066096 0.21677 -0.35277 0.12356 -0.26148 0.34904 -0.2038 -0.20233 -0.11801 -0.24752 0.33782 0.0098645 -0.38913 -0.19182 0.11744 -0.065232 -0.13656 -0.4755 0.10589 -0.20734 0.033725 -0.092295 0.083127 -0.26734 0.29432 0.2051 -0.1562 -0.041519 0.1008 
the -0.12909 0.055616 -0.3696 -0.18719 -0.12103 0.2775 0.01854 0.20936 0.09752 -0.19813 -0.21285 0.083418 0.016162 0.14121 -0.24223 -0.0094061 0.028332 0.37123 -0.11781 -0.19176 0.37214 -0.039365 -0.097866 -0.0050016 -0.28954 -0.45515 0.014649 0.23746 0.047788 0.20878 0.03758 0.058175 -0.14702 -0.059107 -0.056708 0.18481 -0.24313 -0.077047 -0.052451 -0.14229 0.44761 -0.078969 -0.34184 0.23616 0.13514 0.10314 -0.0066068 0.43448 0.65234 0.47869 0.4127 0.41117 0.03521 -0.10341 -0.19444 0.28458 0.011618 0.086314 -0.31977 -0.14575 0.11906 0.009967 0.21598 0.1113 0.020127 -0.10383 0.066471 -0.22908 -0.1199 0.25135 0.19716 -0.072622 0.23 -0.025951 0.042014 0.021876 0.039729 -0.55375 0.095557 0.46388 -0.37981 0.00016889 0.2524 -0.32383 0.32751 -0.44859 -0.10094 -0.12716 -0.40568 -0.060579 -0.12642 0.17714 -0.079242 -0.13409 0.058547 0.13197 -0.015039 0.025645 -0.066081 0.12708 
fuck -0.058611 0.057183 -0.041783 -0.37217 0.14209 0.34844 -0.63363 -0.36179 -0.072163 0.91156 0.03035 0.11818 0.67802 0.081026 0.64936 -0.12426 0.22982 -0.23246 0.040846 0.041818 0.27794 -0.0099458 -0.19554 0.54899 -0.44809 -0.31202 -0.22453 0.10881 -0.036528 -0.12731 0.40714 0.065295 0.57494 0.034111 0.3151 -0.031521 0.71399 -0.014006 -0.12132 0.23345 0.70018 -0.050306 0.36475 0.52981 0.25617 -0.3498 -0.25729 -0.19234 0.39339 0.050153 0.59596 -0.41099 -0.16302 -0.37753 -0.31371 -0.1496 0.19898 -0.33186 -1.0232 0.22755 0.71151 -0.025874 -0.10878 -0.76363 -0.80891 -0.10293 0.61912 0.5186 0.30178 0.032113 0.50403 0.14278 0.35163 -0.37008 -0.40752 -0.62272 0.50291 -0.096062 -0.23859 0.21181 0.49698 0.71006 0.25118 -0.61219 -0.16518 -0.083687 0.2768 -0.13805 -0.71201 0.40129 -0.080268 -0.15334 0.21017 0.075741 -0.5743 -0.15687 0.84504 -0.74026 0.51993 0.20547 
out 0.095084 -0.34668 -0.29661 0.36503 -0.049586 0.52637 0.21526 0.0082911 -0.33428 0.26074 -0.11496 0.40547 -0.0020223 0.29337 0.039203 0.10698 -0.37423 0.22085 -0.037315 0.092291 0.21265 -0.11413 -0.1042 0.047826 0.083402 -0.1864 0.1972 -0.35872 0.071064 -0.32934 -0.14132 0.26032 -0.00452 0.039306 0.21692 0.28521 0.11242 0.32081 0.0083984 -0.32079 0.25809 -0.52832 -0.032795 0.31803 0.361 0.081924 -0.32014 0.039908 0.6 0.47681 0.13996 0.11896 0.059675 -0.33345 -0.10751 0.089404 0.37752 -0.07873 -0.16767 0.1458 -0.10502 -0.18125 0.24368 0.1482 -0.41592 0.13236 0.22565 -0.0059395 0.1614 0.046295 0.45359 -0.12962 0.33642 -0.21669 -0.27091 -0.16509 0.18419 -0.27586 0.12269 -0.012149 -0.23497 0.20923 0.43814 -0.32106 -0.17071 -0.0025727 -0.025948 -0.071002 -0.2163 0.12129 0.17356 -0.159 -0.26937 0.21498 0.11852 -0.014236 0.28358 -0.30305 0.20611 -0.20913 
of -0.16715 -0.0010567 -0.19993 -0.093281 -0.085323 0.16803 0.02827 0.21193 -0.18461 0.030626 0.061337 0.20897 -0.048829 0.072872 -0.24804 0.14995 -0.10427 0.25362 -0.25377 -0.046909 0.23103 -0.13958 0.096698 -0.13873 -0.18816 -0.33925 -0.12769 0.025515 0.1927 -0.23886 -0.096003 0.12565 -0.38746 -0.17257 -0.016184 0.11388 -0.11505 -0.135 0.18531 -0.31078 0.25641 -0.21784 -0.23305 0.48851 0.29054 -0.093619 -0.088168 0.40308 0.49952 0.4213 0.18668 0.26579 -0.10406 -0.0013798 -0.15389 0.31486 0.036097 0.032645 -0.11297 0.26994 -0.031791 0.034534 0.0045391 -0.082605 0.16027 -0.1163 0.045438 -0.18456 -0.033046 0.14392 0.38028 0.00054076 0.17435 0.008556 0.19375 -0.020889 0.17603 -0.48627 0.0014847 0.23283 -0.18314 -0.071 -0.028154 -0.34701 0.20839 -0.21952 -0.1269 -0.01303 -0.34134 -0.018452 -0.088293 0.1442 -0.010917 -0.18804 0.029666 0.12227 -0.059641 -0.099701 0.080151 0.098683 
bug -0.16104 0.22345 -0.52171 -0.049254 0.36398 -0.03377 -0.51757 -0.13128 -0.033654 0.44559 -0.73595 -0.17421 -0.061673 0.15399 -0.17079 -0.35185 0.2719 0.58866 -0.18934 -0.38255 -0.55436 5.6939e-05 0.47935 0.79757 -0.21634 -0.27231 -0.7705 -0.27486 -0.080184 -0.13623 0.25086 0.55783 0.23359 0.079897 0.24158 0.45196 -0.034684 0.070867 -0.47792 -0.44604 -0.17802 -0.40082 0.16075 0.36177 0.85764 -0.13079 -0.21857 -0.24954 -0.1655 0.20273 0.028715 0.54311 -0.16729 0.041986 -0.14236 -0.022988 0.77909 0.038478 -0.59859 -0.084233 0.39918 -0.36386 -0.12653 -0.41765 -0.28527 0.25547 0.1974 0.17408 0.28804 0.79494 -0.016819 -0.025348 0.3845 -0.35161 -0.3202 0.48525 0.01959 0.32804 -0.31761 0.44232 0.13141 0.17387 0.0097161 0.052898 0.24716 0.050469 0.073792 0.026017 -0.72611 0.41077 0.25149 0.16558 -0.12419 -0.86742 0.26589 -0.42548 0.26709 0.061441 0.24726 0.25026

当然如果上述dim(也就是embedding size)太大了,也可自己定义,

$ fasttext skipgram -input text8 -dim 10  -output text8_ft10
$ head -n 3 text8_ft10.vec 
71290 10
the -0.69471 -0.35273 0.18617 -0.3283 0.28874 0.35978 -0.50711 -0.11573 -0.30905 -0.58648 
of -0.87699 -0.46422 0.10984 -0.15627 0.49961 0.22101 -0.40932 -0.24884 -0.20546 -0.54027 

 3,文本分类

数据集The DBpedia ontology classification dataset,本体分类数据集,14个类别,每个类别选取40k作为训练集,5k作为测试集,因此总的训练集为560k,测试集样本70k

cat classes.txt 
Company
EducationalInstitution
Artist
Athlete
OfficeHolder
MeanOfTransportation
Building
NaturalPlace
Village
Animal
Plant
Album
Film
WrittenWork

labels是从1到14的,数据集格式如下:

1,"Bergan Mercy Medical Center"," Bergan Mercy Medical Center is a hospital located in Omaha Nebraska. It is part of the Alegent Health System."
1,"The Unsigned Guide"," The Unsigned Guide is an online contacts directory and careers guide for the UK music industry. Founded in 2003 and first published as a printed directory The Unsigned Guide became an online only resource in November 2011."
。。。#277356808 Q group
14,"The Blithedale Romance"," The Blithedale Romance (1852) is Nathaniel Hawthorne's third major romance. In Hawthorne (1879) Henry James called it the lightest the brightest the liveliest of Hawthorne's unhumorous fictions."
14,"Razadarit Ayedawbon"," Razadarit Ayedawbon (Burmese: ရာဇာဓိရာဇ် အရေးတော်ပုံ) is a Burmese chronicle covering the history of Ramanya from 1287 to 1421. The chronicle consists of accounts of court intrigues rebellions diplomatic missions wars etc. About half of the chronicle is devoted to the reign of King Razadarit (r."
14,"The Vinyl Cafe Notebooks"," Vinyl Cafe Notebooks: a collection of essays from The Vinyl Cafe (2010) is Stuart McLean's ninth book and each one has been a Canadian bestseller. McLean has sold over 1 million books in Canada. Unlike the other Vinyl Cafe books these are not Dave and Morley stories.Selected from 15 years of radio-show archives and re-edited by the author this eclectic essay collection provides a glimpse into the thoughtful mind at work behind The Vinyl Cafe."

 对此数据进行预处理,

myshuf() {
  perl -MList::Util=shuffle -e 'print shuffle(<>);' "$@";
}
#Q group 277356808
normalize_text() {
  tr '[:upper:]' '[:lower:]' | sed -e 's/^/__label__/g' | \
    sed -e "s/'/ ' /g" -e 's/"//g' -e 's/\./ \. /g' -e 's/
/ /g' \ -e 's/,/ , /g' -e 's/(/ ( /g' -e 's/)/ ) /g' -e 's/\!/ \! /g' \ -e 's/\?/ \? /g' -e 's/\;/ /g' -e 's/\:/ /g' | tr -s " " | myshuf }

预处理结果如下:有shuffle,但我调了下顺序方便看清楚预处理了啥:一些标点符号全部改为逗号和句号,大写全部改为小写,并增加空格。

1,"TY KU"," TY KU /taɪkuː/ is an American alcoholic beverage company that specializes in sake and other spirits. The privately-held company was founded in 2004 and is headquartered in New York City New York. While based in New York TY KU's beverages are made in Japan through a joint venture with two sake breweries. Since 2011 TY KU's growth has extended its products into all 50 states."
1,"Odd Lot Entertainment"," OddLot Entertainment founded in 2001 by longtime producers Gigi Pritzker and Deborah Del Prete (The Wedding Planner) is a film production and financing company based in Culver City California.OddLot produced the film version of Orson Scott Card's sci-fi novel Ender's Game. A film version of this novel had been in the works in one form or another for more than a decade by the time of its release."

#after process
__label__1 , ty ku , ty ku /taɪkuː/ is an american alcoholic beverage company that specializes in sake and other spirits . the privately-held company was founded in 2004 and is headquartered in new york city new york . while based in new york ty ku ' s beverages are made in japan through a joint venture with two sake breweries . since 2011 ty ku ' s growth has extended its products into all 50 states . 
__label__1 , odd lot entertainment , oddlot entertainment founded in 2001 by longtime producers gigi pritzker and deborah del prete ( the wedding planner ) is a film production and financing company based in culver city california . oddlot produced the film version of orson scott card ' s sci-fi novel ender ' s game . a film version of this novel had been in the works in one form or another for more than a decade by the time of its release . 

预处理后的train(dbpedia.train)和test文件进行如下

fasttext supervised -input dbpedia.train -output trainout -dim 10 -lr 0.1 -wordNgrams 2 -minCount 1 -bucket 10000000 -epoch 5 -thread 4

此时的训练结果仍旧是bin及vec文件,如下得到80w+词向量,dim=10,逗号句号均有向量,

$ head -n 6 trainout.vec 
802981 10
the 0.48158 0.13413 -0.5119 0.62694 0.089501 -0.024228 -0.13503 0.23139 0.041772 0.081158 
. -0.61252 -0.32307 0.78123 -0.56232 -0.0014737 -0.019952 0.22725 0.065144 -0.23527 -0.053442 
, -0.38554 -0.35668 0.071955 0.54615 -0.041367 -0.010555 -0.11941 0.3101 -0.077714 -0.35903 
in 0.159 -0.21333 0.048756 -0.058684 1.0204 0.54013 1.2182 -0.02415 -0.004165 0.6187 
of -0.078618 -0.11361 -0.32771 0.63844 -0.79154 0.32892 -0.55461 -0.47428 -0.6273 0.51869 

然后进行测试,推理

$ fasttext test trainout.bin dbpedia.test 
N	70000
P@1	0.985
R@1	0.985
$ fasttext predict trainout.bin dbpedia.test >dbpedia.test.predict
$ head dbpedia.test.predict 
__label__3
__label__6
__label__6
__label__4
__label__7
__label__2
__label__14
__label__9
__label__13
__label__3

 此文本分类可用于推荐中的图文信息进行分类,比如分为娱乐八卦、财经频道、穿衣搭配、时政新闻等等,用于对爬取的新闻进行打标签,这也是item侧内容画像的一个特征。

4,上述中1-2也可采用enwik9数据集,这是维基百科的数据集,xml格式

$ head enwik9

  
    Wikipedia
    http://en.wikipedia.org/wiki/Main_Page
    MediaWiki 1.6alpha
    first-letter
      
      Media
      Special
      

需要预处理,预处理文件在此 (wikifil.pl),用于过滤Wikipedia XML转储到仅由小写字母(a-z,从a-z转换而来)和空格(从不连续)组成的“干净”文本的程序。所有其他字符都转换为空格。仅显示通常出现在web浏览器中的文本。表将被删除。保留图像标题。链接被转换为普通文本。数字是拼出来的。

perl wikifil.pl enwik9 > file9

 处理好的file9是文本文件,其中没有逗号和句号。真是6啊

fasttext skipgram -input file9 -output file9out -lr 0.025 -dim 10 -ws 5 -epoch 3 -minCount 5 -neg 5 -loss ns -bucket 2000000 -minn 3 -maxn 6 -thread 4 -t 1e-4 -lrUpdateRate 100

参数意义:

  -minCount           minimal number of word occurrences [1]
  -minCountLabel      minimal number of label occurrences [0]
  -wordNgrams         max length of word ngram [1]
  -bucket             number of buckets [2000000]
  -minn               min length of char ngram [0]
  -maxn               max length of char ngram [0]
  -t                  sampling threshold [0.0001]
  -label              labels prefix [__label__]
  -lr                 learning rate [0.1]
  -lrUpdateRate       change the rate of updates for the learning rate [100]
  -dim                size of word vectors [100]
  -ws                 size of the context window [5]
  -epoch              number of epochs [5]
  -neg                number of negatives sampled [5]
  -loss               loss function {ns, hs, softmax} [softmax]
  -thread             number of threads [12]
  -pretrainedVectors  pretrained word vectors for supervised learning []
  -saveOutput         whether output params should be saved [0]
  -cutoff             number of words and ngrams to retain [0]
  -retrain            finetune embeddings if a cutoff is applied [0]
  -qnorm              quantizing the norm separately [0]
  -qout               quantizing the classifier [0]
  -dsub               size of each sub-vector [2]

关注本专栏获取更多。 

愿我们终有重逢之时,而你还记得我们曾经讨论的话题 

你可能感兴趣的:(Recommendation,自然语言处理,fastTEXT,NLP)