head text8
anarchism originated as a term of abuse first used against early working class radicals including the diggers of the english revolution and the sans culottes of the french revolution whilst the term is still used in a pejorative way to describe any act that used violent means to destroy the organization of society it has also been taken up as a positive label by self defined anarchists the word anarchism is derived from the greek without archons ruler chief king anarchism as a political philosophy is the belief that rulers are unnecessary and should be abolished although there are differing interpretations of what this means anarchism also refers to related social movements that advocate the elimination of authoritarian institutions particularly the state the word anarchy as most anarchists use it does not imply chaos nihilism or anomie but rather a harmonious anti authoritarian society in place of what are regarded as authoritarian political structures and coercive economic institutions anarchists advocate social relations based upon voluntary association of autonomous individuals mutual aid and self governance while anarchism is most easily defined by what it is against anarchists also offer positive visions of what they believe to be a truly free society however ideas about how an anarchist
$ fasttext skipgram -input text8 -output text8_ft
Read 17M words
Number of words: 71290
Number of labels: 0
Progress: 99.8% words/sec/thread: 7369 lr: 0.000101 avg.loss: 1.769820 ETA: 0h 0Progress: 99.8% words/sec/thread: 7370 lr: 0.000094 avg.loss: 1.769829 ETA: 0h 0Progress: 99.8% words/sec/thread: 7369 lr: 0.000090 avg.loss: 1.769800 ETA: 0h 0Progress: 99.8% words/sec/thread: 7369 lr: 0.000087 avg.loss: 1.769800 ETA: 0h 0Progress: 99.8% words/sec/thread: 7369 lr: 0.000080 avg.loss: 1.769757 ETA: 0h 0Progress: 99.8% words/sec/thread: 7369 lr: 0.000076 avg.loss: 1.769715 ETA: 0h 0Progress: 99.9% words/sec/thread: 7369 lr: 0.000071 avg.loss: 1.769715 ETA: 0h 0Progress: 99.9%
$ head -n 3 text8_ft.vec
71290 100
the -0.12909 0.055616 -0.3696 -0.18719 -0.12103 0.2775 0.01854 0.20936 0.09752 -0.19813 -0.21285 0.083418 0.016162 0.14121 -0.24223 -0.0094061 0.028332 0.37123 -0.11781 -0.19176 0.37214 -0.039365 -0.097866 -0.0050016 -0.28954 -0.45515 0.014649 0.23746 0.047788 0.20878 0.03758 0.058175 -0.14702 -0.059107 -0.056708 0.18481 -0.24313 -0.077047 -0.052451 -0.14229 0.44761 -0.078969 -0.34184 0.23616 0.13514 0.10314 -0.0066068 0.43448 0.65234 0.47869 0.4127 0.41117 0.03521 -0.10341 -0.19444 0.28458 0.011618 0.086314 -0.31977 -0.14575 0.11906 0.009967 0.21598 0.1113 0.020127 -0.10383 0.066471 -0.22908 -0.1199 0.25135 0.19716 -0.072622 0.23 -0.025951 0.042014 0.021876 0.039729 -0.55375 0.095557 0.46388 -0.37981 0.00016889 0.2524 -0.32383 0.32751 -0.44859 -0.10094 -0.12716 -0.40568 -0.060579 -0.12642 0.17714 -0.079242 -0.13409 0.058547 0.13197 -0.015039 0.025645 -0.066081 0.12708
of -0.16715 -0.0010567 -0.19993 -0.093281 -0.085323 0.16803 0.02827 0.21193 -0.18461 0.030626 0.061337 0.20897 -0.048829 0.072872 -0.24804 0.14995 -0.10427 0.25362 -0.25377 -0.046909 0.23103 -0.13958 0.096698 -0.13873 -0.18816 -0.33925 -0.12769 0.025515 0.1927 -0.23886 -0.096003 0.12565 -0.38746 -0.17257 -0.016184 0.11388 -0.11505 -0.135 0.18531 -0.31078 0.25641 -0.21784 -0.23305 0.48851 0.29054 -0.093619 -0.088168 0.40308 0.49952 0.4213 0.18668 0.26579 -0.10406 -0.0013798 -0.15389 0.31486 0.036097 0.032645 -0.11297 0.26994 -0.031791 0.034534 0.0045391 -0.082605 0.16027 -0.1163 0.045438 -0.18456 -0.033046 0.14392 0.38028 0.00054076 0.17435 0.008556 0.19375 -0.020889 0.17603 -0.48627 0.0014847 0.23283 -0.18314 -0.071 -0.028154 -0.34701 0.20839 -0.21952 -0.1269 -0.01303 -0.34134 -0.018452 -0.088293 0.1442 -0.010917 -0.18804 0.029666 0.12227 -0.059641 -0.099701 0.080151 0.098683
2,对于OOV问题(out-of-vocabulary words),如何得到词向量,由上面的bin模型得到推理结果
$ cat testQuery.txt
what the fuck
out of bug
$ fasttext print-word-vectors text8_ft.bin < testQuery.txt >results
$ head results
what -0.057545 -0.48528 -0.20754 -0.15859 -0.14724 0.039533 0.23823 -0.010322 -0.11841 0.2602 -0.071378 0.045908 -0.1794 0.13509 0.42207 -0.073658 -0.085075 -0.010533 -0.30685 -0.23157 -0.0038759 -0.22726 0.11984 0.097364 -0.32854 -0.12644 0.10312 0.05729 0.0088756 -0.12448 0.12922 0.16195 0.22631 -0.14809 0.015782 0.88848 -0.22506 0.31695 -0.017969 0.067788 0.022775 -0.30599 0.10087 0.57101 0.32064 0.16622 -0.17665 -0.064036 0.79752 0.46684 0.43368 0.36142 0.076338 0.21368 0.051775 -0.24059 0.34093 0.19272 -0.43182 -0.10237 -0.07673 0.081198 0.030859 -0.30472 -0.072027 -0.049737 0.025858 0.20029 0.23727 0.21938 0.40949 -0.066096 0.21677 -0.35277 0.12356 -0.26148 0.34904 -0.2038 -0.20233 -0.11801 -0.24752 0.33782 0.0098645 -0.38913 -0.19182 0.11744 -0.065232 -0.13656 -0.4755 0.10589 -0.20734 0.033725 -0.092295 0.083127 -0.26734 0.29432 0.2051 -0.1562 -0.041519 0.1008
the -0.12909 0.055616 -0.3696 -0.18719 -0.12103 0.2775 0.01854 0.20936 0.09752 -0.19813 -0.21285 0.083418 0.016162 0.14121 -0.24223 -0.0094061 0.028332 0.37123 -0.11781 -0.19176 0.37214 -0.039365 -0.097866 -0.0050016 -0.28954 -0.45515 0.014649 0.23746 0.047788 0.20878 0.03758 0.058175 -0.14702 -0.059107 -0.056708 0.18481 -0.24313 -0.077047 -0.052451 -0.14229 0.44761 -0.078969 -0.34184 0.23616 0.13514 0.10314 -0.0066068 0.43448 0.65234 0.47869 0.4127 0.41117 0.03521 -0.10341 -0.19444 0.28458 0.011618 0.086314 -0.31977 -0.14575 0.11906 0.009967 0.21598 0.1113 0.020127 -0.10383 0.066471 -0.22908 -0.1199 0.25135 0.19716 -0.072622 0.23 -0.025951 0.042014 0.021876 0.039729 -0.55375 0.095557 0.46388 -0.37981 0.00016889 0.2524 -0.32383 0.32751 -0.44859 -0.10094 -0.12716 -0.40568 -0.060579 -0.12642 0.17714 -0.079242 -0.13409 0.058547 0.13197 -0.015039 0.025645 -0.066081 0.12708
fuck -0.058611 0.057183 -0.041783 -0.37217 0.14209 0.34844 -0.63363 -0.36179 -0.072163 0.91156 0.03035 0.11818 0.67802 0.081026 0.64936 -0.12426 0.22982 -0.23246 0.040846 0.041818 0.27794 -0.0099458 -0.19554 0.54899 -0.44809 -0.31202 -0.22453 0.10881 -0.036528 -0.12731 0.40714 0.065295 0.57494 0.034111 0.3151 -0.031521 0.71399 -0.014006 -0.12132 0.23345 0.70018 -0.050306 0.36475 0.52981 0.25617 -0.3498 -0.25729 -0.19234 0.39339 0.050153 0.59596 -0.41099 -0.16302 -0.37753 -0.31371 -0.1496 0.19898 -0.33186 -1.0232 0.22755 0.71151 -0.025874 -0.10878 -0.76363 -0.80891 -0.10293 0.61912 0.5186 0.30178 0.032113 0.50403 0.14278 0.35163 -0.37008 -0.40752 -0.62272 0.50291 -0.096062 -0.23859 0.21181 0.49698 0.71006 0.25118 -0.61219 -0.16518 -0.083687 0.2768 -0.13805 -0.71201 0.40129 -0.080268 -0.15334 0.21017 0.075741 -0.5743 -0.15687 0.84504 -0.74026 0.51993 0.20547
out 0.095084 -0.34668 -0.29661 0.36503 -0.049586 0.52637 0.21526 0.0082911 -0.33428 0.26074 -0.11496 0.40547 -0.0020223 0.29337 0.039203 0.10698 -0.37423 0.22085 -0.037315 0.092291 0.21265 -0.11413 -0.1042 0.047826 0.083402 -0.1864 0.1972 -0.35872 0.071064 -0.32934 -0.14132 0.26032 -0.00452 0.039306 0.21692 0.28521 0.11242 0.32081 0.0083984 -0.32079 0.25809 -0.52832 -0.032795 0.31803 0.361 0.081924 -0.32014 0.039908 0.6 0.47681 0.13996 0.11896 0.059675 -0.33345 -0.10751 0.089404 0.37752 -0.07873 -0.16767 0.1458 -0.10502 -0.18125 0.24368 0.1482 -0.41592 0.13236 0.22565 -0.0059395 0.1614 0.046295 0.45359 -0.12962 0.33642 -0.21669 -0.27091 -0.16509 0.18419 -0.27586 0.12269 -0.012149 -0.23497 0.20923 0.43814 -0.32106 -0.17071 -0.0025727 -0.025948 -0.071002 -0.2163 0.12129 0.17356 -0.159 -0.26937 0.21498 0.11852 -0.014236 0.28358 -0.30305 0.20611 -0.20913
of -0.16715 -0.0010567 -0.19993 -0.093281 -0.085323 0.16803 0.02827 0.21193 -0.18461 0.030626 0.061337 0.20897 -0.048829 0.072872 -0.24804 0.14995 -0.10427 0.25362 -0.25377 -0.046909 0.23103 -0.13958 0.096698 -0.13873 -0.18816 -0.33925 -0.12769 0.025515 0.1927 -0.23886 -0.096003 0.12565 -0.38746 -0.17257 -0.016184 0.11388 -0.11505 -0.135 0.18531 -0.31078 0.25641 -0.21784 -0.23305 0.48851 0.29054 -0.093619 -0.088168 0.40308 0.49952 0.4213 0.18668 0.26579 -0.10406 -0.0013798 -0.15389 0.31486 0.036097 0.032645 -0.11297 0.26994 -0.031791 0.034534 0.0045391 -0.082605 0.16027 -0.1163 0.045438 -0.18456 -0.033046 0.14392 0.38028 0.00054076 0.17435 0.008556 0.19375 -0.020889 0.17603 -0.48627 0.0014847 0.23283 -0.18314 -0.071 -0.028154 -0.34701 0.20839 -0.21952 -0.1269 -0.01303 -0.34134 -0.018452 -0.088293 0.1442 -0.010917 -0.18804 0.029666 0.12227 -0.059641 -0.099701 0.080151 0.098683
bug -0.16104 0.22345 -0.52171 -0.049254 0.36398 -0.03377 -0.51757 -0.13128 -0.033654 0.44559 -0.73595 -0.17421 -0.061673 0.15399 -0.17079 -0.35185 0.2719 0.58866 -0.18934 -0.38255 -0.55436 5.6939e-05 0.47935 0.79757 -0.21634 -0.27231 -0.7705 -0.27486 -0.080184 -0.13623 0.25086 0.55783 0.23359 0.079897 0.24158 0.45196 -0.034684 0.070867 -0.47792 -0.44604 -0.17802 -0.40082 0.16075 0.36177 0.85764 -0.13079 -0.21857 -0.24954 -0.1655 0.20273 0.028715 0.54311 -0.16729 0.041986 -0.14236 -0.022988 0.77909 0.038478 -0.59859 -0.084233 0.39918 -0.36386 -0.12653 -0.41765 -0.28527 0.25547 0.1974 0.17408 0.28804 0.79494 -0.016819 -0.025348 0.3845 -0.35161 -0.3202 0.48525 0.01959 0.32804 -0.31761 0.44232 0.13141 0.17387 0.0097161 0.052898 0.24716 0.050469 0.073792 0.026017 -0.72611 0.41077 0.25149 0.16558 -0.12419 -0.86742 0.26589 -0.42548 0.26709 0.061441 0.24726 0.25026
当然如果上述dim(也就是embedding size)太大了,也可自己定义,
$ fasttext skipgram -input text8 -dim 10 -output text8_ft10
$ head -n 3 text8_ft10.vec
71290 10
the -0.69471 -0.35273 0.18617 -0.3283 0.28874 0.35978 -0.50711 -0.11573 -0.30905 -0.58648
of -0.87699 -0.46422 0.10984 -0.15627 0.49961 0.22101 -0.40932 -0.24884 -0.20546 -0.54027
数据集The DBpedia ontology classification dataset,本体分类数据集,14个类别,每个类别选取40k作为训练集,5k作为测试集,因此总的训练集为560k,测试集样本70k
cat classes.txt
1,"Bergan Mercy Medical Center"," Bergan Mercy Medical Center is a hospital located in Omaha Nebraska. It is part of the Alegent Health System."
1,"The Unsigned Guide"," The Unsigned Guide is an online contacts directory and careers guide for the UK music industry. Founded in 2003 and first published as a printed directory The Unsigned Guide became an online only resource in November 2011."
。。。#277356808 Q group
14,"The Blithedale Romance"," The Blithedale Romance (1852) is Nathaniel Hawthorne's third major romance. In Hawthorne (1879) Henry James called it the lightest the brightest the liveliest of Hawthorne's unhumorous fictions."
14,"Razadarit Ayedawbon"," Razadarit Ayedawbon (Burmese: ရာဇာဓိရာဇ် အရေးတော်ပုံ) is a Burmese chronicle covering the history of Ramanya from 1287 to 1421. The chronicle consists of accounts of court intrigues rebellions diplomatic missions wars etc. About half of the chronicle is devoted to the reign of King Razadarit (r."
14,"The Vinyl Cafe Notebooks"," Vinyl Cafe Notebooks: a collection of essays from The Vinyl Cafe (2010) is Stuart McLean's ninth book and each one has been a Canadian bestseller. McLean has sold over 1 million books in Canada. Unlike the other Vinyl Cafe books these are not Dave and Morley stories.Selected from 15 years of radio-show archives and re-edited by the author this eclectic essay collection provides a glimpse into the thoughtful mind at work behind The Vinyl Cafe."
myshuf() {
perl -MList::Util=shuffle -e 'print shuffle(<>);' "$@";
#Q group 277356808
normalize_text() {
tr '[:upper:]' '[:lower:]' | sed -e 's/^/__label__/g' | \
sed -e "s/'/ ' /g" -e 's/"//g' -e 's/\./ \. /g' -e 's/
/ /g' \
-e 's/,/ , /g' -e 's/(/ ( /g' -e 's/)/ ) /g' -e 's/\!/ \! /g' \
-e 's/\?/ \? /g' -e 's/\;/ /g' -e 's/\:/ /g' | tr -s " " | myshuf
1,"TY KU"," TY KU /taɪkuː/ is an American alcoholic beverage company that specializes in sake and other spirits. The privately-held company was founded in 2004 and is headquartered in New York City New York. While based in New York TY KU's beverages are made in Japan through a joint venture with two sake breweries. Since 2011 TY KU's growth has extended its products into all 50 states."
1,"Odd Lot Entertainment"," OddLot Entertainment founded in 2001 by longtime producers Gigi Pritzker and Deborah Del Prete (The Wedding Planner) is a film production and financing company based in Culver City California.OddLot produced the film version of Orson Scott Card's sci-fi novel Ender's Game. A film version of this novel had been in the works in one form or another for more than a decade by the time of its release."
#after process
__label__1 , ty ku , ty ku /taɪkuː/ is an american alcoholic beverage company that specializes in sake and other spirits . the privately-held company was founded in 2004 and is headquartered in new york city new york . while based in new york ty ku ' s beverages are made in japan through a joint venture with two sake breweries . since 2011 ty ku ' s growth has extended its products into all 50 states .
__label__1 , odd lot entertainment , oddlot entertainment founded in 2001 by longtime producers gigi pritzker and deborah del prete ( the wedding planner ) is a film production and financing company based in culver city california . oddlot produced the film version of orson scott card ' s sci-fi novel ender ' s game . a film version of this novel had been in the works in one form or another for more than a decade by the time of its release .
fasttext supervised -input dbpedia.train -output trainout -dim 10 -lr 0.1 -wordNgrams 2 -minCount 1 -bucket 10000000 -epoch 5 -thread 4
$ head -n 6 trainout.vec
802981 10
the 0.48158 0.13413 -0.5119 0.62694 0.089501 -0.024228 -0.13503 0.23139 0.041772 0.081158
. -0.61252 -0.32307 0.78123 -0.56232 -0.0014737 -0.019952 0.22725 0.065144 -0.23527 -0.053442
, -0.38554 -0.35668 0.071955 0.54615 -0.041367 -0.010555 -0.11941 0.3101 -0.077714 -0.35903
in 0.159 -0.21333 0.048756 -0.058684 1.0204 0.54013 1.2182 -0.02415 -0.004165 0.6187
of -0.078618 -0.11361 -0.32771 0.63844 -0.79154 0.32892 -0.55461 -0.47428 -0.6273 0.51869
$ fasttext test trainout.bin dbpedia.test
N 70000
P@1 0.985
R@1 0.985
$ fasttext predict trainout.bin dbpedia.test >dbpedia.test.predict
$ head dbpedia.test.predict
$ head enwik9
MediaWiki 1.6alpha
需要预处理,预处理文件在此 (wikifil.pl),用于过滤Wikipedia XML转储到仅由小写字母(a-z,从a-z转换而来)和空格(从不连续)组成的“干净”文本的程序。所有其他字符都转换为空格。仅显示通常出现在web浏览器中的文本。表将被删除。保留图像标题。链接被转换为普通文本。数字是拼出来的。
perl wikifil.pl enwik9 > file9
fasttext skipgram -input file9 -output file9out -lr 0.025 -dim 10 -ws 5 -epoch 3 -minCount 5 -neg 5 -loss ns -bucket 2000000 -minn 3 -maxn 6 -thread 4 -t 1e-4 -lrUpdateRate 100
-minCount minimal number of word occurrences [1]
-minCountLabel minimal number of label occurrences [0]
-wordNgrams max length of word ngram [1]
-bucket number of buckets [2000000]
-minn min length of char ngram [0]
-maxn max length of char ngram [0]
-t sampling threshold [0.0001]
-label labels prefix [__label__]
-lr learning rate [0.1]
-lrUpdateRate change the rate of updates for the learning rate [100]
-dim size of word vectors [100]
-ws size of the context window [5]
-epoch number of epochs [5]
-neg number of negatives sampled [5]
-loss loss function {ns, hs, softmax} [softmax]
-thread number of threads [12]
-pretrainedVectors pretrained word vectors for supervised learning []
-saveOutput whether output params should be saved [0]
-cutoff number of words and ngrams to retain [0]
-retrain finetune embeddings if a cutoff is applied [0]
-qnorm quantizing the norm separately [0]
-qout quantizing the classifier [0]
-dsub size of each sub-vector [2]