http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
昨晚写了两篇文章,明明保存了,今早却没了,只好简单的重复一下。。。
1、tf:
首先要解决high-dimensional sparse datasets的问题,scipy.sparse matrices 就是这样的数据结构,而 scikit-learn 内置了该数据结构(has built-in support for these structures.)。
from sklearn.feature_extraction.text import CountVectorizer count_vect = CountVectorizer() rawData_counts = count_vect.fit_transform(rawData.data) rawData_counts Out[18]: <<strong>4x15 sparse matrix</strong> of type '<type 'numpy.int64'>' with 25 stored elements in Compressed Sparse Row format> rawData_counts.shape Out[19]: (4, 15) print rawData_counts (0, 12) 1 <strong>(0, 8) 1</strong> (0, 13) 1 (0, 3) 1 (1, 12) 1 <strong> (1, 8) 1</strong> (1, 13) 1 (1, 3) 1 (1, 6) 1 (1, 5) 1 (1, 7) 1 (2, 12) 1 <strong> (2, 8) 1</strong> (2, 7) 1 (2, 10) 1 (2, 2) 1 (2, 4) 1 (2, 1) 1 (2, 0) 1 (3, 12) 1 <strong> (3, 8) 1</strong> (3, 4) 1 (3, 14) 1 (3, 11) 1 (3, 9) 1 count_vect.vocabulary_.get(u'like') Out[22]: 8 #可以看到,like被索引到字典的第9个特征(从0开始的),参考上篇文件内容,每个文件中都有一个like,看来正确。
from sklearn.feature_extraction.text import TfidfTransformer tfidf_transformer = TfidfTransformer() X_tfidf = tfidf_transformer.fit_transform(rawData_counts) X_tfidf.shape Out[29]: (4, 15) print X_tfidf (0, 3) 0.589645181731 (0, 13) 0.589645181731 (0, 8) 0.390280104107 (0, 12) 0.390280104107 (1, 7) 0.375458941152 (1, 5) 0.47622205886 (1, 6) 0.47622205886 (1, 3) 0.375458941152 (1, 13) 0.375458941152 (1, 8) 0.248512426084 (1, 12) 0.248512426084 (2, 0) 0.415663989294 (2, 1) 0.415663989294 (2, 4) 0.327714263528 (2, 2) 0.415663989294 (2, 10) 0.415663989294 (2, 7) 0.327714263528 (2, 8) 0.216910713171 (2, 12) 0.216910713171 (3, 9) 0.489923636451 (3, 11) 0.489923636451 (3, 14) 0.489923636451 (3, 4) 0.386261422303 (3, 8) 0.255662477672 (3, 12) 0.255662477672