scikit-learn:从文本文件中提取特征(tf、idf)

http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

昨晚写了两篇文章,明明保存了,今早却没了,只好简单的重复一下。。。


1、tf:

首先要解决high-dimensional sparse datasets的问题scipy.sparse matrices 就是这样的数据结构,而 scikit-learn 内置了该数据结构(has built-in support for these structures.)。

from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
rawData_counts = count_vect.fit_transform(rawData.data)

rawData_counts
Out[18]: 
<<strong>4x15 sparse matrix</strong> of type '<type 'numpy.int64'>'
	with 25 stored elements in Compressed Sparse Row format>
rawData_counts.shape
Out[19]: (4, 15)
print rawData_counts
  (0, 12)       1
  <strong>(0, 8)        1</strong>
  (0, 13)       1
  (0, 3)        1
  (1, 12)       1
<strong>  (1, 8)        1</strong>
  (1, 13)       1
  (1, 3)        1
  (1, 6)        1
  (1, 5)        1
  (1, 7)        1
  (2, 12)       1
<strong>  (2, 8)        1</strong>
  (2, 7)        1
  (2, 10)       1
  (2, 2)        1
  (2, 4)        1
  (2, 1)        1
  (2, 0)        1
  (3, 12)       1
<strong>  (3, 8)        1</strong>
  (3, 4)        1
  (3, 14)       1
  (3, 11)       1
  (3, 9)        1

count_vect.vocabulary_.get(u'like')
Out[22]: 8 #可以看到,like被索引到字典的第9个特征(从0开始的),参考上篇文件内容,每个文件中都有一个like,看来正确。

2、tf-idf:

from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_tfidf = tfidf_transformer.fit_transform(rawData_counts)

X_tfidf.shape
Out[29]: (4, 15)
print X_tfidf
  (0, 3)        0.589645181731
  (0, 13)       0.589645181731
  (0, 8)        0.390280104107
  (0, 12)       0.390280104107
  (1, 7)        0.375458941152
  (1, 5)        0.47622205886
  (1, 6)        0.47622205886
  (1, 3)        0.375458941152
  (1, 13)       0.375458941152
  (1, 8)        0.248512426084
  (1, 12)       0.248512426084
  (2, 0)        0.415663989294
  (2, 1)        0.415663989294
  (2, 4)        0.327714263528
  (2, 2)        0.415663989294
  (2, 10)       0.415663989294
  (2, 7)        0.327714263528
  (2, 8)        0.216910713171
  (2, 12)       0.216910713171
  (3, 9)        0.489923636451
  (3, 11)       0.489923636451
  (3, 14)       0.489923636451
  (3, 4)        0.386261422303
  (3, 8)        0.255662477672
  (3, 12)       0.255662477672












你可能感兴趣的:(机器学习,tf,TF-IDF,scikit-learn,特征提取)