sklearn CountVectorizer\TfidfVectorizer\TfidfTransformer函数详解

  • sklearn CountVectorizer函数详解
from sklearn.feature_extraction.text import CountVectorizer

texts=["dog cat fish","dog cat cat","fish bird", 'bird']
cv = CountVectorizer()
cv_fit=cv.fit_transform(texts)

print(cv.get_feature_names())
print(cv_fit.toarray())
print(cv_fit)
  • 返回的结果为稀疏矩阵
['bird', 'cat', 'dog', 'fish']
[[0 1 1 1]
 [0 2 1 0]
 [1 0 0 1]
 [1 0 0 0]]
  (0, 3)    1
  (0, 1)    1
  (0, 2)    1
  (1, 1)    2
  (1, 2)    1
  (2, 0)    1
  (2, 3)    1
  (3, 0)    1
  • sklearn TfidfTransformer函数详解
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

texts=["dog cat fish","dog cat cat","dog fish", 'dog pig pig 中国']
cv = CountVectorizer()
cv_fit=cv.fit_transform(texts)

transformer = TfidfTransformer()
tfidf = transformer.fit_transform(cv_fit)
tfidf.toarray()
array([[ 0.64043405,  0.42389674,  0.64043405,  0.        ,  0.        ],
       [ 0.94936136,  0.31418628,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.55193942,  0.83388421,  0.        ,  0.        ],
       [ 0.        ,  0.22726773,  0.        ,  0.8710221 ,  0.43551105]])
  • sklearn TfidfVectorizer函数详解
  • TfidfVectorizer函数的功能相当于下面这四行代码的功能,即CountVectorizer+TfidfTransformer
cv = CountVectorizer()
cv_fit=cv.fit_transform(texts)

transformer = TfidfTransformer()
tfidf = transformer.fit_transform(cv_fit)
  • 上代码,TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

tv = TfidfVectorizer(max_features=100,
                     ngram_range=(1, 1),
                     stop_words='english')
X_description = tv.fit_transform(texts)
print(X_description.toarray())
[[ 0.64043405  0.42389674  0.64043405  0.          0.        ]
 [ 0.94936136  0.31418628  0.          0.          0.        ]
 [ 0.          0.55193942  0.83388421  0.          0.        ]
 [ 0.          0.22726773  0.          0.8710221   0.43551105]]
  • 可观察到输出的结果和上面的结果是一毛一样的。
  • ngram_range=(1, 1)也可以改为(2,3),这就是2-gram.
  • stop_words暂时只支持英文,即”english”

你可能感兴趣的:(deeplearning4j)