- sklearn CountVectorizer函数详解
from sklearn.feature_extraction.text import CountVectorizer
texts=["dog cat fish","dog cat cat","fish bird", 'bird']
cv = CountVectorizer()
cv_fit=cv.fit_transform(texts)
print(cv.get_feature_names())
print(cv_fit.toarray())
print(cv_fit)
['bird', 'cat', 'dog', 'fish']
[[0 1 1 1]
[0 2 1 0]
[1 0 0 1]
[1 0 0 0]]
(0, 3) 1
(0, 1) 1
(0, 2) 1
(1, 1) 2
(1, 2) 1
(2, 0) 1
(2, 3) 1
(3, 0) 1
- sklearn TfidfTransformer函数详解
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
texts=["dog cat fish","dog cat cat","dog fish", 'dog pig pig 中国']
cv = CountVectorizer()
cv_fit=cv.fit_transform(texts)
transformer = TfidfTransformer()
tfidf = transformer.fit_transform(cv_fit)
tfidf.toarray()
array([[ 0.64043405, 0.42389674, 0.64043405, 0. , 0. ],
[ 0.94936136, 0.31418628, 0. , 0. , 0. ],
[ 0. , 0.55193942, 0.83388421, 0. , 0. ],
[ 0. , 0.22726773, 0. , 0.8710221 , 0.43551105]])
- sklearn TfidfVectorizer函数详解
- TfidfVectorizer函数的功能相当于下面这四行代码的功能,即CountVectorizer+TfidfTransformer
cv = CountVectorizer()
cv_fit=cv.fit_transform(texts)
transformer = TfidfTransformer()
tfidf = transformer.fit_transform(cv_fit)
from sklearn.feature_extraction.text import TfidfVectorizer
tv = TfidfVectorizer(max_features=100,
ngram_range=(1, 1),
stop_words='english')
X_description = tv.fit_transform(texts)
print(X_description.toarray())
[[ 0.64043405 0.42389674 0.64043405 0. 0. ]
[ 0.94936136 0.31418628 0. 0. 0. ]
[ 0. 0.55193942 0.83388421 0. 0. ]
[ 0. 0.22726773 0. 0.8710221 0.43551105]]
- 可观察到输出的结果和上面的结果是一毛一样的。
- ngram_range=(1, 1)也可以改为(2,3),这就是2-gram.
- stop_words暂时只支持英文,即”english”