自然语言处理(NLP) 三:词袋模型 + 文本分类

1.词袋模型

(BOW,bag of words)
用词频矩阵作为每个样本的特征
Are you curious about tokenization ? Let’s see how it works! we need to analyze a couple of sentences with puntuations to see it in action.’
每个单词出现的次数

import nltk.tokenize as tk 
import sklearn.feature_extraction.text as ft 
#ft进行特征抓取
doc = 'the brown dog is running. The black dog is in the black room. Running in the room is forbidden.'
print(doc)
print('-'*72)
sentences = tk.sent_tokenize(doc)
print(sentences)
print('-'*72)

cv = ft.CountVectorizer()
tfmat = cv.fit_transform(sentences).toarray()
words = cv.get_feature_names()
print(words)
print(tfmat)

2. 文本分类

tf.idf. Term Frequency - Inverse Document Frequency
词频—-逆文档频率,其本质就是基于词袋模型的文本特征值

import sklearn.datasets as sd 
import sklearn.feature_extraction.text as ft 
import sklearn.naive_bayes as nb 
cld = {'misc.forsale':'SALES',
        'rec.motorcycles':'MOTOCYCLES',
        'rec.sport.baseball':'BASEBALL',
        'sci.crypt':'CRYPTOGRAPHY',
        'sci.space':'SPACE'}
train = sd.fetch_20newsgroups(subset='train',categories=cld.keys(),shuffle=True,random_state=7)
train_data = train.data 
print(len(train_data))
train_y = train.target
print('train_y',train_y)
categories = train.target_names
print(len(categories))

cv = ft.CountVectorizer()
train_tfmat = cv.fit_transform(train_data) 
print(train_tfmat.shape)

tf = ft.TfidfTransformer()
train_x = tf.fit_transform(train_tfmat)
print(train_x.shape)

model = nb.MultinomialNB()
model.fit(train_x,train_y)
test_data =['The curveballs of right handed pitchers tend to curve to the left',
'Caesar cipher is an ancient form of encryption',
'This two-wheeler is really good on slippery roads']
test_tfmat = cv.transform(test_data)
test_x = tf.transform(test_tfmat)
pred_test_y = model.predict(test_x)
for sentence,index in zip(test_data,pred_test_y):
    print(sentence,'->',cld[categories[index]])

你可能感兴趣的:(自然语言处理)