对影评进行情感预测(countvectorizer,randomforeast)

参加了kaggle的竞赛,主题为对影评进行情感预测。以下为我的baseline思路.

所用到的包:countvectorize,randomforestclassifier.

import

#import所需要的库
import os
import re
import numpy as np
import pandas as pd
from bis4 import BeautifulSoup #网页内容格式处理包
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForeastClassifier
from sklearn.metrics import confusion_matrix
import nltk
from nltk.corpus import stopwords

读取数据

detafile=os.path.jopin("..","data","labeledTrainData.tsv")
df=pd.read_csv(datafile,sep="\t",escapechar="\\")
print("Number of reviews:{}".format(len(df)))
df.head()






#以下为输出内容


Number of reviews: 25000


   id      sentiment     review
0  5814_8      1        With all this stuff going down at the moment w...
1  2381_9      1        "The Classic War of the Worlds" by Timothy Hin...
2  7759_3      0        The film starts with a manager (Nicholas Bell)...
3  3630_4      0        It must be assumed that those who praised this...
4  9495_8      1        Superbly trashy and wondrously unpretentious 8...


 

对影评数据做预处理,大概有以下环节:

1.去掉html标签(用到Beautifulsoup包)

2.移除标点符号(用到正则)

3.切分成词/token

4.去掉停用词

5.重组为新的句子

def display(text,title):
    print(title)
    print("\n-----------我是分割词-----------------\n")
    print(text)

raw_example=df["review"][1]#先以review的第2条数据操作
display(raw_example,"原始数据")




#以下为结果

原始数据

----------我是分割线-------------

"The Classic War of the Worlds" by Timothy Hines is a very entertaining film that
 obviously goes to great effort and lengths to faithfully recreate H. G. Wells' classic 
book. Mr. Hines succeeds in doing so. I, and those who watched his film with 
me, appreciated the fact that it was not the standard, predictable Hollywood 
fare that comes out every year, e.g. the Spielberg version with Tom Cruise 
that had only the slightest resemblance to the book. Obviously, everyone 
looks for different things in a movie. Those who envision themselves as amateur "critics" look only to criticize everything they can. Others rate 
a movie on more important bases,like being entertained, which is why most 
people never agree with the "critics". We enjoyed the effort Mr. Hines put
 into being faithful to H.G. Wells' classic novel, and we found it to be 
very entertaining. This made it easy to overlook what the "critics" perceive 
to be its shortcomings.
example=BeautifulSoup(raw_example,"html.parser").get_text()
dispaly(example,"去掉标签的数据")



#以下为输出结果
去掉HTML标签的数据

----------我是分割线-------------

"The Classic War of the Worlds" by Timothy Hines is a very entertaining film 
that obviously goes to great effort and lengths to faithfully recreate 
H. G. Wells' classic book. Mr. Hines succeeds in doing so. I, and those
 who watched his film with me, appreciated the fact that it was not the
 standard, predictable Hollywood fare that comes out every year, e.g. 
the Spielberg version with Tom Cruise that had only the slightest 
resemblance to the book. Obviously, everyone looks for different 
things in a movie. Those who envision themselves as amateur "critics"
 look only to criticize everything they can. Others rate a movie on 
more important bases,like being entertained, which is why most people
 never agree with the "critics". We enjoyed the effort Mr. Hines put
 into being faithful to H.G. Wells' classic novel, and we found it to 
be very entertaining. This made it easy to overlook what the "critics"
 perceive to be its shortcomings.
example_letters=re.sub(r'[^a-zA-Z]',' ',example)
display(example_letters,"去掉标点的数据")




#以下为结果
去掉标点的数据

----------我是分割线-------------

 The Classic War of the Worlds  by Timothy Hines is a very entertaining film 
that obviously goes to great effort and lengths to faithfully recreate H  G  
Wells  classic book  Mr  Hines succeeds in doing so  I  and those who watched his 
film with me  appreciated the fact that it was not the standard  predictable 
Hollywood fare that comes out every year  e g  the Spielberg version with Tom 
Cruise that had only the slightest resemblance to the book  Obviously  everyone 
looks for different things in a movie  Those who envision themselves as amateur 
 critics  look only to criticize everything they can  Others rate a movie on 
more important bases like being entertained  which is why most people never 
agree with the  critics   We enjoyed the effort Mr  Hines put into being 
faithful to H G  Wells  classic novel  and we found it to be very entertaining 
 This made it easy to overlook what the  critics  perceive to be its shortcomings 
stopwords={}.fromkeys([line.rstrip() for line n open("..stopwords.txt")])
words_nostop=[w for w in words if w not instopwords]
display(words_nostop,"去掉停用词数据")


#以下为结果
去掉停用词数据

----------我是分割线-------------

[u'classic', u'war', u'worlds', u'timothy', u'hines', u'entertaining', u'film', 
u'effort', u'lengths', u'faithfully', u'recreate', u'classic', u'book', 
u'hines', u'succeeds', u'watched', u'film', u'appreciated', u'standard',
 u'predictable', u'hollywood', u'fare', u'spielberg', u'version', u'tom',
 u'cruise', u'slightest', u'resemblance', u'book', u'movie', u'envision',
 u'amateur', u'critics', u'criticize', u'rate', u'movie', u'bases', 
u'entertained', u'people', u'agree', u'critics', u'enjoyed', u'effort',
 u'hines', u'faithful', u'classic', u'entertaining', u'easy', u'overlook',
 u'critics', u'perceive', u'shortcomings']

将以上清晰的步骤,合为以个def函数。

eng_stopwords = set(stopwords)

def clean_text(text):
    text = BeautifulSoup(text, 'html.parser').get_text()
    text = re.sub(r'[^a-zA-Z]', ' ', text)
    words = text.lower().split()
    words = [w for w in words if w not in eng_stopwords]
    return ' '.join(words)

clean_text(raw_example)

#以下为结果




u'classic war worlds timothy hines entertaining film effort lengths faithfully 
recreate classic book hines succeeds watched film appreciated standard 
predictable hollywood fare spielberg version tom cruise slightest resemblance
 book movie envision amateur critics criticize rate movie bases entertained 
people agree critics enjoyed effort hines faithful classic entertaining easy
 overlook critics perceive shortcomings'

清晰数据添加到dataframe里

df["clearn_review"]=df.review.apply(clean_text)
df.head()



#以下为结果
   id      sentiment                review                                     clean_review
0  5814_8      1        With all this stuff going down at the moment w...      stuff moment mj ve started listening music wat...
1  2381_9      1        "The Classic War of the Worlds" by Timothy Hin...      classic war worlds timothy hines entertaining ...
2  7759_3      0        The film starts with a manager (Nicholas Bell)...      film starts manager nicholas bell investors ro...
3  3630_4      0        It must be assumed that those who praised this...      assumed praised film filmed opera didn read do...
4  9495_8      1        Superbly trashy and wondrously unpretentious 8...      superbly trashy wondrously unpretentious explo...

抽取bag of words 特征(用sklearn的countvectorizer)

vectorizer=CountVectorizer(max_features=5000)#选取出现频率top5000的词
train_data_features=vectorizer.fit_transform(df.clean_review).toarray()
train_data_features.shape


#以下为输出结果

(25000,5000)

训练分类器并做预测

forest=RandomForestClassifier(n_estimators=100)
forest=forest.fit(train_data_features,f.sentiment)

#读取测试数据,并进行相同的数据清洗操作
datafile = os.path.join('..', 'data', 'testData.tsv')
df = pd.read_csv(datafile, sep='\t', escapechar='\\')
print('Number of reviews: {}'.format(len(df)))
df['clean_review'] = df.review.apply(clean_text)

#将测试数据进行embling
test_data_features = vectorizer.transform(df.clean_review).toarray()

#predict

result=forest.predict(test_data_features)
output=pd.DataFrame({"id":df.id,"sentiment":result})

#以下为输出结果

     id        sentiment
0   12311_10       1
1   8348_2         0
2   5828_4         1
3   7186_2         1
4   12128_7        1

以上程序到此完结。

方法用的是countvectorize对文本进行embling,并用randomforest进行预测。

countvectorizer此方法存在缺点,因为文本中上下文词语存在关联性,因此单纯用词频来做embling关联性。,则会忽略这种

关联性。因此word2vec是模型更新的选择。

你可能感兴趣的:(对影评进行情感预测(countvectorizer,randomforeast))