参加了kaggle的竞赛,主题为对影评进行情感预测。以下为我的baseline思路.
所用到的包:countvectorize,randomforestclassifier.
#import所需要的库
import os
import re
import numpy as np
import pandas as pd
from bis4 import BeautifulSoup #网页内容格式处理包
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForeastClassifier
from sklearn.metrics import confusion_matrix
import nltk
from nltk.corpus import stopwords
detafile=os.path.jopin("..","data","labeledTrainData.tsv")
df=pd.read_csv(datafile,sep="\t",escapechar="\\")
print("Number of reviews:{}".format(len(df)))
df.head()
#以下为输出内容
Number of reviews: 25000
id sentiment review
0 5814_8 1 With all this stuff going down at the moment w...
1 2381_9 1 "The Classic War of the Worlds" by Timothy Hin...
2 7759_3 0 The film starts with a manager (Nicholas Bell)...
3 3630_4 0 It must be assumed that those who praised this...
4 9495_8 1 Superbly trashy and wondrously unpretentious 8...
1.去掉html标签(用到Beautifulsoup包)
2.移除标点符号(用到正则)
3.切分成词/token
4.去掉停用词
5.重组为新的句子
def display(text,title):
print(title)
print("\n-----------我是分割词-----------------\n")
print(text)
raw_example=df["review"][1]#先以review的第2条数据操作
display(raw_example,"原始数据")
#以下为结果
原始数据
----------我是分割线-------------
"The Classic War of the Worlds" by Timothy Hines is a very entertaining film that
obviously goes to great effort and lengths to faithfully recreate H. G. Wells' classic
book. Mr. Hines succeeds in doing so. I, and those who watched his film with
me, appreciated the fact that it was not the standard, predictable Hollywood
fare that comes out every year, e.g. the Spielberg version with Tom Cruise
that had only the slightest resemblance to the book. Obviously, everyone
looks for different things in a movie. Those who envision themselves as amateur "critics" look only to criticize everything they can. Others rate
a movie on more important bases,like being entertained, which is why most
people never agree with the "critics". We enjoyed the effort Mr. Hines put
into being faithful to H.G. Wells' classic novel, and we found it to be
very entertaining. This made it easy to overlook what the "critics" perceive
to be its shortcomings.
example=BeautifulSoup(raw_example,"html.parser").get_text()
dispaly(example,"去掉标签的数据")
#以下为输出结果
去掉HTML标签的数据
----------我是分割线-------------
"The Classic War of the Worlds" by Timothy Hines is a very entertaining film
that obviously goes to great effort and lengths to faithfully recreate
H. G. Wells' classic book. Mr. Hines succeeds in doing so. I, and those
who watched his film with me, appreciated the fact that it was not the
standard, predictable Hollywood fare that comes out every year, e.g.
the Spielberg version with Tom Cruise that had only the slightest
resemblance to the book. Obviously, everyone looks for different
things in a movie. Those who envision themselves as amateur "critics"
look only to criticize everything they can. Others rate a movie on
more important bases,like being entertained, which is why most people
never agree with the "critics". We enjoyed the effort Mr. Hines put
into being faithful to H.G. Wells' classic novel, and we found it to
be very entertaining. This made it easy to overlook what the "critics"
perceive to be its shortcomings.
example_letters=re.sub(r'[^a-zA-Z]',' ',example)
display(example_letters,"去掉标点的数据")
#以下为结果
去掉标点的数据
----------我是分割线-------------
The Classic War of the Worlds by Timothy Hines is a very entertaining film
that obviously goes to great effort and lengths to faithfully recreate H G
Wells classic book Mr Hines succeeds in doing so I and those who watched his
film with me appreciated the fact that it was not the standard predictable
Hollywood fare that comes out every year e g the Spielberg version with Tom
Cruise that had only the slightest resemblance to the book Obviously everyone
looks for different things in a movie Those who envision themselves as amateur
critics look only to criticize everything they can Others rate a movie on
more important bases like being entertained which is why most people never
agree with the critics We enjoyed the effort Mr Hines put into being
faithful to H G Wells classic novel and we found it to be very entertaining
This made it easy to overlook what the critics perceive to be its shortcomings
stopwords={}.fromkeys([line.rstrip() for line n open("..stopwords.txt")])
words_nostop=[w for w in words if w not instopwords]
display(words_nostop,"去掉停用词数据")
#以下为结果
去掉停用词数据
----------我是分割线-------------
[u'classic', u'war', u'worlds', u'timothy', u'hines', u'entertaining', u'film',
u'effort', u'lengths', u'faithfully', u'recreate', u'classic', u'book',
u'hines', u'succeeds', u'watched', u'film', u'appreciated', u'standard',
u'predictable', u'hollywood', u'fare', u'spielberg', u'version', u'tom',
u'cruise', u'slightest', u'resemblance', u'book', u'movie', u'envision',
u'amateur', u'critics', u'criticize', u'rate', u'movie', u'bases',
u'entertained', u'people', u'agree', u'critics', u'enjoyed', u'effort',
u'hines', u'faithful', u'classic', u'entertaining', u'easy', u'overlook',
u'critics', u'perceive', u'shortcomings']
将以上清晰的步骤,合为以个def函数。
eng_stopwords = set(stopwords)
def clean_text(text):
text = BeautifulSoup(text, 'html.parser').get_text()
text = re.sub(r'[^a-zA-Z]', ' ', text)
words = text.lower().split()
words = [w for w in words if w not in eng_stopwords]
return ' '.join(words)
clean_text(raw_example)
#以下为结果
u'classic war worlds timothy hines entertaining film effort lengths faithfully
recreate classic book hines succeeds watched film appreciated standard
predictable hollywood fare spielberg version tom cruise slightest resemblance
book movie envision amateur critics criticize rate movie bases entertained
people agree critics enjoyed effort hines faithful classic entertaining easy
overlook critics perceive shortcomings'
df["clearn_review"]=df.review.apply(clean_text)
df.head()
#以下为结果
id sentiment review clean_review
0 5814_8 1 With all this stuff going down at the moment w... stuff moment mj ve started listening music wat...
1 2381_9 1 "The Classic War of the Worlds" by Timothy Hin... classic war worlds timothy hines entertaining ...
2 7759_3 0 The film starts with a manager (Nicholas Bell)... film starts manager nicholas bell investors ro...
3 3630_4 0 It must be assumed that those who praised this... assumed praised film filmed opera didn read do...
4 9495_8 1 Superbly trashy and wondrously unpretentious 8... superbly trashy wondrously unpretentious explo...
vectorizer=CountVectorizer(max_features=5000)#选取出现频率top5000的词
train_data_features=vectorizer.fit_transform(df.clean_review).toarray()
train_data_features.shape
#以下为输出结果
(25000,5000)
forest=RandomForestClassifier(n_estimators=100)
forest=forest.fit(train_data_features,f.sentiment)
#读取测试数据,并进行相同的数据清洗操作
datafile = os.path.join('..', 'data', 'testData.tsv')
df = pd.read_csv(datafile, sep='\t', escapechar='\\')
print('Number of reviews: {}'.format(len(df)))
df['clean_review'] = df.review.apply(clean_text)
#将测试数据进行embling
test_data_features = vectorizer.transform(df.clean_review).toarray()
#predict
result=forest.predict(test_data_features)
output=pd.DataFrame({"id":df.id,"sentiment":result})
#以下为输出结果
id sentiment
0 12311_10 1
1 8348_2 0
2 5828_4 1
3 7186_2 1
4 12128_7 1
以上程序到此完结。
方法用的是countvectorize对文本进行embling,并用randomforest进行预测。
countvectorizer此方法存在缺点,因为文本中上下文词语存在关联性,因此单纯用词频来做embling关联性。,则会忽略这种
关联性。因此word2vec是模型更新的选择。