情感分析是机器学习中的一个有挑战性的任务。数据集包含50,000个IMDB电影评论,训练集的25,000个评论标注了二元的情感倾向,IMDB评级<5的情绪评分为0,评级> = 7的情绪评分为1,另外还有25,000个测试集评论不包含标签。
import os
print(os.listdir("./input"))
['testData.csv', 'labeledTrainData.csv']
import pandas as pd
#载入数据
train = pd.read_csv('./input/labeledTrainData.csv',delimiter = '\t')
test = pd.read_csv('./input/testData.csv',delimiter = '\t')
train.shape, test.shape
train.head()
train['review'][0]
"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.
Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.
The actual feature film bit when it finally starts is only on for 20 minutes or so excluding the Smooth Criminal sequence and Joe Pesci is convincing as a psychopathic all powerful drug lord. Why he wants MJ dead so bad is beyond me. Because MJ overheard his plans? Nah, Joe Pesci's character ranted that he wanted people to know it is he who is supplying drugs etc so i dunno, maybe he just hates MJ's music.
Lots of cool things in this like MJ turning into a car and a robot and the whole Speed Demon sequence. Also, the director must have had the patience of a saint when it came to filming the kiddy Bad sequence as usually directors hate working with one kid let alone a whole bunch of them performing a complex dance scene.
Bottom line, this movie is for people who like MJ on one level or another (which i think is most people). If not, then stay away. It does try and give off a wholesome message and ironically MJ's bestest buddy in this movie is a girl! Michael Jackson is truly one of the most talented people ever to grace this planet but is he guilty? Well, with all the attention i've gave this subject....hmmm well i don't know because people can be different behind closed doors, i know this for a fact. He is either an extremely nice but stupid guy or one of the most sickest liars. I hope he is not the latter."
test.head()
test['review'][0]
"Naturally in a film who's main themes are of mortality, nostalgia, and loss of innocence it is perhaps not surprising that it is rated more highly by older viewers than younger ones. However there is a craftsmanship and completeness to the film which anyone can enjoy. The pace is steady and constant, the characters full and engaging, the relationships and interactions natural showing that you do not need floods of tears to show emotion, screams to show fear, shouting to show dispute or violence to show anger. Naturally Joyce's short story lends the film a ready made structure as perfect as a polished diamond, but the small changes Huston makes such as the inclusion of the poem fit in neatly. It is truly a masterpiece of tact, subtlety and overwhelming beauty."
查看不同情感的分布情况。
print ("number of rows for sentiment 1: {}".format(len(train[train.sentiment == 1])))
print ( "number of rows for sentiment 0: {}".format(len(train[train.sentiment == 0])))
number of rows for sentiment 1: 12500
number of rows for sentiment 0: 12500
train.groupby('sentiment').describe().transpose()
#创建一个新的列
train['length'] = train['review'].apply(len)
train.head()
#导入可视化库
import matplotlib.pyplot as plt
%matplotlib inline
#直方图统计
train['length'].plot.hist(bins = 100)
正在上传…重新上传取消
正在上传…重新上传取消
train.length.describe()
count 25000.000000
mean 1327.710560
std 1005.239246
min 52.000000
25% 703.000000
50% 981.000000
75% 1617.000000
max 13708.000000
Name: length, dtype: float64
train[train['length'] == 13708]['review'].iloc[0]
'Match 1: Tag Team Table Match Bubba Ray and Spike Dudley vs Eddie Guerrero and Chris Benoit Bubba Ray and Spike Dudley started things off with a Tag Team Table Match against Eddie Guerrero and Chris Benoit. According to the rules of the match, both opponents have to go through tables in order to get the win. Benoit and Guerrero heated up early on by taking turns hammering first Spike and then Bubba Ray. A German suplex by Benoit to Bubba took the wind out of the Dudley brother. Spike tried to help his brother, but the referee restrained him while Benoit and Guerrero ganged up on him in the corner. With Benoit stomping away on Bubba, Guerrero set up a table outside. Spike dashed into the ring and somersaulted over the top rope onto Guerrero on the outside! After recovering and taking care of Spike, Guerrero slipped a table into the ring and helped the Wolverine set it up. The tandem then set up for a double superplex from the middle rope which would have put Bubba through the table, but Spike knocked the table over right before his brother came crashing down! Guerrero and Benoit propped another table in the corner and tried to Irish Whip Spike through it, but Bubba dashed in and blocked his brother. Bubba caught fire and lifted both opponents into back body drops! Bubba slammed Guerrero and Spike stomped on the Wolverine from off the top rope. Bubba held Benoit at bay for Spike to soar into the Wassup! headbutt! Shortly after, Benoit latched Spike in the Crossface, but the match continued even after Spike tapped out. Bubba came to his brother\'s rescue and managed to sprawl Benoit on a table. Bubba leapt from the middle rope, but Benoit moved and sent Bubba crashing through the wood! But because his opponents didn\'t force him through the table, Bubba was allowed to stay in the match. The first man was eliminated shortly after, though, as Spike put Eddie through a table with a Dudley Dawg from the ring apron to the outside! Benoit put Spike through a table moments later to even the score. Within seconds, Bubba nailed a Bubba Bomb that put Benoit through a table and gave the Dudleys the win! Winner: Bubba Ray and Spike Dudley
Match 2: Cruiserweight Championship Jamie Noble vs Billy Kidman Billy Kidman challenged Jamie Noble, who brought Nidia with him to the ring, for the Cruiserweight Championship. Noble and Kidman locked up and tumbled over the ring, but raced back inside and grappled some more. When Kidman thwarted all Noble\'s moves, Noble fled outside the ring where Nidia gave him some encouragement. The fight spread outside the ring and Noble threw his girlfriend into the challenger. Kidman tossed Nidia aside but was taken down with a modified arm bar. Noble continued to attack Kidman\'s injured arm back in the ring. Kidman\'s injured harm hampered his offense, but he continued to battle hard. Noble tried to put Kidman away with a powerbomb but the challenger countered into a facebuster. Kidman went to finish things with a Shooting Star Press, but Noble broke up the attempt. Kidman went for the Shooting Star Press again, but this time Noble just rolled out of harm\'s way. Noble flipped Kidman into a power bomb soon after and got the pin to retain his WWE Cruiserweight Championship! Winner: Jamie Noble
Match 3: European Championship William Regal vs Jeff Hardy William Regal took on Jeff Hardy next in an attempt to win back the European Championship. Jeff catapulted Regal over the top rope then took him down with a hurracanrana off the ring apron. Back in the ring, Jeff hit the Whisper in the wind to knock Regal for a loop. Jeff went for the Swanton Bomb, but Regal got his knees up to hit Jeff with a devastating shot. Jeff managed to surprise Regal with a quick rollup though and got the pin to keep the European Championship! Regal started bawling at seeing Hardy celebrate on his way back up the ramp. Winner: Jeff Hardy
Match 4: Chris Jericho vs John Cena Chris Jericho had promised to end John Cena\'s career in their match at Vengeance, which came up next. Jericho tried to teach Cena a lesson as their match began by suplexing him to the mat. Jericho continued to knock Cena around the ring until his cockiness got the better of him. While on the top rope, Jericho began to showboat and allowed Cena to grab him for a superplex! Cena followed with a tilt-a-whirl slam but was taken down with a nasty dropkick to the gut. The rookie recovered and hit a belly to belly suplex but couldn\'t put Y2J away. Jericho launched into the Lionsault but Cena dodged the move. Jericho nailed a bulldog and then connected on the Lionsault, but did not go for the cover. He goaded Cena to his feet so he could put on the Walls of Jericho. Cena had other ideas, reversing the move into a pin attempt and getting the 1-2-3! Jericho went berserk after the match. Winner: John Cena
Match 5: Intercontinental Championship RVD vs Brock Lesnar via disqualification The Next Big Thing and Mr. Pay-Per-View tangled with the Intercontinental Championship on the line. Brock grabbed the title from the ref and draped it over his shoulder momentarily while glaring at RVD. Van Dam \'s quickness gave Brock fits early on. The big man rolled out of the ring and kicked the steel steps out of frustration. Brock pulled himself together and began to take charge. With Paul Heyman beaming at ringside, Brock slammed RVD to the hard floor outside the ring. From there, Brock began to overpower RVD, throwing him with ease over the top rope. RVD landed painfully on his back, then had to suffer from having his spine cracked against the steel ring steps. The fight returned to the ring with Brock squeezing RVD around the ribs. RVD broke away and soon after leveled Brock with a kick to the temple. RVD followed with the Rolling Thunder but Brock managed to kick out after a two-count. The fight looked like it might be over soon as RVD went for a Five-Star Frog Splash. Brock, though, hoisted Van Dam onto his shoulder and went for the F-5, but RVD whirled Brock into a DDT and followed with the Frog Splash! He went for the pin, but Heyman pulled the ref from the ring! The ref immediately called for a disqualification and soon traded blows with Heyman! After, RVD leapt onto Brock from the top rope and then threatened to hit the Van Terminator! Heyman grabbed RVD\'s leg and Brock picked up the champ and this time connected with the F-5 onto a steel chair! Winner: RVD
Match 6: Booker T vs the Big Show Booker T faced the Big Show one-on-one next. Show withstood Booker T\'s kicks and punches and slapped Booker into the corner. After being thrown from the ring, Booker picked up a chair at ringside, but Big Show punched it back into Booker\'s face. Booker tried to get back into the game by choking Show with a camera cable at ringside. Booker smashed a TV monitor from the Spanish announcers\' position into Show\'s skull, then delivered a scissors kick that put both men through the table! Booker crawled back into the ring and Big Show staggered in moments later. Show grabbed Booker\'s throat but was met by a low blow and a kick to the face. Booker climbed the top rope and nailed a somersaulting leg drop to get the pin! Winner: Booker T
Announcement: Triple H entered the ring to a thunderous ovation as fans hoped to learn where The Game would end up competing. Before he could speak, Eric Bishoff stopped The Game to apologize for getting involved in his personal business. If Triple H signed with RAW, Bischoff promised his personal life would never come into play again. Bischoff said he\'s spent the past two years networking in Hollywood. He said everyone was looking for the next breakout WWE Superstar, and they were all talking about Triple H. Bischoff guaranteed that if Triple H signed with RAW, he\'d be getting top opportunities coming his way. Stephanie McMahon stepped out to issue her own pitch. She said that because of her personal history with Triple H, the two of them know each other very well. She said the two of them were once unstoppable and they can be again. Bischoff cut her off and begged her to stop. Stephanie cited that Triple H once told her how Bischoff said Triple H had no talent and no charisma. Bischoff said he was young at the time and didn\'t know what he had, but he still has a lot more experience that Stephanie. The two continued to bicker back and forth, until Triple H stepped up with his microphone. The Game said it would be easy to say \\screw you\\" to either one of them. Triple H went to shake Bischoff\'s hand, but pulled it away. He said he would rather go with the devil he knows, rather than the one he doesn\'t know. Before he could go any further, though, Shawn Michaels came out to shake things up. HBK said the last thing he wanted to do was cause any trouble. He didn\'t want to get involved, but he remembered pledging to bring Triple H to the nWo. HBK said there\'s nobody in the world that Triple H is better friends with. HBK told his friend to imagine the two back together again, making Bischoff\'s life a living hell. Triple H said that was a tempting offer. He then turned and hugged HBK, making official his switch to RAW! Triple H and HBK left, and Bischoff gloated over his victory. Bischoff said the difference between the two of them is that he\'s got testicles and she doesn\'t. Stephanie whacked Bischoff on the side of the head and left!
Match 7: Tag Team Championship Match Christian and Lance Storm vs Hollywood Hogan and Edge The match started with loud \\"USA\\" chants and with Hogan shoving Christian through the ropes and out of the ring. The Canadians took over from there. But Edge scored a kick to Christian\'s head and planted a facebuster on Storm to get the tag to Hogan. Hogan began to Hulk up and soon caught Christian with a big boot and a leg drop! Storm broke up the count and Christian tossed Hogan from the ring where Storm superkicked the icon. Edge tagged in soon after and dropped both opponents. He speared both of them into the corner turnbuckles, but missed a spear on Strom and hit the ref hard instead. Edge nailed a DDT, but the ref was down and could not count. Test raced down and took down Hogan then leveled Edge with a boot. Storm tried to get the pin, but Edge kicked out after two. Riksihi sprinted in to fend off Test, allowing Edge to recover and spear Storm. Christian distracted the ref, though, and Y2J dashed in and clocked Edge with the Tag Team Championship! Storm rolled over and got the pinfall to win the title! Winners and New Tag Team Champions: Christian and Lance Storm
Match 8: WWE Undisputed Championship Triple Threat Match. The Rock vs Kurt Angle and the Undertaker Three of WWE\'s most successful superstars lined up against each other in a Triple Threat Match with the Undisputed Championship hanging in the balance. Taker and The Rock got face to face with Kurt Angle begging for some attention off to the side. He got attention in the form of a beat down form the two other men. Soon after, Taker spilled out of the ring and The Rock brawled with Angle. Angle gave a series of suplexes that took down Rock, but the Great One countered with a DDT that managed a two-count. The fight continued outside the ring with Taker coming to life and clotheslining Angle and repeatedly smacking The Rock. Taker and Rock got into it back into the ring, and Taker dropped The Rock with a sidewalk slam to get a two-count. Rock rebounded, grabbed Taker by the throat and chokeslammed him! Angle broke up the pin attempt that likely would have given The Rock the title. The Rock retaliated by latching on the ankle lock to Kurt Angle. Angle reversed the move and Rock Bottomed the People\'s Champion. Soon after, The Rock disposed of Angle and hit the People\'s Elbow on the Undertaker. Angle tried to take advantage by disabling the Great One outside the ring and covering Taker, who kicked out after a two count. Outside the ring, Rock took a big swig from a nearby water bottle and spewed the liquid into Taker\'s face to blind the champion. Taker didn\'t stay disabled for long, and managed to overpower Rock and turn his attention to Angle. Taker landed a guillotine leg drop onto Angle, laying on the ring apron. The Rock picked himself up just in time to break up a pin attempt on Kurt Angle. Taker nailed Rock with a DDT and set him up for a chokeslam. ANgle tried sneaking up with a steel chair, but Taker caught on to that tomfoolery and smacked it out of his hands. The referee got caught in the ensuing fire and didn\'t see Angle knock Taker silly with a steel chair. Angle went to cover Taker as The Rock lay prone, but the Dead Man somehow got his shoulder up. Angle tried to pin Rock, but he too kicked out. The Rock got up and landed Angle in the sharpshooter! Angle looked like he was about to tap, but Taker kicked The Rock out of the submission hold. Taker picked Rock up and crashed him with the Last Ride. While the Dead Man covered him for the win, Angle raced in and picked Taker up in the ankle lock! Taker went delirious with pain, but managed to counter. He picked Angle up for the last ride, but Angle put on a triangle choke! It looked like Taker was about to pass out, but The Rock broke Angle\'s hold only to find himself caught in the ankle lock. Rock got out of the hold and watched Taker chokeslam Angle. Rocky hit the Rock Bottom, but Taker refused to go down and kicked out. Angle whirled Taker up into the Angle Slam but was Rock Bottomed by the Great One and pinned! Winner and New WWE Champion: The Rock
~Finally there is a decent PPV! Lately the PPV weren\'t very good, but this one was a winner. I give this PPV a A-
"'
train.hist(column='length', by='sentiment', bins=100,figsize=(12,4))
如果没有这个库,可以用以下方法安装:sudo pip install BeautifulSoup4
若没有stopwords,可以运行以下代码:import nltk nltk.download("stopwords")
#导入预处理所需要的包
from bs4 import BeautifulSoup
import re
import nltk
from nltk.corpus import stopwords
# 去除html标签
raw_text = BeautifulSoup(train["review"][0],"lxml").get_text()
print(raw_text)
With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.The actual feature film bit when it finally starts is only on for 20 minutes or so excluding the Smooth Criminal sequence and Joe Pesci is convincing as a psychopathic all powerful drug lord. Why he wants MJ dead so bad is beyond me. Because MJ overheard his plans? Nah, Joe Pesci's character ranted that he wanted people to know it is he who is supplying drugs etc so i dunno, maybe he just hates MJ's music.Lots of cool things in this like MJ turning into a car and a robot and the whole Speed Demon sequence. Also, the director must have had the patience of a saint when it came to filming the kiddy Bad sequence as usually directors hate working with one kid let alone a whole bunch of them performing a complex dance scene.Bottom line, this movie is for people who like MJ on one level or another (which i think is most people). If not, then stay away. It does try and give off a wholesome message and ironically MJ's bestest buddy in this movie is a girl! Michael Jackson is truly one of the most talented people ever to grace this planet but is he guilty? Well, with all the attention i've gave this subject....hmmm well i don't know because people can be different behind closed doors, i know this for a fact. He is either an extremely nice but stupid guy or one of the most sickest liars. I hope he is not the latter.
#去除所有非字母字符
letters_only = re.sub("[^a-zA-Z]", " ", raw_text)
print(letters_only)
#转化大小写
letters_lowercase = letters_only.lower()
print(letters_lowercase)
#对于文档进行分词
words = letters_lowercase.split()
print(words)
#去除停用词
#1.创建停用词词表
#2.去除停用词
stops = set(stopwords.words("english"))
#stops.add("stuff")
clean_review1 = [w for w in words if not w in stops]
print(clean_review1)
def clean_text(raw_text):
raw_text = BeautifulSoup(raw_text,"lxml").get_text()
letters_only = re.sub("[^a-zA-Z]", " ", raw_text)
words = letters_only.lower().split()
stops = set(stopwords.words("english"))
return [w for w in words if not w in stops]
#对于review列进行处理
train['clean_review'] = train['review'].apply(clean_text)
#train['clean_review'] = [clean_text(e,stops)for e in train['review']]
#增加一列:处理后的长度
train['length_clean_review'] = train['clean_review'].apply(len)
train.head()
train.describe()
#Checking the smallest review
print(train[train['length_clean_review'] == 4]['review'].iloc[0])
print('------After Cleaning------')
print(train[train['length_clean_review'] == 4]['clean_review'].iloc[0])
This movie is terrible but it has some good effects.
------After Cleaning------
['movie', 'terrible', 'good', 'effects']
### 词云
#画出
from wordcloud import WordCloud
word_cloud = WordCloud(width = 1000, height = 500, background_color = 'black').generate(
''.join(train['review']))
plt.figure(figsize = (15,8))
plt.imshow(word_cloud)
plt.axis('off')
plt.show()
word_cloud = WordCloud(width = 1000, height = 500, background_color = 'black').generate(
''.join(str(train['clean_review'])))
plt.figure(figsize = (15,8))
plt.imshow(word_cloud)
plt.axis('off')
plt.show()
现在我们需要将处理好的文本转化为机器学习的模型可以处理的形式,这里我们使用不同的模型来对文本进行向量化。 可以分为以下四个步骤。
布尔向量
TF
IDF
我们尝试分别使用bool
型特征、TF
特征、TF-IDF
特征进行模型训练。
4.1.1在CountVectorizer
中我们指定analyzer
、max_feature
和binary
from sklearn.feature_extraction.text import CountVectorizer
# 运行时间可能稍长
bool_transformer = CountVectorizer(analyzer=clean_text,binary = True,max_features=5000).fit(train['review'])
# 打印出词
# print(bool_transformer.vocabulary_)
# 打印前100组数据
a = list(bool_transformer.vocabulary_.items())
print(a[:100])
[('stuff', 4258), ('going', 1902), ('moment', 2859), ('started', 4173), ('listening', 2590), ('music', 2916), ('watching', 4838), ('odd', 3051), ('documentary', 1270), ('watched', 4836), ('maybe', 2747), ('want', 4814), ('get', 1872), ('certain', 674), ('insight', 2270), ('guy', 1978), ('thought', 4474), ('really', 3538), ('cool', 947), ('eighties', 1393), ('make', 2686), ('mind', 2819), ('whether', 4880), ('guilty', 1972), ('innocent', 2266), ('part', 3159), ('feature', 1646), ('film', 1687), ('remember', 3617), ('see', 3865), ('cinema', 757), ('originally', 3097), ('released', 3598), ('subtle', 4278), ('messages', 2789), ('feeling', 1653), ('towards', 4550), ('press', 3352), ('also', 146), ('obvious', 3041), ('message', 2788), ('drugs', 1332), ('bad', 330), ('visually', 4780), ('impressive', 2220), ('course', 977), ('michael', 2799), ('jackson', 2347), ('unless', 4687), ('remotely', 3624), ('like', 2569), ('anyway', 208), ('hate', 2026), ('find', 1697), ('boring', 485), ('may', 2746), ('call', 590), ('making', 2691), ('movie', 2895), ('fans', 1615), ('would', 4960), ('say', 3812), ('made', 2670), ('true', 4610), ('nice', 2979), ('actual', 54), ('bit', 431), ('finally', 1695), ('starts', 4175), ('minutes', 2830), ('smooth', 4059), ('criminal', 1019), ('sequence', 3899), ('joe', 2378), ('convincing', 943), ('powerful', 3329), ('drug', 1331), ('lord', 2625), ('wants', 4817), ('dead', 1092), ('beyond', 417), ('plans', 3260), ('character', 697), ('wanted', 4815), ('people', 3188), ('know', 2458), ('etc', 1486), ('hates', 2028), ('lots', 2635), ('things', 4465), ('turning', 4622), ('car', 617), ('robot', 3724), ('whole', 4884), ('speed', 4121), ('demon', 1137), ('director', 1225), ('must', 2920), ('patience', 3178), ('saint', 3789)]
使用定义好的bool_transformer
来处理一条影评。
4.1.2查看向量表示
review1 = train['review'][0]
bow1 = bool_transformer.transform([review1])
print(bow1)
print(bow1.toarray())
print(bow1.shape)
print(bow1.nnz)
TF
特征向量4.2.1在CountVectorizer
中我们指定analyzer
、max_feature
和binary
from sklearn.feature_extraction.text import CountVectorizer
# 运行时间可能稍长
bow_transformer = CountVectorizer(analyzer=clean_text,binary=False,max_features=5000).fit(train['review'])
4.2.2在CountVectorizer
中我们指定analyzer
和max_feature
# 运行时间可能稍长
bow_transformer = CountVectorizer(analyzer=clean_text,max_features=5000).fit(train['review'])
# 打印出词
print(len(bow_transformer.vocabulary_))
使用定义好的bow_transformer
来处理一条影评。
4.2.3查看向量表示
review1 = train['review'][0]
bow1 = bow_transformer.transform([review1])
print(bow1.toarray())
print(bow1)
print(bow1.shape)
print(bow1.nnz)
#每个数字都有对应的词
print(bow_transformer.get_feature_names()[480])
print(bow_transformer.get_feature_names()[4947])
#存入矩阵
review_bow = bow_transformer.transform(train['review'])
4.2.4查看稀疏矩阵中非零数字的占比
print('Shape of Sparse Matrix: ', review_bow.shape)
print('Amount of Non-Zero occurences: ', review_bow.nnz)
Shape of Sparse Matrix: (25000, 5000)
Amount of Non-Zero occurences: 1980474
#检查矩阵稀疏性
sparsity = (review_bow.nnz / (review_bow.shape[0] * review_bow.shape[1]))
print('sparsity: {}'.format(sparsity))
sparsity: 0.015843792
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer(norm="l2",smooth_idf=True).fit(review_bow)
tfidf1 = tfidf_transformer.transform(bow1)
print(tfidf1)
print(tfidf1.shape)
print(tfidf1.nnz)
4.2.5检查IDF
计算出来的数值
print(tfidf_transformer.idf_[bow_transformer.vocabulary_['well']])
print(tfidf_transformer.idf_[bow_transformer.vocabulary_['book']])
2.1996224552246644
3.8577509748566374
#将BOW转化为TF-IDF
review_tfidf = tfidf_transformer.transform(review_bow)
print(review_tfidf.shape)
(25000, 5000)
from sklearn.decomposition import TruncatedSVD
LSA = TruncatedSVD(n_components=300, n_iter=7, random_state=42)
LSA.fit(review_tfidf)
#print(LSA.explained_variance_ratio_)
#print(LSA.explained_variance_ratio_.sum())
#print(LSA.singular_values_)
TruncatedSVD(algorithm='randomized', n_components=300, n_iter=7,
random_state=42, tol=0.0)
from sklearn.metrics import classification_report
#定义对于模型进行评估的函数
def pred(predicted,compare):
cm = pd.crosstab(compare,predicted)
TN = cm.iloc[0,0]
FN = cm.iloc[1,0]
TP = cm.iloc[1,1]
FP = cm.iloc[0,1]
print("CONFUSION MATRIX ------->> ")
print(cm)
print()
##计算模型的准确率
print('Classification paradox :------->>')
print('Accuracy :- ', round(((TP+TN)*100)/(TP+TN+FP+FN),2))
print()
print('False Negative Rate :- ',round((FN*100)/(FN+TP),2))
print()
print('False Postive Rate :- ',round((FP*100)/(FP+TN),2))
print()
print(classification_report(compare,predicted))
6.3.1随机森林
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.decomposition import TruncatedSVD
tuned_parameters = {'n_estimators': [100,150], 'max_depth': [5,10]}
rfc = RandomForestClassifier(random_state=42)
pipeline_bool = Pipeline([
('bool', CountVectorizer(analyzer=clean_text,binary=True,max_features=5000)),
("LSA" , TruncatedSVD(n_components=300, n_iter=7, random_state=42)),
("classifier",GridSearchCV(rfc, tuned_parameters, cv=5, scoring='r2', n_jobs=4, verbose=1))
])
pipeline_bool.fit(X_train,y_train)
predictions = pipeline_bool.predict(X_train)
pred(predictions,y_train)
Fitting 5 folds for each of 4 candidates, totalling 20 fits
[Parallel(n_jobs=4)]: Done 20 out of 20 | elapsed: 1.3min finished
CONFUSION MATRIX ------->>
col_0 0 1
sentiment
0 7074 360
1 257 7309
Classification paradox :------->>
Accuracy :- 95.89
False Negative Rate :- 3.4
False Postive Rate :- 4.84
precision recall f1-score support
0 0.96 0.95 0.96 7434
1 0.95 0.97 0.96 7566
avg / total 0.96 0.96 0.96 15000
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.decomposition import TruncatedSVD
tuned_parameters = {'n_estimators': [100,150], 'max_depth': [5,10]}
rfc = RandomForestClassifier(random_state=42)
pipeline_tf = Pipeline([
('bow', CountVectorizer(analyzer=clean_text,max_features=5000)),
("LSA" , TruncatedSVD(n_components=300, n_iter=7, random_state=42)),
("classifier",GridSearchCV(rfc, tuned_parameters, cv=5, scoring='r2', n_jobs=4, verbose=1))
])
pipeline_tf.fit(X_train,y_train)
predictions = pipeline_tf.predict(X_train)
pred(predictions,y_train)
Fitting 5 folds for each of 4 candidates, totalling 20 fits
[Parallel(n_jobs=4)]: Done 20 out of 20 | elapsed: 1.3min finished
CONFUSION MATRIX ------->>
col_0 0 1
sentiment
0 7030 404
1 319 7247
Classification paradox :------->>
Accuracy :- 95.18
False Negative Rate :- 4.22
False Postive Rate :- 5.43
precision recall f1-score support
0 0.96 0.95 0.95 7434
1 0.95 0.96 0.95 7566
avg / total 0.95 0.95 0.95 15000
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.decomposition import TruncatedSVD
tuned_parameters = {'n_estimators': [100,150], 'max_depth': [5,10]}
rfc = RandomForestClassifier(random_state=42)
pipeline_tfidf = Pipeline([
('bow', CountVectorizer(analyzer=clean_text,max_features=5000)),
('tfidf', TfidfTransformer()),
("LSA" , TruncatedSVD(n_components=300, n_iter=7, random_state=42)),
("classifier",GridSearchCV(rfc, tuned_parameters, cv=5, scoring='r2', n_jobs=4, verbose=1))
])
pipeline_tfidf.fit(X_train,y_train)
predictions = pipeline_tfidf.predict(X_train)
pred(predictions,y_train)
Fitting 5 folds for each of 4 candidates, totalling 20 fits
[Parallel(n_jobs=4)]: Done 20 out of 20 | elapsed: 1.3min finished
CONFUSION MATRIX ------->>
col_0 0 1
sentiment
0 6900 534
1 389 7177
Classification paradox :------->>
Accuracy :- 93.85
False Negative Rate :- 5.14
False Postive Rate :- 7.18
precision recall f1-score support
0 0.95 0.93 0.94 7434
1 0.93 0.95 0.94 7566
avg / total 0.94 0.94 0.94 15000
#在测试集上面运行结果
predictions = pipeline_bool.predict(X_test)
pred(predictions,y_test)
CONFUSION MATRIX ------->>
col_0 0 1
sentiment
0 3810 1256
1 915 4019
Classification paradox :------->>
Accuracy :- 78.29
False Negative Rate :- 18.54
False Postive Rate :- 24.79
precision recall f1-score support
0 0.81 0.75 0.78 5066
1 0.76 0.81 0.79 4934
avg / total 0.78 0.78 0.78 10000
#在测试集上面运行结果
predictions = pipeline_tf.predict(X_test)
pred(predictions,y_test)
CONFUSION MATRIX ------->>
col_0 0 1
sentiment
0 3690 1376
1 1000 3934
Classification paradox :------->>
Accuracy :- 76.24
False Negative Rate :- 20.27
False Postive Rate :- 27.16
precision recall f1-score support
0 0.79 0.73 0.76 5066
1 0.74 0.80 0.77 4934
avg / total 0.76 0.76 0.76 10000
#在测试集上面运行结果
predictions = pipeline_tfidf.predict(X_test)
pred(predictions,y_test)
CONFUSION MATRIX ------->>
col_0 0 1
sentiment
0 3910 1156
1 897 4037
Classification paradox :------->>
Accuracy :- 79.47
False Negative Rate :- 18.18
False Postive Rate :- 22.82
precision recall f1-score support
0 0.81 0.77 0.79 5066
1 0.78 0.82 0.80 4934
avg / total 0.80 0.79 0.79 10000
最终应用我们得到的最优的模型对无标注的数据集进行预测:
test['sentiment'] = pipeline_tfidf.predict(test['review'])
output = test[['id','sentiment']]
print(output)
id sentiment
0 12311_10 1
1 8348_2 0
2 5828_4 1
3 7186_2 0
4 12128_7 1
5 2913_8 1
6 4396_1 0
7 395_2 0
8 10616_1 0
9 9074_9 1
10 9252_3 0
11 9896_9 0
12 574_4 1
13 11182_8 1
14 11656_4 0
15 2322_4 1
16 8703_1 1
17 7483_1 1
18 6007_10 1
19 12424_4 0
20 4672_1 0
21 10841_3 0
22 8954_7 0
23 7392_1 0
24 10288_8 1
25 5343_4 0
26 4950_1 0
27 9257_4 0
28 8689_3 0
29 4480_2 1
... ... ...
24970 6857_10 1
24971 11091_8 1
24972 4167_2 1
24973 679_4 1
24974 10147_1 0
24975 6875_1 0
24976 923_10 1
24977 6200_8 0
24978 7208_8 1
24979 5363_8 1
24980 4067_8 0
24981 1773_7 1
24982 1498_10 1
24983 10497_10 0
24984 3444_10 1
24985 588_2 0
24986 9678_9 1
24987 1983_9 0
24988 5012_3 1
24989 12240_2 1
24990 5071_2 0
24991 5078_2 0
24992 10069_3 0
24993 7407_8 1
24994 7207_1 0
24995 2155_10 1
24996 59_10 1
24997 2531_1 1
24998 7772_8 1
24999 11465_10 1
[25000 rows x 2 columns]