泰迪杯C题第三问[文本有效性分析] (1)

导入库

import re # 正则表达式库
import collections # 词频统计库
import numpy as np # numpy数据处理库
import jieba # 中文分词
import pandas as pd 
import wordcloud # 词云展示库
from PIL import Image # 图像处理库
import matplotlib.pyplot as plt # 图像展示库

读入数据

jingqu = pd.read_excel(r'F:\桌面\研一\上课\数据挖掘\泰迪杯比赛数据1\data\附件1\景区评论.xlsx')
jiudian = pd.read_excel(r'F:\桌面\研一\上课\数据挖掘\泰迪杯比赛数据1\data\附件1\酒店评论.xlsx')
jingqu
景区名称 评论日期 评论内容
0 A01 2020-06-16 是亲子游的绝佳场所,门票就是有点贵,不过可以接受,爷爷奶奶不放心小朋友也跟上来了,当天我们十...
1 A01 2020-01-23 **景区差不多,票价偏贵了。大马戏比较精彩,八点的场次,6点40才能检票进入,我们6点多看看...
2 A01 2020-03-22 很有**特色的亲子酒店,房间里的装修很可爱,小孩子特别喜欢,洗漱用品也很有特色,对应的房间还...
3 A01 2020-12-25 有园区的工作人员在那,他会主动给你园区里的地图和表演的时间安排,很周到,上接驳车大概也是34...
4 A01 2020-11-28 周五逃课跟朋友在广州集合!终于如愿以偿的到达欢乐世界。学生票198 需要出示相关证件(校卡或...
... ... ... ...
59101 A50 2015-02-25 还好吧。我们刚刚到瀑布楼遇到一点小意外,打电话到景区办公室要求帮助,景区值班领导马上行动,在...
59102 A50 2015-02-25 山高路远,走的很辛苦。景色宜人爬山很累。
59103 A50 2015-02-22 环境很好,空气非常棒,很适合全家旅游,特别是避暑
59104 A50 2015-02-16 都很方便,价格实惠吧,可以预早就订好。
59105 A50 2015-02-22 旅行社不负责任 到了景点没有与门票售票协调好 等了很久

59106 rows × 3 columns

看看题目要求

出于各种原因,网络评论常常出现内容不相关、简单复制修改和无有效内容等现象,妨碍了游客从网络评论中获得有价值的信息,也为各网络平台的运营工作带来了挑战。请从文本分析的角度,建立合理的模型,对附件1景区及酒店网络评论的有效性进行分析。

感觉上可行的方案

初步感觉是 数据清洗,但这里好像是 做一个关于垃圾评论的筛选、删除的模型。
就比如我们逛淘宝的时候,淘宝社区会自动帮我屏蔽掉一些无用的评论,给到消费者有效评论

主要涉及:文本去重 这里可以基于文本之间的相似度计算,包括编辑距离去重,simhash算法去重等,但是也会去除一些相近的表达,可能会误删。推荐使用比较删除法

先处理简单复制的内容-【景区】

jingqu = jingqu.set_index('景区名称') 

contents = jingqu['评论内容']
print(contents) # Series数据
print('去重前条数:',len(contents))
景区名称
A01    是亲子游的绝佳场所,门票就是有点贵,不过可以接受,爷爷奶奶不放心小朋友也跟上来了,当天我们十...
A01    **景区差不多,票价偏贵了。大马戏比较精彩,八点的场次,6点40才能检票进入,我们6点多看看...
A01    很有**特色的亲子酒店,房间里的装修很可爱,小孩子特别喜欢,洗漱用品也很有特色,对应的房间还...
A01    有园区的工作人员在那,他会主动给你园区里的地图和表演的时间安排,很周到,上接驳车大概也是34...
A01    周五逃课跟朋友在广州集合!终于如愿以偿的到达欢乐世界。学生票198 需要出示相关证件(校卡或...
                             ...                        
A50    还好吧。我们刚刚到瀑布楼遇到一点小意外,打电话到景区办公室要求帮助,景区值班领导马上行动,在...
A50                                 山高路远,走的很辛苦。景色宜人爬山很累。
A50                             环境很好,空气非常棒,很适合全家旅游,特别是避暑
A50                                  都很方便,价格实惠吧,可以预早就订好。
A50                          旅行社不负责任 到了景点没有与门票售票协调好 等了很久
Name: 评论内容, Length: 59106, dtype: object
contents = contents.drop_duplicates() # 保留第一次出现的重复行,删除后面的重复行
print(contents)
print('去重后条数:',len(contents))
景区名称
A01    是亲子游的绝佳场所,门票就是有点贵,不过可以接受,爷爷奶奶不放心小朋友也跟上来了,当天我们十...
A01    **景区差不多,票价偏贵了。大马戏比较精彩,八点的场次,6点40才能检票进入,我们6点多看看...
A01    很有**特色的亲子酒店,房间里的装修很可爱,小孩子特别喜欢,洗漱用品也很有特色,对应的房间还...
A01    有园区的工作人员在那,他会主动给你园区里的地图和表演的时间安排,很周到,上接驳车大概也是34...
A01    周五逃课跟朋友在广州集合!终于如愿以偿的到达欢乐世界。学生票198 需要出示相关证件(校卡或...
                             ...                        
A50    还好吧。我们刚刚到瀑布楼遇到一点小意外,打电话到景区办公室要求帮助,景区值班领导马上行动,在...
A50                                 山高路远,走的很辛苦。景色宜人爬山很累。
A50                             环境很好,空气非常棒,很适合全家旅游,特别是避暑
A50                                  都很方便,价格实惠吧,可以预早就订好。
A50                          旅行社不负责任 到了景点没有与门票售票协调好 等了很久
Name: 评论内容, Length: 58486, dtype: object
去重后条数: 58486

短文本的相似度比较-【景区】-法1

from fuzzywuzzy import fuzz
from difflib import SequenceMatcher
C:\Users\kingS\anaconda3\lib\site-packages\fuzzywuzzy\fuzz.py:11: UserWarning: Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning
  warnings.warn('Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning')
jingqu
景区名称 评论日期 评论内容
0 A01 2020-06-16 是亲子游的绝佳场所,门票就是有点贵,不过可以接受,爷爷奶奶不放心小朋友也跟上来了,当天我们十...
1 A01 2020-01-23 **景区差不多,票价偏贵了。大马戏比较精彩,八点的场次,6点40才能检票进入,我们6点多看看...
2 A01 2020-03-22 很有**特色的亲子酒店,房间里的装修很可爱,小孩子特别喜欢,洗漱用品也很有特色,对应的房间还...
3 A01 2020-12-25 有园区的工作人员在那,他会主动给你园区里的地图和表演的时间安排,很周到,上接驳车大概也是34...
4 A01 2020-11-28 周五逃课跟朋友在广州集合!终于如愿以偿的到达欢乐世界。学生票198 需要出示相关证件(校卡或...
... ... ... ...
59101 A50 2015-02-25 还好吧。我们刚刚到瀑布楼遇到一点小意外,打电话到景区办公室要求帮助,景区值班领导马上行动,在...
59102 A50 2015-02-25 山高路远,走的很辛苦。景色宜人爬山很累。
59103 A50 2015-02-22 环境很好,空气非常棒,很适合全家旅游,特别是避暑
59104 A50 2015-02-16 都很方便,价格实惠吧,可以预早就订好。
59105 A50 2015-02-22 旅行社不负责任 到了景点没有与门票售票协调好 等了很久

59106 rows × 3 columns

先做一个去重处理

jingqu = pd.read_excel(r'F:\桌面\研一\上课\数据挖掘\泰迪杯比赛数据1\data\附件1\景区评论.xlsx')
# 先做一个去重处理
print('去重前:',len(jingqu['评论内容']))
contents = jingqu.drop_duplicates('评论内容')
print('去重后:',len(contents['评论内容']))
print(contents)
去重前: 59106
去重后: 58486
      景区名称        评论日期                                               评论内容
0      A01  2020-06-16  是亲子游的绝佳场所,门票就是有点贵,不过可以接受,爷爷奶奶不放心小朋友也跟上来了,当天我们十...
1      A01  2020-01-23  **景区差不多,票价偏贵了。大马戏比较精彩,八点的场次,6点40才能检票进入,我们6点多看看...
2      A01  2020-03-22  很有**特色的亲子酒店,房间里的装修很可爱,小孩子特别喜欢,洗漱用品也很有特色,对应的房间还...
3      A01  2020-12-25  有园区的工作人员在那,他会主动给你园区里的地图和表演的时间安排,很周到,上接驳车大概也是34...
4      A01  2020-11-28  周五逃课跟朋友在广州集合!终于如愿以偿的到达欢乐世界。学生票198 需要出示相关证件(校卡或...
...    ...         ...                                                ...
59101  A50  2015-02-25  还好吧。我们刚刚到瀑布楼遇到一点小意外,打电话到景区办公室要求帮助,景区值班领导马上行动,在...
59102  A50  2015-02-25                               山高路远,走的很辛苦。景色宜人爬山很累。
59103  A50  2015-02-22                           环境很好,空气非常棒,很适合全家旅游,特别是避暑
59104  A50  2015-02-16                                都很方便,价格实惠吧,可以预早就订好。
59105  A50  2015-02-22                        旅行社不负责任 到了景点没有与门票售票协调好 等了很久

[58486 rows x 3 columns]
jingqu = pd.read_excel(r'F:\桌面\研一\上课\数据挖掘\泰迪杯比赛数据1\data\附件1\景区评论.xlsx')
# 先做一个去重处理
jingqu_IDs = jingqu['景区名称']
print(jingqu_IDs)
jingqu_contents = jingqu['评论内容']
print(jingqu_contents)
sequenceMatcher = SequenceMatcher()# 实例化相识度计算
j = 1 # 查看进行到几条检索
print('阈值低于0.01的为无关文本,以及大于0.99的相似文本')
print('你就开始睡觉吧!')
print('开始了!')
for jingqu_ID,jingqu_content in zip(jingqu_IDs,jingqu_contents):
    
    #print(jingqu_ID,':',jingqu_content)
    print('第%d条'%j)
    j += 1
    for i in range(len(jingqu_IDs)):
        sequenceMatcher.set_seqs(jingqu_content, jingqu_contents[i])
        threshold_value = sequenceMatcher.ratio()
        print(threshold_value)
        if (threshold_value < 0.01 or threshold_value > 0.99):
            print(jingqu_ID)
            print(jingqu_contents[i])
    print('='*30)# 分割线
0        A01
1        A01
2        A01
3        A01
4        A01
        ... 
59101    A50
59102    A50
59103    A50
59104    A50
59105    A50
Name: 景区名称, Length: 59106, dtype: object
0        是亲子游的绝佳场所,门票就是有点贵,不过可以接受,爷爷奶奶不放心小朋友也跟上来了,当天我们十...
1        **景区差不多,票价偏贵了。大马戏比较精彩,八点的场次,6点40才能检票进入,我们6点多看看...
2        很有**特色的亲子酒店,房间里的装修很可爱,小孩子特别喜欢,洗漱用品也很有特色,对应的房间还...
3        有园区的工作人员在那,他会主动给你园区里的地图和表演的时间安排,很周到,上接驳车大概也是34...
4        周五逃课跟朋友在广州集合!终于如愿以偿的到达欢乐世界。学生票198 需要出示相关证件(校卡或...
                               ...                        
59101    还好吧。我们刚刚到瀑布楼遇到一点小意外,打电话到景区办公室要求帮助,景区值班领导马上行动,在...
59102                                 山高路远,走的很辛苦。景色宜人爬山很累。
59103                             环境很好,空气非常棒,很适合全家旅游,特别是避暑
59104                                  都很方便,价格实惠吧,可以预早就订好。
59105                          旅行社不负责任 到了景点没有与门票售票协调好 等了很久
Name: 评论内容, Length: 59106, dtype: object
阈值低于0.01的为无关文本,以及大于0.99的相似文本
你就开始睡觉吧!
开始了!
第1条
1.0
A01
是亲子游的绝佳场所,门票就是有点贵,不过可以接受,爷爷奶奶不放心小朋友也跟上来了,当天我们十点多就到了,错过了节假日,人也还是多,不过错峰出行我们一天是把动物园逛完了,两种路线都逛完了,早上我们先坐的缆车,缆车人多,排了半小时队,小火车是把步行区转完了去坐的,到了下午四点半刚好小火车到站就下雨了,地铁站到男北门都有接驳车,很方便的,总之这次还算满意,就是有些动物表演没有看到,只看到了大象表演
0.12459016393442623
0.10344827586206896
0.04700854700854701
0.04247104247104247
0.04504504504504504
0.02390438247011952
0.12234042553191489
0.0966183574879227
0.0759493670886076
0.060836501901140684
0.04878048780487805
0.09859154929577464
0.06130268199233716
0.02882882882882883
0.10796915167095116
0.10410958904109589
0.04790419161676647
0.028846153846153848
0.047619047619047616
0.026905829596412557
0.06451612903225806
0.028846153846153848
0.018957345971563982
0.05511811023622047
0.047619047619047616
0.07255139056831923
0.08436724565756824
0.06666666666666667
0.038461538461538464
0.02926829268292683
0.10948905109489052
0.1
0.09486166007905138
0.07092198581560284
0.06779661016949153
0.05172413793103448
0.07003891050583658
0.058577405857740586
0.054901960784313725
0.08201892744479496
0.06769230769230769
0.07947019867549669
0.04456094364351245
0.06722689075630252
0.08270676691729323
0.046948356807511735
0.11152416356877323
0.06646525679758308
0.0660377358490566
0.09523809523809523
0.0273972602739726
0.10031347962382445
0.14388489208633093
0.04807692307692308
0.0380952380952381
0.06153846153846154
0.046948356807511735
0.02952029520295203
0.019230769230769232
0.05653710247349823
0.1238390092879257
0.04716981132075472
0.07511737089201878
0.1037463976945245
0.028708133971291867
0.05405405405405406
0.036036036036036036
0.04240282685512368
0.037209302325581395
0.1254355400696864
0.13548387096774195
0.01818181818181818
0.019230769230769232
0.009615384615384616
A01
非常划算,平日来的首选
0.05660377358490566
0.14450867052023122
0.04608294930875576
0.11255411255411256
0.09219858156028368
0.0694980694980695
0.0912863070539419
0.09584664536741214
0.08496732026143791
0.05223880597014925
0.12811387900355872
0.018691588785046728
0.10194174757281553
0.018779342723004695
0.028846153846153848
0.02843601895734597
0.034482758620689655
0.04048582995951417
0.05687203791469194
0.030303030303030304
0.07317073170731707
0.03636363636363636
0.07604562737642585
0.13529411764705881
0.03587443946188341
0.035398230088495575
0.08196721311475409
0.07344632768361582
0.047619047619047616
0.09395973154362416
0.05054151624548736
0.1095890410958904
0.1232876712328767
0.0380952380952381
0.0877742946708464
0.05555555555555555
0.03864734299516908
0.08602150537634409
0.05177993527508091
0.1145374449339207
0.08358208955223881
0.06926406926406926
0.09243697478991597
0.05084745762711865
0.10380622837370242
0.024193548387096774
0.0728744939271255
0.09961685823754789
0.04819277108433735
0.059322033898305086
0.018779342723004695
0.018867924528301886
0.1016949152542373
0.03088803088803089
0.04484304932735426
0.03755868544600939
0.056074766355140186
0.027777777777777776
0.03636363636363636
0.05622489959839357
0.05660377358490566
0.05172413793103448
0.05785123966942149
0.06369426751592357
0.05454545454545454
0.04807692307692308
0.07079646017699115
0.08178438661710037
0.04524886877828054
0.04741379310344827
0.03773584905660377
0.055944055944055944
0.03636363636363636
0.009569377990430622
A01
4o分免费观光车好,值得
0.04819277108433735
0.14652014652014653
0.05622489959839357
0.057692307692307696
0.0963855421686747
0.0380952380952381
0.009345794392523364
A01
很期待非常期待去那里玩。轻松加刺激
0.037209302325581395
0.10452961672473868
0.10687022900763359
0.08560311284046693
0.0
A01
爽爽爽
0.08637873754152824
0.033003300330033
0.07003891050583658
0.08982035928143713
0.06997084548104957
0.07441860465116279
0.03619909502262444
0.027906976744186046
0.009569377990430622
A01
服务好 速度快 还不错,
0.035555555555555556
0.05982905982905983
0.054982817869415807
0.03686635944700461
0.08148148148148149
0.08035714285714286
0.027649769585253458
0.11371237458193979
0.10582010582010581
0.018779342723004695
0.028708133971291867
0.0380952380952381
0.05454545454545454
0.07449856733524356
0.05863192182410423
0.06837606837606838
0.10865191146881288
0.01834862385321101
0.07659574468085106
0.056338028169014086
0.05574912891986063
0.09147609147609148
0.009615384615384616
A01
人少享受了VIP的待遇
0.02643171806167401
0.09876543209876543
0.046511627906976744
0.09259259259259259
0.046511627906976744
0.0703125
0.07964601769911504
0.019138755980861243
0.05357142857142857
0.018957345971563982
0.05309734513274336
0.0796812749003984
0.07792207792207792
0.07142857142857142
0.09411764705882353
0.016771488469601678
0.03773584905660377
0.06363636363636363
0.07756232686980609
0.028037383177570093
0.08771929824561403
0.06542056074766354
0.08849557522123894
0.038461538461538464
0.056910569105691054
0.04716981132075472
0.0380952380952381
0.07052896725440806
0.09565217391304348
0.038461538461538464
0.09014084507042254
0.10256410256410256
0.11428571428571428
0.07272727272727272
0.1271186440677966
0.046511627906976744
0.018691588785046728
0.08
0.0975609756097561
0.06572769953051644
0.0380952380952381
0.06722689075630252
0.05286343612334802
0.060897435897435896
0.1409921671018277
0.042643923240938165
0.12
0.05761316872427984
0.08426966292134831
0.05144694533762058
0.05750798722044728
0.11650485436893204
0.11371237458193979
0.07142857142857142
0.06716417910447761
0.018604651162790697
0.05286343612334802
0.08490566037735849
0.10828025477707007
0.05660377358490566
0.13745704467353953
0.07511737089201878
0.15053763440860216
0.03669724770642202
0.018604651162790697
0.05668016194331984
0.046296296296296294
0.06306306306306306
0.019230769230769232
0.046296296296296294
0.07894736842105263
0.042735042735042736
0.08403361344537816
0.09665427509293681
0.019230769230769232
0.06870229007633588
0.06542056074766354
0.07662835249042145
0.009433962264150943
A01
还行吧很好的体验*******
0.018957345971563982
0.05737704918032787
0.045662100456621
0.0
A01
The Chimelong Happy World is worth a visit. There are more than 20 game facilities in the park, suitable for a family and a couple, and the holidays are relaxing.Theme park, water park, fabulous zoo and safari park, crocodile park, circus and hotel all in this area and they are all well worth the trip.This is a good place to go. However, the queue time is also about an hour. I have met many animals that I have never seen before. The environment is beautiful, the scenery is pleasant, the animals are very cute, the service staff is very good, and there is a chance to come back.
0.0
A01
The entire resort area is quite large, including the animal world, the big circus, the joy world, the water park, the hotel and even the bird park. There is also a shuttle bus in the park. There are quite a lot of items for friends of all ages to play. Buying a set of tickets, cost-effective, no need to queue to buy tickets, you only need to pick up tickets at the automatic ticket machine.
0.05901639344262295
0.0
A01
The entire resort area is quite large, including the animal world, the big circus, the joy world, the water park, the hotel and even the bird park. There is also a shuttle bus in the park. There are quite a lot of items for friends of all ages to play. Buying a set of tickets, cost-effective, no need to queue to buy tickets, you only need to pick up tickets at the automatic ticket machine.
0.0
A01
The entire resort area is quite large, including the animal world, the big circus, the joy world, the water park, the hotel and even the bird park. There is also a shuttle bus in the park. There are quite a lot of items for friends of all ages to play. Buying a set of tickets, cost-effective, no need to queue to buy tickets, you only need to pick up tickets at the automatic ticket machine.
0.04672897196261682
0.0364963503649635
0.0
A01
The Chimelong Happy World is worth a visit. There are more than 20 game facilities in the park, suitable for a family and a couple, and the holidays are relaxing.Theme park, water park, fabulous zoo and safari park, crocodile park, circus and hotel all in this area and they are all well worth the trip.This is a good place to go. However, the queue time is also about an hour. I have met many animals that I have never seen before. The environment is beautiful, the scenery is pleasant, the animals are very cute, the service staff is very good, and there is a chance to come back.
0.046948356807511735
0.0
A01
The entire resort area is quite large, including the animal world, the big circus, the joy world, the water park, the hotel and even the bird park. There is also a shuttle bus in the park. There are quite a lot of items for friends of all ages to play. Buying a set of tickets, cost-effective, no need to queue to buy tickets, you only need to pick up tickets at the automatic ticket machine.
0.0
A01
The Chimelong Happy World is worth a visit. There are more than 20 game facilities in the park, suitable for a family and a couple, and the holidays are relaxing.Theme park, water park, fabulous zoo and safari park, crocodile park, circus and hotel all in this area and they are all well worth the trip.This is a good place to go. However, the queue time is also about an hour. I have met many animals that I have never seen before. The environment is beautiful, the scenery is pleasant, the animals are very cute, the service staff is very good, and there is a chance to come back.
0.0
A01
The entire resort area is quite large, including the animal world, the big circus, the joy world, the water park, the hotel and even the bird park. There is also a shuttle bus in the park. There are quite a lot of items for friends of all ages to play. Buying a set of tickets, cost-effective, no need to queue to buy tickets, you only need to pick up tickets at the automatic ticket machine.
0.0
A01
The entire resort area is quite large, including the animal world, the big circus, the joy world, the water park, the hotel and even the bird park. There is also a shuttle bus in the park. There are quite a lot of items for friends of all ages to play. Buying a set of tickets, cost-effective, no need to queue to buy tickets, you only need to pick up tickets at the automatic ticket machine.
0.018779342723004695
0.0
A01
The Chimelong Happy World is worth a visit. There are more than 20 game facilities in the park, suitable for a family and a couple, and the holidays are relaxing.Theme park, water park, fabulous zoo and safari park, crocodile park, circus and hotel all in this area and they are all well worth the trip.This is a good place to go. However, the queue time is also about an hour. I have met many animals that I have never seen before. The environment is beautiful, the scenery is pleasant, the animals are very cute, the service staff is very good, and there is a chance to come back.
0.0
A01
The Chimelong Happy World is worth a visit. There are more than 20 game facilities in the park, suitable for a family and a couple, and the holidays are relaxing.Theme park, water park, fabulous zoo and safari park, crocodile park, circus and hotel all in this area and they are all well worth the trip.This is a good place to go. However, the queue time is also about an hour. I have met many animals that I have never seen before. The environment is beautiful, the scenery is pleasant, the animals are very cute, the service staff is very good, and there is a chance to come back.
0.05217391304347826
0.037209302325581395
0.0
A01
The Chimelong Happy World is worth a visit. There are more than 20 game facilities in the park, suitable for a family and a couple, and the holidays are relaxing.Theme park, water park, fabulous zoo and safari park, crocodile park, circus and hotel all in this area and they are all well worth the trip.This is a good place to go. However, the queue time is also about an hour. I have met many animals that I have never seen before. The environment is beautiful, the scenery is pleasant, the animals are very cute, the service staff is very good, and there is a chance to come back.
0.046296296296296294
0.018604651162790697
0.05555555555555555
0.037209302325581395
0.018779342723004695
0.04716981132075472
0.045871559633027525
0.02830188679245283
0.10270270270270271
0.07531380753138076
0.03827751196172249
0.15104166666666666
0.037037037037037035
0.08583690987124463
0.09803921568627451
0.043478260869565216
0.1157556270096463
0.11965811965811966
0.10344827586206896
0.104
0.09411764705882353
0.07028753993610223
0.008733624454148471
A01
玩垂直极限的时候,感觉呼吸都停止了。还有在温水池里游泳感觉超棒。
0.056338028169014086
0.03614457831325301
0.01904761904761905
0.03614457831325301
0.03524229074889868
0.1095890410958904
0.043859649122807015
0.055299539170506916
0.05555555555555555
0.06222222222222222
0.046153846153846156
0.07973421926910298
0.046296296296296294
0.06734006734006734
0.027777777777777776
0.09734513274336283
0.044444444444444446
0.03773584905660377
0.09216589861751152
0.04739336492890995
0.05286343612334802
0.07373271889400922
0.00847457627118644
A01
绝对震撼,票有所值,带你在欢乐中体验刺激惊险,身临其中,感受到的慢慢的都是快乐
0.08943089430894309
0.01904761904761905
0.07142857142857142
0.0660377358490566
0.05785123966942149

这个方法的效率太低了,由于一个短文本与所有的短文本都做比较,而且还用了for循环,看来没得个几个小时是出不来的

短文本的相似度比较-【景区】-法2

import jieba.analyse
# 先获取关键词
jingqu = pd.read_excel(r'F:\桌面\研一\上课\数据挖掘\泰迪杯比赛数据1\data\附件1\景区评论.xlsx')
# 先做一个去重处理
jingqu_IDs = jingqu['景区名称']
print(jingqu_IDs)
jingqu_contents = jingqu['评论内容']
print(jingqu_contents)

# 定义相似度计算函数()
def compute_sim(word1,word2):
    jiaoji = set(word1).intersection(set(word2))
    bingji = set(word1).union(set(word2))
    return len(jiaoji)/len(bingji)
    
a = []
for jingqu_ID,jingqu_content in zip(jingqu_IDs,jingqu_contents):
    keyword = jieba.analyse.extract_tags(jingqu_content,50)
    a.append(keyword)
for i in range(len(a)):
    for j in range(len(a)):
        sim_va = compute_sim(a[i],a[j])
        if sim_va < 0.0001 or sim_va > 0.99:
            print('第%d条:'%i)
            print(jingqu_contents[i])
            print('第%d条:'%j)
            print(jingqu_contents[j])
            print('='*30,'相似或者重复','='*30)
0        A01
1        A01
2        A01
3        A01
4        A01
        ... 
59101    A50
59102    A50
59103    A50
59104    A50
59105    A50
Name: 景区名称, Length: 59106, dtype: object
0        是亲子游的绝佳场所,门票就是有点贵,不过可以接受,爷爷奶奶不放心小朋友也跟上来了,当天我们十...
1        **景区差不多,票价偏贵了。大马戏比较精彩,八点的场次,6点40才能检票进入,我们6点多看看...
2        很有**特色的亲子酒店,房间里的装修很可爱,小孩子特别喜欢,洗漱用品也很有特色,对应的房间还...
3        有园区的工作人员在那,他会主动给你园区里的地图和表演的时间安排,很周到,上接驳车大概也是34...
4        周五逃课跟朋友在广州集合!终于如愿以偿的到达欢乐世界。学生票198 需要出示相关证件(校卡或...
                               ...                        
59101    还好吧。我们刚刚到瀑布楼遇到一点小意外,打电话到景区办公室要求帮助,景区值班领导马上行动,在...
59102                                 山高路远,走的很辛苦。景色宜人爬山很累。
59103                             环境很好,空气非常棒,很适合全家旅游,特别是避暑
59104                                  都很方便,价格实惠吧,可以预早就订好。
59105                          旅行社不负责任 到了景点没有与门票售票协调好 等了很久
Name: 评论内容, Length: 59106, dtype: object
第0条:
是亲子游的绝佳场所,门票就是有点贵,不过可以接受,爷爷奶奶不放心小朋友也跟上来了,当天我们十点多就到了,错过了节假日,人也还是多,不过错峰出行我们一天是把动物园逛完了,两种路线都逛完了,早上我们先坐的缆车,缆车人多,排了半小时队,小火车是把步行区转完了去坐的,到了下午四点半刚好小火车到站就下雨了,地铁站到男北门都有接驳车,很方便的,总之这次还算满意,就是有些动物表演没有看到,只看到了大象表演
第0条:
是亲子游的绝佳场所,门票就是有点贵,不过可以接受,爷爷奶奶不放心小朋友也跟上来了,当天我们十点多就到了,错过了节假日,人也还是多,不过错峰出行我们一天是把动物园逛完了,两种路线都逛完了,早上我们先坐的缆车,缆车人多,排了半小时队,小火车是把步行区转完了去坐的,到了下午四点半刚好小火车到站就下雨了,地铁站到男北门都有接驳车,很方便的,总之这次还算满意,就是有些动物表演没有看到,只看到了大象表演
============================== 相似或者重复 ==============================
第0条:
是亲子游的绝佳场所,门票就是有点贵,不过可以接受,爷爷奶奶不放心小朋友也跟上来了,当天我们十点多就到了,错过了节假日,人也还是多,不过错峰出行我们一天是把动物园逛完了,两种路线都逛完了,早上我们先坐的缆车,缆车人多,排了半小时队,小火车是把步行区转完了去坐的,到了下午四点半刚好小火车到站就下雨了,地铁站到男北门都有接驳车,很方便的,总之这次还算满意,就是有些动物表演没有看到,只看到了大象表演
第6条:
最好提前到。 我从这位代理商这里订的。 在官方的取票点拿不到。马戏非常精彩,来广州必须要看。看了终身忘不掉。
============================== 相似或者重复 ==============================
第0条:
是亲子游的绝佳场所,门票就是有点贵,不过可以接受,爷爷奶奶不放心小朋友也跟上来了,当天我们十点多就到了,错过了节假日,人也还是多,不过错峰出行我们一天是把动物园逛完了,两种路线都逛完了,早上我们先坐的缆车,缆车人多,排了半小时队,小火车是把步行区转完了去坐的,到了下午四点半刚好小火车到站就下雨了,地铁站到男北门都有接驳车,很方便的,总之这次还算满意,就是有些动物表演没有看到,只看到了大象表演
第11条:
夜间水上世界没那么多人,每个项目排队不超过15分钟,很爽。而且不冷。超级造浪池晚上友电音节,好玩。
============================== 相似或者重复 ==============================
第0条:
是亲子游的绝佳场所,门票就是有点贵,不过可以接受,爷爷奶奶不放心小朋友也跟上来了,当天我们十点多就到了,错过了节假日,人也还是多,不过错峰出行我们一天是把动物园逛完了,两种路线都逛完了,早上我们先坐的缆车,缆车人多,排了半小时队,小火车是把步行区转完了去坐的,到了下午四点半刚好小火车到站就下雨了,地铁站到男北门都有接驳车,很方便的,总之这次还算满意,就是有些动物表演没有看到,只看到了大象表演
第13条:
18点就清场了,各位小哥哥小姐姐要玩就早点去。 网上买票换成二维码就不用取票出来,直接刷二维码进园区。 想下个月再去一起。组队中
============================== 相似或者重复 ==============================
第0条:
是亲子游的绝佳场所,门票就是有点贵,不过可以接受,爷爷奶奶不放心小朋友也跟上来了,当天我们十点多就到了,错过了节假日,人也还是多,不过错峰出行我们一天是把动物园逛完了,两种路线都逛完了,早上我们先坐的缆车,缆车人多,排了半小时队,小火车是把步行区转完了去坐的,到了下午四点半刚好小火车到站就下雨了,地铁站到男北门都有接驳车,很方便的,总之这次还算满意,就是有些动物表演没有看到,只看到了大象表演
第22条:
不错,推荐带孩子一块去
============================== 相似或者重复 ==============================
第0条:
是亲子游的绝佳场所,门票就是有点贵,不过可以接受,爷爷奶奶不放心小朋友也跟上来了,当天我们十点多就到了,错过了节假日,人也还是多,不过错峰出行我们一天是把动物园逛完了,两种路线都逛完了,早上我们先坐的缆车,缆车人多,排了半小时队,小火车是把步行区转完了去坐的,到了下午四点半刚好小火车到站就下雨了,地铁站到男北门都有接驳车,很方便的,总之这次还算满意,就是有些动物表演没有看到,只看到了大象表演
第1134条:
很好玩,过山车太棒了!就是排队太久了、设施都没有玩完,都浪费在排队上了,在欢乐世界找个卖男士拖鞋的地方太难了,好不容易找到品种也很单一(男盆友鞋子坏了)。天气很热,如果凉快些,玩的估计会更尽兴,总之很开心,很难忘的体验,网票比现场和酒店代售的便宜太多了,取票也方便,很好很满意!关于园内水上设施穿雨衣一事,之前看到有评论,说不需要购买,雨衣回收站就有,可是我们找了很久没找到,之后买了雨衣玩耍完在海岛船外出口处才看到回收箱?在海盗船旁边附近有个卖纪念品的地方,从那个商店进去,进去后从另外一个门出去,就是海盗船的出口,有个雨衣回收箱,都是穿了一次丢弃的,除了水,并不脏,雨衣质量还不错,可以重复使用,当然也可自带,游园里15元一个,外面商店5元,小建议,分享给大家。关于大马戏,要坐好位置都要另外加钱(50200元不等),入住**酒店看大马戏有特定位置,网票(一日联票)买的都是又偏又靠后的位置,没办法和男朋友买补票,补票也是按次序卖,17:30开园,19:30才开演!我是开场前1个小时过去的,补票都已经没有好位置了,人们都一早过去占位置,有条件的不差钱的建议网票购票时,选择“套票”,或者入住**大酒店,贵有贵的道理,可以坐在靠前靠中间的位置,省的现场来回跑补票占座
============================== 相似或者重复 ==============================
第0条:
是亲子游的绝佳场所,门票就是有点贵,不过可以接受,爷爷奶奶不放心小朋友也跟上来了,当天我们十点多就到了,错过了节假日,人也还是多,不过错峰出行我们一天是把动物园逛完了,两种路线都逛完了,早上我们先坐的缆车,缆车人多,排了半小时队,小火车是把步行区转完了去坐的,到了下午四点半刚好小火车到站就下雨了,地铁站到男北门都有接驳车,很方便的,总之这次还算满意,就是有些动物表演没有看到,只看到了大象表演
第1135条:
非常棒的体验!自驾车过去的,景区服务一流,节目安排也非常紧凑,以后还会再来的。
============================== 相似或者重复 ==============================
第0条:
是亲子游的绝佳场所,门票就是有点贵,不过可以接受,爷爷奶奶不放心小朋友也跟上来了,当天我们十点多就到了,错过了节假日,人也还是多,不过错峰出行我们一天是把动物园逛完了,两种路线都逛完了,早上我们先坐的缆车,缆车人多,排了半小时队,小火车是把步行区转完

感觉还是很慢呀!

我想把它拼成一个矩阵,然后求均值,但是还是要遍历,所以将关键词降低个数,看快一点不

我用了官方给的测试数据计算了一下,感觉还可以

import jieba.analyse
import numpy as np
# 先获取关键词
jingqu = pd.read_excel(r'F:\桌面\研一\上课\数据挖掘\泰迪杯比赛数据1\景区评论(样例数据).xlsx')
# 先做一个去重处理
jingqu_IDs = jingqu['景区名称']
#print(jingqu_IDs)
jingqu_contents = jingqu['评论详情']
#print(jingqu_contents)

# 定义相似度计算函数()
def compute_sim(word1,word2):
    jiaoji = set(word1).intersection(set(word2))
    bingji = set(word1).union(set(word2))
    return len(jiaoji)/len(bingji)
    
a = []
for jingqu_ID,jingqu_content in zip(jingqu_IDs,jingqu_contents):
    keyword = jieba.analyse.extract_tags(jingqu_content,15)
    a.append(keyword)
matrix_list = []
for i in range(len(a)):
    for j in range(len(a)):
        sim_va = compute_sim(a[i],a[j])
        print(sim_va)
        matrix_list.append(sim_va)

#values = np.mat(matrix_list)
#print(np.mean(values,axis = 1))
1.0
0.034482758620689655
0.05555555555555555
0.0
0.0
0.11764705882352941
0.0
0.034482758620689655
0.034482758620689655
0.034482758620689655
0.07142857142857142

0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.034482758620689655
1.0
0.0
0.034482758620689655
0.0
0.0
0.0
0.
0.034482758620689655
0.043478260869565216
0.0
0.0
0.0
0.07142857142857142
0.0
0.034482758620689655
0.034482758620689655
0.0

0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.041666666666666664
arr = np.array(matrix_list)
arr= arr.reshape((50,50)) # 重塑数组
arr_mean = np.array(np.mean(arr,axis = 1)) #求均值
weizhi = np.where(arr_mean<=0.021) # 设定删除的阈值
for i in range(len(weizhi[0][:])):
    print('第%d条数据应该被屏蔽或者删除'%weizhi[0][i])

output

第6条数据应该被屏蔽或者删除
第13条数据应该被屏蔽或者删除

如果要用 原始数据,那么将上面这段代码拿过去,跑一晚上差不多了

你可能感兴趣的:(自然语言处理,python,自然语言处理,文本)