今天不熬夜！

2018年泰迪杯数据挖掘比赛c题

一、题目
提升景区及酒店等旅游目的地美誉度是各地文旅主管部门和旅游相关企业非常重视和关注的工作，涉及到如何稳定客源、取得竞争优势、吸引游客到访消费等重要事项。游客满意度与目的地美誉度紧密相关，游客满意度越高，目的地美誉度就越大。因此掌握目的地游客满意度的影响因素，切实提高游客满意度、最终提升目的地美誉度，不仅能够保证客源稳定，而且对于旅游企业科学监管、资源优化配置以及市场持续开拓具有长远而积极的作用。
第一问：景区及酒店印象分析
依据附件1中景区及酒店网评文本，按赛题表1格式计算出目的地TOP20热门词，并保存为文件“印象词云表.xls”。

题目过程：
1、提取出excel中每一行存入单独的csv文件中

import pandas as pd
#输出酒店的数据
def xls_txt(inputfile):
        data = pd.read_excel(inputfile,index_col='酒店名称')
        for i in range(51,61):
            if i<10:
                index = 'H0' + str(i)
            else:
                index='H'+str(i)
            data1 = data.loc[index]
            data1 = data1[[u'评论内容']]
            outputfile = '../test/1/' + index + '.txt'
            data1.to_csv(outputfile, index=False, header=False, encoding='utf-8')
#输出景区的数据
def xls_txt2(inputfile):
    data = pd.read_excel(inputfile, index_col='景区名称')
    for i in range(51, 61):
        if i < 10:
            index = 'A0' + str(i)
        else:
            index = 'A' + str(i)
        data1 = data.loc[index]
        data1 = data1[[u'评论内容']]
        outputfile = '../test/1/' + index + '.txt'
        data1.to_csv(outputfile, index=False, header=False, encoding='utf-8')

inputfile1 = '../test/酒店评论（测试数据）.xlsx' #评论汇总文件
inputfile2 = '../test/景区评论（测试数据）.xlsx' #评论汇总文件
xls_txt(inputfile1)
xls_txt2(inputfile2)

2、将csv文件转换为txt文件

#-*- coding: utf-8 -*-
import pandas as pd

inputfile = '../data/huizong.csv' #评论汇总文件
outputfile = '../data/meidi_jd.txt' #评论提取后保存路径
data = pd.read_csv(inputfile, encoding = 'utf-8')
data = data[[u'评论']][data[u'品牌'] == u'美的']
data.to_csv(outputfile, index = False, header = False, encoding = 'utf-8')

2、机械语句压缩，删除其中的短语句视为无效句子

import pandas as pd
#机械语句压缩
def func(st):
    for i in range(1, int(len(st) / 2) + 1):
        for j in range(len(st)):
            if st[j:j + i] == st[j + i:j + 2 * i]:
                k = j + i
                while st[k:k + i] == st[k + i:k + 2 * i] and k < len(st):
                    k = k + i
                st = st[:j] + st[k:]
    str1 = "".join(st)
    str2 = str1.strip()
    return str(str2) #记录删除了多少次无效语句


#导入文件
def run(inputfile,outputfile):
    f = open(inputfile, encoding='utf-8')
    filelist = []
    while True:
        line = f.readline()
        if line:
            #重复语句压缩
            data=func(line)
            # 判断是否为短语句（少于四个字则视为无效语句）
            if len(data)<=4:
                continue;
            else:
                filelist.append(data)
        else:
            break
    f.close()
    filelist2 = pd.DataFrame(filelist)
    filelist2.to_csv(outputfile, index=False, header=False, encoding='utf-8')
#运行函数
'''for i in range(1, 51):
    if i < 10:
        index = 'A0' + str(i)
    else:
        index = 'A' + str(i)
    inputfile='../train/jingqu/jingqu_cl/'+index+'.txt'#评论处理前保存路径
    outputfile = '../train/jingqu/jingqu_cl/cl_pr/'+index+'.txt'#评论处理后保存路径
    run(inputfile,outputfile)'''
for i in range(51, 61):
    if i < 10:
        index = 'A0' + str(i)
    else:
        index = 'A' + str(i)
    inputfile='../test/2/'+index+'.txt'
    # 输出文件
    outputfile = '../test/3/'+index+'.txt'
    run(inputfile,outputfile)
for i in range(51, 61):
    if i < 10:
        index = 'H0' + str(i)
    else:
        index = 'H' + str(i)
    inputfile='../test/2/'+index+'.txt'
    # 输出文件
    outputfile = '../test/3/'+index+'.txt'
    run(inputfile,outputfile)

3、删除其中重复的句子，视为无效文本

import pandas as pd
import re
import jieba.posseg as psg#停用词 自带词性
import numpy as np
import nltk
from pandas import DataFrame
#文本去重
def clean_same(inputfile,outputfile):
    # 去重，去除完全重复的数据
    reviews = pd.read_csv(inputfile,encoding = 'utf-8', header = None)
    l1=len(reviews)
    reviews = reviews.drop_duplicates()
    l2 = len(reviews)
   # print(u'删除了%s条评论。' %(l1 - l2))
    # 由于评论主要为酒店和景区的评论，因此去除这些词语
    # 删除字母
    # 删除一些常见的符号
    reviews_cut = pd.DataFrame(reviews[0].str.replace('.*?\d+?\\t ', ' '))
    content = pd.DataFrame(reviews_cut[0].str.replace('[0-9a-zA-Z]|景区|酒店', ''))
    content = pd.DataFrame(content[0].str.replace('[，！!：_.+-=——,$%^。？、~@#￥%……&*《》<>「」{}【】()/]', ''))
    content.to_csv(outputfile, index=False, header=False, encoding='utf-8')


for i in range(51, 61):
    if i < 10:
        index = 'A0' + str(i)
    else:
        index = 'A' + str(i)
    inputfile='../test/1/'+index+'.txt'
    # 输出文件
    outputfile = '../test/2/'+index+'.txt'
    clean_same(inputfile,outputfile)
for i in range(51, 61):
    if i < 10:
        index = 'H0' + str(i)
    else:
        index = 'H' + str(i)
    inputfile='../test/1/'+index+'.txt'
    # 输出文件
    outputfile = '../test/2/'+index+'.txt'
    clean_same(inputfile,outputfile)

4、简单进行分词（jieba分词），删除停用词（其中还整理了自己的停用词），提取出里面的名词

#coding:utf-8
import pandas as pd
import re
import jieba.posseg as psg
import numpy as np
import matplotlib.pyplot as plt
from wordcloud import WordCloud

def word(inputfile,outputfile):
    reviews = pd.read_csv(inputfile, encoding='utf-8', names=['评论']).astype(str)
    content = reviews['评论']
    worker = lambda s: [(x.word, x.flag) for x in psg.cut(s)]  # 自定义简单分词函数
    seg_word = content.apply(worker)
    # print(seg_word)

    # 将词语转为数据框形式，一列是词，一列是词语所在的句子ID，最后一列是词语在该句子的位置
    n_word = seg_word.apply(lambda x: len(x))  # 每一评论中词的个数

    n_content = [[x + 1] * y for x, y in zip(list(seg_word.index), list(n_word))]
    index_content = sum(n_content, [])  # 将嵌套的列表展开，作为词所在评论的id

    seg_word = sum(seg_word, [])
    word = [x[0] for x in seg_word]

    nature = [x[1] for x in seg_word]  # 词性


    result = pd.DataFrame({"index_content": index_content,
                           "word": word,
                           "nature": nature})
    # 删除标点符号
    result = result[result['nature'] != 'x']  # x表示标点符号

    # 删除停用词
    stop_path = open('../stop/stopword2.txt', 'r', encoding='UTF-8')
    stop = stop_path.readlines()
    stop = [x.replace('\n', '') for x in stop]
    word = list(set(word) - set(stop))
    result = result[result['word'].isin(word)]

    # 构造各词在对应评论的位置列
    n_word = list(result.groupby(by=['index_content'])['index_content'].count())
    index_word = [list(np.arange(0, y)) for y in n_word]
    index_word = sum(index_word, [])  # 表示词语在改评论的位置

    # 合并评论id，评论中词的id，词，词性，评论类型
    result['index_word'] = index_word

    # 提取含有名词类的评论
    ind = result[['n' in x for x in result['nature']]]['index_content'].unique()
    result = result[[x in ind for x in result['index_content']]]

    # 提取含有名词类的评论
    ind = result[['n' in x for x in result['nature']]]['index_content'].unique()
    result = result[[x in ind for x in result['index_content']]]
    # 将结果写出
    result.to_csv(outputfile, index=False, encoding='utf-8')


for i in range(51, 61):
    if i < 10:
        index = 'A0' + str(i)
    else:
        index = 'A' + str(i)
    inputfile='../test/3/'+index+'.txt'
    # 输出文件
    outputfile = '../test/5/'+index+'.csv'
    word(inputfile, outputfile)

for i in range(51, 61):
    if i < 10:
        index = 'H0' + str(i)
    else:
        index = 'H' + str(i)
    inputfile = '../test/3/' + index + '.txt'  # E\jingquA01.csv
    # 输出文件
    outputfile = '../test/5/' + index + '.csv'
    word(inputfile, outputfile)

5、按统计顺序打印出高频词

# -*- coding: utf-8 -*-
import pandas as pd
from collections import Counter

def word(inputfile,outputfile):
    data = pd.read_csv(inputfile)
    words = data['word']
    counter = Counter(words)

    # 打印前十高频词
    #pprint(counter.most_common(20))
    count = counter.most_common(2000)

    name = ['comment', 'count']
    data = pd.DataFrame(columns=name,data=count)
    cols = ['number'] + list(data.columns)
    data.index += 1
    data['number'] = data.index
    data2 = data[cols]
    data2.to_excel(outputfile)

'''for i in range(1, 51):
    if i < 10:
        index = 'A0' + str(i)
    else:
        index = 'A' + str(i)
    inputfile = '../data/G2_jingquWord/' + index + 'word.csv'
    # 输出文件
    outputfile = '../train/jingqu/jingqu_cipin/' + index + '.csv'
    word(inputfile, outputfile)'''
'''for i in range(1, 51):
    if i < 10:
        index = 'H0' + str(i)
    else:
        index = 'H' + str(i)
    inputfile = '../data/H2_jiudianWord/' + index + 'word.csv'
    # 输出文件
    outputfile = '../train/jiudian/jiudian_cipin/' + index + '.csv'
    word(inputfile, outputfile)'''
inputfile = '../test/5_A/all_A.csv'
# 输出文件
outputfile = '../test/5_A/A_cipin.xlsx'
word(inputfile, outputfile)

inputfile = '../test/5_H/all_H.csv'
# 输出文件
outputfile = '../test/5_H/H_cipin.xlsx'
word(inputfile, outputfile)

第二问：根据附件1景区及酒店网评文本及附件2景区及酒店得分建立合理的数学模型及相应算法，按满分为5分对景区及酒店的服务、位置、设施、卫生、性价比五个方面进行评分，并按照均方误差（Mean Squared Error, MSE）进行模型评价。
1、通过连接百度api，将每个句子按中性、积极以及消极进行分类。

# -*- coding: utf-8 -*-
import json
import requests
import pandas as pd
import time

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)


def get_sentiment_result(text):
    """
    利用情感倾向分析API来获取返回数据
    :param text: 输入文本
    :return response: 返回的响应
    """
    if text == '':
        return ''
    # 请求接口
    url = "https://aip.baidubce.com/oauth/2.0/token"
    # 需要先获取一个 token
    client_id = 'vWIz7LcVIEXn7txTgtwMcPll'
    client_secret = '9kOPdWOGGtjQxTQnTXEzTs4zBOuYOiu6'
    params = {
        'grant_type': 'client_credentials',
        'client_id': client_id,
        'client_secret': client_secret
    }
    headers = {'Content-Type': 'application/json; charset=UTF-8'}
    response = requests.post(url=url, params=params, headers=headers).json()
    access_token = response['access_token']

    # 情感分析接口
    url = 'https://aip.baidubce.com/rpc/2.0/nlp/v1/sentiment_classify'

    # 使用 token 调用情感倾向分析接口
    params = {
        'access_token': access_token
    }
    payload = json.dumps({
        'text': text
    })
    headers = {'Content-Type': 'application/json; charset=UTF-8'}
    response = requests.post(url=url, params=params, data=payload, headers=headers).json()
    return response

def baidu_emotion(inputfile,outputfile):
    # 读取要测试的文本
    text = pd.read_csv(inputfile,encoding='utf-8', names=['评论内容']).astype(str)
    review = text['评论内容']
    length = len(review)
    # 初始化用来存储情感分析结果的列表
    sentiment = ['blank'] * length#表示情感极性分类结果，0:负向，1:中性，2:正向
    negative_prob = ['blank'] * length#表示属于消极类别的概率，取值范围[0,1]
    positive_pro=['blank'] * length#表示属于积极类别的概率 ，取值范围[0,1]
    confidence = ['blank'] * length
    time_start = time.time()  # 计时
    i = 0
    for content in review:
        if content:
            op = True  # 利用循环和输出条件来保证获取到情绪分析的结果
            while op:
                maxTryNum = 50  # 设置最大尝试访问的次数，通过多次访问保证不会因为访问受限制而得不到结果（可修改）
                for tries in range(maxTryNum):
                    # print(content)
                    try:
                        result = get_sentiment_result(content)
                        break
                    except:
                        if tries < (maxTryNum - 1):
                            continue
                        else:
                            print('尝试了 %d 次都失败了！！！', maxTryNum)
                            break
                # 因为发现如果能够成功调用api则输出结果长度为3，失败了长度为2，故将其设为控制输出的条件
                if len(result) == 3:
                    op = False
                else:
                    op = True

            # 处理异常情况
            if 'items' in list(result.keys()):
                result1 = result.get('items')
                item = result1[0]
                sentiment[i] = item['sentiment']
                positive_pro[i]=item['positive_prob']
                negative_prob[i]=item['negative_prob']
                confidence[i]=item['confidence']

            elif 'error_code' in list(result.keys()):
                sentiment[i] = -1
                negative_prob[i]=-1
                positive_pro[i]=-1
                confidence[i]=-1
            # 方便观察进度
            print('第 ' + str(i + 1) + ' 条评论已分析完成，一共'+str(length)+'条评论')
            i = i + 1
    time_end = time.time()
    print('分析评论一共耗时：', time_end - time_start)
    print(sentiment)
    text['评论内容'] = review
    text['情感倾向'] = sentiment
    text['positive_prob'] = positive_pro
    text['negative_prob']=negative_prob
    text['置信度']=confidence

    # 保存
    text.to_csv(outputfile, index=None, encoding="utf_8_sig")

#景区评价
for i in range(54, 61):
    if i < 10:
        index = 'A0' + str(i)
    else:
        index = 'A' + str(i)
    inputfile='../test/3/'+index+'.txt'
         #输出文件
    outputfile = '../test/4/'+index+'.csv'
    baidu_emotion(inputfile, outputfile)
    print('文件' + str(index) + '完成')

2、对情感词进行修正，并绘制出词云

# -*- coding: utf-8 -*-

# 代码12-6 匹配情感词

import pandas as pd
import numpy as np
word = pd.read_csv("../train/jingqu/jingqu_stop2/A01_word.csv")

# 读入正面、负面情感评价词
pos_comment = pd.read_csv("../train/express_data/正面评价词语（中文）.txt", header=None,sep="\n",
                          encoding = 'utf-8', engine='python')
neg_comment = pd.read_csv("../train/express_data/负面评价词语（中文）.txt", header=None,sep="\n",
                          encoding = 'utf-8', engine='python')
pos_emotion = pd.read_csv("../train/express_data/正面情感词语（中文）.txt", header=None,sep="\n",
                          encoding = 'utf-8', engine='python')
neg_emotion = pd.read_csv("../train/express_data/负面情感词语（中文）.txt", header=None,sep="\n",
                          encoding = 'utf-8', engine='python') 

# 合并情感词与评价词
positive = set(pos_comment.iloc[:,0])|set(pos_emotion.iloc[:,0])
negative = set(neg_comment.iloc[:,0])|set(neg_emotion.iloc[:,0])
intersection = positive&negative  # 正负面情感词表中相同的词语
positive = list(positive - intersection)
negative = list(negative - intersection)
positive = pd.DataFrame({"word":positive,
                         "weight":[1]*len(positive)})
negative = pd.DataFrame({"word":negative,
                         "weight":[-1]*len(negative)}) 

posneg = positive.append(negative)

#  将分词结果与正负面情感词表合并，定位情感词
data_posneg = posneg.merge(word, left_on = 'word', right_on = 'word', 
                           how = 'right')
data_posneg = data_posneg.sort_values(by = ['index_content','index_word'])



# 代码12-7 修正情感倾向

# 根据情感词前时候有否定词或双层否定词对情感值进行修正
# 载入否定词表
notdict = pd.read_csv("../train/data/not.csv")

# 处理否定修饰词
data_posneg['amend_weight'] = data_posneg['weight']  # 构造新列，作为经过否定词修正后的情感值
data_posneg['id'] = np.arange(0, len(data_posneg))
only_inclination = data_posneg.dropna()  # 只保留有情感值的词语
only_inclination.index = np.arange(0, len(only_inclination))
index = only_inclination['id']

for i in np.arange(0, len(only_inclination)):
    review = data_posneg[data_posneg['index_content'] == 
                         only_inclination['index_content'][i]]  # 提取第i个情感词所在的评论
    review.index = np.arange(0, len(review))
    affective = only_inclination['index_word'][i]  # 第i个情感值在该文档的位置
    if affective == 1:
        ne = sum([i in notdict['term'] for i in review['word'][affective - 1]])
        if ne == 1:
            data_posneg['amend_weight'][index[i]] = -\
            data_posneg['weight'][index[i]]          
    elif affective > 1:
        ne = sum([i in notdict['term'] for i in review['word'][[affective - 1, 
                  affective - 2]]])
        if ne == 1:
            data_posneg['amend_weight'][index[i]] = -\
            data_posneg['weight'][index[i]]
            
# 更新只保留情感值的数据
only_inclination = only_inclination.dropna()

# 计算每条评论的情感值
emotional_value = only_inclination.groupby(['index_content'],
                                           as_index=False)['amend_weight'].sum()
print(emotional_value)
# 去除情感值为0的评论
emotional_value = emotional_value[emotional_value['amend_weight'] != 0]



# 代码12-8 查看情感分析效果

# 给情感值大于0的赋予评论类型（content_type）为pos,小于0的为neg
emotional_value['a_type'] = ''
emotional_value['a_type'][emotional_value['amend_weight'] > 0] = '2'
emotional_value['a_type'][emotional_value['amend_weight'] < 0] = '0'

# 查看情感分析结果
result = emotional_value.merge(word, 
                               left_on = 'index_content', 
                               right_on = 'index_content',
                               how = 'left')

result = result[['index_content','content_type', 'a_type']].drop_duplicates() 
confusion_matrix = pd.crosstab(result['content_type'], result['a_type'], 
                               margins=True)  # 制作交叉表
(confusion_matrix.iat[0,0] + confusion_matrix.iat[1,1])/confusion_matrix.iat[2,2]

# 提取正负面评论信息
ind_pos = list(emotional_value[emotional_value['a_type'] == '2']['index_content'])
ind_neg = list(emotional_value[emotional_value['a_type'] == '0']['index_content'])
posdata = word[[i in ind_pos for i in word['index_content']]]
negdata = word[[i in ind_neg for i in word['index_content']]]

# 绘制词云
'''import matplotlib.pyplot as plt
from wordcloud import WordCloud
# 正面情感词词云
freq_pos = posdata.groupby(by = ['word'])['word'].count()
freq_pos = freq_pos.sort_values(ascending = False)
backgroud_Image=plt.imread('../train/data/pl.jpg')
wordcloud = WordCloud(font_path="STZHONGS.ttf",
                      max_words=100,
                      background_color='white',
                      mask=backgroud_Image)
pos_wordcloud = wordcloud.fit_words(freq_pos)
plt.imshow(pos_wordcloud)
plt.axis('off') 
plt.show()
# 负面情感词词云
freq_neg = negdata.groupby(by = ['word'])['word'].count()
freq_neg = freq_neg.sort_values(ascending = False)
neg_wordcloud = wordcloud.fit_words(freq_neg)
plt.imshow(neg_wordcloud)
plt.axis('off') 
plt.show()
'''
# 将结果写出,每条评论作为一行
posdata.to_csv("../train/jingqu/Emotional_correction/A01_posdata.csv", index = False, encoding = 'utf-8')
negdata.to_csv("../train/jingqu/Emotional_correction/A01_negdata.csv", index = False, encoding = 'utf-8')

3、将消极、积极以及中性三种情感态度人工分成了五个方面。（仅展示部分代码）

# -*- coding: utf-8 -*-
import pandas as pd

def score(inputfiles,index):
    data1=pd.read_csv(inputfiles)

    contents=data1['评论内容']

    data2 = pd.read_excel('../test/jiudian/jd位置.xlsx')
    location=data2['位置']

    data3 = pd.read_excel('../test/jiudian/jd服务.xlsx')
    service=data3['服务']

    data4 = pd.read_excel('../test/jiudian/jd设施.xlsx')
    value=data4['设施']

    data5 = pd.read_excel('../test/jiudian/jd卫生.xlsx')
    health=data5['卫生']

    data6 = pd.read_excel('../test/jiudian/jd性价比.xlsx')
    facility=data6['性价比']

    location_score = ['blank'] * len(location)  # 位置
    service_score = ['blank'] * len(service)  # 服务
    value_score = ['blank'] * len(value)  # 设施
    health_score = ['blank'] * len(health)  # 卫生
    facility_score = ['blank'] * len(facility)  # 性价比
    i=0

    for location1 in location:
        count1 = 0
        for content in contents:
            if str(location1) in content:
                count1=1+count1
                location_score[i]=count1
        i = i + 1
    data2['次数']=location_score
    data2.to_excel('../test/jiudian/位置/'+index+'.xlsx',encoding='utf-8')

    i = 0
    for service1 in service:
        count1 = 0

        for content in contents:
            if str(service1) in content:
                count1 = 1 + count1
                service_score[i] = count1
        i = i + 1
    data3['次数'] = service_score
    data3.to_excel('../test/jiudian/服务/'+index+'.xlsx',encoding='utf-8')

    i = 0
    for value1 in value:
        count1 = 0
        for content in contents:
            if str(value1) in content:
                count1 = 1 + count1
                value_score[i] = count1
        i = i + 1
    data4['次数'] = value_score
    data4.to_excel('../test/jiudian/设施/'+index+'.xlsx',encoding='utf-8')

    i = 0
    for health1 in health:
        count1 = 0

        for content in contents:
            if str(health1) in content:
                count1 = 1 + count1
                health_score[i] = count1
        i = i + 1
    data5['次数'] = health_score
    data5.to_excel('../test/jiudian/卫生/'+index+'.xlsx',encoding='utf-8')

    i = 0
    for facility1 in facility:
        count1 = 0
        for content in contents:
            if str(facility1) in content:
                count1 = 1 + count1
                facility_score[i] = count1
        i = i + 1
    data6['次数'] = facility_score
    data6.to_excel('../test/jiudian/性价比/'+index+'.xlsx',encoding='utf-8')

for i in range(51, 61):
    if i < 10:
        index = 'H0' + str(i)
    else:
        index = 'H' + str(i)
    inputfile='../test/4/'+index+'.csv'
    score(inputfile,index)

4、进行评分。（部分代码）

# -*- coding: utf-8 -*-
import pandas as pd

def ratio(inputfiles,outputfiles):
    data=pd.read_excel(inputfiles)
    data['消极'] = data['消极'].replace('blank', '0')
    data['积极'] = data['积极'].replace('blank', '0')
    data['中等'] = data['中等'].replace('blank', '0')

    data['积极_r']=data['积极'].astype(float)/data['次数']
    data['消极_r']=data['消极'].astype(float)/data['次数']
    data['中等_r'] = data['中等'].astype(float)/ data['次数']
    data.to_excel(outputfiles)

input2 = ['服务', '位置', '设施', '卫生', '性价比']
b=0
while b<=4:
    i = 51
    while i <= 60:
        if i < 10:
            index = 'H0' + str(i)
        else:
            index = 'H' + str(i)

        inputfiles = '../test/jingqu_6_radio/' + input2[b] + '/' + index + '.xlsx'
        outputfiles = '../test/jingqu_6_radio/' + input2[b] + '/' + index + '.xlsx'
        print(inputfiles)
        ratio(inputfiles, outputfiles)
        i = i + 1
    b=b+1

第三问：出于各种原因，网络评论常常出现内容不相关、简单复制修改和无有效内容等现象，妨碍了游客从网络评论中获得有价值的信息，也为各网络平台的运营工作带来了挑战。请从文本分析的角度，建立合理的模型，对附件1景区及酒店网络评论的有效性进行分析。
思路：这里主要用的文本相似度建立模型

# -*- coding: utf-8 -*-
import jieba
import numpy as np
import re
import pandas as  pd


def get_word_vector(s1, s2):
    """
    :param s1: 句子1
    :param s2: 句子2
    :return: 返回句子的余弦相似度
    """
    # 分词
    cut1 = jieba.cut(s1)
    cut2 = jieba.cut(s2)
    list_word1 = (','.join(cut1)).split(',')
    list_word2 = (','.join(cut2)).split(',')

    # 列出所有的词,取并集
    key_word = list(set(list_word1 + list_word2))
    # 给定形状和类型的用0填充的矩阵存储向量
    word_vector1 = np.zeros(len(key_word))
    word_vector2 = np.zeros(len(key_word))

    # 计算词频
    # 依次确定向量的每个位置的值
    for i in range(len(key_word)):
        # 遍历key_word中每个词在句子中的出现次数
        for j in range(len(list_word1)):
            if key_word[i] == list_word1[j]:
                word_vector1[i] += 1
        for k in range(len(list_word2)):
            if key_word[i] == list_word2[k]:
                word_vector2[i] += 1

    # 输出向量
    #print(word_vector1)
    #print(word_vector2)
    return word_vector1, word_vector2


def cos_dist(vec1, vec2):
    """
    :param vec1: 向量1
    :param vec2: 向量2
    :return: 返回两个向量的余弦相似度
    """
    dist1 = float(np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2)))

    return dist1


def filter_html(html):
    """
    :param html: html
    :return: 返回去掉html的纯净文本
    """
    dr = re.compile(r'<[^>]+>', re.S)
    dd = dr.sub('', html).strip()
    return dd


def inputfile(inputfile):
    contents=pd.read_csv(inputfile)
    contents=contents['评论内容']
    i=1
    count=0
    index = 0
    index2=0
    for content in  contents:
        index = index + 1
        index2=0
        s1=content
        for content2 in contents:
            index2 = index2 + 1
            if i == 1 or content==content2:
                s1 = content
                i = i + 1
                break;
            else:
                s2 = content2
                vec1, vec2 = get_word_vector(s1, s2)
                dist1 = cos_dist(vec1, vec2)
                if dist1>=0.8:
                    contents[i]=''
                    count=count+1
                    print(index)
                    print(s1)
                    print(index2)
                    print(s2)
    print(inputfile+':'+str(count))

for i in range(51, 61):
    if i < 10:
        index = 'H0' + str(i)
    else:
        index = 'H' + str(i)
    inputfile1='../test/4/'+index+'.csv'
    inputfile(inputfile1)

for i in range(51, 61):
    if i < 10:
        index = 'H0' + str(i)
    else:
        index = 'h' + str(i)
    inputfile1='../test/4/'+index+'.csv'
    inputfile(inputfile1)

第四问：建立LDA模型

from gensim import corpora, models
# 建立词典
pos_dict = corpora.Dictionary([[i] for i in posdata['word']])  # 正面
neg_dict = corpora.Dictionary([[i] for i in negdata['word']])  # 负面

# 建立语料库
pos_corpus = [pos_dict.doc2bow(j) for j in [[i] for i in posdata['word']]]  # 正面
neg_corpus = [neg_dict.doc2bow(j) for j in [[i] for i in negdata['word']]]   # 负面


# 余弦相似度函数
def cos(vector1, vector2):
    dot_product = 0.0;  
    normA = 0.0;  
    normB = 0.0;  
    for a,b in zip(vector1, vector2): 
        dot_product += a*b  
        normA += a**2  
        normB += b**2  
    if normA == 0.0 or normB==0.0:  
        return(None)  
    else:  
        return(dot_product / ((normA*normB)**0.5))   

# 主题数寻优函数
def topic_search(x_corpus, x_dict):
    
    # 初始化
    mean_similarity = []
    mean_similarity.append(1)
    
    # 生成主题，计算主题间相似度
    for i in np.arange(2,11):
        lda = models.LdaModel(x_corpus, num_topics = i, id2word = x_dict)  # LDA训练
        for j in np.arange(i):
            term = lda.show_topics(num_words = 50)
            
        # 提取主题词
        top_word = []
        for k in np.arange(i):
            top_word.append([''.join(re.findall('"(.*)"',i)) \
                             for i in term[k][1].split('+')])  # 列出所有词
           
        # 构造词频向量
        word = sum(top_word,[])  # 所有的词
        unique_word = set(word)  # 删除重复的词
        
        # 构造主题词列表，行表示主题号，列表示各主题词
        mat = []
        for j in np.arange(i):
            top_w = top_word[j]
            mat.append(tuple([top_w.count(k) for k in unique_word]))
            
        p = list(itertools.permutations(list(np.arange(i)),2))
        len_p = len(p)
        top_similarity = [0]
        for w in np.arange(len_p):
            vector1 = mat[p[w][0]]
            vector2 = mat[p[w][1]]
            top_similarity.append(cos(vector1, vector2))
            
        # 计算平均余弦相似度
        mean_similarity.append(sum(top_similarity)/len_p)
    return(mean_similarity)
            
# 计算主题平均余弦相似度
pos_k = topic_search(pos_corpus, pos_dict)
neg_k = topic_search(neg_corpus, neg_dict)

# 绘制主题平均余弦相似度图形
from matplotlib.font_manager import FontProperties  
font = FontProperties(size=14)

#调整中文的显示问题
plt.rcParams['font.sans-serif']=['SimHei']
plt.rcParams['axes.unicode_minus'] = False  
fig = plt.figure(figsize=(10,8))
ax_1 = fig.add_subplot(211)
ax_1.plot(pos_k)
ax_1.set_xlabel('正面评论LDA主题数寻优', fontproperties=font)

ax_2 = fig.add_subplot(212)
ax_2.plot(neg_k)
ax_2.set_xlabel('负面评论LDA主题数寻优', fontproperties=font)
#plt.show() #显示主题数寻优的结果

# LDA主题分析
pos_topic = models.LdaModel(pos_corpus, num_topics = 6, id2word = pos_dict)#num_topics 主题数
neg_topic = models.LdaModel(neg_corpus, num_topics = 3, id2word = neg_dict)
print('正面评论LDA主题')
print(pos_topic.print_topics(num_words = 5))#num_words 主题里有多少词
print('负面面评论LDA主题')
print(neg_topic.print_topics(num_words = 5))

完整代码以及数据打包：

下载链接：https://download.csdn.net/download/qq_44700741/85504402

你可能感兴趣的:(数据挖掘,数据挖掘,python,人工智能)

系统学习Python——并发模型和异步编程：进程、线程和GIL
分类目录：《系统学习Python》总目录在文章《并发模型和异步编程：基础知识》我们简单介绍了Python中的进程、线程和协程。本文就着重介绍Python中的进程、线程和GIL的关系。Python解释器的每个实例都是一个进程。使用multiprocessing或concurrent.futures库可以启动额外的Python进程。Python的subprocess库用于启动运行外部程序（不管使用何种
Flask框架入门：快速搭建轻量级Python网页应用「已注销」 python-AI python基础网站网络 python flask 后端
转载：Flask框架入门：快速搭建轻量级Python网页应用1.Flask基础Flask是一个使用Python编写的轻量级Web应用框架。它的设计目标是让Web开发变得快速简单，同时保持应用的灵活性。Flask依赖于两个外部库：Werkzeug和Jinja2，Werkzeug作为WSGI工具包处理Web服务的底层细节，Jinja2作为模板引擎渲染模板。安装Flask非常简单，可以使用pip安装命令
Python Flask 框架入门：快速搭建 Web 应用的秘诀 Python编程之道 Python人工智能与大数据 Python编程之道 python flask 前端 ai
PythonFlask框架入门：快速搭建Web应用的秘诀关键词Flask、微框架、路由系统、Jinja2模板、请求处理、WSGI、Web开发摘要想快速用Python搭建一个灵活的Web应用？Flask作为“微框架”代表，凭借轻量、可扩展的特性，成为初学者和小型项目的首选。本文将从Flask的核心概念出发，结合生活化比喻、代码示例和实战案例，带你一步步掌握：如何用Flask搭建第一个Web应用？路由
python_虚拟环境阿_焦 python
第一、配置虚拟环境：virtualenv（1）pipvirtualenv>安装虚拟环境包（2）pipinstallvirtualenvwrapper-win>安装虚拟环境依赖包（3）c盘创建虚拟目录>C:\virtualenv>配置环境变量【了解一下】：（1）如何使用virtualenv创建虚拟环境a、cd到C:\virtualenv目录下：b、mkvirtualenvname>创建虚拟环境nam
Python爱心光波
系列文章序号直达链接Tkinter1Python李峋同款可写字版跳动的爱心2Python跳动的双爱心3Python蓝色跳动的爱心4Python动漫烟花5Python粒子烟花Turtle1Python满屏飘字2Python蓝色流星雨3Python金色流星雨4Python漂浮爱心5Python爱心光波①6Python爱心光波②7Python满天繁星8Python五彩气球9Python白色飘雪10Pyt
Python流星雨 Want595 python 开发语言
文章目录系列文章写在前面技术需求完整代码代码分析1.模块导入2.画布设置3.画笔设置4.颜色列表5.流星类(Star)6.流星对象创建7.主循环8.流星运动逻辑9.视觉效果10.总结写在后面系列文章序号直达链接表白系列1Python制作一个无法拒绝的表白界面2Python满屏飘字表白代码3Python无限弹窗满屏表白代码4Python李峋同款可写字版跳动的爱心5Python流星雨代码6Python
算法学习笔记：17.蒙特卡洛算法 ——从原理到实战，涵盖 LeetCode 与考研 408 例题
在计算机科学和数学领域，蒙特卡洛算法（MonteCarloAlgorithm）以其独特的随机抽样思想，成为解决复杂问题的有力工具。从圆周率的计算到金融风险评估，从物理模拟到人工智能，蒙特卡洛算法都发挥着不可替代的作用。本文将深入剖析蒙特卡洛算法的思想、解题思路，结合实际应用场景与Java代码实现，并融入考研408的相关考点，穿插图片辅助理解，帮助你全面掌握这一重要算法。蒙特卡洛算法的基本概念蒙特卡
Python之七彩花朵代码实现 PlutoZuo Python python 开发语言
Python之七彩花朵代码实现文章目录Python之七彩花朵代码实现下面是一个简单的使用Python的七彩花朵。这个示例只是一个简单的版本，没有很多高级功能，但它可以作为一个起点，你可以在此基础上添加更多功能。importturtleastuimportrandomasraimportmathtu.setup(1.0,1.0)t=tu.Pen()t.ht()colors=['red','skybl
Python 脚本最佳实践2025版
前文可以直接把这篇文章喂给AI,可以放到AI角色设定里,也可以直接作为提示词.这样,你只管提需求,写脚本就让AI来.概述追求简洁和清晰：脚本应简单明了。使用函数(functions)、常量(constants)和适当的导入(import)实践来有逻辑地组织你的Python脚本。使用枚举(enumerations)和数据类(dataclasses)等数据结构高效管理脚本状态。通过命令行参数增强交互性
（Python基础篇）了解和使用分支结构 EternityArt 基础篇 python
目录一、引言二、Python分支结构的类型与语法（一）if语句（单分支）（二）if-else语句（双分支）（三）if-elif-else语句（多分支）三、分支结构的应用场景（一）提示用户输入用户名，然后再提示输入密码，如果用户名是“admin”并且密码是“88888”则提示正确，否则，如果用户名不是admin还提示用户用户名不存在,（二）提示用户输入用户名，然后再提示输入密码，如果用户名是“adm
（Python基础篇）循环结构 EternityArt 基础篇 python
一、什么是Python循环结构？循环结构是编程中重复执行代码块的机制。在Python中，循环允许你：1.迭代处理数据：遍历列表、字典、文件内容等。2.自动化重复任务：如批量处理数据、生成序列等。3.控制执行流程：根据条件决定是否继续或终止循环。二、为什么需要循环结构？假设你需要打印1到100的所有偶数：没有循环：需手动编写100行print()语句。print(0)print(2)print(4)
（Python基础篇）字典的操作 EternityArt 基础篇 python 开发语言
一、引言在Python编程中，字典（Dictionary）是一种极具灵活性的数据结构，它通过“键-值对”（key-valuepair）的形式存储数据，如同现实生活中的字典——通过“词语（键）”快速查找“释义（值）”。相较于列表和元组的有序索引访问，字典的优势在于基于键的快速查找，这使得它在处理需要频繁通过唯一标识获取数据的场景中极为高效。掌握字典的操作，能让我们更高效地组织和管理复杂数据，是Pyt
Python七彩花朵 Want595 python 开发语言
系列文章序号直达链接Tkinter1Python李峋同款可写字版跳动的爱心2Python跳动的双爱心3Python蓝色跳动的爱心4Python动漫烟花5Python粒子烟花Turtle1Python满屏飘字2Python蓝色流星雨3Python金色流星雨4Python漂浮爱心5Python爱心光波①6Python爱心光波②7Python满天繁星8Python五彩气球9Python白色飘雪10Pyt
用OpenCV标定相机内参应用示例（C++和Python）
下面是一个完整的使用OpenCV进行相机内参标定（CameraCalibration）的示例，包括C++和Python两个版本，基于棋盘格图案标定。一、目标：相机标定通过拍摄多张带有棋盘格图案的图像，估计相机的内参：相机矩阵（内参）K畸变系数distCoeffs可选外参（R,T）标定精度指标（如重投影误差）二、棋盘格参数设置（根据自己的棋盘格设置）：棋盘格角点数：9x6（内角点，9列×6行）；每个
Anaconda 详细下载与安装教程
Anaconda详细下载与安装教程1.简介Anaconda是一个用于科学计算的开源发行版，包含了Python和R的众多常用库。它还包括了conda包管理器，可以方便地安装、更新和管理各种软件包。2.下载Anaconda2.1访问官方网站首先，打开浏览器，访问Anaconda官方网站。2.2选择适合的版本在页面中，你会看到两个主要的下载选项：AnacondaIndividualEdition：适用于
python中 @注解及内置注解的使用方法总结以及完整示例慧一居士 Python python
在Python中，装饰器（Decorator）使用@符号实现，是一种修改函数/类行为的语法糖。它本质上是一个高阶函数，接受目标函数作为参数并返回包装后的函数。Python也提供了多个内置装饰器，如@property、@staticmethod、@classmethod等。一、核心概念装饰器本质：@decorator等价于func=decorator(func)执行时机：在函数/类定义时立即执行装饰
Python中的静态方法和类方法详解
在Python中，`@staticmethod`和`@classmethod`是两种装饰器，它们用于定义类中的方法，但是它们的行为和用途有所不同。###@staticmethod`@staticmethod`装饰器用于定义一个静态方法。静态方法不接收类或实例的引用作为第一个参数，因此它不能访问类的状态或实例的状态。静态方法可以看作是与类关联的普通函数，但它们可以通过类名直接调用。classMath
Python中类静态方法：@classmethod/@staticmethod详解和实战示例
在Python中，类方法(@classmethod)和静态方法(@staticmethod)是类作用域下的两种特殊方法。它们使用装饰器定义，并且与实例方法(deffunc(self))的行为有所不同。1.三种方法的对比概览方法类型是否访问实例(self)是否访问类(cls)典型用途实例方法✅是❌否访问对象属性类方法@classmethod❌否✅是创建类的替代构造器，访问类变量等静态方法@stati
Python多版本管理与pip升级全攻略：解决冲突与高效实践码界奇点 Python python pip 开发语言 python3.11 源代码管理虚拟现实依赖倒置原则
引言Python作为最流行的编程语言之一，其版本迭代速度与生态碎片化给开发者带来了巨大挑战。据统计，超过60%的Python开发者需要同时维护基于Python3.6+和Python2.7的项目。本文将系统解决以下核心痛点：如何安全地在同一台机器上管理多个Python版本pip依赖冲突的根治方案符合PEP标准的生产环境最佳实践第一部分：Python多版本管理核心方案1.1系统级多版本共存方案Wind
基于Python的健身数据分析工具的搭建流程day1 weixin_45677320 python 开发语言数据挖掘爬虫
基于Python的健身数据分析工具的搭建流程分数据挖掘、数据存储和数据分析三个步骤。本文主要介绍利用Python实现健身数据分析工具的数据挖掘部分。第一步：加载库加载本文需要的库，如下代码所示。若库未安装，请按照python如何安装各种库（保姆级教程）_python安装库-CSDN博客https://blog.csdn.net/aobulaien001/article/details/133298
seaborn又一个扩展heatmapz qq_21478261 #Python可视化 matplotlib
推荐阅读：Pythonmatplotlib保姆级教程嫌Matplotlib繁琐？试试Seaborn！
NGS测序基础梳理01-文库构建（Library Preparation） qq_21478261 #生物信息生物学
本文介绍Illumina测序平台文库构建（LibraryPreparation）步骤，文库结构。写作时间：2020.05。推荐阅读：10W字《Python可视化教程1.0》来了！一份由公众号「pythonic生物人」精心制作的PythonMatplotlib可视化系统教程，105页PDFhttps://mp.weixin.qq.com/s/QaSmucuVsS_DR-klfpE3-Q10W字《Rg
AI音乐模拟器：AIGC时代的智能音乐创作革命 lauo 人工智能 AIGC 开源前端机器人
AI音乐模拟器：AIGC时代的智能音乐创作革命引言：AIGC浪潮下的音乐创作新范式在数字化转型的浪潮中，人工智能生成内容（AIGC）正在重塑各个创意领域。音乐产业作为创意经济的重要组成部分，正经历着前所未有的变革。据最新市场研究数据显示，全球AI音乐市场规模预计将从2023年的5.8亿美元增长到2030年的26.8亿美元，年复合增长率高达24.3%。这一快速增长的市场背后，是AI音乐技术正在打破传
Python 常用内置函数详解（七）：dir()函数——获取当前本地作用域中的名称列表或对象的有效属性列表
目录一、功能二、语法和示例一、功能dir()函数获取当前本地作用域中的名称列表或对象的有效属性列表。二、语法和示例dir()函数有两种形式，如果没有实参，则返回当前本地作用域中的名称列表。如果有实参，它会尝试返回该对象的有效属性列表。如果对象有一个名为__dir__()的方法，那么该方法将被调用，并且必须返回一个属性列表。dir()函数的语法格式如下：C:\Users\amoxiang>ipyth
pythonjson中list操作_Python json.dumps 特殊数据类型的自定义序列化操作
场景描述：Python标准库中的json模块，集成了将数据序列化处理的功能；在使用json.dumps()方法序列化数据时候，如果目标数据中存在datetime数据类型，执行操作时，会抛出异常：TypeError:datetime.datetime(2016,12,10,11,04,21)isnotJSONserializable那么遇到json.dumps序列化不支持的数据类型，该怎么办！首先，
Python 日期格式转json.dumps的解决方法 douyaoxin python json 开发语言
classDateEncoder(json.JSONEncoder):defdefault(self,obj):ifisinstance(obj,datetime.datetime):returnobj.strftime('%Y-%m-%d%H:%M:%S')elifisinstance(obj,datetime.date):returnobj.strftime("%Y-%m-%d")json.d
Python 爬虫实战：视频平台播放量实时监控（含反爬对抗与数据趋势预测）西攻城狮北 python 爬虫音视频
一、引言在数字内容蓬勃发展的当下，视频平台的播放量数据已成为内容创作者、营销人员以及行业分析师手中极为关键的情报资源。它不仅能够实时反映内容的受欢迎程度，更能在竞争分析、营销策略制定以及内容优化等方面发挥不可估量的作用。然而，视频平台为了保护自身数据和用户隐私，往往会设置一系列反爬虫机制，对数据爬取行为进行限制。这就向我们发起了挑战：如何巧妙地突破这些限制，同时精准地捕捉并预测播放量的动态变化趋势
Python技能手册 - 模块module 金色牛神 Python python windows 开发语言
系列Python常用技能手册-基础语法Python常用技能手册-模块modulePython常用技能手册-包package目录module模块指什么typing数据类型int整数float浮点数str字符串bool布尔值TypeVar类型变量functools高阶函数工具functools.partial()函数偏置functools.lru_cache()函数缓存sorted排序列表排序元组排序
Ubuntu基础（Python虚拟环境和Vue） aaiier ubuntu python linux
Python虚拟环境sudoaptinstallpython3python3-venv进入项目目录cdXXX创建虚拟环境python3-mvenvvenv激活虚拟环境sourcevenv/bin/activate退出虚拟环境deactivateVue安装Node.js和npm#安装Node.js和npm（Ubuntu默认仓库可能版本较旧，适合入门）sudoaptinstallnodejsnpm#验
苦练Python第9天：if-else分支九剑 python后端前端人工智能
苦练Python第9天：if-else分支九剑前言大家好，我是倔强青铜三。是一名热情的软件工程师，我热衷于分享和传播IT技术，致力于通过我的知识和技能推动技术交流与创新，欢迎关注我，微信公众号：倔强青铜三。欢迎点赞、收藏、关注，一键三连！！！欢迎来到100天Python挑战第9天！今天我们不练循环，改磨“分支剑法”——ifelse三式：单分支、双分支、多分支，以及嵌套和三元运算符，全部实战演练，让
js动画html标签（持续更新中） 843977358 html js 动画 media opacity
1.jQuery 效果 - animate() 方法改变 "div" 元素的高度： $(".btn1").click(function(){ $("#box").animate({height:"300px
springMVC学习笔记 caoyong springMVC
1、搭建开发环境 a>、添加jar文件，在ioc所需jar包的基础上添加spring-web.jar,spring-webmvc.jar b>、在web.xml中配置前端控制器 <servlet> &nbs
POI中设置Excel单元格格式 107x poi style 列宽合并单元格自动换行
引用：http://apps.hi.baidu.com/share/detail/17249059 POI中可能会用到一些需要设置EXCEL单元格格式的操作小结：先获取工作薄对象: HSSFWorkbook wb = new HSSFWorkbook(); HSSFSheet sheet = wb.createSheet(); HSSFCellStyle setBorder = wb.
jquery 获取A href 触发js方法的this参数无效的情况一炮送你回车库 jquery
html如下： <td class=\"bord-r-n bord-l-n c-333\"> <a class=\"table-icon edit\" onclick=\"editTrValues(this);\">修改</a> </td>" j
md5 3213213333332132 MD5
import java.security.MessageDigest; import java.security.NoSuchAlgorithmException; public class MDFive { public static void main(String[] args) { String md5Str = "cq
完全卸载干净Oracle11g sophia天雪 orale数据库卸载干净清理注册表
完全卸载干净Oracle11g A、存在OUI卸载工具的情况下：第一步：停用所有Oracle相关的已启动的服务；第二步：找到OUI卸载工具：在“开始”菜单中找到“oracle_OraDb11g_home”文件夹中 &
apache 的access.log 日志文件太大如何解决 darkranger apache
CustomLog logs/access.log common 此写法导致日志数据一致自增变大。直接注释上面的语法 #CustomLog logs/access.log common 增加： CustomLog "|bin/rotatelogs.exe -l logs/access-%Y-%m-d.log
Hadoop单机模式环境搭建关键步骤 aijuans 分布式
Hadoop环境需要sshd服务一直开启，故，在服务器上需要按照ssh服务，以Ubuntu Linux为例，按照ssh服务如下： sudo apt-get install ssh sudo apt-get install rsync 编辑HADOOP_HOME/conf/hadoop-env.sh文件，将JAVA_HOME设置为Java
PL/SQL DEVELOPER 使用的一些技巧 atongyeye java sql
1 记住密码这是个有争议的功能，因为记住密码会给带来数据安全的问题。但假如是开发用的库，密码甚至可以和用户名相同，每次输入密码实在没什么意义，可以考虑让PLSQL Developer记住密码。位置：Tools菜单－－Preferences－－Oracle－－Logon HIstory－－Store with password 2 特殊Copy 在SQL Window
PHP：在对象上动态添加一个新的方法 bardo 方法动态添加闭包
有关在一个对象上动态添加方法，如果你来自Ruby语言或您熟悉这门语言，你已经知道它是什么...... Ruby提供给你一种方式来获得一个instancied对象，并给这个对象添加一个额外的方法。好！不说Ruby了，让我们来谈谈PHP PHP未提供一个“标准的方式”做这样的事情，这也是没有核心的一部分... 但无论如何，它并没有说我们不能做这样
ThreadLocal与线程安全 bijian1013 java java多线程 threadLocal
首先来看一下线程安全问题产生的两个前提条件： 1.数据共享，多个线程访问同样的数据。 2.共享数据是可变的，多个线程对访问的共享数据作出了修改。实例：定义一个共享数据： public static int a = 0;
Tomcat 架包冲突解决征客丶 tomcat Web
环境： Tomcat 7.0.6 win7 x64 错误表象：【我的冲突的架包是：catalina.jar 与 tomcat-catalina-7.0.61.jar 冲突，不知道其他架包冲突时是不是也报这个错误】严重: End event threw exception java.lang.NoSuchMethodException: org.apache.catalina.dep
【Scala三】分析Spark源代码总结的Scala语法一 bit1129 scala
Scala语法 1. classOf运算符 Scala中的classOf[T]是一个class对象，等价于Java的T.class,比如classOf[TextInputFormat]等价于TextInputFormat.class 2. 方法默认值 defaultMinPartitions就是一个默认值，类似C++的方法默认值
java 线程池管理机制 BlueSkator java线程池管理机制
编辑 Add Tools jdk线程池一、引言第一：降低资源消耗。通过重复利用已创建的线程降低线程创建和销毁造成的消耗。第二：提高响应速度。当任务到达时，任务可以不需要等到线程创建就能立即执行。第三：提高线程的可管理性。线程是稀缺资源，如果无限制的创建，不仅会消耗系统资源，还会降低系统的稳定性，使用线程池可以进行统一的分配，调优和监控。
关于hql中使用本地sql函数的问题（问-答） BreakingBad HQL 存储函数
转自于：http://www.iteye.com/problems/23775 问：我在开发过程中，使用hql进行查询（mysql5）使用到了mysql自带的函数find_in_set()这个函数作为匹配字符串的来讲效率非常好，但是我直接把它写在hql语句里面（from ForumMemberInfo fm,ForumArea fa where find_in_set(fm.userId,f
读《研磨设计模式》-代码笔记-迭代器模式-Iterator bylijinnan java 设计模式
声明：本文只为方便我个人查阅和理解，详细的分析以及源代码请移步原作者的博客http://chjavach.iteye.com/ import java.util.Arrays; import java.util.List; /** * Iterator模式提供一种方法顺序访问一个聚合对象中各个元素，而又不暴露该对象内部表示 * * 个人觉得，为了不暴露该
常用SQL chenjunt3 oracle sql C++c C#
--NC建库 CREATE TABLESPACE NNC_DATA01 DATAFILE 'E:\oracle\product\10.2.0\oradata\orcl\nnc_data01.dbf' SIZE 500M AUTOEXTEND ON NEXT 50M EXTENT MANAGEMENT LOCAL UNIFORM SIZE 256K ; CREATE TABLESPA
数学是科学技术的语言 comsci 工作活动领域模型
从小学到大学都在学习数学，从小学开始了解数字的概念和背诵九九表到大学学习复变函数和离散数学，看起来好像掌握了这些数学知识，但是在工作中却很少真正用到这些知识，为什么？最近在研究一种开源软件-CARROT2的源代码的时候，又一次感觉到数学在计算机技术中的不可动摇的基础作用，CARROT2是一种用于自动语言分类（聚类）的工具性软件，用JAVA语言编写，它
Linux系统手动安装rzsz 软件包 daizj linux sz rz
1、下载软件 rzsz-3.34.tar.gz。登录linux，用命令 wget http://freeware.sgi.com/source/rzsz/rzsz-3.48.tar.gz下载。 2、解压 tar zxvf rzsz-3.34.tar.gz 3、安装 cd rzsz-3.34 ; make posix 。注意：这个软件安装与常规的GNU软件不
读源码之:ArrayBlockingQueue dieslrae java
ArrayBlockingQueue是concurrent包提供的一个线程安全的队列,由一个数组来保存队列元素.通过 takeIndex和 putIndex来分别记录出队列和入队列的下标,以保证在出队列时不进行元素移动. //在出队列或者入队列的时候对takeIndex或者putIndex进行累加,如果已经到了数组末尾就又从0开始,保证数
C语言学习九枚举的定义和应用 dcj3sjt126com c
枚举的定义 # include <stdio.h> enum WeekDay { MonDay, TuesDay, WednesDay, ThursDay, FriDay, SaturDay, SunDay }; int main(void) { //int day; //day定义成int类型不合适 enum WeekDay day = Wedne
Vagrant 三种网络配置详解 dcj3sjt126com vagrant
Forwarded port Private network Public network Vagrant 中一共有三种网络配置，下面我们将会详解三种网络配置各自优缺点。端口映射(Forwarded port)，顾名思义是指把宿主计算机的端口映射到虚拟机的某一个端口上，访问宿主计算机端口时，请求实际是被转发到虚拟机上指定端口的。Vagrantfile中设定语法为： c
16.性能优化-完结 frank1234 性能优化
性能调优是一个宏大的工程，需要从宏观架构(比如拆分，冗余，读写分离，集群，缓存等)，软件设计（比如多线程并行化，选择合适的数据结构），数据库设计层面（合理的表设计，汇总表，索引，分区，拆分，冗余等）以及微观（软件的配置，SQL语句的编写，操作系统配置等）根据软件的应用场景做综合的考虑和权衡，并经验实际测试验证才能达到最优。性能水很深，笔者经验尚浅，赶脚也就了解了点皮毛而已，我觉得
Word Search hcx2013 search
Given a 2D board and a word, find if the word exists in the grid. The word can be constructed from letters of sequentially adjacent cell, where "adjacent" cells are those horizontally or ve
Spring4新特性——Web开发的增强 jinnianshilongnian spring spring mvc spring4
Spring4新特性——泛型限定式依赖注入 Spring4新特性——核心容器的其他改进 Spring4新特性——Web开发的增强 Spring4新特性——集成Bean Validation 1.1(JSR-349)到SpringMVC Spring4新特性——Groovy Bean定义DSL Spring4新特性——更好的Java泛型操作API Spring4新
CentOS安装配置tengine并设置开机启动 liuxingguome centos
yum install gcc-c++ yum install pcre pcre-devel yum install zlib zlib-devel yum install openssl openssl-devel Ubuntu上可以这样安装 sudo aptitude install libdmalloc-dev libcurl4-opens
第14章工具函数（上） onestopweb 函数
index.html <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/
Xelsius 2008 and SAP BW at a glance blueoxygen BO Xelsius
Xelsius提供了丰富多样的数据连接方式，其中为SAP BW专属提供的是BICS。那么Xelsius的各种连接的优缺点比较以及Xelsius是如何直接连接到BEx Query的呢？以下Wiki文章应该提供了全面的概览。 http://wiki.sdn.sap.com/wiki/display/BOBJ/Xcelsius+2008+and+SAP+NetWeaver+BW+Co
oracle表空间相关 tongsh6 oracle
在oracle数据库中，一个用户对应一个表空间，当表空间不足时，可以采用增加表空间的数据文件容量，也可以增加数据文件，方法有如下几种： 1.给表空间增加数据文件 ALTER TABLESPACE "表空间的名字" ADD DATAFILE '表空间的数据文件路径' SIZE 50M; &nb
.Net framework4.0安装失败 yangjuanjava .net windows
上午的.net framework 4.0，各种失败，查了好多答案，各种不靠谱，最后终于找到答案了和Windows Update有关系，给目录名重命名一下再次安装，即安装成功了！下载地址：http://www.microsoft.com/en-us/download/details.aspx?id=17113 方法： 1.运行cmd，输入net stop WuAuServ 2.点击开