cumian8165

文本挖掘情感分析_文本挖掘的情感分析

文本挖掘情感分析

In this tutorial, I will explore some text mining techniques for sentiment analysis. We'll look at how to prepare textual data. After that we will try two different classifiers to infer the tweets' sentiment. We will tune the hyperparameters of both classifiers with grid search. Finally, we evaluate the performance on a set of metrics like precision, recall and the F1 score.

在本教程中，我将探讨一些用于情感分析的文本挖掘技术。我们将研究如何准备文本数据。之后，我们将尝试使用两个不同的分类器来推断推文的情绪。我们将使用网格搜索调整两个分类器的超参数。最后，我们根据一组指标(如准确性，召回率和F1得分)评估性能。

For this project, we'll be working with the Twitter US Airline Sentiment data set on Kaggle. It contains the tweet’s text and one variable with three possible sentiment values. Let's start by importing the packages and configuring some settings.

对于此项目，我们将使用Kaggle上的Twitter美国航空情绪数据集。它包含推文的文本和一个带有三个可能的情感值的变量。让我们首先导入软件包并配置一些设置。

import numpy as np 
import pandas as pd 
pd.set_option('display.max_colwidth', -1)
from time import time
import re
import string
import os
import emoji
from pprint import pprint
import collections
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="darkgrid")
sns.set(font_scale=1.3)
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.metrics import classification_report
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.externals import joblib
import gensim
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
import warnings
warnings.filterwarnings('ignore')
np.random.seed(37)

加载数据 (Loading the data)

We read in the comma separated file we downloaded from the Kaggle Datasets. We shuffle the data frame in case the classes are sorted. Applying the reindex method on the permutation of the original indices is good for that. In this notebook, we will work with the text variable and the airline_sentiment variable.

我们读取从Kaggle数据集下载的逗号分隔文件。如果对类进行排序，我们会重新整理数据框。将reindex方法应用于原始索引的permutation对此很有好处。在此笔记本中，我们将使用text变量和airline_sentiment变量。

df = pd.read_csv('../input/Tweets.csv')
df = df.reindex(np.random.permutation(df.index))
df = df[['text', 'airline_sentiment']]

探索性数据分析 (Exploratory Data Analysis)

目标变量 (Target variable)

There are three class labels we will predict: negative, neutral or positive.

我们将预测三种类别的标签：负面，中性或正面。

The class labels are imbalanced as we can see below in the chart. This is something that we should keep in mind during the model training phase. With the factorplot of the seaborn package, we can visualize the distribution of the target variable.

类别标签不平衡，如下图所示。在模型训练阶段，我们应该牢记这一点。随着factorplot的seaborn包，我们可以直观的目标变量的分布。

sns.factorplot(x="airline_sentiment", data=df, kind="count", size=6, aspect=1.5, palette="PuBuGn_d")
plt.show();

输入变量 (Input variable)

To analyze the text variable we create a class TextCounts. In this class we compute some basic statistics on the text variable.

为了分析text变量，我们创建了一个TextCounts类。在此类中，我们计算有关文本变量的一些基本统计信息。

count_words: number of words in the tweet
count_words ：鸣叫中的单词数
count_mentions: referrals to other Twitter accounts start with a @
count_mentions ：对其他Twitter帐户的引荐以@开头
count_hashtags: number of tag words, preceded by a #
count_hashtags ：标记词的数量， count_hashtags ＃
count_capital_words: number of uppercase words are sometimes used to “shout” and express (negative) emotions
count_capital_words ：大写单词的数量有时用于“喊”和表达(负面)情绪
count_excl_quest_marks: number of question or exclamation marks
count_excl_quest_marks ：问题或感叹号的数量
count_urls: number of links in the tweet, preceded by http(s)
count_urls ：推文中的链接数，以http(s) count_urls
count_emojis: number of emoji, which might be a good sign of the sentiment
count_emojis ：表情符号的数量，这可能是情绪的好兆头

class TextCounts(BaseEstimator, TransformerMixin):
    
    def count_regex(self, pattern, tweet):
        return len(re.findall(pattern, tweet))
    
    def fit(self, X, y=None, **fit_params):
        # fit method is used when specific operations need to be done on the train data, but not on the test data
        return self
    
    def transform(self, X, **transform_params):
        count_words = X.apply(lambda x: self.count_regex(r'\w+', x)) 
        count_mentions = X.apply(lambda x: self.count_regex(r'@\w+', x))
        count_hashtags = X.apply(lambda x: self.count_regex(r'#\w+', x))
        count_capital_words = X.apply(lambda x: self.count_regex(r'\b[A-Z]{2,}\b', x))
        count_excl_quest_marks = X.apply(lambda x: self.count_regex(r'!|\?', x))
        count_urls = X.apply(lambda x: self.count_regex(r'http.?://[^\s]+[\s]?', x))
        # We will replace the emoji symbols with a description, which makes using a regex for counting easier
        # Moreover, it will result in having more words in the tweet
        count_emojis = X.apply(lambda x: emoji.demojize(x)).apply(lambda x: self.count_regex(r':[a-z_&]+:', x))
        
        df = pd.DataFrame({'count_words': count_words
                           , 'count_mentions': count_mentions
                           , 'count_hashtags': count_hashtags
                           , 'count_capital_words': count_capital_words
                           , 'count_excl_quest_marks': count_excl_quest_marks
                           , 'count_urls': count_urls
                           , 'count_emojis': count_emojis
                          })
        
        return df
tc = TextCounts()
df_eda = tc.fit_transform(df.text)
df_eda['airline_sentiment'] = df.airline_sentiment

It could be interesting to see how the TextStats variables relate to the class variable. So we write a function show_dist that provides descriptive statistics and a plot per target class.

看看TextStats变量与类变量之间的关系可能会很有趣。因此，我们编写了一个函数show_dist ，该函数提供描述性统计信息和每个目标类的图表。

def show_dist(df, col):
    print('Descriptive stats for {}'.format(col))
    print('-'*(len(col)+22))
    print(df.groupby('airline_sentiment')[col].describe())
    bins = np.arange(df[col].min(), df[col].max() + 1)
    g = sns.FacetGrid(df, col='airline_sentiment', size=5, hue='airline_sentiment', palette="PuBuGn_d")
    g = g.map(sns.distplot, col, kde=False, norm_hist=True, bins=bins)
    plt.show()

Below you can find the distribution of the number of words in a tweet per target class. For brevity, we will limit us to only this variable. The charts for all TextCounts variables are in the notebook on Github.

在下面，您可以找到每个目标类别的推文中单词数的分布。为简便起见，我们将限于此变量。所有TextCounts变量的图表都在Github的笔记本中。

The number of words used in the tweets is rather low. The largest number of words is 36 and there are even tweets with only 2 words. So we’ll have to be careful during data cleaning not to remove too many words. But the text processing will be faster. Negative tweets contain more words than neutral or positive tweets.
推文中使用的单词数量很少。单词数量最多，为36个，甚至还有只有2个单词的推文。因此，在数据清理过程中，我们必须注意不要删除过多的单词。但是文本处理会更快。负面推文比中立或正面推文包含更多的单词。
All tweets have at least one mention. This is the result of extracting the tweets based on mentions in the Twitter data. There seems to be no difference in the number of mentions with regard to the sentiment.
所有推文至少都有一处提及。这是根据Twitter数据中的提及提取推文的结果。在情感方面，提及的次数似乎没有差异。
Most of the tweets do not contain hash tags. So this variable will not be retained during model training. Again, no difference in the number of hash tags with regard to the sentiment.
大多数推文不包含哈希标签。因此，在模型训练期间将不会保留此变量。再次，关于情感，哈希标签的数量没有差异。
Most of the tweets do not contain capitalized words and we do not see a difference in distribution between the sentiments.
大多数推文不包含大写单词，我们在情感之间的分布上也看不到差异。
The positive tweets seem to be using a bit more exclamation or question marks.
积极的推文似乎使用了更多的感叹号或问号。
Most tweets do not contain a URL.
大多数推文不包含URL。
Most tweets do not use emojis.
大多数推文不使用表情符号。

文字清理 (Text Cleaning)

Before we start using the tweets’ text we need to clean it. We’ll do the this in the class CleanText. With this class we’ll perform the following actions:

在开始使用推文之前，我们需要先对其进行清理。我们将在类CleanText执行此CleanText 。在此类中，我们将执行以下操作：

remove the mentions, as we want to generalize to tweets of other airline companies too.
删除提及，因为我们也希望将其推广到其他航空公司的推文中。
remove the hash tag sign (#) but not the actual tag as this may contain information
删除哈希标签符号(＃)，但不要删除实际标签，因为它可能包含信息
set all words to lowercase
将所有单词设置为小写
remove all punctuations, including the question and exclamation marks
删除所有标点符号，包括问号和感叹号
remove the URLs as they do not contain useful information. We did not notice a difference in the number of URLs used between the sentiment classes
删除网址，因为它们不包含有用的信息。我们没有发现情感类别之间使用的URL数量有所不同
make sure to convert the emojis into one word.
确保将表情符号转换为一个单词。
remove digits
删除数字
remove stopwords
删除停用词
apply the PorterStemmer to keep the stem of the words
应用PorterStemmer保持词干

class CleanText(BaseEstimator, TransformerMixin):
    def remove_mentions(self, input_text):
        return re.sub(r'@\w+', '', input_text)
    
    def remove_urls(self, input_text):
        return re.sub(r'http.?://[^\s]+[\s]?', '', input_text)
    
    def emoji_oneword(self, input_text):
        # By compressing the underscore, the emoji is kept as one word
        return input_text.replace('_','')
    
    def remove_punctuation(self, input_text):
        # Make translation table
        punct = string.punctuation
        trantab = str.maketrans(punct, len(punct)*' ')  # Every punctuation symbol will be replaced by a space
        return input_text.translate(trantab)
    def remove_digits(self, input_text):
        return re.sub('\d+', '', input_text)
    
    def to_lower(self, input_text):
        return input_text.lower()
    
    def remove_stopwords(self, input_text):
        stopwords_list = stopwords.words('english')
        # Some words which might indicate a certain sentiment are kept via a whitelist
        whitelist = ["n't", "not", "no"]
        words = input_text.split() 
        clean_words = [word for word in words if (word not in stopwords_list or word in whitelist) and len(word) > 1] 
        return " ".join(clean_words) 
    
    def stemming(self, input_text):
        porter = PorterStemmer()
        words = input_text.split() 
        stemmed_words = [porter.stem(word) for word in words]
        return " ".join(stemmed_words)
    
    def fit(self, X, y=None, **fit_params):
        return self
    
    def transform(self, X, **transform_params):
        clean_X = X.apply(self.remove_mentions).apply(self.remove_urls).apply(self.emoji_oneword).apply(self.remove_punctuation).apply(self.remove_digits).apply(self.to_lower).apply(self.remove_stopwords).apply(self.stemming)
        return clean_X

To show how the cleaned text variable will look like, here’s a sample.

为了显示清除后的文本变量的外观，这是一个示例。

ct = CleanText()
sr_clean = ct.fit_transform(df.text)
sr_clean.sample(5)

glad rt bet bird wish flown south winter

高兴rt投注鸟希望飞过南冬天

glad rt bet bird wish flown south winterpoint upc code check baggag tell luggag vacat day tri swimsuit

高兴rt赌注鸟希望飞行南冬季 点upc代码检查baggag告诉行李vacat日三泳衣

glad rt bet bird wish flown south winterpoint upc code check baggag tell luggag vacat day tri swimsuitvx jfk la dirti plane not standard

高兴rt赌注鸟希望飞行南冬季 点upc代码检查baggag告诉行李vacat日三泳衣 vx肯尼迪洛杉矶肮脏飞机不标

glad rt bet bird wish flown south winterpoint upc code check baggag tell luggag vacat day tri swimsuitvx jfk la dirti plane not standardtell mean work need estim time arriv pleas need laptop work thank

高兴rt赌注鸟希望飞行南冬天 点upc代码检查baggag告诉行李vacat日三泳衣 vx jfk ladirti飞机不标准 告诉意思是工作需要估计时间到达pleas需要笔记本工作

glad rt bet bird wish flown south winterpoint upc code check baggag tell luggag vacat day tri swimsuitvx jfk la dirti plane not standardtell mean work need estim time arriv pleas need laptop work thanksure busi go els airlin travel name kathryn sotelo

高兴rt赌注鸟希望飞行南冬天 点upc代码检查baggag告诉行李vacat日三泳衣 vx jfk ladirti飞机不标准 告诉意思是工作需要估计时间到达pleas需要笔记本工作，感谢 确定busi go els airlin旅行名称凯瑟琳索特洛

One side-effect of text cleaning is that some rows do not have any words left in their text. For the CountVectorizer and TfIdfVectorizer this does not pose a problem. Yet, for the Word2Vec algorithm this causes an error. There are different strategies to deal with these missing values.

清除文本的一个副作用是某些行的文本中没有剩余单词。对于CountVectorizer和TfIdfVectorizer这不会造成问题。但是，对于Word2Vec算法，这会导致错误。有不同的策略来应对这些缺失的价值观。

Remove the complete row, but in a production environment this is not desirable.
删除完整的行，但是在生产环境中，这是不希望的。
Impute the missing value with some placeholder text like *[no_text]*
用一些占位符文本(例如* [no_text] *)来估算缺失值
When applying Word2Vec: use the average of all vectors
应用Word2Vec时：使用所有向量的平均值

Here we will impute with placeholder text.

在这里，我们将使用占位符文本进行插补。

empty_clean = sr_clean == ''
print('{} records have no words left after text cleaning'.format(sr_clean[empty_clean].count()))
sr_clean.loc[empty_clean] = '[no_text]'

Now that we have the cleaned text of the tweets, we can have a look at what are the most frequent words. Below we’ll show the top 20 words. The most frequent word is “flight”.

既然我们已经清除了推文的文本，我们就可以看看最常用的词是什么。下面我们将显示前20个字。最常见的词是“飞行”。

cv = CountVectorizer()
bow = cv.fit_transform(sr_clean)
word_freq = dict(zip(cv.get_feature_names(), np.asarray(bow.sum(axis=0)).ravel()))
word_counter = collections.Counter(word_freq)
word_counter_df = pd.DataFrame(word_counter.most_common(20), columns = ['word', 'freq'])
fig, ax = plt.subplots(figsize=(12, 10))
sns.barplot(x="word", y="freq", data=word_counter_df, palette="PuBuGn_d", ax=ax)
plt.show();

创建测试数据 (Creating test data)

To check the performance of the models we’ll need a test set. Evaluating on the train data would not be correct. You should not test on the same data used for training the model.

要检查模型的性能，我们需要测试集。评估火车数据是不正确的。您不应在用于训练模型的相同数据上进行测试。

First, we combine the TextCounts variables with the CleanText variable. Initially, I made the mistake to execute TextCounts and CleanText in the GridSearchCV. This took too long as it applies these functions each run of the GridSearch. It suffices to run them only once.

首先，我们将TextCounts变量与CleanText变量结合在一起。最初，我在GridSearchCV执行TextCounts和CleanText时犯了一个错误。只要它在GridSearch的每次运行中应用这些功能，就需要花费很长时间。只运行一次就足够了。

df_model = df_eda
df_model['clean_text'] = sr_clean
df_model.columns.tolist()

So df_model now contains several variables. But our vectorizers (see below) will only need the clean_text variable. The TextCountsvariables can be added as such. To select columns, I wrote the class ColumnExtractor below.

因此df_model现在包含几个变量。但是我们的矢量化程序(见下文)将只需要clean_text变量。可以这样添加TextCounts变量。为了选择列，我在下面编写了ColumnExtractor类。

class ColumnExtractor(TransformerMixin, BaseEstimator):
    def __init__(self, cols):
        self.cols = cols
    def transform(self, X, **transform_params):
        return X[self.cols]
    def fit(self, X, y=None, **fit_params):
        return self
X_train, X_test, y_train, y_test = train_test_split(df_model.drop('airline_sentiment', axis=1), df_model.airline_sentiment, test_size=0.1, random_state=37)

超参数调整和交叉验证 (Hyperparameter tuning and cross-validation)

As we will see below, the vectorizers and classifiers all have configurable parameters. To choose the best parameters, we need to test on a separate validation set. This validation set was not used during the training. Yet, using only one validation set may not produce reliable validation results. Due to chance, you might have a good model performance on the validation set. If you would split the data otherwise, you might end up with other results. To get a more accurate estimation, we perform cross-validation.

正如我们将在下面看到的，矢量化器和分类器都具有可配置的参数。为了选择最佳参数，我们需要在单独的验证集上进行测试。训练期间未使用此验证集。但是，仅使用一个验证集可能不会产生可靠的验证结果。由于偶然的原因，您可能在验证集中具有良好的模型性能。如果以其他方式拆分数据，则可能会导致其他结果。为了获得更准确的估计，我们执行交叉验证。

With cross-validation we split the data into a train and validation set many times. The evaluation metric is then averaged over the different folds. Luckily, GridSearchCV applies cross-validation out-of-the-box.

通过交叉验证，我们将数据多次拆分为训练和验证集。然后，将评估指标在不同折数上取平均值。幸运的是，GridSearchCV开箱即用地应用了交叉验证。

To find the best parameters for both a vectorizer and classifier, we create a Pipeline.

为了找到矢量化器和分类器的最佳参数，我们创建了Pipeline 。

评估指标 (Evaluation metrics)

By default GridSearchCV uses the default scorer to compute the best_score_. For both the MultiNomialNb and LogisticRegression this default scoring metric is accuracy.

默认情况下，GridSearchCV使用默认best_score_来计算best_score_ 。对于MultiNomialNb和LogisticRegression此默认评分指标均为准确性。

In our function grid_vectwe additionally generate the classification_report on the test data. This provides some interesting metrics per target class. This might be more appropriate here. These metrics are the precision, recall and F1 score.

在我们的函数grid_vect我们还根据测试数据生成了classification_report报告。这为每个目标类别提供了一些有趣的指标。这在这里可能更合适。这些指标是精度，召回率和F1得分。

Precision: Of all rows we predicted to be a certain class, how many did we correctly predict?
精度：在我们预测为某一类的所有行中，我们正确预测了几行？
Recall: Of all rows of a certain class, how many did we correctly predict?
回想一下：在某个类的所有行中，我们正确预测了多少行？
F1 score: Harmonic mean of Precision and Recall.
F1得分：精确度和召回率的谐波平均值。

With the elements of the confusion matrix we can calculate Precision and Recall.

使用混淆矩阵的元素，我们可以计算精度和召回率。

# Based on http://scikit-learn.org/stable/auto_examples/model_selection/grid_search_text_feature_extraction.html
def grid_vect(clf, parameters_clf, X_train, X_test, parameters_text=None, vect=None, is_w2v=False):
    
    textcountscols = ['count_capital_words','count_emojis','count_excl_quest_marks','count_hashtags'
                      ,'count_mentions','count_urls','count_words']
    
    if is_w2v:
        w2vcols = []
        for i in range(SIZE):
            w2vcols.append(i)
        features = FeatureUnion([('textcounts', ColumnExtractor(cols=textcountscols))
                                 , ('w2v', ColumnExtractor(cols=w2vcols))]
                                , n_jobs=-1)
    else:
        features = FeatureUnion([('textcounts', ColumnExtractor(cols=textcountscols))
                                 , ('pipe', Pipeline([('cleantext', ColumnExtractor(cols='clean_text')), ('vect', vect)]))]
                                , n_jobs=-1)
    
    pipeline = Pipeline([
        ('features', features)
        , ('clf', clf)
    ])
    
    # Join the parameters dictionaries together
    parameters = dict()
    if parameters_text:
        parameters.update(parameters_text)
    parameters.update(parameters_clf)
    # Make sure you have scikit-learn version 0.19 or higher to use multiple scoring metrics
    grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1, cv=5)
    
    print("Performing grid search...")
    print("pipeline:", [name for name, _ in pipeline.steps])
    print("parameters:")
    pprint(parameters)
    t0 = time()
    grid_search.fit(X_train, y_train)
    print("done in %0.3fs" % (time() - t0))
    print()
    print("Best CV score: %0.3f" % grid_search.best_score_)
    print("Best parameters set:")
    best_parameters = grid_search.best_estimator_.get_params()
    for param_name in sorted(parameters.keys()):
        print("\t%s: %r" % (param_name, best_parameters[param_name]))
        
    print("Test score with best_estimator_: %0.3f" % grid_search.best_estimator_.score(X_test, y_test))
    print("\n")
    print("Classification Report Test Data")
    print(classification_report(y_test, grid_search.best_estimator_.predict(X_test)))
                        
    return grid_search

GridSearchCV的参数网格 (Parameter grids for GridSearchCV)

In the grid search, we will investigate the performance of the classifier. The set of parameters used to test the performance are specified below.

在网格搜索中，我们将研究分类器的性能。下面指定了用于测试性能的参数集。

# Parameter grid settings for the vectorizers (Count and TFIDF)
parameters_vect = {
    'features__pipe__vect__max_df': (0.25, 0.5, 0.75),
    'features__pipe__vect__ngram_range': ((1, 1), (1, 2)),
    'features__pipe__vect__min_df': (1,2)
}

# Parameter grid settings for MultinomialNB
parameters_mnb = {
    'clf__alpha': (0.25, 0.5, 0.75)
}

# Parameter grid settings for LogisticRegression
parameters_logreg = {
    'clf__C': (0.25, 0.5, 1.0),
    'clf__penalty': ('l1', 'l2')
}

分类器 (Classifiers)

Here we will compare the performance of a MultinomialNBand LogisticRegression.

在这里，我们将比较MultinomialNB和LogisticRegression的性能。

mnb = MultinomialNB()
logreg = LogisticRegression()

CountVectorizer (CountVectorizer)

To use words in a classifier, we need to convert the words to numbers. Sklearn’s CountVectorizer takes all words in all tweets, assigns an ID and counts the frequency of the word per tweet. We then use this bag of words as input for a classifier. This bag of words is a sparse data set. This means that each record will have many zeroes for the words not occurring in the tweet.

要在分类器中使用单词，我们需要将单词转换为数字。 Sklearn的CountVectorizer接收所有推文中的所有单词，分配一个ID并计算每条推文中单词的出现频率。然后，我们将这袋单词用作分类器的输入。这个词袋是一个稀疏的数据集。这意味着对于未在推文中出现的单词，每个记录将具有多个零。

countvect = CountVectorizer()
# MultinomialNB
best_mnb_countvect = grid_vect(mnb, parameters_mnb, X_train, X_test, parameters_text=parameters_vect, vect=countvect)
joblib.dump(best_mnb_countvect, '../output/best_mnb_countvect.pkl')
# LogisticRegression
best_logreg_countvect = grid_vect(logreg, parameters_logreg, X_train, X_test, parameters_text=parameters_vect, vect=countvect)
joblib.dump(best_logreg_countvect, '../output/best_logreg_countvect.pkl')

TF-IDF矢量化器 (TF-IDF Vectorizer)

One issue with CountVectorizer is that there might be words that occur frequently. These words might not have discriminatory information. Thus they can be removed. TF-IDF (term frequency — inverse document frequency)can be used to down-weight these frequent words.

CountVectorizer的一个问题是可能经常出现单词。这些词可能没有歧视性信息。因此可以将它们删除。 TF-IDF(术语频率-逆文档频率)可用于降低这些常用单词的权重。

tfidfvect = TfidfVectorizer()
# MultinomialNB
best_mnb_tfidf = grid_vect(mnb, parameters_mnb, X_train, X_test, parameters_text=parameters_vect, vect=tfidfvect)
joblib.dump(best_mnb_tfidf, '../output/best_mnb_tfidf.pkl')
# LogisticRegression
best_logreg_tfidf = grid_vect(logreg, parameters_mnb, X_train, X_test, parameters_text=parameters_vect, vect=tfidfvect)
joblib.dump(best_logreg_tfidf, '../output/best_logreg_tfidf.pkl')

Word2Vec (Word2Vec)

Another way of converting the words to numerical values is to use Word2Vec. Word2Vec maps each word in a multi-dimensional space. It does this by taking into account the context in which a word appears in the tweets. As a result, words that are similar are also close to each other in the multi-dimensional space.

将单词转换为数值的另一种方法是使用Word2Vec 。 Word2Vec映射多维空间中的每个单词。它通过考虑单词在推文中出现的上下文来做到这一点。结果，相似的词在多维空间中也彼此接近。

The Word2Vec algorithm is part of the gensim package.

Word2Vec算法是gensim程序包的一部分。

The Word2Vec algorithm uses lists of words as input. For that purpose, we use the word_tokenize method of the the nltk package.

Word2Vec算法使用单词列表作为输入。为此，我们使用nltk包的word_tokenize方法。

SIZE = 50
X_train['clean_text_wordlist'] = X_train.clean_text.apply(lambda x : word_tokenize(x))
X_test['clean_text_wordlist'] = X_test.clean_text.apply(lambda x : word_tokenize(x))
model = gensim.models.Word2Vec(X_train.clean_text_wordlist
, min_count=1
, size=SIZE
, window=5
, workers=4)
model.most_similar('plane', topn=3)

The Word2Vec model provides a vocabulary of the words in all the tweets. For each word you also have its vector values. The number of vector values is equal to the chosen size. These are the dimensions on which each word is mapped in the multi-dimensional space. Words with an occurrence less than min_count are not kept in the vocabulary.

Word2Vec模型提供所有推文中的单词词汇。对于每个单词，您还具有其向量值。向量值的数量等于所选的大小。这些是每个单词在多维空间中所映射的维度。出现次数少于min_count的单词不会保留在词汇表中。

A side effect of the min_count parameter is that some tweets could have no vector values. This is would be the case when the word(s) in the tweet occur in less than min_count tweets. Due to the small corpus of tweets, there is a risk of this happening in our case. Thus we set the min_count value equal to 1.

min_count参数的副作用是某些推文可能没有向量值。如果tweet中的单词少于min_count的情况就是这种情况鸣叫。由于推文的语料很少，因此在我们的案例中有发生这种情况的风险。因此，我们将min_count值设置为等于1。

The tweets can have a different number of vectors, depending on the number of words it contains. To use this output for modeling we will calculate the average of all vectors per tweet. As such we will have the same number (i.e. size) of input variables per tweet.

这些推文可以具有不同数量的向量，具体取决于它包含的单词数。要使用此输出进行建模，我们将计算每条推文所有向量的平均值。这样，每条推文我们将具有相同数量(即大小)的输入变量。

We do this with the function compute_avg_w2v_vector. In this function we also check whether the words in the tweet occur in the vocabulary of the Word2Vec model. If not, a list filled with 0.0 is returned. Else the average of the word vectors.

我们使用函数compute_avg_w2v_vector进行此compute_avg_w2v_vector 。在此功能中，我们还将检查tweet中的单词是否出现在Word2Vec模型的词汇表中。如果不是，则返回填充为0.0的列表。否则，单词向量的平均值。

def compute_avg_w2v_vector(w2v_dict, tweet):
    list_of_word_vectors = [w2v_dict[w] for w in tweet if w in w2v_dict.vocab.keys()]
    
    if len(list_of_word_vectors) == 0:
        result = [0.0]*SIZE
    else:
        result = np.sum(list_of_word_vectors, axis=0) / len(list_of_word_vectors)
        
    return result
X_train_w2v = X_train['clean_text_wordlist'].apply(lambda x: compute_avg_w2v_vector(model.wv, x))
X_test_w2v = X_test['clean_text_wordlist'].apply(lambda x: compute_avg_w2v_vector(model.wv, x))

This gives us a Series with a vector of dimension equal to SIZE. Now we will split this vector and create a DataFrame with each vector value in a separate column. That way we can concatenate the Word2Vec variables to the other TextCounts variables. We need to reuse the index of X_train and X_test. Otherwise this will give issues (duplicates) in the concatenation later on.

这给我们一个序列，其向量的尺寸等于SIZE 。现在，我们将分割此向量，并使用单独列中的每个向量值创建一个DataFrame。这样，我们可以将Word2Vec变量连接到其他TextCounts变量。我们需要重用X_train和X_test的索引。否则，这将在以后的连接中产生问题(重复)。

X_train_w2v = pd.DataFrame(X_train_w2v.values.tolist(), index= X_train.index)
X_test_w2v = pd.DataFrame(X_test_w2v.values.tolist(), index= X_test.index)
# Concatenate with the TextCounts variables
X_train_w2v = pd.concat([X_train_w2v, X_train.drop(['clean_text', 'clean_text_wordlist'], axis=1)], axis=1)
X_test_w2v = pd.concat([X_test_w2v, X_test.drop(['clean_text', 'clean_text_wordlist'], axis=1)], axis=1)

We only consider LogisticRegression as we have negative values in the Word2Vec vectors. MultinomialNB assumes that the variables have a multinomial distribution. So they cannot contain negative values.

我们仅考虑LogisticRegression，因为在Word2Vec向量中具有负值。 MultinomialNB假定变量具有多项式分布。因此它们不能包含负值。

best_logreg_w2v = grid_vect(logreg, parameters_logreg, X_train_w2v, X_test_w2v, is_w2v=True)
joblib.dump(best_logreg_w2v, '../output/best_logreg_w2v.pkl')

结论 (Conclusion)

Both classifiers achieve the best results when using the features of the CountVectorizer
使用CountVectorizer的功能时，两个分类器均能获得最佳结果
Logistic Regression outperforms the Multinomial Naive Bayes classifier
Logistic回归优于多项式朴素贝叶斯分类器
The best performance on the test set comes from the LogisticRegression with features from CountVectorizer.
测试集上的最佳性能来自LogisticRegression和CountVectorizer的功能。

最佳参数 (Best parameters)

C value of 1
C值1
L2 regularization
L2正则化
max_df: 0.5 or maximum document frequency of 50%.
max_df：0.5或最大文档频率为50％。
min_df: 1 or the words need to appear in at least 2 tweets
min_df：1或单词需要出现在至少2条推文中
ngram_range: (1, 2), both single words as bi-grams are used
ngram_range：(1，2)，两个单词都作为双字母组使用

评估指标 (Evaluation metrics)

A test accuracy of 81,3%. This is better than a baseline performance of predicting the majority class (here a negative sentiment) for all observations. The baseline would give 63% accuracy.
测试精度为81.3％。这好于预测所有观察结果的多数类别(此处为负面情绪)的基准性能。基线将给出63％的准确性。
The Precision is rather high for all three classes. For instance, of all cases that we predict as negative, 80% is negative.
这三个类别的精度都很高。例如，在我们预测为负面的所有情况中，80％为负面。
The Recall for the neutral class is low. Of all neutral cases in our test data, we only predict 48% as being neutral.
中立类别的召回率很低。在我们的测试数据中，所有中性案例中，我们仅预测48％为中性。

在新推文上应用最佳模型 (Apply the best model on new tweets)

For the fun, we will use the best model and apply it to some new tweets that contain @VirginAmerica. I selected 3 negative and 3 positive tweets by hand.

为了好玩，我们将使用最佳模型并将其应用于包含@VirginAmerica的一些新推文中。我手动选择了3条负面和3条正面的推文。

Thanks to the GridSearchCV, we now know what are the best hyperparameters. So now we can train the best model on all training data, including the test data that we split off before.

多亏了GridSearchCV，我们现在知道了最好的超参数。因此，现在我们可以在所有训练数据(包括我们之前拆分的测试数据)上训练最佳模型。

textcountscols = ['count_capital_words','count_emojis','count_excl_quest_marks','count_hashtags'
,'count_mentions','count_urls','count_words']
features = FeatureUnion([('textcounts', ColumnExtractor(cols=textcountscols))
, ('pipe', Pipeline([('cleantext', ColumnExtractor(cols='clean_text'))
, ('vect', CountVectorizer(max_df=0.5, min_df=1, ngram_range=(1,2)))]))]
, n_jobs=-1)
pipeline = Pipeline([
('features', features)
, ('clf', LogisticRegression(C=1.0, penalty='l2'))
])
best_model = pipeline.fit(df_model.drop('airline_sentiment', axis=1), df_model.airline_sentiment)
# Applying on new positive tweets
new_positive_tweets = pd.Series(["Thank you @VirginAmerica for you amazing customer support team on Tuesday 11/28 at @EWRairport and returning my lost bag in less than 24h! #efficiencyiskey #virginamerica"
,"Love flying with you guys ask these years. Sad that this will be the last trip ? @VirginAmerica #LuxuryTravel"
,"Wow @VirginAmerica main cabin select is the way to fly!! This plane is nice and clean & I have tons of legroom! Wahoo! NYC bound! ✈️"])
df_counts_pos = tc.transform(new_positive_tweets)
df_clean_pos = ct.transform(new_positive_tweets)
df_model_pos = df_counts_pos
df_model_pos['clean_text'] = df_clean_pos
best_model.predict(df_model_pos).tolist()
# Applying on new negative tweets
new_negative_tweets = pd.Series(["@VirginAmerica shocked my initially with the service, but then went on to shock me further with no response to what my complaint was. #unacceptable @Delta @richardbranson"
,"@VirginAmerica this morning I was forced to repack a suitcase w a medical device because it was barely overweight - wasn't even given an option to pay extra. My spouses suitcase then burst at the seam with the added device and had to be taped shut. Awful experience so far!"
,"Board airplane home. Computer issue. Get off plane, traverse airport to gate on opp side. Get on new plane hour later. Plane too heavy. 8 volunteers get off plane. Ohhh the adventure of travel ✈️ @VirginAmerica"])
df_counts_neg = tc.transform(new_negative_tweets)
df_clean_neg = ct.transform(new_negative_tweets)
df_model_neg = df_counts_neg
df_model_neg['clean_text'] = df_clean_neg
best_model.predict(df_model_neg).tolist()

The model classifies all tweets correctly. A larger test set should be used to assess the model’s performance. But on this small data set it does what we are aiming for. I hope you enjoyed reading this story. If you did, feel free to share it.

该模型将所有推文正确分类。应该使用更大的测试集来评估模型的性能。但是，在这个小的数据集上，它确实可以实现我们的目标。希望您喜欢阅读这个故事。如果您愿意，可以随时分享。

翻译自: https://www.freecodecamp.org/news/sentiment-analysis-with-text-mining/

文本挖掘情感分析

你可能感兴趣的:(大数据,python,机器学习,人工智能,数据分析)

Python 列表
列表是由一系列按特定顺序排列的元素组成。在python中用方括号（[]）来表示列表并用逗号来分隔其中的元素。例如：bicycles=['trek','cannondale','redline']。访问列表元素时，只需将该元素的索引值或位置告诉Python即可。（索引值由0开始）>>>names=['zhao','qian','sun','li']>>>print(names[0])zhao创建的大
列表简单数据类型天池小晨 python
整型浮点型布尔型容器数据类型列表元组字典集合字符串1.列表的定义列表是有序集合，没有固定大小，能够保存任意数量任意类型的Python对象，语法为[元素1,元素2,...,元素n]。关键点是「中括号[]」和「逗号,」中括号把所有元素绑在一起逗号将每个元素一一分开2.列表的创建创建一个普通列表【例子】1x=['Monday','Tuesday','Wednesday','Thursday','Frid
Python-难点-获取项目根目录
1需求2接口3示例4参考资料在Python中，“设置根目录”通常指指定项目的基准路径，以便统一管理文件路径。以下是几种常见方法，结合不同场景和兼容性需求：一、基于路径拼接（最常用）通过手动拼接路径来定义根目录，适用于结构固定的项目。importos#方法1：根据当前文件位置向上递归定义（推荐）defset_project_root():current_file=os.path.abspath(__
JSON和JSONL、python操作 weixin_668 json python
JSONJSON（JavaScriptObjectNotation）是一种轻量级的数据交换格式，基于文本、易于读写，并支持多种数据结构。以下是常见的JSON格式及示例：1.简单对象（键值对）{"name":"Alice","age":25,"isStudent":true}2.嵌套对象{"person":{"name":"Bob","address":{"city":"NewYork","zipc
python 抓取小红书小五咔咔咔 python 开发语言
python相关学习资料：https://edu.51cto.com/video/3832.htmlhttps://edu.51cto.com/video/4102.htmlhttps://edu.51cto.com/video/1158.htmlPython抓取小红书数据的科普文章小红书是一个流行的社交电商平台，用户可以分享购物心得、生活点滴等。本文将介绍如何使用Python语言抓取小红书的数据
利用 Python 爬取小红书热门笔记并进行标签关键词分析程序员威哥最新爬虫实战项目 python 笔记开发语言
一、背景与目标小红书（RED）作为中国最活跃的内容社区之一，拥有大量关于美妆、穿搭、美食、旅游等领域的用户生成内容（UGC）。对于产品、品牌方或研究人员来说，提取热门笔记的标签关键词，可以有效捕捉用户关注点、消费趋势及内容热词。本项目目标：使用Python爬取小红书某个话题下的热门笔记；分析每篇笔记中的标题、正文、标签等字段；利用NLP技术提取高频关键词；对关键词进行可视化与聚类分析。二、技术难点
python JSON Lines (JSONL)的保存和读取；jsonl的数据保存和读取，大模型prompt文件保存常用格式医学小达人常用算法 NLP prompt JSON Lines JSONL jsonl jsonl文件保存读取
1.JSONLines(JSONL)文件保存将一个包含多个字典的列表保存为JSONLines(JSONL)格式的文件，每个字典对应一个JSONL文件中的一行。以下是如何实现这一操作的Python代码importjson#定义包含字典的列表data=[{"id":1,"name":"Alice","age":30,"email":"[email protected]"},{"id":2,"name"
四十行Python代码，带你爬取热门音乐评论，制作评论词云图！
请求页面数据driver.get(‘https://music.163.com/#/song?id=569213220’)#selenium无法直接获取到嵌套页面里面的数据switch_to.frame()切换到嵌套网页driver.switch_to.frame(0)让浏览器加载的时候,等待渲染页面driver.implicitly_wait(10)driver.page_source获取请求页
Python 处理图像并生成 JSONL 元数据文件 - 固定text版本
Python处理图像并生成JSONL元数据文件-固定text版本flyfishJSONL（JSONLines）简介JSONL（JSONLines，也称为newline-delimitedJSON）是一种轻量级的数据序列化格式，由一系列独立的JSON对象组成，每行一个有效的JSON对象，行与行之间通过换行符（\n）分隔。JSONL是传统JSON的“轻量化”变体，通过“每行一个JSON对象”的设计，解
交错并联Buck+LLC变换器的建模与控制优化研究
交错并联Buck+LLC变换器的建模与控制优化研究前些天发现了一个巨牛的人工智能学习网站，通俗易懂，风趣幽默，忍不住分享一下给大家，觉得好请收藏。点击跳转到网站。摘要本文针对宽输入电压范围(200-450V)、多电压输出(12-48V)的高效DC-DC变换系统，提出了一种基于交错并联Buck预调节器和LLC谐振变换器的两级式拓扑结构。中间母线电压设定为200V，系统输出功率为1500W，要求电压和
基于卷积神经网络与小波变换的医学图像超分辨率算法复现神经网络15044 python 算法 cnn 算法人工智能图像处理开发语言神经网络深度学习
基于卷积神经网络与小波变换的医学图像超分辨率算法复现前些天发现了一个巨牛的人工智能学习网站，通俗易懂，风趣幽默，忍不住分享一下给大家，觉得好请收藏。点击跳转到网站。1.引言医学图像超分辨率技术在临床诊断和治疗规划中具有重要意义。高分辨率的医学图像能够提供更丰富的细节信息，帮助医生做出更准确的诊断。近年来，深度学习技术在图像超分辨率领域取得了显著进展。本文将复现一种结合卷积神经网络(CNN)、小波变
使用MMDetection中的Mask2Former和X-Decoder训练自定义数据集及结果复现神经网络15044 算法 python 分类矩阵人工智能数据挖掘深度学习
使用MMDetection中的Mask2Former和X-Decoder训练自定义数据集及结果复现前些天发现了一个巨牛的人工智能学习网站，通俗易懂，风趣幽默，忍不住分享一下给大家，觉得好请收藏。点击跳转到网站。1.引言1.1研究背景实例分割是计算机视觉领域的重要任务，它要求模型不仅要检测图像中的对象，还要精确地分割出每个对象的像素级掩码。近年来，基于Transformer的模型在实例分割任务上取得
OpenCV引擎：驱动实时应用开发的科技狂飙芯作者 DD：计算机科学领域 opencv 计算机视觉
在人工智能与计算机视觉技术迅猛发展的今天，实时图像处理已成为工业自动化、自动驾驶、医疗诊断、增强现实等领域的核心技术需求。而**OpenCV（OpenSourceComputerVisionLibrary）**作为全球最活跃的开源计算机视觉库，正以其强大的算法生态、跨平台兼容性以及持续进化的架构设计，成为驱动实时应用开发的“数字引擎”。本文将深入剖析OpenCV如何通过技术创新突破实时处理的性能极
jxORM--编程指南 jxandrew jxWebUI 数据库 python jxWebUI jxORM ORM
jxORM是jxWebUI配套的数据库操作库，可以简化python程序员操作数据库。声明数据类定义数据类之前，先导入ORM修饰符：fromjxORMimportORM,DBDataType,ColType然后就可以用ORM修饰符来修饰一个类，从而定义一个数据类：@ORMclassUser:ID:DBDataType.Long=ColType.PrimaryKeyCreateTime:DBDataT
深度学习系列-----＞环境搭建（Ubuntu）二师兄用飘柔深度学习历程深度学习 ubuntu 人工智能 pytorch python
1、前言电脑基础系统硬件情况：系统：ubuntu18.04、显卡：GTX1050Ti；后续的环境搭建都在此基础上进行。此次学习选择Pytorch作为深度学习的框架，选择的原因主要由于PyTorch在研究领域特别受欢迎，较多的论文框架也是基于其开发。2、anaconda+python3安装测试在学习深度学习的过程中会涉及到使用不同版本python包的问题，而anaconda可以便捷获取包且对包能够进
Python中的enumerate()函数冉成未来 Service python 开发语言
文章目录基本用法参数说明特点实际应用与zip()的比较注意事项enumerate()是Python内置的一个非常有用的函数，它用于在遍历可迭代对象（如列表、元组、字符串等）时，同时获取元素的索引和值。基本用法fruits=['apple','banana','cherry']forindex,fruitinenumerate(fruits):print(index,fruit)输出：0apple1
空间曲线正交投影及其距离计算的理论与实践老歌老听老掉牙 python 正交投影
引言：正交投影的几何本质在三维空间中，正交投影是一种基础而重要的几何变换，它将空间中的点沿特定方向映射到一个平面上。当我们考虑将空间曲线投影到由给定法向量n\mathbf{n}n定义的平面时，这一问题在计算机图形学、CAD/CAM系统和科学计算中具有广泛应用。本文将从数学原理、Python实现到距离计算的等价性问题，全面探讨这一几何操作的深层内涵。设空间曲线由参数方程r(t)=(x(t),y(t)
pip是如何卸载你安装的第三方库的酷python python python
使用pipuninstall命令可以卸载掉你所安装的第三方库，所有与其相关的文件都将被pip整理出来展示并询问是否真的要删除，类似下面的提示pipuninstallnoxFoundexistinginstallation:nox2020.8.22Uninstallingnox-2020.8.22:Wouldremove:d:\python\lib\site-packages\nox-2020.8.
深度学习-常用环境配置瑶山 AI linux 人工智能 windows CUDA PyTorch
目录Miniconda安装安装NVIDIA显卡驱动安装CUDA和cnDNNCUDAcuDNNPyTorch安装手动下载测试Miniconda安装最新版Miniconda搭建Python环境_miniconda创建python虚拟环境-CSDN博客安装NVIDIA显卡驱动直接进NVIDIA官网：NVIDIAGeForce驱动程序-N卡驱动|NVIDIA在这里有GeForce驱动程序，立即下载，这是下
机器学习初学者理论初解 Mikhail_G 机器学习人工智能
大家好!为什么手机相册能自动识别人脸？为什么购物网站总能推荐你喜欢的商品？这些“智能”背后，都藏着一位隐形高手——机器学习（MachineLearning）。一、什么是机器学习？简单说，机器学习是教计算机从数据中自己找规律的技术。就像教孩子认猫：不是直接告诉他“猫有尖耳朵和胡须”，而是给他看100张猫狗照片，让他自己总结出猫的特征。传统程序vs机器学习传统程序：输入规则+数据→输出结果（例：按“温
Nginx IP授权页面实现步骤
目标：一、创建白名单文件sudomkdir-p/usr/local/nginx/conf/whitelistsudotouch/usr/local/nginx/conf/whitelist/temporary.conf二、创建Python认证服务文件路径：/opt/script/auth_server.pyimportosimporttimefromflaskimportFlask,request
高阶知识库搭建实战五、（向量数据库Milvus安装）伯牙碎琴大模型数据库 milvus 大模型 AI
以下是关于在Windows环境下直接搭建Milvus向量数据库的教程：本教程分两部分，第一部分是基于docker安装，在Windows环境下直接安装Milvus向量数据库，目前官方推荐的方式是通过Docker进行部署，因为Milvus的运行环境依赖于Linux系统。如果你希望在Windows上直接运行Milvus，可以考虑使用MilvusLite版本，这是一个轻量级的Python库，适用于快速原型
Embedding与向量数据库玖月初玖大模型应用开发基础人工智能 embedding 数据库
1.Embedding是什么EmbeddingModel是一种机器学习模型，它的核心任务是将离散的、高维的符号（如单词、句子、图片、用户、商品等）转换成连续的、低维的向量（称为“嵌入”或“向量表示”），并且这个向量能有效地捕捉原始符号的语义、关系或特征。1.1通俗理解EmbeddingModel是让计算机“理解”世界的核心工具，把“文字、图片、音频”等信息变成一串有意义的数字我们称之为“向量”。类
python分布式事务_分布式事务系列（2.1）分布式事务的概念
#1系列目录#2X/OpenDTPDTP全称是DistributedTransactionProcess，即分布式事务模型。之前我们接触的事务都是针对单个数据库的操作，如果涉及多个数据库的操作，还想保证原子性，这就需要使用分布式事务了。而X/OpenDTP就是一种分布式事务处理模型。##2.1X/OpenDTP模型X/Open是一个组织，维基百科上这样说明：X/Open是1984年由多个公司联合创
LLM初识
从零到一：用Python和LLM构建你的专属本地知识库问答机器人摘要：随着大型语言模型（LLM）的兴起，构建智能问答系统变得前所未有的简单。本文将详细介绍如何使用Python，结合开源的LLM和向量数据库技术，一步步搭建一个基于你本地文档的知识库问答机器人。你将学习到从环境准备、文档加载、文本切分、向量化、索引构建到最终实现问答交互的完整流程。本文包含详细的流程图描述、代码片段思路和关键注意事项，
CCF-GESP 等级考试 2025年6月认证Python四级真题解析
1单选题（每题2分，共30分）第1题2025年4月19日在北京举行了一场颇为瞩目的人形机器人半程马拉松赛。比赛期间，跑动着的机器人会利用身上安装的多个传感器所反馈的数据来调整姿态、保持平衡等，那么这类传感器类似于计算机的()。A.处理器B.存储器C.输入设备D.输出设备解析：答案：C。所有传感器都用于采集数据，属于输入设备，故选C。第2题小杨购置的计算机使用一年后觉得内存不够用了，想购置一个容量更
推荐开源项目：Milvus Lite —— 轻量级向量数据库，助力AI应用快速起飞穆希静
推荐开源项目：MilvusLite——轻量级向量数据库，助力AI应用快速起飞项目介绍MilvusLite是知名开源向量数据库Milvus的轻量级版本，专为需要在小型环境中进行向量嵌入和相似性搜索的AI应用设计。通过将MilvusLite导入您的Python应用，您可以直接使用Milvus的核心向量搜索功能。MilvusLite已集成在PythonSDKofMilvus中，只需通过pipinstal
【数据结构】详解堆排序当中的topk问题（leetcode例题） ylfxw 数据结构 leetcode 算法
文章目录前言如何理解topk问题代码逻辑代码实现前言Leetcode相关题目：215.数组中的第K个最大元素如何理解topk问题**TopK问题是一个经典的问题，在计算机科学中，它的目标是在一组数据中找到前K个最大或最小的元素。**这个问题在许多场景下都很重要，比如搜索引擎的搜索结果排名、数据分析中的热门元素筛选等。.在最简单的形式中，给定一个数组（或列表）和一个整数K，TopK问题要求返回数组中
【华为419机考真题】服务器能耗统计，JAVA 题解梦想橡皮擦华为服务器 java 华为OD机试华为OD
最近更新的博客华为od2023|什么是华为od，od薪资待遇，od机试题清单华为OD机试真题大全，用Python解华为机试题|机试宝典【华为OD机试】全流程解析+经验分享,题型分享,防作弊指南华为od机试，独家整理已参加机试人员的实战技巧本篇题解：服务器耗能题目描述服务器有三种运行状态：空载，单任务，多任务，每个时间片的能耗的分别为111、333、444，每个任务由起始时间片和结束时间片定义运行时
拼多多官方返利新动向，高省App引领购物省钱新趋势古楼
电商行业的快速发展带来了无数的新趋势和新机遇，而拼多多官方返利的新趋势无疑是其中的一大亮点。高省App作为这一趋势的敏锐洞察者和积极参与者，致力于帮助用户精准把握这些新机遇。通过高省App，用户可以及时了解拼多多官方返利的最新政策和活动信息，从而做出更加明智的购物决策。同时，高省App还提供了专业的数据分析工具，帮助用户分析自己的消费行为和省钱效果，让省钱之路更加清晰和明确。我们在开始讲今天的文章
Java实现的基于模板的网页结构化信息精准抽取组件：HtmlExtractor yangshangchuan 信息抽取 HtmlExtractor 精准抽取信息采集
HtmlExtractor是一个Java实现的基于模板的网页结构化信息精准抽取组件，本身并不包含爬虫功能，但可被爬虫或其他程序调用以便更精准地对网页结构化信息进行抽取。 HtmlExtractor是为大规模分布式环境设计的，采用主从架构，主节点负责维护抽取规则，从节点向主节点请求抽取规则，当抽取规则发生变化，主节点主动通知从节点，从而能实现抽取规则变化之后的实时动态生效。如
java编程思想 -- 多态百合不是茶 java 多态详解
一: 向上转型和向下转型面向对象中的转型只会发生在有继承关系的子类和父类中（接口的实现也包括在这里）。父类：人子类：男人向上转型： Person p = new Man() ; //向上转型不需要强制类型转化向下转型： Man man =
[自动数据处理]稳扎稳打,逐步形成自有ADP系统体系 comsci dp
对于国内的IT行业来讲,虽然我们已经有了"两弹一星",在局部领域形成了自己独有的技术特征,并初步摆脱了国外的控制...但是前面的路还很长.... 首先是我们的自动数据处理系统还无法处理很多高级工程...中等规模的拓扑分析系统也没有完成,更加复杂的
storm 自定义日志文件商人shang storm cluster logback
Storm中的日志级级别默认为INFO，并且，日志文件是根据worker号来进行区分的，这样，同一个log文件中的信息不一定是一个业务的，这样就会有以下两个需求出现： 1. 想要进行一些调试信息的输出 2. 调试信息或者业务日志信息想要输出到一些固定的文件中不要怕，不要烦恼，其实Storm已经提供了这样的支持，可以通过自定义logback 下的 cluster.xml 来输
Extjs3 SpringMVC使用 @RequestBody 标签问题记录 21jhf
springMVC使用 @RequestBody(required = false) UserVO userInfo 传递json对象数据，往往会出现http 415，400,500等错误，总结一下需要使用ajax提交json数据才行，ajax提交使用proxy，参数为jsonData，不能为params；另外，需要设置Content-type属性为json，代码如下：（由于使用了父类aaa
一些排错方法文强chu 方法
1、java.lang.IllegalStateException: Class invariant violation at org.apache.log4j.LogManager.getLoggerRepository(LogManager.java:199)at org.apache.log4j.LogManager.getLogger(LogManager.java:228) at o
Swing中文件恢复我觉得很难小桔子 swing
我那个草了！老大怎么回事，怎么做项目评估的？只会说相信你可以做的，试一下，有的是时间！用java开发一个图文处理工具，类似word，任意位置插入、拖动、删除图片以及文本等。文本框、流程图等，数据保存数据库，其余可保存pdf格式。ok,姐姐千辛万苦，
php 文件操作 aichenglong PHP 读取文件写入文件
1 写入文件 @$fp=fopen("$DOCUMENT_ROOT/order.txt", "ab"); if(!$fp){ echo "open file error" ; exit; } $outputstring="date:"." \t tire:".$tire."
MySQL的btree索引和hash索引的区别 AILIKES 数据结构 mysql 算法
Hash 索引结构的特殊性，其检索效率非常高，索引的检索可以一次定位，不像B-Tree 索引需要从根节点到枝节点，最后才能访问到页节点这样多次的IO访问，所以 Hash 索引的查询效率要远高于 B-Tree 索引。可能很多人又有疑问了，既然 Hash 索引的效率要比 B-Tree 高很多，为什么大家不都用 Hash 索引而还要使用 B-Tree 索引呢
JAVA的抽象--- 接口 --实现百合不是茶
抽象接口实现接口 //抽象类 ,方法 //定义一个公共抽象的类 ,并在类中定义一个抽象的方法体抽象的定义使用abstract abstract class A 定义一个抽象类例如： //定义一个基类 public abstract class A{ //抽象类不能用来实例化，只能用来继承 //
JS变量作用域实例 bijian1013 作用域
<script> var scope='hello'; function a(){ console.log(scope); //undefined var scope='world'; console.log(scope); //world console.log(b);
TDD实践（二） bijian1013 java TDD
实践题目：分解质因数 Step1：单元测试： package com.bijian.study.factor.test; import java.util.Arrays; import junit.framework.Assert; import org.junit.Before; import org.junit.Test; import com.bijian.
[MongoDB学习笔记一]MongoDB主从复制 bit1129 mongodb
MongoDB称为分布式数据库，主要原因是1.基于副本集的数据备份， 2.基于切片的数据扩容。副本集解决数据的读写性能问题，切片解决了MongoDB的数据扩容问题。事实上，MongoDB提供了主从复制和副本复制两种备份方式，在MongoDB的主从复制和副本复制集群环境中，只有一台作为主服务器，另外一台或者多台服务器作为从服务器。本文介绍MongoDB的主从复制模式，需要指明
【HBase五】Java API操作HBase bit1129 hbase
import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.hbase.HBaseConfiguration; import org.apache.hadoop.hbase.HColumnDescriptor; import org.apache.ha
python调用zabbix api接口实时展示数据 ronin47
zabbix api接口来进行展示。经过思考之后，计划获取如下内容： 1、获得认证密钥 2、获取zabbix所有的主机组 3、获取单个组下的所有主机 4、获取某个主机下的所有监控项
jsp取得绝对路径 byalias 绝对路径
在JavaWeb开发中，常使用绝对路径的方式来引入JavaScript和CSS文件，这样可以避免因为目录变动导致引入文件找不到的情况，常用的做法如下：一、使用${pageContext.request.contextPath} 　　代码” ${pageContext.request.contextPath}”的作用是取出部署的应用程序名，这样不管如何部署，所用路径都是正确的。
Java定时任务调度：用ExecutorService取代Timer bylijinnan java
《Java并发编程实战》一书提到的用ExecutorService取代Java Timer有几个理由，我认为其中最重要的理由是：如果TimerTask抛出未检查的异常，Timer将会产生无法预料的行为。Timer线程并不捕获异常，所以 TimerTask抛出的未检查的异常会终止timer线程。这种情况下，Timer也不会再重新恢复线程的执行了;它错误的认为整个Timer都被取消了。此时，已经被
SQL 优化原则 chicony sql
一、问题的提出　在应用系统开发初期，由于开发数据库数据比较少，对于查询SQL语句，复杂视图的的编写等体会不出SQL语句各种写法的性能优劣，但是如果将应用系统提交实际应用后，随着数据库中数据的增加，系统的响应速度就成为目前系统需要解决的最主要的问题之一。系统优化中一个很重要的方面就是SQL语句的优化。对于海量数据，劣质SQL语句和优质SQL语句之间的速度差别可以达到上百倍，可见对于一个系统
java 线程弹球小游戏 CrazyMizzz java 游戏
最近java学到线程，于是做了一个线程弹球的小游戏，不过还没完善这里是提纲 1.线程弹球游戏实现 1.实现界面需要使用哪些API类 JFrame JPanel JButton FlowLayout Graphics2D Thread Color ActionListener ActionEvent MouseListener Mouse
hadoop jps出现process information unavailable提示解决办法 daizj hadoop jps
hadoop jps出现process information unavailable提示解决办法 jps时出现如下信息： 3019 -- process information unavailable3053 -- process information unavailable2985 -- process information unavailable2917 --
PHP图片水印缩放类实现 dcj3sjt126com PHP
<?php class Image{ private $path; function __construct($path='./'){ $this->path=rtrim($path,'/').'/'; } //水印函数，参数：背景图，水印图，位置，前缀,TMD透明度 public function water($b,$l,$pos
IOS控件学习：UILabel常用属性与用法 dcj3sjt126com ios UILabel
参考网站： http://shijue.me/show_text/521c396a8ddf876566000007 http://www.tuicool.com/articles/zquENb http://blog.csdn.net/a451493485/article/details/9454695 http://wiki.eoe.cn/page/iOS_pptl_artile_281
完全手动建立maven骨架 eksliang java eclipse Web
建一个 JAVA 项目： mvn archetype:create -DgroupId=com.demo -DartifactId=App [-Dversion=0.0.1-SNAPSHOT] [-Dpackaging=jar] 建一个 web 项目： mvn archetype:create -DgroupId=com.demo -DartifactId=web-a
配置清单 gengzg 配置
1、修改grub启动的内核版本 vi /boot/grub/grub.conf 将default 0改为1 拷贝mt7601Usta.ko到/lib文件夹拷贝RT2870STA.dat到 /etc/Wireless/RT2870STA/文件夹拷贝wifiscan到bin文件夹，chmod 775 /bin/wifiscan 拷贝wifiget.sh到bin文件夹，chm
Windows端口被占用处理方法 huqiji windows
以下文章主要以80端口号为例，如果想知道其他的端口号也可以使用该方法..........................1、在windows下如何查看80端口占用情况?是被哪个进程占用?如何终止等. 这里主要是用到windows下的DOS工具,点击"开始"--"运行",输入&
开源ckplayer 网页播放器，跨平台(html5, mobile)，flv, f4v, mp4, rtmp协议. webm, ogg, m3u8 ！天梯梦 mobile
CKplayer，其全称为超酷flv播放器，它是一款用于网页上播放视频的软件，支持的格式有：http协议上的flv,f4v,mp4格式，同时支持rtmp视频流格式播放，此播放器的特点在于用户可以自己定义播放器的风格，诸如播放/暂停按钮，静音按钮，全屏按钮都是以外部图片接口形式调用，用户根据自己的需要制作出播放器风格所需要使用的各个按钮图片然后替换掉原始风格里相应的图片就可以制作出自己的风格了，
简单工厂设计模式 hm4123660 java 工厂设计模式简单工厂模式
简单工厂模式（Simple Factory Pattern）属于类的创新型模式，又叫静态工厂方法模式。是通过专门定义一个类来负责创建其他类的实例，被创建的实例通常都具有共同的父类。简单工厂模式是由一个工厂对象决定创建出哪一种产品类的实例。简单工厂模式是工厂模式家族中最简单实用的模式，可以理解为是不同工厂模式的一个特殊实现。
maven笔记 zhb8015 maven
跳过测试阶段： mvn package -DskipTests 临时性跳过测试代码的编译： mvn package -Dmaven.test.skip=true maven.test.skip同时控制maven-compiler-plugin和maven-surefire-plugin两个插件的行为，即跳过编译，又跳过测试。指定测试类 mvn test
非mapreduce生成Hfile，然后导入hbase当中 Stark_Summer map hbase reduce Hfile path实例
最近一个群友的boss让研究hbase，让hbase的入库速度达到5w+/s，这可愁死了，4台个人电脑组成的集群，多线程入库调了好久，速度也才1w左右，都没有达到理想的那种速度，然后就想到了这种方式，但是网上多是用mapreduce来实现入库，而现在的需求是实时入库，不生成文件了，所以就只能自己用代码实现了，但是网上查了很多资料都没有查到，最后在一个网友的指引下，看了源码，最后找到了生成Hfile
jsp web tomcat 编码问题王新春 tomcat jsp pageEncode
今天配置jsp项目在tomcat上，windows上正常，而linux上显示乱码，最后定位原因为tomcat 的server.xml 文件的配置，添加 URIEncoding 属性： <Connector port="8080" protocol="HTTP/1.1" connectionTi

文本挖掘 情感分析_文本挖掘的情感分析

加载数据 (Loading the data)

探索性数据分析 (Exploratory Data Analysis)

目标变量 (Target variable)

输入变量 (Input variable)

文字清理 (Text Cleaning)

创建测试数据 (Creating test data)

超参数调整和交叉验证 (Hyperparameter tuning and cross-validation)

评估指标 (Evaluation metrics)

GridSearchCV的参数网格 (Parameter grids for GridSearchCV)

分类器 (Classifiers)

CountVectorizer (CountVectorizer)

TF-IDF矢量化器 (TF-IDF Vectorizer)

Word2Vec (Word2Vec)

结论 (Conclusion)

最佳参数 (Best parameters)

评估指标 (Evaluation metrics)

在新推文上应用最佳模型 (Apply the best model on new tweets)

你可能感兴趣的:(大数据,python,机器学习,人工智能,数据分析)

文本挖掘情感分析_文本挖掘的情感分析