逻辑回归分析与回归分析

Years ago, it was impossible for machines to make text translation, text summarization, speech recognition, etc. An application of question answering system or chatbot would be like magic and hard to implement before the rise of what we call machine learning and especially natural language processing (NLP) which considered as a subfield of machine learning that deals with language and aims to push machines to understand and interpret languages in a human level of understanding. One of the hottest applications of NLP is sentiment analysis that allows us to classify a text, tweet or comment either positive, neutral or negative. For example, to evaluate people’s satisfaction about a specific product, we apply sentiment analysis on reviews and calculate the percent of positive and negative reviews.

几年前，机器不可能进行文本翻译，文本摘要，语音识别等。在我们所谓的机器学习(尤其是自然语言)兴起之前，问题解答系统或聊天机器人的应用就像魔术般难以实现。处理(NLP)，它被视为机器学习的一个子领域，该领域涉及语言，旨在推动机器以人类的理解水平理解和解释语言。 NLP最热门的应用之一是情感分析，它使我们能够对文本，推文或评论进行分类，包括正面，中立或负面。例如，要评估人们对特定产品的满意度，我们将情感分析应用于评论，并计算正面和负面评论的百分比。

In this tutorial we’d do something like that building a sentiment classifier from scratch based on logistic regression, and we’ll train it on a corpus of tweets, thus we’ll cover :

在本教程中，我们将做类似基于逻辑回归从头开始构建情绪分类器的操作，并且我们将在一系列推文上对其进行训练，因此我们将介绍：

Text processing

文字处理

Features extraction

特征提取

Sentiment classifier

情感分类器

Training & evaluating the sentiment classifier

训练和评估情感分类器

文字处理 (Text processing)

First, we’ll use Natural Language Toolkit (NLTK), it’s an open source python library, it has a bunch of functions to process textual data, it contains also a Twitter dataset that we’ll work on :

首先，我们将使用自然语言工具包(NLTK)，这是一个开放源代码的python库，它具有许多处理文本数据的功能，并且还包含一个我们要处理的Twitter数据集：

import nltk
from nltk.corpus import twitter_samples
positive_tweets =twitter_samples.strings('positive_tweets.json')
negative_tweets =twitter_samples.strings('negative_tweets.json')
example_postive_tweet=positive_tweets[0]
example_negative_tweet=negative_tweets[0]test_pos = positive_tweets[4000:]
train_pos = positive_tweets[:4000]
test_neg = negative_tweets[4000:]
train_neg = negative_tweets[:4000]train_x = train_pos + train_neg test_x = test_pos + test_negtrain_y = np.append(np.ones((len(train_pos), 1)), np.zeros((len(train_neg), 1)), axis=0)test_y = np.append(np.ones((len(test_pos), 1)), np.zeros((len(test_neg), 1)), axis=0)

The python code above allow us to get a list of positive tweets and a list of negative tweets. We’ve divided our dataset into train_x, test_x, train_y, test_y, 20% for test and 80% for training. Those tweets contain a lot of irrelevant information like hashtags, mentions, stop words, etc. Data cleaning or data preprocessing is a key step in the process of data science in order to prepare data for training a classification algorithm. In the context of NLP, text processing includes :

上面的python代码使我们可以获得正向推文的列表和负向推文的列表。我们将数据集分为train_x，test_x，train_y，test_y，其中20％用于测试，80％用于训练。这些推文包含许多不相关的信息，例如主题标签，提及，停用词等。数据清理或数据预处理是数据科学过程中的关键步骤，以便为训练分类算法准备数据。在NLP中，文本处理包括：

Tokenization : is the operation of splitting a sentence into a list of words.

记号化：是将一个句子分成单词列表的操作。

Removing stop words : stop words refer to the frequent words occurring in a text without adding a semantic value to the text.

删除停用词：停用词指的是文本中出现的常见词，而没有在文本中添加语义值。

Removing punctuation : it refers to the marks like (!”#$%&’()*+,-./:;<=>?@[\]^_`{|}~).

删除标点符号：指的是(！“＃$％＆'()* +，-。/ :; <=>？@ [\] ^ _`{|}〜)之类的标记。

Stemming : is the process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words.

词干：是将单词还原为单词词干的过程，单词词干附有后缀，前缀或单词的词根。

We’ll try to implement those operations in one python function to process all the tweets before feeding them into our classifier :

我们将尝试在一个python函数中实现这些操作，以便在将所有tweet馈入分类器之前对其进行处理：

import re                                  
import string
from nltk.corpus import stopwords          
from nltk.stem import PorterStemmer        
from nltk.tokenize import TweetTokenizerdef text_process(tweet):
    tweet = re.sub(r'^RT[\s]+', '', tweet)
    tweet = re.sub(r'https?:\/\/.*[\r\n]*', '', tweet)
    tweet = re.sub(r'#', '', tweet)
    tokenizer = TweetTokenizer()
    tweet_tokenized = tokenizer.tokenize(tweet)
    stopwords_english = stopwords.words('english') 
    tweet_processsed=[word for word in tweet_tokenized 
    if word not  in stopwords_english and word not in       
    string.punctuation]
    stemmer = PorterStemmer() 
    tweet_after_stem=[]
    for word in tweet_processsed:
        word=stemmer.stem(word)
        tweet_after_stem.append(word)
    return tweet_after_stem

特征提取 (Features extraction)

After text processing, it's time for feature extraction. Actually, computers don’t deal with texts, computers only understand the language of numbers, that’s why we should work on transforming tweets into vectors that can be fed into our logistic regression function. It exists a lot of methods to represent texts into vectors, each technique depends on the context of the problem we are trying to solve. In our case, we are working on binary classification which means classifying a tweet either positive or negative. So basically, we’d find some words more occurring in the list of positive tweets like happy, good. In the same way, we'd find some words more frequent than the others in the list of negative tweets.

经过文本处理后，就该提取特征了。实际上，计算机不处理文本，计算机只理解数字语言，这就是为什么我们应该致力于将推文转换为向量，然后再将其输入到逻辑回归函数中。它存在许多将文本表示为矢量的方法，每种技术都取决于我们要解决的问题的上下文。在我们的案例中，我们正在努力进行二进制分类，这意味着对推文进行正面或负面分类。因此，基本上，我们会在正面推文列表中找到更多诸如快乐，善意之类的词。同样，在负面推文列表中，我们发现某些单词比其他单词更频繁。

Positive and negative frequencies of words 单词的正负频率

Like the figure shows, for each word, we count how much the word has occurred in the positive tweets and how much has occurred in the negative tweets. So, in order to represent that, we’ll build a dictionary to store as a key the word with its class positive or negative class and how much has occurred in each class as value. For example, if the word “happy” has occurred 12 times in the list of positive tweets and 5 times in the list of negative tweets, we'd have something like that :

如图所示，对于每个单词，我们计算单词在正向推文中出现了多少，在负向推文中出现了多少。因此，为了表示这一点，我们将构建一个字典来存储具有正类或负类以及每个类中作为值发生了多少的单词作为关键字。例如，如果“高兴”一词在正面推文列表中出现了12次，在负面推文列表中出现了5次，那么我们将得到以下内容：

 dict={("happy",1):12,("happy",0):5}

So, in order to implement this, we’ll build the first dictionary containing the frequency of the words in the positive tweets, and the second dictionary will contain the frequency of the words in the negative tweets. Then, well combine the two dictionaries.

因此，为了实现此目的，我们将构建第一个字典，其中包含正向推文中单词的频率，第二个字典将包含负向推文中单词的频率。然后，将两个字典很好地结合起来。

pos_words=[]
for tweet in all_positive_tweets:
    tweet=process_tweet(tweet)
    
    for word in tweet:
        
        pos_words.append(word)
freq_pos={}
for word in pos_words:
    if (word,1) not in freq_pos:
        freq_pos[(word,1)]=1
    else:
        freq_pos[(word,1)]=freq_pos[(word,1)]+1neg_words=[]
for tweet in negative_tweets:
    tweet=text_process(tweet)
    
    for word in tweet:
        
        neg_words.append(word)
freq_neg={}
for word in neg_words:
    if (word,0) not in freq_neg:
        freq_neg[(word,0)]=1
    else:
        freq_neg[(word,0)]=freq_neg[(word,0)]+1freqs_dict = dict(freq_pos)
freqs_dict.update(freq_neg)

back to feature extraction, now after building this dictionary, we'd convert each tweet to a vector of 3 dimensions as below :

回到特征提取，现在在构建此字典之后，我们将每个推文转换为3维向量，如下所示：

Tweet to vector 推到矢量

Each vector is a representation of a tweet. Now, we need to combine those vector in one matrix holding all tweet’s features. As we have 10000 tweets, and each tweet is represented as vector of 3 dimensions, the shape of our matrix X would be (10000,3) :

每个向量都代表一条推文。现在，我们需要将这些矢量合并到一个具有所有tweet功能的矩阵中。因为我们有10000条推文，并且每条推文都表示为3维向量，所以矩阵X的形状为(10000,3)：

import numpy as np
def features_extraction(tweet, freqs_dict):
    word_l = text_process(tweet)
    x = np.zeros((1, 3))
    x[0,0] = 1 
    for word in word_l:
        try:
            x[0,1] += freqs_dict[(word,1)]
        except:
            x[0,1] += 0
        try: 
            x[0,2] += freqs_dict[(word,0.0)]
        except:
            x[0,2] += 0    assert(x.shape == (1, 3))
    return xX = np.zeros((len(train_x), 3))
    
for i in range(len(train_x)):
    
    X[i, :]= features_extraction(train_x[i], freqs)

情感分类器 (Sentiment classifier)

To build the sentiment classifier, and instead of using libraries like scikit-learn, we are going to build a logistic regression classifier from scratch :

要构建情感分类器，而不是像scikit-learn这样的库，我们将从头开始构建一个逻辑回归分类器：

Curve of sigmoid function S形曲线

The logistic or sigmoid function uses the features as input to calculate the probability of a tweet being labeled as positive, if the output is greater or equal 0.5, we classify the tweet as positive. Otherwise, we classify it as negative.

logistic或sigmoid函数使用这些功能作为输入来计算将tweet标记为正的概率，如果输出大于或等于0.5，则将tweet分类为正。否则，我们将其分类为否定。

Cost function 成本函数

So the idea behind classification using logistic regression is minimizing the cost function which is representation for the relation between the real output and the output predicted using sigmoid function. We’ll work on updating the weights using gradient descent algorithm till we get a minimized cost function, thus getting the optimal weights :

因此，使用逻辑回归进行分类的思想是最小化成本函数，该成本函数表示实际输出与使用S型函数预测的输出之间的关系。我们将使用梯度下降算法来更新权重，直到获得最小的成本函数，从而获得最佳权重：

def sigmoid(x): 
    h = 1/(1+np.exp(-x))
    return h
def gradientDescent_algo(x, y, theta, alpha, num_iters):
    m = x.shape[0]
    for i in range(0, num_iters):
        z = np.dot(x,theta)
        h = sigmoid(z)
        J = -1/m*(np.dot(y.T,np.log(h))+np.dot((1-y).T,np.log(1-h)))
        theta = theta-(alpha/m)*np.dot(x.T,h-y)
    J = float(J)
    return J, theta

训练和评估情感分类器 (Training & evaluating the sentiment classifier)

After implementing gradient descent, now we’ll move to training in order to calculate the optimal weights theta :

实施梯度下降后，现在我们开始进行训练，以计算最佳权重theta：

X = np.zeros((len(train_x), 3))
for i in range(len(train_x)):
    X[i, :]= features_extraction(train_x[i], freqs_dict)
Y = train_y
J, theta = gradientDescent_algo(X, Y, np.zeros((3, 1)), 1e-9, 1500)

It’s time for testing our sentiment classifier in order to evaluate how it performs on test_x:

现在是时候测试我们的情感分类器，以评估它在test_x上的表现了：

def predict(tweet, freqs_dict, theta):
    x = features_extraction(tweet,freqs_dict)
    y_pred = sigmoid(np.dot(x,theta))
    return y_pred
def test_accuracy(test_x, test_y, freqs_dict, theta):
    y_hat = []
    for tweet in test_x:
        
        y_pred = predict(tweet, freqs_dict, theta)
        
        if y_pred > 0.5:
           
            y_hat.append(1)
        else:
            
            y_hat.append(0)
    m=len(y_hat)
    y_hat=np.array(y_hat)
    y_hat=y_hat.reshape(m)
    test_y=test_y.reshape(m)
    
    c=y_hat==test_y
    j=0
    for i in c:
        if i==True:
            j=j+1
    accuracy = j/m
    return accuracy
accuracy = test_accuracy(test_x, test_y, freqs_dict, theta)

We’ve got more 98% in the accuracy, our model is almost perfect !!

我们的准确率更高，达到98％，我们的模型几乎完美！

Note : This blog is based on the new specialization NLP at DeepLearning.ai

注意： 此博客基于 DeepLearning.ai 上的新专业NLP

翻译自: https://medium.com/swlh/sentiment-analysis-from-scratch-with-logistic-regression-ca6f119256ab

逻辑回归分析与回归分析

逻辑回归分析与回归分析_逻辑回归从零开始的情感分析

文字处理 (Text processing)

特征提取 (Features extraction)

情感分类器 (Sentiment classifier)

训练和评估情感分类器 (Training & evaluating the sentiment classifier)

你可能感兴趣的:(逻辑回归,python,机器学习,java,mysql)