引言

从deeplearning.ai的课程开始，尝试捡回荒废了3年的NLP。
Coursera课程链接

搭建jupyter + vscode学习环境

Start with Why

为什么要用vscode？

我很想用诸如：“谁用谁知道，不用就吃亏”这样的话来偷懒，但为了能让心存疑虑的小伙伴放心去用，好歹要用自己的话说一下这工具为什么好用。

vscode是写程序必备的“万用军刀”，如果硬要说有什么它办不到，那可能只是还没找到合适的插件罢了。

因此，我不打算写vscode功能的详细清单。用过的同学们都知道，一旦用上，不光敲代码，可能日常码字你都离不开它。

现在我在vscode上面完成的工作有：前后端开发，代码调试，记笔记，远程登录服务器操作，命令行，上线代码/博客，等等。相信未来它能承载更多工作入口。就正如我现在想要把Jupyter整合进去一样。

Jupyter Notebook几年前刚开始学机器学习就用过，但渐渐少用（少用的原因是Mac算力限制，跑机器学习太费劲）之后连怎么搭环境都给忘了。

这段时间想把自己的NLP技能捡回来，于是就有了这想法。简单一搜，果然有方案，马上动手不啰嗦。

步骤一：搭建本地Jupyter服务器

jupyter notebook 是非常好的用于学习人工智能编程的工具。

首先Jupyter配合anaconda让你可以在不同的packages环境下进行相对应的开发，特别是在跑机器学习时往往需要加载大量的库来配合工作，省去折腾各种不同版本的包和运行环境的麻烦。

其次使用Jupyter还可以一边做笔记，一边看程序运行结果，免去界面切换的繁琐。当你做完一次学习之后，笔记可以立马拿去发布分享，强化自身学习动力。

因此，强烈建议大家都在自家电脑上搭建一个Jupyter Notebook的运行环境，网上教程很多这里就不再累赘。

首先你要自行安装好Python3环境

下载并安装Anaconda（这又是个什么玩意儿？）

image.png

我记得以前安装anaconda都是跑命令行搞出来的，现在居然下载完了直接就可以装了。好吧，那就随带说一下为什么要用Anaconda，然后使用过程又要注意什么。

Anaconda解决了维护运行环境不一致的问题，你可以为每一个应用配置单独的，隔离的环境。（一句话说完）

如果这句话你还是理解不了的话，建议随便在github上找几个Python项目拉下来玩，不用Anaconda，然后就知道为啥要用这东西了。

安装完了之后启动Jupyter notebook，运行

jupyter notebook

命令运行成功后系统会为你自动打开 localhost:8888，因为我们是要在vscode里面去用的，所以直接关掉就行。

步骤二：配置vscode

在插件市场安装Jupyter插件，成功后启动命令窗口（Shift+Command+P）

执行 Jupyter:Create New Blank Jupyter Notebook

image.png

然后就可以开始使用了，在新建的文档中能看到的信息和你在网页上使用无异，可以看到已连接的local，Python3是否正在执行等等。

image.png

参考资料

Working with Jupyter Notebooks in Visual Studio Code
Install and Use — Jupyter Documentation 4.1.1 alpha documentation

NLP基础01 - 数据预处理

对数据进行预处理
使用NLTK处理数据集

引入包

NLTK(http://www.nltk.org/)是一个自然语言工具箱，提供超过50种语料库和词法资源(如WordNet)提供了易于使用的接口，还提供了一套用于分类、标记、词干提取、标记、解析和语义推理的文本处理库、工业强度NLP库的包装器。

适合于语言学家、工程师、学生、教育工作者、研究人员和行业用户。NLTK可用于Windows、Mac OS X和Linux。最重要的是，NLTK是一个免费的、开源的、社区驱动的项目。

Python的自然语言处理为语言处理编程提供了一个实用的入门。它由NLTK的创建者编写，指导读者了解编写Python程序的基础知识、使用语料库、对文本进行分类、分析语言结构等等。该书的在线版本已经针对Python 3和NLTK 3进行了更新。(Python 2的原始版本仍然可以在http://nltk.org/book_1ed上找到。)

对tweets数据进行情感性分析，即判断每一条tweet是正向，负向，还是中性描述。
在NLTK包中有预加载的一个Twitter实验数据集，可直接使用。

import nltk
from nltk.corpus import twitter_samples
import matplotlib.pyplot as plt
import random

关于Twitter数据集

NLTK这个数据集已经把tweets划分成了正向或负向，各5000条。虽然数据集来源于真实数据，但这样的划分是人为的。

由于本地使用 nltk.download('twitter_samples') 语句会报错：Errno 61 Connection refused

因此需要在命令行中进行如下操作（同一窗口操作命令行是vscode优势之一）

在nltk/nltk_data: NLTK Data 下载zip
解压后把文件夹改名为nltk_data
若运行下一步时报错，可查看提示搬运文件夹到程序会检索的目录下

from nltk.book import *

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908

用 strings() 函数加载数据

all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')

现在，我们可以先来看看数据长什么样子。这在正式跑数之前是非常重要的操作

print('Number of postive tweets: ', len(all_positive_tweets))
print('Number of negative tweets: ', len(all_negative_tweets))

print('\nThe type of all_positive_tweets is: ', type(all_positive_tweets))
print('\nThe type of all_negative_tweets is: ', type(all_negative_tweets))
print('\nThe type of a tweet entry is: ', type(all_negative_tweets[0]))

Number of postive tweets:  5000
Number of negative tweets:  5000

The type of all_positive_tweets is:  

The type of all_negative_tweets is:  

The type of a tweet entry is:

从上面结果可以看出来，两个json文件已被转换成了列表，而一条tweet则是一个字符串。

你还可以使用 pyplot 库去画一个饼图，用来描述上述的数据（增加一点数据可视化总是有好处滴）

pyplot库使用可参考 Basic pie chart — Matplotlib 3.3.3 documentation

# 自定义图形大小
fig = plt.figure(figsize=(5, 5))

# 定义标签
labels = 'Positives', 'Negative'

# 每页大小
sizes = [len(all_positive_tweets), len(all_negative_tweets)]

# 声明饼图，页大小，保留小数位，阴影，角度-90为垂直切分
plt.pie(sizes, labels=labels, autopct='%1.1f%%', shadow=True, startangle=90)

plt.axis('equal')

plt.show()

查看原始文本数据

查看真实的数据情况，下面的代码会print出正向，负向的评论，以不同颜色为区分

# 正向评论 绿色
print('\033[92m' + all_positive_tweets[random.randint(0,5000)])

# 负向评论 红色
print('\033[91m' + all_negative_tweets[random.randint(0,5000)])

�[92m@JayHorwell Hi Jay, if you haven't received it yet please email our events team at [email protected] and they'll sort it :)
�[91m@pickledog47 @FoxyLustyGrover Its Kate, tho!!  :(  #sniff

由此发现数据中含有不少表情符号及url信息，在后续的处理中需要考虑在内

对原始文本进行预处理

数据预处理是所有机器学习的关键步骤。包括数据清洗和格式化。对NLP而言，主要有以下任务:

分词
处理大小写
删除停止词（Stop Words）和标点符号
提取词根(处理英语时特有的Stemming)

# 选择一条较为复杂的数据
tweet = all_positive_tweets[2277]

print(tweet)

My beautiful sunflowers on a sunny Friday morning off :) #sunflowers #favourites #happy #Friday off… https://t.co/3tfYom0N1i

import re                                     # 正则表达式库
import string                                 # 字符串操作库

from nltk.corpus import stopwords             # NLTK的stopwords库，貌似不支持中文
from nltk.stem import PorterStemmer           # stemming 库
from nltk.tokenize import TweetTokenizer      # 推特分词器

去除超链接，推特标签和格式

删除推特平台常用字符串，就像微博一样，有许多'@' '#' 和url
使用re库执行正则表达式操作。使用sub()替换成空串

关于python正则表达式出来参考：Python 正则表达式 | 菜鸟教程
可以直接使用vscode的查找工具进行正则表达式的调试

print('\033[92m' + tweet)
print('\033[94m')

tweet2 = re.sub(r'^RT[\s]+', '', tweet) #处理 RT【空格】打头的数据，即“转发”类的tweet

tweet2 = re.sub(r'https?:\/\/.*[\r\n]*', '', tweet2) #去除超链接

tweet2 = re.sub(r'#', '', tweet2)

print(tweet2)

�[92mMy beautiful sunflowers on a sunny Friday morning off :) #sunflowers #favourites #happy #Friday off… https://t.co/3tfYom0N1i
�[94m
My beautiful sunflowers on a sunny Friday morning off :) sunflowers favourites happy Friday off…

先试试直接分词，看看结果如何

print()
print('\033[92m' + tweet2)
print('\033[94m')

tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True, reduce_len=True)

tweet_tokens = tokenizer.tokenize(tweet2)

print()
print('Tokenized string:')
print(tweet_tokens)

�[92mMy beautiful sunflowers on a sunny Friday morning off :) sunflowers favourites happy Friday off… 
�[94m

Tokenized string:
['my', 'beautiful', 'sunflowers', 'on', 'a', 'sunny', 'friday', 'morning', 'off', ':)', 'sunflowers', 'favourites', 'happy', 'friday', 'off', '…']

去掉stop words和标点符号

stop words是常用的没有实际意义的那些词语，之前试过生成词云都会发现诸如“的”，“那么”这些词会很多，所以在处理前最好先去掉。
在英文情况下会有所不同，具体看下一步执行结果。

stopwords_english = stopwords.words('english')

print('Stop words\n')
print(stopwords_english)

print('\nPunctuation\n')
print(string.punctuation)

Stop words

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

Punctuation

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~

我们可以看到上面的停止词包含了一些可能很重要的词。例如“I”，"not", "between", "won", "against" 。

不同分析目的，可能要对停止词表进一步加工，在我们前面下载nltk_data里面有一个stopwords的文件夹，对应的English那个文件就是停止词的词表。
在这个练习里，则用整个列表。

下面开始进行分词操作

print()
print('033[92m')
print(tweet_tokens)
print('033[94m')

tweets_clean = []

for word in tweet_tokens:
    if (word not in stopwords_english and
        word not in string.punctuation):
        tweets_clean.append(word)

print('removed stop words and punctuation:')
print(tweets_clean)

033[92m
['my', 'beautiful', 'sunflowers', 'on', 'a', 'sunny', 'friday', 'morning', 'off', ':)', 'sunflowers', 'favourites', 'happy', 'friday', 'off', '…']
033[94m
removed stop words and punctuation:
['beautiful', 'sunflowers', 'sunny', 'friday', 'morning', ':)', 'sunflowers', 'favourites', 'happy', 'friday', '…']

词干提取(Stemming)

这是处理英语时需要特别考虑的一个因素，比如

learn
learning
learned
learnt

这些词的词根都是learn，但处理时提取出来的可能不是learn。例如，happy

happy
happiness
happier

我们需要提取出happi，而不是happ，因为它是happen的词根。

NLTK有不同的模块用于词干提取，我们将使用使用PorterStemmer完成此操作

print()
print('\033[92m')
print(tweets_clean)
print('\033[94m')

stemmer = PorterStemmer()

tweets_stem = []

for word in tweets_clean:
    stem_word = stemmer.stem(word)
    tweets_stem.append(stem_word)

print('stemmed words:')
print(tweets_stem)

�[92m
['beautiful', 'sunflowers', 'sunny', 'friday', 'morning', ':)', 'sunflowers', 'favourites', 'happy', 'friday', '…']
�[94m
stemmed words:
['beauti', 'sunflow', 'sunni', 'friday', 'morn', ':)', 'sunflow', 'favourit', 'happi', 'friday', '…']

process_tweet()

可使用诸如utils.py这样的文件，对上述过程进行封装，例如process_tweet函数的以下应用
utils.py的代码放在最后

from utils import process_tweet

tweet = all_positive_tweets[2277]

print()
print('\033[92m')
print(tweet)
print('\033[94m')

tweets_stem = process_tweet(tweet);

print('preprocessed tweet:')
print(tweets_stem)

�[92m
My beautiful sunflowers on a sunny Friday morning off :) #sunflowers #favourites #happy #Friday off… https://t.co/3tfYom0N1i
�[94m
preprocessed tweet:
['beauti', 'sunflow', 'sunni', 'friday', 'morn', ':)', 'sunflow', 'favourit', 'happi', 'friday', '…']

总结

通过这个练习，我们知道了一般NLP的预处理过程，当然实际过程（涉及中文时）会更复杂，要结合数据具体情况不断调整。

把以下内容保存为文件utils.py，放在ipynb文件同一个目录下，最后一个步骤才能运行成功

import re
import string
import numpy as np


from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import TweetTokenizer


def process_tweet(tweet):
    """Process tweet function.
    Input:
        tweet: a string containing a tweet
    Output:
        tweets_clean: a list of words containing the processed tweet

    """
    stemmer = PorterStemmer()
    stopwords_english = stopwords.words('english')
    # remove stock market tickers like $GE
    tweet = re.sub(r'\$\w*', '', tweet)
    # remove old style retweet text "RT"
    tweet = re.sub(r'^RT[\s]+', '', tweet)
    # remove hyperlinks
    tweet = re.sub(r'https?:\/\/.*[\r\n]*', '', tweet)
    # remove hashtags
    # only removing the hash # sign from the word
    tweet = re.sub(r'#', '', tweet)
    # tokenize tweets
    tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True,
                               reduce_len=True)
    tweet_tokens = tokenizer.tokenize(tweet)

    tweets_clean = []
    for word in tweet_tokens:
        if (word not in stopwords_english and  # remove stopwords
                word not in string.punctuation):  # remove punctuation
            # tweets_clean.append(word)
            stem_word = stemmer.stem(word)  # stemming word
            tweets_clean.append(stem_word)

    return tweets_clean


def build_freqs(tweets, ys):
    """Build frequencies.
    Input:
        tweets: a list of tweets
        ys: an m x 1 array with the sentiment label of each tweet
            (either 0 or 1)
    Output:
        freqs: a dictionary mapping each (word, sentiment) pair to its
        frequency
    """
    # Convert np array to list since zip needs an iterable.
    # The squeeze is necessary or the list ends up with one element.
    # Also note that this is just a NOP if ys is already a list.
    yslist = np.squeeze(ys).tolist()

    # Start with an empty dictionary and populate it by looping over all tweets
    # and over all processed words in each tweet.
    freqs = {}
    for y, tweet in zip(yslist, tweets):
        for word in process_tweet(tweet):
            pair = (word, y)
            if pair in freqs:
                freqs[pair] += 1
            else:
                freqs[pair] = 1
    
    return freqs

ChangeLog

2021/1/28 17:12:10 折腾了两小时，先到这里。其实你会发现搞程序遇到的麻烦，跟你在玩一个游戏（比较虐的那种）时被卡住的感觉很像，这时应该先设法让自己先停下来去搞点别的……
2021/2/1 16:28:38 花了两小时把后面内容完成

NLP笔记Day1：环境搭建及数据预处理

引言

搭建jupyter + vscode学习环境

Start with Why

步骤一：搭建本地Jupyter服务器

步骤二：配置vscode

参考资料

NLP基础01 - 数据预处理

引入包

关于Twitter数据集

查看原始文本数据

对原始文本进行预处理

去除超链接，推特标签和格式

去掉stop words和标点符号

词干提取(Stemming)

process_tweet()

总结

ChangeLog

你可能感兴趣的:(NLP笔记Day1：环境搭建及数据预处理)