如何成为一个优秀的NLP工程师,it’s not all about training! 很多小伙伴的模型在训练集上表现良好,却在测试集上表现欠佳,有的小伙伴甚至连训练集都拟合不了。一个优秀的NLP工程师做一个项目的时候第一件是不是训练模型,而是观察自己的数据,自己的数据有什么,有什么可能是我们模型最需要的特征,我们能否删除一些不重要的信息,可以说数据预处理是NLP工程师也是很多DL行业必备的一项技能。
分词tokenisation是NLP领域最常用的技巧之一了,现在的模式是无法直接地对一个句子获得其embedding的,无论是一切的word2vec还是现在大火大热的bert,都需要分词器。在这里我们介绍最简单的分词代码
# 下载分词集合
import nltk
nltk.download('punkt')
from nltk import word_tokenize
sentence = " I love python! "
sentence = ' '.join(word_tokenize(sentence))
单词的大小写也是一个影响因素,单词区分大小写建立的词汇表可能引入不必要的噪声,在单词的大小写这种信息对于你的模型不重要的时候,可以将单词全部小写
text = "I love python"
text = text.lower()
# 结果应该是: i love python
如果大小写对你的文本有重要影响,而这时候文本没有正确的大小写,比如国家名,地名,人名:China,Daive等等
import truecase
# pip install truecase
text = "I love python"
truecase.get_true_case(text)
停用词指那些在哪都大量出现的词汇,因为出现得太过频繁,一般就没有重要信息。一般都会删去。
from nltk.corpus import stopwords
nltk.download('stopwords')
stopwords_list = stopwords.words('english')
text = " What? You don't love python?"
text = text .split()
for word in text :
if word in stopwords_list:
text .remove(word)
对于Stemming和Lemmatisation的英文解释如下:
Stemming
In the case of stemming, we want to normalise all words to their stem (or root). The stem is the part of the word to which affixes (suffixes or prefixes) are attached. Stemming a word may result in the word not actually looking like a word. For example, some stemming algorithms may stem trouble, troubling, troubled as troubl.
Lemmatisation
Lemmatisation attempts to reduce tokens to a word that belongs in the language. The basic form of the word is called a lemma, and is the canonical form of a set of words. For example, runs, running, ran are all forms of the word run.
意思就是说Stemming是讲单词的词根提取出来保留,词根不一定是个能解释的单词,而Lemmatisation是讲一个单词缩小到一个最短的完整的单词比如讲running缩小到run。有的时候如果单词的时态不重要,我们就可以考虑做个Lemmatisation试试,如果词根就足够了,我们也可以做stemming。当然目前的代码实现做stemming和Lemmatisation仍然不准确,毕竟也都是用AI技术训练或者统计方法计算的,肯定有准确率。
Stemming代码:
# STEMMING
import nltk
from nltk.stem import PorterStemmer
porter = PorterStemmer()
stemming_word_list = ["friend", "friendship", "friends", "friendships","stabil","destabilize","misunderstanding","railroad","moonlight","football"]
print("{0:20}{1:20}".format("Word","Stemmed variant"))
print()
for word in stemming_word_list:
print("{0:20}{1:20}".format(word,porter.stem(word)))
'''
输出结果:
Word Stemmed variant
friend friend
friendship friendship
friends friend
friendships friendship
stabil stabil
destabilize destabil
misunderstanding misunderstand
railroad railroad
moonlight moonlight
football footbal
'''
Lemmatisation代码:
# LEMMATISATION
import nltk
from nltk.stem import WordNetLemmatizer
import re
import string
nltk.download('wordnet')
wordnet_lemmatizer = WordNetLemmatizer()
to_lemmatize_sentence = "He was running and eating at same time. He has bad habit of swimming after playing long hours in the Sun."
# lemmatisation requires punctuation removal
to_lemmatize_sentence = "".join([c for c in to_lemmatize_sentence if c not in string.punctuation])
to_lemmatize_sentence = to_lemmatize_sentence.split(" ")
print("{0:20}{1:20}".format("Word","Lemma"))
print()
for word in to_lemmatize_sentence:
print ("{0:20}{1:20}".format(word,wordnet_lemmatizer.lemmatize(word)))
'''
输出结果:
Word Lemma
He He
was wa
running running
and and
eating eating
at at
same same
time time
He He
has ha
bad bad
habit habit
of of
swimming swimming
after after
playing playing
long long
hours hour
in in
the the
Sun Sun
'''