npl和文本分析的应用领域:搜索引擎,情感分析,主题建模,词性标注,实体识别等。本小结知识是关于如何从文本数据中提取有用的信息
#tokenize将一个文本分割成有意思的标记,比如一个文本分割成若干单词或者句子
sample_text = "Are you curious about tokenization? Let's see how it works! We need to analyze\
a couple of sentences with punctuations to see it in action."
from nltk.tokenize import sent_tokenize#导入句子解析器
sent_tokenize(sample_text)#从列子文本被分为3个句子。
from nltk.tokenize import word_tokenize#文本被分割成若干个单词
print(word_tokenize(sample_text))
#词干提取,work,works,working,worker这些词的有相同含义,提取这些词的词干是很有意义的
from nltk.stem.porter import PorterStemmer#波特词干提取
from nltk.stem.lancaster import LancasterStemmer#lancaster词干提取器
from nltk.stem.snowball import SnowballStemmer #snowball词干提取器
sample_words = ['table', 'probably', 'wolves', 'playing',
'is','dog', 'the', 'beaches','grounded', 'dreamt', 'envision']
#比较下每个词干提取器的区别
porter = PorterStemmer()
lancaster = LancasterStemmer()
snowball = SnowballStemmer('english')
stemmers = ['porter','lancaster','snowball']
format_style = '{:>16}' * (len(stemmers)+ 1)
print('\n',format_style.format('word',*stemmers),'\n')
for word in sample_words:
stemmer_words = [porter.stem(word),lancaster.stem(word),snowball.stem(word)]
print(format_style.format(word,*stemmer_words))
#从输出可以看出,lancaster词干提取器最为严格,他的速度很快,但是会减少单词的很大部分,会让词干模糊难于理解。一般会选择snowball词干提取器
输出:
word porter lancaster snowball
table tabl tabl tabl
probably probabl prob probabl
wolves wolv wolv wolv
playing play play play
is is is is
dog dog dog dog
the the the the
beaches beach beach beach
grounded ground ground ground
dreamt dreamt dreamt dreamt
envision envis envid envis
#词性提取器,对单词做词根还原,比如wolves变成名词词根wolf
from nltk.stem import WordNetLemmatizer
lemmatizers = ['noun lemmatizer','verb lemmatizer']
lemmatizer_format = '{:>24}' * (len(lemmatizers) + 1)
wordnet_lemmatizer = WordNetLemmatizer()
print('\n',lemmatizer_formart.format('word',*lemmatizers),'\n')
for word in sample_words:
lemmatizer_words = [wordnet_lematizer.lemmatize(word,pos='n'),wordnet_lematizer.lemmatize(word,pos='v')]#pos参数用来指定词性
print(lemmatizer_format.format(word,*lemmatizer_words))
输出:
word noun lemmatizer verb lemmatizer
table table table
probably probably probably
wolves wolf wolves
playing playing play
is is be
dog dog dog
the the the
beaches beach beach
grounded grounded ground
dreamt dreamt dream
envision envision envision
词袋模型
1:对整个文档集上的每个词汇创建一个标记,形成一个词汇表
2:为每个稳定创建一个特征向量,特征值为该单词在文档中出现的次数
docs = ['the sun is sunning',
'the weather is sweet',
'the sun is shining and the weather is sweet']
from sklearn.feature_extraction.text import CountVectorizer
count_vector = CountVectorizer(ngram_range=(1,1))#ngram_range参数,1-gram序列
bag = count_vector.fit_transform(docs)
print(sorted(count_vector.vocabulary_.items(),key=lambda x:x[1]))#整个文档集生成的词汇表
bag.toarray()
#可以看出在第一个文档中,词汇表0索引处对应的and在第一个文档中没有出现所以该位置对应得数值为0 ,1索引对应的is在文档中出现的次数为1
#故数值为1,依次类推
#每个特征向量中特征所对应的值称为原始词频,词汇在文档中出现的次数。
输出:
[('and', 0), ('is', 1), ('shining', 2), ('sun', 3), ('sunning', 4), ('sweet', 5), ('the', 6), ('weather', 7)]
array([[0, 1, 0, 1, 1, 0, 1, 0],
[0, 1, 0, 0, 0, 1, 1, 1],
[1, 2, 1, 1, 0, 1, 2, 1]], dtype=int64)
count_vector = CountVectorizer(ngram_range=(2,2))#2-gram序列
bag = count_vector.fit_transform(docs)
print(sorted(count_vector.vocabulary_.items(),key=lambda x:x[1]))#2-gram序列生成的词汇表
bag.toarray()
输出:
[('and the', 0), ('is shining', 1), ('is sunning', 2), ('is sweet', 3), ('shining and', 4), ('sun is', 5), ('the sun', 6), ('the weather', 7), ('weather is', 8)]
array([[0, 0, 1, 0, 0, 1, 1, 0, 0],
[0, 0, 0, 1, 0, 0, 0, 1, 1],
[1, 1, 0, 1, 1, 1, 1, 1, 1]], dtype=int64)
如果一个词在文档出现的太频繁,那该词语就可能不具备有辨识度的有效信息,故有另外一种技术tf-idf(词频-逆文档频率)
tf-idf = 词频 X 逆文档频率
idf=lognd1+df(d,t)nd为文档的总数,df为包含词t的文档的数量 i d f = log n d 1 + d f ( d , t ) n d 为 文 档 的 总 数 , d f 为 包 含 词 t 的 文 档 的 数 量
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
tfidf = TfidfVectorizer()
np.set_printoptions(precision=2)
tfidf.fit_transform(docs).toarray()
输出:
array([[ 0. , 0.39, 0. , 0.5 , 0.66, 0. , 0.39, 0. ],
[ 0. , 0.43, 0. , 0. , 0. , 0.56, 0.43, 0.56],
[ 0.39, 0.46, 0.39, 0.3 , 0. , 0.3 , 0.46, 0.3 ]])
文本很基本的知识介绍完毕,后续用这些知识处理文本,比如文本的情感分析等