nlp学习笔记

npl和文本分析的应用领域:搜索引擎,情感分析,主题建模,词性标注,实体识别等。本小结知识是关于如何从文本数据中提取有用的信息

#tokenize将一个文本分割成有意思的标记,比如一个文本分割成若干单词或者句子
sample_text = "Are you curious about tokenization? Let's see how it works! We need to analyze\
a couple of sentences with punctuations to see it in action."
from nltk.tokenize  import sent_tokenize#导入句子解析器
sent_tokenize(sample_text)#从列子文本被分为3个句子。
from nltk.tokenize import word_tokenize#文本被分割成若干个单词
print(word_tokenize(sample_text))
#词干提取,work,works,working,worker这些词的有相同含义,提取这些词的词干是很有意义的
from nltk.stem.porter import PorterStemmer#波特词干提取
from nltk.stem.lancaster import LancasterStemmer#lancaster词干提取器
from nltk.stem.snowball import SnowballStemmer #snowball词干提取器
sample_words = ['table', 'probably', 'wolves', 'playing', 
'is','dog', 'the', 'beaches','grounded', 'dreamt', 'envision']
#比较下每个词干提取器的区别
porter = PorterStemmer()
lancaster = LancasterStemmer()
snowball = SnowballStemmer('english')
stemmers = ['porter','lancaster','snowball']
format_style = '{:>16}' * (len(stemmers)+ 1)
print('\n',format_style.format('word',*stemmers),'\n')
for word in sample_words:
    stemmer_words = [porter.stem(word),lancaster.stem(word),snowball.stem(word)]
    print(format_style.format(word,*stemmer_words))
#从输出可以看出,lancaster词干提取器最为严格,他的速度很快,但是会减少单词的很大部分,会让词干模糊难于理解。一般会选择snowball词干提取器
输出:
 word          porter       lancaster        snowball 

           table            tabl            tabl            tabl
        probably         probabl            prob         probabl
          wolves            wolv            wolv            wolv
         playing            play            play            play
              is              is              is              is
             dog             dog             dog             dog
             the             the             the             the
         beaches           beach           beach           beach
        grounded          ground          ground          ground
          dreamt          dreamt          dreamt          dreamt
        envision           envis           envid           envis
#词性提取器,对单词做词根还原,比如wolves变成名词词根wolf
from nltk.stem import WordNetLemmatizer
lemmatizers = ['noun lemmatizer','verb lemmatizer']
lemmatizer_format = '{:>24}' * (len(lemmatizers) + 1)
wordnet_lemmatizer = WordNetLemmatizer()
print('\n',lemmatizer_formart.format('word',*lemmatizers),'\n')
for word in sample_words:
    lemmatizer_words = [wordnet_lematizer.lemmatize(word,pos='n'),wordnet_lematizer.lemmatize(word,pos='v')]#pos参数用来指定词性
    print(lemmatizer_format.format(word,*lemmatizer_words))
输出:
         word         noun lemmatizer         verb lemmatizer 

                   table                   table                   table
                probably                probably                probably
                  wolves                    wolf                  wolves
                 playing                 playing                    play
                      is                      is                      be
                     dog                     dog                     dog
                     the                     the                     the
                 beaches                   beach                   beach
                grounded                grounded                  ground
                  dreamt                  dreamt                   dream
                envision                envision                envision

词袋模型
1:对整个文档集上的每个词汇创建一个标记,形成一个词汇表
2:为每个稳定创建一个特征向量,特征值为该单词在文档中出现的次数

docs = ['the sun is sunning',
       'the weather is sweet',
       'the sun is shining and the weather is sweet']
from sklearn.feature_extraction.text import CountVectorizer
count_vector = CountVectorizer(ngram_range=(1,1))#ngram_range参数,1-gram序列
bag = count_vector.fit_transform(docs)
print(sorted(count_vector.vocabulary_.items(),key=lambda x:x[1]))#整个文档集生成的词汇表
bag.toarray()
#可以看出在第一个文档中,词汇表0索引处对应的and在第一个文档中没有出现所以该位置对应得数值为01索引对应的is在文档中出现的次数为1
#故数值为1,依次类推
#每个特征向量中特征所对应的值称为原始词频,词汇在文档中出现的次数。
输出:
[('and', 0), ('is', 1), ('shining', 2), ('sun', 3), ('sunning', 4), ('sweet', 5), ('the', 6), ('weather', 7)]
array([[0, 1, 0, 1, 1, 0, 1, 0],
       [0, 1, 0, 0, 0, 1, 1, 1],
       [1, 2, 1, 1, 0, 1, 2, 1]], dtype=int64)
count_vector = CountVectorizer(ngram_range=(2,2))#2-gram序列
bag = count_vector.fit_transform(docs)
print(sorted(count_vector.vocabulary_.items(),key=lambda x:x[1]))#2-gram序列生成的词汇表
bag.toarray()
输出:
[('and the', 0), ('is shining', 1), ('is sunning', 2), ('is sweet', 3), ('shining and', 4), ('sun is', 5), ('the sun', 6), ('the weather', 7), ('weather is', 8)]
array([[0, 0, 1, 0, 0, 1, 1, 0, 0],
       [0, 0, 0, 1, 0, 0, 0, 1, 1],
       [1, 1, 0, 1, 1, 1, 1, 1, 1]], dtype=int64)

如果一个词在文档出现的太频繁,那该词语就可能不具备有辨识度的有效信息,故有另外一种技术tf-idf(词频-逆文档频率)
tf-idf = 词频 X 逆文档频率
idf=lognd1+df(d,t)nd,dft i d f = log ⁡ n d 1 + d f ( d , t ) n d 为 文 档 的 总 数 , d f 为 包 含 词 t 的 文 档 的 数 量

from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np 
tfidf = TfidfVectorizer()
np.set_printoptions(precision=2)
tfidf.fit_transform(docs).toarray()
输出:
array([[ 0.  ,  0.39,  0.  ,  0.5 ,  0.66,  0.  ,  0.39,  0.  ],
       [ 0.  ,  0.43,  0.  ,  0.  ,  0.  ,  0.56,  0.43,  0.56],
       [ 0.39,  0.46,  0.39,  0.3 ,  0.  ,  0.3 ,  0.46,  0.3 ]])

文本很基本的知识介绍完毕,后续用这些知识处理文本,比如文本的情感分析等

你可能感兴趣的:(读书笔记)