python nltk —— 文本预处理

  • 下载:

    import nltk
    nltk.download()
    

0. 语法知识

  • N:名词, V:动词,ADJ:形容词,ADV:副词,
    • proper noun:专有名词
    • pronoun:代词,he/her/I/their
  • CNJ:连词,and/or/but/if/while/although
  • DET:determiner,限定词,the/a/some/most/every/no
  • EX:existential,there/there’s
  • MOD:情态动词,UH:Interjection,情态动词;
  • VD:past tense,VG:现在时,VN:完成时

1. 语料库的查看

  • brown:布朗语料库;

  • categories:分类

  • stents:句子

  • words:单词

    >> from nltk.corpus import brown
    # 文本类型
    >> brown.categories()
    ['adventure', 'belles_lettres', 'editorial',
    'fiction', 'government', 'hobbies', 'humor',
    'learned', 'lore', 'mystery', 'news', 'religion',
    'reviews', 'romance', 'science_fiction']
    
    >> len(brown.sents())
    57340
    >> len(brown.words())
    1161192
    

2. 词干提取与词形归一

  • 词干提取(Stemming):walking ⇒ walk;walked ⇒ walk

    from nltk.stem import PorterStemmer, LancasterStemmer, SnowballStemmer
    porter_stem = PorterStemmer()
    porter_stem.stem('walking')
    
  • 词形归一(Lemmatization):went ⇒ go;are ⇒ be

    from nltk.stem import WordNetLemmatizer
    lemma = WordNetLemmatizer()
    lemma.lemmatize('dogs')
    

    注意词性的问题,不指定 POS,默认是名词

    >> lemma.lemmatize('went')
    'went'
    >> lemma.lemmatize('went', pos='v')
    'go'
    >> lemma.lemmatize('are', pos='v')
    'be'				# be 动词也是动词
    

3. pos tags 与 stopwords

  • pos tags

    words = nltk.word_tokenize('what does the fox say')
    nltk.pos_tags(words)
    [('what', 'WDT'), ('does', 'VBZ'), ('the', 'DT'), ('fox', 'NNS'), ('say', 'VBP')]
    
  • stopwords

    from nltk.corpus import stopwords
    stopwords.words('english')
    

    stopwords 支持的语言:

    ['dutch',
     'german',
     'hungarian',
     'romanian',
     'kazakh',
     'turkish',
     'russian',
     'README',
     'italian',
     'english',
     'greek',
     'norwegian',
     'portuguese',
     'finnish',
     'danish',
     'french',
     'swedish',
     'azerbaijani',
     'spanish',
     'indonesian',
     'arabic',
     'nepali']
    

4. 文本处理 pipeline

python nltk —— 文本预处理_第1张图片

你可能感兴趣的:(NLP)