【NLP自然语言处理】文本特征提取

文本表示方法:

  • One-hot
  • Bag of Words
  • N-gram
  • TF-IDF

这几种文本表示方法存在的缺陷:转换得到的向量维度很高,需要较长的训练实践;没有考虑单词与单词之间的关系,只是进行了统计。 

 Count Vecotrs(Bag of Words词袋模型)

词向量之词袋模型(BOW)详解

sklearn——CountVectorizer详

from sklearn.feature_extraction.text import CountVectorizer

#CountVectors+RidgeClassifier
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import RidgeClassifier
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split



df = pd.read_csv('新建文件夹/天池—新闻文本分类/train_set.csv', sep='\t',nrows = 15000) 
##统计每个字出现的次数,并赋值为0/1   用词袋表示text(特征集)
##max_features=3000文档中出现频率最多的前3000个词
#ngram_range(1,3)(单个字,两个字,三个字 都会统计
vectorizer = CountVectorizer(max_features = 3000,ngram_range=(1,3))
train_text = vectorizer.fit_transform(train_df['text'])

X_train,X_val,y_train,y_val = train_test_split(train_text,df.label,test_size = 0.3)


#岭回归拟合训练集(包含text 和 label)
clf = RidgeClassifier()
clf.fit(X_train,y_train)
val_pred = clf.predict(X_test)
print(f1_score(y_val,val_pred,average = 'macro'))

TF-IDF模型

TF-IDF 分数由两部分组成:第一部分是词语频率(Term Frequency),第二部分是逆文档频率(Inverse Document Frequency)。其中计算语料库中文档总数除以含有该词语的文档数量,然后再取对数就是逆文档频率。

  • TF(t)= 该词语在当前文档出现的次数 / 当前文档中词语的总数
  • IDF(t)= log_e(文档总数 / 出现该词语的文档总数)

当有TF(词频)和IDF(逆文档频率)后,将这两个词相乘,就能得到一个词的TF-IDF的值。某个词在文章中的TF-IDF越大,那么一般而言这个词在这篇文章的重要性会越高,所以通过计算文章中各个词的TF-IDF,由大到小排序,排在最前面的几个词,就是该文章的关键词。

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
     'This is the first document.',
     'This document is the second document.',
     'And this is the third one.',
     'Is this the first document?',
]

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

vectorizer.get_feature_names()
#['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

X.toarray()
#最后to_array()函数返回的是每个文档中关键词的tf-idf值

#将每个文档的toptf-idf值输出
word = vectorizer.get_feature_names()
weight = X.toarray()

for i in range(len(weight)):
    w_sort = np.argsort(-weight[i])

    print('doc: {0}, top tf-idf is : {1},{2}'.format(corpus[i], word[w_sort[0]], weight[i][w_sort[0]]) )

 实例

#TF-IDF + RidgeClassifier
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import RidgeClassifier
from sklearn.metrics import f1_score


df = pd.read_csv('新建文件夹/天池—新闻文本分类/train_set.csv', sep='\t',nrows = 15000)

train_test = TfidfVectorizer(ngram_range=(1,3),max_features = 3000).fit_transform(df.text)

X_train,X_val,y_train,y_val = train_test_split(train_text,df.label,test_size = 0.3)


clf = RidgeClassifier()
clf.fit(X_train,y_train)
val_pred = clf.predict(X_test)
print(f1_score(y_val,val_pred,average = 'macro'))
  • 这两个模型一般与机器学习模型一起使用,前者负责提取文本中的特征,机器学习模型负责预测和分类

CountVectorizer TfidfVectorizer 中文处理

你可能感兴趣的:(机器学习,sklearn,人工智能,自然语言处理,python)