在机器学习中,不能直接处理文本数据,需要提前将其转换为数值向量,接下来的内容,将简要覆盖其中涉及的技术要点
文本数据在训练机器学习模型之前需要先进行清理并转化成向量,这个过程称为文本预处理
在这节中,将会介绍编码文本数据的基本数据清理步骤和技术
**Bag of Words (词袋模型)**
**Binary Bag of Words **
**Bigram, Ngram (N元语法模型)**
**TF-IDF**( **T**erm **F**requency - **I**nverse **D**ocument **F**requency)
**Word2Vec**
**Avg-Word2Vec**
**TF-IDF Word2Vec**
导入库
import warnings
warnings.filterwarnings("ignore") #Ignoring unnecessory warnings
import numpy as np #for large and multi-dimensional arrays
import pandas as pd #for data manipulation and analysis
import nltk #Natural language processing tool-kit
from nltk.corpus import stopwords #Stopwords corpus
from nltk.stem import PorterStemmer # Stemmer
from sklearn.feature_extraction.text import CountVectorizer #For Bag of words
from sklearn.feature_extraction.text import TfidfVectorizer #For TF-IDF
from gensim.models import Word2Vec #For Word2Vec
data_path = "Reviews.csv" #文件下载链接在文本末尾
data = pd.read_csv(data_path)
data_sel = data.head(10000) #读取前10000条
数据集包含来自亚马逊到2012年10月份为止,时间跨度超过10年,50多万条的美食评论。每条评论包括产品和用户信息,评分和纯文本的评论。
# 列名
data_sel.columns
Index(['Id', 'ProductId', 'UserId', 'ProfileName', 'HelpfulnessNumerator',
'HelpfulnessDenominator', 'Score', 'Time', 'Summary', 'Text'],
dtype='object')
假设此时的目标是基于评论文本内容来预测评论的正向还是负向性。
观察评分列,可以看到有1,2,3,4,5类评分。此时可以认为1,2是负向评分,4,5是正向评分,3是中性评分
预测目标为正向或负向评分,因此需要移除中性评分
data_sel['Score'].value_counts()
5 6183
4 1433
1 932
3 862
2 590
Name: Score, dtype: int64
data_score_removed = data_sel[data_sel['Score']!=3] #移除中心评分
转化评分为正负二分类
def partition(x):
if x < 3:
return 'negative'
return 'positive'
score_upd = data_score_removed['Score']
t = score_upd.map(partition)
data_score_removed['Score']=t
重复数据删除表示删除重复的行,必须删除重复的行才能获得稳定的结果。 根据UserId,ProfileName,Time,Text检查重复项。如果所有这些值都相等,那么删除这些记录。
HelfulnessNumerator表示大约有多少人发现评论有用,而AdSensenessDenominator是关于有用评论数+不是那么有用。 因此,可以看到HelfulnessNumerator始终小于或等于HelpnesDenominator
AffortnessNumerator应该始终小于或等于AffortnessDenominator,因此需要检查此条件并删除这些记录。
final_data = data_score_removed.drop_duplicates(subset={"UserId","ProfileName","Time","Text"})
final = final_data[final_data['HelpfulnessNumerator'] <= final_data['HelpfulnessDenominator']]
final_X = final['Text']
final_y = final['Score']
转换所有的单词为小写形式并移除标点符号
Stemming - 将单词转换为基础单词或主词(例如,tastefully, tasty,这些单词将转换为“ tasti”主词)。 因为不考虑所有相似的词,所以可以减小向量维数
Stopwords - 停用词是不必要的词,删除它们也不影响句子情感的表达
例如 - This pasta is so tasty ==> pasta tasty ( This , is, so 是停止词,被移除)
stop = set(stopwords.words('english'))
print(stop)
{"aren't", "weren't", 'to', 'had', 'in', 'same', 'did', 'off', 'as', 'was', 'about', 'its', 'hers', 'been', "you'll", 'didn', 'his', 'up', 'down', 'for', 'yourself', 'out', 'have', 'an', 'only', 'this', 'of', 'itself', 's', 'll', 'themselves', 'where', 'very', 'mightn', 'do', 'other', 'myself', 'some', 'above', "shan't", 'weren', 'those', 'yours', 'below', 'wasn', 'theirs', 'be', "didn't", 'has', "isn't", "that'll", 'can', 'any', 'own', "she's", "don't", "hadn't", 'then', 'or', "won't", 'they', 'ain', 'won', 'mustn', "you'd", 'herself', 'does', 'at', 'ma', 'am', 'each', 'isn', 'my', 'ours', 'our', 'me', 'so', "you've", 'if', 'with', 'once', 'such', 'yourselves', 'having', 'i', 'all', 'hasn', 'through', 'over', 'she', 'who', 'by', 'why', 'were', 'your', 'hadn', 'from', 'her', 'just', 'a', 'now', 'how', "haven't", 't', 'haven', 'not', 'ourselves', 'no', 'himself', 'wouldn', 'being', 'o', 'it', 'him', 'that', 'between', 'on', 'shan', 'd', 'you', 'after', 'both', "should've", 'needn', 'most', "doesn't", "hasn't", 'what', 'he', 'are', 'here', 're', 'don', 'before', 'more', 'few', "mustn't", 'until', 'but', 'whom', "shouldn't", 'we', 'aren', 'nor', 'than', 'while', 'these', 'during', 'y', 'when', 'their', 'doing', 'm', "wouldn't", 'under', 'further', 'too', "you're", 'which', "it's", 'the', 've', 'against', 'again', "mightn't", 'is', 'there', 'and', 'into', 'them', 'shouldn', 'doesn', 'couldn', 'should', 'will', "wasn't", 'because', "couldn't", "needn't"}
import re
temp =[]
snow = nltk.stem.SnowballStemmer('english')
for sentence in final_X:
sentence = sentence.lower() # 转换为小写
cleanr = re.compile('<.*?>')
sentence = re.sub(cleanr, ' ', sentence) #移除HTML标签
sentence = re.sub(r'[?|!|\'|"|#]',r'',sentence)
sentence = re.sub(r'[.|,|)|(|\|/]',r' ',sentence) #移除符号
words = [snow.stem(word) for word in sentence.split() if word not in stopwords.words('english')] # Stemming and removing stopwords
temp.append(words)
final_X = temp
print(final_X[1])
['product', 'arriv', 'label', 'jumbo', 'salt', 'peanut', 'peanut', 'actual', 'small', 'size', 'unsalt', 'sure', 'error', 'vendor', 'intend', 'repres', 'product', 'jumbo']
sent = []
for row in final_X:
sequ = ''
for word in row:
sequ = sequ + ' ' + word
sent.append(sequ)
final_X = sent
print(final_X[1])
product arriv label jumbo salt peanut peanut actual small size unsalt sure error vendor intend repres product jumbo
编码技术
词袋模型
词袋模型中,需要构建来自文本数据集的所有单词的字典集合,字典值为词的频率。如果字典的大小为d,那么每个句子所形成的向量长度将是d维,并且在对应的位置上存储了单词在句子中出现的次数。这样的字典将注定是非常稀疏的。
如:pasta is tasty and pasta is good
[0]…[1]…[1]…[2]…[2]…[1]… <== 向量表示 ( 所有的.符号表示数值0)
[a]…[and]…[good]…[is]…[pasta]…[tasty]… <==字典
.
scikit-learn的CountVectorizer方法能够实现词袋模型,其中max_features参数表示设置词库的大小,该词库将取预料数据中频率最大的前max_features个单词作为词库
BINARY词袋模型
在binary词袋模型中,不统计单词出现的频率,只用1表示单词出现,0表示单词未出现,在CountVectorizer中设置binary=true即可将词袋模型转化为binary词袋。
count_vect = CountVectorizer(max_features=5000)
bow_data = count_vect.fit_transform(final_X)
print(bow_data[1])
(0, 3641) 1
(0, 2326) 1
(0, 4734) 1
(0, 1539) 1
(0, 4314) 1
(0, 4676) 1
(0, 3980) 1
(0, 4013) 1
(0, 162) 1
(0, 3219) 2
(0, 3770) 1
(0, 2420) 2
(0, 2493) 1
(0, 332) 1
(0, 3432) 2
词袋模型的缺点
相似的文本他们的编码向量应该比较相近,但是在词袋模型中有时候却不能实现这样的表示。
例如,句子This pasta is very tasty 和句子 This pasta is not tasty经过移除停止词之后,两个句子将转换为pasta tasty,成为了相同的意思。
基于前面没有考虑到的问题,提出了二元和N元语法模型
BI-GRAM 词袋模型
当CountVectorizer 的参数 ngram_range 赋值为 (1,2)时,能够转换为 Bi-Gram 词袋模型
Bi-Gram模型将显著增大词典库的大小
final_B_X = final_X
count_vect = CountVectorizer(ngram_range=(1,2))
Bigram_data = count_vect.fit_transform(final_B_X)
print(Bigram_data[1])
(0, 143171) 1
(0, 151696) 1
(0, 95087) 1
(0, 196648) 1
(0, 60866) 1
(0, 177168) 1
(0, 193567) 1
(0, 164722) 1
(0, 165627) 1
(0, 4021) 1
(0, 133855) 1
(0, 133898) 1
(0, 155987) 1
(0, 97865) 1
(0, 100490) 1
(0, 11861) 1
(0, 142800) 1
(0, 151689) 1
(0, 95076) 1
(0, 196632) 1
(0, 60852) 1
(0, 177092) 1
(0, 193558) 1
(0, 164485) 1
(0, 165423) 1
(0, 3831) 1
(0, 133854) 2
(0, 155850) 1
(0, 97859) 2
(0, 100430) 1
(0, 11784) 1
(0, 142748) 2
TF-IDF
Term Frequency - Inverse Document Frequency 能够考虑到低频词,对常见词赋予更低的权重。
Term Frequency 是一个词 word(W) 在一个句子中出现的频率 (Wr) . 该值介于 0 to 1.
Inverse Document Frequency 计算公式为 **log(Total Number of Docs(N) / Number of Docs **.
TF-IDF 即 *(W/Wr)LOG(N/n)
scikit-learn的tfidfVectorizer能够得到TF-IDF.
final_tf = final_X
tf_idf = TfidfVectorizer(max_features=5000)
tf_data = tf_idf.fit_transform(final_tf)
print(tf_data[1])
(0, 3432) 0.1822092004981035
(0, 332) 0.1574317775964303
(0, 2493) 0.18769649750089953
(0, 2420) 0.5671119742041831
(0, 3770) 0.1536626385509959
(0, 3219) 0.3726548417697838
(0, 162) 0.14731616688674187
(0, 4013) 0.14731616688674187
(0, 3980) 0.14758995053747803
(0, 4676) 0.2703170210936338
(0, 4314) 0.14376924933112933
(0, 1539) 0.2676489579732629
(0, 4734) 0.22110622670603633
(0, 2326) 0.25860104128863787
(0, 3641) 0.27633136515735446
Word2Vec
Word2Vec实际上接受单词的语义含义以及它们在其他单词之间的关系。 它学习单词之间的所有内部关系,以密集的矢量形式表示单词。
Gensim’s 库可以得到word2vec的表示,其参数min_count 可以考虑在预料库中出现次数大于min_count的词
size = 50 表示词向量的大小为50, workers 指运行训练的时候的核数大小.
Word2Vec平均值
计算每个单词的Word2vec并添加句子中每个单词的向量,然后将向量除以句子中单词的数目。对所有单词的Word2Vec进行平均。
w2v_data = final_X
splitted = []
for row in w2v_data:
splitted.append([word for word in row.split()]) #splitting words
train_w2v = Word2Vec(splitted,min_count=5,size=50, workers=4)
avg_data = []
for row in splitted:
vec = np.zeros(50)
count = 0
for word in row:
try:
vec += train_w2v[word]
count += 1
except:
pass
avg_data.append(vec/count)
print(avg_data[1])
[-0.05002685 0.25928063 0.05951109 0.51900934 -0.03167963 0.46785719
0.00222484 -0.66310911 -0.20395733 0.46152488 -0.10946387 -0.23357287
-0.46547037 -0.15859849 -0.06357213 0.31450908 -0.20035413 0.15254458
0.43844457 -0.16903808 0.21322138 -0.31059842 -0.2618238 0.38762275
0.16056326 -0.20617737 0.11266044 0.11485556 0.45777261 -0.36297071
-0.06595719 -0.01621006 0.29060831 -0.3001661 0.47677443 0.11626731
-0.47465852 -0.30745895 -0.20314391 0.2791373 0.35737396 0.06957825
-0.17933015 0.30257337 0.24010014 -0.07452129 0.0677023 0.20865259
-0.37281946 0.23683855]
TF-IDF WORD2VEC
在TF-IDF Word2Vec中,每个单词的Word2Vec值乘以该单词的tfidf值,然后求和,然后除以句子的tfidf值之和。
V = ( t(W1)*w2v(W1) + t(W2)*w2v(W2) +.....+t(Wn)*w2v(Wn))/(t(W1)+t(W2)+....+t(Wn))
tf_w_data = final_X
tf_idf = TfidfVectorizer(max_features=5000)
tf_idf_data = tf_idf.fit_transform(tf_w_data)
tf_w_data = []
tf_idf_data = tf_idf_data.toarray()
i = 0
for row in splitted:
vec = [0 for i in range(50)]
temp_tfidf = []
for val in tf_idf_data[i]:
if val != 0:
temp_tfidf.append(val)
count = 0
tf_idf_sum = 0
for word in row:
try:
count += 1
tf_idf_sum = tf_idf_sum + temp_tfidf[count-1]
vec += (temp_tfidf[count-1] * train_w2v[word])
except:
pass
vec = (float)(1/tf_idf_sum) * vec
tf_w_data.append(vec)
i = i + 1
print(tf_w_data[1])
[-0.07845875 0.2705135 0.15750491 0.44439225 -0.10240859 0.4895916
-0.05983333 -0.5968678 -0.45577406 0.32542196 0.03153885 -0.33706042
-0.54295155 -0.24655421 -0.18965962 0.32954096 -0.48843989 0.34956909
0.51294949 -0.07487897 0.33537215 -0.33352162 -0.37988736 0.37148924
0.38356532 -0.38107139 0.00941638 0.04478032 0.65593118 -0.56111246
-0.25711972 -0.07745567 0.28346332 -0.34313615 0.70660757 0.17808426
-0.41935335 -0.39114685 -0.36363119 0.25106416 0.38174835 0.18360225
-0.15376004 0.51548442 0.22080342 -0.12319357 0.13416085 0.04792397
-0.54085787 -0.00814264]