I am Happy Because I am learning NLP @deeplearning
经过预处理后得到
[happy,learn,nlp]
然后进行特征提取有
[1,4,2]
将一个语料集中的所有m个语料都进行特征提取,最终可以获得一个矩阵
{%raw%}
1 X 1 ( 1 ) X 2 ( 1 ) 1 X 1 ( 2 ) X 2 ( 2 ) . . . . . . . . . 1 X 1 ( m ) X 2 ( m ) \begin{matrix} 1&X^{(1)}_1&X_2^{(1)}\\ 1&X^{(2)}_1&X_2^{(2)}\\ ...&...&...\\ 1&X^{(m)}_1&X_2^{(m)} \end{matrix} 11...1X1(1)X1(2)...X1(m)X2(1)X2(2)...X2(m)
{%endraw%}
freqs = build_freqs(tweets,labels)#Build frequencies dictionary
X = np.zeros((m,3))#Initialize matrix (预先import numpy as np)
for i in range(m):# For every tweet
p_tweet = process_tweet(tweets[i])#Process tweet
X[i,:] = extract_features(p_tweet,freqs)# Extract Features
首先依然是 引入必要的库
import nltk # Python library for NLP
from nltk.corpus import twitter_samples # sample Twitter dataset from NLTK
import matplotlib.pyplot as plt # visualization library
import numpy as np # library for scientific computing and matrix operations
其中
process_tweet()函数实现清除文本,将其分割为单独的单词,删除stopwords,并将单词转换为Stem。
build_freqs()函数计算整个Corpus中单词在positive的频率和在negative的频率,构建freqs字典,key是元组(word,label)
下载所需的内容
# download the stopwords for the process_tweet function
nltk.download('stopwords')
# import our convenience functions
from utils import process_tweet, build_freqs
加载数据集
仍然与上一节相同
# select the lists of positive and negative tweets
all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')
# concatenate the lists, 1st part is the positive tweets followed by the negative
tweets = all_positive_tweets + all_negative_tweets
# let's see how many tweets we have
print("Number of tweets: ", len(tweets))
建立一个Labels array其中前5000个元素的label是1,后5000个元素的label是0.
# make a numpy array representing labels of the tweets
labels = np.append(np.ones((len(all_positive_tweets))), np.zeros((len(all_negative_tweets))))
(python字典知识可参考菜鸟教程)
构建Word frequency dictionary
def build_freqs(tweets, ys):
"""Build frequencies.
Input:
tweets: a list of tweets
ys: an m x 1 array with the sentiment label of each tweet
(either 0 or 1)
Output:
freqs: a dictionary mapping each (word, sentiment) pair to its
frequency
"""
# Convert np array to list since zip needs an iterable.
#将np数组转换为列表,因为zip需要一个可迭代对象
# The squeeze is necessary or the list ends up with one element.
#Squeeze是必要的,否则list将以单个元素解为
# Also note that this is just a NOP if ys is already a list.
yslist = np.squeeze(ys).tolist()
# Start with an empty dictionary and populate it by looping over all tweets
# and over all processed words in each tweet.
freqs = {}
for y, tweet in zip(yslist, tweets):
for word in process_tweet(tweet):
pair = (word, y) # 键为(word,label)元组
if pair in freqs:
freqs[pair] += 1 # 计数+1
else:
freqs[pair] = 1 # 如果没有计数过则初始化为1
return freqs
现在将这个函数投入使用
# create frequency dictionary
freqs = build_freqs(tweets, labels)
# check data type
print(f'type(freqs) = {type(freqs)}')
# check length of the dictionary
print(f'len(freqs) = {len(freqs)}')
但是字典信息太繁杂了,难以观察。
选择想要可视化的一部分单词
可以用元组来储存这个临时信息。
# select some words to appear in the report. we will assume that each word is unique (i.e. no duplicates)
keys = ['happi', 'merri', 'nice', 'good', 'bad', 'sad', 'mad', 'best', 'pretti',
'❤', ':)', ':(', '', '', '', '', '♛',
'song', 'idea', 'power', 'play', 'magnific']
# list representing our table of word counts.
# each element consist of a sublist with this pattern: [, , ]
data = []
# loop through our selected words
for word in keys:
# initialize positive and negative counts
pos = 0
neg = 0
# retrieve number of positive counts
if (word, 1) in freqs:
pos = freqs[(word, 1)]
# retrieve number of negative counts
if (word, 0) in freqs:
neg = freqs[(word, 0)]
# append the word counts to the table
data.append([word, pos, neg])
data
可以用对数坐标的散点图来将其绘出,横纵坐标分别是positive和negative的频率
fig, ax = plt.subplots(figsize = (8, 8))
# convert positive raw counts to logarithmic scale. we add 1 to avoid log(0)
x = np.log([x[1] + 1 for x in data])
# do the same for the negative counts
y = np.log([x[2] + 1 for x in data])
# Plot a dot for each pair of words
ax.scatter(x, y)
# assign axis labels
plt.xlabel("Log Positive count")
plt.ylabel("Log Negative count")
# Add the word as the label at the same position as you added the points just before
for i in range(0, len(data)):
ax.annotate(data[i][0], (x[i], y[i]), fontsize=12)
ax.plot([0, 9], [0, 9], color = 'red') # Plot the red line that divides the 2 areas.
plt.show()
有
使用之前提取的特征来判断是积极还是消极的情绪。
逻辑回归使用Sigmoid 函数(输出范围 0到1 )
其表达式为
σ ( z ) = 1 1 + e − z \sigma(z) = \frac{1}{1+e^{-z}} σ(z)=1+e−z1
在前节中已经提到过有监督学习的基本模型
而在逻辑回归中,Prediction Function就是Sigmoid函数。
在逻辑回归中的函数H表达式为
h ( x ( i ) , θ ) = 1 1 + e − θ T x ( i ) h(x^{(i)},\theta)=\frac{1}{1+e^{-\theta^T x^{(i)}}} h(x(i),θ)=1+e−θTx(i)1
它取决于
θ , x , i \theta,x,i θ,x,i
为了分类,我们需要一个阈值,通常将其设定为0.5,与之对应的
θ T \theta^T θT
与
x x x
的点乘结果为0。
如果点乘结果小于0,预测是negative,如果大于0则是positive。
例:
Text为
@YMourri and @ AndrewYNg are tuning a GREAT AI model
经过预处理后为
[tun,ai,great,model]
获得的向量x为
{%raw%}
x i = [ 1 3476 245 ] x^{i} = \begin{bmatrix} 1\\3476\\245 \end{bmatrix} xi=⎣⎡13476245⎦⎤
{%endraw%}
而
θ \theta θ
是我们希望得到的,有了它就可以做预测了。
现在我们先假设我们已经得到了一个
θ \theta θ
其值为
{%raw%}
θ = [ 0.00003 0.00150 − 0.00120 ] \theta = \begin{bmatrix} 0.00003\\0.00150\\-0.00120 \end{bmatrix} θ=⎣⎡0.000030.00150−0.00120⎦⎤
{%endraw%}
则可求得其点乘结果为4.92,可以预测是positive
(前面的import啥略过了)
# select the set of positive and negative tweets
all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')
tweets = all_positive_tweets + all_negative_tweets ## Concatenate the lists.
labels = np.append(np.ones((len(all_positive_tweets),1)), np.zeros((len(all_negative_tweets),1)), axis = 0)
# split the data into two pieces, one for training and one for testing (validation set)
train_pos = all_positive_tweets[:4000]
train_neg = all_negative_tweets[:4000]
train_x = train_pos + train_neg
print("Number of tweets: ", len(train_x))
data = pd.read_csv('logistic_features.csv'); # Load a 3 columns csv file using pandas function
data.head(10) # Print the first 10 data entries
# Each feature is labeled as bias, positive and negative
X = data[['bias', 'positive', 'negative']].values # Get only the numerical values of the dataframe
Y = data['sentiment'].values; # Put in Y the corresponding labels or sentiments
print(X.shape) # Print the shape of the X part
print(X) # Print some rows of X
theta = [7e-08, 0.0005239, -0.00055517]
画一下数据的散点图
# Plot the samples using columns 1 and 2 of the matrix
fig, ax = plt.subplots(figsize = (8, 8))
colors = ['red', 'green']
# Color based on the sentiment Y
ax.scatter(X[:,1], X[:,2], c=[colors[int(k)] for k in Y], s = 0.1) # Plot a dot for each pair of words
plt.xlabel("Positive")
plt.ylabel("Negative")
可以看出y=x这条线可以很好的将内容分开,可以预计模型精度不错。
画一条灰线来表示正区域和负区域之间的截止,即为
z = θ ⋅ x = 0 z = \theta \cdot x = 0 z=θ⋅x=0
红色和绿色的线点在相应的情绪的方向是计算使用垂直线的分离线计算在前面的方程(neg函数)。它必须指向与Logit函数的导数相同的方向,但大小可能不同。它只是为了模型的可视化表示。
# Equation for the separation plane
# It give a value in the negative axe as a function of a positive value
# f(pos, neg, W) = w0 + w1 * pos + w2 * neg = 0
# s(pos, W) = (w0 - w1 * pos) / w2
def neg(theta, pos):
return (-theta[0] - pos * theta[1]) / theta[2]
# Equation for the direction of the sentiments change
# We don't care about the magnitude of the change. We are only interested
# in the direction. So this direction is just a perpendicular function to the
# separation plane
# df(pos, W) = pos * w2 / w1
def direction(theta, pos):
return pos * theta[2] / theta[1]
# Plot the samples using columns 1 and 2 of the matrix
fig, ax = plt.subplots(figsize = (8, 8))
colors = ['red', 'green']
# Color base on the sentiment Y
ax.scatter(X[:,1], X[:,2], c=[colors[int(k)] for k in Y], s = 0.1) # Plot a dot for each pair of words
plt.xlabel("Positive")
plt.ylabel("Negative")
# Now lets represent the logistic regression model in this chart.
maxpos = np.max(X[:,1])
offset = 5000 # The pos value for the direction vectors origin
# Plot a gray line that divides the 2 areas.
ax.plot([0, maxpos], [neg(theta, 0), neg(theta, maxpos)], color = 'gray')
# Plot a green line pointing to the positive direction
ax.arrow(offset, neg(theta, offset), offset, direction(theta, offset), head_width=500, head_length=500, fc='g', ec='g')
# Plot a red line pointing to the negative direction
ax.arrow(offset, neg(theta, offset), -offset, -direction(theta, offset), head_width=500, head_length=500, fc='r', ec='r')
plt.show()
可以获得图像
我们有数据集X,标签集Y以及获得的θ,使用预测函数进行预测,观测其是否超过阈值
p r e d = h ( X v a l , θ ) ≥ 0.5 pred = h(X_{val},\theta)\ge 0.5 pred=h(Xval,θ)≥0.5
通过和阈值的比较可以获得对应X的标签,将预测的标签与Y作比较
将正确次数求和除以总数即得准确率
Σ i = 1 m ( p r e d ( i ) = = y v a l ( i ) ) m \Sigma_{i=1}^m\frac{(pred^{(i)}==y_{val}^{(i)})}{m} Σi=1mm(pred(i)==yval(i))
J ( θ ) = − 1 m Σ i = 1 m [ y ( i ) log h ( x ( i ) , θ ) + ( 1 − y ( i ) ) log ( 1 − h ( x ( i ) , θ ) ] J(\theta)=-\frac{1}{m}\Sigma_{i=1}^m[y^{(i)}\log h(x^{(i)},\theta)+(1-y^{(i)})\log(1-h(x^{(i)},\theta)] J(θ)=−m1Σi=1m[y(i)logh(x(i),θ)+(1−y(i))log(1−h(x(i),θ)]
写成向量形式有
J = − 1 m × ( y T ⋅ log ( h ) + ( 1 − y ) T ⋅ log ( 1 − h ) ) J = \frac{-1}{m}\times (\bold y^T \cdot \log(h)+(1-\bold{y})^T\cdot\log(1-\bold{h})) J=m−1×(yT⋅log(h)+(1−y)T⋅log(1−h))
θ的梯度更新有
θ = θ − α m × ( x T ⋅ ( h − y ) ) \theta=\theta - \frac{\alpha}{m}\times(\bold x^T\cdot(\bold{h}-\bold{y})) θ=θ−mα×(xT⋅(h−y))
损失函数是一个很复杂的表达式,现在来拆分一下它的概念。
最左侧的求和意味着是将每个训练样本的损失加和再取平均,负号的作用则是让最后的数值为正数。(因为log一个小于1的正数肯定是负的嘛)
括号里的第一项
y ( i ) log h ( x ( i ) , θ ) y^{(i)}\log h(x^{(i)},\theta) y(i)logh(x(i),θ)
y | h | ylog(h) |
---|---|---|
0 | any | 0 |
1 | 0.99 | ~ 0 |
1 | ~0 | -inf |
也就是说预测越接近标签,值越小(y=0,h=1的时候是特例)
接下来看第二项
( 1 − y ( i ) ) log ( 1 − h ( x ( i ) , θ ) (1-y^{(i)})\log(1-h(x^{(i)},\theta) (1−y(i))log(1−h(x(i),θ)
y | h | finally |
---|---|---|
1 | any | 0 |
0 | 0.01 | ~ 0 |
0 | ~1 | -inf |
与之前类似,区别是(y=1 h=0的时候是特例)
将两项叠加有
y | h | finally |
---|---|---|
1 | ~0 | -inf |
1 | ~1 | ~0 |
0 | ~0 | ~ 0 |
0 | ~1 | -inf |
下载地址