最近想自学一下自然语言处理,网上找了 Stanford CS224N 的网课,顺藤摸瓜找了点作业题来坐坐。下载链接:斯坦福 cs224n 课程网站
之前导入软件包如果出现问题的话,根据所需的包名一个个导入就可以了,推荐使用国内镜像网站,在 Win+R 输入 cmd ,那个窗口输入:
pip install 包名 -i https://pypi.tuna.tsinghua.edu.cn/simple
这道题目的让你获取 单词列表 的列表 中不同单词的,其实就是让你熟悉一下使用 python 。
这里使用集合这个数据结构来实现(集合是 Hash 表存储结构,读取起来更快)。然后每次循环在旧集合与新列表之间取并集即可:
## Question 1.1: Implement distinct_words [code] (2 points)
def distinct_words(corpus):
""" Determine a list of distinct words for the corpus.
Params:
corpus (list of list of strings): corpus of documents
Return:
corpus_words (list of strings): list of distinct words across the corpus, sorted (using python 'sorted' function)
num_corpus_words (integer): number of distinct words across the corpus
"""
corpus_words = []
num_corpus_words = -1
# ------------------
# Write your implementation here.
corpus_words = set()
for item in corpus:
words = set(item)
corpus_words = corpus_words | words
num_corpus_words = len(corpus_words)
corpus_words = sorted(list(corpus_words))
# ------------------
return corpus_words, num_corpus_words
测试一下:
>>> --------------------------------------------------------------------------------
Passed All Tests!
--------------------------------------------------------------------------------
第二个问题是构建一个邻接矩阵。首先先看它需要返回的两个东西:
word2ind
是由字符作为 key
,这个字符在之前第一小问返回的列表中的索引作为 val
的一个字典结构。根据这个结构,使我们能够在 O ( 1 ) O(1) O(1) 的时间复杂度下获得矩阵的位置。关于怎么创建全 0 0 0 矩阵,见底下代码M
是一个矩阵,这个矩阵包含了信息:出现在这个词左右的,是什么词。我们只需要对当前字符串进行一次遍历,在每次遍历的时候,注意 window
这个变量,我们需要统计这个词 ± w i n d o w ±window ±window 这个变量周围的所有单词。先回顾一下课堂上讲的知识:为什么这个窗口是可行的:这个窗口就是这个单词出现的上下文,因此出现在类似窗口(也即类似上下文)中的两个单词,他们的词意一般是类似的,因此类似意思的两个单词会聚类在一起。
如果忘了 numpy
的使用方法,见:数据可视化学习笔记【一】(numpy包)。
实现代码如下:
## Question 1.2: Implement compute_co_occurrence_matrix [code] (3 points)
def compute_co_occurrence_matrix(corpus, window_size=4):
""" Compute co-occurrence matrix for the given corpus and window_size (default of 4).
Note: Each word in a document should be at the center of a window. Words near edges will have a smaller
number of co-occurring words.
For example, if we take the document " All that glitters is not gold " with window size of 4,
"All" will co-occur with "", "that", "glitters", "is", and "not".
Params:
corpus (list of list of strings): corpus of documents
window_size (int): size of context window
Return:
M (a symmetric numpy matrix of shape (number of unique words in the corpus , number of unique words in the corpus)):
Co-occurence matrix of word counts.
The ordering of the words in the rows/columns should be the same as the ordering of the words given by the distinct_words function.
word2Ind (dict): dictionary that maps word to index (i.e. row/column number) for matrix M.
"""
words, num_words = distinct_words(corpus)
M = None
word2Ind = {}
# ------------------
# Write your implementation here.
i = 0
for key in words:
word2Ind[key] = i
i += 1
M = np.zeros((num_words, num_words))
for sentence in corpus:
for i, word in enumerate(sentence):
for j in range(i - window_size, i + window_size + 1):
if j < 0 or j >= len(sentence):
continue
if j != i:
M[word2Ind[word], word2Ind[sentence[j]]] += 1
# ------------------
return M, word2Ind
执行:
>>> --------------------------------------------------------------------------------
Passed All Tests!
--------------------------------------------------------------------------------
这题就是导包,进行 SVD 分解( M = U Σ V T M = U\Sigma V^{T} M=UΣVT 其中 U , V U,V U,V 都是酉矩阵, Σ \Sigma Σ 为对角矩阵,且 r a n k ( Σ ) = r a n k ( M ) rank(\Sigma)=rank(M) rank(Σ)=rank(M)),保留奇异值前 k
大的值,然后得到一个降维的矩阵 U Σ U\Sigma UΣ 。
原来那个 10 ∗ 10 10*10 10∗10 的矩阵被降为一个 10 ∗ 2 10*2 10∗2 的矩阵了。
[0.65480209 0.78322112]
[5.20200324e-01 -1.56599893e-15]
[0.70564718 -0.48405727]
[0.70564718 0.48405727]
[1.02780472e+00 1.01204090e-15]
[0.65480209 -0.78322112]
[0.38225849 -0.656224 ]
[0.38225849 0.656224 ]
[1.39420808 1.06179274]
[1.39420808 -1.06179274]
为什么要进行这部操作呢?之前课上说到:人是一个三维生物,很难想象到一个高维空间,比如之前那个 10 ∗ 10 10*10 10∗10 矩阵所表示的一个 10 10 10 维空间显然超过了人类的理解范围,因此将其降维。(同时有一个小小的推测,这里相当于加上了一个小扰动,是不是为了防止模型会产生过拟合?)
代码实现如下:
def reduce_to_k_dim(M, k=2):
""" Reduce a co-occurence count matrix of dimensionality (num_corpus_words, num_corpus_words)
to a matrix of dimensionality (num_corpus_words, k) using the following SVD function from Scikit-Learn:
- http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html
Params:
M (numpy matrix of shape (number of unique words in the corpus , number of unique words in the corpus)): co-occurence matrix of word counts
k (int): embedding size of each word after dimension reduction
Return:
M_reduced (numpy matrix of shape (number of corpus words, k)): matrix of k-dimensioal word embeddings.
In terms of the SVD from math class, this actually returns U * S
"""
n_iters = 10 # Use this parameter in your call to `TruncatedSVD`
M_reduced = None
print("Running Truncated SVD over %i words..." % (M.shape[0]))
# ------------------
# Write your implementation here.
handle = TruncatedSVD(k, n_iter = n_iters)
M_reduced = handle.fit_transform(M)
# ------------------
print("Done.")
return M_reducedython
执行结果:
Done.
--------------------------------------------------------------------------------
Passed All Tests!
--------------------------------------------------------------------------------
这题就是做一个图,熟悉一下 matplotlib
这个模组中的功能。
代码如下:
def plot_embeddings(M_reduced, word2Ind, words):
""" Plot in a scatterplot the embeddings of the words specified in the list "words".
NOTE: do not plot all the words listed in M_reduced / word2Ind.
Include a label next to each point.
Params:
M_reduced (numpy matrix of shape (number of unique words in the corpus , 2)): matrix of 2-dimensioal word embeddings
word2Ind (dict): dictionary that maps word to indices for matrix M
words (list of strings): words whose embeddings we want to visualize
"""
# ------------------
# Write your implementation here.
fig = plt.figure()
plt.style.use("seaborn-whitegrid")
for word in words:
point = M_reduced[word2Ind[word]]
plt.scatter(point[0], point[1], marker = "^")
plt.annotate(word, xy = (point[0], point[1]), xytext = (point[0], point[1]+0.1))
# ------------------
执行结果如下:
>>> Outputted Plot:
--------------------------------------------------------------------------------
感觉我做的图比他给的样例好看那么一丁点。
直接根据他提供的代码进行作图就可以了:
# -----------------------------
# Run This Cell to Produce Your Plot
# ------------------------------
reuters_corpus = read_corpus()
M_co_occurrence, word2Ind_co_occurrence = compute_co_occurrence_matrix(reuters_corpus)
M_reduced_co_occurrence = reduce_to_k_dim(M_co_occurrence, k=2)
# Rescale (normalize) the rows to make them each of unit-length
M_lengths = np.linalg.norm(M_reduced_co_occurrence, axis=1)
M_normalized = M_reduced_co_occurrence / M_lengths[:, np.newaxis] # broadcasting
words = ['barrels', 'bpd', 'ecuador', 'energy', 'industry', 'kuwait', 'oil', 'output', 'petroleum', 'venezuela']
plot_embeddings(M_normalized, word2Ind_co_occurrence, words)
所作的图如下:
同时老师也提出了几个问题:
Q:
Remark: Note: “bpd” stands for “barrels per day” and is a commonly used abbreviation in crude oil topic articles.
A:
-------------------------- 第一部分结束了 -----------------------------
这部分数据集的下载最好事先装好 ,不然非常容易下载失败。
这题是一个比较,比较我们之前写的:利用矩阵 SVD 将其邻接矩阵进行降秩的结果 与 GloVe embeddings
本身数据集中的坐标。
因为邻接矩阵是一个非常稀疏的矩阵,而且数据量极大,这里出题者很好心让我们只用 10000 个单词来进行构造。
这里让我们比较用 co-occurrence 矩阵和用 GloVe embeddings 这个数据里,点的坐标是否有不同。
比较的还是对之前那 10 个单词,运行如下代码:
words = ['barrels', 'bpd', 'ecuador', 'energy', 'industry', 'kuwait', 'oil', 'output', 'petroleum', 'venezuela']
plot_embeddings(M_reduced_normalized, word2Ind, words)
得到的结果如下:
接下来回答作者的问题:
Q:
之前的图还得用到,这里把两张图合并一下放一下做对比:
A:
M_reduced_normalized = M_reduced / M_lengths[:, np.newaxis] # broadcasting
确实我不太清楚这个引入的数据集中,这些单词的坐标是怎么来的,只能够通过代码入手。如果没有这部正则化,这张图长这样:
这也只是我的个人推测,如果有不同意见,可以评论讨论,共同商讨一下。
这题让我们统计一词多义的情况。两个单词之间词义的相似度是由两个向量之间的内积决定的。(因为之前所有单词的坐标到原点的距离都被正则化为 1 1 1 ,因此它们之间的内积就是他们之间夹角 α \alpha α 的 cos α \cos \alpha cosα 值)在这里的处理思路与聚类分析中对距离、相似度的定义是类似的。
这里提示我们使用一个已有的函数:wv_from_bin.most_similar(word)
来完成。这个函数的实现机制是:对所有单词进行求内积,并取内积最大的 10 10 10 个单词返回给你。
这题的思路如下:
首先我引入一个概念:组内离差(这在判断马尔科夫链收敛性中也有使用到)。首先有一个集合(相似集):记录这个单词通过这个 wv_from_bin.most_similar(word)
函数返回的列表。在这个列表中,两两取内积,取其内积最小值,称之为组内离差。
因此,组内离差最小的集合就说明这个单词 与 两个意思相差很远的单词 之间相似度高,也就说明更有可能是一词多义的解。
代码实现如下:
def MultiMeaningWord(M_reduced_normalized, word2Ind):
wordlst = []
MultiMeans = ""
n = len(M_reduced_normalized[0])
MinVariance = 100
for word in word2Ind.keys():
lst = wv_from_bin.most_similar(word)
cur = 100
for i in range(10):
try:
xy_cur = M_reduced_normalized[word2Ind[lst[i][0]]]
except KeyError:
continue
for j in range(i, 10):
temp = 0
try:
xy_nxt = M_reduced_normalized[word2Ind[lst[j][0]]]
except KeyError:
continue
for k in range(n):
temp = temp + xy_cur[k] * xy_nxt[k]
cur = min(cur, temp)
if cur < MinVariance:
wordlst = lst
MinVariance = cur
MultiMeans = word
return MultiMeans, wordlst, MinVariance
结果如下:
>>> ('raisonné',
[('raisonne', 0.6827203035354614),
('catalogue', 0.6217593550682068),
('köchel', 0.5784828662872314),
('dictionnaire', 0.5532978773117065),
('recueil', 0.538439154624939),
('traité', 0.5328050851821899),
('hesiodic', 0.5189188718795776),
('études', 0.5103318691253662),
('etudes', 0.4762265682220459),
('encyclopédie', 0.4728423058986664)],
-0.9605955556035042)
组间离差为 − 0.96 -0.96 −0.96 是一个相当小的值。然后我们再看这个单词的意思:
raisonné 经过推理的,建立在推理基础上的;思考过的
貌似与剩下的词的意思关系都不是很大,这里存疑,希望见者能够解答一下。
经常会产生这样一个结果:这个词的好几个意思都是相近的,因为意思相近的单词聚类了,他们之间的内积接近于 1 1 1 因此直接取出意思最相近的 10 10 10 个单词很容易造成全部是同义词的情况。
这题的主要目的是求同义词和反义词。
具体思路为:
python 实现如下:
def Synonyms_Antonyms(word2Ind, w1):
words = word2Ind.keys()
far = ""
furtherness = 100
for word in words:
if word == w1:
continue
temp = wv_from_bin.distance(w1, word)
if furtherness > temp:
furtherness, far = temp, word
continue
return wv_from_bin.most_similar(w1)[0], (far, furtherness)
找符合的单词比较麻烦,最终找到单词:“satisfying” ,执行结果如下:
Res = Synonyms_Antonyms(word2Ind, w1 = 'satisfying')
>>> ([('enjoyable', 0.6154831051826477), ('unsatisfying', 0.4698140621185303))
因此:“satisfying” 的近义词是 “enjoyable” ,反义词是 “unsatisfying” 符合我们的认知。
这一块内容讲的是:通过单词向量来解决问题。这里首先介绍了一个函数:
wv_from_bin.most_similar(positive=['woman', 'king'], negative=['man'])
>>> [('queen', 0.6978678703308105),
('princess', 0.6081745028495789),
('monarch', 0.5889754891395569),
('throne', 0.5775108933448792),
('prince', 0.5750998258590698),
('elizabeth', 0.5463595986366272),
('daughter', 0.5399125814437866),
('kingdom', 0.5318052172660828),
('mother', 0.5168544054031372),
('crown', 0.5164473056793213)]
这里返回的是离 positive 中最近,且离 negative 最远的单词。
这题让我们实现一个正确的聚类:
pprint.pprint(wv_from_bin.most_similar(positive=['satisfying', 'exciting'], negative=['unsatisfying']))
>>> [('interesting', 0.6445490121841431),
('really', 0.6026532649993896),
('very', 0.6022480726242065),
('excited', 0.596916675567627),
('wonderful', 0.5959773063659668),
('quite', 0.5956001281738281),
('truly', 0.5935688018798828),
('definitely', 0.5903993248939514),
('entertaining', 0.5786590576171875),
('fun', 0.56939697265625)]
这题是让我们输出一个错误的聚类,实际上挺难找的。如果positive 中存在反义词,那就会导致这个聚类不精确。
pprint.pprint(wv_from_bin.most_similar(positive=['output', 'input'], negative=['energy']))
>>> [('outputs', 0.6508897542953491),
('inputs', 0.6220414638519287),
('voltage', 0.4847225546836853),
('waveform', 0.4809161126613617),
('audio', 0.46772128343582153),
('amplifier', 0.46416085958480835),
('corresponding', 0.45216110348701477),
('impedance', 0.4518190026283264),
('non-inverting', 0.4489710330963135),
('sequential', 0.4211637079715729)]
这一节是讲偏差分析的。
pprint.pprint(wv_from_bin.most_similar(positive=['woman', 'worker'], negative=['man']))
print()
pprint.pprint(wv_from_bin.most_similar(positive=['man', 'worker'], negative=['woman']))
输出结果为:
>>> [('employee', 0.6375863552093506),
('workers', 0.6068919897079468),
('nurse', 0.5837947130203247),
('pregnant', 0.5363885760307312),
('mother', 0.5321309566497803),
('employer', 0.5127025842666626),
('teacher', 0.5099577307701111),
('child', 0.5096741914749146),
('homemaker', 0.5019455552101135),
('nurses', 0.4970571994781494)]
[('workers', 0.611325740814209),
('employee', 0.5983108878135681),
('working', 0.5615329742431641),
('laborer', 0.5442320108413696),
('unemployed', 0.5368517637252808),
('job', 0.5278826951980591),
('work', 0.5223963260650635),
('mechanic', 0.5088937282562256),
('worked', 0.5054520964622498),
('factory', 0.4940453767776489)]
这里输出的是:
之前看到一篇报道说机器学习是有偏见的,没错,机器学习的结果取决于你的训练集。机器学习产生的偏见实际上就是人类自己本身的偏见。
我们需要寻找更多的偏见,一直说女司机,男司机;我们就来看看两个性别之间对驾驶是否存在偏见:
pprint.pprint(wv_from_bin.most_similar(positive=['woman', 'car'], negative=['man']))
print()
pprint.pprint(wv_from_bin.most_similar(positive=['man', 'car'], negative=['woman']))
>>> [('vehicle', 0.6337087750434875),
('cars', 0.6253966689109802),
('driver', 0.6123777031898499),
('truck', 0.5899932384490967),
('minivan', 0.5488290190696716),
('driving', 0.5473644733428955),
('mercedes', 0.5350144505500793),
('parked', 0.5255646109580994),
('vehicles', 0.521051287651062),
('automobile', 0.5183522701263428)]
[('cars', 0.7136538624763489),
('vehicle', 0.6922875642776489),
('truck', 0.6608046293258667),
('driver', 0.6462159752845764),
('driving', 0.6076016426086426),
('vehicles', 0.5946481227874756),
('motorcycle', 0.5647350549697876),
('drivers', 0.5344247221946716),
('racing', 0.5336049795150757),
('parked', 0.5304452180862427)]
从中可以看到:
之前也稍微提到了一点偏见是怎么来的,这里稍微总结一下: