一曲笛音入梦来จุ๊บ

CS224n-winter2019 exploring_word_vectors

这里写自定义目录标题

第一次写博客，不到之处请见谅。
词向量
part1 基于计数的词向量

绘制共生词嵌入
Question 1.1 实现不同单词
Question 1.2:实现共现矩阵
实现降到$k$维
Question 1.4: Implement plot_embeddings
Question 1.5: Co-Occurrence Plot Analysis

Part 2: Prediction-Based Word Vectors

Question 2.1: Word2Vec Plot Analysis
Cosine Similarity
Question 2.2: Polysemous Words
Question 2.3: Synonyms & Antonyms
Solving Analogies with Word Vectors
Question 2.4: Finding Analogies
Question 2.5: Incorrect Analogy
Question 2.6: Guided Analysis of Bias in Word Vectors
Question 2.7: Independent Analysis of Bias in Word Vectors

点个赞呗

第一次写博客，不到之处请见谅。

# All Import Statements Defined Here
# Note: Do not add to this list.
# All the dependencies you need, can be installed by running .
# ----------------
import ssl
_create_unverified_https_context = ssl._create_unverified_context
ssl._create_default_https_context = _create_unverified_https_context
import sys
assert sys.version_info[0]==3
assert sys.version_info[1] >= 5

from gensim.models import KeyedVectors
from gensim.test.utils import datapath
import pprint
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [10, 5]
import nltk
nltk.download('reuters')
from nltk.corpus import reuters
import numpy as np
import random
import scipy as sp
from sklearn.decomposition import TruncatedSVD
from sklearn.decomposition import PCA

START_TOKEN = ''
END_TOKEN = ''

np.random.seed(0)
random.seed(0)
# ----------------

导包存在出现连接不上服务器的问题，私自下载好了reuters压缩包，放在了C:\Users\Administrator\AppData\Roaming\nltk_data\corpora下面，注意，Roaming是隐藏文件夹。

词向量

字向量通常用作下游NLP任务的基本组件，例如，问答系统，文本生成，文本翻译等。因此，建立对词向量优缺点的直观印象很重要。在这里，您将探讨两种类型的词向量：从共现矩阵派生的和通过word2vec派生的。
注释：词向量和词嵌入经常可以互换使用。词向量的意思是我们在低维空间编码单词的含义。正如维基百科描述的那样，“从概念上讲，它涉及从每个单词一维空间到具有更低维度的连续向量空间的数学嵌入”。

part1 基于计数的词向量

大多数词向量模型来源自下面的想法：
你应该把单词放在他原来的地方理解 https://en.wikipedia.org/wiki/John_Rupert_Firth （科学上网）许多单词向量实现是由类似的单词（即，近似的）同义词将在类似的上下文中使用的想法驱动的。结果，类似的单词通常与单词的共享子集（即上下文）一起被说出或写入。通过检查这些上下文，我们可以尝试为我们的单词开发嵌入。考虑到这种直觉，构建单词向量的许多“旧学派”方法依赖于单词计数。在这里，我们详细阐述其中一种策略，共现矩阵（更多信息，请参见http://web.stanford.edu/class/cs124/lec/vectorsemantics.video.pdf或https://medium.com/data-science-group-iitr/word-embedding-2d05d270b285）。
共现
共现矩阵计算在某些上下文中单词共同出现的频率。给定文档中某个单词 $w_i$ ，考虑到它周围的几个单词出现的次数。假设我们窗口大小为 $n$ ，即文档中该单词前面和后面 $n$ 个。我们建立了一个共现矩阵 $M$ ,这是一个按字的对称矩阵， $M_{ij}$ 是 $w_j$ 出现在 $w_i$ 上下文的次数。
例如
文档1：“all that glitters is not gold”
文档2： “all is well that ends well”

注释：
在NLP中，我们经常添加START和END标记来表示句子，段落或文档的开头和结尾。在这种情况下，我们想象封装每个文档的START和END标记，例如“START所有闪烁的不是金子END”，并在我们的共现计数中包括这些标记。
该矩阵的行（或列）提供一种类型的单词向量（基于单词 - 单词共现的那些），但是这些向量通常很大（语料库中不同单词的数量是线性的）。因此，我们接下来要进行降维。特别是，我们将运行SVD（奇异值分解），这是一种广义PCA（主成分分析），用于选择前 $k$ 主成分。这是使用SVD降低降维的可视化。在这张图片中，我们的共生矩阵是 $A$ ， $n$ 行对应 $n$ 个words。我们获得了一个完整的矩阵分解，奇异值在对角 $S$ 矩阵中排序，我们新的，更短的长度 $k$ 单词向量在 $U_k$ 中。

这种降维的共现表示保留了单词之间的语义关系，例如，医生和医院将比医生和狗更近。

绘制共生词嵌入

在这里，我们将使用Reuters（商业和财经新闻）语料库。如果您尚未运行此页面顶部的导入单元格，请立即运行（单击它并按SHIFT-RETURN）。该语料库包含10,788个新闻文件，总计130万字。这些文件涵盖90个类别，分为测试集和训练集。我们在下面提供了read_corpus函数，它只从“crude”（即有关石油，天然气等的新闻文章）类别中提取文章。该函数还为每个文档和小写单词添加了START和END标记。您没有执行任何其他类型的预处理。

def read_corpus(category="crude"):
    """ Read files from the specified Reuter's category.
        Params:
            category (string): category name
        Return:
            list of lists, with words from each of the processed files
    """
    ## files为docement列表
    files = reuters.fileids(category)
    # 返回结果是的列表里面很多个小列表，每个小列表是一篇文章
    # [START_TOKEN] + [ w.lower()for f in files for w in list(reuters.words(f)) ]+ [END_TOKEN] 这种表示是所有文章在一个列表中
    return [[START_TOKEN] + [w.lower() for w in list(reuters.words(f))] + [END_TOKEN] for f in files]

路透社新闻相互覆盖，即一条新闻包含好几个主题，即文件和类别是一对多的关系。例如：
查询主题为baley的文章，reuters.fileids(‘barley’)
-> [‘test/15618’, ‘test/15649’, ‘test/15676’, ‘test/15728’, ‘test/15871’, …]
查询某个文章的主题 reuters.categories(‘training/9865’)
-> [‘barley’, ‘corn’, ‘grain’, ‘wheat’]

reuters_corpus = read_corpus()
pprint.pprint(reuters_corpus[:1], compact=True, width=100)

[’’, ‘japan’, ‘to’, ‘revise’, ‘long’, ‘-’, ‘term’, ‘energy’, ‘demand’, ‘downwards’, ‘the’,
‘ministry’, ‘of’, ‘international’, ‘trade’, ‘and’, ‘industry’, ‘(’, ‘miti’, ‘)’, ‘will’, ‘revise’,
‘its’, ‘long’, ‘-’, ‘term’, ‘energy’, ‘supply’, ‘/’, ‘demand’, ‘outlook’, ‘by’, ‘august’, ‘to’,
‘meet’, ‘a’, ‘forecast’, ‘downtrend’, ‘in’, ‘japanese’, ‘energy’, ‘demand’, ‘,’, ‘ministry’,
‘officials’, ‘said’, ‘.’, ‘miti’, ‘is’, ‘expected’, ‘to’, ‘lower’, ‘the’, ‘projection’, ‘for’,
‘primary’, ‘energy’, ‘supplies’, ‘in’, ‘the’, ‘year’, ‘2000’, ‘to’, ‘550’, ‘mln’, ‘kilolitres’,
‘(’, ‘kl’, ‘)’, ‘from’, ‘600’, ‘mln’, ‘,’, ‘they’, ‘said’, ‘.’, ‘the’, ‘decision’, ‘follows’,
‘the’, ‘emergence’, ‘of’, ‘structural’, ‘changes’, ‘in’, ‘japanese’, ‘industry’, ‘following’,
‘the’, ‘rise’, ‘in’, ‘the’, ‘value’, ‘of’, ‘the’, ‘yen’, ‘and’, ‘a’, ‘decline’, ‘in’, ‘domestic’,
‘electric’, ‘power’, ‘demand’, ‘.’, ‘miti’, ‘is’, ‘planning’, ‘to’, ‘work’, ‘out’, ‘a’, ‘revised’,
‘energy’, ‘supply’, ‘/’, ‘demand’, ‘outlook’, ‘through’, ‘deliberations’, ‘of’, ‘committee’,
‘meetings’, ‘of’, ‘the’, ‘agency’, ‘of’, ‘natural’, ‘resources’, ‘and’, ‘energy’, ‘,’, ‘the’,
‘officials’, ‘said’, ‘.’, ‘they’, ‘said’, ‘miti’, ‘will’, ‘also’, ‘review’, ‘the’, ‘breakdown’,
‘of’, ‘energy’, ‘supply’, ‘sources’, ‘,’, ‘including’, ‘oil’, ‘,’, ‘nuclear’, ‘,’, ‘coal’, ‘and’,
‘natural’, ‘gas’, ‘.’, ‘nuclear’, ‘energy’, ‘provided’, ‘the’, ‘bulk’, ‘of’, ‘japan’, “’”, ‘s’,
‘electric’, ‘power’, ‘in’, ‘the’, ‘fiscal’, ‘year’, ‘ended’, ‘march’, ‘31’, ‘,’, ‘supplying’,
‘an’, ‘estimated’, ‘27’, ‘pct’, ‘on’, ‘a’, ‘kilowatt’, ‘/’, ‘hour’, ‘basis’, ‘,’, ‘followed’,
‘by’, ‘oil’, ‘(’, ‘23’, ‘pct’, ‘)’, ‘and’, ‘liquefied’, ‘natural’, ‘gas’, ‘(’, ‘21’, ‘pct’, ‘),’,
‘they’, ‘noted’, ‘.’, ‘’]

Question 1.1 实现不同单词

编写一种方法来计算语料库中出现的不同单词（单词类型）。你可以用for循环来实现，但使用Python列表推导更有效率。

def distinct_words(corpus):
    """ Determine a list of distinct words for the corpus.
        Params:
            corpus (list of list of strings): corpus of documents
        Return:
            corpus_words (list of strings): list of distinct words across the corpus, sorted (using python 'sorted' function)
            num_corpus_words (integer): number of distinct words across the corpus
    """
    corpus_words = []
    num_corpus_words = -1
    # ------------------
    # Write your implementation here.
    temp= [y for x in corpus for y in x]
    corpus_words = sorted(set(temp))
    num_corpus_words = len(corpus_words)
    # ------------------
    #返回的结果是语料库中的所有单词按照字母顺序排列的。
    return corpus_words, num_corpus_words

Question 1.2:实现共现矩阵

写一个方法实现一个窗口大小为 $n$ 的共现矩阵（默认大小为4）

def compute_co_occurrence_matrix(corpus, window_size=4):
    """ Compute co-occurrence matrix for the given corpus and window_size (default of 4).
    
        Note: Each word in a document should be at the center of a window. Words near edges will have a smaller
              number of co-occurring words.
              
              For example, if we take the document "START All that glitters is not gold END" with window size of 4,
              "All" will co-occur with "START", "that", "glitters", "is", and "not".
    
        Params:
            corpus (list of list of strings): corpus of documents
            window_size (int): size of context window
        Return:
            M (numpy matrix of shape (number of corpus words, number of number of corpus words)): 
                Co-occurence matrix of word counts. 
                The ordering of the words in the rows/columns should be the same as the ordering of the words given by the distinct_words function.
            word2Ind (dict): dictionary that maps word to index (i.e. row/column number) for matrix M.
    """
    words, num_words = distinct_words(corpus)
    M = None
    word2Ind = {}
    M = np.zeros((num_words, num_words))
    # ------------------
    word2Ind = {w:i for i,w in enumerate(words)}
    for doc in corpus:
        for i, word in enumerate(doc):
        	## 注意for的范围和循环条件
            for j in range(i-window_size,i+window_size+1):
                if j<0 or j>=len(doc):
                    continue
                if j != i:
                    M[word2Ind[word], word2Ind[doc[j]]]+=1
    # ------------------

    return M, word2Ind

实现降到 $k$ 维

构造一种在矩阵上执行降维的方法，以生成k维嵌入。使用SVD获取前k个分量并生成新的k维嵌入矩阵。

def reduce_to_k_dim(M, k=2):
    """ Reduce a co-occurence count matrix of dimensionality (num_corpus_words, num_corpus_words)
        to a matrix of dimensionality (num_corpus_words, k) using the following SVD function from Scikit-Learn:
            - http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html
    
        Params:
            M (numpy matrix of shape (number of corpus words, number of number of corpus words)): co-occurence matrix of word counts
            k (int): embedding size of each word after dimension reduction
        Return:
            M_reduced (numpy matrix of shape (number of corpus words, k)): matrix of k-dimensioal word embeddings.
                    In terms of the SVD from math class, this actually returns U * S
    """    
    n_iters = 10     # Use this parameter in your call to `TruncatedSVD`
    M_reduced = None
    print("Running Truncated SVD over %i words..." % (M.shape[0]))
        # ------------------
    SVD = TruncatedSVD(n_components=k, n_iter=n_iters)
    M_reduced = SVD.fit_transform(M)
        # ------------------
    print("Done.")
    return M_reduced

Question 1.4: Implement plot_embeddings

在这里，您将编写一个函数来绘制2D空间中的一组2D矢量。

def plot_embeddings(M_reduced, word2Ind, words):
    """ Plot in a scatterplot the embeddings of the words specified in the list "words".
        NOTE: do not plot all the words listed in M_reduced / word2Ind.
        Include a label next to each point.
        
        Params:
            M_reduced (numpy matrix of shape (number of corpus words, k)): matrix of k-dimensioal word embeddings
            word2Ind (dict): dictionary that maps word to indices for matrix M
            words (list of strings): words whose embeddings we want to visualize
    """

    # ------------------
    for word in words:
        coord = M_reduced[word2Ind[word]]
        x = coord[0]
        y = coord[1]
        plt.scatter(x,y, marker='x', color='red')
        plt.text(x, y, word, fontsize=9)
    plt.show()

    # ------------------


# ---------------------
# Run this sanity check
# Note that this not an exhaustive check for correctness.
# The plot produced should look like the "test solution plot" depicted below. 
# ---------------------

print ("-" * 80)
print ("Outputted Plot:")

M_reduced_plot_test = np.array([[1, 1], [-1, -1], [1, -1], [-1, 1], [0, 0]])
word2Ind_plot_test = {'test1': 0, 'test2': 1, 'test3': 2, 'test4': 3, 'test5': 4}
words = ['test1', 'test2', 'test3', 'test4', 'test5']
plot_embeddings(M_reduced_plot_test, word2Ind_plot_test, words)

print ("-" * 80)

Question 1.5: Co-Occurrence Plot Analysis

现在我们将把你写的所有部分组合在一起！我们将在路透社“crude”语料库中计算固定窗口为5的共现矩阵。然后我们将使用TruncatedSVD来计算每个单词的二维嵌入。 TruncatedSVD返回U * S，因此我们对返回的向量进行归一化，以便所有向量都出现在单位圆周围（因此接近度是方向接近度）。
运行以下单元格以生成绘图。它可能需要几秒钟才能运行。什么在二维嵌入空间中聚集在一起？什么不会聚集在一起你可能认为应该有什么？注意：“bpd”代表“每天桶数”，是原油主题文章中常用的缩写。

# -----------------------------
# Run This Cell to Produce Your Plot
# ------------------------------
reuters_corpus = read_corpus()
M_co_occurrence, word2Ind_co_occurrence = compute_co_occurrence_matrix(reuters_corpus)
M_reduced_co_occurrence = reduce_to_k_dim(M_co_occurrence, k=2)

# Rescale (normalize) the rows to make them each of unit-length
M_lengths = np.linalg.norm(M_reduced_co_occurrence, axis=1)
M_normalized = M_reduced_co_occurrence / M_lengths[:, np.newaxis] # broadcasting

words = ['barrels', 'bpd', 'ecuador', 'energy', 'industry', 'kuwait', 'oil', 'output', 'petroleum', 'venezuela']
plot_embeddings(M_normalized, word2Ind_co_occurrence, words)

Part 2: Prediction-Based Word Vectors

正如在课堂上讨论的那样，最近基于预测的单词向量已经变得流行，例如，word2vec。在这里，我们将探讨word2vec生成的嵌入。有关word2vec算法的更多详细信息，请重新阅读课堂笔记和演讲幻灯片。

def load_word2vec(embeddings_fp="./GoogleNews-vectors-negative300.bin"):
    """ Load Word2Vec Vectors
        Param:
            embeddings_fp (string) - path to .bin file of pretrained word vectors
        Return:
            wv_from_bin: All 3 million embeddings, each lengh 300
                This is the KeyedVectors format: https://radimrehurek.com/gensim/models/deprecated/keyedvectors.html
    """
    embed_size = 300
    print("Loading 3 million word vectors from file...")
    ## 自己下载的文件
    wv_from_bin = KeyedVectors.load_word2vec_format(embeddings_fp, binary=True)
    vocab = list(wv_from_bin.vocab.keys())
    print("Loaded vocab size %i" % len(vocab))
    return wv_from_bin
wv_from_bin = load_word2vec()

让我们直接比较word2vec嵌入与共生矩阵的嵌入。运行以下单元格：

1、将300万个word2vec向量放入矩阵M中
2、运行reduce_to_k_dim（截断的SVD函数）将矢量从300维减少到2维。

def get_matrix_of_vectors(wv_from_bin):
    """ Put the word2vec vectors into a matrix M.
        Param:
            wv_from_bin: KeyedVectors object; the 3 million word2vec vectors loaded from file
        Return:
            M: numpy matrix shape (num words, 300) containing the vectors
            word2Ind: dictionary mapping each word to its row number in M
    """
    words = list(wv_from_bin.vocab.keys())
    print("Putting %i words into word2Ind and matrix M..." % len(words))
    word2Ind = {}
    M = []
    curInd = 0
    for w in words:
        try:
            M.append(wv_from_bin.word_vec(w))
            word2Ind[w] = curInd
            curInd += 1
        except KeyError:
            continue
    M = np.stack(M)
    print("Done.")
    return M, word2Ind
M, word2Ind = get_matrix_of_vectors(wv_from_bin)
M_reduced = reduce_to_k_dim(M, k=2)

Question 2.1: Word2Vec Plot Analysis

words = ['barrels', 'bpd', 'ecuador', 'energy', 'industry', 'kuwait', 'oil', 'output', 'petroleum', 'venezuela']
plot_embeddings(M_reduced, word2Ind, words)

Cosine Similarity

现在我们有了单词向量，我们需要一种方法来根据这些向量量化单个单词之间的相似性。一个这样的度量是余弦相似性。我们将使用它来找到彼此“接近”和“远”的单词。

我们可以将n维向量看作n维空间中的点。如果我们采用这种观点，L1和L2距离有助于量化“我们必须旅行”以获得这两点之间的空间量。另一种方法是检查两个矢量之间的角度。

Question 2.2: Polysemous Words

找到一个多义词（例如，“leaves”或“scoop”），使得前10个最相似的词（根据余弦相似性）包含来自两个含义的相关词。例如，“leaves”在前10名中都有“vanish”和“stalks”，“scoop”同时包含“handed_waffle_cone”和“lowdown”。在找到之前，您可能需要尝试几个多义词。请说明您发现的多义词以及前10名中出现的多重含义。

wv_from_bin.most_similar("happy")

Question 2.3: Synonyms & Antonyms

在考虑余弦相似度时，考虑余弦距离通常会更方便，它只是1 - 余弦相似度(??)。

找到三个单词（w1，w2，w3），其中w1和w2是同义词，w1和w3是反义词，但余弦距离（w1，w3）<余弦距离（w1，w2）。例如，w1 =“happy”更接近w3 =“sad”而不是w2 =“happy”。

# ------------------
# Write your synonym & antonym exploration code here.

w1 = "sleep"
w2 = "nap"
w3 = "awake"
w1_w2_dist = wv_from_bin.distance(w1, w2)
w1_w3_dist = wv_from_bin.distance(w1, w3)

print("Synonyms {}, {} have cosine distance: {}".format(w1, w2, w1_w2_dist))
print("Antonyms {}, {} have cosine distance: {}".format(w1, w3, w1_w3_dist))

# ------------------

Synonyms sleep, nap have cosine distance: 0.38177746534347534
Antonyms sleep, awake have cosine distance: 0.4313364028930664

Solving Analogies with Word Vectors

Word2Vec向量具有解决类比的能力。
比如说： “man : king :: woman : x”, x是什么？
在下面的单元格中，我们将向您展示如何使用单词向量来查找x。 most_similar函数查找与positive列表中的单词最相似的单词，并且与negative列表中的单词最不相似。类比的答案将是排名最相似的词（最大数值）。

pprint.pprint(wv_from_bin.most_similar(positive=['woman', 'king'], negative=['man']))

[(‘queen’, 0.7118192911148071),
(‘monarch’, 0.6189674139022827),
(‘princess’, 0.5902431011199951),
(‘crown_prince’, 0.5499460697174072),
(‘prince’, 0.5377321243286133),
(‘kings’, 0.5236844420433044),
(‘Queen_Consort’, 0.5235945582389832),
(‘queens’, 0.518113374710083),
(‘sultan’, 0.5098593235015869),
(‘monarchy’, 0.5087411999702454)]

Question 2.4: Finding Analogies

找到根据这些向量保持的类比的例子（即，预期的单词排在最前面）。在您的解决方案中，请以x：y :: a：b的形式说明完整的类比。如果您认为类比很复杂，请解释为什么类比在一两句话中成立。

pprint.pprint(wv_from_bin.most_similar(positive=['english','spain'], negative=['canada']))

[(‘spanish’, 0.5764416456222534),
(‘portugal’, 0.568500816822052),
(‘raul’, 0.543082594871521),
(‘lyk’, 0.5288429856300354),
(‘messi’, 0.5282317996025085),
(‘robben’, 0.5281049609184265),
(‘madrid’, 0.5262929201126099),
(‘barcelona’, 0.5260317325592041),
(‘Institute_ITRI_eng’, 0.5254518985748291),
(‘portuguese’, 0.522739589214325)]

Question 2.5: Incorrect Analogy

找一个根据这些向量不能保持的类比的例子。在您的解决方案中，以x：y :: a：b的形式陈述预期的类比，并根据单词向量说明b的（不正确的）值。

pprint.pprint(wv_from_bin.most_similar(positive=['rome', 'spain'], negative=['italy']))

[(‘carlos’, 0.504329264163971),
(‘samuel’, 0.4907485842704773),
(‘albert’, 0.48940616846084595),
(‘dubai’, 0.48854902386665344),
(‘madrid’, 0.48699095845222473),
(‘jh’, 0.48649877309799194),
(‘cra’, 0.4857587218284607),
(‘eddie’, 0.4807842969894409),
(‘melrose’, 0.47857555747032166),
(‘andre’, 0.47524213790893555)]

Question 2.6: Guided Analysis of Bias in Word Vectors

重要的是要认识到我们的嵌入词隐含的偏见（性别，种族，性取向等）。

运行下面的单元格，检查（a）哪些术语与“women”和“boss”最相似，与“man”最相似，以及（b）哪些术语最类似于“男人”和“老板”，大多数与“女人”不同。

# Run this cell
# Here `positive` indicates the list of words to be similar to and `negative` indicates the list of words to be
# most dissimilar from.
pprint.pprint(wv_from_bin.most_similar(positive=['woman', 'boss'], negative=['man']))
print()
pprint.pprint(wv_from_bin.most_similar(positive=['man', 'boss'], negative=['woman']))

[(‘bosses’, 0.5522644519805908),
(‘manageress’, 0.49151360988616943),
(‘exec’, 0.459408164024353),
(‘Manageress’, 0.45598435401916504),
(‘receptionist’, 0.4474116861820221),
(‘Jane_Danson’, 0.44480547308921814),
(‘Fiz_Jennie_McAlpine’, 0.44275766611099243),
(‘Coronation_Street_actress’, 0.44275569915771484),
(‘supremo’, 0.4409852921962738),
(‘coworker’, 0.4398624897003174)]

[(‘supremo’, 0.6097397804260254),
(‘MOTHERWELL_boss’, 0.5489562153816223),
(‘CARETAKER_boss’, 0.5375303626060486),
(‘Bully_Wee_boss’, 0.5333974361419678),
(‘YEOVIL_Town_boss’, 0.5321705341339111),
(‘head_honcho’, 0.5281980037689209),
(‘manager_Stan_Ternent’, 0.525971531867981),
(‘Viv_Busby’, 0.5256163477897644),
(‘striker_Gabby_Agbonlahor’, 0.5250812768936157),
(‘BARNSLEY_boss’, 0.5238943099975586)]

Question 2.7: Independent Analysis of Bias in Word Vectors

使用most_similar函数找到另一种情况，其中向量表现出一些偏差。请简要说明您发现的偏见示例。

pprint.pprint(wv_from_bin.most_similar(positive=['woman','nurse'], negative=['man']))

[(‘registered_nurse’, 0.7375059127807617),
(‘nurse_practitioner’, 0.6650707721710205),
(‘midwife’, 0.6506887674331665),
(‘nurses’, 0.6448697447776794),
(‘nurse_midwife’, 0.6239830255508423),
(‘birth_doula’, 0.5852459669113159),
(‘neonatal_nurse’, 0.5670714974403381),
(‘dental_hygienist’, 0.5668443441390991),
(‘lactation_consultant’, 0.566798985004425),
(‘respiratory_therapist’, 0.5652169585227966)]

点个赞呗

你可能感兴趣的:(CS224n-winter2019 exploring_word_vectors)

cs224n Assignment 1：exploring_word_vectors 不会代码的小王 NLP CS2224n python 人工智能自然语言处理
CS224n：NaturalLanguageProcessingwithDeepLearningAssignmentIWordVectors：Introduction，SVDandWord2Vec目录AbstractPreparationPackagePart1：Count-BasedWordVectorsCo-OccurrenceSVDPlottingCo-OccurrenceWordEmbed
CS224n Assignment1:exploring_word_vectors Forlogen NLP
导入所需的库文件#AllImportStatementsDefinedHere#Note:Donotaddtothislist.#Allthedependenciesyouneed,canbeinstalledbyrunning.#----------------importsysassertsys.version_info[0]==3assertsys.version_info[1]>=5fro
CS224n-winter2019 exploring_word_vectors 一曲笛音入梦来จุ๊บ
这里写自定义目录标题第一次写博客，不到之处请见谅。词向量part1基于计数的词向量绘制共生词嵌入Question1.1实现不同单词Question1.2:实现共现矩阵实现降到$k$维Question1.4:Implementplot_embeddingsQuestion1.5:Co-OccurrencePlotAnalysisPart2:Prediction-BasedWordVectorsQue
apache 安装linux windows 墙头上一根草 apache inux windows
linux安装Apache 有两种方式一种是手动安装通过二进制的文件进行安装，另外一种就是通过yum 安装，此中安装方式，需要物理机联网。以下分别介绍两种的安装方式通过二进制文件安装Apache需要的软件有apr,apr-util,pcre 1，安装 apr 下载地址：htt
fill_parent、wrap_content和match_parent的区别 Cb123456 match_parent fill_parent
fill_parent、wrap_content和match_parent的区别: 1）fill_parent 设置一个构件的布局为fill_parent将强制性地使构件扩展，以填充布局单元内尽可能多的空间。这跟Windows控件的dockstyle属性大体一致。设置一个顶部布局或控件为fill_parent将强制性让它布满整个屏幕。 2） wrap_conte
网页自适应设计天子之骄 html css 响应式设计页面自适应
网页自适应设计网页对浏览器窗口的自适应支持变得越来越重要了。自适应响应设计更是异常火爆。再加上移动端的崛起，更是如日中天。以前为了适应不同屏幕分布率和浏览器窗口的扩大和缩小，需要设计几套css样式，用js脚本判断窗口大小，选择加载。结构臃肿，加载负担较大。现笔者经过一定时间的学习，有所心得，故分享于此，加强交流，共同进步。同时希望对大家有所
[sql server] 分组取最大最小常用sql 一炮送你回车库 SQL Server
--分组取最大最小常用sql--测试环境if OBJECT_ID('tb') is not null drop table tb;gocreate table tb( col1 int, col2 int, Fcount int)insert into tbselect 11,20,1 union allselect 11,22,1 union allselect 1
ImageIO写图片输出到硬盘 3213213333332132 java image
package awt; import java.awt.Color; import java.awt.Font; import java.awt.Graphics; import java.awt.image.BufferedImage; import java.io.File; import java.io.IOException; import javax.imagei
自己的String动态数组宝剑锋梅花香 java 动态数组数组
数组还是好说，学过一两门编程语言的就知道，需要注意的是数组声明时需要把大小给它定下来，比如声明一个字符串类型的数组：String str[]=new String[10]; 但是问题就来了，每次都是大小确定的数组，我需要数组大小不固定随时变化怎么办呢？动态数组就这样应运而生，龙哥给我们讲的是自己用代码写动态数组，并非用的ArrayList 看看字符
pinyin4j工具类 darkranger .net
pinyin4j工具类Java工具类 2010-04-24 00:47:00 阅读69 评论0 字号：大中小引入pinyin4j-2.5.0.jar包: pinyin4j是一个功能强悍的汉语拼音工具包，主要是从汉语获取各种格式和需求的拼音，功能强悍，下面看看如何使用pinyin4j。本人以前用AscII编码提取工具，效果不理想，现在用pinyin4j简单实现了一个。功能还不是很完美，
StarUML学习笔记----基本概念 aijuans UML建模
介绍StarUML的基本概念，这些都是有效运用StarUML?所需要的。包括对模型、视图、图、项目、单元、方法、框架、模型块及其差异以及UML轮廓。模型、视与图（Model, View and Diagram） &
Activiti最终总结 avords Activiti id 工作流
1、流程定义ID：ProcessDefinitionId，当定义一个流程就会产生。 2、流程实例ID：ProcessInstanceId，当开始一个具体的流程时就会产生，也就是不同的流程实例ID可能有相同的流程定义ID。 3、TaskId，每一个userTask都会有一个Id这个是存在于流程实例上的。 4、TaskDefinitionKey和（ActivityImpl activityId
从省市区多重级联想到的，react和jquery的差别 bee1314 jquery UI react
在我们的前端项目里经常会用到级联的select，比如省市区这样。通常这种级联大多是动态的。比如先加载了省，点击省加载市，点击市加载区。然后数据通常ajax返回。如果没有数据则说明到了叶子节点。针对这种场景，如果我们使用jquery来实现，要考虑很多的问题，数据部分，以及大量的dom操作。比如这个页面上显示了某个区，这时候我切换省，要把市重新初始化数据，然后区域的部分要从页面
Eclipse快捷键大全 bijian1013 java eclipse 快捷键
Ctrl+1 快速修复(最经典的快捷键,就不用多说了)Ctrl+D: 删除当前行 Ctrl+Alt+↓ 复制当前行到下一行(复制增加)Ctrl+Alt+↑ 复制当前行到上一行(复制增加)Alt+↓ 当前行和下面一行交互位置(特别实用,可以省去先剪切,再粘贴了)Alt+↑ 当前行和上面一行交互位置(同上)Alt+← 前一个编辑的页面Alt+→ 下一个编辑的页面(当然是针对上面那条来说了)Alt+En
js 笔记函数征客丶 JavaScript
一、函数的使用 1.1、定义函数变量 var vName = funcation(params){ } 1.2、函数的调用函数变量的调用： vName(params); 函数定义时自发调用：(function(params){})(params); 1.3、函数中变量赋值 var a = 'a'; var ff
【Scala四】分析Spark源代码总结的Scala语法二 bit1129 scala
1. Some操作在下面的代码中，使用了Some操作：if (self.partitioner == Some(partitioner))，那么Some(partitioner)表示什么含义？首先partitioner是方法combineByKey传入的变量， Some的文档说明： /** Class `Some[A]` represents existin
java 匿名内部类 BlueSkator java匿名内部类
组合优先于继承 Java的匿名类，就是提供了一个快捷方便的手段，令继承关系可以方便地变成组合关系继承只有一个时候才能用，当你要求子类的实例可以替代父类实例的位置时才可以用继承。在Java中内部类主要分为成员内部类、局部内部类、匿名内部类、静态内部类。内部类不是很好理解，但说白了其实也就是一个类中还包含着另外一个类如同一个人是由大脑、肢体、器官等身体结果组成，而内部类相
盗版win装在MAC有害发热，苹果的东西不值得买，win应该不用 ljy325 游戏 apple windows XP OS
Mac mini 型号: MC270CH-A RMB:5,688 Apple 对windows的产品支持不好,有以下问题: 1.装完了xp,发现机身很热虽然没有运行任何程序！貌似显卡跑游戏发热一样，按照那样的发热量,那部机子损耗很大,使用寿命受到严重的影响! 2.反观安装了Mac os的展示机，发热量很小，运行了1天温度也没有那么高 &nbs
读《研磨设计模式》-代码笔记-生成器模式-Builder bylijinnan java 设计模式
声明：本文只为方便我个人查阅和理解，详细的分析以及源代码请移步原作者的博客http://chjavach.iteye.com/ /** * 生成器模式的意图在于将一个复杂的构建与其表示相分离，使得同样的构建过程可以创建不同的表示（GoF） * 个人理解： * 构建一个复杂的对象，对于创建者（Builder）来说，一是要有数据来源(rawData)，二是要返回构
JIRA与SVN插件安装 chenyu19891124 SVN jira
JIRA安装好后提交代码并要显示在JIRA上，这得需要用SVN的插件才能看见开发人员提交的代码。 1.下载svn与jira插件安装包，解压后在安装包(atlassian-jira-subversion-plugin-0.10.1) 2.解压出来的包里下的lib文件夹下的jar拷贝到(C:\Program Files\Atlassian\JIRA 4.3.4\atlassian-jira\WEB
常用数学思想方法 comsci 工作
对于搞工程和技术的朋友来讲，在工作中常常遇到一些实际问题，而采用常规的思维方式无法很好的解决这些问题，那么这个时候我们就需要用数学语言和数学工具，而使用数学工具的前提却是用数学思想的方法来描述问题。。下面转帖几种常用的数学思想方法，仅供学习和参考函数思想　　把某一数学问题用函数表示出来，并且利用函数探究这个问题的一般规律。这是最基本、最常用的数学方法
pl/sql集合类型 daizj oracle 集合 type pl/sql
--集合类型 /* 单行单列的数据，使用标量变量单行多列数据，使用记录单列多行数据，使用集合（。。。） *集合：类似于数组也就是。pl/sql集合类型包括索引表（pl/sql table）、嵌套表（Nested Table）、变长数组（VARRAY）等 */ /* --集合方法 &n
[Ofbiz]ofbiz初用 dinguangx 电商 ofbiz
从github下载最新的ofbiz（截止2015-7-13），从源码进行ofbiz的试用 1. 加载测试库 ofbiz内置derby，通过下面的命令初始化测试库 ./ant load-demo (与load-seed有一些区别) 2. 启动内置tomcat ./ant start 或 ./startofbiz.sh 或 java -jar ofbiz.jar &
结构体中最后一个元素是长度为0的数组 dcj3sjt126com c gcc
在Linux源代码中，有很多的结构体最后都定义了一个元素个数为0个的数组，如/usr/include/linux/if_pppox.h中有这样一个结构体： struct pppoe_tag { __u16 tag_type; __u16 tag_len; &n
Linux cp 实现强行覆盖 dcj3sjt126com linux
发现在Fedora 10 /ubutun 里面用cp -fr src dest，即使加了-f也是不能强行覆盖的，这时怎么回事的呢？一两个文件还好说，就输几个yes吧，但是要是n多文件怎么办，那还不输死人呢？下面提供三种解决办法。方法一我们输入alias命令，看看系统给cp起了一个什么别名。 [root@localhost ~]# aliasalias cp=’cp -i’a
Memcached(一)、HelloWorld frank1234 memcached
一、简介高性能的架构离不开缓存，分布式缓存中的佼佼者当属memcached，它通过客户端将不同的key hash到不同的memcached服务器中，而获取的时候也到相同的服务器中获取，由于不需要做集群同步，也就省去了集群间同步的开销和延迟，所以它相对于ehcache等缓存来说能更好的支持分布式应用，具有更强的横向伸缩能力。二、客户端选择一个memcached客户端，我这里用的是memc
Search in Rotated Sorted Array II hcx2013 search
Follow up for "Search in Rotated Sorted Array":What if duplicates are allowed? Would this affect the run-time complexity? How and why? Write a function to determine if a given ta
Spring4新特性——更好的Java泛型操作API jinnianshilongnian spring4 generic type
Spring4新特性——泛型限定式依赖注入 Spring4新特性——核心容器的其他改进 Spring4新特性——Web开发的增强 Spring4新特性——集成Bean Validation 1.1(JSR-349)到SpringMVC Spring4新特性——Groovy Bean定义DSL Spring4新特性——更好的Java泛型操作API Spring4新
CentOS安装JDK liuxingguome centos
1、行卸载原来的： [root@localhost opt]# rpm -qa | grep java tzdata-java-2014g-1.el6.noarch java-1.7.0-openjdk-1.7.0.65-2.5.1.2.el6_5.x86_64 java-1.6.0-openjdk-1.6.0.0-11.1.13.4.el6.x86_64 [root@localhost
二分搜索专题2-在有序二维数组中搜索一个元素 OpenMind 二维数组算法二分搜索
1,设二维数组p的每行每列都按照下标递增的顺序递增。用数学语言描述如下：p满足 (1),对任意的x1，x2，y，如果x1<x2,则p(x1,y)<p(x2,y); (2),对任意的x，y1,y2, 如果y1<y2,则p(x,y1)<p(x,y2); 2,问题：给定满足1的数组p和一个整数k，求是否存在x0,y0使得p(x0,y0)=k? 3,算法分析： (
java 随机数 Math与Random SaraWon java Math Random
今天需要在程序中产生随机数，知道有两种方法可以使用，但是使用Math和Random的区别还不是特别清楚，看到一篇文章是关于的，觉得写的还挺不错的，原文地址是 http://www.oschina.net/question/157182_45274?sort=default&p=1#answers 产生1到10之间的随机数的两种实现方式： //Math Math.roun
oracle创建表空间 tugn oracle
create temporary tablespace TXSJ_TEMP tempfile 'E:\Oracle\oradata\TXSJ_TEMP.dbf' size 32m autoextend on next 32m maxsize 2048m extent m
使用Java8实现自己的个性化搜索引擎 yangshangchuan java superword 搜索引擎 java8 全文检索
需要对249本软件著作实现句子级别全文检索，这些著作均为PDF文件，不使用现有的框架如lucene，自己实现的方法如下： 1、从PDF文件中提取文本，这里的重点是如何最大可能地还原文本。提取之后的文本，一个句子一行保存为文本文件。 2、将所有文本文件合并为一个单一的文本文件，这样，每一个句子就有一个唯一行号。 3、对每一行文本进行分词，建立倒排表，倒排表的格式为：词=包含该词的总行数N=行号