Ninja Lin

CS224N 2019 Assignment 2

Written: Understanding word2vec

Let’s have a quick refresher on the word2vec algorithm. The key insight behind word2vec is that ‘a word is known by the company it keeps’. Concretely, suppose we have a ‘center’ word $c$ and a contextual window surrounding $c$ . We shall refer to words that lie in this contextual window as ‘outside words’. For example, in Figure 1 we see that the center word c is ‘banking’. Since the context window size is 2, the outside words are ‘turning’, ‘into’, ‘crises’, and ‘as’.

The goal of the skip-gram word2vec algorithm is to accurately learn the probability distribution $P (O ∣ C)$ . Given a speciﬁc word $o$ and a speciﬁc word $c$ , we want to calculate $P (O = o ∣ C = c)$ , which is the probability that word $o$ is an ‘outside’ word for $c$ , i.e., the probability that $o$ falls within the contextual window of $c$ .

Figure 1: The word2vec skip-gram prediction model with window size 2

In word2vec, the conditional probability distribution is given by taking vector dot-products and applying the softmax function:
$P(O=o|C=c)=\frac{\exp (\mathbf{u_o}^\top v_c)}{\sum_{w\in Vocab}\exp(\mathbf{u_w}^\top v_c)}\tag{1}$

Here, $u_o$ is the ‘outside’ vector representing outside word $o$ , and $v_c$ is the ‘center’ vector representing center word $c$ . To contain these parameters, we have two matrices, $U$ and $V$ . The columns of $U$ are all the ‘outside’ vectors $u_w$ . The columns of $V$ are all of the ‘center’ vectors $v_w$ . Both $U$ and $V$ contain a vector for every $\in$ Vocabulary.¹

Recall from lectures that, for a single pair of words $c$ and $o$ , the loss is given by:
$J_{naive-softmax}(v_c,o,U)=-\log P(O=o|C=c)\tag{2}$
Another way to view this loss is as the cross-entropy² between the true distribution y and the predicted distribution $\hat{y}$ . Here, both $y$ and $\hat{y}$ are vectors with length equal to the number of words in the vocabulary. Furthermore, the $k^{th}$ entry in these vectors indicates the conditional probability of the $k^{th}$ word being an ‘outside word’ for the given $c$ . The true empirical distribution $y$ is a one-hot vector with a 1 for the true outside word $o$ , and 0 everywhere else. The predicted distribution $\hat{y}$ is the probability distribution $P (O ∣ C = c)$ given by our model in equation (1).

Question a

Show that the naive-softmax loss given in Equation (2) is the same as the cross-entropy loss between $y$ and $\hat{y}$ ; i.e., show that

$-\sum_{w\in{Vocab}}y_w\log(\hat{y}_w) = -\log(\hat{y}_o)\tag{3}$

Ans for a

$y_w=\left\{ \begin{aligned} 1, w=o\\ 0, w\neq o \end{aligned} \right.$

$-\sum_{w\in{Vocab}}y_w\log(\hat{y}_w)=-y_o\log(\hat{y_o})=-\log(\hat{y_o})$

Question b

Compute the partial derivative of $J_{naive-softmax}(v_c,o,U)$ with respect to $v_c$ . Please write your answer in terms of $y$ , $\hat{y}$ , and $U$ .

Ans for b

$\begin{aligned} \frac{\partial}{\partial v_c} J_{naive-softmax} =& -\frac{\partial}{\partial v_c} \log P(O=o|C=c)\\ =& -\frac{\partial}{\partial v_c} \log \frac{\exp (u_o^T v_c)}{\sum_{w=1}^V \exp (u_w^T v_c)}\\ =& -\frac{\partial}{\partial v_c} \log \exp (u_o^T v_c) + \frac{\partial}{\partial v_c} \log \sum_{w=1}^V \exp (u_w^T v_c)\\ =& -u_o + \frac{1}{\sum_{w=1}^V \exp (u_w^T v_c)} \frac{\partial}{\partial v_c}\sum_{x=1}^V \exp (u_x^T v_c)\\ =& -u_o + \frac{1}{\sum_{w=1}^V \exp (u_w^T v_c)} \sum_{x=1}^V \exp (u_x^T v_c) \frac{\partial}{\partial v_c} u_x^T v_c\\ =& -u_o + \frac{1}{\sum_{w=1}^V \exp (u_w^T v_c)} \sum_{x=1}^V \exp (u_x^T v_c) u_x\\ =& -u_o + \sum_{x=1}^V \frac{\exp (u_x^T v_c)}{\sum_{w=1}^V \exp (u_w^T v_c)} u_x\\ =& -u_o + \sum_{x=1}^V P(O=x|C=c) u_x\\ =&-y^T U^T + \hat{y}^T u^T \\ =& U(\hat{y} - y) \end{aligned}$

Question c

Compute the partial derivatives of $J_{naive-softmax}(v_c,o,U)$ with respect to each of the ‘outside’ word vectors, $u_w$ 's. There will be two cases: when $w = o$ , the true ‘outside’ word vector, and $\neq o$ , for all other words. Please write you answer in terms of $y$ , $\hat{y}$ , and $v_c$ .

Ans for c

$\begin{aligned} \frac{\partial}{\partial u_w}J_{naive-softmax}=& -\frac{\partial}{\partial u_w}\log\frac{\exp (u_o^T v_c)}{\sum_{m=1}^V \exp(u_m^T v_c)}\\ =& -\frac{\partial}{\partial u_w} \log\exp (u_o^T v_c)+ \frac{\partial}{\partial u_w}\log\sum_{m=1}^V \exp(u_m^T v_c)\\ \end{aligned}$

When $w = o$ :

$\begin{aligned} \frac{\partial}{\partial u_o}J_{naive-softmax}=& -v_c + \frac{1}{\sum_{m=1}^V \exp(u_m^T)}\sum_{n=1}^V \frac{\partial}{\partial u_o}\exp(u_n^T v_c)\\ =& -v_c + \frac{1}{\sum_{m=1}^V \exp(u_m^T)} \frac{\partial}{\partial u_o}\exp(u_o^T v_c)\\ =& -v_c + \frac{\exp(u_o^T v_c)}{\sum_{m=1}^V \exp(u_m^T)}v_c\\ =& -v_c + P(O=o|C=c)v_c\\ =&(P(O=o|C=c) - 1)v_c \end{aligned}$

When $w\neq o$ :

$\begin{aligned} \frac{\partial}{\partial u_w}J_{naive-softmax}=& \frac{\partial}{\partial u_w}\log\sum_{m=1}^V \exp(u_m^T v_c)\\ =& \frac{\exp(u_w^T v_c)}{\sum_{m=1}^V \exp(u_m^T)}v_c\\ =& P(O=w|C=c)v_c\\ =& (P(O=o|C=c) - 0)v_c \end{aligned}$

In summary:

$\begin{aligned} \frac{\partial}{\partial u_w}J_{naive-softmax}=& (\hat{y}_w-y_w)v_c \end{aligned}$

Question d

The sigmoid function is given by Equation 4:
$\sigma(x)=\frac{1}{1+e^{-x}}=\frac{e^x}{e^x+1}\tag{4}$
Please compute the derivative of $\sigma(x)$ with respect to $x$ , where $x$ is a vector.

Ans for d

$\begin{aligned} \frac{\partial}{\partial x}\sigma(x) =& \frac{\partial}{\partial x} \frac{e^x}{e^x + 1}\\ =& \frac{\partial}{\partial y}\frac{y}{y+1}\frac{\partial}{\partial x}e^x\\ =& \frac{\partial}{\partial y}(1-\frac{1}{y+1})\frac{\partial}{\partial x}e^x\\ =& \frac{\partial}{\partial y}\frac{1}{y+1}\frac{\partial}{\partial x}e^x\\ =& \frac{1}{y+1}\frac{\partial}{\partial x}e^x\\ =& \frac{e^x}{(e^x + 1)^2}\\ =& \frac{e^x}{e^x+1} \frac{1}{e^x+1}\\ =& \frac{e^x}{e^x+1} \frac{e^x+1-e^x}{e^x+1}\\ =& \frac{e^x}{e^x+1}(1-\frac{e^x}{e^x+1})\\ =& \sigma(x)(1-\sigma(x)) \end{aligned}$

Question e

Now we shall consider the Negative Sampling loss, which is an alternative to the Naive Softmax loss. Assume that $K$ negative samples (words) are drawn from the vocabulary. For simplicity of notation we shall refer to them as $w_1,w_2,…,w_K$ and their outside vectors as $u_1,…,u_K$ . Note that$ o\notin {w_1,…,w_K}$. For a center word $c$ and an outside word $o$ , the negative sampling loss function is given by:
$J_{neg-sample}(v_c,o,U) =-\log(\sigma (\mathbf{u_o}^\top v_c)) -\sum_{K=1}^{K}\log(\sigma(-\mathbf{u_k}^\top v_c))\tag{5}$
for a sample $w_1, ... w_K$ , where $\sigma(\cdot)$ is the sigmoid function³

Please repeat parts (b) and ©, computing the partial derivatives of $J_{neg-sample}$ with respect to $v_c$ , with respect to $u_o$ , and with respect to a negative sample $u_k$ . Please write your answers in terms of the vectors $u_o$ , $v_c$ , and $u_k$ , where $\in [1,K]$ . After you’ve done this, describe with one sentence why this loss function is much more efficient to compute than the naive-softmax loss. Note, you should be able to use your solution to part (d) to help compute the necessary gradients here.

Ans for e

$\begin{aligned} \frac{\partial}{\partial v_c}J_{neg-sample}=& -\frac{\partial}{\partial v_c}\log(\sigma(u_o^T v_c)) -\frac{\partial}{\partial v_c}\sum_{k=1}^K \log(\sigma(-u_k^Tv_c))\\ =&-\frac{1}{\sigma(u_o^T v_c)}\frac{\partial}{\partial v_c}\sigma(u_o^T v_c) -\sum_{k=1}^K \frac{1}{\sigma(-u_k^T v_c)}\frac{\partial}{\partial v_c}\sigma(-u_k^T v_c)\\ =& -\frac{1}{\sigma(u_o^T v_c)}\sigma(u_o^T v_c)(1-\sigma(u_o^T v_c)) \frac{\partial}{\partial v_c}u_o^T v_c -\sum_{k=1}^K\frac{1}{\sigma(-u_k^T v_c)}\sigma(-u_k^T v_c)(1-\sigma(u_k^T v_c))\frac{\partial}{\partial v_c}(-u_k^T v_c)\\ =& (\sigma(u_o^T v_c)-1)u_o -\sum_{k=1}^K(\sigma(-u_k^T v_c) - 1)u_k\\ ~\\ \frac{\partial}{\partial u_o}J_{neg-sample}=& -\frac{\partial}{\partial u_o}\log(\sigma(u_o^T v_c)) -\frac{\partial}{\partial u_o}\sum_{k=1}^K\log(\sigma(-u_k^T v_c))\\ =& -\frac{\partial}{\partial u_o}\log(\sigma(u_o^T v_c))\\ =& -\frac{1}{\sigma(u_o^T v_c)}\frac{\partial}{\partial u_o}\sigma(u_o^T v_c)\\ =& -\frac{1}{\sigma(u_o^T v_c)}\sigma(u_o^T v_c)(1-\sigma(u_o^T v_c))\frac{\partial}{\partial u_o}u_o^T v_c\\ =& (\sigma(u_o^T v_c) - 1)v_c\\ ~\\ \frac{\partial}{\partial u_k}J_{neg-sample}=& -\frac{\partial}{\partial u_k}\log(\sigma(u_o^T v_c)) -\frac{\partial}{\partial u_k}\sum_{x=1}^K\log(\sigma(-u_x^T v_c))\\ =& -\frac{\partial}{\partial u_k}\sum_{x=1}^K\log(\sigma(-u_x^T v_c))\\ =& -\frac{\partial}{\partial u_k}\log(\sigma(-u_k^T v_c))\\ =& -\frac{1}{\sigma(-u_k^Tv_c)}\frac{\partial}{\partial u_k}\sigma(-u_k^T v_c)\\ =& -\frac{1}{\sigma(-u_k^Tv_c)}\sigma(-u_k^Tv_c)(1-\sigma(-u_k^Tv_c))\frac{\partial}{\partial u_k}(-u_k^Tv_c)\\ =& (1-\sigma(-u_k^Tv_c))v_c \end{aligned}$

Cause through this loss funtion, we don’t need to go through all word in vocabulary which cost expensive.

Question f

Suppose the center word is $c = w_t$ and the context window is $w_{t−m}, . . ., w_{t−1}, w_t, w_{t+1}, . . ., w_{t+m}]$ , where $m$ is the context window size. Recall that for the skip-gram version of word2vec, the total loss for the context window is:
$J_{skip-gram}(v_c,w_{t-m},...w_{t+m},U) =\sum_{-m\leq j \leq m, j\neq0}J(v_c,w{t+j},U)\tag{6}$
Here, $J (v_c , w_{t+j} , U )$ represents an arbitrary loss term for the center word $c = w_t$ and outside word $w_{t+j}$ . $J(v_c,w_{t+j},U)$ could be $J_{naive-softmax}(v_c,w_{t+j},U)$ or $J_{neg-sample}(v_c,w_{t+j},U)$ , depending on your implementation.

Write down three partial derivatives:

$\frac{\partial J_{skip-gram}(v_c,w_{t-m},…w_{t+m},U)}{\partial U}$
$\frac{\partial J_{skip-gram(v_c,w_{t-m,…w_{t+m},U})}}{\partial v_c}$
$\frac{\partial J_{skip-gram(v_c,w_{t-m},…w_{t+m},U)}}{\partial v_w}$ when $w\neq c$

Write your answers in terms of $\frac{\partial J(v_c,w_{t+j},U)}{\partial U}$ and $\frac{\partial J(v_c,w_{t+j},U)}{\partial v_c}$ . This is very simple - each solution should be one line.

Ans for f

$\begin{aligned} \frac{\partial}{\partial U}J_{skip-gram}(v_c,w_{t-m},…w_{t+m},U) =& \sum_{-m\leq j\leq m}\frac{J(v_c, w_{t+j}, U)}{\partial U}\\ ~\\ \frac{\partial}{\partial v_c}J_{skip-gram(v_c,w_{t-m,…w_{t+m},U})} =& \sum_{-m\leq j\leq m}\frac{J(v_c, w_{t+j}, U)}{\partial v_c}\\ ~\\ \frac{\partial}{\partial v_w}J_{skip-gram(v_c,w_{t-m},…w_{t+m},U)} = & 0 \end{aligned}$

Assume that every word in our vocabulary is matched to an integer number $k. u_k$ is both the $k ^{th}$ column of $U$ and the ‘outside’ word vector for the word indexed by $k. v_k$ is both the $k^{th}$ column of $V$ and the ‘center’ word vector for the word indexed by $k$ . In order to simplify notation we shall interchangeably use $k$ to refer to the word and the index-of-the-word. ↩︎
The Cross Entropy Loss between the true (discrete) probability distribution $p$ and another distribution $q$ is$ − \sum_{i}p_i log(q_i)$. ↩︎
Note: the loss function here is the negative of what Mikolov et al. had in their original paper, because we are doing aminimization instead of maximization in our assignment code. Ultimately, this is the same objective function. ↩︎

Word2Vec ——gensim实战教程王同学死磕技术
最近斯坦福的CS224N开课了，看了下课程介绍,去年google发表的Transformer以及最近特别火的ContextualWordEmbeddings都会在今年的课程中进行介绍。NLP领域确实是一个知识迭代特别快速的领域，每年都有新的知识冒出来。所以身处NLP领域的同学们要时刻保持住学习的状态啊。笔者又重新在B站上看了这门课程的第一二节课。这里是课程链接。前两节课的主要内容基本上围绕着词向量
CS224N笔记——词向量表示 random_walk
onehot表示image.png主要问题所有的向量都是正交的，无法准确表达不同词之间的相似度，没有任何语义信息向量维度是语料库中所有单词的数量，维度太大。以下内容主要摘抄自来斯惟的博士论文基于神经网络的词和文档语义向量表示方法研究CS224n的notesYoavGoldberg的word2vecExplained:DerivingMikolovetal.’sNegative-SamplingWo
谢撩，人在斯坦福打SoTA 夕小瑶人工智能 ai cstring 边缘检测 nlp
文|Jazon编|小戏小编注：不知道大家还记不记得卖萌屋之前人在斯坦福，刚上CS224n的Jazon小哥发来的关于斯坦福神课CS224n上半学期的报道？今天，Jazon又在斯坦福前线发来了关于他在CS224n下半学期的经历，那么现在让我们把画面交给Jazon，看看大佬的课程作业是怎么完成的吧！上篇文章提到我在Stanford上NLP“神课”CS224n，课程的前半学期以上课、写作业为主，而后半学期
2021斯坦福CS224N课程笔记~4 mwcxz 斯坦福CS224N学习笔记 pytorch 深度学习人工智能
4.依存解析DependencyParsing参考文档：https://zhuanlan.zhihu.com/p/420736640https://www.showmeai.tech/article-detail/237https://zhuanlan.zhihu.com/p/147321515https://zhuanlan.zhihu.com/p/49992664https://blog.cs
斯坦福NLP课程来了人工智能大讲堂学习资料深度学习自然语言处理人工智能
生成式AI，尤其是以ChatGPT为首的大语言模型正在改变人们的生活方式，我想一定有小伙伴想加入NLP这个行列。微软重磅发布4个适合初学者的机器学习资料我在前一篇文章中分享了微软人工智能初学者课程，其中的【生成式AI】非常适合初学者，今天我将分享NLP的进阶课程。https://web.stanford.edu/class/cs224n/关注v公众号：人工智能大讲堂，后台回复snlp获取全部资料。
【关于Python中两个相等字符串is判断出来是false的问题】李不卷 python list
今天在写cs224n的作业时，在判断words中的单词和corpus中的单词进行判断单词是否相等时，采用了is进行逻辑判断。但是出现了相同的单词进行判断结果为false的情况。即，如“END”is"END"的结果为false.先开始以为是代码的其他部分逻辑错了，就改来改去也没有找到原因。晚上躺在床上，想起来试一试==来判断，结果居然跑通了。所以，利用==来替换is，得到了最终想要的正确结果。在博客
2021斯坦福CS224N课程笔记~7 mwcxz 人工智能深度学习机器学习
7.机器翻译，序列到序列、注意力机制参考文献：https://zhuanlan.zhihu.com/p/430709084https://zhuanlan.zhihu.com/p/147310766【简易】https://zhuanlan.zhihu.com/p/47063917【注意力系列】https://www.showmeai.tech/article-detail/242https://z
2021斯坦福CS224N课程笔记~5 mwcxz 斯坦福CS224N学习笔记机器学习人工智能自然语言处理
5语言模型(LM)与循环神经网络(RNN)参考文档：https://zhuanlan.zhihu.com/p/424671205https://www.showmeai.tech/article-detail/239https://zhuanlan.zhihu.com/p/147322049[易懂]https://zhuanlan.zhihu.com/p/61893429讲座计划\1.神经依存解析
2021斯坦福CS224N课程笔记~3 mwcxz 斯坦福CS224N学习笔记人工智能深度学习机器学习
3.神经网络学习：手工计算梯度Lecture3:Neuralnetlearning:Gradientsbyhand(matrixcalculus)andalgorithmically(thebackpropagationalgorithm)参考文档：https://zhuanlan.zhihu.com/p/527211871https://zhuanlan.zhihu.com/p/41429307
【笔记3-6】CS224N课程笔记 - RNN和语言模型 jessie_weiqing 笔记 CS224N RNN cs224n 自然语言处理 GRU LSTM
CS224N（六）RecurrentNeuralNetworksandLanguageModels语言模型语言模型介绍n-gram基于窗口的神经语言模型RNNRNNLossandPerplexityRNN的优缺点及应用梯度消失和梯度爆炸问题梯度消失/爆炸问题的解决方法DeepBidirectionalRNN应用：RNN翻译模型GRULSTM【笔记3-1】CS224N课程笔记-深度自然语言处理【笔记
Transformer简单理解（MT） rd142857 nlp transformer 机器翻译深度学习
Transformer21年cs224n的Transformer这课换了TA来讲，有点听不太懂（我是菜狗）这篇suggestedreading讲得非常清楚TheIllustratedTransformerKey-Query-ValueAttention使得xi的不同方面得以被使用或强调。计算分数时，除以d的平方根以获得更加稳定的梯度。softmax计算得到的某单词上的权重可以被视作为该单词应当被获
NLP进阶之路——CS224n（一）技术宅zch NLP
NLP绪论什么是自然语言处理？NLP的层次NLP的应用人类语言的特殊之处什么是深度学习为什么NLP很难？NLP语义层面的表示Reference什么是自然语言处理？自然语言处理（NLPnaturallanguageprocessing）是一门计算机科学、人工智能和语言学的交叉学科。是人工智能领域的重要分支！人工智能有机器视觉、语音识别、和NLP。自然界拥有视觉的生物有很多，但是拥有高级语言的生物只有
关于无监督、聚类和主题模型 Silv_Kim
Somereferenceshttp://www.52nlp.cn/2012/04https://github.com/Computing-Intelligence/Referenceshttp://web.stanford.edu/class/cs224n/https://study.163.com/course/courseLearn.htm?courseId=1004570029#/lear
【Stanford CS224N 笔记】lecture 7 Recurrent Neural Network 宇智波艾尼路深度学习机器学习 pytorch
一、语言模型1.1定义语言模型LanguageModel，是指预测一个句子（词语有序序列）出现的概率的模型，即，一般可用于以下场景：1.判断什么词序出现的可能性更高：p(六点吃饭)>p(六点饭吃)2.判断在上下文中，什么词汇出现的可能性更高：p(七点下班回家)>p(七点下班回公司)1.2n-gram语言模型一般基于一个错误但有必要的马尔科夫假设：一个单词的出现概率仅取决于前n个单词是什么，即在足量
Stanford CS224n 第一讲：深度自然语言处理江南丶 Stanford CS224n NLP Stanford CS224n 学习笔记
第一节课主要是介绍了NLP（尤其是DeepNLP）的背景知识。主要有一下几点：什么是NLP？NLP的应用NLP的难点MachineLearningvs.DeepLearning接下来，根据课程视频+自己的理解，我将一一详细介绍以上的4部分。1.什么是NLP？Naturallanguageprocessing(NLP)是计算机科学+AI+语言学的交叉产物；它的目标是让机器能够处理或者明白自然语言(t
2021斯坦福CS224N课程笔记~2 mwcxz 斯坦福CS224N学习笔记机器学习算法人工智能
2NeuralClassifiers2.1本篇内容覆盖word2vec与词向量回顾算法优化基础计数与共现矩阵GloVe模型词向量评估wordsenses2.2.回顾：word2vec的主要思想2.2.1.主要步骤具体见1.3.2Word2Vec算法的具体思路(1)随起：从随机的词向量开始；(2)遍历：遍历整个语料库中的每个单词；(3)预测：尝试使用词向量预测周围的词（见图2.1）：(4)学习：更新
斯坦福CS224N学习笔记-6 依存分析 CoderZhangsM 学习笔记人工智能深度学习神经网络自然语言处理
课程内容概述句法结构：一致性与依赖性依存文法和Treebank基于转移的依存分析使用神经网络的依存分析描述语言结构的两种方法上下文无关文法上下文无关文法=短语结构文法=句子成分依存文法通过找出句子中每个词所依赖的部分来描述句子的结构为了描述语言结构，人们采用了两种方法。其中一个就是计算机科学中常用的上下文无关文法，在语言学中，这常常被称为短语结构文法，然后也被称为句子成分的概念。另一种方法就是依存
cs224n学习笔记9-问答系统 TARO_ZERO 学习笔记 nlp 自然语言处理
目录QuestionAnswering问答系统QuestionAnswering问答ReadingComprehension阅读理解Stanfordquestionansweringdataset(SQuAD)斯坦福问答数据集神经网络模型BiDAF:theBidirectionalAttentionFlowmodel(2017)用于阅读理解的BERT模型比较BiDAF和BERT模型预训练模型Spa
Stanford CS224N - word2vec oveZ AI 人工智能深度学习神经网络自然语言处理机器学习
最近在听Stanford放出来的StanfordCS224NNLPwithDeepLearning这门课，弥补一下之前nlp这块基础知识的一些不清楚的地方，顺便巩固一下基础知识关于word2vec：1.为什么要把单词表示成向量一开始人们造了一个类似于词典表的东西-wordnet：但是这里面存在一些问题，大概有这么几个：例如，“proficient”被列为“good”的同义词，但这只在某些情境下是正
斯坦福大学CS520知识图谱系列课程学习笔记：第一讲什么是知识图谱 ngl567
随着知识图谱在人工智能各个领域的广泛使用，知识图谱受到越来越多AI研究人员的关注和学习，已经成为人工智能迈向认知系统的关键技术之一。之前，斯坦福大学的面向计算机视觉的CS231n和面向自然语言处理的CS224n成为了全球非常多AI研究人员的入门经典学习课程。因此，斯坦福大学于今年3月开设了一门专门面向知识图谱的系列课程CS520，官网课程页：https://web.stanford.edu/cla
神经网络基础知识 hqc888688 神经网络和深度学习
本文由斯坦福CS224n翻译整理而来1.神经网络基础知识1.1单个神经元单个神经元是神经网络的基本单位，其主要接受一个n维的向量x，输出为一个激活函数的输出aa=11+exp(−(ωTx+b))每个神经元均可拟合一种非线性的变化形势，上图采用的主要是基于sigmoid函数的神经元。神经元内部的主要参数为一个n维向量的参数ω和一个偏移量b。每一个神经网络可以看作是同时运行多个逻辑回归1.2单层神经网
Stanford CS224N: PyTorch Tutorial (Winter ‘21) —— 斯坦福CS224N PyTorch教程（第二部分）放肆荒原 AI PyTorch Python pytorch 人工智能 python
本教程译文的第一部分，请见我的上一篇博文：StanfordCS224N:PyTorchTutorial(Winter‘21)——斯坦福CS224NPyTorch教程（第一部分）_放肆荒原的博客-CSDN博客运算(Operations)PyTorch运算与NumPy的运算非常相似。我们可以使用标量和其他张量。In[40]:#Createanexampletensor#创建一个示例张量x=torch.
机器学习100天-Day10 Tensorflow实现RNN算法我的昵称违规了
本例是为了配合NLP学习中的RNN网络，斯坦福CS224n课程里面使用的是Tensorflow进行，所以提前熟悉一下，使用Tensorflow生成一个echo-rnn。说实话，这个例子是照着教程敲出来的，仅仅实现了，但是没有对后面的原理进行分析，目前还是在一步一步往前推。代码同样更新在github：https://github.com/jwc19890114/-02-learning-file-1
Stanford:Natural Language Processing with Deep Learning 元宇宙iwemeta
CS224n:NaturalLanguageProcessingwithDeepLearningStanford/Winter2019LogisticsLectures:areonTuesday/Thursday4:30-5:50pmPSTinNVIDIAAuditorium.Lecturevideosforenrolledstudents:arepostedonmvideox.stanford.
CS224n 2019 Winter 笔记（一）：Word Embedding:Word2vec and Glove lairongxuan CS224n 自然语言处理
CS224n笔记：Word2Vec:CBOWandSkip-Gram摘要一、语言模型（LanguageModel）（一）一元模型（UnaryLanguageModel）（二）二元模型（BigramModel）二、如何表示“word”——词向量（WordVector）三、Word2Vec模型（一）Word2vec的作用（二）ContinuousBagofWordsModel(CBOW)1、CBOW模
CS224n自然语言处理（四）——单词表示及预训练，transformer和BERT 李明朔自然语言处理自然语言处理
文章目录一、ELMO1.TagLM–“Pre-ELMo”2.ELMo:EmbeddingsfromLanguageModels二、ULMfit三、Transformer1.编码器（1）词向量+位置编码（2）多头注意力层（3）前馈神经网络层2.解码器四、BERT1.BERT的输入2.预训练任务1：MaskedLM3.预训练任务2：NextSentencePrediction之前介绍的WordVect
SoftMax函数意念回复机器学习数学
目录1Softmax的形式2hardmax的特性3softmax和hardmax的相似性4softmax函数概率模型构建5softmax函数优化1Softmax的形式Softmax函数是在机器学习中经常出现的，时常出现在输出层中。softmax的表达式：而下面我们要介绍的softmax“暂时”长相和它有些不一样，暂且叫做softmax_g：为什么叫softmax呢？根据CS224n的说法，主要是因
CS 224N总结长命百岁️ 自然语言处理人工智能深度学习
CS224N网址：StanfordCS224N|NaturalLanguageProcessingwithDeepLearningLecture1PPT网址：PowerPointPresentation(stanford.edu)这一讲主要讲了NLP研究的对象，我们如何表示单词的含义，以及Word2Vec方法的基本原理。这里我们简单介绍一些Word2Vec方法的基本原理：人们认为，一个词往往与其上
CS224N学习笔记（六）—— 句法分析 DataArk
写在前面的话：CS224N的第四课和第五课分别是word窗口分类、神经网络和反向传播的知识，但是第四课前半部分内容其实蛮乱的，我个人准备后面在这部分的更新换成对传统的一些机器算法在NLP上的应用上的学习。后面的神经网络和反向早就学过了，所以也就跳过了，后面总结神经网络的时候一起总结。一、语言学的两种观点如何描述语法，有两种主流观点，其中一种是短语结构文法，也就是上下文无关文法，英文术语是：Cons
NLP-D22-cs224n&UNICORN&多层感知机&房价预测kaggle 甄小胖 NLP python 自然语言处理 pytorch 深度学习
–0519今天0430起床的，早上开始看cs224n，感觉老师好可爱！现在开始读论文啦！一、Unicorn—0558感觉还是有创新的！但是一时间说不上来？可能是时间与关系在溯源图中的综合？？？先干饭！–0621吃饭的时候看了cs224n，讲的很细。主要讲了word2vec，具体是如何去做word2vec这件事。1、用中心词预测周围词2、用两套向量，分别表示这个词作为中心词和作为周围词时的向量表示3
redis学习笔记——不仅仅是存取数据 Everyday都不同 returnSource expire/del incr/lpush 数据库分区 redis
最近项目中用到比较多redis，感觉之前对它一直局限于get/set数据的层面。其实作为一个强大的NoSql数据库产品，如果好好利用它，会带来很多意想不到的效果。（因为我搞java，所以就从jedis的角度来补充一点东西吧。PS：不一定全，只是个人理解，不喜勿喷） 1、关于JedisPool.returnSource(Jedis jeids) 这个方法是从red
SQL性能优化-持续更新中。。。。。。 atongyeye oracle sql
1 通过ROWID访问表--索引你可以采用基于ROWID的访问方式情况,提高访问表的效率, , ROWID包含了表中记录的物理位置信息..ORACLE采用索引(INDEX)实现了数据和存放数据的物理位置(ROWID)之间的联系. 通常索引提供了快速访问ROWID的方法,因此那些基于索引列的查询就可以得到性能上的提高. 2 共享SQL语句--相同的sql放入缓存 3 选择最有效率的表
[JAVA语言]JAVA虚拟机对底层硬件的操控还不完善 comsci JAVA虚拟机
如果我们用汇编语言编写一个直接读写CPU寄存器的代码段，然后利用这个代码段去控制被操作系统屏蔽的硬件资源，这对于JVM虚拟机显然是不合法的，对操作系统来讲，这样也是不合法的，但是如果是一个工程项目的确需要这样做，合同已经签了，我们又不能够这样做，怎么办呢？那么一个精通汇编语言的那种X客，是否在这个时候就会发生某种至关重要的作用呢？ &n
lvs- real 男人50 LVS
#!/bin/bash # # Script to start LVS DR real server. # description: LVS DR real server # #. /etc/rc.d/init.d/functions VIP=10.10.6.252 host='/bin/hostname' case "$1" in sta
生成公钥和私钥 oloz DSA 安全加密
package com.msserver.core.util; import java.security.KeyPair; import java.security.PrivateKey; import java.security.PublicKey; import java.security.SecureRandom; public class SecurityUtil {
UIView 中加入的cocos2d，背景透明 374016526 cocos2d glClearColor
要点是首先pixelFormat:kEAGLColorFormatRGBA8，必须有alpha层才能透明。然后view设置为透明glView.opaque = NO;[director setOpenGLView:glView];[self.viewController.view setBackgroundColor:[UIColor clearColor]];[self.viewControll
mysql常用命令香水浓 mysql
连接数据库 mysql -u troy -ptroy 备份表 mysqldump -u troy -ptroy mm_database mm_user_tbl > user.sql 恢复表（与恢复数据库命令相同） mysql -u troy -ptroy mm_database < user.sql 备份数据库 mysqldump -u troy -ptroy
我的架构经验系列文章 - 后端架构 - 系统层面 agevs JavaScript jquery css html5
系统层面：高可用性所谓高可用性也就是通过避免单独故障加上快速故障转移实现一旦某台物理服务器出现故障能实现故障快速恢复。一般来说，可以采用两种方式，如果可以做业务可以做负载均衡则通过负载均衡实现集群，然后针对每一台服务器进行监控，一旦发生故障则从集群中移除；如果业务只能有单点入口那么可以通过实现Standby机加上虚拟IP机制，实现Active机在出现故障之后虚拟IP转移到Standby的快速
利用ant进行远程tomcat部署 aijuans tomcat
在javaEE项目中，需要将工程部署到远程服务器上，如果部署的频率比较高，手动部署的方式就比较麻烦，可以利用Ant工具实现快捷的部署。这篇博文详细介绍了ant配置的步骤（http://www.cnblogs.com/GloriousOnion/archive/2012/12/18/2822817.html），但是在tomcat7以上不适用，需要修改配置，具体如下： 1.配置tomcat的用户角色
获取复利总收入 baalwolf 获取
public static void main(String args[]){ int money=200; int year=1; double rate=0.1; &
eclipse.ini解释 BigBird2012 eclipse
大多数java开发者使用的都是eclipse，今天感兴趣去eclipse官网搜了一下eclipse.ini的配置，供大家参考，我会把关键的部分给大家用中文解释一下。还是推荐有问题不会直接搜谷歌，看官方文档，这样我们会知道问题的真面目是什么，对问题也有一个全面清晰的认识。 Overview 1、Eclipse.ini的作用 Eclipse startup is controlled by th
AngularJS实现分页功能 bijian1013 JavaScript AngularJS 分页
对于大多数web应用来说显示项目列表是一种很常见的任务。通常情况下，我们的数据会比较多，无法很好地显示在单个页面中。在这种情况下，我们需要把数据以页的方式来展示，同时带有转到上一页和下一页的功能。既然在整个应用中这是一种很常见的需求，那么把这一功能抽象成一个通用的、可复用的分页（Paginator）服务是很有意义的。 &nbs
[Maven学习笔记三]Maven archetype bit1129 ArcheType
archetype的英文意思是原型，Maven archetype表示创建Maven模块的模版，比如创建web项目，创建Spring项目等等. mvn archetype提供了一种命令行交互式创建Maven项目或者模块的方式， mvn archetype 1.在LearnMaven-ch03目录下，执行命令mvn archetype:gener
【Java命令三】jps bit1129 Java命令
jps很简单，用于显示当前运行的Java进程，也可以连接到远程服务器去查看 [hadoop@hadoop bin]$ jps -help usage: jps [-help] jps [-q] [-mlvV] [<hostid>] Definitions: <hostid>: <hostname>[:
ZABBIX2.2 2.4 等各版本之间的兼容性 ronin47
zabbix更新很快，从2009年到现在已经更新多个版本，为了使用更多zabbix的新特性，随之而来的便是升级版本，zabbix版本兼容性是必须优先考虑的一点客户端AGENT兼容 zabbix1.x到zabbix2.x的所有agent都兼容zabbix server2.4：如果你升级zabbix server，客户端是可以不做任何改变，除非你想使用agent的一些新特性。 Zabbix代理（p
unity 3d还是cocos2dx哪个适合游戏？ brotherlamp unity自学 unity教程 unity视频 unity资料 unity
unity 3d还是cocos2dx哪个适合游戏？问：unity 3d还是cocos2dx哪个适合游戏？答：首先目前来看unity视频教程因为是3d引擎，目前对2d支持并不完善，unity 3d 目前做2d普遍两种思路，一种是正交相机，3d画面2d视角，另一种是通过一些插件，动态创建mesh来绘制图形单元目前用的较多的是2d toolkit，ex2d，smooth moves，sm2，
百度笔试题：一个已经排序好的很大的数组，现在给它划分成m段，每段长度不定，段长最长为k，然后段内打乱顺序，请设计一个算法对其进行重新排序 bylijinnan java 算法面试百度招聘
import java.util.Arrays; /** * 最早是在陈利人老师的微博看到这道题： * #面试题#An array with n elements which is K most sorted，就是每个element的初始位置和它最终的排序后的位置的距离不超过常数K * 设计一个排序算法。It should be faster than O(n*lgn)。
获取checkbox复选框的值 chiangfai checkbox
<title>CheckBox</title> <script type = "text/javascript"> doGetVal: function doGetVal() { //var fruitName = document.getElementById("apple").value;//根据
MySQLdb用户指南 chenchao051 mysqldb
原网页被墙，放这里备用。 MySQLdb User's Guide Contents Introduction Installation _mysql MySQL C API translation MySQL C API function mapping Some _mysql examples MySQLdb
HIVE 窗口及分析函数 daizj hive 窗口函数分析函数
窗口函数应用场景：（1）用于分区排序（2）动态Group By （3）Top N （4）累计计算（5）层次查询一、分析函数用于等级、百分点、n分片等。函数说明 RANK() &nbs
PHP ZipArchive 实现压缩解压Zip文件 dcj3sjt126com PHP zip
PHP ZipArchive 是PHP自带的扩展类，可以轻松实现ZIP文件的压缩和解压，使用前首先要确保PHP ZIP 扩展已经开启，具体开启方法就不说了，不同的平台开启PHP扩增的方法网上都有，如有疑问欢迎交流。这里整理一下常用的示例供参考。一、解压缩zip文件 01 02 03 04 05 06 07 08 09 10 11
精彩英语贺词 dcj3sjt126com 英语
I'm always here 我会一直在这里支持你 &nb
基于Java注解的Spring的IoC功能 e200702084 java spring bean IOC Office
java模拟post请求 geeksun java
一般API接收客户端（比如网页、APP或其他应用服务）的请求，但在测试时需要模拟来自外界的请求，经探索，使用HttpComponentshttpClient可模拟Post提交请求。此处用HttpComponents的httpclient来完成使命。 import org.apache.http.HttpEntity ; import org.apache.http.HttpRespon
Swift语法之 ---- ?和!区别 hongtoushizi ?swift !
转载自： http://blog.sina.com.cn/s/blog_71715bf80102ux3v.html Swift语言使用var定义变量，但和别的语言不同，Swift里不会自动给变量赋初始值，也就是说变量不会有默认值，所以要求使用变量之前必须要对其初始化。如果在使用变量之前不进行初始化就会报错： var stringValue : String //
centos7安装jdk1.7 jisonami jdk centos
安装JDK1.7 步骤1、解压tar包在当前目录 [root@localhost usr]#tar -xzvf jdk-7u75-linux-x64.tar.gz 步骤2：配置环境变量在etc/profile文件下添加 export JAVA_HOME=/usr/java/jdk1.7.0_75 export CLASSPATH=/usr/java/jdk1.7.0_75/lib
数据源架构模式之数据映射器 home198979 PHP 架构数据映射器 datamapper
前面分别介绍了数据源架构模式之表数据入口、数据源架构模式之行和数据入口数据源架构模式之活动记录，相较于这三种数据源架构模式，数据映射器显得更加“高大上”。一、概念数据映射器（Data Mapper）：在保持对象和数据库（以及映射器本身）彼此独立的情况下，在二者之间移动数据的一个映射器层。概念永远都是抽象的，简单的说，数据映射器就是一个负责将数据映射到对象的类数据。 &nb
在Python中使用MYSQL pda158 mysql python
缘由　　近期在折腾一个小东西须要抓取网上的页面。然后进行解析。将结果放到数据库中。　　了解到 Python在这方面有优势，便选用之。　　由于我有台 server上面安装有 mysql，自然使用之。在进行数据库的这个操作过程中遇到了不少问题，这里记录一下，大家共勉。　　 python中mysql的调用　　百度之后能够通过MySQLdb进行数据库操作。
单例模式 hxl1988_0311 java 单例设计模式单件
package com.sosop.designpattern.singleton; /* * 单件模式：保证一个类必须只有一个实例，并提供全局的访问点 * * 所以单例模式必须有私有的构造器，没有私有构造器根本不用谈单件 * * 必须考虑到并发情况下创建了多个实例对象 * */ /** * 虽然有锁，但是只在第一次创建对象的时候加锁，并发时不会存在效率
27种迹象显示你应该辞掉程序员的工作 vipshichg 工作
1、你仍然在等待老板在2010年答应的要提拔你的暗示。 2、你的上级近10年没有开发过任何代码。 3、老板假装懂你说的这些技术，但实际上他完全不知道你在说什么。 4、你干完的项目6个月后才部署到现场服务器上。 5、时不时的，老板在检查你刚刚完成的工作时，要求按新想法重新开发。 6、而最终这个软件只有12个用户。 7、时间全浪费在办公室政治中，而不是用在开发好的软件上。 8、部署前5分钟才开始测试。