CS224N NLP

附大佬的笔记:
github.com/LooperXX/LooperXX.github.io.git

文章目录

  • Abbreviation
  • Lecture 1 - Introduction and Word Vectors
    • NLP
    • Word2vec
      • Use two vector in one word: centor word context word.
      • softmax function
      • Train the model: gradient descent
    • Show some achievement with code(5640-h0516)
    • QA
  • Lecture 2 Word Vectors,Word Senses,and Neural Classifiers
    • Bag models (0245)
    • Gradient descent (0600)
      • stochastic gradient descent SGD TOBELM (0920)
    • more details of word2vec(1400)
      • SG use center to predict context
        • SGNS negative sampling [ToBLO]
      • CBOW opposite.
    • Why use two vectors(1500)
    • Why not capture co-occurrence counts directly?(2337)
    • SVD(3230) [ToL]
    • Count based vs direct prediction
    • Encoing meaning components in vector differences(3948)
    • GloVe (4313)
    • How to evaluate word vectors Intrinsic vs. extrinsic(4756)
      • Analogy evaluation and hyperparameters (intrinsic)(5515)
      • Word vector distances and their correlation with human judgements(5640)
    • Data shows that 300 dimensional word vector is good(5536)
    • The objective function for the GloVe model and What log-bilinear means(5739)
    • Word senses and word sense ambiguity(h0353)
  • Lecture 3 Gradients by hand(matric calculus) and algorithmically(the backpropagation algorithm) all the math details of doing nerual net learning
    • Need to be learn again, it is not totally understanded.
    • Named Entity Recognition(0530)
    • Simple NER (0636)
      • How the sample model run (0836)
    • update equation(1220)
    • jacobian(1811)
    • Chain Rule(2015)
    • do one example step (2650)
    • image-20220214193417520
    • Reusing Computation(3402)
      • ds/dw
    • Forward and backward propagation(5000)
    • An example(5507)
    • Compute all gradients at once (h0005)
    • Back-prop in general computation graph(h0800)[ToL]
    • Automatic Differentiation(h1346)
    • Manual Gradient checking : Numeric Gradient(h1900)
  • Lecture 4 Dependency Parsing
    • Two views of linguistic structure
      • Constituency = phrase structure grammar = context-free grammars(CFGs)(0331)
      • Dependency structure(1449)
    • Why do we need sentence structure?(2205)
    • Prepositional phrase attachment ambiguity.(2422)
      • San Jose cops kill man with knife
    • Coordination scope ambiguity(3614)
    • Adjectival/Adverbial Modifier Ambiguity(3755)
    • Verb Phrase(VP) attachment ambiguity(4404)
    • Dependency Grammar and Dependency structure(4355)
      • Will add a fake ROOT for handy
    • Dependency Grammar history(4742)
    • The rise of annotated data Universal Dependency tree(5100)
      • Tree bank(5400)
    • how to build parser with dependency(5738)
    • Dependency Parsing
      • Projectivity(h0416)
    • Methods of Dependency Parsing(h0521)
    • Greedy transition-based parsing(h0621)
    • Basic transition-based dependency parser (h0808)
    • MaltParser(h1351)[ToL]
    • Evaluation of Dependency Parsing (h1845)[ToL]
  • Lecture-5 Languages models and Recurrent Neural Networks(RNNs)
    • A neural dependency parser(0624)
    • Distributed Representations(0945)
    • Deep Learning Classifier are non-linear classifiers(1210)
    • Simple feed-forward neural network multi-class classifier (1621)
    • Neural Dependency Parser Model Architecture(1730)
    • Graph-based dependency parsers (2044)
    • Regularization && Overfitting (2529)
    • Dropout (3100)[ToL]
    • Vectorization(3333)
    • Non-linearities (4000)
    • Parameter Initialization (4357)
    • Optimizers(4617)
    • Learning Rates(4810)
    • Language Modeling (5036)
    • n-gram Language Models(5356)
    • Sparsity Problems (5922)
    • Storage Problems(h0117)
    • How to build a neural language model(h0609)
    • A fixed-window neural Language Model(h1100)
    • Recurrent Neural Network (RNN)(h1250)
    • A Simple RNN Language Model(h1430)
  • Lecture 6 Simple and LSTM Recurrent Neural Networks.
    • The Simple RNN Language Model (0310)
    • Training an RNN Language Model (0818)
      • Teacher Forcing
    • Evaluating Language Models (2447)[ToL]
    • Language Model is a system that predicts the next word(3130)
    • Other use of RNN(3229)
      • Tag for word
      • Used for classification(3420)
      • Used to Language encoder module (3500)
      • Used to generate text (3600)
    • Problems with Vanishing and Exploding Gradients(3750)[IMPORTANT]
      • Why This is a problem (4400)
    • Long Short Term Memory RNNS(LSTMS)(5000)[ToL]
    • Bidirectional RNN (h2000)
  • Lecture-7 Translation, Seq2Seq, Attention
    • Machine Translation(0245)
      • What do you need (1200)
    • Decoding for SMT(1748)
    • What is Neural Machine Translation(NMT)(2130)
    • Seq2seq is more than MT(2600)
    • (2732)[ToL]
    • Multi-layer RNNs(3323)
    • Greedy decoding(4000)
    • Exhaustive search decoding(4200)
    • beam search decoding(4400)
    • How do we evaluate Machine Translation(5550)
      • BLEU
    • NMT perhaps the biggest success story of NLP Deep Learning(h00000)
    • Attention(h1300)
  • Lecture 8 Final Projects; Practical Tips
    • Sequence to Sequence with attention(0235)
    • Attention: in equations(0800)
    • there are several attention variants(1500)
    • Attention is a general Deep Learning technique(2240)
    • Final Project(3000)
  • Lecture-9 Self- Attention and Transformers
    • Issues with recurrent models (0434)
      • Linear interaction distance
      • Lack of parallelizability(0723)
    • If not recurrence
      • Word window models aggregate local contexts (1031)
      • Attention(1406)
    • Self-Attention(1638)
    • Self-attention as an nlp building block(2222)
    • Fix the first self-attention problem
      • sequence order (2423)
        • Position representation vector through sinusoids(2624)
          • Sinusoidal position representations(2730)
          • Position representation vector from scratch(2830)
      • Adding nonlinearities in self-attention(2953)
    • Barriers and solutions for Self-Attention as building block(2945)
    • The transformer encoder-decoder(3638)
      • key query value(4000)
      • Multi-headed attention (4322)
    • Residual connections(4723)
    • Layer normalization(5045)
    • Scaled fot product(5415)
  • Lecture 10 - Transformers and Pretraining
    • Word structure and subword models(0300)
    • The byte-pair encoding(0659)
    • Motivating word meaning and context(1556)
    • Pretraining whole models(2000)
    • this model haven't met overfitting now, you can save some data to test it.(2811)
    • transformers for encoding and decoding (3030)
    • Pretraining through language modeling(3400)
    • Stochastic gradient descent and pretrain/finetune(3740)
    • Model pretraining has three ways (4021)
      • Decoder(4300)
    • Generative Pretrained Transformer(GPT) (4818)
    • GPT2(5400)
    • Pretraining Encoding(5545)
      • (Bert)(5654)
    • Bidirectional encoder representations from transformers(h0100)
    • Limitations of pretrained encoders(h0900)
    • Extensions of BERT(h1000)
    • Pretraining Encoder-Decoder (h1200)
      • T5(h1500)
    • GPT3(h1800)
    • Lecture 11 Question Answering
    • What is question answering(0414)
    • Beyond textual QA problems(1100)
    • Reading comprehension(1223)
    • Standord question answering dataset (1815)
    • Neural models for reading comprehension(2428)
    • LSTM-based vs BERT models (2713)
    • BiDAF(3200)
      • Encoding(3200)
      • Attention(3400)
      • Modeling and output layers(4640)
    • BERT for reading comprehension (5227)
    • Comparisons between BiDAF and BERT models(2734)
    • Can we design better pre-training objectives(h0000)
    • open domain question answering(h1000)
    • DPR(H1400)
    • DensePhrase:Demo(h1800)
  • Lecture 12 - Natural Language Generation[ToL]
    • What is neural language generation?(0300)
    • Components of NLG Systems(0845)
      • Basic of natural language generation(0916)
      • A look at a single step(1024)
      • then select and train(1115)
    • Decoding(1317)
      • Greedy methods(1432)
      • Greedy methods get repetitive(1545)
      • why do repetition happen(1613)
      • How can we reduce repetition (1824)[ToL]
      • People is not always choose the greedy methods(1930)
      • Time to get random: Sampling(2047)
      • Decoding : Top-k sampling(2100)
      • Issues with Top-k sampling(2339)
      • Decoding: Top-p(nucleus)sampling(2421)
      • Scaling randomness: Softmax temperature (2500)[ToL]
      • improving decoding: re-balancing distributions(2710)
      • Backpropagation-based distribution re-balancing(3027)
      • Improving Decoding: Re-ranking(3300)[ToL]
      • Decoding: Takeaways(3540)
    • Training NLG models(4114)
      • Maximum Likelihood Training(4200)
      • Unlikelihood Training(4427)[ToL]
      • Exposure Bias(4513)[ToL]
      • Exposure Bias Solutions(4645)
      • Reinforce Basics(4900)
      • Reward Estimation(5020)
      • reinforce's dark side(5300)
      • Training: Takeways(5423)
    • Evaluating NLG Systems(5613)
    • Types of evaluation methods for text generation(5734)
      • Content Overlap metrics(5800)
      • A simple failure case(5900)
      • Semantic overlap metrics(h0100)
      • Model-based metrics(h0120)
        • word distance functions(h0234)
        • Beyond word matching(h0350)
      • Human evaluations(h0433)
        • Issues(h0700)
      • Takeways(h0912)
    • Ethical Considerations(h1025)
  • Lecture 13 - Coreference Resolution
    • What is Coreference Resolution?(0604)
    • Applications (1712)
    • Coreference Resolution in Two steps(1947)
    • Mention Detection(2049)
      • Not quite so simple(2255)
    • Avoiding a traditional pipeline system(2811)
    • Onto Coreference! First, some linguistics (3035)
      • not all anaphoric relations are coreferential (3349)
    • Anaphora vs Cataphora(3610)
    • Taking stock (3801)
    • Four kinds of coreference Models(4018)
    • Traditional pronominal anaphora resolution:Hobbs's naive algorithm(4130)
    • Knowledge-based Pronominal Coreference(4820)
    • Coreference Models: Mention Pair(5624)
      • Mention Pair Test Time(5800)
      • Disadvantage(5953)
    • Coreference Models: Mention Ranking(h0050)
    • Convolutional Neural Nets(h0341)
    • What is convolution anyway?(h0452)
    • End-to-End Neural Coref Model(h1206)
    • Conclusion (h2017)
  • Lecture 14 - T5 and Large Language Models
    • T5 with a task prefix(0800)
    • Others
      • STSB
      • Summarize
    • T5 change little from original transformer(1300)
    • what should my pre-training data set be?(1325)
    • Then is how to train from a start(1659)
    • pretrain(1805)
    • choose the model(2412)
    • pre-training objective(2629)
    • different structure of data source(2822)
    • Multi task learning (3443)
    • close the gap between multi-task training and this pre-training followed by separate fine tuning(3621)
    • What if it happens there are four times computes as much as before (3737)
    • Overview(3840)
    • What about all of the other languages?(mT5)(4735)
    • XTREME (5000)
    • How much knowledge does a language model pick up during pre-training?(5225)
    • Salient span masking (5631)
    • Do large language models memorize their training data(h0100)
    • Can we close the gap between large and small models by improving the transformer architecture(h1010)
    • QA(h1915)
  • Lecture 15 - Add Knowledge to Language Models
    • Recap: LM(0232)
    • What does a language model know?(0423)
    • The importance of know ledge-aware language models(0700)
    • Query traditional knowledge bases(0750)
    • Query language models as knowledge bases(0955)
    • Compare and disadvantage(1010)
    • Techniques to add knowledge to LMs(130)
    • Add pretrained embeddings(1403)
    • Aside: What is entity linking?(1516)
    • Method 1: Add pretrained entity embeddings(1815)
      • How to we incorporate pretrained entity embeddings from a different embedding space?(2000)
    • ERNIE: Enhanced language representation with informative entities(2143)
      • strengths & remaining challenges(2610)
    • Jointly learn to link entities with KnowBERT(2958)
    • Use an external memory(3140)
      • KGLM(3355)
      • Local knowledge and full knowledge
      • When should the model use the external knowledge(3600)
    • Compare to the others(4334)
    • More recent takes: Nearest Neighbor Language Models(kNN-LM)(4730)
    • Modify the training data(5230)
    • WKLM(5458)
    • Learn inductive biases through masking(5811)
    • Salient span masking(5927)
    • Recap(h0053)
    • Evaluating knowledge in LMS(h0211)
      • LAMA(h0250)
      • The limitations (h0650)
    • LAMA_UnHelpful Names(LAMA-UHN)
      • Developing better prompts to query knowledge in LMS
      • Knowledge-driven downstream tasks(h1253)
    • Relation extraction performance on TACED(h1400)
    • Entity typing performance on Open Entuty
    • Recap: Evaluating knowledge in LMs(h1600)
    • Other exciting progress & what's next?(h1652)
  • Lecture 17 - Model Analysis and Explanation
    • Motivation
      • what are our models doing(0415)
      • how do we make tomorrow's model?(0515)
      • What biases are built into model?(0700)
      • how do we make in the following 25years(0800)
    • Model analysis at varying levels of abstraction(0904)
    • Model evaluation as model analysis(1117)
    • Model evaluation as model analysis in natural language inference(1344)
      • What if the model is simple using heuristics to get good accuracy?(1558)
    • Language models as linguistic test subjects(2023)
    • Careful test sets as unit test suites: CheckListing(3230)
    • Fitting the dataset vs learning the task(3500)
    • Knowledge evaluation as model analysis(3642)
    • Input influence: does my model really use long-distance context?(3822)
    • Prediction explanations: what in the input led to this output?(4054)
    • Prediction explanations: simple saliency maps(4230)
    • Explanation by input reduction (4607)
    • Analyzing models by breaking them(5106)
    • Are models robust to noise in their input?(5518)
    • Analysis of "interpretable" architecture components(5719)
    • Probing: supervised analysis of neural networks(h0408)
    • Emergent simple structure in neural networks(h1019)
    • Probing: tress simply recoverable from BERT representations(h1136)
    • Final thoughts on probing and correlation studies(h1341)
    • Recasting model tweaks and ablations as analysis(h1406)
      • Ablation analysis: do we need all these attension heads?(h1445)
    • What's the right layer order for a transformer?(h1537)
    • Parting thoughts(h1612)
  • Lecture 18 - Future of NLP + Deep Learning
    • General Representation Learning Recipe(0312)
    • Large Language Models and GPT-3(0358)
      • Large Language models and GPT-3(0514)
      • What's new about GPT-3
  • There are three lessons left, They will be finished in the review when I come back from Lee.

Abbreviation

- -
[ToL] To learn
[ToLM] To learn more
[ToLO] To learn optionally
(0501) 05 min 01s
(h0501) 1 hour 05 min 01s
(hh0501) 2 hour 05 min 01s

Lecture 1 - Introduction and Word Vectors

CS224N NLP_第1张图片

NLP

Convert one-hot encoding to distributed representitions

Ont hot can’t represent the relation between word vectors,it is too big

Word2vec

Ignore the position of word of context

CS224N NLP_第2张图片 CS224N NLP_第3张图片 CS224N NLP_第4张图片

Use two vector in one word: centor word context word.

softmax function

CS224N NLP_第5张图片

CS224N NLP_第6张图片

Train the model: gradient descent

CS224N NLP_第7张图片

There is a term to calculate the gradient descent. (39:50-56:40)

result is :CS224N NLP_第8张图片

ToL

Review derivation and the following especially.

CS224N NLP_第9张图片

Show some achievement with code(5640-h0516)

  • We can do vector addition, subtraction, multiplication and division, etc.

QA

Why are there center word and context word(h0650)

To avoid one vector dot product himself in some situation???

Even synonyms can be merged into a vector(h1215)

Which is different from lee ,He says synonyms use different.

Lecture 2 Word Vectors,Word Senses,and Neural Classifiers

CS224N NLP_第10张图片

CS224N NLP_第11张图片

Bag models (0245)

The model makes the same predictions at each position.

Gradient descent (0600)

Not usually use because of the big calculation.

step size: not too big nor too small

CS224N NLP_第12张图片

stochastic gradient descent SGD TOBELM (0920)

Take part of the corpus

billion faster.

Maybe even get better result.

But it is stochastic, either you need sparse matrix update operations to only update certain rows of full embedding matrices U and V, or you need to keep around a hash for vectors.(1344)ToL

more details of word2vec(1400)

CS224N NLP_第13张图片

SG use center to predict context

SGNS negative sampling [ToBLO]

use logistic function instead of softmax and take sampling of corpus

CBOW opposite.

CS224N NLP_第14张图片

Why use two vectors(1500)

Sometime it will dot product with itself.

CS224N NLP_第15张图片

[ToL]

The first one is positive word and the last is negative word (2800)

negative word is being sampled cause the center word will turn up on other occasions, when it does, there will have other sampling, and it will learn step by step.

Why not capture co-occurrence counts directly?(2337)

CS224N NLP_第16张图片

SVD(3230) [ToL]

https://zhuanlan.zhihu.com/p/29846048

use svd to get lower dimensional representations for words

CS224N NLP_第17张图片(3451)

Count based vs direct prediction

CS224N NLP_第18张图片(3900)

Encoing meaning components in vector differences(3948)

This is to make addition subtraction available for word vectors.

CS224N NLP_第19张图片

GloVe (4313)

CS224N NLP_第20张图片

let dot product minus log of the co-occurrence

How to evaluate word vectors Intrinsic vs. extrinsic(4756)

CS224N NLP_第21张图片

Analogy evaluation and hyperparameters (intrinsic)(5515)

Word vector distances and their correlation with human judgements(5640)

Data shows that 300 dimensional word vector is good(5536)

The objective function for the GloVe model and What log-bilinear means(5739)

Word senses and word sense ambiguity(h0353)

One word different mean different vector.

then a word can be the sum of them all

It will work good but not bad (h1200)

the vector is so sparse that you can separate out different senses (h1402)

Lecture 3 Gradients by hand(matric calculus) and algorithmically(the backpropagation algorithm) all the math details of doing nerual net learning

CS224N NLP_第22张图片

Need to be learn again, it is not totally understanded.

Named Entity Recognition(0530)

CS224N NLP_第23张图片

Simple NER (0636)

CS224N NLP_第24张图片

How the sample model run (0836)

CS224N NLP_第25张图片

update equation(1220)

CS224N NLP_第26张图片

jacobian(1811)

CS224N NLP_第27张图片

Chain Rule(2015)

CS224N NLP_第28张图片

CS224N NLP_第29张图片

do one example step (2650)

image-20220214193417520

hadamard product ToL

Reusing Computation(3402)

CS224N NLP_第30张图片

ds/dw

CS224N NLP_第31张图片 CS224N NLP_第32张图片

Forward and backward propagation(5000)

CS224N NLP_第33张图片 CS224N NLP_第34张图片

An example(5507)

a = x+y

b = max(y,z)

f = ab

CS224N NLP_第35张图片

Compute all gradients at once (h0005)

image-20220215145351805

Back-prop in general computation graph(h0800)[ToL]

CS224N NLP_第36张图片

Automatic Differentiation(h1346)

Many tools can calculate automaticly.CS224N NLP_第37张图片

Manual Gradient checking : Numeric Gradient(h1900)

CS224N NLP_第38张图片

Lecture 4 Dependency Parsing

CS224N NLP_第39张图片

Two views of linguistic structure

Constituency = phrase structure grammar = context-free grammars(CFGs)(0331)

Phrase structure organizes words into nested constituents

CS224N NLP_第40张图片

Dependency structure(1449)

Dependency structure shows which words depend on (modify, attach to,or are arguments of)

CS224N NLP_第41张图片

Why do we need sentence structure?(2205)

Can not express meaning by just one word.

CS224N NLP_第42张图片

Prepositional phrase attachment ambiguity.(2422)

There is some sentence to show it:

San Jose cops kill man with knife

Scientists count whales from space

The board approved [its acquisition] [by Royal Trustco Ltd.] [of Toronto] [for $27 a share] [at its monthly meeting].

Coordination scope ambiguity(3614)

**Shuttle veteran and longtime NASA executive Fred Gregory appointed to board **

Doctor: No heart, cognitive issues

Adjectival/Adverbial Modifier Ambiguity(3755)

Students get [first hand job] experience Students get first [hand job] experience

Verb Phrase(VP) attachment ambiguity(4404)

Mutilated body washes up on Rio beach to be used for Olympics beach volleyball.

CS224N NLP_第43张图片

Dependency Grammar and Dependency structure(4355)

CS224N NLP_第44张图片

Will add a fake ROOT for handy

Dependency Grammar history(4742)

CS224N NLP_第45张图片

The rise of annotated data Universal Dependency tree(5100)

CS224N NLP_第46张图片

Tree bank(5400)

Its too slow to write a grammar by hand but its still worth,cause it can used in another place but not only nlp .

how to build parser with dependency(5738)

CS224N NLP_第47张图片

Dependency Parsing

CS224N NLP_第48张图片

Projectivity(h0416)

CS224N NLP_第49张图片

Methods of Dependency Parsing(h0521)

CS224N NLP_第50张图片

Greedy transition-based parsing(h0621)

Basic transition-based dependency parser (h0808)

CS224N NLP_第51张图片

[root] I ate fish

[root I ate] fish

[root ate] fish

[root ate fish]

[root ate]

[root]

MaltParser(h1351)[ToL]

CS224N NLP_第52张图片

Evaluation of Dependency Parsing (h1845)[ToL]

CS224N NLP_第53张图片

Lecture-5 Languages models and Recurrent Neural Networks(RNNs)

CS224N NLP_第54张图片

A neural dependency parser(0624)

CS224N NLP_第55张图片

Distributed Representations(0945)

CS224N NLP_第56张图片

Deep Learning Classifier are non-linear classifiers(1210)

CS224N NLP_第57张图片

Deep Learning Classifier’s non-linear classifiers:

CS224N NLP_第58张图片

Simple feed-forward neural network multi-class classifier (1621)

CS224N NLP_第59张图片

Neural Dependency Parser Model Architecture(1730)

CS224N NLP_第60张图片

Graph-based dependency parsers (2044)

CS224N NLP_第61张图片

Regularization && Overfitting (2529)

CS224N NLP_第62张图片

Dropout (3100)[ToL]

CS224N NLP_第63张图片

Vectorization(3333)

CS224N NLP_第64张图片

Non-linearities (4000)

CS224N NLP_第65张图片

Parameter Initialization (4357)

CS224N NLP_第66张图片

Optimizers(4617)

CS224N NLP_第67张图片

Learning Rates(4810)

It can be slow as the learning go on.

CS224N NLP_第68张图片

Language Modeling (5036)

CS224N NLP_第69张图片

n-gram Language Models(5356)

CS224N NLP_第70张图片 CS224N NLP_第71张图片

Sparsity Problems (5922)

Many situation didn’t occur so it will be zero

CS224N NLP_第72张图片

Storage Problems(h0117)

How to build a neural language model(h0609)

CS224N NLP_第73张图片

A fixed-window neural Language Model(h1100)

CS224N NLP_第74张图片

Recurrent Neural Network (RNN)(h1250)

x1 -> y1

Wx1 x2 -> y1

CS224N NLP_第75张图片

A Simple RNN Language Model(h1430)

CS224N NLP_第76张图片

CS224N NLP_第77张图片

Lecture 6 Simple and LSTM Recurrent Neural Networks.

CS224N NLP_第78张图片

CS224N NLP_第79张图片

The Simple RNN Language Model (0310)

CS224N NLP_第80张图片

Training an RNN Language Model (0818)

RNN takes more time.

Teacher Forcing

penalize when dont take its advise

CS224N NLP_第81张图片

CS224N NLP_第82张图片

CS224N NLP_第83张图片

But how do we get the answer?

CS224N NLP_第84张图片

CS224N NLP_第85张图片

Evaluating Language Models (2447)[ToL]

CS224N NLP_第86张图片

Language Model is a system that predicts the next word(3130)

CS224N NLP_第87张图片

Other use of RNN(3229)

Tag for word

CS224N NLP_第88张图片

Used for classification(3420)

CS224N NLP_第89张图片

Used to Language encoder module (3500)

CS224N NLP_第90张图片

Used to generate text (3600)

CS224N NLP_第91张图片

Problems with Vanishing and Exploding Gradients(3750)[IMPORTANT]

CS224N NLP_第92张图片

[ToL]

CS224N NLP_第93张图片

Why This is a problem (4400)

CS224N NLP_第94张图片

CS224N NLP_第95张图片

CS224N NLP_第96张图片

We can give him a limit.

CS224N NLP_第97张图片

Long Short Term Memory RNNS(LSTMS)(5000)[ToL]

CS224N NLP_第98张图片

CS224N NLP_第99张图片

CS224N NLP_第100张图片

CS224N NLP_第101张图片

Bidirectional RNN (h2000)

We need information from the word after

CS224N NLP_第102张图片

Lecture-7 Translation, Seq2Seq, Attention

CS224N NLP_第103张图片

Machine Translation(0245)

CS224N NLP_第104张图片

What do you need (1200)

you need parallel corpus,Then you need alignment

Decoding for SMT(1748)

Try many possible sequences.

CS224N NLP_第105张图片

What is Neural Machine Translation(NMT)(2130)

Neural Machine Translation(NMT) is a way to do Machine Translation with a single end-to-end neural net work.

The neural network architecture is called sequence-to-sequence model(aka seq2seq) and it involves RNNs

CS224N NLP_第106张图片

Seq2seq is more than MT(2600)

CS224N NLP_第107张图片

(2732)[ToL]

Multi-layer RNNs(3323)

CS224N NLP_第108张图片

Lower-level basic meaning

Higher-level overall meaning

CS224N NLP_第109张图片

Greedy decoding(4000)

CS224N NLP_第110张图片

Exhaustive search decoding(4200)

CS224N NLP_第111张图片

beam search decoding(4400)

CS224N NLP_第112张图片

CS224N NLP_第113张图片

CS224N NLP_第114张图片

CS224N NLP_第115张图片

CS224N NLP_第116张图片

How do we evaluate Machine Translation(5550)

BLEU

CS224N NLP_第117张图片

NMT perhaps the biggest success story of NLP Deep Learning(h00000)

Attention(h1300)

CS224N NLP_第118张图片

CS224N NLP_第119张图片

Lecture 8 Final Projects; Practical Tips

CS224N NLP_第120张图片

Sequence to Sequence with attention(0235)

CS224N NLP_第121张图片

Attention: in equations(0800)

CS224N NLP_第122张图片

CS224N NLP_第123张图片

there are several attention variants(1500)

CS224N NLP_第124张图片

Attention is a general Deep Learning technique(2240)

CS224N NLP_第125张图片

Final Project(3000)

Lecture-9 Self- Attention and Transformers

Issues with recurrent models (0434)

Linear interaction distance

Sometimes it is too far too learn from the words.

CS224N NLP_第126张图片

Lack of parallelizability(0723)

GPU can count parallelizable but RNN lacks that.

CS224N NLP_第127张图片

If not recurrence

Word window models aggregate local contexts (1031)

CS224N NLP_第128张图片

Attention(1406)

CS224N NLP_第129张图片

Self-Attention(1638)

CS224N NLP_第130张图片

Self-attention as an nlp building block(2222)

CS224N NLP_第131张图片

Fix the first self-attention problem

sequence order (2423)

CS224N NLP_第132张图片

Position representation vector through sinusoids(2624)

Sinusoidal position representations(2730)
Position representation vector from scratch(2830)

CS224N NLP_第133张图片

Adding nonlinearities in self-attention(2953)

Barriers and solutions for Self-Attention as building block(2945)

CS224N NLP_第134张图片

CS224N NLP_第135张图片

(3040)

CS224N NLP_第136张图片

(3428)

CS224N NLP_第137张图片

The transformer encoder-decoder(3638)

CS224N NLP_第138张图片

[ToL]

CS224N NLP_第139张图片

key query value(4000)

CS224N NLP_第140张图片

CS224N NLP_第141张图片

Multi-headed attention (4322)

(4450)

CS224N NLP_第142张图片

CS224N NLP_第143张图片

Residual connections(4723)

CS224N NLP_第144张图片

Layer normalization(5045)

CS224N NLP_第145张图片

Scaled fot product(5415)

Lecture 10 - Transformers and Pretraining

CS224N NLP_第146张图片

Word structure and subword models(0300)

transform transformerify

taaaasty

CS224N NLP_第147张图片

The byte-pair encoding(0659)

Subwords model learn the structure of word. The byte-pair between it and dont learn structure.

(0943)

CS224N NLP_第148张图片

CS224N NLP_第149张图片

Motivating word meaning and context(1556)

CS224N NLP_第150张图片

Pretraining whole models(2000)

CS224N NLP_第151张图片

Wordv2vec dont consider context but we can use LSTM to achieve that.

Mask some data and pretrain the model with them.

this model haven’t met overfitting now, you can save some data to test it.(2811)

transformers for encoding and decoding (3030)

Pretraining through language modeling(3400)

CS224N NLP_第152张图片

CS224N NLP_第153张图片

Stochastic gradient descent and pretrain/finetune(3740)

Model pretraining has three ways (4021)

CS224N NLP_第154张图片

Decoder can see the history, the Encoder can also the future.

Encoder-Decoder maybe is the better.

Decoder(4300)

CS224N NLP_第155张图片

CS224N NLP_第156张图片

Generative Pretrained Transformer(GPT) (4818)

CS224N NLP_第157张图片

CS224N NLP_第158张图片

GPT2(5400)

CS224N NLP_第159张图片

Pretraining Encoding(5545)

(Bert)(5654)

CS224N NLP_第160张图片

CS224N NLP_第161张图片

Bert will mask some words, ask what have I mask

Bidirectional encoder representations from transformers(h0100)

[ToL]

CS224N NLP_第162张图片

CS224N NLP_第163张图片

Limitations of pretrained encoders(h0900)

CS224N NLP_第164张图片

Extensions of BERT(h1000)

CS224N NLP_第165张图片

Pretraining Encoder-Decoder (h1200)

T5(h1500)

The model even dont know how many words are masked

CS224N NLP_第166张图片

CS224N NLP_第167张图片

In the pretraining the model learned a lot, but it is not always good

GPT3(h1800)

CS224N NLP_第168张图片

CS224N NLP_第169张图片

Lecture 11 Question Answering

CS224N NLP_第170张图片

What is question answering(0414)

CS224N NLP_第171张图片

CS224N NLP_第172张图片

There are lots of practical applications(0629)

Beyond textual QA problems(1100)

Reading comprehension(1223)

CS224N NLP_第173张图片

They are useful for many practical applications

Reading comprehension is an important tested for evaluating how well computer systems understand human language

Standord question answering dataset (1815)

CS224N NLP_第174张图片

Neural models for reading comprehension(2428)

CS224N NLP_第175张图片

LSTM-based vs BERT models (2713)

CS224N NLP_第176张图片

CS224N NLP_第177张图片

BiDAF(3200)

CS224N NLP_第178张图片

Encoding(3200)

CS224N NLP_第179张图片

Attention(3400)

CS224N NLP_第180张图片

CS224N NLP_第181张图片

Modeling and output layers(4640)

CS224N NLP_第182张图片

CS224N NLP_第183张图片

BERT for reading comprehension (5227)

CS224N NLP_第184张图片

Comparisons between BiDAF and BERT models(2734)

CS224N NLP_第185张图片

Can we design better pre-training objectives(h0000)

CS224N NLP_第186张图片

open domain question answering(h1000)

CS224N NLP_第187张图片

CS224N NLP_第188张图片

CS224N NLP_第189张图片

DPR(H1400)

CS224N NLP_第190张图片

CS224N NLP_第191张图片

CS224N NLP_第192张图片

DensePhrase:Demo(h1800)

Lecture 12 - Natural Language Generation[ToL]

CS224N NLP_第193张图片

What is neural language generation?(0300)

CS224N NLP_第194张图片

Mache Translate

Dialogue Systems //siri

Summarization

Visual Description

Creative Generation //story

Components of NLG Systems(0845)

Basic of natural language generation(0916)

CS224N NLP_第195张图片

A look at a single step(1024)

CS224N NLP_第196张图片

then select and train(1115)

teacher forcing need to be leaned

CS224N NLP_第197张图片

Decoding(1317)

CS224N NLP_第198张图片

Greedy methods(1432)

CS224N NLP_第199张图片

Greedy methods get repetitive(1545)

CS224N NLP_第200张图片

why do repetition happen(1613)

CS224N NLP_第201张图片

How can we reduce repetition (1824)[ToL]

CS224N NLP_第202张图片

People is not always choose the greedy methods(1930)

CS224N NLP_第203张图片

Time to get random: Sampling(2047)

CS224N NLP_第204张图片

Decoding : Top-k sampling(2100)

CS224N NLP_第205张图片

CS224N NLP_第206张图片

Issues with Top-k sampling(2339)

CS224N NLP_第207张图片

Decoding: Top-p(nucleus)sampling(2421)

CS224N NLP_第208张图片

Scaling randomness: Softmax temperature (2500)[ToL]

CS224N NLP_第209张图片

improving decoding: re-balancing distributions(2710)

CS224N NLP_第210张图片

Backpropagation-based distribution re-balancing(3027)

CS224N NLP_第211张图片

Improving Decoding: Re-ranking(3300)[ToL]

CS224N NLP_第212张图片

Decoding: Takeaways(3540)

CS224N NLP_第213张图片

Training NLG models(4114)

Maximum Likelihood Training(4200)

Are greedy decoders bad because of how they’re trained?

CS224N NLP_第214张图片

Unlikelihood Training(4427)[ToL]

CS224N NLP_第215张图片

Exposure Bias(4513)[ToL]

CS224N NLP_第216张图片

Exposure Bias Solutions(4645)

CS224N NLP_第217张图片

CS224N NLP_第218张图片

CS224N NLP_第219张图片

Reinforce Basics(4900)

CS224N NLP_第220张图片

Reward Estimation(5020)

CS224N NLP_第221张图片

CS224N NLP_第222张图片

reinforce’s dark side(5300)

CS224N NLP_第223张图片

image-20220301154547630

Training: Takeways(5423)

CS224N NLP_第224张图片

Evaluating NLG Systems(5613)

Types of evaluation methods for text generation(5734)

CS224N NLP_第225张图片

Content Overlap metrics(5800)

CS224N NLP_第226张图片

A simple failure case(5900)

CS224N NLP_第227张图片

Semantic overlap metrics(h0100)

Model-based metrics(h0120)

CS224N NLP_第228张图片

word distance functions(h0234)

CS224N NLP_第229张图片

Beyond word matching(h0350)

CS224N NLP_第230张图片

Human evaluations(h0433)

CS224N NLP_第231张图片

CS224N NLP_第232张图片

Issues(h0700)

CS224N NLP_第233张图片

Takeways(h0912)

CS224N NLP_第234张图片

Ethical Considerations(h1025)

CS224N NLP_第235张图片

CS224N NLP_第236张图片

CS224N NLP_第237张图片

CS224N NLP_第238张图片

CS224N NLP_第239张图片

CS224N NLP_第240张图片

Lecture 13 - Coreference Resolution

CS224N NLP_第241张图片

What is Coreference Resolution?(0604)

Identify all mentions that refer to the same entity in the world

Applications (1712)

CS224N NLP_第242张图片

CS224N NLP_第243张图片

Coreference Resolution in Two steps(1947)

CS224N NLP_第244张图片

Mention Detection(2049)

CS224N NLP_第245张图片

Not quite so simple(2255)

CS224N NLP_第246张图片

It is the best donut.

I want to find the best donut.

Avoiding a traditional pipeline system(2811)

CS224N NLP_第247张图片

End to End[ToL]

Onto Coreference! First, some linguistics (3035)

Coreference and Anaphor

CS224N NLP_第248张图片

CS224N NLP_第249张图片

not all anaphoric relations are coreferential (3349)

CS224N NLP_第250张图片

Anaphora vs Cataphora(3610)

One look its reference before it the other is after it.

CS224N NLP_第251张图片

Taking stock (3801)

CS224N NLP_第252张图片

Four kinds of coreference Models(4018)

CS224N NLP_第253张图片

Traditional pronominal anaphora resolution:Hobbs’s naive algorithm(4130)

CS224N NLP_第254张图片

CS224N NLP_第255张图片

CS224N NLP_第256张图片

Knowledge-based Pronominal Coreference(4820)

CS224N NLP_第257张图片

Hobb’s method can not really solve the questions, the model should really understand the sentence.

Coreference Models: Mention Pair(5624)

CS224N NLP_第258张图片

CS224N NLP_第259张图片

Mention Pair Test Time(5800)

CS224N NLP_第260张图片

Disadvantage(5953)

CS224N NLP_第261张图片

Coreference Models: Mention Ranking(h0050)

CS224N NLP_第262张图片

CS224N NLP_第263张图片

Convolutional Neural Nets(h0341)

CS224N NLP_第264张图片

What is convolution anyway?(h0452)

CS224N NLP_第265张图片

CS224N NLP_第266张图片

CS224N NLP_第267张图片

CS224N NLP_第268张图片

Summarize what we have usually use pooling

CS224N NLP_第269张图片

CS224N NLP_第270张图片

Max pooling is usually better.

CS224N NLP_第271张图片

End-to-End Neural Coref Model(h1206)

CS224N NLP_第272张图片

CS224N NLP_第273张图片

CS224N NLP_第274张图片

CS224N NLP_第275张图片

CS224N NLP_第276张图片

CS224N NLP_第277张图片

CS224N NLP_第278张图片

CS224N NLP_第279张图片

CS224N NLP_第280张图片

Conclusion (h2017)

CS224N NLP_第281张图片

Lecture 14 - T5 and Large Language Models

CS224N NLP_第282张图片

(0243)

CS224N NLP_第283张图片

CS224N NLP_第284张图片

T5 with a task prefix(0800)

image-20220302145406303

Others

CS224N NLP_第285张图片

CS224N NLP_第286张图片

STSB

CS224N NLP_第287张图片

Summarize

CS224N NLP_第288张图片

T5 change little from original transformer(1300)

CS224N NLP_第289张图片

what should my pre-training data set be?(1325)

Get from open source data source and then wipe them and get c4 1500

Then is how to train from a start(1659)

CS224N NLP_第290张图片

pretrain(1805)

CS224N NLP_第291张图片

choose the model(2412)

CS224N NLP_第292张图片

They use the encoder-Decoder model, It turns out it works well.

They dont change hyper paramenters because of the cost

pre-training objective(2629)

CS224N NLP_第293张图片

Choose different train method

different structure of data source(2822)

CS224N NLP_第294张图片

Multi task learning (3443)

CS224N NLP_第295张图片

close the gap between multi-task training and this pre-training followed by separate fine tuning(3621)

CS224N NLP_第296张图片

What if it happens there are four times computes as much as before (3737)

CS224N NLP_第297张图片

Overview(3840)

CS224N NLP_第298张图片

CS224N NLP_第299张图片

CS224N NLP_第300张图片

CS224N NLP_第301张图片

What about all of the other languages?(mT5)(4735)

CS224N NLP_第302张图片

Same model different corpus.

CS224N NLP_第303张图片

CS224N NLP_第304张图片

XTREME (5000)

CS224N NLP_第305张图片

How much knowledge does a language model pick up during pre-training?(5225)

CS224N NLP_第306张图片

CS224N NLP_第307张图片

CS224N NLP_第308张图片

CS224N NLP_第309张图片

Salient span masking (5631)

CS224N NLP_第310张图片

Instead of mask randomly, it mask username please date, etc.

Do large language models memorize their training data(h0100)

It seems it did

CS224N NLP_第311张图片

CS224N NLP_第312张图片

CS224N NLP_第313张图片

CS224N NLP_第314张图片

CS224N NLP_第315张图片

CS224N NLP_第316张图片

They need to see examples, they need to see particular examples fewer times in order!

Can we close the gap between large and small models by improving the transformer architecture(h1010)

image-20220302164909562

in these test, they change some architecture such as RELu.

there actually were very few, if any modifications that improved performance meaningfully.

CS224N NLP_第317张图片

CS224N NLP_第318张图片(h1700)

QA(h1915)

Lecture 15 - Add Knowledge to Language Models

CS224N NLP_第319张图片

Recap: LM(0232)

CS224N NLP_第320张图片

CS224N NLP_第321张图片

What does a language model know?(0423)

CS224N NLP_第322张图片

Thing may right in logic but wrong in fact.

CS224N NLP_第323张图片

The importance of know ledge-aware language models(0700)

CS224N NLP_第324张图片

Query traditional knowledge bases(0750)

CS224N NLP_第325张图片

Query language models as knowledge bases(0955)

CS224N NLP_第326张图片

Compare and disadvantage(1010)

CS224N NLP_第327张图片

Techniques to add knowledge to LMs(130)

CS224N NLP_第328张图片

Add pretrained embeddings(1403)

CS224N NLP_第329张图片

Aside: What is entity linking?(1516)

CS224N NLP_第330张图片

Method 1: Add pretrained entity embeddings(1815)

CS224N NLP_第331张图片

How to we incorporate pretrained entity embeddings from a different embedding space?(2000)

CS224N NLP_第332张图片

ERNIE: Enhanced language representation with informative entities(2143)

CS224N NLP_第333张图片

CS224N NLP_第334张图片

CS224N NLP_第335张图片

CS224N NLP_第336张图片

strengths & remaining challenges(2610)

CS224N NLP_第337张图片

Jointly learn to link entities with KnowBERT(2958)

CS224N NLP_第338张图片

Use an external memory(3140)

CS224N NLP_第339张图片

KGLM(3355)

CS224N NLP_第340张图片

Local knowledge and full knowledge

CS224N NLP_第341张图片

When should the model use the external knowledge(3600)

CS224N NLP_第342张图片

CS224N NLP_第343张图片

CS224N NLP_第344张图片

CS224N NLP_第345张图片

Compare to the others(4334)

CS224N NLP_第346张图片

More recent takes: Nearest Neighbor Language Models(kNN-LM)(4730)

CS224N NLP_第347张图片

CS224N NLP_第348张图片

Modify the training data(5230)

CS224N NLP_第349张图片

CS224N NLP_第350张图片

WKLM(5458)

CS224N NLP_第351张图片

CS224N NLP_第352张图片

CS224N NLP_第353张图片

Learn inductive biases through masking(5811)

CS224N NLP_第354张图片

CS224N NLP_第355张图片

Salient span masking(5927)

CS224N NLP_第356张图片

Recap(h0053)

CS224N NLP_第357张图片

Evaluating knowledge in LMS(h0211)

LAMA(h0250)

CS224N NLP_第358张图片

CS224N NLP_第359张图片

The limitations (h0650)

CS224N NLP_第360张图片

LAMA_UnHelpful Names(LAMA-UHN)

CS224N NLP_第361张图片

** They delete something that may caused by co-occurrence **

Developing better prompts to query knowledge in LMS

CS224N NLP_第362张图片

CS224N NLP_第363张图片

Knowledge-driven downstream tasks(h1253)

CS224N NLP_第364张图片

Relation extraction performance on TACED(h1400)

CS224N NLP_第365张图片

Entity typing performance on Open Entuty

CS224N NLP_第366张图片

Recap: Evaluating knowledge in LMs(h1600)

CS224N NLP_第367张图片

Other exciting progress & what’s next?(h1652)

CS224N NLP_第368张图片

Lecture 17 - Model Analysis and Explanation

CS224N NLP_第369张图片

CS224N NLP_第370张图片

Motivation

what are our models doing(0415)

CS224N NLP_第371张图片

how do we make tomorrow’s model?(0515)

CS224N NLP_第372张图片

What biases are built into model?(0700)

CS224N NLP_第373张图片

how do we make in the following 25years(0800)

CS224N NLP_第374张图片

Model analysis at varying levels of abstraction(0904)

CS224N NLP_第375张图片

Model evaluation as model analysis(1117)

CS224N NLP_第376张图片

Model evaluation as model analysis in natural language inference(1344)

CS224N NLP_第377张图片

What if the model is simple using heuristics to get good accuracy?(1558)

CS224N NLP_第378张图片

CS224N NLP_第379张图片

Language models as linguistic test subjects(2023)

CS224N NLP_第380张图片

CS224N NLP_第381张图片

CS224N NLP_第382张图片

Careful test sets as unit test suites: CheckListing(3230)

CS224N NLP_第383张图片

Fitting the dataset vs learning the task(3500)

CS224N NLP_第384张图片

Knowledge evaluation as model analysis(3642)

CS224N NLP_第385张图片

Input influence: does my model really use long-distance context?(3822)

CS224N NLP_第386张图片

Prediction explanations: what in the input led to this output?(4054)

CS224N NLP_第387张图片

Prediction explanations: simple saliency maps(4230)

CS224N NLP_第388张图片

CS224N NLP_第389张图片

Explanation by input reduction (4607)

CS224N NLP_第390张图片

CS224N NLP_第391张图片

Analyzing models by breaking them(5106)

CS224N NLP_第392张图片

CS224N NLP_第393张图片

They add a nonsense sentence at the end and the prediction changed.

CS224N NLP_第394张图片

Change the Q also make the prediction changed

Are models robust to noise in their input?(5518)

CS224N NLP_第395张图片

It seems not.

Analysis of “interpretable” architecture components(5719)

CS224N NLP_第396张图片

CS224N NLP_第397张图片

CS224N NLP_第398张图片

CS224N NLP_第399张图片

CS224N NLP_第400张图片

CS224N NLP_第401张图片

Probing: supervised analysis of neural networks(h0408)

CS224N NLP_第402张图片

CS224N NLP_第403张图片

CS224N NLP_第404张图片

CS224N NLP_第405张图片

CS224N NLP_第406张图片

CS224N NLP_第407张图片

the most efficient layer is in the middlwe.

CS224N NLP_第408张图片

deeper, more abstract

Emergent simple structure in neural networks(h1019)

CS224N NLP_第409张图片

Probing: tress simply recoverable from BERT representations(h1136)

CS224N NLP_第410张图片

Final thoughts on probing and correlation studies(h1341)

CS224N NLP_第411张图片

Not causal study

Recasting model tweaks and ablations as analysis(h1406)

CS224N NLP_第412张图片

Ablation analysis: do we need all these attension heads?(h1445)

CS224N NLP_第413张图片

What’s the right layer order for a transformer?(h1537)

CS224N NLP_第414张图片

Parting thoughts(h1612)

CS224N NLP_第415张图片

Lecture 18 - Future of NLP + Deep Learning

CS224N NLP_第416张图片

CS224N NLP_第417张图片

General Representation Learning Recipe(0312)

CS224N NLP_第418张图片

Certain properties emerge only when we scale up the model size!

Large Language Models and GPT-3(0358)

Large Language models and GPT-3(0514)

CS224N NLP_第419张图片

What’s new about GPT-3

CS224N NLP_第420张图片

CS224N NLP_第421张图片

CS224N NLP_第422张图片

There are three lessons left, They will be finished in the review when I come back from Lee.

你可能感兴趣的:(NLP,自然语言处理,人工智能,nlp)