我对算法一无所知

CS224N Assignment 1: Exploring Word Vectors part 1

Welcome to CS224n!

Before you start, make sure you read the README.txt in the same directory as this notebook.

# All Import Statements Defined Here
# Note: Do not add to this list.
# All the dependencies you need, can be installed by running .
# ----------------

import sys
assert sys.version_info[0]==3
assert sys.version_info[1] >= 5

from gensim.models import KeyedVectors
from gensim.test.utils import datapath
import pprint
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [10, 5]
import nltk
nltk.download('reuters')
from nltk.corpus import reuters
import numpy as np
import random
import scipy as sp
from sklearn.decomposition import TruncatedSVD
from sklearn.decomposition import PCA

START_TOKEN = ''
END_TOKEN = ''

np.random.seed(0)
random.seed(0)
# ----------------

output:

[nltk_data] Downloading package reuters to
[nltk_data]     C:\Users\yezhipeng\AppData\Roaming\nltk_data...
[nltk_data]   Package reuters is already up-to-date!

Word Vectors

Word Vectors are often used as a fundamental component for downstream NLP tasks, e.g. question answering, text generation, translation, etc., so it is important to build some intuitions as to their strengths and weaknesses. Here, you will explore two types of word vectors: those derived from co-occurrence matrices, and those derived via word2vec.

Assignment Notes: Please make sure to save the notebook as you go along. Submission Instructions are located at the bottom of the notebook.

Note on Terminology: The terms "word vectors" and "word embeddings" are often used interchangeably. The term "embedding" refers to the fact that we are encoding aspects of a word's meaning in a lower dimensional space. As Wikipedia states, "conceptually it involves a mathematical embedding from a space with one dimension per word to a continuous vector space with a much lower dimension".

Part 1: Count-Based Word Vectors (10 points)

Most word vector models start from the following idea:

You shall know a word by the company it keeps (Firth, J. R. 1957:11)

Many word vector implementations are driven by the idea that similar words, i.e., (near) synonyms, will be used in similar contexts. As a result, similar words will often be spoken or written along with a shared subset of words, i.e., contexts. By examining these contexts, we can try to develop embeddings for our words. With this intuition in mind, many "old school" approaches to constructing word vectors relied on word counts. Here we elaborate upon one of those strategies, co-occurrence matrices (for more information, see here or here).

Co-Occurrence

A co-occurrence matrix counts how often things co-occur in some environment. Given some word occurring in the document, we consider the context window surrounding . Supposing our fixed window size is , then this is the preceding and subsequent words in that document, i.e. words $w_{i-n}...w_{i-1}$ and $w_{i+1}..w_{i+n}$ . We build a co-occurrence matrix , which is a symmetric word-by-word matrix in which $M_{ij}$ is the number of times appears inside 's window.

Example: Co-Occurrence with Fixed Window of n=1:

Document 1: "all that glitters is not gold"

Document 2: "all is well that ends well"

*	START	all	that	glitters	is	not	gold	well	ends	END
START	0	2	0	0	0	0	0	0	0	0
all	2	0	1	0	1	0	0	0	0	0
that	0	1	0	1	0	0	0	1	1	0
glitters	0	0	1	0	1	0	0	0	0	0
is	0	1	0	1	0	1	0	1	0	0
not	0	0	0	0	1	0	1	0	0	0
gold	0	0	0	0	0	1	0	0	0	1
well	0	0	1	0	1	0	0	0	1	1
ends	0	0	1	0	0	0	0	1	0	0
END	0	0	0	0	0	0	1	1	0	0

Note: In NLP, we often add START and END tokens to represent the beginning and end of sentences, paragraphs or documents. In thise case we imagine START and END tokens encapsulating each document, e.g., "START All that glitters is not gold END", and include these tokens in our co-occurrence counts.

The rows (or columns) of this matrix provide one type of word vectors (those based on word-word co-occurrence), but the vectors will be large in general (linear in the number of distinct words in a corpus). Thus, our next step is to run dimensionality reduction. In particular, we will run SVD (Singular Value Decomposition), which is a kind of generalized PCA (Principal Components Analysis) to select the top principal components. Here's a visualization of dimensionality reduction with SVD. In this picture our co-occurrence matrix is with rows corresponding to words. We obtain a full matrix decomposition, with the singular values ordered in the diagonal matrix, and our new, shorter length- word vectors in .

This reduced-dimensionality co-occurrence representation preserves semantic relationships between words, e.g. doctor and hospital will be closer than doctor and dog.

Notes: If you can barely remember what an eigenvalue is, here's a slow, friendly introduction to SVD. If you want to learn more thoroughly about PCA or SVD, feel free to check out lectures 7, 8, and 9 of CS168. These course notes provide a great high-level treatment of these general purpose algorithms. Though, for the purpose of this class, you only need to know how to extract the k-dimensional embeddings by utilizing pre-programmed implementations of these algorithms from the numpy, scipy, or sklearn python packages. In practice, it is challenging to apply full SVD to large corpora because of the memory needed to perform PCA or SVD. However, if you only want the top vector components for relatively small — known as Truncated SVD — then there are reasonably scalable techniques to compute those iteratively.

Plotting Co-Occurrence Word Embeddings

Here, we will be using the Reuters (business and financial news) corpus. If you haven't run the import cell at the top of this page, please run it now (click it and press SHIFT-RETURN). The corpus consists of 10,788 news documents totaling 1.3 million words. These documents span 90 categories and are split into train and test. For more details, please see https://www.nltk.org/book/ch02.html. We provide a read_corpus function below that pulls out only articles from the "crude" (i.e. news articles about oil, gas, etc.) category. The function also adds START and END tokens to each of the documents, and lowercases words. You do not have perform any other kind of pre-processing.

def read_corpus(category="crude"):
    """ Read files from the specified Reuter's category.
        Params:
            category (string): category name
        Return:
            list of lists, with words from each of the processed files
    """
    files = reuters.fileids(category)
    return [[START_TOKEN] + [w.lower() for w in list(reuters.words(f))] + [END_TOKEN] for f in files]

Let's have a look what these documents are like….

reuters_corpus = read_corpus()
pprint.pprint(reuters_corpus[:3], compact=True, width=100)

output:

[['', 'japan', 'to', 'revise', 'long', '-', 'term', 'energy', 'demand', 'downwards', 'the',
  'ministry', 'of', 'international', 'trade', 'and', 'industry', '(', 'miti', ')', 'will', 'revise',
  'its', 'long', '-', 'term', 'energy', 'supply', '/', 'demand', 'outlook', 'by', 'august', 'to',
  'meet', 'a', 'forecast', 'downtrend', 'in', 'japanese', 'energy', 'demand', ',', 'ministry',
  'officials', 'said', '.', 'miti', 'is', 'expected', 'to', 'lower', 'the', 'projection', 'for',
  'primary', 'energy', 'supplies', 'in', 'the', 'year', '2000', 'to', '550', 'mln', 'kilolitres',
  '(', 'kl', ')', 'from', '600', 'mln', ',', 'they', 'said', '.', 'the', 'decision', 'follows',
  'the', 'emergence', 'of', 'structural', 'changes', 'in', 'japanese', 'industry', 'following',
  'the', 'rise', 'in', 'the', 'value', 'of', 'the', 'yen', 'and', 'a', 'decline', 'in', 'domestic',
  'electric', 'power', 'demand', '.', 'miti', 'is', 'planning', 'to', 'work', 'out', 'a', 'revised',
  'energy', 'supply', '/', 'demand', 'outlook', 'through', 'deliberations', 'of', 'committee',
  'meetings', 'of', 'the', 'agency', 'of', 'natural', 'resources', 'and', 'energy', ',', 'the',
  'officials', 'said', '.', 'they', 'said', 'miti', 'will', 'also', 'review', 'the', 'breakdown',
  'of', 'energy', 'supply', 'sources', ',', 'including', 'oil', ',', 'nuclear', ',', 'coal', 'and',
  'natural', 'gas', '.', 'nuclear', 'energy', 'provided', 'the', 'bulk', 'of', 'japan', "'", 's',
  'electric', 'power', 'in', 'the', 'fiscal', 'year', 'ended', 'march', '31', ',', 'supplying',
  'an', 'estimated', '27', 'pct', 'on', 'a', 'kilowatt', '/', 'hour', 'basis', ',', 'followed',
  'by', 'oil', '(', '23', 'pct', ')', 'and', 'liquefied', 'natural', 'gas', '(', '21', 'pct', '),',
  'they', 'noted', '.', ''],
 ['', 'energy', '/', 'u', '.', 's', '.', 'petrochemical', 'industry', 'cheap', 'oil',
  'feedstocks', ',', 'the', 'weakened', 'u', '.', 's', '.', 'dollar', 'and', 'a', 'plant',
  'utilization', 'rate', 'approaching', '90', 'pct', 'will', 'propel', 'the', 'streamlined', 'u',
  '.', 's', '.', 'petrochemical', 'industry', 'to', 'record', 'profits', 'this', 'year', ',',
  'with', 'growth', 'expected', 'through', 'at', 'least', '1990', ',', 'major', 'company',
  'executives', 'predicted', '.', 'this', 'bullish', 'outlook', 'for', 'chemical', 'manufacturing',
  'and', 'an', 'industrywide', 'move', 'to', 'shed', 'unrelated', 'businesses', 'has', 'prompted',
  'gaf', 'corp', '&', 'lt', ';', 'gaf', '>,', 'privately', '-', 'held', 'cain', 'chemical', 'inc',
  ',', 'and', 'other', 'firms', 'to', 'aggressively', 'seek', 'acquisitions', 'of', 'petrochemical',
  'plants', '.', 'oil', 'companies', 'such', 'as', 'ashland', 'oil', 'inc', '&', 'lt', ';', 'ash',
  '>,', 'the', 'kentucky', '-', 'based', 'oil', 'refiner', 'and', 'marketer', ',', 'are', 'also',
  'shopping', 'for', 'money', '-', 'making', 'petrochemical', 'businesses', 'to', 'buy', '.', '"',
  'i', 'see', 'us', 'poised', 'at', 'the', 'threshold', 'of', 'a', 'golden', 'period', ',"', 'said',
  'paul', 'oreffice', ',', 'chairman', 'of', 'giant', 'dow', 'chemical', 'co', '&', 'lt', ';',
  'dow', '>,', 'adding', ',', '"', 'there', "'", 's', 'no', 'major', 'plant', 'capacity', 'being',
  'added', 'around', 'the', 'world', 'now', '.', 'the', 'whole', 'game', 'is', 'bringing', 'out',
  'new', 'products', 'and', 'improving', 'the', 'old', 'ones', '."', 'analysts', 'say', 'the',
  'chemical', 'industry', "'", 's', 'biggest', 'customers', ',', 'automobile', 'manufacturers',
  'and', 'home', 'builders', 'that', 'use', 'a', 'lot', 'of', 'paints', 'and', 'plastics', ',',
  'are', 'expected', 'to', 'buy', 'quantities', 'this', 'year', '.', 'u', '.', 's', '.',
  'petrochemical', 'plants', 'are', 'currently', 'operating', 'at', 'about', '90', 'pct',
  'capacity', ',', 'reflecting', 'tighter', 'supply', 'that', 'could', 'hike', 'product', 'prices',
  'by', '30', 'to', '40', 'pct', 'this', 'year', ',', 'said', 'john', 'dosher', ',', 'managing',
  'director', 'of', 'pace', 'consultants', 'inc', 'of', 'houston', '.', 'demand', 'for', 'some',
  'products', 'such', 'as', 'styrene', 'could', 'push', 'profit', 'margins', 'up', 'by', 'as',
  'much', 'as', '300', 'pct', ',', 'he', 'said', '.', 'oreffice', ',', 'speaking', 'at', 'a',
  'meeting', 'of', 'chemical', 'engineers', 'in', 'houston', ',', 'said', 'dow', 'would', 'easily',
  'top', 'the', '741', 'mln', 'dlrs', 'it', 'earned', 'last', 'year', 'and', 'predicted', 'it',
  'would', 'have', 'the', 'best', 'year', 'in', 'its', 'history', '.', 'in', '1985', ',', 'when',
  'oil', 'prices', 'were', 'still', 'above', '25', 'dlrs', 'a', 'barrel', 'and', 'chemical',
  'exports', 'were', 'adversely', 'affected', 'by', 'the', 'strong', 'u', '.', 's', '.', 'dollar',
  ',', 'dow', 'had', 'profits', 'of', '58', 'mln', 'dlrs', '.', '"', 'i', 'believe', 'the',
  'entire', 'chemical', 'industry', 'is', 'headed', 'for', 'a', 'record', 'year', 'or', 'close',
  'to', 'it', ',"', 'oreffice', 'said', '.', 'gaf', 'chairman', 'samuel', 'heyman', 'estimated',
  'that', 'the', 'u', '.', 's', '.', 'chemical', 'industry', 'would', 'report', 'a', '20', 'pct',
  'gain', 'in', 'profits', 'during', '1987', '.', 'last', 'year', ',', 'the', 'domestic',
  'industry', 'earned', 'a', 'total', 'of', '13', 'billion', 'dlrs', ',', 'a', '54', 'pct', 'leap',
  'from', '1985', '.', 'the', 'turn', 'in', 'the', 'fortunes', 'of', 'the', 'once', '-', 'sickly',
  'chemical', 'industry', 'has', 'been', 'brought', 'about', 'by', 'a', 'combination', 'of', 'luck',
  'and', 'planning', ',', 'said', 'pace', "'", 's', 'john', 'dosher', '.', 'dosher', 'said', 'last',
  'year', "'", 's', 'fall', 'in', 'oil', 'prices', 'made', 'feedstocks', 'dramatically', 'cheaper',
  'and', 'at', 'the', 'same', 'time', 'the', 'american', 'dollar', 'was', 'weakening', 'against',
  'foreign', 'currencies', '.', 'that', 'helped', 'boost', 'u', '.', 's', '.', 'chemical',
  'exports', '.', 'also', 'helping', 'to', 'bring', 'supply', 'and', 'demand', 'into', 'balance',
  'has', 'been', 'the', 'gradual', 'market', 'absorption', 'of', 'the', 'extra', 'chemical',
  'manufacturing', 'capacity', 'created', 'by', 'middle', 'eastern', 'oil', 'producers', 'in',
  'the', 'early', '1980s', '.', 'finally', ',', 'virtually', 'all', 'major', 'u', '.', 's', '.',
  'chemical', 'manufacturers', 'have', 'embarked', 'on', 'an', 'extensive', 'corporate',
  'restructuring', 'program', 'to', 'mothball', 'inefficient', 'plants', ',', 'trim', 'the',
  'payroll', 'and', 'eliminate', 'unrelated', 'businesses', '.', 'the', 'restructuring', 'touched',
  'off', 'a', 'flurry', 'of', 'friendly', 'and', 'hostile', 'takeover', 'attempts', '.', 'gaf', ',',
  'which', 'made', 'an', 'unsuccessful', 'attempt', 'in', '1985', 'to', 'acquire', 'union',
  'carbide', 'corp', '&', 'lt', ';', 'uk', '>,', 'recently', 'offered', 'three', 'billion', 'dlrs',
  'for', 'borg', 'warner', 'corp', '&', 'lt', ';', 'bor', '>,', 'a', 'chicago', 'manufacturer',
  'of', 'plastics', 'and', 'chemicals', '.', 'another', 'industry', 'powerhouse', ',', 'w', '.',
  'r', '.', 'grace', '&', 'lt', ';', 'gra', '>', 'has', 'divested', 'its', 'retailing', ',',
  'restaurant', 'and', 'fertilizer', 'businesses', 'to', 'raise', 'cash', 'for', 'chemical',
  'acquisitions', '.', 'but', 'some', 'experts', 'worry', 'that', 'the', 'chemical', 'industry',
  'may', 'be', 'headed', 'for', 'trouble', 'if', 'companies', 'continue', 'turning', 'their',
  'back', 'on', 'the', 'manufacturing', 'of', 'staple', 'petrochemical', 'commodities', ',', 'such',
  'as', 'ethylene', ',', 'in', 'favor', 'of', 'more', 'profitable', 'specialty', 'chemicals',
  'that', 'are', 'custom', '-', 'designed', 'for', 'a', 'small', 'group', 'of', 'buyers', '.', '"',
  'companies', 'like', 'dupont', '&', 'lt', ';', 'dd', '>', 'and', 'monsanto', 'co', '&', 'lt', ';',
  'mtc', '>', 'spent', 'the', 'past', 'two', 'or', 'three', 'years', 'trying', 'to', 'get', 'out',
  'of', 'the', 'commodity', 'chemical', 'business', 'in', 'reaction', 'to', 'how', 'badly', 'the',
  'market', 'had', 'deteriorated', ',"', 'dosher', 'said', '.', '"', 'but', 'i', 'think', 'they',
  'will', 'eventually', 'kill', 'the', 'margins', 'on', 'the', 'profitable', 'chemicals', 'in',
  'the', 'niche', 'market', '."', 'some', 'top', 'chemical', 'executives', 'share', 'the',
  'concern', '.', '"', 'the', 'challenge', 'for', 'our', 'industry', 'is', 'to', 'keep', 'from',
  'getting', 'carried', 'away', 'and', 'repeating', 'past', 'mistakes', ',"', 'gaf', "'", 's',
  'heyman', 'cautioned', '.', '"', 'the', 'shift', 'from', 'commodity', 'chemicals', 'may', 'be',
  'ill', '-', 'advised', '.', 'specialty', 'businesses', 'do', 'not', 'stay', 'special', 'long',
  '."', 'houston', '-', 'based', 'cain', 'chemical', ',', 'created', 'this', 'month', 'by', 'the',
  'sterling', 'investment', 'banking', 'group', ',', 'believes', 'it', 'can', 'generate', '700',
  'mln', 'dlrs', 'in', 'annual', 'sales', 'by', 'bucking', 'the', 'industry', 'trend', '.',
  'chairman', 'gordon', 'cain', ',', 'who', 'previously', 'led', 'a', 'leveraged', 'buyout', 'of',
  'dupont', "'", 's', 'conoco', 'inc', "'", 's', 'chemical', 'business', ',', 'has', 'spent', '1',
  '.', '1', 'billion', 'dlrs', 'since', 'january', 'to', 'buy', 'seven', 'petrochemical', 'plants',
  'along', 'the', 'texas', 'gulf', 'coast', '.', 'the', 'plants', 'produce', 'only', 'basic',
  'commodity', 'petrochemicals', 'that', 'are', 'the', 'building', 'blocks', 'of', 'specialty',
  'products', '.', '"', 'this', 'kind', 'of', 'commodity', 'chemical', 'business', 'will', 'never',
  'be', 'a', 'glamorous', ',', 'high', '-', 'margin', 'business', ',"', 'cain', 'said', ',',
  'adding', 'that', 'demand', 'is', 'expected', 'to', 'grow', 'by', 'about', 'three', 'pct',
  'annually', '.', 'garo', 'armen', ',', 'an', 'analyst', 'with', 'dean', 'witter', 'reynolds', ',',
  'said', 'chemical', 'makers', 'have', 'also', 'benefitted', 'by', 'increasing', 'demand', 'for',
  'plastics', 'as', 'prices', 'become', 'more', 'competitive', 'with', 'aluminum', ',', 'wood',
  'and', 'steel', 'products', '.', 'armen', 'estimated', 'the', 'upturn', 'in', 'the', 'chemical',
  'business', 'could', 'last', 'as', 'long', 'as', 'four', 'or', 'five', 'years', ',', 'provided',
  'the', 'u', '.', 's', '.', 'economy', 'continues', 'its', 'modest', 'rate', 'of', 'growth', '.',
  ''],
 ['', 'turkey', 'calls', 'for', 'dialogue', 'to', 'solve', 'dispute', 'turkey', 'said',
  'today', 'its', 'disputes', 'with', 'greece', ',', 'including', 'rights', 'on', 'the',
  'continental', 'shelf', 'in', 'the', 'aegean', 'sea', ',', 'should', 'be', 'solved', 'through',
  'negotiations', '.', 'a', 'foreign', 'ministry', 'statement', 'said', 'the', 'latest', 'crisis',
  'between', 'the', 'two', 'nato', 'members', 'stemmed', 'from', 'the', 'continental', 'shelf',
  'dispute', 'and', 'an', 'agreement', 'on', 'this', 'issue', 'would', 'effect', 'the', 'security',
  ',', 'economy', 'and', 'other', 'rights', 'of', 'both', 'countries', '.', '"', 'as', 'the',
  'issue', 'is', 'basicly', 'political', ',', 'a', 'solution', 'can', 'only', 'be', 'found', 'by',
  'bilateral', 'negotiations', ',"', 'the', 'statement', 'said', '.', 'greece', 'has', 'repeatedly',
  'said', 'the', 'issue', 'was', 'legal', 'and', 'could', 'be', 'solved', 'at', 'the',
  'international', 'court', 'of', 'justice', '.', 'the', 'two', 'countries', 'approached', 'armed',
  'confrontation', 'last', 'month', 'after', 'greece', 'announced', 'it', 'planned', 'oil',
  'exploration', 'work', 'in', 'the', 'aegean', 'and', 'turkey', 'said', 'it', 'would', 'also',
  'search', 'for', 'oil', '.', 'a', 'face', '-', 'off', 'was', 'averted', 'when', 'turkey',
  'confined', 'its', 'research', 'to', 'territorrial', 'waters', '.', '"', 'the', 'latest',
  'crises', 'created', 'an', 'historic', 'opportunity', 'to', 'solve', 'the', 'disputes', 'between',
  'the', 'two', 'countries', ',"', 'the', 'foreign', 'ministry', 'statement', 'said', '.', 'turkey',
  "'", 's', 'ambassador', 'in', 'athens', ',', 'nazmi', 'akiman', ',', 'was', 'due', 'to', 'meet',
  'prime', 'minister', 'andreas', 'papandreou', 'today', 'for', 'the', 'greek', 'reply', 'to', 'a',
  'message', 'sent', 'last', 'week', 'by', 'turkish', 'prime', 'minister', 'turgut', 'ozal', '.',
  'the', 'contents', 'of', 'the', 'message', 'were', 'not', 'disclosed', '.', '']]

Question 1.1: Implement `distinct_words` code (2 points)

Write a method to work out the distinct words (word types) that occur in the corpus. You can do this with for loops, but it's more efficient to do it with Python list comprehensions. In particular, this may be useful to flatten a list of lists. If you're not familiar with Python list comprehensions in general, here's more information.

You may find it useful to use Python sets to remove duplicate words.

def distinct_words(corpus):
    """ Determine a list of distinct words for the corpus.
        Params:
            corpus (list of list of strings): corpus of documents
        Return:
            corpus_words (list of strings): list of distinct words across the corpus, sorted (using python 'sorted' function)
            num_corpus_words (integer): number of distinct words across the corpus
    """
    corpus_words = []
    num_corpus_words = -1
    
    # ------------------
    # Write your implementation here.
    corpus_words = list(set([j for i in corpus for j in i]))
    num_corpus_words = len(corpus_words)
    corpus_words.sort()
    # ------------------

    return corpus_words, num_corpus_words

# ---------------------
# Run this sanity check
# Note that this not an exhaustive check for correctness.
# ---------------------

# Define toy corpus
test_corpus = ["START All that glitters isn't gold END".split(" "), "START All's well that ends well END".split(" ")]
test_corpus_words, num_corpus_words = distinct_words(test_corpus)

# Correct answers
ans_test_corpus_words = sorted(list(set(["START", "All", "ends", "that", "gold", "All's", "glitters", "isn't", "well", "END"])))
ans_num_corpus_words = len(ans_test_corpus_words)

# Test correct number of words
assert(num_corpus_words == ans_num_corpus_words), "Incorrect number of distinct words. Correct: {}. Yours: {}".format(ans_num_corpus_words, num_corpus_words)

# Test correct words
assert (test_corpus_words == ans_test_corpus_words), "Incorrect corpus_words.\nCorrect: {}\nYours:   {}".format(str(ans_test_corpus_words), str(test_corpus_words))

# Print Success
print ("-" * 80)
print("Passed All Tests!")
print ("-" * 80)

output:

--------------------------------------------------------------------------------
Passed All Tests!
--------------------------------------------------------------------------------

Question 1.2: Implement `compute_co_occurrence_matrix` code (3 points)

Write a method that constructs a co-occurrence matrix for a certain window-size (with a default of 4), considering words before and after the word in the center of the window. Here, we start to use numpy (np) to represent vectors, matrices, and tensors. If you're not familiar with NumPy, there's a NumPy tutorial in the second half of this cs231n Python NumPy tutorial.

def compute_co_occurrence_matrix(corpus, window_size=4):
    """ Compute co-occurrence matrix for the given corpus and window_size (default of 4).
    
        Note: Each word in a document should be at the center of a window. Words near edges will have a smaller
              number of co-occurring words.
              
              For example, if we take the document "START All that glitters is not gold END" with window size of 4,
              "All" will co-occur with "START", "that", "glitters", "is", and "not".
    
        Params:
            corpus (list of list of strings): corpus of documents
            window_size (int): size of context window
        Return:
            M (numpy matrix of shape (number of corpus words, number of corpus words)): 
                Co-occurence matrix of word counts. 
                The ordering of the words in the rows/columns should be the same as the ordering of the words given by the distinct_words function.
            word2Ind (dict): dictionary that maps word to index (i.e. row/column number) for matrix M.
    """
    words, num_words = distinct_words(corpus)
    M = None
    word2Ind = {}
    
    # ------------------
    # Write your implementation here.
    M = np.zeros((num_words, num_words))
    word2Ind = {words[i]:i for i in range(num_words)}
    for sentence in corpus:
        length_sentence = len(sentence)
        for i in range(length_sentence):
            if i-window_size <= 0 :
                start_window = 0
            else:
                start_window = i-window_size
            if i+window_size >= length_sentence-1:
                end_window = length_sentence-1
            else:
                end_window = i+window_size
            
            center_word = sentence[i]
            row = word2Ind[center_word]
            for j in range(start_window, end_window+1):
                if j == i:
                    continue
                else:
                    context_word = sentence[j]
                    col = word2Ind[context_word]
                    M[row][col] += 1
    # ------------------

    return M, word2Ind

# ---------------------
# Run this sanity check
# Note that this is not an exhaustive check for correctness.
# ---------------------

# Define toy corpus and get student's co-occurrence matrix
test_corpus = ["START All that glitters isn't gold END".split(" "), "START All's well that ends well END".split(" ")]
M_test, word2Ind_test = compute_co_occurrence_matrix(test_corpus, window_size=1)

# Correct M and word2Ind
M_test_ans = np.array( 
    [[0., 0., 0., 1., 0., 0., 0., 0., 1., 0.,],
     [0., 0., 0., 1., 0., 0., 0., 0., 0., 1.,],
     [0., 0., 0., 0., 0., 0., 1., 0., 0., 1.,],
     [1., 1., 0., 0., 0., 0., 0., 0., 0., 0.,],
     [0., 0., 0., 0., 0., 0., 0., 0., 1., 1.,],
     [0., 0., 0., 0., 0., 0., 0., 1., 1., 0.,],
     [0., 0., 1., 0., 0., 0., 0., 1., 0., 0.,],
     [0., 0., 0., 0., 0., 1., 1., 0., 0., 0.,],
     [1., 0., 0., 0., 1., 1., 0., 0., 0., 1.,],
     [0., 1., 1., 0., 1., 0., 0., 0., 1., 0.,]]
)
word2Ind_ans = {'All': 0, "All's": 1, 'END': 2, 'START': 3, 'ends': 4, 'glitters': 5, 'gold': 6, "isn't": 7, 'that': 8, 'well': 9}

# Test correct word2Ind
assert (word2Ind_ans == word2Ind_test), "Your word2Ind is incorrect:\nCorrect: {}\nYours: {}".format(word2Ind_ans, word2Ind_test)

# Test correct M shape
assert (M_test.shape == M_test_ans.shape), "M matrix has incorrect shape.\nCorrect: {}\nYours: {}".format(M_test.shape, M_test_ans.shape)

# Test correct M values
for w1 in word2Ind_ans.keys():
    idx1 = word2Ind_ans[w1]
    for w2 in word2Ind_ans.keys():
        idx2 = word2Ind_ans[w2]
        student = M_test[idx1, idx2]
        correct = M_test_ans[idx1, idx2]
        if student != correct:
            print("Correct M:")
            print(M_test_ans)
            print("Your M: ")
            print(M_test)
            raise AssertionError("Incorrect count at index ({}, {})=({}, {}) in matrix M. Yours has {} but should have {}.".format(idx1, idx2, w1, w2, student, correct))

# Print Success
print ("-" * 80)
print("Passed All Tests!")
print ("-" * 80)

output:

--------------------------------------------------------------------------------
Passed All Tests!
--------------------------------------------------------------------------------

Question 1.3: Implement `reduce_to_k_dim` code (1 point)

Construct a method that performs dimensionality reduction on the matrix to produce k-dimensional embeddings. Use SVD to take the top k components and produce a new matrix of k-dimensional embeddings.

Note: All of numpy, scipy, and scikit-learn (sklearn) provide some implementation of SVD, but only scipy and sklearn provide an implementation of Truncated SVD, and only sklearn provides an efficient randomized algorithm for calculating large-scale Truncated SVD. So please use sklearn.decomposition.TruncatedSVD.

def reduce_to_k_dim(M, k=2):
    """ Reduce a co-occurence count matrix of dimensionality (num_corpus_words, num_corpus_words)
        to a matrix of dimensionality (num_corpus_words, k) using the following SVD function from Scikit-Learn:
            - http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html
    
        Params:
            M (numpy matrix of shape (number of corpus words, number of corpus words)): co-occurence matrix of word counts
            k (int): embedding size of each word after dimension reduction
        Return:
            M_reduced (numpy matrix of shape (number of corpus words, k)): matrix of k-dimensioal word embeddings.
                    In terms of the SVD from math class, this actually returns U * S
    """    
    n_iters = 10     # Use this parameter in your call to `TruncatedSVD`
    M_reduced = None
    print("Running Truncated SVD over %i words..." % (M.shape[0]))
    
     # ------------------
    # Write your implementation here.
    svd = TruncatedSVD(n_components=k, n_iter=n_iters, random_state=42)
    M_reduced = svd.fit_transform(M)
    # ------------------

    print("Done.")
    return M_reduced

# ---------------------
# Run this sanity check
# Note that this not an exhaustive check for correctness 
# In fact we only check that your M_reduced has the right dimensions.
# ---------------------

# Define toy corpus and run student code
test_corpus = ["START All that glitters isn't gold END".split(" "), "START All's well that ends well END".split(" ")]
M_test, word2Ind_test = compute_co_occurrence_matrix(test_corpus, window_size=1)
M_test_reduced = reduce_to_k_dim(M_test, k=2)

# Test proper dimensions
assert (M_test_reduced.shape[0] == 10), "M_reduced has {} rows; should have {}".format(M_test_reduced.shape[0], 10)
assert (M_test_reduced.shape[1] == 2), "M_reduced has {} columns; should have {}".format(M_test_reduced.shape[1], 2)

# Print Success
print ("-" * 80)
print("Passed All Tests!")
print ("-" * 80)

output:

Running Truncated SVD over 10 words...
Done.
--------------------------------------------------------------------------------
Passed All Tests!
--------------------------------------------------------------------------------

Question 1.4: Implement `plot_embeddings` code (1 point)

Here you will write a function to plot a set of 2D vectors in 2D space. For graphs, we will use Matplotlib (plt).

For this example, you may find it useful to adapt this code. In the future, a good way to make a plot is to look at the Matplotlib gallery, find a plot that looks somewhat like what you want, and adapt the code they give.

def plot_embeddings(M_reduced, word2Ind, words):
    """ Plot in a scatterplot the embeddings of the words specified in the list "words".
        NOTE: do not plot all the words listed in M_reduced / word2Ind.
        Include a label next to each point.
        
        Params:
            M_reduced (numpy matrix of shape (number of unique words in the corpus , k)): matrix of k-dimensioal word embeddings
            word2Ind (dict): dictionary that maps word to indices for matrix M
            words (list of strings): words whose embeddings we want to visualize
    """

    # ------------------
    # Write your implementation here.
    for word in words:
        x = M_reduced[word2Ind[word]][0]
        y = M_reduced[word2Ind[word]][1]
        plt.scatter(x, y, marker='x', color='blue')
        plt.text(x, y, word)
    plt.show()

    # ------------------

# ---------------------
# Run this sanity check
# Note that this not an exhaustive check for correctness.
# The plot produced should look like the "test solution plot" depicted below. 
# ---------------------

print ("-" * 80)
print ("Outputted Plot:")

M_reduced_plot_test = np.array([[1, 1], [-1, -1], [1, -1], [-1, 1], [0, 0]])
word2Ind_plot_test = {'test1': 0, 'test2': 1, 'test3': 2, 'test4': 3, 'test5': 4}
words = ['test1', 'test2', 'test3', 'test4', 'test5']
plot_embeddings(M_reduced_plot_test, word2Ind_plot_test, words)

print ("-" * 80)

Question 1.5: Co-Occurrence Plot Analysis written (3 points)

Now we will put together all the parts you have written! We will compute the co-occurrence matrix with fixed window of 4, over the Reuters "crude" corpus. Then we will use TruncatedSVD to compute 2-dimensional embeddings of each word. TruncatedSVD returns U*S, so we normalize the returned vectors, so that all the vectors will appear around the unit circle (therefore closeness is directional closeness). Note: The line of code below that does the normalizing uses the NumPy concept of broadcasting. If you don't know about broadcasting, check out Computation on Arrays: Broadcasting by Jake VanderPlas.

Run the below cell to produce the plot. It'll probably take a few seconds to run. What clusters together in 2-dimensional embedding space? What doesn't cluster together that you might think should have? Note: "bpd" stands for "barrels per day" and is a commonly used abbreviation in crude oil topic articles.

# -----------------------------
# Run This Cell to Produce Your Plot
# ------------------------------
reuters_corpus = read_corpus()
M_co_occurrence, word2Ind_co_occurrence = compute_co_occurrence_matrix(reuters_corpus)
M_reduced_co_occurrence = reduce_to_k_dim(M_co_occurrence, k=2)

# Rescale (normalize) the rows to make them each of unit-length
M_lengths = np.linalg.norm(M_reduced_co_occurrence, axis=1)
M_normalized = M_reduced_co_occurrence / M_lengths[:, np.newaxis] # broadcasting

words = ['barrels', 'bpd', 'ecuador', 'energy', 'industry', 'kuwait', 'oil', 'output', 'petroleum', 'venezuela']
plot_embeddings(M_normalized, word2Ind_co_occurrence, words)

答案是自己写的，有错请指正

你可能感兴趣的:(CS224N,NLP,Python,cs224n,自然语言处理)

Python入门(函数) 高育良00003 python 开发语言
一.基础认识一种映射关系1.1什么是函数呢？概念函数是可以重复执行的语句块，可以重复调用作用用于封装语句块，提高代码的重用性1.2函数的定义语法：deffunction():#def为关键字，function为函数名#语句想要执行的操作returnre#re为返回值二.函数的调用函数名后+小括号()表示函数的执行2.1基本用法语法：函数名(实际调用的参数)2.2调用传参2.2.1位置传参最为常见，
python本地连接minio 伶星37 python 网络服务器
在你浏览器能成功访问到你的minio网页，并且成功登录之后。接下来如果你想用python连接数据库，并且想用python连接minio，就可以用这个blog。连接代码client=Minio("localhost:9000",#9000是默认端口号access_key="admin",#你的账户secret_key="password",#你的密码secure=False,#这点我会详细说明)为什
头歌实践教学平台 Python程序设计实训答案（三）学习的锅头哥实践教学平台实训答案 python
第七阶段文件实验一文本文件的读取第1关：学习-Python文件之文本文件的读取任务描述本关任务：使用open函数以只写的方式打开文件，打印文件的打开方式。相关知识为了完成本关任务，你需要掌握：文本文件；open函数及其参数；文件打开模式；文件对象常用属性；关闭文件close函数。#请在下面的Begin-End之间按照注释中给出的提示编写正确的代码##########Begin###########
python基础之--面相对象--OOP基本特性暴龙胡乱写博客 python 开发语言人工智能
python基础之–面相对象–OOP基本特性文章目录python基础之--面相对象--OOP基本特性一，OOP基本特性1.1封装1.2继承/派生1.2.1基础概念1.2.3继承实现1.3多态1.4对象对成员的操作（补充）1.5私有属性1.6重写魔术方法二，super函数2.1基本使用2.2super().\__init__()一，OOP基本特性OOP的四大基本特性是封装、继承、多态和抽象。1.1封
Dify1.01版本vscode 本地环境搭建运行实践 hamish-wu vscode 编辑器 dify 大模型 python flask
dify是python编写的低代码AI开发平台，是常用的大模型开发平台。本文基于最新的1.0.1版本实践完成，有需要的可以私信交流。咨询免费，详细文档及视频需要一定成本，大概相当于节约的时间成本。搭建环境windows11开发工具vscode搭建步骤：1.Startthedocker-composestackwindow环境下运行docker命令，需要下载docker官网镜像，会遇到timeout
vscode python 入门教程(一) window 10 环境下安装pyenv hamish-wu Python python 开发语言 pyenv
python的环境配置方法很多，由于python有两个大版本，很多时候需要切换某个固定的版本才能运行三方包，所以推荐使用pyenv配置python环境变量pyenv的安装安装方法：Invoke-WebRequest-UseBasicParsing-Uri"https://raw.githubusercontent.com/pyenv-win/pyenv-win/master/pyenv-win/i
1-5 Python 入门之运算符的使用 Sa_sa_ki_Haise python
第1关：算术、比较、赋值运算符100任务要求参考答案评论201任务描述相关知识算术运算符比较(关系)运算符赋值运算符编程要求测试说明任务描述在编程时，我们常常需要对数值或对象进行算术、比较运算和赋值运算，以此来实现我们的功能需求。本关介绍Python中的一些基本运算符，并要求对给定的苹果和梨的数量进行算术运算、比较、赋值运算，然后输出相应的结果。相关知识要实现上述功能，需要用到Python中的各种
2025年第二届机器学习与神经网络国际学术会议(MLNN 2025) 分享学术科研与论文的禁小默机器学习神经网络人工智能
重要信息官网：www.icmlnn.org时间：2025年4月22-24日地点：中国-重庆简介2025年第二届机器学习与神经网络国际学术会议（MLNN2025）围绕学习系统与神经网络的核心理论、关键技术和应用展开讨论，涵盖深度学习、计算机视觉、自然语言处理、强化学习等多个子领域，通过特邀报告、主题演讲、海报展示等形式，展示相关领域的最新研究成果和技术创新。征稿主题神经网络机器学习深度学习算法及应用
rabbitmq + minio +python 上传文件伶星37 rabbitmq python ruby
功能实现RabbitMq接收hello里面传来的消息根据消息在MobileFile里面新建文件新建文件上传到miniopython新建文件importospath='./MobileFile'file_path=os.path.join(path,"new_file.txt")withopen(file_path,"w")asfile:pass转换成函数格式importosdefcreatefil
vscode python 入门教程(二) vscode使用gti 管理代码 hamish-wu vscode ide 编辑器
vscode代码管理需要用管道git的命令，这点和idea的代码管理区别比较大。作为java开发需要自己熟悉适应一下。一、GitHub新建一个仓库过程略二、本地git项目初始化gitinitvscode中可以看到文件状态gitstatus使用gitremote命令吧本地git仓库和远程git仓库链接起来[email protected]提交代码gitcommit-m"评论
Python进阶之-加密库cryptography使用详解夏天Aileft Python python 网络加密
✨前言cryptography库是一个强大的Python加密库，提供了对加密算法和协议的高层和低层访问。它是用来实现数据加密、签名、密钥管理等功能的。以下是一些常见用法的详解，帮助你理解如何使用这个库。✨安装首先，你需要确保安装了cryptography库：pipinstallcryptography✨1.对称加密对称加密是指加密和解密使用相同的密钥。Fernet是cryptography库中提供
python列表添加元素的三种方法定义集合数据对象_python 学习第三天可迭代对象（列表，字典，元组和集合）... weixin_39852491
列表，字典，元组和集合列表list列表是由一系列特定元素组成的，元素和元素之间没有任何关联关系，但他们之间有先后顺序关系列表是一种容器列表是序列的一种列表是可以被改变的序列Python中的序列类型简介（sequence）字符串（str）列表（list）元组（tuple）字节串（bytes）字节数组（bytearray）创建空列表的字面值L=[]#L绑定空列表创建非空列表：L=[1,’two’,3,
python~集合详解鱼跃龙 python python集合详解 set集合
集合的基本操作首先需要明确的是：集合(set)是一个无序的不重复元素序列，多用来进行排重；不支持切片和索引取值！1.创建集合>>>a={1,2,4,4}>>>a{1,2,4}>>>type(a)**创建空集合时需要注意：不能直接用大括号，只能用set()；否则创建的是一个字典>>>b=set()>>>type(b)>>>c={}>>>type(c)2.添加元素add()方法是将要添加的元素作为一个
Python密码学：cryptography库零度° python python 密码学
在数字时代，确保数据的安全性和隐私至关重要。Python中的cryptography库是一个全面的包，为Python开发者提供了密码学原语和配方。它支持高级配方和常见密码学算法的低级接口。cryptography库概述cryptography库旨在易于使用且默认安全。它包括各种密码学操作的高级和低级API，如：对称加密非对称加密哈希函数消息认证码（MAC）数字签名密钥管理cryptography库
Python---frozenset集合爱听雨声的北方汉快快乐乐学Python Python
frozenset是set的不可变版本，因此set集合中所有能改变集合本身的方法（如add、remove、discard、xxx_update等），frozenset都不支持；set集合中不改变集合本身的方法，fronzenset都支持。frozenset的作用主要有以下两点：1、当集合元素不需要改变时，使用frozenset代替set更安全。2、当某些API需要不可变对象时，必须用frozens
(python)保障信息安全的加密库-cryptography Marst·Zhang 基础知识实用工具 python
前言cryptography是一个广泛使用的Python加密库，提供了各种加密、哈希和签名算法的实现。它支持多种加密算法，如AES、RSA、ECC等，以及哈希函数（如SHA-256、SHA-384等）和数字签名算法(如DSA、ECDSA等).目录常见用途密码学函数主要功能优点缺点总结常见用途数据加密使用对称加密算法（如AES）对数据进行加密，确保数据在传输或存储过程中的机密性。数字签名生成和验证数
Python if-else对缩进的要求宇寒风暖 python编程 python 开发语言学习笔记
在Python中，缩进是语法的一部分，用于表示代码块的层次结构。if-else语句的代码块必须通过缩进来定义，缩进不正确会导致语法错误或逻辑错误。1.缩进的基本规则1.1缩进的作用缩进用于表示代码块的层次结构。同一代码块中的语句必须具有相同的缩进级别。缩进通常使用4个空格，这是Python官方推荐的风格。1.2示例x=10ifx>5:print("x大于5")#缩进4个空格print("这是if代
一文弄懂 Python assert 断言宇寒风暖 python编程 python 开发语言学习笔记
在Python中，assert是一种用于调试的语句，用于检查某个条件是否为True。如果条件为False，assert会抛出AssertionError异常，并可选地输出错误信息。assert通常用于在开发阶段验证程序的假设条件，确保代码的正确性。1.assert的基本语法1.1语法assertcondition,messagecondition：需要检查的条件表达式。message：可选参数，当
开源项目常见问题解决方案——cryptography 周屹隽
开源项目常见问题解决方案——cryptographycryptographycryptographyisapackagedesignedtoexposecryptographicprimitivesandrecipestoPythondevelopers.项目地址:https://gitcode.com/gh_mirrors/cr/cryptography项目基础介绍cryptography是一个
python 利用pandas实现从CSV导出并格式化后写入.jsonl文件风_流沙 python工具备忘录 python pandas 开发语言
你可以使用pandas库来读取CSV文件，然后通过一些格式化操作将数据转换为JSONL格式并写入文件。JSONL（JSONLines）格式是一种每行一个JSON对象的文件格式。下面是一个示例，演示了如何使用pandas读取CSV文件，处理数据并将其导出到JSONL文件中：示例代码：importpandasaspdimportjson#读取CSV文件df=pd.read_csv('data.csv'
Python文件加密库之cryptography使用详解 Rocky006 python 开发语言
概要在现代信息社会中，数据的安全性变得越来越重要。为了保护敏感信息，文件加密技术被广泛应用。Python的cryptography库提供了强大的加密功能，可以轻松实现文件加密和解密。本文将详细介绍如何使用cryptography库进行文件加密，包含具体的示例代码。cryptography库简介cryptography是Python中一个功能强大且易用的加密库，提供了对称加密、非对称加密、哈希算法、
【Python系列】高效Parquet数据处理策略：合并与分析实践小团团0 python 开发语言
在大数据时代，数据的存储、处理和分析变得尤为重要。Parquet作为一种高效的列存储格式，被广泛应用于大数据处理框架中，如ApacheSpark、ApacheHive等。Parquet是一个开源的列存储格式，它被设计用于支持复杂的嵌套数据结构，同时提供高效的压缩和编码方案，以优化存储空间和查询性能。以下将详细介绍如何使用Python对Parquet文件进行数据处理与合并，并提供相应的源码示例。一、
cryptography，一个神奇的 Python 库！ Sitin涛哥 Python python 开发语言
更多资料获取个人网站：ipengtao.com大家好，今天为大家分享一个神奇的Python库-cryptography。Github地址：https://github.com/pyca/cryptography在当今数字化时代，信息安全越来越受到重视。数据加密是保护数据安全的重要手段之一，而Python的cryptography库提供了丰富的功能来支持各种加密算法和协议。本文将深入探讨crypto
深度讨论Python for循环观智能 python 开发语言
作者的其他文章推荐：强化学习再受关注！for循环使用于遍历可迭代对象的Python语句，工作原理如下：#for循环foriteminiterable:print(item)#等价于iterator=iter(iterable)#获取迭代器whileTrue:try:item=next(iterator)#获取下一个元素print(item)exceptStopIteration:break#迭代结
Python第六章08：元组操作练习题苹果.Python.八宝粥 python 开发语言
#元组定义操作练习题"""定义一个元组，内容是：('周杰伦',11,['football','music'])，记录一个学生的信息（姓名、年龄、爱好）请通元组（tuple）的功能，对其进行如下操作：1.查询其年龄所在的下标位置2.查询学生的姓名3.删除学生爱好中的football4.增加爱好：coding"""my_tuple=('周杰伦',11,['football','music'])#1.查
Python第六章07：元组的定义和操作苹果.Python.八宝粥 python 前端开发语言
#tuple元组的定义和操作#tuple元组定义用小括号：(1,2,3,4,5),可以是不同类型元素#给变量定义元组时，写括号不写tuple：a=(1,2,3,4,5)#变量=（）变量=tuple（）空元组变量#tuple元组定义完成后，不可以修改，但是，如果元组中嵌套了一个列表时，元组中列表的内容可以修改#封装数据后，不希望被篡改数据，就使用元组tuple#1.定义一个元组t1=("halibo
利用Python爬虫获取Shopee（虾皮）商品详情：实战指南小爬虫程序猿 python 爬虫开发语言
在跨境电商领域，Shopee（虾皮）作为东南亚及台湾地区领先的电商平台，拥有海量的商品信息。无论是进行市场调研、数据分析，还是寻找热门商品，获取Shopee商品详情都是一项极具价值的任务。然而，手动浏览和整理这些信息显然是低效且容易出错的。幸运的是，通过编写Python爬虫程序，我们可以高效地完成这一任务。本文将详细介绍如何利用Python爬虫获取Shopee商品详情，并提供完整的代码示例。一、为
在Mac M1/M2芯片上完美安装DeepCTR库：避坑指南与实战验证 ku_code_ku 机器学习 macos 推荐算法推荐系统
让推荐算法在AppleSilicon上全速运行概述作为推荐系统领域的最经常用的明星库，DeepCTR集成了CTR预估、多任务学习等前沿模型实现。但在AppleSilicon架构的Mac设备上，安装过程常因ARM架构适配、依赖库版本冲突等问题受阻。本文通过20+次环境搭建实测，总结出最稳定的安装方案。关键版本说明（2024年验证）组件推荐版本注意事项Python3.10.x向下兼容至3.7，但3.1
数据库数值函数详解 web安全工具库数据库 oracle jvm
各类资料学习下载合集https://pan.quark.cn/s/8c91ccb5a474数值函数是数据库中用于处理数值数据的函数，可以用于执行各种数学运算、统计计算等。数值函数在数据分析及处理时非常重要，能够帮助我们进行数据的聚合、计算和转换。在本篇博客中，我们将详细介绍常用的数据库数值函数，并通过Python和SQLite进行示例，帮助您理解和应用这些函数。1.数值函数的基本概念数值函数是用于
Python中Requests的Cookies的简单使用北条苒茗殇 python 开发语言 Requests
概述Python的Requests库中有一个cookies，是用于管理HTTPCookie的工具，可以像字典一样操作Cookie，支持自动处理作用域（域名、路径）和持久化，cookies是一个RequestsCookieJar的类型。一、概念1.作用自动存储服务器返回的Cookie根据请求域名和路径进行自动发送匹配的Cookie支持手动添加、修改、删除Cookie2.RequestsCookieJ
mysql主从数据同步林鹤霄 mysql主从数据同步
配置mysql5.5主从服务器(转) 教程开始：一、安装MySQL 说明：在两台MySQL服务器192.168.21.169和192.168.21.168上分别进行如下操作，安装MySQL 5.5.22 二、配置MySQL主服务器（192.168.21.169）mysql -uroot -p &nb
oracle学习笔记 caoyong oracle
1、ORACLE的安装 a>、ORACLE的版本 8i,9i : i是internet 10g,11g : grid (网格) 12c : cloud (云计算) b>、10g不支持win7 &
数据库，SQL零基础入门天子之骄 sql 数据库入门基本术语
数据库，SQL零基础入门做网站肯定离不开数据库，本人之前没怎么具体接触SQL，这几天起早贪黑得各种入门，恶补脑洞。一些具体的知识点，可以让小白不再迷茫的术语，拿来与大家分享。数据库，永久数据的一个或多个大型结构化集合，通常与更新和查询数据的软件相关
pom.xml 一炮送你回车库 pom.xml
1、一级元素dependencies是可以被子项目继承的 2、一级元素dependencyManagement是定义该项目群里jar包版本号的，通常和一级元素properties一起使用，既然有继承，也肯定有一级元素modules来定义子元素 3、父项目里的一级元素<modules> <module>lcas-admin-war</module> <
sql查地区省市县 3213213333332132 sql mysql
-- db_yhm_city SELECT * FROM db_yhm_city WHERE class_parent_id = 1 -- 海南 class_id = 9 港、奥、台 class_id = 33、34、35 SELECT * FROM db_yhm_city WHERE class_parent_id =169 SELECT d1.cla
关于监听器那些让人头疼的事宝剑锋梅花香画图板监听器鼠标监听器
本人初学JAVA，对于界面开发我只能说有点蛋疼，用JAVA来做界面的话确实需要一定的耐心（不使用插件，就算使用插件的话也没好多少）既然Java提供了界面开发，老师又要求做，只能硬着头皮上啦。但是监听器还真是个难懂的地方，我是上了几次课才略微搞懂了些。
JAVA的遍历MAP darkranger map
Java Map遍历方式的选择 1. 阐述　　对于Java中Map的遍历方式，很多文章都推荐使用entrySet，认为其比keySet的效率高很多。理由是：entrySet方法一次拿到所有key和value的集合；而keySet拿到的只是key的集合，针对每个key，都要去Map中额外查找一次value，从而降低了总体效率。那么实际情况如何呢？　　为了解遍历性能的真实差距，包括在遍历ke
POJ 2312 Battle City 优先多列+bfs aijuans 搜索
来源：http://poj.org/problem?id=2312 题意：题目背景就是小时候玩的坦克大战，求从起点到终点最少需要多少步。已知S和R是不能走得，E是空的，可以走，B是砖，只有打掉后才可以通过。思路：很容易看出来这是一道广搜的题目，但是因为走E和走B所需要的时间不一样，因此不能用普通的队列存点。因为对于走B来说，要先打掉砖才能通过，所以我们可以理解为走B需要两步，而走E是指需要1
Hibernate与Jpa的关系，终于弄懂 avords java Hibernate 数据库 jpa
我知道Jpa是一种规范，而Hibernate是它的一种实现。除了Hibernate，还有EclipseLink(曾经的toplink)，OpenJPA等可供选择，所以使用Jpa的一个好处是，可以更换实现而不必改动太多代码。在play中定义Model时，使用的是jpa的annotations，比如javax.persistence.Entity, Table, Column, OneToMany
酸爽的console.log bee1314 console
在前端的开发中，console.log那是开发必备啊，简直直观。通过写小函数，组合大功能。更容易测试。但是在打版本时，就要删除console.log，打完版本进入开发状态又要添加，真不够爽。重复劳动太多。所以可以做些简单地封装，方便开发和上线。 /** * log.js hufeng * The safe wrapper for `console.xxx` functions *
哈佛教授：穷人和过于忙碌的人有一个共同思维特质 bijian1013 时间管理励志人生穷人过于忙碌
一个跨学科团队今年完成了一项对资源稀缺状况下人的思维方式的研究，结论是：穷人和过于忙碌的人有一个共同思维特质，即注意力被稀缺资源过分占据，引起认知和判断力的全面下降。这项研究是心理学、行为经济学和政策研究学者协作的典范。　　这个研究源于穆来纳森对自己拖延症的憎恨。他7岁从印度移民美国，很快就如鱼得水，哈佛毕业
other operate 征客丶 OS osx
一、Mac Finder 设置排序方式，预览栏在显示－》查看显示选项中二、有时预览显示时，卡死在那，有可能是一些临时文件夹被删除了，如：/private/tmp[有待验证] -------------------------------------------------------------------- 若有其他凝问或文中有错误，请及时向我指出，我好及时改正，同时也让我们一
【Scala五】分析Spark源代码总结的Scala语法三 bit1129 scala
1. If语句作为表达式 val properties = if (jobIdToActiveJob.contains(jobId)) { jobIdToActiveJob(stage.jobId).properties } else { // this stage will be assigned to "default" po
ZooKeeper 入门 BlueSkator 中间件 zk
ZooKeeper是一个高可用的分布式数据管理与系统协调框架。基于对Paxos算法的实现，使该框架保证了分布式环境中数据的强一致性，也正是基于这样的特性，使得ZooKeeper解决很多分布式问题。网上对ZK的应用场景也有不少介绍，本文将结合作者身边的项目例子，系统地对ZK的应用场景进行一个分门归类的介绍。值得注意的是，ZK并非天生就是为这些应用场景设计的，都是后来众多开发者根据其框架的特性，利
MySQL取得当前时间的函数是什么格式化日期的函数是什么 BreakingBad mysql Date
取得当前时间用 now() 就行。在数据库中格式化时间用DATE_FORMA T(date, format) . 根据格式串format 格式化日期或日期和时间值date，返回结果串。可用DATE_FORMAT( ) 来格式化DATE 或DATETIME 值，以便得到所希望的格式。根据format字符串格式化date值: %S, %s 两位数字形式的秒（ 00,01,
读《研磨设计模式》-代码笔记-组合模式 bylijinnan java 设计模式
声明：本文只为方便我个人查阅和理解，详细的分析以及源代码请移步原作者的博客http://chjavach.iteye.com/ import java.util.ArrayList; import java.util.List; abstract class Component { public abstract void printStruct(Str
4_JAVA+Oracle面试题(有答案) chenke oracle
基础测试题卷面上不能出现任何的涂写文字，所有的答案要求写在答题纸上，考卷不得带走。选择题 1、 What will happen when you attempt to compile and run the following code? （3） public class Static { static { int x = 5; // 在static内有效 } st
新一代工作流系统设计目标 comsci 工作算法脚本
用户只需要给工作流系统制定若干个需求，流程系统根据需求，并结合事先输入的组织机构和权限结构，调用若干算法，在流程展示版面上面显示出系统自动生成的流程图，然后由用户根据实际情况对该流程图进行微调，直到满意为止，流程在运行过程中，系统和用户可以根据情况对流程进行实时的调整，包括拓扑结构的调整，权限的调整，内置脚本的调整。。。。。在这个设计中，最难的地方是系统根据什么来生成流
oracle 行链接与行迁移 daizj oracle 行迁移
表里的一行对于一个数据块太大的情况有二种(一行在一个数据块里放不下) 第一种情况: INSERT的时候，INSERT时候行的大小就超一个块的大小。Oracle把这行的数据存储在一连串的数据块里(Oracle Stores the data for the row in a chain of data blocks)，这种情况称为行链接(Row Chain)，一般不可避免(除非使用更大的数据
[JShop]开源电子商务系统jshop的系统缓存实现 dinguangx jshop 电子商务
前言 jeeshop中通过SystemManager管理了大量的缓存数据，来提升系统的性能，但这些缓存数据全部都是存放于内存中的，无法满足特定场景的数据更新（如集群环境）。JShop对jeeshop的缓存机制进行了扩展，提供CacheProvider来辅助SystemManager管理这些缓存数据，通过CacheProvider,可以把缓存存放在内存,ehcache,redis，memcache
初三全学年难记忆单词 dcj3sjt126com english word
several 儿子；若干 shelf 架子 knowledge 知识；学问 librarian 图书管理员 abroad 到国外，在国外 surf 冲浪 wave 浪；波浪 twice 两次；两倍 describe 描写；叙述 especially 特别；尤其 attract 吸引 prize 奖品；奖赏 competition 比赛；竞争 event 大事；事件 O
sphinx实践 dcj3sjt126com sphinx
安装参考地址:http://briansnelson.com/How_to_install_Sphinx_on_Centos_Server yum install sphinx 如果失败的话使用下面的方式安装 wget http://sphinxsearch.com/files/sphinx-2.2.9-1.rhel6.x86_64.rpm yum loca
JPA之JPQL（三） frank1234 orm jpa JPQL
1 什么是JPQL JPQL是Java Persistence Query Language的简称，可以看成是JPA中的HQL， JPQL支持各种复杂查询。 2 检索单个对象 @Test public void querySingleObject1() { Query query = em.createQuery("sele
Remove Duplicates from Sorted Array II hcx2013 remove
Follow up for "Remove Duplicates":What if duplicates are allowed at most twice? For example,Given sorted array nums = [1,1,1,2,2,3], Your function should return length
Spring4新特性——Groovy Bean定义DSL jinnianshilongnian spring 4
Spring4新特性——泛型限定式依赖注入 Spring4新特性——核心容器的其他改进 Spring4新特性——Web开发的增强 Spring4新特性——集成Bean Validation 1.1(JSR-349)到SpringMVC Spring4新特性——Groovy Bean定义DSL Spring4新特性——更好的Java泛型操作API Spring4新
CentOS安装Mysql5.5 liuxingguome centos
CentOS下以RPM方式安装MySQL5.5 首先卸载系统自带Mysql： yum remove mysql mysql-server mysql-libs compat-mysql51 rm -rf /var/lib/mysql rm /etc/my.cnf 查看是否还有mysql软件： rpm -qa|grep mysql 去http://dev.mysql.c
第14章工具函数（下） onestopweb 函数
index.html <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/
POJ 1050 SaraWon 二维数组子矩阵最大和
POJ ACM第1050题的详细描述，请参照 http://acm.pku.edu.cn/JudgeOnline/problem?id=1050 题目意思：给定包含有正负整型的二维数组，找出所有子矩阵的和的最大值。如二维数组 0 -2 -7 0 9 2 -6 2 -4 1 -4 1 -1 8 0 -2 中和最大的子矩阵是 9 2 -4 1 -1 8 且最大和是15
Java8全新打造，英语学习supertool yangshangchuan java superword 闭包 java8 函数式编程
superword是一个Java实现的英文单词分析软件，主要研究英语单词音近形似转化规律、前缀后缀规律、词之间的相似性规律等等。Clean code、Fluent style、Java8 feature: Lambdas, Streams and Functional-style Programming。升学考试、工作求职、充电提高，都少不了英语的身影，英语对我们来说实在太重要

CS224N Assignment 1: Exploring Word Vectors part 1

Word Vectors

Part 1: Count-Based Word Vectors (10 points)

Co-Occurrence

Plotting Co-Occurrence Word Embeddings

Question 1.1: Implement distinct_words code (2 points)

Question 1.2: Implement compute_co_occurrence_matrix code (3 points)

Question 1.3: Implement reduce_to_k_dim code (1 point)

Question 1.4: Implement plot_embeddings code (1 point)

Question 1.5: Co-Occurrence Plot Analysis written (3 points)

你可能感兴趣的:(CS224N,NLP,Python,cs224n,自然语言处理)

Question 1.1: Implement `distinct_words` code (2 points)

Question 1.2: Implement `compute_co_occurrence_matrix` code (3 points)

Question 1.3: Implement `reduce_to_k_dim` code (1 point)

Question 1.4: Implement `plot_embeddings` code (1 point)