c_木ss

2019-CS224N-Assignment 1: Exploring Word Vectors

认真看2019-cs224n这门课，好好学习！
斯坦福作业一：http://web.stanford.edu/class/cs224n/assignments/a1_preview/exploring_word_vectors.html
首先导入各种包，这里不用自己写代码：

# All Import Statements Defined Here
# Note: Do not add to this list.
# All the dependencies you need, can be installed by running .
# ----------------

import sys
assert sys.version_info[0]==3
assert sys.version_info[1] >= 5

from gensim.models import KeyedVectors
from gensim.test.utils import datapath
import pprint
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [10, 5]
import nltk
nltk.download('reuters')
from nltk.corpus import reuters
import numpy as np
import random
import scipy as sp
from sklearn.decomposition import TruncatedSVD
from sklearn.decomposition import PCA

START_TOKEN = ''
END_TOKEN = ''

np.random.seed(0)
random.seed(0)
# ----------------

Part 1: Count-Based Word Vectors (10 points)
接下来就是词向量的内容了，涉及共现矩阵以及利用SVD进行降维，最后画出共现矩阵的词向量。

def read_corpus(category="crude"):
    """ Read files from the specified Reuter's category.
        Params:
            category (string): category name
        Return:
            list of lists, with words from each of the processed files
    """
    files = reuters.fileids(category)
    return [[START_TOKEN] + [w.lower() for w in list(reuters.words(f))] + [END_TOKEN] for f in files]

reuters_corpus = read_corpus()
pprint.pprint(reuters_corpus[:3], compact=True, width=100)

# 结果如下：
[['', 'japan', 'to', 'revise', 'long', '-', 'term', 'energy', 'demand', 'downwards', 'the',
  'ministry', 'of', 'international', 'trade', 'and', 'industry', '(', 'miti', ')', 'will', 'revise',
  'its', 'long', '-', 'term', 'energy', 'supply', '/', 'demand', 'outlook', 'by', 'august', 'to',
  'meet', 'a', 'forecast', 'downtrend', 'in', 'japanese', 'energy', 'demand', ',', 'ministry',
  'officials', 'said', '.', 'miti', 'is', 'expected', 'to', 'lower', 'the', 'projection', 'for',
  'primary', 'energy', 'supplies', 'in', 'the', 'year', '2000', 'to', '550', 'mln', 'kilolitres',
  '(', 'kl', ')', 'from', '600', 'mln', ',', 'they', 'said', '.', 'the', 'decision', 'follows',
  'the', 'emergence', 'of', 'structural', 'changes', 'in', 'japanese', 'industry', 'following',
  'the', 'rise', 'in', 'the', 'value', 'of', 'the', 'yen', 'and', 'a', 'decline', 'in', 'domestic',
  'electric', 'power', 'demand', '.', 'miti', 'is', 'planning', 'to', 'work', 'out', 'a', 'revised',
  'energy', 'supply', '/', 'demand', 'outlook', 'through', 'deliberations', 'of', 'committee',
  'meetings', 'of', 'the', 'agency', 'of', 'natural', 'resources', 'and', 'energy', ',', 'the',
  'officials', 'said', '.', 'they', 'said', 'miti', 'will', 'also', 'review', 'the', 'breakdown',
  'of', 'energy', 'supply', 'sources', ',', 'including', 'oil', ',', 'nuclear', ',', 'coal', 'and',
  'natural', 'gas', '.', 'nuclear', 'energy', 'provided', 'the', 'bulk', 'of', 'japan', "'", 's',
  'electric', 'power', 'in', 'the', 'fiscal', 'year', 'ended', 'march', '31', ',', 'supplying',
  'an', 'estimated', '27', 'pct', 'on', 'a', 'kilowatt', '/', 'hour', 'basis', ',', 'followed',
  'by', 'oil', '(', '23', 'pct', ')', 'and', 'liquefied', 'natural', 'gas', '(', '21', 'pct', '),',
  'they', 'noted', '.', ''],
 ['', 'energy', '/', 'u', '.', 's', '.', 'petrochemical', 'industry', 'cheap', 'oil',
  'feedstocks', ',', 'the', 'weakened', 'u', '.', 's', '.', 'dollar', 'and', 'a', 'plant',
  'utilization', 'rate', 'approaching', '90', 'pct', 'will', 'propel', 'the', 'streamlined', 'u',
  '.', 's', '.', 'petrochemical', 'industry', 'to', 'record', 'profits', 'this', 'year', ',',
  'with', 'growth', 'expected', 'through', 'at', 'least', '1990', ',', 'major', 'company',
  'executives', 'predicted', '.', 'this', 'bullish', 'outlook', 'for', 'chemical', 'manufacturing',
  'and', 'an', 'industrywide', 'move', 'to', 'shed', 'unrelated', 'businesses', 'has', 'prompted',
  'gaf', 'corp', '&', 'lt', ';', 'gaf', '>,', 'privately', '-', 'held', 'cain', 'chemical', 'inc',
  ',', 'and', 'other', 'firms', 'to', 'aggressively', 'seek', 'acquisitions', 'of', 'petrochemical',
  'plants', '.', 'oil', 'companies', 'such', 'as', 'ashland', 'oil', 'inc', '&', 'lt', ';', 'ash',
  '>,', 'the', 'kentucky', '-', 'based', 'oil', 'refiner', 'and', 'marketer', ',', 'are', 'also',
  'shopping', 'for', 'money', '-', 'making', 'petrochemical', 'businesses', 'to', 'buy', '.', '"',
  'i', 'see', 'us', 'poised', 'at', 'the', 'threshold', 'of', 'a', 'golden', 'period', ',"', 'said',
  'paul', 'oreffice', ',', 'chairman', 'of', 'giant', 'dow', 'chemical', 'co', '&', 'lt', ';',
  'dow', '>,', 'adding', ',', '"', 'there', "'", 's', 'no', 'major', 'plant', 'capacity', 'being',
  'added', 'around', 'the', 'world', 'now', '.', 'the', 'whole', 'game', 'is', 'bringing', 'out',
  'new', 'products', 'and', 'improving', 'the', 'old', 'ones', '."', 'analysts', 'say', 'the',
  'chemical', 'industry', "'", 's', 'biggest', 'customers', ',', 'automobile', 'manufacturers',
  'and', 'home', 'builders', 'that', 'use', 'a', 'lot', 'of', 'paints', 'and', 'plastics', ',',
  'are', 'expected', 'to', 'buy', 'quantities', 'this', 'year', '.', 'u', '.', 's', '.',
  'petrochemical', 'plants', 'are', 'currently', 'operating', 'at', 'about', '90', 'pct',
  'capacity', ',', 'reflecting', 'tighter', 'supply', 'that', 'could', 'hike', 'product', 'prices',
  'by', '30', 'to', '40', 'pct', 'this', 'year', ',', 'said', 'john', 'dosher', ',', 'managing',
  'director', 'of', 'pace', 'consultants', 'inc', 'of', 'houston', '.', 'demand', 'for', 'some',
  'products', 'such', 'as', 'styrene', 'could', 'push', 'profit', 'margins', 'up', 'by', 'as',
  'much', 'as', '300', 'pct', ',', 'he', 'said', '.', 'oreffice', ',', 'speaking', 'at', 'a',
  'meeting', 'of', 'chemical', 'engineers', 'in', 'houston', ',', 'said', 'dow', 'would', 'easily',
  'top', 'the', '741', 'mln', 'dlrs', 'it', 'earned', 'last', 'year', 'and', 'predicted', 'it',
  'would', 'have', 'the', 'best', 'year', 'in', 'its', 'history', '.', 'in', '1985', ',', 'when',
  'oil', 'prices', 'were', 'still', 'above', '25', 'dlrs', 'a', 'barrel', 'and', 'chemical',
  'exports', 'were', 'adversely', 'affected', 'by', 'the', 'strong', 'u', '.', 's', '.', 'dollar',
  ',', 'dow', 'had', 'profits', 'of', '58', 'mln', 'dlrs', '.', '"', 'i', 'believe', 'the',
  'entire', 'chemical', 'industry', 'is', 'headed', 'for', 'a', 'record', 'year', 'or', 'close',
  'to', 'it', ',"', 'oreffice', 'said', '.', 'gaf', 'chairman', 'samuel', 'heyman', 'estimated',
  'that', 'the', 'u', '.', 's', '.', 'chemical', 'industry', 'would', 'report', 'a', '20', 'pct',
  'gain', 'in', 'profits', 'during', '1987', '.', 'last', 'year', ',', 'the', 'domestic',
  'industry', 'earned', 'a', 'total', 'of', '13', 'billion', 'dlrs', ',', 'a', '54', 'pct', 'leap',
  'from', '1985', '.', 'the', 'turn', 'in', 'the', 'fortunes', 'of', 'the', 'once', '-', 'sickly',
  'chemical', 'industry', 'has', 'been', 'brought', 'about', 'by', 'a', 'combination', 'of', 'luck',
  'and', 'planning', ',', 'said', 'pace', "'", 's', 'john', 'dosher', '.', 'dosher', 'said', 'last',
  'year', "'", 's', 'fall', 'in', 'oil', 'prices', 'made', 'feedstocks', 'dramatically', 'cheaper',
  'and', 'at', 'the', 'same', 'time', 'the', 'american', 'dollar', 'was', 'weakening', 'against',
  'foreign', 'currencies', '.', 'that', 'helped', 'boost', 'u', '.', 's', '.', 'chemical',
  'exports', '.', 'also', 'helping', 'to', 'bring', 'supply', 'and', 'demand', 'into', 'balance',
  'has', 'been', 'the', 'gradual', 'market', 'absorption', 'of', 'the', 'extra', 'chemical',
  'manufacturing', 'capacity', 'created', 'by', 'middle', 'eastern', 'oil', 'producers', 'in',
  'the', 'early', '1980s', '.', 'finally', ',', 'virtually', 'all', 'major', 'u', '.', 's', '.',
  'chemical', 'manufacturers', 'have', 'embarked', 'on', 'an', 'extensive', 'corporate',
  'restructuring', 'program', 'to', 'mothball', 'inefficient', 'plants', ',', 'trim', 'the',
  'payroll', 'and', 'eliminate', 'unrelated', 'businesses', '.', 'the', 'restructuring', 'touched',
  'off', 'a', 'flurry', 'of', 'friendly', 'and', 'hostile', 'takeover', 'attempts', '.', 'gaf', ',',
  'which', 'made', 'an', 'unsuccessful', 'attempt', 'in', '1985', 'to', 'acquire', 'union',
  'carbide', 'corp', '&', 'lt', ';', 'uk', '>,', 'recently', 'offered', 'three', 'billion', 'dlrs',
  'for', 'borg', 'warner', 'corp', '&', 'lt', ';', 'bor', '>,', 'a', 'chicago', 'manufacturer',
  'of', 'plastics', 'and', 'chemicals', '.', 'another', 'industry', 'powerhouse', ',', 'w', '.',
  'r', '.', 'grace', '&', 'lt', ';', 'gra', '>', 'has', 'divested', 'its', 'retailing', ',',
  'restaurant', 'and', 'fertilizer', 'businesses', 'to', 'raise', 'cash', 'for', 'chemical',
  'acquisitions', '.', 'but', 'some', 'experts', 'worry', 'that', 'the', 'chemical', 'industry',
  'may', 'be', 'headed', 'for', 'trouble', 'if', 'companies', 'continue', 'turning', 'their',
  'back', 'on', 'the', 'manufacturing', 'of', 'staple', 'petrochemical', 'commodities', ',', 'such',
  'as', 'ethylene', ',', 'in', 'favor', 'of', 'more', 'profitable', 'specialty', 'chemicals',
  'that', 'are', 'custom', '-', 'designed', 'for', 'a', 'small', 'group', 'of', 'buyers', '.', '"',
  'companies', 'like', 'dupont', '&', 'lt', ';', 'dd', '>', 'and', 'monsanto', 'co', '&', 'lt', ';',
  'mtc', '>', 'spent', 'the', 'past', 'two', 'or', 'three', 'years', 'trying', 'to', 'get', 'out',
  'of', 'the', 'commodity', 'chemical', 'business', 'in', 'reaction', 'to', 'how', 'badly', 'the',
  'market', 'had', 'deteriorated', ',"', 'dosher', 'said', '.', '"', 'but', 'i', 'think', 'they',
  'will', 'eventually', 'kill', 'the', 'margins', 'on', 'the', 'profitable', 'chemicals', 'in',
  'the', 'niche', 'market', '."', 'some', 'top', 'chemical', 'executives', 'share', 'the',
  'concern', '.', '"', 'the', 'challenge', 'for', 'our', 'industry', 'is', 'to', 'keep', 'from',
  'getting', 'carried', 'away', 'and', 'repeating', 'past', 'mistakes', ',"', 'gaf', "'", 's',
  'heyman', 'cautioned', '.', '"', 'the', 'shift', 'from', 'commodity', 'chemicals', 'may', 'be',
  'ill', '-', 'advised', '.', 'specialty', 'businesses', 'do', 'not', 'stay', 'special', 'long',
  '."', 'houston', '-', 'based', 'cain', 'chemical', ',', 'created', 'this', 'month', 'by', 'the',
  'sterling', 'investment', 'banking', 'group', ',', 'believes', 'it', 'can', 'generate', '700',
  'mln', 'dlrs', 'in', 'annual', 'sales', 'by', 'bucking', 'the', 'industry', 'trend', '.',
  'chairman', 'gordon', 'cain', ',', 'who', 'previously', 'led', 'a', 'leveraged', 'buyout', 'of',
  'dupont', "'", 's', 'conoco', 'inc', "'", 's', 'chemical', 'business', ',', 'has', 'spent', '1',
  '.', '1', 'billion', 'dlrs', 'since', 'january', 'to', 'buy', 'seven', 'petrochemical', 'plants',
  'along', 'the', 'texas', 'gulf', 'coast', '.', 'the', 'plants', 'produce', 'only', 'basic',
  'commodity', 'petrochemicals', 'that', 'are', 'the', 'building', 'blocks', 'of', 'specialty',
  'products', '.', '"', 'this', 'kind', 'of', 'commodity', 'chemical', 'business', 'will', 'never',
  'be', 'a', 'glamorous', ',', 'high', '-', 'margin', 'business', ',"', 'cain', 'said', ',',
  'adding', 'that', 'demand', 'is', 'expected', 'to', 'grow', 'by', 'about', 'three', 'pct',
  'annually', '.', 'garo', 'armen', ',', 'an', 'analyst', 'with', 'dean', 'witter', 'reynolds', ',',
  'said', 'chemical', 'makers', 'have', 'also', 'benefitted', 'by', 'increasing', 'demand', 'for',
  'plastics', 'as', 'prices', 'become', 'more', 'competitive', 'with', 'aluminum', ',', 'wood',
  'and', 'steel', 'products', '.', 'armen', 'estimated', 'the', 'upturn', 'in', 'the', 'chemical',
  'business', 'could', 'last', 'as', 'long', 'as', 'four', 'or', 'five', 'years', ',', 'provided',
  'the', 'u', '.', 's', '.', 'economy', 'continues', 'its', 'modest', 'rate', 'of', 'growth', '.',
  ''],
 ['', 'turkey', 'calls', 'for', 'dialogue', 'to', 'solve', 'dispute', 'turkey', 'said',
  'today', 'its', 'disputes', 'with', 'greece', ',', 'including', 'rights', 'on', 'the',
  'continental', 'shelf', 'in', 'the', 'aegean', 'sea', ',', 'should', 'be', 'solved', 'through',
  'negotiations', '.', 'a', 'foreign', 'ministry', 'statement', 'said', 'the', 'latest', 'crisis',
  'between', 'the', 'two', 'nato', 'members', 'stemmed', 'from', 'the', 'continental', 'shelf',
  'dispute', 'and', 'an', 'agreement', 'on', 'this', 'issue', 'would', 'effect', 'the', 'security',
  ',', 'economy', 'and', 'other', 'rights', 'of', 'both', 'countries', '.', '"', 'as', 'the',
  'issue', 'is', 'basicly', 'political', ',', 'a', 'solution', 'can', 'only', 'be', 'found', 'by',
  'bilateral', 'negotiations', ',"', 'the', 'statement', 'said', '.', 'greece', 'has', 'repeatedly',
  'said', 'the', 'issue', 'was', 'legal', 'and', 'could', 'be', 'solved', 'at', 'the',
  'international', 'court', 'of', 'justice', '.', 'the', 'two', 'countries', 'approached', 'armed',
  'confrontation', 'last', 'month', 'after', 'greece', 'announced', 'it', 'planned', 'oil',
  'exploration', 'work', 'in', 'the', 'aegean', 'and', 'turkey', 'said', 'it', 'would', 'also',
  'search', 'for', 'oil', '.', 'a', 'face', '-', 'off', 'was', 'averted', 'when', 'turkey',
  'confined', 'its', 'research', 'to', 'territorrial', 'waters', '.', '"', 'the', 'latest',
  'crises', 'created', 'an', 'historic', 'opportunity', 'to', 'solve', 'the', 'disputes', 'between',
  'the', 'two', 'countries', ',"', 'the', 'foreign', 'ministry', 'statement', 'said', '.', 'turkey',
  "'", 's', 'ambassador', 'in', 'athens', ',', 'nazmi', 'akiman', ',', 'was', 'due', 'to', 'meet',
  'prime', 'minister', 'andreas', 'papandreou', 'today', 'for', 'the', 'greek', 'reply', 'to', 'a',
  'message', 'sent', 'last', 'week', 'by', 'turkish', 'prime', 'minister', 'turgut', 'ozal', '.',
  'the', 'contents', 'of', 'the', 'message', 'were', 'not', 'disclosed', '.', '']]

Question 1.1: Implement distinct_words [code] (2 points)

def distinct_words(corpus):
    """ Determine a list of distinct words for the corpus.
        Params:
            corpus (list of list of strings): corpus of documents
        Return:
            corpus_words (list of strings): list of distinct words across the corpus, sorted (using python 'sorted' function)
            num_corpus_words (integer): number of distinct words across the corpus
    """    
    corpus_words = []
    num_corpus_words = -1
    
    # ------------------
    # Write your implementation here.
    corpus_words = sorted(list({word  for words in corpus for word in words}))
    num_corpus_words = len(corpus_words)
    # ------------------

    return corpus_words, num_corpus_words

# ---------------------
# Run this sanity check
# Note that this not an exhaustive check for correctness.
# ---------------------

# Define toy corpus
test_corpus = ["START All that glitters isn't gold END".split(" "), "START All's well that ends well END".split(" ")]
test_corpus_words, num_corpus_words = distinct_words(test_corpus)

# Correct answers
ans_test_corpus_words = sorted(list(set(["START", "All", "ends", "that", "gold", "All's", "glitters", "isn't", "well", "END"])))
ans_num_corpus_words = len(ans_test_corpus_words)

# Test correct number of words
assert(num_corpus_words == ans_num_corpus_words), "Incorrect number of distinct words. Correct: {}. Yours: {}".format(ans_num_corpus_words, num_corpus_words)

# Test correct words
assert (test_corpus_words == ans_test_corpus_words), "Incorrect corpus_words.\nCorrect: {}\nYours:   {}".format(str(ans_test_corpus_words), str(test_corpus_words))

# Print Success
print ("-" * 80)
print("Passed All Tests!")
print ("-" * 80)

结果如下：

--------------------------------------------------------------------------------
Passed All Tests!
--------------------------------------------------------------------------------

Question 1.2: Implement compute_co_occurrence_matrix [code] (3 points)
实现计算共现矩阵，根据窗口的大小来计算中心词附近的词出现的次数

def compute_co_occurrence_matrix(corpus, window_size=4):
    """ Compute co-occurrence matrix for the given corpus and window_size (default of 4).
    
        Note: Each word in a document should be at the center of a window. Words near edges will have a smaller
              number of co-occurring words.
              
              For example, if we take the document "START All that glitters is not gold END" with window size of 4,
              "All" will co-occur with "START", "that", "glitters", "is", and "not".
    
        Params:
            corpus (list of list of strings): corpus of documents
            window_size (int): size of context window
        Return:
            M (numpy matrix of shape (number of corpus words, number of corpus words)): 
                Co-occurence matrix of word counts. 
                The ordering of the words in the rows/columns should be the same as the ordering of the words given by the distinct_words function.
            word2Ind (dict): dictionary that maps word to index (i.e. row/column number) for matrix M.
    """
    words, num_words = distinct_words(corpus)
    M = None
    word2Ind = {}
    
    # ------------------
    # Write your implementation here.
    M = np.zeros((num_words, num_words))
    word2Ind = {word:i for i, word in enumerate(words)}
    
    for doc in corpus:
        for i, word in enumerate(doc):
            for j in range(i-window_size, i+window_size+1):
                if j<0 or j>=len(doc):
                    continue
                if j != i:
                    M[word2Ind[word], word2Ind[doc[j]]] += 1
                
    # ------------------

    return M, word2Ind

# ---------------------
# Run this sanity check
# Note that this is not an exhaustive check for correctness.
# ---------------------

# Define toy corpus and get student's co-occurrence matrix
test_corpus = ["START All that glitters isn't gold END".split(" "), "START All's well that ends well END".split(" ")]
M_test, word2Ind_test = compute_co_occurrence_matrix(test_corpus, window_size=1)

# Correct M and word2Ind
M_test_ans = np.array( 
    [[0., 0., 0., 1., 0., 0., 0., 0., 1., 0.,],
     [0., 0., 0., 1., 0., 0., 0., 0., 0., 1.,],
     [0., 0., 0., 0., 0., 0., 1., 0., 0., 1.,],
     [1., 1., 0., 0., 0., 0., 0., 0., 0., 0.,],
     [0., 0., 0., 0., 0., 0., 0., 0., 1., 1.,],
     [0., 0., 0., 0., 0., 0., 0., 1., 1., 0.,],
     [0., 0., 1., 0., 0., 0., 0., 1., 0., 0.,],
     [0., 0., 0., 0., 0., 1., 1., 0., 0., 0.,],
     [1., 0., 0., 0., 1., 1., 0., 0., 0., 1.,],
     [0., 1., 1., 0., 1., 0., 0., 0., 1., 0.,]]
)
word2Ind_ans = {'All': 0, "All's": 1, 'END': 2, 'START': 3, 'ends': 4, 'glitters': 5, 'gold': 6, "isn't": 7, 'that': 8, 'well': 9}

# Test correct word2Ind
assert (word2Ind_ans == word2Ind_test), "Your word2Ind is incorrect:\nCorrect: {}\nYours: {}".format(word2Ind_ans, word2Ind_test)

# Test correct M shape
assert (M_test.shape == M_test_ans.shape), "M matrix has incorrect shape.\nCorrect: {}\nYours: {}".format(M_test.shape, M_test_ans.shape)

# Test correct M values
for w1 in word2Ind_ans.keys():
    idx1 = word2Ind_ans[w1]
    for w2 in word2Ind_ans.keys():
        idx2 = word2Ind_ans[w2]
        student = M_test[idx1, idx2]
        correct = M_test_ans[idx1, idx2]
        if student != correct:
            print("Correct M:")
            print(M_test_ans)
            print("Your M: ")
            print(M_test)
            raise AssertionError("Incorrect count at index ({}, {})=({}, {}) in matrix M. Yours has {} but should have {}.".format(idx1, idx2, w1, w2, student, correct))

# Print Success
print ("-" * 80)
print("Passed All Tests!")
print ("-" * 80)

结果如下：

--------------------------------------------------------------------------------
Passed All Tests!
--------------------------------------------------------------------------------

Question 1.3: Implement reduce_to_k_dim [code]
进行SVD降维，不太懂的同学可以点击这里，官网上查看用法。

def reduce_to_k_dim(M, k=2):
    """ Reduce a co-occurence count matrix of dimensionality (num_corpus_words, num_corpus_words)
        to a matrix of dimensionality (num_corpus_words, k) using the following SVD function from Scikit-Learn:
            - http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html
    
        Params:
            M (numpy matrix of shape (number of corpus words, number of corpus words)): co-occurence matrix of word counts
            k (int): embedding size of each word after dimension reduction
        Return:
            M_reduced (numpy matrix of shape (number of corpus words, k)): matrix of k-dimensioal word embeddings.
                    In terms of the SVD from math class, this actually returns U * S
    """    
    n_iters = 10     # Use this parameter in your call to `TruncatedSVD`
    M_reduced = None
    print("Running Truncated SVD over %i words..." % (M.shape[0]))
    
        # ------------------
        # Write your implementation here.
    svd = TruncatedSVD(n_components=k, n_iter=n_iters)
    M_reduced = svd.fit_transform(M)
        # ------------------

    print("Done.")
    return M_reduced

打印结果如下：

Running Truncated SVD over 10 words...
Done.
--------------------------------------------------------------------------------
Passed All Tests!
--------------------------------------------------------------------------------

Question 1.4: Implement plot_embeddings [code] (1 point)
画图

def plot_embeddings(M_reduced, word2Ind, words):
    """ Plot in a scatterplot the embeddings of the words specified in the list "words".
        NOTE: do not plot all the words listed in M_reduced / word2Ind.
        Include a label next to each point.
        
        Params:
            M_reduced (numpy matrix of shape (number of unique words in the corpus , k)): matrix of k-dimensioal word embeddings
            word2Ind (dict): dictionary that maps word to indices for matrix M
            words (list of strings): words whose embeddings we want to visualize
    """

    # ------------------
    # Write your implementation here.
    for word in words:
        coord = M_reduced[word2Ind[word]]
        x = coord[0]
        y = coord[1]
        plt.scatter(x,y, marker='x', color='red')
        plt.text(x, y, word, fontsize=9)
    plt.show()

    # ------------------

# ---------------------
# Run this sanity check
# Note that this not an exhaustive check for correctness.
# The plot produced should look like the "test solution plot" depicted below. 
# ---------------------

print ("-" * 80)
print ("Outputted Plot:")

M_reduced_plot_test = np.array([[1, 1], [-1, -1], [1, -1], [-1, 1], [0, 0]])
word2Ind_plot_test = {'test1': 0, 'test2': 1, 'test3': 2, 'test4': 3, 'test5': 4}
words = ['test1', 'test2', 'test3', 'test4', 'test5']
plot_embeddings(M_reduced_plot_test, word2Ind_plot_test, words)

print ("-" * 80)

结果如下：

Question 1.5: Co-Occurrence Plot Analysis [written] (3 points)
画出共现矩阵的图

# -----------------------------
# Run This Cell to Produce Your Plot
# ------------------------------
reuters_corpus = read_corpus()
M_co_occurrence, word2Ind_co_occurrence = compute_co_occurrence_matrix(reuters_corpus)
M_reduced_co_occurrence = reduce_to_k_dim(M_co_occurrence, k=2)

# Rescale (normalize) the rows to make them each of unit-length
M_lengths = np.linalg.norm(M_reduced_co_occurrence, axis=1)
M_normalized = M_reduced_co_occurrence / M_lengths[:, np.newaxis] # broadcasting

words = ['barrels', 'bpd', 'ecuador', 'energy', 'industry', 'kuwait', 'oil', 'output', 'petroleum', 'venezuela']
plot_embeddings(M_normalized, word2Ind_co_occurrence, words)

结果如下：

Running Truncated SVD over 8185 words...
Done.

Part 2: Prediction-Based Word Vectors (15 points)
使用SVD降维，将300维降2维

def load_word2vec():
    """ Load Word2Vec Vectors
        Return:
            wv_from_bin: All 3 million embeddings, each lengh 300
    """
    import gensim.downloader as api
    wv_from_bin = api.load("word2vec-google-news-300")
    vocab = list(wv_from_bin.vocab.keys())
    print("Loaded vocab size %i" % len(vocab))
    return wv_from_bin

# -----------------------------------
# Run Cell to Load Word Vectors
# Note: This may take several minutes
# -----------------------------------
wv_from_bin = load_word2vec()

结果如下：

[==================================================] 100.0% 1662.8/1662.8MB downloaded
Loaded vocab size 3000000

降维

def get_matrix_of_vectors(wv_from_bin, required_words=['barrels', 'bpd', 'ecuador', 'energy', 'industry', 'kuwait', 'oil', 'output', 'petroleum', 'venezuela']):
    """ Put the word2vec vectors into a matrix M.
        Param:
            wv_from_bin: KeyedVectors object; the 3 million word2vec vectors loaded from file
        Return:
            M: numpy matrix shape (num words, 300) containing the vectors
            word2Ind: dictionary mapping each word to its row number in M
    """
    import random
    words = list(wv_from_bin.vocab.keys())
    print("Shuffling words ...")
    random.shuffle(words)
    words = words[:10000]
    print("Putting %i words into word2Ind and matrix M..." % len(words))
    word2Ind = {}
    M = []
    curInd = 0
    for w in words:
        try:
            M.append(wv_from_bin.word_vec(w))
            word2Ind[w] = curInd
            curInd += 1
        except KeyError:
            continue
    for w in required_words:
        try:
            M.append(wv_from_bin.word_vec(w))
            word2Ind[w] = curInd
            curInd += 1
        except KeyError:
            continue
    M = np.stack(M)
    print("Done.")
    return M, word2Ind

# -----------------------------------------------------------------
# Run Cell to Reduce 300-Dimensinal Word Embeddings to k Dimensions
# Note: This may take several minutes
# -----------------------------------------------------------------
M, word2Ind = get_matrix_of_vectors(wv_from_bin)
M_reduced = reduce_to_k_dim(M, k=2)

结果如下:

Shuffling words ...
Putting 10000 words into word2Ind and matrix M ...
Done.
Running Truncated SVD over 10010 words...
Done.

Question 2.1: Word2Vec Plot Analysis [written] (4 points)

words = ['barrels', 'bpd', 'ecuador', 'energy', 'industry', 'kuwait', 'oil', 'output', 'petroleum', 'venezuela']
plot_embeddings(M_reduced, word2Ind, words)

结果：

Question 2.2: Polysemous Words (2 points) [code + written]
多义词

# ------------------
# Write your polysemous word exploration code here.

wv_from_bin.most_similar("exciting")

# ------------------

结果如下：

[('interesting', 0.6667856574058533),
 ('excited', 0.6454532146453857),
 ('exhilarating', 0.6273326873779297),
 ('fantastic', 0.6220911145210266),
 ('Exciting', 0.616710364818573),
 ('intriguing', 0.6062461733818054),
 ('amazing', 0.5953893065452576),
 ('enjoyable', 0.5866266489028931),
 ('fascinating', 0.5797898173332214),
 ('gratifying', 0.575835645198822)]

Question 2.3: Synonyms & Antonyms (2 points) [code + written]
同义词和多义词

# ------------------
# Write your synonym & antonym exploration code here.

w1 = "sleep"
w2 = "nap"
w3 = "awake"
w1_w2_dist = wv_from_bin.distance(w1, w2)
w1_w3_dist = wv_from_bin.distance(w1, w3)

print("Synonyms {}, {} have cosine distance: {}".format(w1, w2, w1_w2_dist))
print("Antonyms {}, {} have cosine distance: {}".format(w1, w3, w1_w3_dist))

# ------------------

结果如下：

Synonyms sleep, nap have cosine distance: 0.38177746534347534
Antonyms sleep, awake have cosine distance: 0.4313364028930664

Solving Analogies with Word Vectors
类比

# Run this cell to answer the analogy -- man : king :: woman : x
pprint.pprint(wv_from_bin.most_similar(positive=['woman', 'king'], negative=['man']))

结果：

[('queen', 0.7118192911148071),
 ('monarch', 0.6189674139022827),
 ('princess', 0.5902431607246399),
 ('crown_prince', 0.5499460697174072),
 ('prince', 0.5377321243286133),
 ('kings', 0.5236844420433044),
 ('Queen_Consort', 0.5235945582389832),
 ('queens', 0.5181134343147278),
 ('sultan', 0.5098593235015869),
 ('monarchy', 0.5087411999702454)]

Question 2.4: Finding Analogies [code + written] (2 Points)
类比

# ------------------
# Write your analogy exploration code here.

pprint.pprint(wv_from_bin.most_similar(positive=['China', 'American'], negative=['America']))

# ------------------

结果如下：

[('Chinese', 0.7894362211227417),
 ('Beijing', 0.6521856784820557),
 ('Taiwanese', 0.6029292345046997),
 ('qualifier_Peng_Shuai', 0.5813732743263245),
 ('Shanghai', 0.5688328742980957),
 ('Taiwan', 0.5469641089439392),
 ('coach_Shang_Ruihua', 0.5457600355148315),
 ('Zhang', 0.5411955118179321),
 ('theChinese', 0.5395910739898682),
 ('Guangzhou', 0.5321415066719055)]

Question 2.5: Incorrect Analogy [code + written] (1 point)
错误的类比

# ------------------
# Write your incorrect analogy exploration code here.

pprint.pprint(wv_from_bin.most_similar(positive=['China', 'American'], negative=['Japan']))

# ------------------

结果如下：

[('Chinese', 0.5462234616279602),
 ('Amercian', 0.5023548603057861),
 ('African', 0.44705307483673096),
 ('Indian', 0.4317880868911743),
 ('Beijing', 0.41833460330963135),
 ('Cuban', 0.4132118821144104),
 ('Americans', 0.4127004146575928),
 ('U.S.', 0.41143256425857544),
 ('America', 0.40900349617004395),
 ('Amercan', 0.4035891890525818)]

Question 2.6: Guided Analysis of Bias in Word Vectors [written] (1 point)
偏差分析

# Run this cell
# Here `positive` indicates the list of words to be similar to and `negative` indicates the list of words to be
# most dissimilar from.
pprint.pprint(wv_from_bin.most_similar(positive=['woman', 'boss'], negative=['man']))
print()
pprint.pprint(wv_from_bin.most_similar(positive=['man', 'boss'], negative=['woman']))

结果如下：

[('bosses', 0.5522644519805908),
 ('manageress', 0.49151360988616943),
 ('exec', 0.45940813422203064),
 ('Manageress', 0.45598435401916504),
 ('receptionist', 0.4474116563796997),
 ('Jane_Danson', 0.44480544328689575),
 ('Fiz_Jennie_McAlpine', 0.44275766611099243),
 ('Coronation_Street_actress', 0.44275566935539246),
 ('supremo', 0.4409853219985962),
 ('coworker', 0.43986251950263977)]

[('supremo', 0.6097398400306702),
 ('MOTHERWELL_boss', 0.5489562153816223),
 ('CARETAKER_boss', 0.5375303626060486),
 ('Bully_Wee_boss', 0.5333974361419678),
 ('YEOVIL_Town_boss', 0.5321705341339111),
 ('head_honcho', 0.5281980037689209),
 ('manager_Stan_Ternent', 0.525971531867981),
 ('Viv_Busby', 0.5256162881851196),
 ('striker_Gabby_Agbonlahor', 0.5250812768936157),
 ('BARNSLEY_boss', 0.5238943099975586)]

Question 2.7: Independent Analysis of Bias in Word Vectors [code + written] (2 points)
单词向量中偏差的独立分析

# ------------------
# Write your bias exploration code here.

pprint.pprint(wv_from_bin.most_similar(positive=['woman', 'doctor'], negative=['man']))
print()
pprint.pprint(wv_from_bin.most_similar(positive=['woman', 'nurse'], negative=['man']))

# ------------------

结果如下：

[('gynecologist', 0.7093892097473145),
 ('nurse', 0.647728681564331),
 ('doctors', 0.6471461057662964),
 ('physician', 0.64389967918396),
 ('pediatrician', 0.6249487996101379),
 ('nurse_practitioner', 0.6218312978744507),
 ('obstetrician', 0.6072014570236206),
 ('ob_gyn', 0.5986712574958801),
 ('midwife', 0.5927063226699829),
 ('dermatologist', 0.5739566683769226)]

[('registered_nurse', 0.7375058531761169),
 ('nurse_practitioner', 0.6650707721710205),
 ('midwife', 0.6506887674331665),
 ('nurses', 0.6448696851730347),
 ('nurse_midwife', 0.6239829659461975),
 ('birth_doula', 0.5852460265159607),
 ('neonatal_nurse', 0.5670714974403381),
 ('dental_hygienist', 0.5668443441390991),
 ('lactation_consultant', 0.5667990446090698),
 ('respiratory_therapist', 0.5652169585227966)]

Question 2.8: Thinking About Bias [written] (1 point)
关于偏差的理解，一开始数据集中就存在了。

你可能感兴趣的:(NLP)

免费的GPT可在线直接使用（一键收藏） kkai人工智能 gpt
1、LuminAI（https://kk.zlrxjh.top）LuminAI标志着一款融合了星辰大数据模型与文脉深度模型的先进知识增强型语言处理系统，旨在自然语言处理（NLP）的技术开发领域发光发热。此系统展现了卓越的语义把握与内容生成能力，轻松驾驭多样化的自然语言处理任务。VisionAI在NLP界的应用领域广泛，能够胜任从机器翻译、文本概要撰写、情绪分析到问答等众多任务。通过对大量文本数据的
机器学习-聚类算法不良人龍木木机器学习机器学习算法聚类
机器学习-聚类算法1.AHC2.K-means3.SC4.MCL仅个人笔记，感谢点赞关注！1.AHC2.K-means3.SC传统谱聚类：个人对谱聚类算法的理解以及改进4.MCL目前仅专注于NLP的技术学习和分享感谢大家的关注与支持！
轻量级模型解读——轻量transformer系列 lishanlu136 #图像分类轻量级模型 transformer 图像分类
先占坑，持续更新。。。文章目录1、DeiT2、ConViT3、Mobile-Former4、MobileViTTransformer是2017谷歌提出的一篇论文，最早应用于NLP领域的机器翻译工作，Transformer解读，但随着2020年DETR和ViT的出现(DETR解读，ViT解读)，其在视觉领域的应用也如雨后春笋般渐渐出现，其特有的全局注意力机制给图像识别领域带来了重要参考。但是tran
FlagEmbedding 吉小雨 python库 python
FlagEmbedding教程FlagEmbedding是一个用于生成文本嵌入（textembeddings）的库，适合处理自然语言处理（NLP）中的各种任务。嵌入（embeddings）是将文本表示为连续向量，能够捕捉语义上的相似性，常用于文本分类、聚类、信息检索等场景。官方文档链接：FlagEmbedding官方GitHub一、FlagEmbedding库概述1.1什么是FlagEmbeddi
【NumPy】深入解析numpy.zeros()函数二七830 numpy
欢迎莅临我的个人主页这里是我深耕Python编程、机器学习和自然语言处理（NLP）领域，并乐于分享知识与经验的小天地！博主简介：我是二七830，一名对技术充满热情的探索者。多年的Python编程和机器学习实践，使我深入理解了这些技术的核心原理，并能够在实际项目中灵活应用。尤其是在NLP领域，我积累了丰富的经验，能够处理各种复杂的自然语言任务。技术专长：我熟练掌握Python编程语言，并深入研究了机
CV、NLP、数据控掘推荐、量化海的那边- AI算法自然语言处理人工智能
下面是对CV（计算机视觉）、NLP（自然语言处理）、数据挖掘推荐和量化的简要概述及其应用领域的介绍：1.CV（计算机视觉，ComputerVision）定义：计算机视觉是一门让计算机能够从图像或视频中提取有用信息，并做出决策的学科。它通过模拟人类的视觉系统来识别、处理和理解视觉信息。主要任务：图像分类：识别图像中的物体并分类，比如猫、狗、车等。目标检测：在图像或视频中定位并识别多个对象，如人脸检测
深度解析：如何使用输出解析器将大型语言模型（LLM）的响应解析为结构化JSON格式 m0_57781768 语言模型 json 人工智能
深度解析：如何使用输出解析器将大型语言模型（LLM）的响应解析为结构化JSON格式在现代自然语言处理（NLP）的应用中，大型语言模型（LLM）已经成为了重要的工具。这些模型能够生成丰富的自然语言文本，适用于各种应用场景。然而，在某些应用中，开发者不仅仅需要生成文本，还需要将这些生成的文本转换为结构化的数据格式，例如JSON。这种结构化的数据格式在数据传输、存储以及进一步处理时具有显著优势。本文将深
使用LangChain和OpenAI实现高效文本标注 aehrutktrjk langchain python
使用LangChain和OpenAI实现高效文本标注引言在自然语言处理(NLP)领域，文本标注是一项重要且常见的任务。它涉及为文本分配标签，如情感、语言、风格等。本文将介绍如何使用LangChain和OpenAI的API来实现高效的文本标注系统。我们将探讨如何设置环境、定义标注模式，以及如何使用OpenAI的模型来执行标注任务。环境准备首先，我们需要安装必要的库并设置API密钥：%pipinsta
【NLP5-RNN模型、LSTM模型和GRU模型】一蓑烟雨紫洛 nlp rnn lstm gru nlp
RNN模型、LSTM模型和GRU模型1、什么是RNN模型RNN（RecurrentNeuralNetwork)中文称为循环神经网络，它一般以序列数据为输入，通过网络内部的结构设计有效捕捉序列之间的关系特征，一般也是以序列形式进行输出RNN的循环机制使模型隐层上一时间步产生的结果，能够作为当下时间步输入的一部分（当下时间步的输入除了正常的输入外还包括上一步的隐层输出）对当下时间步的输出产生影响2、R
基于深度学习的文本引导的图像编辑 SEU-WYL 深度学习dnn 深度学习人工智能
基于深度学习的文本引导的图像编辑（Text-GuidedImageEditing）是一种通过自然语言文本指令对图像进行编辑或修改的技术。它结合了图像生成和自然语言处理（NLP）的最新进展，使用户能够通过描述性文本对图像内容进行精确的调整和操控。1.文本引导的图像编辑的挑战文本和图像之间的对齐：如何将文本中的语义信息准确地映射到图像中的特定区域或元素是一个关键挑战。这涉及到多模态数据的对齐和理解。编
甘超波：NLP婚姻中如何与老人相处甘超波
哈喽，大家好我是甘超波，是一名NLP爱好者，每天一篇原创文章或视频，分享我的实战经验和案例，希望给你些启发和帮助看一下，在家庭中子女与老人观念不一致时案例1：在教育孩子方面，老人习惯用老一套教育方式教育孙子，子女受不了老人这种习惯，从而发生口舌之争？2：在生活习惯方面，老人喜欢吃剩菜剩饭，子女受不了老人这种习惯，从而发生口舌之争？.....这样的事情，我相信你或多或少都听过和看过，甚至了深有感悟。
transformer架构(Transformer Architecture)原理与代码实战案例讲解 AI架构设计之禅大数据AI人工智能 Python入门实战计算科学神经计算深度学习神经网络大数据人工智能大型语言模型 AI AGI LLM Java Python 架构设计 Agent RPA
transformer架构(TransformerArchitecture)原理与代码实战案例讲解关键词：Transformer,自注意力机制,编码器-解码器,预训练,微调,NLP,机器翻译作者：禅与计算机程序设计艺术/ZenandtheArtofComputerProgramming1.背景介绍1.1问题的由来自然语言处理（NLP）领域的发展经历了从规则驱动到统计驱动再到深度学习驱动的三个阶段。
英伟达（NVIDIA）B200架构解读 weixin_41205263 芯际争霸 GPGPU架构 gpu算力人工智能硬件架构
H100芯片是一款高性能AI芯片，其中的TransformerEngine是专门用于加速Transformer模型计算的核心部件。Transformer模型是一种自然语言处理（NLP）模型，广泛应用于机器翻译、文本生成等任务。TransformerEngine的电路设计原理主要包括以下几个方面：
《昇思 25 天学习打卡营第 25 天 | 基于 MindSpore 实现 BERT 对话情绪识别》 Sam9029 Mindscope模型学习深度学习
《昇思25天学习打卡营第25天|基于MindSpore实现BERT对话情绪识别》活动地址：https://xihe.mindspore.cn/events/mindspore-training-camp签名：Sam9029环境配置确保安装了正确版本的MindSpore和MindNLP库。!pipuninstallmindspore-y!pipinstall-ihttps://pypi.mirror
基于人工智能的智能语音助手人工智能发烧友人工智能
语音助手的自然语言处理模块是语音助手系统的关键组成部分。通过这个模块，系统能够识别用户的意图并做出相应的回应。我们可以使用NLP技术来解析文本输入，并将其转换为系统可以理解的命令或指令。在本项目中，我们将结合语音识别、自然语言处理和语音合成技术，构建一个功能简化的语音助手。一、项目背景与需求分析1.1项目目标本项目旨在创建一个语音助手系统，它可以：1.语音识别：从用户的语音输入中提取文本信息。2.
NLP_jieba中文分词的常用模块 Hiweir · NLP_jieba的使用自然语言处理中文分词人工智能 nlp
1.jieba分词模式（1）精确模式:把句子最精确的切分开,比较适合文本分析.默认精确模式.（2）全模式:把句子中所有可能成词的词都扫描出来,cut_all=True,缺点:速度快,不能解决歧义（3）paddle:利用百度的paddlepaddle深度学习框架.简单来说就是使用百度提供的分词模型.use_paddle=True.（4）搜索引擎模式:在精确模式的基础上,对长词再进行切分,提高召回率,
Linux如何查看端口 lanhuazui10 linux操作系统 linux
方法一：lsof-i:端口号用于查看某一端口的占用情况，比如查看9092端口使用情况，lsof-i:9095可以看到9095端口已经被nginx占用方法二：netstat-tunlp|grep端口号，用于查看指定的端口号的进程情况，如查看5050端口的情况，netstat-tunlp|grep5050-t(tcp)仅显示tcp相关选项-u(udp)仅显示udp相关选项-n拒绝显示别名，能显示数字的
【笔记】自然语言处理NLP---概论 xhanZ NLP相关
（from人文学院开设课程）目录1.自然语言处理概论1.1自然语言处理研究的意义、历史与现状1.1.1自然语言的特点1.1.2自然语言处理研究的意义1.1.3国外研究现状1.2NLP的方法、特点和规律1.2.1理性主义与经验主义1.2.2语料库语言学：经验主义研究方法1.2.3汉语语言处理的方法1.2.4基于知识图谱的深度学习1.自然语言处理概论1.1自然语言处理研究的意义、历史与现状1.1.1自
【笔记与idea】——ACL2017论文报告会胖胖的飞象深度学习人工智能笔记 idea
这篇是2017年我有幸参加了中文信息学会组织的ACL2017论文报告会记的笔记，当时还是研一新生，对NLP感兴趣，偶然通过老师知晓了这次报告会，所以想去现场听听大牛们的idea、和大牛们交流（然而由于当时没有入门，啥也不懂，交流失败。。。）但是总的来说，非常感谢组织这次报告会的老师们，尽管没能和大牛们有效的交流，但是这次报告会相当于在最短的时间内读懂了数十篇精彩论文的核心内容，对我后面的学习起到了
如何利用AI技术来提升用户的个性化体验和社区参与度？ Itfuture03 AI前沿技术人工智能
要利用AI技术提升用户的个性化体验和社区参与度，可以采取以下几种策略：个性化推荐系统：通过AI算法分析用户的行为和偏好，提供定制化的服务和内容推荐，如智能推荐活动、健康管理等，让居民感受到社区的温暖和关怀。智能助手与聊天机器人：引入AI驱动的虚拟助手，提供实时帮助、个性化建议和交互式对话，改善客户体验。自然语言处理（NLP）：实现具有AI能力的NLP，创建对用户友好的应用程序，简化用户体验，如客服
【Python】成功解决IndexError: list index out of range 高斯小哥 BUG解决方案合集 python list 新手入门学习 debug
【Python】成功解决IndexError:listindexoutofrange下滑查看解决方法欢迎莅临我的个人主页这里是我静心耕耘深度学习领域、真诚分享知识与智慧的小天地！博主简介：985高校的普通本硕，曾有幸发表过人工智能领域的中科院顶刊一作论文，熟练掌握PyTorch框架。技术专长：在CV、NLP及多模态等领域有丰富的项目实战经验。已累计一对一为数百位用户提供近千次专业服务，助力他们少走
使用Python和Jieba库进行中文情感分析：从文本预处理到模型训练的完整指南快撑死的鱼 Python算法精解 python 人工智能开发语言
使用Python和Jieba库进行中文情感分析：从文本预处理到模型训练的完整指南情感分析（SentimentAnalysis）是自然语言处理（NLP）领域中的一个重要分支，旨在从文本中识别出情绪、态度或意见等主观信息。在中文文本处理中，由于语言特性不同于英语，如何高效、准确地分词和提取关键词成为情感分析的关键步骤之一。在这篇文章中，我们将深入探讨如何使用Python和Jieba库进行中文情感分析，
论文阅读笔记: DINOv2: Learning Robust Visual Features without Supervision 小夏refresh 论文计算机视觉深度学习论文阅读笔记深度学习计算机视觉人工智能
DINOv2:LearningRobustVisualFeatureswithoutSupervision论文地址:https://arxiv.org/abs/2304.07193代码地址:https://github.com/facebookresearch/dinov2摘要大量数据上的预训练模型在NLP方面取得突破，为计算机视觉中的类似基础模型开辟了道路。这些模型可以通过生成通用视觉特征(即无
第3篇：LangChain的架构总览与设计理念 Gemini技术窝 langchain 架构大数据人工智能 AIGC nlp
LangChain库是一个专为自然语言处理（NLP）设计的强大工具包，致力于简化复杂语言模型链的构建和执行。在本文中，我们将深入解析LangChain库的架构，详细列出其核心组件、设计理念及其在不同场景中的应用，并讨论其优缺点。文章目录1.LangChain库简介2.核心组件2.1数据输入模块作用2.2数据预处理模块作用2.3数据增强模块作用2.4数据加载与批处理模块作用2.5模型训练模块作用2.
读李中莹先生论“阿Q精神" 猫咪06
这阵子重读《重塑心灵》，对“阿Q精神"一段很有感慨，在我们从小的信念里，阿Q的精神胜利法是被贬低的，是对无能力改变自己的境遇时，似手只能采用自我安慰的人的讽刺。李中莹先生在他的书中结合对话者的认可，定义阿Q精神“只求精神胜利，罔顾真实情况"，他就针对这两句话，解析阿Q精神，并进行了肯定‘，。首先“精神胜利"指的是自己内心有成功的感觉，这很符合NLP!如果所有人都认为你成功，而你自己没有成功的喜悦，
书单用户5521
提高思维（13本）：影响力逻辑思维（理查德·尼斯贝特）离经叛道:不按常理出牌的人如何改变世界（只看最后一章总结即可）改变:问题形成和解决的原则语言的魔力:谈笑间转变信念之NLP技巧（意识到语言顺序的重要性）改变心理学的40项研究对伪心理学说不你的误区:如何摆脱负面思维掌控你的生活战胜拖拉你的灯亮着吗?别做正常的傻瓜学会提问:批判性思维指南不确定世界的理性选择小说（5本）：霍乱时期的爱情那些回不去的
【Python】解决AttributeError: ‘NoneType‘ object has no attribute ‘xxxx‘ 云天徽上 Pandas python 开发语言 pandas 机器学习 numpy
【Python】解决AttributeError:'NoneType'objecthasnoattribute'xxxx'报错欢迎莅临我的个人主页这里是我深耕Python编程、机器学习和自然语言处理（NLP）领域，并乐于分享知识与经验的小天地！博主简介：我是云天徽上，一名对技术充满热情的探索者。多年的Python编程和机器学习实践，使我深入理解了这些技术的核心原理，并能够在实际项目中灵活应用。尤其
【自然语言处理】自然语言处理NLP概述及应用 @我们的天空人工智能技术 nlp 人工智能深度学习 python 机器学习自然语言处理 scikit-learn
自然语言处理（NaturalLanguageProcessing，简称NLP）是一门集计算机科学、人工智能以及语言学于一体的交叉学科，致力于让计算机能够理解、解析、生成和处理人类的自然语言。它是人工智能领域的一个关键分支，旨在缩小人与机器之间的交流障碍，使得机器能够更有效地识别并响应人类的自然语言指令或内容。自然语言处理NLP概述基本任务：文本分类：将文本划分为预定义的类别，如情感分析、主题分类等
OPENAI中RAG实现原理以及示例代码用PYTHON来实现 dzend aigc python 开发语言 ai
OPENAI中RAG实现原理以及示例代码用PYTHON来实现1.引言在当今人工智能领域，自然语言处理（NLP）是一个非常重要的研究方向。近年来，OPENAI发布了许多创新的NLP模型，其中之一就是RAG（Retrieval-AugmentedGeneration）模型。RAG模型结合了检索和生成两种方法，可以用于生成与给定问题相关的高质量文本。本文将介绍RAG模型的实现原理，并提供使用Python
开源AI图像识别：支持扫描文件批量识别快速对接数据库存储思通数科x 人工智能计算机视觉图像处理 OCR 文本识别
随着数字化转型的不断深入，图像识别技术在各行各业中的应用越来越广泛。文件封识别作为图像识别技术的一个分支，能够有效地提高文件处理的自动化程度和准确性。本文将探讨文件封识别技术的原理、应用场景以及如何将识别后的内容批量对应数据库字段进行存储。开源项目介绍(可本地部署，支持国产化)思通数科研发了一款多模态AI能力引擎，专注于提供自然语言处理（NLP）、情感分析、实体识别、图像识别与分类、OCR识别和语
异常的核心类Throwable 无量 java 源码异常处理 exception
java异常的核心是Throwable，其他的如Error和Exception都是继承的这个类里面有个核心参数是detailMessage，记录异常信息，getMessage核心方法，获取这个参数的值，我们可以自己定义自己的异常类，去继承这个Exception就可以了，方法基本上，用父类的构造方法就OK，所以这么看异常是不是很easy package com.natsu;
mongoDB 游标（cursor）实现分页迭代开窍的石头 mongodb
上篇中我们讲了mongoDB 中的查询函数，现在我们讲mongo中如何做分页查询如何声明一个游标 var mycursor = db.user.find({_id:{$lte:5}}); 迭代显示游标数
MySQL数据库INNODB 表损坏修复处理过程 0624chenhong tomcat mysql
最近mysql数据库经常死掉，用命令net stop mysql命令也无法停掉，关闭Tomcat的时候，出现Waiting for N instance(s) to be deallocated 信息。查了下，大概就是程序没有对数据库连接释放，导致Connection泄露了。因为用的是开元集成的平台，内部程序也不可能一下子给改掉的，就验证一下咯。启动Tomcat,用户登录系统，用netstat -
剖析如何与设计人员沟通不懂事的小屁孩工作
最近做图烦死了，不停的改图，改图……。烦，倒不是因为改，而是反反复复的改，人都会死。很多需求人员不知该如何与设计人员沟通，不明白如何使设计人员知道他所要的效果，结果只能是沟通变成了扯淡，改图变成了应付。那应该如何与设计人员沟通呢？我认为设计人员与需求人员先天就存在语言障碍。对一个合格的设计人员来说，整天玩的都是点、线、面、配色，哪种构图看起来协调；哪种配色看起来合理心里跟明镜似的，
qq空间刷评论工具换个号韩国红果果 JavaScript
var a=document.getElementsByClassName('textinput'); var b=[]; for(var m=0;m<a.length;m++){ if(a[m].getAttribute('placeholder')!=null) b.push(a[m]) } var l
S2SH整合之session 灵静志远 spring AOP struts session
错误信息： Caused by: org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'cartService': Scope 'session' is not active for the current thread; consider defining a scoped
xmp标签 a-john 标签
今天在处理数据的显示上遇到一个问题： var html = '<li><div class="pl-nr"><span class="user-name">' + user + '</span>' + text + '</div></li>'; ulComme
Ajax的常用技巧（2）---实现Web页面中的级联菜单 aijuans Ajax
在网络上显示数据，往往只显示数据中的一部分信息，如文章标题，产品名称等。如果浏览器要查看所有信息，只需点击相关链接即可。在web技术中，可以采用级联菜单完成上述操作。根据用户的选择，动态展开，并显示出对应选项子菜单的内容。在传统的web实现方式中，一般是在页面初始化时动态获取到服务端数据库中对应的所有子菜单中的信息，放置到页面中对应的位置，然后再结合CSS层叠样式表动态控制对应子菜单的显示或者隐
天-安-门，好高 atongyeye 情感
我是85后，北漂一族，之前房租1100，因为租房合同到期，再续，房租就要涨150。最近网上新闻，地铁也要涨价。算了一下，涨价之后，每次坐地铁由原来2块变成6块。仅坐地铁费用，一个月就要涨200。内心苦痛。晚上躺在床上一个人想了很久，很久。我生在农
android 动画百合不是茶 android 透明度平移缩放旋转
android的动画有两种 tween动画和Frame动画 tween动画;,透明度,缩放,旋转,平移效果 Animation 动画 AlphaAnimation 渐变透明度 RotateAnimation 画面旋转 ScaleAnimation 渐变尺寸缩放 TranslateAnimation 位置移动 Animation
查看本机网络信息的cmd脚本 bijian1013 cmd
@echo 您的用户名是：%USERDOMAIN%\%username%>"%userprofile%\网络参数.txt" @echo 您的机器名是：%COMPUTERNAME%>>"%userprofile%\网络参数.txt" @echo ___________________>>"%userprofile%\
plsql 清除登录过的用户征客丶 plsql
tools---preferences----logon history---history 把你想要删除的删除 -------------------------------------------------------------------- 若有其他凝问或文中有错误，请及时向我指出，我好及时改正，同时也让我们一起进步。 email ： binary_spac
【Pig一】Pig入门 bit1129 pig
Pig安装 1.下载pig wget http://mirror.bit.edu.cn/apache/pig/pig-0.14.0/pig-0.14.0.tar.gz 2. 解压配置环境变量如果Pig使用Map/Reduce模式，那么需要在环境变量中，配置HADOOP_HOME环境变量 expor
Java 线程同步几种方式 BlueSkator volatile synchronized ThredLocal ReenTranLock Concurrent
为何要使用同步？ java允许多线程并发控制，当多个线程同时操作一个可共享的资源变量时（如数据的增删改查），将会导致数据不准确，相互之间产生冲突，因此加入同步锁以避免在该线程没有完成操作之前，被其他线程的调用，从而保证了该变量的唯一性和准确性。 1.同步方法&
StringUtils判断字符串是否为空的方法（转帖） BreakingBad null StringUtils “”
转帖地址：http://www.cnblogs.com/shangxiaofei/p/4313111.html public static boolean isEmpty(String str) 　　判断某字符串是否为空，为空的标准是 str== null 或 str.length()== 0
编程之美-分层遍历二叉树 bylijinnan java 数据结构算法编程之美
import java.util.ArrayList; import java.util.LinkedList; import java.util.List; public class LevelTraverseBinaryTree { /** * 编程之美分层遍历二叉树 * 之前已经用队列实现过二叉树的层次遍历，但这次要求输出换行，因此要
jquery取值和ajax提交复习记录 chengxuyuancsdn jquery取值 ajax提交
// 取值 // alert($("input[name='username']").val()); // alert($("input[name='password']").val()); // alert($("input[name='sex']:checked").val()); // alert($("
推荐国产工作流引擎嵌入式公式语法解析器-IK Expression comsci java 应用服务器工作 Excel 嵌入式
这个开源软件包是国内的一位高手自行研制开发的，正如他所说的一样，我觉得它可以使一个工作流引擎上一个台阶。。。。。。欢迎大家使用，并提出意见和建议。。。 ----------转帖--------------------------------------------------- IK Expression是一个开源的（OpenSource），可扩展的（Extensible），基于java语言
关于系统中使用多个PropertyPlaceholderConfigurer的配置及PropertyOverrideConfigurer daizj spring
1、PropertyPlaceholderConfigurer Spring中PropertyPlaceholderConfigurer这个类，它是用来解析Java Properties属性文件值，并提供在spring配置期间替换使用属性值。接下来让我们逐渐的深入其配置。基本的使用方法是：(1) <bean id="propertyConfigurerForWZ&q
二叉树:二叉搜索树 dieslrae 二叉树
所谓二叉树,就是一个节点最多只能有两个子节点,而二叉搜索树就是一个经典并简单的二叉树.规则是一个节点的左子节点一定比自己小,右子节点一定大于等于自己(当然也可以反过来).在树基本平衡的时候插入,搜索和删除速度都很快,时间复杂度为O(logN).但是,如果插入的是有序的数据,那效率就会变成O(N),在这个时候,树其实变成了一个链表. tree代码:
C语言字符串函数大全 dcj3sjt126com c function
C语言字符串函数大全函数名: stpcpy 功能: 拷贝一个字符串到另一个用法: char *stpcpy(char *destin, char *source); 程序例: #include <stdio.h> #include <string.h> int main
友盟统计页面技巧 dcj3sjt126com 技巧
在基类调用就可以了, 基类ViewController示例代码 -(void)viewWillAppear:(BOOL)animated { [super viewWillAppear:animated]; [MobClick beginLogPageView:[NSString stringWithFormat:@"%@",self.class]];
window下在同一台机器上安装多个版本jdk，修改环境变量不生效问题处理办法 flyvszhb java jdk
window下在同一台机器上安装多个版本jdk，修改环境变量不生效问题处理办法本机已经安装了jdk1.7，而比较早期的项目需要依赖jdk1.6，于是同时在本机安装了jdk1.6和jdk1.7. 安装jdk1.6前，执行java -version得到 C:\Users\liuxiang2>java -version java version "1.7.0_21&quo
Java在创建子类对象的同时会不会创建父类对象 happyqing java 创建子类对象父类对象
1.在thingking in java 的第四版第六章中明确的说了，子类对象中封装了父类对象， 2."When you create an object of the derived class, it contains within it a subobject of the base class. This subobject is the sam
跟我学spring3 目录贴及电子书下载 jinnianshilongnian spring
一、《跟我学spring3》电子书下载地址：《跟我学spring3》（1-7 和 8-13） http://jinnianshilongnian.iteye.com/blog/pdf 跟我学spring3系列 word原版下载二、源代码下载最新依
第12章 Ajax（上） onestopweb Ajax
index.html <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/
BI and EIM 4.0 at a glance blueoxygen BO
http://www.sap.com/corporate-en/press.epx?PressID=14787 有机会研究下EIM家族的两个新产品~~~~ New features of the 4.0 releases of BI and EIM solutions include: Real-time in-memory computing –
Java线程中yield与join方法的区别 tomcat_oracle java
长期以来，多线程问题颇为受到面试官的青睐。虽然我个人认为我们当中很少有人能真正获得机会开发复杂的多线程应用(在过去的七年中，我得到了一个机会)，但是理解多线程对增加你的信心很有用。之前，我讨论了一个wait()和sleep()方法区别的问题，这一次，我将会讨论join()和yield()方法的区别。坦白的说，实际上我并没有用过其中任何一个方法，所以，如果你感觉有不恰当的地方，请提出讨论。 &nb
android Manifest.xml选项阿尔萨斯 Manifest
结构继承关系 public final class Manifest extends Objectjava.lang.Objectandroid.Manifest 内部类 class Manifest.permission权限 class Manifest.permission_group权限组构造函数 public Manifest () 详细 androi
Oracle实现类split函数的方 zhaoshijie oracle
关键字：Oracle实现类split函数的方项目里需要保存结构数据，批量传到后他进行保存，为了减小数据量，子集拼装的格式，使用存储过程进行保存。保存的过程中需要对数据解析。但是oracle没有Java中split类似的函数。从网上找了一个，也补全了一下。 CREATE OR REPLACE TYPE t_split_100 IS TABLE OF VARCHAR2(100); cr