Irving_III

【cs224n学习作业】Assignment 1 - Exploring Word Vectors【附代码】

前言

这篇文章是CS224N课程的第一个大作业，主要是对词向量做了一个探索，并直观的感受了一下词嵌入或者词向量的效果。这里简单的记录一下我探索的一个过程。这一下几篇文章基于这次作业的笔记理论：
cs224n学习笔记 01: Introduction and word vectors.
cs224n学习笔记 02:Word Vectors and Word Senses.
cs224n学习笔记 03:Subword Models（fasttext附代码）.
cs224n学习笔记 04：Contextual Word Embeddings.
此次实验分为两部分，第一部分是基于计数的单词词向量，而第二部分，是基于词向量的预测，是利用了已经训练好的一个词向量矩阵去介绍一下怎么进行预测，比如可视化这些词向量啊，找同义词或者反义词啊，实现单词的类比关系啊等等。

大纲

实验前的准备工作(导入包和语料库)
Part1: Count-Based Word Vectors
Part2:Prediction-Based Word Vectors

准备工作

导入要用的包

import sys
assert sys.version_info[0]==3
assert sys.version_info[1] >= 5

from gensim.models import KeyedVectors  # KeyedVectors:实现实体（单词、文档、图片都可以）和向量之间的映射。每个实体由其字符串id标识。
from gensim.test.utils import datapath
import pprint     #  输出的更加规范易读
import matplotlib.pyplot as plt  
plt.rcParams['figure.figsize'] = [10, 5]  #  plt.rcParams主要作用是设置画的图的分辨率，大小等信息
import nltk
nltk.download('reuters')    # 这个可以从GitHub下载， 网址：https://github.com/nltk/nltk_data/tree/gh-pages/packages/corpora
from nltk.corpus import reuters
import numpy as np
import random
import scipy as sp
from sklearn.decomposition import TruncatedSVD
from sklearn.decomposition import PCA

START_TOKEN = ''
END_TOKEN = ''

np.random.seed(0)
random.seed(0)

这里面的Reuters是路透社(商业和金融新闻)语料库，是一个词库，语料库包含10788个新闻文档，共计130万词。这些文档跨越90个类别，分为train和test，我们这次需要用其中的一个类别(crude)里面的句子。
对于这个数据集以及后面要用到的数据我们直接下下不来，所以这里提供另外一个方法就是自己可以百度手动下载数据集。这些数据网上都有都是可以自行下载过来的。
然后就是采用下面的函数，导入这个语料库：

def read_corpus(category="crude"):
    """ Read files from the specified Reuter's category.
        Params:
            category (string): category name
        Return:
            list of lists, with words from each of the processed files
    """
    files = reuters.fileids(category)    # 类别为crude文档
    # 每个文档都转化为小写， 并在开头结尾加标识符
    return [[START_TOKEN] + [w.lower() for w in list(reuters.words(f))] + [END_TOKEN] for f in files]

这个是导入语料库的函数，简单的进行了一下预处理，就是在每句话的前面和后面各加了一个标识符，表示句子的开始和结束，然后把每个单词分开。下面导入并看一下效果：

# pprint模块格式化打印
# pprint.pprint(object, stream=None, indent=1, width=80, depth=None, *, compact=False)
# width：控制打印显示的宽度。默认为80个字符。注意：当单个对象的长度超过width时，并不会分多行显示，而是会突破规定的宽度。
# compact：默认为False。如果值为False，超过width规定长度的序列会被分散打印到多行。如果为True，会尽量使序列填满width规定的宽度。
reuters_corpus = read_corpus()
pprint.pprint(reuters_corpus[:1], compact=True, width=100)  # compact 设置为False是一行一个单词

每个句子处理后长这样：

准备工作完成，接下来就是实验的主要两部分了。

Part1: Count-Based Word Vectors

共现矩阵是实现这种词向量的一种方式，我们看看共现矩阵是什么意思？共现矩阵计算的是单词在某些环境下一块出现的频率，对于共现矩阵，原文描述是这样的：

上面的话其实就是这样的一个意思，要想建立共现矩阵，我们需要先为单词构建一个词典，然后共现矩阵的行列都是这个词典里的单词，看下面这个例子：

上面基于这两段文档构建出的共现矩阵长这样，这个是怎么构建的？首先就是根据两个文档的单词构建一个词典，这里面的数就是两两单词在上下文中共现的频率，比如第一行， START和all一起出现了两次，这就是因为两个文档里面START的窗口中都有all。同理第二行all的那个，我们也固定一个窗口，发现第一个文档里面all左边是START，右边是that，第二个文档all左边是START，右边是is, 那么=2, =1, =1。下面的都是同理了。

我们就是要构建这样的一个矩阵来作为每个单词的词向量，当然这个还不是最终形式，因为可能词典很大的话维度会特别高，所以就相当了降维技术，降维之后的结果就是每个单词的词向量。这个里面使用的降维是SVD, 原理这里不说，这里使用了Truncated SVD，具体的实现是调用了sklearn中的包。

所以我们就有了下面的这样一个思路框架：

对于语录料库中的文档单词，得先构建一个词典（唯一单词且排好序）
然后我们就是基于词典和语料库，为每个单词构建词向量，也就是共现矩阵
对共现矩阵降维，就得到了最终的词向量
可视化

1.1为语料库中的单词构建词典

词典就是记录所有的单词，但是单词唯一且有序。那么实现这个词典的思路就是我遍历每一篇文档，先获得所有的单词，然后去掉重复的，然后再排序就搞定，当然还得记录字典里的单词总数。

# 计算出语料库中出现的不同单词，并排序。
def distinct_words(corpus):
    """ Determine a list of distinct words for the corpus.
        Params:
            corpus (list of list of strings): corpus of documents
        Return:
            corpus_words (list of strings): list of distinct words across the corpus, sorted (using python 'sorted' function)
            num_corpus_words (integer): number of distinct words across the corpus
    """
    corpus_words = []
    num_corpus_words = -1
    
    # ------------------
    # Write your implementation here.
    # 首先得把所有单词放到一个列表里面, 然后用set去重， 然后排序
    for everylist in corpus:
        corpus_words.extend(everylist)
    corpus_words = sorted(set(corpus_words))
    num_corpus_words = len(corpus_words)
    # ------------------

    return corpus_words, num_corpus_words

1.2 构建共现矩阵

首先我们得定义一个M矩阵，也就是共现矩阵，大小就是行列都是词典的单词个数（上面图片一目了然），然后还得定义一个字典单词到索引的映射，因为我们统计的时候是遍历真实文档，而填矩阵的时候是基于字典，这两个是基于同一个单词进行联系起来的，所以我们需要获得真实文档中单词在字典里面的索引才能去填矩阵。

def compute_co_occurrence_matrix(corpus, window_size=4):
    """ Compute co-occurrence matrix for the given corpus and window_size (default of 4).
    
        Note: Each word in a document should be at the center of a window. Words near edges will have a smaller
              number of co-occurring words.
              
              For example, if we take the document "START All that glitters is not gold END" with window size of 4,
              "All" will co-occur with "START", "that", "glitters", "is", and "not".
    
        Params:
            corpus (list of list of strings): corpus of documents
            window_size (int): size of context window
        Return:
            M (numpy matrix of shape (number of corpus words, number of corpus words)): 
                Co-occurence matrix of word counts. 
                The ordering of the words in the rows/columns should be the same as the ordering of the words given by the distinct_words function.
            word2Ind (dict): dictionary that maps word to index (i.e. row/column number) for matrix M.
    """
    words, num_words = distinct_words(corpus)   # 单词已经去重或者排好序  
    M = None
    word2Ind = {}
    
    # ------------------
    # Write your implementation here.
    word2Ind = {k: v for (k, v) in zip(words, range(num_words))}
    M = np.zeros((num_words, num_words))
    	# 接下来是遍历语料库 对于每一篇文档， 我们得遍历每个单词
    # 对于每个单词， 我们得找到窗口的范围， 然后再去遍历它窗口内的每个单词
    # 对于这每个单词， 我们就可以在我们的M词典中进行计数， 但是要注意每个单词其实有两个索引
    # 一个是词典里面的索引， 一个是文档中的索引， 我们统计的共现频率是基于字典里面的索引， 
    # 所以这里涉及到一个索引的转换
    
    # 首先遍历语料库
    for every_doc in corpus:
        for cword_doc_ind, cword in enumerate(every_doc):  # 遍历当前文档的每个单词和在文档中的索引
            # 对于当前的单词， 我们先找到它在词典中的位置
            cword_dic_ind = word2Ind[cword]
            
            # 找窗口的起始和终止位置  开始位置就是当前单词的索引减去window_size, 终止位置
            # 是当前索引加上windo_size+1， 
            window_start = cword_doc_ind - window_size
            window_end = cword_doc_ind + window_size + 1
            
            # 有了窗口， 我们就要遍历窗口里面的每个单词， 然后往M里面记录就行了
            # 但是还要注意一点， 就是边界问题， 因为开始单词左边肯定不够窗口大小， 结束单词
            # 右边肯定不够窗口大小， 所以遍历之后得判断一下是不是左边后者右边有单词
            for j in range(window_start, window_end):
                # 前面两个条件控制不越界， 最后一个条件控制不是它本身
                if j >=0 and j < len(every_doc) and j != cword_doc_ind:
                    # 想办法加入到M， 那么得获取这个单词在词典中的位置
                    oword = every_doc[j]   # 获取到上下文单词
                    oword_dic_ind = word2Ind[oword]
                    # 加入M
                    M[cword_dic_ind, oword_dic_ind] += 1
    # ------------------

    return M, word2Ind

3.3 降维到K维

降维直接调用包SVD.

def reduce_to_k_dim(M, k=2):
    """ Reduce a co-occurence count matrix of dimensionality (num_corpus_words, num_corpus_words)
        to a matrix of dimensionality (num_corpus_words, k) using the following SVD function from Scikit-Learn:
            - http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html
    
        Params:
            M (numpy matrix of shape (number of corpus words, number of corpus words)): co-occurence matrix of word counts
            k (int): embedding size of each word after dimension reduction
        Return:
            M_reduced (numpy matrix of shape (number of corpus words, k)): matrix of k-dimensioal word embeddings.
                    In terms of the SVD from math class, this actually returns U * S
    """    
    n_iters = 10     # Use this parameter in your call to `TruncatedSVD`
    M_reduced = None
    print("Running Truncated SVD over %i words..." % (M.shape[0]))
    
        # ------------------
        # Write your implementation here.
    svd = TruncatedSVD(n_components=k, n_iter=n_iters, random_state=2020)
    M_reduced = svd.fit_transform(M)
        # ------------------

    print("Done.")
    return M_reduced

最后可视化结果

def plot_embeddings(M_reduced, word2Ind, words):
    """ Plot in a scatterplot the embeddings of the words specified in the list "words".
        NOTE: do not plot all the words listed in M_reduced / word2Ind.
        Include a label next to each point.
        
        Params:
            M_reduced (numpy matrix of shape (number of unique words in the corpus , k)): matrix of k-dimensioal word embeddings
            word2Ind (dict): dictionary that maps word to indices for matrix M
            words (list of strings): words whose embeddings we want to visualize
    """

    # ------------------
    # Write your implementation here.
    # 遍历句子， 获得每个单词的x，y坐标
    for word in words:
        word_dic_index = word2Ind[word]
        x = M_reduced[word_dic_index][0]
        y = M_reduced[word_dic_index][1]
        plt.scatter(x, y, marker='x', color='red')
        # plt.text()给图形添加文本注释
        plt.text(x+0.0002, y+0.0002, word, fontsize=9)  # # x、y上方0.002处标注文字说明，word标注的文字，fontsize：文字大小
    plt.show()
    # ------------------

综合以上的步骤：首先是读入数据，然后计算共现矩阵，然后是降维，最后是可视化：

reuters_corpus = read_corpus()
M_co_occurrence, word2Ind_co_occurrence = compute_co_occurrence_matrix(reuters_corpus)
M_reduced_co_occurrence = reduce_to_k_dim(M_co_occurrence, k=2)

# Rescale (normalize) the rows to make them each of unit-length
M_lengths = np.linalg.norm(M_reduced_co_occurrence, axis=1)
M_normalized = M_reduced_co_occurrence / M_lengths[:, np.newaxis] # broadcasting

words = ['barrels', 'bpd', 'ecuador', 'energy', 'industry', 'kuwait', 'oil', 'output', 'petroleum', 'venezuela']
plot_embeddings(M_normalized, word2Ind_co_occurrence, words)

得到结果如下：

Part2:Prediction-Based Word Vectors

2.1 可视化Word2Vec训练的词嵌入

这一部分其实是利用了一个用Word2Vec技术训练好的词向量矩阵去测试一些有趣的效果，看看词向量到底是干啥用的。所以用gensim包下载了一个词向量矩阵：

def load_word2vec(embeddings_fp="F:\\jupyter lab./GoogleNews-vectors-negative300.bin"):
    """ Load Word2Vec Vectors
        Param:
            embeddings_fp (string) - path to .bin file of pretrained word vectors
        Return:
            wv_from_bin: All 3 million embeddings, each lengh 300
                This is the KeyedVectors format: https://radimrehurek.com/gensim/models/deprecated/keyedvectors.html
    """
    embed_size = 300
    print("Loading 3 million word vectors from file...")
    ## 自己下载的文件
    wv_from_bin = KeyedVectors.load_word2vec_format(embeddings_fp, binary=True)
    vocab = list(wv_from_bin.vocab.keys())
    print("Loaded vocab size %i" % len(vocab))
    return wv_from_bin
wv_from_bin = load_word2vec()

路径换成自己的数据保存路径就行

有了这个代码，我们就能得到一个基于Word2Vec训练好的词向量矩阵（和上面我们的M矩阵是类似的，只不过得到的方式不同），接下来就是进行降维并可视化词嵌入：

def get_matrix_of_vectors(wv_from_bin, required_words=['barrels', 'bpd', 'ecuador', 'energy', 'industry', 'kuwait', 'oil', 'output', 'petroleum', 'venezuela']):
    """ Put the word2vec vectors into a matrix M.
        Param:
            wv_from_bin: KeyedVectors object; the 3 million word2vec vectors loaded from file
        Return:
            M: numpy matrix shape (num words, 300) containing the vectors
            word2Ind: dictionary mapping each word to its row number in M
    """
    import random
    words = list(wv_from_bin.vocab.keys())
    print("Shuffling words ...")
    random.shuffle(words)
    words = words[:10000]       # 选10000个加入
    print("Putting %i words into word2Ind and matrix M..." % len(words))
    word2Ind = {}
    M = []
    curInd = 0
    for w in words:
        try:
            M.append(wv_from_bin.word_vec(w))
            word2Ind[w] = curInd
            curInd += 1
        except KeyError:
            continue
    for w in required_words:
        try:
            M.append(wv_from_bin.word_vec(w))
            word2Ind[w] = curInd
            curInd += 1
        except KeyError:
            continue
    M = np.stack(M)
    print("Done.")
    return M, word2Ind

# -----------------------------------------------------------------
# Run Cell to Reduce 300-Dimensinal Word Embeddings to k Dimensions
# Note: This may take several minutes
# -----------------------------------------------------------------
M, word2Ind = get_matrix_of_vectors(wv_from_bin)
M_reduced = reduce_to_k_dim(M, k=2)         # 减到了2维

words = ['barrels', 'bpd', 'ecuador', 'energy', 'industry', 'kuwait', 'oil', 'output', 'petroleum', 'venezuela']
plot_embeddings(M_reduced, word2Ind, words)

结果如下：

2.2 余弦相似度

我们已经得到了每个单词的词向量表示，那么怎么看两个单词的相似性程度呢？余弦相似性是一种方式，公式如下：

基于这个方式，我们就可以找到单词的多义词，同义词，反义词还能实现单词的类比推理等。所以是直接调用的gensim的函数就行。

# 找和energy最相近的10个单词
wv_from_bin.most_similar("energy")

##结果
[('renewable_energy', 0.6721636056900024),
 ('enery', 0.6289607286453247),
 ('electricity', 0.6030439138412476),
 ('enegy', 0.6001754403114319),
 ('Energy', 0.595537006855011),
 ('fossil_fuel', 0.5802257061004639),
 ('natural_gas', 0.5767925381660461),
 ('renewables', 0.5708995461463928),
 ('fossil_fuels', 0.5689164996147156),
 ('renewable', 0.5663810968399048)]

再比如，为我们可以找同义词和反义词：

w1 = "man"
w2 = "king"
w3 = "woman"
w1_w2_dist = wv_from_bin.distance(w1, w2)
w1_w3_dist = wv_from_bin.distance(w1, w3)

print("Synonyms {}, {} have cosine distance: {}".format(w1, w2, w1_w2_dist))
print("Antonyms {}, {} have cosine distance: {}".format(w1, w3, w1_w3_dist))

## 结果：
Synonyms man, king have cosine distance: 0.7705732733011246
Antonyms man, woman have cosine distance: 0.2335987687110901

还可以实现类比关系：
比如： China : Beijing = Japan : ?，那么我们可以用下面的代码求这样的类别关系，注意下面的positive和negative里面的单词顺序，我们求得？其实和Japan和Beijing相似，和China远。

# Run this cell to answer the analogy -- man : king :: woman : x
pprint.pprint(wv_from_bin.most_similar(positive=['Bejing', 'Japan'], negative=['China']))

## 结果：
[('Tokyo', 0.6124968528747559),
 ('Osaka', 0.5791803598403931),
 ('Maebashi', 0.5635818243026733),
 ('Fukuoka_Japan', 0.5362966060638428),
 ('Nagoya', 0.5359445214271545),
 ('Fukuoka', 0.5319067239761353),
 ('Osaka_Japan', 0.5298740267753601),
 ('Nagano', 0.5293833017349243),
 ('Taisuke', 0.5258569717407227),
 ('Chukyo', 0.5195443034172058)]

2.3 引导词向量偏差分析

识别单词嵌入隐含的偏见(性别、种族、性取向等)。
运行下面的单元格，检查
(a)哪些词与“woman”和“boss”最相似，而与“man”最不相似;
(b)哪些词与“man”和“boss”最相似，而与“woman”最不相似。

# Run this cell
# Here `positive` indicates the list of words to be similar to and `negative` indicates the list of words to be
# most dissimilar from.
#这里“positive”表示类似的单词列表，而“negative”表示是最不相似于。
pprint.pprint(wv_from_bin.most_similar(positive=['woman', 'boss'], negative=['man']))
print()
pprint.pprint(wv_from_bin.most_similar(positive=['man', 'boss'], negative=['woman']))

得到结果如下：

2.4 独立的单词向量偏差分析

# ------------------
# Write your bias exploration code here.

pprint.pprint(wv_from_bin.most_similar(positive=['woman', 'doctor'], negative=['man']))
print()
pprint.pprint(wv_from_bin.most_similar(positive=['woman', 'nurse'], negative=['man']))

# ------------------

结果如下：

【附完整代码】

# 导入包
import sys
assert sys.version_info[0]==3
assert sys.version_info[1] >= 5

from gensim.models import KeyedVectors  # KeyedVectors:实现实体（单词、文档、图片都可以）和向量之间的映射。每个实体由其字符串id标识。
from gensim.test.utils import datapath
import pprint     #  输出的更加规范易读
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [10, 5]  #  plt.rcParams主要作用是设置画的图的分辨率，大小等信息
# import nltk
# nltk.download('reuters')    # 这个可以从GitHub下载， 网址：https://github.com/nltk/nltk_data/tree/gh-pages/packages/corpora
from nltk.corpus import reuters
import numpy as np
import random
import rando
import scipy as sp
from sklearn.decomposition import TruncatedSVD
from sklearn.decomposition import PCA

START_TOKEN = ''
END_TOKEN = ''

np.random.seed(0)
random.seed(0)

# 导入 "reuters" 语料库
def read_corpus(category="crude"):
    """ Read files from the specified Reuter's category.
        Params:
            category (string): category name
        Return:
            list of lists, with words from each of the processed files
    """
    files = reuters.fileids(category)    # 类别为crude文档
    # 每个文档都转化为小写， 并在开头结尾加标识符
    return [[START_TOKEN] + [w.lower() for w in list(reuters.words(f))] + [END_TOKEN] for f in files]
print()

# 导入语料库的函数，简单的进行了一下预处理，
# 在每句话的前面和后面各加了一个标识符，表示句子的开始和结束，然后把每个单词分开。
# pprint模块格式化打印
# pprint.pprint(object, stream=None, indent=1, width=80, depth=None, *, compact=False)
# width：控制打印显示的宽度。默认为80个字符。注意：当单个对象的长度超过width时，并不会分多行显示，而是会突破规定的宽度。
# compact：默认为False。如果值为False，超过width规定长度的序列会被分散打印到多行。如果为True，会尽量使序列填满width规定的宽度。
reuters_corpus = read_corpus()
pprint.pprint(reuters_corpus[:1], compact=True, width=100)  # compact 设置为False是一行一个单词

# 问题1.1：实现不同单词
# 计算语料库的单词数量、单词集
def distinct_words(corpus):
    """ Determine a list of distinct words for the corpus.
        Params:
            corpus (list of list of strings): corpus of documents
        Return:
            corpus_words (list of strings): list of distinct words across the corpus, sorted (using python 'sorted' function)
            num_corpus_words (integer): number of distinct words across the corpus
    """
    corpus_words = []
    num_corpus_words = -1

    # Write your implementation here.
    corpus = [w for sent in corpus for w in sent]
    corpus_words = list(set(corpus))
    corpus_words = sorted(corpus_words)
    num_corpus_words = len(corpus_words)
    # 返回的结果是语料库中的所有单词按照字母顺序排列的。
    return corpus_words, num_corpus_words

# 问题1.2：实现共现矩阵
# 计算给定语料库的共现矩阵。具体来说，对于每一个词 w，统计前、后方 window_size 个词的出现次数\
def compute_co_occurrence_matrix(corpus, window_size=4):
    """ Compute co-occurrence matrix for the given corpus and window_size (default of 4).

        Note: Each word in a document should be at the center of a window. Words near edges will have a smaller
              number of co-occurring words.

              For example, if we take the document "START All that glitters is not gold END" with window size of 4,
              "All" will co-occur with "START", "that", "glitters", "is", and "not".

        Params:
            corpus (list of list of strings): corpus of documents
            window_size (int): size of context window
        Return:
            M (numpy matrix of shape (number of corpus words, number of corpus words)):
                Co-occurence matrix of word counts.
                The ordering of the words in the rows/columns should be the same as the ordering of the words given by the distinct_words function.
            word2Ind (dict): dictionary that maps word to index (i.e. row/column number) for matrix M.
    """
    words, num_words = distinct_words(corpus)
    M = None
    word2Ind = {}

    # Write your implementation here.
    M = np.zeros(shape=(num_words, num_words), dtype=np.int32)
    for i in range(num_words):
        word2Ind[words[i]] = i

    for sent in corpus:
        for p in range(len(sent)):
            ci = word2Ind[sent[p]]

            # preceding
            for w in sent[max(0, p - window_size):p]:
                wi = word2Ind[w]
                M[ci][wi] += 1

            # subsequent
            for w in sent[p + 1:p + 1 + window_size]:
                wi = word2Ind[w]
                M[ci][wi] += 1

    return M, word2Ind

# 问题1.3：实现降到k维
# 这一步是降维。
# 在问题1.2得到的是一个N x N的矩阵（N是单词集的大小），使用scikit-learn实现的SVD（奇异值分解），从这个大矩阵里分解出一个含k个特制的N x k 小矩阵。
def reduce_to_k_dim(M, k=2):
    """ Reduce a co-occurence count matrix of dimensionality (num_corpus_words, num_corpus_words)
        to a matrix of dimensionality (num_corpus_words, k) using the following SVD function from Scikit-Learn:
            - http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html

        Params:
            M (numpy matrix of shape (number of corpus words, number of corpus words)): co-occurence matrix of word counts
            k (int): embedding size of each word after dimension reduction
        Return:
            M_reduced (numpy matrix of shape (number of corpus words, k)): matrix of k-dimensioal word embeddings.
                    In terms of the SVD from math class, this actually returns U * S
    """
    n_iters = 10  # Use this parameter in your call to `TruncatedSVD`
    M_reduced = None
    print("Running Truncated SVD over %i words..." % (M.shape[0]))

    # Write your implementation here.
    svd = TruncatedSVD(n_components=k)
    svd.fit(M.T)
    M_reduced = svd.components_.T

    print("Done.")
    return M_reduced

# 问题1.4 实现 plot_embeddings
# 编写一个函数来绘制2D空间中的一组2D矢量。
# 基于matplotlib，用scatter 画 “×”，用 text 写字
def plot_embeddings(M_reduced, word2Ind, words):
    """ Plot in a scatterplot the embeddings of the words specified in the list "words".
        NOTE: do not plot all the words listed in M_reduced / word2Ind.
        Include a label next to each point.

        Params:
            M_reduced (numpy matrix of shape (number of unique words in the corpus , 2)): matrix of 2-dimensioal word embeddings
            word2Ind (dict): dictionary that maps word to indices for matrix M
            words (list of strings): words whose embeddings we want to visualize
    """
    # Write your implementation here.
    fig = plt.figure()
    plt.style.use("seaborn-whitegrid")
    for word in words:
        point = M_reduced[word2Ind[word]]
        plt.scatter(point[0], point[1], marker="^")
        plt.annotate(word, xy=(point[0], point[1]), xytext=(point[0], point[1] + 0.1))

# 测试解决方案图
print("-" * 80)
print("Outputted Plot:")

M_reduced_plot_test = np.array([[1, 1], [-1, -1], [1, -1], [-1, 1], [0, 0]])
word2Ind_plot_test = {'test1': 0, 'test2': 1, 'test3': 2, 'test4': 3, 'test5': 4}
words = ['test1', 'test2', 'test3', 'test4', 'test5']
plot_embeddings(M_reduced_plot_test, word2Ind_plot_test, words)

print("-" * 80)
print()

# 问题1.5：共现打印分析
# 将词嵌入到2个维度上，归一化，最终词向量会落到一个单位圆内，在坐标系上寻找相近的词。
reuters_corpus = read_corpus()
M_co_occurrence, word2Ind_co_occurrence = compute_co_occurrence_matrix(reuters_corpus)
M_reduced_co_occurrence = reduce_to_k_dim(M_co_occurrence, k=2)

# Rescale (normalize) the rows to make them each of unit-length
M_lengths = np.linalg.norm(M_reduced_co_occurrence, axis=1)
M_normalized = M_reduced_co_occurrence / M_lengths[:, np.newaxis] # broadcasting

words = ['barrels', 'bpd', 'ecuador', 'energy', 'industry', 'kuwait', 'oil', 'output', 'petroleum', 'venezuela']

plot_embeddings(M_normalized, word2Ind_co_occurrence, words)
plt.show()

# Part 2：基于预测的词向量
# 使用gensim探索词向量，不是自己实现word2vec，所使用的词向量维度是300，由google发布。
def load_word2vec(embeddings_fp="./GoogleNews-vectors-negative300.bin"):
    """ Load Word2Vec Vectors
        Param:
            embeddings_fp (string) - path to .bin file of pretrained word vectors
        Return:
            wv_from_bin: All 3 million embeddings, each lengh 300
                This is the KeyedVectors format: https://radimrehurek.com/gensim/models/deprecated/keyedvectors.html
    """
    embed_size = 300
    print("Loading 3 million word vectors from file...")
    ## 自己下载的文件
    wv_from_bin = KeyedVectors.load_word2vec_format(embeddings_fp, binary=True)
    vocab = list(wv_from_bin.vocab.keys())
    print("Loaded vocab size %i" % len(vocab))
    return wv_from_bin
wv_from_bin = load_word2vec()
print()

# 首先使用SVD降维，将300维降2维，方便打印查看。
# 问题2.1：word2vec打印分析
# 和问题1.5一样
def get_matrix_of_vectors(wv_from_bin, required_words=['barrels', 'bpd', 'ecuador', 'energy', 'industry', 'kuwait', 'oil', 'output', 'petroleum', 'venezuela']):
    """ Put the word2vec vectors into a matrix M.
        Param:
            wv_from_bin: KeyedVectors object; the 3 million word2vec vectors loaded from file
        Return:
            M: numpy matrix shape (num words, 300) containing the vectors
            word2Ind: dictionary mapping each word to its row number in M
    """
    import random
    words = list(wv_from_bin.vocab.keys())
    print("Shuffling words ...")
    random.shuffle(words)
    words = words[:10000]       # 选10000个加入
    print("Putting %i words into word2Ind and matrix M..." % len(words))
    word2Ind = {}
    M = []
    curInd = 0
    for w in words:
        try:
            M.append(wv_from_bin.word_vec(w))
            word2Ind[w] = curInd
            curInd += 1
        except KeyError:
            continue
    for w in required_words:
        try:
            M.append(wv_from_bin.word_vec(w))
            word2Ind[w] = curInd
            curInd += 1
        except KeyError:
            continue
    M = np.stack(M)
    print("Done.")
    return M, word2Ind

# 测试解决方案图
print("-" * 80)
print("Outputted Plot:")
print("-" * 80)

M, word2Ind = get_matrix_of_vectors(wv_from_bin)
M_reduced = reduce_to_k_dim(M, k=2)         # 减到了2维
plt.tight_layout()
words = ['barrels', 'bpd', 'ecuador', 'energy', 'industry', 'kuwait', 'oil', 'output', 'petroleum', 'venezuela']
plot_embeddings(M_reduced, word2Ind, words)
plt.show()

# 问题2.2：一词多义
# 找到一个有多个含义的词（比如 “leaves”，“scoop”），这种词的top-10相似词（根据余弦相似度）里有两个词的意思不一样。比如"leaves"（叶子，花瓣）的top-10词里有"vanishes"（消失）和"stalks"（茎秆）。
# 这里我找到的词是"column"（列），它的top-10里有"columnist"（专栏作家）和"article"（文章）
w0 = "column"
w0_mean = wv_from_bin.most_similar(w0)
print("column：", w0_mean)
print()

# 问题2.3：近义词和反义词
# 找到三个词(w1, w2, w3)，其中w1和w2是近义词，w1和w3是反义词，但是w1和w3的距离
# 例如：w1=“happy”，w2=“cheerful”，w3=“sad”
w1 = "love"
w2 = "like"
w3 = "hate"
w1_w2_dist = wv_from_bin.distance(w1, w2)
w1_w3_dist = wv_from_bin.distance(w1, w3)
print("Synonyms {}, {} have cosine distance: {}".format(w1, w2, w1_w2_dist))
print("Antonyms {}, {} have cosine distance: {}".format(w1, w3, w1_w3_dist))
print()

# 问题2.4：类比
# man 对于 king，相当于woman对于___，这样的问题也可以用word2vec来解决
# man : him :: woman : her
print("类比 man : him :: woman : her：")
pprint.pprint(wv_from_bin.most_similar(positive=['woman', 'him'], negative=['man']))
print()


# 问题2.5：错误的类比
# 找到一个错误的类比，树：树叶 ：：花：花瓣
print("错误的类比 tree : leaf :: flower : petal：")
pprint.pprint(wv_from_bin.most_similar(positive=['leaf', 'flower'], negative=['tree']))
print()

# 问题2.6：偏见分析
# 注意偏见是很重要的比如性别歧视、种族歧视等，执行下面代码，分析两个问题：
# (a) 哪个词与“woman”和“boss”最相似，和“man”最不相似?
# (b) 哪个词与“man”和“boss”最相似，和“woman”最不相似?
print("偏见 woman : boss :: man：")
pprint.pprint(wv_from_bin.most_similar(positive=['woman', 'boss'], negative=['man']))
print()
print("偏见 man : boss :: woman：")
pprint.pprint(wv_from_bin.most_similar(positive=['man', 'boss'], negative=['woman']))
print()

# 问题2.7：自行分析偏见
#     男人:女人 :: 医生:___
#     女人:男人 :: 医生:___
print("自行分析偏见 woman : doctor :: man：")
pprint.pprint(wv_from_bin.most_similar(positive=['woman', 'doctor'], negative=['man']))
print()
print("自行分析偏见 man : doctor :: woman：")
pprint.pprint(wv_from_bin.most_similar(positive=['man', 'doctor'], negative=['woman']))
print()

你可能感兴趣的:(自然语言处理)

探索OpenAI和LangChain的适配器集成：轻松切换模型提供商 nseejrukjhad langchain easyui 前端 python
#探索OpenAI和LangChain的适配器集成：轻松切换模型提供商##引言在人工智能和自然语言处理的世界中，OpenAI的模型提供了强大的能力。然而，随着技术的发展，许多人开始探索其他模型以满足特定需求。LangChain作为一个强大的工具，集成了多种模型提供商，通过提供适配器，简化了不同模型之间的转换。本篇文章将介绍如何使用LangChain的适配器与OpenAI集成，以便轻松切换模型提供商
使用Apify加载Twitter消息以进行微调的完整指南 nseejrukjhad twitter easyui 前端 python
#使用Apify加载Twitter消息以进行微调的完整指南##引言在自然语言处理领域，微调模型以适应特定任务是提升模型性能的常见方法。本文将介绍如何使用Apify从Twitter导出聊天信息，以便进一步进行微调。##主要内容###使用Apify导出推文首先，我们需要从Twitter导出推文。Apify可以帮助我们做到这一点。通过Apify的强大功能，我们可以批量抓取和导出数据，适用于各类应用场景。
深入理解 MultiQueryRetriever：提升向量数据库检索效果的强大工具 nseejrukjhad 数据库 python
深入理解MultiQueryRetriever：提升向量数据库检索效果的强大工具引言在人工智能和自然语言处理领域，高效准确的信息检索一直是一个关键挑战。传统的基于距离的向量数据库检索方法虽然广泛应用，但仍存在一些局限性。本文将介绍一种创新的解决方案：MultiQueryRetriever，它通过自动生成多个查询视角来增强检索效果，提高结果的相关性和多样性。MultiQueryRetriever的工
自然语言处理_tf-idf _feivirus_ 算法机器学习和数学自然语言处理 tf-idf 逆文档频率词频
importpandasaspdimportmath1.数据预处理docA="Thecatsatonmyface"docB="Thedogsatonmybed"wordsA=docA.split("")wordsB=docB.split("")wordsSet=set(wordsA).union(set(wordsB))print(wordsSet){'on','my','face','sat',
免费的GPT可在线直接使用（一键收藏） kkai人工智能 gpt
1、LuminAI（https://kk.zlrxjh.top）LuminAI标志着一款融合了星辰大数据模型与文脉深度模型的先进知识增强型语言处理系统，旨在自然语言处理（NLP）的技术开发领域发光发热。此系统展现了卓越的语义把握与内容生成能力，轻松驾驭多样化的自然语言处理任务。VisionAI在NLP界的应用领域广泛，能够胜任从机器翻译、文本概要撰写、情绪分析到问答等众多任务。通过对大量文本数据的
推荐3家毕业AI论文可五分钟一键生成！文末附免费教程！小猪包333 写论文人工智能 AI写作深度学习计算机视觉
在当前的学术研究和写作领域，AI论文生成器已经成为许多研究人员和学生的重要工具。这些工具不仅能够帮助用户快速生成高质量的论文内容，还能进行内容优化、查重和排版等操作。以下是三款值得推荐的AI论文生成器：千笔-AIPassPaper、懒人论文以及AIPaperPass。千笔-AIPassPaper千笔-AIPassPaper是一款基于深度学习和自然语言处理技术的AI写作助手，旨在帮助用户快速生成高质
AI论文题目生成器怎么用？9款论文写作网站简单3步搞定小猪包333 写论文人工智能深度学习计算机视觉
在当今信息爆炸的时代，AI写作工具的出现极大地提高了写作效率和质量。本文将详细介绍9款优秀的论文写作网站，并重点推荐千笔-AIPassPaper。一、千笔-AIPassPaper千笔-AIPassPaper是一款功能强大的AI论文生成器，基于最新的自然语言处理技术，能够一键生成高质量的毕业论文、开题报告等文本内容。它不仅提供智能选题、文献推荐和论文润色等功能，还具有较高的用户评价。其文献综述生成功
AI大模型的架构演进与最新发展季风泯灭的季节 AI大模型应用技术二人工智能架构
随着深度学习的发展，AI大模型（LargeLanguageModels,LLMs）在自然语言处理、计算机视觉等领域取得了革命性的进展。本文将详细探讨AI大模型的架构演进，包括从Transformer的提出到GPT、BERT、T5等模型的历史演变，并探讨这些模型的技术细节及其在现代人工智能中的核心作用。一、基础模型介绍：Transformer的核心原理Transformer架构的背景在Transfo
FlagEmbedding 吉小雨 python库 python
FlagEmbedding教程FlagEmbedding是一个用于生成文本嵌入（textembeddings）的库，适合处理自然语言处理（NLP）中的各种任务。嵌入（embeddings）是将文本表示为连续向量，能够捕捉语义上的相似性，常用于文本分类、聚类、信息检索等场景。官方文档链接：FlagEmbedding官方GitHub一、FlagEmbedding库概述1.1什么是FlagEmbeddi
【NumPy】深入解析numpy.zeros()函数二七830 numpy
欢迎莅临我的个人主页这里是我深耕Python编程、机器学习和自然语言处理（NLP）领域，并乐于分享知识与经验的小天地！博主简介：我是二七830，一名对技术充满热情的探索者。多年的Python编程和机器学习实践，使我深入理解了这些技术的核心原理，并能够在实际项目中灵活应用。尤其是在NLP领域，我积累了丰富的经验，能够处理各种复杂的自然语言任务。技术专长：我熟练掌握Python编程语言，并深入研究了机
Humanize 项目教程尤嫒冰
Humanize项目教程humanizeAJSlibraryforaddinga“humantouch”todata.项目地址:https://gitcode.com/gh_mirrors/humani/humanize项目介绍Humanize是一个开源项目，旨在将机器生成的文本转换为更加自然、人性化的文本。该项目通过先进的算法和自然语言处理技术，使得AI生成的内容更加贴近人类的表达方式，从而提高
全自动解密解码神器 — Ciphey K'illCode python_模块 python vscode
Ciphey是一个使用自然语言处理和人工智能的全自动解密/解码/破解工具。简单地来讲，你只需要输入加密文本，它就能给你返回解密文本。就是这么牛逼。有了Ciphey，你根本不需要知道你的密文是哪种类型的加密，你只知道它是加密的，那么Ciphey就能在3秒甚至更短的时间内给你解密，返回你想要的大部分密文的答案。下面就给大家介绍Ciphey的实战使用教程。1.准备开始之前，你要确保Python和pip已
CV、NLP、数据控掘推荐、量化海的那边- AI算法自然语言处理人工智能
下面是对CV（计算机视觉）、NLP（自然语言处理）、数据挖掘推荐和量化的简要概述及其应用领域的介绍：1.CV（计算机视觉，ComputerVision）定义：计算机视觉是一门让计算机能够从图像或视频中提取有用信息，并做出决策的学科。它通过模拟人类的视觉系统来识别、处理和理解视觉信息。主要任务：图像分类：识别图像中的物体并分类，比如猫、狗、车等。目标检测：在图像或视频中定位并识别多个对象，如人脸检测
深度解析：如何使用输出解析器将大型语言模型（LLM）的响应解析为结构化JSON格式 m0_57781768 语言模型 json 人工智能
深度解析：如何使用输出解析器将大型语言模型（LLM）的响应解析为结构化JSON格式在现代自然语言处理（NLP）的应用中，大型语言模型（LLM）已经成为了重要的工具。这些模型能够生成丰富的自然语言文本，适用于各种应用场景。然而，在某些应用中，开发者不仅仅需要生成文本，还需要将这些生成的文本转换为结构化的数据格式，例如JSON。这种结构化的数据格式在数据传输、存储以及进一步处理时具有显著优势。本文将深
深入探讨：如何在Python中通过LangChain技术精准追踪大型语言模型（LLM）的Token使用情况 m0_57781768 python langchain 语言模型
深入探讨：如何在Python中通过LangChain技术精准追踪大型语言模型（LLM）的Token使用情况在现代的人工智能开发中，大型语言模型（LLM）已经成为了不可或缺的工具，无论是用于自然语言处理、对话生成，还是其他复杂的文本生成任务。然而，随着这些模型的广泛应用，开发者面临的一个重要挑战是如何有效地追踪和管理Token的使用情况，特别是在生产环境中，Token的使用直接影响着API调用的成本
使用最大边际相关性(MMR)选择示例：提高AI模型的多样性和相关性 aehrutktrjk 人工智能 easyui 前端 python
使用最大边际相关性(MMR)选择示例：提高AI模型的多样性和相关性引言在机器学习和自然语言处理领域，选择合适的训练示例对模型性能至关重要。最大边际相关性(MaximalMarginalRelevance,MMR)是一种优秀的示例选择方法，它不仅考虑了示例与输入的相关性，还注重保持所选示例之间的多样性。本文将深入探讨如何使用MMR来选择示例，以提高AI模型的性能和泛化能力。什么是最大边际相关性(MM
使用LangChain和OpenAI实现高效文本标注 aehrutktrjk langchain python
使用LangChain和OpenAI实现高效文本标注引言在自然语言处理(NLP)领域，文本标注是一项重要且常见的任务。它涉及为文本分配标签，如情感、语言、风格等。本文将介绍如何使用LangChain和OpenAI的API来实现高效的文本标注系统。我们将探讨如何设置环境、定义标注模式，以及如何使用OpenAI的模型来执行标注任务。环境准备首先，我们需要安装必要的库并设置API密钥：%pipinsta
基于深度学习的文本引导的图像编辑 SEU-WYL 深度学习dnn 深度学习人工智能
基于深度学习的文本引导的图像编辑（Text-GuidedImageEditing）是一种通过自然语言文本指令对图像进行编辑或修改的技术。它结合了图像生成和自然语言处理（NLP）的最新进展，使用户能够通过描述性文本对图像内容进行精确的调整和操控。1.文本引导的图像编辑的挑战文本和图像之间的对齐：如何将文本中的语义信息准确地映射到图像中的特定区域或元素是一个关键挑战。这涉及到多模态数据的对齐和理解。编
多模态Transformer之文本与图像联合建模 - Transformer教程 shandianfk_com ChatGPT Transformer transformer 深度学习人工智能
大家好，今天我们来聊聊一个既前沿又有趣的话题——多模态Transformer，特别是文本与图像的联合建模。对于很多小伙伴来说，Transformer这个词已经不陌生了，但它不仅仅应用于自然语言处理，还能在图像处理、甚至是多模态数据的处理上大显身手。接下来，我会带大家深入了解什么是多模态Transformer，以及它是如何实现文本与图像的联合建模的。Transformer简介首先，我们简单回顾一下T
什么是AIGC？有哪些免费工具？ chent_某位 AIGC
AIGC（AIGeneratedContent），即“人工智能生成内容”，是指通过人工智能技术自动生成各种类型的数字内容。AIGC让机器能够根据输入的信息或数据生成符合人类需求的文本、图像、音频、视频等内容，极大提高了内容创作的效率。AIGC的背景与起源随着深度学习和自然语言处理技术的快速发展，人工智能已经不再局限于简单的任务，如分类、预测和数据分析，而是具备了生成内容的能力。生成式AI模型，如O
transformer架构(Transformer Architecture)原理与代码实战案例讲解 AI架构设计之禅大数据AI人工智能 Python入门实战计算科学神经计算深度学习神经网络大数据人工智能大型语言模型 AI AGI LLM Java Python 架构设计 Agent RPA
transformer架构(TransformerArchitecture)原理与代码实战案例讲解关键词：Transformer,自注意力机制,编码器-解码器,预训练,微调,NLP,机器翻译作者：禅与计算机程序设计艺术/ZenandtheArtofComputerProgramming1.背景介绍1.1问题的由来自然语言处理（NLP）领域的发展经历了从规则驱动到统计驱动再到深度学习驱动的三个阶段。
英伟达（NVIDIA）B200架构解读 weixin_41205263 芯际争霸 GPGPU架构 gpu算力人工智能硬件架构
H100芯片是一款高性能AI芯片，其中的TransformerEngine是专门用于加速Transformer模型计算的核心部件。Transformer模型是一种自然语言处理（NLP）模型，广泛应用于机器翻译、文本生成等任务。TransformerEngine的电路设计原理主要包括以下几个方面：
使用LangChain与Together AI模型交互：深入探讨和实践指南 llzwxh888 langchain 人工智能交互 python
使用LangChain与TogetherAI模型交互：深入探讨和实践指南1.引言在人工智能和自然语言处理领域，TogetherAI已经成为一个强大的平台，提供了对50多个领先开源模型的访问。本文将深入探讨如何使用LangChain与TogetherAI模型进行交互，为开发者提供实用的知识和见解，同时解决可能遇到的常见问题。2.TogetherAI简介TogetherAI是一个强大的API平台，允许
OpenLM: 一个灵活的开源大语言模型接口工具 llzwxh888 语言模型人工智能自然语言处理 python
OpenLM:一个灵活的开源大语言模型接口工具引言在人工智能和自然语言处理快速发展的今天，大语言模型(LLM)已经成为许多应用的核心。然而，不同的LLM提供商往往有着各自的API和使用方式，这给开发者带来了一定的挑战。本文将介绍OpenLM，这是一个零依赖、兼容OpenAIAPI的LLM提供者接口，它可以直接通过HTTP调用不同的推理端点。我们将深入探讨OpenLM的特性、使用方法，以及如何将其与
使用中专API实现AI模型调用与部署 llzwxh888 人工智能 easyui 前端 python
在AI技术领域，如何调用和部署大语言模型（LLM）是一个常见的需求。本文将详细介绍如何通过中专API地址http://api.wlai.vip，实现对OpenAI大模型的调用与部署，并提供一个详细的demo代码示例。引言随着人工智能技术的飞速发展，大语言模型在自然语言处理任务中的表现尤为突出。然而，由于国内访问海外API存在一定限制，本文将使用中专API地址来解决这一问题，并展示如何在本地环境中配
深度学习入门篇：PyTorch实现手写数字识别 AI_Guru人工智能深度学习 pytorch 人工智能
深度学习作为机器学习的一个分支，近年来在图像识别、自然语言处理等领域取得了显著的成就。在众多的深度学习框架中，PyTorch以其动态计算图、易用性强和灵活度高等特点，受到了广泛的喜爱。本篇文章将带领大家使用PyTorch框架，实现一个手写数字识别的基础模型。手写数字识别简介手写数字识别是计算机视觉领域的一个经典问题，目的是让计算机能够识别并理解手写数字图像。这个问题通常作为深度学习入门的练习，因为
基于人工智能的智能语音助手人工智能发烧友人工智能
语音助手的自然语言处理模块是语音助手系统的关键组成部分。通过这个模块，系统能够识别用户的意图并做出相应的回应。我们可以使用NLP技术来解析文本输入，并将其转换为系统可以理解的命令或指令。在本项目中，我们将结合语音识别、自然语言处理和语音合成技术，构建一个功能简化的语音助手。一、项目背景与需求分析1.1项目目标本项目旨在创建一个语音助手系统，它可以：1.语音识别：从用户的语音输入中提取文本信息。2.
深入掌握大模型精髓：《实战AI大模型》带你全面理解大模型开发！努力的光头强人工智能 langchain prompt transformer 深度学习
今天，人工智能技术的快速发展和广泛应用已经引起了大众的关注和兴趣，它不仅成为技术发展的核心驱动力，更是推动着社会生活的全方位变革。特别是作为AI重要分支的深度学习，通过不断刷新的表现力已引领并定义了一场科技革命。大型深度学习模型（简称AI大模型）以其强大的表征能力和卓越的性能，在自然语言处理、计算机视觉、推荐系统等领域均取得了突破性的进展。尤其随着AI大模型的广泛应用，无数领域因此受益。AI大模型
安装jina，并使用jina的向量化和重排序的功能 MonkeyKing.sun milvus numpy
为了在Python的FastAPI项目中使用Jina进行向量化和重排序，您需要按照以下步骤安装和使用Jina。1.安装Jina首先，确保您已经安装了Jina。可以使用pip来安装。pipinstalljina如果需要特定的功能模块，例如自然语言处理相关的向量化模型，可以通过JinaHub获取。pipinstalljina[hub]2.在FastAPI项目中集成Jina接下来，我们将Jina集成到F
Matlab,Python,Java,C++的比较 Codefengfeng python java c++
Matlabmatlab是一个大型计算机，擅长矩阵计算与科学计算，适合构建模型；然而，编译软件的运行效率低，不适合大型软件开发。Pythonpython的优势是简单，入门快。适合做数据挖掘、数据分析、机器学习、人工智能、自然语言处理、爬虫、批量文件处理等，此外，Python开源免费，有很多的库，开发环境开发社区都比较友好；不过，Python是动态型的语言，需要更多的测试，并且错误仅仅是在运行的时候
ios内付费 374016526 ios 内付费
近年来写了很多IOS的程序，内付费也用到不少，使用IOS的内付费实现起来比较麻烦，这里我写了一个简单的内付费包，希望对大家有帮助。具体使用如下: 这里的sender其实就是调用者，这里主要是为了回调使用。 [KuroStoreApi kuroStoreProductId:@"产品ID" storeSender:self storeFinishCallBa
20 款优秀的 Linux 终端仿真器 brotherlamp linux linux视频 linux资料 linux自学 linux教程
终端仿真器是一款用其它显示架构重现可视终端的计算机程序。换句话说就是终端仿真器能使哑终端看似像一台连接上了服务器的客户机。终端仿真器允许最终用户用文本用户界面和命令行来访问控制台和应用程序。（LCTT 译注：终端仿真器原意指对大型机-哑终端方式的模拟，不过在当今的 Linux 环境中，常指通过远程或本地方式连接的伪终端，俗称“终端”。）你能从开源世界中找到大量的终端仿真器，它们
Solr Deep Paging(solr 深分页) eksliang solr深分页 solr分页性能问题
转载请出自出处：http://eksliang.iteye.com/blog/2148370 作者：eksliang(ickes) blg:http://eksliang.iteye.com/ 概述长期以来，我们一直有一个深分页问题。如果直接跳到很靠后的页数，查询速度会比较慢。这是因为Solr的需要为查询从开始遍历所有数据。直到Solr的4.7这个问题一直没有一个很好的解决方案。直到solr
数据库面试题 18289753290 面试题数据库
1.union ,union all 网络搜索出的最佳答案： union和union all的区别是,union会自动压缩多个结果集合中的重复结果，而union all则将所有的结果全部显示出来，不管是不是重复。 Union：对两个结果集进行并集操作，不包括重复行，同时进行默认规则的排序； Union All：对两个结果集进行并集操作，包括重复行，不进行排序； 2.索引有哪些分类？作用是
Android TV屏幕适配酷的飞上天空 android
先说下现在市面上TV分辨率的大概情况两种分辨率为主 1.720标清，分辨率为1280x720. 屏幕尺寸以32寸为主，部分电视为42寸 2.1080p全高清，分辨率为1920x1080 屏幕尺寸以42寸为主，此分辨率电视屏幕从32寸到50寸都有适配遇到问题，已1080p尺寸为例：分辨率固定不变，屏幕尺寸变化较大。如：效果图尺寸为1920x1080，如果使用d
Timer定时器与ActionListener联合应用永夜-极光 java
功能:在控制台每秒输出一次代码: package Main; import javax.swing.Timer; import java.awt.event.*; public class T { private static int count = 0; public static void main(String[] args){
Ubuntu14.04系统Tab键不能自动补全问题解决随便小屋 Ubuntu 14.04
Unbuntu 14.4安装之后就在终端中使用Tab键不能自动补全，解决办法如下： 1、利用vi编辑器打开/etc/bash.bashrc文件（需要root权限） sudo vi /etc/bash.bashrc 接下来会提示输入密码 2、找到文件中的下列代码 #enable bash completion in interactive shells #if
学会人际关系三招轻松走职场 aijuans 职场
要想成功，仅有专业能力是不够的，处理好与老板、同事及下属的人际关系也是门大学问。如何才能在职场如鱼得水、游刃有余呢？在此，教您简单实用的三个窍门。　　第一，多汇报最近，管理学又提出了一个新名词“追随力”。它告诉我们，做下属最关键的就是要多请示汇报，让上司随时了解你的工作进度，有了新想法也要及时建议。不知不觉，你就有了“追随力”，上司会越来越了解和信任你。　　第二，勤沟通团队的力
《O2O：移动互联网时代的商业革命》读书笔记 aoyouzi 读书笔记
移动互联网的未来：碎片化内容+碎片化渠道=各式精准、互动的新型社会化营销。 O2O：Online to OffLine 线上线下活动 O2O就是在移动互联网时代，生活消费领域通过线上和线下互动的一种新型商业模式。手机二维码本质：O2O商务行为从线下现实世界到线上虚拟世界的入口。线上虚拟世界创造的本意是打破信息鸿沟，让不同地域、不同需求的人
js实现图片随鼠标滚动的效果百合不是茶 JavaScript 滚动属性的获取图片滚动属性获取页面加载
1,获取样式属性值 top 与顶部的距离 left 与左边的距离 right 与右边的距离 bottom 与下边的距离 zIndex 层叠层次例子:获取左边的宽度,当css写在body标签中时 <div id="adver" style="position:absolute;top:50px;left:1000p
ajax同步异步参数async bijian1013 jquery Ajax async
开发项目开发过程中，需要将ajax的返回值赋到全局变量中，然后在该页面其他地方引用，因为ajax异步的原因一直无法成功，需将async:false，使其变成同步的。格式： $.ajax({ type: 'POST', ur
Webx3框架（1） Bill_chen eclipse spring maven 框架 ibatis
Webx是淘宝开发的一套Web开发框架，Webx3是其第三个升级版本；采用Eclipse的开发环境，现在支持java开发；采用turbine原型的MVC框架，扩展了Spring容器，利用Maven进行项目的构建管理，灵活的ibatis持久层支持，总的来说，还是一套很不错的Web框架。 Webx3遵循turbine风格，velocity的模板被分为layout/screen/control三部
【MongoDB学习笔记五】MongoDB概述 bit1129 mongodb
MongoDB是面向文档的NoSQL数据库，尽量业界还对MongoDB存在一些质疑的声音，比如性能尤其是查询性能、数据一致性的支持没有想象的那么好，但是MongoDB用户群确实已经够多。MongoDB的亮点不在于它的性能，而是它处理非结构化数据的能力以及内置对分布式的支持(复制、分片达到的高可用、高可伸缩)，同时它提供的近似于SQL的查询能力，也是在做NoSQL技术选型时，考虑的一个重要因素。Mo
spring/hibernate/struts2常见异常总结白糖_ Hibernate
Spring ①ClassNotFoundException: org.aspectj.weaver.reflect.ReflectionWorld$ReflectionWorldException 缺少aspectjweaver.jar，该jar包常用于spring aop中 ②java.lang.ClassNotFoundException: org.sprin
jquery easyui表单重置(reset)扩展思路 bozch form jquery easyui reset
在jquery easyui表单中尚未提供表单重置的功能，这就需要自己对其进行扩展。扩展的时候要考虑的控件有： combo,combobox,combogrid,combotree,datebox,datetimebox 需要对其添加reset方法，reset方法就是把初始化的值赋值给当前的组件，这就需要在组件的初始化时将值保存下来。在所有的reset方法添加完毕之后，就需要对fo
编程之美-烙饼排序 bylijinnan 编程之美
package beautyOfCoding; import java.util.Arrays; /* *《编程之美》的思路是：搜索+剪枝。有点像是写下棋程序：当前情况下，把所有可能的下一步都做一遍；在这每一遍操作里面，计算出如果按这一步走的话，能不能赢（得出最优结果）。 *《编程之美》上代码有很多错误，且每个变量的含义令人费解。因此我按我的理解写了以下代码： */
Struts1.X 源码分析之ActionForm赋值原理 chenbowen00 struts
struts1在处理请求参数之前，首先会根据配置文件action节点的name属性创建对应的ActionForm。如果配置了name属性，却找不到对应的ActionForm类也不会报错，只是不会处理本次请求的请求参数。如果找到了对应的ActionForm类，则先判断是否已经存在ActionForm的实例，如果不存在则创建实例，并将其存放在对应的作用域中。作用域由配置文件action节点的s
[空天防御与经济]在获得充足的外部资源之前,太空投资需有限度 comsci 资源
这里有一个常识性的问题: 地球的资源,人类的资金是有限的,而太空是无限的..... 就算全人类联合起来,要在太空中修建大型空间站,也不一定能够成功,因为资源和资金,技术有客观的限制.... &
ORACLE临时表—ON COMMIT PRESERVE ROWS daizj oracle 临时表
ORACLE临时表转临时表：像普通表一样，有结构，但是对数据的管理上不一样，临时表存储事务或会话的中间结果集，临时表中保存的数据只对当前会话可见，所有会话都看不到其他会话的数据，即使其他会话提交了，也看不到。临时表不存在并发行为，因为他们对于当前会话都是独立的。创建临时表时，ORACLE只创建了表的结构（在数据字典中定义），并没有初始化内存空间，当某一会话使用临时表时，ORALCE会
基于Nginx XSendfile+SpringMVC进行文件下载 denger 应用服务器 Web nginx 网络应用 lighttpd
在平常我们实现文件下载通常是通过普通 read-write方式，如下代码所示。 @RequestMapping("/courseware/{id}") public void download(@PathVariable("id") String courseID, HttpServletResp
scanf接受char类型的字符 dcj3sjt126com c
/* 2013年3月11日22:35:54 目的：学习char只接受一个字符 */ # include <stdio.h> int main(void) { int i; char ch; scanf("%d", &i); printf("i = %d\n", i); scanf("%
学编程的价值 dcj3sjt126com 编程
发一个人会编程, 想想以后可以教儿女, 是多么美好的事啊, 不管儿女将来从事什么样的职业, 教一教, 对他思维的开拓大有帮助像这位朋友学习: http://blog.sina.com.cn/s/articlelist_2584320772_0_1.html VirtualGS教程 (By @林泰前): 几十年的老程序员，资深的
二维数组（矩阵）对角线输出飞天奔月二维数组
今天在BBS里面看到这样的面试题目, 1，二维数组（N*N），沿对角线方向，从右上角打印到左下角如N=4： 4*4二维数组 { 1 2 3 4 } { 5 6 7 8 } { 9 10 11 12 } {13 14 15 16 } 打印顺序 4 3 8 2 7 12 1 6 11 16 5 10 15 9 14 13 要
Ehcache（08）——可阻塞的Cache——BlockingCache 234390216 并发 ehcache BlockingCache 阻塞
可阻塞的Cache—BlockingCache 在上一节我们提到了显示使用Ehcache锁的问题，其实我们还可以隐式的来使用Ehcache的锁，那就是通过BlockingCache。BlockingCache是Ehcache的一个封装类，可以让我们对Ehcache进行并发操作。其内部的锁机制是使用的net.
mysqldiff对数据库间进行差异比较 jackyrong mysqld
mysqldiff该工具是官方mysql-utilities工具集的一个脚本，可以用来对比不同数据库之间的表结构，或者同个数据库间的表结构如果在windows下，直接下载mysql-utilities安装就可以了，然后运行后，会跑到命令行下： 1）基本用法 mysqldiff --server1=admin:12345
spring data jpa 方法中可用的关键字 lawrence.li java spring
spring data jpa 支持以方法名进行查询/删除/统计。查询的关键字为find 删除的关键字为delete/remove (>=1.7.x) 统计的关键字为count (>=1.7.x) 修改需要使用@Modifying注解 @Modifying @Query("update User u set u.firstna
Spring的ModelAndView类 nicegege spring
项目中controller的方法跳转的到ModelAndView类，一直很好奇spring怎么实现的？ /* * Copyright 2002-2010 the original author or authors. * * Licensed under the Apache License, Version 2.0 (the "License"); * yo
搭建 CentOS 6 服务器(13) - rsync、Amanda rensanning centos
（一）rsync Server端 # yum install rsync # vi /etc/xinetd.d/rsync service rsync { disable = no flags = IPv6 socket_type = stream wait
Learn Nodejs 02 toknowme nodejs
（1）npm是什么 npm is the package manager for node 官方网站：https://www.npmjs.com/ npm上有很多优秀的nodejs包，来解决常见的一些问题，比如用node-mysql，就可以方便通过nodejs链接到mysql，进行数据库的操作在开发过程往往会需要用到其他的包，使用npm就可以下载这些包来供程序调用 &nb
Spring MVC 拦截器 xp9802 spring mvc
Controller层的拦截器继承于HandlerInterceptorAdapter HandlerInterceptorAdapter.java 1 public abstract class HandlerInterceptorAdapter implements HandlerIntercep