词向量源码解析:(5.8)ngram2vec源码解析之counts2ppmi等

我们下面要把共现矩阵加权,得到PPMI矩阵。counts2ppmi这个名字起得不是特别准确,这个文件实际上生成的是PMI矩阵。可能是为了统一,这个工具包把所有应该叫PMI的地方都变成了PPMI。ngram2vec中的counts2ppmi比较合理的利用了scipy中的稀疏矩阵,能很快的从文件中把共现矩阵建立好,然后加权得到PMI矩阵。这里默认要能把所有的三元组读进来,所以可能内存不足。

def read_counts_matrix(words_path, contexts_path, counts_path):
    wi, iw = load_vocabulary(words_path)//读取中心词词典
    ci, ic = load_vocabulary(contexts_path)//读取上下文词典
    counts_num = 0
    row = []//非零元素行id
    col = []//非零元素列id
    data = []//非零元素值
    with open(counts_path) as f:
        print str(counts_num/1000**2) + "M counts processed."
        for line in f:
            if counts_num % 1000**2 == 0:
                print "\x1b[1A" + str(counts_num/1000**2) + "M counts processed."
            word, context, count = line.strip().split()//把三元组读进来
            row.append(int(word))
            col.append(int(context))
            data.append(int(float(count)))
            counts_num += 1
    counts = csr_matrix((data, (row, col)), shape=(len(wi), len(ci)), dtype=np.float32)//得到稀疏矩阵存储的共现矩阵,由于counts已经排好序,这步没什么代价
    return counts

剩下的计算PMI矩阵的部分和hyperwords没有区别。

def calc_pmi(counts, cds):
    sum_w = np.array(counts.sum(axis=1))[:, 0]
    sum_c = np.array(counts.sum(axis=0))[0, :]
    if cds != 1:
        sum_c = sum_c ** cds
    sum_total = sum_c.sum()
    sum_w = np.reciprocal(sum_w)
    sum_c = np.reciprocal(sum_c)
    
    pmi = csr_matrix(counts)
    pmi = multiply_by_rows(pmi, sum_w)
    pmi = multiply_by_columns(pmi, sum_c)
    pmi = pmi * sum_total
    return pmi

def multiply_by_rows(matrix, row_coefs):
    normalizer = dok_matrix((len(row_coefs), len(row_coefs)))
    normalizer.setdiag(row_coefs)
    return normalizer.tocsr().dot(matrix)

def multiply_by_columns(matrix, col_coefs):
    normalizer = dok_matrix((len(col_coefs), len(col_coefs)))
    normalizer.setdiag(col_coefs)
    return matrix.dot(normalizer.tocsr())

PPMI到SVD的代码依然和hyperwords没有什么区别。这里的叫ppmi2svd是对的,SVD读取的是PPMI矩阵,representations包中会对PMI进行简单的处理。

def main():
    args = docopt("""
    Usage:
        ppmi2svd.py [options]
    
    Options:
        --dim NUM    Dimensionality of eigenvectors [default: 300]
        --neg NUM    Number of negative samples; subtracts its log from PMI [default: 1]
    """)
    
    ppmi_path = args['']
    output_path = args['']
    dim = int(args['--dim'])
    neg = int(args['--neg'])
    
    explicit = PositiveExplicit(ppmi_path, normalize=False, neg=neg)//PPMI矩阵,PositiveExplicit类对PMI矩阵进行简单的加工


    ut, s, vt = sparsesvd(explicit.m.tocsc(), dim)


    np.save(output_path + '.ut.npy', ut)
    np.save(output_path + '.s.npy', s)
    np.save(output_path + '.vt.npy', vt)


你可能感兴趣的:(词向量)