LDA系列--LightLDA and DMTK介绍

微软亚洲研究院开源了分布式机器学习工具包DMTK。DMTK由一个服务于分布式机器学习的框架和一组分布式机器学习算法构成,是一个将机器学习算法应用在大数据上的强大工具包,包含以下部分:

  • DMTK Framework: a flexible framework that supports unified interface for data parallelization, hybrid data structure for big model storage, model scheduling for big model training, and automatic pipelining for high training efficiency.
  • LightLDA, an extremely fast and scalable topic model algorithm, with a O(1) Gibbs sampler and an efficient distributed implementation.
  • Distributed (Multisense) Word Embedding, a distributed version of (multi-sense) word embedding algorithm.
**LightLDA**是一种全新的用于训练主题模型,计算复杂度与主题数目无关的高效算法。在其分布式实现中做了大量的系统优化使得LightLDA能够在一个普通计算机集群上处理超大规模的数据和模型。例如,在一个由8台计算机组成的集群上,我们可以在具有2千亿训练样本(token)的数据集上训练具有1百万词汇表和1百万个话题(topic)的LDA模型(约1万亿个参数),这种规模的实验以往要在数千台计算机的集群上才能运行。

分布式词向量:DMTK为两种计算词向量的算法提供了高效的分步式实现:一种是标准的word2vec算法,另一种是可以对多义词计算多个词向量的新算法。

参考链接
LightLDA
===========
LightLDA is a new, highly-efficient O(1) Metropolis-Hastings sampling algorithm, whose running cost is (surprisingly) agnostic of model size, and empirically converges nearly an order of magnitude more quickly than current state-of-the-art Gibbs samplers.

In the distributed implementation, we leverage the hybrid data structure, model scheduling, and automatic pipelining functions provided by the DMTK framework, so as to make LightLDA capable for extremely Big Data and Big Model even on a modest computer cluster. In particular, on a cluster of as few as 8 machines, we can train a topic model with 1 million topics and a 10-million-word vocabulary (for a total of 10 trillion parameters), on a document collection with over 100-billion tokens - a scale not yet reported even with thousands of machines in the literature.

Distributed (Multi-sense) Word Embedding

Word embedding has become a very popular tool to compute semantic representation of words, which can serve as high-quality word features for natural language processing tasks. We provide the distributed implementations of two word embedding algorithms. One is the standard word2vec algorithm, and the other is a multi-sense word embedding algorithm that learns multiple embedding vectors for polysemous words. By leveraging the model scheduling and automatic pipelining functions provided by the DMTK framework, we are able to train 300-d word embedding vectors for a 10-million-word vocabulary, on a document collection with over 100-billion tokens, on a cluster of just 8 machines.

你可能感兴趣的:(LDA系列--LightLDA and DMTK介绍)