Optimistic Concurrency Control for Distributed Unsupervised Learning

1. Abstract

  • 机器学习算法的两个极端:严格的并发约束;没有约束
  • 提出一种中间的方法:算法假设冲突很少发生,如果冲突发生,则使用冲突解决协议
  • OCC (optimistic concurrency control)特别适合大规模机器学习算法,特别是非监督学习
  • 实验:聚类、feature learning、online facility location

2. Introduction

2.1. 两种机器学习算法分布式策略

  • mutual exclusion方法,相互排斥。保证顺序算法的顺序执行,代价是并行度受限和锁开销。
    Parallel Gibbs sampling: From colored fields to thin junction trees.
    Distributed GraphLab: A framework for machine learning and data mining in the cloud.
  • coordination-free方法,。完全不同步,最小化锁开销,但是带来随机性、数据损坏、理论证明困难等问题。
    Hogwild: A lock-free approach to parallelizing stochastic gradient descent
    Scalable inference in latent variable models

2.2. OCC

  • 提供了和coordination-free方法一样的性能
  • 同时在理论上保证顺序算法的顺序执行

2.3. Contribution

  • 并发控制方法来分布式非监督学习算法
  • 非监督学习的并发控制的分析

3. Related work

  • Distributed inference for Latent Dirichlet Allocation. 转换模型来增加并行度,同时保持边缘后验概率,但是实现难,而且影响收敛
  • ClusterCluster: Parallel Markov chain Monte Carlo for Dirichlet process mixtures. 对底层模型做了reparameterization(重新参数化),通过条件独立性提高额外的并行度
  • Large scale nonparametric Bayesian inference: Data parallelisation in the Indian Buffet process. 与OCC类似的方法,为IBP提出一个近似并行sampling算法,通过引入一个额外的Metropolis-Hastings步骤
  • Multicore Gibbs sampling in dense, unstructured graphs. 提前加锁的策略,通过目前采样的结果来optimistically预测未来的采样

3.1. Scalable clustering 分布式聚类算法

  • A data-clustering algorithm on distributed memory multiprocessors
  • Google news personalization: Scalable online collaborative filtering
  • Fast clustering using MapReduce

3.2. 流式近似算法(基于分层divide-and-conquer策略)

  • Clustering data streams: Theory and practice.
  • Streaming k-means approximation.
  • Fast and accurate k-means for large datasets.
  • Better streaming algorithms for clustering problems.
  • 这种算法需要在通信和近似效果上trade-off

你可能感兴趣的:(Optimistic Concurrency Control for Distributed Unsupervised Learning)