Big Data资料汇总

整理和翻新一下自己看过和笔记过的Big Data相关的论文和Blog

Streaming & Spark

In-Stream Big Data Processing

Discretized Streams, 离散化的流数据处理

Spark - A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center

Sparrow - Distributed, Low Latency Scheduling

 

Linkedin Ecosystem

The Log: What every software engineer should know about real-time data's unifying abstraction

Kafka: a Distributed Messaging System for Log Processing

Linkedin Kafka Design

Linkedin Databus

Apache Samza - Reliable Stream Processing atop Apache Kafka and Hadoop YARN

 

Google Ecosystem

GFS - The Google File System

bigtable: A Distributed Storage System for Structured Data

Dremel - Interactive Analysis of WebScale Datasets

Chubby - lock service for loosely-coupled distributed systems

Megastore - Providing Scalable, Highly Available Storage for Interactive Services

 

NoSQL

一致性问题

How to beat the CAP theorem

全序, 分布式一致性的本质

Nosql数据一致性技术概要

Paxos Made Simple

Why Vector Clock are Easy or Hard?

Anti-Entropy Protocols

索引技术

大数据索引技术 - B+ tree vs LSM tree

详解SSTable结构和LSMTree索引

数据模型

NoSQL Data Modeling Techniques

Columnar Storage

系统

Dynamo: Amazon’s Highly Available Key-value Store

Cassandra - A Decentralized Structured Storage System

NoSQL Databases - MongoDB

NoSQL Databases - CouchDB

 

Hadoop Ecosystem

Apache Tez Design

YARN - Yet Another Resource Negotiator

 

数据分析和挖掘

大数据处理中基于概率的数据结构

海量文档查同或聚类问题 -- Locality Sensitive Hash 算法

 

并发技术

LMAX Disruptor 原理

同步和异步, 阻塞和非阻塞, Reactor和Proactor

并发编程模型和访问控制

Scalable IO in Java

Java Concurrency In Practice

你可能感兴趣的:(Data)