数据战略
《谷歌的大数据战略》:http://chuansong.me/n/150869
《谷歌大数据帝国》:http://bay-hzrb.hangzhou.com.cn/system/2016/03/14/012967690.shtml
数据仓库
《漫谈大数据仓库与挖掘系统:什么是大数据?》:http://chuansong.me/n/193464
《漫谈大数据仓库与挖掘系统:层次、维度与主题》:http://chuansong.me/n/194438
《漫谈大数据仓库与挖掘系统:数据的传输和同步》:http://chuansong.me/n/195373
《漫谈大数据仓库与挖掘系统:MapReduce与大规模离线计算系统》:http://chuansong.me/n/196860
《漫谈大数据仓库与挖掘系统:MapReduce与大规模离线计算系统(2)Hive》:http://blog.renren.com/share/601017224/16364958852
《漫谈大数据仓库与挖掘系统:BSP整体同步并行计算模型》:http://chuansong.me/n/206702
《大型数据仓库的治理(1)-数据需求响应慢》:http://chuansong.me/n/157570
《大型数据仓库的治理(2)-数据质量不可靠》:http://chuansong.me/n/158532
《大型数据仓库的治理(3)-维护成本高》:http://chuansong.me/n/159405
《大型数据仓库的治理(4)-数据安全不可控》:http://chuansong.me/n/161043
数据质量
《一起吐吐数据遇上的痛》:http://chuansong.me/n/179819
《数据质量之我观》:http://chuansong.me/n/160563
数据指标
转化率:http://chuansong.me/n/162002
引导成交:http://chuansong.me/n/163070
跳出率与退出率:http://chuansong.me/n/163550
页面停留时间:http://chuansong.me/n/145524
数据字典
《我如何完成一本企业数据字典的编写?》:http://www.tuicool.com/articles/fQ7ZV3u
电信行业大数据领域的三大数据域:
数据处理模型
1、管道:UNIX pipes就是一种最常见的管道。管道有助于进程原语的重用,已有模块的简单链接即可组成一个新的模块。
2、消息队列:消息队列有助于进程原语的同步,程序员将数据处理任务以生产者或消费者的形式编写为进程原语,由系统来管理它们何时执行。
3、MapReduce:在MapReduce模型中,数据处理原语被称为Mapper和Reducer。分解一个数据处理应用为Mapper和Reducer有时是繁琐的,但是一旦以MapReduce的形式写好了一个应用程序,仅需修改配置就可以将它扩展到集群中成千上万台机器中运行。它最大的优点就是容易扩展到多个计算节点上处理数据。正式这种简单的可扩展性使得MapReduce模型吸引了众多程序员。
You’re probably aware of data processing models such as pipelines and message queues.These models provide specific capabilities in developing different aspects of data processing applications.The most familiar pipelines are the Unix pipes.Pipelines can help the reuse of processing primitives;simple chaining of existing modules creates new ones.Message queues can help the synchronization of processing primitives.The programmer writes her data processing task as processing primitives in the form of either a producer or a consumer.The timing of their execution is managed by the system.
Similarly,MapReduce is also a data processing model.Its greatest advantage is the easy scaling of data processing over multiple computing nodes.Under the MapReduce model,the data processing primitives are called mappers and reducers.Decomposing a data processing application into mappers and reducers is sometimes nontrivial.But,once you write an application in the MapReduce form,scaling the application to run over hundreds,thousands,or even tens of thousands of machines in a cluster is merely a confi guration change.This simple scalability is what has attracted many programmers to the MapReduce model.
——from 《hadoop in action》 1.5 Understanding MapReduce
4、DAG:有向非循环图 (directed acyclic graph)
《有向无环图的概念》:http://c.biancheng.net/cpp/html/1018.html
《有向无环图的应用》:http://c.biancheng.net/cpp/html/1019.html
5、BSP:Bulk Synchronous Parallel,整体同步并行计算模型
《从BSP模型到Apache Hama》:http://www.cnblogs.com/BYRans/p/4682282.html
6、Streaming
《Streaming模式基础知识》:http://www.open-open.com/lib/view/open1452169086386.html
7、其他
《汇总运行在Hadoop YARN上的开源系统》:http://dongxicheng.org/mapreduce-nextgen/run-systems-on-hadoop-yarn/
数据导航
http://hao.199it.com/