hadoop01 - 大数据启蒙、初识HDFS

分治思想：

需求：

1.我有一万个元素（比如数字或者单词）需要存储？
2.如果查找某一个元素，最简单的遍历方式复杂度是多少
3.如果我期望复杂度是O(4)，怎么处理

1.使用链表的方式储存数据

使用链表的方式储存数据

2.使用遍历的方式寻找X，时间复杂度O（n）
3.使用数据分治的思想，把数据放到若干链表中（用分为2500个小链表举例，简单举例，不考虑数据倾斜等其他问题）
4.分治的思想很多，比如redis集群，elasticsearch,Hbase,hadoop生态

数据分治

单机处理大数据问题

需求：

1.有一个非常大的文本文件（1T），里面有很多很多的行，只有两行一样，它们出现在未知的位置，需要查找到他们
2.单机，而且可用内存很少，几十兆
3.假设IO速度是500M/S（固态硬盘）,1T文件读取一边需要30分钟左右
小贴士：内存寻址比IO寻址快10万倍，磁盘寻址ms级别，内存寻址ns级别，在这个需求中，尽量使用内存遍历数据

使用循环遍历

此时如果使用循环遍历需要N次IO时间 - n次全量IO 所需时间n * 30min

使用分治思想

1.readLine.hashcode % 2000把1TB的文件，输入到2000个小文件中
2.hashcode和 %都是稳定算法，相同的行一定会出现在同一个小文件中
3.在内存中就可以遍历这个小文件，内存中寻址极快
分治思想可以是时间为2次IO - 2 * 30min

分治思想

如果是需要给1TB数据排序呢？
1.readLine() if (x>0 && x<=100) 放入0号小文件，readLine() if (x>100 && x<=200) 放入0号小文件，得到了一个外部有序内部无序的若干小文件，再把小文件放到内存里面排序
2.每次读取50M数据并排序，得到若干内部有序外部无序的小文件，使用归并排序算法，对数字排序

归并排序算法示意

思考：如何让时间变成分钟、秒级别
使用分布式集群处理：假设我们使用2000台计算机，每台处理500MB的数据，根据磁盘IO性能，把500MB切分成小文件需要1秒，然后节点间需要拷贝数据，把同一个下标的小文件放在同一节点上比较，此时需要经历网卡IO，但是最多也就是分钟级，也就是说当我们使用2000台计算机时，计算时间变成了1min左右（大概）

分布式处理示意图

集群分布式处理大数据的辩证

2000台真的比一台数据快吗？
如果考虑分发上传文件的时间呢
如果考虑每天都有1T数据的产生呢
如果增量了一年，最后一天计算数据呢
结论：数据增量越多，多机的优势更加明显
比如说支付宝年度账单，在元旦准时发送，可见计算速度极快

结论：

分而治之，并行计算，计算向数据移动，数据本地化读取
以上这些是学习大数据技术时需要关心的重点

初识HDFS

Hadoop的时间简史

1.《The Google File System 》 2003年
2.《MapReduce: Simplified Data Processing on Large Clusters》 2004年
3.《Bigtable: A Distributed Storage System for Structured Data》 2006年
4.Hadoop由 Apache Software Foundation 于 2005 年秋天作为Lucene的子项目Nutch的一部分正式引入。
5.2006 年 3 月份，Map/Reduce 和 Nutch Distributed File System (NDFS) 分别被纳入称为 Hadoop 的项目中。
6.Cloudera公司在2008年开始提供基于Hadoop的软件和服务。
7.2016年10月hadoop-2.6.5
8.2017年12月hadoop-3.0.0

官网：

http://hadoop.apache.org/old/
http://hadoop.apache.org/
hadoop在apache.org之前，表示hadoop是apache的顶级项目

hadoop模块

The project includes these modules:
1.Hadoop Common: The common utilities that support the other Hadoop modules -- hadoop公共模块
2.Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data. -- 分布式存储模块
3.Hadoop YARN: A framework for job scheduling and cluster resource management. -- 分布式协调模块
4.Hadoop MapReduce: A YARN-based system for parallel processing of large data sets. -- 分布式计算模块

hadoop生态圈

Other Hadoop-related projects at Apache include:

Ambari™: A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps and ability to view MapReduce, Pig and Hive applications visually along with features to diagnose their performance characteristics in a user-friendly manner.
Avro™: A data serialization system.
Cassandra™: A scalable multi-master database with no single points of failure.
Chukwa™: A data collection system for managing large distributed systems.
HBase™: A scalable, distributed database that supports structured data storage for large tables.
Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.
Mahout™: A Scalable machine learning and data mining library.
Pig™: A high-level data-flow language and execution framework for parallel computation.
Spark™: A fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation.
Tez™: A generalized data-flow programming framework, built on Hadoop YARN, which provides a powerful and flexible engine to execute an arbitrary DAG of tasks to process data for both batch and interactive use-cases. Tez is being adopted by Hive™, Pig™ and other frameworks in the Hadoop ecosystem, and also by other commercial software (e.g. ETL tools), to replace Hadoop™ MapReduce as the underlying execution engine.
ZooKeeper™: A high-performance coordination service for distributed applications.

cloudera大数据生态

官网：www.cloudera.com
Cloudera’s Distribution Including Apache Hadoop
CDH is the most complete,tested, and popular distribution of Apache Hadoop and related projects.
hadoop-2.6.0+cdh5.16.1
hbase-1.2.0+cdh5.16.1
hive-1.1.0+cdh5.16.1
spark-1.6.0+cdh5.16.1

CDH