Hadoop学习-基础篇

Hadoop 大数据平台与架构

功能与优势

Hadoop是大数据存储与分析的架构,是分布式存储和分布式计算平台

两个核心组成

  • HDFS:分布式文件系统,用于存储海量的数据
  • MapReduce:并行处理框架,实现任务的分解和调度

应用

可用于搭建数据仓库,分析统计数据

生态

  • HIVE:SQL语句形式,转换为Hadoop任务去执行
  • HBASE:存储结构化数据的分布式数据库
  • zookeeper:服务注册、治理

HDFS概念

  • 块(Block):文件被切分成同大小的数据块,默认为64MB,是文件存储的逻辑单元
  • NameNode:数据管理节点,包括文件与数据块映射表,数据块与数据节点映射表
  • DataNode:工作节点,存放真正的数据块

HDFS数据管理与容错

每个数据块有3个副本,分布在两个机架上,如上图数据块

HDFS特点

  • 数据冗余,硬件容错
  • 流式数据访问
  • 易于存储大文件

HDFS使用

配置好hadoop服务并启动后,执行 jps 命令,如看到这些服务即启动成功

[root@liyan ~]# jps
14258 Jps
5108 TaskTracker
4887 SecondaryNameNode
4634 NameNode
4970 JobTracker
4765 DataNode

执行 hadoop dfsadmin -report 可以看到整个hadoop的使用情况

[root@liyan ~]# hadoop dfsadmin -report
Configured Capacity: 42139451392 (39.25 GB)
Present Capacity: 34523619328 (32.15 GB)
DFS Remaining: 34523467776 (32.15 GB)
DFS Used: 151552 (148 KB)
DFS Used%: 0%
Under replicated blocks: 8
Blocks with corrupt replicas: 0
Missing blocks: 0

-------------------------------------------------
Datanodes available: 1 (1 total, 0 dead)

Name: 127.0.0.1:50010
Decommission Status : Normal
Configured Capacity: 42139451392 (39.25 GB)
DFS Used: 151552 (148 KB)
Non DFS Used: 7615832064 (7.09 GB)
DFS Remaining: 34523467776(32.15 GB)
DFS Used%: 0%
DFS Remaining%: 81.93%
Last contact: Sun Jan 19 18:55:45 CST 2020
HDFS的三个基本操作命令
  • 从本地linux上传文件到HDFS

    hadoop fs -put 文件  文件夹/
    
  • 从HDFS查看文件:

    hadoop fs -cat 文件夹/文件
    
  • 从HDFS下载文件到本地:

    hadoop fs -get 文件夹/文件  新文件
    

MapReduce原理

分而治之,将大任务拆分成多个小任务(map),并行执行,合并结果(reduce)

概念

  • Job&Task:一个Job被拆分为多个Task
  • JobTracker:作业调度、分配任务、监控任务执行进度、监控TaskTracker状态

体系结构

MapReduce容错机制

  • 重复执行
  • 推测执行:如果之前的TaskTracker计算很慢,则推测可能出现问题,新建一台TaskTracker同时与之计算,谁先执行完成取谁结果

Hadoop应用实践

按照官网推荐的WordCount来实践一下,输入文本文件,统计里面单词的出现次数

  1. 先将wordcount.jar文件上传到Linux服务器本地

     scp wordcount.jar root@ip:/opt/examples
    

2.创建input文件夹,里面存放输入的文本文件
3.在HDFS里创建input_wordcount文件夹作为输入文件夹

    hadoop fs -mkdir input_wordcount/

4.将输入文件上传到HDFS的input_wordcount文件夹,使用 hadoop fs -ls 命令查看wordcount文件夹下是否已有文件

    [root@liyan examples]# hadoop fs -put input/* input_wordcount
    [root@liyan examples]# hadoop fs -ls input_wordcount/
    Found 2 items
    -rw-r--r--   3 root supergroup        147 2020-01-19 19:55 /user/root/input_wordcount/file1.text
    -rw-r--r--   3 root supergroup        153 2020-01-19 19:55 /user/root/input_wordcount/file2.text

5.执行jar文件,hadoop jar jar文件 main函数名称 输入文件夹 输出文件夹

    [root@liyan examples]# hadoop jar wordcount.jar WordCount input_wordcount output_wordcount
    20/01/19 20:35:13 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
    20/01/19 20:35:13 INFO input.FileInputFormat: Total input paths to process : 2
    20/01/19 20:35:13 INFO util.NativeCodeLoader: Loaded the native-hadoop library
    20/01/19 20:35:13 WARN snappy.LoadSnappy: Snappy native library not loaded
    20/01/19 20:35:13 INFO mapred.JobClient: Running job: job_202001192032_0002
    20/01/19 20:35:14 INFO mapred.JobClient:  map 0% reduce 0%
    20/01/19 20:35:22 INFO mapred.JobClient:  map 50% reduce 0%
    20/01/19 20:35:32 INFO mapred.JobClient:  map 100% reduce 0%
    20/01/19 20:35:33 INFO mapred.JobClient:  map 100% reduce 33%
    20/01/19 20:35:35 INFO mapred.JobClient:  map 100% reduce 100%
    20/01/19 20:35:36 INFO mapred.JobClient: Job complete: job_202001192032_0002
    20/01/19 20:35:36 INFO mapred.JobClient: Counters: 29
    20/01/19 20:35:36 INFO mapred.JobClient:   Map-Reduce Framework
    20/01/19 20:35:36 INFO mapred.JobClient:     Spilled Records=54
    20/01/19 20:35:36 INFO mapred.JobClient:     Map output materialized bytes=331
    20/01/19 20:35:36 INFO mapred.JobClient:     Reduce input records=27
    20/01/19 20:35:36 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=5852057600
    20/01/19 20:35:36 INFO mapred.JobClient:     Map input records=8
    20/01/19 20:35:36 INFO mapred.JobClient:     SPLIT_RAW_BYTES=246
    20/01/19 20:35:36 INFO mapred.JobClient:     Map output bytes=503
    20/01/19 20:35:36 INFO mapred.JobClient:     Reduce shuffle bytes=331
    20/01/19 20:35:36 INFO mapred.JobClient:     Physical memory (bytes) snapshot=411611136
    20/01/19 20:35:36 INFO mapred.JobClient:     Reduce input groups=19
    20/01/19 20:35:36 INFO mapred.JobClient:     Combine output records=27
    20/01/19 20:35:36 INFO mapred.JobClient:     Reduce output records=19
    20/01/19 20:35:36 INFO mapred.JobClient:     Map output records=51
    20/01/19 20:35:36 INFO mapred.JobClient:     Combine input records=51
    20/01/19 20:35:36 INFO mapred.JobClient:     CPU time spent (ms)=980
    20/01/19 20:35:36 INFO mapred.JobClient:     Total committed heap usage (bytes)=292954112
    20/01/19 20:35:36 INFO mapred.JobClient:   File Input Format Counters
    20/01/19 20:35:36 INFO mapred.JobClient:     Bytes Read=300
    20/01/19 20:35:36 INFO mapred.JobClient:   FileSystemCounters
    20/01/19 20:35:36 INFO mapred.JobClient:     HDFS_BYTES_READ=546
    20/01/19 20:35:36 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=156082
    20/01/19 20:35:36 INFO mapred.JobClient:     FILE_BYTES_READ=325
    20/01/19 20:35:36 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=147
    20/01/19 20:35:36 INFO mapred.JobClient:   Job Counters
    20/01/19 20:35:36 INFO mapred.JobClient:     Launched map tasks=3
    20/01/19 20:35:36 INFO mapred.JobClient:     Launched reduce tasks=1
    20/01/19 20:35:36 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=11869
    20/01/19 20:35:36 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
    20/01/19 20:35:36 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=13320
    20/01/19 20:35:36 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
    20/01/19 20:35:36 INFO mapred.JobClient:     Data-local map tasks=3
    20/01/19 20:35:36 INFO mapred.JobClient:   File Output Format Counters
    20/01/19 20:35:36 INFO mapred.JobClient:     Bytes Written=147

6.查看执行结果

    [root@liyan examples]# hadoop fs -ls output_wordcount/
    Found 3 items
    -rw-r--r--   3 root supergroup          0 2020-01-19 20:35 /user/root/output_wordcount/_SUCCESS
    drwxr-xr-x   - root supergroup          0 2020-01-19 20:35 /user/root/output_wordcount/_logs
    -rw-r--r--   3 root supergroup        147 2020-01-19 20:35 /user/root/output_wordcount/part-r-00000

part-r-00000文件中记录了统计结果

    [root@liyan examples]# hadoop fs -cat output_wordcount/part-r-00000
    five    2
    four    2
    hadoop  7
    help    3
    liyan   6
    love    7
    script  1
    six 1
    tacker  2
    tracker 4
    wend    5
    window  2
    word    3

来源我的个人博客

你可能感兴趣的:(Hadoop学习-基础篇)