Hadoop是由java语言编写的,在分布式服务器集群上存储海量数据并运行分布式分析应用的开源框架,其核心部件是HDFS与MapReduce。HDFS是一个分布式文件系统,类似mogilefs,但又不同于mogilefs,hdfs由存放文件元数据信息的namenode和存放数据的服务器datanode组成;hdfs它不同于mogilefs,hdfs把元数据信息放在内存中,而mogilefs把元数据放在数据库中;而对于hdfs的元数据信息持久化是依靠secondary name node(第二名称节点),第二名称节点并不是真正扮演名称节点角色,它的主要任务是周期性地将编辑日志合并至名称空间镜像文件中以免编辑日志变得过大;它可以独立运行在一个物理主机上,并需要同名称节点同样大小的内存资源来完成文件合并;另外它还保持一份名称空间镜像的副本,以防名称节点挂了,丢失数据;然而根据其工作机制,第二名称节点要滞后主节点,所以当主名称节点挂掉以后,丢失数据是在所难免的;所以snn(secondary name node)保存镜像副本的主要作用是尽可能的减少数据的丢失;MapReduce是一个计算框架,这种计算框架主要有两个阶段,第一阶段是map计算;第二阶段是Reduce计算;map计算的作用是把相同key的数据始终发送给同一个mapper进行计算;reduce就是把mapper计算的结果进行折叠计算(我们可以理解为合并),最终得到一个结果;在hadoop v1版本是这样的架构,v2就不是了,v2版本中把mapreduce框架拆分yarn框架和mapreduce,其计算任务可以跑在yarn框架上;所以hadoop v1核心就是hdfs+mapreduce两个集群;v2的架构就是hdfs+yarn+mapreduce;
hadoop v1与v2架构
提示:在hadoop v1的架构中,所有计算任务都跑在mapreduce之上,mapreduce就主要担任了两个角色,第一个是集群资源管理器和数据处理;到了hadoop v2 其架构就为hdfs+yarn+一堆任务,其实我们可以把一堆任务理解为v1中的mapreduce,不同于v1中的mapreduce,v2中mapreduce只负责数据计算,不在负责集群资源管理,集群资源管理由yarn实现;对于v2来讲其计算任务都跑在了执yarn之上;对于hdfs来讲,v1和v2中的作用都是一样的,都是起存储文件作用;
hadoop v2 计算任务资源调度过程
提示:rm(resource manager)收到客户端的任务请求,此时rm会根据各dn上运行的nm(node manager)周期性报告的状态信息来决定把客户端的任务调度给那个nm来执行;当rm选定好nm后,就把任务发送给对应nm,对应nm内部会起一个appmaster(am)的容器,负责本次任务的主控端,而appmaster需要启动container来运行任务,它会向rm请求,然后rm会根据am的请求在对应的nm上启动一个或多个container;最后各container运行后的结果会发送给am,然后再由am返回给rm,rm再返回给客户端;在这其中rm主要用来接收个nm发送的各节点状态信息和资源调度以及接收各am计算任务后的结果并反馈给各客户端;nm主要用来管理各node上的资源和上报状态信息给rm;am主要用来管理各任务的资源申请和各任务执行后端结果返回给rm;
提示:上图是hadoop v2生态圈架构图,其中hdfs和yarn是hadoop的核心组件,对于运行在其上的各种任务都必须依赖hadoop,也必须支持调用mapreduce接口;
名称 | 角色 | ip |
node01 | nn,snn,rm | |
node02 | dn,nm | |
node03 | dn,nm | |
node04 | dn,nm | |
yum install -y java-1.8.0-openjdk-devel
mkdir /bigdata
[root@node01 ~]# wget https://mirror.bit.edu.cn/apache/hadoop/common/hadoop-2.9.2/hadoop-2.9.2.tar.gz --2020-09-27 22:50:16-- https://mirror.bit.edu.cn/apache/hadoop/common/hadoop-2.9.2/hadoop-2.9.2.tar.gz Resolving mirror.bit.edu.cn (mirror.bit.edu.cn)...,, 2001:da8:204:1205::22 Connecting to mirror.bit.edu.cn (mirror.bit.edu.cn)||:443... connected. HTTP request sent, awaiting response... 200 OK Length: 366447449 (349M) [application/octet-stream] Saving to: ‘hadoop-2.9.2.tar.gz’ 100%[============================================================================>] 366,447,449 1.44MB/s in 2m 19s 2020-09-27 22:52:35 (2.51 MB/s) - ‘hadoop-2.9.2.tar.gz’ saved [366447449/366447449] [root@node01 ~]# ls hadoop-2.9.2.tar.gz [root@node01 ~]#
[root@node01 ~]# cat /etc/profile.d/hadoop.sh export HADOOP_HOME=/bigdata/hadoop export PATH=$PATH:${HADOOP_HOME}/bin:${HADOOP_HOME}/sbin export HADOOP_YARN_HOME=${HADOOP_HOME} export HADOOP_MAPPERD_HOME=${HADOOP_HOME} export HADOOP_COMMON_HOME=${HADOOP_HOME} export HADOOP_HDFS_HOME=${HADOOP_HOME} [root@node01 ~]#
[root@node01 ~]# useradd hadoop [root@node01 ~]# echo "admin" |passwd --stdin hadoop Changing password for user hadoop. passwd: all authentication tokens updated successfully. [root@node01 ~]#
[hadoop@node01 ~]$ ssh-keygen Generating public/private rsa key pair. Enter file in which to save the key (/home/hadoop/.ssh/id_rsa): Created directory '/home/hadoop/.ssh'. Enter passphrase (empty for no passphrase): Enter same passphrase again: Your identification has been saved in /home/hadoop/.ssh/id_rsa. Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub. The key fingerprint is: SHA256:6CNhqdagySJXc4iRBVSoLENddO7JLZMCsdjQzqSFnmw [email protected] The key's randomart image is: +---[RSA 2048]----+ | o*==o . | | o=Bo o | |=oX+ . | |+E =.oo.+ | |o.o B.oBS. | |.o * =. o | |=.+ o o | |oo . . | | | +----[SHA256]-----+ [hadoop@node01 ~]$ ssh-copy-id node01 /usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/home/hadoop/.ssh/id_rsa.pub" The authenticity of host 'node01 (' can't be established. ECDSA key fingerprint is SHA256:lE8/Vyni4z8hsXaa8OMMlDpu3yOIRh6dLcIr+oE57oE. ECDSA key fingerprint is MD5:14:59:02:30:c0:16:b8:6c:1a:84:c3:0f:a7:ac:67:b3. Are you sure you want to continue connecting (yes/no)? yes /usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed /usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys hadoop@node01's password: Number of key(s) added: 1 Now try logging into the machine, with: "ssh 'node01'" and check to make sure that only the key(s) you wanted were added. [hadoop@node01 ~]$ scp -r ./.ssh node02:/home/hadoop/ The authenticity of host 'node02 (' can't be established. ECDSA key fingerprint is SHA256:lE8/Vyni4z8hsXaa8OMMlDpu3yOIRh6dLcIr+oE57oE. ECDSA key fingerprint is MD5:14:59:02:30:c0:16:b8:6c:1a:84:c3:0f:a7:ac:67:b3. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'node02,' (ECDSA) to the list of known hosts. hadoop@node02's password: id_rsa 100% 1679 636.9KB/s 00:00 id_rsa.pub 100% 404 186.3KB/s 00:00 known_hosts 100% 362 153.4KB/s 00:00 authorized_keys 100% 404 203.9KB/s 00:00 [hadoop@node01 ~]$ scp -r ./.ssh node03:/home/hadoop/ The authenticity of host 'node03 (' can't be established. ECDSA key fingerprint is SHA256:lE8/Vyni4z8hsXaa8OMMlDpu3yOIRh6dLcIr+oE57oE. ECDSA key fingerprint is MD5:14:59:02:30:c0:16:b8:6c:1a:84:c3:0f:a7:ac:67:b3. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'node03,' (ECDSA) to the list of known hosts. hadoop@node03's password: id_rsa 100% 1679 755.1KB/s 00:00 id_rsa.pub 100% 404 165.7KB/s 00:00 known_hosts 100% 543 350.9KB/s 00:00 authorized_keys 100% 404 330.0KB/s 00:00 [hadoop@node01 ~]$ scp -r ./.ssh node04:/home/hadoop/ The authenticity of host 'node04 (' can't be established. ECDSA key fingerprint is SHA256:lE8/Vyni4z8hsXaa8OMMlDpu3yOIRh6dLcIr+oE57oE. ECDSA key fingerprint is MD5:14:59:02:30:c0:16:b8:6c:1a:84:c3:0f:a7:ac:67:b3. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'node04,' (ECDSA) to the list of known hosts. hadoop@node04's password: id_rsa 100% 1679 707.0KB/s 00:00 id_rsa.pub 100% 404 172.8KB/s 00:00 known_hosts 100% 724 437.7KB/s 00:00 authorized_keys 100% 404 165.2KB/s 00:00 [hadoop@node01 ~]$

[root@node01 hadoop]# cat core-site.xml "1.0" encoding="UTF-8"?> "text/xsl" href="configuration.xsl"?>[root@node01 hadoop]# fs.defaultFS hdfs://node01:8020 true

[root@node01 hadoop]# cat hdfs-site.xml "1.0" encoding="UTF-8"?> "text/xsl" href="configuration.xsl"?>[root@node01 hadoop]# dfs.replication 3 dfs.namenode.name.dir file:///data/hadoop/hdfs/nn dfs.namenode.secondary.http-address node01:50090 dfs.namenode.http-address node01:50070 dfs.datanode.data.dir file:///data/hadoop/hdfs/dn fs.checkpoint.dir file:///data/hadoop/hdfs/snn fs.checkpoint.edits.dir file:///data/hadoop/hdfs/snn

[root@node01 hadoop]# cat mapred-site.xml "1.0"?> "text/xsl" href="configuration.xsl"?>[root@node01 hadoop]# mapreduce.framework.name yarn

[root@node01 hadoop]# cat yarn-site.xml "1.0"?>[root@node01 hadoop]# yarn.resourcemanager.address node01:8032 yarn.resourcemanager.scheduler.address node01:8030 yarn.resourcemanager.resource-tracker.address node01:8031 yarn.resourcemanager.admin.address node01:8033 yarn.resourcemanager.webapp.address node01:8088 yarn.nodemanager.aux-services mapreduce_shuffle yarn.nodemanager.auxservices.mapreduce_shuffle.class org.apache.hadoop.mapred.ShuffleHandler yarn.resourcemanager.scheduler.class org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler
[root@node01 hadoop]# cat slaves node02 node03 node04 [root@node01 hadoop]#
hdfs namenode -format
提示:如果执行hdfs namenode -format 出现红框中的提示,说明hdfs格式化就成功了;
验证:查看hdfs /test目录下passwd文件,看看是否同/etc/passwd文件内容相同?
[root@node02 ~]# tree /data /data └── hadoop └── hdfs ├── dn │ ├── current │ │ ├── BP-157891879- │ │ │ ├── current │ │ │ │ ├── finalized │ │ │ │ │ └── subdir0 │ │ │ │ │ └── subdir0 │ │ │ │ │ ├── blk_1073741825 │ │ │ │ │ └── blk_1073741825_1001.meta │ │ │ │ ├── rbw │ │ │ │ └── VERSION │ │ │ ├── scanner.cursor │ │ │ └── tmp │ │ └── VERSION │ └── in_use.lock ├── nn └── snn 13 directories, 6 files [root@node02 ~]# cat /data/hadoop/hdfs/dn/current/BP-157891879- current/ scanner.cursor tmp/ [root@node02 ~]# cat /data/hadoop/hdfs/dn/current/BP-157891879- root:x:0:0:root:/root:/bin/bash bin:x:1:1:bin:/bin:/sbin/nologin daemon:x:2:2:daemon:/sbin:/sbin/nologin adm:x:3:4:adm:/var/adm:/sbin/nologin lp:x:4:7:lp:/var/spool/lpd:/sbin/nologin sync:x:5:0:sync:/sbin:/bin/sync shutdown:x:6:0:shutdown:/sbin:/sbin/shutdown halt:x:7:0:halt:/sbin:/sbin/halt mail:x:8:12:mail:/var/spool/mail:/sbin/nologin operator:x:11:0:operator:/root:/sbin/nologin games:x:12:100:games:/usr/games:/sbin/nologin ftp:x:14:50:FTP User:/var/ftp:/sbin/nologin nobody:x:99:99:Nobody:/:/sbin/nologin systemd-network:x:192:192:systemd Network Management:/:/sbin/nologin dbus:x:81:81:System message bus:/:/sbin/nologin polkitd:x:999:997:User for polkitd:/:/sbin/nologin postfix:x:89:89::/var/spool/postfix:/sbin/nologin sshd:x:74:74:Privilege-separated SSH:/var/empty/sshd:/sbin/nologin ntp:x:38:38::/etc/ntp:/sbin/nologin tcpdump:x:72:72::/:/sbin/nologin chrony:x:998:996::/var/lib/chrony:/sbin/nologin hadoop:x:1000:1000::/home/hadoop:/bin/bash [root@node02 ~]#
[hadoop@node01 hadoop]$ yarn jar /bigdata/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.9.2.jar An example program must be given as the first argument. Valid program names are: aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files. aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files. bbp: A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi. dbcount: An example job that count the pageview counts from a database. distbbp: A map/reduce program that uses a BBP-type formula to compute exact bits of Pi. grep: A map/reduce program that counts the matches of a regex in the input. join: A job that effects a join over sorted, equally partitioned datasets multifilewc: A job that counts words from several files. pentomino: A map/reduce tile laying program to find solutions to pentomino problems. pi: A map/reduce program that estimates Pi using a quasi-Monte Carlo method. randomtextwriter: A map/reduce program that writes 10GB of random textual data per node. randomwriter: A map/reduce program that writes 10GB of random data per node. secondarysort: An example defining a secondary sort to the reduce. sort: A map/reduce program that sorts the data written by the random writer. sudoku: A sudoku solver. teragen: Generate data for the terasort terasort: Run the terasort teravalidate: Checking results of terasort wordcount: A map/reduce program that counts the words in the input files. wordmean: A map/reduce program that counts the average length of the words in the input files. wordmedian: A map/reduce program that counts the median length of the words in the input files. wordstandarddeviation: A map/reduce program that counts the standard deviation of the length of the words in the input files. [hadoop@node01 hadoop]$ yarn jar /bigdata/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.9.2.jar wordcount Usage: wordcount[ ...] [hadoop@node01 hadoop]$ yarn jar /bigdata/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.9.2.jar wordcount /test/passwd /test/passwd-word-count20/09/28 00:58:01 INFO client.RMProxy: Connecting to ResourceManager at node01/ 20/09/28 00:58:01 INFO input.FileInputFormat: Total input files to process : 1 20/09/28 00:58:01 INFO mapreduce.JobSubmitter: number of splits:1 20/09/28 00:58:01 INFO Configuration.deprecation: yarn.resourcemanager.system-metrics-publisher.enabled is deprecated. Instead, use yarn.system-metrics-publisher.enabled 20/09/28 00:58:01 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1601224871685_0001 20/09/28 00:58:02 INFO impl.YarnClientImpl: Submitted application application_1601224871685_0001 20/09/28 00:58:02 INFO mapreduce.Job: The url to track the job: http://node01:8088/proxy/application_1601224871685_0001/ 20/09/28 00:58:02 INFO mapreduce.Job: Running job: job_1601224871685_0001 20/09/28 00:58:08 INFO mapreduce.Job: Job job_1601224871685_0001 running in uber mode : false 20/09/28 00:58:08 INFO mapreduce.Job: map 0% reduce 0% 20/09/28 00:58:14 INFO mapreduce.Job: map 100% reduce 0% 20/09/28 00:58:20 INFO mapreduce.Job: map 100% reduce 100% 20/09/28 00:58:20 INFO mapreduce.Job: Job job_1601224871685_0001 completed successfully 20/09/28 00:58:20 INFO mapreduce.Job: Counters: 49 File System Counters FILE: Number of bytes read=1144 FILE: Number of bytes written=399079 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=1053 HDFS: Number of bytes written=1018 HDFS: Number of read operations=6 HDFS: Number of large read operations=0 HDFS: Number of write operations=2 Job Counters Launched map tasks=1 Launched reduce tasks=1 Data-local map tasks=1 Total time spent by all maps in occupied slots (ms)=2753 Total time spent by all reduces in occupied slots (ms)=2779 Total time spent by all map tasks (ms)=2753 Total time spent by all reduce tasks (ms)=2779 Total vcore-milliseconds taken by all map tasks=2753 Total vcore-milliseconds taken by all reduce tasks=2779 Total megabyte-milliseconds taken by all map tasks=2819072 Total megabyte-milliseconds taken by all reduce tasks=2845696 Map-Reduce Framework Map input records=22 Map output records=30 Map output bytes=1078 Map output materialized bytes=1144 Input split bytes=95 Combine input records=30 Combine output records=30 Reduce input groups=30 Reduce shuffle bytes=1144 Reduce input records=30 Reduce output records=30 Spilled Records=60 Shuffled Maps =1 Failed Shuffles=0 Merged Map outputs=1 GC time elapsed (ms)=87 CPU time spent (ms)=620 Physical memory (bytes) snapshot=444997632 Virtual memory (bytes) snapshot=4242403328 Total committed heap usage (bytes)=285212672 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=958 File Output Format Counters Bytes Written=1018 [hadoop@node01 hadoop]$
[hadoop@node01 hadoop]$ hdfs dfs -ls -R /test -rw-r--r-- 3 hadoop supergroup 958 2020-09-28 00:32 /test/passwd drwxr-xr-x - hadoop supergroup 0 2020-09-28 00:58 /test/passwd-word-count -rw-r--r-- 3 hadoop supergroup 0 2020-09-28 00:58 /test/passwd-word-count/_SUCCESS -rw-r--r-- 3 hadoop supergroup 1018 2020-09-28 00:58 /test/passwd-word-count/part-r-00000 [hadoop@node01 hadoop]$ hdfs dfs -cat /test/passwd-word-count/part-r-00000 Management:/:/sbin/nologin 1 Network 1 SSH:/var/empty/sshd:/sbin/nologin 1 User:/var/ftp:/sbin/nologin 1 adm:x:3:4:adm:/var/adm:/sbin/nologin 1 bin:x:1:1:bin:/bin:/sbin/nologin 1 bus:/:/sbin/nologin 1 chrony:x:998:996::/var/lib/chrony:/sbin/nologin 1 daemon:x:2:2:daemon:/sbin:/sbin/nologin 1 dbus:x:81:81:System 1 for 1 ftp:x:14:50:FTP 1 games:x:12:100:games:/usr/games:/sbin/nologin 1 hadoop:x:1000:1000::/home/hadoop:/bin/bash 1 halt:x:7:0:halt:/sbin:/sbin/halt 1 lp:x:4:7:lp:/var/spool/lpd:/sbin/nologin 1 mail:x:8:12:mail:/var/spool/mail:/sbin/nologin 1 message 1 nobody:x:99:99:Nobody:/:/sbin/nologin 1 ntp:x:38:38::/etc/ntp:/sbin/nologin 1 operator:x:11:0:operator:/root:/sbin/nologin 1 polkitd:/:/sbin/nologin 1 polkitd:x:999:997:User 1 postfix:x:89:89::/var/spool/postfix:/sbin/nologin 1 root:x:0:0:root:/root:/bin/bash 1 shutdown:x:6:0:shutdown:/sbin:/sbin/shutdown 1 sshd:x:74:74:Privilege-separated 1 sync:x:5:0:sync:/sbin:/bin/sync 1 systemd-network:x:192:192:systemd 1 tcpdump:x:72:72::/:/sbin/nologin 1 [hadoop@node01 hadoop]$
到此hadoop v2集群就搭建完毕了;