Single Node Cluster
Hadoop Cluster Setup

组件	版本需求	选型
OS	Linux only, CentOS 6.5+ or Ubuntu 16.0.4+	CentOS 7.5.1804
JDK	1.8+	1.8.202
Kylin		v3.1.1
Hadoop	2.7+, 3.1+	3.1.4
Hive	0.13 - 1.2.1+	3.1.2
HBase	1.1+, 2.0	2.2.5
Spark	(可选) 2.3.0+	2.3.1
Kafka	(可选) 1.0.0+	1.1.1
MySQL		5.1.73

Kylin
https://mirrors.bfsu.edu.cn/apache/kylin/apache-kylin-3.1.1/apache-kylin-3.1.1-bin-hadoop3.tar.gz

Hadoop
https://mirrors.bfsu.edu.cn/apache/hadoop/common/hadoop-3.1.4/hadoop-3.1.4.tar.gz

HIVE
18 April 2020: release 2.3.7 works with Hadoop 2.x.y
26 August 2019: release 3.1.2 works with Hadoop 3.x.y
07 April 2017 : release 1.2.2 works with Hadoop 1.x.y, 2.x.y
https://mirror.bit.edu.cn/apache/hive/hive-1.2.2/apache-hive-1.2.2-bin.tar.gz
https://mirror.bit.edu.cn/apache/hive/hive-2.3.7/apache-hive-2.3.7-bin.tar.gz
https://mirror.bit.edu.cn/apache/hive/hive-3.1.2/apache-hive-3.1.2-bin.tar.gz

HBASE stable version 当前版本 2.2.5
Hadoop-3.1.1+ 适配 HBase-2.2.x 和 HBase-2.3.x
https://mirrors.tuna.tsinghua.edu.cn/apache/hbase/stable/hbase-2.2.5-bin.tar.gz
https://mirrors.tuna.tsinghua.edu.cn/apache/hbase/stable/hbase-2.2.5-client-bin.tar.gz

[root@localhost ~]# df -hT
Filesystem              Type      Size  Used Avail Use% Mounted on
/dev/mapper/centos-root xfs       291G  977M  291G   1% /
devtmpfs                devtmpfs  3.9G     0  3.9G   0% /dev
tmpfs                   tmpfs     3.9G     0  3.9G   0% /dev/shm
tmpfs                   tmpfs     3.9G  9.0M  3.9G   1% /run
tmpfs                   tmpfs     3.9G     0  3.9G   0% /sys/fs/cgroup
/dev/sda1               xfs      1014M  142M  873M  14% /boot
tmpfs                   tmpfs     783M     0  783M   0% /run/user/0
[root@localhost ~]# 
[root@localhost ~]# 
[root@localhost ~]# lsblk
NAME            MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda               8:0    0   300G  0 disk 
├─sda1            8:1    0     1G  0 part /boot
└─sda2            8:2    0   299G  0 part 
  ├─centos-root 253:0    0 291.1G  0 lvm  /
  └─centos-swap 253:1    0   7.9G  0 lvm  [SWAP]
sdb               8:16   0    16G  0 disk 
sdc               8:32   0    16G  0 disk 
sdd               8:48   0    16G  0 disk 
sde               8:64   0    16G  0 disk 
sdf               8:80   0    16G  0 disk 
sdg               8:96   0    16G  0 disk 
sdh               8:112  0    16G  0 disk 
sdi               8:128  0    16G  0 disk 
sr0              11:0    1  1024M  0 rom

[root@hadoop ~]# useradd hadoop
[root@hadoop ~]# su - hadoop

[hadoop@hadoop hadoop-3.1.4]$ pwd
/opt/hadoop-3.1.4

启动Hadoop集群的准备工作

配置hadoop-env JAVA_HOME

[hadoop@hadoop hadoop-3.1.4]$ vim etc/hadoop/hadoop-env.sh
export JAVA_HOME=/usr/local/jdk/jdk1.8.0_202

单节点

[hadoop@hadoop hadoop-3.1.4]$ mkdir input
[hadoop@hadoop hadoop-3.1.4]$ cp etc/hadoop/*.xml input
[hadoop@hadoop hadoop-3.1.4]$ ls
bin  etc  include  input  lib  libexec  LICENSE.txt  logs  NOTICE.txt   README.txt  sbin  share

[hadoop@hadoop hadoop-3.1.4]$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.4.jar grep input output 'dfs[a-z.]+'
2020-10-21 04:28:22,207 INFO impl.MetricsConfig: loaded properties from hadoop-metrics2.properties
2020-10-21 04:28:26,405 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2020-10-21 04:28:26,405 INFO impl.MetricsSystemImpl: JobTracker metrics system started
2020-10-21 04:28:26,787 INFO input.FileInputFormat: Total input files to process : 9
2020-10-21 04:28:26,816 INFO mapreduce.JobSubmitter: number of splits:9
2020-10-21 04:28:27,015 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1615788132_0001
2020-10-21 04:28:27,017 INFO mapreduce.JobSubmitter: Executing with tokens: []

... omitted ...

2020-10-21 04:28:30,464 INFO mapreduce.Job: Counters: 30
    File System Counters
        FILE: Number of bytes read=1340028
        FILE: Number of bytes written=3329078
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
    Map-Reduce Framework
        Map input records=2
        Map output records=2
        Map output bytes=41
        Map output materialized bytes=51
        Input split bytes=120
        Combine input records=0
        Combine output records=0
        Reduce input groups=1
        Reduce shuffle bytes=51
        Reduce input records=2
        Reduce output records=2
        Spilled Records=4
        Shuffled Maps =1
        Failed Shuffles=0
        Merged Map outputs=1
        GC time elapsed (ms)=0
        Total committed heap usage (bytes)=489684992
    Shuffle Errors
        BAD_ID=0
        CONNECTION=0
        IO_ERROR=0
        WRONG_LENGTH=0
        WRONG_MAP=0
        WRONG_REDUCE=0
    File Input Format Counters 
        Bytes Read=155
    File Output Format Counters 
        Bytes Written=41
[hadoop@hadoop hadoop-3.1.4]$ 
[hadoop@hadoop hadoop-3.1.4]$ ls
bin  etc  include  input  lib  libexec  LICENSE.txt  logs  NOTICE.txt  output  README.txt  sbin  share
[hadoop@hadoop hadoop-3.1.4]$ cat output/*
1   dfsadmin
1   dfs.replication

这就成功的运行了一个单节点DEMO

伪分布式

Hadoop 还可以在单节点上运行伪分布式集群，每个Hadoop的守护进程都以单独的Java进程运行。

通常，推荐 HDFS 和 YARN 以独立用户运行。比如：HDFS 用户 hdfs； YARN 用户 yarn；MAPRED 用户 mapred。
这里，以hadoop用户运行hdfs,yarn,mapreduce服务。

Hadoop Startup

配置SSH

[hadoop@hadoop ~]$ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
[hadoop@hadoop ~]$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
[hadoop@hadoop ~]$ chmod 0600 ~/.ssh/authorized_keys

修改配置文件

etc/hadoop/core-site.xml:
这里 10.0.31.65 是本服务器IP，提供HDFS服务的IP。


    
        fs.defaultFS
        hdfs://10.0.31.65:9000

etc/hadoop/hdfs-site.xml:


    
        dfs.replication
        1

配置HDFS

格式化文件系统

[hadoop@hadoop hadoop-3.1.4]$ bin/hdfs namenode -format
WARNING: /opt/hadoop-3.1.4/logs does not exist. Creating.
2020-10-21 01:47:10,993 INFO namenode.NameNode: STARTUP_MSG: 
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = hadoop/10.0.31.65
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 3.1.4
STARTUP_MSG:   classpath = /opt/hadoop-3.1.4/e...
...
...
************************************************************/
2020-10-21 01:47:11,005 INFO namenode.NameNode: registered UNIX signal handlers for [TERM, HUP, INT]
2020-10-21 01:47:11,136 INFO namenode.NameNode: createNameNode [-format]
Formatting using clusterid: CID-f6f83b0d-b2c3-456d-b5b6-af3ff8957fbf
2020-10-21 01:47:11,856 INFO namenode.FSEditLog: Edit logging is async:true
2020-10-21 01:47:11,892 INFO namenode.FSNamesystem: KeyProvider: null
2020-10-21 01:47:11,893 INFO namenode.FSNamesystem: fsLock is fair: true
2020-10-21 01:47:11,894 INFO namenode.FSNamesystem: Detailed lock hold time metrics enabled: false
2020-10-21 01:47:11,901 INFO namenode.FSNamesystem: fsOwner             = hadoop (auth:SIMPLE)
2020-10-21 01:47:11,901 INFO namenode.FSNamesystem: supergroup          = supergroup
2020-10-21 01:47:11,901 INFO namenode.FSNamesystem: isPermissionEnabled = true
2020-10-21 01:47:11,901 INFO namenode.FSNamesystem: HA Enabled: false
2020-10-21 01:47:11,962 INFO common.Util: dfs.datanode.fileio.profiling.sampling.percentage set to 0. Disabling file IO profiling
2020-10-21 01:47:11,976 INFO blockmanagement.DatanodeManager: dfs.block.invalidate.limit: configured=1000, counted=60, effected=1000
2020-10-21 01:47:11,977 INFO blockmanagement.DatanodeManager: dfs.namenode.datanode.registration.ip-hostname-check=true
2020-10-21 01:47:11,987 INFO blockmanagement.BlockManager: dfs.namenode.startup.delay.block.deletion.sec is set to 000:00:00:00.000
2020-10-21 01:47:11,987 INFO blockmanagement.BlockManager: The block deletion will start around 2020 Oct 21 01:47:11
2020-10-21 01:47:11,989 INFO util.GSet: Computing capacity for map BlocksMap
2020-10-21 01:47:11,989 INFO util.GSet: VM type       = 64-bit
2020-10-21 01:47:11,991 INFO util.GSet: 2.0% max memory 1.7 GB = 34.8 MB
2020-10-21 01:47:11,991 INFO util.GSet: capacity      = 2^22 = 4194304 entries
2020-10-21 01:47:12,006 INFO blockmanagement.BlockManager: dfs.block.access.token.enable = false
2020-10-21 01:47:12,019 INFO Configuration.deprecation: No unit for dfs.namenode.safemode.extension(30000) assuming MILLISECONDS
2020-10-21 01:47:12,019 INFO blockmanagement.BlockManagerSafeMode: dfs.namenode.safemode.threshold-pct = 0.9990000128746033
2020-10-21 01:47:12,019 INFO blockmanagement.BlockManagerSafeMode: dfs.namenode.safemode.min.datanodes = 0
2020-10-21 01:47:12,019 INFO blockmanagement.BlockManagerSafeMode: dfs.namenode.safemode.extension = 30000
2020-10-21 01:47:12,020 INFO blockmanagement.BlockManager: defaultReplication         = 1
2020-10-21 01:47:12,020 INFO blockmanagement.BlockManager: maxReplication             = 512
2020-10-21 01:47:12,020 INFO blockmanagement.BlockManager: minReplication             = 1
2020-10-21 01:47:12,020 INFO blockmanagement.BlockManager: maxReplicationStreams      = 2
2020-10-21 01:47:12,020 INFO blockmanagement.BlockManager: redundancyRecheckInterval  = 3000ms
2020-10-21 01:47:12,020 INFO blockmanagement.BlockManager: encryptDataTransfer        = false
2020-10-21 01:47:12,020 INFO blockmanagement.BlockManager: maxNumBlocksToLog          = 1000
2020-10-21 01:47:12,100 INFO namenode.FSDirectory: GLOBAL serial map: bits=24 maxEntries=16777215
2020-10-21 01:47:12,126 INFO util.GSet: Computing capacity for map INodeMap
2020-10-21 01:47:12,126 INFO util.GSet: VM type       = 64-bit
2020-10-21 01:47:12,126 INFO util.GSet: 1.0% max memory 1.7 GB = 17.4 MB
2020-10-21 01:47:12,126 INFO util.GSet: capacity      = 2^21 = 2097152 entries
2020-10-21 01:47:12,186 INFO namenode.FSDirectory: ACLs enabled? false
2020-10-21 01:47:12,186 INFO namenode.FSDirectory: POSIX ACL inheritance enabled? true
2020-10-21 01:47:12,186 INFO namenode.FSDirectory: XAttrs enabled? true
2020-10-21 01:47:12,186 INFO namenode.NameNode: Caching file names occurring more than 10 times
2020-10-21 01:47:12,193 INFO snapshot.SnapshotManager: Loaded config captureOpenFiles: false, skipCaptureAccessTimeOnlyChange: false, snapshotDiffAllowSnapRootDescendant: true, maxSnapshotLimit: 65536
2020-10-21 01:47:12,196 INFO snapshot.SnapshotManager: SkipList is disabled
2020-10-21 01:47:12,202 INFO util.GSet: Computing capacity for map cachedBlocks
2020-10-21 01:47:12,202 INFO util.GSet: VM type       = 64-bit
2020-10-21 01:47:12,202 INFO util.GSet: 0.25% max memory 1.7 GB = 4.3 MB
2020-10-21 01:47:12,202 INFO util.GSet: capacity      = 2^19 = 524288 entries
2020-10-21 01:47:12,213 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.window.num.buckets = 10
2020-10-21 01:47:12,213 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.num.users = 10
2020-10-21 01:47:12,214 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.windows.minutes = 1,5,25
2020-10-21 01:47:12,219 INFO namenode.FSNamesystem: Retry cache on namenode is enabled
2020-10-21 01:47:12,219 INFO namenode.FSNamesystem: Retry cache will use 0.03 of total heap and retry cache entry expiry time is 600000 millis
2020-10-21 01:47:12,221 INFO util.GSet: Computing capacity for map NameNodeRetryCache
2020-10-21 01:47:12,221 INFO util.GSet: VM type       = 64-bit
2020-10-21 01:47:12,222 INFO util.GSet: 0.029999999329447746% max memory 1.7 GB = 534.2 KB
2020-10-21 01:47:12,222 INFO util.GSet: capacity      = 2^16 = 65536 entries
2020-10-21 01:47:12,261 INFO namenode.FSImage: Allocated new BlockPoolId: BP-1053537240-10.0.31.65-1603216032251
2020-10-21 01:47:12,275 INFO common.Storage: Storage directory /tmp/hadoop-hadoop/dfs/name has been successfully formatted.
2020-10-21 01:47:12,305 INFO namenode.FSImageFormatProtobuf: Saving image file /tmp/hadoop-hadoop/dfs/name/current/fsimage.ckpt_0000000000000000000 using no compression
2020-10-21 01:47:12,432 INFO namenode.FSImageFormatProtobuf: Image file /tmp/hadoop-hadoop/dfs/name/current/fsimage.ckpt_0000000000000000000 of size 393 bytes saved in 0 seconds .
2020-10-21 01:47:12,449 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
2020-10-21 01:47:12,455 INFO namenode.FSImage: FSImageSaver clean checkpoint: txid = 0 when meet shutdown.
2020-10-21 01:47:12,455 INFO namenode.NameNode: SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at hadoop/10.0.31.65
************************************************************/

启动 NameNode 和 DataNode 服务

[hadoop@hadoop hadoop-3.1.4]$ sbin/start-dfs.sh 
Starting namenodes on [hadoop]
Starting datanodes
Starting secondary namenodes [hadoop]

[hadoop@hadoop hadoop-3.1.4]$ bin/hdfs dfs -ls /
[hadoop@hadoop hadoop-3.1.4]$

通过web浏览器查看NameNode

QQ截图20201020175702.png
准备工作目录
bin/hdfs dfs -mkdir /user/ 指当前用户

[hadoop@hadoop hadoop-3.1.4]$ bin/hdfs dfs -mkdir /user
[hadoop@hadoop hadoop-3.1.4]$ bin/hdfs dfs -mkdir /user/hadoop

将输入文件拷贝到HDFS

[hadoop@hadoop hadoop-3.1.4]$ bin/hdfs dfs -mkdir input
[hadoop@hadoop hadoop-3.1.4]$ bin/hdfs dfs -put etc/hadoop/*.xml input
[hadoop@hadoop hadoop-3.1.4]$ bin/hdfs dfs -ls /user/hadoop
Found 1 items
drwxr-xr-x   - hadoop supergroup          0 2020-10-21 06:17 /user/hadoop/input
[hadoop@hadoop hadoop-3.1.4]$ bin/hdfs dfs -ls /user/hadoop/input
Found 9 items
-rw-r--r--   1 hadoop supergroup       9213 2020-10-21 06:17 /user/hadoop/input/capacity-scheduler.xml
-rw-r--r--   1 hadoop supergroup        885 2020-10-21 06:17 /user/hadoop/input/core-site.xml
-rw-r--r--   1 hadoop supergroup      11392 2020-10-21 06:17 /user/hadoop/input/hadoop-policy.xml
-rw-r--r--   1 hadoop supergroup        867 2020-10-21 06:17 /user/hadoop/input/hdfs-site.xml
-rw-r--r--   1 hadoop supergroup        620 2020-10-21 06:17 /user/hadoop/input/httpfs-site.xml
-rw-r--r--   1 hadoop supergroup       3518 2020-10-21 06:17 /user/hadoop/input/kms-acls.xml
-rw-r--r--   1 hadoop supergroup        682 2020-10-21 06:17 /user/hadoop/input/kms-site.xml
-rw-r--r--   1 hadoop supergroup       1072 2020-10-21 06:17 /user/hadoop/input/mapred-site.xml
-rw-r--r--   1 hadoop supergroup       1052 2020-10-21 06:17 /user/hadoop/input/yarn-site.xml

执行范例

[hadoop@hadoop hadoop-3.1.4]$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.4.jar grep input output 'dfs[a-z.]+'

查看范例的输出

[hadoop@hadoop hadoop-3.1.4]$ bin/hdfs dfs -ls /user/hadoop
Found 2 items
drwxr-xr-x   - hadoop supergroup          0 2020-10-21 06:17 /user/hadoop/input
drwxr-xr-x   - hadoop supergroup          0 2020-10-21 06:21 /user/hadoop/output
[hadoop@hadoop hadoop-3.1.4]$ bin/hdfs dfs -ls /user/hadoop/output
Found 2 items
-rw-r--r--   1 hadoop supergroup          0 2020-10-21 06:21 /user/hadoop/output/_SUCCESS
-rw-r--r--   1 hadoop supergroup         29 2020-10-21 06:21 /user/hadoop/output/part-r-00000


[hadoop@hadoop hadoop-3.1.4]$ ls
bin  etc  include  input  lib  libexec  LICENSE.txt  logs  NOTICE.txt  output  README.txt  sbin  share
[hadoop@hadoop hadoop-3.1.4]$ rm -rf output
[hadoop@hadoop hadoop-3.1.4]$ ls
bin  etc  include  input  lib  libexec  LICENSE.txt  logs  NOTICE.txt  README.txt  sbin  share


[hadoop@hadoop hadoop-3.1.4]$ bin/hdfs dfs -get output output
[hadoop@hadoop hadoop-3.1.4]$ ls
bin  etc  include  input  lib  libexec  LICENSE.txt  logs  NOTICE.txt  output  README.txt  sbin  share
[hadoop@hadoop hadoop-3.1.4]$ cat output/*
1   dfsadmin
1   dfs.replication

[hadoop@hadoop hadoop-3.1.4]$ bin/hdfs dfs -cat output/*
1   dfsadmin
1   dfs.replication

关闭服务

  $ sbin/stop-dfs.sh

YARN

etc/hadoop/mapred-site.xml


    
        mapreduce.framework.name
        yarn
    
    
        mapreduce.application.classpath
        $HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*

etc/hadoop/yarn-site.xml


    
        yarn.nodemanager.aux-services
        mapreduce_shuffle
    
    
        yarn.nodemanager.env-whitelist
        JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME

Start ResourceManager daemon and NodeManager daemon:

[hadoop@hadoop hadoop-3.1.4]$ sbin/start-yarn.sh
Starting resourcemanager
Starting nodemanagers

Browse the web interface for the ResourceManager; by default it is available at:

QQ截图20201020180920.png

Run a MapReduce job.
When you’re done, stop the daemons with:

  $ sbin/stop-yarn.sh

Hadoop 单节点部署（一） Hadoop