Single Node Cluster
Hadoop Cluster Setup
组件 | 版本需求 | 选型 |
---|---|---|
OS | Linux only, CentOS 6.5+ or Ubuntu 16.0.4+ | CentOS 7.5.1804 |
JDK | 1.8+ | 1.8.202 |
Kylin | v3.1.1 | |
Hadoop | 2.7+, 3.1+ | 3.1.4 |
Hive | 0.13 - 1.2.1+ | 3.1.2 |
HBase | 1.1+, 2.0 | 2.2.5 |
Spark | (可选) 2.3.0+ | 2.3.1 |
Kafka | (可选) 1.0.0+ | 1.1.1 |
MySQL | 5.1.73 |
Kylin
https://mirrors.bfsu.edu.cn/apache/kylin/apache-kylin-3.1.1/apache-kylin-3.1.1-bin-hadoop3.tar.gz
Hadoop
https://mirrors.bfsu.edu.cn/apache/hadoop/common/hadoop-3.1.4/hadoop-3.1.4.tar.gz
HIVE
18 April 2020: release2.3.7
works withHadoop 2.x.y
26 August 2019: release3.1.2
works withHadoop 3.x.y
07 April 2017 : release1.2.2
works withHadoop 1.x.y, 2.x.y
https://mirror.bit.edu.cn/apache/hive/hive-1.2.2/apache-hive-1.2.2-bin.tar.gz
https://mirror.bit.edu.cn/apache/hive/hive-2.3.7/apache-hive-2.3.7-bin.tar.gz
https://mirror.bit.edu.cn/apache/hive/hive-3.1.2/apache-hive-3.1.2-bin.tar.gz
HBASE stable version
当前版本 2.2.5
Hadoop-3.1.1+
适配HBase-2.2.x
和HBase-2.3.x
https://mirrors.tuna.tsinghua.edu.cn/apache/hbase/stable/hbase-2.2.5-bin.tar.gz
https://mirrors.tuna.tsinghua.edu.cn/apache/hbase/stable/hbase-2.2.5-client-bin.tar.gz
[root@localhost ~]# df -hT
Filesystem Type Size Used Avail Use% Mounted on
/dev/mapper/centos-root xfs 291G 977M 291G 1% /
devtmpfs devtmpfs 3.9G 0 3.9G 0% /dev
tmpfs tmpfs 3.9G 0 3.9G 0% /dev/shm
tmpfs tmpfs 3.9G 9.0M 3.9G 1% /run
tmpfs tmpfs 3.9G 0 3.9G 0% /sys/fs/cgroup
/dev/sda1 xfs 1014M 142M 873M 14% /boot
tmpfs tmpfs 783M 0 783M 0% /run/user/0
[root@localhost ~]#
[root@localhost ~]#
[root@localhost ~]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 300G 0 disk
├─sda1 8:1 0 1G 0 part /boot
└─sda2 8:2 0 299G 0 part
├─centos-root 253:0 0 291.1G 0 lvm /
└─centos-swap 253:1 0 7.9G 0 lvm [SWAP]
sdb 8:16 0 16G 0 disk
sdc 8:32 0 16G 0 disk
sdd 8:48 0 16G 0 disk
sde 8:64 0 16G 0 disk
sdf 8:80 0 16G 0 disk
sdg 8:96 0 16G 0 disk
sdh 8:112 0 16G 0 disk
sdi 8:128 0 16G 0 disk
sr0 11:0 1 1024M 0 rom
[root@hadoop ~]# useradd hadoop
[root@hadoop ~]# su - hadoop
[hadoop@hadoop hadoop-3.1.4]$ pwd
/opt/hadoop-3.1.4
启动Hadoop集群的准备工作
- 配置hadoop-env JAVA_HOME
[hadoop@hadoop hadoop-3.1.4]$ vim etc/hadoop/hadoop-env.sh
export JAVA_HOME=/usr/local/jdk/jdk1.8.0_202
单节点
[hadoop@hadoop hadoop-3.1.4]$ mkdir input
[hadoop@hadoop hadoop-3.1.4]$ cp etc/hadoop/*.xml input
[hadoop@hadoop hadoop-3.1.4]$ ls
bin etc include input lib libexec LICENSE.txt logs NOTICE.txt README.txt sbin share
[hadoop@hadoop hadoop-3.1.4]$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.4.jar grep input output 'dfs[a-z.]+'
2020-10-21 04:28:22,207 INFO impl.MetricsConfig: loaded properties from hadoop-metrics2.properties
2020-10-21 04:28:26,405 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2020-10-21 04:28:26,405 INFO impl.MetricsSystemImpl: JobTracker metrics system started
2020-10-21 04:28:26,787 INFO input.FileInputFormat: Total input files to process : 9
2020-10-21 04:28:26,816 INFO mapreduce.JobSubmitter: number of splits:9
2020-10-21 04:28:27,015 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1615788132_0001
2020-10-21 04:28:27,017 INFO mapreduce.JobSubmitter: Executing with tokens: []
... omitted ...
2020-10-21 04:28:30,464 INFO mapreduce.Job: Counters: 30
File System Counters
FILE: Number of bytes read=1340028
FILE: Number of bytes written=3329078
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
Map-Reduce Framework
Map input records=2
Map output records=2
Map output bytes=41
Map output materialized bytes=51
Input split bytes=120
Combine input records=0
Combine output records=0
Reduce input groups=1
Reduce shuffle bytes=51
Reduce input records=2
Reduce output records=2
Spilled Records=4
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=0
Total committed heap usage (bytes)=489684992
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=155
File Output Format Counters
Bytes Written=41
[hadoop@hadoop hadoop-3.1.4]$
[hadoop@hadoop hadoop-3.1.4]$ ls
bin etc include input lib libexec LICENSE.txt logs NOTICE.txt output README.txt sbin share
[hadoop@hadoop hadoop-3.1.4]$ cat output/*
1 dfsadmin
1 dfs.replication
这就成功的运行了一个单节点DEMO
伪分布式
Hadoop 还可以在单节点上运行伪分布式集群,每个Hadoop的守护进程都以单独的Java进程运行。
通常,推荐 HDFS 和 YARN 以独立用户运行。比如:HDFS 用户
hdfs
; YARN 用户yarn
;MAPRED 用户mapred
。
这里,以hadoop用户运行hdfs,yarn,mapreduce服务。
Hadoop Startup
配置SSH
[hadoop@hadoop ~]$ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
[hadoop@hadoop ~]$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
[hadoop@hadoop ~]$ chmod 0600 ~/.ssh/authorized_keys
修改配置文件
etc/hadoop/core-site.xml:
这里 10.0.31.65 是本服务器IP,提供HDFS服务的IP。
fs.defaultFS
hdfs://10.0.31.65:9000
etc/hadoop/hdfs-site.xml:
dfs.replication
1
配置HDFS
- 格式化文件系统
[hadoop@hadoop hadoop-3.1.4]$ bin/hdfs namenode -format
WARNING: /opt/hadoop-3.1.4/logs does not exist. Creating.
2020-10-21 01:47:10,993 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = hadoop/10.0.31.65
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 3.1.4
STARTUP_MSG: classpath = /opt/hadoop-3.1.4/e...
...
...
************************************************************/
2020-10-21 01:47:11,005 INFO namenode.NameNode: registered UNIX signal handlers for [TERM, HUP, INT]
2020-10-21 01:47:11,136 INFO namenode.NameNode: createNameNode [-format]
Formatting using clusterid: CID-f6f83b0d-b2c3-456d-b5b6-af3ff8957fbf
2020-10-21 01:47:11,856 INFO namenode.FSEditLog: Edit logging is async:true
2020-10-21 01:47:11,892 INFO namenode.FSNamesystem: KeyProvider: null
2020-10-21 01:47:11,893 INFO namenode.FSNamesystem: fsLock is fair: true
2020-10-21 01:47:11,894 INFO namenode.FSNamesystem: Detailed lock hold time metrics enabled: false
2020-10-21 01:47:11,901 INFO namenode.FSNamesystem: fsOwner = hadoop (auth:SIMPLE)
2020-10-21 01:47:11,901 INFO namenode.FSNamesystem: supergroup = supergroup
2020-10-21 01:47:11,901 INFO namenode.FSNamesystem: isPermissionEnabled = true
2020-10-21 01:47:11,901 INFO namenode.FSNamesystem: HA Enabled: false
2020-10-21 01:47:11,962 INFO common.Util: dfs.datanode.fileio.profiling.sampling.percentage set to 0. Disabling file IO profiling
2020-10-21 01:47:11,976 INFO blockmanagement.DatanodeManager: dfs.block.invalidate.limit: configured=1000, counted=60, effected=1000
2020-10-21 01:47:11,977 INFO blockmanagement.DatanodeManager: dfs.namenode.datanode.registration.ip-hostname-check=true
2020-10-21 01:47:11,987 INFO blockmanagement.BlockManager: dfs.namenode.startup.delay.block.deletion.sec is set to 000:00:00:00.000
2020-10-21 01:47:11,987 INFO blockmanagement.BlockManager: The block deletion will start around 2020 Oct 21 01:47:11
2020-10-21 01:47:11,989 INFO util.GSet: Computing capacity for map BlocksMap
2020-10-21 01:47:11,989 INFO util.GSet: VM type = 64-bit
2020-10-21 01:47:11,991 INFO util.GSet: 2.0% max memory 1.7 GB = 34.8 MB
2020-10-21 01:47:11,991 INFO util.GSet: capacity = 2^22 = 4194304 entries
2020-10-21 01:47:12,006 INFO blockmanagement.BlockManager: dfs.block.access.token.enable = false
2020-10-21 01:47:12,019 INFO Configuration.deprecation: No unit for dfs.namenode.safemode.extension(30000) assuming MILLISECONDS
2020-10-21 01:47:12,019 INFO blockmanagement.BlockManagerSafeMode: dfs.namenode.safemode.threshold-pct = 0.9990000128746033
2020-10-21 01:47:12,019 INFO blockmanagement.BlockManagerSafeMode: dfs.namenode.safemode.min.datanodes = 0
2020-10-21 01:47:12,019 INFO blockmanagement.BlockManagerSafeMode: dfs.namenode.safemode.extension = 30000
2020-10-21 01:47:12,020 INFO blockmanagement.BlockManager: defaultReplication = 1
2020-10-21 01:47:12,020 INFO blockmanagement.BlockManager: maxReplication = 512
2020-10-21 01:47:12,020 INFO blockmanagement.BlockManager: minReplication = 1
2020-10-21 01:47:12,020 INFO blockmanagement.BlockManager: maxReplicationStreams = 2
2020-10-21 01:47:12,020 INFO blockmanagement.BlockManager: redundancyRecheckInterval = 3000ms
2020-10-21 01:47:12,020 INFO blockmanagement.BlockManager: encryptDataTransfer = false
2020-10-21 01:47:12,020 INFO blockmanagement.BlockManager: maxNumBlocksToLog = 1000
2020-10-21 01:47:12,100 INFO namenode.FSDirectory: GLOBAL serial map: bits=24 maxEntries=16777215
2020-10-21 01:47:12,126 INFO util.GSet: Computing capacity for map INodeMap
2020-10-21 01:47:12,126 INFO util.GSet: VM type = 64-bit
2020-10-21 01:47:12,126 INFO util.GSet: 1.0% max memory 1.7 GB = 17.4 MB
2020-10-21 01:47:12,126 INFO util.GSet: capacity = 2^21 = 2097152 entries
2020-10-21 01:47:12,186 INFO namenode.FSDirectory: ACLs enabled? false
2020-10-21 01:47:12,186 INFO namenode.FSDirectory: POSIX ACL inheritance enabled? true
2020-10-21 01:47:12,186 INFO namenode.FSDirectory: XAttrs enabled? true
2020-10-21 01:47:12,186 INFO namenode.NameNode: Caching file names occurring more than 10 times
2020-10-21 01:47:12,193 INFO snapshot.SnapshotManager: Loaded config captureOpenFiles: false, skipCaptureAccessTimeOnlyChange: false, snapshotDiffAllowSnapRootDescendant: true, maxSnapshotLimit: 65536
2020-10-21 01:47:12,196 INFO snapshot.SnapshotManager: SkipList is disabled
2020-10-21 01:47:12,202 INFO util.GSet: Computing capacity for map cachedBlocks
2020-10-21 01:47:12,202 INFO util.GSet: VM type = 64-bit
2020-10-21 01:47:12,202 INFO util.GSet: 0.25% max memory 1.7 GB = 4.3 MB
2020-10-21 01:47:12,202 INFO util.GSet: capacity = 2^19 = 524288 entries
2020-10-21 01:47:12,213 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.window.num.buckets = 10
2020-10-21 01:47:12,213 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.num.users = 10
2020-10-21 01:47:12,214 INFO metrics.TopMetrics: NNTop conf: dfs.namenode.top.windows.minutes = 1,5,25
2020-10-21 01:47:12,219 INFO namenode.FSNamesystem: Retry cache on namenode is enabled
2020-10-21 01:47:12,219 INFO namenode.FSNamesystem: Retry cache will use 0.03 of total heap and retry cache entry expiry time is 600000 millis
2020-10-21 01:47:12,221 INFO util.GSet: Computing capacity for map NameNodeRetryCache
2020-10-21 01:47:12,221 INFO util.GSet: VM type = 64-bit
2020-10-21 01:47:12,222 INFO util.GSet: 0.029999999329447746% max memory 1.7 GB = 534.2 KB
2020-10-21 01:47:12,222 INFO util.GSet: capacity = 2^16 = 65536 entries
2020-10-21 01:47:12,261 INFO namenode.FSImage: Allocated new BlockPoolId: BP-1053537240-10.0.31.65-1603216032251
2020-10-21 01:47:12,275 INFO common.Storage: Storage directory /tmp/hadoop-hadoop/dfs/name has been successfully formatted.
2020-10-21 01:47:12,305 INFO namenode.FSImageFormatProtobuf: Saving image file /tmp/hadoop-hadoop/dfs/name/current/fsimage.ckpt_0000000000000000000 using no compression
2020-10-21 01:47:12,432 INFO namenode.FSImageFormatProtobuf: Image file /tmp/hadoop-hadoop/dfs/name/current/fsimage.ckpt_0000000000000000000 of size 393 bytes saved in 0 seconds .
2020-10-21 01:47:12,449 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
2020-10-21 01:47:12,455 INFO namenode.FSImage: FSImageSaver clean checkpoint: txid = 0 when meet shutdown.
2020-10-21 01:47:12,455 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at hadoop/10.0.31.65
************************************************************/
- 启动 NameNode 和 DataNode 服务
[hadoop@hadoop hadoop-3.1.4]$ sbin/start-dfs.sh
Starting namenodes on [hadoop]
Starting datanodes
Starting secondary namenodes [hadoop]
[hadoop@hadoop hadoop-3.1.4]$ bin/hdfs dfs -ls /
[hadoop@hadoop hadoop-3.1.4]$
-
通过web浏览器查看NameNode
准备工作目录
bin/hdfs dfs -mkdir /user/
指当前用户
[hadoop@hadoop hadoop-3.1.4]$ bin/hdfs dfs -mkdir /user
[hadoop@hadoop hadoop-3.1.4]$ bin/hdfs dfs -mkdir /user/hadoop
- 将输入文件拷贝到HDFS
[hadoop@hadoop hadoop-3.1.4]$ bin/hdfs dfs -mkdir input
[hadoop@hadoop hadoop-3.1.4]$ bin/hdfs dfs -put etc/hadoop/*.xml input
[hadoop@hadoop hadoop-3.1.4]$ bin/hdfs dfs -ls /user/hadoop
Found 1 items
drwxr-xr-x - hadoop supergroup 0 2020-10-21 06:17 /user/hadoop/input
[hadoop@hadoop hadoop-3.1.4]$ bin/hdfs dfs -ls /user/hadoop/input
Found 9 items
-rw-r--r-- 1 hadoop supergroup 9213 2020-10-21 06:17 /user/hadoop/input/capacity-scheduler.xml
-rw-r--r-- 1 hadoop supergroup 885 2020-10-21 06:17 /user/hadoop/input/core-site.xml
-rw-r--r-- 1 hadoop supergroup 11392 2020-10-21 06:17 /user/hadoop/input/hadoop-policy.xml
-rw-r--r-- 1 hadoop supergroup 867 2020-10-21 06:17 /user/hadoop/input/hdfs-site.xml
-rw-r--r-- 1 hadoop supergroup 620 2020-10-21 06:17 /user/hadoop/input/httpfs-site.xml
-rw-r--r-- 1 hadoop supergroup 3518 2020-10-21 06:17 /user/hadoop/input/kms-acls.xml
-rw-r--r-- 1 hadoop supergroup 682 2020-10-21 06:17 /user/hadoop/input/kms-site.xml
-rw-r--r-- 1 hadoop supergroup 1072 2020-10-21 06:17 /user/hadoop/input/mapred-site.xml
-rw-r--r-- 1 hadoop supergroup 1052 2020-10-21 06:17 /user/hadoop/input/yarn-site.xml
- 执行范例
[hadoop@hadoop hadoop-3.1.4]$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.4.jar grep input output 'dfs[a-z.]+'
- 查看范例的输出
[hadoop@hadoop hadoop-3.1.4]$ bin/hdfs dfs -ls /user/hadoop
Found 2 items
drwxr-xr-x - hadoop supergroup 0 2020-10-21 06:17 /user/hadoop/input
drwxr-xr-x - hadoop supergroup 0 2020-10-21 06:21 /user/hadoop/output
[hadoop@hadoop hadoop-3.1.4]$ bin/hdfs dfs -ls /user/hadoop/output
Found 2 items
-rw-r--r-- 1 hadoop supergroup 0 2020-10-21 06:21 /user/hadoop/output/_SUCCESS
-rw-r--r-- 1 hadoop supergroup 29 2020-10-21 06:21 /user/hadoop/output/part-r-00000
[hadoop@hadoop hadoop-3.1.4]$ ls
bin etc include input lib libexec LICENSE.txt logs NOTICE.txt output README.txt sbin share
[hadoop@hadoop hadoop-3.1.4]$ rm -rf output
[hadoop@hadoop hadoop-3.1.4]$ ls
bin etc include input lib libexec LICENSE.txt logs NOTICE.txt README.txt sbin share
[hadoop@hadoop hadoop-3.1.4]$ bin/hdfs dfs -get output output
[hadoop@hadoop hadoop-3.1.4]$ ls
bin etc include input lib libexec LICENSE.txt logs NOTICE.txt output README.txt sbin share
[hadoop@hadoop hadoop-3.1.4]$ cat output/*
1 dfsadmin
1 dfs.replication
or
[hadoop@hadoop hadoop-3.1.4]$ bin/hdfs dfs -cat output/*
1 dfsadmin
1 dfs.replication
- 关闭服务
$ sbin/stop-dfs.sh
YARN
etc/hadoop/mapred-site.xml
mapreduce.framework.name
yarn
mapreduce.application.classpath
$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*
etc/hadoop/yarn-site.xml
yarn.nodemanager.aux-services
mapreduce_shuffle
yarn.nodemanager.env-whitelist
JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME
- Start ResourceManager daemon and NodeManager daemon:
[hadoop@hadoop hadoop-3.1.4]$ sbin/start-yarn.sh
Starting resourcemanager
Starting nodemanagers
-
Browse the web interface for the ResourceManager; by default it is available at:
- Run a MapReduce job.
- When you’re done, stop the daemons with:
$ sbin/stop-yarn.sh