Hadoop是一种分析和处理大数据的软件平台,是Appach的一个用Java语言所实现的开源软件的加框,在大量计算机组成的集群当中实现了对于海量的数据进行的分布式计算。Apache Hadoop作为PaaS构建在虚拟主机上,作为云计算平台。
Hadoop就是存储海量数据和分析海量数据的工具
HADOOP生态圈组成
HDFS
:分布式文件系统
MAPREDUCE
:分布式运算程序开发框架,为海量的数据提供了计算.
HIVE
:基于大数据技术(文件系统+运算框架)的SQL数据仓库工具
HBASE:基于HADOOP的分布式海量数据库
ZOOKEEPER
:分布式协调服务基础组件
Mahout:基于mapreduce/spark/flink等分布式运算框架的机器学习算法库
Oozie:工作流调度框架(Azakaba)
Sqoop:数据导入导出工具
Flume:日志数据采集框架
Common
: 里面存放hadoop其它模块运行所需要的java库和实用程序,包含了启动hadoop的java文件和脚本
YARN
:用于作业调度和资源管理的框架
hadoop数据处理流程
“web日志数据挖掘”
hadoop 当前支持三种集群工作模式:
1.本地单击模式Local (Standalone) Mode
它不需要启用后台程序,可以在主机上直接执行
2.伪分布式Pseudo-Distributed Mode
在单击模式基础上部署分布式,可以用开发和测试。
3.完全分布式Fully-Distributed Mode
这就是真正意义上的集群。
http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html
[root@server1 ~]# ls
hadoop-3.2.1.tar.gz jdk-8i181-linux-x64.tar.gz
[root@server1 ~]# useradd cay
[root@server1 ~]# mv ./* /home/cay/
[root@server1 ~]# su - cay
[cay@server1 ~]$ ls
hadoop-3.2.1.tar.gz jdk-8u181-linux-x64.tar.gz
[cay@server1 ~]$ tar zxf hadoop-3.2.1.tar.gz
[cay@server1 ~]$ tar zxf jdk-8u181-linux-x64.tar.gz
[cay@server1 ~]$ ln -s hadoop-3.2.1 hadoop
[cay@server1 ~]$ ln -s jdk1.8.0_181/ jdk
[cay@server1 ~]$ ls
hadoop hadoop-3.2.1 hadoop-3.2.1.tar.gz jdk jdk1.8.0_181 jdk-8u181-linux-x64.tar.gz
修改环境变量脚本。
[cay@server1 ~]$ cd hadoop
[cay@server1 hadoop]$ ls
bin etc include lib libexec LICENSE.txt NOTICE.txt README.txt sbin share
[cay@server1 hadoop]$ cd etc/
[cay@server1 etc]$ vim hadoop/hadoop-env.sh
export JAVA_HOME=/home/cay/jdk
# Location of Hadoop. By default, Hadoop will attempt to determine
# this location based upon its execution path.
export HADOOP_HOME=/home/cay/hadoop
//告诉他我们的环境在哪
[cay@server1 etc]$ cd ../bin
[cay@server1 bin]$ ls
container-executor hadoop hadoop.cmd hdfs hdfs.cmd mapred mapred.cmd oom-listener test-container-executor yarn yarn.cmd
[cay@server1 bin]$ pwd
/home/cay/hadoop/bin
[cay@server1 bin]$ ./hadoop /我们就可以进入到bin目录下执行了,
也可以加到环境变量里面:
[cay@server1 ~]$ vim .bash_profile
PATH=$PATH:$HOME/.local/bin:$HOME/bin:$HOME/hadoop/bin
[cay@server1 ~]$ source ./.bash_profile
[cay@server1 ~]$ hadoop /效果相同
默认情况下,Hadoop被配置为以非分布式模式运行,作为单个Java进程。这对于调试非常有用。
下面的示例复制解压缩的conf目录以用作输入,然后查找并显示给定正则表达式的每个匹配项。输出至给定的输出目录:
[cay@server1 ~]$ mkdir input
[cay@server1 ~]$ cp hadoop/etc/hadoop/*.xml input/
[cay@server1 ~]$ cd input/
[cay@server1 input]$ ls
capacity-scheduler.xml core-site.xml hadoop-policy.xml hdfs-site.xml httpfs-site.xml kms-acls.xml kms-site.xml mapred-site.xml yarn-site.xml
查看jar包封装的方法:
[cay@server1 ~]$ hadoop jar hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.1.jar
An example program must be given as the first argument.
Valid program names are:
aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.
bbp: A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi.
dbcount: An example job that count the pageview counts from a database.
distbbp: A map/reduce program that uses a BBP-type formula to compute exact bits of Pi.
grep: A map/reduce program that counts the matches of a regex in the input.
join: A job that effects a join over sorted, equally partitioned datasets
multifilewc: A job that counts words from several files.
pentomino: A map/reduce tile laying program to find solutions to pentomino problems.
pi: A map/reduce program that estimates Pi using a quasi-Monte Carlo method.
randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.
randomwriter: A map/reduce program that writes 10GB of random data per node.
secondarysort: An example defining a secondary sort to the reduce.
sort: A map/reduce program that sorts the data written by the random writer.
sudoku: A sudoku solver.
teragen: Generate data for the terasort
terasort: Run the terasort
teravalidate: Checking results of terasort
wordcount: A map/reduce program that counts the words in the input files.
wordmean: A map/reduce program that counts the average length of the words in the input files.
wordmedian: A map/reduce program that counts the median length of the words in the input files.
wordstandarddeviation: A map/reduce program that counts the standard deviation of the length of the words in the input files.
[cay@server1 mapreduce]$ pwd
/home/cay/hadoop/share/hadoop/mapreduce
grep
[cay@server1 ~]$ hadoop jar hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.1.jar grep input output 'dfs[a-z.]+'
注意output目录在运行前不能存在,它会自动创建,input是我们上面创建的目录。过滤出以dfs开头的
[cay@server1 ~]$ cd output/
[cay@server1 output]$ ls
part-r-00000 _SUCCESS
[cay@server1 output]$ cat *
1 dfsadmin /只过滤出了一个
[cay@server1 input]$ vim hadoop-policy.xml
dfsadmin and /只有这个文件中有一个 dfsadmin
The ACL is a
仍然只有当前的一个结点
首先在在hadoop目录中进行配置:
[cay@server1 hadoop]$ cd etc/hadoop/
[cay@server1 hadoop]$ vim core-site.xml /编辑核心文件,告诉他我们的master
fs.defaultFS
hdfs://localhost:9000
[cay@server1 hadoop]$ vim hdfs-site.xml
dfs.replication
1 /只有一个副本,因为当前只有一个master结点
[cay@server1 hadoop]$ ll workers
-rw-r--r-- 1 cay cay 10 Sep 10 2019 workers
[cay@server1 hadoop]$ cat workers
localhost /里面还有一个workers,看出当前我们的master和worker在一个结点上。
分布式是以免密的方式来运作的,所以我们需要为为本地的主机做免密登陆:
[cay@server1 hadoop]$ ssh-keygen
[cay@server1 ~]$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
[cay@server1 ~]$ chmod 0600 ~/.ssh/authorized_keys
[cay@server1 ~]$ ssh localhost
Last failed login: Tue Jul 14 10:40:04 CST 2020 from localhost on ssh:notty
There were 5 failed login attempts since the last successful login.
Last login: Tue Jul 14 10:00:59 2020
上面的操作和直接使用 ssh-copy-id 的方式是相同的,只不过 ssh-copy-id localhost 需要输入localhost的密码.
然后格式化文件系统为 hdfs 格式:
[cay@server1 hadoop]$ hdfs namenode -format master机诶单叫做namenode,slave结点叫做datanode
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at server1/172.25.254.1
************************************************************/
[cay@server1 hadoop]$ ll /tmp/
total 4
drwxr-xr-x 3 cay cay 20 Jul 14 10:14 hadoop
drwxr-xr-x 4 cay cay 31 Jul 14 10:45 hadoop-cay /会创建这个目录来存放数据信息
-rw-rw-r-- 1 cay cay 5 Jul 14 10:45 hadoop-cay-namenode.pid
drwxr-xr-x 2 cay cay 6 Jul 14 10:45 hsperfdata_cay
[cay@server1 hadoop]$ cd /tmp/hadoop-cay/
[cay@server1 hadoop-cay]$ ls
dfs mapred
启动NameNode守护进程和DataNode守护进程:
[cay@server1 ~]$ cd hadoop/sbin/
[cay@server1 sbin]$ ls
distribute-exclude.sh hadoop-daemons.sh mr-jobhistory-daemon.sh start-all.sh start-dfs.sh start-yarn.sh stop-balancer.sh stop-secure-dns.sh workers.sh
FederationStateStore httpfs.sh refresh-namenodes.sh start-balancer.sh start-secure-dns.sh stop-all.cmd stop-dfs.cmd stop-yarn.cmd yarn-daemon.sh
hadoop-daemon.sh kms.sh start-all.cmd start-dfs.cmd start-yarn.cmd stop-all.sh stop-dfs.sh stop-yarn.sh yarn-daemons.sh
[cay@server1 sbin]$ ./start-dfs.sh
Starting namenodes on [localhost]
Starting datanodes
Starting secondary namenodes [server1]
server1: Warning: Permanently added 'server1,172.25.254.1' (ECDSA) to the list of known hosts.
start-dfs.sh 只会启动 dfs 分布式文件系统的相关组件,不会启动其他组件.
[cay@server1 ~]$ cd jdk/bin/
[cay@server1 bin]$ ls
appletviewer idlj java javafxpackager javapackager jcmd jdb jinfo jmc jrunscript jstat keytool pack200 rmid serialver unpack200 xjc
ControlPanel jar javac javah java-rmi.cgi jconsole jdeps jjs jmc.ini jsadebugd jstatd native2ascii policytool rmiregistry servertool wsgen
extcheck jarsigner javadoc javap javaws jcontrol jhat jmap jps jstack jvisualvm orbd rmic schemagen tnameserv wsimport
[cay@server1 bin]$ cd
[cay@server1 ~]$ vim .bash_profile
PATH=$PATH:$HOME/.local/bin:$HOME/bin:$HOME/hadoop/bin:$HOME/jdk/bin /将jdk的命令加入到环境变量
[cay@server1 ~]$ source .bash_profile
[cay@server1 ~]$ jps /列出java进程
4150 NameNode
4249 DataNode
4682 Jps
4429 SecondaryNameNode /nn的辅助机
[cay@server1 ~]$ hdfs dfsadmin -report
;这条命令看到的和web界面看到的是一致的0
我们还可以指定一个hadoop的数据目录,因为默认放到 /tmp目录下的话系统可能会进行定期的清理/tmp目录,可能会造成回数据的丢失,所以我们:
创建执行MapReduce作业所需的HDFS用户主目录:
[cay@server1 ~]$ hdfs dfs -ls / 默认加上/才能看到主目录里的数据,当前没有数据
[cay@server1 ~]$ hdfs dfs -mkdir /user
[cay@server1 ~]$ hdfs dfs -mkdir /user/cay
[cay@server1 ~]$ hdfs dfs -ls 可以直接列出了。
此时的web界面:
分布式文件系统的输入和输出都是来源于分布式文件系统,而不是本地。
我们把input目录上传到 hdfs上:
[cay@server1 ~]$ hdfs dfs -put input/
数据目录出现了,此时:
[cay@server1 ~]$ rm -fr input/ output/
[cay@server1 ~]$ hadoop jar hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.1.jar grep input output 'dfs[a-z.]+'
它会从分布式文件系统中读取input 目录,然后输出至dfs 的output目录:
而且只有一个副本,着里的 clock size 是根据分布式文件系统的大小决定的,可以指定。
[cay@server1 ~]$ hdfs dfs -ls
Found 2 items
drwxr-xr-x - cay supergroup 0 2020-07-14 11:35 input
drwxr-xr-x - cay supergroup 0 2020-07-14 11:39 output
[cay@server1 ~]$ hdfs dfs -cat output/*
2020-07-14 11:43:28,989 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
1 dfsadmin
/和之前的结果相同
我们还可以把目录下载到本地,进行查看:
[cay@server1 ~]$ hdfs dfs -get output
cat 2020-07-14 11:44:07,972 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
o[cay@server1 ~]$ cat output/*
1 dfsadmin
[cay@server1 ~]$ rm -fr output/
[cay@server1 ~]$ cat output/*
cat: output/*: No such file or directory
在次执行:
[cay@server1 ~]$ hadoop jar hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.1.jar grep input output 'dfs[a-z.]+'
org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://localhost:9000/user/cay/output already exists
就会报错,因为文件系统上的 output 目录已经存在了,而且web 界面默认是无法删除的
[cay@server1 ~]$ hdfs dfs -rm -r output
Deleted output
一切其他的运作:
统计词频:
[cay@server1 ~]$ hadoop jar hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.1.jar wordcount input output
[cay@server1 ~]$ hdfs dfs -cat output/*
input output ^C
type="text/xsl" 4
u:%user:%user 1
under 28
unique 1
updating 1
use 11
used 24
user 48
user. 2
...
在开启两台主机server2 和server3
[cay@server1 sbin]$ jps
4150 NameNode
5911 Jps
4249 DataNode
4429 SecondaryNameNode
[cay@server1 sbin]$ ./stop-dfs.sh
Stopping namenodes on [localhost]
Stopping datanodes
Stopping secondary namenodes [server1]
[cay@server1 sbin]$ jps
6371 Jps
在server1上共享出去我们的配置目录,方便另外两台主机配置。
[root@server1 ~]# yum install nfs-utils -y
[root@server1 ~]# vim /etc/exports
/home/cay *(rw,anonuid=1000,anongid=1000) /共享出去,其他结点就可以不用配置
[root@server1 ~]# id cay
uid=1000(cay) gid=1000(cay) groups=1000(cay)
[root@server1 ~]# systemctl enable --now nfs
Created symlink from /etc/systemd/system/multi-user.target.wants/nfs-server.service to /usr/lib/systemd/system/nfs-server.service.
[root@server1 ~]# showmount -e
Export list for server1:
/home/cay *
[root@server2 ~]# yum install nfs-utils -y
[root@server2 ~]# systemctl start rpcbind.service
[root@server2 ~]# mount 172.25.254.1:/home/cay/ /home/cay/
[root@server2 ~]# df
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/mapper/rhel-root 17811456 1199756 16611700 7% /
devtmpfs 1011516 0 1011516 0% /dev
tmpfs 1023608 0 1023608 0% /dev/shm
tmpfs 1023608 16868 1006740 2% /run
tmpfs 1023608 0 1023608 0% /sys/fs/cgroup
/dev/sda1 1038336 135224 903112 14% /boot
tmpfs 204724 0 204724 0% /run/user/0
172.25.254.1:/home/cay 17811456 3002112 14809344 17% /home/cay
[root@server3 ~]# yum install nfs-utils -y
[root@server3 ~]# systemctl start rpcbind.service
[root@server3 ~]# mount 172.25.254.1:/home/cay/ /home/cay/
[root@server3 ~]# df
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/mapper/rhel-root 17811456 1199756 16611700 7% /
devtmpfs 1011516 0 1011516 0% /dev
tmpfs 1023608 0 1023608 0% /dev/shm
tmpfs 1023608 16868 1006740 2% /run
tmpfs 1023608 0 1023608 0% /sys/fs/cgroup
/dev/sda1 1038336 135224 903112 14% /boot
tmpfs 204724 0 204724 0% /run/user/0
172.25.254.1:/home/cay 17811456 3002112 14809344 17% /home/cay
这样连同jdk,hadoop,和免密都帮我们做好了,而且他们的数据都是同步的.
[root@server1 ~]# cd /tmp
[root@server1 tmp]# rm -fr *
[root@server1 tmp]# ls .清除之前的数据
[root@server1 tmp]# cd
[root@server1 ~]# su - cay
Last login: Tue Jul 14 10:40:43 CST 2020 from localhost on pts/1
[cay@server1 ~]$ cd hadoop
[cay@server1 hadoop]$ vim etc/hadoop/core-site.xml
fs.defaultFS
hdfs://172.25.254.1:9000 /master结点
[cay@server1 hadoop]$ vim etc/hadoop/workers /数据结点
172.25.254.2
172.25.254.3
[cay@server1 hadoop]$ cat etc/hadoop/hdfs-site.xml
dfs.replication
2 有两个结点,所以副本数改为2
[cay@server1 hadoop]$ hdfs namenode -format /重新格式化
[cay@server1 hadoop]$ cd /tmp/
[cay@server1 tmp]$ ls
hadoop-cay hadoop-cay-namenode.pid hsperfdata_cay
server1上启动dfs
[cay@server1 ~]$ cd hadoop
[cay@server1 hadoop]$ cd sbin/
[cay@server1 sbin]$ ./start-dfs.sh
Starting namenodes on [server1]
Starting datanodes
172.25.254.3: Warning: Permanently added '172.25.254.3' (ECDSA) to the list of known hosts.
Starting secondary namenodes [server1]
[cay@server1 sbin]$ jps
7092 NameNode /1是nn结点
7449 Jps
7278 SecondaryNameNode
[cay@server2 ~]$ jps
3772 DataNode
3839 Jps /2和3是dn结点
[root@server3 ~]# su - cay
[cay@server3 ~]$ jps
3617 DataNode
3727 Jps
[cay@server1 sbin]$ cd
[cay@server1 ~]$ hdfs dfs -mkdir /user
[cay@server1 ~]$ hdfs dfs -mkdir /user/cay 重新创建主目录
[cay@server1 ~]$ hdfs dfs -mkdir input
[cay@server1 ~]$ hdfs dfs -ls
Found 1 items
drwxr-xr-x - cay supergroup 0 2020-07-14 12:36 input
[cay@server1 ~]$ hdfs dfs -put hadoop/etc/hadoop/*.xml input
[cay@server1 ~]$ hdfs dfs -ls input
Found 9 items
-rw-r--r-- 2 cay supergroup 8260 2020-07-14 12:36 input/capacity-scheduler.xml
-rw-r--r-- 2 cay supergroup 887 2020-07-14 12:36 input/core-site.xml
-rw-r--r-- 2 cay supergroup 11392 2020-07-14 12:36 input/hadoop-policy.xml
-rw-r--r-- 2 cay supergroup 867 2020-07-14 12:36 input/hdfs-site.xml
-rw-r--r-- 2 cay supergroup 620 2020-07-14 12:36 input/httpfs-site.xml
-rw-r--r-- 2 cay supergroup 3518 2020-07-14 12:36 input/kms-acls.xml
-rw-r--r-- 2 cay supergroup 682 2020-07-14 12:37 input/kms-site.xml
-rw-r--r-- 2 cay supergroup 758 2020-07-14 12:37 input/mapred-site.xml
-rw-r--r-- 2 cay supergroup 690 2020-07-14 12:37 input/yarn-site.xml
[cay@server1 ~]$ hadoop jar hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.1.jar grep input output 'dfs[a-z.]+'
[cay@server1 ~]$ hdfs dfs -cat output/*
2020-07-14 12:41:20,823 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
1 dfsadmin
1 dfs.replication
开启server4。
[root@server4 ~]# useradd cay
[root@server4 ~]# yum install -y nfs-utils -y
[root@server4 ~]# systemctl start rpcbind
[root@server4 ~]# mount 172.25.254.1:/home/cay/ /home/cay
[root@server4 ~]# df
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/mapper/rhel-root 17811456 1166272 16645184 7% /
devtmpfs 495420 0 495420 0% /dev
tmpfs 507512 0 507512 0% /dev/shm
tmpfs 507512 13204 494308 3% /run
tmpfs 507512 0 507512 0% /sys/fs/cgroup
/dev/sda1 1038336 135224 903112 14% /boot
tmpfs 101504 0 101504 0% /run/user/0
172.25.254.1:/home/cay 17811456 3002368 14809088 17% /home/cay
[root@server4 ~]# su - cay
[cay@server4 ~]$ cd hadoop/etc/hadoop/
[cay@server4 hadoop]$ vim workers
[cay@server4 hadoop]$ cat workers
172.25.254.2
172.25.254.3
172.25.254.4 /加进worker结点
cay@server4 hadoop]$ vim hdfs-site.xml
[cay@server4 hadoop]$ cat hdfs-site.xml
dfs.replication
3 /副本增加到三份
现在就只缺一个启动进程了。在分布式系统当中启动一个数据结点。
[cay@server4 hadoop]$ hdfs --daemon start datanode
[cay@server4 hadoop]$ jps
4101 Jps
4041 DataNode
[cay@server4 ~]$ dd if=/dev/zero of=bigfile bs=1M count=200
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 8.53429 s, 24.6 MB/s
[cay@server4 ~]$ hdfs dfs -put bigfile
会生成两个block,因为一个block限制是128M,下面是它的读取顺序。先server2,在server3,在server4.并且这些block是分开存储的。