Hadoop -- 完全分布式,伪分布式,单击模式操作

简介

Hadoop是一种分析和处理大数据的软件平台,是Appach的一个用Java语言所实现的开源软件的加框,在大量计算机组成的集群当中实现了对于海量的数据进行的分布式计算。Apache Hadoop作为PaaS构建在虚拟主机上,作为云计算平台。
Hadoop就是存储海量数据和分析海量数据的工具
HADOOP生态圈组成


Hadoop -- 完全分布式,伪分布式,单击模式操作_第1张图片
重点组件:

  • HDFS:分布式文件系统

  • MAPREDUCE:分布式运算程序开发框架,为海量的数据提供了计算.

  • HIVE:基于大数据技术(文件系统+运算框架)的SQL数据仓库工具

  • HBASE:基于HADOOP的分布式海量数据库

  • ZOOKEEPER:分布式协调服务基础组件

  • Mahout:基于mapreduce/spark/flink等分布式运算框架的机器学习算法库

  • Oozie:工作流调度框架(Azakaba)

  • Sqoop:数据导入导出工具

  • Flume:日志数据采集框架

  • Common: 里面存放hadoop其它模块运行所需要的java库和实用程序,包含了启动hadoop的java文件和脚本

  • YARN:用于作业调度和资源管理的框架

hadoop数据处理流程
“web日志数据挖掘”

Hadoop -- 完全分布式,伪分布式,单击模式操作_第2张图片

Hadoop -- 完全分布式,伪分布式,单击模式操作_第3张图片

hadoop 当前支持三种集群工作模式:

1.本地单击模式Local (Standalone) Mode
它不需要启用后台程序,可以在主机上直接执行

2.伪分布式Pseudo-Distributed Mode
在单击模式基础上部署分布式,可以用开发和测试。

3.完全分布式Fully-Distributed Mode
这就是真正意义上的集群。

部署

单机模式操作

http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html

[root@server1 ~]# ls
hadoop-3.2.1.tar.gz  jdk-8i181-linux-x64.tar.gz
[root@server1 ~]# useradd cay
[root@server1 ~]# mv ./* /home/cay/
[root@server1 ~]# su - cay
[cay@server1 ~]$ ls
hadoop-3.2.1.tar.gz  jdk-8u181-linux-x64.tar.gz
[cay@server1 ~]$ tar zxf hadoop-3.2.1.tar.gz 
[cay@server1 ~]$ tar zxf jdk-8u181-linux-x64.tar.gz 
[cay@server1 ~]$ ln -s hadoop-3.2.1 hadoop
[cay@server1 ~]$ ln -s jdk1.8.0_181/ jdk
[cay@server1 ~]$ ls
hadoop  hadoop-3.2.1  hadoop-3.2.1.tar.gz  jdk  jdk1.8.0_181  jdk-8u181-linux-x64.tar.gz

修改环境变量脚本。
[cay@server1 ~]$ cd  hadoop
[cay@server1 hadoop]$ ls
bin  etc  include  lib  libexec  LICENSE.txt  NOTICE.txt  README.txt  sbin  share
[cay@server1 hadoop]$ cd etc/              
[cay@server1 etc]$ vim hadoop/hadoop-env.sh
export JAVA_HOME=/home/cay/jdk

# Location of Hadoop.  By default, Hadoop will attempt to determine
# this location based upon its execution path.
export HADOOP_HOME=/home/cay/hadoop
//告诉他我们的环境在哪


[cay@server1 etc]$ cd ../bin
[cay@server1 bin]$ ls
container-executor  hadoop  hadoop.cmd  hdfs  hdfs.cmd  mapred  mapred.cmd  oom-listener  test-container-executor  yarn  yarn.cmd
[cay@server1 bin]$ pwd
/home/cay/hadoop/bin
[cay@server1 bin]$ ./hadoop		/我们就可以进入到bin目录下执行了,

也可以加到环境变量里面:
[cay@server1 ~]$  vim .bash_profile 
PATH=$PATH:$HOME/.local/bin:$HOME/bin:$HOME/hadoop/bin

[cay@server1 ~]$ source ./.bash_profile 
[cay@server1 ~]$ hadoop		/效果相同

默认情况下,Hadoop被配置为以非分布式模式运行,作为单个Java进程。这对于调试非常有用。

下面的示例复制解压缩的conf目录以用作输入,然后查找并显示给定正则表达式的每个匹配项。输出至给定的输出目录:

[cay@server1 ~]$ mkdir input
[cay@server1 ~]$ cp hadoop/etc/hadoop/*.xml input/
[cay@server1 ~]$ cd input/
[cay@server1 input]$ ls
capacity-scheduler.xml  core-site.xml  hadoop-policy.xml  hdfs-site.xml  httpfs-site.xml  kms-acls.xml  kms-site.xml  mapred-site.xml  yarn-site.xml

查看jar包封装的方法:

[cay@server1 ~]$ hadoop jar hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.1.jar 
An example program must be given as the first argument.
Valid program names are:
  aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
  aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.
  bbp: A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi.
  dbcount: An example job that count the pageview counts from a database.
  distbbp: A map/reduce program that uses a BBP-type formula to compute exact bits of Pi.
  grep: A map/reduce program that counts the matches of a regex in the input.
  join: A job that effects a join over sorted, equally partitioned datasets
  multifilewc: A job that counts words from several files.
  pentomino: A map/reduce tile laying program to find solutions to pentomino problems.
  pi: A map/reduce program that estimates Pi using a quasi-Monte Carlo method.
  randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.
  randomwriter: A map/reduce program that writes 10GB of random data per node.
  secondarysort: An example defining a secondary sort to the reduce.
  sort: A map/reduce program that sorts the data written by the random writer.
  sudoku: A sudoku solver.
  teragen: Generate data for the terasort
  terasort: Run the terasort
  teravalidate: Checking results of terasort
  wordcount: A map/reduce program that counts the words in the input files.
  wordmean: A map/reduce program that counts the average length of the words in the input files.
  wordmedian: A map/reduce program that counts the median length of the words in the input files.
  wordstandarddeviation: A map/reduce program that counts the standard deviation of the length of the words in the input files.
[cay@server1 mapreduce]$ pwd
/home/cay/hadoop/share/hadoop/mapreduce

grep

[cay@server1 ~]$ hadoop jar hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.1.jar grep input output 'dfs[a-z.]+'
注意output目录在运行前不能存在,它会自动创建,input是我们上面创建的目录。过滤出以dfs开头的

[cay@server1 ~]$ cd output/
[cay@server1 output]$ ls
part-r-00000  _SUCCESS
[cay@server1 output]$ cat *
1	dfsadmin			/只过滤出了一个

[cay@server1 input]$ vim hadoop-policy.xml 
    dfsadmin and 		/只有这个文件中有一个 dfsadmin
    The ACL is a 

伪分布式操作

仍然只有当前的一个结点
首先在在hadoop目录中进行配置:

[cay@server1 hadoop]$ cd etc/hadoop/

[cay@server1 hadoop]$ vim core-site.xml		/编辑核心文件,告诉他我们的master

    
        fs.defaultFS
        hdfs://localhost:9000
    


[cay@server1 hadoop]$ vim hdfs-site.xml 

    
        dfs.replication
        1		/只有一个副本,因为当前只有一个master结点
    				


[cay@server1 hadoop]$ ll workers 
-rw-r--r-- 1 cay cay 10 Sep 10  2019 workers
[cay@server1 hadoop]$ cat workers 
localhost			/里面还有一个workers,看出当前我们的master和worker在一个结点上。

分布式是以免密的方式来运作的,所以我们需要为为本地的主机做免密登陆:

[cay@server1 hadoop]$ ssh-keygen 
[cay@server1 ~]$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
[cay@server1 ~]$ chmod 0600 ~/.ssh/authorized_keys
[cay@server1 ~]$ ssh localhost
Last failed login: Tue Jul 14 10:40:04 CST 2020 from localhost on ssh:notty
There were 5 failed login attempts since the last successful login.
Last login: Tue Jul 14 10:00:59 2020

上面的操作和直接使用 ssh-copy-id 的方式是相同的,只不过 ssh-copy-id localhost 需要输入localhost的密码.

然后格式化文件系统为 hdfs 格式:

[cay@server1 hadoop]$ hdfs namenode -format			master机诶单叫做namenode,slave结点叫做datanode

/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at server1/172.25.254.1
************************************************************/
[cay@server1 hadoop]$ ll /tmp/
total 4
drwxr-xr-x 3 cay cay 20 Jul 14 10:14 hadoop
drwxr-xr-x 4 cay cay 31 Jul 14 10:45 hadoop-cay		/会创建这个目录来存放数据信息
-rw-rw-r-- 1 cay cay  5 Jul 14 10:45 hadoop-cay-namenode.pid
drwxr-xr-x 2 cay cay  6 Jul 14 10:45 hsperfdata_cay

[cay@server1 hadoop]$ cd /tmp/hadoop-cay/
[cay@server1 hadoop-cay]$ ls
dfs  mapred

启动NameNode守护进程和DataNode守护进程:

[cay@server1 ~]$  cd hadoop/sbin/
[cay@server1 sbin]$ ls
distribute-exclude.sh  hadoop-daemons.sh  mr-jobhistory-daemon.sh  start-all.sh       start-dfs.sh         start-yarn.sh  stop-balancer.sh  stop-secure-dns.sh  workers.sh
FederationStateStore   httpfs.sh          refresh-namenodes.sh     start-balancer.sh  start-secure-dns.sh  stop-all.cmd   stop-dfs.cmd      stop-yarn.cmd       yarn-daemon.sh
hadoop-daemon.sh       kms.sh             start-all.cmd            start-dfs.cmd      start-yarn.cmd       stop-all.sh    stop-dfs.sh       stop-yarn.sh        yarn-daemons.sh

[cay@server1 sbin]$ ./start-dfs.sh 
Starting namenodes on [localhost]
Starting datanodes
Starting secondary namenodes [server1]
server1: Warning: Permanently added 'server1,172.25.254.1' (ECDSA) to the list of known hosts.

start-dfs.sh 只会启动 dfs 分布式文件系统的相关组件,不会启动其他组件.

[cay@server1 ~]$ cd jdk/bin/
[cay@server1 bin]$ ls
appletviewer  idlj       java     javafxpackager  javapackager  jcmd      jdb    jinfo  jmc      jrunscript  jstat      keytool       pack200     rmid         serialver   unpack200  xjc
ControlPanel  jar        javac    javah           java-rmi.cgi  jconsole  jdeps  jjs    jmc.ini  jsadebugd   jstatd     native2ascii  policytool  rmiregistry  servertool  wsgen
extcheck      jarsigner  javadoc  javap           javaws        jcontrol  jhat   jmap   jps      jstack      jvisualvm  orbd          rmic        schemagen    tnameserv   wsimport
[cay@server1 bin]$ cd
[cay@server1 ~]$ vim .bash_profile 
PATH=$PATH:$HOME/.local/bin:$HOME/bin:$HOME/hadoop/bin:$HOME/jdk/bin	/将jdk的命令加入到环境变量
[cay@server1 ~]$ source .bash_profile 
[cay@server1 ~]$ jps		/列出java进程		
4150 NameNode
4249 DataNode
4682 Jps
4429 SecondaryNameNode		/nn的辅助机

浏览web界面找到NameNode;
Hadoop -- 完全分布式,伪分布式,单击模式操作_第4张图片

在这里插入图片描述
当前只有一个结点。

[cay@server1 ~]$ hdfs dfsadmin -report
;这条命令看到的和web界面看到的是一致的0

我们还可以指定一个hadoop的数据目录,因为默认放到 /tmp目录下的话系统可能会进行定期的清理/tmp目录,可能会造成回数据的丢失,所以我们:

创建执行MapReduce作业所需的HDFS用户主目录:

[cay@server1 ~]$ hdfs dfs -ls /		默认加上/才能看到主目录里的数据,当前没有数据

[cay@server1 ~]$ hdfs dfs -mkdir /user
[cay@server1 ~]$ hdfs dfs -mkdir /user/cay
[cay@server1 ~]$ hdfs dfs -ls			可以直接列出了。

此时的web界面:
Hadoop -- 完全分布式,伪分布式,单击模式操作_第5张图片
Hadoop -- 完全分布式,伪分布式,单击模式操作_第6张图片
Hadoop -- 完全分布式,伪分布式,单击模式操作_第7张图片
分布式文件系统的输入和输出都是来源于分布式文件系统,而不是本地。
我们把input目录上传到 hdfs上:

[cay@server1 ~]$ hdfs dfs -put input/

Hadoop -- 完全分布式,伪分布式,单击模式操作_第8张图片

数据目录出现了,此时:

[cay@server1 ~]$ rm -fr input/ output/
[cay@server1 ~]$ hadoop jar hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.1.jar grep input output 'dfs[a-z.]+'

它会从分布式文件系统中读取input 目录,然后输出至dfs 的output目录:
Hadoop -- 完全分布式,伪分布式,单击模式操作_第9张图片

Hadoop -- 完全分布式,伪分布式,单击模式操作_第10张图片
而且只有一个副本,着里的 clock size 是根据分布式文件系统的大小决定的,可以指定。

[cay@server1 ~]$ hdfs dfs  -ls
Found 2 items
drwxr-xr-x   - cay supergroup          0 2020-07-14 11:35 input
drwxr-xr-x   - cay supergroup          0 2020-07-14 11:39 output
[cay@server1 ~]$ hdfs dfs  -cat output/*
2020-07-14 11:43:28,989 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
1	dfsadmin
		/和之前的结果相同

我们还可以把目录下载到本地,进行查看:

[cay@server1 ~]$ hdfs dfs  -get output
cat 2020-07-14 11:44:07,972 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
o[cay@server1 ~]$ cat output/*
1	dfsadmin
[cay@server1 ~]$ rm -fr output/
[cay@server1 ~]$ cat output/*
cat: output/*: No such file or directory

在次执行:

[cay@server1 ~]$ hadoop jar hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.1.jar grep input output 'dfs[a-z.]+'
org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://localhost:9000/user/cay/output already exists

就会报错,因为文件系统上的 output 目录已经存在了,而且web 界面默认是无法删除的

在这里插入图片描述
我们可以通过命令行的方式进行删除;’

[cay@server1 ~]$ hdfs dfs -rm -r output
Deleted output

在这里插入图片描述

一切其他的运作:
统计词频:


[cay@server1 ~]$ hadoop jar hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.1.jar wordcount  input output
[cay@server1 ~]$ hdfs dfs -cat output/*
input output ^C
type="text/xsl"	4
u:%user:%user	1
under	28
unique	1
updating	1
use	11
used	24
user	48
user.	2

...

完全分布式操作

在开启两台主机server2 和server3

[cay@server1 sbin]$ jps
4150 NameNode
5911 Jps
4249 DataNode
4429 SecondaryNameNode
[cay@server1 sbin]$ ./stop-dfs.sh 
Stopping namenodes on [localhost]
Stopping datanodes
Stopping secondary namenodes [server1]
[cay@server1 sbin]$ jps
6371 Jps

在server1上共享出去我们的配置目录,方便另外两台主机配置。

[root@server1 ~]# yum install nfs-utils -y
[root@server1 ~]# vim /etc/exports
/home/cay       *(rw,anonuid=1000,anongid=1000)		/共享出去,其他结点就可以不用配置
[root@server1 ~]# id cay
uid=1000(cay) gid=1000(cay) groups=1000(cay)
[root@server1 ~]# systemctl enable  --now  nfs
Created symlink from /etc/systemd/system/multi-user.target.wants/nfs-server.service to /usr/lib/systemd/system/nfs-server.service.
[root@server1 ~]# showmount -e
Export list for server1:
/home/cay *

[root@server2 ~]# yum install nfs-utils -y
[root@server2 ~]# systemctl start rpcbind.service 
[root@server2 ~]# mount 172.25.254.1:/home/cay/  /home/cay/
[root@server2 ~]# df
Filesystem             1K-blocks    Used Available Use% Mounted on
/dev/mapper/rhel-root   17811456 1199756  16611700   7% /
devtmpfs                 1011516       0   1011516   0% /dev
tmpfs                    1023608       0   1023608   0% /dev/shm
tmpfs                    1023608   16868   1006740   2% /run
tmpfs                    1023608       0   1023608   0% /sys/fs/cgroup
/dev/sda1                1038336  135224    903112  14% /boot
tmpfs                     204724       0    204724   0% /run/user/0
172.25.254.1:/home/cay  17811456 3002112  14809344  17% /home/cay


[root@server3 ~]# yum install nfs-utils -y
[root@server3 ~]# systemctl start rpcbind.service 
[root@server3 ~]# mount 172.25.254.1:/home/cay/  /home/cay/
[root@server3 ~]# df
Filesystem             1K-blocks    Used Available Use% Mounted on
/dev/mapper/rhel-root   17811456 1199756  16611700   7% /
devtmpfs                 1011516       0   1011516   0% /dev
tmpfs                    1023608       0   1023608   0% /dev/shm
tmpfs                    1023608   16868   1006740   2% /run
tmpfs                    1023608       0   1023608   0% /sys/fs/cgroup
/dev/sda1                1038336  135224    903112  14% /boot
tmpfs                     204724       0    204724   0% /run/user/0
172.25.254.1:/home/cay  17811456 3002112  14809344  17% /home/cay

这样连同jdk,hadoop,和免密都帮我们做好了,而且他们的数据都是同步的.

[root@server1 ~]# cd /tmp
[root@server1 tmp]# rm -fr *
[root@server1 tmp]# ls			.清除之前的数据
[root@server1 tmp]# cd 
[root@server1 ~]# su - cay
Last login: Tue Jul 14 10:40:43 CST 2020 from localhost on pts/1
[cay@server1 ~]$ cd hadoop
[cay@server1 hadoop]$ vim etc/hadoop/core-site.xml 


    
        fs.defaultFS
        hdfs://172.25.254.1:9000		/master结点
    


[cay@server1 hadoop]$ vim etc/hadoop/workers 				/数据结点
172.25.254.2
172.25.254.3


[cay@server1 hadoop]$ cat etc/hadoop/hdfs-site.xml 


    
        dfs.replication
        2			有两个结点,所以副本数改为2
    


[cay@server1 hadoop]$ hdfs namenode -format		/重新格式化

[cay@server1 hadoop]$ cd /tmp/
[cay@server1 tmp]$ ls
hadoop-cay  hadoop-cay-namenode.pid  hsperfdata_cay

server1上启动dfs

[cay@server1 ~]$ cd hadoop
[cay@server1 hadoop]$ cd sbin/
[cay@server1 sbin]$ ./start-dfs.sh 
Starting namenodes on [server1]
Starting datanodes
172.25.254.3: Warning: Permanently added '172.25.254.3' (ECDSA) to the list of known hosts.
Starting secondary namenodes [server1]
[cay@server1 sbin]$ jps
7092 NameNode			/1是nn结点
7449 Jps
7278 SecondaryNameNode

[cay@server2 ~]$ jps
3772 DataNode
3839 Jps				/2和3是dn结点

[root@server3 ~]# su - cay
[cay@server3 ~]$ jps
3617 DataNode
3727 Jps

Hadoop -- 完全分布式,伪分布式,单击模式操作_第11张图片

[cay@server1 sbin]$ cd
[cay@server1 ~]$ hdfs dfs -mkdir /user
[cay@server1 ~]$ hdfs dfs -mkdir /user/cay			重新创建主目录

[cay@server1 ~]$ hdfs dfs -mkdir input
[cay@server1 ~]$ hdfs dfs -ls
Found 1 items
drwxr-xr-x   - cay supergroup          0 2020-07-14 12:36 input
[cay@server1 ~]$ hdfs dfs -put hadoop/etc/hadoop/*.xml input
[cay@server1 ~]$ hdfs dfs -ls input
Found 9 items
-rw-r--r--   2 cay supergroup       8260 2020-07-14 12:36 input/capacity-scheduler.xml
-rw-r--r--   2 cay supergroup        887 2020-07-14 12:36 input/core-site.xml
-rw-r--r--   2 cay supergroup      11392 2020-07-14 12:36 input/hadoop-policy.xml
-rw-r--r--   2 cay supergroup        867 2020-07-14 12:36 input/hdfs-site.xml
-rw-r--r--   2 cay supergroup        620 2020-07-14 12:36 input/httpfs-site.xml
-rw-r--r--   2 cay supergroup       3518 2020-07-14 12:36 input/kms-acls.xml
-rw-r--r--   2 cay supergroup        682 2020-07-14 12:37 input/kms-site.xml
-rw-r--r--   2 cay supergroup        758 2020-07-14 12:37 input/mapred-site.xml
-rw-r--r--   2 cay supergroup        690 2020-07-14 12:37 input/yarn-site.xml
[cay@server1 ~]$ hadoop jar hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.1.jar grep input output 'dfs[a-z.]+'

Hadoop -- 完全分布式,伪分布式,单击模式操作_第12张图片

[cay@server1 ~]$ hdfs dfs -cat output/*
2020-07-14 12:41:20,823 INFO sasl.SaslDataTransferClient: SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
1	dfsadmin
1	dfs.replication

在线添加一个worker结点

开启server4。

[root@server4 ~]# useradd cay
[root@server4 ~]# yum install -y nfs-utils -y
[root@server4 ~]# systemctl start rpcbind
[root@server4 ~]# mount 172.25.254.1:/home/cay/ /home/cay

[root@server4 ~]# df
Filesystem             1K-blocks    Used Available Use% Mounted on
/dev/mapper/rhel-root   17811456 1166272  16645184   7% /
devtmpfs                  495420       0    495420   0% /dev
tmpfs                     507512       0    507512   0% /dev/shm
tmpfs                     507512   13204    494308   3% /run
tmpfs                     507512       0    507512   0% /sys/fs/cgroup
/dev/sda1                1038336  135224    903112  14% /boot
tmpfs                     101504       0    101504   0% /run/user/0
172.25.254.1:/home/cay  17811456 3002368  14809088  17% /home/cay

[root@server4 ~]# su - cay
[cay@server4 ~]$ cd hadoop/etc/hadoop/
[cay@server4 hadoop]$ vim workers 
[cay@server4 hadoop]$ cat workers 
172.25.254.2
172.25.254.3
172.25.254.4			/加进worker结点
cay@server4 hadoop]$ vim hdfs-site.xml 
[cay@server4 hadoop]$ cat hdfs-site.xml 


    
        dfs.replication
        3		/副本增加到三份
    

现在就只缺一个启动进程了。在分布式系统当中启动一个数据结点。

[cay@server4 hadoop]$ hdfs --daemon start datanode
[cay@server4 hadoop]$ jps
4101 Jps
4041 DataNode

Hadoop -- 完全分布式,伪分布式,单击模式操作_第13张图片

[cay@server4 ~]$ dd if=/dev/zero of=bigfile bs=1M count=200
200+0 records in
200+0 records out
209715200 bytes (210 MB) copied, 8.53429 s, 24.6 MB/s
[cay@server4 ~]$ hdfs dfs -put bigfile

Hadoop -- 完全分布式,伪分布式,单击模式操作_第14张图片
Hadoop -- 完全分布式,伪分布式,单击模式操作_第15张图片
会生成两个block,因为一个block限制是128M,下面是它的读取顺序。先server2,在server3,在server4.并且这些block是分开存储的。

在这里插入图片描述

你可能感兴趣的:(Hadoop,分布式,大数据,hadoop,linux,java)