Hadoop伪分布式部署之hdfs

前言

上一节我们介绍了用于部署Hadoop的Linux环境准备,感兴趣的同学可以去看一下Hadoop伪分布式部署之linux环境准备。
这一节我们主要讲伪分布式部署hdfs,相关环境如下
操作系统:CentOS6.4
Java版本:Oracle jdk1.7
Hadoop版本:Hadoop2.5.0
主机hostname:hadoop01.datacenter.com

Hadoop的运行依赖于Java,Hadoop2.5.0我们建议使用Oracle jdk1.7,详细的java版本支持请参考Java版本依赖,具体的安装方法我们在上一节中有提到,这里就不在赘述了。

hdfs是一个分布式文件存储系统,所以我们的所有安装、配置、访问,其实都是围绕“分布式文件存储”这七个字来进行的,感兴趣的朋友可以自己查一下hdfs、namenode和datanode的相关知识。

1、Hadoop下载与解压缩

Hadoop可以直接从官网下载,选择自己喜欢的版本,网址是:https://archive.apache.org/dist/hadoop/commo ,我们这里选择的是Hadoop2.5.0已经编译好的安装包。
下载好安装包后,我们把安装包上传到之前规划好的目录/opt/software,大概200多M。

[hadoop@hadoop01 software]$ cd /opt/software/
[hadoop@hadoop01 software]$ ll hadoop-2.5.0.tar.gz 
-rw-rw-r-- 1 hadoop hadoop 311430119 Jan  7 12:55 hadoop-2.5.0.tar.gz
[hadoop@hadoop01 software]$ 

接下来我们把这个安装包解压到/opt/modules/目录

[hadoop@hadoop01 software]$ tar -zxf hadoop-2.5.0.tar.gz -C /opt/modules/
[hadoop@hadoop01 software]$ cd /opt/modules/
[hadoop@hadoop01 modules]$ ll
total 16
drwxr-xr-x  9 hadoop hadoop 4096 Aug  7  2014 hadoop-2.5.0
drwxr-xr-x  8 hadoop hadoop 4096 Jul 26  2014 jdk1.7.0_67
drwxrwxr-x  9 hadoop hadoop 4096 Mar 18  2014 scala-2.10.4
drwxrwxr-x 11 hadoop hadoop 4096 Aug  7  2015 spark
[hadoop@hadoop01 modules]$ cd hadoop-2.5.0/
[hadoop@hadoop01 hadoop-2.5.0]$ ll
total 28
drwxr-xr-x 2 hadoop hadoop 4096 Apr  6 15:57 bin
drwxr-xr-x 3 hadoop hadoop 4096 Apr  6 15:57 etc
drwxr-xr-x 2 hadoop hadoop 4096 Apr  6 15:57 include
drwxr-xr-x 3 hadoop hadoop 4096 Apr  6 15:57 lib
drwxr-xr-x 2 hadoop hadoop 4096 Apr  6 15:57 libexec
drwxr-xr-x 2 hadoop hadoop 4096 Apr  6 15:57 sbin
drwxr-xr-x 4 hadoop hadoop 4096 Aug  7  2014 share
[hadoop@hadoop01 hadoop-2.5.0]$ 

我们可以看到解压后的文件目录,统计一下存储信息

[hadoop@hadoop01 hadoop-2.5.0]$ du -sh *
424K    bin
132K    etc
60K     include
4.5M    lib
56K     libexec
120K    sbin
1.7G    share
[hadoop@hadoop01 hadoop-2.5.0]$ du -sh ./share/*
1.6G    ./share/doc
162M    ./share/hadoop
[hadoop@hadoop01 hadoop-2.5.0]$ 

可以看出来占用空间最大的是share/doc目录,接近2G,该目录主要存储官方的英文说明文档,如果平常不需要翻阅这些文档,考虑到将来要做分布式,为了减少存储空间,可以选择删除该目录

[hadoop@hadoop01 hadoop-2.5.0]$ rm -rf share/doc
[hadoop@hadoop01 hadoop-2.5.0]$ ll share
total 4
drwxr-xr-x 8 hadoop hadoop 4096 Aug  7  2014 hadoop
[hadoop@hadoop01 hadoop-2.5.0]$ 

2、配置JAVA_HOME参数

在hadoop运行环境中设置JAVA_HOME

[hadoop@hadoop01 hadoop-2.5.0]$ vim etc/hadoop/hadoop-env.sh 
...
# The java implementation to use.
export JAVA_HOME=/opt/modules/jdk1.7.0_67

...

运行bin/hadoop命令,出现如下界面代表配置成功。

[hadoop@hadoop01 hadoop-2.5.0]$ bin/hadoop
Usage: hadoop [--config confdir] COMMAND
       where COMMAND is one of:
  fs                   run a generic filesystem user client
  version              print the version
  jar             run a jar file
  checknative [-a|-h]  check native hadoop and compression libraries availability
  distcp   copy file or directories recursively
  archive -archiveName NAME -p  *  create a hadoop archive
  classpath            prints the class path needed to get the
                       Hadoop jar and the required libraries
  daemonlog            get/set the log level for each daemon
 or
  CLASSNAME            run the class named CLASSNAME

Most commands print help when invoked w/o parameters.
[hadoop@hadoop01 hadoop-2.5.0]$ 

3、配置namenode和datanode相关参数

在etc/hadoop/core-site.xml文件中配置hdfs交互接口,添加如下配置项

[hadoop@hadoop01 hadoop-2.5.0]$ vim etc/hadoop/core-site.xml 
...
    
        fs.defaultFS
        hdfs://hadoop01.datacenter.com:8020
    
...

在hadoop的默认配置中,hadoop.tmp.dir的值为/tmp/hadoop-${user.name},该目录存储的是namenode的镜像文件和日志文件。大家都知道linux每次重启的时候都会清除掉/tmp目录下的所有文件,所有为了保证namenode的正常运行,我们需要添加hadoop.tmp.dir配置项,并将值改为其他目录,具体如下:

[hadoop@hadoop01 hadoop-2.5.0]$ mkdir -p data/tmp
[hadoop@hadoop01 hadoop-2.5.0]$ ll data/tmp
total 0
[hadoop@hadoop01 hadoop-2.5.0]$ vim etc/hadoop/core-site.xml 
...
    
        hadoop.tmp.dir
        /opt/modules/hadoop-2.5.0/data/tmp
    
...

接下来我们做datanode的配置,因为我们此次部署的是伪分布式,所以只需要配置一个节点就可以了

[hadoop@hadoop01 hadoop-2.5.0]$ vim etc/hadoop/slaves 
hadoop01.datacenter.com
[hadoop@hadoop01 hadoop-2.5.0]$ 

hadoop默认的副本数是3个,现在我们部署伪分布式,只有一个datanode,所以我们的副本数也只能配置为1了

[hadoop@hadoop01 hadoop-2.5.0]$ vim etc/hadoop/hdfs-site.xml
...
    
        dfs.replication
        1
    
...

4、格式化namenode

使用bin/hdfs namenode -format命令格式化namenode

[hadoop@hadoop01 hadoop-2.5.0]$ ll data/tmp
total 0
[hadoop@hadoop01 hadoop-2.5.0]$ bin/hdfs namenode -format
...

查看namenode的镜像文件和日志文件,能看到这些文件则代表格式化成功

[hadoop@hadoop01 hadoop-2.5.0]$ ll data/tmp              
total 4
drwxrwxr-x 3 hadoop hadoop 4096 Apr  6 17:02 dfs
[hadoop@hadoop01 hadoop-2.5.0]$ ll data/tmp/dfs/name/current/
total 16
-rw-rw-r-- 1 hadoop hadoop 353 Apr  6 17:02 fsimage_0000000000000000000
-rw-rw-r-- 1 hadoop hadoop  62 Apr  6 17:02 fsimage_0000000000000000000.md5
-rw-rw-r-- 1 hadoop hadoop   2 Apr  6 17:02 seen_txid
-rw-rw-r-- 1 hadoop hadoop 206 Apr  6 17:02 VERSION
[hadoop@hadoop01 hadoop-2.5.0]$ 

5、启动hdfs服务

启动namenode和datanode服务

[hadoop@hadoop01 hadoop-2.5.0]$ sbin/hadoop-daemon.sh start namenode
starting namenode, logging to /opt/modules/hadoop-2.5.0/logs/hadoop-hadoop-namenode-hadoop01.datacenter.com.out
[hadoop@hadoop01 hadoop-2.5.0]$ sbin/hadoop-daemon.sh start datanode
starting datanode, logging to /opt/modules/hadoop-2.5.0/logs/hadoop-hadoop-datanode-hadoop01.datacenter.com.out
[hadoop@hadoop01 hadoop-2.5.0]$

使用jps命令查看服务进程,能看到namenode和datanode的进程信息则代表服务启动成功

[hadoop@hadoop01 hadoop-2.5.0]$ jps
4629 DataNode
4704 Jps
4550 NameNode
[hadoop@hadoop01 hadoop-2.5.0]$ 

接下来可以通过50070端口访问hdfs的web界面,我机器的地址为:http://hadoop01.datacenter.com:50070

6、hdfs使用

新建目录

通过bin/hdfs dfs -mkdir新建相对路径目录

[hadoop@hadoop01 hadoop-2.5.0]$ bin/hdfs dfs -mkdir -p etc/conf               
18/04/06 17:18:15 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[hadoop@hadoop01 hadoop-2.5.0]$ 

通过Web服务页面的Utilities中的Browse the file system标签进入文件浏览器
Hadoop伪分布式部署之hdfs_第1张图片
发现我们新建的文件目录在如下位置
Hadoop伪分布式部署之hdfs_第2张图片
从上图可以看出,我们通过相对路径etc/conf新建的目录在hdfs上的绝对路径为/user/hadoop/etc/conf,由此可以看出,hdfs和linux一样,每个用户都有一个自己的家目录,而且默认路径是在家目录中的,比如我们现在的家目录就是/user/hadoop。

上传文件

使用bin/hdfs dfs -put命令上传文件

[hadoop@hadoop01 hadoop-2.5.0]$ bin/hdfs dfs -put etc/hadoop/slaves /user/hadoop/etc/conf
18/04/06 17:29:33 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[hadoop@hadoop01 hadoop-2.5.0]$ 

通过web界面查看上传结果
Hadoop伪分布式部署之hdfs_第3张图片

查看文件

通过bin/hdfs dfs -cat查看文件内容

[hadoop@hadoop01 hadoop-2.5.0]$ bin/hdfs dfs -cat etc/conf/slaves
18/04/06 17:32:08 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
hadoop01.datacenter.com
[hadoop@hadoop01 hadoop-2.5.0]$ 

下载文件

通过bin/hdfs dfs -get下载文件到本地

[hadoop@hadoop01 hadoop-2.5.0]$ bin/hdfs dfs -get etc/conf/slaves /home/hadoop           
18/04/06 17:34:06 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[hadoop@hadoop01 hadoop-2.5.0]$ ll /home/hadoop
total 40
drwxrwxr-x  2 hadoop hadoop 4096 Apr  6 17:33 conf
drwxr-xr-x. 2 hadoop hadoop 4096 Dec 24 14:39 Desktop
drwxr-xr-x. 2 hadoop hadoop 4096 Dec 24 22:29 Documents
drwxr-xr-x. 2 hadoop hadoop 4096 Dec 24 22:29 Downloads
drwxr-xr-x. 2 hadoop hadoop 4096 Dec 24 22:29 Music
drwxr-xr-x. 2 hadoop hadoop 4096 Dec 24 22:29 Pictures
drwxr-xr-x. 2 hadoop hadoop 4096 Dec 24 22:29 Public
-rw-r--r--  1 hadoop hadoop   24 Apr  6 17:34 slaves
drwxr-xr-x. 2 hadoop hadoop 4096 Dec 24 22:29 Templates
drwxr-xr-x. 2 hadoop hadoop 4096 Dec 24 22:29 Videos
[hadoop@hadoop01 hadoop-2.5.0]$ 

7、关闭hdfs服务

关闭namenode和datanode服务

[hadoop@hadoop01 hadoop-2.5.0]$ sbin/hadoop-daemon.sh stop namenode
stopping namenode
[hadoop@hadoop01 hadoop-2.5.0]$ sbin/hadoop-daemon.sh stop datanode
stopping datanode
[hadoop@hadoop01 hadoop-2.5.0]$ 

使用jps命令查看服务关闭是否成功

[hadoop@hadoop01 hadoop-2.5.0]$ jps
5221 Jps
[hadoop@hadoop01 hadoop-2.5.0]$ 

总结

1.hadoop的运行依赖与java
2.整个部署过程基本是java环境配置、namenode和datanode配置,然后启动namenode和datanode服务
3.可以通过bin/hdfs dfs -[参数]的形式操作分布式文件系统hdfs,用法与linux本地的文件操作用法类似

你可能感兴趣的:(Hadoop)