Hadoop中的HDFS学习

实验环境

服务器列表

服务器用途 IP地址 主机名
NameNode 192.168.3.69 namenode.abc.local
DataNode 1 192.168.3.70 datanode1.abc.local
DataNode 2 192.168.3.71 datanode2.abc.local

环境准备

三台服务器,都最小化安装CentOS 6.6,设置主机名,静态IP地址。

CentOS 6.6 最小化安装,默认是没有Java环境的,需要安装Java环境。

下载Java运行环境的安装介质:jre-7u80-linux-x64.tar.gz

# tar xvfz jre-7u80-linux-x64.tar.gz
# mv jre1.7.0_80/ /opt

在/etc/profile中设置Java环境变量

export JAVA_HOME=/opt/jre1.7.0_80
PATH=$JAVA_HOME/bin:$PATH
export PATH

退出控制台,重新登录服务器,查看Java运行环境

[root@namenode ~]# java -version
java version "1.7.0_80"
Java(TM) SE Runtime Environment (build 1.7.0_80-b15)
Java HotSpot(TM) 64-Bit Server VM (build 24.80-b11, mixed mode)

在其他两台服务器,也安装Java运行环境。

设置三台服务器之间SSH无密码登录

CentOS最小化安装没有安装scp,以及ssh客户端程序。通过rpm包安装如下:

# rpm -ivh libedit-2.11-4.20080712cvs.1.el6.x86_64.rpm
# rpm -ivh openssh-clients-5.3p1-104.el6.x86_64.rpm
libedit是openssh的依赖包

注: 通过SSH服务远程访问Linux服务器,连接非常慢,这时需要关闭SSH的DNS反解析,添加下面一行:

UseDNS no

虽然配置文件中[UseDNS yes]被注释点,但默认开关就是yes。(SSH服务默认启用了DNS反向解析的功能)

同时在SSH客户端上,设置本地的DNS解析,编辑/etc/hosts文件,增加如下配置:

192.168.3.69    namenode namenode.abc.local
192.168.3.70    datanode1 datanode1.abc.local
192.168.3.71    datanode2 datanode2.abc.local

注:在本机上安装好openssh-clients后,利用scp想把本地文件传到远程,这个时候,报错
-bash: scp: command not found。这是因为需要在远程也要安装openssh-client,要有scp程序。

在namenode上操作,生成本机的公钥,密码文件。

[root@namenode ~]# ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/root/.ssh/id_rsa): #采用默认的文件存放密钥
Created directory '/root/.ssh'.
Enter passphrase (empty for no passphrase): #直接Enter,不输入密码
Enter same passphrase again: #直接Enter,不输入密码
Your identification has been saved in /root/.ssh/id_rsa. #生成密钥文件
Your public key has been saved in /root/.ssh/id_rsa.pub. #生成公钥文件
The key fingerprint is:
02:e0:5b:d0:53:19:25:48:e2:61:5a:a3:14:9e:d0:a6 [email protected]
The key's randomart image is:
+--[ RSA 2048]----+
|.+Xo.o++.        |
|+B+*+ ..         |
|o=o o.           |
|E  o .           |
|  .   . S        |
|       .         |
|                 |
|                 |
|                 |
+-----------------+

在本机实现ssh登录本机无需密码

# cat /root/.ssh/id_rsa.pub >> /root/.ssh/authorized_keys
authorized_keys文件是新增文件,在ssh配置文件中使用

同时修改SSH服务的配置文件/etc/ssh/sshd_config。

RSAAuthentication yes   # 去掉注释,开启RAS认证
PubkeyAuthentication yes    # 去掉注释
AuthorizedKeysFile      .ssh/authorized_keys    # 去掉注释

重启SSH服务。

/etc/init.d/sshd restart

实现远程登录无需密码,需要将公钥文件上传到datanode1,将namenode上的/root/.ssh/id_rsa.pub上传到datanode1的/tmp目录。

# scp /root/.ssh/id_rsa.pub [email protected]:/tmp
此时,因为还未配置ssh无密码登录,还是需要输入密码,才能把文件上传过去。

在datanode1上,将namenode的公钥导入到SSH认证文件中。

[root@datanode1 ~]# cat /tmp/id_rsa.pub >> /root/.ssh/authorized_keys
authorized_keys文件是新增文件,在ssh配置文件中使用

修改datanode1的SSH服务的配置文件/etc/ssh/sshd_config。

RSAAuthentication yes   # 去掉注释,开启RAS认证
PubkeyAuthentication yes    # 去掉注释
AuthorizedKeysFile      .ssh/authorized_keys    # 去掉注释

重启SSH服务。

/etc/init.d/sshd restart

在namenode上验证ssh无密码登录

[root@namenode ~]# ssh [email protected]
The authenticity of host '192.168.3.70 (192.168.3.70)' can't be established.
RSA key fingerprint is c4:1f:56:68:f8:44:c7:d9:cc:97:b9:47:1c:37:bb:a7.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '192.168.3.70' (RSA) to the list of known hosts.
Last login: Mon Aug 10 18:47:02 2015 from 192.168.3.64
[root@datanode1 ~]# 

把namenode上的公钥文件上传到datanode2后,进行同样的操作。

同样,datanode1的公钥也要放到namenode,datanode2上,datanode2的公钥也要放到namenode,datanode1上。三台服务器之间,都要能够ssh无密码登录。

安装Hadoop

下载Hadoop的安装介质:hadoop-2.7.1.tar.gz。
上传到namenode上,解压到/opt目录

# tar xvfz hadoop-2.7.1.tar.gz -C /opt/

在/opt/hadoop-2.7.1目录下创建数据存放的文件夹:tmp、hdfs、hdfs/data、hdfs/name。

[root@namenode hadoop-2.7.1]# mkdir tmp
[root@namenode hadoop-2.7.1]# mkdir hdfs
[root@namenode hadoop-2.7.1]# cd hdfs/
[root@namenode hdfs]# mkdir data
[root@namenode hdfs]# mkdir name

配置hadoop的运行环境/opt/hadoop-2.7.1/etc/hadoop/hadoop-env.sh

# The java implementation to use.
export JAVA_HOME=/opt/jre1.7.0_80

配置namenode的运行参数/opt/hadoop-2.7.1/etc/hadoop/core-site.xml


    
        fs.defaultFS
        hdfs://namenode.abc.local:9000
    
    
        hadoop.tmp.dir
        file:/opt/hadoop-2.7.1/tmp
    
    
        io.file.buffer.size
        131702
    

fs.defaultFS设置为NameNode的URI,io.file.buffer.size设置为在顺序文件中读写的缓存大小。

配置hdfs的运行参数/opt/hadoop-2.7.1/etc/hadoop/hdfs-site.xml


    
        dfs.namenode.name.dir
        file:/opt/hadoop-2.7.1/hdfs/name
    
    
        dfs.datanode.data.dir
        file:/opt/hadoop-2.7.1/hdfs/data
    
    
        dfs.replication
        2
    
    
        dfs.webhdfs.enabled
        true
    
    
        dfs.namenode.secondary.http-address
        datanode1.abc.local:9000
    

namenode的hdfs-site.xml是必须将dfs.webhdfs.enabled属性设置为true,否则就不能使用webhdfs的LISTSTATUS、LISTFILESTATUS等需要列出文件、文件夹状态的命令,因为这些信息都是由namenode来保存的。
hadoop 2.7.1 解决了namenode单点故障的问题,必须设置第二个namenode,通过dfs.namenode.secondary.http-address进行设置。

配置mapred的运行参数/opt/hadoop-2.7.1/etc/hadoop/mapred-site.xml


    
        mapreduce.framework.name
        yarn
    

配置yarn的运行参数/opt/hadoop-2.7.1/etc/hadoop/yarn-site.xml


    
        yarn.nodemanager.aux-services
        mapreduce_shuffle
    
    
        yarn.resourcemanager.hostname
        namenode.abc.local
    

在datanode上安装hadoop

通过scp工具,将namenode上的hadoop上传到datanode上。

# cd /opt/
# scp -r hadoop-2.7.1 [email protected]:/opt/
# scp -r hadoop-2.7.1 [email protected]:/opt/

启动hadoop的hdfs环境

在namenode上,执行hadoop的命令

# cd /opt/hadoop-2.7.1/sbin
# ./start-dfs.sh 
15/08/12 03:15:16 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Starting namenodes on [namenode.abc.local]
namenode.abc.local: starting namenode, logging to /opt/hadoop-2.7.1/logs/hadoop-root-namenode-namenode.abc.local.out
datanode2.abc.local: starting datanode, logging to /opt/hadoop-2.7.1/logs/hadoop-root-datanode-datanode2.abc.local.out
datanode1.abc.local: starting datanode, logging to /opt/hadoop-2.7.1/logs/hadoop-root-datanode-datanode1.abc.local.out
Starting secondary namenodes [datanode1.abc.local]
datanode1.abc.local: starting secondarynamenode, logging to /opt/hadoop-2.7.1/logs/hadoop-root-secondarynamenode-datanode1.abc.local.out
15/08/12 03:15:36 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

HDFS的操作

格式化HDFS

执行命令

# cd /opt/hadoop-2.7.1/bin
# ./hdfs namenode -format
..........................
15/08/12 03:48:58 INFO namenode.FSImage: Allocated new BlockPoolId: BP-486254444-192.168.3.69-1439322538827
15/08/12 03:48:59 INFO common.Storage: Storage directory **/opt/hadoop-2.7.1/hdfs/name** has been successfully formatted.
15/08/12 03:48:59 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
15/08/12 03:48:59 INFO util.ExitUtil: Exiting with status 0
15/08/12 03:48:59 INFO namenode.NameNode: SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at namenode.abc.local/192.168.3.69
************************************************************/
该命令在namenode上关闭了hadoop的进程,但是在datanode1,datanode2上并没有关闭hadoop的进程。可以通过stop-dfs.sh来关闭datanode上的进程。

执行完命令后,在namenode的/opt/hadoop-2.7.1/hdfs/name目录下,生成文件。

[root@namenode name]# tree
.
└── current
    ├── fsimage_0000000000000000000
    ├── fsimage_0000000000000000000.md5
    ├── seen_txid
    └── VERSION

重启hdfs系统。

在namenode的/opt/hadoop-2.7.1/hdfs/name目录下,查看文件

[root@namenode name]# tree
.
├── current
│   ├── edits_0000000000000000001-0000000000000000002
│   ├── edits_0000000000000000003-0000000000000000004
│   ├── edits_0000000000000000005-0000000000000000006
│   ├── edits_0000000000000000007-0000000000000000008
│   ├── edits_0000000000000000009-0000000000000000010
│   ├── edits_0000000000000000011-0000000000000000012
│   ├── edits_0000000000000000013-0000000000000000014
│   ├── edits_0000000000000000015-0000000000000000016
│   ├── edits_0000000000000000017-0000000000000000018
│   ├── edits_0000000000000000019-0000000000000000020
│   ├── edits_0000000000000000021-0000000000000000022
│   ├── edits_0000000000000000023-0000000000000000024
│   ├── edits_0000000000000000025-0000000000000000026
│   ├── edits_0000000000000000027-0000000000000000028
│   ├── edits_0000000000000000029-0000000000000000030
│   ├── edits_0000000000000000031-0000000000000000032
│   ├── edits_0000000000000000033-0000000000000000034
│   ├── edits_0000000000000000035-0000000000000000036
│   ├── edits_0000000000000000037-0000000000000000038
│   ├── edits_0000000000000000039-0000000000000000040
│   ├── edits_0000000000000000041-0000000000000000042
│   ├── edits_0000000000000000043-0000000000000000044
│   ├── edits_0000000000000000045-0000000000000000046
│   ├── edits_0000000000000000047-0000000000000000047
│   ├── edits_inprogress_0000000000000000048
│   ├── fsimage_0000000000000000046
│   ├── fsimage_0000000000000000046.md5
│   ├── fsimage_0000000000000000047
│   ├── fsimage_0000000000000000047.md5
│   ├── seen_txid
│   └── VERSION
└── in_use.lock #该文件,说明NameNode已经启动

在datanode1,datanode2上的/opt/hadoop-2.7.1/hdfs/data目录下,查看文件。

[root@datanode1 data]# tree
.
├── current
│   ├── BP-486254444-192.168.3.69-1439322538827
│   │   ├── current
│   │   │   ├── dfsUsed
│   │   │   ├── finalized
│   │   │   ├── rbw
│   │   │   └── VERSION
│   │   ├── scanner.cursor
│   │   └── tmp
│   └── VERSION
└── in_use.lock #该文件,说明DataNode已经启动

由于datanode1设置为第二个namenode,所以在/opt/hadoop-2.7.1/tmp目录下,生成了文件。

[root@datanode1 tmp]# tree
.
└── dfs
    └── namesecondary
        ├── current
        │   ├── edits_0000000000000000001-0000000000000000002
        │   ├── edits_0000000000000000003-0000000000000000004
        │   ├── edits_0000000000000000005-0000000000000000006
        │   ├── edits_0000000000000000007-0000000000000000008
        │   ├── edits_0000000000000000009-0000000000000000010
        │   ├── edits_0000000000000000011-0000000000000000012
        │   ├── edits_0000000000000000013-0000000000000000014
        │   ├── edits_0000000000000000015-0000000000000000016
        │   ├── edits_0000000000000000017-0000000000000000018
        │   ├── edits_0000000000000000019-0000000000000000020
        │   ├── edits_0000000000000000021-0000000000000000022
        │   ├── edits_0000000000000000023-0000000000000000024
        │   ├── edits_0000000000000000025-0000000000000000026
        │   ├── edits_0000000000000000027-0000000000000000028
        │   ├── edits_0000000000000000029-0000000000000000030
        │   ├── edits_0000000000000000031-0000000000000000032
        │   ├── edits_0000000000000000033-0000000000000000034
        │   ├── edits_0000000000000000035-0000000000000000036
        │   ├── edits_0000000000000000037-0000000000000000038
        │   ├── edits_0000000000000000039-0000000000000000040
        │   ├── edits_0000000000000000041-0000000000000000042
        │   ├── edits_0000000000000000043-0000000000000000044
        │   ├── edits_0000000000000000045-0000000000000000046
        │   ├── edits_0000000000000000048-0000000000000000049
        │   ├── fsimage_0000000000000000047
        │   ├── fsimage_0000000000000000047.md5
        │   ├── fsimage_0000000000000000049
        │   ├── fsimage_0000000000000000049.md5
        │   └── VERSION
        └── in_use.lock

向HDFS中放入文件

新建测试文件

# mkdir -p /root/input_data
# cd /root/input_data/
# echo "This is a test." >> test_data.txt

执行hadoop命令,放入文件

# cd /opt/hadoop-2.7.1/bin/
# ./hadoop fs -put /root/input_data/ /input_data
15/08/13 03:16:20 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

把/root/input_data目录下的文件,拷贝进HDFS的/input_data目录下

执行hadoop命令,查看文件

# ./hadoop fs -ls /input_data
15/08/13 03:20:42 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 1 items
-rw-r--r--   2 root supergroup         16 2015-08-13 03:20 /input_data/test_data.txt

针对HDFS的操作命令

# ./hadoop fs 
Usage: hadoop fs [generic options]
[-appendToFile  ... ]
[-cat [-ignoreCrc]  ...]
[-checksum  ...]
[-chgrp [-R] GROUP PATH...]
[-chmod [-R]  PATH...]
[-chown [-R] [OWNER][:[GROUP]] PATH...]
[-copyFromLocal [-f] [-p] [-l]  ... ]
[-copyToLocal [-p] [-ignoreCrc] [-crc]  ... ]
[-count [-q] [-h]  ...]
[-cp [-f] [-p | -p[topax]]  ... ]
[-createSnapshot  []]
[-deleteSnapshot  ]
[-df [-h] [ ...]]
[-du [-s] [-h]  ...]
[-expunge]
[-find  ...  ...]
[-get [-p] [-ignoreCrc] [-crc]  ... ]
[-getfacl [-R] ]
[-getfattr [-R] {-n name | -d} [-e en] ]
[-getmerge [-nl]  ]
[-help [cmd ...]]
**[-ls [-d] [-h] [-R] [ ...]]**
[-mkdir [-p]  ...]
[-moveFromLocal  ... ]
[-moveToLocal  ]
[-mv  ... ]
**[-put [-f] [-p] [-l]  ... ]**
[-renameSnapshot   ]
**[-rm [-f] [-r|-R] [-skipTrash]  ...]**
[-rmdir [--ignore-fail-on-non-empty]  ...]
[-setfacl [-R] [{-b|-k} {-m|-x } ]|[--set  ]]
[-setfattr {-n name [-v value] | -x name} ]
[-setrep [-R] [-w]   ...]
[-stat [format]  ...]
[-tail [-f] ]
[-test -[defsz] ]
[-text [-ignoreCrc]  ...]
[-touchz  ...]
[-truncate [-w]   ...]
[-usage [cmd ...]]

Generic options supported are
-conf      specify an application configuration file
-D             use value for given property
-fs       specify a namenode
-jt     specify a ResourceManager
-files     specify comma separated files to be copied to the map reduce cluster
-libjars     specify comma separated jar files to include in the classpath.
-archives     specify comma separated archives to be unarchived on the compute machines.

The general command line syntax is
bin/hadoop command [genericOptions] [commandOptions]

你可能感兴趣的:(数据挖掘)