大数据(big data)是指无法在一定时间范围内用常规软件工具进行捕捉、管理和处理的数据集合,是需要新处理模式才能具有更强的决策力、洞察发现力和流程优化能力的海量、高增长率和多样化的信息资产
大数据的5V特点(IBM提出):
http://hadoop.apache.org
Apache Hadoop是一个开源、可靠、可扩展的分布式计算框架。
Hadoop框架允许用户在一个超大的规模的服务器集群中,对大数据集进行分布式的处理计算。Hadoop集群规模可以是单个(伪分布式集群)或者上千台的商用服务器(完全分布式集群)构成。Hadoop集群中每一个服务器都提供了本地计算和存储能力。Hadoop框架并不是通过硬件实现的高可用,而是通过应用层检测处理错误,那这样的话Hadoop集群就可以建立在廉价的商用服务器上。
HDFS是Hadoop的分布式文件系统( Hadoop Distributed File System ),类似于其它的分布式文件系统。HDFS支持高度容错,可以部署在廉价的硬件设备上,特别适宜于大型的数据集的分布式存储。
Google开源论文GFS的开源实现
构建HDFS的伪分布式集群(使用单台机器,模拟HDFS集群所有的服务)
安装CentOS
CentOS7.2版本
配置网络
# ip a 查看当前的服务器网络设置
vi /etc/sysconfig/network-scripts/ifcfg-ens33
# 将配置文件中的ONBOOT=yes
systemctl restart network
关闭防火墙
[root@localhost ~]# systemctl stop firewalld
[root@localhost ~]# systemctl disable firewalld
Removed symlink /etc/systemd/system/multi-user.target.wants/firewalld.service.
Removed symlink /etc/systemd/system/dbus-org.fedoraproject.FirewallD1.service.
修改服务器的主机名
# 简化连接服务器操作
[root@localhost ~]# vi /etc/hostname
# 删除localhost,新增hadoop(自定义的主机名)
配置主机名和ip地址的映射关系
[root@localhost ~]# vi /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
# 最后一行添加当前服务器ip地址和主机名的映射
192.168.12.129 hadoop
# 测试
[root@localhost ~]# ping hadoop
PING hadoop (192.168.12.129) 56(84) bytes of data.
64 bytes from hadoop (192.168.12.129): icmp_seq=1 ttl=64 time=0.107 ms
64 bytes from hadoop (192.168.12.129): icmp_seq=2 ttl=64 time=0.053 ms
配置SSH(Secure Shell)免密远程登录
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-5pXTMeBf-1614238273922)(assets\1565595600097.png)]
[root@hadoop ~]# ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
Generating public/private rsa key pair.
Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/id_rsa.pub.
The key fingerprint is:
SHA256:/VJcuTQzpC4EDqiiEKWwwtYAqS9Von3ssc12fM+ldvQ root@hadoop
The key's randomart image is:
+---[RSA 2048]----+
|++. .. . . |
|=o+ o o . o . |
|=* * . . . B |
|B + + o o o = |
|o+ o = .S o + . |
|o . o + o .+ o |
| . . . ..o.+ . |
| .= . E|
| . . |
+----[SHA256]-----+
[root@hadoop ~]#
[root@hadoop ~]#
[root@hadoop ~]# cd .ssh/
[root@hadoop .ssh]# ll
总用量 12
-rw-------. 1 root root 1679 8月 12 15:45 id_rsa
-rw-r--r--. 1 root root 393 8月 12 15:45 id_rsa.pub
-rw-r--r--. 1 root root 183 8月 12 15:43 known_hosts
[root@hadoop .ssh]# cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
[root@hadoop .ssh]# ll
总用量 16
-rw-r--r--. 1 root root 393 8月 12 15:47 authorized_keys
-rw-------. 1 root root 1679 8月 12 15:45 id_rsa
-rw-r--r--. 1 root root 393 8月 12 15:45 id_rsa.pub
-rw-r--r--. 1 root root 183 8月 12 15:43 known_hosts
[root@hadoop .ssh]# chmod 0600 ~/.ssh/authorized_keys
[root@hadoop .ssh]#
[root@hadoop .ssh]# ssh hadoop
Last login: Mon Aug 12 15:43:18 2019 from 192.168.12.1
[root@hadoop ~]# rpm -ivh jdk-8u191-linux-x64.rpm
警告:jdk-8u191-linux-x64.rpm: 头V3 RSA/SHA256 Signature, 密钥 ID ec551f03: NOKEY
准备中... ################################# [100%]
正在升级/安装...
1:jdk1.8-2000:1.8.0_191-fcs ################################# [100%]
Unpacking JAR files...
tools.jar...
plugin.jar...
javaws.jar...
deploy.jar...
rt.jar...
jsse.jar...
charsets.jar...
localedata.jar...
[root@hadoop ~]# java -version
java version "1.8.0_191"
Java(TM) SE Runtime Environment (build 1.8.0_191-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode)
[root@hadoop ~]# tar -zxf hadoop-2.6.0_x64.tar.gz -C /usr
[root@hadoop hadoop-2.6.0]# vi etc/hadoop/core-site.xml
<property>
<name>fs.defaultFSname>
<value>hdfs://hadoop:9000value>
property>
<property>
<name>hadoop.tmp.dirname>
<value>/usr/hadoop-2.6.0/hadoop-${user.name}value>
property>
[root@hadoop hadoop-2.6.0]# vi etc/hadoop/hdfs-site.xml
<property>
<name>dfs.replicationname>
<value>1value>
property>
[root@hadoop hadoop-2.6.0]# vi etc/hadoop/slaves
hadoop
[root@hadoop ~]# vi .bashrc
HADOOP_HOME=/usr/hadoop-2.6.0
JAVA_HOME=/usr/java/latest
CLASSPATH=.
PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
export JAVA_HOME
export CLASSPATH
export PATH
export HADOOP_HOME
[root@hadoop ~]# source .bashrc
[root@hadoop ~]# hdfs namenode -format
NOTE:
初始化操作只需要在第一次启动HDFS集群之前执行,后续不需要执行,跳过直接启动服务即可
[root@hadoop ~]# start-dfs.sh
Starting namenodes on [hadoop]
hadoop: starting namenode, logging to /usr/hadoop-2.6.0/logs/hadoop-root-namenode-hadoop.out
hadoop: starting datanode, logging to /usr/hadoop-2.6.0/logs/hadoop-root-datanode-hadoop.out
Starting secondary namenodes [0.0.0.0]
The authenticity of host '0.0.0.0 (0.0.0.0)' can't be established.
ECDSA key fingerprint is SHA256:yDvdRHO65GeTfU6PJQjEKMap+lEZb8a/JeuesbTsMYs.
ECDSA key fingerprint is MD5:d4:bf:fe:86:d3:ed:2d:fc:5f:a2:2b:e5:86:0c:ae:ee.
Are you sure you want to continue connecting (yes/no)? yes
0.0.0.0: Warning: Permanently added '0.0.0.0' (ECDSA) to the list of known hosts.
0.0.0.0: starting secondarynamenode, logging to /usr/hadoop-2.6.0/logs/hadoop-root-secondarynamenode-hadoop.out
# 1. java的指令 jps,查看java进程列表
[root@hadoop ~]# jps
10995 SecondaryNameNode # HDFS小蜜
10796 NameNode # HDFS Master
10877 DataNode # HDFS Slaves
# 2. 访问hdfs的web ui
http://服务器地址:50070
# 3. 分布式系统学会看日志
[root@hadoop hadoop-2.6.0]# cd logs/
[root@hadoop logs]# ll
总用量 92
-rw-r--r--. 1 root root 24249 8月 12 16:12 hadoop-root-datanode-hadoop.log
-rw-r--r--. 1 root root 714 8月 12 16:12 hadoop-root-datanode-hadoop.out
-rw-r--r--. 1 root root 30953 8月 12 16:17 hadoop-root-namenode-hadoop.log
-rw-r--r--. 1 root root 714 8月 12 16:12 hadoop-root-namenode-hadoop.out
-rw-r--r--. 1 root root 22304 8月 12 16:13 hadoop-root-secondarynamenode-hadoop.log
-rw-r--r--. 1 root root 714 8月 12 16:12 hadoop-root-secondarynamenode-hadoop.out
-rw-r--r--. 1 root root 0 8月 12 16:12 SecurityAuth-root.audit
[root@hadoop logs]# stop-dfs.sh
HDFS分布式文件系统,操作类似于Linux文件系统
比如Linux:cp、mv、rm、cat、mkdir 常用指令非常类似
语法:hdfs dfs -参数
Usage: hadoop fs [generic options]
[-appendToFile <localsrc> ... <dst>]
[-cat [-ignoreCrc] <src> ...] # 查看文本文件内容
[-checksum <src> ...]
[-chgrp [-R] GROUP PATH...] # 修改属组
[-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...] # 修改权限
[-chown [-R] [OWNER][:[GROUP]] PATH...] # 修改属主
[-copyFromLocal [-f] [-p] [-l] <localsrc> ... <dst>] # 从本地拷贝到HDFS
[-copyToLocal [-p] [-ignoreCrc] [-crc] <src> ... <localdst>] # 从HDFS拷贝到本地
[-count [-q] [-h] <path> ...] # 计数
[-cp [-f] [-p | -p[topax]] <src> ... <dst>] # 拷贝
[-createSnapshot <snapshotDir> [<snapshotName>]]
[-deleteSnapshot <snapshotDir> <snapshotName>]
[-df [-h] [<path> ...]]
[-du [-s] [-h] <path> ...]
[-expunge]
[-get [-p] [-ignoreCrc] [-crc] <src> ... <localdst>] # 下载
[-getfacl [-R] <path>]
[-getfattr [-R] {
-n name | -d} [-e en] <path>]
[-getmerge [-nl] <src> <localdst>]
[-help [cmd ...]] # 帮助
[-ls [-d] [-h] [-R] [<path> ...]] # 查看目录列表
[-mkdir [-p] <path> ...] # 创建文件夹
[-moveFromLocal <localsrc> ... <dst>] # 从本地移动到HDFS
[-moveToLocal <src> <localdst>] # 将HDFS中的文件移动到本地
[-mv <src> ... <dst>] # HDFS中的文件或文件夹的移动
[-put [-f] [-p] [-l] <localsrc> ... <dst>] # 上传
[-renameSnapshot <snapshotDir> <oldName> <newName>]
[-rm [-f] [-r|-R] [-skipTrash] <src> ...] # 删除
[-rmdir [--ignore-fail-on-non-empty] <dir> ...] # 删除文件夹
[-setfacl [-R] [{
-b|-k} {
-m|-x <acl_spec>} <path>]|[--set <acl_spec> <path>]]
[-setfattr {
-n name [-v value] | -x name} <path>]
[-setrep [-R] [-w] <rep> <path> ...]
[-stat [format] <path> ...]
[-tail [-f] <file>] # 查看文本文件的末尾内容
[-test -[defsz] <path>]
[-text [-ignoreCrc] <src> ...]
[-touchz <path> ...]
[-usage [cmd ...]]
环境搭建(windows平台为例)
解压缩Hadoop的安装包
# 如解压缩安装到E:\\根目录
拷贝兼容文件到安装目录bin中
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-gbDYeToS-1614238273946)(assets\1565601239877.png)]
在windows的hosts文件中添加主机名和IP地址的映射关系
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-6CcdDMqF-1614238273953)(assets\1565601414718.png)]
重启开发工具
配置HADOOP_HOME环境变量
实战
创建Maven工程,并导入HDFS Client Driver
<dependency>
<groupId>org.apache.hadoopgroupId>
<artifactId>hadoop-commonartifactId>
<version>2.6.0version>
dependency>
<dependency>
<groupId>org.apache.hadoopgroupId>
<artifactId>hadoop-hdfsartifactId>
<version>2.6.0version>
dependency>
<dependency>
<groupId>junitgroupId>
<artifactId>junitartifactId>
<version>4.12version>
dependency>
测试代码
import org.apach