废寝忘食整理出来的知识梳理,希望能对大家有所帮助
大数据是需要新处理模式才能具有更强的决策力 、洞察发现力和流程优化能力来适应海量、高增长率和多样化的信息资产
数据的存储
:单机存储有限,如何解决海量存储?(分布式、集群等)数据的分析
:单机的算力 有限,如何在合理时间内对数据完成成本运算?4V Volume 数据量 Velocity 时效 Variety 多样性 Value 价值
B-KB-MB-GB-TB-PB-EB-ZB…
各种云存储解决方案,百度云、腾讯微云、OneDriver等、现有的硬件资源能够支撑足够大的数据。
大数据在短时间内迅速产生,即要求收集者在短时间内收集并且存储数据,且近期(合理时间内)分析完成数据。
结构化的是数据:SQL 文本
非结构化的数据:视频、音频、图片
地理位置:来自上海、北京…
设备信息:PC 手机 手表 手环
个人喜好:美女 面膜 口红 显卡 数码
社交网络: A可能认识B C ,B 可以认识C
电话号码:100 10086
网络身份证 :设备MAC+IP+地理位置+电话
警察叔叔:只关注有没有违规
大数据开发者:只关注数据本身(不关注无用数据)
AI研究:阿尔法Go (只要围棋棋谱)
所以海量数据中提取有用的数据最为关键,这就是数据分析第一步 数据降噪(数据预处理|数据清洗)
根据用户喜好 达到帮用户想的这个目的
大数据实时流处理 根据用户行为模型的支撑 判断操作是否正常
根据当代的数据信息,分析出来往年甚至古代的气候异常,预测未来的气象信息
无人汽车:百度 Google 特斯拉
语音助手:小爱 Siri
为了解决存储问题和分析问题,实际上一台机器是不能够完成这个任务,所以解决方案是使用多台机器所组成的集群上运行,进行存储和计算。
硬件资源有了,软件上怎么实现?
Hadoop: 适合大数据的分布式存储和计算平台
Hadoop不是指具体一个框架或者组件,它是Apache软件基金会下用Java语言开发的一个开源分布式计算平台。实现在大量计算机组成的集群中对海量数据进行分布式计算。适合大数据的分布式存储和计算平台。
Hadoop1.x中包括两个核心组件:MapReduce和Hadoop Distributed File System(HDFS)
其中HDFS负责将海量数据进行分布式存储,而MapReduce负责提供对数据的计算结果的汇总
HDFS
:Hadoop Distribute File Sysytem
Map Reduce
:Hadoop 中的分布式计算框架 实现对海量数据并行分析和计算
HDFS
:Hadoop Distribute File Sysytem
Map Reduce
:Hadoop 中的分布式计算框架 实现对海量数据并行分析和计算
HBase
:是一款基于列式存储的NOSql
Hive
:是一款sql解释引擎,可以将SQL语句翻译成MR 代码,并在集群中运行
Flume
:分布式日志收集系统
Kafka
: 消息队列 ,分布式的消息系统
Zookeeper
:分布式协调服务 用于 注册中心 配置中心 集群选举 状态监测 分布式锁
Map Reduce
:代表基于磁盘的离线静态大数据批处理
Spark
: 代表基于内存的离线静态大数据批处理
Strom、Spark Streaming、Flink、Kafka Stram
:实时流处理 达到对记录级别数据的毫秒级处理
解决存储问题
修改IP地址 eth0
删除MAC地址
修改 主机名 /etc/sysconifg/network
reboot
# 解压至指定目录,并且配置好环境变量
export JAVA_HOME=/home/java/jdk1.8.0_181
export PATH=$PATH:$JAVA_HOME/bin
[root@HadoopNode00 ~]# vi /etc/sysconfig/network # 配置主机名,如果之前没有配置 记得重启
NETWORKING=yes
HOSTNAME=HadoopNode00
[root@HadoopNode00 ~]# vi /etc/hosts # 配置IP和HOSTNAME的映射关系
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.126.10 HadoopNode00
[root@HadoopNode00 ~]# service iptables stop # 关闭防火墙
[root@HadoopNode00 ~]# chkconfig iptables off # 关闭防火墙自启
SSH 为 [Secure Shell](https://baike.baidu.com/item/Secure Shell) 的缩写,由 IETF 的网络小组(Network Working Group)所制定;SSH 为建立在应用层基础上的安全协议。
基于口令的验证
:基于用户名和密码
基于密钥的安全验证
:需要依靠密钥,在进行连接之前,需要自己创建一对密钥,并且将公钥放在需要访问的服务器上。
[root@HadoopNode00 .ssh]# ssh-keygen -t rsa # 生成公私玥
Generating public/private rsa key pair.
Enter file in which to save the key (/root/.ssh/id_rsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/id_rsa.pub.
The key fingerprint is:
39:e3:a3:d6:5a:19:4f:c5:02:01:e0:00:a9:f0:5c:4a root@HadoopNode00
The key's randomart image is:
+--[ RSA 2048]----+
|.o. ....o. |
|o Eo. . . |
|o+ o. . o |
|. + . o |
| S . |
| . B |
| .= . |
| .o.. |
| .o. |
+-----------------+
[root@HadoopNode00 .ssh]# ssh-copy-id HadoopNode00 # 复制Hadoopnode00的公钥
The authenticity of host 'hadoopnode00 (192.168.126.10)' can't be established.
RSA key fingerprint is 4d:18:40:3d:24:1a:85:ce:ea:3c:a2:76:85:47:e8:12.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'hadoopnode00,192.168.126.10' (RSA) to the list of known hosts.
root@hadoopnode00's password:
Now try logging into the machine, with "ssh 'HadoopNode00'", and check in:
.ssh/authorized_keys
to make sure we haven't added extra keys that you weren't expecting.
[root@HadoopNode00 .ssh]# ssh hadoopnode00 # 免密登陆 hadoopnode00
Last login: Thu Oct 24 19:30:44 2019 from 192.168.126.1
[root@HadoopNode00 ~]# exit;
logout
Connection to hadoopnode00 closed.
[root@HadoopNode00 ~]# tar -zxvf hadoop-2.6.0.tar.gz -C /home/hadoop/
hadoop-2.6.0/
hadoop-2.6.0/share/
[root@HadoopNode00 ~]# vi .bashrc
export HADOOP_HOME=/home/hadoop/hadoop-2.6.0
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
[root@HadoopNode00 ~]# source .bashrc
HADOOP_HOME
环境变量被第三方所依赖, 如hbase、hive、flume、spark 在集成Hadoop的时候,是通过读取HADOOP_HOME环境确定hadoop的位置
在hadoop 安装根目录etc/hadoop/ 下
<configuration>
<property>
<name>fs.defaultFSname>
<value>hdfs://HadoopNode00:9000value>
property>
<property>
<name>hadoop.tmp.dirname>
<value>/home/hadoop/hadoop-2.6.0/hadoop-${user.name}value>
property>
configuration>
在hadoop 安装根目录etc/hadoop/ 下
<configuration>
<property>
<name>dfs.replicationname>
<value>1value>
property>
configuration>
# 如果是第一次 启动
[root@HadoopNode00 ~]# hdfs namenode -format # 格式化namenode
[root@HadoopNode00 ~]# tree /home/hadoop/hadoop-2.6.0/hadoop-root # 查看目录结构
/home/hadoop/hadoop-2.6.0/hadoop-root
└── dfs
└── name
└── current
├── fsimage_0000000000000000000
├── fsimage_0000000000000000000.md5
├── seen_txid
└── VERSION
3 directories, 4 files
[root@HadoopNode00 ~]# start-dfs.sh # 启动hdfs
[root@HadoopNode00 ~]# jps # 查看相关进程
13779 DataNode
13669 NameNode
14201 SecondaryNameNode
14413 Jps
[root@HadoopNode00 ~]# stop-dfs.sh # 关闭hdfs
[root@HadoopNode00 ~]# jps # 查看相关进程
15323 Jps
如果安装启动成功,可以使用WEB界面看到当前节点一些信息
hostname|IP:50070
在windows下记得配置 域名和IP的映射关系
C:\Windows\System32\drivers\etc\ HOSTS
192.168.126.10 HadoopNode00
[root@HadoopNode00 ~]# hadoop
Usage: hadoop [--config confdir] COMMAND
where COMMAND is one of:
fs run a generic filesystem user client
version print the version
jar <jar> run a jar file
checknative [-a|-h] check native hadoop and compression libraries availability
distcp <srcurl> <desturl> copy file or directories recursively
archive -archiveName NAME -p <parent path> <src>* <dest> create a hadoop archive
classpath prints the class path needed to get the
credential interact with credential providers
Hadoop jar and the required libraries
daemonlog get/set the log level for each daemon
trace view and modify Hadoop tracing settings
or
CLASSNAME run the class named CLASSNAME
Most commands print help when invoked w/o parameters.
[root@HadoopNode00 ~]# hadoop fs
Usage: hadoop fs [generic options]
[-appendToFile <localsrc> ... <dst>]
[-cat [-ignoreCrc] <src> ...]
[-checksum <src> ...]
[-chgrp [-R] GROUP PATH...]
[-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...]
[-chown [-R] [OWNER][:[GROUP]] PATH...]
[-copyFromLocal [-f] [-p] [-l] <localsrc> ... <dst>]
[-copyToLocal [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
[-count [-q] [-h] <path> ...]
[-cp [-f] [-p | -p[topax]] <src> ... <dst>]
[-createSnapshot <snapshotDir> [<snapshotName>]]
[-deleteSnapshot <snapshotDir> <snapshotName>]
[-df [-h] [<path> ...]]
[-du [-s] [-h] <path> ...]
[-expunge]
[-get [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
[-getfacl [-R] <path>]
[-getfattr [-R] {-n name | -d} [-e en] <path>]
[-getmerge [-nl] <src> <localdst>]
[-help [cmd ...]]
[-ls [-d] [-h] [-R] [<path> ...]]
[-mkdir [-p] <path> ...]
[-moveFromLocal <localsrc> ... <dst>]
[-moveToLocal <src> <localdst>]
[-mv <src> ... <dst>]
[-put [-f] [-p] [-l] <localsrc> ... <dst>]
[-renameSnapshot <snapshotDir> <oldName> <newName>]
[-rm [-f] [-r|-R] [-skipTrash] <src> ...]
[-rmdir [--ignore-fail-on-non-empty] <dir> ...]
[-setfacl [-R] [{-b|-k} {-m|-x <acl_spec>} <path>]|[--set <acl_spec> <path>]]
[-setfattr {-n name [-v value] | -x name} <path>]
[-setrep [-R] [-w] <rep> <path> ...]
[-stat [format] <path> ...]
[-tail [-f] <file>]
[-test -[defsz] <path>]
[-text [-ignoreCrc] <src> ...]
[-touchz <path> ...]
[-usage [cmd ...]]
Generic options supported are
-conf <configuration file> specify an application configuration file
-D <property=value> use value for given property
-fs <local|namenode:port> specify a namenode
-jt <local|resourcemanager:port> specify a ResourceManager
-files <comma separated list of files> specify comma separated files to be copied to the map reduce cluster
-libjars <comma separated list of jars> specify comma separated jar files to include in the classpath.
-archives <comma separated list of archives> specify comma separated archives to be unarchived on the compute machines.
The general command line syntax is
bin/hadoop command [genericOptions] [commandOptions]
[root@HadoopNode00 ~]# hadoop fs -put 1.txt /
[root@HadoopNode00 ~]# hadoop fs -get /1.txt /root/2.txt
[root@HadoopNode00 ~]# hadoop fs -ls /
Found 1 items
-rw-r--r-- 1 root supergroup 8917 2019-10-24 23:44 /1.txt
[root@HadoopNode00 ~]# hadoop fs -cp /1.txt /2.txt
[root@HadoopNode00 ~]# hadoop fs -mkdir /baizhi
[root@HadoopNode00 ~]# hadoop fs -moveFromLocal 1.txt.tar.gz /
[root@HadoopNode00 ~]# hadoop fs -copyToLocal /1.txt /root/3.txt
[root@HadoopNode00 ~]# hadoop fs -rm -r -f /1.txt
[root@HadoopNode00 ~]# hadoop fs -rmdir /baizhi
[root@HadoopNode00 ~]# hadoop fs -cat /2.txt
[root@HadoopNode00 ~]# hadoop fs -tail -f /2.txt
[root@HadoopNode00 ~]# hadoop fs -appendToFile 1.txt /2.txt
core-site.xml
<property>
<name>fs.trash.intervalname>
<value>1value>
property>
记得重启
[root@HadoopNode00 ~]# hadoop fs -rm -r -f /1.txt # 删除文件 一分钟后删除
19/10/25 00:16:59 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 1 minutes, Emptier interval = 0 minutes.
Moved: 'hdfs://HadoopNode00:9000/1.txt' to trash at: hdfs://HadoopNode00:9000/user/root/.Trash/Current
[root@HadoopNode00 ~]# hadoop fs -ls /user/root/.Trash/191025001700 # 能够在对应文件夹中看到文件
Found 1 items
-rw-r--r-- 1 root supergroup 8917 2019-10-25 00:16 /user/root/.Trash/191025001700/1.txt
[root@HadoopNode00 ~]# hadoop fs -ls /user/root/.Trash/191025001700 # 一分钟后就无法看到文件了
ls: `/user/root/.Trash/191025001700': No such file or directory
如果在启动和使用hdfs的时候,出现了报警
Java HotSpot(TM) 64-Bit Server VM warning: You have loaded library /home/hadoop/hadoop-2.6.0/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now.
It's highly recommended that you fix the library with 'execstack -c ', or link it with '-z noexecstack'.
19/10/25 00:06:15 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
将已经编译好的hadoop2.6.0中的lib中native中的所有文件替换掉原来的已经配置好的hadoop中的相应文件
<dependencies>
<dependency>
<groupId>org.apache.hadoopgroupId>
<artifactId>hadoop-commonartifactId>
<version>2.6.0version>
dependency>
<dependency>
<groupId>org.apache.hadoopgroupId>
<artifactId>hadoop-hdfsartifactId>
<version>2.6.0version>
dependency>
dependencies>
此时使用的Hadoop需要与当前安装的版本相对应
winutils.exe
和hadoop.dll
放入 hadoop 的bin目录下org.apache.hadoop.security.AccessControlException: Permission denied: user=Administrator, access=WRITE, inode="/":root:supergroup:drwxr-xr-x
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkFsPermission(FSPermissionChecker.java:271)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:257)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:238)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:179)
//在代码中加上下列系统设置参数
System.setProperty("HADOOP_USER_NAME","root");
//为虚拟机机上-D 参数
-DHADOOP_USER_NAME=root
关闭权限检查
hdfs-site.xml
<property>
<name>dfs.permissions.enabledname>
<value>falsevalue>
<description>
If "true", enable permission checking in HDFS.
If "false", permission checking is turned off,
but all other behavior is unchanged.
Switching from one parameter value to the other does not change the mode,
owner or group of files or directories.
description>
property>
private Configuration conf;
private FileSystem fileSystem;
@Before
public void getClient() throws Exception {
//System.setProperty("HADOOP_USER_NAME","root");
conf = new Configuration();
conf.addResource("core-site.xml");
conf.addResource("hdfs-site.xml");
fileSystem = FileSystem.newInstance(conf);
}
@After
public void close() throws Exception {
fileSystem.close();
}
@Test
public void testUpload01() throws Exception {
/*
* 从本地拷贝文件到hdfs 上传操作
* */
Path localFile = new Path("D:\\大数据训练营课程\\Note\\Day02-Hadoop\\Hadoop.md");
Path hdfsFile = new Path("/3.md");
fileSystem.copyFromLocalFile(localFile, hdfsFile);
}
public void testUpload02() throws Exception{
/*
*
* 本地文件 输入流
* */
FileInputStream inputStream = new FileInputStream(new File("D:\\大数据训练营课程\\Note\\Day02-Hadoop\\Hadoop.md"));
/*
* hdfs 文件 输出流
* */
FSDataOutputStream outputStream = fileSystem.create(new Path("/4.md"));
IOUtils.copyBytes(inputStream,outputStream,1024,true);
}
@Test
public void testDownload01() throws Exception{
fileSystem.copyToLocalFile(false,new Path("/4.md"),new Path("D:\\大数据训练营课程\\Note\\Day02-Hadoop\\5.md"),true);
}
@Test
public void testDownload02() throws Exception{
/*
* 输出流 本地文件
* */
FileOutputStream outputStream = new FileOutputStream(new File("D:\\大数据训练营课程\\Note\\Day02-Hadoop\\6.md"));
FSDataInputStream inputStream = fileSystem.open(new Path("/4.md"));
IOUtils.copyBytes(inputStream,outputStream,1024,true);
}
@Test
public void testDelete() throws Exception{
boolean delete = fileSystem.delete(new Path("/1.md"), true);
if (delete){
System.out.println("删除成功");
}else {
System.out.println("删除失败");
}
}
@Test
public void testExists() throws Exception{
boolean exists = fileSystem.exists(new Path("/123122.md"));
if (exists){
System.out.println("存在");
}else {
System.out.println("不存在");
}
}
@Test
public void testListFile() throws Exception{
RemoteIterator<LocatedFileStatus> remoteIterator = fileSystem.listFiles(new Path("/"), true);
while (remoteIterator.hasNext()){
LocatedFileStatus fileStatus = remoteIterator.next();
System.out.println(fileStatus.getPath());
}
}
@Test
public void testMkdir() throws Exception{
boolean is = fileSystem.mkdirs(new Path("/baizhi"));
if (is) {
System.out.println("创建成功");
} else {
System.out.println("创建失败");
}
}
http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html
HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files. Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.
namenode
:存储系统的元数据(用于描述数据的数据),例如 文件命名空间、block到DataNode的映射,负责管理DataNode
datanode
: 用于存储数据的节点 负责相应客户端读写请求 向NameNode 汇报块信息
block
:数据块 是对文件拆分的最小单元 表示 一个默认为128MB的切分 尺度,每个数据块副本,默认的副本因子为3,通过dfs.replication进行配置,另外用户还可以通过 dfs.blcoksize 设置块的大小
rack
:机架 使用机架对存储节点做物理编排 用于优化存储和计算
# 查看机架配置
[root@HadoopNode00 ~]# hdfs dfsadmin -printTopology
Rack: /default-rack
192.168.126.10:50010 (HadoopNode00)
<property>
<name>dfs.blocksizename>
<value>134217728value>
<description>
The default block size for new files, in bytes.
You can use the following suffix (case insensitive):
k(kilo), m(mega), g(giga), t(tera), p(peta), e(exa) to specify the size (such as 128k, 512m, 1g, etc.),
Or provide complete size in bytes (such as 134217728 for 128 MB).
description>
property>
在Hadoop 1 .x 版本 BlockSize 为64MB (因工业限制)
硬件限制:廉价PC 机械硬盘速度慢
软件优化:通常认为最佳状态为 : 寻址时间为传输时间的100分之一
答案是不行,如果Block块设置过小,集群中几百万个小文件造成寻址时间的增加,效率低下
如果太大,会造成空间的浪费,会造成存取时间过长,效率还是低,
适合的才是最好的。
在配置集群的时候,需要手动指定某些机器在那些机架上,Hadoop才能得知集群中的节点分布。
当进行文件存储的时候,在默认副本因子为3的情况下,第一个副本存储在本地机架上的本地机器上,第二个存储在本地机架上其他节点上面,第三个存储在除本地机架其他的节点上。
三分之一的文件副本在在某个节点上,三分之二的文件副本在本地机架上,三分之一均分。
FSImage:元数据信息的备份,会被加载到内存中
https://hadoop.apache.org/docs/r2.7.7/hadoop-project-dist/hadoop-hdfs/HdfsImageViewer.html
Edits:Edits文件帮助记录文件增加和更新的操作 提高效率
https://hadoop.apache.org/docs/r2.7.7/hadoop-project-dist/hadoop-hdfs/HdfsEditsViewer.html
NameNode在启动的时候 需要加载edits文件和fsimage文件,所以在第一次启动的时候需要格式化NameNode。
当用户上传文件或者下载文件的时候,会将记录写入edits文件中,这样edits文件和fsimage文件加起来永远是最新的元数据。
当用户一直进行操作,势必会导致edits文件过大,这就导致了集群在下次启动的时候时间过长(会加载两个文件)
为了解决这个问题,可以将edits文件和fsimage文件进行合并。
在这里NameNode自己不能合并元数据,合并元数据的任务交由SecondaryNameNode。将当前NameNode的edits文件和fsimage上传到自己的节点中,利用自身节点的计算资源进行合并,合并完成后会将最新的fsimage上传到namenode节点中,此时namenode加载最新的元数据。
在进行合并的期间,如果出现外部客户端有新的操作请求,会更改edits文件,但是操作edits文件可能会导致文件数据紊乱。如果和解决这个问题?那就是将新的操作记录写入一个叫做edits-inprogress文件中,等到合并完成,edits-inprogress会更名为当前系统的edits文件。
dfs.namenode.checkpoint.period(默认设置为1小时)指定两个连续检查点之间的最大延迟
<property>
<name>dfs.namenode.checkpoint.periodname>
<value>3600value>
<description>The number of seconds between two periodic checkpoints.
description>
property>
dfs.namenode.checkpoint.txns(默认设置为100万)定义了NameNode上的非检查点事务数,即使尚未达到检查点期限,该事务也会强制执行紧急检查点。
<property>
<name>dfs.namenode.checkpoint.txnsname>
<value>1000000value>
<description>The Secondary NameNode or CheckpointNode will create a checkpoint
of the namespace every 'dfs.namenode.checkpoint.txns' transactions, regardless
of whether 'dfs.namenode.checkpoint.period' has expired.
description>
property>
HDFS在启动的时候会默认开启安全模式,当等待发现绝大多数的Block块可用的时候,会自动退出安全模式。
安全模式是HDFS集群的只读模式,但是显式的将HDFS至于安全模式,使用hdfs dfsadmin -safemode
命令即可。
[root@HadoopNode00 ~]# hadoop fs -put 1.txt /
[root@HadoopNode00 ~]# hadoop fs -ls /
Found 6 items
-rw-r--r-- 1 root supergroup 8917 2019-10-25 23:27 /1.txt
-rw-r--r-- 1 root supergroup 16946 2019-10-25 18:01 /2.md
-rw-r--r-- 1 Administrator supergroup 18279 2019-10-25 18:07 /3.md
-rw-r--r-- 1 Administrator supergroup 18279 2019-10-25 18:16 /4.md
drwxr-xr-x - Administrator supergroup 0 2019-10-25 18:37 /baizhi
drwx------ - root supergroup 0 2019-10-25 00:15 /user
[root@HadoopNode00 ~]# hdfs dfsadmin -safemode enter # 开启安全模式
Safe mode is ON
[root@HadoopNode00 ~]# hadoop fs -put 1.txt /2.txt
put: Cannot create file/2.txt._COPYING_. Name node is in safe mode.
[root@HadoopNode00 ~]# hdfs dfsadmin -safemode leave # 关闭安全模式
Safe mode is OFF
[root@HadoopNode00 ~]# hadoop fs -put 1.txt /2.txt
文件 | namenode内存占用 | datanode磁盘占用 |
---|---|---|
128MB单个文件 | 1KB 元数据 | 1kB | 128MB*3 |
128KB*1000个文件 | 1000 1KB 元数据 | 1MB | 128MB*3 |
因为NameNode使用的是单机存储元数据,如果存储过多的小文件,会导致内存紧张
解决小文件存储的问题?
MapReduce是一种编程模型,用于大规模数据集(大于1TB)的并行运算。概念"Map(映射)“和"Reduce(归约)”,是它们的主要思想,都是从函数式编程语言里借来的,还有从矢量编程语言里借来的特性。它极大地方便了编程人员在不会分布式并行编程的情况下,将自己的程序运行在分布式系统上。 当前的软件实现是指定一个Map(映射)函数,用来把一组键值对映射成一组新的键值对,指定并发的Reduce(归约)函数,用来保证所有映射的键值对中的每一个共享相同的键组。
MapReduce是一个并行计算框架,将一个任务(Job)拆分成两个阶段一个是Map一个是Reudce,MapReduce充分利用了存储节点的计算资源所在物理主机(CPU/内存/网络/少许的硬盘)进行并行运算。MapReduce在需要在集群中启动Yarn,启动会出现相应的进程,在一个节点启动一个NodeMmanger对当前节点的计算资源的管理和使用,默认情况下NodeManager会将的当前节点的物理主机抽象为8个计算单元,每一个计算单元称之为一个Container,这些NodeManager都必须听从ResourceManager调度。
MapRduce 擅长处理大数据 ,Map的思想就是“分而治之”
package com.baizhi;
import java.io.*;
public class CleanApp {
public static void main(String[] args) throws IOException {
File file = new File("D:\\大数据训练营课程\\Note\\Day02-Hadoop\\数据文件\\access.tmp2019-05-19-10-28.log");
BufferedReader bufferedReader = new BufferedReader(new FileReader(file));
FileWriter fileWriter = new FileWriter("D:\\大数据训练营课程\\Note\\Day02-Hadoop\\数据文件\\clean_access.tmp2019-05-19-10-28.log");
while (true) {
String line = bufferedReader.readLine();
if (line == null) return;
boolean contains = line.contains("thisisshortvideoproject'slog");
if (contains) {
String s = line.split("thisisshortvideoproject'slog")[0];
fileWriter.write(s+"\n");
fileWriter.flush();
}
}
}
}
上述代码是对日志进行简单的处理,在数据量小的时候,是不会出现相关问题的,但是数据量一大,势必会导致运行的时间过长,无法在合理的期间内分析出来数据,提高计算能力有几种方案:(1)给当前的计算机纵向上的硬件升级,显然成本比较大,而且在节点出现故障的时候无法容错(2)使用多台机器组成集群,横向推展,有良好的计算能力,且可以实施对应的容错机制。
The fundamental idea of YARN is to split up the functionalities of resource management and job scheduling/monitoring into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either a single job or a DAG of jobs.
The ResourceManager and the NodeManager form the data-computation framework. The ResourceManager is the ultimate authority that arbitrates resources among all the applications in the system. The NodeManager is the per-machine framework agent who is responsible for containers, monitoring their resource usage (cpu, memory, disk, network) and reporting the same to the ResourceManager/Scheduler.
The per-application ApplicationMaster is, in effect, a framework specific library and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the tasks.
NodeManager
:管理主机上计算资源,是每台机器的框架代理,向RM 汇报自身的状态信息。
ResourceManager
:负责集群计算资源的统筹规划,拥有着集群资源的最终决策权。
ApplicationMaster
:计算任务的Master,负责申请资源,协调计算任务。
YARNChild
:负责最实际的计算任务(MapTask|ReduceTask)
Container
:是计算资源的抽象,代表着一组内存、CPU、网络的占用,无论是ApplicationMaster和YARNChild都需要消耗一个Container。
etc/hadoop/yarn-site.xml
<property>
<name>yarn.nodemanager.aux-servicesname>
<value>mapreduce_shufflevalue>
property>
<property>
<name>yarn.resourcemanager.hostnamename>
<value>HadoopNode00value>
property>
etc/hadoop/mapred-site.xml
<property>
<name>mapreduce.framework.namename>
<value>yarnvalue>
property>
[root@HadoopNode00 ~]# start-yarn.sh
starting yarn daemons
starting resourcemanager, logging to /home/hadoop/hadoop-2.6.0/logs/yarn-root-resourcemanager-HadoopNode00.out
localhost: starting nodemanager, logging to /home/hadoop/hadoop-2.6.0/logs/yarn-root-nodemanager-HadoopNode00.out
[root@HadoopNode00 ~]# stop-yarn.sh
stopping yarn daemons
stopping resourcemanager
localhost: stopping nodemanager
no proxyserver to stop
[root@HadoopNode00 ~]# jps
43856 Jps
43362 ResourceManager
2403 NameNode
43636 NodeManager
2776 SecondaryNameNode
2556 DataNode
访问: http://hadoopnode00:8088/cluster
http://hostname:8088
<dependency>
<groupId>org.apache.hadoopgroupId>
<artifactId>hadoop-commonartifactId>
<version>2.6.0version>
dependency>
<dependency>
<groupId>org.apache.hadoopgroupId>
<artifactId>hadoop-hdfsartifactId>
<version>2.6.0version>
dependency>
<dependency>
<groupId>org.apache.hadoopgroupId>
<artifactId>hadoop-mapreduce-client-coreartifactId>
<version>2.6.0version>
dependency>
<dependency>
<groupId>org.apache.hadoopgroupId>
<artifactId>hadoop-mapreduce-client-jobclientartifactId>
<version>2.6.0version>
dependency>
package com.baizhi.mr.test01;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
/*
*
* keyIn LongWritable Long 输入文本的偏移量
* valueIn Text String 输入行文本
* */
public class WCMapper extends Mapper<LongWritable, Text,Text, IntWritable> {
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
/*
* 拿到每一个单词
* */
String[] words = value.toString().split(" ");
for (String word : words) {
context.write(new Text(word),new IntWritable(1));
}
}
}
package com.baizhi.mr.test01;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
/*
*
* (key,value)
*
* */
public class WCReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
@Override
protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable value : values) {
sum+=value.get();
}
context.write(key,new IntWritable(sum));
}
}
package com.baizhi.mr.test01;
import jdk.nashorn.internal.scripts.JO;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class WCJob {
public static void main(String[] args) throws Exception {
/*
* 1 封装Job 对象
* */
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "WC-JOB01");
job.setJarByClass(WCJob.class);
/*
* 2 设置数据的写入和写出格式
* */
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
/*
*
* 3 设置数据读取和写出路径
* */
TextInputFormat.setInputPaths(job, new Path("/wcdata.txt"));
/*
* 此时需要注意 指定的文件夹 不能存在
* */
TextOutputFormat.setOutputPath(job, new Path("/test01"));
/*
*
* 4 设置数据的计算逻辑
* */
job.setMapperClass(WCMapper.class);
job.setReducerClass(WCReducer.class);
/*
* 5 设置Mapper 和Reducer 的输出泛型
* */
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
/*
* 6 任务提交
* */
//job.submit();
job.waitForCompletion(true);
}
}
// 设置jar 类 加载器
job.setJarByClass(WCJob.class);
通过Maven 打包
细节:
<packaging>jar</packaging>
先clean 再 package
hadoop jar Hadoop_Test-1.0-SNAPSHOT.jar com.baizhi.mr.test01.WCJob
hadoop jar jar包名字 job类名
log4j
<dependency>
<groupId>log4jgroupId>
<artifactId>log4jartifactId>
<version>1.2.17version>
dependency>
log4j.properties
log4j.rootLogger = info,stdout
log4j.appender.stdout = org.apache.log4j.ConsoleAppender
log4j.appender.stdout.Target = System.out
log4j.appender.stdout.layout = org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern = [%-5p] %d{yyyy-MM-dd HH:mm:ss,SSS} method:%l%n%m%n
本地进行提交任然不需要修改之前的写代码
但是需要指定文件路径为本地路径
有可能出现将本地文件路径识别成HDFS 上的文件,请在文件前面加上 file:///
<property>
<description>If enabled, user can submit an application cross-platform
i.e. submit an application from a Windows client to a Linux/Unix server or
vice versa.
description>
<name>mapreduce.app-submission.cross-platformname>
<value>truevalue>
property>
conf.set("mapreduce.app-submission.cross-platform","true");
conf.addResource("conf2/core-site.xml");
conf.addResource("conf2/hdfs-site.xml");
conf.addResource("conf2/mapred-site.xml");
conf.addResource("conf2/yarn-site.xml");
// 设置 jar 包的路径
conf.set(MRJobConfig.JAR,"D:\\大数据\\Code\\BigData\\Hadoop_Test\\target\\Hadoop_Test-1.0-SNAPSHOT.jar");
开发不是一成不变,Hadoop只提供了几种数据类型的序列化对象,往往在开发中这些基础数据类型不能很好应对复杂的开发,所以说可以加入Bean对象去简化开发的流程,但是此时需要对所有涉及到的对象进行序列化的操作。
Hadoop有自己的一套序列化方案,Writable
为什么不用JDK内置的序列化方案?
因为是重量级的,不是MR计算过程需要高效且快速的传输。
15713770999 12113 123 hn
15713770929 12133 123 zz
15713770909 123 1123 bj
13949158086 13 1213 kf
15713770929 11 12113 xy
15713770999 11113 123 hn
15713770929 123233 123 zz
15713770909 12113 1123 bj
13949158086 13 1213 kf
15713770929 121 12113 xy
电话 上行 下行 总流量
15713770999 12113 123 ?
package com.rechen.mr.test04;
import org.apache.hadoop.io.Writable;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
public class FlowBean implements Writable {
private String phone;
private Long upFlow;
private Long downFlow;
private Long sumFlow;
public FlowBean() {
}
public FlowBean(String phone, Long upFlow, Long downFlow, Long sumFlow) {
this.phone = phone;
this.upFlow = upFlow;
this.downFlow = downFlow;
this.sumFlow = sumFlow;
}
public String getPhone() {
return phone;
}
public void setPhone(String phone) {
this.phone = phone;
}
public Long getUpFlow() {
return upFlow;
}
public void setUpFlow(Long upFlow) {
this.upFlow = upFlow;
}
public Long getDownFlow() {
return downFlow;
}
public void setDownFlow(Long downFlow) {
this.downFlow = downFlow;
}
public Long getSumFlow() {
return sumFlow;
}
public void setSumFlow(Long sumFlow) {
this.sumFlow = sumFlow;
}
@Override
public String toString() {
return "FlowBean{" +
"phone='" + phone + '\'' +
", baizhi upFlow=" + upFlow +
", downFlow=" + downFlow +
", sumFlow=" + sumFlow +
'}' ;
}
/*
* 序列化 编码
* */
public void write(DataOutput dataOutput) throws IOException {
if (this.phone!=null) {
dataOutput.writeUTF(this.phone);
}
dataOutput.writeLong(this.upFlow);
dataOutput.writeLong(this.downFlow);
dataOutput.writeLong(this.sumFlow);
}
/*
* 反序列化 解码
* */
public void readFields(DataInput dataInput) throws IOException {
if (this.phone!=null) {
this.phone = dataInput.readUTF();
}
this.upFlow = dataInput.readLong();
this.downFlow = dataInput.readLong();
this.sumFlow = dataInput.readLong();
}
}
package com.rechen.mr.test04;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class FlowJob {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
job.setJarByClass(FlowJob.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
TextInputFormat.setInputPaths(job, new Path("D:\\大数据\\Note\\Day02-Hadoop\\数据文件\\flow.dat"));
//TextInputFormat.setInputPaths(job, new Path("/flow.dat"));
TextOutputFormat.setOutputPath(job, new Path("D:\\大数据\\Note\\Day02-Hadoop\\数据文件\\out6"));
//TextOutputFormat.setOutputPath(job, new Path("/out312313"));
job.setMapperClass(FlowMapper.class);
job.setReducerClass(FlowReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(FlowBean.class);
job.setOutputKeyClass(NullWritable.class);
job.setOutputValueClass(FlowBean.class);
job.waitForCompletion(true);
}
}
package com.rechen.mr.test04;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
/*
*15713770999 12113 123
*
* */
public class FlowMapper extends Mapper {
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
/*
* 获取到行数据
* */
String line = value.toString();
/*
* 通过空格进行分割
* */
String[] infos = line.split(" ");
/*
* 拿到电话号码
* */
String phone = infos[0];
/*
* 拿到上行流量
* */
Long up = Long.valueOf(infos[1]);
/*
* 拿到下行流量
* */
Long down = Long.valueOf(infos[2]);
/*
* key 电话号码
*
*value 是 flowbean对象
* */
context.write(new Text(phone), new FlowBean(null, up, down, up + down));
}
}
package com.rechen.mr.test04;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
public class FlowReducer extends Reducer {
@Override
protected void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException {
/*
*
* 15713770999 [bean01,bean02,bean03]
* */
Long up = 0L;
Long down = 0L;
Long sum = 0L;
for (FlowBean value : values) {
up += value.getUpFlow();
down += value.getDownFlow();
sum += value.getSumFlow();
}
context.write(NullWritable.get(),new FlowBean(key.toString(),up,down,sum));
}
}
5b.连接到对应的NM启动一个MRAppMaster
6 . MRAppMaster 在启动的时候会初始化Job
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-4Bic1x3f-1593835203956)(assets/1572320702758.png)]
在staging路径下创建以JobID 为名字的文件夹
files 其他文件
libjar 依赖的文件
archives 归档文件
jobJar 代码的Jar包
切片文件
conf.xml
通过WC案例编写,我们不难发现,其实是按照一定的规则执行程序的输入和输出,最终将作业提交到Hadoop的集群中来。
Hadoop是将数据切分成若干个输入切片(Input Split),并将每一个Split交给一个MapTask处理;MapTask 不断的从对应的Split中解析出一个一个的key、value,并调用map()函数处理,处理完成后根据Redduce Task个数将结果分解成若干个分片(partition),写到磁盘中。
同时,每个Reduce Task 从每个MapTask的节点上读取属于自己的那个分区(partition)的数据,然后使用基于排序的方法将key相同的数据聚集在一起,调用Reduce函数,并将结果输出到文件中。
通过上面的描述,上面还缺少三个组件:
(1)指定文件格式,将输入数据切分为若干个Split,,且将每个Spalit的数据解析成一个个map()函数要求的key,value对象
(2)确定Map()函数产生的新的keyvalue对交给那个ReduceTask 函数处理
(3)指定输出文件格式,即每个keyvalue对 以何种形式保存到输出文件中。
所以在MR中,这个三个组件分别就是InputFormat,Partitioner,OutPutFormat,他们均需要用户根据自己业务进行配置,但是对于WC来说,使用的默认的即可。
但是最终,Hadoop其实为我们 提供了5个可以进行编程的组件。
InputFormat,Mapper,Partitioner,Reducer,OutPutFormat
另外还有组件叫做Conbiner 这个组件通常用于优化MR程序性能,但是需要根据具体业务场景而定。
InputFormat主要用于描述数据输入的格式,它提供了两个功能:
本地SplitSize 32MB 集群默认为1128MB
public List<InputSplit> getSplits(JobContext job) throws IOException {
// 不用管
Stopwatch sw = (new Stopwatch()).start();
// 最小大小 为计算splitSize 做准备
long minSize = Math.max(this.getFormatMinSplitSize(), getMinSplitSize(job));
// 最大 大小 为计算splitSize 做准备
long maxSize = getMaxSplitSize(job);
// 准备存储切片的集合
List<InputSplit> splits = new ArrayList();
// 获取所有文件的集合
List<FileStatus> files = this.listStatus(job);
// 迭代
Iterator i$ = files.iterator();
while(true) {
while(true) {
while(i$.hasNext()) {
// 获取文件状态 (到底是hdfs 上的文件还是本地文件)
FileStatus file = (FileStatus)i$.next();
// 获取文件路径
Path path = file.getPath();
// 获取文件长度
long length = file.getLen();
// 如果文件不为空 则 继续执行
if (length != 0L) {
//准备存储BlockLocation的数组
BlockLocation[] blkLocations;
// 判断文件是否为本地文件
if (file instanceof LocatedFileStatus) {
blkLocations = ((LocatedFileStatus)file).getBlockLocations();
} else {
//获取到fs 对象
FileSystem fs = path.getFileSystem(job.getConfiguration());
//通过fs 对象获取到文件所有的block的地址(位置)
blkLocations = fs.getFileBlockLocations(file, 0L, length);
}
// isSplitable 方法默认可以切分
if (this.isSplitable(job, path)) {
// i获取到当前的blcoksize大小 一般为128MB
long blockSize = file.getBlockSize();
/*
protected long computeSplitSize(long blockSize, long minSize, long maxSize) {
return Math.max(minSize, Math.min(maxSize, blockSize));
}
上述代码是计算切片策略的大小
*/
long splitSize = this.computeSplitSize(blockSize, minSize, maxSize);
// 准备剩余 的字节数的对象
long bytesRemaining;
// blockIndex
int blkIndex;
// for 循环 剩余大小等于当前的长度 使用剩余大小除切片策略的大小 如果大于1.1 则向下执行 否则退出循环
for(bytesRemaining = length; (double)bytesRemaining / (double)splitSize > 1.1D; bytesRemaining -= splitSize) {
// 剩余大小除切片策略的大小 大于1.1
blkIndex = this.getBlockIndex(blkLocations, length - bytesRemaining);
// 给定文件路径 ,文件开始的位置 文件长度 如果剩余长度除以splitSize 大于1.1 则进入到次循环
splits.add(this.makeSplit(path, length - bytesRemaining, splitSize, blkLocations[blkIndex].getHosts(), blkLocations[blkIndex].getCachedHosts()));
}
// 如果剩余长度除以splitSize 小于1.1 则进入到判断语句
if (bytesRemaining != 0L) {
blkIndex = this.getBlockIndex(blkLocations, length - bytesRemaining);
// 给定文件路径 ,文件开始的位置 文件长度 如果剩余长度除以splitSize 大于1.1 则进入到次循环
splits.add(this.makeSplit(path, length - bytesRemaining, bytesRemaining, blkLocations[blkIndex].getHosts(), blkLocations[blkIndex].getCachedHosts()));
}
} else {
splits.add(this.makeSplit(path, 0L, length, blkLocations[0].getHosts(), blkLocations[0].getCachedHosts()));
}
} else {
splits.add(this.makeSplit(path, 0L, length, new String[0]));
}
}
job.getConfiguration().setLong("mapreduce.input.fileinputformat.numinputfiles", (long)files.size());
sw.stop();
if (LOG.isDebugEnabled()) {
LOG.debug("Total # of splits generated by getSplits: " + splits.size() + ", TimeTaken: " + sw.elapsedMillis());
}
return splits;
}
}
}
TextInputFormat使用RecordReader中org.apache.hadoop.mapreduce.lib.input下的LineRecordReader。这个类中方法,首先会调用initializ方法获取切片的初始化位置和结束位置,以及使用fs对象打开文件的输入流,mapper 的key、value是通过LineRecordReader.nextKeyValue,在此间期间,key被设置成当前文本的偏移量,value被设置成使用RedaLine的readDefaultLine方法读取到每一行文本:
public boolean nextKeyValue() throws IOException {
if (this.key == null) {
this.key = new LongWritable();
}
this.key.set(this.pos);
if (this.value == null) {
this.value = new Text();
}
int newSize = 0;
while(this.getFilePosition() <= this.end || this.in.needAdditionalRecordAfterSplit()) {
if (this.pos == 0L) {
newSize = this.skipUtfByteOrderMark();
} else {
newSize = this.in.readLine(this.value, this.maxLineLength, this.maxBytesToConsume(this.pos));
this.pos += (long)newSize;
}
if (newSize == 0 || newSize < this.maxLineLength) {
break;
}
LOG.info("Skipped line of size " + newSize + " at pos " + (this.pos - (long)newSize));
}
if (newSize == 0) {
this.key = null;
this.value = null;
return false;
} else {
return true;
}
}
private int readDefaultLine(Text str, int maxLineLength, int maxBytesToConsume) throws IOException {
str.clear();
int txtLength = 0;
int newlineLength = 0;
boolean prevCharCR = false;
long bytesConsumed = 0L;
do {
int startPosn = this.bufferPosn;
if (this.bufferPosn >= this.bufferLength) {
startPosn = this.bufferPosn = 0;
if (prevCharCR) {
++bytesConsumed;
}
this.bufferLength = this.fillBuffer(this.in, this.buffer, prevCharCR);
if (this.bufferLength <= 0) {
break;
}
}
while(this.bufferPosn < this.bufferLength) {
if (this.buffer[this.bufferPosn] == 10) {
newlineLength = prevCharCR ? 2 : 1;
++this.bufferPosn;
break;
}
if (prevCharCR) {
newlineLength = 1;
break;
}
prevCharCR = this.buffer[this.bufferPosn] == 13;
++this.bufferPosn;
}
int readLength = this.bufferPosn - startPosn;
if (prevCharCR && newlineLength == 0) {
--readLength;
}
bytesConsumed += (long)readLength;
int appendLength = readLength - newlineLength;
if (appendLength > maxLineLength - txtLength) {
appendLength = maxLineLength - txtLength;
}
if (appendLength > 0) {
str.append(this.buffer, startPosn, appendLength);
txtLength += appendLength;
}
} while(newlineLength == 0 && bytesConsumed < (long)maxBytesToConsume);
if (bytesConsumed > 2147483647L) {
throw new IOException("Too many bytes before newline: " + bytesConsumed);
} else {
return (int)bytesConsumed;
}
}
MapTask数量由切片数量决定,ReduceTask的数量可以手动设置,默认为1
FileInputFormat (读取HDFS 文件)
TextInputFormat
切片:以文件为单位 按照 splitsize 切分
NLineInputFormat
切片:以文件为单位,n行为一个切片 ;默认一行一个切片 可以设置
CombineTextInputFormat
切片:按照splitsize 进行切分,一个切片可能对应多个文件
DBInputFormat(读取RDBMS)
TableInputFormat(读取Hbase)(重点)
NLineInputFormat.setInputPaths(job, new Path("D:\\大数据\\Note\\Day02-Hadoop\\数据文件\\flow.dat"));
NLineInputFormat.setNumLinesPerSplit(job,3);
CombineTextInputFormat.setMinInputSplitSize(job,1024000);
CombineTextInputFormat.setInputPaths(job, new Path("D:\\大数据\\Note\\Day02-Hadoop\\数据文件\\littlewenjian"));
package com.rechen.mr.test05;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.lib.db.DBConfiguration;
import org.apache.hadoop.mapred.lib.db.DBInputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class DBJob {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
DBConfiguration.configureDB(conf, "com.mysql.jdbc.Driver", "jdbc:mysql://localhost:3306/hadoop", "root", "1234");
Job job = Job.getInstance(conf);
job.setInputFormatClass(DBInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
DBInputFormat.setInput(job, User.class, "select id,name from user", "select count(1) from user");
TextOutputFormat.setOutputPath(job, new Path("D:\\大数据\\Code\\BigData\\Hadoop_Test\\src\\main\\java\\com\\baizhi\\mr\\test05\\out1"));
job.setMapperClass(DBMapper.class);
job.setReducerClass(DBReducer.class);
job.setMapOutputKeyClass(LongWritable.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(LongWritable.class);
job.setOutputValueClass(Text.class);
job.waitForCompletion(true);
}
}
package com.rechen.mr.test05;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
public class DBMapper extends Mapper {
@Override
protected void map(LongWritable key, User value, Context context) throws IOException, InterruptedException {
context.write(key, new Text(value.toString()));
}
}
package com.rechen.mr.test05;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
public class DBReducer extends Reducer {
@Override
protected void reduce(LongWritable key, Iterable values, Context context) throws IOException, InterruptedException {
for (Text value : values) {
context.write(key, value);
}
}
}
package com.rechen.mr.test05;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.mapreduce.lib.db.DBWritable;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
import java.sql.SQLException;
public class User implements Writable, DBWritable {
private Integer id;
private String name;
public User() {
}
public User(Integer id, String name) {
this.id = id;
this.name = name;
}
public Integer getId() {
return id;
}
public void setId(Integer id) {
this.id = id;
}
public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}
/*
* 序列化 编码
* */
public void write(DataOutput dataOutput) throws IOException {
dataOutput.writeInt(this.id);
dataOutput.writeUTF(this.name);
}
/*
* 反序列化 解码
* */
public void readFields(DataInput dataInput) throws IOException {
this.id = dataInput.readInt();
this.name = dataInput.readUTF();
}
/*
* 序列化 编码
* */
public void write(PreparedStatement pstm) throws SQLException {
pstm.setInt(1, this.id);
pstm.setString(2, this.name);
}
/*
* 反序列化 解码
* */
public void readFields(ResultSet rs) throws SQLException {
this.id = rs.getInt(1);
this.name = rs.getString(2);
}
@Override
public String toString() {
return "User{" +
"id=" + id +
", name='" + name + '\'' +
'}';
}
}
本地运行模式
需要加上mysql的依赖到pom文件中
jar包提交
需要将 mysql jar 包拷贝至/home/hadoop/hadoop-2.6.0/share/hadoop/yarn/ 中
远程提交
和之前保持一致即可
解决问题:HDFS中小文件存储
解决小文件存储的问题,将多个文件合并成一个SequenceFile(SequenceFile特指hadoop中一种特殊的文件,这种文件里面存在多个文件,是Hadoop用来存储二进制形式的key-value对的文件格式,SequenceFile 中有路径+文件名 为key 文件内容和为Value)
package com.rechen.mr.test06;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.JobContext;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import java.io.IOException;
public class OwnInputFormat extends FileInputFormat<Text, BytesWritable> {
@Override
protected boolean isSplitable(JobContext context, Path filename) {
return false;
}
public RecordReader<Text, BytesWritable> createRecordReader(InputSplit inputSplit, TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException {
OwnRecordReader recordReader = new OwnRecordReader();
recordReader.initialize(inputSplit, taskAttemptContext);
return recordReader;
}
}
package com.rechen.mr.test06;
import com.rechen.mr.test01.WCJob;
import com.rechen.mr.test01.WCMapper;
import com.rechen.mr.test01.WCReducer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.CombineTextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class OwnJob {
public static void main(String[] args) throws Exception {
/*
* 1 封装Job 对象
* */
// System.setProperty("HADOOP_USER_NAME", "root");
Configuration conf = new Configuration();
/*
* 设置跨平台提交
* */
/* conf.set("mapreduce.app-submission.cross-platform", "true");
conf.addResource("conf2/core-site.xml");
conf.addResource("conf2/hdfs-site.xml");
conf.addResource("conf2/mapred-site.xml");
conf.addResource("conf2/yarn-site.xml");
conf.set(MRJobConfig.JAR,"D:\\大数据\\Code\\BigData\\Hadoop_Test\\target\\Hadoop_Test-1.0-SNAPSHOT.jar");
*/
Job job = Job.getInstance(conf, "WC-JOB01");
job.setJarByClass(OwnJob.class);
/*
* 2 设置数据的写入和写出格式
* */
job.setInputFormatClass(OwnInputFormat.class);
job.setOutputFormatClass(SequenceFileOutputFormat.class);
/*
*
* 3 设置数据读取和写出路径
* */
OwnInputFormat.addInputPath(job,new Path("D:\\大数据\\Note\\Day02-Hadoop\\数据文件\\littlewenjian"));
//TextInputFormat.setInputPaths(job, new Path("/access.tmp2019-05-19-10-28.log"),new Path("/4.md"));
//TextInputFormat.setInputPaths(job, new Path("D:\\大数据\\Note\\Day02-Hadoop\\数据文件\\littlewenjian"));
//CombineTextInputFormat.setMinInputSplitSize(job,1024000);
//CombineTextInputFormat.setInputPaths(job, new Path("D:\\大数据训\\Note\\Day02-Hadoop\\数据文件\\littlewenjian"));
//NLineInputFormat.setInputPaths(job, new Path("D:\\大数据\\Note\\Day02-Hadoop\\数据文件\\flow.dat"));
//NLineInputFormat.setNumLinesPerSplit(job,3);
/*
* 此时需要注意 指定的文件夹 不能存在
* */
//TextOutputFormat.setOutputPath(job, new Path("/outtes1t1112111311"));
//TextOutputFormat.setOutputPath(job, new Path("D:\\大数据\\Note\\Day02-Hadoop\\数据文件\\out2"));
SequenceFileOutputFormat.setOutputPath(job,new Path("D:\\大数据\\Code\\BigData\\Hadoop_Test\\src\\main\\java\\com\\baizhi\\mr\\test06\\out4"));
/*
*
* 4 设置数据的计算逻辑
* */
job.setMapperClass(OwnMapper.class);
job.setReducerClass(OwnReducer.class);
/*
* 5 设置Mapper 和Reducer 的输出泛型
* */
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(BytesWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(BytesWritable.class);
/*
* 6 任务提交
* */
//job.submit();
job.waitForCompletion(true);
}
}
package com.rechen.mr.test06;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
public class OwnMapper extends Mapper {
@Override
protected void map(Text key, BytesWritable value, Context context) throws IOException, InterruptedException {
context.write(key, value);
}
}
package com.rechen.mr.test06;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import java.io.IOException;
public class OwnRecordReader extends RecordReader<Text, BytesWritable> {
private FileSplit fileSplit;
private Configuration conf;
private Text key = new Text();
private BytesWritable value = new BytesWritable();
boolean isProgress = true;
public void initialize(InputSplit inputSplit, TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException {
/*
*初始化
* */
// 获得到文件切片对象
this.fileSplit = (FileSplit) inputSplit;
// 获取到 Configuration 对象
conf = taskAttemptContext.getConfiguration();
}
public boolean nextKeyValue() throws IOException, InterruptedException {
if (isProgress) {
// 设置 存储 文件 内容的二进制数组
byte[] bytes = new byte[(int) fileSplit.getLength()];
// 获取文件路径
Path path = fileSplit.getPath();
FileSystem fileSystem = path.getFileSystem(conf);
FSDataInputStream fsDataInputStream = fileSystem.open(path);
IOUtils.readFully(fsDataInputStream, bytes, 0, bytes.length);
/*
* 设置 key 的值为文件路径
* */
key.set(path.toString());
/*
* 设置 value 的值为文件内容
* */
value.set(bytes, 0, bytes.length);
isProgress = false;
return true;
}
return false;
}
public Text getCurrentKey() throws IOException, InterruptedException {
return this.key;
}
public BytesWritable getCurrentValue() throws IOException, InterruptedException {
return this.value;
}
public float getProgress() throws IOException, InterruptedException {
return 0;
}
public void close() throws IOException {
}
}
package com.rechen.mr.test06;
import org.apache.hadoop.io.ByteWritable;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
public class OwnReducer extends Reducer {
@Override
protected void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException {
for (BytesWritable value : values) {
context.write(key, value);
}
}
}
默认的HashPartitiner使用哈希值去计算key的分区
需求:
将流量数据输出到不同的文件中
package com.rechen.mr.test07;
public class App {
public static void main(String[] args) {
String hn = "hn";
String zz = "zz";
String xy = "xy";
String kf = "kf";
String bj = "bj";
System.out.println((hn.hashCode()& 2147483647) % 1);
System.out.println((zz.hashCode()& 2147483647) % 1);
System.out.println((xy.hashCode()& 2147483647) % 1);
System.out.println((kf.hashCode()& 2147483647) % 1);
System.out.println((bj.hashCode()& 2147483647) % 1);
}
}
package com.rechen.mr.test07;
import com.baizhi.mr.test04.FlowJob;
import com.baizhi.mr.test04.FlowMapper;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class AreaJob {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
job.setJarByClass(AreaJob.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setPartitionerClass(OwnPartatiner.class);
TextInputFormat.setInputPaths(job, new Path("D:\\大数据\\Note\\Day02-Hadoop\\数据文件\\flow.dat"));
//TextInputFormat.setInputPaths(job, new Path("/flow.dat"));
TextOutputFormat.setOutputPath(job, new Path("D:\\大数据\\Note\\Day02-Hadoop\\数据文件\\out112311111"));
//TextOutputFormat.setOutputPath(job, new Path("/out312313"));
job.setMapperClass(AreaMapper.class);
job.setReducerClass(AreaReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(FlowBean.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(FlowBean.class);
job.setNumReduceTasks(100);
job.waitForCompletion(true);
}
}
package com.rechen.mr.test07;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
public class AreaMapper extends Mapper {
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String[] line = value.toString().split(" ");
String area = line[3];
String phone = line[0];
Long up = Long.valueOf(line[1]);
Long down = Long.valueOf(line[2]);
Long sum = up + down;
context.write(new Text(area), new FlowBean(phone, up, down, sum));
}
}
package com.rechen.mr.test07;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
public class AreaReducer extends Reducer {
@Override
protected void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException {
for (FlowBean value : values) {
context.write(key, value);
}
}
}
package com.rechen.mr.test07;
import org.apache.hadoop.io.Writable;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
public class FlowBean implements Writable {
private String phone;
private Long upFlow;
private Long downFlow;
private Long sumFlow;
public FlowBean() {
}
public FlowBean(String phone, Long upFlow, Long downFlow, Long sumFlow) {
this.phone = phone;
this.upFlow = upFlow;
this.downFlow = downFlow;
this.sumFlow = sumFlow;
}
public String getPhone() {
return phone;
}
public void setPhone(String phone) {
this.phone = phone;
}
public Long getUpFlow() {
return upFlow;
}
public void setUpFlow(Long upFlow) {
this.upFlow = upFlow;
}
public Long getDownFlow() {
return downFlow;
}
public void setDownFlow(Long downFlow) {
this.downFlow = downFlow;
}
public Long getSumFlow() {
return sumFlow;
}
public void setSumFlow(Long sumFlow) {
this.sumFlow = sumFlow;
}
@Override
public String toString() {
return "FlowBean{" +
"phone='" + phone + '\'' +
", baizhi upFlow=" + upFlow +
", downFlow=" + downFlow +
", sumFlow=" + sumFlow +
'}' ;
}
/*
* 序列化 编码
* */
public void write(DataOutput dataOutput) throws IOException {
dataOutput.writeUTF(this.phone);
dataOutput.writeLong(this.upFlow);
dataOutput.writeLong(this.downFlow);
dataOutput.writeLong(this.sumFlow);
}
/*
* 反序列化 解码
* */
public void readFields(DataInput dataInput) throws IOException {
this.phone = dataInput.readUTF();
this.upFlow = dataInput.readLong();
this.downFlow = dataInput.readLong();
this.sumFlow = dataInput.readLong();
}
}
package com.rechen.mr.test07;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;
import java.util.HashMap;
public class OwnPartatiner extends Partitioner {
private static HashMap areaMap = new HashMap();
static {
areaMap.put("hn", 0);
areaMap.put("xy", 1);
areaMap.put("kf", 2);
areaMap.put("bj", 3);
areaMap.put("zz", 4);
}
public int getPartition(Text key, FlowBean value, int i) {
/*if (areaMap.get(key.toString()) != null) {
Integer integer = areaMap.get(key.toString());
return integer;
} else {
return 0;
}
*/
return areaMap.get(key.toString())==null ? 0 :areaMap.get(key.toString());
}
}
自定义OutputFormat
package com.rechen.mr.test08;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
public class MyJob {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
job.setJarByClass(MyJob.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(OwnOutputFormat.class);
TextInputFormat.setInputPaths(job, new Path("D:\\大数据\\Note\\Day02-Hadoop\\数据文件\\flow.dat"));
//TextInputFormat.setInputPaths(job, new Path("/flow.dat"));
OwnOutputFormat.setOutputPath(job, new Path("D:\\大数据\\Note\\Day02-Hadoop\\数据文件\\asdjk1ha1"));
//TextOutputFormat.setOutputPath(job, new Path("/out312313"));
job.setMapperClass(MyMapper.class);
job.setReducerClass(MyReducer.class);
job.setMapOutputKeyClass(LongWritable.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(LongWritable.class);
job.setOutputValueClass(Text.class);
job.waitForCompletion(true);
}
}
package com.rechen.mr.test08;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
public class MyMapper extends Mapper {
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
context.write(key, value);
}
}
package com.baizhi.mr.test08;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
public class MyReducer extends Reducer {
@Override
protected void reduce(LongWritable key, Iterable values, Context context) throws IOException, InterruptedException {
for (Text value : values) {
context.write(key, value);
}
}
}
package com.rechen.mr.test08;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.RecordWriter;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import java.io.IOException;
public class OenRecordWriter extends RecordWriter {
private FSDataOutputStream outputStream;
public OenRecordWriter(TaskAttemptContext context) throws Exception {
FileSystem fileSystem = FileSystem.get(context.getConfiguration());
outputStream = fileSystem.create(new Path("D:\\大数据\\Code\\BigData\\Hadoop_Test\\src\\main\\java\\com\\baizhi\\mr\\test08\\out1\\1.txt"));
}
public void write(LongWritable longWritable, Text text) throws IOException, InterruptedException {
outputStream.write((text.toString()+"\n").getBytes());
}
public void close(TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException {
IOUtils.closeStream(outputStream);
}
}
package com.rechen.mr.test08;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.RecordWriter;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
public class OwnOutputFormat extends FileOutputFormat {
public RecordWriter getRecordWriter(TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException {
try {
return new OenRecordWriter(taskAttemptContext);
} catch (Exception e) {
e.printStackTrace();
}
return null;
}
}
(1)Combiner是MR程序中Mapper和Reducer之外的一种组件
(2)Combiner 组件的父类就是Reducer
(3)Combiner和Reducer 的区别在于运行位置
Combiner 是在每一个Task所在的节点上运行
Reducer 是在接收全局所有的Mapper的输出结果
(4)Combiner的意义在于对与每一个MapTask的输出做局部汇总,以减少网络使用量
(5)Combiner的使用不能影响最终的业务结果
进行累加 可以
求平均值 0 20 10 25 15 的平均数 14
(0,10,20) 10 (25,15)20 (10,20)15
适合累加 不适合求平均数
(6)Combiner输出的KV 要与Reducer 的KV相对应
(7)使用
(8)配置
job.setCombinerClass(WCReducer.class);
为了让Reduce可以并行处理Map结果,需要对Map输出结果进行一定分区(Partition),排序(Sort),合并(Combine),归并(Merge)等操作,得到类型的中间结果,再交给对应的Reduce进行处理,这个过程称之为Shuffle,从无序变成有序的
Shuffle过程是MR的核心,描述着MapTask输出和ReduceTask输入的这段过程。在大数据处理过程中,由于使用的是集群的计算模式,并且节点与节点之前采取并行处理的模式,而且也有可能运行着多个Job,此时节点与节点传输数据的效率会影响最终计算的效率。由于MR框架采取磁盘作为溢写文件夹的存储介质,所以也会影响最终计算的效率。
所以,从以上分析,Shuffle的过程基本要求:
总结:Shuffle是对Map输出结果进行分区,排序,合并等处理并且交由Reduce的过程。分为Map端和Reduce端。
org.apache.hadoop.mapred; --MapTask --MapOutputBuffer
property>
<name>mapreduce.task.io.sort.mbname>
<value>100value>
<description>The total amount of buffer memory to use while sorting
files, in megabytes. By default, gives each merge stream 1MB, which
should minimize seeks.description>
property>
<property>
<name>mapreduce.map.sort.spill.percentname>
<value>0.80value>
<description>The soft limit in the serialization buffer. Once reached, a
thread will begin to spill the contents to disk in the background. Note that
collection will not block if this threshold is exceeded while a spill is
already in progress, so spills may be larger than this threshold when it is
set to less than .5description>
property>
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-iOpafhXr-1593835203999)(assets/1572491798829.png)]
Map端的Shuffle输出首先是将结果缓存到内存中(环状缓冲区),默认情况下如果达到80mb ,则溢写到磁盘中。在溢写到磁盘之前,要对内存中的数据进行分区和排序,之后再写入磁盘,每次溢写操作会生成新的磁盘文件,随着MapTask的执行,会产生很多小文件,最终当Map端计算结束之后,这些小的文件会被合并成大小的文件,之后通知相应的ReduceTask领取属于自己的数据。
(1)小文件计算优化-CombineTextInputFormat-干预切片计算逻辑
(2)实现Partitiner策略,防止数据倾斜
(3)适当调整YarnChild内存参数,可以参照YARN 参数配置手册
(4)适当调整溢写缓冲区的大小阈值
(5)适当调整合并文件并行度mapreduce.task.io.sort.factor
(6)对Map端输出溢写文件使用GZIP压缩,节省网络带宽
<property>
<name>mapreduce.map.output.compressname>
<value>falsevalue>
<description>Should the outputs of the maps be compressed before being
sent across the network. Uses SequenceFile compression.
description>
property>
<property>
<name>mapreduce.map.output.compress.codecname>
<value>org.apache.hadoop.io.compress.DefaultCodecvalue>
<description>If the map outputs are compressed, how should they be
compressed?
description>
property>
conf.setClass("mapreduce.map.output.compress.codec", GzipCodec.class, CompressionCodec.class);
conf.set("mapreduce.map.output.compress","true");
略
15713770999 12113 123 hn
15713770929 12133 123 zz
15713770909 123 1123 bj
13949158086 13 1213 kf
15713770929 11 12113 xy
15713770999 11113 123 hn
15713770929 123233 123 zz
15713770909 12113 1123 bj
13949158086 13 1213 kf
15713770929 121 12113 xy
package com.rechen.mr.test09;
import org.apache.hadoop.io.WritableComparable;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
public class FlowBean implements WritableComparable {
private String phone;
private Long upFlow;
private Long downFlow;
private Long sumFlow;
public FlowBean() {
}
public FlowBean(String phone, Long upFlow, Long downFlow, Long sumFlow) {
this.phone = phone;
this.upFlow = upFlow;
this.downFlow = downFlow;
this.sumFlow = sumFlow;
}
public String getPhone() {
return phone;
}
public void setPhone(String phone) {
this.phone = phone;
}
public Long getUpFlow() {
return upFlow;
}
public void setUpFlow(Long upFlow) {
this.upFlow = upFlow;
}
public Long getDownFlow() {
return downFlow;
}
public void setDownFlow(Long downFlow) {
this.downFlow = downFlow;
}
public Long getSumFlow() {
return sumFlow;
}
public void setSumFlow(Long sumFlow) {
this.sumFlow = sumFlow;
}
@Override
public String toString() {
return "FlowBean{" +
"phone='" + phone + '\'' +
", baizhi upFlow=" + upFlow +
", downFlow=" + downFlow +
", sumFlow=" + sumFlow +
'}';
}
/*
* 序列化 编码
* */
public void write(DataOutput dataOutput) throws IOException {
dataOutput.writeUTF(this.phone);
dataOutput.writeLong(this.upFlow);
dataOutput.writeLong(this.downFlow);
dataOutput.writeLong(this.sumFlow);
}
/*
* 反序列化 解码
* */
public void readFields(DataInput dataInput) throws IOException {
this.phone = dataInput.readUTF();
this.upFlow = dataInput.readLong();
this.downFlow = dataInput.readLong();
this.sumFlow = dataInput.readLong();
}
public int compareTo(FlowBean flowBean) {
/*
if (this.sumFlow > flowBean.sumFlow) {
return 1;
} else {
return -1;
}*/
return this.sumFlow > flowBean.sumFlow ? -1 : 1;
}
}
package com.rechen.mr.test09;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class SortJob {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
job.setJarByClass(SortJob.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
TextInputFormat.setInputPaths(job, new Path("D:\\大数据\\Note\\Day02-Hadoop\\数据文件\\flow.dat"));
//TextInputFormat.setInputPaths(job, new Path("/flow.dat"));
TextOutputFormat.setOutputPath(job, new Path("D:\\大数据\\Note\\Day02-Hadoop\\数据文件\\out11"));
//TextOutputFormat.setOutputPath(job, new Path("/out312313"));
job.setMapperClass(SortMapper.class);
job.setReducerClass(SortReducer.class);
job.setMapOutputKeyClass(FlowBean.class);
job.setMapOutputValueClass(NullWritable.class);
job.setOutputKeyClass(NullWritable.class);
job.setOutputValueClass(FlowBean.class);
job.waitForCompletion(true);
}
}
package com.rechen.mr.test09;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
public class SortMapper extends Mapper {
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
/*
* 获取到行数据
* */
String line = value.toString();
/*
* 通过空格进行分割
* */
String[] infos = line.split(" ");
/*
* 拿到电话号码
* */
String phone = infos[0];
/*
* 拿到上行流量
* */
Long up = Long.valueOf(infos[1]);
/*
* 拿到下行流量
* */
Long down = Long.valueOf(infos[2]);
/*
* key 电话号码
*
*value 是 flowbean对象
* */
context.write(new FlowBean(phone, up, down, up + down), NullWritable.get());
}
}
package com.rechen.mr.test09;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
public class SortReducer extends Reducer {
@Override
protected void reduce(FlowBean key, Iterable values, Context context) throws IOException, InterruptedException {
/*
*
* 15713770999 [bean01,bean02,bean03]
* */
Long up = 0L;
Long down = 0L;
Long sum = 0L;
up += key.getUpFlow();
down += key.getDownFlow();
sum += key.getSumFlow();
context.write(NullWritable.get(), new FlowBean(key.toString(), up, down, sum));
}
}
package com.rechen.mr.test10;
import org.apache.hadoop.io.WritableComparable;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
public class FlowBean implements WritableComparable {
private String phone;
private Long upFlow;
private Long downFlow;
private Long sumFlow;
private String area;
public FlowBean() {
}
public FlowBean(String phone, Long upFlow, Long downFlow, Long sumFlow, String area) {
this.phone = phone;
this.upFlow = upFlow;
this.downFlow = downFlow;
this.sumFlow = sumFlow;
this.area = area;
}
/*
* 序列化 编码
* */
public void write(DataOutput dataOutput) throws IOException {
dataOutput.writeUTF(this.phone);
dataOutput.writeLong(this.upFlow);
dataOutput.writeLong(this.downFlow);
dataOutput.writeLong(this.sumFlow);
dataOutput.writeUTF(this.area);
}
public String getPhone() {
return phone;
}
public void setPhone(String phone) {
this.phone = phone;
}
public Long getUpFlow() {
return upFlow;
}
public void setUpFlow(Long upFlow) {
this.upFlow = upFlow;
}
public Long getDownFlow() {
return downFlow;
}
public void setDownFlow(Long downFlow) {
this.downFlow = downFlow;
}
public Long getSumFlow() {
return sumFlow;
}
public void setSumFlow(Long sumFlow) {
this.sumFlow = sumFlow;
}
public String getArea() {
return area;
}
public void setArea(String area) {
this.area = area;
}
/*
* 反序列化 解码
* */
public void readFields(DataInput dataInput) throws IOException {
this.phone = dataInput.readUTF();
this.upFlow = dataInput.readLong();
this.downFlow = dataInput.readLong();
this.sumFlow = dataInput.readLong();
this.area = dataInput.readUTF();
}
public int compareTo(FlowBean flowBean) {
/*
if (this.sumFlow > flowBean.sumFlow) {
return 1;
} else {
return -1;
}*/
return this.sumFlow > flowBean.sumFlow ? -1 : 1;
}
@Override
public String toString() {
return "FlowBean{" +
"phone='" + phone + '\'' +
", upFlow=" + upFlow +
", downFlow=" + downFlow +
", sumFlow=" + sumFlow +
", area='" + area + '\'' +
'}';
}
}
package com.rechen.mr.test10;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;
import java.util.HashMap;
public class OwnPartatiner extends Partitioner {
private static HashMap areaMap = new HashMap();
static {
areaMap.put("hn", 0);
areaMap.put("xy", 1);
areaMap.put("kf", 2);
areaMap.put("bj", 3);
areaMap.put("zz", 4);
}
public int getPartition(FlowBean flowBean, NullWritable nullWritable, int i) {
String area = flowBean.getArea();
return areaMap.get(area)==null ? 0 :areaMap.get(area);
}
}
package com.rechen.mr.test10;
import com.rechen.mr.test09.SortJob;
import com.rechen.mr.test09.SortMapper;
import com.rechen.mr.test09.SortReducer;
import jdk.nashorn.internal.scripts.JO;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class PSJob {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
job.setJarByClass(PSJob.class);
job.setNumReduceTasks(5);
job.setPartitionerClass(OwnPartatiner.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
TextInputFormat.setInputPaths(job, new Path("D:\\大数据\\Note\\Day02-Hadoop\\数据文件\\flow.dat"));
//TextInputFormat.setInputPaths(job, new Path("/flow.dat"));
TextOutputFormat.setOutputPath(job, new Path("D:\\大数据\\Note\\Day02-Hadoop\\数据文件\\out16"));
//TextOutputFormat.setOutputPath(job, new Path("/out312313"));
job.setMapperClass(PSMapper.class);
job.setReducerClass(PSReducer.class);
job.setMapOutputKeyClass(FlowBean.class);
job.setMapOutputValueClass(NullWritable.class);
job.setOutputKeyClass(NullWritable.class);
job.setOutputValueClass(FlowBean.class);
job.waitForCompletion(true);
}
}
package com.rechen.mr.test10;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
public class PSMapper extends Mapper {
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
/*
* 获取到行数据
* */
String line = value.toString();
/*
* 通过空格进行分割
* */
String[] infos = line.split(" ");
/*
* 拿到电话号码
* */
String phone = infos[0];
/*
* 拿到上行流量
* */
Long up = Long.valueOf(infos[1]);
/*
* 拿到下行流量
* */
Long down = Long.valueOf(infos[2]);
/*
* key 电话号码
*
*value 是 flowbean对象
* */
String area = infos[3];
context.write(new FlowBean(phone, up, down, up + down, area), NullWritable.get());
}
}
package com.rechen.mr.test10;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
public class PSReducer extends Reducer {
@Override
protected void reduce(FlowBean key, Iterable values, Context context) throws IOException, InterruptedException {
context.write(NullWritable.get(), key);
}
}
学生信息表
gjf 00001
gzy 00002
jzz 00003
zkf 00004
学生课程信息表
00001 yuwen
00001 shuxue
00002 yinyue
00002 yuwen
00003 tiyu
00003 shengwu
00004 tiyu
00004 wuli
00001 gjf yuwen shuxue
00002 gzy yinyue yuwen
package com.rechen.mr.test11;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class StuJob {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
job.setJarByClass(StuJob.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
TextInputFormat.setInputPaths(job, new Path("D:\\大数据\\Note\\Day02-Hadoop\\数据文件\\stu"));
//TextInputFormat.setInputPaths(job, new Path("/flow.dat"));
TextOutputFormat.setOutputPath(job, new Path("D:\\大数据\\Note\\Day02-Hadoop\\数据文件\\stu\\out1"));
//TextOutputFormat.setOutputPath(job, new Path("/out312313"));
job.setMapperClass(StuMapper.class);
job.setReducerClass(StuReduce.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(NullWritable.class);
job.setOutputValueClass(Text.class);
job.waitForCompletion(true);
}
}
package com.rechen.mr.test11;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import java.io.IOException;
public class StuMapper extends Mapper {
/*
*
* */
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String stuNum = "";
FileSplit inputSplit = (FileSplit) context.getInputSplit();
/*
* 获取到文件的名字
* */
Path name = inputSplit.getPath();
String[] line = value.toString().split(" ");
if (name.toString().contains("student_info_class.txt")) {
// 学号
stuNum = line[0];
// 学科名字
String classaName = line[1];
context.write(new Text(stuNum), new Text(classaName + " a"));
} else if (name.toString().contains("student_info.txt")) {
// 学号
stuNum = line[1];
// 学生名字
String stuName = line[0];
context.write(new Text(stuNum), new Text(stuName + " b"));
}
}
}
package com.rechen.mr.test11;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import java.io.IOException;
public class StuReduce extends Reducer {
/*
*
* 00001 gjf b
* 00001 yuwen a
* 00001 shuxue a
*
*
* (0001,[shuxue a,gjf b,yuwen a ])
*
* 00001 gjf shuxue yuwen
* */
@Override
protected void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException {
String stuName = "";
String className = "";
for (Text value : values) {
String[] s = value.toString().split(" ");
if (s[1].equals("a")) {
className += s[0] + " ";
}
if (s[1].equals("b")) {
stuName = s[0];
}
}
context.write(NullWritable.get(), new Text((key.toString() + " " + stuName + " " + className).trim()));
}
}
004 tiyu
00004 wuli
00001 gjf yuwen shuxue
00002 gzy yinyue yuwen
package com.rechen.mr.test11;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class StuJob {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
job.setJarByClass(StuJob.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
TextInputFormat.setInputPaths(job, new Path("D:\\大数据\\Note\\Day02-Hadoop\\数据文件\\stu"));
//TextInputFormat.setInputPaths(job, new Path("/flow.dat"));
TextOutputFormat.setOutputPath(job, new Path("D:\\大数据\\Note\\Day02-Hadoop\\数据文件\\stu\\out1"));
//TextOutputFormat.setOutputPath(job, new Path("/out312313"));
job.setMapperClass(StuMapper.class);
job.setReducerClass(StuReduce.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(NullWritable.class);
job.setOutputValueClass(Text.class);
job.waitForCompletion(true);
}
}
package com.rechen.mr.test11;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import java.io.IOException;
public class StuMapper extends Mapper
/*
*
* */
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String stuNum = "";
FileSplit inputSplit = (FileSplit) context.getInputSplit();
/*
* 获取到文件的名字
* */
Path name = inputSplit.getPath();
String[] line = value.toString().split(" ");
if (name.toString().contains("student_info_class.txt")) {
// 学号
stuNum = line[0];
// 学科名字
String classaName = line[1];
context.write(new Text(stuNum), new Text(classaName + " a"));
} else if (name.toString().contains("student_info.txt")) {
// 学号
stuNum = line[1];
// 学生名字
String stuName = line[0];
context.write(new Text(stuNum), new Text(stuName + " b"));
}
}
}
~~~
package com.rechen.mr.test11;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import java.io.IOException;
public class StuReduce extends Reducer {
/*
*
* 00001 gjf b
* 00001 yuwen a
* 00001 shuxue a
*
*
* (0001,[shuxue a,gjf b,yuwen a ])
*
* 00001 gjf shuxue yuwen
* */
@Override
protected void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException {
String stuName = "";
String className = "";
for (Text value : values) {
String[] s = value.toString().split(" ");
if (s[1].equals("a")) {
className += s[0] + " ";
}
if (s[1].equals("b")) {
stuName = s[0];
}
}
context.write(NullWritable.get(), new Text((key.toString() + " " + stuName + " " + className).trim()));
}
}
给个三连吧~