hadoop框架原理:
流程是,将input转换成mapper使用的context格式,然后经过mapper处理后,转换成reducer使用的context格式,经过reducer处理之后,产生output。
c++类库和头文件:
hadoop提供的c++api类库和头文件,安装hadoop之后,类库在hadoop/hadoop-2.8.0/lib/native下,头文件在hadoop/hadoop-2.8.0/include中,复制到系统/usr/lib64和/usr/include下,或者直接复制到自己的项目工程中,或者使用Makefile
c++编程模式:
一个公有继承Mapper的map器,一个公有继承reducer的reduce器,模板如下:
class localmapper : public HadoopPipes::Mapper
{
public:
localmapper(HadoopPipes::Taskcontext & context){}
void map(HadoopPipes::MapContext & context){}
};
class localreducer : public HadoopPipes::Reducer
{
public:
localreducer(HadoopPipes::TaskContext & context){}
void reduce(HadoopPipes::ReducerContext & context){}
};
其中localmapper类中的map方法用来自定义洗牌规则,localreducer类中的reduce方法用来自定义展示规则,一个示例:
1.写input,这里是tmp.txt,如下:
[root@master helloworld]# cat tmp.txt
a:067
b:066
a:100
b:089
b:099
2.写map和reduce方法,如下:
#include
#include
#include
/* hadoop头文件 */
#include
#include
#include
using namespace std;
/* hadoop的mapper,reducer,和各自使用的context */
using HadoopPipes::TaskContext;
using HadoopPipes::Mapper;
using HadoopPipes::MapContext;
using HadoopPipes::Reducer;
using HadoopPipes::ReduceContext;
/* hadoop方法集中的两种方法 */
using HadoopUtils::toInt;
using HadoopUtils::toString;
/* hadoop运行入口 */
using HadoopPipes::TemplateFactory;
using HadoopPipes::runTask;
/* 公有继承hadoop的Mapper */
class LocalMapper : public Mapper
{
public:
LocalMapper(TaskContext & context){}
/* map函数,使用MapContext */
void map(MapContext & context)
{
/* 从文本中获取输入 */
string line = context.getInputValue();
string key = line.substr(0, 1);
string value = line.substr(2, 3);
/* 根据筛选条件洗牌,这里要求value不是100 */
if (value != "100")
{
context.emit(key, value);
}
}
};
/* 公有继承Reducer */
class LocalReducer : public Reducer
{
public:
LocalReducer(TaskContext & context){}
/* reduce函数,使用ReduceContext */
void reduce(ReduceContext & context)
{
int max_value = 0;
/* 遍历一个key的所有value,根据筛选条件展示输出,这里选择最大值 */
while (context.nextValue())
{
max_value = max(max_value, toInt(context.getInputValue()));
}
context.emit(context.getInputKey(), toString(max_value));
}
};
int main()
{
return runTask(TemplateFactory());
}
3.编译,命令如下:
g++ helloworld.cpp -lcrypto -lssl -L/root/hadoop/hadoop-2.8.0/lib/native -lhadooppipes -lhadooputils -lpthread
这里要带上-pthread,因为hadoop内部是并发算法,编译之后得到a.out
4.将a.out和tmp.txt上传到hdfs,命令如下:
hdfs dfs -mkdir /helloworld
hdfs dfs -put tmp.txt /helloworld
hdfd dfs -put a.out /helloworld
其中第一行创建一个隐藏文件夹helloworld,后面两行将可执行文件和input放入此文件下
5.启动任务,脚本如下:
[root@master helloworld]# cat start.sh
hadoop pipes -Dhadoop.pipes.java.recordreader=true -D hadoop.pipes.java.recordwriter=true -input /helloworld/tmp.txt -output output -program /helloworld/a.out
其中input参数指出输入文件的路径,output指出输出文件在hdfs中存放位置,会重新创建新的,要求这个文件之前不能存在,program指出可执行文件路径,执行后结果如下表示成功:
[root@master helloworld]# ./start.sh
18/09/25 15:54:21 INFO client.RMProxy: Connecting to ResourceManager at master/10.1.108.64:8032
18/09/25 15:54:21 INFO client.RMProxy: Connecting to ResourceManager at master/10.1.108.64:8032
18/09/25 15:54:22 INFO mapred.FileInputFormat: Total input files to process : 1
18/09/25 15:54:22 INFO mapreduce.JobSubmitter: number of splits:2
18/09/25 15:54:23 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1537857363076_0004
18/09/25 15:54:24 INFO impl.YarnClientImpl: Submitted application application_1537857363076_0004
18/09/25 15:54:24 INFO mapreduce.Job: The url to track the job: http://master:8088/proxy/application_1537857363076_0004/
18/09/25 15:54:24 INFO mapreduce.Job: Running job: job_1537857363076_0004
18/09/25 15:54:38 INFO mapreduce.Job: Job job_1537857363076_0004 running in uber mode : false
18/09/25 15:54:38 INFO mapreduce.Job: map 0% reduce 0%
18/09/25 15:54:59 INFO mapreduce.Job: map 100% reduce 0%
18/09/25 15:55:14 INFO mapreduce.Job: map 100% reduce 100%
18/09/25 15:55:14 INFO mapreduce.Job: Job job_1537857363076_0004 completed successfully
18/09/25 15:55:14 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=38
FILE: Number of bytes written=414120
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=223
HDFS: Number of bytes written=10
HDFS: Number of read operations=9
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=2
Launched reduce tasks=1
Data-local map tasks=2
Total time spent by all maps in occupied slots (ms)=34605
Total time spent by all reduces in occupied slots (ms)=11779
Total time spent by all map tasks (ms)=34605
Total time spent by all reduce tasks (ms)=11779
Total vcore-milliseconds taken by all map tasks=34605
Total vcore-milliseconds taken by all reduce tasks=11779
Total megabyte-milliseconds taken by all map tasks=35435520
Total megabyte-milliseconds taken by all reduce tasks=12061696
Map-Reduce Framework
Map input records=5
Map output records=4
Map output bytes=24
Map output materialized bytes=44
Input split bytes=178
Combine input records=0
Combine output records=0
Reduce input groups=2
Reduce shuffle bytes=44
Reduce input records=4
Reduce output records=2
Spilled Records=8
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
GC time elapsed (ms)=453
CPU time spent (ms)=3630
Physical memory (bytes) snapshot=548720640
Virtual memory (bytes) snapshot=6198038528
Total committed heap usage (bytes)=378470400
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=45
File Output Format Counters
Bytes Written=10
18/09/25 15:55:14 INFO util.ExitUtil: Exiting with status 0
这一行指出作业有一个输入:
18/09/25 15:54:22 INFO mapred.FileInputFormat: Total input files to process : 1
这一行指出hadoop有两个datanode:
18/09/25 15:54:22 INFO mapreduce.JobSubmitter: number of splits:2
通过如下3行可以看到hadoop框架的处理流程,先执行map,map全部执行完毕之后,执行reduce:
18/09/25 15:54:38 INFO mapreduce.Job: map 0% reduce 0%
18/09/25 15:54:59 INFO mapreduce.Job: map 100% reduce 0%
18/09/25 15:55:14 INFO mapreduce.Job: map 100% reduce 100%
6.查看结果如下:
[root@master helloworld]# hdfs dfs -cat output/*
a 67
b 99
这里a中值为100的被排除了,剩下的最大的是67,b中最大的是99
7.作业日志存放位置
hadoop-2.8.0/logs/userlogs
这里存储每个作业的日志,如下:
[root@master userlogs]# ls
application_1537857363076_0002 application_1537857363076_0003 application_1537857363076_0004
每个作业内部有stderr,stdout,syslog,一般内容在syslog中
8.查看每个datanode处理了多少作业:
这里将a.out执行两次,看效果,这里有两个datanode,在第一个datanode下查看日志如下:
[root@master application_1537857363076_0003]# ls
container_1537857363076_0003_01_000001 container_1537857363076_0003_01_000004
第二个datanode下查看日志如下:
[root@slave1 application_1537857363076_0003]# ls
container_1537857363076_0003_01_000002 container_1537857363076_0003_01_000003
但是注意作业不是均分到两个datanode上的,再次执行查看日志如下:
第一个datanode:
[root@master application_1537857363076_0004]# ls
container_1537857363076_0004_01_000004
第二个datanode:
[root@slave1 application_1537857363076_0004]# ls
container_1537857363076_0004_01_000001 container_1537857363076_0004_01_000002 container_1537857363076_0004_01_000003
这里可以看到,master只处理了1个,slave1处理了3个。