c++ 操作hadoop

hadoop框架原理:

c++ 操作hadoop_第1张图片

流程是,将input转换成mapper使用的context格式,然后经过mapper处理后,转换成reducer使用的context格式,经过reducer处理之后,产生output。

c++类库和头文件:

hadoop提供的c++api类库和头文件,安装hadoop之后,类库在hadoop/hadoop-2.8.0/lib/native下,头文件在hadoop/hadoop-2.8.0/include中,复制到系统/usr/lib64和/usr/include下,或者直接复制到自己的项目工程中,或者使用Makefile

c++编程模式:

一个公有继承Mapper的map器,一个公有继承reducer的reduce器,模板如下:

class localmapper : public HadoopPipes::Mapper
{
public:
    localmapper(HadoopPipes::Taskcontext & context){}
    void map(HadoopPipes::MapContext & context){}
};

class localreducer : public HadoopPipes::Reducer
{
public:
    localreducer(HadoopPipes::TaskContext & context){}
    void reduce(HadoopPipes::ReducerContext & context){}
};

其中localmapper类中的map方法用来自定义洗牌规则,localreducer类中的reduce方法用来自定义展示规则,一个示例:

1.写input,这里是tmp.txt,如下:

[root@master helloworld]# cat tmp.txt 
a:067
b:066
a:100
b:089
b:099

2.写map和reduce方法,如下:

#include 
#include 
#include 

/* hadoop头文件 */
#include 
#include 
#include 

using namespace std;

/* hadoop的mapper,reducer,和各自使用的context */
using HadoopPipes::TaskContext;
using HadoopPipes::Mapper;
using HadoopPipes::MapContext;
using HadoopPipes::Reducer;
using HadoopPipes::ReduceContext;

/* hadoop方法集中的两种方法 */
using HadoopUtils::toInt;
using HadoopUtils::toString;

/* hadoop运行入口 */
using HadoopPipes::TemplateFactory;
using HadoopPipes::runTask;

/* 公有继承hadoop的Mapper */
class LocalMapper : public Mapper
{
public:
	LocalMapper(TaskContext & context){}
	/* map函数,使用MapContext */
	void map(MapContext & context)
	{
		/* 从文本中获取输入 */
		string line  = context.getInputValue();
		string key	 = line.substr(0, 1);
		string value = line.substr(2, 3);
		/* 根据筛选条件洗牌,这里要求value不是100 */
		if (value != "100")
		{
			context.emit(key, value);
		}
	}
};

/* 公有继承Reducer */
class LocalReducer : public Reducer
{
public:
	LocalReducer(TaskContext & context){}
	/* reduce函数,使用ReduceContext */
	void reduce(ReduceContext & context)
	{
		int max_value = 0;
		/* 遍历一个key的所有value,根据筛选条件展示输出,这里选择最大值 */
		while (context.nextValue())
		{
			max_value = max(max_value, toInt(context.getInputValue()));
		}
		context.emit(context.getInputKey(), toString(max_value));
	}
};

int main()
{
	return runTask(TemplateFactory());
}

3.编译,命令如下:

g++ helloworld.cpp -lcrypto -lssl -L/root/hadoop/hadoop-2.8.0/lib/native -lhadooppipes -lhadooputils -lpthread

这里要带上-pthread,因为hadoop内部是并发算法,编译之后得到a.out

4.将a.out和tmp.txt上传到hdfs,命令如下:

hdfs dfs -mkdir /helloworld
hdfs dfs -put tmp.txt /helloworld
hdfd dfs -put a.out /helloworld

其中第一行创建一个隐藏文件夹helloworld,后面两行将可执行文件和input放入此文件下

5.启动任务,脚本如下:

[root@master helloworld]# cat start.sh 
hadoop pipes -Dhadoop.pipes.java.recordreader=true -D hadoop.pipes.java.recordwriter=true -input /helloworld/tmp.txt -output output -program /helloworld/a.out

其中input参数指出输入文件的路径,output指出输出文件在hdfs中存放位置,会重新创建新的,要求这个文件之前不能存在,program指出可执行文件路径,执行后结果如下表示成功:

[root@master helloworld]# ./start.sh 

18/09/25 15:54:21 INFO client.RMProxy: Connecting to ResourceManager at master/10.1.108.64:8032
18/09/25 15:54:21 INFO client.RMProxy: Connecting to ResourceManager at master/10.1.108.64:8032
18/09/25 15:54:22 INFO mapred.FileInputFormat: Total input files to process : 1
18/09/25 15:54:22 INFO mapreduce.JobSubmitter: number of splits:2
18/09/25 15:54:23 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1537857363076_0004
18/09/25 15:54:24 INFO impl.YarnClientImpl: Submitted application application_1537857363076_0004
18/09/25 15:54:24 INFO mapreduce.Job: The url to track the job: http://master:8088/proxy/application_1537857363076_0004/
18/09/25 15:54:24 INFO mapreduce.Job: Running job: job_1537857363076_0004
18/09/25 15:54:38 INFO mapreduce.Job: Job job_1537857363076_0004 running in uber mode : false
18/09/25 15:54:38 INFO mapreduce.Job:  map 0% reduce 0%
18/09/25 15:54:59 INFO mapreduce.Job:  map 100% reduce 0%
18/09/25 15:55:14 INFO mapreduce.Job:  map 100% reduce 100%
18/09/25 15:55:14 INFO mapreduce.Job: Job job_1537857363076_0004 completed successfully
18/09/25 15:55:14 INFO mapreduce.Job: Counters: 49
	File System Counters
		FILE: Number of bytes read=38
		FILE: Number of bytes written=414120
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=223
		HDFS: Number of bytes written=10
		HDFS: Number of read operations=9
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=2
	Job Counters 
		Launched map tasks=2
		Launched reduce tasks=1
		Data-local map tasks=2
		Total time spent by all maps in occupied slots (ms)=34605
		Total time spent by all reduces in occupied slots (ms)=11779
		Total time spent by all map tasks (ms)=34605
		Total time spent by all reduce tasks (ms)=11779
		Total vcore-milliseconds taken by all map tasks=34605
		Total vcore-milliseconds taken by all reduce tasks=11779
		Total megabyte-milliseconds taken by all map tasks=35435520
		Total megabyte-milliseconds taken by all reduce tasks=12061696
	Map-Reduce Framework
		Map input records=5
		Map output records=4
		Map output bytes=24
		Map output materialized bytes=44
		Input split bytes=178
		Combine input records=0
		Combine output records=0
		Reduce input groups=2
		Reduce shuffle bytes=44
		Reduce input records=4
		Reduce output records=2
		Spilled Records=8
		Shuffled Maps =2
		Failed Shuffles=0
		Merged Map outputs=2
		GC time elapsed (ms)=453
		CPU time spent (ms)=3630
		Physical memory (bytes) snapshot=548720640
		Virtual memory (bytes) snapshot=6198038528
		Total committed heap usage (bytes)=378470400
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters 
		Bytes Read=45
	File Output Format Counters 
		Bytes Written=10
18/09/25 15:55:14 INFO util.ExitUtil: Exiting with status 0

这一行指出作业有一个输入:

18/09/25 15:54:22 INFO mapred.FileInputFormat: Total input files to process : 1

这一行指出hadoop有两个datanode:

18/09/25 15:54:22 INFO mapreduce.JobSubmitter: number of splits:2

通过如下3行可以看到hadoop框架的处理流程,先执行map,map全部执行完毕之后,执行reduce:

18/09/25 15:54:38 INFO mapreduce.Job:  map 0% reduce 0%
18/09/25 15:54:59 INFO mapreduce.Job:  map 100% reduce 0%
18/09/25 15:55:14 INFO mapreduce.Job:  map 100% reduce 100%

6.查看结果如下:

[root@master helloworld]# hdfs dfs -cat output/*
a	67
b	99

这里a中值为100的被排除了,剩下的最大的是67,b中最大的是99

7.作业日志存放位置

hadoop-2.8.0/logs/userlogs

这里存储每个作业的日志,如下:

[root@master userlogs]# ls
application_1537857363076_0002  application_1537857363076_0003  application_1537857363076_0004

每个作业内部有stderr,stdout,syslog,一般内容在syslog中

8.查看每个datanode处理了多少作业:

这里将a.out执行两次,看效果,这里有两个datanode,在第一个datanode下查看日志如下:

[root@master application_1537857363076_0003]# ls
container_1537857363076_0003_01_000001  container_1537857363076_0003_01_000004

第二个datanode下查看日志如下:

[root@slave1 application_1537857363076_0003]# ls
container_1537857363076_0003_01_000002  container_1537857363076_0003_01_000003

但是注意作业不是均分到两个datanode上的,再次执行查看日志如下:

第一个datanode:

[root@master application_1537857363076_0004]# ls
container_1537857363076_0004_01_000004

第二个datanode:

[root@slave1 application_1537857363076_0004]# ls
container_1537857363076_0004_01_000001  container_1537857363076_0004_01_000002  container_1537857363076_0004_01_000003

这里可以看到,master只处理了1个,slave1处理了3个。

你可能感兴趣的:(cplusplus,hadoop)