大家好,今天给大家介绍一下DataJoin,Hadoop有一个叫DataJoin的包为Data Join提供相应的框架。它的Jar包存在于contrib/datajoin/hadoop-*-datajoin。
为区别于其他的data join技术,我们称其为reduce-side join。(因为我们在reducer上作大多数的工作)
reduce-side join引入了一些术语及概念:
1.Data Source:基本与关系数据库中的表相似,形式为:(例子中为CSV格式)
Customers Orders
1,Stephanie Leung,555-555-5555 3,A,12.95,02-Jun-2008
2,Edward Kim,123-456-7890 1,B,88.25,20-May-2008
3,Jose Madriz,281-330-8004 2,C,32.00,30-Nov-2007
4,David Stork,408-555-0000 3,D,25.02,22-Jan-2009
2.Tag:由于记录类型(Customers或Orders)与记录本身分离,标记一个Record会确保特殊元数据会一致存在于记录中。在这个目的下,我们将使用每个record自身的Data source名称标记每个record。
3.Group Key:Group Key类似于关系数据库中的链接键(join key),在我们的例子中,group key就是Customer ID(第一列的3)。由于datajoin包允许用户自定义group key,所以其较之关系数据库中的join key更一般、平常。
利用datajoin包来实现join:
Hadoop的datajoin包中有三个需要我们继承的类:DataJoinMapperBase,DataJoinReducerBase,TaggedMapOutput。正如其名字一样,我们的MapClass将会扩展DataJoinMapperBase,Reduce类会扩展DataJoinReducerBase。这个datajoin包已经实现了map()和reduce()方法,因此我们的子类只需要实现一些新方法来设置一些细节。
在用DataJoinMapperBase和DataJoinReducerBase之前,我们需要弄清楚我们贯穿整个程序使用的新的虚数据类TaggedMapOutput。
根据之前我们在图Advance MapReduce的数据流中所展示的那样,mapper输出一个包(由一个key和一个value(tagged record)组成)。datajoin包将key设置为Text类型,将value设置为TaggedMapOutput类型(TaggedMapOutput是一个将我们的记录使用一个Text类型的tag包装起来的数据类型)。它实现了getTag()和setTag(Text tag)方法。它还定义了一个getData()方法,我们的子类将实现这个方法来处理record记录。我们并没有明确地要求子类实现setData()方法,但我们最好还是实现这个方法以实现程序的对称性(或者在构造函数中实现)。作为Mapper的输出,TaggedMapOutput需要是Writable类型,因此的子类还需要实现readFields()和write()方法。
DataJoinMapperBase:
回忆join数据流图,mapper的主要功能就是打包一个record使其能够和其他拥有相同group key的记录去向一个Reducer。DataJoinMapperBase完成所有的打包工作,这个类定义了三个虚类让我们的子类实现:
protected abstract Text generateInputTag(String inputFile);
protected abstract TaggedMapOutput generateTaggedMapOutut(Object value);
protected abstract Text generateGroupKey(TaggedMapOutput aRecored);
在一个map任务开始之前为所有这个map任务会处理的记录定义一个tag(Text),结果将保存到DataJoinMapperBase的inputTag变量中,我们也可以保存filename至inputFile变量中以待后用。
在map任务初始化之后,DataJoinMapperBase的map()方法会对每一个记录执行。它调用了两个我们还没有实现的虚方法:generateTaggedMapOutput()以及generateGroupKey(aRecord);(详见代码)
DataJoinReducerBase:
DataJoinMapperBase将我们所需要做的工作以一个full outer join的方式简化。我们的Reducer子类只需要实现combine()方法来滤除掉我们不需要的组合来得到我们需要的(inner join, left outer join等)。同时我们也在combiner()中将我们的组合格式化为输出格式。
环境:Vmware 8.0 和Ubuntu11.04
第一步:首先创建一个工程命名为HadoopTest.目录结构如下图:
第二步: 在/home/tanglg1987目录下新建一个start.sh脚本文件,每次启动虚拟机都要删除/tmp目录下的全部文件,重新格式化namenode,代码如下:
sudo rm -rf /tmp/* rm -rf /home/tanglg1987/hadoop-0.20.2/logs hadoop namenode -format hadoop datanode -format start-all.sh hadoop fs -mkdir input hadoop dfsadmin -safemode leave
第三步:给start.sh增加执行权限并启动hadoop伪分布式集群,代码如下:
chmod 777 /home/tanglg1987/start.sh ./start.sh
执行过程如下:
第四步:上传本地文件到hdfs
在/home/tanglg1987目录下新建Order.txt内容如下:
3,A,12.95,02-Jun-2008 1,B,88.25,20-May-2008 2,C,32.00,30-Nov-2007 3,D,25.00,22-Jan-2009
在/home/tanglg1987目录下新建Customer.txt内容如下:
1,tom,555-555-5555 2,white,123-456-7890 3,jerry,281-330-4563 4,tanglg,408-555-0000
上传本地文件到hdfs:
hadoop fs -put /home/tanglg1987/Orders.txt input hadoop fs -put /home/tanglg1987/Customer.txt input
第五步:新建一个DataJion.java,代码如下:
package com.baison.action; import java.io.DataInput; import java.io.DataOutput; import java.io.IOException; import java.util.Iterator; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.conf.Configured; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.Text; import org.apache.hadoop.io.Writable; import org.apache.hadoop.mapred.FileInputFormat; import org.apache.hadoop.mapred.FileOutputFormat; import org.apache.hadoop.mapred.JobClient; import org.apache.hadoop.mapred.JobConf; import org.apache.hadoop.mapred.KeyValueTextInputFormat; import org.apache.hadoop.mapred.MapReduceBase; import org.apache.hadoop.mapred.Mapper; import org.apache.hadoop.mapred.OutputCollector; import org.apache.hadoop.mapred.Reducer; import org.apache.hadoop.mapred.Reporter; import org.apache.hadoop.mapred.TextInputFormat; import org.apache.hadoop.mapred.TextOutputFormat; import org.apache.hadoop.util.ReflectionUtils; import org.apache.hadoop.util.Tool; import org.apache.hadoop.util.ToolRunner; import org.apache.hadoop.contrib.utils.join.DataJoinMapperBase; import org.apache.hadoop.contrib.utils.join.DataJoinReducerBase; import org.apache.hadoop.contrib.utils.join.TaggedMapOutput; public class DataJoin extends Configured implements Tool { public static class MapClass extends DataJoinMapperBase { protected Text generateInputTag(String inputFile) { String datasource = inputFile.split("-")[0]; return new Text(datasource); } protected Text generateGroupKey(TaggedMapOutput aRecord) { String line = ((Text) aRecord.getData()).toString(); String[] tokens = line.split(","); String groupKey = tokens[0]; return new Text(groupKey); } protected TaggedMapOutput generateTaggedMapOutput(Object value) { TaggedWritable retv = new TaggedWritable((Text) value); retv.setTag(this.inputTag); return retv; } } public static class Reduce extends DataJoinReducerBase { protected TaggedMapOutput combine(Object[] tags, Object[] values) { if (tags.length < 2) return null; String joinedStr = ""; for (int i = 0; i < values.length; i++) { if (i > 0) joinedStr += ","; TaggedWritable tw = (TaggedWritable) values[i]; String line = ((Text) tw.getData()).toString(); String[] tokens = line.split(",", 2); joinedStr += tokens[1]; } TaggedWritable retv = new TaggedWritable(new Text(joinedStr)); retv.setTag((Text) tags[0]); return retv; } } public static class TaggedWritable extends TaggedMapOutput { private Writable data; public TaggedWritable() { this.tag = new Text(); } public TaggedWritable(Writable data) { this.tag = new Text(""); this.data = data; } public Writable getData() { return data; } public void setData(Writable data) { this.data = data; } public void write(DataOutput out) throws IOException { this.tag.write(out); out.writeUTF(this.data.getClass().getName()); this.data.write(out); } public void readFields(DataInput in) throws IOException { this.tag.readFields(in); String dataClz = in.readUTF(); if (this.data == null || !this.data.getClass().getName().equals(dataClz)) { try { this.data = (Writable) ReflectionUtils.newInstance( Class.forName(dataClz), null); } catch (ClassNotFoundException e) { e.printStackTrace(); } } this.data.readFields(in); } } public int run(String[] args) throws Exception { for (String string : args) { System.out.println(string); } Configuration conf = getConf(); JobConf job = new JobConf(conf, DataJoin.class); Path in = new Path(args[0]); Path out = new Path(args[1]); FileInputFormat.setInputPaths(job, in); FileOutputFormat.setOutputPath(job, out); job.setJobName("DataJoin"); job.setMapperClass(MapClass.class); job.setReducerClass(Reduce.class); job.setInputFormat(TextInputFormat.class); job.setOutputFormat(TextOutputFormat.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(TaggedWritable.class); job.set("mapred.textoutputformat.separator", ","); JobClient.runJob(job); return 0; } public static void main(String[] args) throws Exception { String[] arg = { "hdfs://localhost:9100/user/tanglg1987/input", "hdfs://localhost:9100/user/tanglg1987/output" }; int res = ToolRunner.run(new Configuration(), new DataJoin(), arg); System.exit(res); } }
第六步:Run On Hadoop,运行过程如下:
hdfs://localhost:9100/user/tanglg1987/input
hdfs://localhost:9100/user/tanglg1987/output
12/10/16 22:05:36 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
12/10/16 22:05:36 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
12/10/16 22:05:36 INFO mapred.FileInputFormat: Total input paths to process : 2
12/10/16 22:05:36 INFO mapred.JobClient: Running job: job_local_0001
12/10/16 22:05:36 INFO mapred.FileInputFormat: Total input paths to process : 2
12/10/16 22:05:36 INFO mapred.MapTask: numReduceTasks: 1
12/10/16 22:05:36 INFO mapred.MapTask: io.sort.mb = 100
12/10/16 22:05:37 INFO mapred.MapTask: data buffer = 79691776/99614720
12/10/16 22:05:37 INFO mapred.MapTask: record buffer = 262144/327680
12/10/16 22:05:37 INFO mapred.MapTask: Starting flush of map output
12/10/16 22:05:37 INFO mapred.MapTask: Finished spill 0
12/10/16 22:05:37 INFO mapred.TaskRunner: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting
12/10/16 22:05:37 INFO mapred.LocalJobRunner: collectedCount 4
totalCount 4
12/10/16 22:05:37 INFO mapred.TaskRunner: Task 'attempt_local_0001_m_000000_0' done.
12/10/16 22:05:37 INFO mapred.MapTask: numReduceTasks: 1
12/10/16 22:05:37 INFO mapred.MapTask: io.sort.mb = 100
12/10/16 22:05:37 INFO mapred.MapTask: data buffer = 79691776/99614720
12/10/16 22:05:37 INFO mapred.MapTask: record buffer = 262144/327680
12/10/16 22:05:37 INFO mapred.MapTask: Starting flush of map output
12/10/16 22:05:37 INFO mapred.MapTask: Finished spill 0
12/10/16 22:05:37 INFO mapred.TaskRunner: Task:attempt_local_0001_m_000001_0 is done. And is in the process of commiting
12/10/16 22:05:37 INFO mapred.LocalJobRunner: collectedCount 4
totalCount 4
12/10/16 22:05:37 INFO mapred.TaskRunner: Task 'attempt_local_0001_m_000001_0' done.
12/10/16 22:05:37 INFO mapred.LocalJobRunner:
12/10/16 22:05:37 INFO mapred.Merger: Merging 2 sorted segments
12/10/16 22:05:37 INFO mapred.Merger: Down to the last merge-pass, with 2 segments left of total size: 875 bytes
12/10/16 22:05:37 INFO mapred.LocalJobRunner:
12/10/16 22:05:37 INFO datajoin.job: key: 1 this.largestNumOfValues: 2
12/10/16 22:05:37 INFO datajoin.job: key: 3 this.largestNumOfValues: 3
12/10/16 22:05:37 INFO mapred.JobClient: map 100% reduce 0%
12/10/16 22:05:37 INFO mapred.TaskRunner: Task:attempt_local_0001_r_000000_0 is done. And is in the process of commiting
12/10/16 22:05:37 INFO mapred.LocalJobRunner:
12/10/16 22:05:37 INFO mapred.TaskRunner: Task attempt_local_0001_r_000000_0 is allowed to commit now
12/10/16 22:05:37 INFO mapred.FileOutputCommitter: Saved output of task 'attempt_local_0001_r_000000_0' to hdfs://localhost:9100/user/tanglg1987/output
12/10/16 22:05:37 INFO mapred.LocalJobRunner: actuallyCollectedCount 4
collectedCount 5
groupCount 4
> reduce
12/10/16 22:05:37 INFO mapred.TaskRunner: Task 'attempt_local_0001_r_000000_0' done.
12/10/16 22:05:38 INFO mapred.JobClient: map 100% reduce 100%
12/10/16 22:05:38 INFO mapred.JobClient: Job complete: job_local_0001
12/10/16 22:05:38 INFO mapred.JobClient: Counters: 15
12/10/16 22:05:38 INFO mapred.JobClient: FileSystemCounters
12/10/16 22:05:38 INFO mapred.JobClient: FILE_BYTES_READ=51466
12/10/16 22:05:38 INFO mapred.JobClient: HDFS_BYTES_READ=435
12/10/16 22:05:38 INFO mapred.JobClient: FILE_BYTES_WRITTEN=105007
12/10/16 22:05:38 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=162
12/10/16 22:05:38 INFO mapred.JobClient: Map-Reduce Framework
12/10/16 22:05:38 INFO mapred.JobClient: Reduce input groups=4
12/10/16 22:05:38 INFO mapred.JobClient: Combine output records=0
12/10/16 22:05:38 INFO mapred.JobClient: Map input records=8
12/10/16 22:05:38 INFO mapred.JobClient: Reduce shuffle bytes=0
12/10/16 22:05:38 INFO mapred.JobClient: Reduce output records=4
12/10/16 22:05:38 INFO mapred.JobClient: Spilled Records=16
12/10/16 22:05:38 INFO mapred.JobClient: Map output bytes=855
12/10/16 22:05:38 INFO mapred.JobClient: Map input bytes=175
12/10/16 22:05:38 INFO mapred.JobClient: Combine input records=0
12/10/16 22:05:38 INFO mapred.JobClient: Map output records=8
12/10/16 22:05:38 INFO mapred.JobClient: Reduce input records=8
第七步:查看结果集,运行结果如下: