本人原创,转载请注明出处! 本人QQ:530422429,欢迎大家指正、讨论。
目的:举例说明如何在Giraph中添加应用程序,以WCC(Weakly Connected Components)算法为例,描述怎么添加Vertex的子类,自定义输入输出格式和使用Combiner等。
背景:Giraph源码中自带有WCC算法,类为:org.apache.giraph.examples.ConnectedComponentsVertex,代码如下:
package org.apache.giraph.examples; import org.apache.giraph.edge.Edge; import org.apache.giraph.graph.Vertex; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.NullWritable; import java.io.IOException; /** * Implementation of the HCC algorithm that identifies connected components and * assigns each vertex its "component identifier" (the smallest vertex id * in the component) * * The idea behind the algorithm is very simple: propagate the smallest * vertex id along the edges to all vertices of a connected component. The * number of supersteps necessary is equal to the length of the maximum * diameter of all components + 1 * * The original Hadoop-based variant of this algorithm was proposed by Kang, * Charalampos, Tsourakakis and Faloutsos in * "PEGASUS: Mining Peta-Scale Graphs", 2010 * * http://www.cs.cmu.edu/~ukang/papers/PegasusKAIS.pdf */ @Algorithm( name = "Connected components", description = "Finds connected components of the graph" ) public class ConnectedComponentsVertex extends Vertex<IntWritable, IntWritable, NullWritable, IntWritable> { /** * Propagates the smallest vertex id to all neighbors. Will always choose to * halt and only reactivate if a smaller id has been sent to it. * * @param messages Iterator of messages from the previous superstep. * @throws IOException */ @Override public void compute(Iterable<IntWritable> messages) throws IOException { int currentComponent = getValue().get(); // First superstep is special, because we can simply look at the neighbors if (getSuperstep() == 0) { for (Edge<IntWritable, NullWritable> edge : getEdges()) { int neighbor = edge.getTargetVertexId().get(); if (neighbor < currentComponent) { currentComponent = neighbor; } } // Only need to send value if it is not the own id if (currentComponent != getValue().get()) { setValue(new IntWritable(currentComponent)); for (Edge<IntWritable, NullWritable> edge : getEdges()) { IntWritable neighbor = edge.getTargetVertexId(); if (neighbor.get() > currentComponent) { sendMessage(neighbor, getValue()); } } } voteToHalt(); return; } boolean changed = false; // did we get a smaller id ? for (IntWritable message : messages) { int candidateComponent = message.get(); if (candidateComponent < currentComponent) { currentComponent = candidateComponent; changed = true; } } // propagate new component id to the neighbors if (changed) { setValue(new IntWritable(currentComponent)); sendMessageToAllEdges(getValue()); } voteToHalt(); } }分析知:在compute()方法中,对第0次迭代做了优化,每个顶点先从自身和邻接顶点中找出最小的顶点ID值,然后把该最小值发送给所有的邻接顶点。后面每个超步中,先从收到的消息中找出最小值,若该最小值小于自身值,就把自身的值设为该最小值,同时把该最小值发送给所有的邻接顶点;若果大于,就不更新自身值和向外发送消息。最后把顶点voteToHalt,进入InActive状态。
继续添加WCC的原因:写最简单(未做优化)的WCC的代码。自带的WCC中I,V,M的类型均为IntWritable类型,对上百亿的大数据顶点不能满足需求,下面将修改为LongWritable类型,就要求自定义输入和输出的类型。同时会添加Combiner。修改步骤如下:
1. 首先自定义输入格式,添加类: org.apache.giraph.examples.LongLongNullTextInputFormat,I,V,E的类型依次为 LongWritable,LongWritable,NullWritable(表示没有权值)。图的输入格式为邻接表形式,以\t间隔。源码如下:
package org.apache.giraph.examples; import java.io.IOException; import java.util.List; import java.util.regex.Pattern; import org.apache.giraph.conf.ImmutableClassesGiraphConfigurable; import org.apache.giraph.conf.ImmutableClassesGiraphConfiguration; import org.apache.giraph.edge.Edge; import org.apache.giraph.edge.EdgeFactory; import org.apache.giraph.graph.Vertex; import org.apache.giraph.io.formats.TextVertexInputFormat; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.NullWritable; import org.apache.hadoop.io.Writable; import org.apache.hadoop.mapreduce.InputSplit; import org.apache.hadoop.mapreduce.TaskAttemptContext; import com.google.common.collect.Lists; /** * Input format for unweighted graphs with long ids and double vertex values */ public class LongLongNullTextInputFormat extends TextVertexInputFormat<LongWritable, LongWritable, NullWritable> implements ImmutableClassesGiraphConfigurable<LongWritable, LongWritable, NullWritable, Writable> { /** Configuration. */ private ImmutableClassesGiraphConfiguration<LongWritable, LongWritable, NullWritable, Writable> conf; @Override public TextVertexReader createVertexReader(InputSplit split, TaskAttemptContext context) throws IOException { return new LongLongNullLongVertexReader(); } @Override public void setConf(ImmutableClassesGiraphConfiguration<LongWritable, LongWritable, NullWritable, Writable> configuration) { this.conf = configuration; } @Override public ImmutableClassesGiraphConfiguration<LongWritable, LongWritable, NullWritable, Writable> getConf() { return conf; } /** * Vertex reader associated with * {@link LongLongNullTextInputFormat}. */ public class LongLongNullLongVertexReader extends TextVertexInputFormat<LongWritable, LongWritable, NullWritable>.TextVertexReader { /** Separator of the vertex and neighbors */ private final Pattern separator = Pattern.compile("\t"); @Override public Vertex<LongWritable, LongWritable, NullWritable, ?> getCurrentVertex() throws IOException, InterruptedException { Vertex<LongWritable, LongWritable, NullWritable, ?> vertex = conf.createVertex(); String[] tokens = separator.split(getRecordReader().getCurrentValue().toString()); List<Edge<LongWritable, NullWritable>> edges = Lists.newArrayListWithCapacity(tokens.length - 1); for (int n = 1; n < tokens.length; n++) { edges.add(EdgeFactory.create( new LongWritable(Long.parseLong(tokens[n])), NullWritable.get())); } LongWritable vertexId = new LongWritable(Long.parseLong(tokens[0])); vertex.initialize(vertexId, new LongWritable(), edges); return vertex; } @Override public boolean nextVertex() throws IOException, InterruptedException { return getRecordReader().nextKeyValue(); } } }
2. 自定义输出格式,添加类: org.apache.giraph.examples.VertexWithLongValueNullEdgeTextOutputFormat,最后只输出顶点ID和value,源码如下:
import java.io.IOException; import org.apache.giraph.graph.Vertex; import org.apache.giraph.io.formats.TextVertexOutputFormat; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.NullWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.TaskAttemptContext; /** * Output format for vertices with a long as id, a double as value and * null edges */ public class VertexWithLongValueNullEdgeTextOutputFormat extends TextVertexOutputFormat<LongWritable, LongWritable, NullWritable> { @Override public TextVertexWriter createVertexWriter(TaskAttemptContext context) throws IOException, InterruptedException { return new VertexWithDoubleValueWriter(); } /** * Vertex writer used with * {@link VertexWithLongValueNullEdgeTextOutputFormat}. */ public class VertexWithDoubleValueWriter extends TextVertexWriter { @Override public void writeVertex( Vertex<LongWritable, LongWritable, NullWritable, ?> vertex) throws IOException, InterruptedException { StringBuilder output = new StringBuilder(); output.append(vertex.getId().get()); output.append('\t'); output.append(vertex.getValue().get()); getRecordWriter().write(new Text(output.toString()), null); } } }
import java.io.IOException; /** * Weakly Connected Components Algorithm * * @author baisong * */ public class WeaklyConnectedComponentsVertex extends Vertex<LongWritable, LongWritable, NullWritable, LongWritable> { /** * Propagates the smallest vertex id to all neighbors. Will always choose to * halt and only reactivate if a smaller id has been sent to it. * * @param messages Iterator of messages from the previous superstep. * @throws IOException */ @Override public void compute(Iterable<LongWritable> messages) throws IOException { if(getSuperstep()==0) { setValue(getId()); } long minValue=getValue().get(); for(LongWritable msg:messages) { if(msg.get()<minValue) { minValue=msg.get(); } } if(getSuperstep()==0 || minValue<getValue().get()) { setValue(new LongWritable(minValue)); sendMessageToAllEdges(new LongWritable(minValue)); } voteToHalt(); } }
package org.apache.giraph.combiner; import org.apache.hadoop.io.LongWritable; /** * {@link Combiner} that finds the minimum {@link LongWritable} */ public class MinimumLongCombiner extends Combiner<LongWritable, LongWritable> { @Override public void combine(LongWritable vertexIndex, LongWritable originalMessage, LongWritable messageToCombine) { if (originalMessage.get() > messageToCombine.get()) { originalMessage.set(messageToCombine.get()); } } @Override public LongWritable createInitialMessage() { return new LongWritable(Long.MAX_VALUE); } }
完!
本人原创,转载请注明出处! 本人QQ:530422429,欢迎大家指正、讨论。