本篇解析spark的词频统计源程序代码。
java源码如下:
</pre><pre name="code" class="java">package sparkTest; import java.util.Arrays; import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaPairRDD; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.api.java.function.FlatMapFunction; import org.apache.spark.api.java.function.Function2; import org.apache.spark.api.java.function.PairFunction; import scala.Tuple2; public class WordCount { public static void main(String[] args) { String logFile = "file:///home/hadoop/workspace/sparkTest/input/README.md"; // Should be some file on your system SparkConf conf = new SparkConf().setAppName("Simple Application").setMaster("local"); JavaSparkContext sc = new JavaSparkContext(conf); JavaRDD<String> textFile = sc.textFile(logFile); //构建String型RDD JavaRDD<String> words = textFile.flatMap(new FlatMapFunction<String, String>() { //flatMap相对map,多了flattening环节:即将所有行返回的结果合并很一个对象 public Iterable<String> call(String s) { return Arrays.asList(s.split(" ")); } }); JavaPairRDD<String, Integer> pairs = words.mapToPair(new PairFunction<String, String, Integer>() { //执行PairFunction,返回keyValue值对 public Tuple2<String, Integer> call(String s) { return new Tuple2<String, Integer>(s, 1); } }); JavaPairRDD<String, Integer> counts = pairs.reduceByKey(new Function2<Integer, Integer, Integer>() { //合并相同的Key public Integer call(Integer a, Integer b) { return a + b; } }); counts.saveAsTextFile("file:///home/hadoop/workspace/sparkTest/output"); // System.out.println("Lines with a: " + numAs + ", lines with b: " + numBs); } }
1、textFile()之前,构建JavaRDD,String型的。
2、flatMap()对RDD元素进行操作并合并。不同于map(),flatMap的函数参数返回的必须是list等序列。官方文档解释如下:
flatMap(func) | Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item). |
3、words.mapToPair(new PairFunction<String, String, Integer>())将flatMap结果转化成keyValue对。
4、pairs.reduceByKey(new Function2<Integer, Integer, Integer>())将mapToPair结果合并成最终结果。
5、saveAsTextFile(path)把最终结果存入path对应的文件中,可以是local file system、hadoop支持的其它文件系统、HDFS等。
注意:这里除了saveAsTextFile()是action操作,其他都属于Transformation.
输入文件内容如下:
# Apache Spark Spark is a fast and general cluster computing system for Big Data. It provides <http://spark.apache.org/>
(Spark,2) (provides,1) (is,1) (general,1) (a,1) (Big,1) (fast,1) (Apache,1) (#,1) (,2) (cluster,1) (Data.,1) (It,1) (for,1) (computing,1) (and,1) (<http://spark.apache.org/>,1) (system,1)