Spark RDD的flatMap、mapToPair、reduceByKey三个算子详解

1、官方解释

1.1、flatMap

 JavaRDD flatMap(FlatMapFunction f)

Return a new RDD by first applying a function to all elements of this RDD, and then flattening the results.

Parameters:

f - (undocumented)

Returns:

(undocumented)

此解释为输入必须是一个RDD,输出也将是一个RDD,并且输出的时候必须是一个数据集合

map和flatMap对比图如下(个人的一些理解,有不对的地方请大家指正。)

Spark RDD的flatMap、mapToPair、reduceByKey三个算子详解_第1张图片

1.2、mapToPair

 JavaPairRDD mapToPair(PairFunction f)

Return a new RDD by applying a function to all elements of this RDD.

Parameters:

f - (undocumented)

Returns:

(undocumented)

 此解释为会对一个RDD中的每个元素调用f函数,其中原来RDD中的每一个元素都是T类型的,调用f函数后会进行一定的操作把每个元素都转换成一个类型的对象

1.3、reduceByKey

public JavaPairRDD reduceByKey(Partitioner partitioner, Function2 func)

Merge the values for each key using an associative and commutative reduce function. This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a "combiner" in MapReduce.

Parameters:

partitioner - (undocumented)

func - (undocumented)

Returns:

(undocumented)

此解释为使用关联和交换的reduce函数合并每个键的值。这还将在将结果发送到reducer之前在每个映射器上本地执行合并,类似于MapReduce中的“combiner”。

当采用reduceByKeyt时,Spark可以在每个分区移动数据之前将待输出数据与一个共用的key结合。借助下图可以理解在reduceByKey里究竟发生了什么。 注意在数据对被搬移前同一机器上同样的key是怎样被组合的(reduceByKey中的lamdba函数)。然后lamdba函数在每个区上被再次调用来将所有值reduce成一个最终结果。如下图所示:

Spark RDD的flatMap、mapToPair、reduceByKey三个算子详解_第2张图片

2、样例实战

实战样例都是在本地Windows上模拟local模式运行的,在模拟的时候,因为本地版本是hadoop-2.7.3版本的,所以在配置hadoop环境变量的时候,需要从官网找到hadoop-2.7.3相匹配的hadoop.dll ,否则会报如下错误

NativeIO$Windows.createDirectoryWithMode0(Ljava/lang/String;I)V

2.1、WordCount

样例代码:

static {
        try {
            System.load("D:\\hadoop-2.7.3\\bin\\hadoop.dll");//建议采用绝对地址,bin目录下的hadoop.dll文件路径
        } catch (UnsatisfiedLinkError e) {
            System.err.println("Native code library failed to load.\n " + e);
            System.exit(1);
        }
    }

    public static void main(String[] args) throws Exception {

        if (args.length < 1) {
            System.err.println("Usage: JavaWordCount ");
            System.exit(1);
        }

        System.setProperty("HADOOP_USER_NAME", "admin");

        SparkConf conf = new SparkConf().setAppName("Java-Test-WordCount").setMaster("local[*]");
        SparkContext sc = new SparkContext(conf);

        //WordCount demo
        JavaRDD rdd = sc.textFile("D:\\words.txt", 2).toJavaRDD();

        //数据根据空格进行拆分成list扁平化
        JavaRDD words = rdd.flatMap(s -> Arrays.asList(s.split(" ")).iterator());

        //对每个拆好的单词进行k-v标识,value统一给1,存入Tuple2元组里
        JavaPairRDD stringIntegerJavaPairRDD = words.mapToPair((PairFunction) t -> new Tuple2<>(t, 1));

        //拿元组数据我也不是很清楚是不是就是拿默认元组里的key作为分组key,统计输出。
        stringIntegerJavaPairRDD.reduceByKey((Function2) (i1, i2) -> i1 + i2, 3).saveAsTextFile("D://result");

        sc.stop();
    }

输入的文档“D\\words.txt” 内容如下

i i
lo lp
lo
k m

因我输出的时候,采用的是3个numPartitions,所以在输出的时候,产品了三个不同的文件:

Spark RDD的flatMap、mapToPair、reduceByKey三个算子详解_第3张图片

输出的文档中的内容如下: 

“part-00000”内容如下:

(i,2)
(lo,2)

 part-00001内容如下:

(lp,1)
(m,1)

 part-00002内容如下:

(k,1)

2.2、sum

2.2.1、sum by one Key(一个key聚合sum值)

代码:

static {
        try {
            System.load("D:\\hadoop-2.7.3\\bin\\hadoop.dll");//建议采用绝对地址,bin目录下的hadoop.dll文件路径
        } catch (UnsatisfiedLinkError e) {
            System.err.println("Native code library failed to load.\n " + e);
            System.exit(1);
        }
    }

    public static void main(String[] args) throws Exception {

        if (args.length < 1) {
            System.err.println("Usage: JavaWordCount ");
            System.exit(1);
        }

        System.setProperty("HADOOP_USER_NAME", "admin");

        SparkConf conf = new SparkConf().setAppName("Java-Test-WordCount").setMaster("local[*]");
        SparkContext sc = new SparkContext(conf);

        //sum demo
        JavaPairRDD stringIntegerJavaPairRDD1 = sc.textFile("D:\\sum.txt", 3).toJavaRDD().flatMapToPair((PairFlatMapFunction) s -> {
            Tuple2 tuple2 = new Tuple2<>(s.split(" ")[0], Integer.parseInt(s.split(" ")[1]));
            return Arrays.asList(tuple2).iterator();
        });

        stringIntegerJavaPairRDD1.reduceByKey((Function2) (v1, v2) -> v1 + v2, 1).saveAsTextFile("D:\\sumresult");

        sc.stop();
    }

“D:\\sum.txt”

 word 3
word 4
count 1
count 2
sum 4
sum 5
group 7
by 1

因我输出的是采用的是1个numPartitions,所以就一个输出文件

Spark RDD的flatMap、mapToPair、reduceByKey三个算子详解_第4张图片

“part-00000”内容如下:

(sum,9)
(word,7)
(group,7)
(by,1)
(count,3)

2.2.2、sum by a great many of Keys(多个key进行sum聚合值)

代码:

MoreDimension.java

import java.io.Serializable;
import java.util.Objects;

public class MoreDimension implements Serializable {
    private String cityName;
    private String areaName;
    private String schoolName;

    public MoreDimension(String cityName, String areaName, String schoolName) {
        this.cityName = cityName;
        this.areaName = areaName;
        this.schoolName = schoolName;
    }

    public String getCityName() {
        return cityName;
    }

    public void setCityName(String cityName) {
        this.cityName = cityName;
    }

    public String getAreaName() {
        return areaName;
    }

    public void setAreaName(String areaName) {
        this.areaName = areaName;
    }

    public String getSchoolName() {
        return schoolName;
    }

    public void setSchoolName(String schoolName) {
        this.schoolName = schoolName;
    }

    @Override
    public boolean equals(Object o) {
        if (this == o) return true;
        if (!(o instanceof MoreDimension)) return false;
        MoreDimension that = (MoreDimension) o;
        return getCityName().equals(that.getCityName()) &&
                getAreaName().equals(that.getAreaName()) &&
                getSchoolName().equals(that.getSchoolName());
    }

    @Override
    public int hashCode() {
        return Objects.hash(getCityName(), getAreaName(), getSchoolName());
    }

    @Override
    public String toString() {
        return cityName + " " + areaName + " " + schoolName + " ";
    }
}

static {
        try {
            System.load("D:\\hadoop-2.7.3\\bin\\hadoop.dll");//建议采用绝对地址,bin目录下的hadoop.dll文件路径
        } catch (UnsatisfiedLinkError e) {
            System.err.println("Native code library failed to load.\n " + e);
            System.exit(1);
        }
    }

    public static void main(String[] args) throws Exception {

        if (args.length < 1) {
            System.err.println("Usage: JavaWordCount ");
            System.exit(1);
        }

        System.setProperty("HADOOP_USER_NAME", "admin");

        SparkConf conf = new SparkConf().setAppName("Java-Test-WordCount").setMaster("local[*]");
        SparkContext sc = new SparkContext(conf);

        //sum group by 多维 demo
        JavaPairRDD listIntegerJavaPairRDD = sc.textFile("D:\\summ.txt", 1).toJavaRDD().flatMapToPair((PairFlatMapFunction) s -> {
            MoreDimension moreDimension = new MoreDimension(s.split(" ")[0], s.split(" ")[1], s.split(" ")[2]);
            Tuple2 moreDimensionIntegerTuple2 = new Tuple2<>(moreDimension, Integer.parseInt(s.split(" ")[4]));
            return Arrays.asList(moreDimensionIntegerTuple2).iterator();
        });

        listIntegerJavaPairRDD.reduceByKey((Function2) (v1, v2) -> v1 + v2,1).saveAsTextFile("D:\\summresult");

        sc.stop();
    }

 "D:\\summ.txt"

南京市 雨花台区 花小 一班 2
南京市 雨花台区 花小 二班 10
南京市 雨花台区 实小 一班 3
南京市 雨花台区 实小 二班 1

因我只赋值了一个numPartitions,所以就一个文件输出

Spark RDD的flatMap、mapToPair、reduceByKey三个算子详解_第5张图片

“part-00000”内容如下:

(南京市 雨花台区 实小 ,4)
(南京市 雨花台区 花小 ,12)

有问题可以随时联系我一起探讨~~

你可能感兴趣的:(#,Spark,大数据,大数据,spark)