Spark求平均值的三种方法

方法一:

利用groupByKey

        //求平均 方法一: groupByKey
        textFile.mapToPair(line -> new Tuple2<>(line.split(" ")[0], Integer.parseInt(line.split(" ")[1])))
                .groupByKey()
                .mapToPair(info -> {
                    double sum = 0;
                    double count = 0;
                    Iterator it = info._2().iterator();
                    while (it.hasNext()) {
                        sum += it.next();
                        count++;
                    }
                    double ave = sum / count;
                    return new Tuple2<>(info._1(), ave);
                })
                .collect()
                .forEach(System.out::println);

返回结果:

(EEE,420.14285714285717)
(FFF,464.625)
(CCC,447.9375)
(AAA,504.125)
(BBB,351.2)
(DDD,512.3)
(GGG,470.3076923076923)

方法二:

//求平均 方法二: combineByKey
        textFile.mapToPair(line -> new Tuple2<>(line.split(" ")[0], Integer.parseInt(line.split(" ")[1])))
                .combineByKey(score -> new Tuple2<>(score, 1),  // 将score映射为一个元组,作为分区内聚合初始值
                        (t, score) -> new Tuple2<>(t._1() + score, t._2() + 1), //分区内聚合,
                        (a, b) -> new Tuple2<>(a._1() + b._1(), a._2() + b._2()))   //分区间聚合
                .mapToPair(info -> new Tuple2<>(info._1(), info._2()._1()/info._2()._2()))
                .collect()
                .forEach(System.out::println);

计算结果:

(EEE,420)
(FFF,464)
(CCC,447)
(AAA,504)
(BBB,351)
(DDD,512)
(GGG,470)

方法三:

scala版本

data
.map(a => (a._1, (a._2, 1)))
.reduceByKey((a,b) => (a._1+b._1,a._2+b._2))
.map(t => (t._1,t._2._1/t._2._2))

总结:

方法一可读性更好,但是可能存在线程不安全
方法二写法简便
方法三可读性和性能都不错,推荐
PS:也可以采用spark sql进行聚合操作

原始数据:

FFF 578
GGG 839
EEE 566
AAA 815
AAA 334
FFF 268
BBB 963
FFF 173
EEE 160
EEE 309
AAA 131
AAA 312
GGG 472
BBB 78
AAA 80
FFF 968
EEE 774
GGG 960
FFF 226
CCC 725
GGG 671
CCC 155
AAA 927
BBB 41
EEE 622
BBB 4
BBB 715
CCC 201
GGG 131
EEE 16
EEE 872
GGG 44
EEE 71
AAA 303
FFF 39
BBB 410
CCC 349
CCC 401
AAA 53
EEE 189
GGG 411
EEE 580
AAA 215
CCC 355
EEE 470
FFF 227
GGG 501
AAA 753
CCC 385
DDD 239
BBB 146
CCC 897
CCC 670
DDD 778
AAA 993
CCC 757
CCC 802
FFF 159
AAA 841
BBB 273
DDD 317
DDD 483
FFF 482
FFF 620
CCC 415
FFF 142
EEE 462
AAA 783
GGG 452
BBB 258
AAA 752
EEE 483
BBB 0
BBB 242
DDD 743
GGG 175
EEE 308
AAA 516
BBB 971
BBB 280
DDD 774
FFF 791
GGG 479
CCC 647
DDD 548
CCC 253
DDD 493
FFF 678
CCC 81
AAA 258
FFF 436
BBB 658
DDD 350
GGG 418
BBB 229
FFF 834
CCC 74
DDD 398
GGG 561
FFF 813

你可能感兴趣的:(大数据)