spark系列一:共享变量(broadcast和accumulator)

spark一个重要的特性就是共享变量:
共享变量原理:
默认情况下,如果在一个算子的函数中使用到了某个外部的变量,那么这个变量的值会被拷贝到每个task中。此时每个task只能操作自己的那份变量副本。如果多个task想要共享某个变量,那么这种方式是做不到的。
Spark为此提供了两种共享变量,一种是Broadcast Variable(广播变量),另一种是Accumulator(累加变量)。Broadcast Variable会将使用到的变量,仅仅为每个节点拷贝一份,更大的用处是优化性能,减少网络传输以及内存消耗。Accumulator则可以让多个task共同操作一份变量,主要可以进行累加操作。

broadcast:通过调用SparkContext的broadcast()方法,来针对某个变量创建广播变量。然后在算子的函数内,使用到广播变量时,每个节点只会拷贝一份副本了。每个节点可以使用广播变量的value()方法获取值。记住,广播变量,是只读的。
Accumulator,主要用于多个节点对一个变量进行共享性的操作。Accumulator只提供了累加的功能。但是确给我们提供了多个task对一个变量并行操作的功能。但是task只能对Accumulator进行累加操作,不能读取它的值。只有Driver程序可以读取Accumulator的值。

.png
spark系列一:共享变量(broadcast和accumulator)_第1张图片

案例实战:
1、java版本:
broadcast:
package cn.spark.study.core;
import java.util.Arrays;
import java.util.List;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.VoidFunction;
import org.apache.spark.broadcast.Broadcast;
public class broadcastdemo {
 public static void main(String[] args) {
  SparkConf conf = new SparkConf()
    .setAppName("collectionparallelize")
    .setMaster("local");
  JavaSparkContext sc = new JavaSparkContext(conf);
  
  List numbers = Arrays.asList(1,2,3,4,5,6,7,8,9,10);
  
  JavaRDD numbersrdd = sc.parallelize(numbers);
  int factor = 3;
  final Broadcast broadcast = sc.broadcast(factor);
  JavaRDD mutinumbersrdd = numbersrdd.map(new Function(){
   private static final long serialVersionUID = 1L;
   @Override
   public Integer call(Integer v1) throws Exception {
    
    return v1 * broadcast.value();
   }
   
  });
  mutinumbersrdd.foreach(new VoidFunction(){
   
   private static final long serialVersionUID = 1L;
   @Override
   public void call(Integer t) throws Exception {
    System.out.println(t);
    
   }
   
  });
  
  sc.close();
}
}
accmulator:
package cn.spark.study.core;
import java.util.Arrays;
import java.util.List;
import org.apache.spark.Accumulator;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.VoidFunction;
public class accumulatordemo {
 public static void main(String[] args) {
  SparkConf conf = new SparkConf()
    .setAppName("collectionparallelize")
    .setMaster("local");
  JavaSparkContext sc = new JavaSparkContext(conf);
  
  List numbers = Arrays.asList(1,2,3,4,5,6,7,8,9,10);
  
  JavaRDD numbersrdd = sc.parallelize(numbers);
  
  final Accumulator sum =  sc.accumulator(0);
  numbersrdd.foreach(new VoidFunction(){
   private static final long serialVersionUID = 1L;
   @Override
   public void call(Integer t) throws Exception {
    sum.add(t);
    
   }
   
  });
  System.out.println(sum.value());
 }
}

2、scala版本:
broadcast:
package com.spark.study.core
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
object accumulatordemo {
  def main(args:Array[String]){
     val conf = new SparkConf()
                .setAppName("collectionparallelize")
                .setMaster("local");
     val sc = new SparkContext(conf);
     val numbers = Array(1,2,3,4,5,6,7,8,9,10)
     val numbersrdd = sc.parallelize(numbers, 1)
     val sum = sc.accumulator(0)
     val total = numbersrdd.foreach(num => sum+=num)
     println(sum) 
   }
}
accumulator:
package com.spark.study.core
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
object broadcastdemo {
  def main(args:Array[String]){
     val conf = new SparkConf()
                .setAppName("collectionparallelize")
                .setMaster("local");
     val sc = new SparkContext(conf);
     val numbers = Array(1,2,3,4,5,6,7,8,9,10)
     val numbersrdd = sc.parallelize(numbers, 1)
     val factor =3
     val sum = sc.broadcast(factor)
     val total = numbersrdd.map(num => num * sum.value)
     total.foreach(f=>println(f))
    
   }
}


来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/30541278/viewspace-2153549/,如需转载,请注明出处,否则将追究法律责任。

转载于:http://blog.itpub.net/30541278/viewspace-2153549/

你可能感兴趣的:(spark系列一:共享变量(broadcast和accumulator))