spark广播变量的使用(转)

环境:
ubuntu16.04 64
伪分布式
使用的spark是2.3.1
scala 2.11.8
参考连接:
https://blog.csdn.net/android_xue/article/details/79780463#commentsedit

注意,这篇博客是对上述参考链接的总结和概括.

一句话讲明,广播变量干嘛的?
就是你代码里的某个变量在程序运行时太耗内存了,所以丢到各个slave机子中弄一个备份,省得代码运行时传来传去的,就是为了省一下内存以及传递的耗时.
下面上代码,
1.不用广播变量的完整代码BroadcastTest1.scala:

import scala.collection.Map
import scala.collection.mutable.ArrayBuffer
import scala.util.Random
import org.apache.spark.broadcast.Broadcast
import org.apache.spark.ml.recommendation.{ALS, ALSModel}
import org.apache.spark.sql.{DataFrame, Dataset, SparkSession}
import org.apache.spark.sql.functions._
import org.apache.log4j.Logger
import org.apache.log4j.Level
import org.apache.spark.sql.execution.datasources.text


object BroadcastTest 
{
  def main(args: Array[String]): Unit = 
  {

    Logger.getLogger("org").setLevel(Level.OFF)
    Logger.getLogger("akka").setLevel(Level.OFF)
    Logger.getRootLogger().setLevel(Level.ERROR) 
    //这里是用来抑制一大堆log信息的.   

    val spark = SparkSession.builder
    .appName("Intro").config("spark.master", "local")
    .getOrCreate();

    spark.sparkContext.
    setLogLevel("ERROR")

    val list=List("hello java lalala")
    val linesRDD= spark.read.textFile("hdfs://master:9000/test/word.txt")

    linesRDD.filter(line=>
    {
        list.contains(line)
        }).collect().foreach(println)
    }
}

运行方法:
1.启动Hadoop的HDFS系统
2.hdfs dfs -put word.txt hdfs://master:9000/test/
3.scalac BroadcastTest1.scala
4.scala BroadcastTest
当然,这里也可以采用maven打包用spark-submit来运行

2.用广播变量的完整代码BroadcastTest2.scala:

import scala.collection.Map
import scala.collection.mutable.ArrayBuffer
import scala.util.Random
import org.apache.spark.broadcast.Broadcast
import org.apache.spark.ml.recommendation.{ALS, ALSModel}
import org.apache.spark.sql.{DataFrame, Dataset, SparkSession}
import org.apache.spark.sql.functions._
import org.apache.log4j.Logger
import org.apache.log4j.Level
import org.apache.spark.sql.execution.datasources.text

object BroadcastTest 
{
    def main(args: Array[String]): Unit = 
    {


        Logger.getLogger("org").setLevel(Level.OFF)
        Logger.getLogger("akka").setLevel(Level.OFF)
        Logger.getRootLogger().setLevel(Level.ERROR) 
        //这里是用来抑制一大堆log信息的.   

        val spark = SparkSession.builder
        .appName("Intro").config("spark.master", "local")
        .getOrCreate();
        val sc = spark.sparkContext


        sc.setLogLevel("ERROR")
        val list=List("hello java lalala")
        val broadcast=sc.broadcast(list)

        val linesRDD= spark.read.textFile("hdfs://master:9000/test/word.txt")

        linesRDD.filter(line=> 
        {
            broadcast.value.contains(line)
            }).collect().foreach(println)
        spark.stop()//这里也可以使用sc.stop()

    }

}

运行方法:
1.启动Hadoop的HDFS系统
2.hdfs dfs -put word.txt hdfs://master:9000/test/
3.scalac BroadcastTest2.scala
4.scala BroadcastTest
当然,这里也可以采用maven打包用spark-submit来运行

你可能感兴趣的:(Scala与Spark)