Spark kyro Serialization配置运行案例

一:配置

可以在spark-default.conf设置全局参数,也可以代码中初始化时对SparkConf设置 conf.set(“spark.serializer”, “org.apache.spark.serializer.KryoSerializer”) ,该参数会同时作用于机器之间数据的shuffle操作以及序列化rdd到磁盘,内存。

Spark不将Kyro设置成默认的序列化方式是因为它需要对类进行注册,官方强烈建议在一些网络数据传输很大的应用中使用kyro序列化。
You can switch to using Kryo by initializing your job with a SparkConf and calling conf.set(“spark.serializer”, “org.apache.spark.serializer.KryoSerializer”). This setting configures the serializer used for not only shuffling data between worker nodes
官网有提供:https://spark.apache.org/docs/latest/tuning.html

val sparkConf = new SparkConf()
     .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
      .registerKryoClasses(Array(classOf[Student]))
    val sc = new SparkContext(sparkConf)

二:操作

package g5.learning

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.storage.StorageLevel

import scala.collection.mutable.ListBuffer
case class Student(id:Int,name:String,age:Int)
object SerializationApp1 {
  def main(args: Array[String]): Unit = {


    val sparkConf = new SparkConf()
    // .setMaster("local[2]").setAppName("SerializationApp1")

    //        .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")这里写下shell里面不至于以后每次都要更改源码
                  .registerKryoClasses(Array(classOf[Student]))//这么操作没有问题,不过就是浪费资源
    val sc = new SparkContext(sparkConf)
    val students = ListBuffer[Student]()
    for (i <- 1 to 10000){
      students.append(Student(i,"ruoze"+i,39))
    }


    val studentRDD = sc.parallelize(students)
    studentRDD.persist(StorageLevel.MEMORY_ONLY_SER)
    studentRDD.count()


    Thread.sleep(1000*20)//防止跑的太快看不到
    sc.stop()
  }


}

三:打包jar,上传jar,修改shell文件

[hadoop@hadoop001 shell]$ vi kryo_ser.sh
[hadoop@hadoop001 shell]$ rz           
[hadoop@hadoop001 shell]$ vi kryo_ser.sh

export HADOOP_CONF_DIR=/home/hadoop/app/hadoop-2.6.0-cdh5.7.0/etc/hadoop
$SPARK_HOME/bin/spark-submit \
--master local[2] \
--class g5.learning.SerializationApp1 \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--name SerializationApp1 \
/home/hadoop/lib/g5spark1-1.0.jar \

四:结果

Spark kyro Serialization配置运行案例_第1张图片

你可能感兴趣的:(Spark,kyro)