Spark调优

  1. 调优的原因
Because of the in-memory nature of most Spark computations, 
Spark programs can be bottlenecked by any resource in the cluster: 
CPU, network bandwidth, or memory.

因为soark计算是基于内存的,所以spark程序的瓶颈是集群上的资源包括: 
CPU, 带宽或内存
Most often, if the data fits in memory, the bottleneck is network bandwidth, 
but sometimes, you also need to do some tuning, such as storing RDDs in serialized form, 
to decrease memory usage
如果数据和内存大小匹配,瓶颈就是网络带宽, 但是有时,
你仍然需要做一些调优来减少内存使用情况,比如序列化存储RDD
  1. 数据序列化

2.1 序列化的原因

(1) 数据通过网络传输需要序列化
(2) 进程间通信也需要序列化

2.2 Example
2.2.1 Writable接口
2.2.1.1 注释

A serializable object which implements a simple, efficient, 
serialization protocol
使用一个简单地高效的协议实现一个序列化对象

2.2.1.2代码

public interface Writable {
  void write(DataOutput out) throws IOException;
  void readFields(DataInput in) throws IOException;
}

2.2.1.3. 补充说明

实现Writable的类都可以序列化,所以在Hadoop中建议使用IntWritable而不是简单的使用Int

2.3 序列化在Spark中的使用场景

(1)算子中使用到外部变量(broadcast variable),需要经过网络传输,肯定需要序列化,用更好的序列化方式,序列化体积更小,网络传输的开销会小,如果放到内存中,占的内存也小

(2) cache
(3) shuffle

2.4 序列化具体描述(官网翻译)

Serialization plays an important role 
in the performance of any distributed application
序列化在任意分布式系统中扮演重要的角色
Formats that are slow to serialize objects into, 
or consume a large number of bytes, 
will greatly slow down the computation.
不好的序列化方法(如缓慢序列化对象,
或者占据大量bytes的序列化结果)将会使计算变得缓慢
It provides two serialization libraries
它提供了两种序列化库
Java serialization(优势在于能兼容任意的class): 
By default, Spark serializes objects using Java’s ObjectOutputStream framework, 
and can work with any class you create that
implements java.io.Serializable. 
Java序列化: 默认情况下,Spark 使用java的ObjectOutputStream框架序列化对象,
并且能够和任何任何实现java.io.Serializable的类一起使用
//java.io.ObjectOutputStream注释
An ObjectOutputStream writes primitive data types 
and graphs of Java objectsto an OutputStream
ObjectOutputStream将原始数据类型和java对象写入到一个OutputStream中
//java.io.Serializable
public interface Serializable {
}

java.io.Serializable是一个marker interface, 
任何实现该interface的接口就可以做序列化
You can also control the performance of your serialization 
more closely by extendingjava.io.Externalizable
你能通过extendingjava.io.Externalizable更精确的控制
你的序列化的性能(但是一般情况下不使用)
java serialization is flexible but often quite slow, 
and leads to large serialized formats for many classes
Kryo serialization(优点速度快,体积小): 
Spark can also use the Kryo library (version 4) 
to serialize objects more quickly
Kryo 序列化: Spark可以使用Kryo库(version 4)更快速的去序列化对象
Kryo is significantly faster and more compact than Java serialization (often as much as 10x), 
but does not support all Serializable types and requires you 
to register the classes you’ll use in the program 
in advance for best performance
Kryo比Java序列化更快速并且更紧凑(体积更小), 通常是java序列化的10倍,
但是其并不能支持所有的序列化类型并且为了更好的性能,
你需要提前注册你将会在driver程序中使用的类
You can switch to using Kryo by initializing your job with a SparkConf and 
calling conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
你可以通过在用SparkConf初始化Spark作业时设置
set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
来切换使用Kryo
  val sparkConf = new SparkConf().
                      setMaster("local[2]").
                      setAppName("SparkContextApp").
                      set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
This setting configures the serializer used for not only shuffling data 
between worker nodes but also when serializing RDDs to disk
这个设置配置的serializer不止是使用于在不同的work node之间传递的shuffling数据,
并且使用于当RDD序列化磁盘上时
The only reason Kryo is not the default is because of 
the custom registration requirement, 
but we recommend trying it in any network-intensive application
Kryo不是默认的序列化库是因为自定义注册的需要,
但是使用kryo是推荐使用在网络密集型的应用上
Since Spark 2.0.0, we internally use Kryo serializer 
when shuffling RDDs with simple types, 
arrays of simple types, or string type
自从Spark 2.0.0,当RDDs是简单类型的集合,
简单类型的数组或者是String类型的时候,
我们内部使用Kryo序列化器
Finally, if you don’t register your custom classes, 
Kryo will still work, but it will have to store 
the full class name with each object, which is wasteful
最后,如果你不注册自定义的类,Kryo仍然会起作用,
但是会存储每个对象的包名+类名,这样做是浪费的

2.5 程序中使用序列化

def class
val sparkConf = new SparkConf().
                setMaster("local[2]").
                setAppName("SparkContextApp").
                set("spark.serializer", "org.apache.spark.serializer.KryoSerializer").
                registerKryoClasses(Array(classOf[abc]))

2.6 具体实例

你可能感兴趣的:(Spark调优)