本文其实主要是想说说spark的kryo机制和压缩!
首先spark官网对于kryo的描述:http://spark.apache.org/docs/latest/tuning.html#data-serialization
官网相关参数:http://spark.apache.org/docs/latest/configuration.html#compression-and-serialization
大概是说,kryo很强,建议使用,spark2.x的很多地方已经自动帮你用上了kryo!
(1)、算子函数中使用到的外部变量(例如上一篇提到的随机抽取数据的map)
(2)、持久化RDD时进行序列化,StorageLevel.MEMORY_ONLY_SER
(3)、shuffle阶段(还记得我上一篇文章讲的shuffle中用的unsafe吗?)
所以我会将数据缓存,并使用MEMORY_ONLY_SER缓存策略来实际测试下效果!
(这里我用的只是java的序列化方式)
persist(StorageLevel.MEMORY_ONLY());
persist(StorageLevel.MEMORY_ONLY_SER());
可以看到,使用序列化的缓存之后,占用的内存明显变小了!
(代码拉到最下面看,先看结论)
可以看到对于数据源头的RDD和Dataset都没有影响,只对于中间类型的RDD有影响
kryo序列化如果不注册自定义类型,会导致,反而比java序列化的效果还差!
spark.rdd.compress |
false | Whether to compress serialized RDD partitions (e.g. forStorageLevel.MEMORY_ONLY_SER in Java and Scala or StorageLevel.MEMORY_ONLY in Python). Can save substantial space at the cost of some extra CPU time. Compression will use spark.io.compression.codec . |
0.6.0 |
可以看到,开启了压缩,数据量进一步变小,而且是对rdd和dataset都有效果
spark.io.compression.lz4.blockSize |
32k | Block size used in LZ4 compression, in the case when LZ4 compression codec is used. Lowering this block size will also lower shuffle memory usage when LZ4 is used. Default unit is bytes, unless otherwise specified. | 1.4.0 |
spark的压缩,默认是lz4,一个32k的块,会被压缩一下,如果我把这个参数改成512k,那么,可以进一步提高压缩率,但是对于内存和cpu就是更多的压力
spark.kryo.unsafe |
false | Whether to use unsafe based Kryo serializer. Can be substantially faster by using Unsafe Based IO. | 2.1.0 |
spark.kryoserializer.buffer.max |
64m | Maximum allowable size of Kryo serialization buffer, in MiB unless otherwise specified. This must be larger than any object you attempt to serialize and must be less than 2048m. Increase this if you get a "buffer limit exceeded" exception inside Kryo. | 1.4.0 |
spark.kryoserializer.buffer |
64k | Initial size of Kryo's serialization buffer, in KiB unless otherwise specified. Note that there will be one buffer per core on each worker. This buffer will grow up tospark.kryoserializer.buffer.max if needed. |
1.4.0 |
这些配置参数,请具体看spark的描述,和百度一些文章,但是请大家始终记住一点就好,此消彼长,减少了磁盘io,加大缓冲区,势必会加大spark的内存压力,就看真实情况中短板是什么,然后做出取舍了!
public class RDDInfoTest extends SparkAnalyzer{
public static void main(String[] args) throws InterruptedException {
//构建sparksession
SparkSession sparkSession = SparkSession
.builder()
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
//.config("spark.serializer","org.apache.spark.serializer.JavaSerializer")
//.config("spark.kryo.registrator", MyRegistrator.class.getName())
//.config("spark.rdd.compress", "true")
//.config("spark.io.compression.lz4.blockSize","512k")
.master("local[*]")
.appName("RDDInfoTest")
.getOrCreate();
//2、读取数据文件形成RDD
String path = "datas/page_views.data";
RDD stringRDD = sparkSession.sparkContext().textFile(path, 2);
stringRDD.persist(StorageLevel.MEMORY_ONLY_SER());
stringRDD.count();
//3、将每条文件添加了字符串1,并且把String类型,转换成自己的SelfUrl类型
JavaRDD stringJavaRDD = stringRDD.toJavaRDD();
JavaRDD mapRDD = stringJavaRDD.map(new Function() {
@Override
public SelfUrl call(String v1) throws Exception {
return new SelfUrl(v1,"1");
}
});
mapRDD.persist(StorageLevel.MEMORY_ONLY_SER());
mapRDD.count();
//4、把刚刚的SelfUrl类型的Rdd转换为Ds,并且类型转为Row
Dataset dataFrame = sparkSession.createDataFrame(mapRDD, SelfUrl.class);
dataFrame.persist(StorageLevel.MEMORY_ONLY_SER());
dataFrame.count();
//5、再把row类型的为Ds,类型转为SelfUrl
Dataset map = dataFrame.map(new MapFunction() {
@Override
public SelfUrl call(Row value) throws Exception {
return new SelfUrl(value.getString(0), value.getString(1));
}
}, Encoders.bean(SelfUrl.class));
map.persist(StorageLevel.MEMORY_ONLY_SER());
map.count();
Thread.sleep(100000L);
}
}
import java.io.Serializable;
/**
* Created with IntelliJ IDEA
* Description:
* User: lsr
* Date: 2020/6/29
* Time: 15:47
*/
public class SelfUrl implements Serializable {
private String url;
private String count;
public SelfUrl() {
}
public SelfUrl(String url, String count) {
this.url = url;
this.count = count;
}
public String getUrl() {
return url;
}
public void setUrl(String url) {
this.url = url;
}
public String getCount() {
return count;
}
public void setCount(String count) {
this.count = count;
}
}
public class MyRegistrator implements KryoRegistrator {
@Override
public void registerClasses(Kryo kryo) {
kryo.register(SelfUrl.class);
}
}
1、使用kryo肯定是有益的,但是其中用到的自定义的数据类型,一定要记得注册下!
2、序列化和压缩,肯定会带来额外的cpu消耗,会导致时间增长,请务必确认,运行任务的时候,相应的执行机器的网络io和cpu情况,如果网络io压力不大,cpu压力大,其实就不需要做太多压缩序列化等额外操作,反之。
《Spark内核设计的艺术架构设计与实现》这本书看完了,emmm,要说全看懂嘛,肯定是瞎说的,spark里面涉及的东西太多了,不是我这样一个小菜鸡就能全搞懂的,但是重点(我自认为重点)的地方我还是认真看了看,并写了一些笔记,也写了博客,算是对使用了这么长时间的spark一个交代吧!