通过这几个问题,加深对Spark广播变量的理解
1、广播变量在内存不足时是会溢出到磁盘,其存储等级为MEMORY_AND_DISK,那为什么driver在加载超过driver内存的广播变量的时候还会报Java.lang.OutOfMemoryError ?
答案其实很简单,只是需要不被广播变量的MEMORY_AND_DISK迷惑就行了。
我们先看三种关于广播时内存溢出的错误信息。
第一种:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.lang.String.substring(String.java:1933)
at com.boco.uemr.streaming.domain.perf.utility.Tools.splitString(Tools.java:285)
at com.boco.uemr.streaming.drivers.combinesinglepositionapp.CombinedFingerPositionIniService.getLteMroAdjIniMap(CombinedFingerPositionIniService.java:119)
at com.boco.uemr.streaming.drivers.combinesinglepositionapp.CombinedFingerPositionDriver.run(CombinedFingerPositionDriver.java:81)
at com.boco.uemr.streaming.drivers.combinesinglepositionapp.CombinedFingerPositionRunner.run(CombinedFingerPositionRunner.java:62)
at com.boco.uemr.streaming.Main.main(Main.java:16)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:849)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
这一种最简单,我读取文件内容组成Map
后边两种是不一样的,是对象已经在driver堆内存中。
第二种:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.IdentityHashMap.resize(IdentityHashMap.java:471)
at java.util.IdentityHashMap.put(IdentityHashMap.java:440)
at org.apache.spark.util.SizeEstimator$SearchState.enqueue(SizeEstimator.scala:174)
at org.apache.spark.util.SizeEstimator$$anonfun$visitSingleObject$1.apply(SizeEstimator.scala:225)
at org.apache.spark.util.SizeEstimator$$anonfun$visitSingleObject$1.apply(SizeEstimator.scala:224)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.apache.spark.util.SizeEstimator$.visitSingleObject(SizeEstimator.scala:224)
at org.apache.spark.util.SizeEstimator$.org$apache$spark$util$SizeEstimator$$estimate(SizeEstimator.scala:201)
at org.apache.spark.util.SizeEstimator$.estimate(SizeEstimator.scala:69)
at org.apache.spark.util.collection.SizeTracker$class.takeSample(SizeTracker.scala:78)
at org.apache.spark.util.collection.SizeTracker$class.afterUpdate(SizeTracker.scala:70)
at org.apache.spark.util.collection.SizeTrackingVector.$plus$eq(SizeTrackingVector.scala:31)
at org.apache.spark.storage.memory.DeserializedValuesHolder.storeValue(MemoryStore.scala:665)
at org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:222)
at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:299)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
at org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:914)
at org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:1481)
at org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:123)
at org.apache.spark.broadcast.TorrentBroadcast.<init>(TorrentBroadcast.scala:88)
at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62)
at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1489)
at org.apache.spark.api.java.JavaSparkContext.broadcast(JavaSparkContext.scala:650)
at com.boco.uemr.streaming.Driver.broadcast(Driver.java:44)
at com.boco.uemr.streaming.drivers.combinesinglepositionapp.CombinedFingerPositionDriver.run(CombinedFingerPositionDriver.java:75)
at com.boco.uemr.streaming.drivers.combinesinglepositionapp.CombinedFingerPositionRunner.run(CombinedFingerPositionRunner.java:62)
at com.boco.uemr.streaming.Main.main(Main.java:16)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method
从日志中可以看出,这一种发生在
blockManager.putSingle(broadcastId, value, MEMORY_AND_DISK, tellMaster = false)这个方法调用当中,也就是把广播变量保存到driver端的BlockManager中,由错误信息可见,这是由IdentityHashMap扩容存放广播对象时,无法申请到更多内存,引发的内存溢出。
第三种:
19/10/21 15:29:22 WARN memory.MemoryStore: Not enough space to cache broadcast_0 in memory! (computed 580.5 MB so far)
19/10/21 15:29:22 WARN storage.BlockManager: Persisting block broadcast_0 to disk instead.
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.io.ObjectOutputStream$HandleTable.growEntries(ObjectOutputStream.java:2351)
at java.io.ObjectOutputStream$HandleTable.assign(ObjectOutputStream.java:2276)
at java.io.ObjectOutputStream.writeString(ObjectOutputStream.java:1302)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1172)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at java.util.HashMap.internalWriteEntries(HashMap.java:1790)
at java.util.HashMap.writeObject(HashMap.java:1363)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1128)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$blockifyObject$2.apply(TorrentBroadcast.scala:291)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$blockifyObject$2.apply(TorrentBroadcast.scala:291)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1405)
at org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:292)
at org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:127)
at org.apache.spark.broadcast.TorrentBroadcast.<init>(TorrentBroadcast.scala:88)
at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62)
at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1489)
at org.apache.spark.api.java.JavaSparkContext.broadcast(JavaSparkContext.scala:650)
at com.boco.icos.mrfingerlib.driver.drivers.mro.TestDriver.run(TestDriver.java:114)
at com.boco.icos.mrfingerlib.driver.drivers.mro.TestDriverRunner.run(TestDriverRunner.java:64)
at com.boco.icos.mrfingerlib.driver.Main.main(Main.java:12)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
这种情况也是文件内容已经装载成为广播变量值Object,在序列化并拆分成块的时候要使用Entries来存储字节块,在growEntries进行扩容,内存不足,发生溢出。
2、driver端广播变量有几份?是否存在某种程度冗余?
当然,毋庸置疑,给driver端tasks使用的肯定是一份。
回答这个问题我们需要关注以下writeBlocks中的两行代码内容
blockManager.putSingle(broadcastId, value, MEMORY_AND_DISK, tellMaster = false)
这一步是干什么呢?
也就是说在driver中存储一份广播变量的副本,使得运行在driver上的tasks需要用到时不用再创建一个广播变量值的拷贝副本。
blockManager.putBytes(pieceId, bytes, MEMORY_AND_DISK_SER, tellMaster = true)
那这是干什么呢?
这是将广播变量序列化成字节并分块之后,也保存在driver端BlockManager中,还要告知BlockManagerMaster块的元信息。
干嘛用呢?这就涉及到广播原理了,使用类似BitTorrent的实现。
BlockManagerMaster记录的元信息记录着所有块的分布,使得executor能够就近fetch想要的块,而driver端保存的块就是最原始的种子数据,开始广播时,只有driver有广播变量,这时候executor就只能来driver端fetch想要的块。
这样看来,driver端确实是存在两个内容相同,只是存储格式不同的广播内容。所以我觉得某些程度上多多少少有些冗余,大家觉得呢?欢迎大家指正。
3、广播变量是每个存在executor的节点一份还是每个executor一份?
今天,和一小伙伴讨论问题,突然聊到了这里,小伙伴说每个节点一份。这和我一直以来的映象不对啊。我就又查了查资料和源码。
官网上的描述确实有些迷惑
keep a read-only variable on each machine 和 give every node a copy of a large input dataset 是不是有每节点一份的迷惑?
但事实确实是每个executor一份,这个我们从executor需要用到广播变量进行读取时的动作中就能看到。根据如下调用链:
Broadcast.getValue() —> TorrentBroadcast.getValue() —> TorrentBroadcast.readBroadcastBlock() —> TorrentBroadcast.readBlocks()
我们能够发现,在executor获取广播变量时,会fetch到完整块数的广播变量,存储到自己的BlockManager中。
也就不难看出,每个executor一份广播变量了,而不是每个节点。
4、广播过程
具体的BitTorrent协议大家可以自己了解下,我这里描述下Spark广播的过程,基于版本2.4.3,之前的使用HTTP协议进行广播的就不提了。
大体可分为如下几步
第一步:driver端加载广播数据,组成用户所需的Object类型。
第二步:将广播变量保存一份到driver的BlockManager。
第三步:将广播变量序列化并拆分成字节块,块大小可以指定(例如spark.broadcast.blockSize=64m),并将块信息报告给BlockManagerMaster。
第四步:executor取广播变量,怎么取?
这种广播方式,使得每个executor都能够成为其他executor取广播变量的服务器,大大缓解了driver的压力,也提高了广播的速度。
其实BitTorrent协议值得研究下。