spark 书籍和教程

mastering apache spark(作参考,写得比较乱)

https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-rdd-ShuffledRDD.html


spark内部原理解析(已看,很好的解释)

https://github.com/JerryLead/SparkInternals


Apache Spark: core concepts, architecture and internals(写的比较简单,参考乱“spark内部原理解析”)

http://datastrophic.io/core-concepts-architecture-and-internals-of-apache-spark/


Introducing Low-latency Continuous Processing Mode in Structured Streaming in Apache Spark 2.3

- 介绍了spark 2.3中引入的continuous processing模式,还是试验阶段。原理是query开始的时候启动一个long running的进程来接收和处理事件,也是用Chandy-Lamport算法来做异步的checkpoint,和flink一样,应该也是参考flink的,但是flink已经是可以在生产环境使用了。使用原来micro-batching模式的话延迟最小都达到了100ms,而continuous processing模式的话延迟可以减少到1ms一下,快了100倍不止。因为原来的micro-batching模式需要事件等待下一个batch的到来,而且每个批次都需要做checkpoint来做容错,做容错是同步做的,这样就引入来延迟。

https://databricks.com/blog/2018/03/20/low-latency-continuous-processing-mode-in-structured-streaming-in-apache-spark-2-3-0.html

你可能感兴趣的:(spark 书籍和教程)