Introducing Spark-Kafka integration for realtime Kafka SQL queries

Apache Kafka has been all the rage for the key join of the data pipeline. But in most cases, we only treat Kafka as a stream source or a message queue. This means if you wanna do some AdHoc query, you need to sync the data to HDFS or other storage firstly.

People may have forgotten that Kafka is really good at high throughput since Kafka makes full use of both parallel consuming and sequential reading.

In order to satisfy the use cases around ad hoc analytics, data exploration and trend discovery based on Kafka directly, a new project called spark-adhoc-kafkais open sourced.

With this project, what you can do include:

  1. Treat Kafka topics/streams as tables;
  2. Support for SQL
  3. Support complex joins(join other Kafka topics or other tables stored in any place)
  4. Support for MLSQLand Spark

Notice that you can speed up the ad hoc query in spark-adhoc-kafkaby :

  1. specify the startingOffsets and endingOffsets to narrow the number of records Spark is going to fetch.
  2. specify the startingTime and endingTime to narrow the number of records Spark is going to fetch
  3. specify multiplyFactor or maxSizePerPartitions to control the parallelism of data fetching.

How spark-adhoc-kafka works? The default Kafka implementation of consuming model in Spark is like this:

Introducing Spark-Kafka integration for realtime Kafka SQL queries_第1张图片
default model

Both Kafka and Spark have the concept of partition. A partition means a task or a collection of data. To change the number of Kafka partition is a really heavy operation, and the Spark will start the same number of tasks to consume data from Kafka. Suppose Kafka have three partitions, and Spark will start three tasks and consume all data in the Kafka partitions and map them to spark partitions then do the processing. If we have 100 cores in Spark, and we only use three cores, what a waste! And this also really slow down the query performance.

So spark-adhoc-kafkatry to create a new consuming model to resolve all these issues:

Introducing Spark-Kafka integration for realtime Kafka SQL queries_第2张图片
new model

你可能感兴趣的:(Introducing Spark-Kafka integration for realtime Kafka SQL queries)