struct streaming+Kakfa消费者读取单条记录过长问题

使用struct streaming读kafak数据,fetch数据过大,报错日志如下:

20/06/06 11:40:01 org.apache.spark.internal.Logging$class.logError(Logging.scala:70) ERROR TaskSetManager: Task 7 in stage 96.0 failed 4 times; aborting job
20/06/06 11:40:01 org.apache.spark.internal.Logging$class.logError(Logging.scala:91) ERROR MicroBatchExecution: Query sink alarm result to event table [id = f0960793-2c6e-4202-b099-ffd614471716, runId = 28fcb5c3-68a4-49be-814a-a7197336c449] terminated with error
org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in stage 96.0 failed 4 times, most recent failure: Lost task 7.3 in stage 96.0 (TID 1295, 10.123.42.47, executor 1): org.apache.kafka.common.errors.RecordTooLargeException: There are some messages at [Partition=Offset]: {intelligent_driving-3=13632613} whose size is larger than the fetch size 504827599 and hence cannot be ever returned. Increase the fetch size on the client (using max.partition.fetch.bytes), or decrease the maximum message size the broker will allow (using message.max.bytes).

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1651)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1639)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1638)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1638)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1872)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1821)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1810)
	

本来这个问题很常见,网上很多资料都说配置参数“max.partition.fetch.bytes”就行,但是我在struct streaming中配置了“max.partition.fetch.bytes”,依然报错,没有变化。

原因:后来通过查看整个struct streaming 源码 创建数据流的整个过程,发现struct streaming是有将参数名进行处理的,关键代码如下:

override def createRelation(
      sqlContext: SQLContext,
      parameters: Map[String, String]): BaseRelation = {
    validateBatchOptions(parameters)
    val caseInsensitiveParams = parameters.map { case (k, v) => (k.toLowerCase(Locale.ROOT), v) }
    val specifiedKafkaParams =
      parameters
        .keySet
        .filter(_.toLowerCase(Locale.ROOT).startsWith("kafka."))
        .map { k => k.drop(6).toString -> parameters(k) }
        .toMap
  }

struct streaming默认将kafak参数,都是需要加kafak前缀的。所以需要在代码中配置参数名为“kafka.max.partition.fetch.bytes”

你可能感兴趣的:(spark,structstreaming,kafka)