如何高效的使用foreachRDD

对于foreachRDD的正确理解,请参考对DStream.foreachRDD的理解
在spark streaming的官方文档中也有对foreachRDD的说明,请参见Design Patterns for using foreachRDD

基于数据的连接

在实际的应用中经常会使用foreachRDD将数据存储到外部数据源,那么就会涉及到创建和外部数据源的连接问题,最常见的错误写法就是为每条数据都建立连接

dstream.foreachRDD { rdd =>
  val connection = DriverManager.getConnection("jdbc:mysql://localhost:3306/tutorials", "root", "root")  // executed at the driver
  rdd.foreach { record =>
    connection.send(record) // executed at the worker
  }
}

缺点:要为每行数据进行创建连接操作,非常的低效。

基于partition的连接

dstream.foreachRDD { rdd =>
  rdd.foreachPartition { partitionOfRecords =>
    val connection = DriverManager.getConnection("jdbc:mysql://localhost:3306/tutorials", "root", "root")
    partitionOfRecords.foreach(record => connection.send(record))
    connection.close()
  }
}

缺点:这种方式虽然可以一定程度的缓解外部数据源的压力,但是如果partition数量过多,也会导致连接数过多。

基于静态连接

在上面案例的基础上,可以通过静态对象的方式,创建一个静态单例,每个JVM中只有一个连接对象

object Client {
  val conn = DriverManager.getConnection("jdbc:mysql://localhost:3306/tutorials", "root", "root")
  def apply(): Connection = conn
}

dstream.foreachRDD { rdd =>
  rdd.foreachPartition { partitionOfRecords =>
    val connection = Client()
    partitionOfRecords.foreach(record => connection.send(record))
    connection.close()
  }
}

缺点:这样写的问题在于无论是否有数据执行了查询都会创建连接

基于lazy的静态连接

可以对上面的稍加改动就可以实现只有在真正使用的时候才创建连接

object Client {
  lazy val conn = DriverManager.getConnection("jdbc:mysql://localhost:3306/tutorials", "root", "root")
  def apply(): Connection = conn
}

dstream.foreachRDD { rdd =>
  rdd.foreachPartition { partitionOfRecords =>
    val connection = Client()
    partitionOfRecords.foreach(record => connection.send(record))
  }
}

缺点:这种方式一个executor上的task都依赖于同一个连接对象,有可能会造成性能的瓶颈,所以需要一个终极的解决方案。

基于lazy的静态连接池

在官方的样例中也提到创建连接的时候需要ConnectionPool is a static, lazily initialized pool of connections

dstream.foreachRDD { rdd =>
  rdd.foreachPartition { partitionOfRecords =>
    // ConnectionPool is a static, lazily initialized pool of connections
    val connection = ConnectionPool.getConnection()
    partitionOfRecords.foreach(record => connection.send(record))
    ConnectionPool.returnConnection(connection)  // return to the pool for future reuse
  }
}

ConnectionPool可以借助org.apache.commons.pool2框架实现,请参考使用commons.pool2实现mysql连接池
下面是简单的一种实现方法

object ConnectionPool {
  private val pool = new GenericObjectPool[Connection](new MysqlConnectionFactory("jdbc:mysql://103.235.245.156:3306/tutorials", "root", "root", "com.mysql.jdbc.Driver"))
  def getConnection(): Connection ={
    pool.borrowObject()
  }

  def returnConnection(conn: Connection): Unit ={
    pool.returnObject(conn)
  }
}

class MysqlConnectionFactory(url: String, userName: String, password: String, className: String) extends BasePooledObjectFactory[Connection]{
  override def create(): Connection = {
    Class.forName(className)
    DriverManager.getConnection(url, userName, password)
  }

  override def wrap(conn: Connection): PooledObject[Connection] = new DefaultPooledObject[Connection](conn)

  override def validateObject(pObj: PooledObject[Connection]) = !pObj.getObject.isClosed

  override def destroyObject(pObj: PooledObject[Connection]) =  pObj.getObject.close()
}

这样官方的样例就可以改造为

dstream.foreachRDD { rdd =>
  rdd.foreachPartition { partitionOfRecords =>
    // ConnectionPool is a static, lazily initialized pool of connections
    lazy val connection = ConnectionPool.getConnection()
    partitionOfRecords.foreach(record => connection.send(record))
    ConnectionPool.returnConnection(connection)  // return to the pool for future reuse
  }
}

你可能感兴趣的:(如何高效的使用foreachRDD)