代码如下:
val data = sc.parallelize(List(("192.168.34.5", "pc", 5, 12)))
val url = "jdbc:mysql://ip:端口/数据库?"///user=username&password=password”
classOf[com.mysql.jdbc.Driver]
val conn = DriverManager.getConnection(url, user, pwd)
try {
conn.setAutoCommit(false)
val prep = conn.prepareStatement("INSERT INTO info (ip, source, hour, count, ) VALUES (?, ?, ?, ?) ")
data.map{ case (ip, source, hour, count) => {
prep.setString(1, ip)
prep.setString(2, source)
prep.setInt(3, hour)
prep.setInt(4, count)
prep.addBatch()
}}
prep.executeBatch()
conn.commit()
}
catch{
case e:Exception =>e.printStackTrace
} finally {
conn.close
}
解决方案:
data.map{ case (ip, source, hour, count, count_all,...
替换为:
data.collect().foreach{ case (ip, source, hour, count, ...
原因:
prep是一个PrepareStatement对象,这个对象无法序列化,而传入map中的对象是需要分布式传送到各个节点上,传送前先序列化,到达相应机器上后再反序列化,PrepareStatement是个Java类,如果一个java类想(反)序列化,必须实现Serialize接口,PrepareStatement并没有实现这个接口,对象prep在driver端,collect后的数据也在driver端,就不需prep序列化传到各个节点了。
但这样其实会有collect的性能问题,解决方案:
使用mappartition在每一个分区内维持一个mysql连接进行插入
data.foreachPartition{it =>
val conn = DriverManager.getConnection(url, user, pwd)
try{
conn.setAutoCommit(false)
val prep = conn.prepareStatement("INSERT INTO reg_ip_info (ip, source, hour, count) VALUES ( ?,?, ?, ?) ")
it.foreach{ case (ip, source, hour, count) => {
prep.setString(1, ip)
prep.setString(2, source)
prep.setInt(3, hour)
prep.setInt(4, count)
// prep.setTimestamp(4, new Timestamp(System.currentTimeMillis()))
prep.addBatch()
}
}
prep.executeBatch()
conn.commit()
}
catch{
case e:Exception =>e.printStackTrace
} finally {
conn.close
}
}
参考链接:https://stackoverflow.com/questions/37462803/prepare-batch-statement-to-store-all-the-rdd-to-mysql-generated-from-spark-strea
ps:
mappartition函数和map函数类似,只不过映射函数的参数由RDD中的每一个元素变成了RDD中每一个分区的迭代器。如果在映射的过程中需要频繁创建额外的对象,使用mapPartitions要比map高效的过。比如,将RDD中的所有数据通过JDBC连接写入数据库,如果使用map函数,可能要为每一个元素都创建一个connection,这样开销很大,如果使用mapPartitions,那么只需要针对每一个分区建立一个connection。