如有不妥之处,欢迎随时留言沟通交流,谢谢~
错误代码:
from impala.dbapi import connect
is_test = False
host = '192.168.0.1' if is_test else '192.168.0.1'
conn = connect(host=host, port=25001, timeout=3600)
cursor = conn.cursor()
sql = 'INSERT INTO test_db.test_table(sec_code,dt,minute,itype,ftype) values(%s,%s,%s,%s,%s)'
data = [('0' , '1', "a", '23.0','a'), ('1','3', "C", '-23.0','a'), ('2','3', "A", '-21.0','a'), ('3','2', "B", '-19.0','a') ]
rdd = sc.parallelize(data)
rdd2 = rdd.map(lambda x : cursor.execute(sql,x))
rdd2.collect()
Traceback (most recent call last):
File "", line 1, in
File "/usr/local/lib/python2.7/site-packages/pyspark/rdd.py", line 824, in collect
port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
File "/usr/local/lib/python2.7/site-packages/pyspark/rdd.py", line 2470, in _jrdd
self._jrdd_deserializer, profiler)
File "/usr/local/lib/python2.7/site-packages/pyspark/rdd.py", line 2403, in _wrap_function
pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command)
File "/usr/local/lib/python2.7/site-packages/pyspark/rdd.py", line 2389, in _prepare_for_python_RDD
pickled_command = ser.dumps(command)
File "/usr/local/lib/python2.7/site-packages/pyspark/serializers.py", line 568, in dumps
return cloudpickle.dumps(obj, 2)
File "/usr/local/lib/python2.7/site-packages/pyspark/cloudpickle.py", line 918, in dumps
cp.dump(obj)
File "/usr/local/lib/python2.7/site-packages/pyspark/cloudpickle.py", line 249, in dump
raise pickle.PicklingError(msg)
pickle.PicklingError: Could not serialize object: TypeError: can't pickle cStringIO.StringO objects
正确代码:
from impala.dbapi import connect
def fun2(str,is_test=True):
host = '192.168.0.1' if is_test else '192.168.0.1'
conn = connect(host=host, port=25001, timeout=3600)
cursor = conn.cursor()
sql = 'INSERT INTO test_db.test_table(sec_code,dt,minute,itype,ftype) values(%s,%s,%s,%s,%s)'
cursor.execute(sql, str)
data = [('0' , '1', "a", '23.0','a'), ('1','3', "C", '-23.0','a'), ('2','3', "A", '-21.0','a'), ('3','2', "B", '-19.0','a') ]
rdd = sc.parallelize(data)
rdd2 = rdd.map(lambda x : fun2(x))
rdd2.collect()
REASON:
Spark tries to serialize the connect object so it can be used inside the executors, which will surely fail because a deserialized db connect object can't grant read/write permission to another scope (or even computer). The problem can be reproduced by trying to broadcast the connect object. For this instance there was a problem on serializing an i/o object.
The problem was partly solved by connecting to the database inside the map functions. Since there will be too many connections for each RDD element in the map function, I had to switch to partition processing to reduce the db connections from 20k to about 8-64 (based on number of partitions). Spark developers should consider creating an initialization function/script for the executors to avoid these kind of dead end problems.
So let's say I got this init function executed by every node, then every node will be connected to the database (some conn pool, or separate zookeeper nodes) because the init function and the map functions will share the same scope, and then the problem is gone, so you write faster code than the workaround I found. At the end of the execution spark will free/unload these defined variables and the program will end.
参考文献:
Spark can't pickle method_descriptor: https://stackoverflow.com/questions/28142578/spark-cant-pickle-method-descriptor