用Mongo Spark Connector 来连接 python(pyspark)和MongoDB:
报下面错误:
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.saveAsNewAPIHadoopFile.
: java.lang.ClassNotFoundException: com.mongodb.hadoop.io.BSONWritable
at java.net.URLClassLoader$1.run(URLClassLoader.java:435)
at java.net.URLClassLoader$1.run(URLClassLoader.java:424)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:423)
at java.lang.ClassLoader.loadClass(ClassLoader.java:493)
at java.lang.ClassLoader.loadClass(ClassLoader.java:426)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:344)
Mongo Spark Connector 对python 只支持 DataFrames,不支持RDD
如果想用Python保存RDD到MongoDB, 推荐用 pyMongo 库
Python support for Spark is only supported via DataFrames by design. Because unlike DataFrames, there is no Spark API that hooks in at the RDD level from python into the JVM.
If you want to just use RDD's in python and cannot use DataFrames, then I would recommend using the pyMongo library directly in your python code and save the data through the native python API.
更多参考:https://jira.mongodb.org/browse/SPARK-146
Mongo Spark Connector: https://docs.mongodb.com/spark-connector/master/
mongo-hadoop : https://github.com/mongodb/mongo-hadoop/