mongo-spark-connector笔记

sbt 地址

libraryDependencies += "org.mongodb.spark" % "mongo-spark-connector_2.11" % "2.0.0"

配置

package com.neusoft.apps

import com.mongodb.spark.{MongoConnector, MongoSpark}
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession

import com.mongodb.spark.config._

在SparkConf或者$SPARK_HOME/conf/spark-default.conf 文件中将uri、database、collection必选的配置项配置好,如下

// mongodb集群的写法,单个mongodb只写一个即可
val uri = """mongodb://xxx.xxx.xxx.xxx:27017/db.test,xxx.xxx.xxx.xxx:27017,xxx.xxx.xxx.xxx:27017"""

val conf = new SparkConf()
      .set("spark.mongodb.input.uri", uri)) // 读配置
      .set("spark.mongodb.output.uri", uri)) // 写配置

val sparkSession = SparkSession.builder().config(conf).appName("learn something").getOrCreate()

当后面要读写别的collection时使用WriteConfig覆盖即可

val writeVectorMap = new HashMap[String, String]
writeVectorMap += ("collection" -> CollectionDict.VISIT_VECTOR)
writeVectorMap += ("writeConcern.w" -> "majority")

val writeVectorConfig = WriteConfig(writeVectorMap, Some(WriteConfig(sparkSession)))

MongoSpark.save(similarityDocRDD, writeVectorConfig)

注意:奇怪的事情发生了,uri中没有写全database(db)和collection(test),但是WriteConfig中配置了collection,却抛出了 Missing collection name. Set via the ‘spark.mongodb.output.uri’ or ‘spark.mongodb.output.collection’ property 的异常
即这样的写法:

val uri = """mongodb://10.4.120.83"""
val db = "smarket"
val conf = new SparkConf()
      .set("spark.mongodb.input.uri", uri)
      .set("spark.mongodb.output.database", db)

val writeVectorMap = new HashMap[String, String]
writeVectorMap += ("collection" -> CollectionDict.VISIT_VECTOR)
writeVectorMap += ("writeConcern.w" -> "majority")

val writeVectorConfig = WriteConfig(writeVectorMap, Some(WriteConfig(sparkSession)))

MongoSpark.save(similarityDocRDD, writeVectorConfig)

mongo-spark-connector笔记_第1张图片

查看WriteConfig.apply()源码如下:
mongo-spark-connector笔记_第2张图片

按常理来说,第一个参数是我传递的Map,第二个参数是Some(WriteConfig(sparkSession)),所以cleanedOptions对象中存放的应该是writeVectorMap对象中的两个k-v,然后对cleanedOptions addToWatches后发现
mongo-spark-connector笔记_第3张图片

它里面居然是uri中的IP地址和database,然后不应该为None的defaultDatabase和defaultCollection却神奇的为None,结果就导致了上面的代码出现了这种奇怪的异常。所以保险起见,还是一开始就在sparkconf中把必须要配置的项全配了,当读写其他collection的时候使用WriteConfig或者ReadConfig将Collection项替换掉
花了2个小时时间来调试这个神奇的问题,一个小时来记录这它,我也是醉了。

你可能感兴趣的:(mongodb,spark,connector)