用Apache Spark将数据写入ElasticSearch(spark-shell方式)

ES和spark版本:

elaticsearch 6.8.2

安装传送门:https://blog.csdn.net/mei501501/article/details/100866673

spark-2.4.4-bin-hadoop2.7

安装传送门:https://blog.csdn.net/mei501501/article/details/102565970

首先,启动es后,spark shell导入es-hadoop jar包:

下载地址:https://mvnrepository.com/artifact/org.elasticsearch/elasticsearch-spark-20_2.11/6.8.2

spark-shell --jars /Users/mengqingmei/Documents/elasticsearch-spark-20_2.11-6.8.2.jar

交互如下:

import org.apache.spark.SparkConf
import org.elasticsearch.spark._
val conf = new SparkConf()
conf.set("es.index.auto.create", "true")
conf.set("es.nodes", "127.0.0.1")
val numbers = Map("one" -> 1, "two" -> 2, "three" -> 3)
val airports = Map("OTP" -> "Otopeni", "SFO" -> "San Fran")
sc.makeRDD(Seq(numbers, airports)).saveToEs("spark/docs")

 

 

0、初始化SparkContext设置ElasticSearch相关参数:

import org.apache.spark.SparkConf
import org.elasticsearch.spark._
val conf = new SparkConf()
conf.set("es.index.auto.create", "true")
conf.set("es.nodes", "127.0.0.1")
wAAACH5BAEKAAAALAAAAAABAAEAAAICRAEAOw==

在写入数据之前,先导入org.elasticsearch.spark._包,这将使得所有的RDD拥有saveToEs方法。下面我将一一介绍将不同类型的数据写入ElasticSearch中。

1、将Map对象写入ElasticSearch

import org.elasticsearch.spark._
val numbers = Map("one" -> 1, "two" -> 2, "three" -> 3)
val airports = Map("OTP" -> "Otopeni", "SFO" -> "San Fran")
sc.makeRDD(Seq(numbers, airports)).saveToEs("spark/doc")
wAAACH5BAEKAAAALAAAAAABAAEAAAICRAEAOw==

上面构建了两个Map对象,然后将它们写入到ElasticSearch中;其中saveToEs里面参数的spark表示索引(indexes),而doc表示type。然后我们可以通过下面URL查看iteblog这个index的属性:

curl -XGET https://127.0.0.1:9200/spark
wAAACH5BAEKAAAALAAAAAABAAEAAAICRAEAOw==

同时使用下面URL搜索出所有的documents:

https://127.0.0.1:9200/spark/doc/_search
wAAACH5BAEKAAAALAAAAAABAAEAAAICRAEAOw==

2、将case class对象写入ElasticSearch

我们还可以将Scala中的case class对象写入到ElasticSearch;Java中可以写入JavaBean对象,如下:

case class Trip(departure: String, arrival: String) 
val upcomingTrip = Trip("OTP", "SFO")
val lastWeekTrip = Trip("MUC", "OTP")
val rdd = sc.makeRDD(Seq(upcomingTrip, lastWeekTrip))
rdd.saveToEs("spark/doc")
wAAACH5BAEKAAAALAAAAAABAAEAAAICRAEAOw==

上面的代码片段将upcomingTrip和lastWeekTrip写入到名为spark的_index中,type是doc。上面都是通过隐式转换才使得rdd拥有saveToEs方法。elasticsearch-hadoop还提供显式方法来把RDD写入到ElasticSearch中,如下:

import org.elasticsearch.spark.rdd.EsSpark
val rdd = sc.makeRDD(Seq(upcomingTrip, lastWeekTrip))
EsSpark.saveToEs(rdd, "spark/doc")
wAAACH5BAEKAAAALAAAAAABAAEAAAICRAEAOw==

3、将Json字符串写入ElasticSearch

我们可以直接将Json字符串写入到ElasticSearch中,如下:

val json1 = """{"id" : 1, "blog" : "www.iteblog.com", "weixin" : "iteblog_hadoop"}"""
val json2 = """{"id" : 2, "blog" : "books.iteblog.com", "weixin" : "iteblog_hadoop"}"""
sc.makeRDD(Seq(json1, json2)).saveJsonToEs("iteblog3/json")
wAAACH5BAEKAAAALAAAAAABAAEAAAICRAEAOw==

4、自定义id

在ElasticSearch中,_index/_type/_id的组合可以唯一确定一个Document。如果我们不指定id的话,ElasticSearch将会自动为我们生产全局唯一的id,自动生成的ID有20个字符长。很显然,这么长的字符串没啥意义,而且也不便于我们记忆使用。不过我们可以在插入数据的时候手动指定id的值,如下:

val otp = Map("iata" -> "OTP", "name" -> "Otopeni")
val muc = Map("iata" -> "MUC", "name" -> "Munich")
val sfo = Map("iata" -> "SFO", "name" -> "San Fran")
val airportsRDD = sc.makeRDD(Seq((1, otp), (2, muc), (3, sfo))) 
airportsRDD.saveToEsWithMeta("iteblog5/2015")
wAAACH5BAEKAAAALAAAAAABAAEAAAICRAEAOw==

 

你可能感兴趣的:(Elasticsearch,spark)