关于scala代码和Elasticsearch集成已经很常见了
直接一个maven配置
<dependency>
<groupId>org.elasticsearch</groupId>
<artifactId>elasticsearch-hadoop</artifactId>
<version>6.1.0</version>
</dependency>
再然后一个简单的代码书写,即可把mysql的数据写入到Elasticsearch,非常方便
var sconf = new SparkConf()
.setAppName(this.getClass.getName)
.setMaster("local[5]")
.set("spark.testing.memory", "471859200")
.set("es.nodes", "xxx")
.set("es.port","9200")
.set("es.index.auto.create", "true")
.set("es.nodes.wan.only", "true")
val spark = SparkSession.builder().config(sconf).getOrCreate()
spark.sparkContext.setLogLevel("WARN")
val dataDF = spark.read.format("jdbc")
.option("url", "jdbc:mysql://xxx:3306/database?characterEncoding=utf8&useSSL=false")
.option("driver", "com.mysql.jdbc.Driver")
.option("user", "root")
.option("password", "123")
.option("dbtable", "table")
.load()
EsSparkSQL.saveToEs(dataDF,"test_index/doc")
pyspark读写也非常的简单,需要下载相应的jar包
下载地址:
https://www.elastic.co/cn/downloads/past-releases/elasticsearch-apache-hadoop-6-4-1
这个包里面还有多个jar,可以和hive、pig、mapreduce、storm、spark等等框架集成,本次需要的事和spark集成,将里面的elasticsearch-hadoop-6.4.1.jar拷贝到spark的jars文件夹中。
官网的几个例子:
创建一个esRDD,然后指定查询::
import org.elasticsearch.spark._
..
val conf = ...
val sc = new SparkContext(conf)
sc.esRDD("radio/artists", "?q=me*")
import org.elasticsearch.spark.sql._
// DataFrame schema automatically inferred
val df = sqlContext.read.format("es").load("buckethead/albums")
// operations get pushed down and translated at runtime to Elasticsearch QueryDSL
val playlist = df.filter(df("category").equalTo("pikes").and(df("year").geq(2016)))
Import the org.elasticsearch.spark._
package to gain savetoEs
methods on your RDD
s:
import org.elasticsearch.spark._
val conf = ...
val sc = new SparkContext(conf)
val numbers = Map("one" -> 1, "two" -> 2, "three" -> 3)
val airports = Map("OTP" -> "Otopeni", "SFO" -> "San Fran")
sc.makeRDD(Seq(numbers, airports)).saveToEs("spark/docs")
import org.elasticsearch.spark.sql._
val df = sqlContext.read.json("examples/people.json")
df.saveToEs("spark/people")
In a Java environment, use the org.elasticsearch.spark.rdd.java.api
package, in particular the JavaEsSpark
class.
To read data from ES, create a dedicated RDD
and specify the query as an argument.
import org.apache.spark.api.java.JavaSparkContext;
import org.elasticsearch.spark.rdd.api.java.JavaEsSpark;
SparkConf conf = ...
JavaSparkContext jsc = new JavaSparkContext(conf);
JavaPairRDD<String, Map<String, Object>> esRDD = JavaEsSpark.esRDD(jsc, "radio/artists");
SQLContext sql = new SQLContext(sc);
DataFrame df = sql.read().format("es").load("buckethead/albums");
DataFrame playlist = df.filter(df.col("category").equalTo("pikes").and(df.col("year").geq(2016)))
Use JavaEsSpark
to index any RDD
to Elasticsearch:
import org.elasticsearch.spark.rdd.api.java.JavaEsSpark;
SparkConf conf = ...
JavaSparkContext jsc = new JavaSparkContext(conf);
Map<String, ?> numbers = ImmutableMap.of("one", 1, "two", 2);
Map<String, ?> airports = ImmutableMap.of("OTP", "Otopeni", "SFO", "San Fran");
JavaRDD<Map<String, ?>> javaRDD = jsc.parallelize(ImmutableList.of(doc1, doc2));
JavaEsSpark.saveToEs(javaRDD, "spark/docs");
import org.elasticsearch.spark.sql.api.java.JavaEsSparkSQL;
DataFrame df = sqlContext.read.json("examples/people.json")
JavaEsSparkSQL.saveToEs(df, "spark/docs")
这次
打开pycharm,将测试文件数据写入到elasticsearch中
import os
import sys
from pyspark.sql import SparkSession, Row
os.environ['SPARK_HOME'] =r'D:\data\spark-2.3.3-bin-hadoop2.6'
sys.path.append(r'D:\data\spark-2.3.3-bin-hadoop2.6\python')
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext
test_json_file = sc.textFile(xxx.file)
#get_base_info是解析方法
file_map = test_json_file.map(lambda line:get_base_info(line))
#数据转成dataframe
file_info = spark.createDataFrame(file_map)
#结果存入elasticsearch中
jd_info.write \
.format("org.elasticsearch.spark.sql") \
.option("es.nodes", "xxx") \
.option("es.resource", "xxx/doc") \
.mode('append') \
.save()
还可以随意的读取,加载成一张表
query = """
{
"query": {
"match": {
"sum_rec":"xxx"
}
}
}"""
spark.read \
.format("org.elasticsearch.spark.sql") \
.option("es.nodes", "ip") \
.option("es.resource", "xxx/doc") \
.option("es.input.json","yes") \
.option("es.index.read.missing.as.empty","true") \
.option("es.query",query) \
.load().registerTempTable("temp")
根据elasticsearch的官网
https://www.elastic.co/guide/en/elasticsearch/hadoop/current/spark.html
还可以和spark streaming结合
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.StreamingContext._
import org.elasticsearch.spark.streaming._
...
val conf = ...
val sc = new SparkContext(conf)
val ssc = new StreamingContext(sc, Seconds(1))
val numbers = Map("one" -> 1, "two" -> 2, "three" -> 3)
val airports = Map("arrival" -> "Otopeni", "SFO" -> "San Fran")
val rdd = sc.makeRDD(Seq(numbers, airports))
val microbatches = mutable.Queue(rdd)
ssc.queueStream(microbatches).saveToEs("spark/docs")
ssc.start()
ssc.awaitTermination()
或者
import org.apache.spark.SparkContext
import org.elasticsearch.spark.streaming.EsSparkStreaming
// define a case class
case class Trip(departure: String, arrival: String)
val upcomingTrip = Trip("OTP", "SFO")
val lastWeekTrip = Trip("MUC", "OTP")
val rdd = sc.makeRDD(Seq(upcomingTrip, lastWeekTrip))
val microbatches = mutable.Queue(rdd)
val dstream = ssc.queueStream(microbatches)
EsSparkStreaming.saveToEs(dstream, "spark/docs")
ssc.start()
线上环境部署,只需要把jar包放进线上的spark客户端里面去,重启spark,或者–jars指定。
pyspark线上运行可能要加上这么一个配置spark.driver.memory
spark = SparkSession.builder.config("spark.driver.memory","2g").getOrCreate()