spark读写Elasticsearch

关于scala代码和Elasticsearch集成已经很常见了

直接一个maven配置

<dependency>
	<groupId>org.elasticsearch</groupId>
	<artifactId>elasticsearch-hadoop</artifactId>
	<version>6.1.0</version>
</dependency>

再然后一个简单的代码书写,即可把mysql的数据写入到Elasticsearch,非常方便

var sconf = new SparkConf()
	.setAppName(this.getClass.getName)
	.setMaster("local[5]")
	.set("spark.testing.memory", "471859200")
	.set("es.nodes", "xxx")
	.set("es.port","9200")
	.set("es.index.auto.create", "true")
	.set("es.nodes.wan.only", "true")
  val spark = SparkSession.builder().config(sconf).getOrCreate()
  spark.sparkContext.setLogLevel("WARN")
  val dataDF = spark.read.format("jdbc")
	.option("url", "jdbc:mysql://xxx:3306/database?characterEncoding=utf8&useSSL=false")
	.option("driver", "com.mysql.jdbc.Driver")
	.option("user", "root")
	.option("password", "123")
	.option("dbtable", "table")
	  .load()

  EsSparkSQL.saveToEs(dataDF,"test_index/doc")

pyspark读写也非常的简单,需要下载相应的jar包

下载地址:
https://www.elastic.co/cn/downloads/past-releases/elasticsearch-apache-hadoop-6-4-1

这个包里面还有多个jar,可以和hive、pig、mapreduce、storm、spark等等框架集成,本次需要的事和spark集成,将里面的elasticsearch-hadoop-6.4.1.jar拷贝到spark的jars文件夹中。

官网的几个例子:

Scala

Reading

创建一个esRDD,然后指定查询::

import org.elasticsearch.spark._

..
val conf = ...
val sc = new SparkContext(conf)
sc.esRDD("radio/artists", "?q=me*")

Spark SQL

import org.elasticsearch.spark.sql._

// DataFrame schema automatically inferred
val df = sqlContext.read.format("es").load("buckethead/albums")

// operations get pushed down and translated at runtime to Elasticsearch QueryDSL
val playlist = df.filter(df("category").equalTo("pikes").and(df("year").geq(2016)))

Writing

Import the org.elasticsearch.spark._ package to gain savetoEs methods on your RDDs:

import org.elasticsearch.spark._        

val conf = ...
val sc = new SparkContext(conf)         

val numbers = Map("one" -> 1, "two" -> 2, "three" -> 3)
val airports = Map("OTP" -> "Otopeni", "SFO" -> "San Fran")

sc.makeRDD(Seq(numbers, airports)).saveToEs("spark/docs")

Spark SQL

import org.elasticsearch.spark.sql._

val df = sqlContext.read.json("examples/people.json")
df.saveToEs("spark/people")

Java

In a Java environment, use the org.elasticsearch.spark.rdd.java.api package, in particular the JavaEsSpark class.

Reading

To read data from ES, create a dedicated RDD and specify the query as an argument.

import org.apache.spark.api.java.JavaSparkContext;   
import org.elasticsearch.spark.rdd.api.java.JavaEsSpark; 

SparkConf conf = ...
JavaSparkContext jsc = new JavaSparkContext(conf);   

JavaPairRDD<String, Map<String, Object>> esRDD = JavaEsSpark.esRDD(jsc, "radio/artists");

Spark SQL

SQLContext sql = new SQLContext(sc);
DataFrame df = sql.read().format("es").load("buckethead/albums");
DataFrame playlist = df.filter(df.col("category").equalTo("pikes").and(df.col("year").geq(2016)))

Writing

Use JavaEsSpark to index any RDD to Elasticsearch:

import org.elasticsearch.spark.rdd.api.java.JavaEsSpark; 

SparkConf conf = ...
JavaSparkContext jsc = new JavaSparkContext(conf); 

Map<String, ?> numbers = ImmutableMap.of("one", 1, "two", 2);     
Map<String, ?> airports = ImmutableMap.of("OTP", "Otopeni", "SFO", "San Fran");

JavaRDD<Map<String, ?>> javaRDD = jsc.parallelize(ImmutableList.of(doc1, doc2)); 
JavaEsSpark.saveToEs(javaRDD, "spark/docs");

Spark SQL

import org.elasticsearch.spark.sql.api.java.JavaEsSparkSQL;

DataFrame df = sqlContext.read.json("examples/people.json")
JavaEsSparkSQL.saveToEs(df, "spark/docs")

这次

打开pycharm,将测试文件数据写入到elasticsearch中

import os
import sys
from pyspark.sql import SparkSession, Row

os.environ['SPARK_HOME'] =r'D:\data\spark-2.3.3-bin-hadoop2.6'
sys.path.append(r'D:\data\spark-2.3.3-bin-hadoop2.6\python')

spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext
test_json_file = sc.textFile(xxx.file)
#get_base_info是解析方法
file_map = test_json_file.map(lambda line:get_base_info(line))
#数据转成dataframe
file_info = spark.createDataFrame(file_map)

#结果存入elasticsearch中
jd_info.write \
    .format("org.elasticsearch.spark.sql") \
    .option("es.nodes", "xxx") \
    .option("es.resource", "xxx/doc") \
    .mode('append') \
    .save()

还可以随意的读取,加载成一张表

query = """
    {   
     "query": {
        "match": {
          "sum_rec":"xxx"
        }
      }
    }"""
spark.read \
    .format("org.elasticsearch.spark.sql") \
    .option("es.nodes", "ip") \
    .option("es.resource", "xxx/doc") \
    .option("es.input.json","yes") \
    .option("es.index.read.missing.as.empty","true") \
    .option("es.query",query) \
    .load().registerTempTable("temp")

根据elasticsearch的官网
https://www.elastic.co/guide/en/elasticsearch/hadoop/current/spark.html

还可以和spark streaming结合

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._               
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.StreamingContext._

import org.elasticsearch.spark.streaming._           

...

val conf = ...
val sc = new SparkContext(conf)                      
val ssc = new StreamingContext(sc, Seconds(1))       

val numbers = Map("one" -> 1, "two" -> 2, "three" -> 3)
val airports = Map("arrival" -> "Otopeni", "SFO" -> "San Fran")

val rdd = sc.makeRDD(Seq(numbers, airports))
val microbatches = mutable.Queue(rdd)                

ssc.queueStream(microbatches).saveToEs("spark/docs") 

ssc.start()
ssc.awaitTermination() 

或者

import org.apache.spark.SparkContext
import org.elasticsearch.spark.streaming.EsSparkStreaming         

// define a case class
case class Trip(departure: String, arrival: String)               

val upcomingTrip = Trip("OTP", "SFO")
val lastWeekTrip = Trip("MUC", "OTP")

val rdd = sc.makeRDD(Seq(upcomingTrip, lastWeekTrip))
val microbatches = mutable.Queue(rdd)                             
val dstream = ssc.queueStream(microbatches)

EsSparkStreaming.saveToEs(dstream, "spark/docs")                  

ssc.start()    

线上环境部署,只需要把jar包放进线上的spark客户端里面去,重启spark,或者–jars指定。

pyspark线上运行可能要加上这么一个配置spark.driver.memory

spark = SparkSession.builder.config("spark.driver.memory","2g").getOrCreate()

你可能感兴趣的:(python,ES,spark)