spark 写 ElasticSearch 提升性能解决方案

ES 官网提供了一套Spark写ES接口
参见 : https://www.elastic.co/guide/en/elasticsearch/hadoop/current/spark.html

在工作实践中使用 Spark Streaming 写 ES 发现性能太差了,经研究发现,这套接口基于es底层http的restful接口而实现,
我另辟蹊径,采用TCP通信方式,写ES,性能获得了极大提升。
经验分享给大家,
写ES 代码片段如下

resDF.foreachPartition { (iterRecords: Iterator[Row]) =>
          val settings = Settings.settingsBuilder.put("cluster.name", "myES").build
          val client = TransportClient.builder.settings(settings).build
          client.addTransportAddress(new InetSocketTransportAddress(InetAddress.getByName("10.200.8.187"), 9300))
          val bulkRequest = client.prepareBulk

          iterRecords.foreach((row: Row) => {

            val jsonMap = new  util.HashMap[String, String]
            index_field.foreach { case (index, field) => {

              if (row.isNullAt(index))
                jsonMap.put(field, null)
              else {
                jsonMap.put(field, row.get(index).toString)
              }

//              jsonMap.put(field, row.get(index).toString)
//              (index, field)
            }
            }




            bulkRequest.add(client.prepareIndex(row.get(esIndex_index).toString,
              row.get(esType_index).toString, row.get(esID_index).toString).setSource(JSONObject.toJSONString(jsonMap)))
          }
          )

          println("bulkRequest.numberOfActions():" + bulkRequest.numberOfActions)

          if (bulkRequest.numberOfActions > 0) {
            val bulkResponse: BulkResponse = bulkRequest.execute.actionGet
            if (bulkResponse.hasFailures) {
              println("failed processing bulk index requests " + bulkResponse.buildFailureMessage)
            }
          }
          client.close()

          //        connection.close()
        }

你可能感兴趣的:(大数据)