roykingw

elasticSearch spark支持

–Note：1、文章摘自elasticsearch官网，因为觉得再费力的总结，总不如官网的说明学习更快。
2、关于elasticSearch-hadoop组件：从目前的版本看(2017年9月26日)，从elasticsearch官网下载的elasticsearch_hadoop组件还只能基于scala 2.10.x 版本。而spark的2.1.X版本开始，已经基于scala2.11以上的版本进行开发了。(而且坑爹的scala并不是完全向下兼容的)，所以造成官网的组件实际上不太好用。 elasticsearch与spark交互的包需要从maven中下载。

<dependency>
   <groupId>org.elasticsearchgroupId>
    <artifactId>elasticsearch-spark-20_2.11artifactId>
    <version>5.3.2version>
dependency>

以下摘自网页 https://www.elastic.co/guide/en/elasticsearch/hadoop/master/spark.html
Apache Spark support

Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala and Python, and an optimized engine that supports general execution graphs.

-- Spark website

Spark provides fast iterative/functional-like capabilities over large data sets, typically by caching data in memory. As opposed to the rest of the libraries mentioned in this documentation, Apache Spark is computing framework that is not tied to Map/Reduce itself however it does integrate with Hadoop, mainly to HDFS. elasticsearch-hadoop allows Elasticsearch to be used in Spark in two ways: through the dedicated support available since 2.1 or through the Map/Reduce bridge since 2.0. Spark 2.0 is supported in elasticsearch-hadoop since version 5.0

Installationedit
Just like other libraries, elasticsearch-hadoop needs to be available in Spark’s classpath. As Spark has multiple deployment modes, this can translate to the target classpath, whether it is on only one node (as is the case with the local mode - which will be used through-out the documentation) or per-node depending on the desired infrastructure.

Native supportedit
Note
Added in 2.1.

elasticsearch-hadoop provides native integration between Elasticsearch and Apache Spark, in the form of an RDD (Resilient Distributed Dataset) (or Pair RDD to be precise) that can read data from Elasticsearch. The RDD is offered in two flavors: one for Scala (which returns the data as Tuple2 with Scala collections) and one for Java (which returns the data as Tuple2 containing java.util collections).

Important
Whenever possible, consider using the native integration as it offers the best performance and maximum flexibility.

Configurationedit
To configure elasticsearch-hadoop for Apache Spark, one can set the various properties described in the Configuration chapter in the SparkConf object:

import org.apache.spark.SparkConf

val conf = new SparkConf().setAppName(appName).setMaster(master)
conf.set("es.index.auto.create", "true")
SparkConf conf = new SparkConf().setAppName(appName).setMaster(master);
conf.set("es.index.auto.create", "true");

Command-line. For those that want to set the properties through the command-line (either directly or by loading them from a file), note that Spark only accepts those that start with the “spark.” prefix and will ignore the rest (and depending on the version a warning might be thrown). To work around this limitation, define the elasticsearch-hadoop properties by appending the spark. prefix (thus they become spark.es.) and elasticsearch-hadoop will automatically resolve them:


$ ./bin/spark-submit --conf spark.es.resource=index/type ...

Notice the es.resource property which became spark.es.resource

Writing data to Elasticsearchedit
With elasticsearch-hadoop, any RDD can be saved to Elasticsearch as long as its content can be translated into documents. In practice this means the RDD type needs to be a Map (whether a Scala or a Java one), a JavaBean or a Scala case class. When that is not the case, one can easily transform the data in Spark or plug-in their own custom ValueWriter.

Scalaedit
When using Scala, simply import the org.elasticsearch.spark package which, through the pimp my library pattern, enriches the any RDD API with saveToEs methods:

import org.apache.spark.SparkContext    
import org.apache.spark.SparkContext._

import org.elasticsearch.spark._        

...

val conf = ...
val sc = new SparkContext(conf)         

val numbers = Map("one" -> 1, "two" -> 2, "three" -> 3)
val airports = Map("arrival" -> "Otopeni", "SFO" -> "San Fran")

sc.makeRDD(Seq(numbers, airports)).saveToEs("spark/docs")

Spark Scala imports

elasticsearch-hadoop Scala imports

start Spark through its Scala API

makeRDD creates an ad-hoc RDD based on the collection specified; any other RDD (in Java or Scala) can be passed in

index the content (namely the two documents (numbers and airports)) in Elasticsearch under spark/docs

Note
Scala users might be tempted to use Seq and the → notation for declaring root objects (that is the JSON document) instead of using a Map. While similar, the first notation results in slightly different types that cannot be matched to a JSON document: Seq is an order sequence (in other words a list) while → creates a Tuple which is more or less an ordered, fixed number of elements. As such, a list of lists cannot be used as a document since it cannot be mapped to a JSON object; however it can be used freely within one. Hence why in the example above Map(k→v) was used instead of Seq(k→v)

As an alternative to the implicit import above, one can use elasticsearch-hadoop Spark support in Scala through EsSpark in the org.elasticsearch.spark.rdd package which acts as a utility class allowing explicit method invocations. Additionally instead of Maps (which are convenient but require one mapping per instance due to their difference in structure), use a case class :

import org.apache.spark.SparkContext
import org.elasticsearch.spark.rdd.EsSpark                        

// define a case class
case class Trip(departure: String, arrival: String)               

val upcomingTrip = Trip("OTP", "SFO")
val lastWeekTrip = Trip("MUC", "OTP")

val rdd = sc.makeRDD(Seq(upcomingTrip, lastWeekTrip))             
EsSpark.saveToEs(rdd, "spark/docs")

EsSpark import
Define a case class named Trip
Create an RDD around the Trip instances
Index the RDD explicitly through EsSpark

For cases where the id (or other metadata fields like ttl or timestamp) of the document needs to be specified, one can do so by setting the appropriate mapping namely es.mapping.id. Following the previous example, to indicate to Elasticsearch to use the field id as the document id, update the RDD configuration (it is also possible to set the property on the SparkConf though due to its global effect it is discouraged):

EsSpark.saveToEs(rdd, “spark/docs”, Map(“es.mapping.id” -> “id”))
Javaedit
Java users have a dedicated class that provides a similar functionality to EsSpark, namely JavaEsSpark in the org.elasticsearch.spark.rdd.api.java (a package similar to Spark’s Java API):

import org.apache.spark.api.java.JavaSparkContext;                              
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.SparkConf;

import org.elasticsearch.spark.rdd.api.java.JavaEsSpark;                        
...

SparkConf conf = ...
JavaSparkContext jsc = new JavaSparkContext(conf);                              

Map numbers = ImmutableMap.of("one", 1, "two", 2);                   
Map airports = ImmutableMap.of("OTP", "Otopeni", "SFO", "San Fran");

JavaRDD> javaRDD = jsc.parallelize(ImmutableList.of(numbers, airports));
JavaEsSpark.saveToEs(javaRDD, "spark/docs");

Spark Java imports
elasticsearch-hadoop Java imports
start Spark through its Java API
to simplify the example, use Guava(a dependency of Spark) Immutable* methods for simple Map, List creation
create a simple RDD over the two collections; any other RDD (in Java or Scala) can be passed in
index the content (namely the two documents (numbers and airports)) in Elasticsearch under spark/docs

The code can be further simplified by using Java 5 static imports. Additionally, the Map (who’s mapping is dynamic due to its loose structure) can be replaced with a JavaBean:

public class TripBean implements Serializable {
   private String departure, arrival;

   public TripBean(String departure, String arrival) {
       setDeparture(departure);
       setArrival(arrival);
   }

   public TripBean() {}

   public String getDeparture() { return departure; }
   public String getArrival() { return arrival; }
   public void setDeparture(String dep) { departure = dep; }
   public void setArrival(String arr) { arrival = arr; }
}

import static org.elasticsearch.spark.rdd.api.java.JavaEsSpark;                
...

TripBean upcoming = new TripBean("OTP", "SFO");
TripBean lastWeek = new TripBean("MUC", "OTP");

JavaRDD javaRDD = jsc.parallelize(
                            ImmutableList.of(upcoming, lastWeek));        
saveToEs(javaRDD, "spark/docs");

statically import JavaEsSpark
define an RDD containing TripBean instances (TripBean is a JavaBean)
call saveToEs method without having to type JavaEsSpark again

Setting the document id (or other metadata fields like ttl or timestamp) is similar to its Scala counterpart, though potentially a bit more verbose depending on whether you are using the JDK classes or some other utilities (like Guava):

JavaEsSpark.saveToEs(javaRDD, “spark/docs”, ImmutableMap.of(“es.mapping.id”, “id”));
Writing existing JSON to Elasticsearchedit
For cases where the data in the RDD is already in JSON, elasticsearch-hadoop allows direct indexing without applying any transformation; the data is taken as is and sent directly to Elasticsearch. As such, in this case, elasticsearch-hadoop expects either an RDD containing String or byte arrays (byte[]/Array[Byte]), assuming each entry represents a JSON document. If the RDD does not have the proper signature, the saveJsonToEs methods cannot be applied (in Scala they will not be available).

Scalaedit

val json1 = """{"reason" : "business", "airport" : "SFO"}"""      
val json2 = """{"participants" : 5, "airport" : "OTP"}"""

new SparkContext(conf).makeRDD(Seq(json1, json2))
                      .saveJsonToEs("spark/json-trips")

example of an entry within the RDD - the JSON is written as is, without any transformation
index the JSON data through the dedicated saveJsonToEs method

Javaedit

String json1 = "{\"reason\" : \"business\",\"airport\" : \"SFO\"}";  
String json2 = "{\"participants\" : 5,\"airport\" : \"OTP\"}";

JavaSparkContext jsc = ...
JavaRDD stringRDD = jsc.parallelize(ImmutableList.of(json1, json2));
JavaEsSpark.saveJsonToEs(stringRDD, "spark/json-trips");

example of an entry within the RDD - the JSON is written as is, without any transformation
notice the RDD signature
index the JSON data through the dedicated saveJsonToEs method

Writing to dynamic/multi-resourcesedit
For cases when the data being written to Elasticsearch needs to be indexed under different buckets (based on the data content) one can use the es.resource.write field which accepts a pattern that is resolved from the document content, at runtime. Following the aforementioned media example, one could configure it as follows:

Scalaedit

val game = Map("media_type"->"game","title" -> "FF VI","year" -> "1994")
val book = Map("media_type" -> "book","title" -> "Harry Potter","year" -> "2010")
val cd = Map("media_type" -> "music","title" -> "Surfing With The Alien")

sc.makeRDD(Seq(game, book, cd)).saveToEs("my-collection/{media_type}")

Document key used for splitting the data. Any field can be declared (but make sure it is available in all documents)
Save each object based on its resource pattern, in this example based on media_type

For each document/object about to be written, elasticsearch-hadoop will extract the media_type field and use its value to determine the target resource.

Javaedit
As expected, things in Java are strikingly similar:

Map game =
  ImmutableMap.of("media_type", "game", "title", "FF VI", "year", "1994");
Map book = ...
Map cd = ...

JavaRDD> javaRDD =
                jsc.parallelize(ImmutableList.of(game, book, cd));
saveToEs(javaRDD, "my-collection/{media_type}");

Save each object based on its resource pattern, media_type in this example

Handling document metadataedit
Elasticsearch allows each document to have its own metadata. As explained above, through the various mapping options one can customize these parameters so that their values are extracted from their belonging document. Further more, one can even include/exclude what parts of the data are sent back to Elasticsearch. In Spark, elasticsearch-hadoop extends this functionality allowing metadata to be supplied outside the document itself through the use of pair RDDs. In other words, for RDDs containing a key-value tuple, the metadata can be extracted from the key and the value used as the document source.

The metadata is described through the Metadata Java enum within org.elasticsearch.spark.rdd package which identifies its type - id, ttl, version, etc… Thus an RDD keys can be a Map containing the Metadata for each document and its associated values. If RDD key is not of type Map, elasticsearch-hadoop will consider the object as representing the document id and use it accordingly. This sounds more complicated than it is, so let us see some examples.

Scalaedit
Pair RDDs, or simply put RDDs with the signature RDD[(K,V)] can take advantage of the saveToEsWithMeta methods that are available either through the implicit import of org.elasticsearch.spark package or EsSpark object. To manually specify the id for each document, simply pass in the Object (not of type Map) in your RDD:

val otp = Map("iata" -> "OTP", "name" -> "Otopeni")
val muc = Map("iata" -> "MUC", "name" -> "Munich")
val sfo = Map("iata" -> "SFO", "name" -> "San Fran")

// instance of SparkContext
val sc = ...

val airportsRDD = sc.makeRDD(Seq((1, otp), (2, muc), (3, sfo)))  
airportsRDD.saveToEsWithMeta("airports/2015")

airportsRDD is a key-value pair RDD; it is created from a Seq of tuples
The key of each tuple within the Seq represents the id of its associated value/document; in other words, document otp has id 1, muc 2 and sfo 3
Since airportsRDD is a pair RDD, it has the saveToEsWithMeta method available. This tells elasticsearch-hadoop to pay special attention to the RDD keys and use them as metadata, in this case as document ids. If saveToEs would have been used instead, then elasticsearch-hadoop would consider the RDD tuple, that is both the key and the value, as part of the document.

When more than just the id needs to be specified, one should use a scala.collection.Map with keys of type org.elasticsearch.spark.rdd.Metadata:

import org.elasticsearch.spark.rdd.Metadata._          

val otp = Map("iata" -> "OTP", "name" -> "Otopeni")
val muc = Map("iata" -> "MUC", "name" -> "Munich")
val sfo = Map("iata" -> "SFO", "name" -> "San Fran")

// metadata for each document
// note it's not required for them to have the same structure
val otpMeta = Map(ID -> 1, TTL -> "3h")                
val mucMeta = Map(ID -> 2, VERSION -> "23")            
val sfoMeta = Map(ID -> 3)                             

// instance of SparkContext
val sc = ...

val airportsRDD = sc.makeRDD(Seq((otpMeta, otp), (mucMeta, muc), (sfoMeta, sfo)))
airportsRDD.saveToEsWithMeta("airports/2015")

Import the Metadata enum
The metadata used for otp document. In this case, ID with a value of 1 and TTL with a value of 3h
The metadata used for muc document. In this case, ID with a value of 2 and VERSION with a value of 23
The metadata used for sfo document. In this case, ID with a value of 3
The metadata and the documents are assembled into a pair RDD
The RDD is saved accordingly using the saveToEsWithMeta method

Javaedit
In a similar fashion, on the Java side, JavaEsSpark provides saveToEsWithMeta methods that are applied to JavaPairRDD (the equivalent in Java of RDD[(K,V)]). Thus to save documents based on their ids one can use:

import org.elasticsearch.spark.rdd.api.java.JavaEsSpark;

// data to be saved
Map otp = ImmutableMap.of("iata", "OTP", "name", "Otopeni");
Map jfk = ImmutableMap.of("iata", "JFK", "name", "JFK NYC");

JavaSparkContext jsc = ...

// create a pair RDD between the id and the docs
JavaPairRDD pairRdd = jsc.parallelizePairs(ImmutableList.of(
        new Tuple2(1, otp),          
        new Tuple2(2, jfk)));        
JavaEsSpark.saveToEsWithMeta(pairRDD, target);

Create a JavaPairRDD by using Scala Tuple2 class wrapped around the document id and the document itself
Tuple for the first document wrapped around the id (1) and the doc (otp) itself
Tuple for the second document wrapped around the id (2) and jfk
The JavaPairRDD is saved accordingly using the keys as a id and the values as documents

When more than just the id needs to be specified, one can choose to use a java.util.Map populated with keys of type org.elasticsearch.spark.rdd.Metadata:

import org.elasticsearch.spark.rdd.api.java.JavaEsSpark;
import org.elasticsearch.spark.rdd.Metadata;

import static org.elasticsearch.spark.rdd.Metadata.*; 

// data to be saved
Map<String, ?> otp = ImmutableMap.of("iata", "OTP", "name", "Otopeni");
Map<String, ?> sfo = ImmutableMap.of("iata", "SFO", "name", "San Fran");

// metadata for each document
// note it's not required for them to have the same structure
Map otpMeta = ImmutableMap. of(ID, 1, TTL, "1d");
Map sfoMeta = ImmutableMap. of(ID, "2", VERSION, "23");

JavaSparkContext jsc = ...

// create a pair RDD between the id and the docs
JavaPairRDD pairRdd = jsc.parallelizePairs<(ImmutableList.of(
        new Tuple2(otpMeta, otp),    
        new Tuple2(sfoMeta, sfo)));  
JavaEsSpark.saveToEsWithMeta(pairRDD, target);

Metadata enum describing the document metadata that can be declared
static import for the enum to refer to its values in short format (ID, TTL, etc…)
Metadata for otp document
Boiler-plate construct for forcing the of method generic signature
Metadata for sfo document
Tuple between otp (as the value) and its metadata (as the key)
Tuple associating sfo and its metadata
saveToEsWithMeta invoked over the JavaPairRDD containing documents and their respective metadata

Reading data from Elasticsearchedit
For reading, one should define the Elasticsearch RDD that streams data from Elasticsearch to Spark.

Scalaedit
Similar to writing, the org.elasticsearch.spark package, enriches the SparkContext API with esRDD methods:

import org.apache.spark.SparkContext    
import org.apache.spark.SparkContext._

import org.elasticsearch.spark._        

...

val conf = ...
val sc = new SparkContext(conf)         

val RDD = sc.esRDD("radio/artists")

Spark Scala imports
elasticsearch-hadoop Scala imports
start Spark through its Scala API
a dedicated RDD for Elasticsearch is created for index radio/artists

The method can be overloaded to specify an additional query or even a configuration Map (overriding SparkConf):

...
import org.elasticsearch.spark._

...
val conf = ...
val sc = new SparkContext(conf)

sc.esRDD("radio/artists", "?q=me*")

create an RDD streaming all the documents matching me* from index radio/artists

The documents from Elasticsearch are returned, by default, as a Tuple2 containing as the first element the document id and the second element the actual document represented through Scala collections, namely one Map[String, Any]where the keys represent the field names and the value their respective values.

Javaedit
Java users have a dedicated JavaPairRDD that works the same as its Scala counterpart however the returned Tuple2 values (or second element) returns the documents as native, java.util collections.

import org.apache.spark.api.java.JavaSparkContext;               
import org.elasticsearch.spark.rdd.api.java.JavaEsSpark;             
...

SparkConf conf = ...
JavaSparkContext jsc = new JavaSparkContext(conf);               

JavaPairRDD> esRDD =
                        JavaEsSpark.esRDD(jsc, "radio/artists");

Spark Java imports
elasticsearch-hadoop Java import
start Spark through its Java API
a dedicated JavaPairRDD for Elasticsearch is created for index radio/artists

In a similar fashion one can use the overloaded esRDD methods to specify a query or pass a Map object for advanced configuration. Let us see how this looks, but this time around using Java static imports. Further more, let us discard the documents ids and retrieve only the RDD values:


import static org.elasticsearch.spark.rdd.api.java.JavaEsSpark.*;   

...
JavaRDD> esRDD =
                        esRDD(jsc, "radio/artists", "?q=me*").values();

statically import JavaEsSpark class
create an RDD streaming all the documents starting with me from index radio/artists. Note the method does not have to be fully qualified due to the static import
return only values of the PairRDD - hence why the result is of type JavaRDD and not JavaPairRDD

By using the JavaEsSpark API, one gets a hold of Spark’s dedicated JavaPairRDD which are better suited in Java environments than the base RDD (due to its Scala signatures). Moreover, the dedicated RDD returns Elasticsearch documents as proper Java collections so one does not have to deal with Scala collections (which is typically the case with RDDs). This is particularly powerful when using Java 8, which we strongly recommend as its lambda expressions make collection processing extremely concise.

To wit, let us assume one wants to filter the documents from the RDD and return only those that contain a value that contains mega (please ignore the fact one can and should do the filtering directly through Elasticsearch).

In versions prior to Java 8, the code would look something like this:

JavaRDDString, Object>> esRDD =
                        esRDD(jsc, "radio/artists", "?q=me*").values();
JavaRDDString, Object>> filtered = esRDD.filter(
    new FunctionString, Object>, Boolean>() {
      @Override
      public Boolean call(Map<String, Object> map) throws Exception {
          returns map.contains("mega");
      }
    });

with Java 8, the filtering becomes a one liner:


JavaRDD<Map<String, Object>> esRDD =
                        esRDD(jsc, "radio/artists", "?q=me*").values();
JavaRDD<Map<String, Object>> filtered = esRDD.filter(doc ->
                                                doc.contains("mega"));

Reading data in JSON formatedit
In case where the results from Elasticsearch need to be in JSON format (typically to be sent down the wire to some other system), one can use the dedicated esJsonRDD methods. In this case, the connector will return the documents content as it is received from Elasticsearch without any processing as an RDD[(String, String)] in Scala or JavaPairRDD[String, String] in Java with the keys representing the document id and the value its actual content in JSON format.

Type conversionedit
Important
When dealing with multi-value/array fields, please see this section and in particular these configuration options. IMPORTANT: If automatic index creation is used, please review this section for more information.

elasticsearch-hadoop automatically converts Spark built-in types to Elasticsearch types (and back) as shown in the table below:

Table 5. Scala Types Conversion Table
Scala type—-Elasticsearch type
None —- null
Unit —- null
Nil —- empty array
Some[T] —- T according to the table
Map —- object
Traversable —- array
case class —- object (see Map)
Product —- array

in addition, the following implied conversion applies for Java types:

Table 6. Java Types Conversion Table
Java type —- Elasticsearch type
null —- null
String —- string
Boolean —- boolean
Byte —- byte
Short —- short
Integer —- int
Long —- long
Double —- double
Float —- float
Number —- float or double (depending on size)
java.util.Calendar —- date (string format)
java.util.Date —- date (string format)
java.util.Timestamp —- date (string format)
byte[] —- string (BASE64)
Object[] —- array
Iterable —- array
Map —- object
Java Bean —- object (see Map)

The conversion is done as a best effort; built-in Java and Scala types are guaranteed to be properly converted, however there are no guarantees for user types whether in Java or Scala. As mentioned in the tables above, when a case class is encountered in Scala or JavaBean in Java, the converters will try to unwrap its content and save it as an object. Note this works only for top-level user objects - if the user object has other user objects nested in, the conversion is likely to fail since the converter does not perform nested unwrapping. This is done on purpose since the converter has to serialize and deserialize the data and user types introduce ambiguity due to data loss; this can be addressed through some type of mapping however that takes the project way too close to the realm of ORMs and arguably introduces too much complexity for little to no gain; thanks to the processing functionality in Spark and the plugability in elasticsearch-hadoop one can easily transform objects into other types, if needed with minimal effort and maximum control.

Geo types. It is worth mentioning that rich data types available only in Elasticsearch, such as GeoPoint or GeoShape are supported by converting their structure into the primitives available in the table above. For example, based on its storage a geo_point might be returned as a String or a Traversable.

Spark Streaming supportedit
Note
Added in 5.0.

Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams.

-- Spark website

Spark Streaming is an extension on top of the core Spark functionality that allows near real time processing of stream data. Spark Streaming works around the idea of DStreams, or Discretized Streams. DStreams operate by collecting newly arrived records into a small RDD and executing it. This repeats every few seconds with a new RDD in a process called microbatching. The DStream api includes many of the same processing operations as the RDD api, plus a few other streaming specific methods. elasticsearch-hadoop provides native integration with Spark Streaming as of version 5.0.

When using the elasticsearch-hadoop Spark Streaming support, Elasticsearch can be targeted as an output location to index data into from a Spark Streaming job in the same way that one might persist the results from an RDD. Though, unlike RDDs, you are unable to read data out of Elasticsearch using a DStream due to the continuous nature of it.

Important
Spark Streaming support provides special optimizations to allow for conservation of network resources on Spark executors when running jobs with very small processing windows. For this reason, one should prefer to use this integration instead of invoking saveToEs on RDDs returned from the foreachRDD call on DStream.

Writing DStream to Elasticsearchedit
Like RDDs, any DStream can be saved to Elasticsearch as long as its content can be translated into documents. In practice this means the DStream type needs to be a Map (either a Scala or a Java one), a JavaBean or a Scala case class. When that is not the case, one can easily transform the data in Spark or plug-in their own custom ValueWriter.

Scalaedit
When using Scala, simply import the org.elasticsearch.spark.streaming package which, through the pimp my library pattern, enriches the DStream API with saveToEs methods:

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._               
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.StreamingContext._

import org.elasticsearch.spark.streaming._           

...

val conf = ...
val sc = new SparkContext(conf)                      
val ssc = new StreamingContext(sc, Seconds(1))       

val numbers = Map("one" -> 1, "two" -> 2, "three" -> 3)
val airports = Map("arrival" -> "Otopeni", "SFO" -> "San Fran")

val rdd = sc.makeRDD(Seq(numbers, airports))
val microbatches = mutable.Queue(rdd)                

ssc.queueStream(microbatches).saveToEs("spark/docs")

ssc.start()
ssc.awaitTermination()

Spark and Spark Streaming Scala imports
elasticsearch-hadoop Spark Streaming imports
start Spark through its Scala API
start SparkStreaming context by passing it the SparkContext. The microbatches will be processed every second.
makeRDD creates an ad-hoc RDD based on the collection specified; any other RDD (in Java or Scala) can be passed in. Create a queue of RDDs to signify the microbatches to perform.

Create a DStream out of the RDDs and index the content (namely the two _documents_ (numbers and airports)) in {es} underspark/docs
Start the spark Streaming Job and wait for it to eventually finish.

As an alternative to the implicit import above, one can use elasticsearch-hadoop Spark Streaming support in Scala through EsSparkStreaming in the org.elasticsearch.spark.streaming package which acts as a utility class allowing explicit method invocations. Additionally instead of Maps (which are convenient but require one mapping per instance due to their difference in structure), use a case class :

import org.apache.spark.SparkContext
import org.elasticsearch.spark.streaming.EsSparkStreaming         

// define a case class
case class Trip(departure: String, arrival: String)               

val upcomingTrip = Trip("OTP", "SFO")
val lastWeekTrip = Trip("MUC", "OTP")

val rdd = sc.makeRDD(Seq(upcomingTrip, lastWeekTrip))
val microbatches = mutable.Queue(rdd)                             
val dstream = ssc.queueStream(microbatches)

EsSparkStreaming.saveToEs(dstream, "spark/docs")                  

ssc.start()

EsSparkStreaming import
Define a case class named Trip
Create a DStream around the RDD of Trip instances
Configure the DStream to be indexed explicitly through EsSparkStreaming
Start the streaming process

Important
Once a SparkStreamingContext is started, no new DStreams can be added or configured. Once a context has stopped, it cannot be restarted. There can only be one active SparkStreamingContext at a time per JVM. Also note that when stopping a SparkStreamingContext programmatically, it stops the underlying SparkContext unless instructed not to.

For cases where the id (or other metadata fields like ttl or timestamp) of the document needs to be specified, one can do so by setting the appropriate mapping namely es.mapping.id. Following the previous example, to indicate to Elasticsearch to use the field id as the document id, update the DStream configuration (it is also possible to set the property on the SparkConf though due to its global effect it is discouraged):

EsSparkStreaming.saveToEs(dstream, “spark/docs”, Map(“es.mapping.id” -> “id”))
Javaedit
Java users have a dedicated class that provides a similar functionality to EsSparkStreaming, namely JavaEsSparkStreaming in the package org.elasticsearch.spark.streaming.api.java (a package similar to Spark’s Java API):

import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.SparkConf;                                              
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import org.apache.spark.streaming.api.java.JavaDStream;

import org.elasticsearch.spark.streaming.api.java.JavaEsSparkStreaming;         
...

SparkConf conf = ...
JavaSparkContext jsc = new JavaSparkContext(conf);                              
JavaStreamingContext jssc = new JavaSparkStreamingContext(jsc, Seconds.apply(1));

Map numbers = ImmutableMap.of("one", 1, "two", 2);                   
Map airports = ImmutableMap.of("OTP", "Otopeni", "SFO", "San Fran");

JavaRDD> javaRDD = jsc.parallelize(ImmutableList.of(numbers, airports));
Queue>> microbatches = new LinkedList<>();
microbatches.add(javaRDD);                                                      
JavaDStream> javaDStream = jssc.queueStream(microbatches);

JavaEsSparkStreaming.saveToEs(javaDStream, "spark/docs");                       

jssc.start()

Spark and Spark Streaming Java imports
elasticsearch-hadoop Java imports
start Spark and Spark Streaming through its Java API. The microbatches will be processed every second.
to simplify the example, use Guava(a dependency of Spark) Immutable* methods for simple Map, List creation
create a simple DStream over the microbatch; any other RDDs (in Java or Scala) can be passed in
index the content (namely the two documents (numbers and airports)) in Elasticsearch under spark/docs
execute the streaming job.

The code can be further simplified by using Java 5 static imports. Additionally, the Map (who’s mapping is dynamic due to its loose structure) can be replaced with a JavaBean:

public class TripBean implements Serializable {
   private String departure, arrival;

   public TripBean(String departure, String arrival) {
       setDeparture(departure);
       setArrival(arrival);
   }

   public TripBean() {}

   public String getDeparture() { return departure; }
   public String getArrival() { return arrival; }
   public void setDeparture(String dep) { departure = dep; }
   public void setArrival(String arr) { arrival = arr; }
}

import static org.elasticsearch.spark.rdd.api.java.JavaEsSparkStreaming;  
...

TripBean upcoming = new TripBean("OTP", "SFO");
TripBean lastWeek = new TripBean("MUC", "OTP");

JavaRDD javaRDD = jsc.parallelize(ImmutableList.of(upcoming, lastWeek));
Queue> microbatches = new LinkedList>();
microbatches.add(javaRDD);
JavaDStream javaDStream = jssc.queueStream(microbatches);       

saveToEs(javaDStream, "spark/docs");                                          

jssc.start()

statically import JavaEsSparkStreaming
define a DStream containing TripBean instances (TripBean is a JavaBean)
call saveToEs method without having to type JavaEsSparkStreaming again
run that Streaming job
Setting the document id (or other metadata fields like ttl or timestamp) is similar to its Scala counterpart, though potentially a bit more verbose depending on whether you are using the JDK classes or some other utilities (like Guava):

JavaEsSparkStreaming.saveToEs(javaDStream, “spark/docs”, ImmutableMap.of(“es.mapping.id”, “id”));
Writing Existing JSON to Elasticsearchedit
For cases where the data being streamed by the DStream is already serialized as JSON, elasticsearch-hadoop allows direct indexing without applying any transformation; the data is taken as is and sent directly to Elasticsearch. As such, in this case, elasticsearch-hadoop expects either a DStream containing String or byte arrays (byte[]/Array[Byte]), assuming each entry represents a JSON document. If the DStream does not have the proper signature, the saveJsonToEs methods cannot be applied (in Scala they will not be available).

Scalaedit

val json1 = """{"reason" : "business", "airport" : "SFO"}"""      
val json2 = """{"participants" : 5, "airport" : "OTP"}"""

val sc = new SparkContext(conf)
val ssc = new StreamingContext(sc, Seconds(1))

val rdd = sc.makeRDD(Seq(json1, json2))
val microbatch = mutable.Queue(rdd)
ssc.queueStream(microbatch).saveJsonToEs("spark/json-trips")      

ssc.start()

example of an entry within the DStream - the JSON is written as is, without any transformation
configure the stream to index the JSON data through the dedicated saveJsonToEs method
start the streaming job

Javaedit

String json1 = "{\"reason\" : \"business\",\"airport\" : \"SFO\"}";  
String json2 = "{\"participants\" : 5,\"airport\" : \"OTP\"}";

JavaSparkContext jsc = ...
JavaStreamingContext jssc = ...
JavaRDD stringRDD = jsc.parallelize(ImmutableList.of(json1, json2));
Queue> microbatches = new LinkedList>();      
microbatches.add(stringRDD);
JavaDStream stringDStream = jssc.queueStream(microbatches);

JavaEsSparkStreaming.saveJsonToEs(stringRDD, "spark/json-trips");    

jssc.start()

example of an entry within the DStream - the JSON is written as is, without any transformation
creating an RDD, placing it into a queue, and creating a DStream out of the queued RDDs, treating each as a microbatch.
notice the JavaDStream signature
configure stream to index the JSON data through the dedicated saveJsonToEs method
launch stream job

Scalaedit

val game = Map("media_type"->"game","title" -> "FF VI","year" -> "1994")
val book = Map("media_type" -> "book","title" -> "Harry Potter","year" -> "2010")
val cd = Map("media_type" -> "music","title" -> "Surfing With The Alien")

val batch = sc.makeRDD(Seq(game, book, cd))
val microbatches = mutable.Queue(batch)
ssc.queueStream(microbatches).saveToEs("my-collection/{media_type}")  
ssc.start()

For each document/object about to be written, elasticsearch-hadoop will extract the media_type field and use its value to determine the target resource.

Javaedit
As expected, things in Java are strikingly similar:

Map game =
  ImmutableMap.of("media_type", "game", "title", "FF VI", "year", "1994");
Map book = ...
Map cd = ...

JavaRDD> javaRDD =
                jsc.parallelize(ImmutableList.of(game, book, cd));
Queue>> microbatches = ...
JavaDStream> javaDStream =
                jssc.queueStream(microbatches);

saveToEs(javaDStream, "my-collection/{media_type}");  
jssc.start();

Save each object based on its resource pattern, media_type in this example

This is no different in Spark Streaming. For DStreamss containing a key-value tuple, the metadata can be extracted from the key and the value used as the document source.

The metadata is described through the Metadata Java enum within org.elasticsearch.spark.rdd package which identifies its type - id, ttl, version, etc… Thus a DStream’s keys can be a Map containing the Metadata for each document and its associated values. If the DStream key is not of type Map, elasticsearch-hadoop will consider the object as representing the document id and use it accordingly. This sounds more complicated than it is, so let us see some examples.

Scalaedit
Pair DStreamss, or simply put DStreamss with the signature DStream[(K,V)] can take advantage of the saveToEsWithMeta methods that are available either through the implicit import of org.elasticsearch.spark.streaming package or EsSparkStreaming object. To manually specify the id for each document, simply pass in the Object (not of type Map) in your DStream:

val otp = Map("iata" -> "OTP", "name" -> "Otopeni")
val muc = Map("iata" -> "MUC", "name" -> "Munich")
val sfo = Map("iata" -> "SFO", "name" -> "San Fran")

// instance of SparkContext
val sc = ...
// instance of StreamingContext
val ssc = ...

val airportsRDD = sc.makeRDD(Seq((1, otp), (2, muc), (3, sfo)))  
val microbatches = mutable.Queue(airportsRDD)

ssc.queueStream(microbatches).saveToEsWithMeta("airports/2015")
ssc.start()

airportsRDD is a key-value pair RDD; it is created from a Seq of tuples
The key of each tuple within the Seq represents the id of its associated value/document; in other words, document otp has id 1, muc 2 and sfo 3
We construct a DStream which inherits the type signature of the RDD
Since the resulting DStream is a pair DStream, it has the saveToEsWithMeta method available. This tells elasticsearch-hadoop to pay special attention to the DStream keys and use them as metadata, in this case as document ids. If saveToEs would have been used instead, then elasticsearch-hadoop would consider the DStream tuple, that is both the key and the value, as part of the document.

When more than just the id needs to be specified, one should use a scala.collection.Map with keys of type org.elasticsearch.spark.rdd.Metadata:

import org.elasticsearch.spark.rdd.Metadata._

val otp = Map("iata" -> "OTP", "name" -> "Otopeni")
val muc = Map("iata" -> "MUC", "name" -> "Munich")
val sfo = Map("iata" -> "SFO", "name" -> "San Fran")

// metadata for each document
// note it's not required for them to have the same structure
val otpMeta = Map(ID -> 1, TTL -> "3h")                
val mucMeta = Map(ID -> 2, VERSION -> "23")            
val sfoMeta = Map(ID -> 3)                             

// instance of SparkContext
val sc = ...
// instance of StreamingContext
val ssc = ...

val airportsRDD = sc.makeRDD(Seq((otpMeta, otp), (mucMeta, muc), (sfoMeta, sfo)))
val microbatches = mutable.Queue(airportsRDD)

ssc.queueStream(microbatches).saveToEsWithMeta("airports/2015")
ssc.start()

Import the Metadata enum

The metadata used for otp document. In this case, ID with a value of 1 and TTL with a value of 3h
The metadata used for muc document. In this case, ID with a value of 2 and VERSION with a value of 23
The metadata used for sfo document. In this case, ID with a value of 3
The metadata and the documents are assembled into a pair RDD
The DStream inherits the signature from the RDD, becoming a pair DStream
The DStream is configured to index the data accordingly using the saveToEsWithMeta method

Javaedit
In a similar fashion, on the Java side, JavaEsSparkStreaming provides saveToEsWithMeta methods that are applied to JavaPairDStream (the equivalent in Java of DStream[(K,V)]).

This tends to involve a little more work due to the Java API’s limitations. For instance, you cannot create a JavaPairDStream directly from a queue of JavaPairRDDs. Instead, you must create a regular JavaDStream of Tuple2 objects and convert the JavaDStream into a JavaPairDStream. This sounds complex, but it’s a simple work around for a limitation of the API.

First, we’ll create a pair function, that takes a Tuple2 object in, and returns it right back to the framework:

public static class ExtractTuples implements PairFunction<Tuple2<Object, Object>, Object, Object>, Serializable {
    @Override
    public Tuple2 call(Tuple2 tuple2) throws Exception {
        return tuple2;
    }
}

Then we’ll apply the pair function to a JavaDStream of Tuple2s to create a JavaPairDStream and save it:

import org.elasticsearch.spark.streaming.api.java.JavaEsSparkStreaming;

// data to be saved
Map otp = ImmutableMap.of("iata", "OTP", "name", "Otopeni");
Map jfk = ImmutableMap.of("iata", "JFK", "name", "JFK NYC");

JavaSparkContext jsc = ...
JavaStreamingContext jssc = ...

// create an RDD of between the id and the docs
JavaRDD> rdd = jsc.parallelize(ImmutableList.of(
        new Tuple2(1, otp),          
        new Tuple2(2, jfk)));        

Queue>> microbatches = ...
JavaDStream> dStream = jssc.queueStream(microbatches); 

JavaPairDStream pairDStream = dstream.mapToPair(new ExtractTuples()); 

JavaEsSparkStreaming.saveToEsWithMeta(pairDStream, target);       
jssc.start();

Create a regular JavaRDD of Scala Tuple2s wrapped around the document id and the document itself
Tuple for the first document wrapped around the id (1) and the doc (otp) itself
Tuple for the second document wrapped around the id (2) and jfk
Assemble a regular JavaDStream out of the tuple RDD
Transform the JavaDStream into a JavaPairDStream by passing our Tuple2 identity function to the mapToPair method. This will allow the type to be converted to a JavaPairDStream. This function could be replaced by anything in your job that would extract both the id and the document to be indexed from a single entry.
The JavaPairRDD is configured to index the data accordingly using the keys as a id and the values as documents

When more than just the id needs to be specified, one can choose to use a java.util.Map populated with keys of type org.elasticsearch.spark.rdd.Metadata. We’ll use the same typing trick to repack the JavaDStream as a JavaPairDStream:

import org.elasticsearch.spark.streaming.api.java.JavaEsSparkStreaming;
import org.elasticsearch.spark.rdd.Metadata;          

import static org.elasticsearch.spark.rdd.Metadata.*; 

// data to be saved
Map<String, ?> otp = ImmutableMap.of("iata", "OTP", "name", "Otopeni");
Map<String, ?> sfo = ImmutableMap.of("iata", "SFO", "name", "San Fran");

// metadata for each document
// note it's not required for them to have the same structure
Map otpMeta = ImmutableMap. of(ID, 1, TTL, "1d");
Map sfoMeta = ImmutableMap. of(ID, "2", VERSION, "23");

JavaSparkContext jsc = ...

// create a pair RDD between the id and the docs
JavaRDD> pairRdd = jsc.parallelize<(ImmutableList.of(
        new Tuple2(otpMeta, otp),    
        new Tuple2(sfoMeta, sfo)));  

Queue>> microbatches = ...
JavaDStream> dStream = jssc.queueStream(microbatches); 

JavaPairDStream pairDStream = dstream.mapToPair(new ExtractTuples()) 

JavaEsSparkStreaming.saveToEsWithMeta(pairDStream, target);       
jssc.start();

Metadata enum describing the document metadata that can be declared
static import for the enum to refer to its values in short format (ID, TTL, etc…)
Metadata for otp document
Boiler-plate construct for forcing the of method generic signature
Metadata for sfo document
Tuple between otp (as the value) and its metadata (as the key)
Tuple associating sfo and its metadata
Create a JavaDStream out of the JavaRDD
Repack the JavaDStream into a JavaPairDStream by mapping the Tuple2 identity function over it.

saveToEsWithMeta invoked over the JavaPairDStream containing documents and their respective metadata

Spark Streaming Type Conversionedit
The elasticsearch-hadoop Spark Streaming support leverages the same type mapping as the regular Spark type mapping. The mappings are repeated here for consistency:

Table 7. Scala Types Conversion Table
Scala type —- Elasticsearch type
None —- null
Unit —- null
Nil —- empty array
Some[T] —- T according to the table
Map —- object
Traversable —- array
case class —- object (see Map)
Product —- array

in addition, the following implied conversion applies for Java types:

Table 8. Java Types Conversion Table
Java type —- Elasticsearch type
null —-null
String —- string
Boolean —- boolean
Byte —- byte
Short —- short
Integer —- int
Long —- long
Double —- double
Float —- float
Number —- float or double (depending on size)
java.util.Calendar —- date (string format)
java.util.Date —- date (string format)
java.util.Timestamp —- date (string format
byte[] —- string (BASE64)
Object[] —- array
Iterable —- array
Map —- object
Java Bean —- object (see Map)

Geo types. It is worth re-mentioning that rich data types available only in Elasticsearch, such as GeoPoint or GeoShape are supported by converting their structure into the primitives available in the table above. For example, based on its storage a geo_point might be returned as a String or a Traversable.

Spark SQL supportedit
Note
Added in 2.1.

Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as distributed SQL query engine.

-- Spark website

On top of the core Spark support, elasticsearch-hadoop also provides integration with Spark SQL. In other words, Elasticsearch becomes a native source for Spark SQL so that data can be indexed and queried from Spark SQL transparently.

Important
Spark SQL works with structured data - in other words, all entries are expected to have the same structure (same number of fields, of the same type and name). Using unstructured data (documents with different structures) is not supported and will cause problems. For such cases, use PairRDDs.

Supported Spark SQL versionsedit
Spark SQL while becoming a mature component, is still going through significant changes between releases. Spark SQL became a stable component in version 1.3, however it is not backwards compatible with the previous releases. Further more Spark 2.0 introduced significant changed which broke backwards compatibility, through the Dataset API. elasticsearch-hadoop supports both version Spark SQL 1.3-1.6 and Spark SQL 2.0 through two different jars: elasticsearch-spark-1.x-.jar and elasticsearch-hadoop-.jar support Spark SQL 1.3-1.6 (or higher) while elasticsearch-spark-2.0-.jar supports Spark SQL 2.0. In other words, unless you are using Spark 2.0, use elasticsearch-spark-1.x-.jar

Spark SQL support is available under org.elasticsearch.spark.sql package.

API differences. From the elasticsearch-hadoop user perspectives, the differences between Spark SQL 1.3-1.6 and Spark 2.0 are fairly consolidated. This document describes at length the differences which are briefly mentioned below:

DataFrame vs Dataset
The core unit of Spark SQL in 1.3+ is a DataFrame. This API remains in Spark 2.0 however underneath it is based on a Dataset
Unified API vs dedicated Java/Scala APIs
In Spark SQL 2.0, the APIs are further unified by introducing SparkSession and by using the same backing code for both Datasets, DataFrames and RDDs.
As conceptually, a DataFrame is a Dataset[Row], the documentation below will focus on Spark SQL 1.3-1.6.

Writing DataFrame (Spark SQL 1.3+) to Elasticsearchedit
With elasticsearch-hadoop, DataFrames (or any Dataset for that matter) can be indexed to Elasticsearch.

Scalaedit
In Scala, simply import org.elasticsearch.spark.sql package which enriches the given DataFrame class with saveToEs methods; while these have the same signature as the org.elasticsearch.spark package, they are designed for DataFrame implementations:

// reusing the example from Spark SQL documentation

import org.apache.spark.sql.SQLContext    
import org.apache.spark.sql.SQLContext._

import org.elasticsearch.spark.sql._      

...

// sc = existing SparkContext
val sqlContext = new SQLContext(sc)

// case class used to define the DataFrame
case class Person(name: String, surname: String, age: Int)

//  create DataFrame
val people = sc.textFile("people.txt")    
        .map(_.split(","))
        .map(p => Person(p(0), p(1), p(2).trim.toInt))
        .toDF()

people.saveToEs("spark/people")

Spark SQL package import
elasticsearch-hadoop Spark package import
Read a text file as normal RDD and map it to a DataFrame (using the Person case class)
Index the resulting DataFrame to Elasticsearch through the saveToEs method

Note
By default, elasticsearch-hadoop will ignore null values in favor of not writing any field at all. Since a DataFrame is meant to be treated as structured tabular data, you can enable writing nulls as null valued fields for DataFrame Objects only by toggling the es.spark.dataframe.write.null setting to true.

Javaedit
In a similar fashion, for Java usage the dedicated package org.elasticsearch.spark.sql.api.java provides similar functionality through the JavaEsSpark SQL :

import org.apache.spark.sql.api.java.*;                      
import org.elasticsearch.spark.sql.api.java.JavaEsSparkSQL;  
...

DataFrame people = ...
JavaEsSparkSQL.saveToEs(people, "spark/people");

Spark SQL Java imports
elasticsearch-hadoop Spark SQL Java imports
index the DataFrame in Elasticsearch under spark/people
Again, with Java 5 static imports this can be further simplied to:

import static org.elasticsearch.spark.sql.api.java.JavaEsSpark SQL; 
...
saveToEs("spark/people");

statically import JavaEsSpark SQL
call saveToEs method without having to type JavaEsSpark again

Important
For maximum control over the mapping of your DataFrame in Elasticsearch, it is highly recommended to create the mapping before hand. See this chapter for more information.

Writing existing JSON to Elasticsearchedit
When using Spark SQL, if the input data is in JSON format, simply convert it to a DataFrame (in Spark SQL 1.3) or a Dataset (for Spark SQL 2.0) (as described in Spark documentation) through SQLContext/JavaSQLContext jsonFile methods.

Using pure SQL to read from Elasticsearchedit
Important
The index and its mapping, have to exist prior to creating the temporary table

Spark SQL 1.2 introduced a new API for reading from external data sources, which is supported by elasticsearch-hadoop simplifying the SQL configured needed for interacting with Elasticsearch. Further more, behind the scenes it understands the operations executed by Spark and thus can optimize the data and queries made (such as filtering or pruning), improving performance.

Data Sources in Spark SQLedit
When using Spark SQL, elasticsearch-hadoop allows access to Elasticsearch through SQLContext load method. In other words, to create a DataFrame/Dataset backed by Elasticsearch in a declarative manner:

val sql = new SQLContext...
// Spark 1.3 style
val df = sql.load("spark/index", "org.elasticsearch.spark.sql")

SQLContext experimental load method for arbitrary data sources
path or resource to load - in this case the index/type in Elasticsearch
the data source provider - org.elasticsearch.spark.sql
In Spark 1.4, one would use the following similar API calls:

// Spark 1.4 style
val df = sql.read.format("org.elasticsearch.spark.sql").load("spark/index")

SQLContext experimental read method for arbitrary data sources
the data source provider - org.elasticsearch.spark.sql
path or resource to load - in this case the index/type in Elasticsearch

In Spark 1.5, this can be further simplified to:

// Spark 1.5 style
val df = sql.read.format("es").load("spark/index")

Use es as an alias instead of the full package name for the DataSource provider
Whatever API is used, once created, the DataFrame can be accessed freely to manipulate the data.
The sources declaration also allows specific options to be passed in, namely:
Name —- Default value—- Description
path —- required —- Elasticsearch index/type
pushdown —- true —- Whether to translate (push-down) Spark SQL into Elasticsearch Query DSL
strict —- false —- Whether to use exact (not analyzed) matching or not (analyzed)
Usable in Spark 1.6 or higher —- double.filtering —-true

Whether to tell Spark apply its own filtering on the filters pushed down

Both options are explained in the next section. To specify the options (including the generic elasticsearch-hadoop ones), one simply passes a Map to the aforementioned methods:

For example:

val sql = new SQLContext...
// options for Spark 1.3 need to include the target path/resource
val options13 = Map("path" -> "spark/index",
                    "pushdown" -> "true",
                    "es.nodes" -> "someNode", "es.port" -> "9200")

// Spark 1.3 style
val spark13DF = sql.load("org.elasticsearch.spark.sql", options13)

// options for Spark 1.4 - the path/resource is specified separately
val options = Map("pushdown" -> "true", "es.nodes" -> "someNode", "es.port" -> "9200")

// Spark 1.4 style
val spark14DF = sql.read.format("org.elasticsearch.spark.sql")
                        .options(options).load("spark/index")

pushdown option - specific to Spark data sources
es.nodes configuration option
pass the options when definition/loading the source

sqlContext.sql(
   "CREATE TEMPORARY TABLE myIndex    " + 
   "USING org.elasticsearch.spark.sql " + 
   "OPTIONS ( resource 'spark/index', nodes 'spark/index')" ) "

Do note that due to the SQL parser, the . (among other common characters used for delimiting) is not allowed; the connector tries to work around it by append the es. prefix automatically however this works only for specifying the configuration options with only one . (like es.nodes above). Because of this, if properties with multiple . are needed, one should use the SQLContext.load or SQLContext.read methods above and pass the properties as a Map.

Push-Down operationsedit
An important hidden feature of using elasticsearch-hadoop as a Spark source is that the connector understand the operations performed within the DataFrame/SQL and, by default, will translate them into the appropriate QueryDSL. In other words, the connector pushes down the operations directly at the source, where the data is efficiently filtered out so that only the required data is streamed back to Spark. This significantly increases the queries performance and minimizes the CPU, memory and I/O on both Spark and Elasticsearch clusters as only the needed data is returned (as oppose to returning the data in bulk only to be processed and discarded by Spark). Note the push down operations apply even when one specifies a query - the connector will enhance it according to the specified SQL.

As a side note, elasticsearch-hadoop supports all the Filters available in Spark (1.3.0 and higher) while retaining backwards binary-compatibility with Spark 1.3.0, pushing down to full extent the SQL operations to Elasticsearch without any user interference.

To wit, consider the following Spark SQL:

// as a DataFrame
val df = sqlContext.read().format(“org.elasticsearch.spark.sql”).load(“spark/trips”)

df.printSchema()
// root
//|– departure: string (nullable = true)
//|– arrival: string (nullable = true)
//|– days: long (nullable = true)

val filter = df.filter(df(“arrival”).equalTo(“OTP”).and(df(“days”).gt(3))
or in pure SQL:

CREATE TEMPORARY TABLE trips USING org.elasticsearch.spark.sql OPTIONS (path “spark/trips”)
SELECT departure FROM trips WHERE arrival = “OTP” and days > 3
The connector translates the query into:

{
“query” : {
“filtered” : {
“query” : {
“match_all” : {}

  },
  "filter" : {
    "and" : [{
        "query" : {
          "match" : {
            "arrival" : "OTP"
          }
        }
      }, {
        "days" : {
          "gt" : 3
        }
      }
    ]
  }
}

}
}
Further more, the pushdown filters can work on analyzed terms (the default) or can be configured to be strict and provide exact matches (work only on not-analyzed fields). Unless one manually specifies the mapping, it is highly recommended to leave the defaults as they are. This and other topics are discussed at length in the Elasticsearch Reference Documentation.

Note that double.filtering, available since elasticsearch-hadoop 2.2 for Spark 1.6 or higher, allows filters that are already pushed down to Elasticsearch to be processed/evaluated by Spark as well (default) or not. Turning this feature off, especially when dealing with large data sizes speed things up. However one should pay attention to the semantics as turning this off, might return different results (depending on how the data is indexed, analyzed vs not_analyzed). In general, when turning strict on, one can disable double.filtering as well.

Data Sources as tablesedit
Available since Spark SQL 1.2, one can also access a data source by declaring it as a Spark temporary table (backed by elasticsearch-hadoop):

sqlContext.sql(
   "CREATE TEMPORARY TABLE myIndex    " + 
   "USING org.elasticsearch.spark.sql " + 
   "OPTIONS (resource 'spark/index', scroll_size '20')" )

Spark’s temporary table name
USING clause identifying the data source provider, in this case org.elasticsearch.spark.sql
elasticsearch-hadoop configuration options, the mandatory one being resource. One can use the es prefix or skip it for convenience.
Since using . can cause syntax exceptions, one should replace it instead with _ style. Thus, in this example es.scroll.size becomes scroll_size (as the leading es can be removed). Do note this only works in Spark 1.3 as the Spark 1.4 has a stricter parser. See the chapter above for more information.

Once defined, the schema is picked up automatically. So one can issue queries, right away:
val all = sqlContext.sql(“SELECT * FROM myIndex WHERE id <= 10”)
As elasticsearch-hadoop is aware of the queries being made, it can optimize the requests done to Elasticsearch. For example, given the following query:

val names = sqlContext.sql(“SELECT name FROM myIndex WHERE id >=1 AND id <= 10”)
it knows only the name and id fields are required (the first to be returned to the user, the second for Spark’s internal filtering) and thus will ask only for this data, making the queries quite efficient.

Reading DataFrames (Spark SQL 1.3) from Elasticsearchedit
As you might have guessed, one can define a DataFrame backed by Elasticsearch documents. Or even better, have them backed by a query result, effectively creating dynamic, real-time views over your data.

Scalaedit
Through the org.elasticsearch.spark.sql package, esDF methods are available on the SQLContext API:

import org.apache.spark.sql.SQLContext        

import org.elasticsearch.spark.sql._          
...

val sql = new SQLContext(sc)

val people = sql.esDF("spark/people")         

// check the associated schema
println(people.schema.treeString)             
// root
//  |-- name: string (nullable = true)
//  |-- surname: string (nullable = true)
//  |-- age: long (nullable = true)

Spark SQL Scala imports
elasticsearch-hadoop SQL Scala imports
create a DataFrame backed by the spark/people index in Elasticsearch
the DataFrame associated schema discovered from Elasticsearch
notice how the age field was transformed into a Long when using the default Elasticsearch mapping as discussed in the Mapping and Types chapter.

And just as with the Spark core support, additional parameters can be specified such as a query. This is quite a powerful concept as one can filter the data at the source (Elasticsearch) and use Spark only on the results:

// get only the Smiths
val smiths = sqlContext.esDF("spark/people","?q=Smith" )

Elasticsearch query whose results comprise the DataFrame

Controlling the DataFrame schema. In some cases, especially when the index in Elasticsearch contains a lot of fields, it is desireable to create a DataFrame that contains only a subset of them. While one can modify the DataFrame (by working on its backing RDD) through the official Spark API or through dedicated queries, elasticsearch-hadoop allows the user to specify what fields to include and exclude from Elasticsearch when creating the DataFrame.

Through es.read.field.include and es.read.field.exclude properties, one can indicate what fields to include or exclude from the index mapping. The syntax is similar to that of Elasticsearch include/exclude. Multiple values can be specified by using a comma. By default, no value is specified meaning all properties/fields are included and no properties/fields are excluded.

For example:

include

es.read.field.include = name, address.

exclude

es.read.field.exclude = *.created
Important
Due to the way SparkSQL works with a DataFrame schema, elasticsearch-hadoop needs to be aware of what fields are returned from Elasticsearch before executing the actual queries. While one can restrict the fields manually through the underlying Elasticsearch query, elasticsearch-hadoop is unaware of this and the results are likely to be different or worse, errors will occur. Use the properties above instead, which Elasticsearch will properly use alongside the user query.

Javaedit
For Java users, a dedicated API exists through JavaEsSpark SQL. It is strikingly similar to EsSpark SQL however it allows configuration options to be passed in through Java collections instead of Scala ones; other than that using the two is exactly the same:

import org.apache.spark.sql.api.java.JavaSQLContext;
import org.elasticsearch.spark.sql.api.java.JavaEsSparkSQL;
…
SQLContext sql = new SQLContext(sc);

DataFrame people = JavaEsSparkSQL.esDF(sql, “spark/people”);

Spark SQL import

elasticsearch-hadoop import

create a Java DataFrame backed by an Elasticsearch index

Better yet, the DataFrame can be backed by a query result:

DataFrame people = JavaEsSparkSQL.esDF(sql, “spark/people”, “?q=Smith” );

Elasticsearch query backing the elasticsearch-hadoop DataFrame

Spark SQL Type conversionedit
Important
When dealing with multi-value/array fields, please see this section and in particular these configuration options. IMPORTANT: If automatic index creation is used, please review this section for more information.

elasticsearch-hadoop automatically converts Spark built-in types to Elasticsearch types (and back) as shown in the table below:

While Spark SQL DataTypes have an equivalent in both Scala and Java and thus the RDD conversion can apply, there are slightly different semantics - in particular with the java.sql types due to the way Spark SQL handles them:

Table 9. Spark SQL 1.3+ Conversion Table
Spark SQL DataType —- Elasticsearch type
null —- null
ByteType —- byte
ShortType —- short
IntegerType —- int
LongType —- long
FloatType —- float
DoubleType —- double
StringType —- string
BinaryType —- string (BASE64)
BooleanType —- boolean
DateType —- date (string format)
TimestampType —- long (unix time)
ArrayType —- array
MapType —- object
StructType —- object

Geo Types Conversion Table. In addition to the table above, for Spark SQL 1.3 or higher, elasticsearch-hadoop performs automatic schema detection for geo types, namely Elasticsearch geo_point and geo_shape. Since each type allows multiple formats (geo_point accepts latitude and longitude to be specified in 4 different ways, while geo_shape allows a variety of types (currently 9)) and the mapping does not provide such information, elasticsearch-hadoop will sample the determined geo fields at startup and retrieve an arbitrary document that contains all the relevant fields; it will parse it and thus determine the necessary schema (so for example it can tell whether a geo_point is specified as a StringType or as an ArrayType).

Important
Since Spark SQL is strongly-typed, each geo field needs to have the same format across all documents. Shy of that, the returned data will not fit the detected schema and thus lead to errors.

Spark Structured Streaming supportedit
Note
Added in 6.0.

Structured Streaming provides fast, scalable, fault-tolerant, end-to-end exactly-once stream processing without the user having to reason about streaming.

-- Spark documentation

Released as an experimental feature in Spark 2.0, Spark Structured Streaming provides a unified streaming and batch interface built into the Spark SQL integration. As of elasticsearch-hadoop 6.0, we provide native functionality to index streaming data into Elasticsearch.

Important
Like Spark SQL, Structured Streaming works with structured data. All entries are expected to have the same structure (same number of fields, of the same type and name). Using unstructured data (documents with different structures) is not supported and will cause problems. For such cases, use DStreams.

Supported Spark Structured Streaming versionsedit
Spark Structured Streaming is considered generally available as of Spark v2.2.0. As such, elasticsearch-hadoop support for Structured Streaming (available in elasticsearch-hadoop 6.0+) is only compatible with Spark versions 2.2.0 and onward. Similar to Spark SQL before it, Structured Streaming may be subject to significant changes between releases before its interfaces are considered stable.

Spark Structured Streaming support is available under the org.elasticsearch.spark.sql and org.elasticsearch.spark.sql.streaming packages. It shares a unified interface with Spark SQL in the form of the Dataset[_] api. Clients can interact with streaming Datasets in almost exactly the same way as regular batch Datasets with only a few exceptions.

Writing Streaming Datasets (Spark SQL 2.0+) to Elasticsearchedit
With elasticsearch-hadoop, Stream-backed Datasets can be indexed to Elasticsearch.

Scalaedit
In Scala, to save your streaming based Datasets and DataFrames to Elasticsearch, simply configure the stream to write out using the “es” format, like so:

import org.apache.spark.sql.SparkSession    

...

val spark = SparkSession.builder()
   .appName("EsStreamingExample")           
   .getOrCreate()

// case class used to define the DataFrame
case class Person(name: String, surname: String, age: Int)

//  create DataFrame
val people = spark.readStream                   
        .textFile("/path/to/people/files/*")    
        .map(_.split(","))
        .map(p => Person(p(0), p(1), p(2).trim.toInt))

people.writeStream
      .option("checkpointLocation", "/save/location")      
      .format("es")
      .start("spark/people")

Spark SQL import
Create SparkSession
Instead of calling read, call readStream to get instance of DataStreamReader
Read a directory of text files continuously and convert them into Person objects
Provide a location to save the offsets and commit logs for the streaming query
Start the stream using the “es” format to index the contents of the Dataset continuously to Elasticsearch

Warning
Spark makes no type-based differentiation between batch and streaming based Datasets. While you may be able to import the org.elasticsearch.spark.sql package to add saveToEs methods to your Dataset or DataFrame, it will throw an illegal argument exception if those methods are called on streaming based Datasets or DataFrames.

Javaedit
In a similar fashion, the “es” format is available for Java usage as well:

import org.apache.spark.sql.SparkSession    

...

SparkSession spark = SparkSession
  .builder()
  .appName("JavaStructuredNetworkWordCount")     
  .getOrCreate();

// java bean style class
public static class PersonBean {
  private String name;
  private String surname;                        
  private int age;

  ...
}

Dataset people = spark.readStream()         
        .textFile("/path/to/people/files/*")
        .map(new MapFunction() {
            @Override
            public PersonBean call(String value) throws Exception {
                return someFunctionThatParsesStringToJavaBeans(value.split(","));         
            }
        }, Encoders.bean(PersonBean.class));

people.writeStream()
    .option("checkpointLocation", "/save/location")       
    .format("es")
    .start("spark/people");

Spark SQL Java imports. Can use the same session class as Scala
Create SparkSession. Can also use the legacy SQLContext api
We create a java bean class to be used as our data format
Use the readStream() method to get a DataStreamReader to begin building our stream
Convert our string data into our PersonBean
Set a place to save the state of our stream
Using the “es” format, we continuously index the Dataset in Elasticsearch under spark/people

Writing existing JSON to Elasticsearchedit
When using Spark SQL, if the input data is in JSON format, simply convert it to a Dataset (for Spark SQL 2.0) (as described in Spark documentation) through the DataStreamReader’s json format.

Sink commit log in Spark Structured Streamingedit
Spark Structured Streaming advertises an end-to-end fault-tolerant exactly-once processing model that is made possible through the usage of offset checkpoints and maintaining commit logs for each streaming query. When executing a streaming query, most sources and sinks require you to specify a “checkpointLocation” in order to persist the state of your job. In the event of an interruption, launching a new streaming query with the same checkpoint location will recover the state of the

Each file in the commit log directory corresponds to a batch id that has been committed. The log implementation periodically compacts the logs down to avoid clutter. You can set the location for the log directory a number of ways:

Set the explicit log location with es.spark.sql.streaming.sink.log.path (see below).
If that is not set, then the path specified by checkpointLocation will be used.
If that is not set, then a path will be constructed by combining the value of spark.sql.streaming.checkpointLocation from the SparkSession with the Dataset’s given query name.
If no query name is present, then a random UUID will be used in the above case instead of the query name
If none of the above settings are provided then the start call will throw an exception
Here is a list of configurations that affect the behavior of Elasticsearch’s commit log:

es.spark.sql.streaming.sink.log.enabled (default true)
Enables or disables the commit log for a streaming job. By default, the log is enabled, and output batches with the same batch id will be skipped to avoid double-writes. When this is set to false, the commit log is disabled, and all outputs will be sent to Elasticsearch, regardless if they have been sent in a previous execution.
es.spark.sql.streaming.sink.log.path
Sets the location to store the log data for this streaming query. If this value is not set, then the Elasticsearch sink will store its commit logs under the path given in checkpointLocation. Any HDFS Client compatible URI is acceptable.
es.spark.sql.streaming.sink.log.cleanupDelay (default 10m)
The commit log is managed through Spark’s HDFS Client. Some HDFS compatible filesystems (like Amazon’s S3) propagate file changes in an asynchronous manner. To get around this, after a set of log files have been compacted, the client will wait for this amount of time before cleaning up the old files.
es.spark.sql.streaming.sink.log.deletion (default true)
Determines if the log should delete old logs that are no longer needed. After every batch is committed, the client will check to see if there are any commit logs that have been compacted and are safe to be removed. If set to false, the log will skip this cleanup step, leaving behind a commit file for each batch.
es.spark.sql.streaming.sink.log.compactInterval (default 10)
Sets the number of batches to process before compacting the log files. By default, every 10 batches the commit log will be compacted down into a single file that contains all previously committed batch ids.
Spark Structured Streaming Type conversionedit
Structured Streaming uses the exact same type conversion rules as the Spark SQL integration.

Important
When dealing with multi-value/array fields, please see this section and in particular these configuration options.

Important
If automatic index creation is used, please review this section for more information.

elasticsearch-hadoop automatically converts Spark built-in types to Elasticsearch types as shown in the table below:

Table 10. Spark SQL 1.3+ Conversion Table
Spark SQL DataType —- Elasticsearch type
null —- null
ByteType —- byte
ShortType —- short
IntegerType —- int
LongType —- long
FloatType —- float
DoubleType —- double
StringType —- string
BinaryType —- string (BASE64)
BooleanType —- boolean
DateType —- date (string format)
TimestampType —- long (unix time)
ArrayType —- array
MapType —- object
StructType —- object

Using the Map/Reduce layeredit
Another way of using Spark with Elasticsearch is through the Map/Reduce layer, that is by leveraging the dedicated Input/OuputFormat in elasticsearch-hadoop. However, unless one is stuck on elasticsearch-hadoop 2.0, we strongly recommend using the native integration as it offers significantly better performance and flexibility.

Configurationedit
Through elasticsearch-hadoop, Spark can integrate with Elasticsearch through its dedicated InputFormat, and in case of writing, through OutputFormat. These are described at length in the Map/Reduce chapter so please refer to that for an in-depth explanation.

In short, one needs to setup a basic Hadoop Configuration object with the target Elasticsearch cluster and index, potentially a query, and she’s good to go.

From Spark’s perspective, the only thing required is setting up serialization - Spark relies by default on Java serialization which is convenient but fairly inefficient. This is the reason why Hadoop itself introduced its own serialization mechanism and its own types - namely Writables. As such, InputFormat and OutputFormats are required to return Writables which, out of the box, Spark does not understand. The good news is, one can easily enable a different serialization (Kryo) which handles the conversion automatically and also does this quite efficiently.

SparkConf sc = new SparkConf(); //.setMaster("local");
sc.set("spark.serializer", KryoSerializer.class.getName());

// needed only when using the Java API
JavaSparkContext jsc = new JavaSparkContext(sc);

Enable the Kryo serialization support with Spark

Or if you prefer Scala

val sc = new SparkConf(...)
sc.set("spark.serializer", classOf[KryoSerializer].getName)

Enable the Kryo serialization support with Spark

Note that the Kryo serialization is used as a work-around for dealing with Writable types; one can choose to convert the types directly (from Writable to Serializable types) - which is fine however for getting started, the one liner above seems to be the most effective.

Reading data from Elasticsearchedit
To read data, simply pass in the org.elasticsearch.hadoop.mr.EsInputFormat class - since it supports both the old and the new Map/Reduce APIs, you are free to use either method on SparkContext’s, hadoopRDD (which we recommend for conciseness reasons) or newAPIHadoopRDD. Which ever you chose, stick with it to avoid confusion and problems down the road.

Old (org.apache.hadoop.mapred) APIedit

JobConf conf = new JobConf();                             
conf.set("es.resource", "radio/artists");                 
conf.set("es.query", "?q=me*");                           

JavaPairRDD esRDD = jsc.hadoopRDD(conf, EsInputFormat.class,
                          Text.class, MapWritable.class); 
long docCount = esRDD.count();

Create the Hadoop object (use the old API)
Configure the source (index)
Setup the query (optional)
Create a Spark RDD on top of Elasticsearch through EsInputFormat - the key represents the doc id, the value the doc itself

The Scala version is below:

val conf = new JobConf()                                   
conf.set("es.resource", "radio/artists")                   
conf.set("es.query", "?q=me*")                             
val esRDD = sc.hadoopRDD(conf,
                classOf[EsInputFormat[Text, MapWritable]], 
                classOf[Text], classOf[MapWritable]))
val docCount = esRDD.count();

Create the Hadoop object (use the old API)
Configure the source (index)
Setup the query (optional)
Create a Spark RDD on top of Elasticsearch through EsInputFormat

New (org.apache.hadoop.mapreduce) APIedit
As expected, the mapreduce API version is strikingly similar - replace hadoopRDD with newAPIHadoopRDD and JobConf with Configuration. That’s about it.

Configuration conf = new Configuration();       
conf.set("es.resource", "radio/artists");       
conf.set("es.query", "?q=me*");                 

JavaPairRDD esRDD = jsc.newAPIHadoopRDD(conf, EsInputFormat.class,
                Text.class, MapWritable.class); 
long docCount = esRDD.count();

Create the Hadoop object (use the new API)
Configure the source (index)
Setup the query (optional)
Create a Spark RDD on top of Elasticsearch through EsInputFormat - the key represent the doc id, the value the doc itself

The Scala version is below:

val conf = new Configuration()                             
conf.set("es.resource", "radio/artists")                   
conf.set("es.query", "?q=me*")                             
val esRDD = sc.newAPIHadoopRDD(conf,
                classOf[EsInputFormat[Text, MapWritable]], 
                classOf[Text], classOf[MapWritable]))
val docCount = esRDD.count();

Create the Hadoop object (use the new API)
Configure the source (index)
Setup the query (optional)
Create a Spark RDD on top of Elasticsearch through EsInputFormat

Using the connector from PySparkedit
Thanks to its Map/Reduce layer, elasticsearch-hadoop can be used from PySpark as well to both read and write data to Elasticsearch. To wit, below is a snippet from the Spark documentation (make sure to switch to the Python snippet):

$ ./bin/pyspark –driver-class-path=/path/to/elasticsearch-hadoop.jar

conf = {“es.resource” : “index/type”} # assume Elasticsearch is running on localhost defaults
rdd = sc.newAPIHadoopRDD(“org.elasticsearch.hadoop.mr.EsInputFormat”,\
“org.apache.hadoop.io.NullWritable”, “org.elasticsearch.hadoop.mr.LinkedMapWritable”, conf=conf)
rdd.first() # the result is a MapWritable that is converted to a Python dict
(u’Elasticsearch ID’,
{u’field1’: True,
u’field2’: u’Some Text’,
u’field3’: 12345})
Also, the SQL loader can be used as well:

from pyspark.sql import SQLContext

sqlContext = SQLContext(sc)
df = sqlContext.read.format("org.elasticsearch.spark.sql").load("index/type")
df.printSchema()

你可能感兴趣的:(大数据相关)

说说自己Python 代码优化实践 chilavert318 大数据 linux 运维 python
今年上半年在外省做一个大数据相关的项目，在review项目组成员的代码时，发现一段处理大数据集的模块存在明显性能瓶颈：10万条数据的清洗流程耗时近20分钟，CPU占用率却始终在30%以下。深入分析后发现，看似简洁的Python代码背后，隐藏着诸多可以优化的细节——这并非个例，我们的程序在追求代码可读性时，往往忽略了Python特有的性能陷阱。今天抽点时间，从我实践中的代码就python开发，从内存
【腾讯云】考个证...大数据开发工程师认证 runzhliu 腾讯云
作为一个大数据行业的从业者，考个腾讯云大数据开发工程师认证总比考个消防证easy吧…？关于考这个认证的意义其实主要在于全面复习一下大数据相关的知识点，另外有个腾讯云的认证，也许大概也会对你找工作有点帮助的吧？下面是报名的链接和考试大纲。https://cloud.tencent.com/edu/training/cert/detail?type=Big_Data既然是考试，大家肯定会比较关心考试资
Kafka 4.0版本的推出：数据处理新纪元的破晓之光 chilavert318 熬之滴水穿石 kafka 分布式大数据
之前做大数据相关项目，在项目中都使用过kafka。在数字化时代，数据如洪流般涌来，如何高效处理这些数据成为关键。Kafka就像是一条“智能数据管道”，在数据的世界里扮演着至关重要的角色。如果你第一次接触它，不妨把它想象成一个超级“数据快递员”，能快速、可靠地将数据从一个地方送到另一个地方，不管是电商平台的用户点击数据，还是社交平台的动态信息，它都能处理。以下是我对kafka在数据处理上的认知一、k
LeetCode题目54:螺旋矩阵【python4种算法实现】数据分析螺丝钉 LeetCode刷题与模拟面试 leetcode python 数据结构算法经验分享
作者介绍：10年大厂数据\经营分析经验，现任大厂数据部门负责人。会一些的技术：数据分析、算法、SQL、大数据相关、python欢迎加入社区：码上找工作作者专栏每日更新：LeetCode解锁1000题:打怪升级之旅python数据分析可视化：企业实战案例题目描述给定一个包含mxn个元素的矩阵（m行,n列），请按照顺时针螺旋顺序，返回矩阵中的所有元素。输入格式matrix：一个二维整数数组，表示输入的
python大数据相关职位，还需要学习java哪些知识不辉放弃 python java
一、核心需要掌握的Java知识1.Java基础语法语法基础：变量、数据类型、流程控制、异常处理（对比Python的差异）。面向对象编程（OOP）：类、继承、多态、接口（Java的OOP比Python更严格）。集合框架：List,Map,Set等（大数据处理中高频使用）。IO操作：文件读写、流处理（如BufferedReader,InputStream）。2.并发与多线程线程创建：Runnable,
如何根据个人现状确定职业方向转型大数据 xiaokaiabcde 大数据大数据开发转型大数据大数据职业规划大数据学习
本文章目录如下：一、大数据相关职位介绍（数据来源于拉钩、智联）（一）大数据相关职位列举（二）每个相关职位的岗位职责与要求二、非程序员转型大数据职位推荐与SWOT分析（一）金融财会，统计，其他商科转型大数据。（二）非科班理工科转型大数据（三）除了第1条以外的文科专业同学转型大数据。三、程序员转型大数据职位推荐与SWOT分析（一）Java后端/JavaWeb程序员转型大数据。（二）Python程序员转
LeetCode 题目 49：字母异位词分组 5种算法实现与典型应用案例【python】数据分析螺丝钉 LeetCode刷题与模拟面试算法 leetcode python 数据结构职场和发展
作者介绍：10年大厂数据\经营分析经验，现任大厂数据部门负责人。会一些的技术：数据分析、算法、SQL、大数据相关、python欢迎加入社区：码上找工作作者专栏每日更新：LeetCode解锁1000题:打怪升级之旅python数据分析可视化：企业实战案例备注说明：方便大家阅读，统一使用python，带必要注释，公众号数据分析螺丝钉一起打怪升级题目描述首先，字母异位词是指由相同字母以不同顺序组成的单词
数据挖掘与数据分析两者的区别中琛源科技
随着大数据爆发式增长，市场上对大数据相关人才的需求与日俱增，导致大数据行业人才需求紧缺，引发了关于大数据的学习浪潮，在这个过程中，人们也会不时将数据分析与数据挖掘的关系混淆，什么是数据挖掘?与数据分析有什么联系吗?又或者说数据挖掘与数据分析有什么区别呢?让我们带着这些问题，一起往下解惑吧。数据分析简单的说，就是对数据进行分析，比较专业的说法是，数据分析是指用适当的统计分析方法对收集来的大量数据进行
大数据相关开源项目汇总万里浮云大数据
调度与管理服务Azkaban是一款基于Java编写的任务调度系统任务调度，来自LinkedIn公司，用于管理他们的Hadoop批处理工作流。Azkaban根据工作的依赖性进行排序，提供友好的Web用户界面来维护和跟踪用户的工作流程。YARN是一种新的Hadoop资源管理器，它是一个通用资源管理系统，可为上层应用提供统一的资源管理和调度，解决了旧MapReduce框架的性能瓶颈。它的基本思想是把资源
大数据相关职位介绍之三（数据挖掘，数据安全，数据合规师，首席数据官，数据科学家）小Tomkk 大数据大数据数据挖掘首席数据官数据合规师数据安全数据科学家
大数据相关职位介绍之三（数据挖掘，数据安全，数据合规师，首席数据官，数据科学家）文章目录大数据相关职位介绍之三（数据挖掘，数据安全，数据合规师，首席数据官，数据科学家）1.数据挖掘工程师（DataMiningEngineer）2.数据安全工程师（DataSecurityEngineer）3.数据合规师（DataComplianceOfficer）4.首席数据官（CDO-ChiefDataOffic
大数据相关职位介绍之二（数据治理，数据库管理员，数据资产管理师，数据质量专员）小Tomkk 大数据大数据数据治理数据库管理员数据资产管理师数据质量专员
大数据相关职位介绍之二（数据治理，数据库管理员，数据资产管理师，数据质量专员）文章目录大数据相关职位介绍之二（数据治理，数据库管理员，数据资产管理师，数据质量专员）数据治理工程师/专家（DataGovernanceEngineer/Expert）1.元数据管理师（MetadataManager）2.主数据管理师（MasterDataManager）数据库管理员（DBA-DatabaseAdmini
Lambda架构 leveretz 大数据 lambda
原文地址：https://www.cnblogs.com/xiaodf/p/11642555.html首先我们来看一个典型的互联网大数据平台的架构，如下图所示：在这张架构图中，大数据平台里面向用户的在线业务处理组件用褐色标示出来，这部分是属于互联网在线应用的部分，其他蓝色的部分属于大数据相关组件，使用开源大数据产品或者自己开发相关大数据组件。你可以看到，大数据平台由上到下，可分为三个部分：数据采集
有史以来最全的异常类讲解没有之一！第二部分爆肝2万字，终于把Python的异常类写完了！最全Python异常类合集和案例演示，第二部分长风清留扬最新Python入门基础合集 python 笔记学习异常处理改行学it 异常 BUG
本文是第二部分，第一部分请看：有史以来最全的异常类讲解没有之一！爆肝3万字，终于把Python的异常类写完了！最全Python异常类合集和案例演示，第一部分博客主页：长风清留扬-CSDN博客系列专栏：Python基础专栏每天更新大数据相关方面的技术，分享自己的实战工作经验和学习总结，尽量帮助大家解决更多问题和学习更多新知识，欢迎评论区分享自己的看法感谢大家点赞收藏⭐评论异常类型IndexError
有史以来最全的异常类讲解没有之一！第三部分爆肝4万字，终于把Python的异常类写完了！最全Python异常类合集和案例演示，第三部分长风清留扬最新Python入门基础合集 python 面试异常处理 BUG 异常类型职场和发展改行学it
本文是第三部分，第一第二部分请看：有史以来最全的异常类讲解没有之一！爆肝3万字，终于把Python的异常类写完了！最全Python异常类合集和案例演示，第一部分有史以来最全的异常类讲解没有之一！第二部分爆肝2万字，终于把Python的异常类写完了！最全Python异常类合集和案例演示，第二部分博客主页：长风清留扬-CSDN博客系列专栏：Python基础专栏每天更新大数据相关方面的技术，分享自己的实
还在为Python“运算符”中遇到的BUG而发愁吗？，变量相关的问题和解决办法看这篇文章就够了！长风清留扬 android python bug 运算符
博客主页：长风清留扬-CSDN博客系列专栏：Python疑难杂症百科-BUG编年史每天更新大数据相关方面的技术，分享自己的实战工作经验和学习总结，尽量帮助大家解决更多问题和学习更多新知识，欢迎评论区分享自己的看法感谢大家点赞收藏⭐评论关于运算符中常见的问题和解决方法在Python编程的浩瀚宇宙中，变量如同星辰般璀璨，它们承载着数据，驱动着程序的运行。然而，即便是这些看似简单的构建块，也时常隐藏着令
动态规划详解-最小路径和问题【python】数据分析螺丝钉 LeetCode刷题与模拟面试动态规划算法 leetcode python 数据结构
作者介绍：10年大厂数据\经营分析经验，现任大厂数据部门负责人。会一些的技术：数据分析、算法、SQL、大数据相关、python欢迎加入社区：码上找工作作者专栏每日更新：LeetCode解锁1000题:打怪升级之旅python数据分析可视化：企业实战案例备注说明：方便大家阅读，统一使用python，带必要注释，公众号数据分析螺丝钉一起打怪升级1.问题介绍和应用场景最小路径和问题是一个常见的动态规划问
软考信安26~大数据安全需求分析与安全保护工程 jnprlxc 软考~信息安全工程师需求分析安全运维笔记
1、大数据安全威胁与需求分析1.1、大数据相关概念发展大数据是指非传统的数据处理工具的数据集，具有海量的数据规模、快速的数据流转、多样的数据类型和价值密度低等特征。大数据的种类和来源非常多，包括结构化、半结构化和非结构化数据。1.2、大数据安全威胁分析（1）“数据集“安全边界日渐模糊，安全保护难度提升（2）敏感数据泄露安全风险增大（3）数据失真与大数据污染安全风险（4）大数据处理平台业务连续性与拒
大数据分析与安全分析 Zh&&Li 网络安全运维数据分析安全数据挖掘运维数据库
大数据分析一、大数据安全威胁与需求分析1.1大数据相关概念发展大数据：是指非传统的数据处理工具的数据集大数据特征：海量的数据规模、快速的数据流转、多样的数据类型和价值密度低等大数据的种类和来源非常多，包括结构化、半结构化和非结构化数据有关大数据的新兴网络信息技术应用不断出现，主要包括大规模数据分析处理、数据挖掘、分布式文件系统、分布式数据库、云计算平台、互联网和存储系统1.2大数据安全威胁分析“数
pyflink 滚动窗口实例菜鸟社长菜鸟的大数据进阶之路大数据进阶之路 kafka big data python flink
写在前头：更多大数据相关精彩内容请进我的知识星球，每周定期更新正篇技术路线：模拟kafka生产者发送数据——>flink对kafka数据实时计算处理——>处理后的数据发送到kafka1、模拟客流数据的生产者，参考https://blog.csdn.net/qq_22611181/article/details/1199002502、flink聚合操作原理介绍，参考https://blog.csdn
Python自动化：Python操作Excel的多种方式Pandas+openpyxl+xlrd 长风清留扬 Python excel python pandas 自动化 Python办公自动化数据分析开发语言
在Python中，操作Excel数据通常可以通过几个流行的库来实现，比如pandas、openpyxl、xlrd等。下面会分别介绍这三个流行库来实现对Excel的操作。博客主页：长风清留扬-CSDN博客每天更新大数据相关方面的技术，分享自己的实战工作经验和学习总结，尽量帮助大家解决更多问题和学习更多新知识，欢迎评论区分享自己的看法感谢大家点赞收藏⭐评论推荐阅读：Python入门最全基础Python
高校为什么需要AIGC大数据实验室？泰迪智能科技01 AIGC AIGC 大数据
AIGC大数据实验室是一个专注于人工智能生成内容（AIGC）和大数据相关技术研究、开发与应用的创新实验平台。AIGC主要研究方向包括：AIGC技术创新、大数据处理与分析、AIGC与大数据融合应用。AIGC技术创新：探索如何利用人工智能算法，如深度学习中的生成对抗网络（GAN）、变分自编码器（VAE）、基于Transformer架构的语言模型（如GPT系列）等，来高效地生成高质量的文本、图像、音频、
php案例分析百度云_基于阿里云平台的大数据教学案例 —— B站弹幕数据分析 weixin_39892311 php案例分析百度云
简介：实验基于所学的大数据处理知识，结合阿里云大数据相关产品，分组完成一个大数据分析项目，数据集可以使用开源数据集或自行爬取，最终完成一个完整的实验报告：1、能够使用阿里云大数据相关产品完成数据分析、数据建模与模型优化2、能够基于分析结构构建可视化门户或可视化大屏，分析和呈现不少于5个3、分析案例有实用价值并能够形成有效结论4、能够将开源技术与阿里云产品结合，综合利用提升开发效率，降低成本5、能够
Hbase集群搭建超详细教程笑看风云路集群搭建系列 hbase hbase hadoop 大数据
Hbase集群搭建前言详细步骤1、下载安装包2、解压3、修改配置文件3.1修改hbase-env.sh文件3.2修改hbase-site.xml3.3修改regionservers文件4、分发hbase目录5、启动HBase集群6、查看HBaseWebUI大家好，我是风云，欢迎大家关注我的博客，在未来的日子里我们一起来学习大数据相关的技术，一起努力奋斗，遇见更好的自己！前言HBase是一个开源的、
魔法王国的故事——档案馆的危机健鑫. 数据仓库大数据 hadoop
❝这是一个连续的专栏,在这里,我将用一个奇幻的魔法王国的故事,来向你介绍大数据相关内容，希望在这里可以帮助你学到有用的知识第一章：档案馆的危机在一个遥远的魔法王国，有一个叫做档案馆的地方，那里存放着王国的所有重要的文件，比如法律、历史、魔法、地理等等。这些文件是王国的智慧之源，也是王国的秘密之宝，它们记录着王国的过去、现在和未来。档案馆由一位叫做档案大臣的人负责管理，他是王国最聪明也最忙碌的人之一
西安-腾讯云-Python面试经验--一面凉经 jiet07 腾讯云面试
自我介绍手撕链表排序操作系统a.线程和进程区别b.线程安全c.如何保证线程安全d.线程崩溃，会不会影响所在的进程e.什么是守护进程，僵尸进程，孤儿进程f.如何产生一个守护进程g.如何避免僵尸进程或者孤儿进程redisa.持久化方式有哪些，区别是什么b.redis集群有了解么c.rediszset()—底层如何实现（哈希表+跳跃表）和大数据相关的操作a.请求有多少，数据有多少b.Gbp/s负载均衡a
报表任务治理计划 liujianhuiouc
背景介绍近些年来，大数据技术得到了很广的应用，支撑了业务的快速发展。作为大数据的平台部门，提供了大数据相关的基础能力，业务同学借助于大数据的底层赋能完成更偏向业务的需求开发。报表是大数据支撑最早最广的功能形态。先给大家介绍我们我们公司的报表产出组件图：报表产出图底层平台由HDFS、Yarn分别提供存储和计算支持在这之上我们提供了一套支持MR、Spark任务开发、依赖执行的调度系统BI业务同学利用调
大数据相关技术 ssttIsme
1数据获取方式爬虫:分布式爬虫java的jsoup(操作方式基于选择器)，pythoon,八爪鱼日志收集:log4j(可以控制级别和放置的位置)(可以输出数据到flume)(可以输出到mq),flume(分布式日志收集系统)(收集用户ip，访问了哪个方法)(例如三大运营商的日志分析能根据用户71个字段，拿到谁在什么时间什么地点用什么手机什么浏览器哪个版本访问了什么网站访问了多长时间网站内容是什么)
大数据从何学起？大数据脑图+学习路线清晰的告诉你！ yoku酱
近些年，大数据的火热可谓是技术人都知道啊，很多人呢，也想学习大数据相关，但是又不知道从何下手，所以今天柠檬这里分享几个大数据脑图，希望可以让你清楚明白从哪里入门大数据，知道该学习以及掌握哪些知识点；当然还有自学教程分享哦！【大数据开发学习资料领取方式】：加入大数据技术学习交流扣扣群458345782，点击加入群聊，私信管理员即可免费领取第一阶段linux+搜索+hadoop体系Linux基础→sh
想学大数据？先看完这几本书再说 yoku酱
真正的数据爱好者有很多需要阅读的内容：大数据，机器学习，数据科学，数据挖掘等。除了这些技术领域，还有一些特定的技术和语言需要你继续研究：Hadoop，Spark，Python，和R等等，还有无数实现自动化的工具等等，这些工具几乎每天都会用到，这就需要你不断的学习。幸运的是，以上提到的这些都不缺关于它们的书籍。本文首先帮大家盘点几本大数据相关的书籍，这些书都是亚马逊上的畅销排行榜上的：关于大数据1、
2019-03-07 bigtian
早七点半起床。今天写了大量代码，最近一周的编码状态越来越好，代码也写得越来越顺手，今天把我的数据服务写了一个客户端调用程序，感觉质量还是比较满意的。公司做数据，但是我一个架构师对大数据相关技术却了解很浅，实在是惭愧。以后需要大力加强这一块的技能。对同事要善良，今天同事工作不开心闹了情绪，我主动将活揽过来，做好之后再跟他解释一遍我的思路，感觉这样他会更容易理解也更好的成长，只要一个人是积极向上的，就
VMware Workstation 11 或者 VMware Player 7安装MAC OS X 10.10 Yosemite iwindyforest vmware mac os 10.10 workstation player
最近尝试了下VMware下安装MacOS 系统，安装过程中发现网上可供参考的文章都是VMware Workstation 10以下， MacOS X 10.9以下的文章，只能提供大概的思路，但是实际安装起来由于版本问题，走了不少弯路，所以我尝试写以下总结，希望能给有兴趣安装OSX的人提供一点帮助。写在前面的话：其实安装好后发现，由于我的th
关于《基于模型驱动的B/S在线开发平台》源代码开源的疑虑？ deathwknight JavaScript java 框架
本人从学习Java开发到现在已有10年整，从一个要自学 java买成javascript的小菜鸟，成长为只会java和javascript语言的老菜鸟（个人邮箱：[email protected]）一路走来，跌跌撞撞。用自己的三年多业余时间，瞎搞一个小东西（基于模型驱动的B/S在线开发平台，非MVC框架、非代码生成）。希望与大家一起分享，同时有许些疑虑，希望有人可以交流下平台
如何把maven项目转成web项目 Kai_Ge maven MyEclipse
创建Web工程，使用eclipse ee创建maven web工程 1.右键项目,选择Project Facets,点击Convert to faceted from 2.更改Dynamic Web Module的Version为2.5.(3.0为Java7的,Tomcat6不支持). 如果提示错误,可能需要在Java Compiler设置Compiler compl
主管？？？ Array_06 工作
转载：http://www.blogjava.net/fastzch/archive/2010/11/25/339054.html 很久以前跟同事参加的培训，同事整理得很详细，必须得转！前段时间，公司有组织中高阶主管及其培养干部进行了为期三天的管理训练培训。三天的课程下来，虽然内容较多，因对老师三天来的课程内容深有感触，故借着整理学习心得的机会，将三天来的培训课程做了一个
python内置函数大全 2002wmj python
最近一直在看python的document，打算在基础方面重点看一下python的keyword、Build-in Function、Build-in Constants、Build-in Types、Build-in Exception这四个方面，其实在看的时候发现整个《The Python Standard Library》章节都是很不错的，其中描述了很多不错的主题。先把Build-in Fu
JSP页面通过JQUERY合并行 357029540 JavaScript jquery
在写程序的过程中我们难免会遇到在页面上合并单元行的情况，如图所示如果对于会的同学可能很简单，但是对没有思路的同学来说还是比较麻烦的，提供一下用JQUERY实现的参考代码 function mergeCell(){ var trs = $("#table tr"); &nb
Java基础冰天百华 java基础
学习函数式编程 package base; import java.text.DecimalFormat; public class Main { public static void main(String[] args) { // Integer a = 4; // Double aa = (double)a / 100000; // Decimal
unix时间戳相互转换 adminjun 转换 unix 时间戳
如何在不同编程语言中获取现在的Unix时间戳(Unix timestamp)？ Java time JavaScript Math.round(new Date().getTime()/1000) getTime()返回数值的单位是毫秒 Microsoft .NET / C# epoch = (DateTime.Now.ToUniversalTime().Ticks - 62135
作为一个合格程序员该做的事 aijuans 程序员
作为一个合格程序员每天该做的事 1、总结自己一天任务的完成情况最好的方式是写工作日志，把自己今天完成了什么事情，遇见了什么问题都记录下来，日后翻看好处多多 2、考虑自己明天应该做的主要工作把明天要做的事情列出来，并按照优先级排列，第二天应该把自己效率最高的时间分配给最重要的工作 3、考虑自己一天工作中失误的地方，并想出避免下一次再犯的方法出错不要紧，最重
由html5视频播放引发的总结 ayaoxinchao html5 视频 video
前言项目中存在视频播放的功能，前期设计是以flash播放器播放视频的。但是现在由于需要兼容苹果的设备，必须采用html5的方式来播放视频。我就出于兴趣对html5播放视频做了简单的了解，不了解不知道，水真是很深。本文所记录的知识一些浅尝辄止的知识，说起来很惭愧。视频结构本该直接介绍html5的<video>的，但鉴于本人对视频
解决httpclient访问自签名https报javax.net.ssl.SSLHandshakeException: sun.security.validat bewithme httpclient
如果你构建了一个https协议的站点，而此站点的安全证书并不是合法的第三方证书颁发机构所签发，那么你用httpclient去访问此站点会报如下错误 javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path bu
Jedis连接池的入门级使用 bijian1013 redis redis数据库 jedis
Jedis连接池操作步骤如下： a.获取Jedis实例需要从JedisPool中获取； b.用完Jedis实例需要返还给JedisPool； c.如果Jedis在使用过程中出错，则也需要还给JedisPool； packag
变与不变 bingyingao 不变变亲情永恒
变与不变周末骑车转到了五年前租住的小区，曾经最爱吃的西北面馆、江西水饺、手工拉面早已不在，各种店铺都换了好几茬，这些是变的。三年前还很流行的一款手机在今天看起来已经落后的不像样子。三年前还运行的好好的一家公司，今天也已经不复存在。一座座高楼拔地而起，
【Scala十】Scala核心四：集合框架之List bit1129 scala
Spark的RDD作为一个分布式不可变的数据集合，它提供的转换操作，很多是借鉴于Scala的集合框架提供的一些函数，因此，有必要对Scala的集合进行详细的了解 1. 泛型集合都是协变的，对于List而言，如果B是A的子类，那么List[B]也是List[A]的子类，即可以把List[B]的实例赋值给List[A]变量 2. 给变量赋值(注意val关键字，a，b
Nested Functions in C bookjovi c closure
Nested Functions 又称closure，属于functional language中的概念，一直以为C中是不支持closure的，现在看来我错了，不过C标准中是不支持的，而GCC支持。既然GCC支持了closure，那么 lexical scoping自然也支持了，同时在C中label也是可以在nested functions中自由跳转的
Java-Collections Framework学习与总结-WeakHashMap BrokenDreams Collections
总结这个类之前，首先看一下Java引用的相关知识。Java的引用分为四种：强引用、软引用、弱引用和虚引用。强引用：就是常见的代码中的引用，如Object o = new Object();存在强引用的对象不会被垃圾收集
读《研磨设计模式》-代码笔记-解释器模式-Interpret bylijinnan java 设计模式
声明：本文只为方便我个人查阅和理解，详细的分析以及源代码请移步原作者的博客http://chjavach.iteye.com/ package design.pattern; /* * 解释器（Interpreter）模式的意图是可以按照自己定义的组合规则集合来组合可执行对象 * * 代码示例实现XML里面1.读取单个元素的值 2.读取单个属性的值 * 多
After Effects操作&快捷键 cherishLC After Effects
1、快捷键官方文档中文版：https://helpx.adobe.com/cn/after-effects/using/keyboard-shortcuts-reference.html 英文版：https://helpx.adobe.com/after-effects/using/keyboard-shortcuts-reference.html 2、常用快捷键
Maven 常用命令 crabdave maven
Maven 常用命令 mvn archetype:generate mvn install mvn clean mvn clean complie mvn clean test mvn clean install mvn clean package mvn test mvn package mvn site mvn dependency:res
shell bad substitution daizj shell 脚本
#!/bin/sh /data/script/common/run_cmd.exp 192.168.13.168 "impala-shell -islave4 -q 'insert OVERWRITE table imeis.${tableName} select ${selectFields}, ds, fnv_hash(concat(cast(ds as string), im
Java SE 第二讲（原生数据类型 Primitive Data Type） dcj3sjt126com java
Java SE 第二讲： 1. Windows: notepad, editplus, ultraedit, gvim Linux: vi, vim, gedit 2. Java 中的数据类型分为两大类： 1）原生数据类型（Primitive Data Type） 2）引用类型（对象类型）（R
CGridView中实现批量删除 dcj3sjt126com PHP yii
1，CGridView中的columns添加 array( 'selectableRows' => 2, 'footer' => '<button type="button" onclick="GetCheckbox();" style=&
Java中泛型的各种使用 dyy_gusi java 泛型
Java中的泛型的使用：1.普通的泛型使用在使用类的时候后面的<>中的类型就是我们确定的类型。 public class MyClass1<T> {//此处定义的泛型是T private T var; public T getVar() { return var; } public void setVa
Web开发技术十年发展历程 gcq511120594 Web 浏览器数据挖掘
回顾web开发技术这十年发展历程： Ajax 03年的时候我上六年级，那时候网吧刚在小县城的角落萌生。传奇，大话西游第一代网游一时风靡。我抱着试一试的心态给了网吧老板两块钱想申请个号玩玩，然后接下来的一个小时我一直在，注，册，账，号。彼时网吧用的512k的带宽，注册的时候，填了一堆信息，提交，页面跳转，嘣，”您填写的信息有误，请重填”。然后跳转回注册页面，以此循环。我现在时常想，如果当时a
openSession()与getCurrentSession()区别： hetongfei java DAO Hibernate
来自 http://blog.csdn.net/dy511/article/details/6166134 1.getCurrentSession创建的session会和绑定到当前线程,而openSession不会。 2. getCurrentSession创建的线程会在事务回滚或事物提交后自动关闭,而openSession必须手动关闭。这里getCurrentSession本地事务(本地
第一章安装Nginx+Lua开发环境 jinnianshilongnian nginx lua openresty
首先我们选择使用OpenResty，其是由Nginx核心加很多第三方模块组成，其最大的亮点是默认集成了Lua开发环境，使得Nginx可以作为一个Web Server使用。借助于Nginx的事件驱动模型和非阻塞IO，可以实现高性能的Web应用程序。而且OpenResty提供了大量组件如Mysql、Redis、Memcached等等，使在Nginx上开发Web应用更方便更简单。目前在京东如实时价格、秒
HSQLDB In-Process方式访问内存数据库 liyonghui160com
HSQLDB一大特色就是能够在内存中建立数据库，当然它也能将这些内存数据库保存到文件中以便实现真正的持久化。先睹为快！下面是一个In-Process方式访问内存数据库的代码示例：下面代码需要引入hsqldb.jar包（hsqldb-2.2.8） import java.s
Java线程的5个使用技巧 pda158 java 数据结构
Java线程有哪些不太为人所知的技巧与用法？　　萝卜白菜各有所爱。像我就喜欢Java。学无止境，这也是我喜欢它的一个原因。日常工作中你所用到的工具，通常都有些你从来没有了解过的东西，比方说某个方法或者是一些有趣的用法。比如说线程。没错，就是线程。或者确切说是Thread这个类。当我们在构建高可扩展性系统的时候，通常会面临各种各样的并发编程的问题，不过我们现在所要讲的可能会略有不同。
开发资源大整合：编程语言篇——JavaScript（1） shoothao JavaScript
概述：本系列的资源整合来自于github中各个领域的大牛，来收藏你感兴趣的东西吧。程序包管理器管理javascript库并提供对这些库的快速使用与打包的服务。 Bower - 用于web的程序包管理。 component - 用于客户端的程序包管理，构建更好的web应用程序。 spm - 全新的静态的文件包管
避免使用终结函数 vahoa.ma java jvm C++
终结函数（finalizer）通常是不可预测的，常常也是很危险的，一般情况下不是必要的。使用终结函数会导致不稳定的行为、更差的性能，以及带来移植性问题。不要把终结函数当做C++中的析构函数（destructors）的对应物。我自己总结了一下这一条的综合性结论是这样的： 1）在涉及使用资源，使用完毕后要释放资源的情形下，首先要用一个显示的方