Spark Solr(1)Read Data from SOLR

阅读更多
Spark Solr(1)Read Data from SOLR

I get a lot of data-bind version conflict with spark and solr. So I clone the project and made some version updates there.

Originally it is fork from https://github.com/LucidWorks/spark-solr
>git clone https://github.com/luohuazju/spark-solr

Only some dependencies update in pom.xml
4.0.0
   com.lucidworks.spark
   spark-solr
3.4.0-SNAPSHOT
3.4.0.1
   jar
   spark-solr
   Tools for reading data from Spark into Solr
@@ -39,11 +39,10 @@
     1.8
     2.2.1
     7.1.0
-    2.4.0
+    2.6.7
     2.11.8
     2.11
     1.1.1
-    2.4.0
     128m
  
  

Command to build that package
>mvn clean compile install -DskipTests

After build that, I get a driver versioned as 3.4.0.1

Set Up SOLR Spark Task
pom.xml to build the dependencies

xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
                      http://maven.apache.org/xsd/maven-4.0.0.xsd">
    4.0.0
    com.sillycat
    sillycat-spark-solr
    1.0
    Fetch the Events from Kafka
    Spark Streaming System
    jar

   
        2.2.1
        3.4.0.1
   


   
       
       
            org.apache.spark
            spark-core_2.11
            ${spark.version}
       

       
       
            com.lucidworks.spark
            spark-solr
            ${spark.solr.version}
       

       
            org.apache.httpcomponents
            httpclient
            4.5.3
       

       
       
            junit
            junit
            4.12
            test
       

   



   
        org.apache.maven.plugins
        maven-compiler-plugin
        3.6.1
       
            1.8
            1.8
       

   

   
        org.apache.maven.plugins
        maven-assembly-plugin
        2.4.1
       
           
                jar-with-dependencies
           

           
               
                    com.sillycat.sparkjava.SparkJavaApp
               

           

       

       
           
                assemble-all
                package
               
                    single
               

           

       

   

   




Here is the major implementation class to connect to the zookeeper and SOLR
SeniorJavaFeedApp.java
package com.sillycat.sparkjava.app;

import java.util.List;

import org.apache.solr.common.SolrDocument;
import org.apache.spark.SparkConf;
import org.apache.spark.SparkContext;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.function.Function;

import com.lucidworks.spark.rdd.SolrJavaRDD;
import com.sillycat.sparkjava.base.SparkBaseApp;

public class SeniorJavaFeedApp extends SparkBaseApp {

    private static final long serialVersionUID = -1219898501920199612L;

    protected String getAppName() {
        return "SeniorJavaFeedApp";
    }

    public void executeTask(List params) {
        SparkConf conf = this.getSparkConf();
        SparkContext sc = new SparkContext(conf);

        String zkHost = "zookeeper1.us-east-1.elasticbeanstalk.com,zookeeper2.us-east-1.elasticbeanstalk.com,zookeeper3.us-east-1.elasticbeanstalk.com/solr/allJobs";
        String collection = "allJobs";
        String solrQuery = "expired: false AND title: Java* AND source_id: 4675";
        String keyword = "Architect";

        logger.info("Prepare the resource from " + solrQuery);
        JavaRDD rdd = this.generateRdd(sc, zkHost, collection, solrQuery);
        logger.info("Executing the calculation based on keyword " + keyword);

        List results = processRows(rdd, keyword);
        for (SolrDocument result : results) {
            logger.info("Find some jobs for you:" + result);
        }
        sc.stop();
    }

    private JavaRDD generateRdd(SparkContext sc, String zkHost, String collection, String solrQuery) {
        SolrJavaRDD solrRDD = SolrJavaRDD.get(zkHost, collection, sc);
        JavaRDD resultsRDD = solrRDD.queryShards(solrQuery);
        return resultsRDD;
    }

    private List processRows(JavaRDD rows, String keyword) {
        JavaRDD lines = rows.filter(new Function() {
            private static final long serialVersionUID = 1L;

            public Boolean call(SolrDocument s) throws Exception {
                Object titleObj = s.getFieldValue("title");
                if (titleObj != null) {
                    String title = titleObj.toString();
                    if (title.contains(keyword)) {
                        return true;
                    }
                }
                return false;
            }
        });
        return lines.collect();
    }

}

Here is the class to run the Spark task on Cluster and Local
#Run the local#

>java -jar target/sillycat-spark-solr-1.0-jar-with-dependencies.jar com.sillycat.sparkjava.app.CountLinesOfKeywordApp

>java -jar target/sillycat-spark-solr-1.0-jar-with-dependencies.jar com.sillycat.sparkjava.app.SeniorJavaFeedApp

#Run binary on local#

>bin/spark-submit --class com.sillycat.sparkjava.SparkJavaApp /Users/carl/work/sillycat/sillycat-spark-java/sillycat-spark-solr/target/sillycat-spark-solr-1.0-jar-with-dependencies.jar com.sillycat.sparkjava.app.CountLinesOfKeywordApp

>bin/spark-submit --class com.sillycat.sparkjava.SparkJavaApp /Users/carl/work/sillycat/sillycat-spark-java/sillycat-spark-solr/target/sillycat-spark-solr-1.0-jar-with-dependencies.jar com.sillycat.sparkjava.app.SeniorJavaFeedApp

#Run binary on Remote YARN Cluster#

>bin/spark-submit --class com.sillycat.sparkjava.SparkJavaApp --master yarn-client /home/ec2-user/users/carl/sillycat-spark-java/sillycat-spark-solr/target/sillycat-spark-solr-1.0-jar-with-dependencies.jar com.sillycat.sparkjava.app.CountLinesOfKeywordApp

>bin/spark-submit --class com.sillycat.sparkjava.SparkJavaApp --master yarn-client /home/ec2-user/users/carl/sillycat-spark-java/sillycat-spark-solr/target/sillycat-spark-solr-1.0-jar-with-dependencies.jar com.sillycat.sparkjava.app.SeniorJavaFeedApp



References:
https://github.com/LucidWorks/spark-solr
https://lucidworks.com/2015/08/20/solr-spark-sql-datasource/
https://lucidworks.com/2016/08/16/solr-as-sparksql-datasource-part-ii/

Spark library
http://spark-packages.org/

Write to XML - stax
https://docs.oracle.com/javase/tutorial/jaxp/stax/example.html#bnbgx
https://www.journaldev.com/892/how-to-write-xml-file-in-java-using-java-stax-api
Spark to s3
http://www.sparktutorials.net/Reading+and+Writing+S3+Data+with+Apache+Spark

你可能感兴趣的:(Spark Solr(1)Read Data from SOLR)