Spark Streaming+Kafka+Hive+JSON实时增量计算示例

业务架构:
JavaScript -> Netty -> Kafka -> Spark Streaming + Hive -> Redis -> PHP
1.JavaScript作为统计脚本发送后端服务器
2.Netty用来接收请求,生成用户标识,过滤数据,将原始数据JSON化后写入Kafka
3.Spark Streaming采用Direct Approach (No Receivers)方式从Kafka中获取数据,通过Hive对 foreachRDD进行处理,将最终结果写入Redis
4.PHP读取Redis中的统计结果,根据产品需求执行相关推荐业务

部署环境:
CentOS-6.7-x86_64 2.6.32-573.22.1.el6.x86_64
jdk1.8.0_77
spark-1.6.1
scala-2.10.6(spark-1.6.1要求版本)
hadoop-2.7.2(测试时,spark-1.6.1默认为hadoop-2.6.x,之上版本需要自己编译)
kafka_2.10-0.9.0.1(需要对应scala版本)

开发环境:
Windows 10 专业版
myeclipse-2015-stable-1.0
jdk1.7.0_80

节点分布:
192.168.163.141 CoS6-Node1
192.168.163.136 CoS6-Node2
192.168.163.137 CoS6-Node3

1.启动Zookeeper:/opt/zookeeper-3.4.8/bin/zkServer.sh start * 3
2.启动Yarn: /opt/hadoop-2.7.2/sbin/start-all.sh * 1
3.启动ZKFC: /opt/hadoop-2.7.2/sbin/hadoop-daemon.sh start zkfc * 2
4.启动Spark: /opt/spark-1.6.1/sbin/start-all.sh * 1
5.启动Kafka: /opt/kafka_2.10-0.9.0.1/bin/kafka-server-start.sh /opt/kafka_2.10-0.9.0.1/config/server.properties>/data/kafka/run.out & * 3


相关依赖:
   
        
            org.apache.kafka
            kafka_2.10
            0.8.2.1
        
        
            io.netty
            netty-all
            4.0.36.Final
        
        
            commons-logging
            commons-logging
            1.2
        
        
            com.fasterxml.jackson.core
            jackson-databind
            2.4.4
        
        
            org.apache.spark
            spark-streaming_2.10
            1.6.1
        
        
            org.apache.spark
            spark-hive_2.10
            1.6.1
        
        
            org.apache.spark
            spark-streaming-kafka_2.10
            1.6.1
        
        
            redis.clients
            jedis
            2.8.1
        
    

JVM参数:
-Dhadoop.home.dir="D:\\Workspaces\\MyEclipse Professional 2014\\hadoop-common-2.2.0-bin"
-Dhive.exec.scratchdir="C:\\Users\\Ouyang\\AppData\\Local\\Temp\\hive"
-XX:PermSize=128M   
-XX:MaxPermSize=4096M

示例代码:
import java.util.Arrays;
import java.util.HashMap;
import java.util.HashSet;
import java.util.Map;
import java.util.Set;
import kafka.serializer.StringDecoder;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.function.VoidFunction;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.hive.HiveContext;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.api.java.JavaPairInputDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import org.apache.spark.streaming.kafka.KafkaUtils;

public class SparkJson {

    public static void main(String[] args) {
        Configuration config = Configuration.getInstance();

        SparkConf conf = new SparkConf().setMaster(config.getProperty("master")).setAppName(config.getProperty("app.name"));
        JavaStreamingContext jssc = new JavaStreamingContext(conf, Durations.milliseconds(500));
        final HiveContext sqlContext = new HiveContext(jssc.sparkContext());

        Set topicsSet = new HashSet(Arrays.asList(config.getProperty("topic.json")));
        Map kafkaParams = new HashMap();
        kafkaParams.put("metadata.broker.list", config.getProperty("metadata.broker.list"));

        // Create direct kafka stream with brokers and topics
        JavaPairInputDStream messages = KafkaUtils.createDirectStream(
            jssc, String.class, String.class, StringDecoder.class, StringDecoder.class, kafkaParams, topicsSet
        );
        // Get the lines, load to sqlContext
        messages.foreachRDD(new VoidFunction>() {
            private static final long serialVersionUID = 1L;

            public void call(JavaPairRDD t) throws Exception {
                if(t.count() < 1) return ;
                DataFrame df = sqlContext.read().json(t.values());
                df.show();
            }
        });

        // Start the computation
        jssc.start();
        jssc.awaitTermination();
    }

}

问题及解决方案:
1.缺少 winutils.exe文件
异常内容:
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
    at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:278)
    at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:300)
    at org.apache.hadoop.util.Shell.(Shell.java:293)
    at org.apache.hadoop.hive.conf.HiveConf$ConfVars.findHadoopBinary(HiveConf.java:2327)
    at org.apache.hadoop.hive.conf.HiveConf$ConfVars.(HiveConf.java:365)
    at org.apache.spark.sql.hive.HiveContext$.newTemporaryConfiguration(HiveContext.scala:730)
    at org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:215)
    at org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:208)
    at org.apache.spark.sql.hive.HiveContext.functionRegistry$lzycompute(HiveContext.scala:462)
    at org.apache.spark.sql.hive.HiveContext.functionRegistry(HiveContext.scala:461)
    at org.apache.spark.sql.UDFRegistration.(UDFRegistration.scala:40)
    at org.apache.spark.sql.SQLContext.(SQLContext.scala:330)
    at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:90)
    at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:101)
    at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:103)
    at com.leju.esf.cluster.SparkJson.main(SparkJson.java:27)
解决方案:
①下载 hadoop-common依照 BUILDING.txt 进行编译,Windows下依赖zlib和Windows SDK,过程相对麻烦。
项目地址: https://github.com/apache/hadoop-common
②可在github搜索winutils.exe,直接下载别人编码好的文件。
示例地址: https://github.com/srccodes/hadoop-common-2.2.0-bin
③配置HADOOP_HOME,将全部文件拷贝到bin目录下,此处不需要完整安装Hadoop。
④若HADOOP_HOME无效,可在JVM参数中指定:
-Dhadoop.home.dir="D:\\Workspaces\\MyEclipse Professional 2014\\hadoop-common-2.2.0-bin"

2.默认 hive.exec.scratchdir 目录  /tmp/hive在Windows无法对应到有效的磁盘分区。
异常内容:
Exception in thread "main" java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: ---------
    at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
    at org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:204)
    at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:238)
    at org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:218)
    at org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:208)
    at org.apache.spark.sql.hive.HiveContext.functionRegistry$lzycompute(HiveContext.scala:462)
    at org.apache.spark.sql.hive.HiveContext.functionRegistry(HiveContext.scala:461)
    at org.apache.spark.sql.UDFRegistration.(UDFRegistration.scala:40)
    at org.apache.spark.sql.SQLContext.(SQLContext.scala:330)
    at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:90)
    at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:101)
    at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:103)
    at com.leju.esf.cluster.SparkJson.main(SparkJson.java:27)
Caused by: java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: ---------
    at org.apache.hadoop.hive.ql.session.SessionState.createRootHDFSDir(SessionState.java:612)
    at org.apache.hadoop.hive.ql.session.SessionState.createSessionDirs(SessionState.java:554)
    at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:508)
    ... 12 more
解决方案:
①通过跟踪分析,HiveConf会读取SparkConf中的相关配置,从rootHDFSDirPath入手会简单很多。
------org.apache.hadoop.hive.ql.session.SessionState
  /**
   * Create the root scratch dir on hdfs (if it doesn't already exist) and make it writable
   * @param conf
   * @return
   * @throws IOException
   */
  private Path createRootHDFSDir(HiveConf conf) throws IOException {
    Path rootHDFSDirPath = new Path(HiveConf.getVar(conf, HiveConf.ConfVars.SCRATCHDIR));
    FsPermission writableHDFSDirPermission = new FsPermission((short)00733);
    FileSystem fs = rootHDFSDirPath.getFileSystem(conf);

------org.apache.hadoop.fs.FileSystem
  public static URI getDefaultUri(Configuration conf) {
    return URI.create(fixName(conf.get(FS_DEFAULT_NAME_KEY, DEFAULT_FS)));
  }

------org.apache.hadoop.fs.CommonConfigurationKeysPublic
 /** See core-default.xml */
  public static final String  FS_DEFAULT_NAME_KEY = "fs.defaultFS";
  /** Default value for FS_DEFAULT_NAME_KEY */
  public static final String  FS_DEFAULT_NAME_DEFAULT = "file:///";
②配置 hive.exec.scratchdir对应的JVM参数:
-Dhive.exec.scratchdir="C:\\Users\\Ouyang\\AppData\\Local\\Temp\\hive"

3.内存溢出 OutOfMemoryError: PermGen space
异常内容:
Exception in thread "main" java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
    at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:249)
    at org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:327)
    at org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:237)
    at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:441)
    at org.apache.spark.sql.hive.HiveContext.defaultOverrides(HiveContext.scala:226)
    at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:229)
    at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:101)
    at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:103)
    at com.leju.esf.cluster.SparkJson.main(SparkJson.java:27)
Caused by: java.lang.OutOfMemoryError: PermGen space
解决方案:
①增加JVM参数:
-XX:PermSize=128M   
-XX:MaxPermSize=4096M

3.kafka版本不兼容
异常内容:
pom.xml
       
            org.apache.kafka
            kafka_2.10
            0.9.0.1
        
Exception in thread "main" java.lang.ClassCastException: kafka.cluster.BrokerEndPoint cannot be cast to kafka.cluster.Broker
    at org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$2$$anonfun$3$$anonfun$apply$6$$anonfun$apply$7.apply(KafkaCluster.scala:90)
    at scala.Option.map(Option.scala:145)
    at org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$2$$anonfun$3$$anonfun$apply$6.apply(KafkaCluster.scala:90)
    at org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$2$$anonfun$3$$anonfun$apply$6.apply(KafkaCluster.scala:87)
    at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
    at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
    at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
    at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
    at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
    at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
    at org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$2$$anonfun$3.apply(KafkaCluster.scala:87)
    at org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$2$$anonfun$3.apply(KafkaCluster.scala:86)
    at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
    at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
    at scala.collection.immutable.Set$Set1.foreach(Set.scala:74)
    at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
    at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
    at org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$2.apply(KafkaCluster.scala:86)
    at org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$2.apply(KafkaCluster.scala:85)
    at scala.util.Either$RightProjection.flatMap(Either.scala:523)
    at org.apache.spark.streaming.kafka.KafkaCluster.findLeaders(KafkaCluster.scala:85)
    at org.apache.spark.streaming.kafka.KafkaCluster.getLeaderOffsets(KafkaCluster.scala:179)
    at org.apache.spark.streaming.kafka.KafkaCluster.getLeaderOffsets(KafkaCluster.scala:161)
    at org.apache.spark.streaming.kafka.KafkaCluster.getLatestLeaderOffsets(KafkaCluster.scala:150)
    at org.apache.spark.streaming.kafka.KafkaUtils$$anonfun$5.apply(KafkaUtils.scala:215)
    at org.apache.spark.streaming.kafka.KafkaUtils$$anonfun$5.apply(KafkaUtils.scala:211)
    at scala.util.Either$RightProjection.flatMap(Either.scala:523)
    at org.apache.spark.streaming.kafka.KafkaUtils$.getFromOffsets(KafkaUtils.scala:211)
    at org.apache.spark.streaming.kafka.KafkaUtils$.createDirectStream(KafkaUtils.scala:484)
    at org.apache.spark.streaming.kafka.KafkaUtils$.createDirectStream(KafkaUtils.scala:607)
    at org.apache.spark.streaming.kafka.KafkaUtils.createDirectStream(KafkaUtils.scala)
    at com.leju.esf.cluster.SparkJson.main(SparkJson.java:34)
解决方案:
①参考官方文档,查找spark-streaming兼容kafka版本。
文档地址: http://spark.apache.org/docs/latest/streaming-programming-guide.html
Kafka: Spark Streaming 1.6.1 is compatible with Kafka 0.8.2.1. See the Kafka Integration Guide for more details.
②修改pom.xml对应version,服务器环境可保持高版本不变。

4. fasterxml.jackson版本不兼容
异常内容:
pom.xml
       
            com.fasterxml.jackson.core
            jackson-databind
            2.7.4
        
Exception in thread "main" java.lang.VerifyError: class com.fasterxml.jackson.module.scala.ser.ScalaIteratorSerializer overrides final method withResolved.(Lcom/fasterxml/jackson/databind/BeanProperty;Lcom/fasterxml/jackson/databind/jsontype/TypeSerializer;Lcom/fasterxml/jackson/databind/JsonSerializer;)Lcom/fasterxml/jackson/databind/ser/std/AsArraySerializerBase;
    at java.lang.ClassLoader.defineClass1(Native Method)
    at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
    at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
    at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
    at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
    at com.fasterxml.jackson.module.scala.ser.IteratorSerializerModule$class.$init$(IteratorSerializerModule.scala:70)
    at com.fasterxml.jackson.module.scala.DefaultScalaModule.(DefaultScalaModule.scala:19)
    at com.fasterxml.jackson.module.scala.DefaultScalaModule$.(DefaultScalaModule.scala:35)
    at com.fasterxml.jackson.module.scala.DefaultScalaModule$.(DefaultScalaModule.scala)
    at org.apache.spark.rdd.RDDOperationScope$.(RDDOperationScope.scala:81)
    at org.apache.spark.rdd.RDDOperationScope$.(RDDOperationScope.scala)
    at org.apache.spark.streaming.dstream.InputDStream.(InputDStream.scala:78)
    at org.apache.spark.streaming.kafka.DirectKafkaInputDStream.(DirectKafkaInputDStream.scala:56)
    at org.apache.spark.streaming.kafka.KafkaUtils$.createDirectStream(KafkaUtils.scala:485)
    at org.apache.spark.streaming.kafka.KafkaUtils$.createDirectStream(KafkaUtils.scala:607)
    at org.apache.spark.streaming.kafka.KafkaUtils.createDirectStream(KafkaUtils.scala)
    at com.leju.esf.cluster.SparkJson.main(SparkJson.java:34)
解决方案:
①fasterxml.jackson在高版本中修改对应方法的修饰符,通过查找对应方法的发布日志确认最终兼容版本。
②修改pom.xml对应version。

你可能感兴趣的:(Java)