基于Kafka+SparkStreaming+HBase实时点击流案例

背景

Kafka实时记录从数据采集工具Flume或业务系统实时接口收集数据,并作为消息缓冲组件为上游实时计算框架提供可靠数据支撑,Spark 1.3版本后支持两种整合Kafka机制(Receiver-based Approach 和 Direct Approach),具体细节请参考文章最后官方文档链接,数据存储使用HBase

实现思路

实现Kafka消息生产者模拟器

Spark-Streaming采用Direct Approach方式实时获取Kafka中数据

Spark-Streaming对数据进行业务计算后数据存储到HBase

本地虚拟机集群环境配置

由于笔者机器性能有限,hadoop/zookeeper/kafka集群都搭建在一起主机名分别为hadoop1,hadoop2,hadoop3; hbase为单节点 在hadoop1

缺点及不足

由于笔者技术有限,代码设计上有部分缺陷,比如spark-streaming计算后数据保存hbase逻辑性能很低,希望大家多提意见以便小编及时更正

代码实现

Kafka消息模拟器

packageclickstreamimportjava.util.{Properties,Random,UUID}importkafka.producer.{KeyedMessage,Producer,ProducerConfig}importorg.codehaus.jettison.json.JSONObject/**  *

Created by 郭飞 on 2016/5/31.

*/objectKafkaMessageGenerator{privatevalrandom =newRandom()privatevarpointer =-1privatevalos_type =Array("Android","IPhone OS","None","Windows Phone")defclick() :Double= {    random.nextInt(10)  }defgetOsType() :String= {    pointer = pointer +1if(pointer >= os_type.length) {      pointer =0os_type(pointer)    }else{      os_type(pointer)    }  }defmain(args:Array[String]):Unit= {valtopic ="user_events"//本地虚拟机ZK地址valbrokers ="hadoop1:9092,hadoop2:9092,hadoop3:9092"valprops =newProperties()    props.put("metadata.broker.list", brokers)    props.put("serializer.class","kafka.serializer.StringEncoder")valkafkaConfig =newProducerConfig(props)valproducer =newProducer[String,String](kafkaConfig)while(true) {// prepare event datavalevent =newJSONObject()      event        .put("uid",UUID.randomUUID())//随机生成用户id.put("event_time",System.currentTimeMillis.toString)//记录时间发生时间.put("os_type", getOsType)//设备类型.put("click_count", click)//点击次数// produce event messageproducer.send(newKeyedMessage[String,String](topic, event.toString))      println("Message sent: "+ event)Thread.sleep(200)    }  }}

Spark-Streaming主类

packageclickstreamimportkafka.serializer.StringDecoderimportnet.sf.json.JSONObjectimportorg.apache.hadoop.hbase.client.{HTable,Put}importorg.apache.hadoop.hbase.util.Bytesimportorg.apache.hadoop.hbase.{HBaseConfiguration,TableName}importorg.apache.spark.SparkConfimportorg.apache.spark.streaming.kafka.KafkaUtilsimportorg.apache.spark.streaming.{Seconds,StreamingContext}/**

* Created by 郭飞 on 2016/5/31.

*/objectPageViewStream{defmain(args:Array[String]):Unit= {varmasterUrl ="local[2]"if(args.length >0) {      masterUrl = args(0)    }// Create a StreamingContext with the given master URLvalconf =newSparkConf().setMaster(masterUrl).setAppName("PageViewStream")valssc =newStreamingContext(conf,Seconds(5))// Kafka configurationsvaltopics =Set("PageViewStream")//本地虚拟机ZK地址valbrokers ="hadoop1:9092,hadoop2:9092,hadoop3:9092"valkafkaParams =Map[String,String]("metadata.broker.list"-> brokers,"serializer.class"->"kafka.serializer.StringEncoder")// Create a direct streamvalkafkaStream =KafkaUtils.createDirectStream[String,String,StringDecoder,StringDecoder](ssc, kafkaParams, topics)valevents = kafkaStream.flatMap(line => {valdata =JSONObject.fromObject(line._2)Some(data)    })// Compute user click timesvaluserClicks = events.map(x => (x.getString("uid"), x.getInt("click_count"))).reduceByKey(_ + _)    userClicks.foreachRDD(rdd => {      rdd.foreachPartition(partitionOfRecords => {        partitionOfRecords.foreach(pair => {//Hbase配置valtableName ="PageViewStream"valhbaseConf =HBaseConfiguration.create()          hbaseConf.set("hbase.zookeeper.quorum","hadoop1:9092")          hbaseConf.set("hbase.zookeeper.property.clientPort","2181")          hbaseConf.set("hbase.defaults.for.version.skip","true")//用户IDvaluid = pair._1//点击次数valclick = pair._2//组装数据valput =newPut(Bytes.toBytes(uid))          put.add("Stat".getBytes,"ClickStat".getBytes,Bytes.toBytes(click))valStatTable=newHTable(hbaseConf,TableName.valueOf(tableName))StatTable.setAutoFlush(false,false)//写入数据缓存StatTable.setWriteBufferSize(3*1024*1024)StatTable.put(put)//提交StatTable.flushCommits()        })      })    })    ssc.start()    ssc.awaitTermination()  }}

Maven POM文件

4.0.0com.guofei.sparkRiskControl1.0-SNAPSHOTjarRiskControlhttp://maven.apache.orgUTF-8org.apache.sparkspark-core_2.101.3.0org.apache.sparkspark-streaming_2.101.3.0org.apache.sparkspark-streaming-kafka_2.101.3.0org.apache.hbasehbase0.96.2-hadoop2pomorg.apache.hbasehbase-server0.96.2-hadoop2org.apache.hbasehbase-client0.96.2-hadoop2org.apache.hbasehbase-common0.96.2-hadoop2commons-iocommons-io1.3.2commons-loggingcommons-logging1.1.3log4jlog4j1.2.17com.google.protobufprotobuf-java2.5.0io.nettynetty3.6.6.Finalorg.apache.hbasehbase-protocol0.96.2-hadoop2org.apache.zookeeperzookeeper3.4.5org.cloudera.htracehtrace-core2.01org.codehaus.jacksonjackson-mapper-asl1.9.13org.codehaus.jacksonjackson-core-asl1.9.13org.codehaus.jacksonjackson-jaxrs1.9.13org.codehaus.jacksonjackson-xc1.9.13org.slf4jslf4j-api1.6.4org.slf4jslf4j-log4j121.6.4org.apache.hadoophadoop-client2.6.4commons-configurationcommons-configuration1.6org.apache.hadoophadoop-auth2.6.4org.apache.hadoophadoop-common2.6.4net.sf.json-libjson-lib2.4jdk15org.codehaus.jettisonjettison1.1redis.clientsjedis2.5.2org.apache.commonscommons-pool22.2src/main/scalasrc/test/scalanet.alchim31.mavenscala-maven-plugin3.2.2compiletestCompile-make:transitive-dependencyfile${project.build.directory}/.scala_dependenciesorg.apache.maven.pluginsmaven-shade-plugin2.4.3packageshade*:*META-INF/*.SFMETA-INF/*.DSAMETA-INF/*.RSA

FAQ

Maven导入json-lib报错

Failure to find net.sf.json-lib:json-lib:jar:2.3 in

http://repo.maven.apache.org/maven2was cached in the local

repository

解决:

http://stackoverflow.com/questions/4173214/maven-missing-net-sf-json-lib

net.sf.json-lib

json-lib

2.4

jdk15

执行Spark-Streaming程序报错

org.apache.spark.SparkException: Task not serializable

userClicks.foreachRDD(rdd=>{ rdd.foreachPartition(partitionOfRecords=>{ partitionOfRecords.foreach(这里面的代码中所包含的对象必须是序列化的这里面的代码中所包含的对象必须是序列化的这里面的代码中所包含的对象必须是序列化的}) }) })

执行Maven打包报错,找不到依赖的jar包

error:not found: object kafka

ERROR import kafka.javaapi.producer.Producer

解决:win10本地系统 用户/郭飞/.m2/ 目录含有中文

参考文档

spark-streaming官方文档

http://spark.apache.org/docs/latest/streaming-programming-guide.html

spark-streaming整合kafka官方文档

http://spark.apache.org/docs/latest/streaming-kafka-integration.html

spark-streaming整合flume官方文档

http://spark.apache.org/docs/latest/streaming-flume-integration.html

spark-streaming整合自定义数据源官方文档

http://spark.apache.org/docs/latest/streaming-custom-receivers.html

spark-streaming官方scala案例

https://github.com/apache/spark/tree/master/examples/src/main/scala/org/apache/spark/examples/streaming

简单之美博客

http://shiyanjun.cn/archives/1097.html

你可能感兴趣的:(基于Kafka+SparkStreaming+HBase实时点击流案例)