Kafka+Spark Streaming+Redis实时系统实践

基于 Spark通用计算平台,可以很好地扩展各种计算类型的应用,尤其是 Spark提供了内建的计算库支持,像 Spark Streaming、Spark SQL、MLlib、GraphX,这些内建库都提供了高级抽象,可以用非常简洁的代码实现复杂的计算逻辑、这也得益于Scala编程语言的简洁性。这里,我们基于1.3.0版本的Spark搭建了计算平台,实现基于Spark Streaming的实时计算。

  我们的应用场景是分析用户使用手机App的行为,描述如下所示:

  1、手机客户端会收集用户的行为事件(我们以点击事件为例),将数据发送到数据服务器,我们假设这里直接进入到Kafka消息队列
  2、后端的实时服务会从Kafka消费数据,将数据读出来并进行实时分析,这里选择Spark Streaming,因为Spark Streaming提供了与Kafka整合的内置支持
  3、经过Spark Streaming实时计算程序分析,将结果写入Redis,可以实时获取用户的行为数据,并可以导出进行离线综合统计分析

Kafka+Spark Streaming+Redis编程实践

  下面,我们根据上面提到的应用场景,来编程实现这个实时计算应用。首先,写了一个Kafka Producer模拟程序,用来模拟向Kafka实时写入用户行为的事件数据,数据是JSON格式,示例如下:

查看源代码
打印 帮助
1 {
2     "uid":"068b746ed4620d25e26055a9f804385f"
3     "event_time":"1430204612405"
4     "os_type":"Android"
5     "click_count": 6
6 }

一个事件包含4个字段:
  1、uid:用户编号
  2、event_time:事件发生时间戳
  3、os_type:手机App操作系统类型
  4、click_count:点击次数
下面是我们实现的代码,如下所示:

查看源代码
打印 帮助
01 package com.iteblog.spark.streaming.utils
02   
03 import java.util.Properties
04 import scala.util.Properties
05 import org.codehaus.jettison.json.JSONObject
06 import kafka.javaapi.producer.Producer
07 import kafka.producer.KeyedMessage
08 import kafka.producer.KeyedMessage
09 import kafka.producer.ProducerConfig
10 import scala.util.Random
11   
12 object KafkaEventProducer {
13    
14   privateval users =Array(
15       "4A4D769EB9679C054DE81B973ED5D768","8dfeb5aaafc027d89349ac9a20b3930f",
16       "011BBF43B89BFBF266C865DF0397AA71","f2a8474bf7bd94f0aabbd4cdd2c06dcf",
17       "068b746ed4620d25e26055a9f804385f","97edfc08311c70143401745a03a50706",
18       "d7f141563005d1b5d0d3dd30138f3f62","c8ee90aade1671a21336c721512b817a",
19       "6b67c8c700427dee7552f81f3228c927","a95f22eabc4fd4b580c011a3161a9d9d")
20        
21   privateval random =new Random()
22        
23   privatevar pointer =-1
24    
25   defgetUserID() :String = {
26        pointer= pointer + 1
27     if(pointer >=users.length) {
28       pointer= 0
29       users(pointer)
30     } else {
31       users(pointer)
32     }
33   }
34    
35   defclick() : Double = {
36     random.nextInt(10)
37   }
38    
39   // bin/kafka-topics.sh --zookeeper zk1:2181,zk2:2181,zk3:2181/kafka --create --topic user_events --replication-factor 2 --partitions 2
40   // bin/kafka-topics.sh --zookeeper zk1:2181,zk2:2181,zk3:2181/kafka --list
41   // bin/kafka-topics.sh --zookeeper zk1:2181,zk2:2181,zk3:2181/kafka --describe user_events
42   // bin/kafka-console-consumer.sh --zookeeper zk1:2181,zk2:2181,zk3:22181/kafka --topic test_json_basis_event --from-beginning
43   defmain(args: Array[String]): Unit = {
44     valtopic = "user_events"
45     valbrokers = "10.10.4.126:9092,10.10.4.127:9092"
46     valprops = new Properties()
47     props.put("metadata.broker.list", brokers)
48     props.put("serializer.class","kafka.serializer.StringEncoder")
49      
50     valkafkaConfig =new ProducerConfig(props)
51     valproducer = new Producer[String, String](kafkaConfig)
52      
53     while(true) {
54       // prepare event data
55       valevent = new JSONObject()
56       event
57         .put("uid", getUserID)
58         .put("event_time", System.currentTimeMillis.toString)
59         .put("os_type","Android")
60         .put("click_count", click)
61        
62       // produce event message
63       producer.send(newKeyedMessage[String, String](topic, event.toString))
64       println("Message sent: "+ event)
65        
66       Thread.sleep(200)
67     }
68   }  
69 }

  通过控制上面程序最后一行的时间间隔来控制模拟写入速度。下面我们来讨论实现实时统计每个用户的点击次数,它是按照用户分组进行累加次数,逻辑比较简单,关键是在实现过程中要注意一些问题,如对象序列化等。先看实现代码,稍后我们再详细讨论,代码实现如下所示:

查看源代码
打印 帮助
01 object UserClickCountAnalytics {
02   
03   defmain(args: Array[String]): Unit = {
04     varmasterUrl = "local[1]"
05     if(args.length > 0) {
06       masterUrl= args(0)
07     }
08   
09     // Create a StreamingContext with the given master URL
10     valconf = new SparkConf().setMaster(masterUrl).setAppName("UserClickCountStat")
11     valssc = new StreamingContext(conf, Seconds(5))
12   
13     // Kafka configurations
14     valtopics = Set("user_events")
15     valbrokers = "10.10.4.126:9092,10.10.4.127:9092"
16     valkafkaParams =Map[String, String](
17       "metadata.broker.list"-> brokers, "serializer.class"-> "kafka.serializer.StringEncoder")
18   
19     valdbIndex = 1
20     valclickHashKey ="app::users::click"
21   
22     // Create a direct stream
23     valkafkaStream =KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics)
24   
25     valevents = kafkaStream.flatMap(line => {
26       valdata = JSONObject.fromObject(line._2)
27       Some(data)
28     })
29   
30     // Compute user click times
31     valuserClicks = events.map(x => (x.getString("uid"), x.getInt("click_count"))).reduceByKey(_+ _)
32     userClicks.foreachRDD(rdd=> {
33       rdd.foreachPartition(partitionOfRecords=> {
34         partitionOfRecords.foreach(pair=> {
35           valuid = pair._1
36           valclickCount = pair._2
37           valjedis = <SPAN class=wp_keywordlink_affiliate><A title=""href="http://www.iteblog.com/archives/tag/redis"target=_blank data-original-title="View all posts in Redis"jQuery1830668587673401759="50">Redis</A></SPAN>Client.pool.getResource
38           jedis.select(dbIndex)
39           jedis.hincrBy(clickHashKey, uid, clickCount)
40           RedisClient.pool.returnResource(jedis)
41         })
42       })
43     })
44   
45     ssc.start()
46     ssc.awaitTermination()
47   
48   }
49 }

  上面代码使用了Jedis客户端来操作Redis,将分组计数结果数据累加写入Redis存储,如果其他系统需要实时获取该数据,直接从Redis实时读取即可。RedisClient实现代码如下所示:

查看源代码
打印 帮助
01 object RedisClient extends Serializable {
02   valredisHost = "10.10.4.130"
03   valredisPort = 6379
04   valredisTimeout =30000
05   lazyval pool =new JedisPool(newGenericObjectPoolConfig(), redisHost, redisPort, redisTimeout)
06   
07   lazyval hook =new Thread {
08     overridedef run ={
09       println("Execute hook thread: "+ this)
10       pool.destroy()
11     }
12   }
13   sys.addShutdownHook(hook.run)
14 }

  上面代码我们分别在local[K]和Spark Standalone集群模式下运行通过。

  如果我们是在开发环境进行调试的时候,也就是使用local[K]部署模式,在本地启动K个Worker线程来计算,这K个Worker在同一个JVM实例里,上面的代码默认情况是,如果没有传参数则是local[K]模式,所以如果使用这种方式在创建Redis连接池或连接的时候,可能非常容易调试通过,但是在使用Spark Standalone、YARN Client(YARN Cluster)或Mesos集群部署模式的时候,就会报错,主要是由于在处理Redis连接池或连接的时候出错了。我们可以看一下Spark架构,如图所示(来自官网):

Kafka+Spark Streaming+Redis实时系统实践_第1张图片

  无论是在本地模式、Standalone模式,还是在Mesos或YARN模式下,整个Spark集群的结构都可以用上图抽象表示,只是各个组件的运行环境不同,导致组件可能是分布式的,或本地的,或单个JVM实例的。如在本地模式,则上图表现为在同一节点上的单个进程之内的多个组件;而在YARN Client模式下,Driver程序是在YARN集群之外的一个节点上提交Spark Application,其他的组件都运行在YARN集群管理的节点上。

  在Spark集群环境部署Application后,在进行计算的时候会将作用于RDD数据集上的函数(Functions)发送到集群中Worker上的Executor上(在Spark Streaming中是作用于DStream的操作),那么这些函数操作所作用的对象(Elements)必须是可序列化的,通过Scala也可以使用lazy引用来解决,否则这些对象(Elements)在跨节点序列化传输后,无法正确地执行反序列化重构成实际可用的对象。上面代码我们使用lazy引用(Lazy Reference)来实现的,代码如下所示:

查看源代码
打印 帮助
01 // lazy pool reference
02 lazy val pool =new JedisPool(newGenericObjectPoolConfig(), redisHost, redisPort, redisTimeout)
03 ...
04 partitionOfRecords.foreach(pair => {
05   valuid = pair._1
06   valclickCount = pair._2
07   valjedis = RedisClient.pool.getResource
08   jedis.select(dbIndex)
09   jedis.hincrBy(clickHashKey, uid, clickCount)
10   RedisClient.pool.returnResource(jedis)
11 })

  另一种方式,我们将代码修改为,把对Redis连接的管理放在操作DStream的Output操作范围之内,因为我们知道它是在特定的Executor中进行初始化的,使用一个单例的对象来管理,如下所示:

查看源代码
打印 帮助
001 package org.shirdrn.spark.streaming
002   
003 import org.apache.commons.pool2.impl.GenericObjectPoolConfig
004 import org.apache.spark.SparkConf
005 import org.apache.spark.streaming.Seconds
006 import org.apache.spark.streaming.StreamingContext
007 import org.apache.spark.streaming.dstream.DStream.toPairDStreamFunctions
008 import org.apache.spark.streaming.kafka.KafkaUtils
009   
010 import kafka.serializer.StringDecoder
011 import net.sf.json.JSONObject
012 import redis.clients.jedis.JedisPool
013   
014 object UserClickCountAnalytics {
015   
016   defmain(args: Array[String]): Unit = {
017     varmasterUrl = "local[1]"
018     if(args.length > 0) {
019       masterUrl= args(0)
020     }
021   
022     // Create a StreamingContext with the given master URL
023     valconf = new SparkConf().setMaster(masterUrl).setAppName("UserClickCountStat")
024     valssc = new StreamingContext(conf, Seconds(5))
025   
026     // Kafka configurations
027     valtopics = Set("user_events")
028     valbrokers = "10.10.4.126:9092,10.10.4.127:9092"
029     valkafkaParams =Map[String, String](
030       "metadata.broker.list"-> brokers, "serializer.class"-> "kafka.serializer.StringEncoder")
031   
032     valdbIndex = 1
033     valclickHashKey ="app::users::click"
034   
035     // Create a direct stream
036     valkafkaStream =KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics)
037   
038     valevents = kafkaStream.flatMap(line => {
039       valdata = JSONObject.fromObject(line._2)
040       Some(data)
041     })
042   
043     // Compute user click times
044     valuserClicks = events.map(x => (x.getString("uid"), x.getInt("click_count"))).reduceByKey(_+ _)
045     userClicks.foreachRDD(rdd=> {
046       rdd.foreachPartition(partitionOfRecords=> {
047         partitionOfRecords.foreach(pair=> {
048            
049           /**
050            * Internal Redis client for managing Redis connection {@link Jedis} based on {@link RedisPool}
051            */
052           objectInternalRedisClient extendsSerializable {
053              
054             @transientprivate varpool: JedisPool = null
055              
056             defmakePool(redisHost:String, redisPort:Int, redisTimeout:Int,
057                 maxTotal:Int, maxIdle:Int, minIdle:Int): Unit = {
058               makePool(redisHost, redisPort, redisTimeout, maxTotal, maxIdle, minIdle,true, false, 10000)   
059             }
060              
061             defmakePool(redisHost:String, redisPort:Int, redisTimeout:Int,
062                 maxTotal:Int, maxIdle:Int, minIdle:Int, testOnBorrow:Boolean,
063                 testOnReturn:Boolean, maxWaitMillis:Long): Unit = {
064               if(pool== null) {
065                    valpoolConfig = new GenericObjectPoolConfig()
066                    poolConfig.setMaxTotal(maxTotal)
067                    poolConfig.setMaxIdle(maxIdle)
068                    poolConfig.setMinIdle(minIdle)
069                    poolConfig.setTestOnBorrow(testOnBorrow)
070                    poolConfig.setTestOnReturn(testOnReturn)
071                    poolConfig.setMaxWaitMillis(maxWaitMillis)
072                    pool= newJedisPool(poolConfig, redisHost, redisPort, redisTimeout)
073                     
074                    valhook = new Thread{
075                         overridedef run =pool.destroy()
076                    }
077                    sys.addShutdownHook(hook.run)
078               }
079             }
080              
081             defgetPool: JedisPool = {
082               assert(pool !=null)
083               pool
084             }
085           }
086            
087           // Redis configurations
088           valmaxTotal = 10
089           valmaxIdle = 10
090           valminIdle = 1
091           valredisHost = "10.10.4.130"
092           valredisPort = 6379
093           valredisTimeout =30000
094           valdbIndex = 1
095           InternalRedisClient.makePool(redisHost, redisPort, redisTimeout, maxTotal, maxIdle, minIdle)
096            
097           valuid = pair._1
098           valclickCount = pair._2
099           valjedis =InternalRedisClient.getPool.getResource
100           jedis.select(dbIndex)
101           jedis.hincrBy(clickHashKey, uid, clickCount)
102           InternalRedisClient.getPool.returnResource(jedis)
103         })
104       })
105     })
106   
107     ssc.start()
108     ssc.awaitTermination()
109   
110   }
111 }

  上面代码实现,得益于Scala语言的特性,可以在代码中任何位置进行class或object的定义,我们将用来管理Redis连接的代码放在了特定操作的内部,就避免了瞬态(Transient)对象跨节点序列化的问题。这样做还要求我们能够了解Spark内部是如何操作RDD数据集的,更多可以参考RDD或Spark相关文档。

  在集群上,以Standalone模式运行,执行如下命令:

查看源代码
打印 帮助
1 cd /usr/local/spark
2 ./bin/spark-submit --class org.shirdrn.spark.streaming.UserClickCountAnalytics 
3             --master spark://hadoop1:7077 
4             --executor-memory 1G 
5             --total-executor-cores 2 
6             ~/spark-0.0.SNAPSHOT.jarspark://hadoop1:7077

  可以查看集群中各个Worker节点执行计算任务的状态,也可以非常方便地通过Web页面查看。

  下面,看一下我们存储到Redis中的计算结果,如下所示:

查看源代码
打印 帮助
01 127.0.0.1:6379[1]> HGETALL app::users::click
02 1)"4A4D769EB9679C054DE81B973ED5D768"
03 2)"7037"
04 3)"8dfeb5aaafc027d89349ac9a20b3930f"
05 4)"6992"
06 5)"011BBF43B89BFBF266C865DF0397AA71"
07 6)"7021"
08 7)"97edfc08311c70143401745a03a50706"
09 8)"6874"
10 9)"d7f141563005d1b5d0d3dd30138f3f62"
11 10)"7057"
12 11)"a95f22eabc4fd4b580c011a3161a9d9d"
13 12)"7092"
14 13)"6b67c8c700427dee7552f81f3228c927"
15 14)"7266"
16 15)"f2a8474bf7bd94f0aabbd4cdd2c06dcf"
17 16)"7188"
18 17)"c8ee90aade1671a21336c721512b817a"
19 18)"6950"
20 19)"068b746ed4620d25e26055a9f804385f"

pom文件及相关依赖

  这里,附上前面开发的应用所对应的依赖,以及打包Spark Streaming应用程序的Maven配置,以供参考。如果使用maven-shade-plugin插件,配置有问题的话,打包后在Spark集群上提交Application时候可能会报错Invalid signature file digest for Manifest main attributes。参考的Maven配置,如下所示:

查看源代码
打印 帮助
001 ="http://maven.apache.org/POM/4.0.0"xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
002      xsi:schemaLocation="http://maven.apache.org/POM/4.0.0http://maven.apache.org/xsd/maven-4.0.0.xsd">
003      4.0.0
004      org.shirdrn.spark
005      spark
006      0.0.1-SNAPSHOT
007   
008      
009           
010                org.apache.spark
011                spark-core_2.10
012                1.3.0
013           
014           
015                org.apache.spark
016                spark-streaming_2.10
017                1.3.0
018           
019           
020                net.sf.json-lib
021                json-lib
022                2.3
023           
024           
025                org.apache.spark
026                spark-streaming-kafka_2.10
027                1.3.0
028           
029           
030                redis.clients
031                jedis
032                2.5.2
033           
034           
035                org.apache.commons
036                commons-pool2
037                2.2
038           
039      
040   
041      
042           ${basedir}/src/main/scala
043           ${basedir}/src/test/scala
044           
045                
046                     ${basedir}/src/main/resources
047                
048           
049           
050                
051                     ${basedir}/src/test/resources
052                
053           
054           
055                
056                     maven-compiler-plugin
057                     3.1
058                     
059                          1.6
060                          1.6
061                     
062                
063                
064                     org.apache.maven.plugins
065                     maven-shade-plugin
066                     2.2
067                     
068                          true
069                     
070                     
071                          
072                               package
073                               
074                                    shade
075                               
076                               
077                                    
078                                         
079                                              *:*
080                                         
081                                    
082                                    
083                                         
084                                              *:*
085                                              
086                                                   META-INF/*.SF
087                                                   META-INF/*.DSA
088                                                   META-INF/*.RSA
089                                              
090                                         
091                                    
092                                    
093                                         
094                                              implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/>
095                                         
096                                              implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer">
097                                              reference.conf
098                                         
099                                         
100                                              implementation="org.apache.maven.plugins.shade.resource.DontIncludeResourceTransformer">
101                                              log4j.properties
102                                         
103                                    
104                               
105                          
106                     
107                
108           
109      
110

你可能感兴趣的:(spark)