我们的应用场景是分析用户使用手机App的行为,描述如下所示:
1、手机客户端会收集用户的行为事件(我们以点击事件为例),将数据发送到数据服务器,我们假设这里直接进入到Kafka消息队列
2、后端的实时服务会从Kafka消费数据,将数据读出来并进行实时分析,这里选择Spark Streaming,因为Spark Streaming提供了与Kafka整合的内置支持
3、经过Spark Streaming实时计算程序分析,将结果写入Redis,可以实时获取用户的行为数据,并可以导出进行离线综合统计分析
下面,我们根据上面提到的应用场景,来编程实现这个实时计算应用。首先,写了一个Kafka Producer模拟程序,用来模拟向Kafka实时写入用户行为的事件数据,数据是JSON格式,示例如下:
1 |
{ |
2 |
"uid" : "068b746ed4620d25e26055a9f804385f" , |
3 |
"event_time" : "1430204612405" , |
4 |
"os_type" : "Android" , |
5 |
"click_count" : 6 |
6 |
} |
一个事件包含4个字段:
1、uid:用户编号
2、event_time:事件发生时间戳
3、os_type:手机App操作系统类型
4、click_count:点击次数
下面是我们实现的代码,如下所示:
01 |
package com.iteblog.spark.streaming.utils |
02 |
|
03 |
import java.util.Properties |
04 |
import scala.util.Properties |
05 |
import org.codehaus.jettison.json.JSONObject |
06 |
import kafka.javaapi.producer.Producer |
07 |
import kafka.producer.KeyedMessage |
08 |
import kafka.producer.KeyedMessage |
09 |
import kafka.producer.ProducerConfig |
10 |
import scala.util.Random |
11 |
|
12 |
object KafkaEventProducer { |
13 |
|
14 |
private val users = Array( |
15 |
"4A4D769EB9679C054DE81B973ED5D768" , "8dfeb5aaafc027d89349ac9a20b3930f" , |
16 |
"011BBF43B89BFBF266C865DF0397AA71" , "f2a8474bf7bd94f0aabbd4cdd2c06dcf" , |
17 |
"068b746ed4620d25e26055a9f804385f" , "97edfc08311c70143401745a03a50706" , |
18 |
"d7f141563005d1b5d0d3dd30138f3f62" , "c8ee90aade1671a21336c721512b817a" , |
19 |
"6b67c8c700427dee7552f81f3228c927" , "a95f22eabc4fd4b580c011a3161a9d9d" ) |
20 |
|
21 |
private val random = new Random() |
22 |
|
23 |
private var pointer = - 1 |
24 |
|
25 |
def getUserID() : String = { |
26 |
pointer = pointer + 1 |
27 |
if (pointer > = users.length) { |
28 |
pointer = 0 |
29 |
users(pointer) |
30 |
} else { |
31 |
users(pointer) |
32 |
} |
33 |
} |
34 |
|
35 |
def click() : Double = { |
36 |
random.nextInt( 10 ) |
37 |
} |
38 |
|
39 |
// bin/kafka-topics.sh --zookeeper zk1:2181,zk2:2181,zk3:2181/kafka --create --topic user_events --replication-factor 2 --partitions 2 |
40 |
// bin/kafka-topics.sh --zookeeper zk1:2181,zk2:2181,zk3:2181/kafka --list |
41 |
// bin/kafka-topics.sh --zookeeper zk1:2181,zk2:2181,zk3:2181/kafka --describe user_events |
42 |
// bin/kafka-console-consumer.sh --zookeeper zk1:2181,zk2:2181,zk3:22181/kafka --topic test_json_basis_event --from-beginning |
43 |
def main(args : Array[String]) : Unit = { |
44 |
val topic = "user_events" |
45 |
val brokers = "10.10.4.126:9092,10.10.4.127:9092" |
46 |
val props = new Properties() |
47 |
props.put( "metadata.broker.list" , brokers) |
48 |
props.put( "serializer.class" , "kafka.serializer.StringEncoder" ) |
49 |
|
50 |
val kafkaConfig = new ProducerConfig(props) |
51 |
val producer = new Producer[String, String](kafkaConfig) |
52 |
|
53 |
while ( true ) { |
54 |
// prepare event data |
55 |
val event = new JSONObject() |
56 |
event |
57 |
.put( "uid" , getUserID) |
58 |
.put( "event_time" , System.currentTimeMillis.toString) |
59 |
.put( "os_type" , "Android" ) |
60 |
.put( "click_count" , click) |
61 |
|
62 |
// produce event message |
63 |
producer.send( new KeyedMessage[String, String](topic, event.toString)) |
64 |
println( "Message sent: " + event) |
65 |
|
66 |
Thread.sleep( 200 ) |
67 |
} |
68 |
} |
69 |
} |
通过控制上面程序最后一行的时间间隔来控制模拟写入速度。下面我们来讨论实现实时统计每个用户的点击次数,它是按照用户分组进行累加次数,逻辑比较简单,关键是在实现过程中要注意一些问题,如对象序列化等。先看实现代码,稍后我们再详细讨论,代码实现如下所示:
01 |
object UserClickCountAnalytics { |
02 |
|
03 |
def main(args : Array[String]) : Unit = { |
04 |
var masterUrl = "local[1]" |
05 |
if (args.length > 0 ) { |
06 |
masterUrl = args( 0 ) |
07 |
} |
08 |
|
09 |
// Create a StreamingContext with the given master URL |
10 |
val conf = new SparkConf().setMaster(masterUrl).setAppName( "UserClickCountStat" ) |
11 |
val ssc = new StreamingContext(conf, Seconds( 5 )) |
12 |
|
13 |
// Kafka configurations |
14 |
val topics = Set( "user_events" ) |
15 |
val brokers = "10.10.4.126:9092,10.10.4.127:9092" |
16 |
val kafkaParams = Map[String, String]( |
17 |
"metadata.broker.list" -> brokers, "serializer.class" -> "kafka.serializer.StringEncoder" ) |
18 |
|
19 |
val dbIndex = 1 |
20 |
val clickHashKey = "app::users::click" |
21 |
|
22 |
// Create a direct stream |
23 |
val kafkaStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics) |
24 |
|
25 |
val events = kafkaStream.flatMap(line = > { |
26 |
val data = JSONObject.fromObject(line. _ 2 ) |
27 |
Some(data) |
28 |
}) |
29 |
|
30 |
// Compute user click times |
31 |
val userClicks = events.map(x = > (x.getString( "uid" ), x.getInt( "click_count" ))).reduceByKey( _ + _ ) |
32 |
userClicks.foreachRDD(rdd = > { |
33 |
rdd.foreachPartition(partitionOfRecords = > { |
34 |
partitionOfRecords.foreach(pair = > { |
35 |
val uid = pair. _ 1 |
36 |
val clickCount = pair. _ 2 |
37 |
val jedis = < SPAN class = wp _ keywordlink _ affiliate >< A title = "" href = "http://www.iteblog.com/archives/tag/redis" target =_ blank data-original-title = "View all posts in Redis" jQuery 1830668587673401759 = "50" > Redis < /A >< /SPAN > Client.pool.getResource |
38 |
jedis.select(dbIndex) |
39 |
jedis.hincrBy(clickHashKey, uid, clickCount) |
40 |
RedisClient.pool.returnResource(jedis) |
41 |
}) |
42 |
}) |
43 |
}) |
44 |
|
45 |
ssc.start() |
46 |
ssc.awaitTermination() |
47 |
|
48 |
} |
49 |
} |
上面代码使用了Jedis客户端来操作Redis,将分组计数结果数据累加写入Redis存储,如果其他系统需要实时获取该数据,直接从Redis实时读取即可。RedisClient实现代码如下所示:
01 |
object RedisClient extends Serializable { |
02 |
val redisHost = "10.10.4.130" |
03 |
val redisPort = 6379 |
04 |
val redisTimeout = 30000 |
05 |
lazy val pool = new JedisPool( new GenericObjectPoolConfig(), redisHost, redisPort, redisTimeout) |
06 |
|
07 |
lazy val hook = new Thread { |
08 |
override def run = { |
09 |
println( "Execute hook thread: " + this ) |
10 |
pool.destroy() |
11 |
} |
12 |
} |
13 |
sys.addShutdownHook(hook.run) |
14 |
} |
上面代码我们分别在local[K]和Spark Standalone集群模式下运行通过。
如果我们是在开发环境进行调试的时候,也就是使用local[K]部署模式,在本地启动K个Worker线程来计算,这K个Worker在同一个JVM实例里,上面的代码默认情况是,如果没有传参数则是local[K]模式,所以如果使用这种方式在创建Redis连接池或连接的时候,可能非常容易调试通过,但是在使用Spark Standalone、YARN Client(YARN Cluster)或Mesos集群部署模式的时候,就会报错,主要是由于在处理Redis连接池或连接的时候出错了。我们可以看一下Spark架构,如图所示(来自官网):
无论是在本地模式、Standalone模式,还是在Mesos或YARN模式下,整个Spark集群的结构都可以用上图抽象表示,只是各个组件的运行环境不同,导致组件可能是分布式的,或本地的,或单个JVM实例的。如在本地模式,则上图表现为在同一节点上的单个进程之内的多个组件;而在YARN Client模式下,Driver程序是在YARN集群之外的一个节点上提交Spark Application,其他的组件都运行在YARN集群管理的节点上。
在Spark集群环境部署Application后,在进行计算的时候会将作用于RDD数据集上的函数(Functions)发送到集群中Worker上的Executor上(在Spark Streaming中是作用于DStream的操作),那么这些函数操作所作用的对象(Elements)必须是可序列化的,通过Scala也可以使用lazy引用来解决,否则这些对象(Elements)在跨节点序列化传输后,无法正确地执行反序列化重构成实际可用的对象。上面代码我们使用lazy引用(Lazy Reference)来实现的,代码如下所示:
01 |
// lazy pool reference |
02 |
lazy val pool = new JedisPool( new GenericObjectPoolConfig(), redisHost, redisPort, redisTimeout) |
03 |
... |
04 |
partitionOfRecords.foreach(pair = > { |
05 |
val uid = pair. _ 1 |
06 |
val clickCount = pair. _ 2 |
07 |
val jedis = RedisClient.pool.getResource |
08 |
jedis.select(dbIndex) |
09 |
jedis.hincrBy(clickHashKey, uid, clickCount) |
10 |
RedisClient.pool.returnResource(jedis) |
11 |
}) |
另一种方式,我们将代码修改为,把对Redis连接的管理放在操作DStream的Output操作范围之内,因为我们知道它是在特定的Executor中进行初始化的,使用一个单例的对象来管理,如下所示:
001 |
package org.shirdrn.spark.streaming |
002 |
|
003 |
import org.apache.commons.pool 2 .impl.GenericObjectPoolConfig |
004 |
import org.apache.spark.SparkConf |
005 |
import org.apache.spark.streaming.Seconds |
006 |
import org.apache.spark.streaming.StreamingContext |
007 |
import org.apache.spark.streaming.dstream.DStream.toPairDStreamFunctions |
008 |
import org.apache.spark.streaming.kafka.KafkaUtils |
009 |
|
010 |
import kafka.serializer.StringDecoder |
011 |
import net.sf.json.JSONObject |
012 |
import redis.clients.jedis.JedisPool |
013 |
|
014 |
object UserClickCountAnalytics { |
015 |
|
016 |
def main(args : Array[String]) : Unit = { |
017 |
var masterUrl = "local[1]" |
018 |
if (args.length > 0 ) { |
019 |
masterUrl = args( 0 ) |
020 |
} |
021 |
|
022 |
// Create a StreamingContext with the given master URL |
023 |
val conf = new SparkConf().setMaster(masterUrl).setAppName( "UserClickCountStat" ) |
024 |
val ssc = new StreamingContext(conf, Seconds( 5 )) |
025 |
|
026 |
// Kafka configurations |
027 |
val topics = Set( "user_events" ) |
028 |
val brokers = "10.10.4.126:9092,10.10.4.127:9092" |
029 |
val kafkaParams = Map[String, String]( |
030 |
"metadata.broker.list" -> brokers, "serializer.class" -> "kafka.serializer.StringEncoder" ) |
031 |
|
032 |
val dbIndex = 1 |
033 |
val clickHashKey = "app::users::click" |
034 |
|
035 |
// Create a direct stream |
036 |
val kafkaStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics) |
037 |
|
038 |
val events = kafkaStream.flatMap(line = > { |
039 |
val data = JSONObject.fromObject(line. _ 2 ) |
040 |
Some(data) |
041 |
}) |
042 |
|
043 |
// Compute user click times |
044 |
val userClicks = events.map(x = > (x.getString( "uid" ), x.getInt( "click_count" ))).reduceByKey( _ + _ ) |
045 |
userClicks.foreachRDD(rdd = > { |
046 |
rdd.foreachPartition(partitionOfRecords = > { |
047 |
partitionOfRecords.foreach(pair = > { |
048 |
|
049 |
/** |
050 |
* Internal Redis client for managing Redis connection {@link Jedis} based on {@link RedisPool} |
051 |
*/ |
052 |
object InternalRedisClient extends Serializable { |
053 |
|
054 |
@ transient private var pool : JedisPool = null |
055 |
|
056 |
def makePool(redisHost : String, redisPort : Int, redisTimeout : Int, |
057 |
maxTotal : Int, maxIdle : Int, minIdle : Int) : Unit = { |
058 |
makePool(redisHost, redisPort, redisTimeout, maxTotal, maxIdle, minIdle, true , false , 10000 ) |
059 |
} |
060 |
|
061 |
def makePool(redisHost : String, redisPort : Int, redisTimeout : Int, |
062 |
maxTotal : Int, maxIdle : Int, minIdle : Int, testOnBorrow : Boolean, |
063 |
testOnReturn : Boolean, maxWaitMillis : Long) : Unit = { |
064 |
if (pool == null ) { |
065 |
val poolConfig = new GenericObjectPoolConfig() |
066 |
poolConfig.setMaxTotal(maxTotal) |
067 |
poolConfig.setMaxIdle(maxIdle) |
068 |
poolConfig.setMinIdle(minIdle) |
069 |
poolConfig.setTestOnBorrow(testOnBorrow) |
070 |
poolConfig.setTestOnReturn(testOnReturn) |
071 |
poolConfig.setMaxWaitMillis(maxWaitMillis) |
072 |
pool = new JedisPool(poolConfig, redisHost, redisPort, redisTimeout) |
073 |
|
074 |
val hook = new Thread{ |
075 |
override def run = pool.destroy() |
076 |
} |
077 |
sys.addShutdownHook(hook.run) |
078 |
} |
079 |
} |
080 |
|
081 |
def getPool : JedisPool = { |
082 |
assert(pool ! = null ) |
083 |
pool |
084 |
} |
085 |
} |
086 |
|
087 |
// Redis configurations |
088 |
val maxTotal = 10 |
089 |
val maxIdle = 10 |
090 |
val minIdle = 1 |
091 |
val redisHost = "10.10.4.130" |
092 |
val redisPort = 6379 |
093 |
val redisTimeout = 30000 |
094 |
val dbIndex = 1 |
095 |
InternalRedisClient.makePool(redisHost, redisPort, redisTimeout, maxTotal, maxIdle, minIdle) |
096 |
|
097 |
val uid = pair. _ 1 |
098 |
val clickCount = pair. _ 2 |
099 |
val jedis = InternalRedisClient.getPool.getResource |
100 |
jedis.select(dbIndex) |
101 |
jedis.hincrBy(clickHashKey, uid, clickCount) |
102 |
InternalRedisClient.getPool.returnResource(jedis) |
103 |
}) |
104 |
}) |
105 |
}) |
106 |
|
107 |
ssc.start() |
108 |
ssc.awaitTermination() |
109 |
|
110 |
} |
111 |
} |
上面代码实现,得益于Scala语言的特性,可以在代码中任何位置进行class或object的定义,我们将用来管理Redis连接的代码放在了特定操作的内部,就避免了瞬态(Transient)对象跨节点序列化的问题。这样做还要求我们能够了解Spark内部是如何操作RDD数据集的,更多可以参考RDD或Spark相关文档。
在集群上,以Standalone模式运行,执行如下命令:
1 |
cd /usr/ local /spark |
2 |
./bin/spark-submit --class org.shirdrn.spark.streaming.UserClickCountAnalytics |
3 |
--master spark://hadoop1:7077 |
4 |
--executor-memory 1G |
5 |
--total-executor-cores 2 |
6 |
~/spark-0.0.SNAPSHOT.jarspark://hadoop1:7077 |
可以查看集群中各个Worker节点执行计算任务的状态,也可以非常方便地通过Web页面查看。
下面,看一下我们存储到Redis中的计算结果,如下所示:
01 |
127.0 . 0.1 : 6379 [ 1 ]> HGETALL app :: users :: click |
02 |
1 ) "4A4D769EB9679C054DE81B973ED5D768" |
03 |
2 ) "7037" |
04 |
3 ) "8dfeb5aaafc027d89349ac9a20b3930f" |
05 |
4 ) "6992" |
06 |
5 ) "011BBF43B89BFBF266C865DF0397AA71" |
07 |
6 ) "7021" |
08 |
7 ) "97edfc08311c70143401745a03a50706" |
09 |
8 ) "6874" |
10 |
9 ) "d7f141563005d1b5d0d3dd30138f3f62" |
11 |
10 ) "7057" |
12 |
11 ) "a95f22eabc4fd4b580c011a3161a9d9d" |
13 |
12 ) "7092" |
14 |
13 ) "6b67c8c700427dee7552f81f3228c927" |
15 |
14 ) "7266" |
16 |
15 ) "f2a8474bf7bd94f0aabbd4cdd2c06dcf" |
17 |
16 ) "7188" |
18 |
17 ) "c8ee90aade1671a21336c721512b817a" |
19 |
18 ) "6950" |
20 |
19 ) "068b746ed4620d25e26055a9f804385f" |
这里,附上前面开发的应用所对应的依赖,以及打包Spark Streaming应用程序的Maven配置,以供参考。如果使用maven-shade-plugin插件,配置有问题的话,打包后在Spark集群上提交Application时候可能会报错Invalid signature file digest for Manifest main attributes。参考的Maven配置,如下所示:
001 |
|
002 |
xsi : schemaLocation = "http://maven.apache.org/POM/4.0.0http://maven.apache.org/xsd/maven-4.0.0.xsd" > |
003 |
4.0 . 0
|
004 |
|
005 |
|
006 |
0.0 . 1 -SNAPSHOT |
007 |
|
008 |
|
009 |
|
010 |
|
011 |
_ 2.10
|
012 |
1.3 . 0
|
013 |
|
014 |
|
015 |
|
016 |
_ 2.10
|
017 |
1.3 . 0
|
018 |
|
019 |
|
020 |
|
021 |
|
022 |
2.3
|
023 |
|
024 |
|
025 |
|
026 |
_ 2.10
|
027 |
1.3 . 0
|
028 |
|
029 |
|
030 |
|
031 |
|
032 |
2.5 . 2
|
033 |
|
034 |
|
035 |
|
036 |
2
|
037 |
2.2
|
038 |
|
039 |
|
040 |
|
041 |
|
042 |
|
043 |
|
044 |
|
045 |
|
046 |
|
047 |
|
048 |
|
049 |
|
050 |
|
051 |
|
052 |
|
053 |
|
054 |
|
055 |
|
056 |
|
057 |
3.1
|
058 |
|
059 |
1.6
|
060 |
1.6
|
061 |
|
062 |
|
063 |
|
064 |
|
065 |
|
066 |
2.2
|
067 |
|
068 |
true
|
069 |
|
070 |
|
071 |
|
072 |
package
|
073 |
|
074 |
|
075 |
|
076 |
|
077 |
|
078 |
|
079 |
: * |
080 |
|
081 |
|
082 |
|
083 |
|
084 |
: * |
085 |
|
086 |
|
087 |
|
088 |
|
089 |
|
090 |
|
091 |
|
092 |
|
093 |
|
094 |
implementation = "org.apache.maven.plugins.shade.resource.ServicesResourceTransformer" /> |
095 |
|
096 |
implementation = "org.apache.maven.plugins.shade.resource.AppendingTransformer" > |
097 |
|
098 |
|
099 |
|
100 |
implementation = "org.apache.maven.plugins.shade.resource.DontIncludeResourceTransformer" > |
101 |
4 j.properties |
102 |
|
103 |
|
104 |
|
105 |
|
106 |
|
107 |
|
108 |
|
109 |
|
110 |
|