sparkstreaming完整例子

摘要

本文主要实现一个简单sparkstreaming小栗子,整体流程是从kafka实时读取数据,计算pv,uv,以及sum(money)操作,最后将计算结果存入redis中,用sql表述大概就是

select time,page,count(*),count(distinct user) uv,sum(money) from test group by page,time

样例数据格式:

user,page,money,time

smith,iphone4.html,578.02,1500618981283
andrew,mac.html,277.62,1500618981285
smith,note.html,388.56,1500618981285

将数据push到kafka

启动kafka

 

造数据

package com.fan.spark.stream
 
import java.text.DecimalFormat
import java.util.Properties
 
import org.apache.kafka.clients.producer.{KafkaProducer, ProducerRecord}
 
import scala.util.Random
/**
  * Created by http://www.fanlegefan.com on 17-7-21.
  */
object ProduceMessage {
 
  def main(args: Array[String]): Unit = {
 
    val props = newProperties()
    props.put("bootstrap.servers","localhost:9092")
    props.put("acks","all")
    props.put("retries","0")
    props.put("batch.size","16384")
    props.put("linger.ms","1")
    props.put("buffer.memory","33554432")
    props.put("key.serializer","org.apache.kafka.common.serialization.StringSerializer")
    props.put("value.serializer","org.apache.kafka.common.serialization.StringSerializer")
 
    val producer = newKafkaProducer[String, String](props)
 
    val users = Array("jack","leo","andy","lucy","jim","smith","iverson","andrew")
    val pages = Array("iphone4.html","huawei.html","mi.html","mac.html","note.html","book.html","fanlegefan.com")
    val df = newDecimalFormat("#.00")
    val random = newRandom()
    val num = 10
    for(i<- 0 to num ){
      val message = users(random.nextInt(users.length))+","+pages(random.nextInt(pages.length))+
      ","+df.format(random.nextDouble()*1000)+","+System.currentTimeMillis()
      producer.send(newProducerRecord[String, String]("test", Integer.toString(i),message))
      println(message)
    }
    producer.close()
  }
}

 

控制台消费如下

 

bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test --from-beginning
andrew,book.html,309.58,1500620213384
jack,book.html,954.01,1500620213456
iverson,book.html,823.07,1500620213456
iverson,iphone4.html,486.76,1500620213456
lucy,book.html,14.00,1500620213457
iverson,note.html,206.30,1500620213457
jack,book.html,25.30,1500620213457
jim,iphone4.html,513.82,1500620213457
lucy,mac.html,677.29,1500620213457
smith,mi.html,571.30,1500620213457
lucy,iphone4.html,113.83,1500620213457

 

 

 

计算pv,uv以及累计金额

因为数据要存入redis中,获取redis客户端代码如下

 

package com.fan.spark.stream
 
import org.apache.commons.pool2.impl.GenericObjectPoolConfig
import redis.clients.jedis.JedisPool
 
/**
  * Created by http://www.fanlegefan.com on 17-7-21.
  */
object RedisClient {
  val redisHost = "127.0.0.1"
  val redisPort = 6379
  val redisTimeout = 30000
 
  lazy val pool = newJedisPool(newGenericObjectPoolConfig(), redisHost, redisPort, redisTimeout)
  lazy val hook = newThread {
    override def run = {
      println("Execute hook thread: " + this)
      pool.destroy()
    }
  }
 
  sys.addShutdownHook(hook.run)
}

 

 

 

sparkstreaming 是按batch处理数据,例如设置batchDuration=10,则每批次处理10秒中内接收到的数据,计算pv的时候,直接count累加就可以;但是计算uv的时候,这10秒内出现的用户,在之前的batch中也可能出现,但是spark是按batch处理数据,没办法知道之前用户是否出现过,如果只是简单的累计的话,一天下来uv的数据会比真实的uv大很多,所以要解决这个问题就要引入HyperLogLog,还好redis已经提供了这个功能,具体使用情况直接看栗子

 

redis 127.0.0.1:6379> PFADD mykey a b c d e f g h i j
(integer) 1
redis 127.0.0.1:6379> PFCOUNT mykey
(integer) 10

 

a b c d e f g h i j这些可以理解为user,每来一个user,我们就执行下pfadd user操作;使用pfcount key就可以直接获得去重后的uv,但是要注意的是这种算法是有误差的,查阅了相关文档误差大约在0.8%左右,用于计算uv,这种误差还是可以接受的,具体误差大家可以测试下,这里我就不测了


实时计算代码如下

 

package com.fan.spark.stream
 
import java.text.SimpleDateFormat
import java.util.Date
 
import kafka.serializer.StringDecoder
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.{SparkConf, SparkContext}
 
/**
  * Created by http://www.fanlegefan.com on 17-7-21.
  */
object UserActionStreaming {
 
  def main(args: Array[String]): Unit = {
    val df = newSimpleDateFormat("yyyyMMdd")
    val group = "test"
    val topics = "test"
 
    val sparkConf = newSparkConf().setAppName("pvuv").setMaster("local[3]")
 
    val sc = newSparkContext(sparkConf)
    val ssc = newStreamingContext(sc, Seconds(10))
    ssc.checkpoint("/home/work/IdeaProjects/sparklearn/checkpoint")
 
    val topicSets = topics.split(",").toSet
    val kafkaParams = Map[String, String](
      "metadata.broker.list"-> "localhost:9092",
      "group.id"-> group
    )
    val stream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc,
      kafkaParams, topicSets)
    stream.foreachRDD(rdd=>rdd.foreachPartition(partition=>{
      val jedis = RedisClient.pool.getResource
      partition.foreach(tuple=>{
        val line = tuple._2
        val arr = line.split(",")
        val user = arr(0)
        val page = arr(1)
        val money = arr(2)
        val day = df.format(newDate(arr(3).toLong))
        //uv
        jedis.pfadd(day  + "_"+ page , user)
        //pv
        jedis.hincrBy(day+"_pv", page, 1)
        //sum
        jedis.hincrByFloat(day+"_sum", page, money.toDouble)
      })
    }))
    ssc.start()
    ssc.awaitTermination()
  }
}

 

 

 

在redis中查看结果

 

127.0.0.1:6379> keys *
1)"20170721_note.html"
2)"20170721_book.html"
3)"20170721_fanlegefan.com"
4)"20170721_mac.html"
5)"20170721_pv"
6)"20170721_mi.html"
7)"20170721_iphone4.html"
8)"20170721_sum"
9)"20170721_huawei.html"

 

查看pv

127.0.0.1:6379> HGETALL 20170721_pv
 1)"mi.html"
 2)"112"
 3)"note.html"
 4)"107"
 5)"fanlegefan.com"
 6)"124"
 7)"huawei.html"
 8)"122"
 9)"iphone4.html"
10)"92"
11)"mac.html"
12)"103"
13)"book.html"
14)"135"

 

 

 

查看sum

 

127.0.0.1:6379> HGETALL 20170721_sum
 1)"mi.html"
 2)"56949.65999999999998948"
 3)"note.html"
 4)"56803.50999999999999801"
 5)"fanlegefan.com"
 6)"59622.50999999999999801"
 7)"huawei.html"
 8)"64456.50000000000000711"
 9)"iphone4.html"
10)"48643.07000000000001094"
11)"mac.html"
12)"51693.17999999999998906"
13)"book.html"
14)"67724.17999999999999261"

 

 

查看UV,测试数据只有8个user,所以uv都是8

 

127.0.0.1:6379> PFCOUNT 20170721_huawei.html
(integer) 8
127.0.0.1:6379> PFCOUNT 20170721_fanlegefan.com
(integer) 8

 

 

现在数据已经在redis中,可以写个定时任务将数据push到mysql中,前端就可以展示了,实时计算大概是这么个思路

 

你可能感兴趣的:(spark)