大数据实战项目 -- 实时数仓

文章目录

  • 一、实时数据
    • 1.1 日志采集器
    • 1.1 日志生成器
    • 1.3 日志分发器
    • 1.4 采集流脚本
  • 二、实时采集
    • 2.1 项目搭建
    • 2.2 Kafka 数据获取
    • 2.3 Redis 数据去重
    • 2.4 ES 数据存储
    • 2.5 精准一次性消费
    • 2.6 Kibana 可视化配置
    • 2.7 发布数据接口
  • 三、实时监控
    • 3.1 Canal
      • 3.1.1 配置 MySQL
      • 3.1.2 安装 canal
    • 3.2 Canal ODS 层数据分流
    • 3.3 Maxwell
    • 3.4 Maxwell ODS 层数据分流

一、实时数据

大数据实战项目 -- 实时数仓_第1张图片

1.1 日志采集器

  • 新建 Spring Boot Web 工程

https://start.spring.io/ 勾选 LombokSpring WebSpring for Apache Kafka

大数据实战项目 -- 实时数仓_第2张图片

  • 准备 POM 文件

在原POM基础上添加JSON工具


<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <parent>
        <artifactId>gmallartifactId>
        <groupId>com.simworgroupId>
        <version>0.0.1-SNAPSHOTversion>
    parent>
    <modelVersion>4.0.0modelVersion>
    <artifactId>loggerartifactId>

    <properties>
        <maven.compiler.source>8maven.compiler.source>
        <maven.compiler.target>8maven.compiler.target>
    properties>

    <dependencies>
        <dependency>
            <groupId>com.alibabagroupId>
            <artifactId>fastjsonartifactId>
        dependency>

        <dependency>
            <groupId>org.springframework.bootgroupId>
            <artifactId>spring-boot-starter-webartifactId>
        dependency>
        <dependency>
            <groupId>org.springframework.kafkagroupId>
            <artifactId>spring-kafkaartifactId>
        dependency>

        <dependency>
            <groupId>org.projectlombokgroupId>
            <artifactId>lombokartifactId>
            <optional>trueoptional>
        dependency>
        <dependency>
            <groupId>org.springframework.bootgroupId>
            <artifactId>spring-boot-starter-testartifactId>
            <scope>testscope>
            <exclusions>
                <exclusion>
                    <groupId>org.junit.vintagegroupId>
                    <artifactId>junit-vintage-engineartifactId>
                exclusion>
            exclusions>
        dependency>
        <dependency>
            <groupId>org.springframework.kafkagroupId>
            <artifactId>spring-kafka-testartifactId>
            <scope>testscope>
        dependency>
    dependencies>

    <build>
        <plugins>
            <plugin>
                <groupId>org.springframework.bootgroupId>
                <artifactId>spring-boot-maven-pluginartifactId>
                <configuration>
                    <excludes>
                        <exclude>
                            <groupId>org.projectlombokgroupId>
                            <artifactId>lombokartifactId>
                        exclude>
                    excludes>
                configuration>
            plugin>
        plugins>
    build>
project>
  • 编写日志采集控制器
  1. 将日志分流发送至Kafka 2. 将日志落盘
package com.simwor.gmall.controller;

import com.alibaba.fastjson.JSON;
import com.alibaba.fastjson.JSONObject;
import lombok.extern.slf4j.Slf4j;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.kafka.core.KafkaTemplate;
import org.springframework.web.bind.annotation.RequestBody;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RestController;

@RestController
@Slf4j
public class LoggerController {
    @Autowired
    private KafkaTemplate<String,String> kafkaTemplate;

    @RequestMapping("/applog")
    public String appLog(@RequestBody String applog) {
        JSONObject jsonObject = JSON.parseObject(applog);
        if(jsonObject.getString("start") != null && jsonObject.getString("start").length() > 0)
            kafkaTemplate.send("gmall-start-log", applog);
        else
            kafkaTemplate.send("gmall-event-log", applog);

        log.info(applog);
        return applog;
    }
}
  • 准备日志落盘配置文件 logback.xml

<configuration>
    <property name="LOG_HOME" value="/opt/applog/logs" />
    <appender name="console" class="ch.qos.logback.core.ConsoleAppender">
        <encoder>
            <pattern>%msg%npattern>
        encoder>
    appender>

    <appender name="rollingFile" class="ch.qos.logback.core.rolling.RollingFileAppender">
        <file>${LOG_HOME}/app.logfile>
        <rollingPolicy class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy">
            <fileNamePattern>${LOG_HOME}/app.%d{yyyy-MM-dd}.logfileNamePattern>
        rollingPolicy>
        <encoder>
            <pattern>%msg%npattern>
        encoder>
    appender>

    
    <logger name="com.simwor.gmall.controller.LoggerController"
            level="INFO" additivity="false">
        <appender-ref ref="rollingFile" />
        <appender-ref ref="console" />
    logger>

    <root level="error" additivity="false">
        <appender-ref ref="console" />
    root>
configuration>
  • 准备应用配置文件 application.properties
#============== kafka ===================
# 指定kafka 代理地址,可以多个
spring.kafka.bootstrap-servers=simwor01:9092,simwor02:9092,simwor03:9092
# 指定消息key和消息体的编解码方式
spring.kafka.producer.key-serializer=org.apache.kafka.common.serialization.StringSerializer
spring.kafka.producer.value-serializer=org.apache.kafka.common.serialization.StringSerializer
  • 运行验证
  1. 打包运行

大数据实战项目 -- 实时数仓_第3张图片

  1. 发送消息

大数据实战项目 -- 实时数仓_第4张图片

  1. 验证

大数据实战项目 -- 实时数仓_第5张图片

1.1 日志生成器

日志生成器模拟 gmall-start-loggmall-event-log 的格式对 日志采集器 不断发出请求。

[omm@simwor01 mock-log]$ ll
-rw-r--r--. 1 omm omm      610 Jun 16 10:16 application.properties
-rw-r--r--. 1 omm omm 11114569 Jun 13  2020 gmall2020-mock-log-2020-05-10.jar
-rw-r--r--. 1 omm omm     3211 Jun 16 10:17 logback.xml
-rw-r--r--. 1 omm omm      493 Mar 19  2020 path.json

[omm@simwor01 mock-log]$ java -jar gmall2020-mock-log-2020-05-10.jar 
...
{"common":{"ar":"110000","ba":"Xiaomi","ch":"web","md":"Xiaomi 9","mid":"mid_35","os":"Android 9.0","uid":"60","vc":"v2.1.134"},"start":{"entry":"notice","loading_time":9558,"open_ad_id":19,"open_ad_ms":8081,"open_ad_skip_ms":0},"ts":1623810190000}
{"common":{"ar":"110000","ba":"Xiaomi","ch":"web","md":"Xiaomi 9","mid":"mid_35","os":"Android 9.0","uid":"60","vc":"v2.1.134"},"displays":[{"display_type":"activity","item":"2","item_type":"activity_id","order":1},{"display_type":"query","item":"9","item_type":"sku_id","order":2},{"display_type":"query","item":"10","item_type":"sku_id","order":3},{"display_type":"query","item":"5","item_type":"sku_id","order":4},{"display_type":"query","item":"7","item_type":"sku_id","order":5},{"display_type":"query","item":"1","item_type":"sku_id","order":6},{"display_type":"query","item":"8","item_type":"sku_id","order":7},{"display_type":"promotion","item":"8","item_type":"sku_id","order":8},{"display_type":"query","item":"3","item_type":"sku_id","order":9},{"display_type":"promotion","item":"2","item_type":"sku_id","order":10}],"page":{"during_time":18544,"page_id":"home"},"ts":1623810199558}
...

可以配置生成日志的日期以及发起请求的地址。

[omm@simwor01 mock-log]$ head application.properties 

#业务日期
mock.date=2021-06-16

#模拟数据发送模式
mock.type=http
#http模式下,发送的地址
mock.url=http://localhost:8080/applog

[omm@simwor01 mock-log]$ 

1.3 日志分发器

日志分发器指由Nginx将 日志生成器 的请求均匀地分发至多个后端 日志采集器

  • 配置Nginx
[root@simwor01 conf.d]# pwd
/etc/nginx/conf.d
[root@simwor01 conf.d]# cat applog.conf 
upstream applog {
  server simwor01:8080;
  server simwor02:8080;
  server simwor03:8080;
}

server {
  listen 80;
  server_name localhost;
  location / {
    proxy_pass http://applog;
  }
}
[root@simwor01 conf.d]# 
  • 修改日志生成器请求地址
[omm@simwor01 mock-log]$ head application.properties 

#业务日期
mock.date=2021-06-16

#模拟数据发送模式
mock.type=http
#http模式下,发送的地址
mock.url=http://localhost/applog

[omm@simwor01 mock-log]$ 
  • 效果验证

大数据实战项目 -- 实时数仓_第6张图片

1.4 采集流脚本

#!/bin/bash
JAVA_BIN=/opt/module/jdk/bin/java
PROJECT=/opt/applog/logger
APPNAME=logger-0.0.1-SNAPSHOT.jar
 
case $1 in
 "start")
   {
    for i in simwor01 simwor02 simwor03
    do
     echo "========: $i==============="
    ssh $i  "$JAVA_BIN -Xms32m -Xmx64m  -jar $PROJECT/$APPNAME >/dev/null 2>&1  &"
    done
     echo "========NGINX==============="
    sudo systemctl start nginx
  };;
  "stop")
  { 
     echo "======== NGINX==============="
    sudo systemctl stop nginx
    for i in simwor01 simwor02 simwor03
    do
     echo "========: $i==============="
     ssh $i "ps -ef|grep $APPNAME |grep -v grep|awk '{print \$2}'|xargs kill" >/dev/null 2>&1
    done
 
  };;
esac

二、实时采集

2.1 项目搭建

  • POM 文件

<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <parent>
        <artifactId>gmallartifactId>
        <groupId>com.simworgroupId>
        <version>0.0.1-SNAPSHOTversion>
    parent>
    <modelVersion>4.0.0modelVersion>

    <artifactId>realtimeartifactId>

    <properties>
        <spark.version>2.4.0spark.version>
        <scala.version>2.11.8scala.version>
        <kafka.version>1.0.0kafka.version>
        <project.build.sourceEncoding>UTF-8project.build.sourceEncoding>
        <project.reporting.outputEncoding>UTF-8project.reporting.outputEncoding>
        <java.version>1.8java.version>
    properties>

    <dependencies>
        <dependency>
            <groupId>com.alibabagroupId>
            <artifactId>fastjsonartifactId>
            <version>1.2.56version>
        dependency>
        <dependency>
            <groupId>org.elasticsearchgroupId>
            <artifactId>elasticsearchartifactId>
            <version>2.4.6version>
        dependency>
        <dependency>
            <groupId>org.apache.sparkgroupId>
            <artifactId>spark-core_2.11artifactId>
            <version>${spark.version}version>
        dependency>
        <dependency>
            <groupId>org.apache.sparkgroupId>
            <artifactId>spark-streaming_2.11artifactId>
            <version>${spark.version}version>
        dependency>
        <dependency>
            <groupId>org.apache.kafkagroupId>
            <artifactId>kafka-clientsartifactId>
            <version>${kafka.version}version>
        dependency>
        <dependency>
            <groupId>org.apache.sparkgroupId>
            <artifactId>spark-streaming-kafka-0-10_2.11artifactId>
            <version>${spark.version}version>
        dependency>
        <dependency>
            <groupId>redis.clientsgroupId>
            <artifactId>jedisartifactId>
            <version>2.9.0version>
        dependency>
        <dependency>
            <groupId>org.apache.phoenixgroupId>
            <artifactId>phoenix-sparkartifactId>
            <version>4.14.2-HBase-1.3version>
        dependency>
        <dependency>
            <groupId>org.apache.sparkgroupId>
            <artifactId>spark-sql_2.11artifactId>
            <version>${spark.version}version>
        dependency>
        <dependency>
            <groupId>io.searchboxgroupId>
            <artifactId>jestartifactId>
            <version>5.3.3version>
            <exclusions>
                <exclusion>
                    <groupId>org.slf4jgroupId>
                    <artifactId>slf4j-apiartifactId>
                exclusion>
            exclusions>
        dependency>
        <dependency>
            <groupId>net.java.dev.jnagroupId>
            <artifactId>jnaartifactId>
            <version>4.5.2version>
        dependency>
        <dependency>
            <groupId>org.codehaus.janinogroupId>
            <artifactId>commons-compilerartifactId>
            <version>2.7.8version>
        dependency>
    dependencies>

    <build>
        <plugins>
            
            <plugin>
                <groupId>net.alchim31.mavengroupId>
                <artifactId>scala-maven-pluginartifactId>
                <version>3.4.6version>
                <executions>
                    <execution>
                        
                        <goals>
                            <goal>compilegoal>
                            <goal>testCompilegoal>
                        goals>
                    execution>
                executions>
            plugin>
            <plugin>
                <groupId>org.apache.maven.pluginsgroupId>
                <artifactId>maven-assembly-pluginartifactId>
                <version>3.0.0version>
                <configuration>
                    <descriptorRefs>
                        <descriptorRef>jar-with-dependenciesdescriptorRef>
                    descriptorRefs>
                configuration>
                <executions>
                    <execution>
                        <id>make-assemblyid>
                        <phase>packagephase>
                        <goals>
                            <goal>singlegoal>
                        goals>
                    execution>
                executions>
            plugin>
        plugins>
    build>

project>
  • 配置文件 config.properties
# Kafka配置
kafka.broker.list=simwor01:9092,simwor02:9092,simwor03:9092

# Redis配置
redis.host=simwor01
redis.port=6379
  • 实用类
package com.simwor.realtime.util

import java.io.InputStreamReader
import java.util.Properties

object PropertiesUtil {

  def main(args: Array[String]): Unit = {
    val properties: Properties = PropertiesUtil.load("config.properties")
    println(properties.getProperty("kafka.broker.list"))
  }

  def load(propertieName:String): Properties ={
    val prop=new Properties();
    prop.load(new InputStreamReader(Thread.currentThread().getContextClassLoader.getResourceAsStream(propertieName) , "UTF-8"))
    prop
  }

}

2.2 Kafka 数据获取

  • Kafka 实用类
package com.simwor.realtime.util

import java.util.Properties

import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.kafka.common.TopicPartition
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.dstream.InputDStream
import org.apache.spark.streaming.kafka010.{ConsumerStrategies, KafkaUtils, LocationStrategies}

object MyKafkaUtil {
  private val properties: Properties = PropertiesUtil.load("config.properties")
  val broker_list = properties.getProperty("kafka.broker.list")

  // kafka消费者配置
  var kafkaParam = collection.mutable.Map(
    "bootstrap.servers" -> broker_list,//用于初始化链接到集群的地址
    "key.deserializer" -> classOf[StringDeserializer],
    "value.deserializer" -> classOf[StringDeserializer],
    //用于标识这个消费者属于哪个消费团体
    "group.id" -> "gmall_consumer_group",
    //如果没有初始化偏移量或者当前的偏移量不存在任何服务器上,可以使用这个配置属性
    //可以使用这个配置,latest自动重置偏移量为最新的偏移量
    "auto.offset.reset" -> "latest",
    //如果是true,则这个消费者的偏移量会在后台自动提交,但是kafka宕机容易丢失数据
    //如果是false,会需要手动维护kafka偏移量
    "enable.auto.commit" -> (true: java.lang.Boolean)
  )

  // 创建DStream,返回接收到的输入数据
  // LocationStrategies:根据给定的主题和集群地址创建consumer
  // LocationStrategies.PreferConsistent:持续的在所有Executor之间分配分区
  // ConsumerStrategies:选择如何在Driver和Executor上创建和配置Kafka Consumer
  // ConsumerStrategies.Subscribe:订阅一系列主题
  def getKafkaStream(topic: String,ssc:StreamingContext ): InputDStream[ConsumerRecord[String,String]]={
    val dStream = KafkaUtils.createDirectStream[String,String](ssc, LocationStrategies.PreferConsistent,ConsumerStrategies.Subscribe[String,String](Array(topic),kafkaParam ))
    dStream
  }

  def getKafkaStream(topic: String,ssc:StreamingContext,groupId:String): InputDStream[ConsumerRecord[String,String]]={
    kafkaParam("group.id")=groupId
    val dStream = KafkaUtils.createDirectStream[String,String](ssc, LocationStrategies.PreferConsistent,ConsumerStrategies.Subscribe[String,String](Array(topic),kafkaParam ))
    dStream
  }

  def getKafkaStream(topic: String,ssc:StreamingContext,offsets:Map[TopicPartition,Long],groupId:String): InputDStream[ConsumerRecord[String,String]]={
    kafkaParam("group.id")=groupId
    val dStream = KafkaUtils.createDirectStream[String,String](ssc, LocationStrategies.PreferConsistent,ConsumerStrategies.Subscribe[String,String](Array(topic),kafkaParam,offsets))
    dStream
  }
}
  • 消费数据
package com.simwor.realtime.app

import com.alibaba.fastjson.{JSON, JSONObject}
import com.simwor.realtime.bean.DauInfo
import com.simwor.realtime.util.{MyEsUtil, MyKafkaUtil, RedisUtil}
import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.{DStream, InputDStream}
import org.apache.spark.streaming.{Seconds, StreamingContext}

import java.text.SimpleDateFormat
import java.util.Date
import scala.collection.mutable.ListBuffer

object DauApp {

  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf().setAppName("dau_app").setMaster("local[4]")
    val ssc = new StreamingContext(sparkConf, Seconds(5))

    // 消费Kafka启动日志
    val recordInputStream: InputDStream[ConsumerRecord[String, String]] = MyKafkaUtil.getKafkaStream("gmall-start-log", ssc)
    val jsonObjectDataStream = recordInputStream.map(record => {
      val jsonString = record.value()
      val jsonObject = JSON.parseObject(jsonString)

      val timestamp = jsonObject.getLong("ts")
      val simpleDateFormat = new SimpleDateFormat("yyyy-MM-dd HH")
      val dateHourString = simpleDateFormat.format(new Date(timestamp))
      val dateHour = dateHourString.split(" ")
      jsonObject.put("dt", dateHour(0))
      jsonObject.put("hr", dateHour(1))

      jsonObject
    })

    // Redis日志去重,计算日活
    //...

    //ElasticSearch 最终存储
    ...

    ssc.start()
    ssc.awaitTermination()
  }

}

2.3 Redis 数据去重

  • 实用类
package com.simwor.realtime.util

import redis.clients.jedis.{Jedis, JedisPool, JedisPoolConfig}

object RedisUtil {

  var jedisPool:JedisPool=null

  def getJedisClient: Jedis = {
    if(jedisPool==null){
      //      println("开辟一个连接池")
      val config = PropertiesUtil.load("config.properties")
      val host = config.getProperty("redis.host")
      val port = config.getProperty("redis.port")

      val jedisPoolConfig = new JedisPoolConfig()
      jedisPoolConfig.setMaxTotal(100)  //最大连接数
      jedisPoolConfig.setMaxIdle(20)   //最大空闲
      jedisPoolConfig.setMinIdle(20)     //最小空闲
      jedisPoolConfig.setBlockWhenExhausted(true)  //忙碌时是否等待
      jedisPoolConfig.setMaxWaitMillis(500)//忙碌时等待时长 毫秒
      jedisPoolConfig.setTestOnBorrow(true) //每次获得连接的进行测试

      jedisPool=new JedisPool(jedisPoolConfig,host,port.toInt)
    }
    //    println(s"jedisPool.getNumActive = ${jedisPool.getNumActive}")
    //   println("获得一个连接")
    jedisPool.getResource
  }

}
  • 去重
    // Redis日志去重,计算日活
    val filteredDStream: DStream[JSONObject] = jsonObjectDataStream.mapPartitions { jsonObjItr =>
      val originalList = jsonObjItr.toList
      val filteredList = new ListBuffer[JSONObject]()
      val jedisClient = RedisUtil.getJedisClient

      println("Before Filter : " + originalList.size)
      for(jsonObj <- originalList) {
        val dt = jsonObj.getString("dt")
        val mid = jsonObj.getJSONObject("common").getString("mid")
        val dauKey = "dau:" + dt
        val exists = jedisClient.sadd(dauKey, mid)
        jedisClient.expire(dauKey, 3600*24)
        if (exists == 1L)
          filteredList += jsonObj
      }

      println("After Filter : " + filteredList.size)
      jedisClient.close()
      filteredList.toIterator
    }

2.4 ES 数据存储

  • 索引模板
PUT   _template/gmall_dau_info_template
{
  "index_patterns": ["gmall_dau_info*"],                  
  "settings": {                                               
    "number_of_shards": 3
  },
  "aliases" : { 
    "{index}-query": {},
    "gmall_dau_info-query":{}
  },
 "mappings": {
   "properties":{
     "mid":{
       "type":"keyword"
     },
     "uid":{
       "type":"keyword"
     },
     "ar":{
       "type":"keyword"
     },
     "ch":{
       "type":"keyword"
     },
     "vc":{
       "type":"keyword"
     },
      "dt":{
       "type":"keyword"
     },
      "hr":{
       "type":"keyword"
     },
      "mi":{
       "type":"keyword"
     },
     "ts":{
       "type":"date"
     }
   }
 }
}
  • 索引样例类
package com.simwor.realtime.bean

case class DauInfo(
                mid:String,
                uid:String,
                ar:String,
                ch:String,
                vc:String,
                var dt:String,
                var hr:String,
                var mi:String,
                ts:Long)
  • 实用类
package com.simwor.realtime.util

import io.searchbox.client.config.HttpClientConfig
import io.searchbox.client.{JestClient, JestClientFactory}
import io.searchbox.core.{Bulk, Index, Search}
import org.elasticsearch.index.query.{BoolQueryBuilder, MatchQueryBuilder}
import org.elasticsearch.search.builder.SearchSourceBuilder

object MyEsUtil {

  def bulkDoc(sourceList: List[Any], indexName: String): Unit = {
    val jestClient = getClient

    val bulkBuilder = new Bulk.Builder
    for(source <- sourceList) {
      val index = new Index.Builder(source).index(indexName).`type`("_doc").build()
      bulkBuilder.addAction(index)
    }

    jestClient.execute(bulkBuilder.build())
    jestClient.close()
  }

  /* ElasticSearch Connection Factory */

  def getClient:JestClient ={
    if(factory==null) build();
    factory.getObject
  }

  def  build(): Unit ={
    factory = new JestClientFactory
    factory.setHttpClientConfig(new HttpClientConfig.Builder("http://simwor01:9200")
      .multiThreaded(true)
      .maxTotalConnection(20)
      .connTimeout(10000).readTimeout(1000).build())
  }

  private var factory: JestClientFactory = null;
}
  • 数据存储
    //ElasticSearch 最终存储
    filteredDStream.foreachRDD { rdd =>
      rdd.foreachPartition { jsonItr =>
        val list = jsonItr.toList
        val dt = new SimpleDateFormat("yyyy-MM-dd").format(new Date())
        val dauList = list.map { startupJsonObj =>
          val dtHr: String = new SimpleDateFormat("yyyy-MM-dd HH:mm").format(new Date(startupJsonObj.getLong("ts")))
          val dtHrArr: Array[String] = dtHr.split(" ")
          val dt = dtHrArr(0)
          val timeArr = dtHrArr(1).split(":")
          val hr = timeArr(0)
          val mi = timeArr(1)
          val commonJSONObj: JSONObject = startupJsonObj.getJSONObject("common")
          DauInfo(commonJSONObj.getString("mid"),
            commonJSONObj.getString("uid"),
            commonJSONObj.getString("mid"),
            commonJSONObj.getString("ch"),
            commonJSONObj.getString("vc"),
            dt, hr, mi,
            startupJsonObj.getLong("ts"))
        }
        MyEsUtil.bulkDoc(dauList, "gmall_dau_info_" + dt)
      }
    }

2.5 精准一次性消费

Kafka 支持事务性提交但不支持事务性消费,ES支持幂等性提交但不支持事务。

通过手工保存Kafka偏移量 + ES幂等性提交,即可达成 精准一次性消费

  • 手工保存 Kafka 偏移量到 Redis
  1. OffsetManager
package com.simwor.realtime.util

import org.apache.kafka.common.TopicPartition
import org.apache.spark.streaming.kafka010.OffsetRange

import java.util

object OffsetManager {

  // 获取偏移量
  def getOffset(topicName: String, groupId: String): Map[TopicPartition, Long] = {
    // Redis
    // type -> hash
    // key -> offset:[topic]:[groupid]
    // field -> partition_id
    // value -> offset
    val jedisClient = RedisUtil.getJedisClient

    val offsetMap: util.Map[String, String] = jedisClient.hgetAll("offset:" + topicName + ":" + groupId)
    import scala.collection.JavaConversions._
    val kafkaOffsetMapMap: Map[TopicPartition, Long] = offsetMap.map { case (partitionId, offset) =>
      (new TopicPartition(topicName, partitionId.toInt), offset.toLong)
    }.toMap

    jedisClient.close()
    kafkaOffsetMapMap
  }

  //写入偏移量
  def saveOffset(topicName: String, groupId: String, offsetRanges: Array[OffsetRange]): Unit = {
    val jedisClient = RedisUtil.getJedisClient

    val offsetMap: util.Map[String, String] = new util.HashMap()
    for(offset <- offsetRanges) {
      val partition: Int = offset.partition
      val untilOffset: Long = offset.untilOffset
      offsetMap.put(partition.toString, untilOffset.toString)
      println("partition := " + partition + " -- " + offset.fromOffset + " --> " + untilOffset)
    }
    if(offsetMap != null && offsetMap.size() > 0)
      jedisClient.hmset("offset:" + topicName + ":" + groupId, offsetMap)

    jedisClient.close()
  }

}
  1. DauApp
package com.simwor.realtime.app

import com.alibaba.fastjson.{JSON, JSONObject}
import com.simwor.realtime.bean.DauInfo
import com.simwor.realtime.util.{MyEsUtil, MyKafkaUtil, OffsetManager, RedisUtil}
import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.{DStream, InputDStream}
import org.apache.spark.streaming.kafka010.{HasOffsetRanges, OffsetRange}
import org.apache.spark.streaming.{Seconds, StreamingContext}

import java.text.SimpleDateFormat
import java.util.Date
import scala.collection.mutable.ListBuffer

object DauApp {

  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf().setAppName("dau_app").setMaster("local[4]")
    val ssc = new StreamingContext(sparkConf, Seconds(5))

    // ***************** 读取Kafka偏移量
    val topicName = "gmall-start-log"
    val groupId = "gmall-start-group"
    val kafkaOffsetMap = OffsetManager.getOffset(topicName, groupId)
    var recordInputStream: InputDStream[ConsumerRecord[String, String]] = null
    if(kafkaOffsetMap != null && kafkaOffsetMap.size > 0)
      recordInputStream = MyKafkaUtil.getKafkaStream("gmall-start-log", ssc, kafkaOffsetMap, groupId)
    else
      recordInputStream = MyKafkaUtil.getKafkaStream("gmall-start-log", ssc)

    // ***************** 获得偏移结束点
    var offsetRanges: Array[OffsetRange] = Array.empty[OffsetRange]
    val startupInputGetOffsetDstream: DStream[ConsumerRecord[String, String]] = recordInputStream.transform { rdd =>
      offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
      rdd
    }

...

    //ElasticSearch 最终存储
    filteredDStream.foreachRDD { rdd =>
...
      // ***************** 提交Kafka偏移量
      OffsetManager.saveOffset(topicName, groupId, offsetRanges)
    }

    ssc.start()
    ssc.awaitTermination()
  }

}
  • ES 幂等性提交
  1. 指定文档ID MyEsUtil
  def bulkDoc(sourceList: List[(String, DauInfo)], indexName: String): Unit = {
    val jestClient = getClient

    val bulkBuilder = new Bulk.Builder
    for((id, source) <- sourceList) {
      // ************ 指定ID,重复出现时只更新不新建
      val index = new Index.Builder(source).index(indexName).`type`("_doc").id(id).build()
      bulkBuilder.addAction(index)
    }

    jestClient.execute(bulkBuilder.build())
    jestClient.close()
  }
  1. 指定文档ID DauApp
    //ElasticSearch 最终存储
    filteredDStream.foreachRDD { rdd =>
      rdd.foreachPartition { jsonItr =>
        val list = jsonItr.toList
        val dt = new SimpleDateFormat("yyyy-MM-dd").format(new Date())
        val dauList: List[(String, DauInfo)] = list.map { startupJsonObj =>
          val dtHr: String = new SimpleDateFormat("yyyy-MM-dd HH:mm").format(new Date(startupJsonObj.getLong("ts")))
          val dtHrArr: Array[String] = dtHr.split(" ")
          val dt = dtHrArr(0)
          val timeArr = dtHrArr(1).split(":")
          val hr = timeArr(0)
          val mi = timeArr(1)
          val commonJSONObj: JSONObject = startupJsonObj.getJSONObject("common")
          val dauInfo = DauInfo(commonJSONObj.getString("mid"),
            commonJSONObj.getString("uid"),
            commonJSONObj.getString("mid"),
            commonJSONObj.getString("ch"),
            commonJSONObj.getString("vc"),
            dt, hr, mi,
            startupJsonObj.getLong("ts"))

          // **************** 返回值必须加上文档的id,这里使用mid
          (dauInfo.mid, dauInfo)
        }
        MyEsUtil.bulkDoc(dauList, "gmall_dau_info_" + dt)
      }

2.6 Kibana 可视化配置

  • 配置数据源 Stack Management -> Index Patterns -> Create Index Pattern

大数据实战项目 -- 实时数仓_第7张图片
大数据实战项目 -- 实时数仓_第8张图片

  • 配置可视化 Visualize
  1. Create new visualize -> New Vertical Bar / Choose a source -> gmall_dau_info_2021*

大数据实战项目 -- 实时数仓_第9张图片

  1. 设置纵坐标

大数据实战项目 -- 实时数仓_第10张图片

  1. 设置横坐标

大数据实战项目 -- 实时数仓_第11张图片

  1. 设置时间范围并刷新 Refresh

大数据实战项目 -- 实时数仓_第12张图片

  1. 查看并保存 Update -> Save

大数据实战项目 -- 实时数仓_第13张图片

  • 组合仪表盘
  1. Dashboard -> Create new dashboard -> Add

大数据实战项目 -- 实时数仓_第14张图片

  1. 实时更新

大数据实战项目 -- 实时数仓_第15张图片

  1. 分享链接 Share -> Embed Code -> Saved Object

<html>
<head>
	<meta charset="utf-8">
	<title>Simwortitle>
head>
<body>
	<h1>Daily Active Usersh1>
	<iframe src="http://simwor01:5601/app/kibana#/dashboard/39adc0a0-d4f0-11eb-8ddb-af39ee8ef270?embed=true&_g=(filters%3A!()%2CrefreshInterval%3A(pause%3A!t%2Cvalue%3A0)%2Ctime%3A(from%3Anow%2Fw%2Cto%3Anow%2Fw))" height="600" width="800">iframe>
body>
html>

大数据实战项目 -- 实时数仓_第16张图片

2.7 发布数据接口

  • 接口格式
接口 路径 返回结果
总数 http://publisher:8070/realtime-total?date=2019-02-01 [{“id”:“dau”,“name”:“新增日活”,“value”:1200},{“id”:“new_mid”,“name”:“新增设备”,“value”:233} ]
分时统计 http://publisher:8070/realtime-hour?id=dau&date=2019-02-01 {“yesterday”:{“11”:383,“12”:123,“17”:88,“19”:200 }, “today”:{“12”:38,“13”:1233,“17”:123,“19”:688 }}
  • 新建项目

大数据实战项目 -- 实时数仓_第17张图片

Spring 版本POM中调成 2.1.15.RELEASE,添加一些其它工具包。

<dependency>
    <groupId>org.apache.commonsgroupId>
    <artifactId>commons-lang3artifactId>
    <version>3.10version>
dependency>
<dependency>
    <groupId>com.google.guavagroupId>
    <artifactId>guavaartifactId>
    <version>29.0-jreversion>
dependency>
<dependency>
    <groupId>com.alibabagroupId>
    <artifactId>fastjsonartifactId>
    <version>1.2.68version>
dependency>
<dependency>
    <groupId>io.searchboxgroupId>
    <artifactId>jestartifactId>
    <version>5.3.3version>
    <exclusions>
        <exclusion>
            <groupId>org.slf4jgroupId>
            <artifactId>slf4j-apiartifactId>
        exclusion>
    exclusions>
dependency>
<dependency>
    <groupId>net.java.dev.jnagroupId>
    <artifactId>jnaartifactId>
    <version>4.5.2version>
dependency>
<dependency>
    <groupId>org.codehaus.janinogroupId>
    <artifactId>commons-compilerartifactId>
    <version>2.7.8version>
dependency>
<dependency>
    <groupId>org.elasticsearchgroupId>
    <artifactId>elasticsearchartifactId>
    <version>2.4.6version>
dependency>
  • 项目配置文件 application.properties
spring.elasticsearch.jest.uris=http://simwor01:9200,http://simwor02:9200,http://simwor03:9200
server.port=8070
  • 定义接口
package com.simwor.publisher.service;

import java.util.Map;

public interface EsService {

    public Long getDauTotal(String date);

    public Map getDauHour(String data);

}
  • 实现接口
package com.simwor.publisher.service.impl;

import com.google.gson.JsonElement;
import com.google.gson.JsonObject;
import com.simwor.publisher.service.EsService;
import io.searchbox.client.JestClient;
import io.searchbox.core.Search;
import io.searchbox.core.SearchResult;
import io.searchbox.core.search.aggregation.TermsAggregation;
import org.elasticsearch.index.query.MatchAllQueryBuilder;
import org.elasticsearch.search.aggregations.AggregationBuilders;
import org.elasticsearch.search.aggregations.bucket.terms.TermsBuilder;
import org.elasticsearch.search.builder.SearchSourceBuilder;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.stereotype.Service;

import java.io.IOException;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

@Service
public class EsServiceImpl implements EsService {

    @Autowired
    JestClient jestClient;

    @Override
    public Long getDauTotal(String date) {
        Long totalResult = 0L;
        String indexName = "gmall_dau_info_" + date + "-query";
        SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
        searchSourceBuilder.query(new MatchAllQueryBuilder());
        Search search = new Search.Builder(searchSourceBuilder.toString())
                .addIndex(indexName)
                .addType("_doc")
                .build();

        try {
            SearchResult searchResult = jestClient.execute(search);
            JsonObject jsonObject = searchResult.getJsonObject();
            JsonElement jsonElement = jsonObject.get("hits").getAsJsonObject().get("total").getAsJsonObject().get("value");
            totalResult = jsonElement.getAsLong();
        } catch (IOException e) {
            e.printStackTrace();
            throw new RuntimeException("ElasticSearch 查询异常");
        }

        return totalResult;
    }

    @Override
    public Map getDauHour(String date) {
        Map<String, Long> results = new HashMap<>();
        String indexName = "gmall_dau_info_" + date + "-query";
        //构造查询语句
        SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
        TermsBuilder termsBuilder = AggregationBuilders.terms("groupby_hr").field("hr").size(24);
        searchSourceBuilder.aggregation(termsBuilder);
        Search search = new Search.Builder(searchSourceBuilder.toString())
                .addIndex(indexName)
                .addType("_doc")
                .build();

        try {
            //执行并封装返回结果
            SearchResult searchResult = jestClient.execute(search);
            List<TermsAggregation.Entry> buckets = searchResult.getAggregations().getTermsAggregation("groupby_hr").getBuckets();
            for(TermsAggregation.Entry bucket : buckets)
                results.put(bucket.getKey(), bucket.getCount());
        } catch (IOException e) {
            e.printStackTrace();
        }

        return results;
    }
}
  • 接口前端请求控制器
package com.simwor.publisher.controller;

import com.alibaba.fastjson.JSON;
import com.simwor.publisher.service.EsService;
import org.apache.commons.lang3.time.DateUtils;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.RequestParam;
import org.springframework.web.bind.annotation.RestController;

import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.*;

@RestController
public class PublisherController {

    @Autowired
    private EsService esService;

    @GetMapping("realtime-total")
    public String realtimeTotal(@RequestParam("date") String dt) {
        List<Map<String, Object>> resultList = new ArrayList<>();

        Map<String, Object> dauMap = new HashMap<>();
        dauMap.put("id", "dau");
        dauMap.put("name", "新增日活");
        dauMap.put("value", esService.getDauTotal(dt));
        resultList.add(dauMap);

        Map<String, Object> midMap = new HashMap<>();
        midMap.put("id", "new_mid");
        midMap.put("name", "新增设备");
        midMap.put("value", 233);
        resultList.add(midMap);

        return JSON.toJSONString(resultList);
    }

    @GetMapping("realtime-hour")
    public String realTimeHour(@RequestParam("id") String id,
                               @RequestParam("date") String dt) {
        Map<String, Map<String, Long>> resultMap = new HashMap<>();

        Map dauHourToday = esService.getDauHour(dt);
        Map dauHourYesterday = esService.getDauHour(getYesterday(dt));
        resultMap.put("today", dauHourToday);
        resultMap.put("yesterday", dauHourYesterday);

        return JSON.toJSONString(resultMap);
    }

    private String getYesterday(String today) {
        SimpleDateFormat simpleDateFormat = new SimpleDateFormat("yyyy-MM-dd");
        String yesterday = "";

        try {
            Date todayDate = simpleDateFormat.parse(today);
            Date yesterdayDate = DateUtils.addDays(todayDate, -1);
            yesterday = simpleDateFormat.format(yesterdayDate);
        } catch (ParseException e) {
            e.printStackTrace();
        }

        return yesterday;
    }

}
GET gmall_dau_info_2021-06-22-query/_search
{
  "aggs": {
    "groupby_hr": {
      "terms": {
        "field": "hr",
        "size": 24
      }
    }
  }
}

"aggregations" : {
   "groupby_hr" : {
     "doc_count_error_upper_bound" : 0,
     "sum_other_doc_count" : 0,
     "buckets" : [
       {
         "key" : "21",
         "doc_count" : 50
       }
     ]
   }
 }

三、实时监控

本章介绍两款 MySQL 数据变化实时监控工具:Canal 和 Maxwell。

3.1 Canal

  • 定义

Canal 通过模拟 MySQL 的主从复制 备机的行为 来实时 监控数据变化

大数据实战项目 -- 实时数仓_第18张图片

  1. Master主库将改变记录,写到二进制日志(binary log)中;
  2. Slave从库向mysql master发送dump协议,将master主库的binary log events拷贝到它的中继日志(relay log);
  3. Slave从库读取并重做中继日志中的事件,将改变的数据同步到自己的数据库。

3.1.1 配置 MySQL

  • 初始化数据库及canal用户权限
mysql> create database gmall_db;

mysql> use gmall_db;

mysql> source /opt/appdb/gmall_db.sql

mysql> GRANT SELECT, REPLICATION SLAVE, REPLICATION CLIENT ON *.* TO 'canal'@'%' IDENTIFIED BY 'ABcd12#$..';

mysql> 
  • 开启 binlog
[omm@simwor01 ~]$ sudo vi /etc/my.cnf
[omm@simwor01 ~]$ tail -4 /etc/my.cnf
server-id= 1
log-bin=mysql-bin
binlog_format=row
binlog-do-db=gmall_db
[omm@simwor01 ~]$ sudo systemctl restart mysqld
[omm@simwor01 mysql]$ pwd
/var/lib/mysql
[omm@simwor01 mysql]$ ll mysql-bin*
-rwxr-xr-x. 1 mysql mysql 154 Jun 29 11:11 mysql-bin.000001
-rwxr-xr-x. 1 mysql mysql  19 Jun 29 11:11 mysql-bin.index
[omm@simwor01 mysql]$ 
  • 模拟业务数据生成观察 binlog 大小变化
[omm@simwor01 appdb]$ java -jar gmall2020-mock-db-2020-05-18.jar 
--------开始生成数据--------
--------开始生成用户数据--------
共有10名用户发生变更
共生成0名用户
--------开始生成收藏数据--------
共生成收藏100条
--------开始生成购物车数据--------
共生成购物车274条
--------开始生成订单数据--------
共优惠券200张
共生成订单14条
共有9订单参与活动条
--------开始生成支付数据--------
状态更新14个订单
共有8订单完成支付
--------开始生成退单数据--------
状态更新8个订单
共生成退款2条
--------开始生成评价数据--------
共生成评价8条
[omm@simwor01 appdb]$ 

[omm@simwor01 mysql]$ ll mysql-bin*
-rwxr-xr-x. 1 mysql mysql 220806 Jun 29 11:16 mysql-bin.000001
-rwxr-xr-x. 1 mysql mysql     19 Jun 29 11:11 mysql-bin.index
[omm@simwor01 mysql]$ 

3.1.2 安装 canal

  • 架构

一个 Canal Server 可以监控多个 MySQL。

大数据实战项目 -- 实时数仓_第19张图片

  • 解压
[omm@simwor01 soft]$ pwd
/opt/soft
[omm@simwor01 soft]$ mkdir /opt/module/canal
[omm@simwor01 soft]$ tar -zxf canal.deployer-1.1.4.tar.gz -C /opt/module/canal
[omm@simwor01 soft]$ ll /opt/module/canal
total 4
drwxrwxr-x. 2 omm omm   76 Jun 29 11:22 bin
drwxrwxr-x. 5 omm omm  123 Jun 29 11:22 conf
drwxrwxr-x. 2 omm omm 4096 Jun 29 11:22 lib
drwxrwxr-x. 2 omm omm    6 Sep  2  2019 logs
[omm@simwor01 soft]$ 
  • 修改配置文件
[omm@simwor01 conf]$ vi canal.properties 
[omm@simwor01 conf]$ grep canal.mq.servers canal.properties 
canal.mq.servers = simwor01:9092,simwor02:9092,simwor03:9092
[omm@simwor01 conf]$ grep serverMode canal.properties 
canal.serverMode = kafka
[omm@simwor01 conf]$ 
[omm@simwor01 example]$ pwd
/opt/module/canal/conf/example
[omm@simwor01 example]$ vi instance.properties 
[omm@simwor01 example]$ grep canal.instance.master.address instance.properties 
canal.instance.master.address=simwor01:3306
[omm@simwor01 example]$ grep canal.instance.db instance.properties 
canal.instance.dbUsername=canal
canal.instance.dbPassword=ABcd12#$..
[omm@simwor01 example]$ grep canal.mq.topic instance.properties 
canal.mq.topic=GMALL_DB_CANAL
[omm@simwor01 example]$ 
  • 模拟 Canal 监测 MySQL 数据变化
# 启动 Canal
[omm@simwor01 canal]$ bin/startup.sh

# 生成数据
[omm@simwor01 appdb]$ pwd
/opt/appdb
[omm@simwor01 appdb]$ java -jar gmall2020-mock-db-2020-05-18.jar 

# 观察 Kafka topic
[omm@simwor01 bin]$ ./kafka-console-consumer.sh --bootstrap-server simwor01:9092 --topic GMALL_DB_CANAL --from-beginning
...
^CProcessed a total of 1582 messages
[omm@simwor01 bin]$ 

大数据实战项目 -- 实时数仓_第20张图片

3.2 Canal ODS 层数据分流

通过 Canal 可以实时监测数据变化,现要求不同表的数据变化记录到不同的 Kafka topic 中。如下修改 user_info 表就会推送到 ODS_USER_INFO 主题中:

大数据实战项目 -- 实时数仓_第21张图片

  • BaseDbCanal 业务分流代码
package com.simwor.realtime.ods

import com.alibaba.fastjson.JSON
import com.simwor.realtime.util.{MyKafkaSink, MyKafkaUtil, OffsetManager}
import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.{DStream, InputDStream}
import org.apache.spark.streaming.kafka010.{HasOffsetRanges, OffsetRange}
import org.apache.spark.streaming.{Seconds, StreamingContext}

object BaseDbCanal {

  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf().setAppName("base_db_canal_app").setMaster("local[4]")
    val ssc = new StreamingContext(sparkConf, Seconds(5))

    // ***************** 读取Kafka偏移量
    val topicName = "GMALL_DB_CANAL"
    val groupId = "gmall-canal-group"
    val kafkaOffsetMap = OffsetManager.getOffset(topicName, groupId)
    var recordInputStream: InputDStream[ConsumerRecord[String, String]] = null
    if(kafkaOffsetMap != null && kafkaOffsetMap.size > 0)
      recordInputStream = MyKafkaUtil.getKafkaStream(topicName, ssc, kafkaOffsetMap, groupId)
    else
      recordInputStream = MyKafkaUtil.getKafkaStream(topicName, ssc)

    // ***************** 获得偏移结束点
    var offsetRanges: Array[OffsetRange] = Array.empty[OffsetRange]
    val startupInputGetOffsetDstream: DStream[ConsumerRecord[String, String]] = recordInputStream.transform { rdd =>
      offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
      rdd
    }

    // ***************** 将 Kafka 数据转成 JSON 对象
    val jsonObjDstream = startupInputGetOffsetDstream.map { record =>
      val jsonString = record.value()
      val jsonObj = JSON.parseObject(jsonString)
      jsonObj
    }

    // ***************** 解析对象数据分流回推至 Kafka
    jsonObjDstream.foreachRDD { rdd =>
      //推回 Kafka
      rdd.foreach { jsonObj =>
        // 根据表名生长 topic 名
        val tableName = jsonObj.getString("table")
        val topic = "ODS_" + tableName.toUpperCase()
        // 将数据分流推到 Kafka
        val jsonArr = jsonObj.getJSONArray("data")
        import scala.collection.JavaConversions._
        for( item <- jsonArr)
          MyKafkaSink.send(topic, item.toString)
      }
    }

    // ***************** 提交Kafka偏移量
    OffsetManager.saveOffset(topicName, groupId, offsetRanges)

    ssc.start()
    ssc.awaitTermination()
  }
  
}
  • MyKafkaSink 实用类
package com.simwor.realtime.util

import java.util.Properties

import org.apache.kafka.clients.producer.{KafkaProducer, ProducerRecord}

object MyKafkaSink {
  private val properties: Properties = PropertiesUtil.load("config.properties")
  val broker_list = properties.getProperty("kafka.broker.list")
  var kafkaProducer: KafkaProducer[String, String] = null

  def createKafkaProducer: KafkaProducer[String, String] = {
    val properties = new Properties
    properties.put("bootstrap.servers", broker_list)
    properties.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")
    properties.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer")
    properties.put("enable.idompotence",(true: java.lang.Boolean))
    var producer: KafkaProducer[String, String] = null
    try
      producer = new KafkaProducer[String, String](properties)
    catch {
      case e: Exception =>
        e.printStackTrace()
    }
    producer
  }

  def send(topic: String, msg: String): Unit = {
    if (kafkaProducer == null) kafkaProducer = createKafkaProducer
    kafkaProducer.send(new ProducerRecord[String, String](topic, msg))

  }

  def send(topic: String,key:String, msg: String): Unit = {
    if (kafkaProducer == null) kafkaProducer = createKafkaProducer
    kafkaProducer.send(new ProducerRecord[String, String](topic,key, msg))

  }
}

3.3 Maxwell

  • 对比 Canal
  1. Maxwell 没有 Canal那种server+client模式,只有一个server把数据发送到消息队列或redis。
  2. Maxwell 有一个亮点功能,就是Canal只能抓取最新数据,对已存在的历史数据没有办法处理。而Maxwell有一个bootstrap功能,可以直接引导出完整的历史数据用于初始化,非常好用。
  3. Maxwell不能直接支持HA,但是它支持断点还原,即错误解决后重启继续上次点儿读取数据。
  4. Maxwell只支持json格式,而Canal如果用Server+client模式的话,可以自定义格式。
  5. Maxwell比Canal更加轻量级
  • 安装
  1. 解压缩
[omm@simwor01 soft]$ tar -zxf maxwell-1.25.0.tar.gz -C /opt/module/
[omm@simwor01 soft]$ ln -s /opt/module/maxwell-1.25.0/ /opt/module/maxwell
[omm@simwor01 soft]$ ll -d /opt/module/max*
lrwxrwxrwx. 1 omm omm  27 Jun 30 10:23 /opt/module/maxwell -> /opt/module/maxwell-1.25.0/
drwxrwxr-x. 4 omm omm 200 Jun 30 10:23 /opt/module/maxwell-1.25.0
[omm@simwor01 soft]$ 
  1. 配置 MySQL 环境(前提:binlog已开启)
mysql> CREATE DATABASE maxwell;

mysql> GRANT ALL   ON maxwell.* TO 'maxwell'@'%' IDENTIFIED BY 'Abcd12#$..';

mysql> GRANT  SELECT ,REPLICATION SLAVE , REPLICATION CLIENT  ON *.* TO maxwell@'%';
  1. 修改配置文件
[omm@simwor01 maxwell]$ cp config.properties.example config.properties
[omm@simwor01 maxwell]$ vi config.properties
[omm@simwor01 maxwell]$ head -15 config.properties

log_level=info

producer=kafka
kafka.bootstrap.servers=simwor01:9092,simwor02:9092,simwor03:9092
kafka_topic=GMALL_DB_MAXWELL
# database | table | primary_key | random | column
producer_partition_by=primary_key

# mysql login info
host=simwor01
user=maxwell
password=Abcd12#$..

client_id=maxwell_1

[omm@simwor01 maxwell]$ 
  1. 启动验证

启动maxwell -> 生成模拟数据 -> Kafka 消费验证

大数据实战项目 -- 实时数仓_第22张图片

3.4 Maxwell ODS 层数据分流

  • 数据格式对比

大数据实战项目 -- 实时数仓_第23张图片

  1. 日志结构:canal 每一条SQL会产生一条日志,如果该条Sql影响了多行数据,则已经会通过集合的方式归集在这条日志中。(即使是一条数据也会是数组结构);maxwell 以影响的数据为单位产生日志,即每影响一条数据就会产生一条日志。如果想知道这些日志是否是通过某一条sql产生的可以通过xid进行判断,相同的xid的日志来自同一sql。
  2. 数字类型:当原始数据是数字类型时,maxwell会尊重原始数据的类型不增加双引,变为字符串;canal一律转换为字符串。
  3. 带原始数据字段定义:canal数据中会带入表结构;maxwell更简洁。
  • BaseDbMaxwell 业务代码
package com.simwor.realtime.ods

...

object BaseDbMaxwell {

...
    // ***************** 读取Kafka偏移量
    val topicName = "GMALL_DB_MAXWELL"
    val groupId = "gmall-maxwell-group"
    ...

    // ***************** 解析对象数据分流回推至 Kafka
    jsonObjDstream.foreachRDD { rdd =>
      //推回 Kafka
      rdd.foreach { jsonObj =>
        // 根据表名生长 topic 名
        val tableName = jsonObj.getString("table")
        val topic = "ODS_" + tableName.toUpperCase()
        // 将数据分流推到 Kafka
        val jsonString = jsonObj.getString("data")
        MyKafkaSink.send(topic, jsonString)
      }
    }

...
}

大数据实战项目 -- 实时数仓_第24张图片

你可能感兴趣的:(BigData)