Flink笔记

reducescal第一天-Flink—流式计算框架

课程安排:

Flink的介绍(特点,整合),FLink的环境安装(standAlone,yarn),Flink dataSet(批处理)

flink的介绍

特点

  • 高吞吐,低延迟
  • 窗口行数:事件时间(重点)
  • Exactly-once一致性语义(理解)
  • 容错机制(checkpoint,重点)
  • 自己实现内存管理
  • 水位线(waterMark:网络乱序,网络延迟)
  • 状态管理

flink核心计算模块:runtime
角色:

  • Jobmanager :主节点,监控从节点
  • taskManager: 从节点,负责具体的任务的执行
    类库支持:
  • 图计算
  • CEP
  • flink sql
  • 机器学习
  • dataStream (重点)
  • dataSet
    整合支持:
  • 支持Flink on YARN
  • 支持HDFS
  • 支持来自Kafka的输入数据 (第三天)
  • 支持Apache HBase(项目)

两种模式

yarn-session模式

  • 一次性申请好内存
  • 适用于大量的小作业
  • 申请的资源不会主动关闭,需要手动关闭
    命令:
bin/yarn-session.sh -n 2 -tm 800 -jm 800 -s 1 -d

-n : 表示taskmanager容器的数量

-s : 表示slot的数量

-tm: 表示taskmanager容器的内存

-jm: 表示jobmanager容器的内存

-d: 表示分离模式

总共三个容器,tm有2个,jm有1个

提交任务:

 bin/flink run examples/batch/WordCount.jar 

查看yarn-session资源列表

yarn application -list

删除指定yarn-session

yarn application -kill application_1571196306040_0002

查看其他命令:

yarn-session -help

yarn-cluster模式

  • 会自动关闭session资源,使用时申请,完成时完毕
  • 使用大作业,使用于批量和离线
  • 直接提交任务给yarn集群

提交命令:

 bin/flink run -m yarn-cluster -yn 2 -ys 2 -ytm 1024 -yjm 1024 /export/servers/flink-1.7.0/examples/batch/WordCount.jar 

-m : 指定模式

-yn : 容器的数量

-ys: slot的数量

-ytm: tm的内存

-yjm: jm的内存

/export/servers/flink-1.7.0/examples/batch/WordCount.jar : 执行jar包

查看help

 bin/flink run -m yarn-cluster -help

扩展: flume自定义source

https://www.cnblogs.com/nstart/p/7699904.html

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-77a1IdGX-1577071557946)(photo/1571198075931.png)]

Flink 应用开发

1.wordCount开发

package cn.itcast

import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.api.scala._
/**
  * @Date 2019/10/16
  */
object WordCount {

  def main(args: Array[String]): Unit = {

    /**
      * 1.获取批处理执行环境
      * 2.加载数据源
      * 3.数据转换:切分,分组,聚合
      * 4.数据打印
      * 5.触发执行
      */
    //1.获取批处理执行环境
    val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
    //2.加载数据源
    val source: DataSet[String] = env.fromElements("ss dd dd ss ff")

    source.flatMap(line=>line.split("\\W+"))//正则表达式W+ ,多个空格
      .map(line=>(line,1))  //(ss,1)
      //分组
      .groupBy(0)
      .sum(1)  //求和
      .print()//数据打印,在批处理中,print是一个触发算子

    //env.execute()  //表示触发执行
  }
}

2.打包部署

第一种:采用maven打包

第二种:idea打包

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-fybmK7v8-1577071557949)(photo/1571208964476.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-PSauyXLa-1577071557950)(photo/1571209075510.png)]

任务执行:

bin/flink run -m yarn-cluster -yn 1 -ys 1 -ytm 1024 -yjm 1024 /export/servers/tmp/flink-1016.jar 

不将jar包打入工程:

jar变小

升级维护方便

算子

1.map

将一个元素转换成另一个元素

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-3unS2gpd-1577071557952)(photo/1571209951856.png)]

package cn.itcast

import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.api.scala._
/**
  * @Date 2019/10/16
  */
object MapDemo {

  def main(args: Array[String]): Unit = {

    /**
      * 1. 获取 ExecutionEnvironment 运行环境
      * 2. 使用 fromCollection 构建数据源
      * 3. 创建一个 User 样例类
      * 4. 使用 map 操作执行转换
      * 5. 打印测试
      */
    //1. 获取 ExecutionEnvironment 运行环境
    val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment

    //2. 使用 fromCollection 构建数据源
    val source: DataSet[String] = env.fromCollection(List("1,张三", "2,李四", "3,王五", "4,赵六"))

    //3.数据转换
    source.map(line=>{
      val arr: Array[String] = line.split(",")
      User(arr(0).toInt,arr(1))
    }).print()
  }
}

case class User(id:Int,userName:String)

2.flatmap

将一个元素转换成0/1/n个元素

package cn.itcast

import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.api.scala._
/**
  * @Date 2019/10/16
  */
object FlatMap {

  def main(args: Array[String]): Unit = {
    /**
      * 1. 获取 ExecutionEnvironment 运行环境
      * 2. 使用 fromElements构建数据源
      * 3. 使用 flatMap 执行转换
      * 4. 使用groupBy进行分组
      * 5. 使用sum求值
      * 6. 打印测试
      */
    //1. 获取 ExecutionEnvironment 运行环境
    val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
    //2. 使用 fromElements构建数据源
    val source: DataSet[List[(String, Int)]] = env.fromElements(List(("java", 1), ("scala", 1), ("java", 1)) )
    source.flatMap(line=>line)
      .groupBy(0)  //对第一个元素进行分组
      .sum(1)  //对第二个元素求和
      .print()  //打印和触发执行
  }
}

3.mapPartition

分区转换算子,将一个分区中的元素转换为另一个元素

package cn.itcast

import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.api.scala._
/**
  * @Date 2019/10/16
  */
object MapPartition {

  def main(args: Array[String]): Unit = {

    /**
      * 1. 获取 ExecutionEnvironment 运行环境
      * 2. 使用 fromElements构建数据源
      * 3. 创建一个 Demo 样例类
      * 4. 使用 mapPartition 操作执行转换
      * 5. 打印测试
      */
    //1. 获取 ExecutionEnvironment 运行环境
    val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
    //2. 使用 fromElements构建数据源
    val source: DataSet[(String, Int)] = env.fromElements(("java", 1), ("scala", 1), ("java", 1) )
    //3.数据转换
    source.mapPartition(line=>{
      line.map(y=>(y._1,y._2))
    }).print()
  }
}

4.filter

过滤boolean值为true的元素

package cn.itcast

import org.apache.flink.api.scala.{DataSet, ExecutionEnvironment}
import org.apache.flink.api.scala._
/**
  * @Date 2019/10/16
  */
object Filter {


  def main(args: Array[String]): Unit = {

    //1. 获取 ExecutionEnvironment 运行环境
    val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
    //2. 使用 fromElements构建数据源
    val source: DataSet[(String, Int)] = env.fromElements(("java", 1), ("scala", 1), ("java", 1) )
    //3.数据过滤
    source.filter(line=>line._1.contains("java"))
      .print()

  }
}

5.reduce

增量聚合函数,将数据集最终聚合成一个元素

package cn.itcast

import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.api.scala._
/**
  * @Date 2019/10/16
  */
object Reduce {

  def main(args: Array[String]): Unit = {

    /**
      * 1. 获取 ExecutionEnvironment 运行环境
      * 2. 使用 fromElements 构建数据源
      * 3. 使用 map和group执行转换操作
      * 4.使用reduce进行聚合操作
      * 5.打印测试
      */
    //1. 获取 ExecutionEnvironment 运行环境
    val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
    //2. 使用 fromElements 构建数据源
    val source: DataSet[(String, Int)] = env.fromElements(("java", 1), ("scala", 1), ("java", 1)   )
    //3.数据转换
    source.groupBy(0)
      //4.使用reduce进行聚合操作
      .reduce((x,y)=>(x._1,x._2+y._2))
      //5.打印测试
      .print()
  }

}

6.reduce和reduceGroup

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Lzf15Vrd-1577071557955)(photo/1571213462938.png)]

package cn.itcast

import java.lang

import akka.stream.impl.fusing.Collect
import org.apache.flink.api.common.functions.{GroupCombineFunction, GroupReduceFunction}
import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.api.scala._
import org.apache.flink.util.Collector

/**
  * @Date 2019/10/16
  */
object Reduce {

  def main(args: Array[String]): Unit = {

    /**
      * 1. 获取 ExecutionEnvironment 运行环境
      * 2. 使用 fromElements 构建数据源
      * 3. 使用 map和group执行转换操作
      * 4.使用reduce进行聚合操作
      * 5.打印测试
      */
    //1. 获取 ExecutionEnvironment 运行环境
    val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
    //2. 使用 fromElements 构建数据源
    val source: DataSet[(String, Int)] = env.fromElements(("java", 1), ("scala", 1), ("java", 1))
    //3.数据转换
    source.groupBy(0)
      //4.使用reduce进行聚合操作
      //.reduce((x,y)=>(x._1,x._2+y._2))
      //reduceGroup写法
//      .reduceGroup(line => {
//      line.reduce((x, y) => (x._1, x._2 + y._2))
//    })

//      .reduceGroup{
//      (in:Iterator[(String,Int)],out:Collector[(String,Int)])=>{
//        val tuple: (String, Int) = in.reduce((x,y)=>(x._1,x._2+y._2))
//        out.collect(tuple)
//      }
//    }

      //combine
      .combineGroup(new GroupCombineAndReduce)

      //5.打印测试
      .print()
  }

}

//导入包 java语法改成可以使用scala语法
import collection.JavaConverters._
class GroupCombineAndReduce extends GroupReduceFunction[(String,Int),(String,Int)]
  with GroupCombineFunction[(String,Int),(String,Int)] {

  //后执行
  override def reduce(values: lang.Iterable[(String, Int)], out: Collector[(String, Int)]): Unit = {
    for(line<- values.asScala){
      out.collect(line)
    }
  }

  //先执行,能够预先合并数据
  override def combine(values: lang.Iterable[(String, Int)], out: Collector[(String, Int)]): Unit = {

    var key= ""
    var sum:Int =0
    for(line<- values.asScala){
      key =line._1
      sum = sum+ line._2
    }
    out.collect((key,sum))

  }
}

注意接收数据量不能太大

7.聚合函数

package cn.itcast

import org.apache.flink.api.java.aggregation.Aggregations
import org.apache.flink.api.scala.{DataSet, ExecutionEnvironment}
import org.apache.flink.api.scala._

import scala.collection.mutable

/**
  * @Date 2019/10/16
  */
object Aggregate {

  def main(args: Array[String]): Unit = {

    //1.获取执行环境
    val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
    //2.加载数据源
    val data = new mutable.MutableList[(Int, String, Double)]
    data.+=((1, "yuwen", 89.0))
    data.+=((2, "shuxue", 92.2))
    data.+=((3, "yingyu", 89.99))
    data.+=((4, "wuli", 98.9))
    data.+=((5, "yuwen", 88.88))
    data.+=((6, "wuli", 93.00))
    data.+=((7, "yuwen", 94.3))

    val source: DataSet[(Int, String, Double)] = env.fromCollection(data)

    //3.数据分组
    val groupData: GroupedDataSet[(Int, String, Double)] = source.groupBy(1)

    //4.数据聚合
    groupData
      //根据第三个元素取最小值
//      .minBy(2)
      //.maxBy(2)  //返回满足条件的一组元素
        //.min(2)
//      .max(2) //返回满足条件的最值
        .aggregate(Aggregations.MAX,2)
      .print()
  }
}

8.distinct

数据去重

package cn.itcast

import org.apache.flink.api.scala.{DataSet, ExecutionEnvironment}
import org.apache.flink.api.scala._
import scala.collection.mutable

/**
  * @Date 2019/10/16
  */
object DistinctDemo {

  def main(args: Array[String]): Unit = {

    //1.获取执行环境
    val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
    //2.加载数据源
    val data = new mutable.MutableList[(Int, String, Double)]
    data.+=((1, "yuwen", 89.0))
    data.+=((2, "shuxue", 92.2))
    data.+=((3, "yingyu", 89.99))
    data.+=((4, "wuli", 93.00))
    data.+=((5, "yuwen", 89.0))
    data.+=((6, "wuli", 93.00))

    val source: DataSet[(Int, String, Double)] = env.fromCollection(data)
    source.distinct(1) //去重
      .print()

  }

}

9.左连接/右连接/全连接

package cn.itcast

import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.api.scala._
/**
  * @Date 2019/10/16
  */
object LeftAndRightAndFull {

  def main(args: Array[String]): Unit = {

    //1.获取执行环境
    val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
    //2.加载数据源
    val s1: DataSet[(Int, String)] = env.fromElements((1, "zhangsan") , (2, "lisi") ,(3 , "wangwu") ,(4 , "zhaoliu"))
    val s2: DataSet[(Int, String)] = env.fromElements((1, "beijing"), (2, "shanghai"), (4, "guangzhou"))

    //3.join关联
    //leftJoin
//    s1.leftOuterJoin(s2).where(0).equalTo(0){
//      (s1,s2)=>{
//        if(s2 == null){
//          (s1._1,s1._2,null)
//        }else{
//          (s1._1,s1._2,s2._2)
//        }
//      }
//    }

      //rightJoin
//    s1.rightOuterJoin(s2).where(0).equalTo(0) {
//      (s1, s2) => {
//        if (s1 == null) {
//          (s2._1, null, s2._2)
//        } else {
//          (s2._1, s1._2, s2._2)
//        }
//      }
//    }

      //fullJoin
      s1.fullOuterJoin(s2).where(0).equalTo(0){
        (s1,s2)=>{
          if (s1 == null) {
            (s2._1, null, s2._2)
          }else if(s2 == null){
            (s1._1,s1._2,null)
          } else {
            (s2._1, s1._2, s2._2)
          }
        }
      }
      .print()
  }
}

10.union

多数据流合并

object Union {

  def main(args: Array[String]): Unit = {

    //1.获取执行环境
    val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment

    val s1: DataSet[String] = env.fromElements("java")
    val s2: DataSet[String] = env.fromElements("scala")
    val s3: DataSet[String] = env.fromElements("java")

    //union数据合并
    s1.union(s2).union(s3).print()
  }

}

11.rebalance

package cn.itcast

import org.apache.flink.api.common.functions.RichMapFunction
import org.apache.flink.api.scala.{DataSet, ExecutionEnvironment}
import org.apache.flink.api.scala._
import org.apache.flink.configuration.Configuration
/**
  * @Date 2019/10/16
  */
object Rebalance {

  def main(args: Array[String]): Unit = {

    /**
      * 1. 获取 ExecutionEnvironment 运行环境
      * 2. 生成序列数据源
      * 3. 使用filter过滤大于50的数字
      * 4. 执行rebalance操作
      * 5.使用map操作传入 RichMapFunction ,将当前子任务的ID和数字构建成一个元组
      * 6. 打印测试
      */

    //1. 获取 ExecutionEnvironment 运行环境
    val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
    //2. 生成序列数据源
    val source: DataSet[Long] = env.generateSequence(0,100)
    //3. 使用filter过滤大于50的数字
    val filterData: DataSet[Long] = source.filter(_>50)

    //4.避免数据倾斜
    val rebData: DataSet[Long] = filterData.rebalance()

    //5.数据转换
    rebData.map(new RichMapFunction[Long,(Int,Long)] {
      var subtask: Int = 0
      //open方法会在map方法之前执行
      override def open(parameters: Configuration): Unit = {
        //获取线程任务执行id
        //通过上下文对象获取
        subtask = getRuntimeContext.getIndexOfThisSubtask

      }

      override def map(value: Long): (Int, Long) = {
        (subtask,value)
      }
    })
    //数据打印,触发执行
      .print()
  }
}

12.partiton

分区算子

package cn.itcast

import org.apache.flink.api.common.operators.Order
import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.api.scala._
import org.apache.flink.core.fs.FileSystem.WriteMode

import scala.collection.mutable

/**
  * @Date 2019/10/16
  */
object PartitionDemo {

  def main(args: Array[String]): Unit = {

    //1.获取执行环境
    val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment

    //加载数据源
    val data = new mutable.MutableList[(Int, Long, String)]
    data.+=((1, 1L, "Hi"))
    data.+=((2, 2L, "Hello"))
    data.+=((3, 2L, "Hello world"))
    data.+=((4, 3L, "Hello world, how are you?"))
    data.+=((5, 3L, "I am fine."))
    data.+=((6, 3L, "Luke Skywalker"))
    data.+=((7, 4L, "Comment#1"))
    data.+=((8, 4L, "Comment#2"))
    data.+=((9, 4L, "Comment#3"))
    data.+=((10, 4L, "Comment#4"))
    data.+=((11, 5L, "Comment#5"))
    data.+=((12, 5L, "Comment#6"))
    data.+=((13, 5L, "Comment#7"))
    data.+=((14, 5L, "Comment#8"))
    data.+=((15, 5L, "Comment#9"))
    data.+=((16, 6L, "Comment#10"))
    data.+=((17, 6L, "Comment#11"))
    data.+=((18, 6L, "Comment#12"))
    data.+=((19, 6L, "Comment#13"))
    data.+=((20, 6L, "Comment#14"))
    data.+=((21, 6L, "Comment#15"))
    val source = env.fromCollection(data)

    //3.partitionByHash分区
//    val result: DataSet[(Int, Long, String)] = source.partitionByHash(0).setParallelism(2).mapPartition(line => {
//      line.map(line => (line._1, line._2, line._3))
//    })

    //partitionByRange
//    val result: DataSet[(Int, Long, String)] = source.partitionByRange(0).setParallelism(2).mapPartition(line => {
//      line.map(line => (line._1, line._2, line._3))
//    })

    //sortPartition
    val result: DataSet[(Int, Long, String)] = source.sortPartition(0,Order.DESCENDING).setParallelism(2).mapPartition(line => {
      line.map(line => (line._1, line._2, line._3))
    })

    //4.数据落地
    result.writeAsText("sort",WriteMode.OVERWRITE)

    //5.触发执行
    env.execute("partition")
  }
}

13.first

取前N条数据,下标从1开始

package cn.itcast

import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.api.scala._
import scala.collection.mutable

/**
  * @Date 2019/10/16
  */
object FirstDemo {

  def main(args: Array[String]): Unit = {

    //1.获取执行环境
    val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)

    //2.加载数据
    val data = new mutable.MutableList[(Int, Long, String)]
    data.+=((1, 1L, "Hi"))
    data.+=((2, 2L, "Hello"))
    data.+=((3, 2L, "Hello world"))
    data.+=((4, 3L, "Hello world, how are you?"))
    data.+=((5, 3L, "I am fine."))
    data.+=((6, 3L, "Luke Skywalker"))
    data.+=((7, 4L, "Comment#1"))
    data.+=((8, 4L, "Comment#2"))
    data.+=((9, 4L, "Comment#3"))
    data.+=((10, 4L, "Comment#4"))
    data.+=((11, 5L, "Comment#5"))
    data.+=((12, 5L, "Comment#6"))
    data.+=((13, 5L, "Comment#7"))
    data.+=((14, 5L, "Comment#8"))
    data.+=((15, 5L, "Comment#9"))
    data.+=((16, 6L, "Comment#10"))
    data.+=((17, 6L, "Comment#11"))
    data.+=((18, 6L, "Comment#12"))
    data.+=((19, 6L, "Comment#13"))
    data.+=((20, 6L, "Comment#14"))
    data.+=((21, 6L, "Comment#15"))
    val ds = env.fromCollection(data)
    //    ds.first(10).print()
    //还可以先goup分组,然后在使用first取值
    ds.first(2).print()

  }
}

source

  • 本地集合
  • 基于文件
    • hdfs
      • txt
      • csv
    • 本地文件

基于本地和hdfs

package cn.itcast

import org.apache.flink.api.scala.{DataSet, ExecutionEnvironment}
import org.apache.flink.api.scala._
import org.apache.flink.core.fs.FileSystem.WriteMode
/**
  * @Date 2019/10/16
  */
object TeadTxtDemo {

  def main(args: Array[String]): Unit = {

    //1.获取执行环境
    val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
    //2.读取本地磁盘文件
    //val source: DataSet[String] = env.readTextFile("C:\\Users\\zhb09\\Desktop\\tmp\\user.txt")
    //读取hdfs文件
    val source: DataSet[String] = env.readTextFile("hdfs://node01:8020/tmp/user.txt")

    //3.数据转换,单词统计
    val result: AggregateDataSet[(String, Int)] = source.flatMap(_.split(","))
      .map((_, 1))
      .groupBy(0)
      .sum(1)

    //4.数据写入hdfs,OVERWRITE:数据覆盖
    result.writeAsText("hdfs://node01:8020/tmp/user2.txt",WriteMode.OVERWRITE)

    env.execute()
  }
}

基于CSV

package cn.itcast

import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.api.scala._
/**
  * @Date 2019/10/16
  */
object ReadCsv {


  def main(args: Array[String]): Unit = {

    //1.获取执行环境
    val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)
    //2.读取CSV文件
    val result: DataSet[(String, String, Int)] = env.readCsvFile[(String, String, Int)](
      "C:\\Users\\zhb09\\Desktop\\write\\test\\test.csv",
      lineDelimiter = "\n", //行分隔符
      fieldDelimiter = ",", //字段之间的分隔符
      ignoreFirstLine = true, //忽略首行
      lenient = false, //不忽略解析错误的行
      includedFields = Array(0, 1, 2) //读取列
    )
    result.first(5).print()
  }
}

第二天-Flink—流式计算框架

课程复习:

Flink的环境搭建
  • standAlone

    • 提交作业: flink run …/*.jar
    • HA模式
  • flink on yarn

    • yarn-session
      • yarn-session -n 2 -tm 1024 -jm 1024 -s 1 -d
      • flin run …/*.jar
      • 适用于大量的小作业
      • 一次性的申请好yarn资源
      • 需要手动关闭yarn-session(yarn application -kill id)
    • yarn-cluster
      • flink run -m yarn-cluster -yn 1 -yjm 1024 -ytm 1024 -ys 1 …/*.jar
      • 使用与批量和离线作业
      • 适用于大作业
      • 一条任务,既申请资源,任务完成之后,会自动关闭资源
  • 打包方式:代码与依赖包分离,代码与配置文件分离

  • 算子

    • map/flatmap/reduce/reduceGroup/filter/union
    • source
      • 基于本地
        • csv
        • txt
      • hdfs
        • csv
        • txt

课程安排

1.批处理

  • 广播变量
  • 分布式缓存

2.流处理

  • 算子
    • keyBy
    • connect
    • split和select
  • source
    • hdfs
    • 本地
    • 自定义数据源(kafka)
    • mysql
  • sink
    • mysql
    • kafka
    • hbase
    • redis
  • window和time

DataSet

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-TEdjTiIs-1577071557958)(photo/1571277131670.png)]

1.广播变量

package cn.itcast.dataset

import java.util

import org.apache.flink.api.common.functions.RichMapFunction
import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.api.scala._
import org.apache.flink.configuration.Configuration

import scala.collection.mutable

/**
  * @Date 2019/10/17
  */
object BrocastDemo {

  def main(args: Array[String]): Unit = {

    /**
      * 1.获取批处理执行环境
      * 2.加器数据源
      * 3.数据转换
      *   (1)共享广播变量
      *   (2)获取广播变量
      *   (3)数据合并
      * 4.数据打印/触发执行
      *需求:从内存中拿到data2的广播数据,再与data1数据根据第二列元素组合成(Int, Long, String, String)
      */
    //1.获取批处理执行环境
    val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
    //2.加器数据源
    val data1 = new mutable.MutableList[(Int, Long, String)]
    data1 .+=((1, 1L, "xiaoming"))
    data1 .+=((2, 2L, "xiaoli"))
    data1 .+=((3, 2L, "xiaoqiang"))
    val ds1 = env.fromCollection(data1)

    val data2 = new mutable.MutableList[(Int, Long, Int, String, Long)]
    data2 .+=((1, 1L, 0, "Hallo", 1L))
    data2 .+=((2, 2L, 1, "Hallo Welt", 2L))
    data2 .+=((2, 3L, 2, "Hallo Welt wie", 1L))
    val ds2 = env.fromCollection(data2)

    //3.数据转换
    ds1.map(new RichMapFunction[(Int,Long,String),(Int, Long, String, String)] {

      var ds: util.List[(Int, Long, Int, String, Long)] = null
      //open在map方法之前先执行
      override def open(parameters: Configuration): Unit = {
        //(2)获取广播变量
        ds = getRuntimeContext.getBroadcastVariable[(Int, Long, Int, String, Long)]("ds2")

      }

      //(3)数据合并
      import collection.JavaConverters._
      override def map(value: (Int, Long, String)): (Int, Long, String, String) = {
        var tuple: (Int, Long, String, String) = null
        for(line<- ds.asScala){
          if(line._2 == value._2){
            tuple = (value._1,value._2,value._3,line._4)
          }
        }
        tuple
      }
    }).withBroadcastSet(ds2,"ds2")  //(1)共享广播变量
    //4.数据打印/触发执行
      .print()
  }
}

2.分布式缓存

package cn.itcast.dataset

import java.io.File

import org.apache.flink.api.common.functions.RichMapFunction
import org.apache.flink.api.scala.{DataSet, ExecutionEnvironment}
import org.apache.flink.api.scala._
import org.apache.flink.configuration.Configuration

import scala.collection.mutable.ArrayBuffer
import scala.io.Source
/**
  * @Date 2019/10/17
  */
object DistributeCache {

  def main(args: Array[String]): Unit = {

    /**
      * 1.获取执行环境
      * 2.加载数据源
      * 3.注册分布式缓存
      * 4.数据转换
      *   (1)获取缓存文件
      *   (2)解析文件
      *   (3)数据转换
      * 5.数据打印/以及触发执行
      */

    //1.获取执行环境
    val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment

    //2.加载数据源
    val clazz:DataSet[Clazz] = env.fromElements(
      Clazz(1,"class_1"),
      Clazz(2,"class_1"),
      Clazz(3,"class_2"),
      Clazz(4,"class_2"),
      Clazz(5,"class_3"),
      Clazz(6,"class_3"),
      Clazz(7,"class_4"),
      Clazz(8,"class_1")
    )

    //3.注册分布式缓存
    val url = "hdfs://node01:8020/tmp/subject.txt"
    env.registerCachedFile(url,"cache")

    //4.数据转换
    clazz.map(new RichMapFunction[Clazz,Info] {

      val buffer = new ArrayBuffer[String]()
      override def open(parameters: Configuration): Unit = {
        //(1)获取缓存文件
        val file: File = getRuntimeContext.getDistributedCache.getFile("cache")
        //(2)解析文件
        val strs: Iterator[String] = Source.fromFile(file.getAbsoluteFile).getLines()
        strs.foreach(line=>{
          buffer.append(line)
        })
      }

      override def map(value: Clazz): Info = {
        var info:Info = null
        //(3)数据转换
        for(line <- buffer){
          val arr: Array[String] = line.split(",")
          if(arr(0).toInt == value.id){
            info = Info(value.id,value.clazz,arr(1),arr(2).toDouble)
          }
        }
        info
      }
    }).print() // 5.数据打印/以及触发执行

  }
}
//(学号 , 班级 , 学科 , 分数)
case class Info(id:Int,clazz:String,subject:String,score:Double)
case class Clazz(id:Int,clazz:String)

3.累加器

package cn.itcast.dataset

import org.apache.flink.api.common.JobExecutionResult
import org.apache.flink.api.common.accumulators.IntCounter
import org.apache.flink.api.common.functions.RichMapFunction
import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.api.scala._
import org.apache.flink.configuration.Configuration

/**
  * @Date 2019/10/17
  */
object AccumulatorCount {

  def main(args: Array[String]): Unit = {

    /**
      * 1.获取执行环境
      * 2.加载数据源
      * 3.数据转换
      * (1)新建累加器
      * (2)注册累加器
      * (3)使用累加器
      * 4.批量数据sink
      * 5.触发执行
      * 6.获取累加器的结果
      */

    //1.获取执行环境
    val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
    //2.加载数据源
    val source: DataSet[Int] = env.fromElements(1, 2, 3, 4, 5, 6)
    //3.数据转换
    val result: DataSet[Int] = source.map(new RichMapFunction[Int, Int] {

      //(1)新建累加器
      var counter = new IntCounter()

      override def open(parameters: Configuration): Unit = {
        //(2)注册累加器
        getRuntimeContext.addAccumulator("accumulator", counter)
      }

      override def map(value: Int): Int = {
        //(3)使用累加器
        counter.add(value)
        value
      }
    })
    //4.批量数据sink
    result.writeAsText("accumulator")

    //5.触发执行
    val execuResult: JobExecutionResult = env.execute()

    //6.获取累加器的结果
    val i: Int = execuResult.getAccumulatorResult[Int]("accumulator")
    println("累加器结果:" + i)
  }
}

DataStream

WordCount

package cn.itcast.datastream

import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.streaming.api.scala._
/**
  * @Date 2019/10/17
  */
object WordCountStream {

  def main(args: Array[String]): Unit = {

    /**
      * 1.获取流处理执行环境
      * 2.数据源加载
      * 3.数据转换
      *   切分,分组,聚合
      * 4.数据打印
      * 5.触发执行
      */

    //1.获取流处理执行环境
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    //2.数据源加载
    val source: DataStream[String] = env.socketTextStream("node01",8090)

    //3.数据转换
    source.split(_.split("\\W+"))
      .map((_,1))
      .keyBy(0) //分组
      .sum(1)  //数据聚合
      .print()     //4.数据打印

    //5.触发执行
    env.execute()
  }
}

connect

package cn.itcast.datastream

import org.apache.flink.streaming.api.functions.co.CoMapFunction
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.streaming.api.scala._
/**
  * @Date 2019/10/17
  */
object Connect {

  def main(args: Array[String]): Unit = {

    /**
      * 1.获取流处理执行环境
      * 2.加载/创建数据源
      * 3.使用connect连接数据流,并做map转换
      * 4.打印测试
      * 5.触发执行
      */
    //1.获取流处理执行环境
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    //2.加载/创建数据源
    val source: DataStream[Int] = env.fromElements(1,2,3,4,5,6)
    //3.使用connect连接数据流,并做map转换
    val strSource: DataStream[String] = source.map(line=>line+"==")

    //connect
    source.connect(strSource).map(new CoMapFunction[Int,String,String] {
      override def map1(value: Int): String = {
        value+"xxxxx"
      }

      override def map2(value: String): String = {
        value
      }
    }).print()//4.打印测试
    //5.触发执行
    env.execute()
  }
}

split和select

package cn.itcast.datastream

import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.streaming.api.scala._
/**
  * @Date 2019/10/17
  */
object SplitAndSelect {

  def main(args: Array[String]): Unit = {

    //1.获取执行
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment

    //2.加载数据源
    val source: DataStream[Int] = env.fromElements(1,2,3,4,5,6)

    //3.数据切分
    val splitData: SplitStream[Int] = source.split(line => {
      line % 2 match {
    case 0 => List("even")
    case 1 => List("odd")
  }
})
    //4.切分流数据查询
    splitData.select("even").print()

    env.execute()
  }

}

source

  • 基于本地集合

  • 基于分布式文件系统

  • 自定义source

    • kafka
    • 自己定义数据源
  • 基于socket

    并行数据源

package cn.itcast.datastream.source

import org.apache.flink.streaming.api.functions.source.{RichParallelSourceFunction, RichSourceFunction, SourceFunction}
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.api.scala._
/**
  * @Date 2019/10/17
  */
object SourceFunDemo {

  def main(args: Array[String]): Unit = {

    //1.获取执行环境
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment

    //2.自定义数据源
    env.addSource(new SourceFun).setParallelism(1)
      .print()

    env.execute()
  }

}

//多并行度数据源
class SourceFun extends RichParallelSourceFunction[Int]{
  override def run(ctx: SourceFunction.SourceContext[Int]): Unit = {
        var i:Int = 0
        while (true){
          i+=1
          ctx.collect(i)
          Thread.sleep(1000)
        }
  }

  override def cancel(): Unit = ???
}


//单数源,不能够用多个并行度
//class SourceFun extends RichSourceFunction[Int] {
//  override def run(ctx: SourceFunction.SourceContext[Int]): Unit = {
//    var i:Int = 0
//    while (true){
//      i+=1
//      ctx.collect(i)
//      Thread.sleep(1000)
//    }
//
//  }
//
//  override def cancel(): Unit = ???
//}

  • 基于本地集合
  • 基于分布式文件系统
  • 自定义source
    • kafka
    • 自己定义数据源
  • 基于socket

扩展

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-u0hUZwBq-1577071557959)(photo/1571284628010.png)]

kafka

1.查看指定topic的偏移量

kafka-consumer-groups.sh --group test1017 --describe --bootstrap-server node01:9092,node02:9092,node03:9092 

2.新建topic

kafka-topics.sh --create --topic demo --partitions 3 --replication-factor 2 --zookeeper node01:2181,node02:2181,node03:2181

3.生产数据

kafka-console-producer.sh --broker-list node01:9092,node02:9092,node03:9092 --topic demo

4.消费数据

kafka-console-consumer.sh --topic demo --bootstrap-server node01:9092,node02:9092,node03:9092

5.修改分区

 kafka-topics.sh --alter --partitions 4 --topic demo --zookeeper node01:2181,node02:2181,node03:2181
package cn.itcast.datastream.source

import java.{lang, util}
import java.util.Properties

import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer011
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.connectors.kafka.internals.KafkaTopicPartition
/**
  * @Date 2019/10/17
  */
object KafkaConsumer {


  def main(args: Array[String]): Unit = {

    /**
      * 1.获取流处理执行环境
      * 2.配置kafka参数
      * 3.整合kafka
      * 4.设置kafka消费者模式
      * 5.加载数据源
      * 6.数据打印
      * 7.触发执行
      */

    //1.获取流处理执行环境
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment

    //2.配置kafka参数
    val properties = new Properties()
    properties.setProperty("bootstrap.servers","node01:9092,node02:9092,node03:9092")
    properties.setProperty("group.id","test1017")
    properties.setProperty("auto.offset.reset", "latest") //最近消费,与offset相关,从消费组上次消费的偏移量开始消费

    //3.整合kafka
    val kafkaConsumer = new FlinkKafkaConsumer011[String]("demo",new SimpleStringSchema(),properties)

    //4.设置kafka消费者模式
    //默认值,当前消费组记录的偏移量开始,接着上次的偏移量消费
    //kafkaConsumer.setStartFromGroupOffsets()
    //从头消费
    //kafkaConsumer.setStartFromEarliest()
    //从最近消费,与offset无关,会导致数据丢失
//    kafkaConsumer.setStartFromLatest()

    //指定偏移量消费数据
//    val map = new util.HashMap[KafkaTopicPartition,lang.Long]()
//    map.put(new KafkaTopicPartition("demo",0),6L)
//    map.put(new KafkaTopicPartition("demo",1),6L)
//    map.put(new KafkaTopicPartition("demo",2),6L)
//
//    kafkaConsumer.setStartFromSpecificOffsets(map)

    //动态感知kafka主题分区的增加 单位毫秒
    properties.setProperty("flink.partition-discovery.interval-millis", "5000");


    //5.加载数据源
    val source: DataStream[String] = env.addSource(kafkaConsumer)
    //6.数据打印
    source.print()
    //7.触发执行
    env.execute()
  }
}

mysql Source

package cn.itcast.datastream.source

import java.sql.{Connection, DriverManager, PreparedStatement, ResultSet}

import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.functions.source.{RichSourceFunction, SourceFunction}
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.api.scala._
/**
  * @Date 2019/10/17
  */
object MysqlSource {

  def main(args: Array[String]): Unit = {

    //1.获取执行环境
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment

    //2.自定义数据源,读取mysql数据
    env.addSource(new MySqlSourceDemo)
      .print()

    //3.触发执行
    env.execute()
  }

}

case class Demo(id:Int,name:String,age:Int)
class MySqlSourceDemo extends RichSourceFunction[Demo] {
  var conn: Connection = null
  var pst: PreparedStatement = null
  //初始化数据源
  override def open(parameters: Configuration): Unit = {

    val driver = "com.mysql.jdbc.Driver"
    Class.forName(driver)
    val url = "jdbc:mysql://node02:3306/itcast"
    //获取连接
    conn = DriverManager.getConnection(url,"root","123456")
    pst = conn.prepareStatement("select * from demo")
    pst.setMaxRows(100) //查询最大行数
  }

  //执行业务查询的主方法
  override def run(ctx: SourceFunction.SourceContext[Demo]): Unit = {
    val rs: ResultSet = pst.executeQuery()
    while (rs.next()){
      val id: Int = rs.getInt(1)
      val name: String = rs.getString(2)
      val age: Int = rs.getInt(3)
      ctx.collect(Demo(id,name,age))
    }

  }

  //关流
  override def close(): Unit = {
    if(pst != null){
      pst.close()
    }
    if(conn != null){
      conn.close()
    }

  }

  override def cancel(): Unit = ???
}

sink

mysql sink

package cn.itcast.datastream

import java.sql.{Connection, DriverManager, PreparedStatement}

import cn.itcast.datastream.source.Demo
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.functions.sink.RichSinkFunction
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.api.scala._

/**
  * @Date 2019/10/17
  */
object MysqlSinkDemo {

  def main(args: Array[String]): Unit = {
    //1.获取流处理执行环境
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment

    //2.加载数据源
    val source: DataStream[Demo] = env.fromElements(Demo(20, "xiaoli", 20))

    //3.数据写入mysql
    source.addSink(new SinkMysql)

    //4.触发执行
    env.execute()
  }

}

class SinkMysql extends RichSinkFunction[Demo] {


  var conn: Connection = null
  var pst: PreparedStatement = null

  //初始化数据源
  override def open(parameters: Configuration): Unit = {

    val driver = "com.mysql.jdbc.Driver"
    Class.forName(driver)
    val url = "jdbc:mysql://node02:3306/itcast"
    //获取连接
    conn = DriverManager.getConnection(url, "root", "123456")
    pst = conn.prepareStatement("insert into demo values(?,?,?)")
  }

  //数据插入操作
  override def invoke(value: Demo): Unit = {

    pst.setInt(1, value.id)
    pst.setString(2, value.name)
    pst.setInt(3, value.age)
    pst.executeUpdate()
  }

  //关流
  override def close(): Unit = {
    if (pst != null) {
      pst.close()
    }
    if (conn != null) {
      conn.close()
    }

  }
}

kafka sink

package cn.itcast.datastream.sink

import java.util.Properties

import cn.itcast.datastream.source.Demo
import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011
/**
  * @Date 2019/10/17
  */
object KafkaSinkDemo {

  def main(args: Array[String]): Unit = {

    //1.获取执行环境
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment

    //2.加载本地数据
    val source: DataStream[Demo] = env.fromElements(Demo(1,"xiaoli",20))
    val sourceStr: DataStream[String] = source.map(line => {
      line.toString
    })


    //3.flink整合kafka
    val properties = new Properties() 
    properties.setProperty("bootstrap.servers","node01:9092,node02:9092,node03:9092")

    val kafkaProducer: FlinkKafkaProducer011[String] = new FlinkKafkaProducer011[String]("demo",new SimpleStringSchema(),properties)

    //4.数据写入kafka
    sourceStr.addSink(kafkaProducer)

    //5.触发执行
    env.execute()
  }
}

redis sink

package cn.itcast.datastream.sink

import java.net.{InetAddress, InetSocketAddress}
import java.util

import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.connectors.redis.RedisSink
import org.apache.flink.streaming.connectors.redis.common.config.FlinkJedisClusterConfig
import org.apache.flink.streaming.connectors.redis.common.mapper.{RedisCommand, RedisCommandDescription, RedisMapper}
/**
  * @Date 2019/10/17
  */
object RedisSinkDemo {


  def main(args: Array[String]): Unit = {

    //1.获取执行环境
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment

    //2.实时读取数据
    val source: DataStream[String] = env.socketTextStream("node01",8090)

    //3.将数据求和
    val result: DataStream[(String, Int)] = source.flatMap(_.split("\\W+"))
      .map((_, 1))
      .keyBy(0)
      .sum(1)

    //4.将结果放入redis
    //节点配置
    val set = new util.HashSet[InetSocketAddress]()
    set.add(new InetSocketAddress(InetAddress.getByName("node01"),7001))
    set.add(new InetSocketAddress(InetAddress.getByName("node01"),7002))
    set.add(new InetSocketAddress(InetAddress.getByName("node01"),7003))
    //配置对象
    val config: FlinkJedisClusterConfig = new FlinkJedisClusterConfig.Builder()
      .setNodes(set)
      .setMaxTotal(5)
      .build()

    //5.数据写入redis
    result.addSink(new RedisSink(config,new MySinkRedis))

    //6.触发执行
    env.execute()
  }
}

class MySinkRedis extends RedisMapper[(String,Int)] {
  //指定redis的数据类型
  override def getCommandDescription: RedisCommandDescription = {
    new RedisCommandDescription(RedisCommand.HSET,"sinkRedis")
  }

  //redis key
  override def getKeyFromData(data: (String, Int)): String = {
    data._1
  }

  //redis value
  override def getValueFromData(data: (String, Int)): String = {
    data._2.toString
  }
}

Window

时间类型:

  • 事件时间 : 消息源本身携带的时间
  • 提取时间:消费数据源的时间
  • 处理时间:算子正在处理的时间

窗口类型:

  • 时间窗口
    • 有重叠的
    • 无重叠的
  • 数量窗口
    • 有重叠
    • 无重叠

时间窗口:

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-KcsFMaSv-1577071557960)(photo/1571302824199.png)]

数量窗口:

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-fNCDIRq0-1577071557962)(photo/1571303238317.png)]

时间窗口和数量窗口统计:

package cn.itcast.datastream

import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.scala.function.WindowFunction
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector

/**
  * @Date 2019/10/17
  */
object WindowDemo {


  def main(args: Array[String]): Unit = {

    /**
      * 1.获取流处理执行环境
      * 2.实时加载数据源
      * 3.数据转换:Car(id,count)
      * 4.数据分组
      * 5.划分窗口
      * 6.求和
      * 7.数据打印
      * 8.触发执行
      */
    //1.获取流处理执行环境
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment

    //2.实时加载数据源
    val source: DataStream[String] = env.socketTextStream("node01", 8090)

    //3.数据转换:Car(id,count)
    val cars: DataStream[Car] = source.map(line => {
      val arr: Array[String] = line.split(",")
      Car(arr(0).toInt, arr(1).toInt)
    })
    //4.数据分组
    cars.keyBy(_.id)
      //5.划分窗口
      //无重叠的时间窗口
      .timeWindow(Time.seconds(3))
      //有重叠的时间窗口
//      .timeWindow(Time.seconds(6),Time.seconds(3))
      //无重叠的数量窗口
      //.countWindow(3)
      //有重叠的数量窗口
      //.countWindow(6,3)
      //.reduce((x,y)=>Car(x.id,x.count+y.count))
      //.apply(new CarWindow)
      .fold(100){
      (x,y)=>{
        x+y.count
      }
    }
      //6.求和
      //.sum(1)
      //7.数据打印
      .print()

    // 8.触发执行
    env.execute()

  }
}

case class Car(id: Int, count: Int)

class CarWindow extends WindowFunction[Car,Car,Int,TimeWindow] {
  override def apply(key: Int, window: TimeWindow, input: Iterable[Car], out: Collector[Car]): Unit = {

    var key = 0
    var sum =0
    for(line<- input){
      key = line.id
      sum = sum+line.count
    }
    out.collect(Car(key,sum))
  }
}

第三天-Flink—流式计算框架

课程回顾

  • 本地模式

  • standAlone模式(HA)

  • flink on yarn

    • 能够最大化的利用集群中的资源
    • yarn-session
      • 适用于大量的小作业
      • 一次性申请资源
      • 资源不会自动关闭,需要手动关闭
      • yarn-session -n 2 -tm 1024 -jm 1024 -s 2 -d
      • flink run *.jar
      • yarn application -list 查看资源列表
      • yarn application -kill ID
    • yarn-cluster
      • 适用于批量,一次性,或者是大作业
      • 使用的时候申请资源,任务执行完毕会释放资源
      • flink run -m yarn-cluster -yn 1 -ys 1 -ytm 1024 -yjm 1024 *.jar
  • dataSet

    • 算子
      • map/flatmap/reduce/reduceGroup/mapPartition/join/union/rebalance
    • source
      • 本地集合
      • 本地/磁盘文件(txt/csv)
      • 分布式文件系统(hdfs)(txt/csv)
      • FTP
      • 自定义source
        • kafka/mysql
        • RichSourceFunctoin/SourceFunction(单并行度数据源)
        • RichParallelSourceFunction(多并行度数据)
    • sink
      • 集合
      • 文件
        • 本地
        • 分布式文件系统
        • 自定义sink(kafka/mysql/redis)
    • 广播变量
      • 数据放在内存
      • 性能较快
      • 不能处理大数据量(OOM)
    • 分布式缓存
      • 基于文件,能够处理大数据量
      • 会产生网络IO和磁盘IO,性能较慢
  • DataStream

    • keyBy :分组
    • connect /split/select
    • window
      • 时间
        • 有重叠(6,3)
        • 无重叠
      • 数量
        • 有重叠
        • 无重叠
      • apply

DataStream

window和time

​ time按照提取类型划分:

​ 时间事件: eventTime, 是消息源本身携带的时间

​ 提取时间: flink消费数据的时间,当前系统时间

​ 处理时间: 算子的处理的时间,当前系统时间

水位线代码开发

package cn.itcast

import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.AssignerWithPeriodicWatermarks
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.watermark.Watermark
import org.apache.flink.streaming.api.windowing.time.Time

/**
  * @Date 2019/10/19
  */
object EventTimeDemo {


  def main(args: Array[String]): Unit = {

    /**
      * 需求:以EventTime划分窗口,计算3秒钟内出价最高的产品
      * 步骤:
      * 1.获取流处理执行环境
      * 2.设置事件时间
      * 3.加载数据源
      * 4.数据转换,新建样例类
      * 5.设置水位线(延迟的时间轴)
      * 6.分组
      * 7.划分窗口时间 3s
      * 8.统计最大值
      * 9.数据打印
      * 10.触发执行
      */

    //1.获取流处理执行环境
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)
    //2.设置事件时间,必须设置
    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
    //3.加载数据源
    val source: DataStream[String] = env.socketTextStream("node01", 8090)
    //4.数据转换,新建样例类
    val bossData: DataStream[Boss] = source.map(line => {
      val arr: Array[String] = line.split(",")
      Boss(arr(0).toLong, arr(1), arr(2), arr(3).toDouble)
    })
    //5.设置水位线(延迟的时间轴),周期性水位线
    //实现一:
        val waterData: DataStream[Boss] = bossData.assignTimestampsAndWatermarks(new AssignerWithPeriodicWatermarks[Boss] {

          //延时时间,自己定义的,到公司中,看具体实际情况
          val delayTime: Long = 2000L
          var currentTimestamp: Long = 0L //当前时间  22s 23s

          //2.后执行, 获取当前水位线
          override def getCurrentWatermark: Watermark = {
            new Watermark(currentTimestamp - delayTime) //这就是水位线的时间,是一个延迟的时间轴
          }

          //1.先执行, 提取事件时间(消息源本身的时间)
          override def extractTimestamp(element: Boss, previousElementTimestamp: Long): Long = {
            //消息本身的时间 22s 23s 19s
            val timestamp = element.time
            //谁打大取谁
            currentTimestamp = Math.max(timestamp, currentTimestamp) //保证时间永远往前走
            timestamp
          }
        })

    //实现二:
    //Time.seconds(2) 是延时时间
//    val waterData: DataStream[Boss] = bossData.assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor[Boss](Time.seconds(2)) {
//      override def extractTimestamp(element: Boss): Long = {
//        val time: Long = element.time
//        time
//      }
//    })

    //6.分组
    waterData.keyBy(_.boss)
      //7.划分窗口时间 3s
      .timeWindow(Time.seconds(3))
      //8.统计最大值
      .maxBy(3)
      //9.数据打印
      .print()

    //10.触发执行
    env.execute()
  }
}

//数据:(时间,公司,产品,价格)
case class Boss(time: Long, boss: String, product: String, price: Double)

1.窗口的划分,不依赖于事件时间,依赖于窗口本身,假如说窗口时间是3s,从00s开始算,(00,01,02),依次类推

2.用事件时间划分水位线,必须设置TimeCharacteristic.EventTime

3.必须要实现周期性水位线 AssignerWithPeriodicWatermarks

延迟数据处理和侧输出流

package cn.itcast

import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.AssignerWithPeriodicWatermarks
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.watermark.Watermark
import org.apache.flink.streaming.api.windowing.time.Time

/**
  * @Date 2019/10/19
  */
object EventTimeDemo {


  def main(args: Array[String]): Unit = {

    /**
      * 需求:以EventTime划分窗口,计算3秒钟内出价最高的产品
      * 步骤:
      * 1.获取流处理执行环境
      * 2.设置事件时间
      * 3.加载数据源
      * 4.数据转换,新建样例类
      * 5.设置水位线(延迟的时间轴)
      * 6.分组
      * 7.划分窗口时间 3s
      * 8.统计最大值
      * 9.数据打印
      * 10.触发执行
      */

    //1.获取流处理执行环境
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)
    //2.设置事件时间,必须设置
    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
    //3.加载数据源
    val source: DataStream[String] = env.socketTextStream("node01", 8090)
    //4.数据转换,新建样例类
    val bossData: DataStream[Boss] = source.map(line => {
      val arr: Array[String] = line.split(",")
      Boss(arr(0).toLong, arr(1), arr(2), arr(3).toDouble)
    })
    //5.设置水位线(延迟的时间轴),周期性水位线
    //实现一:
    val waterData: DataStream[Boss] = bossData.assignTimestampsAndWatermarks(new AssignerWithPeriodicWatermarks[Boss] {

      //延时时间,自己定义的,到公司中,看具体实际情况
      val delayTime: Long = 2000L
      var currentTimestamp: Long = 0L //当前时间  22s 23s

      //2.后执行, 获取当前水位线
      override def getCurrentWatermark: Watermark = {
        new Watermark(currentTimestamp - delayTime) //这就是水位线的时间,是一个延迟的时间轴
      }

      //1.先执行, 提取事件时间(消息源本身的时间)
      override def extractTimestamp(element: Boss, previousElementTimestamp: Long): Long = {
        //消息本身的时间 22s 23s 19s
        val timestamp = element.time
        //谁打大取谁
        currentTimestamp = Math.max(timestamp, currentTimestamp) //保证时间永远往前走
        timestamp
      }
    })

    //定义侧输出流
    val outputTag = new OutputTag[Boss]("outPutTag")

    //6.分组
    val result: DataStream[Boss] = waterData.keyBy(_.boss)
      //7.划分窗口时间 3s
      .timeWindow(Time.seconds(3))
      //在水位线延迟的基础之上,再延迟2s钟
      .allowedLateness(Time.seconds(2))
      //收集延迟数据
      .sideOutputLateData(outputTag)
      //8.统计最大值
      .maxBy(3)
    //9.数据打印
    result.print("正常数据:")

    //打印延迟数据
    val outputData: DataStream[Boss] = result.getSideOutput(outputTag)
    outputData.print("延迟数据:")

    //10.触发执行
    env.execute()
  }
}

//数据:(时间,公司,产品,价格)
case class Boss(time: Long, boss: String, product: String, price: Double)

异步IO

1.异步io主类

package cn.itcast;

import org.apache.flink.streaming.api.datastream.AsyncDataStream;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;

import java.util.concurrent.TimeUnit;

/**
 * @Date 2019/10/19
 */
public class AsyIoDemo {

    public static void main(String[] args) throws Exception {
        /**
         * 1.获取流处理执行环境
         * 2.加载ab.txt
         * 3.异步流对象AsyncDataStream使用有序模式
         * 4.自定义异步处理函数,初始化redis
         * 5.CompletableFuture发起异步请求
         * 6.thenAccept接收异步返回数据
         * 7.打印结果
         * 8.触发执行
         */
        //1.获取流处理执行环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        //2.加载ab.txt
        DataStreamSource source = env.readTextFile("C:\\Users\\zhb09\\Desktop\\tmp\\fs\\ab.txt");

        //3.异步流对象AsyncDataStream使用有序模式
        SingleOutputStreamOperator result = AsyncDataStream.orderedWait(source, new MyAsyncFun(), 60000, TimeUnit.SECONDS, 1);
        //7.打印结果
        result.print();
        //8.触发执行
        env.execute();
    }
}

2.RichAsyncFuntion

package cn.itcast;

import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.functions.async.ResultFuture;
import org.apache.flink.streaming.api.functions.async.RichAsyncFunction;
import org.apache.flink.streaming.api.scala.async.AsyncFunction;
import redis.clients.jedis.HostAndPort;
import redis.clients.jedis.JedisCluster;
import redis.clients.jedis.JedisPoolConfig;

import java.util.Collections;
import java.util.HashSet;
import java.util.concurrent.CompletableFuture;
import java.util.function.Supplier;

/**
 * @Date 2019/10/19
 */
//
public class MyAsyncFun extends RichAsyncFunction {

    //初始化redis连接
    JedisCluster jedisCluster = null;

    @Override
    public void open(Configuration parameters) throws Exception {
        HashSet set = new HashSet();
        set.add(new HostAndPort("node01", 7001));
        set.add(new HostAndPort("node01", 7002));
        set.add(new HostAndPort("node01", 7003));

        JedisPoolConfig jedisPoolConfig = new JedisPoolConfig();
        jedisPoolConfig.setMaxTotal(10);
        jedisPoolConfig.setMaxIdle(10);
        jedisPoolConfig.setMinIdle(5);

        jedisCluster = new JedisCluster(set, jedisPoolConfig);
    }

    @Override
    public void asyncInvoke(String input, ResultFuture resultFuture) throws Exception {

        //5.发起异步请求
        CompletableFuture.supplyAsync(new Supplier() {
            //第一步,获取所需结果值
            @Override
            public String get() {
                //数据切分,获取name值
                String name = input.split(",")[1];
                String redisValue = jedisCluster.hget("AsyncReadRedis", name);
                return redisValue;
            }
        }).thenAccept((String str) -> {  //6.thenAccept接收异步返回数据
            //第二,异步回调
            resultFuture.complete(Collections.singleton(str));
        });
    }
}

状态管理

  • state :保存的是中间结果值
  • checkpoint(重点): 保存的是快照数据,是容错性的保障
  • state的数据结构:
    • ValueState
    • listState
    • ReducingState
    • MapState(企业中用的最多)
    • 数据都是保存在内存里,数据量不能太大

ValueState

package cn.itcast

import org.apache.flink.api.common.functions.RichMapFunction
import org.apache.flink.api.common.state.{ValueState, ValueStateDescriptor}
import org.apache.flink.api.common.typeinfo.{TypeHint, TypeInformation}
import org.apache.flink.api.java.tuple.Tuple
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.streaming.api.scala._

/**
  * @Date 2019/10/19
  */
object ValueState {

  def main(args: Array[String]): Unit = {

    /**
      * 开发步骤:
      *       1.获取流处理执行环境
      *       2.加载数据源,以及数据处理
      *       3.数据分组
      *       4.数据转换,定义ValueState,保存中间结果
      *       5.数据打印
      *       6.触发执行
      */
    //1.获取流处理执行环境
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    //2.加载数据源
    val source: DataStream[(Long, Long)] = env.fromCollection(List(
      (1L, 4L),
      (2L, 3L),
      (3L, 1L),
      (1L, 2L),
      (3L, 2L),
      (1L, 2L),
      (2L, 2L),
      (2L, 9L)
    ))

    //数据处理

    //3.数据分组
    val keyData: KeyedStream[(Long, Long), Tuple] = source.keyBy(0)

    //4.数据转换,定义ValueState,保存中间结果
    keyData.map(new RichMapFunction[(Long, Long), (Long, Long)] {
      var vsState: ValueState[(Long, Long)] = _

      //定义ValueState
      override def open(parameters: Configuration): Unit = {
        //新建状态描述器
        val vs = new ValueStateDescriptor[(Long, Long)]("vs", TypeInformation.of(new TypeHint[(Long, Long)] {}))
        //获取ValueState
        vsState = getRuntimeContext.getState(vs)
      }

      //计算并保存中间结果
      override def map(value: (Long, Long)): (Long, Long) = {

        //获取vsState内的值
        val tuple: (Long, Long) = vsState.value()
        val currnetTuple = if (tuple == null) {
          (0L, 0L)
        } else {
          tuple
        }

        val tupleResult: (Long, Long) = (value._1, value._2 + currnetTuple._2)
        //更新vsState
        vsState.update(tupleResult)

        tupleResult
      }
    }).print() //5.数据打印


    //6.触发执行
    env.execute()

  }
}

MapState

package cn.itcast

import org.apache.flink.api.common.functions.RichMapFunction
import org.apache.flink.api.common.state.{MapState, MapStateDescriptor}
import org.apache.flink.api.common.typeinfo.{TypeHint, TypeInformation}
import org.apache.flink.api.java.tuple.Tuple
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.api.scala._

/**
  * @Date 2019/10/19
  */
object MapState {

  def main(args: Array[String]): Unit = {
    /**
      *       1.获取流处理执行环境
      *       2.加载数据源
      *       3.数据分组
      *       4.数据转换,定义MapState,保存中间结果
      *       5.数据打印
      *       6.触发执行
      */
    //1.获取流处理执行环境
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    //2.加载数据源
    val source: DataStream[(String, Int)] = env.fromCollection(List(
      ("java", 1),
      ("python", 3),
      ("java", 2),
      ("scala", 2),
      ("python", 1),
      ("java", 1),
      ("scala", 2)
    ))
    //3.数据分组
    val keyData: KeyedStream[(String, Int), Tuple] = source.keyBy(0)
    //4.数据转换,定义MapState,保存中间结果
    keyData.map(new RichMapFunction[(String, Int), (String, Int)] {
      var mapState: MapState[String, Int] = _

      //定义MapState
      override def open(parameters: Configuration): Unit = {
        //定义MapState的描述器
        val ms = new MapStateDescriptor[String, Int]("ms", TypeInformation.of(new TypeHint[String] {}),
          TypeInformation.of(new TypeHint[Int] {}))
        //注册和获取mapState
        mapState = getRuntimeContext.getMapState(ms)
      }

      //计算并保存中间结果
      override def map(value: (String, Int)): (String, Int) = {
        //先获取mapState内的值
        val i: Int = mapState.get(value._1)
        //mapState数据更新
        mapState.put(value._1, value._2 + i)
        (value._1, value._2 + i)
      }
    }).print() //5.数据打印

    //6.触发执行
    env.execute()
  }
}

ListState

(1)数据累加

package cn.itcast

import java.{lang, util}

import org.apache.flink.api.common.functions.RichMapFunction
import org.apache.flink.api.common.state.{ListState, ListStateDescriptor}
import org.apache.flink.api.common.typeinfo.{TypeHint, TypeInformation}
import org.apache.flink.api.java.typeutils.ListTypeInfo
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.streaming.api.scala._
import redis.clients.jedis.Tuple
/**
  * @Date 2019/10/14
  */
object ListState {

  def main(args: Array[String]): Unit = {

    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    val source: DataStream[(String, Int)] = env.fromElements(
      ("java", 1),
      ("python", 3),
      ("java", 4),
      ("scala", 2),
      ("python", 1),
      ("java", 1))

    source.keyBy(0).map(new RichMapFunction[(String,Int),(String,Int)] {
      //定义listState
      var listState: ListState[(String,Int)] = _
      override def open(parameters: Configuration): Unit = {
        //定义描述器
        val liState: ListStateDescriptor[(String,Int)] = new ListStateDescriptor[(String,Int)]("liState",TypeInformation.of(new TypeHint[(String,Int)] {}))
        //注册和获取listState
        listState = getRuntimeContext.getListState(liState)
      }

      //数据转换和计算
      override def map(value: (String,Int)): (String,Int) = {
        //获取最新的listState的值
        val ints: lang.Iterable[(String,Int)] = listState.get()
        val v2: util.Iterator[(String,Int)] = ints.iterator()
        var i: (String,Int) =null
        while (v2.hasNext){
          i = v2.next()
        }
        val iData = if(i == null){
          ("null",0)
        }else{
          i
        }
        val v3 = (value._1,value._2+iData._2)
        listState.clear()
        listState.add(v3)
        v3
      }
    }).print()

    env.execute()
  }
}

(2)模拟kafka offset

package cn.itcast

import java.util

import org.apache.flink.api.common.restartstrategy.RestartStrategies
import org.apache.flink.api.common.state.{ListState, ListStateDescriptor}
import org.apache.flink.api.common.time.Time
import org.apache.flink.api.common.typeinfo.{TypeHint, TypeInformation}
import org.apache.flink.runtime.state.{FunctionInitializationContext, FunctionSnapshotContext}
import org.apache.flink.runtime.state.filesystem.FsStateBackend
import org.apache.flink.streaming.api.CheckpointingMode
import org.apache.flink.streaming.api.checkpoint.CheckpointedFunction
import org.apache.flink.streaming.api.environment.CheckpointConfig
import org.apache.flink.streaming.api.functions.source.{RichSourceFunction, SourceFunction}
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.api.scala._
/**
  * @Date 2019/10/19
  */
object OperateListState {

  def main(args: Array[String]): Unit = {

    /**
      * 1.获取执行环境
      * 2.设置检查点机制:路径,重启策略
      * 3.自定义数据源
      * (1)需要继承并行数据源和CheckpointedFunction
      * (2)设置listState,通过上下文对象context获取
      * (3)数据处理,保留offset
      * (4)制作快照
      * 4.数据打印
      * 5.触发执行
      */
    //1.获取执行环境
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)


    //2.设置检查点机制:路径,重启策略
    env.enableCheckpointing(1000)//每1s,启动一次检查点
    //检查点保存路径
    env.setStateBackend(new FsStateBackend("hdfs://node01:8020/checkpoint"))
    env.getCheckpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE)//强一致性
    env.getCheckpointConfig.setCheckpointTimeout(60000)//检查点制作超时时间
    env.getCheckpointConfig.setFailOnCheckpointingErrors(false) //检查点制作失败,任务继续运行
    //任务取消时,删除检查点
    env.getCheckpointConfig.enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.DELETE_ON_CANCELLATION)

    //重启策略 ,重启3次,每次间隔5s
    //注意:一旦配置检查点机制,会无限重启 ,宕机如果取消检查点机制,出现异常直接宕机
    env.setRestartStrategy(RestartStrategies.fixedDelayRestart(3,Time.seconds(5)))

    //3.自定义数据源
    env.addSource(new OpeSource)
      .print() // 4.数据打印

    //5.触发执行
    env.execute()

  }

}

class OpeSource extends RichSourceFunction[Long] with CheckpointedFunction {
  var lsState: ListState[Long] =_
  var offset:Long = 0L
  //业务逻辑处理
  override def run(ctx: SourceFunction.SourceContext[Long]): Unit = {
    //获取listState的数据
    val lsData: util.Iterator[Long] = lsState.get().iterator()
    while (lsData.hasNext){
      offset = lsData.next()
    }

    //注意:我们时模拟kafka的offset的提交
    while (true){
      offset +=1
      ctx.collect(offset)
      Thread.sleep(1000)
      if(offset>10){
        1/0
      }
    }

  }

  //取消任务
  override def cancel(): Unit = ???

  //制作offset 的快照
  override def snapshotState(context: FunctionSnapshotContext): Unit = {
    lsState.clear() //清空历史offset数据
    lsState.add(offset) //更新最新的offset
  }

  //(2)设置listState,通过上下文对象context获取
  //初始化状态
  override def initializeState(context: FunctionInitializationContext): Unit = {
    //定义并获取listState
    val ls = new ListStateDescriptor[Long]("ls",TypeInformation.of(new TypeHint[Long] {}))
    lsState = context.getOperatorStateStore.getListState(ls)
  }
}

Broadcast State

package cn.itcast

import java.util

import org.apache.flink.api.common.restartstrategy.RestartStrategies
import org.apache.flink.api.common.state.{ListState, ListStateDescriptor}
import org.apache.flink.api.common.time.Time
import org.apache.flink.api.common.typeinfo.{TypeHint, TypeInformation}
import org.apache.flink.runtime.state.{FunctionInitializationContext, FunctionSnapshotContext}
import org.apache.flink.runtime.state.filesystem.FsStateBackend
import org.apache.flink.streaming.api.CheckpointingMode
import org.apache.flink.streaming.api.checkpoint.CheckpointedFunction
import org.apache.flink.streaming.api.environment.CheckpointConfig
import org.apache.flink.streaming.api.functions.source.{RichSourceFunction, SourceFunction}
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.api.scala._
/**
  * @Date 2019/10/19
  */
object OperateListState {

  def main(args: Array[String]): Unit = {

    /**
      * 1.获取执行环境
      * 2.设置检查点机制:路径,重启策略
      * 3.自定义数据源
      * (1)需要继承并行数据源和CheckpointedFunction
      * (2)设置listState,通过上下文对象context获取
      * (3)数据处理,保留offset
      * (4)制作快照
      * 4.数据打印
      * 5.触发执行
      */
    //1.获取执行环境
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)


    //2.设置检查点机制:路径,重启策略
    env.enableCheckpointing(1000)//每1s,启动一次检查点
    //检查点保存路径
    env.setStateBackend(new FsStateBackend("hdfs://node01:8020/checkpoint"))
    env.getCheckpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE)//强一致性
    env.getCheckpointConfig.setCheckpointTimeout(60000)//检查点制作超时时间
    env.getCheckpointConfig.setFailOnCheckpointingErrors(false) //检查点制作失败,任务继续运行
    //任务取消时,删除检查点
    env.getCheckpointConfig.enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.DELETE_ON_CANCELLATION)

    //重启策略 ,重启3次,每次间隔5s
    //注意:一旦配置检查点机制,会无限重启 ,宕机如果取消检查点机制,出现异常直接宕机
    env.setRestartStrategy(RestartStrategies.fixedDelayRestart(3,Time.seconds(5)))

    //3.自定义数据源
    env.addSource(new OpeSource)
      .print() // 4.数据打印

    //5.触发执行
    env.execute()

  }

}

class OpeSource extends RichSourceFunction[Long] with CheckpointedFunction {
  var lsState: ListState[Long] =_
  var offset:Long = 0L
  //业务逻辑处理
  override def run(ctx: SourceFunction.SourceContext[Long]): Unit = {
    //获取listState的数据
    val lsData: util.Iterator[Long] = lsState.get().iterator()
    while (lsData.hasNext){
      offset = lsData.next()
    }

    //注意:我们时模拟kafka的offset的提交
    while (true){
      offset +=1
      ctx.collect(offset)
      Thread.sleep(1000)
      if(offset>10){
        1/0
      }
    }
  }

  //取消任务
  override def cancel(): Unit = ???

  //制作offset 的快照
  override def snapshotState(context: FunctionSnapshotContext): Unit = {
    lsState.clear() //清空历史offset数据
    lsState.add(offset) //更新最新的offset
  }

  //(2)设置listState,通过上下文对象context获取
  //初始化状态
  override def initializeState(context: FunctionInitializationContext): Unit = {
    //定义并获取listState
    val ls = new ListStateDescriptor[Long]("ls",TypeInformation.of(new TypeHint[Long] {}))
    lsState = context.getOperatorStateStore.getListState(ls)
  }
}

课程回顾

  • waterMark
    • 必须要设置事件时间 TimeCharacteristic.EventTime
    • window数据划分,从00s开始划分,假如窗口时间是3s,窗口就是(00,01,02)
    • 必须用周期性水位线AssignerWithPeriodicWatermarks
      • 获取当前时间Math.max(timestamp, currentTimestamp)
      • 延迟时间的设置,具体根据实际情况
  • allowedLateness :在水位线的基础之上再延迟N
  • OutputTag: 侧输出流
  • 异步IO
    • 匿名函数:RichAsyncFuntion
    • 有序:AsyncDataStream.orderedWait();
    • 无序:AsyncDataStream.unorderWait();
  • 状态管理
    • 原始状态
    • 托管状态
      • ListState
      • ValueState
      • MapState
      • 广播流

第四天-Flink—流式计算框架

课程安排:

  • 容错机制(重点)
  • Flink SQL(掌握)
  • 一次性语义(了解)
  • 品优购(SpringBoot)

容错机制

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-gB6IsY8E-1577071557965)(photo/1571565339794.png)]

检查点存储的三种方式:

  • MemStateBackend:基于内存的,生产环境不用
  • FsStateBackend: 基于文件
  • RocksDBStateBackend:支持增量和超大检查点

检查点的执行模式:

  • EXACTLY_ONCE: 强一致性,表示数据只消费一次
  • AT_LEAST_ONCE:至少消费一次,有可能重复消费

重启策略

  • fixed-delay :固定时间重启策略 ,只要失败N次就宕机

  • failure-rate: 失败率重启策略,在指定时间范围之内,如果重启N次之后就宕机

  • no restart: 不重启

    如果配置检查点机制,会无限重启

    如果配置了检查点机制,失败之后不重启,需要配置no restart

    flink默认是不重启的(没有配置checkpoint)

package cn.itcast.checkpoint

import org.apache.flink.api.common.restartstrategy.RestartStrategies
import org.apache.flink.api.common.time.Time
import org.apache.flink.runtime.state.filesystem.FsStateBackend
import org.apache.flink.streaming.api.CheckpointingMode
import org.apache.flink.streaming.api.environment.CheckpointConfig
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.streaming.api.scala._
/**
  * @Date 2019/10/20
  */
object CheckpointDemo {


  def main(args: Array[String]): Unit = {

    /**
      * 输入三次zhangsan,程序挂掉
      */

    /**
      * 1.获取流处理执行环境
      * 2.设置检查点机制
      * 3.设置重启策略
      * 4.数据打印
      * 5.触发执行
      */
//    //1.获取流处理执行环境
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
//    //2.设置检查点机制
//    env.enableCheckpointing(5000) //开启检查点,每5s钟触发一次
//    //设置检查点存储路径
//    env.setStateBackend(new FsStateBackend("hdfs://node01:8020/tmp/checkpoint"))
//    env.getCheckpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE) //强一致性
//    env.getCheckpointConfig.setCheckpointTimeout(60000) //检查点制作的超时时间
//    env.getCheckpointConfig.setFailOnCheckpointingErrors(false) //如果检查点制作失败,任务继续运行
//    env.getCheckpointConfig.setMaxConcurrentCheckpoints(1) //检查点最大次数
//    //DELETE_ON_CANCELLATION:任务取消的时候,会删除检查点
//    //RETAIN_ON_CANCELLATION:任务取消的时候,会保留检查点,需要手动删除检查点,生产上主要使用这种方式
//    env.getCheckpointConfig.enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION)

    //3.设置重启策略 失败重启三次,重启间隔时间是5s
    //env.setRestartStrategy(RestartStrategies.fixedDelayRestart(3,Time.seconds(5)))

    //4.数据打印
    val source: DataStream[String] = env.socketTextStream("node01",8090)
    source.flatMap(_.split("\\W+"))
      .map(line=>{
        if(line.equals("zhangsan")){
          throw new RuntimeException("失败重启........")
        }
        line
      }).print()

    //5.触发执行
    env.execute()
  }

}

即不配置检查点,也不配置重启策略,如果抛异常,直接宕机

如果不配置检查点,配置重启策略,按重启策略执行

即配置检查点,配置重启策略,按重启策略执行

即配置检查点,,也不配置重启策略,无限重启

checkpoint样例开发

将checkpoint的结果保存到HDFS中

package cn.itcast.checkpoint

import java.util

import org.apache.flink.api.java.tuple.Tuple
import org.apache.flink.runtime.state.filesystem.FsStateBackend
import org.apache.flink.streaming.api.{CheckpointingMode, TimeCharacteristic}
import org.apache.flink.streaming.api.checkpoint.ListCheckpointed
import org.apache.flink.streaming.api.environment.CheckpointConfig
import org.apache.flink.streaming.api.functions.AssignerWithPeriodicWatermarks
import org.apache.flink.streaming.api.functions.source.{RichSourceFunction, SourceFunction}
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.scala.function.WindowFunction
import org.apache.flink.streaming.api.watermark.Watermark
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector

/**
  * @Date 2019/10/20
  */
object CheckpointSumResult {

  /**
    * 1)使用自定义算子每秒钟产生大约10000条数据。
    * 2)产生的数据为一个四元组(Long,String,String,Integer)—------(id,name,info,count)
    * 3)数据经统计后,统计结果打印到终端输出
    * 4)打印输出的结果为Long类型的数据
    *
    * 开发思路
    * 1)source算子每隔1秒钟发送1000条数据,并注入到Window算子中。
    * 2)window算子每隔1秒钟统计一次最近4秒钟内数据数量。
    * 3)每隔1秒钟将统计结果打印到终端
    * 4)每隔6秒钟触发一次checkpoint,然后将checkpoint的结果保存到HDFS中。
    */
  def main(args: Array[String]): Unit = {
    /**
      * 开发步骤:
      * 1.获取执行环境
      * 2.设置检查点机制
      * 3.自定义数据源 ,新建样例类
      * 4.设置水位线(必须设置处理时间)
      * 5.数据分组
      * 6.划分时间窗口
      * 7.数据聚合,然后将checkpoint的结果保存到HDFS中
      * 8.数据打印
      * 9.触发执行
      */
    //1.获取执行环境
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment

    //2.设置检查点存储路径
    env.enableCheckpointing(5000)
    env.setStateBackend(new FsStateBackend("hdfs://node01:8020/tmp/checkpoint"))
    env.getCheckpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE) //强一致性
    env.getCheckpointConfig.setCheckpointTimeout(60000) //检查点制作的超时时间
    env.getCheckpointConfig.setFailOnCheckpointingErrors(false) //如果检查点制作失败,任务继续运行
    env.getCheckpointConfig.setMaxConcurrentCheckpoints(1) //检查点最大次数
    //DELETE_ON_CANCELLATION:任务取消的时候,会删除检查点
    //RETAIN_ON_CANCELLATION:任务取消的时候,会保留检查点,需要手动删除检查点,生产上主要使用这种方式
    env.getCheckpointConfig.enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION)


    //3.自定义数据源 ,新建样例类
    val source: DataStream[Info] = env.addSource(new CheckpointSourceFunc)
    //4.设置水位线(必须设置处理时间)
    env.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime)
    source.assignTimestampsAndWatermarks(new AssignerWithPeriodicWatermarks[Info] {
      override def getCurrentWatermark: Watermark = {
        new Watermark(System.currentTimeMillis())
      }

      override def extractTimestamp(element: Info, previousElementTimestamp: Long): Long = {
        System.currentTimeMillis()
      }
    })
      //5.数据分组
      .keyBy(0)
      //6.划分时间窗口
      .timeWindow(Time.seconds(4), Time.seconds(1))
      //7.数据聚合,然后将checkpoint的结果保存到HDFS中
      .apply(new CheckpointWindowFunc)
      //8.数据打印
      .print()

    //9.触发执行
    env.execute()
  }

}

//(Long,String,String,Integer)—------(id,name,info,count)
case class Info(id: Long, name: String, info: String, count: Long)

class CheckpointSourceFunc extends RichSourceFunction[Info] {
  override def run(ctx: SourceFunction.SourceContext[Info]): Unit = {

    while (true) {

      for (line <- 0 until 1000) {
        ctx.collect(Info(1, "test", "test:" + line, 1))
      }
      Thread.sleep(1000)
    }
  }

  override def cancel(): Unit = ???
}

class State extends Serializable {
  var total: Long = 0

  def getTotal = total

  def setTotal(value: Long) = {
    total = value
  }
}

//聚合结果,并将结果保存到hdfs
class CheckpointWindowFunc extends WindowFunction[Info, Long, Tuple, TimeWindow]
  with ListCheckpointed[State] {
  var total: Long = 0

  //数据处理,数据聚合
  override def apply(key: Tuple, window: TimeWindow, input: Iterable[Info], out: Collector[Long]): Unit = {

    var sum: Long = 0L
    for (line <- input) {
      sum = sum + line.count
    }

    total += sum
    out.collect(total)
  }

  //制作快照
  override def snapshotState(checkpointId: Long, timestamp: Long): util.List[State] = {

    val states = new util.ArrayList[State]()
    val state = new State
    state.setTotal(total)
    states.add(state)
    states
  }

  //修复快照,更新数据,通过快照,拿到最新数据
  override def restoreState(state: util.List[State]): Unit = {
    total = state.get(0).getTotal
  }
}

一次性语义

官方文档:

https://flink.apache.org/features/2018/03/01/end-to-end-exactly-once-apache-flink.html

kafka exactly

package cn.itcast.checkpoint

import java.util.Properties

import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.streaming.api.CheckpointingMode
import org.apache.flink.streaming.api.environment.CheckpointConfig
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011
import org.apache.flink.streaming.util.serialization.KeyedSerializationSchemaWrapper

/**
  * @Date 2019/10/20
  */
object ExactlyKafka {

  def main(args: Array[String]): Unit = {

    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment

    //隐式转换
    import org.apache.flink.api.scala._
    //checkpoint配置
    env.enableCheckpointing(5000)
    env.getCheckpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE)
    env.getCheckpointConfig.setMinPauseBetweenCheckpoints(500)
    env.getCheckpointConfig.setCheckpointTimeout(60000)
    env.getCheckpointConfig.setMaxConcurrentCheckpoints(1)
    env.getCheckpointConfig.enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION)

    //数据加载及数据转换
    val source: DataStream[String] = env.socketTextStream("node01", 8090)
    val strValue: DataStream[String] = source.flatMap(_.split(" ")).map(line=>line)

    //配置kafka生产,参数
    val topic = "demo"
    val prop = new Properties()
    prop.setProperty("bootstrap.servers", "node01:9092,node02:9092,node03:9092")
    //设置事务超时时间,也可在kafka配置中设置
    prop.setProperty("transaction.timeout.ms",60000*15+"")
    val kafkaProducer = new FlinkKafkaProducer011[String](topic, new KeyedSerializationSchemaWrapper[String](new SimpleStringSchema), prop, FlinkKafkaProducer011.Semantic.EXACTLY_ONCE)
    //使用至少一次语义的形式
    //val myProducer = new FlinkKafkaProducer011<>(brokerList, topic, new SimpleStringSchema());
    //使用支持仅一次语义的形式
    strValue.addSink(kafkaProducer)
    env.execute("StreamingKafkaSinkScala")

  }
}

redis exactly

package cn.itcast.checkpoint

import java.net.{InetAddress, InetSocketAddress}
import java.util
import java.util.Properties

import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.streaming.api.CheckpointingMode
import org.apache.flink.streaming.api.environment.CheckpointConfig
import org.apache.flink.streaming.api.scala.{DataStream, StreamExecutionEnvironment}
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer011
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.connectors.redis.RedisSink
import org.apache.flink.streaming.connectors.redis.common.config.FlinkJedisClusterConfig
import org.apache.flink.streaming.connectors.redis.common.mapper.{RedisCommand, RedisCommandDescription, RedisMapper}
/**
  * @Date 2019/10/20
  */
object ExactlyRedis {

  def main(args: Array[String]): Unit = {

    /**
      *     1.获取流处理执行环境
      *     2.设置检查点机制
      *     3.定义kafkaConsumer
      *     4.数据转换:分组,求和
      *     5.数据写入redis
      *     6.触发执行
      */
    //1.获取流处理执行环境
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    //2.设置检查点机制
    //checkpoint配置
    env.enableCheckpointing(5000)
    env.getCheckpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE)
    env.getCheckpointConfig.setMinPauseBetweenCheckpoints(500)
    env.getCheckpointConfig.setCheckpointTimeout(60000)
    env.getCheckpointConfig.setMaxConcurrentCheckpoints(1)
    env.getCheckpointConfig.enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION)

    //3.定义kafkaConsumer
    val properties = new Properties()
    properties.setProperty("bootstrap.servers","node01:9092,node02:9092,node03:9092")
    properties.setProperty("group.id","test1020")
    //不自动提交偏移量
    properties.setProperty("enable.auto.commit", "false")

    val kafkaConsumer = new FlinkKafkaConsumer011[String]("demo",new SimpleStringSchema(),properties)

    //检查制作成功之后,在提交偏移量
    kafkaConsumer.setCommitOffsetsOnCheckpoints(true)
    val source: DataStream[String] = env.addSource(kafkaConsumer)

    //4.数据转换:分组,求和
    val sumResult: DataStream[(String, Int)] = source.flatMap(_.split(" "))
      .map((_, 1))
      .keyBy(0)
      .sum(1)

    //5.数据写入redis

    //设置redis属性
    //redis的节点
    val set = new util.HashSet[InetSocketAddress]()
    set.add(new InetSocketAddress(InetAddress.getByName("node01"),7001))
    set.add(new InetSocketAddress(InetAddress.getByName("node01"),7002))
    set.add(new InetSocketAddress(InetAddress.getByName("node01"),7003))

    val config: FlinkJedisClusterConfig = new FlinkJedisClusterConfig.Builder()
      .setNodes(set)
      .setMaxTotal(5)
      .setMinIdle(2)
      .setMaxIdle(5)
      .build()

    sumResult.addSink(new RedisSink[(String, Int)](config,new ExactlyRedisMapper))

    env.execute("redis exactly")
  }

}

class ExactlyRedisMapper extends RedisMapper[(String,Int)] {
  override def getCommandDescription: RedisCommandDescription = {
    //设置redis的数据解构类型
    new RedisCommandDescription(RedisCommand.HSET,"exactlyRedis")
  }

  override def getKeyFromData(data: (String, Int)): String = {
    data._1
  }

  override def getValueFromData(data: (String, Int)): String = {
    data._2.toString
  }
}

Flink SQL

DataSet

1.读取静态数据

package cn.itcast.sql

import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.table.api.{Table, TableEnvironment}
import org.apache.flink.table.api.scala.BatchTableEnvironment
import org.apache.flink.api.scala._
import org.apache.flink.core.fs.FileSystem.WriteMode
import org.apache.flink.table.sinks.CsvTableSink
import org.apache.flink.types.Row

/**
  * @Date 2019/10/20
  */
object BatchDataSetSql {

  def main(args: Array[String]): Unit = {

    /**
      * 1)获取一个批处理运行环境
      * 2)获取一个Table运行环境
      * 3)创建一个样例类 Order 用来映射数据(订单名、用户名、订单日期、订单金额)
      * 4)基于本地 Order 集合创建一个DataSet source
      * 5)使用Table运行环境将DataSet注册为一张表
      * 6)使用SQL语句来操作数据(统计用户消费订单的总金额、最大金额、最小金额、订单总数)
      * 7)使用TableEnv.toDataSet将Table转换为DataSet
      * 8)打印测试
      */
    //1)获取一个批处理运行环境
    val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
    //2)获取一个Table运行环境
    val tblEnv: BatchTableEnvironment = TableEnvironment.getTableEnvironment(env)

    //4)基于本地 Order 集合创建一个DataSet source
    val source: DataSet[Order] = env.fromElements(
      Order(1, "zhangsan", "2018-10-20 15:30", 358.5),
      Order(2, "zhangsan", "2018-10-20 16:30", 131.5),
      Order(3, "lisi", "2018-10-20 16:30", 127.5),
      Order(4, "lisi", "2018-10-20 16:30", 328.5),
      Order(5, "lisi", "2018-10-20 16:30", 432.5),
      Order(6, "zhaoliu", "2018-10-20 22:30", 451.0),
      Order(7, "zhaoliu", "2018-10-20 22:30", 362.0),
      Order(8, "zhaoliu", "2018-10-20 22:30", 364.0),
      Order(9, "zhaoliu", "2018-10-20 22:30", 341.0)
    )

    //将批量数据注册成表
    tblEnv.registerDataSet("order2", source)
    //1.table api数据查询
    //val table: Table = tblEnv.scan("order").select("userId,userName")
    //2.sql数据查询
    //val table: Table = tblEnv.sqlQuery("select * from order2")

    //统计用户消费订单的总金额、最大金额、最小金额、订单总数
    val sql =
      """
        | select
        |   userName,
        |   sum(price) totalMoney,
        |   max(price) maxMoney,
        |   min(price) minMoney,
        |   count(1) totalCount
        |  from order2
        |  group by userName
        |""".stripMargin //在scala中stripMargin默认是“|”作为多行连接符

    // val sql = "select userName,sum(price),max(price),min(price),count(*) from order2 group by userName"
    val table: Table = tblEnv.sqlQuery(sql)

    //    table.writeToSink(new CsvTableSink("C:\\Users\\zhb09\\Desktop\\write\\test\\orderCsv.csv",
    //      ",",1,WriteMode.OVERWRITE))

    //打印数据到控制台
    val rows: DataSet[Row] = tblEnv.toDataSet[Row](table)
    rows.print()
//    env.execute()
  }


}

//3)创建一个样例类 Order 用来映射数据(订单名、用户名、订单日期、订单金额)
case class Order(userId: Int, userName: String, eventTime: String, price: Double)

2.读取CSV

package cn.itcast.sql

import org.apache.flink.api.scala.ExecutionEnvironment
import org.apache.flink.table.api.{Table, TableEnvironment, Types}
import org.apache.flink.table.api.scala.BatchTableEnvironment
import org.apache.flink.table.sources.CsvTableSource
import org.apache.flink.types.Row
import org.apache.flink.api.scala._
/**
  * @Date 2019/10/20
  */
object ReadCsvSource {

  def main(args: Array[String]): Unit = {

    /**
      * 1.获取批处理执行环境
      * 2.获取表执行环境
      * 3.加载csv数据
      * 4.注册表
      * 5.查询表数据
      * 6.table 转换为批量数据
      * 7.数据打印
      */
    //1.获取批处理执行环境
    val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
    //2.获取表执行环境
    val tblEnv: BatchTableEnvironment = TableEnvironment.getTableEnvironment(env)
    //3.加载csv数据
    val tableSource: CsvTableSource = CsvTableSource.builder()
      .path("C:\\Users\\zhb09\\Desktop\\write\\test\\test.csv")
      .field("name", Types.STRING)
      .field("address", Types.STRING)
      .field("age", Types.INT)
      .ignoreFirstLine() //忽略首行
      .lineDelimiter("\n") //换行符
      .fieldDelimiter(",")
      .build()

    //4.注册表
    tblEnv.registerTableSource("csv",tableSource)

    //5.查询表数据
    val table: Table = tblEnv.sqlQuery("select * from csv")

    //6.table 转换为批量数据
    val values: DataSet[Row] = tblEnv.toDataSet[Row](table)

    //数据打印
    values.print()
  }
}

DataStream

需求:使用Flink SQL来统计5秒内 用户的 订单总数、订单的最大金额、订单的最小金额。

package cn.itcast.sql

import java.util.UUID

import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.source.{RichSourceFunction, SourceFunction}
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.table.api.{Table, TableEnvironment}
import org.apache.flink.table.api.scala.StreamTableEnvironment
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.types.Row

import scala.util.Random

/**
  * @Date 2019/10/20
  */
object StreamSql {

  def main(args: Array[String]): Unit = {
    /**
      * 1)获取流处理运行环境
      * 2)获取Table运行环境
      * 3)设置处理时间为 EventTime
      * 4)创建一个订单样例类 Order ,包含四个字段(订单ID、用户ID、订单金额、时间戳)
      * 5)创建一个自定义数据源
      * 6)添加水印,允许延迟2秒
      * 7)导入 import org.apache.flink.table.api.scala._ 隐式参数
      * 8)使用 registerDataStream 注册表,并分别指定字段,还要指定rowtime字段
      * 9)编写SQL语句统计用户订单总数、最大金额、最小金额
      * 分组时要使用 tumble(时间列, interval '窗口时间' second) 来创建窗口
      * 10)使用 tableEnv.sqlQuery 执行sql语句
      * 11)将SQL的执行结果转换成DataStream再打印出来
      * 12)启动流处理程序
      * 需求:使用Flink SQL来统计5秒内 用户的 订单总数、订单的最大金额、订单的最小金额。
      */

    //1)获取流处理运行环境
    val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)
    //2)获取Table运行环境
    val tblEnv: StreamTableEnvironment = TableEnvironment.getTableEnvironment(env)
    //3)设置处理时间为 EventTime
    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
    //5)创建一个自定义数据源
    val source: DataStream[Order3] = env.addSource(new OrderSourceFunc)
    //6)添加水印,允许延迟2秒
    val waterData: DataStream[Order3] = source.assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor[Order3](Time.seconds(2)) {
      override def extractTimestamp(element: Order3): Long = {
        val time: Long = element.orderTime
        time
      }
    })

    //7)导入 import org.apache.flink.table.api.scala._ 隐式参数
    import org.apache.flink.table.api.scala._
    //8)使用 registerDataStream 注册表,并分别指定字段,还要指定rowtime字段
    tblEnv.registerDataStream("order3",waterData,'orderId,'userId,'orderPrice,'orderTime.rowtime)

    // 9)编写SQL语句统计用户订单总数、最大金额、最小金额
    //      * 分组时要使用 tumble(时间列, interval '窗口时间' second) 来创建窗口
    val sql = "select userId,count(orderId),max(orderPrice),min(orderPrice) from order3 group by userId,tumble(orderTime, interval '5' second) "

    //10)使用 tableEnv.sqlQuery 执行sql语句
    val table: Table = tblEnv.sqlQuery(sql)

    //11)将SQL的执行结果转换成DataStream再打印出来
    val values: DataStream[Row] = tblEnv.toAppendStream[Row](table)
    values.print()

    //12)启动流处理程序
    env.execute()
  }


}
//4)创建一个订单样例类 Order ,包含四个字段(订单ID、用户ID、订单金额、时间戳)
case class Order3(orderId:String,userId:Int,orderPrice:Double,orderTime:Long)

class OrderSourceFunc extends RichSourceFunction[Order3] {
  /**
    * * a.使用for循环生成1000个订单
    * * b.随机生成订单ID(UUID)
    * * c.随机生成用户ID(0-2)
    * * d.随机生成订单金额(0-100)
    * * e.时间戳为当前系统时间
    * * f.每隔1秒生成一个订单
    *
    * @param ctx
    */
  override def run(ctx: SourceFunction.SourceContext[Order3]): Unit = {

    for(line<- 0 until 1000){
      ctx.collect(Order3(UUID.randomUUID().toString,Random.nextInt(2),Random.nextInt(100),System.currentTimeMillis()))
      Thread.sleep(100)
    }

  }

  override def cancel(): Unit = ???
}

电商实时分析系统

业务背景介绍

通过对用户的消费行为分析,用大数据技术,进行后台计算,对各种消费指标进行统计分析,目的是为了提高市场占有率,提高营业额。

系统架构

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-SYfvrJBk-1577071557967)(B:/video/flink/10.16/flink-day04/photo/1571562941575.png)]

技术选型

流式计算框架: flink

消息队列: kafka(100万/s)

数据库: hbase(NoSql),单表亿级别

以上三个框架的特点:大吞吐量,高并发,高可用

canal: 实时读取mysql binlog日志

springBoot: 快速java开发框架,纯注解开发,不需要xml配置文件

开发语言

java ,scala

课程目标

1):掌握 HBASE 的搭建和基本运维操作

2):掌握 flink 基本语法

3):掌握 kafka 的搭建和基本运维操作

4):掌握 canal 的使⽤

5):能够独⽴开发出上报服务

6):能够使⽤flink:处理实时热点数据及数据落地 Hbase

7):能够使⽤flink:处理频道的 PV、UV 及数据落地 Hbase

8):能够使⽤flink:处理新鲜度

9):能够使⽤flink:处理频道地域分布

10):能够使⽤flink:处理运营商平台数据

11):能够使⽤flink:处理浏览器类型

12):能够使⽤代码对接 canal,并将数据同步到 kafka

13):能够使⽤flink 同步数据到 hbase

创建工程

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-XRh1eRem-1577071557969)(B:/video/flink/10.16/flink-day04/photo/1571563693185.png)]

1.上报服务模块

1.导入依赖


<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <parent>
        <artifactId>pyg-1020artifactId>
        <groupId>cn.itcastgroupId>
        <version>1.0-SNAPSHOTversion>
    parent>
    <modelVersion>4.0.0modelVersion>

    <artifactId>reportartifactId>


    <properties>
        <project.build.sourceEncoding>UTF-8project.build.sourceEncoding>
        <project.reporting.outputEncoding>UTF-8project.reporting.outputEncoding>
        <java.version>1.8java.version>
        <spring-cloud.version>Greenwich.M3spring-cloud.version>
    properties>

    
    <repositories>
        <repository>
            <id>alimavenid>
            <name>alimavenname>
            <url>http://maven.aliyun.com/nexus/content/groups/public/url>
        repository>
    repositories>

    <dependencies>
        <dependency>
            <groupId>org.springframework.bootgroupId>
            <artifactId>spring-boot-starterartifactId>
            <version>1.5.13.RELEASEversion>
        dependency>
        <dependency>
            <groupId>org.springframework.bootgroupId>
            <artifactId>spring-boot-starter-testartifactId>
            <version>1.5.13.RELEASEversion>
        dependency>

        <dependency>
            <groupId>org.springframework.bootgroupId>
            <artifactId>spring-boot-starter-webartifactId>
            <version>1.5.13.RELEASEversion>
        dependency>

        <dependency>
            <groupId>org.springframework.bootgroupId>
            <artifactId>spring-boot-starter-tomcatartifactId>
            <version>1.5.13.RELEASEversion>
        dependency>

        <dependency>
            <groupId>org.apache.tomcatgroupId>
            <artifactId>tomcat-catalinaartifactId>
            <version>8.5.35version>
        dependency>

        <dependency>
            <groupId>com.alibabagroupId>
            <artifactId>fastjsonartifactId>
            <version>1.2.47version>
        dependency>

        <dependency>
            <groupId>org.springframework.kafkagroupId>
            <artifactId>spring-kafkaartifactId>
            <version>1.0.6.RELEASEversion>
        dependency>
        <dependency>
            <groupId>org.springframework.bootgroupId>
            <artifactId>spring-boot-autoconfigureartifactId>
            <version>1.5.13.RELEASEversion>
        dependency>


    dependencies>

    <build>
        <plugins>
            <plugin>
                <groupId>org.springframework.bootgroupId>
                <artifactId>spring-boot-maven-pluginartifactId>
            plugin>
        plugins>
    build>

project>

2.启动类

package cn.itcast;

import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;

/**
 * @Date 2019/10/20
 * 启动类
 * SpringBoot内置tomcat
 */
@SpringBootApplication
public class ReportApplication {

    //通过main方法启动服务
    public static void main(String[] args) {

        SpringApplication.run(ReportApplication.class,args);
    }
}

3.测试类

package cn.itcast.controller;

import org.springframework.stereotype.Controller;
import org.springframework.web.bind.annotation.RequestMapping;

/**
 * @Date 2019/10/20
 */
@Controller
@RequestMapping("report")
public class ReportTestController {

    @RequestMapping("test")
    public void accceptData(String str){
        System.out.println("<<<接收的数据<<<:"+str);
    }

}

Flink—电商实时分析系统

springBoot可以做为web框架单独开发java项目

springCloud必须依赖springBoot来开发java项目

1.实时频道pv/uv分析

(1)ChannelPvuvTask

package cn.itcast.task

import cn.itcast.`trait`.ProcessData
import cn.itcast.bean.Message
import cn.itcast.map.ChannelPvuvFlatMap
import cn.itcast.reduce.ChannelPvuvReduce
import cn.itcast.sink.ChannelPvuvSink
import org.apache.flink.streaming.api.scala.DataStream
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.time.Time

/**
  * @Date 2019/10/22
  */
object ChannelPvuvTask extends ProcessData {
  override def process(waterData: DataStream[Message]): Unit = {
    /**
      * 开发步骤一:
      * 开发userState
      */

    /** 开发步骤二:
      * (1)数据转换
      * (2)数据分组
      * (3)划分时间窗口
      * (4)数据聚合
      * (5)数据落地
      */
    //(1)数据转换
    waterData.flatMap(new ChannelPvuvFlatMap)
      //(2)数据分组
      .keyBy(line => line.getChannelId + line.getTime)
      //(3)划分时间窗口
      .timeWindow(Time.seconds(3))
      //(4)数据聚合
      .reduce(new ChannelPvuvReduce)
      //5)数据落地
      .addSink(new ChannelPvuvSink)

  }
}

(2)ChannelPvuvFlatMap

package cn.itcast.map

import cn.itcast.bean.{ChannelPvuv, Message, UserBrowse, UserState}
import cn.itcast.util.TimeUtil
import org.apache.flink.api.common.functions.RichFlatMapFunction
import org.apache.flink.util.Collector

/**
  * @Date 2019/10/23
  */
class ChannelPvuvFlatMap extends RichFlatMapFunction[Message,ChannelPvuv]{

  //格式化模板
  val hour = "yyyyMMddHH"
  val day ="yyyyMMdd"
  val month ="yyyyMM"

  override def flatMap(in: Message, out: Collector[ChannelPvuv]): Unit = {
    val userBrowse: UserBrowse = in.userBrowse
    val timestamp: Long = userBrowse.timestamp
    val channelID: Long = userBrowse.channelID
    val userID: Long = userBrowse.userID

    //查询用户访问状态
    val userState: UserState = UserState.getUserState(userID,timestamp)
    val isNew: Boolean = userState.isNew
    val firstHour: Boolean = userState.isFirstHour
    val firstDay: Boolean = userState.isFirstDay
    val firstMonth: Boolean = userState.isFirstMonth

    //封装数据到ChannelPvuv
    val channelPvuv = new ChannelPvuv
    channelPvuv.setChannelId(channelID)
    channelPvuv.setPv(1)

    //日期格式化
    val hourTime: String = TimeUtil.parseTime(timestamp,hour)
    val dayTime: String = TimeUtil.parseTime(timestamp,day)
    val monthTime: String = TimeUtil.parseTime(timestamp,month)

    //判断用户访问状态
    if(isNew == true){
      channelPvuv.setUv(1L)
    }

    //小时
    if(firstHour == true){
      channelPvuv.setUv(1L)
      channelPvuv.setTime(hourTime)
    }else{
      channelPvuv.setUv(0L)
      channelPvuv.setTime(hourTime)
    }
    out.collect(channelPvuv)

    //天
    if(firstDay == true){
      channelPvuv.setUv(1L)
      channelPvuv.setTime(dayTime)
    }else{
      channelPvuv.setUv(0L)
      channelPvuv.setTime(dayTime)
    }
    out.collect(channelPvuv)

    //月
    if(firstMonth == true){
      channelPvuv.setUv(1L)
      channelPvuv.setTime(monthTime)
    }else{
      channelPvuv.setUv(0L)
      channelPvuv.setTime(monthTime)
    }
    out.collect(channelPvuv)

  }
}

(3)ChannelPvuvReduce

package cn.itcast.reduce

import cn.itcast.bean.ChannelPvuv
import org.apache.flink.api.common.functions.ReduceFunction

/**
  * @Date 2019/10/23
  */
class ChannelPvuvReduce extends ReduceFunction[ChannelPvuv] {
  override def reduce(value1: ChannelPvuv, value2: ChannelPvuv): ChannelPvuv = {

    //增量聚合,最终聚合成一条数据
    val pvuv = new ChannelPvuv
    pvuv.setTime(value1.getTime)
    pvuv.setChannelId(value1.getChannelId)
    pvuv.setPv(value1.getPv + value2.getPv)
    pvuv.setUv(value1.getUv + value2.getUv)
    pvuv
  }
}

(4)ChannelPvuvSink

package cn.itcast.sink

import cn.itcast.bean.ChannelPvuv
import cn.itcast.util.HbaseUtil
import org.apache.commons.lang3.StringUtils
import org.apache.flink.streaming.api.functions.sink.RichSinkFunction

/**
  * @Date 2019/10/23
  */
class ChannelPvuvSink extends RichSinkFunction[ChannelPvuv] {

  /**
    * 落地数据到hbase
    */
  override def invoke(value: ChannelPvuv): Unit = {

    /**
      * 表名: channel
      * rowkey: channelId+ time(格式化)
      * 字段: channelId, time,pv,uv
      * 列名:channelId,time,pv,uv
      * 列族: info
      */
    val tableName = "channel"
    val rowkey = value.getChannelId + value.getTime
    val pvCol = "pv"
    val uvCol = "uv"
    val family = "info"

    //需要查询hbase,如果有pv/uv数据,需要累加,再插入数据库,如果没有数据,直接插入
    val pvData: String = HbaseUtil.queryByRowkey(tableName, family, pvCol, rowkey)
    val uvData: String = HbaseUtil.queryByRowkey(tableName, family, uvCol, rowkey)

    var pv = value.getPv
    var uv = value.getUv
    //数据非空判断
    if (StringUtils.isNotBlank(pvData)) {
      pv = pv + pvData.toLong
    }
    if (StringUtils.isNotBlank(uvData)) {
      uv = uv + uvData.toLong
    }

    //需要封装map多列数据
    var map = Map[String,Any]()
    map+=("channelId"->value.getChannelId)
    map+=("time"->value.getTime)
    map+=(pvCol -> pv)
    map+=(uvCol -> uv)

    //将数据插入hbase
    HbaseUtil.putMapDataByRowkey(tableName,family,map,rowkey)
  }
}

2.用户新鲜度

(1)ChannelFreshnessTask

package cn.itcast.task

import cn.itcast.`trait`.ProcessData
import cn.itcast.bean.Message
import cn.itcast.map.ChannelFreshnessFlatMap
import cn.itcast.reduce.ChannelFreshnessReduce
import cn.itcast.sink.ChannelFreshnessSink
import org.apache.flink.streaming.api.scala.DataStream
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.time.Time
/**
  * @Date 2019/10/23
  */
object ChannelFreshnessTask extends ProcessData{
  override def process(waterData: DataStream[Message]): Unit = {

    /**
      * 1.数据转换
      * 2.数据分组
      * 3.划分时间窗口
      * 4.数据聚合
      * 5.数据落地
      */
    //1.数据转换
    waterData.flatMap(new ChannelFreshnessFlatMap)
    //2.数据分组
      .keyBy(line=>line.getChannelId+line.getTime)
    //3.划分时间窗口
      .timeWindow(Time.seconds(3))
    //4.数据聚合
      .reduce(new ChannelFreshnessReduce)
    //5.数据落地
      .addSink(new ChannelFreshnessSink)

  }
}

(2)ChannelFreshnessFlatMap

package cn.itcast.map

import cn.itcast.bean.{ChannelFreshness, Message, UserBrowse, UserState}
import cn.itcast.util.TimeUtil
import org.apache.flink.api.common.functions.RichFlatMapFunction
import org.apache.flink.util.Collector

/**
  * @Date 2019/10/23
  */
class ChannelFreshnessFlatMap extends RichFlatMapFunction[Message, ChannelFreshness] {

  //格式化模板
  val hour = "yyyyMMddHH"
  val day = "yyyyMMdd"
  val month = "yyyyMM"

  override def flatMap(in: Message, out: Collector[ChannelFreshness]): Unit = {

    val userBrowse: UserBrowse = in.userBrowse
    val timestamp: Long = userBrowse.timestamp
    val channelID: Long = userBrowse.channelID
    val userID: Long = userBrowse.userID

    //获取用户访问状态
    val userState: UserState = UserState.getUserState(userID, timestamp)
    val isNew: Boolean = userState.isNew
    val firstHour: Boolean = userState.isFirstHour
    val firstDay: Boolean = userState.isFirstDay
    val firstMonth: Boolean = userState.isFirstMonth

    //日期格式化
    val hourTime: String = TimeUtil.parseTime(timestamp, hour)
    val dayTime: String = TimeUtil.parseTime(timestamp, day)
    val monthTime: String = TimeUtil.parseTime(timestamp, month)

    val freshness = new ChannelFreshness
    freshness.setChannelId(channelID)

    //根据用户访问状态设置结果值
    isNew match {
      case true =>
        freshness.setNewCount(1L)
      case false =>
        freshness.setOldCount(1L)
    }

    //小时
    firstHour match {
      case true =>
        freshness.setNewCount(1L)
        freshness.setTime(hourTime)
      case false =>
        freshness.setOldCount(1L)
        freshness.setTime(hourTime)
    }
    out.collect(freshness)
    //天
    firstDay match {
      case true =>
        freshness.setNewCount(1L)
        freshness.setTime(dayTime)
      case false =>
        freshness.setOldCount(1L)
        freshness.setTime(dayTime)
    }
    out.collect(freshness)
    //月
    firstMonth match {
      case true =>
        freshness.setNewCount(1L)
        freshness.setTime(monthTime)
      case false =>
        freshness.setOldCount(1L)
        freshness.setTime(monthTime)
    }
    out.collect(freshness)
  }
}

(3)ChannelFreshnessReduce

package cn.itcast.reduce

import cn.itcast.bean.ChannelFreshness
import org.apache.flink.api.common.functions.ReduceFunction

/**
  * @Date 2019/10/23
  */
class ChannelFreshnessReduce extends ReduceFunction[ChannelFreshness]{
  override def reduce(value1: ChannelFreshness, value2: ChannelFreshness): ChannelFreshness = {

    val freshness = new ChannelFreshness
    freshness.setChannelId(value1.getChannelId)
    freshness.setTime(value1.getTime)
    freshness.setNewCount(value1.getNewCount + value2.getNewCount)
    freshness.setOldCount(value1.getOldCount+value2.getOldCount)
    freshness
  }
}

(4)ChannelFreshnessSink

package cn.itcast.sink

import cn.itcast.bean.ChannelFreshness
import cn.itcast.util.HbaseUtil
import org.apache.commons.lang3.StringUtils
import org.apache.flink.streaming.api.functions.sink.RichSinkFunction

/**
  * @Date 2019/10/23
  */
class ChannelFreshnessSink extends RichSinkFunction[ChannelFreshness]{


  override def invoke(value: ChannelFreshness): Unit = {

    /**
      * 表名: channel
      * rowkey: channelId+ time(格式化)
      * 字段: channelId, time,newCount,oldCount
      * 列名:channelId,time,newCount,oldCount
      * 列族: info
      */
    val tableName = "channel"
    val rowkey = value.getChannelId + value.getTime
    val newCountCol = "newCount"
    val oldCountCol = "oldCount"
    val family ="info"

    var newCount = value.getNewCount
    var oldCount = value.getOldCount

    //需要先查询hbase,如果数据库有数据,需要进行累加,如果没有数据,直接插入数据
    val newCountData: String = HbaseUtil.queryByRowkey(tableName,family,newCountCol,rowkey)
    val oldCountData: String = HbaseUtil.queryByRowkey(tableName,family,oldCountCol,rowkey)

    //数据非空判断,并累加
    if(StringUtils.isNotBlank(newCountData)){
      newCount = newCount + newCountData.toLong
    }
    if(StringUtils.isNotBlank(oldCountData)){
      oldCount = oldCount + oldCountData.toLong
    }

    //封装map数据
    var map = Map[String ,Any]()
    map+=("channelId"->value.getChannelId)
    map+=("time"->value.getTime)
    map+=(newCountCol -> newCount)
    map+=(oldCountCol -> oldCount)

    //插入数据
    HbaseUtil.putMapDataByRowkey(tableName,family,map,rowkey)

  }
}

3.实时频道的地域分析

1.ChannelRegionTask

package cn.itcast.task

import cn.itcast.`trait`.ProcessData
import cn.itcast.bean.Message
import cn.itcast.map.ChannelRegionFlatMap
import cn.itcast.reduce.ChannelRegionReduce
import cn.itcast.sink.ChannelRegionSink
import org.apache.flink.streaming.api.scala.DataStream
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.time.Time

/**
  * @Date 2019/10/23
  */
object ChannelRegionTask extends ProcessData {
  override def process(waterData: DataStream[Message]): Unit = {

    /** 开发步骤:
      * (1)数据转换
      * (2)数据分组
      * (3)划分时间窗口
      * (4)数据聚合
      * (5)数据落地
      */
    //(1)数据转换
    waterData.flatMap(new ChannelRegionFlatMap)
      //(2)数据分组
      .keyBy(line => line.getChannelId + line.getTime)
      //(3)划分时间窗口
      .timeWindow(Time.seconds(3))
      //(4)数据聚合
      .reduce(new ChannelRegionReduce)
      //(5)数据落地
      .addSink(new ChannelRegionSink)
  }
}

2.ChannelRegionFlatMap

package cn.itcast.map

import cn.itcast.bean._
import cn.itcast.util.TimeUtil
import org.apache.flink.api.common.functions.RichFlatMapFunction
import org.apache.flink.util.Collector

/**
  * @Date 2019/10/23
  */
class ChannelRegionFlatMap extends RichFlatMapFunction[Message,ChannelRegion]{

  //格式化模板
  val hour = "yyyyMMddHH"
  val day ="yyyyMMdd"
  val month ="yyyyMM"
  override def flatMap(in: Message, out: Collector[ChannelRegion]): Unit = {

    val userBrowse: UserBrowse = in.userBrowse
    val timestamp: Long = userBrowse.timestamp
    val userID: Long = userBrowse.userID
    val channelID: Long = userBrowse.channelID

    //查询用户访问状态
    val userState: UserState = UserState.getUserState(userID,timestamp)
    val isNew: Boolean = userState.isNew
    val firstHour: Boolean = userState.isFirstHour
    val firstDay: Boolean = userState.isFirstDay
    val firstMonth: Boolean = userState.isFirstMonth

    //日期格式化
    val hourTime: String = TimeUtil.parseTime(timestamp,hour)
    val dayTime: String = TimeUtil.parseTime(timestamp,day)
    val monthTime: String = TimeUtil.parseTime(timestamp,month)

    //封装一部分数据
    val channelRegion = new ChannelRegion
    channelRegion.setChannelId(channelID)
    channelRegion.setCity(userBrowse.city)
    channelRegion.setCountry(userBrowse.country)
    channelRegion.setProvince(userBrowse.province)
    channelRegion.setPv(1L)

    //需要根据用户访问状态设置结果值
    isNew match {
      case true=>
        channelRegion.setUv(1L)
        channelRegion.setNewCount(1L)
      case false =>
        channelRegion.setUv(0L)
        channelRegion.setOldCount(1L)
    }

    //小时
    firstHour match {
      case true =>
        channelRegion.setUv(1L)
        channelRegion.setNewCount(1L)
      case false =>
        channelRegion.setUv(0L)
        channelRegion.setOldCount(1L)
    }
    channelRegion.setTime(hourTime)
    out.collect(channelRegion)

    //天
    firstDay match {
      case true =>
        channelRegion.setUv(1L)
        channelRegion.setNewCount(1L)
      case false =>
        channelRegion.setUv(0L)
        channelRegion.setOldCount(1L)
    }
    channelRegion.setTime(dayTime)
    out.collect(channelRegion)

    //月
    firstMonth match {
      case true =>
        channelRegion.setUv(1L)
        channelRegion.setNewCount(1L)
      case false =>
        channelRegion.setUv(0L)
        channelRegion.setOldCount(1L)
    }
    channelRegion.setTime(monthTime)
    out.collect(channelRegion)
  }
}

3.ChannelRegionReduce

package cn.itcast.reduce

import cn.itcast.bean.ChannelRegion
import org.apache.flink.api.common.functions.ReduceFunction

/**
  * @Date 2019/10/23
  */
class ChannelRegionReduce extends ReduceFunction[ChannelRegion] {
  override def reduce(value1: ChannelRegion, value2: ChannelRegion): ChannelRegion = {

    val region = new ChannelRegion
    region.setChannelId(value1.getChannelId)
    region.setCity(value1.getCity)
    region.setCountry(value1.getCountry)
    region.setNewCount(value1.getNewCount + value2.getNewCount)
    region.setOldCount(value1.getOldCount + value2.getOldCount)
    region.setProvince(value1.getProvince)
    region.setPv(value1.getPv + value2.getPv)
    region.setUv(value1.getUv + value2.getUv)
    region.setTime(value1.getTime)
    region
  }
}

4.ChannelRegionSink

package cn.itcast.sink

import cn.itcast.bean.ChannelRegion
import cn.itcast.util.HbaseUtil
import org.apache.commons.lang3.StringUtils
import org.apache.flink.streaming.api.functions.sink.RichSinkFunction

/**
  * @Date 2019/10/23
  */
class ChannelRegionSink extends RichSinkFunction[ChannelRegion] {


  override def invoke(value: ChannelRegion): Unit = {

    /**
      * 设计表:
      * 表名: region
      * rowkey: channelId+ time(格式化)
      * 字段: channelId, time,newCount,oldCount,pv,uv,country,province,city
      * 列名:channelId,time,newCount,oldCount,pv,uv,country,province,city
      * 列族: info
      */
    val tableName = "region"
    val family = "info"
    val rowkey = value.getChannelId + value.getTime
    val pvCol = "pv"
    val uvCol = "uv"
    val newCountCol = "newCount"
    val oldCountCol = "oldCount"

    //需要先查询hbase数据库,并进行数据累加
    val pvData: String = HbaseUtil.queryByRowkey(tableName, family, pvCol, rowkey)
    val uvData: String = HbaseUtil.queryByRowkey(tableName, family, uvCol, rowkey)
    val newCountData: String = HbaseUtil.queryByRowkey(tableName, family, newCountCol, rowkey)
    val oldCountData: String = HbaseUtil.queryByRowkey(tableName, family, oldCountCol, rowkey)

    //先从value取待累加值
    var pv = value.getPv
    var uv = value.getUv
    var newCount = value.getNewCount
    var oldCount = value.getOldCount

    if (StringUtils.isNotBlank(pvData)) {
      pv = pv + pvData.toLong
    }
    if (StringUtils.isNotBlank(uvData)) {
      uv = uv + uvData.toLong
    }
    if (StringUtils.isNotBlank(newCountData)) {
      newCount = newCount + newCountData.toLong
    }
    if (StringUtils.isNotBlank(oldCountData)) {
      oldCount = oldCount + oldCountData.toLong
    }

    //封装插入数据
    //channelId, time,newCount,oldCount,pv,uv,country,province,city
    var map = Map[String, Any]()
    map += ("channelId" -> value.getChannelId)
    map += ("time" -> value.getTime)
    map += (newCountCol -> newCount)
    map += (oldCountCol -> oldCount)
    map += (pvCol -> pv)
    map += (uvCol -> uv)
    map += ("country" -> value.getCountry)
    map += ("province" -> value.getProvince)
    map += ("city" -> value.getCity)

    HbaseUtil.putMapDataByRowkey(tableName,family,map,rowkey)

  }

}

你可能感兴趣的:(大数据开发知识)