Flink对两个并行的数据源进行连接操作,如何保证数据的共享?

当对两个并行的数据源进行连接操作,如何保证数据的共享?
1. 场景:

  • 在项目中,对两个数据源进行整合,出现了数据丢失的情况。
  • 需求:Kafka数据中domain通过MySQL数据转换为userid
  • Kafka(ip,domain,traffic)=Flink - connect=MySQL(userid,domain)==>Result(ip,userid,traffic)
  • 数据源一:Kafka中日志信息(包括ip、域名、流量)
  • 数据源二:MySQL数据中的配置信息(含有域名和用户的对应关系)

2.代码如下

val env = StreamExecutionEnvironment.getExecutionEnvironment
//使用事件时间
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
val data = env.addSource(getConsumer)
//通过Kafka获取的数据
val logData = data.map(new MapSplitFunction)
    .filter(_._1 == "E")
    .filter(_._2 != 0)
    .map(x => {
        (x._2, x._3, x._4)
    })
    .assignTimestampsAndWatermarks(new BoundedOutOfOrdernessGenerator) //添加时间戳和水印
    .keyBy(1)
    .window(TumblingEventTimeWindows.of(Time.seconds(10)))
    .apply(new MyWindowFunction)
//通过MySQL获取的数据    
val sqlData = env.addSource(new JdbcSourceFunction)
//进行连接操作
val connectData = logData.connect(sqlData)
       .flatMap(new CoFlatMapFunction[(String, String, Long), mutable.HashMap[String, String], (String, String, Long)] {

         var userDomainMap = new mutable.HashMap[String, String]()

         override def flatMap1(value: (String, String, Long), out: Collector[(String, String, Long)]): Unit = {
           val domain = value._2
           val userid = userDomainMap.get(domain)
           out.collect(value._1, value._2 + "=" + userid, value._3)
         }

         override def flatMap2(value: mutable.HashMap[String, String], out: Collector[(String, String, Long)]): Unit = {
           userDomainMap = value
         }
       }
      )
     connectData.print()


3.输出结果
 

8> (2019-07-06 11:23,v1.go2yd.com=None,9682)
3> (2019-07-06 11:23,v4.go2yd.com=None,14199)
7> (2019-07-06 11:23,vmi.go2yd.com=None,31990)
5> (2019-07-06 11:23,v3.go2yd.com=None,27307)
8> (2019-07-06 11:23,v2.go2yd.com=None,10066)

4.问题分析

1. 可以发现,通过domain并没有取到userid,打印sqlData看一下

2> Map(v3.go2yd.com -> 10002, vmi.go2yd.com -> 10002, v4.go2yd.com -> 10003, v1.go2yd.com -> 10001, v2.go2yd.com -> 10001)
3> (2019-07-06 12:45,v4.go2yd.com=None,20299)
5> (2019-07-06 12:45,v3.go2yd.com=None,13853)
8> (2019-07-06 12:45,v2.go2yd.com=None,24317)
8> (2019-07-06 12:45,v1.go2yd.com=None,13945)
7> (2019-07-06 12:45,vmi.go2yd.com=Some(10002),15739)

2. 发现sqlData数据是没有问题的,sqlData是进程2拿到数据,logData是5、8、8、7拿到数据,不同进程间默认数据不能共享,所以连接的时候有些进程并没有拿到sqlData,打开sqlData的代码

class JdbcSourceFunction extends RichSourceFunction[mutable.HashMap[String, String]] {

  var connection: Connection = null
  var pstmt: PreparedStatement = null
  val sql = "select userid,domain from user_domain_map"

  //打开数据库连接
  override def open(parameters: Configuration): Unit = {
    Class.forName(Constant.JdbcInfo.DRIVER_NAME)
    connection = DriverManager.getConnection(Constant.JdbcInfo.URL,
      Constant.JdbcInfo.USER,
      Constant.JdbcInfo.PW)
    pstmt = connection.prepareStatement(sql)
  }

  override def run(ctx: SourceFunction.SourceContext[mutable.HashMap[String, String]]): Unit = {
    val resultSet = pstmt.executeQuery()
    val map: mutable.HashMap[String, String] = new mutable.HashMap[String, String]()
    while (resultSet.next()) {
      val userid = resultSet.getString("userid")
      val domain = resultSet.getString("domain")
      map.put(domain, userid)
    }
    ctx.collect(map)
  }

  override def cancel(): Unit = {
    if (pstmt != null) {
      pstmt.close()
    }
    if (connection != null) {
      connection.close()
    }
  }
}


3.发现JdbcSourceFunction实现的是RichSourceFunction,也就是并行度为1,然而logData并行度为8(我电脑是8核的),我们也可以通过logData源码确认一下。

val data = env.addSource(getConsumer) 
 
//连接Kafka,作为消费者   
def getConsumer = {

    val properties = new Properties()
    properties.setProperty("bootstrap.servers", Constant.KafkaInfo.BOOTSTRAP_SERVERS)
    properties.setProperty("group.id", Constant.KafkaInfo.GROUP_ID)
    val consumer = new FlinkKafkaConsumer[String](
      Constant.KafkaInfo.TOPIC,
      new SimpleStringSchema(),
      properties
    )
    consumer
 }

FlinkKafkaConsumer extends FlinkKafkaConsumerBase
FlinkKafkaConsumerBase extends RichParallelSourceFunction


4. 要想在每个地方都能拿到sqlData,应该使sqlData的进程覆盖logData的进程,也就是我们的sqlData实现RichParallelSourceFunction,重新运行

7> Map(v3.go2yd.com -> 10002, vmi.go2yd.com -> 10002, v4.go2yd.com -> 10003, v1.go2yd.com -> 10001, v2.go2yd.com -> 10001)
8> Map(v3.go2yd.com -> 10002, vmi.go2yd.com -> 10002, v4.go2yd.com -> 10003, v1.go2yd.com -> 10001, v2.go2yd.com -> 10001)
1> Map(v3.go2yd.com -> 10002, vmi.go2yd.com -> 10002, v4.go2yd.com -> 10003, v1.go2yd.com -> 10001, v2.go2yd.com -> 10001)
6> Map(v3.go2yd.com -> 10002, vmi.go2yd.com -> 10002, v4.go2yd.com -> 10003, v1.go2yd.com -> 10001, v2.go2yd.com -> 10001)
4> Map(v3.go2yd.com -> 10002, vmi.go2yd.com -> 10002, v4.go2yd.com -> 10003, v1.go2yd.com -> 10001, v2.go2yd.com -> 10001)
3> Map(v3.go2yd.com -> 10002, vmi.go2yd.com -> 10002, v4.go2yd.com -> 10003, v1.go2yd.com -> 10001, v2.go2yd.com -> 10001)
5> Map(v3.go2yd.com -> 10002, vmi.go2yd.com -> 10002, v4.go2yd.com -> 10003, v1.go2yd.com -> 10001, v2.go2yd.com -> 10001)
2> Map(v3.go2yd.com -> 10002, vmi.go2yd.com -> 10002, v4.go2yd.com -> 10003, v1.go2yd.com -> 10001, v2.go2yd.com -> 10001)
8> (2019-07-06 12:58,v2.go2yd.com,512414)
5> (2019-07-06 12:58,v3.go2yd.com,549535)
7> (2019-07-06 12:58,vmi.go2yd.com,417180)
3> (2019-07-06 12:58,v4.go2yd.com,520637)
8> (2019-07-06 12:58,v1.go2yd.com,533559)
5> (2019-07-06 12:58,v3.go2yd.com=Some(10002),549535)
7> (2019-07-06 12:58,vmi.go2yd.com=Some(10002),417180)
8> (2019-07-06 12:58,v2.go2yd.com=Some(10001),512414)
3> (2019-07-06 12:58,v4.go2yd.com=Some(10003),520637)
8> (2019-07-06 12:58,v1.go2yd.com=Some(10001),533559)


5. 可以发现,logData已经取到了sqlData中的数据。
6. 那么如果我想设置1或3个线程来执行任务,而不是默认的8个呢?,那么这里就涉及到flink的并行度的设置以及优先级的问题了。
7. Flink的并行度,简单来说可以理解把任务放到几个容器(slot)中执行。
8. 并行度有以下四个地方可以设置,以下级别由高到低

  1. operator level(source、transform、sink)
  2. env level
  3. client leve(提交job时通过-p指定)
  4. system level(flink-conf.xml配置)

9. 那么如果我要保证两个流的数据共享,要保证两个流的整个流程在相同的容器里(一个或多个)运行,

10.如果通过operator来设置的话,本项目中有以下三个地方需要设置并行度

  1. logData:source、transform
  2. sqlData:source
  3. logData.connect(sqlData))


11. 测试:

  1. logData:source、transform (设置并行度)
  2. sqlData:source )(设置并行度)
  3. logData.connect(sqlData)( 不设置并行度)
2> Map(v3.go2yd.com -> 10002, vmi.go2yd.com -> 10002, v4.go2yd.com -> 10005, v1.go2yd.com -> 10001, v2.go2yd.com -> 10001)
1> (2019-07-06 15:45,vmi.go2yd.com,87001)
7> (2019-07-06 15:45,v4.go2yd.com,45513)
2> (2019-07-06 15:45,v3.go2yd.com,57755)
3> (2019-07-06 15:45,v1.go2yd.com,50410)
8> (2019-07-06 15:45,v2.go2yd.com,86325)
8> (2019-07-06 15:45,v4.go2yd.com=None,45513)
2> (2019-07-06 15:45,vmi.go2yd.com=None,87001)
4> (2019-07-06 15:45,v1.go2yd.com=None,50410)
3> (2019-07-06 15:45,v3.go2yd.com=None,57755)
1> (2019-07-06 15:45,v2.go2yd.com=Some(10001),86325)

12. 可以发现,少设置一处(就会使用默认8个,其他5个slot中没有数据)就会出现问题,这样做需要熟悉整个流程,对每个地方分别添加,比较麻烦。

13. 直接通过env.setParallelism(3)来设置,测试结果:

1> Map(v3.go2yd.com -> 10002, vmi.go2yd.com -> 10002, v4.go2yd.com -> 10005, v1.go2yd.com -> 10001, v2.go2yd.com -> 10001)
3> Map(v3.go2yd.com -> 10002, vmi.go2yd.com -> 10002, v4.go2yd.com -> 10005, v1.go2yd.com -> 10001, v2.go2yd.com -> 10001)
2> Map(v3.go2yd.com -> 10002, vmi.go2yd.com -> 10002, v4.go2yd.com -> 10005, v1.go2yd.com -> 10001, v2.go2yd.com -> 10001)
3> (2019-07-06 15:33,v2.go2yd.com,7868)
1> (2019-07-06 15:33,v4.go2yd.com,5400)
2> (2019-07-06 15:33,v3.go2yd.com,6737)
3> (2019-07-06 15:33,v1.go2yd.com,6560)
3> (2019-07-06 15:33,vmi.go2yd.com,9068)
1> (2019-07-06 15:33,v4.go2yd.com=Some(10005),5400)
3> (2019-07-06 15:33,v2.go2yd.com=Some(10001),7868)
2> (2019-07-06 15:33,v3.go2yd.com=Some(10002),6737)
3> (2019-07-06 15:33,v1.go2yd.com=Some(10001),6560)
3> (2019-07-06 15:33,vmi.go2yd.com=Some(10002),9068)


14. 通过项目中遇到的问题,对Flink的并行度以及优先级作了一个初步了解。

你可能感兴趣的:(Flink)