val env = StreamExecutionEnvironment.getExecutionEnvironment
//使用事件时间
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
val data = env.addSource(getConsumer)
//通过Kafka获取的数据
val logData = data.map(new MapSplitFunction)
.filter(_._1 == "E")
.filter(_._2 != 0)
.map(x => {
(x._2, x._3, x._4)
})
.assignTimestampsAndWatermarks(new BoundedOutOfOrdernessGenerator) //添加时间戳和水印
.keyBy(1)
.window(TumblingEventTimeWindows.of(Time.seconds(10)))
.apply(new MyWindowFunction)
//通过MySQL获取的数据
val sqlData = env.addSource(new JdbcSourceFunction)
//进行连接操作
val connectData = logData.connect(sqlData)
.flatMap(new CoFlatMapFunction[(String, String, Long), mutable.HashMap[String, String], (String, String, Long)] {
var userDomainMap = new mutable.HashMap[String, String]()
override def flatMap1(value: (String, String, Long), out: Collector[(String, String, Long)]): Unit = {
val domain = value._2
val userid = userDomainMap.get(domain)
out.collect(value._1, value._2 + "=" + userid, value._3)
}
override def flatMap2(value: mutable.HashMap[String, String], out: Collector[(String, String, Long)]): Unit = {
userDomainMap = value
}
}
)
connectData.print()
8> (2019-07-06 11:23,v1.go2yd.com=None,9682)
3> (2019-07-06 11:23,v4.go2yd.com=None,14199)
7> (2019-07-06 11:23,vmi.go2yd.com=None,31990)
5> (2019-07-06 11:23,v3.go2yd.com=None,27307)
8> (2019-07-06 11:23,v2.go2yd.com=None,10066)
1. 可以发现,通过domain并没有取到userid,打印sqlData看一下
2> Map(v3.go2yd.com -> 10002, vmi.go2yd.com -> 10002, v4.go2yd.com -> 10003, v1.go2yd.com -> 10001, v2.go2yd.com -> 10001)
3> (2019-07-06 12:45,v4.go2yd.com=None,20299)
5> (2019-07-06 12:45,v3.go2yd.com=None,13853)
8> (2019-07-06 12:45,v2.go2yd.com=None,24317)
8> (2019-07-06 12:45,v1.go2yd.com=None,13945)
7> (2019-07-06 12:45,vmi.go2yd.com=Some(10002),15739)
2. 发现sqlData数据是没有问题的,sqlData是进程2拿到数据,logData是5、8、8、7拿到数据,不同进程间默认数据不能共享,所以连接的时候有些进程并没有拿到sqlData,打开sqlData的代码
class JdbcSourceFunction extends RichSourceFunction[mutable.HashMap[String, String]] {
var connection: Connection = null
var pstmt: PreparedStatement = null
val sql = "select userid,domain from user_domain_map"
//打开数据库连接
override def open(parameters: Configuration): Unit = {
Class.forName(Constant.JdbcInfo.DRIVER_NAME)
connection = DriverManager.getConnection(Constant.JdbcInfo.URL,
Constant.JdbcInfo.USER,
Constant.JdbcInfo.PW)
pstmt = connection.prepareStatement(sql)
}
override def run(ctx: SourceFunction.SourceContext[mutable.HashMap[String, String]]): Unit = {
val resultSet = pstmt.executeQuery()
val map: mutable.HashMap[String, String] = new mutable.HashMap[String, String]()
while (resultSet.next()) {
val userid = resultSet.getString("userid")
val domain = resultSet.getString("domain")
map.put(domain, userid)
}
ctx.collect(map)
}
override def cancel(): Unit = {
if (pstmt != null) {
pstmt.close()
}
if (connection != null) {
connection.close()
}
}
}
3.发现JdbcSourceFunction实现的是RichSourceFunction,也就是并行度为1,然而logData并行度为8(我电脑是8核的),我们也可以通过logData源码确认一下。
val data = env.addSource(getConsumer)
//连接Kafka,作为消费者
def getConsumer = {
val properties = new Properties()
properties.setProperty("bootstrap.servers", Constant.KafkaInfo.BOOTSTRAP_SERVERS)
properties.setProperty("group.id", Constant.KafkaInfo.GROUP_ID)
val consumer = new FlinkKafkaConsumer[String](
Constant.KafkaInfo.TOPIC,
new SimpleStringSchema(),
properties
)
consumer
}
FlinkKafkaConsumer extends FlinkKafkaConsumerBase
FlinkKafkaConsumerBase extends RichParallelSourceFunction
4. 要想在每个地方都能拿到sqlData,应该使sqlData的进程覆盖logData的进程,也就是我们的sqlData实现RichParallelSourceFunction,重新运行
7> Map(v3.go2yd.com -> 10002, vmi.go2yd.com -> 10002, v4.go2yd.com -> 10003, v1.go2yd.com -> 10001, v2.go2yd.com -> 10001)
8> Map(v3.go2yd.com -> 10002, vmi.go2yd.com -> 10002, v4.go2yd.com -> 10003, v1.go2yd.com -> 10001, v2.go2yd.com -> 10001)
1> Map(v3.go2yd.com -> 10002, vmi.go2yd.com -> 10002, v4.go2yd.com -> 10003, v1.go2yd.com -> 10001, v2.go2yd.com -> 10001)
6> Map(v3.go2yd.com -> 10002, vmi.go2yd.com -> 10002, v4.go2yd.com -> 10003, v1.go2yd.com -> 10001, v2.go2yd.com -> 10001)
4> Map(v3.go2yd.com -> 10002, vmi.go2yd.com -> 10002, v4.go2yd.com -> 10003, v1.go2yd.com -> 10001, v2.go2yd.com -> 10001)
3> Map(v3.go2yd.com -> 10002, vmi.go2yd.com -> 10002, v4.go2yd.com -> 10003, v1.go2yd.com -> 10001, v2.go2yd.com -> 10001)
5> Map(v3.go2yd.com -> 10002, vmi.go2yd.com -> 10002, v4.go2yd.com -> 10003, v1.go2yd.com -> 10001, v2.go2yd.com -> 10001)
2> Map(v3.go2yd.com -> 10002, vmi.go2yd.com -> 10002, v4.go2yd.com -> 10003, v1.go2yd.com -> 10001, v2.go2yd.com -> 10001)
8> (2019-07-06 12:58,v2.go2yd.com,512414)
5> (2019-07-06 12:58,v3.go2yd.com,549535)
7> (2019-07-06 12:58,vmi.go2yd.com,417180)
3> (2019-07-06 12:58,v4.go2yd.com,520637)
8> (2019-07-06 12:58,v1.go2yd.com,533559)
5> (2019-07-06 12:58,v3.go2yd.com=Some(10002),549535)
7> (2019-07-06 12:58,vmi.go2yd.com=Some(10002),417180)
8> (2019-07-06 12:58,v2.go2yd.com=Some(10001),512414)
3> (2019-07-06 12:58,v4.go2yd.com=Some(10003),520637)
8> (2019-07-06 12:58,v1.go2yd.com=Some(10001),533559)
5. 可以发现,logData已经取到了sqlData中的数据。
6. 那么如果我想设置1或3个线程来执行任务,而不是默认的8个呢?,那么这里就涉及到flink的并行度的设置以及优先级的问题了。
7. Flink的并行度,简单来说可以理解把任务放到几个容器(slot)中执行。
8. 并行度有以下四个地方可以设置,以下级别由高到低
9. 那么如果我要保证两个流的数据共享,要保证两个流的整个流程在相同的容器里(一个或多个)运行,
10.如果通过operator来设置的话,本项目中有以下三个地方需要设置并行度
11. 测试:
2> Map(v3.go2yd.com -> 10002, vmi.go2yd.com -> 10002, v4.go2yd.com -> 10005, v1.go2yd.com -> 10001, v2.go2yd.com -> 10001)
1> (2019-07-06 15:45,vmi.go2yd.com,87001)
7> (2019-07-06 15:45,v4.go2yd.com,45513)
2> (2019-07-06 15:45,v3.go2yd.com,57755)
3> (2019-07-06 15:45,v1.go2yd.com,50410)
8> (2019-07-06 15:45,v2.go2yd.com,86325)
8> (2019-07-06 15:45,v4.go2yd.com=None,45513)
2> (2019-07-06 15:45,vmi.go2yd.com=None,87001)
4> (2019-07-06 15:45,v1.go2yd.com=None,50410)
3> (2019-07-06 15:45,v3.go2yd.com=None,57755)
1> (2019-07-06 15:45,v2.go2yd.com=Some(10001),86325)
12. 可以发现,少设置一处(就会使用默认8个,其他5个slot中没有数据)就会出现问题,这样做需要熟悉整个流程,对每个地方分别添加,比较麻烦。
13. 直接通过env.setParallelism(3)来设置,测试结果:
1> Map(v3.go2yd.com -> 10002, vmi.go2yd.com -> 10002, v4.go2yd.com -> 10005, v1.go2yd.com -> 10001, v2.go2yd.com -> 10001)
3> Map(v3.go2yd.com -> 10002, vmi.go2yd.com -> 10002, v4.go2yd.com -> 10005, v1.go2yd.com -> 10001, v2.go2yd.com -> 10001)
2> Map(v3.go2yd.com -> 10002, vmi.go2yd.com -> 10002, v4.go2yd.com -> 10005, v1.go2yd.com -> 10001, v2.go2yd.com -> 10001)
3> (2019-07-06 15:33,v2.go2yd.com,7868)
1> (2019-07-06 15:33,v4.go2yd.com,5400)
2> (2019-07-06 15:33,v3.go2yd.com,6737)
3> (2019-07-06 15:33,v1.go2yd.com,6560)
3> (2019-07-06 15:33,vmi.go2yd.com,9068)
1> (2019-07-06 15:33,v4.go2yd.com=Some(10005),5400)
3> (2019-07-06 15:33,v2.go2yd.com=Some(10001),7868)
2> (2019-07-06 15:33,v3.go2yd.com=Some(10002),6737)
3> (2019-07-06 15:33,v1.go2yd.com=Some(10001),6560)
3> (2019-07-06 15:33,vmi.go2yd.com=Some(10002),9068)
14. 通过项目中遇到的问题,对Flink的并行度以及优先级作了一个初步了解。