大数据系列八:Flink实时流计算-时间窗口(Kafka to MySQL)

一.前言

相对Storm&Spark来说,Flink在业界使用可能没前者那么广泛,从搜索引擎检索量&网络文章的质量可见一斑,相关概念介绍的比较混乱,官网文档相对也不是太清晰。

二.概念

Flink的集群部署模式:
Standalone(最简单,缺点资源隔离做的不太好,存在deploy job不均衡),YARN(比较成熟,常用于生成环境),Mesos,Docker。

本文主要介绍Flink on yarn,此模式job manager存在单点故障问题,一般采用HA解决;如果不用HA,此模式基本不需要安装配置,官网下载的文件解压后放在各节点即可。

Flink on yarn作业提交分两种(重点,主要是相关概念不是特别清晰),生产环境推荐2。
1.yarn session(Start a long-running Flink cluster on YARN),有两种启动集群方式(区别就是是否有参数-d,加了d提交flink到yarn集群后就退出)#ENV VAR MUST SETUP:YARN_CONF_DIR or HADOOP_CONF_DIR
----1.1客户端模式;
./bin/yarn-session.sh -jm 500 -tm 500;#jm&tm默认都是1G,需要设定HADOOP_CLASSPATH
./bin/flink run ./examples/batch/WordCount.jar -input hdfs://ip:9000/LICENSE -output hdfs://ip:9000/wordcount-result_1.txt
----1.2分离式模式 #主要就是指detach,参数-d
./bin/yarn-session.sh -nm yarn-session_test -d

2.Flink run(Run a Flink job on YARN) #也可以加参数-d
./bin/flink run -m yarn-cluster -yjm 500m -ytm 500m /examples/batch/WordCount.jar

停止方法:
echo “stop” | ./bin/yarn-session.sh -id .
or yarn application -kill #cannot clearup all jobs remain files

三.安装配置

如果安装带hadoop jar的Flink(新版本都没有),不需要任何配置,官网下载的文件解压后放在各节点即可;否则,特别是使用最新版本,需要配置HADOOP_CLASSPATH或者把hadoop下需要的jar拷入flink lib下(主要在jar包冲突的情况下)
export HADOOP_CLASSPATH=/usr/local/hadoop-3.1.3/share/hadoop/common/lib/:/usr/local/hadoop-3.1.3/share/hadoop/yarn/:/usr/local/hadoop-3.1.3/share/hadoop/hdfs/*

四.程序

K2FlinkDemo.java

package org.ips.demo;

import java.util.Properties;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.DataStreamSink;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer;
import org.apache.flink.streaming.util.serialization.SimpleStringSchema;
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.util.Collector;
import org.ips.demo.MySQLSink;
import org.apache.flink.streaming.api.windowing.time.Time;

public class K2FlinkDemo {
    private static final String READ_TOPIC = "IPS_CASH";
	private static final String WRITER_TOPIC = "IPS_CASH_RESULT";
    public static void main(String[] args) throws Exception {
        final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        Properties props = new Properties();
        props.put("bootstrap.servers", "192.168.100.101:9092");
        //props.put("zookeeper.connect", "localhost:2181");
        props.put("group.id", "flink-group");
        props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        props.put("auto.offset.reset", "latest");
        DataStreamSource<String> dssk = env.addSource(new FlinkKafkaConsumer<>(READ_TOPIC,new SimpleStringSchema(),props)).setParallelism(1);
        DataStream<Tuple2<String, Integer>> counts = dssk .flatMap(new LineSplitter()).keyBy(0).timeWindow(Time.seconds(60),Time.seconds(60)).sum(1);
        counts.print();
		counts.addSink(new MySQLSink());
        env.execute("flink custom kafka");
    }
    public static final class LineSplitter implements FlatMapFunction<String, Tuple2<String, Integer>> {
        //private static final long serialVersionUID = 1L;
        public void flatMap(String value, Collector<Tuple2<String, Integer>> out) {
            String[] tokens = value.toLowerCase().split("\\W+");
            if (tokens.length > 1) {
				try{
                out.collect(new Tuple2<String, Integer>(tokens[0], Integer.parseInt(tokens[1])));
				} catch (NumberFormatException e) {
                out.collect(new Tuple2<String, Integer>(tokens[0], 0));
                }
            }
        }
    }
}

MySQLSink.java

package org.ips.demo;

import java.sql.DriverManager;
import java.sql.Connection;
import java.sql.PreparedStatement;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.functions.sink.RichSinkFunction;


public class MySQLSink extends RichSinkFunction<Tuple2<String, Integer>> {

	//private static final long serialVersionUID = 1L;
	private Connection connection;
	private PreparedStatement preparedStatement;
	String username = "root";
	String password = "Msql#1234";
	String drivername = "com.mysql.cj.jdbc.Driver";
	String dburl = "jdbc:mysql://192.168.100.102:3310/test";

	@Override
	public void invoke(Tuple2<String, Integer> value) throws Exception {
		Class.forName(drivername);
		connection = DriverManager.getConnection(dburl, username, password);
		String sql = "insert into test1(ky,vl) values(?,?)";
		preparedStatement = connection.prepareStatement(sql);
		preparedStatement.setString(1, value.f0);
		preparedStatement.setInt(2, value.f1);
		preparedStatement.executeUpdate();
		if (preparedStatement != null) {
			preparedStatement.close();
		}
		if (connection != null) {
			connection.close();
		}
	}
}

pom.xml



<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0modelVersion>

  <groupId>org.ipsgroupId>
  <artifactId>demoartifactId>
  <version>0.0.1-SNAPSHOTversion>

  <name>demoname>
  
  <url>http://www.example.comurl>

  <properties>
    <project.build.sourceEncoding>UTF-8project.build.sourceEncoding>
    <maven.compiler.source>1.8maven.compiler.source>
    <maven.compiler.target>1.8maven.compiler.target>
  properties>

  <dependencies>
    <dependency>
      <groupId>org.apache.flinkgroupId>
      <artifactId>flink-streaming-java_2.11artifactId>
      <version>1.10.1version>
    dependency>
    <dependency>
      <groupId>org.apache.flinkgroupId>
      <artifactId>flink-connector-kafka_2.11artifactId>
      <version>1.10.1version>
    dependency>
	<dependency>
        <groupId>mysqlgroupId>
        <artifactId>mysql-connector-javaartifactId>
		<version>6.0.2version>
        <scope>runtimescope>
    dependency>
  dependencies>

 <build>
        <finalName>K2FlinkDemofinalName>
        <plugins>
            <plugin>
                <artifactId>maven-assembly-pluginartifactId>
                <configuration>
                    <archive>
                    <manifest>
                    <mainClass>org.ips.demo.K2FlinkDemomainClass>
                    manifest>
                    archive>
                    <descriptorRefs>
                        <descriptorRef>jar-with-dependenciesdescriptorRef>
                    descriptorRefs>
                configuration>
            plugin>
        plugins>
    build>
project>

五.编译

mvn -U clean package assembly:assembly -Dmaven.test.skip=true

六.运行

[root@ipsnode1 flink-1.10.1]# ./bin/flink run -m yarn-cluster /home/test/K2Flink2K.jar
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/flink-1.10.1/lib/slf4j-log4j12-1.7.15.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/hadoop-3.1.3/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
2020-04-17 22:39:08,947 INFO org.apache.hadoop.yarn.client.AHSProxy - Connecting to Application History server at /0.0.0.0:10200
2020-04-17 22:39:08,976 INFO org.apache.flink.yarn.YarnClusterDescriptor - No path for the flink jar passed. Using the location of class org.apache.flink.yarn.YarnClusterDescriptor to locate the jar
2020-04-17 22:39:09,382 INFO org.apache.hadoop.conf.Configuration - resource-types.xml not found
2020-04-17 22:39:09,383 INFO org.apache.hadoop.yarn.util.resource.ResourceUtils - Unable to find ‘resource-types.xml’.
2020-04-17 22:39:09,451 INFO org.apache.flink.yarn.YarnClusterDescriptor - Cluster specification: ClusterSpecification{masterMemoryMB=1024, taskManagerMemoryMB=1728, slotsPerTaskManager=1}
2020-04-17 22:39:11,137 INFO org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient - SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2020-04-17 22:39:11,582 INFO org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient - SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2020-04-17 22:39:11,724 INFO org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient - SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2020-04-17 22:39:12,560 INFO org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient - SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2020-04-17 22:39:13,105 INFO org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient - SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2020-04-17 22:39:13,177 INFO org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient - SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2020-04-17 22:39:13,568 INFO org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient - SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2020-04-17 22:39:14,824 INFO org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient - SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2020-04-17 22:39:17,637 INFO org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient - SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2020-04-17 22:39:18,604 INFO org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient - SASL encryption trust check: localHostTrusted = false, remoteHostTrusted = false
2020-04-17 22:39:20,693 INFO org.apache.flink.yarn.YarnClusterDescriptor - Submitting application master application_1592302610578_0007
2020-04-17 22:39:20,939 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Submitted application application_1592302610578_0007
2020-04-17 22:39:20,940 INFO org.apache.flink.yarn.YarnClusterDescriptor - Waiting for the cluster to be allocated
2020-04-17 22:39:21,183 INFO org.apache.flink.yarn.YarnClusterDescriptor - Deploying cluster, current state ACCEPTED
2020-04-17 22:39:41,121 INFO org.apache.flink.yarn.YarnClusterDescriptor - YARN application has been deployed successfully.
2020-04-17 22:39:41,127 INFO org.apache.flink.yarn.YarnClusterDescriptor - Found Web Interface ipsnode3:59460 of application ‘application_1592302610578_0007’.
Job has been submitted with JobID 0c96066f2c453e1da5613f2fc7f1e5d5
^C[root@ipsnode1 flink-1.10.1]#

七.Checking

Kafka发送数据:
[root@ipsnode2 bin]# ./kafka-console-producer.sh --broker-list ipsnode1:9092 -topic IPS_CASH
>aaa 100
>aaa 10
>aaa 1
>bbb 200
>bbb 22
>ccc 333

MySQL验证:
mysql> use test;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql> select * from test1;
±-----±-----+
| ky | vl |
±-----±-----+
| aaa | 100 |
| aaa | 110 |
| aaa | 111 |
| bbb | 200 |
| bbb | 222 |
| ccc | 333 |
±-----±-----+
6 rows in set (0.00 sec)

mysql>

你可能感兴趣的:(bigdata,大数据,flink,mysql,kafka)