Flink消费kafka数据进行统计,过滤,合流后sink到kafka

Flink消费与生产kafka数据

由于最近毕设需要设计一个小功能,一个日志分析并转换合并放到kafka上的一个需求,今天来总结积记录一下思路与代码实现。

  1. 首先先明确业务流程,我们需要:

    1. 先从kafka上消费数据,两条数据流,数据是kafka上的Json字符串;
    2. 进行对数据的加工与分析(包含对数据的统计,过滤,转换),并处理成为一种格式的数据;
    3. 进行union合流操作;
    4. sink到kafka上,并自定义key与value。

    OK,明确业务需求后我们开始上编码部分,大家也可以根据自己的业务来参考。

  2. pom文件内容:

    <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
      <modelVersion>4.0.0modelVersion>
    
      <groupId>com.huyue.flinkgroupId>
      <artifactId>FlinkConsumerAOIartifactId>
      <version>0.0.1-SNAPSHOTversion>
      <packaging>jarpackaging>
    
      <name>FlinkConsumerAOIname>
      <url>http://maven.apache.orgurl>
    
      <properties>
        <project.build.sourceEncoding>UTF-8project.build.sourceEncoding>
      properties>
    
      <dependencies>
      		<dependency>
    			<groupId>com.alibabagroupId>
    			<artifactId>fastjsonartifactId>
    			<version>1.2.56version>
    		dependency>
      
    		<dependency>
    			<groupId>org.redissongroupId>
    			<artifactId>redissonartifactId>
    			<version>3.11.6version>
    		dependency>
    
    		
    		<dependency>
    			<groupId>org.apache.poigroupId>
    			<artifactId>poiartifactId>
    			<version>4.0.1version>
    		dependency>
    
    		<dependency>
    			<groupId>org.apache.poigroupId>
    			<artifactId>poi-ooxmlartifactId>
    			<version>4.0.1version>
    		dependency>
    		
    		<dependency>
    			<groupId>org.apache.poigroupId>
    			<artifactId>poi-ooxml-schemasartifactId>
    			<version>4.0.1version>
    		dependency>
    
    		<dependency>
    			<groupId>c3p0groupId>
    			<artifactId>c3p0artifactId>
    			<version>0.9.0.4version>
    		dependency>
    
    		<dependency>
    			<groupId>com.zaxxergroupId>
    			<artifactId>HikariCPartifactId>
    			<version>3.1.0version>
    		dependency>
    
    		
    		<dependency>
    			<groupId>net.sourceforge.javacsvgroupId>
    			<artifactId>javacsvartifactId>
    			<version>2.0version>
    		dependency>
    		
    		<dependency>
    			<groupId>org.postgresqlgroupId>
    			<artifactId>postgresqlartifactId>
    			<version>42.2.5version>
    		dependency>
    
    		<dependency>
    			<groupId>org.apache.kafkagroupId>
    			<artifactId>kafka-clientsartifactId>
    			<version>2.2.0version>
    		dependency>
    		
    		<dependency>
    			<groupId>org.apache.flinkgroupId>
    			<artifactId>flink-clients_2.12artifactId>
    			<version>1.11.1version>
    		dependency>
    
    		
    		<dependency>
    			<groupId>org.apache.flinkgroupId>
    			<artifactId>flink-streaming-java_2.12artifactId>
    			<version>1.11.1version>
    		dependency>
    
    
    		
    		<dependency>
    			<groupId>org.apache.flinkgroupId>
    			<artifactId>flink-javaartifactId>
    			<version>1.11.1version>
    		dependency>
    
    		
    		<dependency>
    			<groupId>org.apache.flinkgroupId>
    			<artifactId>flink-connector-kafka_2.12artifactId>
    			<version>1.11.1version>
    		dependency>
    
    
    		
    		<dependency>
    			<groupId>commons-logginggroupId>
    			<artifactId>commons-loggingartifactId>
    			<version>1.1.1version>
    		dependency>
    
    		<dependency>
    			<groupId>commons-cligroupId>
    			<artifactId>commons-cliartifactId>
    			<version>1.4version>
    		dependency>
    
    		<dependency>
    			<groupId>org.slf4jgroupId>
    			<artifactId>slf4j-apiartifactId>
    			<version>1.8.0-beta0version>
    		dependency>
    		<dependency>
    			<groupId>org.slf4jgroupId>
    			<artifactId>slf4j-log4j12artifactId>
    			<version>1.8.0-beta0version>
    		dependency>
    
    		<dependency>
    			<groupId>log4jgroupId>
    			<artifactId>log4jartifactId>
    			<version>1.2.17version>
    		dependency>
        <dependency>
          <groupId>junitgroupId>
          <artifactId>junitartifactId>
          <version>3.8.1version>
          <scope>testscope>
        dependency>
      dependencies>
    project>
    
    

    这里为了方便看到Flink的分析过程日志,我们引入log4j的jar包,并且需要将log4j.properties文件放到项目的src目录下,

    # log4j.properties文件
    # Root logger option
    log4j.rootLogger=info, stdout
    
    # Direct log messages to stdout
    log4j.appender.stdout=org.apache.log4j.ConsoleAppender
    log4j.appender.stdout.Target=System.out
    log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
    log4j.appender.stdout.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss} %-5p %c{1}:%L - %m%n
    
  3. 主函数部分,主要是kafka配置、定义数据流、调用分析算子以及合流sink到kafka;

    /**
    	 * @throws Exception   
    	* @Author: Hu.Yue
    	* @Title: main  
    	* @Description: TODO 
    	* @param @param args 
    	* @return void 
    	* @throws  
    	*/
    	public static void main(String[] args) throws Exception {
            //分析数据流1
    		String topic = "inTopic1";
            //分析数据流2
    		String topic2 = "inTopic2";
            //合并数据流
    		String outTopic = "outTopic";
    		
    		//创建环境
    	    StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment();
    	    
    	    //配置恢复点
    	    environment.enableCheckpointing(10000);
    	    environment.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime);
    	    
    	    Properties props = new Properties();
            //kafka集群地址
    		props.setProperty("bootstrap.servers", "10.144.3.155:9092,10.144.5.233:9092,10.144.4.54:9092");
            //认证(如果没有此配置可以忽略)
    		props.put("sasl.jaas.config", "org.apache.kafka.common.security.scram.ScramLoginModule required username='xxx' password='xxx';");
    		props.put("security.protocol", "SASL_PLAINTEXT");
    		props.put("sasl.mechanism", "PLAIN");
            //分组定义
    		props.setProperty("group.id", "java_group1");
    		props.setProperty("kafka.consumer.auto.offset.reset", "enableAutoCommit");
            //消费序列化
    		props.setProperty("key.deserializer", StringDeserializer.class.getName());
    		props.setProperty("value.deserializer", StringDeserializer.class.getName());
    		//生产序列化
    		props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
    		props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
    		//参数配置
    		props.put("acks", "all");
    		props.put("retries", 3);
    		props.put("batch.size", 65536);
    		props.put("linger.ms", 1);
    		props.put("buffer.memory", 33554432);
    		props.put("max.request.size", 10485760);
    		
            //定义String类型的两个消费者
    		FlinkKafkaConsumer<String> consumerBoard = new FlinkKafkaConsumer<String>(topic, new SimpleStringSchema(), props);
    		FlinkKafkaConsumer<String> consumerFovComp = new FlinkKafkaConsumer<String>(topic2, new SimpleStringSchema(), props);
    		
            //设置消费原则
    		consumerBoard.setStartFromEarliest();
    		consumerFovComp.setStartFromEarliest();
    		
           	//定义数据流,调用方法在下面介绍
    		DataStream<String> dataStream_board = ConversionBoard(environment,consumerBoard);
    		DataStream<String> dataStream_dovComp = ConversionFovComp(environment,consumerFovComp);
    		
            //合流操作,并RichSinkFunction自定义上传的key与value
    		DataStream<String> unionStream = dataStream_board.union(dataStream_dovComp);
    		unionStream.addSink(new RichSinkFunction<String>() {
    					private static final long serialVersionUID = 1L;
    					KafkaProducer<String, String> producer = null;
    					
    					@Override
    					public void open(Configuration parameters) {
    						producer = new KafkaProducer<String, String>(props);
    					}
    					
    					@Override
    					public void invoke(String value) throws Exception {
    						if(null != value) {
    							String key = String.format("%d", (int) (Math.random() * 6));
    							ProducerRecord<String,String> producerRecord = new ProducerRecord<String,String>(outTopic, key, value); 
    							
    							producer.send(producerRecord);
    							producer.flush();
    						}
    					}
    				});
            //控制台打印数据流信息
    		unionStream.print();
            //提交操作
    		environment.execute("union DataStream");
    		
    	}
    
  4. 俩条数据流的格式的不同的,算子需要同通过某种逻辑规则将数据转化成相同的后进行合流

    /**  
    	* @Author: Hu.Yue
    	* @Title: ConversionBoard  
    	* @Description: board-pass 
    	* @param @param environment
    	* @param @param consumer
    	* @param @return
    	* @param @throws Exception 
    	* @return DataStream> 
    	* @throws  
    	*/ 
    	public static DataStream<String> ConversionBoard(StreamExecutionEnvironment environment, FlinkKafkaConsumer<String> consumer) throws Exception{
    		DataStream<String> dataStream = environment.addSource(consumer)
                //对数据进行包装,使一条数据流包含可以分组的key值和便于统计的一个count值
    				.flatMap(new FlatMapFunction<String, Tuple3<String,String,Integer>>() {
    					
    					private static final long serialVersionUID = 1L;
    
    					public void flatMap(String value, Collector<Tuple3<String, String, Integer>> out) throws Exception {
    					
    						ObjectMapper oMapper = new ObjectMapper();
    						JsonNode node = oMapper.readTree(value);
    						//这里我们使用其中的4个字段作为key,每一个数据后面带一个1,
    						out.collect(new Tuple3<String, String, Integer>(
    								node.get("xxx").asText()+ node.get("yyy").asText() + node.get("zzz").asText() + node.get("bbb").asText(),
    								value,
    								1));
    						}
    				})
                	//通过刚才拼接的key来进行分组
    				.keyBy(0)
                	//设置会话时间来等这一组数据全部到齐,这里的三秒是测试时使用的,真实的生产环境可以拉长时间来多等一会
    				.window(ProcessingTimeSessionWindows.withGap(Time.seconds(3)))
                	//聚合函数,对相同的key值数据流进行count操作并合并成一条数据输出
    				.reduce(new ReduceFunction<Tuple3<String,String,Integer>>() {
    					
    					private static final long serialVersionUID = 1L;
    
    					public Tuple3<String, String, Integer> reduce(Tuple3<String, String, Integer> value1,
    							Tuple3<String, String, Integer> value2) throws Exception {
    						return new Tuple3<String, String, Integer>(value2.f0,value2.f1,value2.f2+value1.f2);
    					}
    				})
                	//过滤不符合我们预期的数据
    				.filter(new FilterFunction<Tuple3<String,String,Integer>>() {
    
    					private static final long serialVersionUID = 1L;
    
    					public boolean filter(Tuple3<String, String, Integer> value) throws Exception {
    						ObjectMapper oMapper = new ObjectMapper();
    						JsonNode node = oMapper.readTree(value.f1);
    						if((node.get("boardSN").asText().length() == 10) &&
    								(!node.get("idStation").asText().startsWith("L")) && 
    								("\"P \"".equals(node.get("status").asText().trim()) || "\"Y \"".equals(node.get("status").asText().trim())) &&
    								(node.get("modifierDate").asText() != null) && 
    								(node.get("bBoardsn").asText() == null || "".equals(node.get("bBoardsn").asText()) || node.get("bBoardsn").asText() != node.get("boardSN").asText())) {
    							return true;
    						}	
    						return false;
    					}
    				})
                	//对数据进行符合结构转化
    				.flatMap(new FlatMapFunction<Tuple3<String,String,Integer>, String>() {
    
    					private static final long serialVersionUID = 1L;
    
    					@Override
    					public void flatMap(Tuple3<String, String, Integer> value, Collector<String> out) throws Exception {
    						ObjectMapper mapper = new ObjectMapper();
    						JsonNode node = mapper.readTree(value.f1);
    						
    						String mcbSno = node.get("boardSN").asText().length() > 32 ? node.get("boardSN").asText().toUpperCase().substring(0,32) : node.get("boardSN").asText();
    						String fdate = node.get("fdate").asText();
    						String fixNo = node.get("idStation").asText().toUpperCase().substring(0,4);
    						String cModel = node.get("cModel").asText();
    						String badgeNo = node.get("modifier").asText().length() > 7 ? node.get("modifier").asText().substring(0,7) : node.get("modifier").asText();
    						String cycleTime = node.get("cycleTime").asText();
    						Integer unionQty = 1;
    						String wc= "";
    						String pdLine = fixNo.substring(0,3);
    						String pcbSide = node.get("cModel").asText().toUpperCase().trim().substring(
    								node.get("cModel").asText().length()-1);
                            //这里拿到按需(key--value.f1)统计的count值
    						unionQty = value.f2;
    						switch (pcbSide) {
    						case "1":
    							wc = "05";
    							break;
    						case "2":
    							wc = "0B";
    							break;
    						case "3":
    							wc = "0C";
    							break;
    						case "4":
    							wc = "0D";
    							break;
    						case "5":
    							wc = "PE";
    							break;
    						default:
    							break;
    						}
    
    						if(wc != null && !"".equals(wc)) {
                                //转化成符合FIS的数据结构
    							FisResult fis = new FisResult(
    									mcbSno + "_" + wc + "_" +fdate,
    									mcbSno,
    									wc,
    									1,
    									pdLine,
    									"",
    									"",
    									badgeNo,
    									unionQty,
    									"57",
    									fdate,
    									fixNo,
    									cycleTime,
    									cModel,
    									"f3");
    							//转化成JSON串后被out收录
    							String result = JSON.toJSONString(fis);
    							out.collect(result);
    						}
    					}
    				});
            //一定记得返回流
    		return dataStream;
    	}
    
  5. 下一个数据流分析方法和上面的雷同,只是分析逻辑不同而已,这里考虑篇幅就不介绍了,重要的部分我已经打好备注,大家按需食用即可,有问题欢迎评论区交流。

你可能感兴趣的:(大数据,stream,kafka,java,flink)