【flink学习笔记】【4】Datastream API-源算子

文章目录

  • 一、编程架构
  • 二、执行环境
    • ①getExecutionEnvironment
    • ②createLocalEnvironment
    • ③createRemoteEnvironment
    • ④执行模式
      • 批处理环境
      • 流处理环境
  • 三、source(源算子)
    • 3.1 准备工作 pojo类型
    • 3.2 读取有界流
    • 3.3 读取kafka
    • 3.5 自定义source
    • 3.6 自定义并行source
  • 四、flink支持的数据类型
    • flink类型系统 typeInformation

一、编程架构

  1. 执行环境
  2. 数据源(source)
  3. 数据转换(transformation)
  4. sink(输出)
  5. 触发执行程序

二、执行环境

1.12新版本之后,flink流批一体

①getExecutionEnvironment

会智能判断是local还是remote。

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

②createLocalEnvironment

StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironment();

③createRemoteEnvironment

StreamExecutionEnvironment env = StreamExecutionEnvironment.createRemoteEnvironment(
	"host", // 主机名
	1234, //端口号
	"path/to/jarFile.jar" // jobmanager 的jar包
);

返回集群执行环境,需要指定jar包

④执行模式

批处理环境

①命令行

bin/flink run -Dexecution.runtime-mode=BATCH ……

②代码设置

StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironment();
env.setRuntimeMode(RuntimeExcutionMode.BATCH);
//写死代码灵活度差

③batch模式的应用场景
有界流数据

流处理环境

三、source(源算子)

整个程序输入端的具体操作

3.1 准备工作 pojo类型

定义event,具有特点:

  • 有无参构造方法
  • 类是共有的
  • 所有属性是public
  • 所有属性类型可以序列化

方便flink解析+序列化。

3.2 读取有界流

package com.shinho.chapter05_source;

import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;

import java.util.ArrayList;

public class SourceTest {
    public static void main(String[] args) throws Exception {
        //创建执行环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1);

        //1.从文件读取(批量)
        DataStreamSource<String> stream1 = env.readTextFile("input/clicks.txt");

        //2.从集合中读取数据
        ArrayList<Integer> nums = new ArrayList<>();
        nums.add(2);
        nums.add(5);
        DataStreamSource<Integer> ds = env.fromCollection(nums);

        ArrayList<Event> events = new ArrayList<>();
        events.add(new Event("mary","/home",1000L));
        events.add(new Event("bob","/prod",2000L));
        DataStreamSource<Event> stream2 = env.fromCollection(events);

        //3.从元素读取
        DataStreamSource<Event> ele_ds = env.fromElements(new Event("bob", "/prod", 2000L));


        //4. socket文本流(是流数据但不稳定)
        DataStreamSource<String> stream4 = env.socketTextStream("node00", 7777);


//        stream1.print("txt print");
//        ds.print("collection print");
//        stream2.print("event print");
//        ele_ds.print("ele print");
        stream4.print("socket stream");

        env.execute();
    }
}

3.3 读取kafka

分布式消息传输队列,高吞吐、易于扩展的消息系统,消息系统的传输方式与流处理完全一致,flink与kafka天生一对:
①kafka:数据收集与传输
②flink:分析计算

  • 需要自定义源算子

依赖

<dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-connector-kafka_2.12</artifactId>
            <version>1.13.6</version>
 </dependency>

Kafka安装:
压缩包:https://archive.apache.org/dist/
安装:https://www.cnblogs.com/ding2016/p/8282907.html

启动

[root@node00 bin]# kafka-server-start.sh -daemon ../config/server.properties 

生产者

[root@node00 bin]# kafka-console-producer.sh --broker-list node00:9092 --topic clicks

代码中实现消费者

  //5 读取kafka
        Properties properties = new Properties();
        properties.setProperty("bootstrap.servers","node00:9092");
        properties.setProperty("group.id","consumer.group");
        properties.setProperty("key.deserializer","org.apache.kafka.common.serialization.StringDeserializer");
        properties.setProperty("value.deserializer","org.apache.kafka.common.serialization.StringDeserializer");
        properties.setProperty("auto.offset.reset","latest");

        DataStreamSource<String> kafka_stream = env.addSource(new FlinkKafkaConsumer<String>("clicks", new SimpleStringSchema(), properties));

        kafka_stream.print("kafka");


        env.execute();

【flink学习笔记】【4】Datastream API-源算子_第1张图片
在这里插入图片描述

3.5 自定义source

自定义source

package com.shinho.chapter05_source;

import org.apache.flink.streaming.api.functions.source.SourceFunction;

import java.util.Calendar;
import java.util.Random;

public class ClickSource implements SourceFunction<Event> {
    private Boolean running = true;

    @Override
    public void run(SourceContext<Event> sourceContext) throws Exception {
        Random random = new Random();
        String[] users = {"gyz","xxz","ztk"};
        String[] urls = {"./home","./prod","./prod?id=100"};

        while (running){
            String user = users[random.nextInt(users.length)];
            String url = urls[random.nextInt(urls.length)];
            Long timestamp = Calendar.getInstance().getTimeInMillis();
            sourceContext.collect(new Event(user,url,timestamp));
            Thread.sleep(1000L);
        }
    }

    @Override
    public void cancel() {
        running = false;

    }
}

main方法

public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1);

        DataStreamSource<Event> ds = env.addSource(new ClickSource());

        ds.print("CLICK RESOUCE");

        env.execute();
    }

【flink学习笔记】【4】Datastream API-源算子_第2张图片

3.6 自定义并行source

对于简单的sourcefunction,并行度只能是1。

DataStreamSource<Integer> ds = env.addSource(new ParallelCustomSource()).setParallelism(2);

想要并行source,需要ParallelSourceFunction

public static class ParallelCustomSource implements ParallelSourceFunction<Integer>{
        private Boolean running =true;
        private Random random = new Random();

        @Override
        public void run(SourceContext<Integer> sourceContext) throws Exception {
            while (running){
                sourceContext.collect(random.nextInt());
            }
        }

        @Override
        public void cancel() {
            running = false;
        }
    }

四、flink支持的数据类型

flink类型系统 typeInformation

  1. 基本类型:java基本类型及包装类,void、String、Date……

  2. 数组:基本数组+对象数组
    在这里插入图片描述

  3. 复合:元组(最灵活)、scala样例类、行类型(row)、pojo(flink自定义类似java bean模式)

POJO

  • 有无参构造方法
  • 类是共有的
  • 所有属性是public
  • 所有属性类型可以序列化
  1. 辅助类型 Option、Editor、List、Map
  2. 泛型 Generic
    flink只关心外层类型(tuple2),但是无法得到(tuple2)只有显示告诉系统返回值类型,才能解析完整数据,可以嵌套数据类型。
    .returns方法

FlatMapOperator<String, Tuple2<String, Long>> wordAndOne = dataSource.flatMap((String line, Collector<Tuple2<String, Long>> out) -> {
            String[] words = line.split(" ");
            //每个单词转换二元组
            for (String word : words) {
                out.collect(Tuple2.of(word, 1L));
            }
        }).returns(Types.TUPLE(Types.STRING, Types.LONG));

【flink学习笔记】【4】Datastream API-源算子_第3张图片

你可能感兴趣的:(flink,sql,hive,hadoop)