flink批处理

4.1 State

4.1.1 state概述

Apache Flink® — Stateful Computations over Data Streams

回顾单词计数的例子

java
/**

  • 单词计数
    */
    public class WordCount {
    public static void main(String[] args) throws Exception {
    StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
    DataStreamSource data = env.socketTextStream(“localhost”, 8888);
    SingleOutputStreamOperator> result = data.flatMap(new FlatMapFunction>() {
    @Override
    public void flatMap(String line, Collector> collector) throws Exception {
    String[] fields = line.split(",");
    for (String word : fields) {
    collector.collect(new Tuple2<>(word, 1));
    }
    }
    }).keyBy(“0”)
    .sum(1);

     result.print();
    
     env.execute("WordCount");
    

    }
    }

输入

java
hadoop,hadoop
hadoop
hive,hadoop

输出

java
4> (hadoop,1)
4> (hadoop,2)
4> (hadoop,3)
1> (hive,1)
4> (hadoop,4)

我们会发现,单词出现的次数有累计的效果。如果没有状态的管理,是不会有累计的效果的,所以Flink里面还有state的概念。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-65PVkfSk-1638967401625)(assets/State.png)]

state:一般指一个具体的task/operator的状态。State可以被记录,在失败的情况下数据还可以恢复,Flink中有两种基本类型的State:Keyed State,Operator State,他们两种都可以以两种形式存在:原始状态(raw state)和托管状态(managed state)
托管状态:由Flink框架管理的状态,我们通常使用的就是这种。
原始状态:由用户自行管理状态具体的数据结构,框架在做checkpoint的时候,使用byte[]来读写状态内容,对其内部数据结构一无所知。通常在DataStream上的状态推荐使用托管的状态,当实现一个用户自定义的operator时,会使用到原始状态。但是我们工作中一般不常用,所以我们不考虑他。

4.1.2 State类型
Operator State

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-X9EMfww3-1638967401626)(assets/Operator State.png)]

  1. operator state是task级别的state,说白了就是每个task对应一个state

  2. Kafka Connector source中的每个分区(task)都需要记录消费的topic的partition和offset等信息。

  3. operator state 只有一种托管状态:

    ValueState

Keyed State

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-YiE94tV1-1638967401627)(assets/Keyed State-1573781173705.png)]

  1. keyed state 记录的是每个key的状态
  2. Keyed state托管状态有六种类型:
    1. ValueState
    2. ListState
    3. MapState
    4. ReducingState
    5. AggregatingState
    6. FoldingState
4.1.3 Keyed State的案例演示
ValueState

java
/**

  • ValueState :这个状态为每一个 key 保存一个值
  •  value() 获取状态值
    
  •  update() 更新状态值
    
  •  clear() 清除状态
    

*/
public class CountWindowAverageWithValueState
extends RichFlatMapFunction, Tuple2> {
// 用以保存每个 key 出现的次数,以及这个 key 对应的 value 的总值
// managed keyed state
//1. ValueState 保存的是对应的一个 key 的一个状态值
private ValueState> countAndSum;

@Override
public void open(Configuration parameters) throws Exception {
    // 注册状态
    ValueStateDescriptor> descriptor =
            new ValueStateDescriptor>(
                    "average",  // 状态的名字
                    Types.TUPLE(Types.LONG, Types.LONG)); // 状态存储的数据类型
    countAndSum = getRuntimeContext().getState(descriptor);
}

@Override
public void flatMap(Tuple2 element,
                    Collector> out) throws Exception {
    // 拿到当前的 key 的状态值
    Tuple2 currentState = countAndSum.value();

    // 如果状态值还没有初始化,则初始化
    if (currentState == null) {
        currentState = Tuple2.of(0L, 0L);
    }

    // 更新状态值中的元素的个数
    currentState.f0 += 1;

    // 更新状态值中的总值
    currentState.f1 += element.f1;

    // 更新状态
    countAndSum.update(currentState);

    // 判断,如果当前的 key 出现了 3 次,则需要计算平均值,并且输出
    if (currentState.f0 >= 3) {
        double avg = (double)currentState.f1 / currentState.f0;
        // 输出 key 及其对应的平均值
        out.collect(Tuple2.of(element.f0, avg));
        //  清空状态值
        countAndSum.clear();
    }
}

}

/**

  • 需求:当接收到的相同 key 的元素个数等于 3 个或者超过 3 个的时候

  • 就计算这些元素的 value 的平均值。

  • 计算 keyed stream 中每 3 个元素的 value 的平均值
    */
    public class TestKeyedStateMain {
    public static void main(String[] args) throws Exception{
    StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

    DataStreamSource> dataStreamSource =
            env.fromElements(Tuple2.of(1L, 3L), Tuple2.of(1L, 5L), Tuple2.of(1L, 7L),
                    Tuple2.of(2L, 4L), Tuple2.of(2L, 2L), Tuple2.of(2L, 5L));
    
    // 输出:
    //(1,5.0)
    //(2,3.6666666666666665)
    dataStreamSource
            .keyBy(0)
            .flatMap(new CountWindowAverageWithValueState())
            .print();
    
    env.execute("TestStatefulApi");
    

    }
    }

结果输出:

java
3> (1,5.0)
4> (2,3.6666666666666665)

ListState

java
/**

  • ListState :这个状态为每一个 key 保存集合的值
  •  get() 获取状态值
    
  •  add() / addAll() 更新状态值,将数据放到状态中
    
  •  clear() 清除状态
    

*/
public class CountWindowAverageWithListState
extends RichFlatMapFunction, Tuple2> {
// managed keyed state
//1. ListState 保存的是对应的一个 key 的出现的所有的元素
private ListState> elementsByKey;

@Override
public void open(Configuration parameters) throws Exception {
    // 注册状态
    ListStateDescriptor> descriptor =
            new ListStateDescriptor>(
                    "average",  // 状态的名字
                    Types.TUPLE(Types.LONG, Types.LONG)); // 状态存储的数据类型
    elementsByKey = getRuntimeContext().getListState(descriptor);
}

@Override
public void flatMap(Tuple2 element,
                    Collector> out) throws Exception {
    // 拿到当前的 key 的状态值
    Iterable> currentState = elementsByKey.get();

    // 如果状态值还没有初始化,则初始化
    if (currentState == null) {
        elementsByKey.addAll(Collections.emptyList());
    }

    // 更新状态
    elementsByKey.add(element);

    // 判断,如果当前的 key 出现了 3 次,则需要计算平均值,并且输出
    List> allElements = Lists.newArrayList(elementsByKey.get());
    if (allElements.size() >= 3) {
        long count = 0;
        long sum = 0;
        for (Tuple2 ele : allElements) {
            count++;
            sum += ele.f1;
        }
        double avg = (double) sum / count;
        out.collect(Tuple2.of(element.f0, avg));

        // 清除状态
        elementsByKey.clear();
    }
}

}

/**

  • 需求:当接收到的相同 key 的元素个数等于 3 个或者超过 3 个的时候

  • 就计算这些元素的 value 的平均值。

  • 计算 keyed stream 中每 3 个元素的 value 的平均值
    */
    public class TestKeyedStateMain {
    public static void main(String[] args) throws Exception{
    StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

    DataStreamSource> dataStreamSource =
            env.fromElements(Tuple2.of(1L, 3L), Tuple2.of(1L, 5L), Tuple2.of(1L, 7L),
                    Tuple2.of(2L, 4L), Tuple2.of(2L, 2L), Tuple2.of(2L, 5L));
    
    // 输出:
    //(1,5.0)
    //(2,3.6666666666666665)
    dataStreamSource
            .keyBy(0)
            .flatMap(new CountWindowAverageWithListState())
            .print();
    
    env.execute("TestStatefulApi");
    

    }
    }

结果输出:

java
3> (1,5.0)
4> (2,3.6666666666666665)

MapState

java
/**

  • MapState :这个状态为每一个 key 保存一个 Map 集合
  •  put() 将对应的 key 的键值对放到状态中
    
  •  values() 拿到 MapState 中所有的 value
    
  •  clear() 清除状态
    

*/
public class CountWindowAverageWithMapState
extends RichFlatMapFunction, Tuple2> {
// managed keyed state
//1. MapState :key 是一个唯一的值,value 是接收到的相同的 key 对应的 value 的值
private MapState mapState;

@Override
public void open(Configuration parameters) throws Exception {
    // 注册状态
    MapStateDescriptor descriptor =
            new MapStateDescriptor(
                    "average",  // 状态的名字
                    String.class, Long.class); // 状态存储的数据类型
    mapState = getRuntimeContext().getMapState(descriptor);
}

@Override
public void flatMap(Tuple2 element,
                    Collector> out) throws Exception {
    mapState.put(UUID.randomUUID().toString(), element.f1);

    // 判断,如果当前的 key 出现了 3 次,则需要计算平均值,并且输出
    List allElements = Lists.newArrayList(mapState.values());
    if (allElements.size() >= 3) {
        long count = 0;
        long sum = 0;
        for (Long ele : allElements) {
            count++;
            sum += ele;
        }
        double avg = (double) sum / count;
        out.collect(Tuple2.of(element.f0, avg));

        // 清除状态
        mapState.clear();
    }
}

}

/**

  • 需求:当接收到的相同 key 的元素个数等于 3 个或者超过 3 个的时候

  • 就计算这些元素的 value 的平均值。

  • 计算 keyed stream 中每 3 个元素的 value 的平均值
    */
    public class TestKeyedStateMain {
    public static void main(String[] args) throws Exception{
    StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

    DataStreamSource> dataStreamSource =
            env.fromElements(Tuple2.of(1L, 3L), Tuple2.of(1L, 5L), Tuple2.of(1L, 7L),
                    Tuple2.of(2L, 4L), Tuple2.of(2L, 2L), Tuple2.of(2L, 5L));
    
    // 输出:
    //(1,5.0)
    //(2,3.6666666666666665)
    dataStreamSource
            .keyBy(0)
            .flatMap(new CountWindowAverageWithMapState())
            .print();
    
    env.execute("TestStatefulApi");
    

    }
    }

输出结果:

4> (2,3.6666666666666665)
3> (1,5.0)

ReducingState

java
/**

  • ReducingState :这个状态为每一个 key 保存一个聚合之后的值
  •  get() 获取状态值
    
  •  add()  更新状态值,将数据放到状态中
    
  •  clear() 清除状态
    

*/
public class SumFunction
extends RichFlatMapFunction, Tuple2> {
// managed keyed state
// 用于保存每一个 key 对应的 value 的总值
private ReducingState sumState;

@Override
public void open(Configuration parameters) throws Exception {
    // 注册状态
    ReducingStateDescriptor descriptor =
            new ReducingStateDescriptor(
                    "sum",  // 状态的名字
                    new ReduceFunction() { // 聚合函数
                        @Override
                        public Long reduce(Long value1, Long value2) throws Exception {
                            return value1 + value2;
                        }
                    }, Long.class); // 状态存储的数据类型
    sumState = getRuntimeContext().getReducingState(descriptor);
}

@Override
public void flatMap(Tuple2 element,
                    Collector> out) throws Exception {
    // 将数据放到状态中
    sumState.add(element.f1);

    out.collect(Tuple2.of(element.f0, sumState.get()));
}

}

public class TestKeyedStateMain2 {
public static void main(String[] args) throws Exception{
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

    DataStreamSource> dataStreamSource =
            env.fromElements(Tuple2.of(1L, 3L), Tuple2.of(1L, 5L), Tuple2.of(1L, 7L),
                    Tuple2.of(2L, 4L), Tuple2.of(2L, 2L), Tuple2.of(2L, 5L));

    // 输出:
    //(1,5.0)
    //(2,3.6666666666666665)
    dataStreamSource
            .keyBy(0)
            .flatMap(new SumFunction())
            .print();

    env.execute("TestStatefulApi");
}

}

输出:

4> (2,4)
4> (2,6)
4> (2,11)
3> (1,3)
3> (1,8)
3> (1,15)

AggregatingState

java
public class ContainsValueFunction
extends RichFlatMapFunction, Tuple2> {

private AggregatingState totalStr;

@Override
public void open(Configuration parameters) throws Exception {
    // 注册状态
    AggregatingStateDescriptor descriptor =
            new AggregatingStateDescriptor(
                    "totalStr",  // 状态的名字
                    new AggregateFunction() {
                        @Override
                        public String createAccumulator() {
                            return "Contains:";
                        }

                        @Override
                        public String add(Long value, String accumulator) {
                            if ("Contains:".equals(accumulator)) {
                                return accumulator + value;
                            }
                            return accumulator + " and " + value;
                        }

                        @Override
                        public String getResult(String accumulator) {
                            return accumulator;
                        }

                        @Override
                        public String merge(String a, String b) {
                            return a + " and " + b;
                        }
                    }, String.class); // 状态存储的数据类型
    totalStr = getRuntimeContext().getAggregatingState(descriptor);
}

@Override
public void flatMap(Tuple2 element,
                    Collector> out) throws Exception {
    totalStr.add(element.f1);
    out.collect(Tuple2.of(element.f0, totalStr.get()));
}

}

public class TestKeyedStateMain2 {
public static void main(String[] args) throws Exception{
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

    DataStreamSource> dataStreamSource =
            env.fromElements(Tuple2.of(1L, 3L), Tuple2.of(1L, 5L), Tuple2.of(1L, 7L),
                    Tuple2.of(2L, 4L), Tuple2.of(2L, 2L), Tuple2.of(2L, 5L));


    dataStreamSource
            .keyBy(0)
            .flatMap(new ContainsValueFunction())
            .print();

    env.execute("TestStatefulApi");
}

}

输出:

4> (2,Contains:4)
3> (1,Contains:3)
3> (1,Contains:3 and 5)
3> (1,Contains:3 and 5 and 7)
4> (2,Contains:4 and 2)
4> (2,Contains:4 and 2 and 5)

4.1.4 Operator State案例演示
ListState

java
public class CustomSink
implements SinkFunction>, CheckpointedFunction {

// 用于缓存结果数据的
private List> bufferElements;
// 表示内存中数据的大小阈值
private int threshold;
// 用于保存内存中的状态信息
private ListState> checkpointState;
// StateBackend
// checkpoint

public CustomSink(int threshold) {
    this.threshold = threshold;
    this.bufferElements = new ArrayList<>();
}

@Override
public void invoke(Tuple2 value, Context context) throws Exception {
    // 可以将接收到的每一条数据保存到任何的存储系统中
    bufferElements.add(value);
    if (bufferElements.size() == threshold) {
        // 简单打印
        System.out.println("自定义格式:" + bufferElements);
        bufferElements.clear();
    }
}

// 用于将内存中数据保存到状态中
@Override
public void snapshotState(FunctionSnapshotContext context) throws Exception {
    checkpointState.clear();
    for (Tuple2 ele : bufferElements) {
        checkpointState.add(ele);
    }
}
// 用于在程序挥发的时候从状态中恢复数据到内存
@Override
public void initializeState(FunctionInitializationContext context) throws Exception {
    ListStateDescriptor> descriptor =
            new ListStateDescriptor>(
                    "bufferd -elements",
                    TypeInformation.of(new TypeHint>() {}));
    // 注册一个 operator state
    checkpointState = context.getOperatorStateStore().getListState(descriptor);

    if (context.isRestored()) {
        for (Tuple2 ele : checkpointState.get()) {
            bufferElements.add(ele);
        }
    }
}

}

/**

  • 需求: 每两条数据打印一次结果
    */
    public class TestOperatorStateMain {
    public static void main(String[] args) throws Exception{
    StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

     DataStreamSource> dataStreamSource =
             env.fromElements(Tuple2.of("Spark", 3), Tuple2.of("Hadoop", 5), Tuple2.of("Hadoop", 7),
                     Tuple2.of("Spark", 4));
    
     // 输出:
     //(1,5.0)
     //(2,3.6666666666666665)
     dataStreamSource
             .addSink(new CustomSink(2)).setParallelism(1);
    
     env.execute("TestStatefulApi");
    

    }
    }

输出结果:

自定义格式:[(Spark,3), (Hadoop,5)]
自定义格式:[(Hadoop,7), (Spark,4)]

4.1.5 KeyedState 案例演示

需求:将两个流中,订单号一样的数据合并在一起输出

orderinfo1数据

java
123,拖把,30.0
234,牙膏,20.0
345,被子,114.4
333,杯子,112.2
444,Mac电脑,30000.0

orderinfo2数据

java
123,2019-11-11 10:11:12,江苏
234,2019-11-11 11:11:13,云南
345,2019-11-11 12:11:14,安徽
333,2019-11-11 13:11:15,北京
444,2019-11-11 14:11:16,深圳

代码实现:

java
public class Constants {
public static final String ORDER_INFO1_PATH=“D:\kkb\flinklesson\src\main\input\OrderInfo1.txt”;
public static final String ORDER_INFO2_PATH=“D:\kkb\flinklesson\src\main\input\OrderInfo2.txt”;
}

java
public class OrderInfo1 {
//订单ID
private Long orderId;
//商品名称
private String productName;
//价格
private Double price;

public OrderInfo1(){

}

public OrderInfo1(Long orderId,String productName,Double price){
this.orderId=orderId;
this.productName=productName;
this.price=price;
}

@Override
public String toString() {
    return "OrderInfo1{" +
            "orderId=" + orderId +
            ", productName='" + productName + '\'' +
            ", price=" + price +
            '}';
}

public Long getOrderId() {
    return orderId;
}

public void setOrderId(Long orderId) {
    this.orderId = orderId;
}

public String getProductName() {
    return productName;
}

public void setProductName(String productName) {
    this.productName = productName;
}

public Double getPrice() {
    return price;
}

public void setPrice(Double price) {
    this.price = price;
}

public static OrderInfo1 string2OrderInfo1(String line){
    OrderInfo1 orderInfo1 = new OrderInfo1();
    if(line != null && line.length() > 0){
       String[] fields = line.split(",");
        orderInfo1.setOrderId(Long.parseLong(fields[0]));
        orderInfo1.setProductName(fields[1]);
        orderInfo1.setPrice(Double.parseDouble(fields[2]));
   }
   return orderInfo1;
}

}

java
public class OrderInfo2 {
//订单ID
private Long orderId;
//下单时间
private String orderDate;
//下单地址
private String address;

public OrderInfo2(){

}
public OrderInfo2(Long orderId,String orderDate,String address){
    this.orderId = orderId;
    this.orderDate = orderDate;
    this.address = address;
}

@Override
public String toString() {
    return "OrderInfo2{" +
            "orderId=" + orderId +
            ", orderDate='" + orderDate + '\'' +
            ", address='" + address + '\'' +
            '}';
}

public Long getOrderId() {
    return orderId;
}

public void setOrderId(Long orderId) {
    this.orderId = orderId;
}

public String getOrderDate() {
    return orderDate;
}

public void setOrderDate(String orderDate) {
    this.orderDate = orderDate;
}

public String getAddress() {
    return address;
}

public void setAddress(String address) {
    this.address = address;
}


public static OrderInfo2 string2OrderInfo2(String line){
    OrderInfo2 orderInfo2 = new OrderInfo2();
    if(line != null && line.length() > 0){
        String[] fields = line.split(",");
        orderInfo2.setOrderId(Long.parseLong(fields[0]));
        orderInfo2.setOrderDate(fields[1]);
        orderInfo2.setAddress(fields[2]);
    }

    return orderInfo2;
}

}

java
/**

  • 自定义source
    */
    public class FileSource implements SourceFunction {
    //文件路径
    public String filePath;
    public FileSource(String filePath){
    this.filePath = filePath;
    }

    private InputStream inputStream;
    private BufferedReader reader;

    private Random random = new Random();

    @Override
    public void run(SourceContext ctx) throws Exception {

        reader = new BufferedReader(new InputStreamReader(new FileInputStream(filePath)));
        String line = null;
        while ((line = reader.readLine()) != null) {
            // 模拟发送数据
            TimeUnit.MILLISECONDS.sleep(random.nextInt(500));
            // 发送数据
            ctx.collect(line);
        }
     if(reader != null){
         reader.close();
     }
     if(inputStream != null){
         inputStream.close();
     }
    

    }

    @Override
    public void cancel() {
    try{
    if(reader != null){
    reader.close();
    }
    if(inputStream != null){
    inputStream.close();
    }
    }catch (Exception e){

    }
    }
    }

java
public class OrderStream {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStreamSource info1 = env.addSource(new FileSource(Constants.ORDER_INFO1_PATH));
DataStreamSource info2 = env.addSource(new FileSource(Constants.ORDER_INFO2_PATH));

    KeyedStream orderInfo1Stream = info1.map(line -> string2OrderInfo1(line))
            .keyBy(orderInfo1 -> orderInfo1.getOrderId());

    KeyedStream orderInfo2Stream = info2.map(line -> string2OrderInfo2(line))
            .keyBy(orderInfo2 -> orderInfo2.getOrderId());

    orderInfo1Stream.connect(orderInfo2Stream)
            .flatMap(new EnrichmentFunction())
            .print();

    env.execute("OrderStream");

}

/**
 *   IN1, 第一个流的输入的数据类型 
     IN2, 第二个流的输入的数据类型
     OUT,输出的数据类型
 */
public static class EnrichmentFunction extends
        RichCoFlatMapFunction>{
	//定义第一个流 key对应的state
    private ValueState orderInfo1State;
    //定义第二个流 key对应的state
    private ValueState orderInfo2State;

    @Override
    public void open(Configuration parameters) {
        orderInfo1State = getRuntimeContext()
                .getState(new ValueStateDescriptor("info1", OrderInfo1.class));
        orderInfo2State = getRuntimeContext()
                .getState(new ValueStateDescriptor("info2",OrderInfo2.class));

    }

    @Override
    public void flatMap1(OrderInfo1 orderInfo1, Collector> out) throws Exception {
        OrderInfo2 value2 = orderInfo2State.value();
        if(value2 != null){
            orderInfo2State.clear();
            out.collect(Tuple2.of(orderInfo1,value2));
        }else{
            orderInfo1State.update(orderInfo1);
        }

    }

    @Override
    public void flatMap2(OrderInfo2 orderInfo2, Collector> out)throws Exception {
        OrderInfo1 value1 = orderInfo1State.value();
        if(value1 != null){
            orderInfo1State.clear();
            out.collect(Tuple2.of(value1,orderInfo2));
        }else{
            orderInfo2State.update(orderInfo2);
        }

    }
}

}

4.2 State backend

4.2.1 概述

Flink支持的StateBackend:

  • MemoryStateBackend
  • FsStateBackend
  • RocksDBStateBackend
4.2.2 MemoryStateBackend

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-xTvcXHhA-1638967401629)(assets/MemoryStateBackend.png)]

默认情况下,状态信息是存储在 TaskManager 的堆内存中的,c heckpoint 的时候将状态保存到 JobManager 的堆内存中。
缺点:

​ 只能保存数据量小的状态

​ 状态数据有可能会丢失

优点:

​ 开发测试很方便

4.2.3 FSStateBackend

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-myydtQDw-1638967401629)(assets/FSStateBackend.png)]

状态信息存储在 TaskManager 的堆内存中的,checkpoint 的时候将状态保存到指定的文件中 (HDFS 等文件系统)

缺点:
状态大小受TaskManager内存限制(默认支持5M)
优点:
状态访问速度很快
状态信息不会丢失
用于: 生产,也可存储状态数据量大的情况

4.2.4 RocksDBStateBackend

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-6nzCCeQP-1638967401630)(assets/RocksDBStateBackend.png)]

状态信息存储在 RocksDB 数据库 (key-value 的数据存储服务), 最终保存在本地文件中
checkpoint 的时候将状态保存到指定的文件中 (HDFS 等文件系统)
缺点:
状态访问速度有所下降
优点:
可以存储超大量的状态信息
状态信息不会丢失
用于: 生产,可以存储超大量的状态信息

4.2.5 StateBackend配置方式

(1)单任务调整

修改当前任务代码
env.setStateBackend(new FsStateBackend(“hdfs://namenode:9000/flink/checkpoints”));
或者new MemoryStateBackend()
或者new RocksDBStateBackend(filebackend, true);【需要添加第三方依赖】

(2)全局调整

修改flink-conf.yaml
state.backend: filesystem
state.checkpoints.dir: hdfs://namenode:9000/flink/checkpoints
注意:state.backend的值可以是下面几种:jobmanager(MemoryStateBackend), filesystem(FsStateBackend), rocksdb(RocksDBStateBackend)

4.3 checkpoint

4.3.1 checkpoint概述

(1)为了保证state的容错性,Flink需要对state进行checkpoint。
(2)Checkpoint是Flink实现容错机制最核心的功能,它能够根据配置周期性地基于Stream中各个Operator/task的状态来生成快照,从而将这些状态数据定期持久化存储下来,当Flink程序一旦意外崩溃时,重新运行程序时可以有选择地从这些快照进行恢复,从而修正因为故障带来的程序数据异常
(3)Flink的checkpoint机制可以与(stream和state)的持久化存储交互的前提:
持久化的source,它需要支持在一定时间内重放事件。这种sources的典型例子是持久化的消息队列(比如Apache Kafka,RabbitMQ等)或文件系统(比如HDFS,S3,GFS等)
用于state的持久化存储,例如分布式文件系统(比如HDFS,S3,GFS等)

生成快照

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ISVTlIgu-1638891956547)(assets/1569326195474.png)]

恢复快照

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-VwITyOAS-1638891956548)(assets/1569326229867.png)]

4.3.2 checkpoint配置

默认checkpoint功能是disabled的,想要使用的时候需要先启用,checkpoint开启之后,checkPointMode有两种,Exactly-once和At-least-once,默认的checkPointMode是Exactly-once,Exactly-once对于大多数应用来说是最合适的。At-least-once可能用在某些延迟超低的应用程序(始终延迟为几毫秒)。

默认checkpoint功能是disabled的,想要使用的时候需要先启用
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// 每隔1000 ms进行启动一个检查点【设置checkpoint的周期】
env.enableCheckpointing(1000);
// 高级选项:
// 设置模式为exactly-once (这是默认值)
env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
// 确保检查点之间有至少500 ms的间隔【checkpoint最小间隔】
env.getCheckpointConfig().setMinPauseBetweenCheckpoints(500);
// 检查点必须在一分钟内完成,或者被丢弃【checkpoint的超时时间】
env.getCheckpointConfig().setCheckpointTimeout(60000);
// 同一时间只允许进行一个检查点
env.getCheckpointConfig().setMaxConcurrentCheckpoints(1);
// 表示一旦Flink处理程序被cancel后,会保留Checkpoint数据,以便根据实际需要恢复到指定的Checkpoint【详细解释见备注】
env.getCheckpointConfig().enableExternalizedCheckpoints(ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);

4.5 恢复数据

4.5.1 重启策略概述

Flink支持不同的重启策略,以在故障发生时控制作业如何重启,集群在启动时会伴随一个默认的重启策略,在没有定义具体重启策略时会使用该默认策略。 如果在工作提交时指定了一个重启策略,该策略会覆盖集群的默认策略,默认的重启策略可以通过 Flink 的配置文件 flink-conf.yaml 指定。配置参数 restart-strategy 定义了哪个策略被使用。
常用的重启策略
(1)固定间隔 (Fixed delay)
(2)失败率 (Failure rate)
(3)无重启 (No restart)
如果没有启用 checkpointing,则使用无重启 (no restart) 策略。
如果启用了 checkpointing,但没有配置重启策略,则使用固定间隔 (fixed-delay) 策略, 尝试重启次数默认值是:Integer.MAX_VALUE,重启策略可以在flink-conf.yaml中配置,表示全局的配置。也可以在应用代码中动态指定,会覆盖全局配置。

4.5.2 重启策略

固定间隔 (Fixed delay)

第一种:全局配置 flink-conf.yaml
restart-strategy: fixed-delay
restart-strategy.fixed-delay.attempts: 3
restart-strategy.fixed-delay.delay: 10 s
第二种:应用代码设置
env.setRestartStrategy(RestartStrategies.fixedDelayRestart(
3, // 尝试重启的次数
Time.of(10, TimeUnit.SECONDS) // 间隔
));

失败率 (Failure rate)

第一种:全局配置 flink-conf.yaml
restart-strategy: failure-rate
restart-strategy.failure-rate.max-failures-per-interval: 3
restart-strategy.failure-rate.failure-rate-interval: 5 min
restart-strategy.failure-rate.delay: 10 s
第二种:应用代码设置
env.setRestartStrategy(RestartStrategies.failureRateRestart(
3, // 一个时间段内的最大失败次数
Time.of(5, TimeUnit.MINUTES), // 衡量失败次数的是时间段
Time.of(10, TimeUnit.SECONDS) // 间隔
));

无重启 (No restart)

第一种:全局配置 flink-conf.yaml
restart-strategy: none
第二种:应用代码设置
env.setRestartStrategy(RestartStrategies.noRestart());

4.5.3 多checkpoint

默认情况下,如果设置了Checkpoint选项,则Flink只保留最近成功生成的1个Checkpoint,而当Flink程序失败时,可以从最近的这个Checkpoint来进行恢复。但是,如果我们希望保留多个Checkpoint,并能够根据实际需要选择其中一个进行恢复,这样会更加灵活,比如,我们发现最近4个小时数据记录处理有问题,希望将整个状态还原到4小时之前Flink可以支持保留多个Checkpoint,需要在Flink的配置文件conf/flink-conf.yaml中,添加如下配置,指定最多需要保存Checkpoint的个数:

state.checkpoints.num-retained: 20

这样设置以后就查看对应的Checkpoint在HDFS上存储的文件目录
hdfs dfs -ls hdfs://namenode:9000/flink/checkpoints
如果希望回退到某个Checkpoint点,只需要指定对应的某个Checkpoint路径即可实现

4.5.4 从checkpoint恢复数据

如果Flink程序异常失败,或者最近一段时间内数据处理错误,我们可以将程序从某一个Checkpoint点进行恢复

bin/flink run -s hdfs://namenode:9000/flink/checkpoints/467e17d2cc343e6c56255d222bae3421/chk-56/_metadata flink-job.jar

程序正常运行后,还会按照Checkpoint配置进行运行,继续生成Checkpoint数据。

当然恢复数据的方式还可以在自己的代码里面指定checkpoint目录,这样下一次启动的时候即使代码发生了改变就自动恢复数据了。

4.5.5 savepoint

Flink通过Savepoint功能可以做到程序升级后,继续从升级前的那个点开始执行计算,保证数据不中断
全局,一致性快照。可以保存数据源offset,operator操作状态等信息,可以从应用在过去任意做了savepoint的时刻开始继续消费

checkPoint vs savePoint

checkPoint
应用定时触发,用于保存状态,会过期,内部应用失败重启的时候使用。
savePoint
用户手动执行,是指向Checkpoint的指针,不会过期,在升级的情况下使用。
注意:为了能够在作业的不同版本之间以及 Flink 的不同版本之间顺利升级,强烈推荐程序员通过 uid(String) 方法手动的给算子赋予 ID,这些 ID 将用于确定每一个算子的状态范围。如果不手动给各算子指定 ID,则会由 Flink 自动给每个算子生成一个 ID。只要这些 ID 没有改变就能从保存点(savepoint)将程序恢复回来。而这些自动生成的 ID 依赖于程序的结构,并且对代码的更改是很敏感的。因此,强烈建议用户手动的设置 ID。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-b6RbHcFt-1638891956549)(…/…/…/…/…/%E6%A1%8C%E9%9D%A2/%E5%A4%A7%E6%95%B0%E6%8D%AE%E8%AF%BE%E7%A8%8B/%E8%AF%BE%E7%A8%8B%E8%AE%BE%E8%AE%A1/Flink%E8%AF%BE%E7%A8%8B/assets/1569329267892.png)]

savepoint的使用

1:在flink-conf.yaml中配置Savepoint存储位置
不是必须设置,但是设置后,后面创建指定Job的Savepoint时,可以不用在手动执行命令时指定Savepoint的位置
state.savepoints.dir: hdfs://namenode:9000/flink/savepoints
2:触发一个savepoint【直接触发或者在cancel的时候触发】
bin/flink savepoint jobId [targetDirectory] [-yid yarnAppId]【针对on yarn模式需要指定-yid参数】
bin/flink cancel -s [targetDirectory] jobId [-yid yarnAppId]【针对on yarn模式需要指定-yid参数】

3:从指定的savepoint启动job
bin/flink run -s savepointPath [runArgs]

4.1 需求背景

需求描述:每隔5秒,计算最近10秒单词出现的次数。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-FAvsBN2d-1638891977211)(assets/Window.png)]

4.1.1 TimeWindow实现

java
/**

  • 每隔5秒计算最近10秒单词出现的次数
    */
    public class TimeWindowWordCount {
    public static void main(String[] args) throws Exception{
    StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
    DataStreamSource dataStream = env.socketTextStream(“localhost”, 8888);
    SingleOutputStreamOperator> result = dataStream.flatMap(new FlatMapFunction>() {
    @Override
    public void flatMap(String line, Collector> out) throws Exception {
    String[] fields = line.split(",");
    for (String word : fields) {
    out.collect(new Tuple2<>(word, 1));
    }
    }
    }).keyBy(0)
    .timeWindow(Time.seconds(10), Time.seconds(5))
    .sum(1);

     result.print().setParallelism(1);
    
     env.execute("TimeWindowWordCount");
    

    }
    }

4.1.2 ProcessWindowFunction

java
/**

  • 每隔5秒计算最近10秒单词出现的次数
    */
    public class TimeWindowWordCount {
    public static void main(String[] args) throws Exception{
    StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
    DataStreamSource dataStream = env.socketTextStream(“10.148.15.10”, 8888);
    SingleOutputStreamOperator> result = dataStream.flatMap(new FlatMapFunction>() {
    @Override
    public void flatMap(String line, Collector> out) throws Exception {
    String[] fields = line.split(",");
    for (String word : fields) {
    out.collect(new Tuple2<>(word, 1));
    }
    }
    }).keyBy(0)
    .timeWindow(Time.seconds(10), Time.seconds(5))
    .process(new SumProcessWindowFunction());

     result.print().setParallelism(1);
    
     env.execute("TimeWindowWordCount");
    

    }

    /**

    • IN, OUT, KEY, W

    • IN:输入的数据类型

    • OUT:输出的数据类型

    • Key:key的数据类型(在Flink里面,String用Tuple表示)

    • W:Window的数据类型
      /
      public static class SumProcessWindowFunction extends
      ProcessWindowFunction,Tuple2,Tuple,TimeWindow> {
      FastDateFormat dataFormat = FastDateFormat.getInstance(“HH:mm:ss”);
      /
      *

      • 当一个window触发计算的时候会调用这个方法

      • @param tuple key

      • @param context operator的上下文

      • @param elements 指定window的所有元素

      • @param out 用户输出
        */
        @Override
        public void process(Tuple tuple, Context context, Iterable> elements,
        Collector> out) {

        System.out.println(“当天系统的时间:”+dataFormat.format(System.currentTimeMillis()));

        System.out.println(“Window的处理时间:”+dataFormat.format(context.currentProcessingTime()));
        System.out.println(“Window的开始时间:”+dataFormat.format(context.window().getStart()));
        System.out.println(“Window的结束时间:”+dataFormat.format(context.window().getEnd()));

        int sum = 0;
        for (Tuple2 ele : elements) {
        sum += 1;
        }
        // 输出单词出现的次数
        out.collect(Tuple2.of(tuple.getField(0), sum));

      }
      }
      }

先输入:

hive

然后输入hive,hbase

输出结果:

java
当天系统的时间:15:10:30
Window的处理时间:15:10:30
Window的开始时间:15:10:20
Window的结束时间:15:10:30
(hive,1)
当天系统的时间:15:10:35
Window的处理时间:15:10:35
Window的开始时间:15:10:25
Window的结束时间:15:10:35
当天系统的时间:15:10:35
Window的处理时间:15:10:35
Window的开始时间:15:10:25
Window的结束时间:15:10:35
(hbase,1)
(hive,1)

根据每隔5秒执行最近10秒的数据,Flink划分的窗口

java
[00:00:00, 00:00:05) [00:00:05, 00:00:10)
[00:00:10, 00:00:15) [00:00:15, 00:00:20)
[00:00:20, 00:00:25) [00:00:25, 00:00:30)
[00:00:30, 00:00:35) [00:00:35, 00:00:40)
[00:00:40, 00:00:45) [00:00:45, 00:00:50)
[00:00:50, 00:00:55) [00:00:55, 00:01:00)
[00:01:00, 00:01:05) …

4.1.3 Time的种类

针对stream数据中的时间,可以分为以下三种:
Event Time:事件产生的时间,它通常由事件中的时间戳描述。
Ingestion time:事件进入Flink的时间
Processing Time:事件被处理时当前系统的时间

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-rUH0NoYC-1638891977212)(assets/1569394563906.png)]

案例演示:
原始日志如下

2018-10-10 10:00:01,134 INFO executor.Executor: Finished task in state 0.0

这条数据进入Flink的时间是2018-10-10 20:00:00,102
到达window处理的时间为2018-10-10 20:00:01,100

2018-10-10 10:00:01,134 是Event time
2018-10-10 20:00:00,102 是Ingestion time
2018-10-10 20:00:01,100 是Processing tim

思考:

如果我们想要统计每分钟内接口调用失败的错误日志个数,使用哪个时间才有意义?

4.2 Process Time Window(有序)

需求:每隔5秒计算最近10秒的单词出现的次数

自定义source,模拟:第 13 秒的时候连续发送 2 个事件,第 16 秒的时候再发送 1 个事件

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-O7XTh0eO-1638891977212)(assets/自定义source-TimeWindow.png)]

java
/**

  • 每隔5秒计算最近10秒单词出现的次数
    */
    public class TimeWindowWordCount {
    public static void main(String[] args) throws Exception{
    StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
    env.setParallelism(1);
    DataStreamSource dataStream = env.addSource(new TestSouce());
    SingleOutputStreamOperator> result = dataStream.flatMap(new FlatMapFunction>() {
    @Override
    public void flatMap(String line, Collector> out) throws Exception {
    String[] fields = line.split(",");
    for (String word : fields) {
    out.collect(new Tuple2<>(word, 1));
    }
    }
    }).keyBy(0)
    .timeWindow(Time.seconds(10), Time.seconds(5))
    .process(new SumProcessWindowFunction());

     result.print().setParallelism(1);
    
     env.execute("TimeWindowWordCount");
    

    }

    public static class TestSouce implements SourceFunction{
    FastDateFormat dateFormat = FastDateFormat.getInstance(“HH:mm:ss”);
    @Override
    public void run(SourceContext ctx) throws Exception {
    // 控制大约在 10 秒的倍数的时间点发送事件
    String currTime = String.valueOf(System.currentTimeMillis());
    while (Integer.valueOf(currTime.substring(currTime.length() - 4)) > 100) {
    currTime = String.valueOf(System.currentTimeMillis());
    continue;
    }
    System.out.println(“开始发送事件的时间:” + dateFormat.format(System.currentTimeMillis()));
    // 第 13 秒发送两个事件
    TimeUnit.SECONDS.sleep(13);
    ctx.collect(“hadoop,” + System.currentTimeMillis());
    // 产生了一个事件,但是由于网络原因,事件没有发送
    ctx.collect(“hadoop,” + System.currentTimeMillis());
    // 第 16 秒发送一个事件
    TimeUnit.SECONDS.sleep(3);
    ctx.collect(“hadoop,” + System.currentTimeMillis());
    TimeUnit.SECONDS.sleep(300);

     }
    
     @Override
     public void cancel() {
    
     }
    

    }

    /**

    • IN, OUT, KEY, W
    • IN:输入的数据类型
    • OUT:输出的数据类型
    • Key:key的数据类型(在Flink里面,String用Tuple表示)
    • W:Window的数据类型
      /
      public static class SumProcessWindowFunction extends
      ProcessWindowFunction,Tuple2,Tuple,TimeWindow> {
      FastDateFormat dateFormat = FastDateFormat.getInstance(“HH:mm:ss”);
      /
      *
      • 当一个window触发计算的时候会调用这个方法
      • @param tuple key
      • @param context operator的上下文
      • @param elements 指定window的所有元素
      • @param out 用户输出
        */
        @Override
        public void process(Tuple tuple, Context context, Iterable> elements,
        Collector> out) {

// System.out.println(“当天系统的时间:”+dateFormat.format(System.currentTimeMillis()));
//
// System.out.println(“Window的处理时间:”+dateFormat.format(context.currentProcessingTime()));
// System.out.println(“Window的开始时间:”+dateFormat.format(context.window().getStart()));
// System.out.println(“Window的结束时间:”+dateFormat.format(context.window().getEnd()));

        int sum = 0;
        for (Tuple2 ele : elements) {
            sum += 1;
        }
        // 输出单词出现的次数
        out.collect(Tuple2.of(tuple.getField(0), sum));

    }
}

}

输出结果:

java
开始发送事件的时间:16:16:40
(hadoop,2)
(1573287413001,1)
(1573287413015,1)
(hadoop,3)
(1573287416016,1)
(1573287413001,1)
(1573287413015,1)
(hadoop,1)
(1573287416016,1)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-0TvpUw7B-1638891977213)(assets/自定义source-TimeWindow2.png)]

4.3 Process Time Window(无序)

自定义source,模拟:第 13 秒的时候连续发送 2 个事件,但是有一个事件确实在第13秒的发送出去了,另外一个事件因为某种原因在19秒的时候才发送出去,第 16 秒的时候再发送 1 个事件

java
/**

  • 每隔5秒计算最近10秒单词出现的次数
    */
    public class TimeWindowWordCount {
    public static void main(String[] args) throws Exception{
    StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
    env.setParallelism(1);
    DataStreamSource dataStream = env.addSource(new TestSouce());
    SingleOutputStreamOperator> result = dataStream.flatMap(new FlatMapFunction>() {
    @Override
    public void flatMap(String line, Collector> out) throws Exception {
    String[] fields = line.split(",");
    for (String word : fields) {
    out.collect(new Tuple2<>(word, 1));
    }
    }
    }).keyBy(0)
    .timeWindow(Time.seconds(10), Time.seconds(5))
    .process(new SumProcessWindowFunction());

     result.print().setParallelism(1);
    
     env.execute("TimeWindowWordCount");
    

    }

    /**

    • 模拟:第 13 秒的时候连续发送 2 个事件,第 16 秒的时候再发送 1 个事件
      */
      public static class TestSouce implements SourceFunction{
      FastDateFormat dateFormat = FastDateFormat.getInstance(“HH:mm:ss”);
      @Override
      public void run(SourceContext ctx) throws Exception {
      // 控制大约在 10 秒的倍数的时间点发送事件
      String currTime = String.valueOf(System.currentTimeMillis());
      while (Integer.valueOf(currTime.substring(currTime.length() - 4)) > 100) {
      currTime = String.valueOf(System.currentTimeMillis());
      continue;
      }
      System.out.println(“开始发送事件的时间:” + dateFormat.format(System.currentTimeMillis()));
      // 第 13 秒发送两个事件
      TimeUnit.SECONDS.sleep(13);
      ctx.collect(“hadoop,” + System.currentTimeMillis());
      // 产生了一个事件,但是由于网络原因,事件没有发送
      String event = “hadoop,” + System.currentTimeMillis();
      // 第 16 秒发送一个事件
      TimeUnit.SECONDS.sleep(3);
      ctx.collect(“hadoop,” + System.currentTimeMillis());
      // 第 19 秒的时候发送
      TimeUnit.SECONDS.sleep(3);
      ctx.collect(event);

       TimeUnit.SECONDS.sleep(300);
      

      }

      @Override
      public void cancel() {

      }
      }

    /**

    • IN, OUT, KEY, W
    • IN:输入的数据类型
    • OUT:输出的数据类型
    • Key:key的数据类型(在Flink里面,String用Tuple表示)
    • W:Window的数据类型
      /
      public static class SumProcessWindowFunction extends
      ProcessWindowFunction,Tuple2,Tuple,TimeWindow> {
      FastDateFormat dateFormat = FastDateFormat.getInstance(“HH:mm:ss”);
      /
      *
      • 当一个window触发计算的时候会调用这个方法
      • @param tuple key
      • @param context operator的上下文
      • @param elements 指定window的所有元素
      • @param out 用户输出
        */
        @Override
        public void process(Tuple tuple, Context context, Iterable> elements,
        Collector> out) {

// System.out.println(“当天系统的时间:”+dateFormat.format(System.currentTimeMillis()));
//
// System.out.println(“Window的处理时间:”+dateFormat.format(context.currentProcessingTime()));
// System.out.println(“Window的开始时间:”+dateFormat.format(context.window().getStart()));
// System.out.println(“Window的结束时间:”+dateFormat.format(context.window().getEnd()));

        int sum = 0;
        for (Tuple2 ele : elements) {
            sum += 1;
        }
        // 输出单词出现的次数
        out.collect(Tuple2.of(tuple.getField(0), sum));

    }
}

}

处理结果:

java
开始发送事件的时间:16:18:50
(hadoop,1)
(1573287543001,1)
(1573287543001,1)
(hadoop,3)
(1573287546016,1)
(1573287543016,1)
(1573287546016,1)
(hadoop,2)
(1573287543016,1)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-aPREgUnM-1638891977214)(assets/自定义source-TimeWindow-无序.png)]

4.4 使用Event Time处理无序

使用Event Time处理

java
/**

  • 每隔5秒计算最近10秒单词出现的次数
    */
    public class TimeWindowWordCount {
    public static void main(String[] args) throws Exception{
    StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
    env.setParallelism(1);
    //步骤一:设置时间类型
    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
    DataStreamSource dataStream = env.addSource(new TestSouce());
    dataStream.map(new MapFunction>() {
    @Override
    public Tuple2 map(String line) throws Exception {
    String[] fields = line.split(",");
    return new Tuple2<>(fields[0],Long.valueOf(fields[1]));
    }
    //步骤二:获取数据里面的event Time
    }).assignTimestampsAndWatermarks(new EventTimeExtractor() )
    .keyBy(0)
    .timeWindow(Time.seconds(10), Time.seconds(5))
    .process(new SumProcessWindowFunction())
    .print().setParallelism(1);

     env.execute("TimeWindowWordCount");
    

    }

    public static class TestSouce implements SourceFunction{
    FastDateFormat dateFormat = FastDateFormat.getInstance(“HH:mm:ss”);
    @Override
    public void run(SourceContext ctx) throws Exception {
    // 控制大约在 10 秒的倍数的时间点发送事件
    String currTime = String.valueOf(System.currentTimeMillis());
    while (Integer.valueOf(currTime.substring(currTime.length() - 4)) > 100) {
    currTime = String.valueOf(System.currentTimeMillis());
    continue;
    }
    System.out.println(“开始发送事件的时间:” + dateFormat.format(System.currentTimeMillis()));
    // 第 13 秒发送两个事件
    TimeUnit.SECONDS.sleep(13);
    ctx.collect(“hadoop,” + System.currentTimeMillis());
    // 产生了一个事件,但是由于网络原因,事件没有发送
    String event = “hadoop,” + System.currentTimeMillis();
    // 第 16 秒发送一个事件
    TimeUnit.SECONDS.sleep(3);
    ctx.collect(“hadoop,” + System.currentTimeMillis());
    // 第 19 秒的时候发送
    TimeUnit.SECONDS.sleep(3);
    ctx.collect(event);

         TimeUnit.SECONDS.sleep(300);
    
     }
    
     @Override
     public void cancel() {
    
     }
    

    }

    /**

    • IN, OUT, KEY, W
    • IN:输入的数据类型
    • OUT:输出的数据类型
    • Key:key的数据类型(在Flink里面,String用Tuple表示)
    • W:Window的数据类型
      /
      public static class SumProcessWindowFunction extends
      ProcessWindowFunction,Tuple2,Tuple,TimeWindow> {
      FastDateFormat dateFormat = FastDateFormat.getInstance(“HH:mm:ss”);
      /
      *
      • 当一个window触发计算的时候会调用这个方法
      • @param tuple key
      • @param context operator的上下文
      • @param elements 指定window的所有元素
      • @param out 用户输出
        */
        @Override
        public void process(Tuple tuple, Context context, Iterable> elements,
        Collector> out) {

// System.out.println(“当天系统的时间:”+dateFormat.format(System.currentTimeMillis()));
//
// System.out.println(“Window的处理时间:”+dateFormat.format(context.currentProcessingTime()));
// System.out.println(“Window的开始时间:”+dateFormat.format(context.window().getStart()));
// System.out.println(“Window的结束时间:”+dateFormat.format(context.window().getEnd()));

        int sum = 0;
        for (Tuple2 ele : elements) {
            sum += 1;
        }
        // 输出单词出现的次数
        out.collect(Tuple2.of(tuple.getField(0), sum));

    }
}


private static class EventTimeExtractor
        implements AssignerWithPeriodicWatermarks> {
    FastDateFormat dateFormat = FastDateFormat.getInstance("HH:mm:ss");

    // 拿到每一个事件的 Event Time
    @Override
    public long extractTimestamp(Tuple2 element,
                                 long previousElementTimestamp) {
        return element.f1;
    }

    @Nullable
    @Override
    public Watermark getCurrentWatermark() {

        return new Watermark(System.currentTimeMillis());
    }
}

}

计算结果:

java
开始发送事件的时间:16:44:10
(hadoop,1)
(hadoop,3)
(hadoop,1)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-6uVn92Pf-1638891977214)(assets/用EventTime处理无序的数据.png)]

现在我们第三个window的结果已经计算准确了,但是我们还是没有彻底的解决问题。接下来就需要我们使用WaterMark机制来解决了。

4.5 使用WaterMark机制解决无序

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-A3qdpC9T-1638891977215)(assets/使用waterMark机制处理无序的数据.png)]

java
/**

  • 每隔5秒计算最近10秒单词出现的次数
    */
    public class TimeWindowWordCount {
    public static void main(String[] args) throws Exception{
    StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
    env.setParallelism(1);
    //步骤一:设置时间类型
    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
    DataStreamSource dataStream = env.addSource(new TestSouce());
    dataStream.map(new MapFunction>() {
    @Override
    public Tuple2 map(String line) throws Exception {
    String[] fields = line.split(",");
    return new Tuple2<>(fields[0],Long.valueOf(fields[1]));
    }
    //步骤二:获取数据里面的event Time
    }).assignTimestampsAndWatermarks(new EventTimeExtractor() )
    .keyBy(0)
    .timeWindow(Time.seconds(10), Time.seconds(5))
    .process(new SumProcessWindowFunction())
    .print().setParallelism(1);

     env.execute("TimeWindowWordCount");
    

    }

    public static class TestSouce implements SourceFunction{
    FastDateFormat dateFormat = FastDateFormat.getInstance(“HH:mm:ss”);
    @Override
    public void run(SourceContext ctx) throws Exception {
    // 控制大约在 10 秒的倍数的时间点发送事件
    String currTime = String.valueOf(System.currentTimeMillis());
    while (Integer.valueOf(currTime.substring(currTime.length() - 4)) > 100) {
    currTime = String.valueOf(System.currentTimeMillis());
    continue;
    }
    System.out.println(“开始发送事件的时间:” + dateFormat.format(System.currentTimeMillis()));
    // 第 13 秒发送两个事件
    TimeUnit.SECONDS.sleep(13);
    ctx.collect(“hadoop,” + System.currentTimeMillis());
    // 产生了一个事件,但是由于网络原因,事件没有发送
    String event = “hadoop,” + System.currentTimeMillis();
    // 第 16 秒发送一个事件
    TimeUnit.SECONDS.sleep(3);
    ctx.collect(“hadoop,” + System.currentTimeMillis());
    // 第 19 秒的时候发送
    TimeUnit.SECONDS.sleep(3);
    ctx.collect(event);

         TimeUnit.SECONDS.sleep(300);
    
     }
    
     @Override
     public void cancel() {
    
     }
    

    }

    /**

    • IN, OUT, KEY, W
    • IN:输入的数据类型
    • OUT:输出的数据类型
    • Key:key的数据类型(在Flink里面,String用Tuple表示)
    • W:Window的数据类型
      /
      public static class SumProcessWindowFunction extends
      ProcessWindowFunction,Tuple2,Tuple,TimeWindow> {
      FastDateFormat dateFormat = FastDateFormat.getInstance(“HH:mm:ss”);
      /
      *
      • 当一个window触发计算的时候会调用这个方法
      • @param tuple key
      • @param context operator的上下文
      • @param elements 指定window的所有元素
      • @param out 用户输出
        */
        @Override
        public void process(Tuple tuple, Context context, Iterable> elements,
        Collector> out) {

// System.out.println(“当天系统的时间:”+dateFormat.format(System.currentTimeMillis()));
//
// System.out.println(“Window的处理时间:”+dateFormat.format(context.currentProcessingTime()));
// System.out.println(“Window的开始时间:”+dateFormat.format(context.window().getStart()));
// System.out.println(“Window的结束时间:”+dateFormat.format(context.window().getEnd()));

        int sum = 0;
        for (Tuple2 ele : elements) {
            sum += 1;
        }
        // 输出单词出现的次数
        out.collect(Tuple2.of(tuple.getField(0), sum));

    }
}


private static class EventTimeExtractor
        implements AssignerWithPeriodicWatermarks> {
    FastDateFormat dateFormat = FastDateFormat.getInstance("HH:mm:ss");

    // 拿到每一个事件的 Event Time
    @Override
    public long extractTimestamp(Tuple2 element,
                                 long previousElementTimestamp) {
        return element.f1;
    }

    @Nullable
    @Override
    public Watermark getCurrentWatermark() {
        //window延迟5秒触发
        return new Watermark(System.currentTimeMillis() - 5000);
    }
}

}

计算结果:

java
开始发送事件的时间:16:57:40
(hadoop,2)
(hadoop,3)
(hadoop,1)

结果正确!

4.6 WaterMark机制

4.6.1 WaterMark的周期

java
/**

  • 每隔5秒计算最近10秒单词出现的次数
    */
    public class TimeWindowWordCount {
    public static void main(String[] args) throws Exception{
    StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
    env.setParallelism(1);
    //步骤一:设置时间类型
    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
    //设置waterMark产生的周期为1s
    env.getConfig().setAutoWatermarkInterval(1000);

     DataStreamSource dataStream = env.addSource(new TestSouce());
    dataStream.map(new MapFunction>() {
         @Override
         public Tuple2 map(String line) throws Exception {
             String[] fields = line.split(",");
             return new Tuple2<>(fields[0],Long.valueOf(fields[1]));
         }
         //步骤二:获取数据里面的event Time
     }).assignTimestampsAndWatermarks(new EventTimeExtractor() )
            .keyBy(0)
             .timeWindow(Time.seconds(10), Time.seconds(5))
             .process(new SumProcessWindowFunction())
             .print().setParallelism(1);
    
     env.execute("TimeWindowWordCount");
    

    }

    public static class TestSouce implements SourceFunction{
    FastDateFormat dateFormat = FastDateFormat.getInstance(“HH:mm:ss”);
    @Override
    public void run(SourceContext ctx) throws Exception {
    // 控制大约在 10 秒的倍数的时间点发送事件
    String currTime = String.valueOf(System.currentTimeMillis());
    while (Integer.valueOf(currTime.substring(currTime.length() - 4)) > 100) {
    currTime = String.valueOf(System.currentTimeMillis());
    continue;
    }
    System.out.println(“开始发送事件的时间:” + dateFormat.format(System.currentTimeMillis()));
    // 第 13 秒发送两个事件
    TimeUnit.SECONDS.sleep(13);
    ctx.collect(“hadoop,” + System.currentTimeMillis());
    // 产生了一个事件,但是由于网络原因,事件没有发送
    String event = “hadoop,” + System.currentTimeMillis();
    // 第 16 秒发送一个事件
    TimeUnit.SECONDS.sleep(3);
    ctx.collect(“hadoop,” + System.currentTimeMillis());
    // 第 19 秒的时候发送
    TimeUnit.SECONDS.sleep(3);
    ctx.collect(event);

         TimeUnit.SECONDS.sleep(300);
    
     }
    
     @Override
     public void cancel() {
    
     }
    

    }

    /**

    • IN, OUT, KEY, W

    • IN:输入的数据类型

    • OUT:输出的数据类型

    • Key:key的数据类型(在Flink里面,String用Tuple表示)

    • W:Window的数据类型
      /
      public static class SumProcessWindowFunction extends
      ProcessWindowFunction,Tuple2,Tuple,TimeWindow> {
      FastDateFormat dateFormat = FastDateFormat.getInstance(“HH:mm:ss”);
      /
      *

      • 当一个window触发计算的时候会调用这个方法
      • @param tuple key
      • @param context operator的上下文
      • @param elements 指定window的所有元素
      • @param out 用户输出
        */
        @Override
        public void process(Tuple tuple, Context context, Iterable> elements,
        Collector> out) {
        int sum = 0;
        for (Tuple2 ele : elements) {
        sum += 1;
        }
        // 输出单词出现的次数
        out.collect(Tuple2.of(tuple.getField(0), sum));

      }
      }

    private static class EventTimeExtractor
    implements AssignerWithPeriodicWatermarks> {
    FastDateFormat dateFormat = FastDateFormat.getInstance(“HH:mm:ss”);

     // 拿到每一个事件的 Event Time
     @Override
     public long extractTimestamp(Tuple2 element,
                                  long previousElementTimestamp) {
         //这个方法是每获取到一个数据就会被调用一次。
         return element.f1;
     }
    
     @Nullable
     @Override
     public Watermark getCurrentWatermark() {
         /**
          * WasterMark会周期性的产生,默认就是每隔200毫秒产生一个
          *
          *         设置 watermark 产生的周期为 1000ms
          *         env.getConfig().setAutoWatermarkInterval(1000);
          */
         //window延迟5秒触发
         System.out.println("water mark...");
         return new Watermark(System.currentTimeMillis() - 5000);
     }
    

    }
    }

输出结果:

java
water mark…
water mark…
water mark…
water mark…
water mark…
water mark…
water mark…
开始发送事件的时间:17:10:50
water mark…
water mark…
water mark…
water mark…
water mark…
water mark…
water mark…
water mark…
water mark…
water mark…
water mark…
water mark…
water mark…
water mark…
water mark…
water mark…
water mark…
water mark…
water mark…
water mark…
water mark…
(hadoop,2)
water mark…
water mark…
water mark…
water mark…
water mark…
(hadoop,3)
water mark…
water mark…
water mark…
water mark…
water mark…
(hadoop,1)
water mark…
water mark…
water mark…
water mark…
water mark…

4.6.2 WaterMark的定义

使用eventTime的时候如何处理乱序数据?
我们知道,流处理从事件产生,到流经source,再到operator,中间是有一个过程和时间的。虽然大部分情况下,流到operator的数据都是按照事件产生的时间顺序来的,但是也不排除由于网络延迟等原因,导致乱序的产生,特别是使用kafka的话,多个分区的数据无法保证有序。所以在进行window计算的时候,我们又不能无限期的等下去,必须要有个机制来保证一个特定的时间后,必须触发window去进行计算了。这个特别的机制,就是watermark,watermark是用于处理乱序事件的。watermark可以翻译为水位线

有序的流的watermarks

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-DmmLBPo2-1638891977216)(assets/1569479960665.png)]

无序的流的watermarks

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-36aMIYRH-1638891977216)(assets/1569479997521.png)]

多并行度流的watermarks

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-UbjdEbMt-1638891977216)(assets/1569480051217.png)]

4.6.3 需求

得到并打印每隔 3 秒钟统计前 3 秒内的相同的 key 的所有的事件

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Whk6Sml5-1638891977217)(assets/1573294611566.png)]

代码开发

java
/**

  • 得到并打印每隔 3 秒钟统计前 3 秒内的相同的 key 的所有的事件
    */
    public class WaterMarkWindowWordCount {
    public static void main(String[] args) throws Exception{
    StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
    env.setParallelism(1);
    //步骤一:设置时间类型
    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
    //设置waterMark产生的周期为1s
    env.getConfig().setAutoWatermarkInterval(1000);

     DataStreamSource dataStream = env.socketTextStream("10.148.15.10", 8888);
     dataStream.map(new MapFunction>() {
         @Override
         public Tuple2 map(String line) throws Exception {
             String[] fields = line.split(",");
             return new Tuple2<>(fields[0],Long.valueOf(fields[1]));
         }
         //步骤二:获取数据里面的event Time
     }).assignTimestampsAndWatermarks(new EventTimeExtractor() )
            .keyBy(0)
             .timeWindow(Time.seconds(3))
             .process(new SumProcessWindowFunction())
             .print().setParallelism(1);
    
     env.execute("TimeWindowWordCount");
    

    }

    /**

    • IN, OUT, KEY, W

    • IN:输入的数据类型

    • OUT:输出的数据类型

    • Key:key的数据类型(在Flink里面,String用Tuple表示)

    • W:Window的数据类型
      /
      public static class SumProcessWindowFunction extends
      ProcessWindowFunction,String,Tuple,TimeWindow> {
      FastDateFormat dateFormat = FastDateFormat.getInstance(“HH:mm:ss”);
      /
      *

      • 当一个window触发计算的时候会调用这个方法

      • @param tuple key

      • @param context operator的上下文

      • @param elements 指定window的所有元素

      • @param out 用户输出
        */
        @Override
        public void process(Tuple tuple, Context context, Iterable> elements,
        Collector out) {
        System.out.println(“处理时间:” + dateFormat.format(context.currentProcessingTime()));
        System.out.println("window start time : " + dateFormat.format(context.window().getStart()));

        List list = new ArrayList<>();
        for (Tuple2 ele : elements) {
        list.add(ele.toString() + “|” + dateFormat.format(ele.f1));
        }
        out.collect(list.toString());
        System.out.println("window end time : " + dateFormat.format(context.window().getEnd()));

      }
      }

    private static class EventTimeExtractor
    implements AssignerWithPeriodicWatermarks> {
    FastDateFormat dateFormat = FastDateFormat.getInstance(“HH:mm:ss”);

     private long currentMaxEventTime = 0L;
     private long maxOutOfOrderness = 10000; // 最大允许的乱序时间 10 秒
    
    
     // 拿到每一个事件的 Event Time
     @Override
     public long extractTimestamp(Tuple2 element,
                                  long previousElementTimestamp) {
         long currentElementEventTime = element.f1;
         currentMaxEventTime = Math.max(currentMaxEventTime, currentElementEventTime);
         System.out.println("event = " + element
                 + "|" + dateFormat.format(element.f1) // Event Time
                 + "|" + dateFormat.format(currentMaxEventTime)  // Max Event Time
                 + "|" + dateFormat.format(getCurrentWatermark().getTimestamp())); // Current Watermark
         return currentElementEventTime;
     }
    
     @Nullable
     @Override
     public Watermark getCurrentWatermark() {
         /**
          * WasterMark会周期性的产生,默认就是每隔200毫秒产生一个
          *
          *         设置 watermark 产生的周期为 1000ms
          *         env.getConfig().setAutoWatermarkInterval(1000);
          */
         //window延迟5秒触发
         System.out.println("water mark...");
         return new Watermark(currentMaxEventTime - maxOutOfOrderness);
     }
    

    }
    }

演示数据:

java
– window 计算触发的条件
000001,1461756862000
000001,1461756866000
000001,1461756872000
000001,1461756873000
000001,1461756874000
000001,1461756876000
000001,1461756877000

一条一条的数据输入。

4.6.4计算window的触发时间

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-IXcFDniX-1638891977217)(assets/1573295426278.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-5aHUiD1A-1638891977218)(assets/1573295434967.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-QsuCxo0z-1638891977221)(assets/1573295444736.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-xPnBxVdX-1638891977222)(assets/1573295452688.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-5QXLUCLF-1638891977223)(assets/1573295462557.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ds1NpQlp-1638891977223)(assets/1573295482248.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-7UYRCOfq-1638891977224)(assets/1573295499134.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-6QSJBUn2-1638891977224)(assets/1573295512707.png)]

总结:window触发的时间

  1. watermark 时间 >= window_end_time
  2. 在 [window_start_time, window_end_time) 区间中有数据存在,注意是左闭右开的区间,而且是以 event time 来计算的

4.6.5 WaterMark+Window 处理乱序时间

输入数据:

java
000001,1461756879000
000001,1461756871000

000001,1461756883000

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-sHt4BtDC-1638891977225)(assets/1573296359405.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-EtVWFh0R-1638891977225)(assets/1573296391460.png)]

4.6.5 迟到太多的事件

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-8Z2Fm3TK-1638891977225)(assets/1573296532906.png)]

  1. 丢弃,这个是默认的处理方式
  2. allowedLateness 指定允许数据延迟的时间
  3. sideOutputLateData 收集迟到的数据
丢弃

重启程序,做测试。

输入数据:

java
000001,1461756870000
000001,1461756883000

000001,1461756870000
000001,1461756871000
000001,1461756872000

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-8f7kJ4eQ-1638891977226)(assets/1573296944424.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-WkmYIWyo-1638891977226)(assets/1573296954680.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-6jcDTAWj-1638891977226)(assets/1573296963406.png)]

发现迟到太多数据就会被丢弃

指定允许再次迟到的时间

java
).assignTimestampsAndWatermarks(new EventTimeExtractor() )
.keyBy(0)
.timeWindow(Time.seconds(3))
.allowedLateness(Time.seconds(2)) // 允许事件迟到 2 秒
.process(new SumProcessWindowFunction())
.print().setParallelism(1);

输入数据

java
000001,1461756870000
000001,1461756883000

000001,1461756870000
000001,1461756871000
000001,1461756872000

000001,1461756884000

000001,1461756870000
000001,1461756871000
000001,1461756872000

000001,1461756885000

000001,1461756870000
000001,1461756871000
000001,1461756872000

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-K0xMKFcI-1638891977227)(assets/1573297641179.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-eruwWqgF-1638891977227)(assets/1573297653341.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-VxD3s5eq-1638891977227)(assets/1573297664487.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-iNxuS6Jj-1638891977227)(assets/1573297613203.png)]

  1. 当我们设置允许迟到 2 秒的事件,第一次 window 触发的条件是 watermark >= window_end_time
  2. 第二次(或者多次)触发的条件是 watermark < window_end_time + allowedLateness
收集迟到的数据

/**

  • 得到并打印每隔 3 秒钟统计前 3 秒内的相同的 key 的所有的事件

  • 收集迟到太多的数据
    */
    public class WaterMarkWindowWordCount {
    public static void main(String[] args) throws Exception{
    StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
    env.setParallelism(1);
    //步骤一:设置时间类型
    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
    //设置waterMark产生的周期为1s
    env.getConfig().setAutoWatermarkInterval(1000);

     // 保存迟到的,会被丢弃的数据
     OutputTag> outputTag =
             new OutputTag>("late-date"){};
    
     DataStreamSource dataStream = env.socketTextStream("10.148.15.10", 8888);
     SingleOutputStreamOperator result = dataStream.map(new MapFunction>() {
         @Override
         public Tuple2 map(String line) throws Exception {
             String[] fields = line.split(",");
             return new Tuple2<>(fields[0], Long.valueOf(fields[1]));
         }
         //步骤二:获取数据里面的event Time
     }).assignTimestampsAndWatermarks(new EventTimeExtractor())
             .keyBy(0)
             .timeWindow(Time.seconds(3))
             // .allowedLateness(Time.seconds(2)) // 允许事件迟到 2 秒
             .sideOutputLateData(outputTag) // 保存迟到太多的数据
             .process(new SumProcessWindowFunction());
     //打印正常的数据
     result.print();
     //获取迟到太多的数据
    
     DataStream lateDataStream
             = result.getSideOutput(outputTag).map(new MapFunction, String>() {
         @Override
         public String map(Tuple2 stringLongTuple2) throws Exception {
             return "迟到的数据:" + stringLongTuple2.toString();
         }
     });
    
     lateDataStream.print();
    
     env.execute("TimeWindowWordCount");
    

    }

    /**

    • IN, OUT, KEY, W

    • IN:输入的数据类型

    • OUT:输出的数据类型

    • Key:key的数据类型(在Flink里面,String用Tuple表示)

    • W:Window的数据类型
      /
      public static class SumProcessWindowFunction extends
      ProcessWindowFunction,String,Tuple,TimeWindow> {
      FastDateFormat dateFormat = FastDateFormat.getInstance(“HH:mm:ss”);
      /
      *

      • 当一个window触发计算的时候会调用这个方法

      • @param tuple key

      • @param context operator的上下文

      • @param elements 指定window的所有元素

      • @param out 用户输出
        */
        @Override
        public void process(Tuple tuple, Context context, Iterable> elements,
        Collector out) {
        System.out.println(“处理时间:” + dateFormat.format(context.currentProcessingTime()));
        System.out.println("window start time : " + dateFormat.format(context.window().getStart()));

        List list = new ArrayList<>();
        for (Tuple2 ele : elements) {
        list.add(ele.toString() + “|” + dateFormat.format(ele.f1));
        }
        out.collect(list.toString());
        System.out.println("window end time : " + dateFormat.format(context.window().getEnd()));

      }
      }

    private static class EventTimeExtractor
    implements AssignerWithPeriodicWatermarks> {
    FastDateFormat dateFormat = FastDateFormat.getInstance(“HH:mm:ss”);

     private long currentMaxEventTime = 0L;
     private long maxOutOfOrderness = 10000; // 最大允许的乱序时间 10 秒
    
    
     // 拿到每一个事件的 Event Time
     @Override
     public long extractTimestamp(Tuple2 element,
                                  long previousElementTimestamp) {
         long currentElementEventTime = element.f1;
         currentMaxEventTime = Math.max(currentMaxEventTime, currentElementEventTime);
         System.out.println("event = " + element
                 + "|" + dateFormat.format(element.f1) // Event Time
                 + "|" + dateFormat.format(currentMaxEventTime)  // Max Event Time
                 + "|" + dateFormat.format(getCurrentWatermark().getTimestamp())); // Current Watermark
         return currentElementEventTime;
     }
    
     @Nullable
     @Override
     public Watermark getCurrentWatermark() {
         /**
          * WasterMark会周期性的产生,默认就是每隔200毫秒产生一个
          *
          *         设置 watermark 产生的周期为 1000ms
          *         env.getConfig().setAutoWatermarkInterval(1000);
          */
         System.out.println("water mark...");
         return new Watermark(currentMaxEventTime - maxOutOfOrderness);
     }
    

    }
    }

输入:

java
000001,1461756870000
000001,1461756883000
迟到的数据
000001,1461756870000
000001,1461756871000
000001,1461756872000

4.7 多并行度下的WaterMark

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-FK3BOTJM-1638891977228)(assets/1573298799383.png)]

一个window可能会接受到多个waterMark,我们以最小的为准。

/**

  • 得到并打印每隔 3 秒钟统计前 3 秒内的相同的 key 的所有的事件

  • 测试多并行度
    */
    public class WaterMarkWindowWordCount {
    public static void main(String[] args) throws Exception{
    StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
    //把并行度设置为2
    env.setParallelism(2);
    //步骤一:设置时间类型
    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
    //设置waterMark产生的周期为1s
    env.getConfig().setAutoWatermarkInterval(1000);

     // 保存迟到的,会被丢弃的数据
     OutputTag> outputTag =
             new OutputTag>("late-date"){};
    
     DataStreamSource dataStream = env.socketTextStream("10.148.15.10", 8888);
     SingleOutputStreamOperator result = dataStream.map(new MapFunction>() {
         @Override
         public Tuple2 map(String line) throws Exception {
             String[] fields = line.split(",");
             return new Tuple2<>(fields[0], Long.valueOf(fields[1]));
         }
         //步骤二:获取数据里面的event Time
     }).assignTimestampsAndWatermarks(new EventTimeExtractor())
             .keyBy(0)
             .timeWindow(Time.seconds(3))
             // .allowedLateness(Time.seconds(2)) // 允许事件迟到 2 秒
             .sideOutputLateData(outputTag) // 保存迟到太多的数据
             .process(new SumProcessWindowFunction());
     //打印正常的数据
     result.print();
     //获取迟到太多的数据
    
     DataStream lateDataStream
             = result.getSideOutput(outputTag).map(new MapFunction, String>() {
         @Override
         public String map(Tuple2 stringLongTuple2) throws Exception {
             return "迟到的数据:" + stringLongTuple2.toString();
         }
     });
    
     lateDataStream.print();
    
     env.execute("TimeWindowWordCount");
    

    }

    /**

    • IN, OUT, KEY, W

    • IN:输入的数据类型

    • OUT:输出的数据类型

    • Key:key的数据类型(在Flink里面,String用Tuple表示)

    • W:Window的数据类型
      /
      public static class SumProcessWindowFunction extends
      ProcessWindowFunction,String,Tuple,TimeWindow> {
      FastDateFormat dateFormat = FastDateFormat.getInstance(“HH:mm:ss”);
      /
      *

      • 当一个window触发计算的时候会调用这个方法

      • @param tuple key

      • @param context operator的上下文

      • @param elements 指定window的所有元素

      • @param out 用户输出
        */
        @Override
        public void process(Tuple tuple, Context context, Iterable> elements,
        Collector out) {
        System.out.println(“处理时间:” + dateFormat.format(context.currentProcessingTime()));
        System.out.println("window start time : " + dateFormat.format(context.window().getStart()));

        List list = new ArrayList<>();
        for (Tuple2 ele : elements) {
        list.add(ele.toString() + “|” + dateFormat.format(ele.f1));
        }
        out.collect(list.toString());
        System.out.println("window end time : " + dateFormat.format(context.window().getEnd()));

      }
      }

    private static class EventTimeExtractor
    implements AssignerWithPeriodicWatermarks> {
    FastDateFormat dateFormat = FastDateFormat.getInstance(“HH:mm:ss”);

     private long currentMaxEventTime = 0L;
     private long maxOutOfOrderness = 10000; // 最大允许的乱序时间 10 秒
    
    
     // 拿到每一个事件的 Event Time
     @Override
     public long extractTimestamp(Tuple2 element,
                                  long previousElementTimestamp) {
         long currentElementEventTime = element.f1;
         currentMaxEventTime = Math.max(currentMaxEventTime, currentElementEventTime);
         //打印线程
         long id = Thread.currentThread().getId();
         System.out.println("当前线程ID:"+id+"event = " + element
                 + "|" + dateFormat.format(element.f1) // Event Time
                 + "|" + dateFormat.format(currentMaxEventTime)  // Max Event Time
                 + "|" + dateFormat.format(getCurrentWatermark().getTimestamp())); // Current Watermark
         return currentElementEventTime;
     }
    
     @Nullable
     @Override
     public Watermark getCurrentWatermark() {
         /**
          * WasterMark会周期性的产生,默认就是每隔200毫秒产生一个
          *
          *         设置 watermark 产生的周期为 1000ms
          *         env.getConfig().setAutoWatermarkInterval(1000);
          */
         System.out.println("water mark...");
         return new Watermark(currentMaxEventTime - maxOutOfOrderness);
     }
    

    }
    }

输入数据:

java
000001,1461756870000
000001,1461756883000
000001,1461756888000

输出结果:

当前线程ID:55event = (000001,1461756883000)|19:34:43|19:34:43|19:34:33
water mark…
当前线程ID:56event = (000001,1461756870000)|19:34:30|19:34:30|19:34:20
water mark…
water mark…
water mark…
当前线程ID:56event = (000001,1461756888000)|19:34:48|19:34:48|19:34:38
water mark…
water mark…
处理时间:19:31:25
window start time : 19:34:30
2> [(000001,1461756870000)|19:34:30]
window end time : 19:34:33

ID为56的线程有两个WaterMark:20,38

那么38这个会替代20,所以ID为56的线程的WaterMark是38

然后ID为55的线程的WaterMark是33,而ID为56是WaterMark是38,会在里面求一个小的值作为waterMark,就是33,这个时候会触发Window为30-33的窗口,那这个窗口里面就有 (000001,1461756870000)这条数据。

4.8 WaterMark生成机制

/**

  • 得到并打印每隔 3 秒钟统计前 3 秒内的相同的 key 的所有的事件

  • 有条件的产生watermark
    */
    public class WaterMarkWindowWordCount {
    public static void main(String[] args) throws Exception{
    StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
    //把并行度设置为2
    env.setParallelism(2);
    //步骤一:设置时间类型
    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
    //设置waterMark产生的周期为1s
    env.getConfig().setAutoWatermarkInterval(1000);

     // 保存迟到的,会被丢弃的数据
     OutputTag> outputTag =
             new OutputTag>("late-date"){};
    
     DataStreamSource dataStream = env.socketTextStream("10.148.15.10", 8888);
     SingleOutputStreamOperator result = dataStream.map(new MapFunction>() {
         @Override
         public Tuple2 map(String line) throws Exception {
             String[] fields = line.split(",");
             return new Tuple2<>(fields[0], Long.valueOf(fields[1]));
         }
         //步骤二:获取数据里面的event Time
     }).assignTimestampsAndWatermarks(new EventTimeExtractor())
             .keyBy(0)
             .timeWindow(Time.seconds(3))
             // .allowedLateness(Time.seconds(2)) // 允许事件迟到 2 秒
             .sideOutputLateData(outputTag) // 保存迟到太多的数据
             .process(new SumProcessWindowFunction());
     //打印正常的数据
     result.print();
     //获取迟到太多的数据
    
     DataStream lateDataStream
             = result.getSideOutput(outputTag).map(new MapFunction, String>() {
         @Override
         public String map(Tuple2 stringLongTuple2) throws Exception {
             return "迟到的数据:" + stringLongTuple2.toString();
         }
     });
    
     lateDataStream.print();
    
     env.execute("TimeWindowWordCount");
    

    }

    /**

    • IN, OUT, KEY, W

    • IN:输入的数据类型

    • OUT:输出的数据类型

    • Key:key的数据类型(在Flink里面,String用Tuple表示)

    • W:Window的数据类型
      /
      public static class SumProcessWindowFunction extends
      ProcessWindowFunction,String,Tuple,TimeWindow> {
      FastDateFormat dateFormat = FastDateFormat.getInstance(“HH:mm:ss”);
      /
      *

      • 当一个window触发计算的时候会调用这个方法

      • @param tuple key

      • @param context operator的上下文

      • @param elements 指定window的所有元素

      • @param out 用户输出
        */
        @Override
        public void process(Tuple tuple, Context context, Iterable> elements,
        Collector out) {
        System.out.println(“处理时间:” + dateFormat.format(context.currentProcessingTime()));
        System.out.println("window start time : " + dateFormat.format(context.window().getStart()));

        List list = new ArrayList<>();
        for (Tuple2 ele : elements) {
        list.add(ele.toString() + “|” + dateFormat.format(ele.f1));
        }
        out.collect(list.toString());
        System.out.println("window end time : " + dateFormat.format(context.window().getEnd()));

      }
      }

    /**

    • 按条件产生waterMark
      */
      private static class EventTimeExtractor2
      implements AssignerWithPunctuatedWatermarks> {

      @Nullable
      @Override
      public Watermark checkAndGetNextWatermark(Tuple2 lastElement,
      long extractedTimestamp) {
      // 这个方法是每接收到一个事件就会调用
      // 根据条件产生 watermark ,并不是周期性的产生 watermark
      if (lastElement.f0 == “000002”) {
      // 才发送 watermark
      return new Watermark(lastElement.f1 - 10000);
      }
      // 则表示不产生 watermark
      return null;
      }

      @Override
      public long extractTimestamp(Tuple2 element,
      long previousElementTimestamp) {
      return element.f1;
      }
      }

    private static class EventTimeExtractor
    implements AssignerWithPeriodicWatermarks> {
    FastDateFormat dateFormat = FastDateFormat.getInstance(“HH:mm:ss”);

     private long currentMaxEventTime = 0L;
     private long maxOutOfOrderness = 10000; // 最大允许的乱序时间 10 秒
    
    
     // 拿到每一个事件的 Event Time
     @Override
     public long extractTimestamp(Tuple2 element,
                                  long previousElementTimestamp) {
         long currentElementEventTime = element.f1;
         currentMaxEventTime = Math.max(currentMaxEventTime, currentElementEventTime);
         long id = Thread.currentThread().getId();
         System.out.println("当前线程ID:"+id+"event = " + element
                 + "|" + dateFormat.format(element.f1) // Event Time
                 + "|" + dateFormat.format(currentMaxEventTime)  // Max Event Time
                 + "|" + dateFormat.format(getCurrentWatermark().getTimestamp())); // Current Watermark
         return currentElementEventTime;
     }
    
     @Nullable
     @Override
     public Watermark getCurrentWatermark() {
         /**
          * WasterMark会周期性的产生,默认就是每隔200毫秒产生一个
          *
          *         设置 watermark 产生的周期为 1000ms
          *         env.getConfig().setAutoWatermarkInterval(1000);
          *
          *
          * 和事件关系不大
          *    1. watermark 值依赖处理时间的场景
          *    2. 当有一段时间没有接收到事件,但是仍然需要产生 watermark 的场景
          */
         System.out.println("water mark...");
         return new Watermark(currentMaxEventTime - maxOutOfOrderness);
     }
    

    }
    }

4.1 Window概述

聚合事件(比如计数、求和)在流上的工作方式与批处理不同。比如,对流中的所有元素进行计数是不可能的,因为通常流是无限的(无界的)。所以,流上的聚合需要由 window 来划定范围,比如 “计算过去的5分钟” ,或者 “最后100个元素的和” 。window是一种可以把无限数据切割为有限数据块的手段。

窗口可以是 时间驱动的 【Time Window】(比如:每30秒)或者 数据驱动的【Count Window】 (比如:每100个元素)。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-J2v57Cve-1638891999640)(assets/1569381744120.png)]

4.2 Window类型

窗口通常被区分为不同的类型:
tumbling windows:滚动窗口 【没有重叠】
sliding windows:滑动窗口 【有重叠】
session windows:会话窗口
global windows: 没有窗口

4.2.1 tumblingwindows:滚动窗口【没有重叠】

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-6ZAUzdp9-1638891999641)(assets/1569381903653.png)]

4.2.2 slidingwindows:滑动窗口 【有重叠】

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-LotwBOkR-1638891999642)(assets/1569381981992.png)]

4.2.3 session windows

需求:实时计算每个单词出现的次数,如果一个单词过了5秒就没出现过了,那么就输出这个单词。

案例演示:见下方

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-CUbWQprK-1638891999642)(assets/session-windows.svg)]

4.2.4 global windows

案例见下方

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-E6mS6H6J-1638891999643)(assets/non-windowed.svg)]

4.2.5 Window类型总结
Keyed Window 和 Non Keyed Window

/**

  • Non Keyed Window 和 Keyed Window
    */
    public class WindowType {
    public static void main(String[] args) throws Exception {
    StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
    DataStreamSource dataStream = env.socketTextStream(“10.148.15.10”, 8888);

     SingleOutputStreamOperator> stream = dataStream.flatMap(new FlatMapFunction>() {
         @Override
         public void flatMap(String line, Collector> collector) throws Exception {
             String[] fields = line.split(",");
             for (String word : fields) {
                 collector.collect(Tuple2.of(word, 1));
             }
         }
     });
    
     //Non keyed Stream
    

// AllWindowedStream, TimeWindow> nonkeyedStream = stream.timeWindowAll(Time.seconds(3));
// nonkeyedStream.sum(1)
// .print();

    //Keyed Stream
    stream.keyBy(0)
            .timeWindow(Time.seconds(3))
            .sum(1)
            .print();

    env.execute("word count");


}

}

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-9CTr2fm7-1638891999643)(assets/window的类型2.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-oiNUyTTX-1638891999643)(assets/window的类型-1573884417208.png)]

TimeWindow

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ZSzkj4IG-1638891999644)(assets/1569383737549.png)]

CountWindow

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-RbyitwKC-1638891999644)(assets/1569383745199.png)]

自定义Window

一般前面两种window就能解决我们所遇到的业务场景了,本人至今还没遇到需要自定义window的场景。

4.3 window操作

Keyed Windows

stream
.keyBy(…) <- keyed versus non-keyed windows
.window(…) <- required: “assigner”
[.trigger(…)] <- optional: “trigger” (else default trigger)
[.evictor(…)] <- optional: “evictor” (else no evictor)
[.allowedLateness(…)] <- optional: “lateness” (else zero)
[.sideOutputLateData(…)] <- optional: “output tag” (else no side output for late data)
.reduce/aggregate/fold/apply() <- required: “function”
[.getSideOutput(…)] <- optional: “output tag”

Non-Keyed Windows

java
stream
.windowAll(…) <- required: “assigner”
[.trigger(…)] <- optional: “trigger” (else default trigger)
[.evictor(…)] <- optional: “evictor” (else no evictor)
[.allowedLateness(…)] <- optional: “lateness” (else zero)
[.sideOutputLateData(…)] <- optional: “output tag” (else no side output for late data)
.reduce/aggregate/fold/apply() <- required: “function”
[.getSideOutput(…)] <- optional: “output tag”

4.3.1 window function
Tumbling window和slide window

java
//滚动窗口
stream.keyBy(0)
.window(TumblingEventTimeWindows.of(Time.seconds(2)))
.sum(1)
.print();
//滑动窗口
stream.keyBy(0)
.window(SlidingProcessingTimeWindows.of(Time.seconds(6),Time.seconds(4)))
.sum(1)
.print();

session window

java
/**

  • 5秒过去以后,该单词不出现就打印出来该单词
    */
    public class SessionWindowTest {
    public static void main(String[] args) throws Exception {
    StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
    DataStreamSource dataStream = env.socketTextStream(“10.148.15.10”, 8888);

     SingleOutputStreamOperator> stream = dataStream.flatMap(new FlatMapFunction>() {
         @Override
         public void flatMap(String line, Collector> collector) throws Exception {
             String[] fields = line.split(",");
             for (String word : fields) {
                 collector.collect(Tuple2.of(word, 1));
             }
         }
     });
    
     stream.keyBy(0)
             .window(ProcessingTimeSessionWindows.withGap(Time.seconds(5)))
             .sum(1)
             .print();
    
     env.execute("SessionWindowTest");
    

    }
    }

global window

global window + trigger 一起配合才能使用

需求:单词每出现三次统计一次

java
/**

  • 单词每出现三次统计一次
    */
    public class GlobalWindowTest {
    public static void main(String[] args) throws Exception {
    StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
    DataStreamSource dataStream = env.socketTextStream(“10.148.15.10”, 8888);

     SingleOutputStreamOperator> stream = dataStream.flatMap(new FlatMapFunction>() {
         @Override
         public void flatMap(String line, Collector> collector) throws Exception {
             String[] fields = line.split(",");
             for (String word : fields) {
                 collector.collect(Tuple2.of(word, 1));
             }
         }
     });
    
     stream.keyBy(0)
             .window(GlobalWindows.create())
              //如果不加这个程序是启动不起来的
             .trigger(CountTrigger.of(3))
             .sum(1)
             .print();
    
     env.execute("SessionWindowTest");
    

    }
    }

执行结果:

java
hello,3
hello,6
hello,9

总结:效果跟CountWindow(3)很像,但又有点不像,因为如果是CountWindow(3),单词每次出现的都是3次,不会包含之前的次数,而我们刚刚的这个每次都包含了之前的次数。

4.3.2 Trigger

需求:自定义一个CountWindow

java
/**

  • 使用Trigger 自己实现一个类似CountWindow的效果
    */
    public class CountWindowWordCount {
    public static void main(String[] args) throws Exception {
    StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
    DataStreamSource dataStream = env.socketTextStream(“10.148.15.10”, 8888);

     SingleOutputStreamOperator> stream = dataStream.flatMap(new FlatMapFunction>() {
         @Override
         public void flatMap(String line, Collector> collector) throws Exception {
             String[] fields = line.split(",");
             for (String word : fields) {
                 collector.collect(Tuple2.of(word, 1));
             }
         }
     });
    
     WindowedStream, Tuple, GlobalWindow> keyedWindow = stream.keyBy(0)
             .window(GlobalWindows.create())
             .trigger(new MyCountTrigger(3));
    
    
         //可以看看里面的源码,跟我们写的很像
    

// WindowedStream, Tuple, GlobalWindow> keyedWindow = stream.keyBy(0)
// .window(GlobalWindows.create())
// .trigger(CountTrigger.of(3));

    DataStream> wordCounts = keyedWindow.sum(1);

    wordCounts.print().setParallelism(1);

    env.execute("Streaming WordCount");
}



private static class MyCountTrigger
        extends Trigger, GlobalWindow> {
    // 表示指定的元素的最大的数量
    private long maxCount;

    // 用于存储每个 key 对应的 count 值
    private ReducingStateDescriptor stateDescriptor
            = new ReducingStateDescriptor("count", new ReduceFunction() {
        @Override
        public Long reduce(Long aLong, Long t1) throws Exception {
            return aLong + t1;
        }
    }, Long.class);

    public MyCountTrigger(long maxCount) {
        this.maxCount = maxCount;
    }

    /**
     *  当一个元素进入到一个 window 中的时候就会调用这个方法
     * @param element   元素
     * @param timestamp 进来的时间
     * @param window    元素所属的窗口
     * @param ctx 上下文
     * @return TriggerResult
     *      1. TriggerResult.CONTINUE :表示对 window 不做任何处理
     *      2. TriggerResult.FIRE :表示触发 window 的计算
     *      3. TriggerResult.PURGE :表示清除 window 中的所有数据
     *      4. TriggerResult.FIRE_AND_PURGE :表示先触发 window 计算,然后删除 window 中的数据
     * @throws Exception
     */
    @Override
    public TriggerResult onElement(Tuple2 element,
                                   long timestamp,
                                   GlobalWindow window,
                                   TriggerContext ctx) throws Exception {
        // 拿到当前 key 对应的 count 状态值
        ReducingState count = ctx.getPartitionedState(stateDescriptor);
        // count 累加 1
        count.add(1L);
        // 如果当前 key 的 count 值等于 maxCount
        if (count.get() == maxCount) {
            count.clear();
            // 触发 window 计算,删除数据
            return TriggerResult.FIRE_AND_PURGE;
        }
        // 否则,对 window 不做任何的处理
        return TriggerResult.CONTINUE;
    }

    @Override
    public TriggerResult onProcessingTime(long time,
                                          GlobalWindow window,
                                          TriggerContext ctx) throws Exception {
        // 写基于 Processing Time 的定时器任务逻辑
        return TriggerResult.CONTINUE;
    }

    @Override
    public TriggerResult onEventTime(long time,
                                     GlobalWindow window,
                                     TriggerContext ctx) throws Exception {
        // 写基于 Event Time 的定时器任务逻辑
        return TriggerResult.CONTINUE;
    }

    @Override
    public void clear(GlobalWindow window, TriggerContext ctx) throws Exception {
        // 清除状态值
        ctx.getPartitionedState(stateDescriptor).clear();
    }
}

}

注:效果跟CountWindow一模一样

4.3.3 Evictor

需求:实现每隔2个单词,计算最近3个单词

java
/**

  • 使用Evictor 自己实现一个类似CountWindow(3,2)的效果

  • 每隔2个单词计算最近3个单词
    */
    public class CountWindowWordCountByEvictor {
    public static void main(String[] args) throws Exception {
    StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
    DataStreamSource dataStream = env.socketTextStream(“10.148.15.10”, 8888);

     SingleOutputStreamOperator> stream = dataStream.flatMap(new FlatMapFunction>() {
         @Override
         public void flatMap(String line, Collector> collector) throws Exception {
             String[] fields = line.split(",");
             for (String word : fields) {
                 collector.collect(Tuple2.of(word, 1));
             }
         }
     });
    
     WindowedStream, Tuple, GlobalWindow> keyedWindow = stream.keyBy(0)
             .window(GlobalWindows.create())
             .trigger(new MyCountTrigger(2))
             .evictor(new MyCountEvictor(3));
    
    
    
     DataStream> wordCounts = keyedWindow.sum(1);
    
     wordCounts.print().setParallelism(1);
    
     env.execute("Streaming WordCount");
    

    }

    private static class MyCountTrigger
    extends Trigger, GlobalWindow> {
    // 表示指定的元素的最大的数量
    private long maxCount;

     // 用于存储每个 key 对应的 count 值
     private ReducingStateDescriptor stateDescriptor
             = new ReducingStateDescriptor("count", new ReduceFunction() {
         @Override
         public Long reduce(Long aLong, Long t1) throws Exception {
             return aLong + t1;
         }
     }, Long.class);
    
     public MyCountTrigger(long maxCount) {
         this.maxCount = maxCount;
     }
    
     /**
      *  当一个元素进入到一个 window 中的时候就会调用这个方法
      * @param element   元素
      * @param timestamp 进来的时间
      * @param window    元素所属的窗口
      * @param ctx 上下文
      * @return TriggerResult
      *      1. TriggerResult.CONTINUE :表示对 window 不做任何处理
      *      2. TriggerResult.FIRE :表示触发 window 的计算
      *      3. TriggerResult.PURGE :表示清除 window 中的所有数据
      *      4. TriggerResult.FIRE_AND_PURGE :表示先触发 window 计算,然后删除 window 中的数据
      * @throws Exception
      */
     @Override
     public TriggerResult onElement(Tuple2 element,
                                    long timestamp,
                                    GlobalWindow window,
                                    TriggerContext ctx) throws Exception {
         // 拿到当前 key 对应的 count 状态值
         ReducingState count = ctx.getPartitionedState(stateDescriptor);
         // count 累加 1
         count.add(1L);
         // 如果当前 key 的 count 值等于 maxCount
         if (count.get() == maxCount) {
             count.clear();
             // 触发 window 计算,删除数据
             return TriggerResult.FIRE;
         }
         // 否则,对 window 不做任何的处理
         return TriggerResult.CONTINUE;
     }
    
     @Override
     public TriggerResult onProcessingTime(long time,
                                           GlobalWindow window,
                                           TriggerContext ctx) throws Exception {
         // 写基于 Processing Time 的定时器任务逻辑
         return TriggerResult.CONTINUE;
     }
    
     @Override
     public TriggerResult onEventTime(long time,
                                      GlobalWindow window,
                                      TriggerContext ctx) throws Exception {
         // 写基于 Event Time 的定时器任务逻辑
         return TriggerResult.CONTINUE;
     }
    
     @Override
     public void clear(GlobalWindow window, TriggerContext ctx) throws Exception {
         // 清除状态值
         ctx.getPartitionedState(stateDescriptor).clear();
     }
    

    }

    private static class MyCountEvictor
    implements Evictor, GlobalWindow> {
    // window 的大小
    private long windowCount;

     public MyCountEvictor(long windowCount) {
         this.windowCount = windowCount;
     }
    
     /**
      *  在 window 计算之前删除特定的数据
      * @param elements  window 中所有的元素
      * @param size  window 中所有元素的大小
      * @param window    window
      * @param evictorContext    上下文
      */
     @Override
     public void evictBefore(Iterable>> elements,
                             int size, GlobalWindow window, EvictorContext evictorContext) {
         if (size <= windowCount) {
             return;
         } else {
             int evictorCount = 0;
             Iterator>> iterator = elements.iterator();
             while (iterator.hasNext()) {
                 iterator.next();
                 evictorCount++;
                 // 如果删除的数量小于当前的 window 大小减去规定的 window 的大小,就需要删除当前的元素
                 if (evictorCount > size - windowCount) {
                     break;
                 } else {
                     iterator.remove();
                 }
             }
         }
     }
    
     /**
      *  在 window 计算之后删除特定的数据
      * @param elements  window 中所有的元素
      * @param size  window 中所有元素的大小
      * @param window    window
      * @param evictorContext    上下文
      */
     @Override
     public void evictAfter(Iterable>> elements,
                            int size, GlobalWindow window, EvictorContext evictorContext) {
    
     }
    

    }

    }

4.3.4 window增量聚合

窗口中每进入一条数据,就进行一次计算,等时间到了展示最后的结果

常用的聚合算子

java
reduce(reduceFunction)
aggregate(aggregateFunction)
sum(),min(),max()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-YsoVwJU7-1638891999645)(assets/1573871695836.png)]

java
/**

  • 演示增量聚合
    */
    public class SocketDemoIncrAgg {
    public static void main(String[] args) throws Exception{
    StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
    DataStreamSource dataStream = env.socketTextStream(“localhost”, 8888);
    SingleOutputStreamOperator intDStream = dataStream.map(number -> Integer.valueOf(number));
    AllWindowedStream windowResult = intDStream.timeWindowAll(Time.seconds(10));
    windowResult.reduce(new ReduceFunction() {
    @Override
    public Integer reduce(Integer last, Integer current) throws Exception {
    System.out.println(“执行逻辑”+last + " "+current);
    return last+current;
    }
    }).print();

     env.execute(SocketDemoIncrAgg.class.getSimpleName());
    

    }
    }

aggregate算子

需求:求每隔窗口里面的数据的平均值

java
/**

  • 求每隔窗口中的数据的平均值
    */
    public class aggregateWindowTest {
    public static void main(String[] args) throws Exception{
    StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
    DataStreamSource dataStream = env.socketTextStream(“10.148.15.10”, 8888);

     SingleOutputStreamOperator numberStream = dataStream.map(line -> Integer.valueOf(line));
     AllWindowedStream windowStream = numberStream.timeWindowAll(Time.seconds(5));
     windowStream.aggregate(new MyAggregate())
             .print();
    
     env.execute("aggregateWindowTest");
    

    }

    /**

    • IN, 输入的数据类型

    • ACC,自定义的中间状态

    •  Tuple2:
      
    •      key: 计算数据的个数
      
    •      value:计算总值
      
    • OUT,输出的数据类型
      /
      private static class MyAggregate
      implements AggregateFunction,Double>{
      /
      *

      • 初始化 累加器
      • @return
        */
        @Override
        public Tuple2 createAccumulator() {
        return new Tuple2<>(0,0);
        }

      /**

      • 针对每个数据的操作
      • @return
        */
        @Override
        public Tuple2 add(Integer element,
        Tuple2 accumulator) {
        //个数+1
        //总的值累计
        return new Tuple2<>(accumulator.f0+1,accumulator.f1+element);
        }

      @Override
      public Double getResult(Tuple2 accumulator) {
      return (double)accumulator.f1/accumulator.f0;
      }

      @Override
      public Tuple2 merge(Tuple2 a1,
      Tuple2 b1) {
      return Tuple2.of(a1.f0+b1.f0,a1.f1+b1.f1);
      }
      }
      }

4.3.5 window全量聚合

等属于窗口的数据到齐,才开始进行聚合计算【可以实现对窗口内的数据进行排序等需求】

java
apply(windowFunction)
process(processWindowFunction)
processWindowFunction比windowFunction提供了更多的上下文信息。类似于map和RichMap的关系

效果图

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ZcqyYNRP-1638891999645)(assets/1573877034053.png)]

java
/**

  • 全量计算
    */
    public class SocketDemoFullAgg {
    public static void main(String[] args) throws Exception {
    StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
    DataStreamSource dataStream = env.socketTextStream(“localhost”, 8888);
    SingleOutputStreamOperator intDStream = dataStream.map(number -> Integer.valueOf(number));
    AllWindowedStream windowResult = intDStream.timeWindowAll(Time.seconds(10));
    windowResult.process(new ProcessAllWindowFunction() {
    @Override
    public void process(Context context, Iterable iterable, Collector collector) throws Exception {
    System.out.println(“执行计算逻辑”);
    int count=0;
    Iterator numberiterator = iterable.iterator();
    while (numberiterator.hasNext()){
    Integer number = numberiterator.next();
    count+=number;
    }
    collector.collect(count);
    }
    }).print();

     env.execute("socketDemoFullAgg");
    

    }
    }

4.3.6 window join

两个window之间可以进行join,join操作只支持三种类型的window:滚动窗口,滑动窗口,会话窗口

使用方式:

java
stream.join(otherStream) //两个流进行关联
.where() //选择第一个流的key作为关联字段
.equalTo()//选择第二个流的key作为关联字段
.window()//设置窗口的类型
.apply() //对结果做操作

Tumbling Window Join

java
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;

DataStream orangeStream = …
DataStream greenStream = …

orangeStream.join(greenStream)
.where()
.equalTo()
.window(TumblingEventTimeWindows.of(Time.milliseconds(2)))
.apply (new JoinFunction (){
@Override
public String join(Integer first, Integer second) {
return first + “,” + second;
}
});

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-jVWbwv40-1638891999646)(assets/tumbling-window-join.svg)]

Sliding Window Join

java
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.streaming.api.windowing.assigners.SlidingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;

DataStream orangeStream = …
DataStream greenStream = …

orangeStream.join(greenStream)
.where()
.equalTo()
.window(SlidingEventTimeWindows.of(Time.milliseconds(2) /* size /, Time.milliseconds(1) / slide */))
.apply (new JoinFunction (){
@Override
public String join(Integer first, Integer second) {
return first + “,” + second;
}
});

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-5p8AWQNq-1638891999646)(assets/sliding-window-join.svg)]

Session Window Join

java
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.streaming.api.windowing.assigners.EventTimeSessionWindows;
import org.apache.flink.streaming.api.windowing.time.Time;

DataStream orangeStream = …
DataStream greenStream = …

orangeStream.join(greenStream)
.where()
.equalTo()
.window(EventTimeSessionWindows.withGap(Time.milliseconds(1)))
.apply (new JoinFunction (){
@Override
public String join(Integer first, Integer second) {
return first + “,” + second;
}
});

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-c8aeRAzr-1638891999647)(assets/session-window-join.svg)]

Interval Join

java
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.streaming.api.functions.co.ProcessJoinFunction;
import org.apache.flink.streaming.api.windowing.time.Time;

DataStream orangeStream = …
DataStream greenStream = …

orangeStream
.keyBy()
.intervalJoin(greenStream.keyBy())
.between(Time.milliseconds(-2), Time.milliseconds(1))
.process (new ProcessJoinFunction

    @Override
    public void processElement(Integer left, Integer right, Context ctx, Collector out) {
        out.collect(first + "," + second);
    }
});

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-oXiXtmiy-1638891999647)(assets/interval-join.svg)]

五 、招聘要求介绍(5分钟)

六 、总结(5分钟)

深入浅出Flink-task

一 、课前准备

  1. 掌握前面的flink知识

二 、课堂主题

了解TaskManager,slot,Task之间的关系

三 、课程目标

了解TaskManager,slot,Task之间的关系

四 、知识要点

4.1 flink基础知识

4.1.1 Flink基本架构

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-JmWfMyc6-1638892018928)(assets/Flink架构.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-2JFAe959-1638892018929)(assets/yarn.png)]

4.1.2 概述

Flink 整个系统主要由两个组件组成,分别为 JobManager 和 TaskManager,Flink 架构也遵循 Master - Slave 架构设计原则,JobManager 为 Master 节点,TaskManager 为 Worker (Slave)节点。
所有组件之间的通信都是借助于 Akka Framework,包括任务的状态以及 Checkpoint 触发等信息。

4.1.3 Client 客户端

客户端负责将任务提交到集群,与 JobManager 构建 Akka 连接,然后将任务提交到 JobManager,通过和 JobManager 之间进行交互获取任务执行状态。
客户端提交任务可以采用 CLI 方式或者通过使用 Flink WebUI 提交,也可以在应用程序中指定 JobManager 的 RPC 网络端口构建 ExecutionEnvironment 提交 Flink 应用。

4.1.4 JobManager

JobManager 负责整个 Flink 集群任务的调度以及资源的管理,从客户端中获取提交的应用,然后根据集群中 TaskManager 上 TaskSlot 的使用情况,为提交的应用分配相应的 TaskSlot 资源并命令 TaskManager 启动从客户端中获取的应用。
JobManager 相当于整个集群的 Master 节点,且整个集群有且只有一个活跃的 JobManager ,负责整个集群的任务管理和资源管理。
JobManager 和 TaskManager 之间通过 Actor System 进行通信,获取任务执行的情况并通过 Actor System 将应用的任务执行情况发送给客户端。
同时在任务执行的过程中,Flink JobManager 会触发 Checkpoint 操作,每个 TaskManager 节点 收到 Checkpoint 触发指令后,完成 Checkpoint 操作,所有的 Checkpoint 协调过程都是在 Fink JobManager 中完成。
当任务完成后,Flink 会将任务执行的信息反馈给客户端,并且释放掉 TaskManager 中的资源以供下一次提交任务使用。

4.1.5 TaskManager

TaskManager 相当于整个集群的 Slave 节点,负责具体的任务执行和对应任务在每个节点上的资源申请和管理。
客户端通过将编写好的 Flink 应用编译打包,提交到 JobManager,然后 JobManager 会根据已注册在 JobManager 中 TaskManager 的资源情况,将任务分配给有资源的 TaskManager节点,然后启动并运行任务。
TaskManager 从 JobManager 接收需要部署的任务,然后使用 Slot 资源启动 Task,建立数据接入的网络连接,接收数据并开始数据处理。同时 TaskManager 之间的数据交互都是通过数据流的方式进行的。
可以看出,Flink 的任务运行其实是采用多线程的方式,这和 MapReduce 多 JVM 进行的方式有很大的区别,Flink 能够极大提高 CPU 使用效率,在多个任务和 Task 之间通过 TaskSlot 方式共享系统资源,每个 TaskManager 中通过管理多个 TaskSlot 资源池进行对资源进行有效管理。

4.2 TaskManager 与 Slot

4.2.1 Slot

Flink的每个TaskManager为集群提供solt。 solt的数量通常与每个TaskManager节点的可用CPU内核数成比例。一般情况下你的slot数是你每个节点的cpu的核数。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-FAbuFPkJ-1638892018930)(assets/1574472572814.png)]

4.2.2 并行度

一个Flink程序由多个任务组成(source、transformation和 sink)。 一个任务由多个并行的实例(线程)来执行, 一个任务的并行实例(线程)数目就被称为该任务的并行度。

4.2.3 并行度的设置

一个任务的并行度设置可以从多个层次指定

•Operator Level(算子层次)

•Execution Environment Level(执行环境层次)

•Client Level(客户端层次)

•System Level(系统层次)

算子层次

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-dNnqxNMx-1638892018930)(assets/1574472860477.png)]

执行环境层次

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Wf9ebwHm-1638892018930)(assets/1574472880358.png)]

客户端层次

并行度可以在客户端将job提交到Flink时设定,对于CLI客户端,可以通过-p参数指定并行度

java
./bin/flink run -p 10 WordCount.jar

系统层次

在系统级可以通过设置flink-conf.yaml文件中的parallelism.default属性来指定所有执行环境的默认并行度

4.2.4 案例演示

并行度为1

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-bE3VWtKu-1638892018930)(assets/1574473161843.png)]

各种不同的并行度

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-LCXV7bwc-1638892018931)(assets/1574473199938.png)]

4.3 任务提交

4.3.1 把任务提交到yarn上

java
演示一:
flink run -m yarn-cluster -p 2 -yn 2 -yjm 1024 -ytm 1024 -c streaming.slot.lesson01.WordCount flinklesson-1.0-SNAPSHOT.jar

演示二:
flink run -m yarn-cluster -p 3 -yn 2 -yjm 1024 -ytm 1024 -c streaming.slot.lesson01.WordCount flinklesson-1.0-SNAPSHOT.jar

4.3.2 把任务提交到standalone集群

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-hsekYYlN-1638892018931)(assets/1574475633598.png)]

java
演示一:
flink run -c streaming.slot.lesson01.WordCount -p 2 flinklesson-1.0-SNAPSHOT.jar
演示二:
flink run -c streaming.slot.lesson01.WordCount -p 3 flinklesson-1.0-SNAPSHOT.jar

4.4 task

4.4.1 数据传输的方式

forward strategy

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ZhFyOrJE-1638892018931)(assets/1574478499283.png)]

  1. 一个 task 的输出只发送给一个 task 作为输入
  2. 如果两个 task 都在一个 JVM 中的话,那么就可以避免网络开销

key based strategy

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-IU21DFJr-1638892018932)(assets/1574478613578.png)]

  1. 数据需要按照某个属性(我们称为 key)进行分组(或者说分区)

  2. 相同 key 的数据需要传输给同一个 task,在一个 task 中进行处理

broadcast strategy

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-rtA7ZZR5-1638892018932)(assets/1574478760221.png)]

random strategy

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-EWdVIobK-1638892018932)(assets/1574478869374.png)]

  1. 数据随机的从一个 task 中传输给下一个 operator 所有的 subtask
  2. 保证数据能均匀的传输给所有的 subtask
4.4.2 Operator Chain

代码:

java
public class WordCount {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
String topic=“testSlot”;
Properties consumerProperties = new Properties();
consumerProperties.setProperty(“bootstrap.servers”,“192.168.167.254:9092”);
consumerProperties.setProperty(“group.id”,“testSlot_consumer”);

    FlinkKafkaConsumer011 myConsumer =
            new FlinkKafkaConsumer011<>(topic, new SimpleStringSchema(), consumerProperties);

    DataStreamSource data = env.addSource(myConsumer).setParallelism(3);

    SingleOutputStreamOperator> wordOneStream = data.flatMap(new FlatMapFunction>() {
        @Override
        public void flatMap(String line,
                            Collector> out) throws Exception {
            String[] fields = line.split(",");
            for (String word : fields) {
                out.collect(Tuple2.of(word, 1));
            }
        }
    }).setParallelism(2);

    SingleOutputStreamOperator> result = wordOneStream.keyBy(0).sum(1).setParallelism(2);

    result.map( tuple -> tuple.toString()).setParallelism(2)
            .print().setParallelism(1);

    env.execute("WordCount2");

}

}

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-CCFxCtzJ-1638892018933)(assets/1574479305528.png)]****

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-JmMkpdAT-1638892018933)(assets/1574479086824.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-PTRtnjOB-1638892018934)(assets/1574479147372.png)]

Operator Chain的条件:

  1. 数据传输策略是 forward strategy
  2. 在同一个 TaskManager 中运行

并行度设置为1:

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-EhGPsWoW-1638892018934)(assets/1574480303336.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-QVNAEQJ9-1638892018934)(assets/1574480321456.png)]

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-5nYVQkPK-1638892018934)(assets/1574480283016.png)]

4.1.1 需求背景

针对算法产生的日志数据进行清洗拆分

•1:算法产生的日志数据是嵌套json格式,需要拆分打平

•2:针对算法中的国家字段进行大区转换

•3:把数据回写到Kafka

4.1.2 项目架构

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-YxmZsyh8-1638892042235)(assets/项目架构-1573899524477.png)]

视频网站(抖音),生成日志的时候,他们日志里面是把多条数据合并成一条数据了。

4.1.3 方案设计

日志格式:

java
直播平台(是不是国内,但是类似于国内的抖音)
处理前:
{“dt”:“2019-11-19 20:33:39”,“countryCode”:“TW”,“data”:[{“type”:“s1”,“score”:0.8,“level”:“D”},{“type”:“s2”,“score”:0.1,“level”:“B”}]}

kafka:
如何去评估存储,10亿,评估每条数据多大,50k-几百k
我们公司里面还有几百个topic,数据都是这样的一个情况,所以我们有很多的实时任务都是进行ETL
处理后:
“dt”:“2019-11-19 20:33:39”,“countryCode”:“TW”,“type”:“s1”,“score”:0.8,“level”:“D”
“dt”:“2019-11-19 20:33:39”,“countryCode”:“TW”,“type”:“s2”,“score”:0.1,“level”:“B”

其实是需要我们处理成:
“dt”:“2019-11-19 20:33:39”,“area”:“AREA_CT”,“type”:“s1”,“score”:0.8,“level”:“D”
“dt”:“2019-11-19 20:33:39”,“area”:“AREA_CT”,“type”:“s2”,“score”:0.1,“level”:“B”

我们日志里面有地区,地区用的是编号,需要我们做ETL的时候顺带也要转化一下。

如果用SparkStrimming怎么做?
1.读取redis里面的数据,作为一个广播变量
2.读区Kafka里面的日志数据
flatMap,把广播变量传进去。
如果是用flink又怎么做?

hset areas AREA_US US
hset areas AREA_CT TW,HK
hset areas AREA_AR PK,KW,SA
hset areas AREA_IN IN

flink -> reids -> k,v HashMap
US,AREA_US
TW,AREA_CT
HK,AREA_CT
IN,AREA_IN

{“dt”:“2019-11-19 20:33:41”,“countryCode”:“KW”,“data”:[{“type”:“s2”,“score”:0.2,“level”:“A”},{“type”:“s1”,“score”:0.2,“level”:“D”}]}

{“dt”:“2019-11-19 20:33:43”,“countryCode”:“HK”,“data”:[{“type”:“s5”,“score”:0.5,“level”:“C”},{“type”:“s2”,“score”:0.8,“level”:“B”}]}

reids码表格式(元数据):

java
大区 国家
hset areas AREA_US US
hset areas AREA_CT TW,HK
hset areas AREA_AR PK,KW,SA
hset areas AREA_IN IN

操作:

java
HKEYS areas
HGETALL areas

4.2 实时报表

4.2.1 需求背景

主要针对直播/短视频平台审核指标的统计

•1:统计不同大区每1 min内过审(上架)视频的数据量(单词的个数)

​ 分析一下:

​ 统计的是大区,不同的大区,大区应该就是一个分组的字段,每分钟(时间)的有效视频(Process时间,事件的事件?)

每分钟【1:事件时间 2:加上水位,这样的话,我们可以挽救一些数据。3:收集数据延迟过多的数据】的不同大区的【有效视频】的数量(单词计数)

PM:产品经理

•2:统计不同大区每1 min内未过审(下架)的数据量

我们公司的是一个电商的平台(京东,淘宝)

京东 -》 店主 -〉 上架商品 -》 通过审核了,可以上架了,有效商品数

每分钟的不同主题的有效商品数。

【衣服】

【鞋】

【书】

【电子产品】

淘宝 -》 店主 -〉 上架商品 -》 未通过审核,下架 -〉 无效的商品数

4.2.2 项目架构

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-xMJyYLig-1638892042236)(assets/实时报表项目架构.png)]

4.2.3 方案设计

日志格式:

如果下一条数据就代表一个有效视频。

  1. 统计的过去的一分钟的每个大区的有效视频数量

  2. 统计的过去的一分钟的每个大区的,不同类型的有效视频数量

统计的过去一分钟是每个单词出现次数。

java
{“dt”:“2019-11-20 15:09:43”,“type”:“child_unshelf”,“username”:“shenhe5”,“area”:“AREA_ID”}
{“dt”:“2019-11-20 15:09:44”,“type”:“chlid_shelf”,“username”:“shenhe2”,“area”:“AREA_ID”}
{“dt”:“2019-11-20 15:09:45”,“type”:“black”,“username”:“shenhe2”,“area”:“AREA_US”}
{“dt”:“2019-11-20 15:09:46”,“type”:“chlid_shelf”,“username”:“shenhe3”,“area”:“AREA_US”}
{“dt”:“2019-11-20 15:09:47”,“type”:“unshelf”,“username”:“shenhe3”,“area”:“AREA_ID”}
{“dt”:“2019-11-20 15:09:48”,“type”:“black”,“username”:“shenhe4”,“area”:“AREA_IN”}

pom文件:

java

1.9.0
2.11.8



    
        
            org.apache.flink
            flink-java
            ${flink.version}
        
        
            org.apache.flink
            flink-streaming-java_2.11
            ${flink.version}
        
        
            org.apache.flink
            flink-scala_2.11
            ${flink.version}
        
        
            org.apache.flink
            flink-streaming-scala_2.11
            ${flink.version}
        

        
            org.apache.bahir
            flink-connector-redis_2.11
            1.0
        

        
            org.apache.flink
            flink-statebackend-rocksdb_2.11
            ${flink.version}
        

        
            org.apache.flink
            flink-connector-kafka-0.11_2.11
            ${flink.version}
        

        
            org.apache.kafka
            kafka-clients
            0.11.0.3
        
        
        
            org.slf4j
            slf4j-api
            1.7.25
        

        
            org.slf4j
            slf4j-log4j12
            1.7.25
        
        
        
            redis.clients
            jedis
            2.9.0
        
        
        
            com.alibaba
            fastjson
            1.2.44
        

        
        
            org.apache.flink
            flink-connector-elasticsearch6_2.11
            ${flink.version}
        

    




    
        
            org.apache.maven.plugins
            maven-compiler-plugin
            3.1
            
                1.8
                1.8
                
                    /src/test/**
                
                utf-8
            
        
        
            net.alchim31.maven
            scala-maven-plugin
            3.2.0
            
                
                    compile-scala
                    compile
                    
                        add-source
                        compile
                    
                
                
                    test-compile-scala
                    test-compile
                    
                        add-source
                        testCompile
                    
                
            
            
                ${scala.version}
            
        
        
            maven-assembly-plugin
            
                
                    jar-with-dependencies
                
            
            
                
                    make-assembly 
                    package 
                    
                        single
                    
                
            
        
    

-

-

kkbPro

com.kkb.flink

1.0-SNAPSHOT

4.0.0

ETL

-

-

org.apache.flink

flink-java

-

org.apache.flink

flink-streaming-java_2.11

-

org.apache.flink

flink-scala_2.11

-

org.apache.flink

flink-streaming-scala_2.11

-

org.apache.bahir

flink-connector-redis_2.11

-

org.apache.flink

flink-statebackend-rocksdb_2.11

-

org.apache.flink

flink-connector-kafka-0.11_2.11

-

org.apache.kafka

kafka-clients

-

org.slf4j

slf4j-api

-

org.slf4j

slf4j-log4j12

-

redis.clients

jedis

-

com.alibaba

fastjson

package com.kkb.core;

import com.alibaba.fastjson.JSONArray;
import com.alibaba.fastjson.JSONObject;
import com.kkb.source.KkbRedisSource;
import org.apache.flink.api.common.serialization.SimpleStringSchema;
import org.apache.flink.streaming.api.CheckpointingMode;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.CheckpointConfig;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.co.CoFlatMapFunction;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer011;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011;
import org.apache.flink.streaming.connectors.kafka.internals.KeyedSerializationSchemaWrapper;
import org.apache.flink.util.Collector;

import java.util.HashMap;
import java.util.Properties;

/**

  • 数据清洗
    */
    public class DataClean {
    public static void main(String[] args) throws Exception {
    StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
    //我们是从Kafka里面读取数据,所以这儿就是topic有多少个partition,那么就设置几个并行度。
    env.setParallelism(3);
    env.enableCheckpointing(60000);
    env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
    env.getCheckpointConfig().setMinPauseBetweenCheckpoints(10000);
    env.getCheckpointConfig().setMaxConcurrentCheckpoints(1);
    env.getCheckpointConfig().enableExternalizedCheckpoints(
    CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
    //注释,我们这儿其实需要设置state backed类型,我们要把checkpoint的数据存储到
    //rocksdb里面

     //第一步:从Kafka里面读取数据 消费者 数据源需要kafka
     //topic的取名还是有讲究的,最好就是让人根据这个名字就能知道里面有什么数据。
     //xxxx_xxx_xxx_xxx
     String topic="allData";
     Properties consumerProperties = new Properties();
     consumerProperties.put("bootstrap.servers","192.168.167.254:9092");
     consumerProperties.put("group.id","allTopic_consumer");
    
     /**
      * String topic, 主题
      * KafkaDeserializationSchema deserializer,
      * Properties props
      */
     FlinkKafkaConsumer011 consumer = new FlinkKafkaConsumer011<>(topic,
             new SimpleStringSchema(),
             consumerProperties);
     //{"dt":"2019-11-24 19:54:23","countryCode":"PK","data":[{"type":"s4","score":0.8,"level":"C"},{"type":"s5","score":0.2,"level":"C"}]}
     DataStreamSource allData = env.addSource(consumer);
     //设置为广播变量
     DataStream> mapData = env.addSource(new KkbRedisSource()).broadcast();
     SingleOutputStreamOperator etlData = allData.connect(mapData).flatMap(new CoFlatMapFunction, String>() {
         HashMap allMap = new HashMap();
    
         //里面处理的是kafka的数据
         @Override
         public void flatMap1(String line, Collector out) throws Exception {
             JSONObject jsonObject = JSONObject.parseObject(line);
             String dt = jsonObject.getString("dt");
             String countryCode = jsonObject.getString("countryCode");
             //可以根据countryCode获取大区的名字
             String area = allMap.get(countryCode);
             JSONArray data = jsonObject.getJSONArray("data");
             for (int i = 0; i < data.size(); i++) {
                 JSONObject dataObject = data.getJSONObject(i);
                 System.out.println("大区:"+area);
                 dataObject.put("dt", dt);
                 dataObject.put("area", area);
                 //下游获取到数据的时候,也就是一个json格式的数据
                 out.collect(dataObject.toJSONString());
             }
    
    
         }
    
         //里面处理的是redis里面的数据
         @Override
         public void flatMap2(HashMap map,
                              Collector collector) throws Exception {
             System.out.println(map.toString());
             allMap = map;
    
         }
     });
    
     //ETL -> load kafka
    
    
     etlData.print().setParallelism(1);
    
     /**
      * String topicId,
      * SerializationSchema serializationSchema,
      * Properties producerConfig)
      */
    

// String outputTopic=“allDataClean”;
// Properties producerProperties = new Properties();
// producerProperties.put(“bootstrap.servers”,“192.168.167.254:9092”);
// FlinkKafkaProducer011 producer = new FlinkKafkaProducer011<>(outputTopic,
// new KeyedSerializationSchemaWrapper(new SimpleStringSchema()),
// producerProperties);
//
// //搞一个Kafka的生产者
// etlData.addSink(producer);

    env.execute("DataClean");


}

}

package com.kkb.producer;

import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.ProducerRecord;
import org.apache.kafka.common.serialization.StringSerializer;

import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.Properties;
import java.util.Random;

/**

  • 模拟数据源
    */
    public class kafkaProducer {

    public static void main(String[] args) throws Exception{
    Properties prop = new Properties();
    //指定kafka broker地址
    prop.put(“bootstrap.servers”, “192.168.167.254:9092”);
    //指定key value的序列化方式
    prop.put(“key.serializer”, StringSerializer.class.getName());
    prop.put(“value.serializer”, StringSerializer.class.getName());
    //指定topic名称
    String topic = “allData”;

     //创建producer链接
     KafkaProducer producer = new KafkaProducer(prop);
    
     //{"dt":"2018-01-01 10:11:11","countryCode":"US","data":[{"type":"s1","score":0.3,"level":"A"},{"type":"s2","score":0.2,"level":"B"}]}
    
    
     while(true){
         String message = "{\"dt\":\""+getCurrentTime()+"\",\"countryCode\":\""+getCountryCode()+"\",\"data\":[{\"type\":\""+getRandomType()+"\",\"score\":"+getRandomScore()+",\"level\":\""+getRandomLevel()+"\"},{\"type\":\""+getRandomType()+"\",\"score\":"+getRandomScore()+",\"level\":\""+getRandomLevel()+"\"}]}";
         System.out.println(message);
         //同步的方式,往Kafka里面生产数据
        producer.send(new ProducerRecord(topic,message));
         Thread.sleep(2000);
     }
     //关闭链接
     //producer.close();
    

    }

    public static String getCurrentTime(){
    SimpleDateFormat sdf = new SimpleDateFormat(“YYYY-MM-dd HH:mm:ss”);
    return sdf.format(new Date());
    }

    public static String getCountryCode(){
    String[] types = {“US”,“TW”,“HK”,“PK”,“KW”,“SA”,“IN”};
    Random random = new Random();
    int i = random.nextInt(types.length);
    return types[i];
    }

    public static String getRandomType(){
    String[] types = {“s1”,“s2”,“s3”,“s4”,“s5”};
    Random random = new Random();
    int i = random.nextInt(types.length);
    return types[i];
    }

    public static double getRandomScore(){
    double[] types = {0.3,0.2,0.1,0.5,0.8};
    Random random = new Random();
    int i = random.nextInt(types.length);
    return types[i];
    }

    public static String getRandomLevel(){
    String[] types = {“A”,“A+”,“B”,“C”,“D”};
    Random random = new Random();
    int i = random.nextInt(types.length);
    return types[i];
    }

}


package com.kkb.source;

import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.functions.source.SourceFunction;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import redis.clients.jedis.Jedis;
import redis.clients.jedis.exceptions.JedisConnectionException;

import java.util.HashMap;
import java.util.Map;

/**
*

  • hset areas AREA_US US
  • hset areas AREA_CT TW,HK
  • hset areas AREA_AR PK,KW,SA
  • hset areas AREA_IN IN
  • IN,AREA_IN

*/
public class KkbRedisSource implements SourceFunction> {

private Logger logger=LoggerFactory.getLogger(KkbRedisSource.class);

private Jedis jedis;
private boolean isRunning=true;

@Override
public void run(SourceContext> cxt) throws Exception {
    this.jedis = new Jedis("192.168.167.254",6379);
    HashMap map = new HashMap<>();
    while(isRunning){
      try{
          map.clear();
          Map areas = jedis.hgetAll("areas");
          for(Map.Entry entry: areas.entrySet()){
              String area = entry.getKey();
              String value = entry.getValue();
              String[] fields = value.split(",");
              for(String country:fields){
                  map.put(country,area);
              }

          }
          if(map.size() > 0 ){
              cxt.collect(map);
          }
          Thread.sleep(60000);
      }catch (JedisConnectionException e){
          logger.error("redis连接异常",e.getCause());
          this.jedis = new Jedis("192.168.167.254",6379);
      }catch (Exception e){
          logger.error("数据源异常",e.getCause());
      }

    }

}

@Override
public void cancel() {
    isRunning=false;
    if(jedis != null){
        jedis.close();
    }

}

}

-

-

kkbPro

com.kkb.flink

1.0-SNAPSHOT

4.0.0

Report

-

-

org.apache.flink

flink-java

-

org.apache.flink

flink-streaming-java_2.11

-

org.apache.flink

flink-scala_2.11

-

org.apache.flink

flink-streaming-scala_2.11

-

org.apache.bahir

flink-connector-redis_2.11

-

org.apache.flink

flink-statebackend-rocksdb_2.11

-

org.apache.flink

flink-connector-kafka-0.11_2.11

-

org.apache.kafka

kafka-clients

-

org.slf4j

slf4j-api

-

org.slf4j

slf4j-log4j12

-

redis.clients

jedis

-

com.alibaba

fastjson

package com.kkb.core;

import com.alibaba.fastjson.JSON;
import com.alibaba.fastjson.JSONObject;
import com.kkb.function.MySumFuction;
import com.kkb.watermark.MyWaterMark;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.common.serialization.SimpleStringSchema;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.api.java.tuple.Tuple4;
import org.apache.flink.contrib.streaming.state.RocksDBStateBackend;
import org.apache.flink.streaming.api.CheckpointingMode;
import org.apache.flink.streaming.api.TimeCharacteristic;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.CheckpointConfig;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer011;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011;
import org.apache.flink.streaming.connectors.kafka.internals.KeyedSerializationSchemaWrapper;
import org.apache.flink.util.OutputTag;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.Properties;

/**

  • ETL:对数据进行预处理

  • 报表:就是要计算一些指标
    */
    public class DataReport {

    public static void main(String[] args) throws Exception{

     StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
    
     env.setParallelism(3);
     env.enableCheckpointing(60000);
     env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
     env.getCheckpointConfig().setMinPauseBetweenCheckpoints(10000);
     env.getCheckpointConfig().setMaxConcurrentCheckpoints(1);
     env.getCheckpointConfig().enableExternalizedCheckpoints(
             CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
     //env.setStateBackend(new RocksDBStateBackend(""));
     //设置time
     env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
     String topic="auditLog";
     Properties consumerProperties = new Properties();
     consumerProperties.put("bootstrap.servers","192.168.167.254:9092");
     consumerProperties.put("group.id","auditLog_consumer");
    
     //读取Kafka里面对数据
     //{"dt":"2019-11-24 21:19:47","type":"child_unshelf","username":"shenhe1","area":"AREA_ID"}
     FlinkKafkaConsumer011 consumer =
             new FlinkKafkaConsumer011(topic,new SimpleStringSchema(),consumerProperties);
     DataStreamSource data = env.addSource(consumer);
    
     Logger logger= LoggerFactory.getLogger(DataReport.class);
    
     //对数据进行处理
     SingleOutputStreamOperator> preData = data.map(new MapFunction>() {
         /**
          * Long:time
          * String: type
          * String: area
          * @return
          * @throws Exception
          */
         @Override
         public Tuple3 map(String line) throws Exception {
             JSONObject jsonObject = JSON.parseObject(line);
             String dt = jsonObject.getString("dt");
             String type = jsonObject.getString("type");
             String area = jsonObject.getString("area");
             long time = 0;
    
             try {
                 SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
                 time = sdf.parse(dt).getTime();
             } catch (ParseException e) {
                 logger.error("时间解析失败,dt:" + dt, e.getCause());
             }
    
    
             return Tuple3.of(time, type, area);
         }
     });
    
     /**
      *过滤无效的数据
      */
     SingleOutputStreamOperator> filterData = preData.filter(tuple3 -> tuple3.f0 != 0);
    
    
     /**
      * 收集迟到太久的数据
      */
     OutputTag> outputTag=
             new OutputTag>("late-date"){};
     /**
      * 进行窗口的统计操作
      * 统计的过去的一分钟的每个大区的,不同类型的有效视频数量
      */
    
     SingleOutputStreamOperator> resultData = filterData.assignTimestampsAndWatermarks(new MyWaterMark())
             .keyBy(1, 2)
             .window(TumblingEventTimeWindows.of(Time.seconds(30)))
             .sideOutputLateData(outputTag)
             .apply(new MySumFuction());
    
    
     /**
      * 收集到延迟太多的数据,业务里面要求写到Kafka
      */
    
     SingleOutputStreamOperator sideOutput =
             //java8
             resultData.getSideOutput(outputTag).map(line -> line.toString());
    

// String outputTopic=“lateData”;
// Properties producerProperties = new Properties();
// producerProperties.put(“bootstrap.servers”,“192.168.167.254:9092”);
// FlinkKafkaProducer011 producer = new FlinkKafkaProducer011<>(outputTopic,
// new KeyedSerializationSchemaWrapper(new SimpleStringSchema()),
// producerProperties);
// sideOutput.addSink(producer);

    /**
     * 业务里面需要吧数据写到ES里面
     * 而我们公司是需要把数据写到kafka
     */

    resultData.print();


    env.execute("DataReport");

}

}
package com.kkb.function;

import org.apache.flink.api.java.tuple.Tuple;
import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.api.java.tuple.Tuple4;
import org.apache.flink.streaming.api.functions.windowing.WindowFunction;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.util.Collector;

import java.text.SimpleDateFormat;
import java.util.Date;

/**

  • IN,输入的数据类型
  • OUT,输出的数据类型
  • KEY,在flink里面这儿其实就是分组的字段,大家永远看到的是一个tuple字段
  • 只不过,如果你的分组的字段是有一个,那么这个tuple里面就只会有一个字段
  • 如果说你的分组的字段有多个,那么这个里面就会有多个字段。
  • W extends Window

*/
public class MySumFuction implements WindowFunction,
Tuple4,Tuple,TimeWindow> {
@Override
public void apply(Tuple tuple, TimeWindow timeWindow,
Iterable> input,
Collector> out) {
//获取分组字段信息
String type = tuple.getField(0).toString();
String area = tuple.getField(1).toString();

    java.util.Iterator> iterator = input.iterator();
    long count=0;
    while(iterator.hasNext()){
        iterator.next();
        count++;
    }
    SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
    String time = sdf.format(new Date(timeWindow.getEnd()));


    Tuple4 result =
            new Tuple4(time, type, area, count);
    out.collect(result);
}

}
package com.kkb.source;

import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.ProducerRecord;
import org.apache.kafka.common.serialization.StringSerializer;

import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.Properties;
import java.util.Random;

/**
*
*/
public class ProducerDataReport {

public static void main(String[] args) throws Exception{
    Properties prop = new Properties();
    //指定kafka broker地址
    prop.put("bootstrap.servers", "192.168.167.254:9092");
    //指定key value的序列化方式
    prop.put("key.serializer", StringSerializer.class.getName());
    prop.put("value.serializer", StringSerializer.class.getName());
    //指定topic名称
    String topic = "auditLog";

    //创建producer链接
    KafkaProducer producer = new KafkaProducer(prop);

    //{"dt":"2018-01-01 10:11:22","type":"shelf","username":"shenhe1","area":"AREA_US"}

    //生产消息
    while(true){
        String message = "{\"dt\":\""+getCurrentTime()+"\",\"type\":\""+getRandomType()+"\",\"username\":\""+getRandomUsername()+"\",\"area\":\""+getRandomArea()+"\"}";
        System.out.println(message);
        producer.send(new ProducerRecord(topic,message));
        Thread.sleep(500);
    }
    //关闭链接
    //producer.close();
}

public static String getCurrentTime(){
    SimpleDateFormat sdf = new SimpleDateFormat("YYYY-MM-dd HH:mm:ss");
    return sdf.format(new Date());
}

public static String getRandomArea(){
    String[] types = {"AREA_US","AREA_CT","AREA_AR","AREA_IN","AREA_ID"};
    Random random = new Random();
    int i = random.nextInt(types.length);
    return types[i];
}


public static String getRandomType(){
    String[] types = {"shelf","unshelf","black","chlid_shelf","child_unshelf"};
    Random random = new Random();
    int i = random.nextInt(types.length);
    return types[i];
}


public static String getRandomUsername(){
    String[] types = {"shenhe1","shenhe2","shenhe3","shenhe4","shenhe5"};
    Random random = new Random();
    int i = random.nextInt(types.length);
    return types[i];
}

}
package com.kkb.watermark;

import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.streaming.api.functions.AssignerWithPeriodicWatermarks;
import org.apache.flink.streaming.api.watermark.Watermark;

import javax.annotation.Nullable;

/**
*
*/
public class MyWaterMark
implements AssignerWithPeriodicWatermarks> {

long currentMaxTimestamp=0L;
final long maxOutputOfOrderness=20000L;//允许乱序时间。
@Nullable
@Override
public Watermark getCurrentWatermark() {
    return new Watermark(currentMaxTimestamp - maxOutputOfOrderness);
}

@Override
public long extractTimestamp(Tuple3
                                         element, long l) {
    Long timeStamp = element.f0;
    currentMaxTimestamp=Math.max(timeStamp,currentMaxTimestamp);
    return timeStamp;
}

}

-

4.0.0

com.kkb.test

flinklesson

1.0-SNAPSHOT

-

1.9.0

2.11.8

-

-

org.apache.flink

flink-streaming-java_2.11

${flink.version}

-

org.apache.flink

flink-runtime-web_2.11

${flink.version}

-

org.apache.flink

flink-streaming-scala_2.11

${flink.version}

-

org.apache.flink

flink-statebackend-rocksdb_2.11

${flink.version}

-

joda-time

joda-time

2.7

-

org.apache.bahir

flink-connector-redis_2.11

1.0

-

org.apache.flink

flink-connector-kafka-0.11_2.11

${flink.version}

-

-

-

org.apache.maven.plugins

maven-compiler-plugin

3.1

-

1.8

1.8

-

/src/test/**

utf-8

-

net.alchim31.maven

scala-maven-plugin

3.2.0

-

-

compile-scala

compile

-

add-source

compile

-

test-compile-scala

test-compile

-

add-source

testCompile

-

${scala.version}

-

maven-assembly-plugin

-

-

jar-with-dependencies

-

-

make-assembly

package

-

single

package streaming.sink;

import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.connectors.redis.RedisSink;
import org.apache.flink.streaming.connectors.redis.common.config.FlinkJedisPoolConfig;
import org.apache.flink.streaming.connectors.redis.common.mapper.RedisCommand;
import org.apache.flink.streaming.connectors.redis.common.mapper.RedisCommandDescription;
import org.apache.flink.streaming.connectors.redis.common.mapper.RedisMapper;

/**

  • 把数据写入redis
    */
    public class SinkForRedisDemo {
    public static void main(String[] args) throws Exception {
    StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
    DataStreamSource text = env.socketTextStream(“192.168.167.254”, 8888, “\n”);
    //lpsuh l_words word
    //对数据进行组装,把string转化为tuple2
    DataStream> l_wordsData = text.map(new MapFunction>() {
    @Override
    public Tuple2 map(String value) throws Exception {
    return new Tuple2<>(“b”, value);
    }
    });
    //创建redis的配置
    FlinkJedisPoolConfig conf = new FlinkJedisPoolConfig.Builder().setHost(“192.168.167.254”).setPort(6379).build();

     //创建redissink
     RedisSink> redisSink = new RedisSink<>(conf, new MyRedisMapper());
     l_wordsData.addSink(redisSink);
     env.execute("StreamingDemoToRedis");
    

    }

    public static class MyRedisMapper implements RedisMapper> {
    //表示从接收的数据中获取需要操作的redis key
    @Override
    public String getKeyFromData(Tuple2 data) {
    return data.f0;
    }
    //表示从接收的数据中获取需要操作的redis value
    @Override
    public String getValueFromData(Tuple2 data) {
    return data.f1;
    }

     @Override
     public RedisCommandDescription getCommandDescription() {
         return new RedisCommandDescription(RedisCommand.LPUSH);
     }
    

    }
    }
    package com.atguigu.apitest.sinktest

import java.util

import com.atguigu.apitest.SensorReading
import org.apache.flink.api.common.functions.RuntimeContext
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.connectors.elasticsearch.{ElasticsearchSinkFunction, RequestIndexer}
import org.apache.flink.streaming.connectors.elasticsearch6.ElasticsearchSink
import org.apache.http.HttpHost
import org.elasticsearch.client.Requests

/**
*
*

  • Project: FlinkTutorial

  • Package: com.atguigu.apitest.sinktest

  • Version: 1.0

  • Created by wushengran on 2019/9/17 16:27
    */
    object EsSinkTest {
    def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)

    // source
    val inputStream = env.readTextFile(“D:\Projects\BigData\FlinkTutorial\src\main\resources\sensor.txt”)

    // transform
    val dataStream = inputStream
    .map(
    data => {
    val dataArray = data.split(",")
    SensorReading( dataArray(0).trim, dataArray(1).trim.toLong, dataArray(2).trim.toDouble )
    }
    )

    val httpHosts = new util.ArrayListHttpHost
    httpHosts.add(new HttpHost(“localhost”, 9200))

    // 创建一个esSink 的builder
    val esSinkBuilder = new ElasticsearchSink.Builder[SensorReading](
    httpHosts,
    new ElasticsearchSinkFunction[SensorReading] {
    override def process(element: SensorReading, ctx: RuntimeContext, indexer: RequestIndexer): Unit = {
    println("saving data: " + element)
    // 包装成一个Map或者JsonObject
    val json = new util.HashMapString, String
    json.put(“sensor_id”, element.id)
    json.put(“temperature”, element.temperature.toString)
    json.put(“ts”, element.timestamp.toString)

      // 创建index request,准备发送数据
      val indexRequest = Requests.indexRequest()
        .index("sensor")
        .`type`("readingdata")
        .source(json)
    
      // 利用index发送请求,写入数据
      indexer.add(indexRequest)
      println("data saved.")
    }
    

    }
    )

    // sink
    dataStream.addSink( esSinkBuilder.build() )

    env.execute(“es sink test”)
    }
    }
    package com.atguigu.apitest.sinktest

import java.sql.{Connection, DriverManager, PreparedStatement}

import com.atguigu.apitest.SensorReading
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.functions.sink.{RichSinkFunction, SinkFunction}
import org.apache.flink.streaming.api.scala._

/**
*
*

  • Project: FlinkTutorial

  • Package: com.atguigu.apitest.sinktest

  • Version: 1.0

  • Created by wushengran on 2019/9/17 16:44
    */
    object JdbcSinkTest {
    def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)

    // source
    val inputStream = env.readTextFile(“D:\Projects\BigData\FlinkTutorial\src\main\resources\sensor.txt”)

    // transform
    val dataStream = inputStream
    .map(
    data => {
    val dataArray = data.split(",")
    SensorReading(dataArray(0).trim, dataArray(1).trim.toLong, dataArray(2).trim.toDouble)
    }
    )

    // sink
    dataStream.addSink( new MyJdbcSink() )

    env.execute(“jdbc sink test”)
    }
    }

class MyJdbcSink() extends RichSinkFunction[SensorReading]{
// 定义sql连接、预编译器
var conn: Connection = _
var insertStmt: PreparedStatement = _
var updateStmt: PreparedStatement = _

// 初始化,创建连接和预编译语句
override def open(parameters: Configuration): Unit = {
super.open(parameters)
conn = DriverManager.getConnection(“jdbc:mysql://localhost:3306/test”, “root”, “123456”)
insertStmt = conn.prepareStatement(“INSERT INTO temperatures (sensor, temp) VALUES (?,?)”)
updateStmt = conn.prepareStatement(“UPDATE temperatures SET temp = ? WHERE sensor = ?”)
}

// 调用连接,执行sql
override def invoke(value: SensorReading, context: SinkFunction.Context[_]): Unit = {
// 执行更新语句
updateStmt.setDouble(1, value.temperature)
updateStmt.setString(2, value.id)
updateStmt.execute()
// 如果update没有查到数据,那么执行插入语句
if( updateStmt.getUpdateCount == 0 ){
insertStmt.setString(1, value.id)
insertStmt.setDouble(2, value.temperature)
insertStmt.execute()
}
}

// 关闭时做清理工作
override def close(): Unit = {
insertStmt.close()
updateStmt.close()
conn.close()
}
}

package com.atguigu.apitest.sinktest

import java.util.Properties

import com.atguigu.apitest.SensorReading
import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer011.Semantic
import org.apache.flink.streaming.connectors.kafka.{FlinkKafkaConsumer011, FlinkKafkaProducer011}

/**
*
*

  • Project: FlinkTutorial

  • Package: com.atguigu.apitest.sinktest

  • Version: 1.0

  • Created by wushengran on 2019/9/17 15:43
    */
    object KafkaSinkTest {
    def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)

    // source
    // val inputStream = env.readTextFile(“D:\Projects\BigData\FlinkTutorial\src\main\resources\sensor.txt”)
    val properties = new Properties()
    properties.setProperty(“bootstrap.servers”, “localhost:9092”)
    properties.setProperty(“group.id”, “consumer-group”)
    properties.setProperty(“key.deserializer”, “org.apache.kafka.common.serialization.StringDeserializer”)
    properties.setProperty(“value.deserializer”, “org.apache.kafka.common.serialization.StringDeserializer”)
    properties.setProperty(“auto.offset.reset”, “latest”)

    val inputStream = env.addSource(new FlinkKafkaConsumer011[String](“sensor”, new SimpleStringSchema(), properties))

    // Transform操作

    val dataStream = inputStream
    .map(
    data => {
    val dataArray = data.split(",")
    SensorReading( dataArray(0).trim, dataArray(1).trim.toLong, dataArray(2).trim.toDouble ).toString // 转成String方便序列化输出
    }
    )

    // sink
    dataStream.addSink( new FlinkKafkaProducer011[String]( “sinkTest”, new SimpleStringSchema(), properties) )
    dataStream.print()

    env.execute(“kafka sink test”)
    }
    }
    package com.atguigu.apitest.sinktest

import com.atguigu.apitest.SensorReading
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.connectors.redis.RedisSink
import org.apache.flink.streaming.connectors.redis.common.config.FlinkJedisPoolConfig
import org.apache.flink.streaming.connectors.redis.common.mapper.{RedisCommand, RedisCommandDescription, RedisMapper}

/**
*
*

  • Project: FlinkTutorial

  • Package: com.atguigu.apitest.sinktest

  • Version: 1.0

  • Created by wushengran on 2019/9/17 16:12
    */
    object RedisSinkTest {
    def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)

    // source
    val inputStream = env.readTextFile(“D:\Projects\BigData\FlinkTutorial\src\main\resources\sensor.txt”)

    // transform
    val dataStream = inputStream
    .map(
    data => {
    val dataArray = data.split(",")
    SensorReading( dataArray(0).trim, dataArray(1).trim.toLong, dataArray(2).trim.toDouble )
    }
    )

    val conf = new FlinkJedisPoolConfig.Builder()
    .setHost(“localhost”)
    .setPort(6379)
    .build()

    // sink
    dataStream.addSink( new RedisSink(conf, new MyRedisMapper()) )

    env.execute(“redis sink test”)
    }
    }

class MyRedisMapper() extends RedisMapper[SensorReading]{

// 定义保存数据到redis的命令
override def getCommandDescription: RedisCommandDescription = {
// 把传感器id和温度值保存成哈希表 HSET key field value
new RedisCommandDescription( RedisCommand.HSET, “sensor_temperature” )
}

// 定义保存到redis的value
override def getValueFromData(t: SensorReading): String = t.temperature.toString

// 定义保存到redis的key
override def getKeyFromData(t: SensorReading): String = t.id
}
package com.atguigu.apitest

import org.apache.flink.api.common.functions.{RichFlatMapFunction, RichMapFunction}
import org.apache.flink.api.common.restartstrategy.RestartStrategies
import org.apache.flink.api.common.state.{ValueState, ValueStateDescriptor}
import org.apache.flink.configuration.Configuration
import org.apache.flink.contrib.streaming.state.RocksDBStateBackend
import org.apache.flink.runtime.state.filesystem.FsStateBackend
import org.apache.flink.runtime.state.memory.MemoryStateBackend
import org.apache.flink.streaming.api.environment.CheckpointConfig.ExternalizedCheckpointCleanup
import org.apache.flink.streaming.api.{CheckpointingMode, TimeCharacteristic}
import org.apache.flink.streaming.api.functions.KeyedProcessFunction
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.util.Collector

/**
*
*

  • Project: FlinkTutorial

  • Package: com.atguigu.apitest

  • Version: 1.0

  • Created by wushengran on 2019/8/24 10:14
    */
    object ProcessFunctionTest {
    def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)
    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)

    env.enableCheckpointing(60000)
    env.getCheckpointConfig.setCheckpointingMode(CheckpointingMode.AT_LEAST_ONCE)
    env.getCheckpointConfig.setCheckpointTimeout(100000)
    env.getCheckpointConfig.setFailOnCheckpointingErrors(false)
    // env.getCheckpointConfig.setMaxConcurrentCheckpoints(1)
    env.getCheckpointConfig.setMinPauseBetweenCheckpoints(100)
    env.getCheckpointConfig.enableExternalizedCheckpoints(ExternalizedCheckpointCleanup.DELETE_ON_CANCELLATION)

    env.setRestartStrategy(RestartStrategies.failureRateRestart(3, org.apache.flink.api.common.time.Time.seconds(300), org.apache.flink.api.common.time.Time.seconds(10)))

// env.setStateBackend( new RocksDBStateBackend("") )

val stream = env.socketTextStream("localhost", 7777)

val dataStream = stream.map(data => {
  val dataArray = data.split(",")
  SensorReading(dataArray(0).trim, dataArray(1).trim.toLong, dataArray(2).trim.toDouble)
})
  .assignTimestampsAndWatermarks( new BoundedOutOfOrdernessTimestampExtractor[SensorReading]( Time.seconds(1) ) {
  override def extractTimestamp(element: SensorReading): Long = element.timestamp * 1000
} )

val processedStream = dataStream.keyBy(_.id)
  .process( new TempIncreAlert() )

val processedStream2 = dataStream.keyBy(_.id)

// .process( new TempChangeAlert(10.0) )
.flatMap( new TempChangeAlert(10.0) )

val processedStream3 = dataStream.keyBy(_.id)
  .flatMapWithState[(String, Double, Double), Double]{
  // 如果没有状态的话,也就是没有数据来过,那么就将当前数据温度值存入状态
  case ( input: SensorReading, None ) => ( List.empty, Some(input.temperature) )
  // 如果有状态,就应该与上次的温度值比较差值,如果大于阈值就输出报警
  case ( input: SensorReading, lastTemp: Some[Double] ) =>
    val diff = ( input.temperature - lastTemp.get ).abs
    if( diff > 10.0 ){
      ( List((input.id, lastTemp.get, input.temperature)), Some(input.temperature) )
    } else
      ( List.empty, Some(input.temperature) )
}

dataStream.print("input data")
processedStream3.print("processed data")

env.execute("process function test")

}
}

class TempIncreAlert() extends KeyedProcessFunction[String, SensorReading, String]{

// 定义一个状态,用来保存上一个数据的温度值
lazy val lastTemp: ValueState[Double] = getRuntimeContext.getState( new ValueStateDescriptor[Double](“lastTemp”, classOf[Double]) )
// 定义一个状态,用来保存定时器的时间戳
lazy val currentTimer: ValueState[Long] = getRuntimeContext.getState( new ValueStateDescriptor[Long](“currentTimer”, classOf[Long]) )

override def processElement(value: SensorReading, ctx: KeyedProcessFunction[String, SensorReading, String]#Context, out: Collector[String]): Unit = {
// 先取出上一个温度值
val preTemp = lastTemp.value()
// 更新温度值
lastTemp.update( value.temperature )

val curTimerTs = currentTimer.value()


if( value.temperature < preTemp || preTemp == 0.0 ){
  // 如果温度下降,或是第一条数据,删除定时器并清空状态
  ctx.timerService().deleteProcessingTimeTimer( curTimerTs )
  currentTimer.clear()
} else if ( value.temperature > preTemp && curTimerTs == 0 ){
  // 温度上升且没有设过定时器,则注册定时器
  val timerTs = ctx.timerService().currentProcessingTime() + 5000L
  ctx.timerService().registerProcessingTimeTimer( timerTs )
  currentTimer.update( timerTs )
}

}

override def onTimer(timestamp: Long, ctx: KeyedProcessFunction[String, SensorReading, String]#OnTimerContext, out: Collector[String]): Unit = {
// 输出报警信息
out.collect( ctx.getCurrentKey + " 温度连续上升" )
currentTimer.clear()
}
}

class TempChangeAlert(threshold: Double) extends RichFlatMapFunction[SensorReading, (String, Double, Double)]{

private var lastTempState: ValueState[Double] = _

override def open(parameters: Configuration): Unit = {
// 初始化的时候声明state变量
lastTempState = getRuntimeContext.getState(new ValueStateDescriptor[Double](“lastTemp”, classOf[Double]))
}

override def flatMap(value: SensorReading, out: Collector[(String, Double, Double)]): Unit = {
// 获取上次的温度值
val lastTemp = lastTempState.value()
// 用当前的温度值和上次的求差,如果大于阈值,输出报警信息
val diff = (value.temperature - lastTemp).abs
if(diff > threshold){
out.collect( (value.id, lastTemp, value.temperature) )
}
lastTempState.update(value.temperature)
}

}

class TempChangeAlert2(threshold: Double) extends KeyedProcessFunction[String, SensorReading, (String, Double, Double)]{
// 定义一个状态变量,保存上次的温度值
lazy val lastTempState: ValueState[Double] = getRuntimeContext.getState( new ValueStateDescriptor[Double](“lastTemp”, classOf[Double]) )

override def processElement(value: SensorReading, ctx: KeyedProcessFunction[String, SensorReading, (String, Double, Double)]#Context, out: Collector[(String, Double, Double)]): Unit = {
// 获取上次的温度值
val lastTemp = lastTempState.value()
// 用当前的温度值和上次的求差,如果大于阈值,输出报警信息
val diff = (value.temperature - lastTemp).abs
if(diff > threshold){
out.collect( (value.id, lastTemp, value.temperature) )
}
lastTempState.update(value.temperature)
}
}
package com.atguigu.apitest

import org.apache.flink.runtime.state.filesystem.FsStateBackend
import org.apache.flink.runtime.state.memory.MemoryStateBackend
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.ProcessFunction
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.util.Collector

/**
*
*

  • Project: FlinkTutorial

  • Package: com.atguigu.apitest

  • Version: 1.0

  • Created by wushengran on 2019/8/24 11:16
    */
    object SideOutputTest {
    def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)
    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)

    val stream = env.socketTextStream(“localhost”, 7777)

    val dataStream = stream.map(data => {
    val dataArray = data.split(",")
    SensorReading(dataArray(0).trim, dataArray(1).trim.toLong, dataArray(2).trim.toDouble)
    })
    .assignTimestampsAndWatermarks( new BoundedOutOfOrdernessTimestampExtractorSensorReading {
    override def extractTimestamp(element: SensorReading): Long = element.timestamp * 1000
    } )

    val processedStream = dataStream
    .process( new FreezingAlert() )

// dataStream.print(“input data”)
processedStream.print(“processed data”)
processedStream.getSideOutput( new OutputTag[String](“freezing alert”) ).print(“alert data”)

env.execute("side output test")

}
}

// 冰点报警,如果小于32F,输出报警信息到侧输出流
class FreezingAlert() extends ProcessFunction[SensorReading, SensorReading]{

// lazy val alertOutput: OutputTag[String] = new OutputTag[String]( “freezing alert” )

override def processElement(value: SensorReading, ctx: ProcessFunction[SensorReading, SensorReading]#Context, out: Collector[SensorReading]): Unit = {
if( value.temperature < 32.0 ){
ctx.output( new OutputTag[String]( “freezing alert” ), "freezing alert for " + value.id )
}
out.collect( value )
}
}
package com.atguigu.apitest

import java.util.Properties

import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.streaming.api.functions.source.SourceFunction
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer011

import scala.util.Random

/**
*
*

  • Project: FlinkTutorial
  • Package: com.atguigu.apitest
  • Version: 1.0
  • Created by wushengran on 2019/9/17 10:11
    */

// 定义传感器数据样例类
case class SensorReading( id: String, timestamp: Long, temperature: Double )

object SourceTest {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)

// 1. 从集合中读取数据
val stream1 = env.fromCollection(List(
  SensorReading("sensor_1", 1547718199, 35.80018327300259),
  SensorReading("sensor_6", 1547718201, 15.402984393403084),
  SensorReading("sensor_7", 1547718202, 6.720945201171228),
  SensorReading("sensor_10", 1547718205, 38.101067604893444)
))

// env.fromElements(“flink”, 1, 32, 3213, 0.324).print(“test”)

// 2. 从文件中读取数据
val stream2 = env.readTextFile("D:\\Projects\\BigData\\FlinkTutorial\\src\\main\\resources\\sensor.txt")

// 3. 从kafka中读取数据
// 创建kafka相关的配置
val properties = new Properties()
properties.setProperty("bootstrap.servers", "localhost:9092")
properties.setProperty("group.id", "consumer-group")
properties.setProperty("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")
properties.setProperty("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")
properties.setProperty("auto.offset.reset", "latest")

val stream3 = env.addSource(new FlinkKafkaConsumer011[String]("sensor", new SimpleStringSchema(), properties))

// 4. 自定义数据源
val stream4 = env.addSource(new SensorSource())

// sink输出
stream4.print("stream4")

env.execute("source api test")

}
}

class SensorSource() extends SourceFunction[SensorReading]{
// 定义一个flag:表示数据源是否还在正常运行
var running: Boolean = true
override def cancel(): Unit = running = false

override def run(ctx: SourceFunction.SourceContext[SensorReading]): Unit = {
// 创建一个随机数发生器
val rand = new Random()

// 随机初始换生成10个传感器的温度数据,之后在它基础随机波动生成流数据
var curTemp = 1.to(10).map(
  i => ( "sensor_" + i, 60 + rand.nextGaussian() * 20 )
)

// 无限循环生成流数据,除非被cancel
while(running){
  // 更新温度值
  curTemp = curTemp.map(
    t => (t._1, t._2 + rand.nextGaussian())
  )
  // 获取当前的时间戳
  val curTime = System.currentTimeMillis()
  // 包装成SensorReading,输出
  curTemp.foreach(
    t => ctx.collect( SensorReading(t._1, curTime, t._2) )
  )
  // 间隔100ms
  Thread.sleep(100)
}

}
}
package com.atguigu.apitest

import org.apache.flink.api.common.functions.{FilterFunction, RichMapFunction}
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.scala._

/**
*
*

  • Project: FlinkTutorial

  • Package: com.atguigu.apitest

  • Version: 1.0

  • Created by wushengran on 2019/9/17 11:41
    */
    object TransformTest {
    def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)

    // 读入数据
    val inputStream = env.readTextFile(“D:\Projects\BigData\FlinkTutorial\src\main\resources\sensor.txt”)

    // Transform操作

    val dataStream = inputStream
    .map(
    data => {
    val dataArray = data.split(",")
    SensorReading( dataArray(0).trim, dataArray(1).trim.toLong, dataArray(2).trim.toDouble )
    }
    )

    // 1. 聚合操作
    val stream1 = dataStream
    .keyBy(“id”)
    // .sum(“temperature”)
    .reduce( (x, y) => SensorReading(x.id, x.timestamp + 1, y.temperature + 10) )

    // 2. 分流,根据温度是否大于30度划分
    val splitStream = dataStream
    .split( sensorData => {
    if( sensorData.temperature > 30 ) Seq(“high”) else Seq(“low”)
    } )

    val highTempStream = splitStream.select(“high”)
    val lowTempStream = splitStream.select(“low”)
    val allTempStream = splitStream.select(“high”, “low”)

    // 3. 合并两条流
    val warningStream = highTempStream.map( sensorData => (sensorData.id, sensorData.temperature) )
    val connectedStreams = warningStream.connect(lowTempStream)

    val coMapStream = connectedStreams.map(
    warningData => ( warningData._1, warningData._2, “high temperature warning” ),
    lowData => ( lowData.id, “healthy” )
    )

    val unionStream = highTempStream.union(lowTempStream)

    // 函数类
    dataStream.filter( new MyFilter() ).print()

    // 输出数据
    // dataStream.print()
    // highTempStream.print(“high”)
    // lowTempStream.print(“low”)
    // allTempStream.print(“all”)
    // unionStream.print(“union”)

    env.execute(“transform test job”)
    }
    }

class MyFilter() extends FilterFunction[SensorReading]{
override def filter(value: SensorReading): Boolean = {
value.id.startsWith(“sensor_1”)
}
}

class MyMapper() extends RichMapFunction[SensorReading, String]{
override def map(value: SensorReading): String = {
“flink”
}

override def open(parameters: Configuration): Unit = super.open(parameters)
}
package com.atguigu.apitest

import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.{AssignerWithPeriodicWatermarks, AssignerWithPunctuatedWatermarks, KeyedProcessFunction}
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.watermark.Watermark
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.util.Collector

/**
*
*

  • Project: FlinkTutorial

  • Package: com.atguigu.apitest

  • Version: 1.0

  • Created by wushengran on 2019/9/18 9:31
    */
    object WindowTest {
    def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)
    // 设置事件时间
    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
    env.getConfig.setAutoWatermarkInterval(500)

    // 读入数据
    // val inputStream = env.readTextFile(“D:\Projects\BigData\FlinkTutorial\src\main\resources\sensor.txt”)

    val inputStream = env.socketTextStream(“localhost”, 7777)

    val dataStream = inputStream
    .map(
    data => {
    val dataArray = data.split(",")
    SensorReading(dataArray(0).trim, dataArray(1).trim.toLong, dataArray(2).trim.toDouble)
    }
    )
    // .assignAscendingTimestamps(.timestamp * 1000L)
    .assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractorSensorReading {
    override def extractTimestamp(element: SensorReading): Long = element.timestamp * 1000L
    })
    // .assignTimestampsAndWatermarks( new MyAssigner() )
    .map(data => (data.id, data.temperature))
    .keyBy(
    ._1)
    // .process( new MyProcess() )
    .timeWindow(Time.seconds(10), Time.seconds(3))
    .reduce((result, data) => (data._1, result._2.min(data._2))) // 统计10秒内的最低温度值

    dataStream.print()

    env.execute(“window api test”)
    }
    }

class MyAssigner() extends AssignerWithPeriodicWatermarks[SensorReading]{
// 定义固定延迟为3秒
val bound: Long = 3 * 1000L
// 定义当前收到的最大的时间戳
var maxTs: Long = Long.MinValue

override def getCurrentWatermark: Watermark = {
new Watermark(maxTs - bound)
}

override def extractTimestamp(element: SensorReading, previousElementTimestamp: Long): Long = {
maxTs = maxTs.max(element.timestamp * 1000L)
element.timestamp * 1000L
}
}

class MyAssigner2() extends AssignerWithPunctuatedWatermarks[SensorReading]{
val bound: Long = 1000L

override def checkAndGetNextWatermark(lastElement: SensorReading, extractedTimestamp: Long): Watermark = {
if( lastElement.id == “sensor_1” ){
new Watermark(extractedTimestamp - bound)
}else{
null
}
}

override def extractTimestamp(element: SensorReading, previousElementTimestamp: Long): Long = {
element.timestamp * 1000L
}
}
package com.atguigu.wc

import org.apache.flink.api.java.utils.ParameterTool
import org.apache.flink.streaming.api.scala._

/**
*
*

  • Project: FlinkTutorial

  • Package: com.atguigu.wc

  • Version: 1.0

  • Created by wushengran on 2019/9/16 14:08
    */
    object StreamWordCount {
    def main(args: Array[String]): Unit = {

    val params = ParameterTool.fromArgs(args)
    val host: String = params.get(“host”)
    val port: Int = params.getInt(“port”)

    // 创建一个流处理的执行环境
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    // env.setParallelism(1)
    // env.disableOperatorChaining()

    // 接收socket数据流
    val textDataStream = env.socketTextStream(host, port)

    // 逐一读取数据,分词之后进行wordcount
    val wordCountDataStream = textDataStream.flatMap(.split("\s"))
    .filter(
    .nonEmpty).startNewChain()
    .map( (_, 1) )
    .keyBy(0)
    .sum(1)

    // 打印输出
    wordCountDataStream.print().setParallelism(1)

    // 执行任务
    env.execute(“stream word count job”)
    }
    }
    package com.atguigu.wc

import org.apache.flink.api.scala._

/**
*
*

  • Project: FlinkTutorial
  • Package: com.atguigu.wc
  • Version: 1.0
  • Created by wushengran on 2019/9/16 11:48
    */

// 批处理代码
object WordCount {
def main(args: Array[String]): Unit = {
// 创建一个批处理的执行环境
val env = ExecutionEnvironment.getExecutionEnvironment

// 从文件中读取数据
val inputPath = "D:\\Projects\\BigData\\FlinkTutorial\\src\\main\\resources\\hello.txt"
val inputDataSet = env.readTextFile(inputPath)

// 分词之后做count
val wordCountDataSet = inputDataSet.flatMap(_.split(" "))
  .map( (_, 1) )
  .groupBy(0)
  .sum(1)

// 打印输出
wordCountDataSet.print()

}
}

-

4.0.0

com.atguigu

FlinkTutorial

1.0-SNAPSHOT

-

-

org.apache.flink

flink-scala_2.11

1.7.2

-

org.apache.flink

flink-streaming-scala_2.11

1.7.2

-

org.apache.flink

flink-connector-kafka-0.11_2.11

1.7.2

-

org.apache.bahir

flink-connector-redis_2.11

1.0

-

org.apache.flink

flink-connector-elasticsearch6_2.11

1.7.2

-

mysql

mysql-connector-java

5.1.44

-

org.apache.flink

flink-statebackend-rocksdb_2.11

1.7.2

-

-

-

net.alchim31.maven

scala-maven-plugin

3.4.6

-

-

-

testCompile

-

org.apache.maven.plugins

maven-assembly-plugin

3.0.0

-

-

jar-with-dependencies

-

-

make-assembly

package

-

single

-

4.0.0

com.atguigu

UserBehaviorAnalysis

pom

1.0-SNAPSHOT

-

HotItemsAnalysis

NetworkFlowAnalysis

MarketAnalysis

LoginFailDetect

OrderPayDetect

-

1.7.2

2.11

2.2.0

-

-

org.apache.flink

flink-scala_${scala.binary.version}

${flink.version}

-

org.apache.flink

flink-streaming-scala_${scala.binary.version}

${flink.version}

-

org.apache.kafka

kafka_${scala.binary.version}

${kafka.version}

-

org.apache.flink

flink-connector-kafka_${scala.binary.version}

${flink.version}

-

-

-

net.alchim31.maven

scala-maven-plugin

3.4.6

-

-

-

testCompile

-

org.apache.maven.plugins

maven-assembly-plugin

3.0.0

-

-

jar-with-dependencies

-

-

make-assembly

package

-

single

-

-

UserBehaviorAnalysis

com.atguigu

1.0-SNAPSHOT

4.0.0

OrderPayDetect

-

-

org.apache.flink

flink-cep-scala_${scala.binary.version}

${flink.version}

package com.atguigu.orderpay_detect

import java.util

import org.apache.flink.cep.{PatternSelectFunction, PatternTimeoutFunction}
import org.apache.flink.cep.scala.CEP
import org.apache.flink.cep.scala.pattern.Pattern
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.time.Time

/**
*
*

  • Project: UserBehaviorAnalysis
  • Package: com.atguigu.orderpay_detect
  • Version: 1.0
  • Created by wushengran on 2019/9/25 9:17
    */

// 定义输入订单事件的样例类
case class OrderEvent(orderId: Long, eventType: String, txId: String, eventTime: Long)

// 定义输出结果样例类
case class OrderResult(orderId: Long, resultMsg: String)

object OrderTimeout {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
env.setParallelism(1)

// 1. 读取订单数据
val resource = getClass.getResource("/OrderLog.csv")
//    val orderEventStream = env.readTextFile(resource.getPath)
val orderEventStream = env.socketTextStream("localhost", 7777)
  .map(data => {
    val dataArray = data.split(",")
    OrderEvent(dataArray(0).trim.toLong, dataArray(1).trim, dataArray(2).trim, dataArray(3).trim.toLong)
  })
  .assignAscendingTimestamps(_.eventTime * 1000L)
  .keyBy(_.orderId)

// 2. 定义一个匹配模式
val orderPayPattern = Pattern.begin[OrderEvent]("begin").where(_.eventType == "create")
  .followedBy("follow").where(_.eventType == "pay")
  .within(Time.minutes(15))

// 3. 把模式应用到stream上,得到一个pattern stream
val patternStream = CEP.pattern(orderEventStream, orderPayPattern)

// 4. 调用select方法,提取事件序列,超时的事件要做报警提示
val orderTimeoutOutputTag = new OutputTag[OrderResult]("orderTimeout")

val resultStream = patternStream.select(orderTimeoutOutputTag,
  new OrderTimeoutSelect(),
  new OrderPaySelect())

resultStream.print("payed")
resultStream.getSideOutput(orderTimeoutOutputTag).print("timeout")

env.execute("order timeout job")

}
}

// 自定义超时事件序列处理函数
class OrderTimeoutSelect() extends PatternTimeoutFunction[OrderEvent, OrderResult] {
override def timeout(map: util.Map[String, util.List[OrderEvent]], l: Long): OrderResult = {
val timeoutOrderId = map.get(“begin”).iterator().next().orderId
OrderResult(timeoutOrderId, “timeout”)
}
}

// 自定义正常支付事件序列处理函数
class OrderPaySelect() extends PatternSelectFunction[OrderEvent, OrderResult] {
override def select(map: util.Map[String, util.List[OrderEvent]]): OrderResult = {
val payedOrderId = map.get(“follow”).iterator().next().orderId
OrderResult(payedOrderId, “payed successfully”)
}
}
package com.atguigu.orderpay_detect

import org.apache.flink.api.common.state.{ValueState, ValueStateDescriptor}
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.KeyedProcessFunction
import org.apache.flink.streaming.api.scala._
import org.apache.flink.util.Collector

/**
*
*

  • Project: UserBehaviorAnalysis
  • Package: com.atguigu.orderpay_detect
  • Version: 1.0
  • Created by wushengran on 2019/9/25 10:27
    */
    object OrderTimeoutWithoutCep {

val orderTimeoutOutputTag = new OutputTagOrderResult

def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
env.setParallelism(1)

// 读取订单数据
val resource = getClass.getResource("/OrderLog.csv")
//    val orderEventStream = env.readTextFile(resource.getPath)
val orderEventStream = env.socketTextStream("localhost", 7777)
  .map(data => {
    val dataArray = data.split(",")
    OrderEvent(dataArray(0).trim.toLong, dataArray(1).trim, dataArray(2).trim, dataArray(3).trim.toLong)
  })
  .assignAscendingTimestamps(_.eventTime * 1000L)
  .keyBy(_.orderId)

// 定义process function进行超时检测

// val timeoutWarningStream = orderEventStream.process( new OrderTimeoutWarning() )
val orderResultStream = orderEventStream.process( new OrderPayMatch() )

orderResultStream.print("payed")
orderResultStream.getSideOutput(orderTimeoutOutputTag).print("timeout")

env.execute("order timeout without cep job")

}

class OrderPayMatch() extends KeyedProcessFunction[Long, OrderEvent, OrderResult]{
lazy val isPayedState: ValueState[Boolean] = getRuntimeContext.getState(new ValueStateDescriptor[Boolean](“ispayed-state”, classOf[Boolean]))
// 保存定时器的时间戳为状态
lazy val timerState: ValueState[Long] = getRuntimeContext.getState(new ValueStateDescriptor[Long](“timer-state”, classOf[Long]))

override def processElement(value: OrderEvent, ctx: KeyedProcessFunction[Long, OrderEvent, OrderResult]#Context, out: Collector[OrderResult]): Unit = {
  // 先读取状态
  val isPayed = isPayedState.value()
  val timerTs = timerState.value()

  // 根据事件的类型进行分类判断,做不同的处理逻辑
  if( value.eventType == "create" ){
    // 1. 如果是create事件,接下来判断pay是否来过
    if( isPayed ){
      // 1.1 如果已经pay过,匹配成功,输出主流,清空状态
      out.collect( OrderResult(value.orderId, "payed successfully") )
      ctx.timerService().deleteEventTimeTimer(timerTs)
      isPayedState.clear()
      timerState.clear()
    } else {
      // 1.2 如果没有pay过,注册定时器等待pay的到来
      val ts = value.eventTime * 1000L + 15 * 60 * 1000L
      ctx.timerService().registerEventTimeTimer(ts)
      timerState.update(ts)
    }
  } else if ( value.eventType == "pay" ){
    // 2. 如果是pay事件,那么判断是否create过,用timer表示
    if( timerTs > 0 ){
      // 2.1 如果有定时器,说明已经有create来过
      // 继续判断,是否超过了timeout时间
      if( timerTs > value.eventTime * 1000L ){
        // 2.1.1 如果定时器时间还没到,那么输出成功匹配
        out.collect( OrderResult(value.orderId, "payed successfully") )
      } else{
        // 2.1.2 如果当前pay的时间已经超时,那么输出到侧输出流
        ctx.output(orderTimeoutOutputTag, OrderResult(value.orderId, "payed but already timeout"))
      }
      // 输出结束,清空状态
      ctx.timerService().deleteEventTimeTimer(timerTs)
      isPayedState.clear()
      timerState.clear()
    } else {
      // 2.2 pay先到了,更新状态,注册定时器等待create
      isPayedState.update(true)
      ctx.timerService().registerEventTimeTimer( value.eventTime * 1000L )
      timerState.update(value.eventTime * 1000L)
    }
  }
}

override def onTimer(timestamp: Long, ctx: KeyedProcessFunction[Long, OrderEvent, OrderResult]#OnTimerContext, out: Collector[OrderResult]): Unit = {
  // 根据状态的值,判断哪个数据没来
  if( isPayedState.value() ){
    // 如果为true,表示pay先到了,没等到create
    ctx.output(orderTimeoutOutputTag, OrderResult(ctx.getCurrentKey, "already payed but not found create log"))
  } else{
    // 表示create到了,没等到pay
    ctx.output(orderTimeoutOutputTag, OrderResult(ctx.getCurrentKey, "order timeout"))
  }
  isPayedState.clear()
  timerState.clear()
}

}
}

// 实现自定义的处理函数
class OrderTimeoutWarning() extends KeyedProcessFunction[Long, OrderEvent, OrderResult]{

// 保存pay是否来过的状态
lazy val isPayedState: ValueState[Boolean] = getRuntimeContext.getState(new ValueStateDescriptor[Boolean](“ispayed-state”, classOf[Boolean]))

override def processElement(value: OrderEvent, ctx: KeyedProcessFunction[Long, OrderEvent, OrderResult]#Context, out: Collector[OrderResult]): Unit = {
// 先取出状态标识位
val isPayed = isPayedState.value()

if( value.eventType == "create" && !isPayed ){
  // 如果遇到了create事件,并且pay没有来过,注册定时器开始等待
  ctx.timerService().registerEventTimeTimer( value.eventTime * 1000L + 15 * 60 * 1000L )
} else if( value.eventType == "pay" ){
  // 如果是pay事件,直接把状态改为true
  isPayedState.update(true)
}

}

override def onTimer(timestamp: Long, ctx: KeyedProcessFunction[Long, OrderEvent, OrderResult]#OnTimerContext, out: Collector[OrderResult]): Unit = {
// 判断isPayed是否为true
val isPayed = isPayedState.value()
if(isPayed){
out.collect( OrderResult( ctx.getCurrentKey, “order payed successfully” ) )
} else {
out.collect( OrderResult( ctx.getCurrentKey, “order timeout” ) )
}
// 清空状态
isPayedState.clear()
}

}
package com.atguigu.orderpay_detect

import org.apache.flink.api.common.state.{ValueState, ValueStateDescriptor}
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.co.CoProcessFunction
import org.apache.flink.streaming.api.scala._
import org.apache.flink.util.Collector

/**
*
*

  • Project: UserBehaviorAnalysis
  • Package: com.atguigu.orderpay_detect
  • Version: 1.0
  • Created by wushengran on 2019/9/25 14:15
    */

// 定义接收流事件的样例类
case class ReceiptEvent(txId: String, payChannel: String, eventTime: Long)

object TxMacthDetect {
// 定义侧数据流tag
val unmatchedPays = new OutputTagOrderEvent
val unmatchedReceipts = new OutputTagReceiptEvent

def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)

// 读取订单事件流
val resource = getClass.getResource("/OrderLog.csv")

// val orderEventStream = env.readTextFile(resource.getPath)
val orderEventStream = env.socketTextStream(“localhost”, 7777)
.map(data => {
val dataArray = data.split(",")
OrderEvent(dataArray(0).trim.toLong, dataArray(1).trim, dataArray(2).trim, dataArray(3).trim.toLong)
})
.filter(.txId != “”)
.assignAscendingTimestamps(
.eventTime * 1000L)
.keyBy(_.txId)

// 读取支付到账事件流
val receiptResource = getClass.getResource("/ReceiptLog.csv")

// val receiptEventStream = env.readTextFile(receiptResource.getPath)
val receiptEventStream = env.socketTextStream(“localhost”, 8888)
.map( data => {
val dataArray = data.split(",")
ReceiptEvent( dataArray(0).trim, dataArray(1).trim, dataArray(2).toLong )
} )
.assignAscendingTimestamps(.eventTime * 1000L)
.keyBy(
.txId)

// 将两条流连接起来,共同处理
val processedStream = orderEventStream.connect(receiptEventStream)
  .process( new TxPayMatch() )

processedStream.print("matched")
processedStream.getSideOutput(unmatchedPays).print("unmatchedPays")
processedStream.getSideOutput(unmatchedReceipts).print("unmatchReceipts")

env.execute("tx match job")

}

class TxPayMatch() extends CoProcessFunction[OrderEvent, ReceiptEvent, (OrderEvent, ReceiptEvent)]{
// 定义状态来保存已经到达的订单支付事件和到账事件
lazy val payState: ValueState[OrderEvent] = getRuntimeContext.getState(new ValueStateDescriptor[OrderEvent](“pay-state”, classOf[OrderEvent]))
lazy val receiptState: ValueState[ReceiptEvent] = getRuntimeContext.getState(new ValueStateDescriptor[ReceiptEvent](“receipt-state”, classOf[ReceiptEvent]))

// 订单支付事件数据的处理
override def processElement1(pay: OrderEvent, ctx: CoProcessFunction[OrderEvent, ReceiptEvent, (OrderEvent, ReceiptEvent)]#Context, out: Collector[(OrderEvent, ReceiptEvent)]): Unit = {
  // 判断有没有对应的到账事件
  val receipt = receiptState.value()
  if( receipt != null ){
    // 如果已经有receipt,在主流输出匹配信息,清空状态
    out.collect((pay, receipt))
    receiptState.clear()
  } else {
    // 如果还没到,那么把pay存入状态,并且注册一个定时器等待
    payState.update(pay)
    ctx.timerService().registerEventTimeTimer( pay.eventTime * 1000L + 5000L )
  }
}

// 到账事件的处理
override def processElement2(receipt: ReceiptEvent, ctx: CoProcessFunction[OrderEvent, ReceiptEvent, (OrderEvent, ReceiptEvent)]#Context, out: Collector[(OrderEvent, ReceiptEvent)]): Unit = {
  // 同样的处理流程
  val pay = payState.value()
  if( pay != null ){
    out.collect((pay, receipt))
    payState.clear()
  } else {
    receiptState.update(receipt)
    ctx.timerService().registerEventTimeTimer( receipt.eventTime * 1000L + 5000L )
  }
}

override def onTimer(timestamp: Long, ctx: CoProcessFunction[OrderEvent, ReceiptEvent, (OrderEvent, ReceiptEvent)]#OnTimerContext, out: Collector[(OrderEvent, ReceiptEvent)]): Unit = {
  // 到时间了,如果还没有收到某个事件,那么输出报警信息
  if( payState.value() != null ){
    // recipt没来,输出pay到侧输出流
    ctx.output(unmatchedPays, payState.value())
  }
  if( receiptState.value() != null ){
    ctx.output(unmatchedReceipts, receiptState.value())
  }
  payState.clear()
  receiptState.clear()
}

}
}

package com.atguigu.orderpay_detect

import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.co.ProcessJoinFunction
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.util.Collector

/**
*
*

  • Project: UserBehaviorAnalysis

  • Package: com.atguigu.orderpay_detect

  • Version: 1.0

  • Created by wushengran on 2019/9/25 15:40
    */
    object TxMatchByJoin {
    def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)
    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)

    // 读取订单事件流
    val resource = getClass.getResource("/OrderLog.csv")
    // val orderEventStream = env.readTextFile(resource.getPath)
    val orderEventStream = env.socketTextStream(“localhost”, 7777)
    .map(data => {
    val dataArray = data.split(",")
    OrderEvent(dataArray(0).trim.toLong, dataArray(1).trim, dataArray(2).trim, dataArray(3).trim.toLong)
    })
    .filter(.txId != “”)
    .assignAscendingTimestamps(
    .eventTime * 1000L)
    .keyBy(_.txId)

    // 读取支付到账事件流
    val receiptResource = getClass.getResource("/ReceiptLog.csv")
    // val receiptEventStream = env.readTextFile(receiptResource.getPath)
    val receiptEventStream = env.socketTextStream(“localhost”, 8888)
    .map( data => {
    val dataArray = data.split(",")
    ReceiptEvent( dataArray(0).trim, dataArray(1).trim, dataArray(2).toLong )
    } )
    .assignAscendingTimestamps(.eventTime * 1000L)
    .keyBy(
    .txId)

    // join处理
    val processedStream = orderEventStream.intervalJoin( receiptEventStream )
    .between(Time.seconds(-5), Time.seconds(5))
    .process( new TxPayMatchByJoin() )

    processedStream.print()

    env.execute(“tx pay match by join job”)
    }
    }

class TxPayMatchByJoin() extends ProcessJoinFunction[OrderEvent, ReceiptEvent, (OrderEvent, ReceiptEvent)]{
override def processElement(left: OrderEvent, right: ReceiptEvent, ctx: ProcessJoinFunction[OrderEvent, ReceiptEvent, (OrderEvent, ReceiptEvent)]#Context, out: Collector[(OrderEvent, ReceiptEvent)]): Unit = {
out.collect((left, right))
}
}
package com.atguigu.marketanalysis

import java.sql.Timestamp

import org.apache.flink.api.common.functions.AggregateFunction
import org.apache.flink.api.common.state.{ValueState, ValueStateDescriptor}
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.KeyedProcessFunction
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.scala.function.WindowFunction
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector

/**
*
*

  • Project: UserBehaviorAnalysis
  • Package: com.atguigu.marketanalysis
  • Version: 1.0
  • Created by wushengran on 2019/9/24 10:10
    */
    // 输入的广告点击事件样例类
    case class AdClickEvent( userId: Long, adId: Long, province: String, city: String, timestamp: Long )
    // 按照省份统计的输出结果样例类
    case class CountByProvince( windowEnd: String, province: String, count: Long )
    // 输出的黑名单报警信息
    case class BlackListWarning( userId: Long, adId: Long, msg: String )

object AdStatisticsByGeo {
// 定义侧输出流的tag
val blackListOutputTag: OutputTag[BlackListWarning] = new OutputTagBlackListWarning

def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
env.setParallelism(1)

// 读取数据并转换成AdClickEvent
val resource = getClass.getResource("/AdClickLog.csv")
val adEventStream = env.readTextFile(resource.getPath)
  .map( data => {
    val dataArray = data.split(",")
    AdClickEvent( dataArray(0).trim.toLong, dataArray(1).trim.toLong, dataArray(2).trim, dataArray(3).trim, dataArray(4).trim.toLong )
  } )
  .assignAscendingTimestamps(_.timestamp * 1000L)

// 自定义process function,过滤大量刷点击的行为
val filterBlackListStream = adEventStream
  .keyBy( data => (data.userId, data.adId) )
  .process( new FilterBlackListUser(100) )

// 根据省份做分组,开窗聚合
val adCountStream = filterBlackListStream
  .keyBy(_.province)
  .timeWindow( Time.hours(1), Time.seconds(5) )
  .aggregate( new AdCountAgg(), new AdCountResult() )

adCountStream.print("count")
filterBlackListStream.getSideOutput(blackListOutputTag).print("blacklist")

env.execute("ad statistics job")

}

class FilterBlackListUser(maxCount: Int) extends KeyedProcessFunction[(Long, Long), AdClickEvent, AdClickEvent]{
// 定义状态,保存当前用户对当前广告的点击量
lazy val countState: ValueState[Long] = getRuntimeContext.getState(new ValueStateDescriptor[Long](“count-state”, classOf[Long]))
// 保存是否发送过黑名单的状态
lazy val isSentBlackList: ValueState[Boolean] = getRuntimeContext.getState( new ValueStateDescriptor[Boolean](“issent-state”, classOf[Boolean]) )
// 保存定时器触发的时间戳
lazy val resetTimer: ValueState[Long] = getRuntimeContext.getState( new ValueStateDescriptor[Long](“resettime-state”, classOf[Long]) )

override def processElement(value: AdClickEvent, ctx: KeyedProcessFunction[(Long, Long), AdClickEvent, AdClickEvent]#Context, out: Collector[AdClickEvent]): Unit = {
  // 取出count状态
  val curCount = countState.value()

  // 如果是第一次处理,注册定时器,每天00:00触发
  if( curCount == 0 ){
    val ts = ( ctx.timerService().currentProcessingTime()/(1000*60*60*24) + 1) * (1000*60*60*24)
    resetTimer.update(ts)
    ctx.timerService().registerProcessingTimeTimer(ts)
  }

  // 判断计数是否达到上限,如果到达则加入黑名单
  if( curCount >= maxCount ){
    // 判断是否发送过黑名单,只发送一次
    if( !isSentBlackList.value() ){
      isSentBlackList.update(true)
      // 输出到侧输出流
      ctx.output( blackListOutputTag, BlackListWarning(value.userId, value.adId, "Click over " + maxCount + " times today.") )
    }
    return
  }
  // 计数状态加1,输出数据到主流
  countState.update( curCount + 1 )
  out.collect( value )
}

override def onTimer(timestamp: Long, ctx: KeyedProcessFunction[(Long, Long), AdClickEvent, AdClickEvent]#OnTimerContext, out: Collector[AdClickEvent]): Unit = {
  // 定时器触发时,清空状态
  if( timestamp == resetTimer.value() ){
    isSentBlackList.clear()
    countState.clear()
    resetTimer.clear()
  }
}

}
}

// 自定义预聚合函数
class AdCountAgg() extends AggregateFunction[AdClickEvent, Long, Long]{
override def add(value: AdClickEvent, accumulator: Long): Long = accumulator + 1

override def createAccumulator(): Long = 0L

override def getResult(accumulator: Long): Long = accumulator

override def merge(a: Long, b: Long): Long = a + b
}

// 自定义窗口处理函数
class AdCountResult() extends WindowFunction[Long, CountByProvince, String, TimeWindow]{
override def apply(key: String, window: TimeWindow, input: Iterable[Long], out: Collector[CountByProvince]): Unit = {
out.collect( CountByProvince( new Timestamp(window.getEnd).toString, key, input.iterator.next() ) )
}
}
package com.atguigu.marketanalysis

import java.sql.Timestamp

import org.apache.flink.api.common.functions.AggregateFunction
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.scala.function.WindowFunction
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector

/**
*
*

  • Project: UserBehaviorAnalysis

  • Package: com.atguigu.marketanalysis

  • Version: 1.0

  • Created by wushengran on 2019/9/23 15:37
    */
    object AppMarketing {
    def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)
    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)

    val dataStream = env.addSource( new SimulatedEventSource() )
    .assignAscendingTimestamps(_.timestamp)
    .filter( .behavior != “UNINSTALL” )
    .map( data => {
    ( “dummyKey”, 1L )
    } )
    .keyBy(
    ._1) // 以渠道和行为类型作为key分组
    .timeWindow( Time.hours(1), Time.seconds(10) )
    .aggregate( new CountAgg(), new MarketingCountTotal() )

    dataStream.print()
    env.execute(“app marketing job”)
    }
    }

class CountAgg() extends AggregateFunction[(String, Long), Long, Long]{
override def add(value: (String, Long), accumulator: Long): Long = accumulator + 1

override def createAccumulator(): Long = 0L

override def getResult(accumulator: Long): Long = accumulator

override def merge(a: Long, b: Long): Long = a + b
}

class MarketingCountTotal() extends WindowFunction[Long, MarketingViewCount, String, TimeWindow]{
override def apply(key: String, window: TimeWindow, input: Iterable[Long], out: Collector[MarketingViewCount]): Unit = {
val startTs = new Timestamp(window.getStart).toString
val endTs = new Timestamp(window.getEnd).toString
val count = input.iterator.next()
out.collect( MarketingViewCount(startTs, endTs, “app marketing”, “total”, count) )
}
}
package com.atguigu.marketanalysis

import java.sql.Timestamp
import java.util.UUID
import java.util.concurrent.TimeUnit

import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.source.{RichSourceFunction, SourceFunction}
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.scala.function.ProcessWindowFunction
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector

import scala.util.Random

/**
*
*

  • Project: UserBehaviorAnalysis
  • Package: com.atguigu.marketanalysis
  • Version: 1.0
  • Created by wushengran on 2019/9/23 15:06
    */

// 输入数据样例类
case class MarketingUserBehavior( userId: String, behavior: String, channel: String, timestamp: Long )
// 输出结果样例类
case class MarketingViewCount( windowStart: String, windowEnd: String, channel: String, behavior: String, count: Long )

object AppMarketingByChannel {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)

val dataStream = env.addSource( new SimulatedEventSource() )
  .assignAscendingTimestamps(_.timestamp)
  .filter( _.behavior != "UNINSTALL" )
  .map( data => {
    ( (data.channel, data.behavior), 1L )
  } )
  .keyBy(_._1)     // 以渠道和行为类型作为key分组
  .timeWindow( Time.hours(1), Time.seconds(10) )
  .process( new MarketingCountByChannel() )

dataStream.print()
env.execute("app marketing by channel job")

}
}

// 自定义数据源
class SimulatedEventSource() extends RichSourceFunction[MarketingUserBehavior]{
// 定义是否运行的标识位
var running = true
// 定义用户行为的集合
val behaviorTypes: Seq[String] = Seq(“CLICK”, “DOWNLOAD”, “INSTALL”, “UNINSTALL”)
// 定义渠道的集合
val channelSets: Seq[String] = Seq(“wechat”, “weibo”, “appstore”, “huaweistore”)
// 定义一个随机数发生器
val rand: Random = new Random()

override def cancel(): Unit = running = false

override def run(ctx: SourceFunction.SourceContext[MarketingUserBehavior]): Unit = {
// 定义一个生成数据的上限
val maxElements = Long.MaxValue
var count = 0L

// 随机生成所有数据
while( running && count < maxElements ){
  val id = UUID.randomUUID().toString
  val behavior = behaviorTypes(rand.nextInt(behaviorTypes.size))
  val channel = channelSets(rand.nextInt(channelSets.size))
  val ts = System.currentTimeMillis()

  ctx.collect( MarketingUserBehavior( id, behavior, channel, ts ) )

  count += 1
  TimeUnit.MILLISECONDS.sleep(10L)
}

}
}

// 自定义处理函数
class MarketingCountByChannel() extends ProcessWindowFunction[((String, String), Long), MarketingViewCount, (String, String), TimeWindow]{
override def process(key: (String, String), context: Context, elements: Iterable[((String, String), Long)], out: Collector[MarketingViewCount]): Unit = {
val startTs = new Timestamp(context.window.getStart).toString
val endTs = new Timestamp(context.window.getEnd).toString
val channel = key._1
val behavior = key._2
val count = elements.size
out.collect( MarketingViewCount(startTs, endTs, channel, behavior, count) )
}
}

UserBehaviorAnalysis

com.atguigu

1.0-SNAPSHOT

4.0.0

MarketAnalysis

你可能感兴趣的:(flink,big,data,hadoop)