从flink-example分析flink组件(1)WordCount batch实战及源码分析

上一章 简单介绍了一下flink在windows下如何通过flink-webui运行已经打包完成的示例程序(jar),那么我们为什么要使用flink呢?

flink的特征

官网给出的特征如下:

1、一切皆为流(All streaming use cases )

  • 事件驱动应用(Event-driven Applications)

              

  

  • 流式 & 批量分析(Stream & Batch Analytics)

    

 


  

  •  数据管道&ETL(Data Pipelines & ETL)

     

 

2、正确性保证(Guaranteed correctness)

  • 唯一状态一致性(Exactly-once state consistency)
  • 事件-事件处理(Event-time processing)
  • 高超的最近数据处理(Sophisticated late data handling)

3、多层api(Layered APIs)   

  • 基于流式和批量数据处理的SQL(SQL on Stream & Batch Data)
  • 流水数据API & 数据集API(DataStream API & DataSet API)
  • 处理函数 (时间 & 状态)(ProcessFunction (Time & State))

           

4、易用性

  • 部署灵活(Flexible deployment)
  • 高可用安装(High-availability setup)
  • 保存点(Savepoints)

5、可扩展性

  • 可扩展架构(Scale-out architecture)
  • 大量状态的支持(Support for very large state)
  • 增量检查点(Incremental checkpointing)

6、高性能

  • 低延迟(Low latency)
  • 高吞吐量(High throughput)
  • 内存计算(In-Memory computing)

flink架构 

1、层级结构

 

2.工作架构图

 

 flink实战

1、依赖文件pom.xml

xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0modelVersion>

    <groupId>flinkDemogroupId>
    <artifactId>flinkDemoartifactId>
    <version>1.0-SNAPSHOTversion>

    <dependencies>
        <dependency>
            <groupId>org.apache.flinkgroupId>
            <artifactId>flink-javaartifactId>
            <version>1.5.0version>
            
        dependency>
        <dependency>
            <groupId>org.apache.flinkgroupId>
            <artifactId>flink-streaming-java_2.11artifactId>
            <version>1.5.0version>
            
        dependency>
        
        <dependency>
            <groupId>org.apache.flinkgroupId>
            <artifactId>flink-connector-kafka-0.10_2.11artifactId>
            <version>1.5.0version>
        dependency>
        
        <dependency>
            <groupId>org.apache.flinkgroupId>
            <artifactId>flink-hbase_2.11artifactId>
            <version>1.5.0version>
        dependency>

        <dependency>
            <groupId>org.apache.kafkagroupId>
            <artifactId>kafka-clientsartifactId>
            <version>0.10.1.1version>
        dependency>

        <dependency>
            <groupId>org.apache.hbasegroupId>
            <artifactId>hbase-clientartifactId>
            <version>1.1.2version>
        dependency>

        <dependency>
            <groupId>org.projectlombokgroupId>
            <artifactId>lombokartifactId>
            <version>1.16.10version>
            <scope>compilescope>
        dependency>
        <dependency>
            <groupId>com.google.code.gsongroupId>
            <artifactId>gsonartifactId>
            <version>2.8.2version>
        dependency>
        <dependency>
            <groupId>com.github.rholdergroupId>
            <artifactId>guava-retryingartifactId>
            <version>2.0.0version>
        dependency>
    dependencies>

    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.pluginsgroupId>
                <artifactId>maven-compiler-pluginartifactId>
                <version>3.5.1version>
                <configuration>
                    <source>1.8source>
                    <target>1.8target>
                configuration>
            plugin>
        plugins>
    build>
project>

2、java程序

public class WordCountDemo {

        public static void main(String[] args) throws Exception {
            final ParameterTool params = ParameterTool.fromArgs(args);

            // create execution environment
            final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
            env.getConfig().setGlobalJobParameters(params);

            // get input data
            DataSet text;
            if (params.has("input")) {
                // read the text file from given input path
                text = env.readTextFile(params.get("input"));
            } else {
                // get default test text data
                System.out.println("Executing WordCount example with default input data set.");
                System.out.println("Use --input to specify file input.");
                text = WordCountData.getDefaultTextLineDataSet(env);
            }

            DataSet> counts =
                    // split up the lines in pairs (2-tuples) containing: (word,1)
                    text.flatMap(new Tokenizer())
                            // group by the tuple field "0" and sum up tuple field "1"
                            .groupBy(0)
                            .sum(1);

            // emit result
            if (params.has("output")) {
                counts.writeAsCsv(params.get("output"), "\n", " ");
                // execute program
                env.execute("WordCount Example");
            } else {
                System.out.println("Printing result to stdout. Use --output to specify output path.");
                counts.print();
            }

        }

    // *************************************************************************
    //     USER FUNCTIONS
    // *************************************************************************

    /**
     * Implements the string tokenizer that splits sentences into words as a user-defined
     * FlatMapFunction. The function takes a line (String) and splits it into
     * multiple pairs in the form of "(word,1)" ({@code Tuple2}).
     */
    public static final class Tokenizer implements FlatMapFunction> {

        @Override
        public void flatMap(String value, Collector> out) {
            // normalize and split the line
            String[] tokens = value.toLowerCase().split("\\W+");

            // emit the pairs
            for (String token : tokens) {
                if (token.length() > 0) {
                    out.collect(new Tuple2<>(token, 1));
                }
            }
        }
    }
}

3、单步调试分析

   第一步:获取环境信息ExecutionEnvironment.java

  

/**
 * The ExecutionEnvironment is the context in which a program is executed. A
 * {@link LocalEnvironment} will cause execution in the current JVM, a
 * {@link RemoteEnvironment} will cause execution on a remote setup.
 *
 * 

The environment provides methods to control the job execution (such as setting the parallelism) * and to interact with the outside world (data access). * *

Please note that the execution environment needs strong type information for the input and return types * of all operations that are executed. This means that the environments needs to know that the return * value of an operation is for example a Tuple of String and Integer. * Because the Java compiler throws much of the generic type information away, most methods attempt to re- * obtain that information using reflection. In certain cases, it may be necessary to manually supply that * information to some of the methods. * * @see LocalEnvironment * @see RemoteEnvironment */

  创建本地环境

    /**
     * Creates a {@link LocalEnvironment} which is used for executing Flink jobs.
     *
     * @param configuration to start the {@link LocalEnvironment} with
     * @param defaultParallelism to initialize the {@link LocalEnvironment} with
     * @return {@link LocalEnvironment}
     */
    private static LocalEnvironment createLocalEnvironment(Configuration configuration, int defaultParallelism) {
        final LocalEnvironment localEnvironment = new LocalEnvironment(configuration);

        if (defaultParallelism > 0) {
            localEnvironment.setParallelism(defaultParallelism);
        }

        return localEnvironment;
    }

  第二步:获取外部数据,创建数据集  ExecutionEnvironment.java

    /**
     * Creates a DataSet from the given non-empty collection. Note that this operation will result
     * in a non-parallel data source, i.e. a data source with a parallelism of one.
     *
     * 

The returned DataSet is typed to the given TypeInformation. * * @param data The collection of elements to create the data set from. * @param type The TypeInformation for the produced data set. * @return A DataSet representing the given collection. * * @see #fromCollection(Collection) */ public DataSource fromCollection(Collection data, TypeInformation type) { return fromCollection(data, type, Utils.getCallLocationName()); } private DataSource fromCollection(Collection data, TypeInformation type, String callLocationName) { CollectionInputFormat.checkCollection(data, type.getTypeClass()); return new DataSource<>(this, new CollectionInputFormat<>(data, type.createSerializer(config)), type, callLocationName); }

   数据集的继承关系

从flink-example分析flink组件(1)WordCount batch实战及源码分析_第1张图片

其中,DataSet是一组相同类型数据的集合,抽象类,它提供了数据的转换功能,如map,reduce,join和coGroup

/**
 * A DataSet represents a collection of elements of the same type.
 *
 * 

A DataSet can be transformed into another DataSet by applying a transformation as for example *

    *
  • {@link DataSet#map(org.apache.flink.api.common.functions.MapFunction)},
  • *
  • {@link DataSet#reduce(org.apache.flink.api.common.functions.ReduceFunction)},
  • *
  • {@link DataSet#join(DataSet)}, or
  • *
  • {@link DataSet#coGroup(DataSet)}.
  • *
* *
@param The type of the DataSet, i.e., the type of the elements of the DataSet. */

Operator是java api的操作基类,抽象类

/**
 * Base class of all operators in the Java API.
 *
 * @param  The type of the data set produced by this operator.
 * @param  The type of the operator, so that we can return it.
 */
@Public
public abstract class Operatorextends Operator> extends DataSet {

DataSource具体实现类。

/**
 * An operation that creates a new data set (data source). The operation acts as the
 * data set on which to apply further transformations. It encapsulates additional
 * configuration parameters, to customize the execution.
 *
 * @param  The type of the elements produced by this data source.
 */
@Public
public class DataSource extends Operator> {

  第三步:对输入数据集进行转换

            DataSet> counts =
                    // split up the lines in pairs (2-tuples) containing: (word,1)
                    text.flatMap(new Tokenizer())
                            // group by the tuple field "0" and sum up tuple field "1"
                            .groupBy(0)
                            .sum(1);

     >>调用map DataSet.java

    /**
     * Applies a FlatMap transformation on a {@link DataSet}.
     *
     * 

The transformation calls a {@link org.apache.flink.api.common.functions.RichFlatMapFunction} for each element of the DataSet. * Each FlatMapFunction call can return any number of elements including none. * * @param flatMapper The FlatMapFunction that is called for each element of the DataSet. * @return A FlatMapOperator that represents the transformed DataSet. * * @see org.apache.flink.api.common.functions.RichFlatMapFunction * @see FlatMapOperator * @see DataSet */ public FlatMapOperator flatMap(FlatMapFunction flatMapper) { if (flatMapper == null) { throw new NullPointerException("FlatMap function must not be null."); } String callLocation = Utils.getCallLocationName(); TypeInformation resultType = TypeExtractor.getFlatMapReturnTypes(flatMapper, getType(), callLocation, true); return new FlatMapOperator<>(this, resultType, clean(flatMapper), callLocation); }

  >>调用groupby   DataSet.java

    /**
     * Groups a {@link Tuple} {@link DataSet} using field position keys.
     *
     * 

Note: Field position keys only be specified for Tuple DataSets. * *

The field position keys specify the fields of Tuples on which the DataSet is grouped. * This method returns an {@link UnsortedGrouping} on which one of the following grouping transformation * can be applied. *

    *
  • {@link UnsortedGrouping#sortGroup(int, org.apache.flink.api.common.operators.Order)} to get a {@link SortedGrouping}. *
  • {@link UnsortedGrouping#aggregate(Aggregations, int)} to apply an Aggregate transformation. *
  • {@link UnsortedGrouping#reduce(org.apache.flink.api.common.functions.ReduceFunction)} to apply a Reduce transformation. *
  • {@link UnsortedGrouping#reduceGroup(org.apache.flink.api.common.functions.GroupReduceFunction)} to apply a GroupReduce transformation. *
* *
@param fields One or more field positions on which the DataSet will be grouped. * @return A Grouping on which a transformation needs to be applied to obtain a transformed DataSet. * * @see Tuple * @see UnsortedGrouping * @see AggregateOperator * @see ReduceOperator * @see org.apache.flink.api.java.operators.GroupReduceOperator * @see DataSet */ public UnsortedGrouping groupBy(int... fields) { return new UnsortedGrouping<>(this, new Keys.ExpressionKeys<>(fields, getType())); }

 

  >>调用sum  UnsortedGrouping.java

    /**
     * Syntactic sugar for aggregate (SUM, field).
     * @param field The index of the Tuple field on which the aggregation function is applied.
     * @return An AggregateOperator that represents the summed DataSet.
     *
     * @see org.apache.flink.api.java.operators.AggregateOperator
     */
    public AggregateOperator sum (int field) {
        return this.aggregate (Aggregations.SUM, field, Utils.getCallLocationName());
    }
    // private helper that allows to set a different call location name
    private AggregateOperator aggregate(Aggregations agg, int field, String callLocationName) {
        return new AggregateOperator(this, agg, field, callLocationName);
    }

 UnsortedGrouping和DataSet的关系

  UnsortedGrouping使用AggregateOperator做聚合

从flink-example分析flink组件(1)WordCount batch实战及源码分析_第2张图片

  第四步:对转换的输入值进行处理

            // emit result
            if (params.has("output")) {
                counts.writeAsCsv(params.get("output"), "\n", " ");
                // execute program
                env.execute("WordCount Example");
            } else {
                System.out.println("Printing result to stdout. Use --output to specify output path.");
                counts.print();
            }

  如果不指定output参数,则打印到控制台

    /**
     * Prints the elements in a DataSet to the standard output stream {@link System#out} of the JVM that calls
     * the print() method. For programs that are executed in a cluster, this method needs
     * to gather the contents of the DataSet back to the client, to print it there.
     *
     * 

The string written for each element is defined by the {@link Object#toString()} method. * *

This method immediately triggers the program execution, similar to the * {@link #collect()} and {@link #count()} methods. * * @see #printToErr() * @see #printOnTaskManager(String) */ public void print() throws Exception { List elements = collect(); for (T e: elements) { System.out.println(e); } }

  若指定输出,则先进行输入转换为csv文件的DataSink,它是用来存储数据结果的

/**
 * An operation that allows storing data results.
 * @param 
 */

过程如下:

    /**
     * Writes a {@link Tuple} DataSet as CSV file(s) to the specified location with the specified field and line delimiters.
     *
     * 

Note: Only a Tuple DataSet can written as a CSV file. * For each Tuple field the result of {@link Object#toString()} is written. * * @param filePath The path pointing to the location the CSV file is written to. * @param rowDelimiter The row delimiter to separate Tuples. * @param fieldDelimiter The field delimiter to separate Tuple fields. * @param writeMode The behavior regarding existing files. Options are NO_OVERWRITE and OVERWRITE. * * @see Tuple * @see CsvOutputFormat * @see DataSet#writeAsText(String) Output files and directories */ public DataSink writeAsCsv(String filePath, String rowDelimiter, String fieldDelimiter, WriteMode writeMode) { return internalWriteAsCsv(new Path(filePath), rowDelimiter, fieldDelimiter, writeMode); } @SuppressWarnings("unchecked") private extends Tuple> DataSink internalWriteAsCsv(Path filePath, String rowDelimiter, String fieldDelimiter, WriteMode wm) { Preconditions.checkArgument(getType().isTupleType(), "The writeAsCsv() method can only be used on data sets of tuples."); CsvOutputFormat of = new CsvOutputFormat<>(filePath, rowDelimiter, fieldDelimiter); if (wm != null) { of.setWriteMode(wm); } return output((OutputFormat) of); } /** * Emits a DataSet using an {@link OutputFormat}. This method adds a data sink to the program. * Programs may have multiple data sinks. A DataSet may also have multiple consumers (data sinks * or transformations) at the same time. * * @param outputFormat The OutputFormat to process the DataSet. * @return The DataSink that processes the DataSet. * * @see OutputFormat * @see DataSink */ public DataSink output(OutputFormat outputFormat) { Preconditions.checkNotNull(outputFormat); // configure the type if needed if (outputFormat instanceof InputTypeConfigurable) { ((InputTypeConfigurable) outputFormat).setInputType(getType(), context.getConfig()); } DataSink sink = new DataSink<>(this, outputFormat, getType()); this.context.registerDataSink(sink); return sink; }

  最后执行job

    @Override
    public JobExecutionResult execute(String jobName) throws Exception {
        if (executor == null) {
            startNewSession();
        }

        Plan p = createProgramPlan(jobName);

        // Session management is disabled, revert this commit to enable
        //p.setJobId(jobID);
        //p.setSessionTimeout(sessionTimeout);

        JobExecutionResult result = executor.executePlan(p);

        this.lastJobExecutionResult = result;
        return result;
    }

这一阶段是内容比较多,放到下一篇讲解吧

总结

  Apache Flink 功能强大,支持开发和运行多种不同种类的应用程序。它的主要特性包括:批流一体化、精密的状态管理、事件时间支持以及精确一次的状态一致性保障等。Flink 不仅可以运行在包括 YARN、 Mesos、Kubernetes 在内的多种资源管理框架上,还支持在裸机集群上独立部署。在启用高可用选项的情况下,它不存在单点失效问题。事实证明,Flink 已经可以扩展到数千核心,其状态可以达到 TB 级别,且仍能保持高吞吐、低延迟的特性。世界各地有很多要求严苛的流处理应用都运行在 Flink 之上。

  其应用场景如下:  

1、事件驱动型应用
  典型的事件驱动型应用实例:
  反欺诈
  异常检测
  基于规则的报警
  业务流程监控
 (社交网络)Web 应用
2、数据分析应用
  典型的数据分析应用实例
  电信网络质量监控
  移动应用中的产品更新及实验评估分析
  消费者技术中的实时数据即席分析
  大规模图分析
3、数据管道应用
  典型的数据管道应用实例
  电子商务中的实时查询索引构建
  电子商务中的持续 ETL

参考资料

【1】https://flink.apache.org/

【2】https://blog.csdn.net/yangyin007/article/details/82382734

【3】https://flink.apache.org/zh/usecases.html

 

转载于:https://www.cnblogs.com/davidwang456/p/10948698.html

你可能感兴趣的:(大数据,java)