Flink DataSet和DataStream Print方法的区别

在Flink example中,有两个Wordcount的example特别类似,一个是batch下的WordCount一个是streaming下的WordCount,从用法上来讲也比较类似。

WordCount example
  1. batch下的WordCount样例
  • 用法
    在Flink on Yarn模式下,配置好HADOOP_CONF_DIR和YARN_CONF_DIR后执行
bin/flink run -m yarn-cluster -yjm 1024 -ytm 1024 -yqu exampleQ  examples/batch/WordCount.jar

默认情况下如果不加input和output参数,会读取默认的数据WordCountData进行count,然后将结果输出到std out。执行完结果可以看到client端会有输出结果如下。

Program execution finished
Job with JobID 21eca2a01dbbc9594525824e9590c453 has finished.
Job Runtime: 12435 ms
Accumulator Results:
- 71de74f85dc472654c31b6df79701cf5 (java.util.ArrayList) [170 elements]
  1. streaming下的WordCount样例
  • 用法
bin/flink run -m yarn-cluster -yjm 1024 -ytm 1024 -yqu exampleQ  examples/streaming/WordCount.jar

默认情况下如果不加input和output参数,会读取默认的流数据WordCountData进行count,然后将结果输出到std out。执行完结果可以看到client端会有输出结果如下。



        // get input data
        DataSet text;
        if (params.has("input")) {
            // read the text file from given input path
            text = env.readTextFile(params.get("input"));
        } else {
            // get default test text data
            System.out.println("Executing WordCount example with default input data set.");
            System.out.println("Use --input to specify file input.");
            text = WordCountData.getDefaultTextLineDataSet(env);

        DataSet> counts =
                // split up the lines in pairs (2-tuples) containing: (word,1)
                text.flatMap(new Tokenizer())
                // group by the tuple field "0" and sum up tuple field "1"

        // emit result
        if (params.has("output")) {
            counts.writeAsCsv(params.get("output"), "\n", " ");
            // execute program
            env.execute("WordCount Example");
        } else {
            System.out.println("Printing result to stdout. Use --output to specify output path.");

可以看到从默认的数据读到DataSet text中,然后对text进行map和reduce操作统计出结果counts,然后将counts结果print到std out中。

    // get input data
        DataStream text;
        if (params.has("input")) {
            // read the text file from given input path
            text = env.readTextFile(params.get("input"));
        } else {
            System.out.println("Executing WordCount example with default input data set.");
            System.out.println("Use --input to specify file input.");
            // get default test text data
            text = env.fromElements(WordCountData.WORDS);

        DataStream> counts =
            // split up the lines in pairs (2-tuples) containing: (word,1)
            text.flatMap(new Tokenizer())
            // group by the tuple field "0" and sum up tuple field "1"

        // emit result
        if (params.has("output")) {
        } else {
            System.out.println("Printing result to stdout. Use --output to specify output path.");

        // execute program
        env.execute("Streaming WordCount");


     * Prints the elements in a DataSet to the standard output stream {@link System#out} of the JVM that calls
     * the print() method. For programs that are executed in a cluster, this method needs
     * to gather the contents of the DataSet back to the client, to print it there.

The string written for each element is defined by the {@link Object#toString()} method. * *

This method immediately triggers the program execution, similar to the * {@link #collect()} and {@link #count()} methods. * * @see #printToErr() * @see #printOnTaskManager(String) */ public void print() throws Exception { List elements = collect(); for (T e: elements) { System.out.println(e); } }


     * Writes a DataStream to the standard output stream (stdout).

For each element of the DataStream the result of {@link Object#toString()} is written. * *

NOTE: This will print to stdout on the machine where the code is executed, i.e. the Flink * worker. * * @return The closed DataStream. */ @PublicEvolving public DataStreamSink print() { PrintSinkFunction printFunction = new PrintSinkFunction<>(); return addSink(printFunction).name("Print to Std. Out"); }

可以看到只是增加了printsink,并没有把数据收集到client端,因此std out也不会在客户端进行,而是在这段代码的执行机器上进行,也就是Flink的TaskManager上。



你可能感兴趣的:(Flink DataSet和DataStream Print方法的区别)