Flink 读取文本文件,聚合每一行的uid

文本数据大约30W行,内容格式如下:

001	jack
001	jack
001	rose
004	tom
004	jerry
001	sofia
005	natasha
006	catalina
006	jennifer

要求输出结果如下:

001	[jack,rose,sofia]
004	[tom,jerry]
005	[natasha]
006	[catalina, jennifer]

首先将文件的格式进行整理

public class Test2 {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        DataStreamSource<String> dataStreamSource = env.readTextFile("E:/test/uid_person.txt");
        SingleOutputStreamOperator<Tuple2<String, Set<String>>> map = dataStreamSource.map(new MapFunction<String, Tuple2<String, Set<String>>>() {
            @Override
            public Tuple2<String, Set<String>> map(String s) throws Exception {
                String[] split = s.split("\t");
                String uid = split[0];
                String name = split[1];
                Set<String> set = new HashSet();
                set.add(name);
                return Tuple2.of(uid, set);
            }
        });
        map.writeAsText("E:/test/mytest.txt").setParallelism(1);
        env.execute("Test");
    }
}

输出文件内容:

(004,[tom])
(004,[jerry])
(001,[sofia])
(001,[jack])
(001,[jack])
(001,[rose])
(006,[jennifer])
(005,[natasha])
(006,[catalina])

每行数据都变为Tuple2>,它主要是用来将两个同类型的值操作为一个同类型的值,第一个参数为前面reduce的结果,第二参数为当前的元素,注意reduce操作只能对相同类型的数据进行处理。将数据合并成一个新的数据,返回单个的结果值,

最每行数据进行keyBy-reduce操作

public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        DataStreamSource<String> dataStreamSource = env.readTextFile("E:/test/uid_person.txt");
        SingleOutputStreamOperator<Tuple2<String, Set<String>>> map = dataStreamSource.map(new MapFunction<String, Tuple2<String, Set<String>>>() {
            @Override
            public Tuple2<String, Set<String>> map(String s) throws Exception {
                String[] split = s.split("\t");
                String uid = split[0];
                String name = split[1];
                Set<String> set = new HashSet();
                set.add(name);
                return Tuple2.of(uid, set);
            }
        });
        map.keyBy(0).reduce(new ReduceFunction<Tuple2<String, Set<String>>>() {
            @Override
            public Tuple2<String, Set<String>> reduce(Tuple2<String, Set<String>> stringSetTuple2, Tuple2<String, Set<String>> t1) throws Exception {
                stringSetTuple2.f1.addAll(t1.f1);
                return Tuple2.of(stringSetTuple2.f0, stringSetTuple2.f1);
            }
        }).writeAsText("E:/test/mytest.txt").setParallelism(1);
        env.execute("Test");
    }

输出结果如下:

(001,[sofia])
(001,[sofia, jack])
(001,[sofia, jack])
(001,[sofia, rose, jack])
(006,[catalina])
(006,[jennifer, catalina])
(005,[natasha])
(004,[tom])
(004,[tom, jerry])

这样每个uid的最右一条数据就是最完整的数据。

你可能感兴趣的:(Apache,Flink)