文本数据大约30W行,内容格式如下:
001 jack
001 jack
001 rose
004 tom
004 jerry
001 sofia
005 natasha
006 catalina
006 jennifer
要求输出结果如下:
001 [jack,rose,sofia]
004 [tom,jerry]
005 [natasha]
006 [catalina, jennifer]
public class Test2 {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStreamSource<String> dataStreamSource = env.readTextFile("E:/test/uid_person.txt");
SingleOutputStreamOperator<Tuple2<String, Set<String>>> map = dataStreamSource.map(new MapFunction<String, Tuple2<String, Set<String>>>() {
@Override
public Tuple2<String, Set<String>> map(String s) throws Exception {
String[] split = s.split("\t");
String uid = split[0];
String name = split[1];
Set<String> set = new HashSet();
set.add(name);
return Tuple2.of(uid, set);
}
});
map.writeAsText("E:/test/mytest.txt").setParallelism(1);
env.execute("Test");
}
}
输出文件内容:
(004,[tom])
(004,[jerry])
(001,[sofia])
(001,[jack])
(001,[jack])
(001,[rose])
(006,[jennifer])
(005,[natasha])
(006,[catalina])
每行数据都变为Tuple2
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStreamSource<String> dataStreamSource = env.readTextFile("E:/test/uid_person.txt");
SingleOutputStreamOperator<Tuple2<String, Set<String>>> map = dataStreamSource.map(new MapFunction<String, Tuple2<String, Set<String>>>() {
@Override
public Tuple2<String, Set<String>> map(String s) throws Exception {
String[] split = s.split("\t");
String uid = split[0];
String name = split[1];
Set<String> set = new HashSet();
set.add(name);
return Tuple2.of(uid, set);
}
});
map.keyBy(0).reduce(new ReduceFunction<Tuple2<String, Set<String>>>() {
@Override
public Tuple2<String, Set<String>> reduce(Tuple2<String, Set<String>> stringSetTuple2, Tuple2<String, Set<String>> t1) throws Exception {
stringSetTuple2.f1.addAll(t1.f1);
return Tuple2.of(stringSetTuple2.f0, stringSetTuple2.f1);
}
}).writeAsText("E:/test/mytest.txt").setParallelism(1);
env.execute("Test");
}
输出结果如下:
(001,[sofia])
(001,[sofia, jack])
(001,[sofia, jack])
(001,[sofia, rose, jack])
(006,[catalina])
(006,[jennifer, catalina])
(005,[natasha])
(004,[tom])
(004,[tom, jerry])
这样每个uid的最右一条数据就是最完整的数据。