声明:本系列博客为原创,最先发表在拉勾教育,其中一部分为免费阅读部分。被读者各种搬运至各大网站。所有其他的来源均为抄袭。
一、WordCount
1、首先创建好项目,然后添加相关依赖
org.apache.flink
flink-java
${flink.version}
org.apache.flink
flink-streaming-java_${scala.binary.version}
${flink.version}
2、DataSet WordCount
wordcount程序是大数据处理框架的入门程序,统计一段文件每个单词出现次数。该程序主要分为两个部分:一部分是将文字拆分成单词;另一部分是将单词进行分组计数不能给打印输出结果。
整体代码如下:
public static void main(String[] args) throws Exception {
// 创建Flink运行的上下文环境
final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
// 创建DataSet,这里我们的输入是一行一行的文本
DataSet text = env.fromElements(
"Flink Spark Storm",
"Flink Flink Flink",
"Spark Spark Spark",
"Storm Storm Storm"
);
// 通过Flink内置的转换函数进行计算
DataSet> counts =
text.flatMap(new LineSplitter())
.groupBy(0)
.sum(1);
//结果打印
counts.printToErr();
}
public static final class LineSplitter implements FlatMapFunction> {
@Override
public void flatMap(String value, Collector> out) {
// 将文本分割
String[] tokens = value.toLowerCase().split("\\W+");
for (String token : tokens) {
if (token.length() > 0) {
out.collect(new Tuple2(token, 1));
}
}
}
}
实现的整个过程中分为一下几个步骤。
(1)我们需要创建Flink的上下文运行环境:
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
(2)使用fromElements函数创建一个DataSet对象,该对象中包含了我们的输入,使用FlatMap、GroupBy、Sum函数进行转换
(3)直接运行解雇
3、DataStream WordCount
为了模仿一个流式计算环境,我们选择监听一个本地的socket端口,并且使用Flink中的滚动窗口,每5s打印一次计算结果,代码如下:
public class StreamingJob {
public static void main(String[] args) throws Exception {
// 创建Flink的流式计算环境
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// 监听本地9000端口
DataStream text = env.socketTextStream("127.0.0.1", 9000, "\n");
// 将接收的数据进行拆分,分组,窗口计算并且进行聚合输出
DataStream windowCounts = text
.flatMap(new FlatMapFunction() {
@Override
public void flatMap(String value, Collector out) {
for (String word : value.split("\\s")) {
out.collect(new WordWithCount(word, 1L));
}
}
})
.keyBy("word")
.timeWindow(Time.seconds(5), Time.seconds(1))
.reduce(new ReduceFunction() {
@Override
public WordWithCount reduce(WordWithCount a, WordWithCount b) {
return new WordWithCount(a.word, a.count + b.count);
}
});
// 打印结果
windowCounts.print().setParallelism(1);
env.execute("Socket Window WordCount");
}
// Data type for words with count
public static class WordWithCount {
public String word;
public long count;
public WordWithCount() {}
public WordWithCount(String word, long count) {
this.word = word;
this.count = count;
}
@Override
public String toString() {
return word + " : " + count;
}
}
}
整个流式计算的过程分为以下几步:
(1)首先创建一个流式计算环境:
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
(2)进行本地9000端口监听,将接受的数据进行拆分、分组、窗口计算并且进行聚合输出,代码中使用了Flink的窗口函数,后面会进行详解。
(3)在本地使用netcat命令启动一个窗口:
nc -lk 9000
(4)运行程序,得到结果
输入:
$ nc -lk 9000
Flink Flink Flink
Flink Spark Storm
结果:
Flink : 4
Spark : 1
Storm : 1
4、Flink Table & SQL WordCount
Flink SQL 是Flink实时计算为简化计算模型,降低用户使用实时计算的门槛而设计的一套符合标准SQL语义的开发语言。
一个完整的Flink SQL便携的程序包括以下三个部分:
(1)首先在pom中增加依赖
org.apache.flink
flink-java
1.10.0
org.apache.flink
flink-streaming-java_2.11
1.10.0
org.apache.flink
flink-table-api-java-bridge_2.11
1.10.0
org.apache.flink
flink-table-planner-blink_2.11
1.10.0
org.apache.flink
flink-table-planner_2.11
1.10.0
org.apache.flink
flink-table-api-scala-bridge_2.11
1.10.0
(2)创建上下文环境
ExecutionEnvironment fbEnv = ExecutionEnvironment.getExecutionEnvironment();
BatchTableEnvironment fbTableEnv = BatchTableEnvironment.create(fbEnv);
(3)读取一行数据作为输入
String words = "hello flink hello lagou";
String[] split = words.split("\\W+");
ArrayList list = new ArrayList<>();
for(String word : split){
WC wc = new WC(word,1);
list.add(wc);
}
DataSet input = fbEnv.fromCollection(list);
(4)注册成表,执行SQL,然后输出
//DataSet 转sql, 指定字段名
Table table = fbTableEnv.fromDataSet(input, "word,frequency");
table.printSchema();
//注册为一个表
fbTableEnv.createTemporaryView("WordCount", table);
Table table02 = fbTableEnv.sqlQuery("select word as word, sum(frequency) as frequency from WordCount GROUP BY word");
//将表转换DataSet
DataSet ds3 = fbTableEnv.toDataSet(table02, WC.class);
ds3.printToErr();
整体代码如下:
public class WordCountSQL {
public static void main(String[] args) throws Exception{
//获取运行环境
ExecutionEnvironment fbEnv = ExecutionEnvironment.getExecutionEnvironment();
//创建一个tableEnvironment
BatchTableEnvironment fbTableEnv = BatchTableEnvironment.create(fbEnv);
String words = "hello flink hello lagou";
String[] split = words.split("\\W+");
ArrayList list = new ArrayList<>();
for(String word : split){
WC wc = new WC(word,1);
list.add(wc);
}
DataSet input = fbEnv.fromCollection(list);
//DataSet 转sql, 指定字段名
Table table = fbTableEnv.fromDataSet(input, "word,frequency");
table.printSchema();
//注册为一个表
fbTableEnv.createTemporaryView("WordCount", table);
Table table02 = fbTableEnv.sqlQuery("select word as word, sum(frequency) as frequency from WordCount GROUP BY word");
//将表转换DataSet
DataSet ds3 = fbTableEnv.toDataSet(table02, WC.class);
ds3.printToErr();
}
public static class WC {
public String word;
public long frequency;
public WC() {}
public WC(String word, long frequency) {
this.word = word;
this.frequency = frequency;
}
@Override
public String toString() {
return word + ", " + frequency;
}
}
}
5、总结
这篇文章主要以wordcount场景用Flink来演示,让大家体验Flink SQL的强大之处,为后续内容打好基础。