1.pom.xml
org.apache.flink
flink-java
1.6.0
org.apache.flink
flink-streaming-java_2.11
1.6.0
org.apache.flink
flink-clients_2.11
1.6.0
Scala API: To use the Scala API, replace the flink-java
artifact id with flink-scala_2.11
and flink-streaming-java_2.11
with flink-streaming-scala_2.11
.
2.以文件作为数据源
readTextFile(Stringpath):默认读取TextInputFormat格式,每行作为一个字符串;
readTextFileWithValue(Stringpath):返回StringValues,StringValues作为mutable字符串;
readCsvFile(Stringpath):返回Java POJOS或者tuples;
readFileofPremitives(path, delimiter, class):解析一行数据到指定的class;
readHadoopFile(FileInputFormat, Key, Value, path):读取Hadoop文件,指定路径、文件格式以及key、value class;具体参见下边的图;
readSequenceFile(Key, Value, path):读取SequenceFile格式的文件,同样需指定key、value的class。
val env = ExecutionEnvironment.getExecutionEnvironment
// get input data
val text = env.readTextFile("/path/to/file")
val counts = text.flatMap { _.toLowerCase.split("\\W+") filter { _.nonEmpty } }
.map { (_, 1) }
.groupBy(0)
.sum(1)
counts.writeAsCsv(outputPath, "\n", " ")
3.以控制台输入为数据源
$ nc -l 9000
abc,sad,as
asd,a
bv
然后提交 Flink 程序
$ ./bin/flink run examples/streaming/SocketWordCount.jar --port 9000
$ bin/flink run examples/streaming/SocketWordCount.jar \
--hostname slave01 \
--port 9000
package cn.com.xxx.zzy;
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.common.functions.ReduceFunction;
import org.apache.flink.api.java.utils.ParameterTool;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.util.Collector;
/**
* Created with IntelliJ IDEA.
* To change this template use File | Settings | File Templates.
*/
public class SocketWordCount {
public static void main(String[] args) throws Exception {
// the port to connect to
final int port;
try {
final ParameterTool params = ParameterTool.fromArgs(args);
port = params.getInt("port");
} catch (Exception e) {
System.err.println("No port specified. Please run 'SocketWordCount --port '");
return;
}
// get the execution environment
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// get input data by connecting to the socket
DataStream text = env.socketTextStream("localhost", port, "\n");
// parse the data, group it, window it, and aggregate the counts
DataStream windowCounts = text
.flatMap(new FlatMapFunction() {
@Override
public void flatMap(String value, Collector out) {
for (String word : value.split("\\s")) {
out.collect(new WordWithCount(word, 1L));
}
}
})
.keyBy("word")
.timeWindow(Time.seconds(5), Time.seconds(1))
.reduce(new ReduceFunction() {
@Override
public WordWithCount reduce(WordWithCount a, WordWithCount b) {
return new WordWithCount(a.word, a.count + b.count);
}
});
// print the results with a single thread, rather than in parallel
windowCounts.print().setParallelism(1);
env.execute("Socket WordCount");
}
// Data type for words with count
public static class WordWithCount {
public String word;
public long count;
public WordWithCount() {
}
public WordWithCount(String word, long count) {
this.word = word;
this.count = count;
}
@Override
public String toString() {
return word + " : " + count;
}
}
}
4. 以java集合为数据源
fromCollection(Collection)
fromCollection(Iterator, Class):也可以读取iterator,其数据本身的类型为指定的class;
fromElements(T):读取sequence对象;
fromParallelCollection(SplittableIterator, Class):读取并行iterator;
generateSequence(from, to):读取一定范围的sequnce对象。
package com.gr.dologic
import java.util
import org.apache.flink.api.java.aggregation.Aggregations
import org.apache.flink.api.scala._
object flink1 {
def main(args: Array[String]): Unit = {
val env = ExecutionEnvironment.getExecutionEnvironment
val list = new util.ArrayList[Int]();
list.add(1);
list.add(2);
list.add(3);
val stream = env.fromElements(1,2,3,4,3,4,3,3,5).map(arr=>{
arr
}).filter(_>=2).map{x=>(x,1)}.groupBy(0).aggregate(Aggregations.SUM,1)//.sum(1)
stream.print
// env.execute()
/*Exception in thread "main" java.lang.RuntimeException: No new data sinks have been defined since the last execution.
The last execution refers to the latest call to 'execute()', 'count()', 'collect()', or 'print()'.
参照此文,原因是print()方法自动会调用execute()方法,造成错误,所以注释掉env.execute()即可*/
}
}
5.以kafka为数据源
val properties = new Properties();
properties.setProperty("bootstrap.servers","localhost:9092");
properties.setProperty("zookeeper.connect","localhost:2181");
properties.setProperty("group.id","test");
val stream = env.addSource(new FlinkKafkaConsumer09[String]("mytopic",new SimpleStringSchema(),properties))//.print
对温度计算平均值
DataStream> keyedStream = env.addSource(new FlinkKafkaConsumer09[String]("mytopic",new SimpleStringSchema(),properties)).flatMap(new Splitter()).keyBy(0)
.timeWindow(Time.second(300))
.apply(new WindowFunction,Tuple2,Tuple,TimeWindow>() {
public void apply(Tuple key,TimeWindow window,Iterable> input,Collector> out) throws Exception{
double sum = 0L;
int count = 0;
for(Tuple2 record : input) {
sum += record.f1;
count ++;
}
Tuple2 result = input.iterator().next();
result.f1 = (sum/count);
out.collect(result);
}
});
要注意容错,设置checkpoint
kafka producer也可以作为sink来用
stream.addSink(new FlinkKafkaProducer09("localhost:9092","mytopic",new SimpleStringSchema()));
6.以关系型数据库为数据源
7.Table API
Flink提供了一个table接口来进行批处理和流处理,这个接口叫做Table API。一旦dataset/datastream被注册为table后,就可以引用聚合、join和select等关系型的操作了。
Table同样可以通过标准SQL来操作,操作执行后,需要将table转换为dataSet/datastream。Flink内部中使用开源框架Apache Calcite来优化这些转换操作
8.参考学习文档精通Apache Flink读书笔记 1-5
https://blog.csdn.net/lmalds/article/details/60867262