搭建环境然后运行完 helloworld
实验之后,接下来我们就要聊聊flink的数据处理了。对于数据处理,我们可以从人人都熟悉的函数关系式y = f(x)
开始聊起。单从函数关系式来讲:函数概念含有三个要素:定义域A(x)、值域C(y)和对应法则f
.如果是从数据流向的角度出发x
就是我们的数据源, f(x)
就是我们的系统方法(转换规则),y
就是我们的输出。
对于数据源x我们可以有如下疑问:
1、数据从哪里来?
2、数据什么时候来?
3、数据是谁发送来?
4、数据以什么样的形式来?
。。。。。
对于f(x)可以有如下疑问:
f(x)的处理规则是什么?
f(x)有哪些处理规则?
f(x)有哪些场景?
…
对于y可以有如下疑问:
可以输出到哪里?
输出之后用来做什么?
。。。
对与上面的简单思考,我们又可以有 y = f(g(x))
这样的形式。或者说模拟
、转换
、变换
等等。
对于数据源flink source源码提供了各种数据源的实现方式。同时在environment中进一步封装为方法
同时用户也可以自己定义一种方式,flink 同时还提供了相应的方法,如下,只要满足addSource
入口协议,那么就也可以作为数据源:
/**
* Ads a data source with a custom type information thus opening a
* {@link DataStream}. Only in very special cases does the user need to
* support type information. Otherwise use
* {@link #addSource(org.apache.flink.streaming.api.functions.source.SourceFunction)}
*
* @param function
* the user defined function
* @param sourceName
* Name of the data source
* @param
* type of the returned stream
* @param typeInfo
* the user defined type information for the stream
* @return the data stream constructed
*/
@SuppressWarnings("unchecked")
public <OUT> DataStreamSource<OUT> addSource(SourceFunction<OUT> function, String sourceName, TypeInformation<OUT> typeInfo) {
if (typeInfo == null) {
if (function instanceof ResultTypeQueryable) {
typeInfo = ((ResultTypeQueryable<OUT>) function).getProducedType();
} else {
try {
typeInfo = TypeExtractor.createTypeInfo(
SourceFunction.class,
function.getClass(), 0, null, null);
} catch (final InvalidTypesException e) {
typeInfo = (TypeInformation<OUT>) new MissingTypeInfo(sourceName, e);
}
}
}
boolean isParallel = function instanceof ParallelSourceFunction;
clean(function);
StreamSource<OUT, ?> sourceOperator;
if (function instanceof StoppableFunction) {
sourceOperator = new StoppableStreamSource<>(cast2StoppableSourceFunction(function));
} else {
sourceOperator = new StreamSource<>(function);
}
return new DataStreamSource<>(this, typeInfo, sourceOperator, isParallel, sourceName);
}
或者去扩展更加底层的datastream
对于数据的转换规则,也就是f(x)这个函数,flink 函数实现提供了常用的处理函数。比如Map
, Reduce
, FlatMap
等
对于数据流这快的数据转换,flink在datastream实现了各种方法, 该类方法如下:
对于新手我们可以直接从flink java 模版中进行创建一个实验环境,或者参考官方datastream入门教程,因为本人的scala 是2.12
版本的,稍微改变了一些东西,对于完整mvn pom.xml
文件见附录。
step1
:
curl https://flink.apache.org/q/quickstart.sh | bash -s 1.7.1
step2
新建立一个WikipediaAnalysis.java
package org.myorg.quickstart;
import org.apache.flink.api.common.functions.FoldFunction;
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.connectors.wikiedits.WikipediaEditEvent;
import org.apache.flink.streaming.connectors.wikiedits.WikipediaEditsSource;
/**
* @author legotime
*/
public class WikipediaAnalysis {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment see = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<WikipediaEditEvent> edits = see.addSource(new WikipediaEditsSource());
KeyedStream<WikipediaEditEvent, String> keyedEdits = edits
.keyBy(new KeySelector<WikipediaEditEvent, String>() {
@Override
public String getKey(WikipediaEditEvent event) {
return event.getUser();
}
});
DataStream<Tuple2<String, Long>> result = keyedEdits
.timeWindow(Time.seconds(5))
.fold(new Tuple2<>("", 0L), new FoldFunction<WikipediaEditEvent, Tuple2<String, Long>>() {
@Override
public Tuple2<String, Long> fold(Tuple2<String, Long> acc, WikipediaEditEvent event) {
acc.f0 = event.getUser();
acc.f1 += event.getByteDiff();
return acc;
}
});
result.print();
see.execute();
}
}
step3:打包
mvn clean package
step4:运行
mvn exec:java -Dexec.mainClass=org.myorg.quickstart.WikipediaAnalysis
会发现每五秒出现一次worldcount
,类似如下
4> (Samf4u,1)
3> (Rodw,31)
4> (Mlaffs,-3)
1> (Atlantic306,0)
4> (Samf4u,738)
2> (Stephencdickson,45)
4> (01:558:6001:2B:532:3495:C6CE:8A9F,47)
3> (KylieTastic,8795)
4> (Marcocapelle,-36)
3> (.81.254.95,0)
4> (Hotcop2,-11)
这是在没有启动flink 的情况下的进行运行的。接下来对于DataStream
的实验就会基于这个maven 项目。
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0modelVersion>
<groupId>org.myorg.quickstartgroupId>
<artifactId>quickstartartifactId>
<version>0.1version>
<packaging>jarpackaging>
<name>Flink Quickstart Jobname>
<url>http://www.myorganization.orgurl>
<properties>
<project.build.sourceEncoding>UTF-8project.build.sourceEncoding>
<flink.version>1.7.1flink.version>
<java.version>1.8java.version>
<scala.binary.version>2.12scala.binary.version>
<maven.compiler.source>${java.version}maven.compiler.source>
<maven.compiler.target>${java.version}maven.compiler.target>
properties>
<repositories>
<repository>
<id>apache.snapshotsid>
<name>Apache Development Snapshot Repositoryname>
<url>https://repository.apache.org/content/repositories/snapshots/url>
<releases>
<enabled>falseenabled>
releases>
<snapshots>
<enabled>trueenabled>
snapshots>
repository>
repositories>
<dependencies>
<dependency>
<groupId>org.apache.flinkgroupId>
<artifactId>flink-javaartifactId>
<version>1.7.1version>
dependency>
<dependency>
<groupId>org.apache.flinkgroupId>
<artifactId>flink-streaming-java_2.12artifactId>
<version>1.7.1version>
dependency>
<dependency>
<groupId>org.apache.flinkgroupId>
<artifactId>flink-connector-wikiedits_2.12artifactId>
<version>${flink.version}version>
dependency>
<dependency>
<groupId>org.slf4jgroupId>
<artifactId>slf4j-log4j12artifactId>
<version>1.7.7version>
<scope>runtimescope>
dependency>
<dependency>
<groupId>log4jgroupId>
<artifactId>log4jartifactId>
<version>1.2.17version>
<scope>runtimescope>
dependency>
dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.pluginsgroupId>
<artifactId>maven-compiler-pluginartifactId>
<version>3.1version>
<configuration>
<source>${java.version}source>
<target>${java.version}target>
configuration>
plugin>
<plugin>
<groupId>org.apache.maven.pluginsgroupId>
<artifactId>maven-shade-pluginartifactId>
<version>3.0.0version>
<executions>
<execution>
<phase>packagephase>
<goals>
<goal>shadegoal>
goals>
<configuration>
<artifactSet>
<excludes>
<exclude>org.apache.flink:force-shadingexclude>
<exclude>com.google.code.findbugs:jsr305exclude>
<exclude>org.slf4j:*exclude>
<exclude>log4j:*exclude>
excludes>
artifactSet>
<filters>
<filter>
<artifact>*:*artifact>
<excludes>
<exclude>META-INF/*.SFexclude>
<exclude>META-INF/*.DSAexclude>
<exclude>META-INF/*.RSAexclude>
excludes>
filter>
filters>
<transformers>
<transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
<mainClass>org.myorg.quickstart.StreamingJobmainClass>
transformer>
transformers>
configuration>
execution>
executions>
plugin>
plugins>
<pluginManagement>
<plugins>
<plugin>
<groupId>org.eclipse.m2egroupId>
<artifactId>lifecycle-mappingartifactId>
<version>1.0.0version>
<configuration>
<lifecycleMappingMetadata>
<pluginExecutions>
<pluginExecution>
<pluginExecutionFilter>
<groupId>org.apache.maven.pluginsgroupId>
<artifactId>maven-shade-pluginartifactId>
<versionRange>[3.0.0,)versionRange>
<goals>
<goal>shadegoal>
goals>
pluginExecutionFilter>
<action>
<ignore/>
action>
pluginExecution>
<pluginExecution>
<pluginExecutionFilter>
<groupId>org.apache.maven.pluginsgroupId>
<artifactId>maven-compiler-pluginartifactId>
<versionRange>[3.1,)versionRange>
<goals>
<goal>testCompilegoal>
<goal>compilegoal>
goals>
pluginExecutionFilter>
<action>
<ignore/>
action>
pluginExecution>
pluginExecutions>
lifecycleMappingMetadata>
configuration>
plugin>
plugins>
pluginManagement>
build>
<profiles>
<profile>
<id>add-dependencies-for-IDEAid>
<activation>
<property>
<name>idea.versionname>
property>
activation>
<dependencies>
<dependency>
<groupId>org.apache.flinkgroupId>
<artifactId>flink-javaartifactId>
<version>${flink.version}version>
<scope>compilescope>
dependency>
<dependency>
<groupId>org.apache.flinkgroupId>
<artifactId>flink-streaming-java_${scala.binary.version}artifactId>
<version>${flink.version}version>
<scope>compilescope>
dependency>
dependencies>
profile>
profiles>
project>