flink扫盲-DataStream中数据流向API理解

文章目录

    • 基本信息
      • 数据源x从何而来?
        • 直接输入形式
        • Socket形式
        • 文件形式
        • 自定义的方式
      • 处理规则f(x)有哪些?(transformations)
      • 数据y可以存放何处(Data sinks)
    • 实验环境
      • 附录
        • pom.xml文件

搭建环境然后运行完 helloworld 实验之后,接下来我们就要聊聊flink的数据处理了。对于数据处理,我们可以从人人都熟悉的函数关系式y = f(x)开始聊起。单从函数关系式来讲:函数概念含有三个要素:定义域A(x)、值域C(y)和对应法则f.如果是从数据流向的角度出发x 就是我们的数据源, f(x)就是我们的系统方法(转换规则),y就是我们的输出。

对于数据源x我们可以有如下疑问:

1、数据从哪里来?
2、数据什么时候来?
3、数据是谁发送来?
4、数据以什么样的形式来?
。。。。。

对于f(x)可以有如下疑问:

f(x)的处理规则是什么?
f(x)有哪些处理规则?
f(x)有哪些场景?

对于y可以有如下疑问:

可以输出到哪里?
输出之后用来做什么?
。。。

对与上面的简单思考,我们又可以有 y = f(g(x))这样的形式。或者说模拟转换变换等等。

基本信息

数据源x从何而来?

对于数据源flink source源码提供了各种数据源的实现方式。同时在environment中进一步封装为方法

直接输入形式

flink扫盲-DataStream中数据流向API理解_第1张图片

Socket形式

在这里插入图片描述

文件形式

flink扫盲-DataStream中数据流向API理解_第2张图片

自定义的方式

同时用户也可以自己定义一种方式,flink 同时还提供了相应的方法,如下,只要满足addSource入口协议,那么就也可以作为数据源:

	/**
	 * Ads a data source with a custom type information thus opening a
	 * {@link DataStream}. Only in very special cases does the user need to
	 * support type information. Otherwise use
	 * {@link #addSource(org.apache.flink.streaming.api.functions.source.SourceFunction)}
	 *
	 * @param function
	 * 		the user defined function
	 * @param sourceName
	 * 		Name of the data source
	 * @param 
	 * 		type of the returned stream
	 * @param typeInfo
	 * 		the user defined type information for the stream
	 * @return the data stream constructed
	 */
	@SuppressWarnings("unchecked")
	public <OUT> DataStreamSource<OUT> addSource(SourceFunction<OUT> function, String sourceName, TypeInformation<OUT> typeInfo) {

		if (typeInfo == null) {
			if (function instanceof ResultTypeQueryable) {
				typeInfo = ((ResultTypeQueryable<OUT>) function).getProducedType();
			} else {
				try {
					typeInfo = TypeExtractor.createTypeInfo(
							SourceFunction.class,
							function.getClass(), 0, null, null);
				} catch (final InvalidTypesException e) {
					typeInfo = (TypeInformation<OUT>) new MissingTypeInfo(sourceName, e);
				}
			}
		}

		boolean isParallel = function instanceof ParallelSourceFunction;

		clean(function);
		StreamSource<OUT, ?> sourceOperator;
		if (function instanceof StoppableFunction) {
			sourceOperator = new StoppableStreamSource<>(cast2StoppableSourceFunction(function));
		} else {
			sourceOperator = new StreamSource<>(function);
		}

		return new DataStreamSource<>(this, typeInfo, sourceOperator, isParallel, sourceName);
	}

或者去扩展更加底层的datastream

处理规则f(x)有哪些?(transformations)

对于数据的转换规则,也就是f(x)这个函数,flink 函数实现提供了常用的处理函数。比如Map, Reduce, FlatMap

对于数据流这快的数据转换,flink在datastream实现了各种方法, 该类方法如下:

flink扫盲-DataStream中数据流向API理解_第3张图片

数据y可以存放何处(Data sinks)

源码中的Data sinks方式
它的存放方式如下:
flink扫盲-DataStream中数据流向API理解_第4张图片

实验环境

对于新手我们可以直接从flink java 模版中进行创建一个实验环境,或者参考官方datastream入门教程,因为本人的scala 是2.12版本的,稍微改变了一些东西,对于完整mvn pom.xml文件见附录。

step1:

curl https://flink.apache.org/q/quickstart.sh | bash -s 1.7.1

step2

新建立一个WikipediaAnalysis.java

package org.myorg.quickstart;

import org.apache.flink.api.common.functions.FoldFunction;
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.streaming.connectors.wikiedits.WikipediaEditEvent;
import org.apache.flink.streaming.connectors.wikiedits.WikipediaEditsSource;

/**
 * @author legotime
 */
public class WikipediaAnalysis {

    public static void main(String[] args) throws Exception {

        StreamExecutionEnvironment see = StreamExecutionEnvironment.getExecutionEnvironment();

        DataStream<WikipediaEditEvent> edits = see.addSource(new WikipediaEditsSource());

        KeyedStream<WikipediaEditEvent, String> keyedEdits = edits
                .keyBy(new KeySelector<WikipediaEditEvent, String>() {
                    @Override
                    public String getKey(WikipediaEditEvent event) {
                        return event.getUser();
                    }
                });

        DataStream<Tuple2<String, Long>> result = keyedEdits
                .timeWindow(Time.seconds(5))
                .fold(new Tuple2<>("", 0L), new FoldFunction<WikipediaEditEvent, Tuple2<String, Long>>() {
                    @Override
                    public Tuple2<String, Long> fold(Tuple2<String, Long> acc, WikipediaEditEvent event) {
                        acc.f0 = event.getUser();
                        acc.f1 += event.getByteDiff();
                        return acc;
                    }
                });

        result.print();

        see.execute();
    }
}

step3:打包

mvn clean package                                                    

step4:运行

mvn exec:java -Dexec.mainClass=org.myorg.quickstart.WikipediaAnalysis

会发现每五秒出现一次worldcount,类似如下

4> (Samf4u,1)
3> (Rodw,31)
4> (Mlaffs,-3)
1> (Atlantic306,0)
4> (Samf4u,738)
2> (Stephencdickson,45)
4> (01:558:6001:2B:532:3495:C6CE:8A9F,47)
3> (KylieTastic,8795)
4> (Marcocapelle,-36)
3> (.81.254.95,0)
4> (Hotcop2,-11)

这是在没有启动flink 的情况下的进行运行的。接下来对于DataStream的实验就会基于这个maven 项目。

附录

pom.xml文件

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
	xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
	<modelVersion>4.0.0modelVersion>

	<groupId>org.myorg.quickstartgroupId>
	<artifactId>quickstartartifactId>
	<version>0.1version>
	<packaging>jarpackaging>

	<name>Flink Quickstart Jobname>
	<url>http://www.myorganization.orgurl>

	<properties>
		<project.build.sourceEncoding>UTF-8project.build.sourceEncoding>
		<flink.version>1.7.1flink.version>
		<java.version>1.8java.version>
		<scala.binary.version>2.12scala.binary.version>
		<maven.compiler.source>${java.version}maven.compiler.source>
		<maven.compiler.target>${java.version}maven.compiler.target>
	properties>

	<repositories>
		<repository>
			<id>apache.snapshotsid>
			<name>Apache Development Snapshot Repositoryname>
			<url>https://repository.apache.org/content/repositories/snapshots/url>
			<releases>
				<enabled>falseenabled>
			releases>
			<snapshots>
				<enabled>trueenabled>
			snapshots>
		repository>
	repositories>

	<dependencies>
		
		
		
		<dependency>
			<groupId>org.apache.flinkgroupId>
			<artifactId>flink-javaartifactId>
			<version>1.7.1version>
		dependency>

		
		<dependency>
			<groupId>org.apache.flinkgroupId>
			<artifactId>flink-streaming-java_2.12artifactId>
			<version>1.7.1version>
		dependency>

		<dependency>
			<groupId>org.apache.flinkgroupId>
			<artifactId>flink-connector-wikiedits_2.12artifactId>
			<version>${flink.version}version>
		dependency>


		

		

		
		
		<dependency>
			<groupId>org.slf4jgroupId>
			<artifactId>slf4j-log4j12artifactId>
			<version>1.7.7version>
			<scope>runtimescope>
		dependency>
		<dependency>
			<groupId>log4jgroupId>
			<artifactId>log4jartifactId>
			<version>1.2.17version>
			<scope>runtimescope>
		dependency>
	dependencies>

	<build>
		<plugins>

			
			<plugin>
				<groupId>org.apache.maven.pluginsgroupId>
				<artifactId>maven-compiler-pluginartifactId>
				<version>3.1version>
				<configuration>
					<source>${java.version}source>
					<target>${java.version}target>
				configuration>
			plugin>

			
			
			<plugin>
				<groupId>org.apache.maven.pluginsgroupId>
				<artifactId>maven-shade-pluginartifactId>
				<version>3.0.0version>
				<executions>
					
					<execution>
						<phase>packagephase>
						<goals>
							<goal>shadegoal>
						goals>
						<configuration>
							<artifactSet>
								<excludes>
									<exclude>org.apache.flink:force-shadingexclude>
									<exclude>com.google.code.findbugs:jsr305exclude>
									<exclude>org.slf4j:*exclude>
									<exclude>log4j:*exclude>
								excludes>
							artifactSet>
							<filters>
								<filter>
									
									<artifact>*:*artifact>
									<excludes>
										<exclude>META-INF/*.SFexclude>
										<exclude>META-INF/*.DSAexclude>
										<exclude>META-INF/*.RSAexclude>
									excludes>
								filter>
							filters>
							<transformers>
								<transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
									<mainClass>org.myorg.quickstart.StreamingJobmainClass>
								transformer>
							transformers>
						configuration>
					execution>
				executions>
			plugin>
		plugins>

		<pluginManagement>
			<plugins>

				
				<plugin>
					<groupId>org.eclipse.m2egroupId>
					<artifactId>lifecycle-mappingartifactId>
					<version>1.0.0version>
					<configuration>
						<lifecycleMappingMetadata>
							<pluginExecutions>
								<pluginExecution>
									<pluginExecutionFilter>
										<groupId>org.apache.maven.pluginsgroupId>
										<artifactId>maven-shade-pluginartifactId>
										<versionRange>[3.0.0,)versionRange>
										<goals>
											<goal>shadegoal>
										goals>
									pluginExecutionFilter>
									<action>
										<ignore/>
									action>
								pluginExecution>
								<pluginExecution>
									<pluginExecutionFilter>
										<groupId>org.apache.maven.pluginsgroupId>
										<artifactId>maven-compiler-pluginartifactId>
										<versionRange>[3.1,)versionRange>
										<goals>
											<goal>testCompilegoal>
											<goal>compilegoal>
										goals>
									pluginExecutionFilter>
									<action>
										<ignore/>
									action>
								pluginExecution>
							pluginExecutions>
						lifecycleMappingMetadata>
					configuration>
				plugin>
			plugins>
		pluginManagement>
	build>

	
	
	
	<profiles>
		<profile>
			<id>add-dependencies-for-IDEAid>

			<activation>
				<property>
					<name>idea.versionname>
				property>
			activation>

			<dependencies>
				<dependency>
					<groupId>org.apache.flinkgroupId>
					<artifactId>flink-javaartifactId>
					<version>${flink.version}version>
					<scope>compilescope>
				dependency>
				<dependency>
					<groupId>org.apache.flinkgroupId>
					<artifactId>flink-streaming-java_${scala.binary.version}artifactId>
					<version>${flink.version}version>
					<scope>compilescope>
				dependency>
			dependencies>
		profile>
	profiles>

project>

你可能感兴趣的:(flink)