1 前言
Stateful Computations over Data Streams(数据流的状态计算)
Apache Flink是一个框架和分布式处理引擎,用于对无界和有界数据流进行有状态计算。Flink设计为在所有常见的集群环境中运行,以内存速度和任何规模执行计算。
在这里,我们解释了Flink架构的重要方面。
处理无界和有界数据
任何类型的数据都是作为事件流产生的。信用卡交易,传感器测量,机器日志或网站或移动应用程序上的用户交互,所有这些数据都作为流生成。
数据可以作为无界或有界流处理。
Apache Flink擅长处理无界和有界数据集。精确控制时间和状态使Flink的运行时能够在无界流上运行任何类型的应用程序。有界流由算法和数据结构内部处理,这些算法和数据结构专为固定大小的数据集而设计,从而产生出色的性能。
通过探索在Flink之上构建的用例来使自己信服。
Deploy Applications Anywhere
Apache Flink是一个分布式系统,需要计算资源才能执行应用程序。Flink与所有常见的集群资源管理器(如Hadoop YARN,Apache Mesos和Kubernetes)集成,但也可以设置为作为独立集群运行。
Flink旨在很好地适用于之前列出的每个资源管理器。这是通过特定于资源管理器的部署模式实现的,这些模式允许Flink以其惯用的方式与每个资源管理器进行交互。
部署Flink应用程序时,Flink会根据应用程序配置的并行性自动识别所需资源,并从资源管理器请求它们。如果发生故障,Flink会通过请求新资源来替换发生故障的容器。提交或控制应用程序的所有通信都通过REST调用进行。这简化了Flink在许多环境中的集成。
Run Applications at any Scale
Flink旨在以任何规模运行有状态流应用程序。应用程序可以并行化为数千个在集群中分布和同时执行的任务。因此,应用程序可以利用几乎无限量的CPU,主内存,磁盘和网络IO。而且,Flink可以轻松维护非常大的应用程序状态。其异步和增量检查点算法确保对处理延迟的影响最小,同时保证一次性状态一致性。
用户报告了在其生产环境中运行的Flink应用程序的可扩展性数字令人印象深刻,例如
Leverage In-Memory Performance
有状态Flink应用程序针对本地状态访问进行了优化。任务状态始终保留在内存中,或者,如果状态大小超过可用内存,则保存在访问高效的磁盘上数据结构中。因此,任务通过访问本地(通常是内存中)状态来执行所有计算,从而产生非常低的处理延迟。Flink通过定期和异步检查本地状态到持久存储来保证在出现故障时的一次状态一致性。
1.2 架构
2.1.1 准备工作
2.1.1.1基本要求
2.1.1.1.1操作系统
Centos7.2以上
2.1.1.1.2 jdk、maven和hdp版本
2.1.1.1.3 获取flink的lib包
第一种方法:自己编译flink-1.6.0 (推荐,耗时间,兼容性好)
编译好的目录: /data02/maven/soft/flink-release-1.6.0/
编译环境:
编译时间:1小时左右
第二种方法:直接下载已经编译好的包(推荐,不耗时,兼容性可能不太好)
# cd /opt/
#wget http://www.us.apache.org/dist/flink/flink-1.6.0/flink-1.6.0-bin-hadoop27-scala_2.11.tgz
#tar -zxvf flink-1.6.0-bin-hadoop27-scala_2.11.tgz
#mv flink-release-1.6.0 flink
#chown -R hdfs:hdfs flink/*
注意:每台flink集群的设备都需要这些包
2.1.1.1.4修改目录权限
chmod 755 /var/run
chmod 755 /etc/profile
2.1.1.1.5运行用户
安装过程中,需要用root用户。
2.1.1.1.6每台设备上必须要有所有设备的hosts和ip映射
172.16.5.117 bdp03nn01
172.16.5.118 bdp03nn02
172.16.5.119 bdp03dn01
2.1.1.2版本选择
结合flink的发布说明,需要考虑到兼容性和可扩展性.
Apache flink : 1.6.0
2.1.1.3角色规划
主机 |
角色 |
备注 |
172.16.5.117 bdp03nn01 |
Flink gateway 和client |
|
172.16.5.118 bdp03nn02 |
Flink gateway |
|
172.16.5.119 bdp03dn01 |
Flink gateway |
|
2.1.2 安装步骤
在ambari server的设备上下载flink的服务包
2.1.2.1.下载ambari-flink-service
1)在ambari server 所在设备上执行
#VERSION=`hdp-select status hadoop-client | sed 's/hadoop-client - \([0-9]\.[0-9]\).*/\1/'`
#sudo git clone https://github.com/highfei2011/ambari-flink-service.git /var/lib/ambari-server/resources/stacks/HDP/$VERSION/services/FLINK
注意:
https://github.com/highfei2011/ambari-flink-service.git
这是我的github 资源,如果需要更新flink的版本,那么一样需要修改github仓库的对应参数。
2)进入目录:
#cd /opt/
3)在每台机群设备上下载
#wget http://www.us.apache.org/dist/flink/flink-1.6.0/flink-1.6.0-bin-hadoop27-scala_2.11.tgz
4)在每台机群设备上解压到/opt下
tar –zxvf flink-1.6.0-bin-hadoop27-scala_2.11.tgz
mv /opt/flink-1.6.0 /opt/flink
chown -R flink:flink /opt/flink/
chmod 777 -R /opt/flink/
sudo mkdir –p /opt/flink/conf
5)每台设备添加环境变量
在每台设备上添加/etc/profile环境变量
export HADOOP_CLASSPATH=`hadoop classpath` export CLASSPATH=$CLASSPATH:$HADOOP_CLASSPATH export FLINK_HOME=/opt/flink/ export PATH=$FLINK_HOME/bin:$PATH export PATH
|
source /etc/profile
2.1.2.2.重启ambari-server
在 ambari-server 设备上执行
#sudo systemctl restart ambari-server
或者
#sudo service ambari-server restart
2.1.2.3.安装flink
选择Action ---》 Add service
选择flink-1.6.0
选择flink 启动的一台设备
添加完成
2.1.2.4.重启集群
2.1.2.5.参数修改
在 Ambari 上修改 Flink 的参数
A、网络缓冲区大小
如果以非常高的并行度运行Flink,则可能需要增加网络缓冲区的数量,默认,Flink取JVM堆大小的10%用作网络缓冲区,最小为64MB,最大为1GB,可通过以下参数配置。为什么需要网络缓冲区?
见https://ci.apache.org/projects/flink/flink-docs-master/ops/config.html#configuring-the-network-buffers
taskmanager.network.memory.min 网络缓冲区最小字节数(默认64M)
taskmanager.network.memory.max 网络缓冲区的最大字节数(默认1G)
taskmanager.network.memory.fraction 用于网络缓冲区的JVM内存的占比(默认0.1)
默认:
taskmanager.network.memory.fraction: 0.1
taskmanager.network.memory.min: 67108864
taskmanager.network.memory.max: 1073741824
启动异常:org.apache.flink.configuration.IllegalConfigurationException: Invalid configuration value for (taskmanager.network.memory.fraction, taskmanager.network.memory.min, taskmanager.network.memory.max) : (0.1, 67108864, 1073741824) - Network buffer memory size too large: 67108864 >= 8388608 (total JVM memory size)
B、yarn参数
yarn.nodemanager.resource.memory-mb=30G 每个nodemanager最大可用内存
yarn.scheduler.maximum-allocation-mb=30G 单个容器可申请的最大内存
yarn.scheduler.minimum-allocation-mb=1024M 单个容器最小内存
containerized.heap-cutoff-min=400
参考:https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html
C、Flink on yarn 参数
flink在yarn上可以直接运行起来
flink在yarn上不能运行起来
并重启yarn集群
D、添加环境变量
export HADOOP_CLASSPATH=`hadoop classpath`
2.2.2 flink run job
资源管理器:flink on yarn
设备:bdp03nn01
提交用户:hdfs
执行命令:
flink run --jobmanager yarn-cluster \ -yn 1 \ -ytm 768 \ -yjm 768 \ /opt/flink/examples/batch/WordCount.jar \ --input hdfs://bdp03nn01:8020/user/hdfs/demo/input/word \ --output hdfs://bdp03nn01:8020/user/hdfs/demo/output/wc/ |
等待执行完成
查看输出结果
# hdfs dfs -cat /user/hdfs/demo/output/wc
运行时可以查看web ui
http://host:8081
需要映射8081端口
172.16.5.117 bdp03nn01
172.16.5.118 bdp03nn02
172.16.5.119 bdp03dn01
2.2.3 flink sql client
# sql-client.sh embedded
参考文档:
https://ci.apache.org/projects/flink/flink-docs-release-1.6/dev/table/sqlClient.html#starting-the-sql-client-cli
idea 2018-1.1、jdk-1.8、flink-1.6.0、maven-3.4.5
项目构建的需要条件:
https://ci.apache.org/projects/flink/flink-docs-release-1.6/quickstart/scala_api_quickstart.html#maven
https://flink.apache.org/downloads.html
2.2.2.1 pom.xm文件
4.0.0
cn.acewell
dev-flink-1
1.0-SNAPSHOT
UTF-8
UTF-8
128m
1024m
1024m
1.8
1.8
UTF-8
2.10
1.7.7
1.2.17
1.8
1.7.2
2.11
0.10_2.11
1.2.47
1.2.0
2.9.1
2.3.0
3.3.0
0.9.1
horton-works-releases
http://repo.hortonworks.com/content/groups/public/
apache maven
https://repo.maven.apache.org/maven2/
mvn repository
https://mvnrepository.com/artifact/
CDH
https://repository.cloudera.com/artifactory/cloudera-repos/
Acewill
Jeff Yang
[email protected]
io.flinkspector
flinkspector-datastream_2.11
${flinkspector.version}
io.flinkspector
flinkspector-dataset_2.11
${flinkspector.version}
org.apache.flink
flink-connector-kafka-0.10_2.11
${apache.flink.version}
log4j
log4j
${log4j.version}
org.slf4j
slf4j-api
${slf4j.version}
org.slf4j
slf4j-log4j12
${slf4j.version}
org.apache.flink
flink-streaming-java_2.11
${apache.flink.version}
org.apache.flink
flink-table_2.11
${apache.flink.version}
org.apache.flink
flink-runtime-web_2.11
${apache.flink.version}
dev-flink-1.6
src/main/java
src/test/java
target/java-${java.version}/classes
target/java-${java.version}/test-classes
org.apache.maven.plugins
maven-surefire-plugin
2.21.0
default-test
test
test
**/*Test.*
org.apache.maven.plugins
maven-compiler-plugin
3.1
${jvm.version}
2.2.2.2 编写测试类和工具类
WordCount.java
package cn.acewill.flink.batch;
import cn.acewill.flink.utils.WordCountData;
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.DataSet;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.api.java.utils.ParameterTool;
import org.apache.flink.util.Collector;
/**
* @author Created by yangjf on 20180920.
* Update date:
* Time: 下午6:12
* Project: dev-flink-1.6
* Package: cn.acewell.flink.batch
* Describe :
* Frequency:
* Result of Test: test ok
* Command:
*
* Email: [email protected]
* Status:Using online
*
* Please note:
* Must be checked once every time you submit a configuration file is correct!
* Data is priceless! Accidentally deleted the consequences!
*/
public class WordCount {
// *************************************************************************
// PROGRAM
// *************************************************************************
public static void main(String[] args) throws Exception {
final ParameterTool params = ParameterTool.fromArgs(args);
// set up the execution environment
final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
// make parameters available in the web interface
env.getConfig().setGlobalJobParameters(params);
// get input data
DataSet text;
if (params.has("input")) {
// read the text file from given input path
text = env.readTextFile(params.get("input"));
} else {
// get default test text data
System.out.println("Executing WordCountWindow example with default input data set.");
System.out.println("Use --input to specify file input.");
text = WordCountData.getDefaultTextLineDataSet(env);
}
DataSet> counts =
// split up the lines in pairs (2-tuples) containing: (word,1)
text.flatMap(new Tokenizer())
// group by the tuple field "0" and sum up tuple field "1"
.groupBy(0)
.sum(1)
.setParallelism(2)
// 流水线的同时作业数量 https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/parallel.html
// 如果设置了 setParallelism 则可以理解: 这类任务(transformations/operators, data sources, and sinks)都有2个同时执行
;
// emit result
if (params.has("output")) {
counts.writeAsCsv(params.get("output"), "\n", " ");
// execute program
env.execute("WordCountWindow Example");
} else {
System.out.println("Printing result to stdout. Use --output to specify output path.");
counts.print();
}
}
// *************************************************************************
// USER FUNCTIONS
// *************************************************************************
/**
* Implements the string tokenizer that splits sentences into words as a user-defined
* FlatMapFunction. The function takes a line (String) and splits it into
* multiple pairs in the form of "(word,1)" ({@code Tuple2}).
*/
public static final class Tokenizer implements FlatMapFunction> {
@Override
public void flatMap(String value, Collector> out) {
// normalize and split the line
String[] tokens = value.toLowerCase().split("\\W+");
// emit the pairs
for (String token : tokens) {
if (token.length() > 0) {
out.collect(new Tuple2<>(token, 1));
}
try{
Thread.sleep(1000);
}catch (Exception s){
s.printStackTrace();
}
}
}
}
}
WordCountData.java
package cn.acewill.flink.utils;
import org.apache.flink.api.java.DataSet;
import org.apache.flink.api.java.ExecutionEnvironment;
/**
* @author Created by yangjf on 20180920.
* Update date:
* Time: 下午6:14
* Project: dev-flink-1.6
* Package: cn.acewell.flink.utils
* Describe :
* Provides the default data sets used for the WordCountWindow example program.
* The default data sets are used, if no parameters are given to the program.
* Frequency: Calculate once a day.
* Result of Test: test ok
* Command:
*
* Email: [email protected]
* Status:Using online
*
* Please note:
* Must be checked once every time you submit a configuration file is correct!
* Data is priceless! Accidentally deleted the consequences!
*/
public class WordCountData {
public static final String[] WORDS = new String[]{
"To be, or not to be,--that is the question:--",
"Whether 'tis nobler in the mind to suffer",
"The slings and arrows of outrageous fortune",
"Or to take arms against a sea of troubles,",
"And by opposing end them?--To die,--to sleep,--",
"No more; and by a sleep to say we end",
"The heartache, and the thousand natural shocks",
"That flesh is heir to,--'tis a consummation",
"Devoutly to be wish'd. To die,--to sleep;--",
"To sleep! perchance to dream:--ay, there's the rub;",
"For in that sleep of death what dreams may come,",
"When we have shuffled off this mortal coil,",
"When we have shuffled off this mortal coil,",
"When we have shuffled off this mortal coil,",
"When we have shuffled off this mortal coil,",
"Must give us pause: there's the respect",
"Must give us pause: there's the respect",
"Must give us pause: there's the respect",
"Must give us pause: there's the respect",
"That makes calamity of so long life;",
"For who would bear the whips and scorns of time,",
"The oppressor's wrong, the proud man's contumely,",
"The pangs of despis'd love, the law's delay,",
"The insolence of office, and the spurns",
"The insolence of office, and the spurns",
"The insolence of office, and the spurns",
"That patient merit of the unworthy takes,",
"When he himself might his quietus make",
"With a bare bodkin? who would these fardels bear,",
"To grunt and sweat under a weary life,",
"But that the dread of something after death,--",
"The undiscover'd country, from whose bourn",
"No traveller returns,--puzzles the will,",
"And makes us rather bear those ills we have",
"Than fly to others that we know not of?",
"Thus conscience does make cowards of us all;",
"And thus the native hue of resolution",
"Is sicklied o'er with the pale cast of thought;",
"And enterprises of great pith and moment,",
"With this regard, their currents turn awry,",
"And lose the name of action.--Soft you now!",
"The fair Ophelia!--Nymph, in thy orisons",
"Be all my sins remember'd."
};
public static DataSet getDefaultTextLineDataSet(ExecutionEnvironment env) {
return env.fromElements(WORDS);
}
}
2.2.2.3 运行WordCount.java
查看统计结果:
Grafana +Prometheus
3.1 安装 Grafana
Ambari 自带 Grafana 所以只需要配置即可。
开发service 的作者不建议在生产上使用,但是目前大部分公司都已在生产上使用过了。
cd ${FLINK_HOME}
后台运行:
yarn-session.sh -n 1 -s 1 -jm 768 -tm 1024 -qu default -nm flinkapp-from-ambari -d >> /var/log/flink/flink-test.log
./bin/flink run --jobmanager yarn-cluster --yarnqueue offline --yarnjobManagerMemory 1024 --yarncontainer 2 --yarntaskManagerMemory 1024 --yarnslots 3 ./examples/batch/WordCount.jar --input hdfs:///user/hdfs/demo/data/wc.txt --output hdfs:///user/hdfs/demo/result/wc
编译flink:
https://community.hortonworks.com/articles/2659/exploring-apache-flink-with-hdp.html
http://doc.flink-china.org/1.1.0/setup/building.html
Flink 教程:https://ci.apache.org/projects/flink/flink-docs-release-1.6/quickstart/setup_quickstart.html
已编译的:
http://www.gtlib.gatech.edu/pub/apache/flink/flink-1.6.0/
http://www.us.apache.org/dist/flink/flink-1.6.0/
安装参考:https://community.hortonworks.com/articles/2659/exploring-apache-flink-with-hdp.html
run on yarn : https://ci.apache.org/projects/flink/flink-docs-release-1.6/ops/deployment/yarn_setup.html
Flink样例:
https://ci.apache.org/projects/flink/flink-docs-release-1.6/dev/batch/examples.html#word-count
https://ci.apache.org/projects/flink/flink-docs-release-1.6/examples/
Flink培训:http://training.data-artisans.com/
Flink参考项目:https://github.com/highfei2011/flink-training-exercises
Flink metrics:https://ci.apache.org/projects/flink/flink-docs-release-1.6/monitoring/metrics.html#system-metrics
Grafana plugns: https://grafana.com/dashboards/5151
Flink 配置:https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html