更新几期Flink系的文章,有时间就写写,没时间就放放。之前做flink开发时候的一些认知还有心得,没有研究多深,大家多提意见,有问题的地方麻烦更正,本系列基于flink1.12。
Flink介绍
发展历史
官方介绍
组件栈
应用场景
所有的流式计算
Flink安装部署
local本地模式-了解
原理
操作
1.下载安装包
https://archive.apache.org/dist/flink/
2.上传flink-1.12.0-bin-scala_2.12.tgz到node1的指定目录
3.解压
tar -zxvf flink-1.12.0-bin-scala_2.12.tgz
4.如果出现权限问题,需要修改权限
chown -R root:root /export/server/flink-1.12.0
5.改名或创建软链接
mv flink-1.12.0 flink
ln -s /export/server/flink-1.12.0 /export/server/flink
测试
1.准备文件/root/words.txt
vim /root/words.txt
hello me you her
hello me you
hello me
hello
2.启动Flink本地“集群”
/export/server/flink/bin/start-cluster.sh
3.使用jps可以查看到下面两个进程
- TaskManagerRunner
- StandaloneSessionClusterEntrypoint
4.访问Flink的Web UI
http://node1:8081/#/overview
slot在Flink里面可以认为是资源组,Flink是通过将任务分成子任务并且将这些子任务分配到slot来并行执行程序。
5.执行官方示例
/export/server/flink/bin/flink run /export/server/flink/examples/batch/WordCount.jar --input /root/words.txt --output /root/out
6.停止Flink
/export/server/flink/bin/stop-cluster.sh
启动shell交互式窗口(目前所有Scala 2.12版本的安装包暂时都不支持 Scala Shell)
/export/server/flink/bin/start-scala-shell.sh local
执行如下命令
benv.readTextFile("/root/words.txt").flatMap(_.split(" ")).map((_,1)).groupBy(0).sum(1).print()
退出shell
:quit
Standalone独立集群模式-了解
原理
操作
1.集群规划:
- 服务器: node1(Master + Slave): JobManager + TaskManager
- 服务器: node2(Slave): TaskManager
- 服务器: node3(Slave): TaskManager
2.修改flink-conf.yaml
vim /export/server/flink/conf/flink-conf.yaml
jobmanager.rpc.address: node1
taskmanager.numberOfTaskSlots: 2
web.submit.enable: true
#历史服务器
jobmanager.archive.fs.dir: hdfs://node1:8020/flink/completed-jobs/
historyserver.web.address: node1
historyserver.web.port: 8082
historyserver.archive.fs.dir: hdfs://node1:8020/flink/completed-jobs/
2.修改masters
vim /export/server/flink/conf/masters
node1:8081
3.修改slaves
vim /export/server/flink/conf/workers
node1
node2
node3
4.添加HADOOPCONFDIR环境变量
vim /etc/profile
export HADOOP_CONF_DIR=/export/server/hadoop/etc/hadoop
5.分发
scp -r /export/server/flink node2:/export/server/flink
scp -r /export/server/flink node3:/export/server/flink
scp /etc/profile node2:/etc/profile
scp /etc/profile node3:/etc/profile
或
for i in {2..3}; do scp -r flink node$i:$PWD; done
6.source
source /etc/profile
测试
1.启动集群,在node1上执行如下命令
/export/server/flink/bin/start-cluster.sh
或者单独启动
/export/server/flink/bin/jobmanager.sh ((start|start-foreground) cluster)|stop|stop-all
/export/server/flink/bin/taskmanager.sh start|start-foreground|stop|stop-all
2.启动历史服务器
/export/server/flink/bin/historyserver.sh start
3.访问Flink UI界面或使用jps查看
http://node1:8081/#/overview
http://node1:8082/#/overview
4.执行官方测试案例
/export/server/flink/bin/flink run /export/server/flink/examples/batch/WordCount.jar
5.停止Flink集群
/export/server/flink/bin/stop-cluster.sh
Standalone-HA高可用集群模式-了解
原理
操作
1.集群规划
- 服务器: node1(Master + Slave): JobManager + TaskManager
- 服务器: node2(Master + Slave): JobManager + TaskManager
- 服务器: node3(Slave): TaskManager
2.启动ZooKeeper
zkServer.sh status
zkServer.sh stop
zkServer.sh start
3.启动HDFS
/export/serves/hadoop/sbin/start-dfs.sh
4.停止Flink集群
/export/server/flink/bin/stop-cluster.sh
5.修改flink-conf.yaml
vim /export/server/flink/conf/flink-conf.yaml
增加如下内容
state.backend: filesystem
state.backend.fs.checkpointdir: hdfs://node1:8020/flink-checkpoints
high-availability: zookeeper
high-availability.storageDir: hdfs://node1:8020/flink/ha/
high-availability.zookeeper.quorum: node1:2181,node2:2181,node3:2181
6.修改masters
vim /export/server/flink/conf/masters
7.同步
scp -r /export/server/flink/conf/flink-conf.yaml node2:/export/server/flink/conf/
scp -r /export/server/flink/conf/flink-conf.yaml node3:/export/server/flink/conf/
scp -r /export/server/flink/conf/masters node2:/export/server/flink/conf/
scp -r /export/server/flink/conf/masters node3:/export/server/flink/conf/
8.修改node2上的flink-conf.yaml
vim /export/server/flink/conf/flink-conf.yaml
jobmanager.rpc.address: node2
9.重新启动Flink集群,node1上执行
/export/server/flink/bin/stop-cluster.sh
/export/server/flink/bin/start-cluster.sh
10.使用jps命令查看
发现没有Flink相关进程被启动
11.查看日志
cat /export/server/flink/log/flink-root-standalonesession-0-node1.log
发现如下错误
因为在Flink1.8版本后,Flink官方提供的安装包里没有整合HDFS的jar
12.下载jar包并在Flink的lib目录下放入该jar包并分发使Flink能够支持对Hadoop的操作
下载地址
https://flink.apache.org/downloads.html
13.放入lib目录
cd /export/server/flink/lib
14.分发
for i in {2..3}; do scp -r flink-shaded-hadoop-2-uber-2.7.5-10.0.jar node$i:$PWD; done
15.重新启动Flink集群,node1上执行
/export/server/flink/bin/stop-cluster.sh
/export/server/flink/bin/start-cluster.sh
16.使用jps命令查看,发现三台机器已经ok
测试
1.访问WebUI
http://node1:8081/#/job-manager/config
http://node2:8081/#/job-manager/config
2.执行wc
/export/server/flink/bin/flink run /export/server/flink/examples/batch/WordCount.jar
3.kill掉其中一个master
4.重新执行wc,还是可以正常执行
/export/server/flink/bin/flink run /export/server/flink/examples/batch/WordCount.jar
3.停止集群
/export/server/flink/bin/stop-cluster.sh
Flink-On-Yarn-开发使用
原理
两种模式
Session会话模式
Job分离模式
操作
1.关闭yarn的内存检查
vim /export/server/hadoop/etc/hadoop/yarn-site.xml
yarn.nodemanager.pmem-check-enabled
false
yarn.nodemanager.vmem-check-enabled
false
2.分发
scp -r /export/server/hadoop/etc/hadoop/yarn-site.xml node2:/export/server/hadoop/etc/hadoop/yarn-site.xmlscp -r /export/server/hadoop/etc/hadoop/yarn-site.xml node3:/export/server/hadoop/etc/hadoop/yarn-site.xml
3.重启yarn
/export/server/hadoop/sbin/stop-yarn.sh
/export/server/hadoop/sbin/start-yarn.sh
测试
Session会话模式
在Yarn上启动一个Flink集群,并重复使用该集群,后续提交的任务都是给该集群,资源会被一直占用,除非手动关闭该集群----适用于大量的小任务
1.在yarn上启动一个Flink集群/会话,node1上执行以下命令
/export/server/flink/bin/yarn-session.sh -n 2 -tm 800 -s 1 -d
说明:
申请2个CPU、1600M内存
# -n 表示申请2个容器,这里指的就是多少个taskmanager
# -tm 表示每个TaskManager的内存大小
# -s 表示每个TaskManager的slots数量
# -d 表示以后台程序方式运行
注意:
该警告不用管
WARN org.apache.hadoop.hdfs.DFSClient - Caught exception
java.lang.InterruptedException
2.查看UI界面
http://node1:8088/cluster
3.使用flink run提交任务:
/export/server/flink/bin/flink run /export/server/flink/examples/batch/WordCount.jar
运行完之后可以继续运行其他的小任务
/export/server/flink/bin/flink run /export/server/flink/examples/batch/WordCount.jar
4.通过上方的ApplicationMaster可以进入Flink的管理界面
==5.关闭yarn-session:==
yarn application -kill application16095080879770005
Job分离模式--用的更多
针对每个Flink任务在Yarn上启动一个独立的Flink集群并运行,结束后自动关闭并释放资源,----适用于大任务
1.直接提交job
/export/server/flink/bin/flink run -m yarn-cluster -yjm 1024 -ytm 1024 /export/server/flink/examples/batch/WordCount.jar
# -m jobmanager的地址
# -yjm 1024 指定jobmanager的内存信息
# -ytm 1024 指定taskmanager的内存信息
2.查看UI界面
http://node1:8088/cluster
参数说明
/export/server/flink/bin/flink --help
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/export/server/flink/lib/log4j-slf4j-impl-2.12.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/export/server/hadoop-2.7.5/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
./flink
[OPTIONS] [ARGUMENTS] The following actions are available:
Action "run" compiles and runs a program.
Syntax: run [OPTIONS]
"run" action options:
-c,--class
Class with the program entry point ("main()" method). Only needed if the
JAR file does not specify the class in
its manifest.
-C,--classpath
Adds a URL to each user code classloader on all nodes in the
cluster. The paths must specify a
protocol (e.g. file://) and be
accessible on all nodes (e.g. by means
of a NFS share). You can use this
option multiple times for specifying
more than one URL. The protocol must
be supported by the {@link
java.net.URLClassLoader}.
-d,--detached If present, runs the job in detached
mode
-n,--allowNonRestoredState Allow to skip savepoint state that
cannot be restored. You need to allow
this if you removed an operator from
your program that was part of the
program when the savepoint was
triggered.
-p,--parallelism
The parallelism with which to run the program. Optional flag to override the
default value specified in the
configuration.
-py,--python
Python script with the program entry point. The dependent resources can be
configured with the `--pyFiles`
option.
-pyarch,--pyArchives
Add python archive files for job. The archive files will be extracted to the
working directory of python UDF
worker. Currently only zip-format is
supported. For each archive file, a
target directory be specified. If the
target directory name is specified,
the archive file will be extracted to
a name can directory with the
specified name. Otherwise, the archive
file will be extracted to a directory
with the same name of the archive
file. The files uploaded via this
option are accessible via relative
path. '#' could be used as the
separator of the archive file path and
the target directory name. Comma (',')
could be used as the separator to
specify multiple archive files. This
option can be used to upload the
virtual environment, the data files
used in Python UDF (e.g.: --pyArchives
file:///tmp/py37.zip,file:///tmp/data.
zip#data --pyExecutable
py37.zip/py37/bin/python). The data
files could be accessed in Python UDF,
e.g.: f = open('data/data.txt', 'r').
-pyexec,--pyExecutable
Specify the path of the python interpreter used to execute the python
UDF worker (e.g.: --pyExecutable
/usr/local/bin/python3). The python
UDF worker depends on Python 3.5+,
Apache Beam (version == 2.23.0), Pip
(version >= 7.1.0) and SetupTools
(version >= 37.0.0). Please ensure
that the specified environment meets
the above requirements.
-pyfs,--pyFiles
Attach custom python files for job. These files will be added to the
PYTHONPATH of both the local client
and the remote python UDF worker. The
standard python resource file suffixes
such as .py/.egg/.zip or directory are
all supported. Comma (',') could be
used as the separator to specify
multiple files (e.g.: --pyFiles
file:///tmp/myresource.zip,hdfs:///$na
menode_address/myresource2.zip).
-pym,--pyModule
Python module with the program entry point. This option must be used in
conjunction with `--pyFiles`.
-pyreq,--pyRequirements
Specify a requirements.txt file which defines the third-party dependencies.
These dependencies will be installed
and added to the PYTHONPATH of the
python UDF worker. A directory which
contains the installation packages of
these dependencies could be specified
optionally. Use '#' as the separator
if the optional parameter exists
(e.g.: --pyRequirements
file:///tmp/requirements.txt#file:///t
mp/cached_dir).
-s,--fromSavepoint
Path to a savepoint to restore the job from (for example
hdfs:///flink/savepoint-1537).
-sae,--shutdownOnAttachedExit If the job is submitted in attached
mode, perform a best-effort cluster
shutdown when the CLI is terminated
abruptly, e.g., in response to a user
interrupt, such as typing Ctrl + C.
Options for Generic CLI mode:
-D
Allows specifying multiple generic configuration options. The available options can be found at
https://ci.apache.org/projects/flink/flink-docs-stabl
e/ops/config.html
-e,--executor
DEPRECATED: Please use the -t option instead which is also available with the "Application Mode".
The name of the executor to be used for executing the
given job, which is equivalent to the
"execution.target" config option. The currently
available executors are: "remote", "local",
"kubernetes-session", "yarn-per-job", "yarn-session".
-t,--target
The deployment target for the given application, which is equivalent to the "execution.target" config
option. For the "run" action the currently available
targets are: "remote", "local", "kubernetes-session",
"yarn-per-job", "yarn-session". For the
"run-application" action the currently available
targets are: "kubernetes-application",
"yarn-application".
Options for yarn-cluster mode:
-d,--detached If present, runs the job in detached
mode
-m,--jobmanager
Set to yarn-cluster to use YARN execution mode.
-yat,--yarnapplicationType
Set a custom application type for the application on YARN
-yD
use value for given property -yd,--yarndetached If present, runs the job in detached
mode (deprecated; use non-YARN
specific option instead)
-yh,--yarnhelp Help for the Yarn session CLI.
-yid,--yarnapplicationId
Attach to running YARN session -yj,--yarnjar
Path to Flink jar file -yjm,--yarnjobManagerMemory
Memory for JobManager Container with optional unit (default: MB)
-ynl,--yarnnodeLabel
Specify YARN node label for the YARN application
-ynm,--yarnname
Set a custom name for the application on YARN
-yq,--yarnquery Display available YARN resources
(memory, cores)
-yqu,--yarnqueue
Specify YARN queue. -ys,--yarnslots
Number of slots per TaskManager -yt,--yarnship
Ship files in the specified directory (t for transfer)
-ytm,--yarntaskManagerMemory
Memory per TaskManager Container with optional unit (default: MB)
-yz,--yarnzookeeperNamespace
Namespace to create the Zookeeper sub-paths for high availability mode
-z,--zookeeperNamespace
Namespace to create the Zookeeper sub-paths for high availability mode
Options for default mode:
-D
Allows specifying multiple generic configuration options. The available
options can be found at
https://ci.apache.org/projects/flink/flink-
docs-stable/ops/config.html
-m,--jobmanager
Address of the JobManager to which to connect. Use this flag to connect to a
different JobManager than the one specified
in the configuration. Attention: This
option is respected only if the
high-availability configuration is NONE.
-z,--zookeeperNamespace
Namespace to create the Zookeeper sub-paths for high availability mode
Action "run-application" runs an application in Application Mode.
Syntax: run-application [OPTIONS]
Options for Generic CLI mode:
-D
Allows specifying multiple generic configuration options. The available options can be found at
https://ci.apache.org/projects/flink/flink-docs-stabl
e/ops/config.html
-e,--executor
DEPRECATED: Please use the -t option instead which is also available with the "Application Mode".
The name of the executor to be used for executing the
given job, which is equivalent to the
"execution.target" config option. The currently
available executors are: "remote", "local",
"kubernetes-session", "yarn-per-job", "yarn-session".
-t,--target
The deployment target for the given application, which is equivalent to the "execution.target" config
option. For the "run" action the currently available
targets are: "remote", "local", "kubernetes-session",
"yarn-per-job", "yarn-session". For the
"run-application" action the currently available
targets are: "kubernetes-application",
"yarn-application".
Action "info" shows the optimized execution plan of the program (JSON).
Syntax: info [OPTIONS]
"info" action options:
-c,--class
Class with the program entry point ("main()" method). Only needed if the JAR
file does not specify the class in its
manifest.
-p,--parallelism
The parallelism with which to run the program. Optional flag to override the
default value specified in the
configuration.
Action "list" lists running and scheduled programs.
Syntax: list [OPTIONS]
"list" action options:
-a,--all Show all programs and their JobIDs
-r,--running Show only running programs and their JobIDs
-s,--scheduled Show only scheduled programs and their JobIDs
Options for Generic CLI mode:
-D
Allows specifying multiple generic configuration options. The available options can be found at
https://ci.apache.org/projects/flink/flink-docs-stabl
e/ops/config.html
-e,--executor
DEPRECATED: Please use the -t option instead which is also available with the "Application Mode".
The name of the executor to be used for executing the
given job, which is equivalent to the
"execution.target" config option. The currently
available executors are: "remote", "local",
"kubernetes-session", "yarn-per-job", "yarn-session".
-t,--target
The deployment target for the given application, which is equivalent to the "execution.target" config
option. For the "run" action the currently available
targets are: "remote", "local", "kubernetes-session",
"yarn-per-job", "yarn-session". For the
"run-application" action the currently available
targets are: "kubernetes-application",
"yarn-application".
Options for yarn-cluster mode:
-m,--jobmanager
Set to yarn-cluster to use YARN execution mode.
-yid,--yarnapplicationId
Attach to running YARN session -z,--zookeeperNamespace
Namespace to create the Zookeeper sub-paths for high availability mode
Options for default mode:
-D
Allows specifying multiple generic configuration options. The available
options can be found at
https://ci.apache.org/projects/flink/flink-
docs-stable/ops/config.html
-m,--jobmanager
Address of the JobManager to which to connect. Use this flag to connect to a
different JobManager than the one specified
in the configuration. Attention: This
option is respected only if the
high-availability configuration is NONE.
-z,--zookeeperNamespace
Namespace to create the Zookeeper sub-paths for high availability mode
Action "stop" stops a running program with a savepoint (streaming jobs only).
Syntax: stop [OPTIONS]
"stop" action options:
-d,--drain Send MAX_WATERMARK before taking the
savepoint and stopping the pipelne.
-p,--savepointPath
Path to the savepoint (for example hdfs:///flink/savepoint-1537). If no
directory is specified, the configured
default will be used
("state.savepoints.dir").
Options for Generic CLI mode:
-D
Allows specifying multiple generic configuration options. The available options can be found at
https://ci.apache.org/projects/flink/flink-docs-stabl
e/ops/config.html
-e,--executor
DEPRECATED: Please use the -t option instead which is also available with the "Application Mode".
The name of the executor to be used for executing the
given job, which is equivalent to the
"execution.target" config option. The currently
available executors are: "remote", "local",
"kubernetes-session", "yarn-per-job", "yarn-session".
-t,--target
The deployment target for the given application, which is equivalent to the "execution.target" config
option. For the "run" action the currently available
targets are: "remote", "local", "kubernetes-session",
"yarn-per-job", "yarn-session". For the
"run-application" action the currently available
targets are: "kubernetes-application",
"yarn-application".
Options for yarn-cluster mode:
-m,--jobmanager
Set to yarn-cluster to use YARN execution mode.
-yid,--yarnapplicationId
Attach to running YARN session -z,--zookeeperNamespace
Namespace to create the Zookeeper sub-paths for high availability mode
Options for default mode:
-D
Allows specifying multiple generic configuration options. The available
options can be found at
https://ci.apache.org/projects/flink/flink-
docs-stable/ops/config.html
-m,--jobmanager
Address of the JobManager to which to connect. Use this flag to connect to a
different JobManager than the one specified
in the configuration. Attention: This
option is respected only if the
high-availability configuration is NONE.
-z,--zookeeperNamespace
Namespace to create the Zookeeper sub-paths for high availability mode
Action "cancel" cancels a running program.
Syntax: cancel [OPTIONS]
"cancel" action options:
-s,--withSavepoint
**DEPRECATION WARNING**: Cancelling a job with savepoint is deprecated.
Use "stop" instead.
Trigger savepoint and cancel job.
The target directory is optional. If
no directory is specified, the
configured default directory
(state.savepoints.dir) is used.
Options for Generic CLI mode:
-D
Allows specifying multiple generic configuration options. The available options can be found at
https://ci.apache.org/projects/flink/flink-docs-stabl
e/ops/config.html
-e,--executor
DEPRECATED: Please use the -t option instead which is also available with the "Application Mode".
The name of the executor to be used for executing the
given job, which is equivalent to the
"execution.target" config option. The currently
available executors are: "remote", "local",
"kubernetes-session", "yarn-per-job", "yarn-session".
-t,--target
The deployment target for the given application, which is equivalent to the "execution.target" config
option. For the "run" action the currently available
targets are: "remote", "local", "kubernetes-session",
"yarn-per-job", "yarn-session". For the
"run-application" action the currently available
targets are: "kubernetes-application",
"yarn-application".
Options for yarn-cluster mode:
-m,--jobmanager
Set to yarn-cluster to use YARN execution mode.
-yid,--yarnapplicationId
Attach to running YARN session -z,--zookeeperNamespace
Namespace to create the Zookeeper sub-paths for high availability mode
Options for default mode:
-D
Allows specifying multiple generic configuration options. The available
options can be found at
https://ci.apache.org/projects/flink/flink-
docs-stable/ops/config.html
-m,--jobmanager
Address of the JobManager to which to connect. Use this flag to connect to a
different JobManager than the one specified
in the configuration. Attention: This
option is respected only if the
high-availability configuration is NONE.
-z,--zookeeperNamespace
Namespace to create the Zookeeper sub-paths for high availability mode
Action "savepoint" triggers savepoints for a running job or disposes existing ones.
Syntax: savepoint [OPTIONS]
[ ] "savepoint" action options:
-d,--dispose
Path of savepoint to dispose. -j,--jarfile
Flink program JAR file. Options for Generic CLI mode:
-D
Allows specifying multiple generic configuration options. The available options can be found at
https://ci.apache.org/projects/flink/flink-docs-stabl
e/ops/config.html
-e,--executor
DEPRECATED: Please use the -t option instead which is also available with the "Application Mode".
The name of the executor to be used for executing the
given job, which is equivalent to the
"execution.target" config option. The currently
available executors are: "remote", "local",
"kubernetes-session", "yarn-per-job", "yarn-session".
-t,--target
The deployment target for the given application, which is equivalent to the "execution.target" config
option. For the "run" action the currently available
targets are: "remote", "local", "kubernetes-session",
"yarn-per-job", "yarn-session". For the
"run-application" action the currently available
targets are: "kubernetes-application",
"yarn-application".
Options for yarn-cluster mode:
-m,--jobmanager
Set to yarn-cluster to use YARN execution mode.
-yid,--yarnapplicationId
Attach to running YARN session -z,--zookeeperNamespace
Namespace to create the Zookeeper sub-paths for high availability mode
Options for default mode:
-D
Allows specifying multiple generic configuration options. The available
options can be found at
https://ci.apache.org/projects/flink/flink-
docs-stable/ops/config.html
-m,--jobmanager
Address of the JobManager to which to connect. Use this flag to connect to a
different JobManager than the one specified
in the configuration. Attention: This
option is respected only if the
high-availability configuration is NONE.
-z,--zookeeperNamespace
Namespace to create the Zookeeper sub-paths for high availability mode
Flink入门案例
前置说明
注意:入门案例使用DataSet后续就不再使用了,而是使用流批一体的DataStream
https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/batch/
准备环境
POM
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
4.0.0
XX.XXXX
flink_XXXXX
1.0-SNAPSHOT
aliyun
http://maven.aliyun.com/nexus/content/groups/public/
apache
https://repository.apache.org/content/repositories/snapshots/
cloudera
https://repository.cloudera.com/artifactory/cloudera-repos/
UTF-8
UTF-8
1.8
1.8
1.8
2.12
1.12.0
org.apache.flink
flink-clients_2.12
${flink.version}
org.apache.flink
flink-scala_2.12
${flink.version}
org.apache.flink
flink-java
${flink.version}
org.apache.flink
flink-streaming-scala_2.12
${flink.version}
org.apache.flink
flink-streaming-java_2.12
${flink.version}
org.apache.flink
flink-table-api-scala-bridge_2.12
${flink.version}
org.apache.flink
flink-table-api-java-bridge_2.12
${flink.version}
org.apache.flink
flink-table-planner_2.12
${flink.version}
org.apache.flink
flink-table-planner-blink_2.12
${flink.version}
org.apache.flink
flink-table-common
${flink.version}
org.apache.flink
flink-connector-kafka_2.12
${flink.version}
org.apache.flink
flink-sql-connector-kafka_2.12
${flink.version}
org.apache.flink
flink-connector-jdbc_2.12
${flink.version}
org.apache.flink
flink-csv
${flink.version}
org.apache.flink
flink-json
${flink.version}
org.apache.bahir
flink-connector-redis_2.11
1.0
flink-streaming-java_2.11
org.apache.flink
flink-runtime_2.11
org.apache.flink
flink-core
org.apache.flink
flink-java
org.apache.flink
org.apache.flink
flink-connector-hive_2.12
${flink.version}
org.apache.hive
hive-metastore
2.1.0
org.apache.hive
hive-exec
2.1.0
org.apache.flink
flink-shaded-hadoop-2-uber
2.7.5-10.0
org.apache.hbase
hbase-client
2.1.0
mysql
mysql-connector-java
5.1.38
io.vertx
vertx-core
3.9.0
io.vertx
vertx-jdbc-client
3.9.0
io.vertx
vertx-redis-client
3.9.0
org.slf4j
slf4j-log4j12
1.7.7
runtime
log4j
log4j
1.2.17
runtime
com.alibaba
fastjson
1.2.44
org.projectlombok
lombok
1.18.2
provided
src/main/java
org.apache.maven.plugins
maven-compiler-plugin
3.5.1
1.8
org.apache.maven.plugins
maven-surefire-plugin
2.18.1
false
true
**/*Test.*
**/*Suite.*
org.apache.maven.plugins
maven-shade-plugin
2.3
package
shade
*:*
META-INF/*.SF
META-INF/*.DSA
META-INF/*.RSA
代码实现-DataSet-了解
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.java.DataSet;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.operators.AggregateOperator;
import org.apache.flink.api.java.operators.UnsortedGrouping;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.util.Collector;
/**
* Author ZuoYan
* Desc 演示Flink-DataSet-API-实现WordCount
*/
public class WordCount {
public static void main(String[] args) throws Exception {
//TODO 0.env
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
//TODO 1.source
DataSet
lines = env.fromElements("itcast hadoop spark", "itcast hadoop spark", "itcast hadoop", "itcast"); //TODO 2.transformation
//切割
/*
@FunctionalInterface
public interface FlatMapFunction
extends Function, Serializable { void flatMap(T value, Collector
out) throws Exception; }
*/
DataSet
words = lines.flatMap(new FlatMapFunction () { @Override
public void flatMap(String value, Collector
out) throws Exception { //value表示每一行数据
String[] arr = value.split(" ");
for (String word : arr) {
out.collect(word);
}
}
});
//记为1
/*
@FunctionalInterface
public interface MapFunction
extends Function, Serializable { O map(T value) throws Exception;
}
*/
DataSet
> wordAndOne = words.map(new MapFunction >() { @Override
public Tuple2
map(String value) throws Exception { //value就是每一个单词
return Tuple2.of(value, 1);
}
});
//分组
UnsortedGrouping
> grouped = wordAndOne.groupBy(0); //聚合
AggregateOperator
> result = grouped.sum(1); //TODO 3.sink
result.print();
}
}
代码实现-DataStream-匿名内部类-处理批
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.util.Collector;
/**
* Author ZuoYan
* Desc 演示Flink-DataStream-API-实现WordCount
* 注意:在Flink1.12中DataStream既支持流处理也支持批处理,如何区分?
*/
public class WordCount2 {
public static void main(String[] args) throws Exception {
//TODO 0.env
//ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//env.setRuntimeMode(RuntimeExecutionMode.BATCH);//注意:使用DataStream实现批处理
//env.setRuntimeMode(RuntimeExecutionMode.STREAMING);//注意:使用DataStream实现流处理
//env.setRuntimeMode(RuntimeExecutionMode.AUTOMATIC);//注意:使用DataStream根据数据源自动选择使用流还是批
//TODO 1.source
//DataSet
lines = env.fromElements("itcast hadoop spark", "itcast hadoop spark", "itcast hadoop", "itcast"); DataStream
lines = env.fromElements("itcast hadoop spark", "itcast hadoop spark", "itcast hadoop", "itcast"); //TODO 2.transformation
//切割
/*
@FunctionalInterface
public interface FlatMapFunction
extends Function, Serializable { void flatMap(T value, Collector
out) throws Exception; }
*/
DataStream
words = lines.flatMap(new FlatMapFunction () { @Override
public void flatMap(String value, Collector
out) throws Exception { //value就是每一行数据
String[] arr = value.split(" ");
for (String word : arr) {
out.collect(word);
}
}
});
//记为1
/*
@FunctionalInterface
public interface MapFunction
extends Function, Serializable { O map(T value) throws Exception;
}
*/
DataStream
> wordAndOne = words.map(new MapFunction >() { @Override
public Tuple2
map(String value) throws Exception { //value就是一个个单词
return Tuple2.of(value, 1);
}
});
//分组:注意DataSet中分组是groupBy,DataStream分组是keyBy
//wordAndOne.keyBy(0);
/*
@FunctionalInterface
public interface KeySelector
extends Function, Serializable { KEY getKey(IN value) throws Exception;
}
*/
KeyedStream
, String> grouped = wordAndOne.keyBy(t -> t.f0); //聚合
SingleOutputStreamOperator
> result = grouped.sum(1); //TODO 3.sink
result.print();
//TODO 4.execute/启动并等待程序结束
env.execute();
}
}
代码实现-DataStream-匿名内部类-处理流
import org.apache.flink.api.common.RuntimeExecutionMode;
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.util.Collector;
/**
* Author ZuoYan
* Desc 演示Flink-DataStream-API-实现WordCount
* 注意:在Flink1.12中DataStream既支持流处理也支持批处理,如何区分?
*/
public class WordCount3 {
public static void main(String[] args) throws Exception {
//TODO 0.env
//ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//env.setRuntimeMode(RuntimeExecutionMode.BATCH);//注意:使用DataStream实现批处理
//env.setRuntimeMode(RuntimeExecutionMode.STREAMING);//注意:使用DataStream实现流处理
env.setRuntimeMode(RuntimeExecutionMode.AUTOMATIC);//注意:使用DataStream根据数据源自动选择使用流还是批
//TODO 1.source
//DataSet
lines = env.fromElements("itcast hadoop spark", "itcast hadoop spark", "itcast hadoop", "itcast"); //DataStream
lines = env.fromElements("itcast hadoop spark", "itcast hadoop spark", "itcast hadoop", "itcast"); DataStream
lines = env.socketTextStream("node1", 9999); //TODO 2.transformation
//切割
/*
@FunctionalInterface
public interface FlatMapFunction
extends Function, Serializable { void flatMap(T value, Collector
out) throws Exception; }
*/
DataStream
words = lines.flatMap(new FlatMapFunction () { @Override
public void flatMap(String value, Collector
out) throws Exception { //value就是每一行数据
String[] arr = value.split(" ");
for (String word : arr) {
out.collect(word);
}
}
});
//记为1
/*
@FunctionalInterface
public interface MapFunction
extends Function, Serializable { O map(T value) throws Exception;
}
*/
DataStream
> wordAndOne = words.map(new MapFunction >() { @Override
public Tuple2
map(String value) throws Exception { //value就是一个个单词
return Tuple2.of(value, 1);
}
});
//分组:注意DataSet中分组是groupBy,DataStream分组是keyBy
//wordAndOne.keyBy(0);
/*
@FunctionalInterface
public interface KeySelector
extends Function, Serializable { KEY getKey(IN value) throws Exception;
}
*/
KeyedStream
, String> grouped = wordAndOne.keyBy(t -> t.f0); //聚合
SingleOutputStreamOperator
> result = grouped.sum(1); //TODO 3.sink
result.print();
//TODO 4.execute/启动并等待程序结束
env.execute();
}
}
代码实现-DataStream-Lambda
import org.apache.flink.api.common.RuntimeExecutionMode;
import org.apache.flink.api.common.typeinfo.Types;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.util.Collector;
import java.util.Arrays;
/**
* Author ZuoYan
* Desc 演示Flink-DataStream-API-实现WordCount
* 注意:在Flink1.12中DataStream既支持流处理也支持批处理,如何区分?
*/
public class WordCount4 {
public static void main(String[] args) throws Exception {
//TODO 0.env
//ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//env.setRuntimeMode(RuntimeExecutionMode.BATCH);//注意:使用DataStream实现批处理
//env.setRuntimeMode(RuntimeExecutionMode.STREAMING);//注意:使用DataStream实现流处理
env.setRuntimeMode(RuntimeExecutionMode.AUTOMATIC);//注意:使用DataStream根据数据源自动选择使用流还是批
//TODO 1.source
//DataSet
lines = env.fromElements("itcast hadoop spark", "itcast hadoop spark", "itcast hadoop", "itcast"); DataStream
lines = env.fromElements("itcast hadoop spark", "itcast hadoop spark", "itcast hadoop", "itcast"); //TODO 2.transformation
//切割
/*
@FunctionalInterface
public interface FlatMapFunction
extends Function, Serializable { void flatMap(T value, Collector
out) throws Exception; }
*/
/*DataStream
words = lines.flatMap(new FlatMapFunction () { @Override
public void flatMap(String value, Collector
out) throws Exception { //value就是每一行数据
String[] arr = value.split(" ");
for (String word : arr) {
out.collect(word);
}
}
});*/
SingleOutputStreamOperator
words = lines.flatMap( (String value, Collector
out) -> Arrays.stream(value.split(" ")).forEach(out::collect) ).returns(Types.STRING);
//记为1
/*
@FunctionalInterface
public interface MapFunction
extends Function, Serializable { O map(T value) throws Exception;
}
*/
/*DataStream
> wordAndOne = words.map(new MapFunction >() { @Override
public Tuple2
map(String value) throws Exception { //value就是一个个单词
return Tuple2.of(value, 1);
}
});*/
DataStream
> wordAndOne = words.map( (String value) -> Tuple2.of(value, 1)
).returns(Types.TUPLE(Types.STRING,Types.INT));
//分组:注意DataSet中分组是groupBy,DataStream分组是keyBy
//wordAndOne.keyBy(0);
/*
@FunctionalInterface
public interface KeySelector
extends Function, Serializable { KEY getKey(IN value) throws Exception;
}
*/
KeyedStream
, String> grouped = wordAndOne.keyBy(t -> t.f0); //聚合
SingleOutputStreamOperator
> result = grouped.sum(1); //TODO 3.sink
result.print();
//TODO 4.execute/启动并等待程序结束
env.execute();
}
}
代码实现-On-Yarn
import org.apache.flink.api.common.typeinfo.Types;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.api.java.utils.ParameterTool;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.util.Collector;
import java.util.Arrays;
/**
* Author ZuoYan
* Desc 演示Flink-DataStream-API-实现WordCount
* 注意:在Flink1.12中DataStream既支持流处理也支持批处理,如何区分?
*/
public class WordCount5_Yarn {
public static void main(String[] args) throws Exception {
ParameterTool parameterTool = ParameterTool.fromArgs(args);
String output = "";
if (parameterTool.has("output")) {
output = parameterTool.get("output");
System.out.println("指定了输出路径使用:" + output);
} else {
output = "hdfs://node1:8020/wordcount/output47_";
System.out.println("可以指定输出路径使用 --output ,没有指定使用默认的:" + output);
}
//TODO 0.env
//ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//env.setRuntimeMode(RuntimeExecutionMode.BATCH);//注意:使用DataStream实现批处理
//env.setRuntimeMode(RuntimeExecutionMode.STREAMING);//注意:使用DataStream实现流处理
//env.setRuntimeMode(RuntimeExecutionMode.AUTOMATIC);//注意:使用DataStream根据数据源自动选择使用流还是批
//TODO 1.source
//DataSet
lines = env.fromElements("itcast hadoop spark", "itcast hadoop spark", "itcast hadoop", "itcast"); DataStream
lines = env.fromElements("itcast hadoop spark", "itcast hadoop spark", "itcast hadoop", "itcast"); //TODO 2.transformation
//切割
/*
@FunctionalInterface
public interface FlatMapFunction
extends Function, Serializable { void flatMap(T value, Collector
out) throws Exception; }
*/
/*DataStream
words = lines.flatMap(new FlatMapFunction () { @Override
public void flatMap(String value, Collector
out) throws Exception { //value就是每一行数据
String[] arr = value.split(" ");
for (String word : arr) {
out.collect(word);
}
}
});*/
SingleOutputStreamOperator
words = lines.flatMap( (String value, Collector
out) -> Arrays.stream(value.split(" ")).forEach(out::collect) ).returns(Types.STRING);
//记为1
/*
@FunctionalInterface
public interface MapFunction
extends Function, Serializable { O map(T value) throws Exception;
}
*/
/*DataStream
> wordAndOne = words.map(new MapFunction >() { @Override
public Tuple2
map(String value) throws Exception { //value就是一个个单词
return Tuple2.of(value, 1);
}
});*/
DataStream
> wordAndOne = words.map( (String value) -> Tuple2.of(value, 1)
).returns(Types.TUPLE(Types.STRING, Types.INT));
//分组:注意DataSet中分组是groupBy,DataStream分组是keyBy
//wordAndOne.keyBy(0);
/*
@FunctionalInterface
public interface KeySelector
extends Function, Serializable { KEY getKey(IN value) throws Exception;
}
*/
KeyedStream
, String> grouped = wordAndOne.keyBy(t -> t.f0); //聚合
SingleOutputStreamOperator
> result = grouped.sum(1); //TODO 3.sink
//如果执行报hdfs权限相关错误,可以执行 hadoop fs -chmod -R 777 /
System.setProperty("HADOOP_USER_NAME", "root");//设置用户名
//result.print();
//result.writeAsText("hdfs://node1:8020/wordcount/output47_"+System.currentTimeMillis()).setParallelism(1);
result.writeAsText(output + System.currentTimeMillis()).setParallelism(1);
//TODO 4.execute/启动并等待程序结束
env.execute();
}
}
打包改名上传
提交
/export/server/flink/bin/flink run -Dexecution.runtime-mode=BATCH -m yarn-cluster -yjm 1024 -ytm 1024 -c cn.itcast.hello.WordCount5_Yarn /root/wc.jar --output hdfs://node1:8020/wordcount/output_xx
注意
RuntimeExecutionMode.BATCH//使用DataStream实现批处理
RuntimeExecutionMode.STREAMING//使用DataStream实现流处理
RuntimeExecutionMode.AUTOMATIC//使用DataStream根据数据源自动选择使用流还是批
//如果不指定,默认是流
在后续的Flink开发中,把一切数据源看做流即可或者使用AUTOMATIC就行了
Flink原理初探-慢慢理解/消化
角色分工
执行流程
DataFlow
https://ci.apache.org/projects/flink/flink-docs-release-1.12/concepts/glossary.html
DataFlow、Operator、Partition、Parallelism、SubTask
OperatorChain和Task
TaskSlot和TaskSlotSharing
执行流程图生成
公众号:漫话架构之美
大数据领域原创技术号,专注于大数据研究,包括 Hadoop、Flink、Spark、Kafka、Hive、HBase 等,深入大数据技术原理,数据仓库,数据治理,前沿大数据技术