Flink应用开发
flink和spark类似,也是一种一站式处理的框架;既可以进行批处理(DataSet),也可以进行实时处理(DataStream)
1.8
1.8
UTF-8
2.11.2
2.11
2.6.2
1.5.0
org.scala-lang
scala-library
${scala.version}
org.apache.flink
flink-streaming-scala_2.11
${flink.version}
org.apache.flink
flink-scala_2.11
${flink.version}
org.apache.flink
flink-clients_2.11
${flink.version}
org.apache.flink
flink-table_2.11
${flink.version}
org.apache.hadoop
hadoop-client
${hadoop.version}
mysql
mysql-connector-java
5.1.38
com.alibaba
fastjson
1.2.22
org.apache.flink
flink-connector-kafka-0.9_2.11
${flink.version}
DateSet开发
开发流程
获得一个execution environment,
加载/创建初始数据,
指定这些数据的转换,
指定将计算结果放在哪里,
触发程序执行
例子
程序执行
例子
object DataSet_WordCount {
def main(args: Array[String]) {
//TODO 初始化环境
val env = ExecutionEnvironment.getExecutionEnvironment
//TODO 加载/创建初始数据
val text = env.fromElements(
"Who's there?",
"I think I hear them. Stand, ho! Who's there?")
//TODO 指定这些数据的转换
val split_words = text.flatMap(line => line.toLowerCase().split("\\W+"))
val filter_words = split_words.filter(x=> x.nonEmpty)
val map_words = filter_words.map(x=> (x,1))
val groupBy_words = map_words.groupBy(0)
val sum_words = groupBy_words.sum(1)
//todo 指定将计算结果放在哪里
// sum_words.setParallelism(1)//汇总结果
sum_words.writeAsText(args(0))//"/Users/niutao/Desktop/flink.txt"
//TODO 触发程序执行
env.execute("DataSet wordCount")
}
}
将程序打包,提交到yarn
添加maven打包插件
src/main/java
src/test/scala
org.apache.maven.plugins
maven-compiler-plugin
2.5.1
1.7
net.alchim31.maven
scala-maven-plugin
3.2.0
compile
testCompile
-dependencyfile
${project.build.directory}/.scala_dependencies
org.apache.maven.plugins
maven-surefire-plugin
2.18.1
false
true
**/*Test.*
**/*Suite.*
org.apache.maven.plugins
maven-shade-plugin
2.3
package
shade
*:*
META-INF/*.SF
META-INF/*.DSA
META-INF/*.RSA
com.itcast.DEMO.WordCount
执行成功后,在target文件夹中生成jar包;
使用rz命令上传jar包,然后执行程序:
bin/flink run -m yarn-cluster -yn 2 /home/elasticsearch/flinkjar/itcast_learn_flink-1.0-SNAPSHOT.jar com.itcast.DEMO.WordCount
在yarn的8088页面可以观察到提交的程序:
去/export/servers/flink-1.3.2/flinkJAR文件夹下可以找到输出的运行结果:
Transformation |
Description |
Map |
Takes one element and produces one element. |
FlatMap |
Takes one element and produces zero, one, or more elements. |
MapPartition |
Transforms a parallel partition in a single function call. The function get the partition as an `Iterator` and can produce an arbitrary number of result values. The number of elements in each partition depends on the degree-of-parallelism and previous operations. |
Filter |
Evaluates a boolean function for each element and retains those for which the function returns true. |
Reduce |
Combines a group of elements into a single element by repeatedly combining two elements into one. Reduce may be applied on a full data set, or on a grouped data set. |
ReduceGroup |
Combines a group of elements into one or more elements. ReduceGroup may be applied on a full data set, or on a grouped data set. |
Aggregate |
Aggregates a group of values into a single value. Aggregation functions can be thought of as built-in reduce functions. Aggregate may be applied on a full data set, or on a grouped data set. |
Distinct |
Returns the distinct elements of a data set. It removes the duplicate entries from the input DataSet, with respect to all fields of the elements, or a subset of fields. |
Join |
Joins two data sets by creating all pairs of elements that are equal on their keys. Optionally uses a JoinFunction to turn the pair of elements into a single element, or a FlatJoinFunction to turn the pair of elements into arbitrarily many (including none) elements. See the keys section to learn how to define join keys. Note that the join transformation works only for equi-joins. Other join types need to be expressed using OuterJoin or CoGroup. |
OuterJoin |
Performs a left, right, or full outer join on two data sets. Outer joins are similar to regular (inner) joins and create all pairs of elements that are equal on their keys. In addition, records of the "outer" side (left, right, or both in case of full) are preserved if no matching key is found in the other side. Matching pairs of elements (or one element and a `null` value for the other input) are given to a JoinFunction to turn the pair of elements into a single element, or to a FlatJoinFunction to turn the pair of elements into arbitrarily many (including none) elements. See the keys section to learn how to define join keys. |
CoGroup |
The two-dimensional variant of the reduce operation. Groups each input on one or more fields and then joins the groups. The transformation function is called per pair of groups. See the keys section to learn how to define coGroup keys. |
Cross |
Builds the Cartesian product (cross product) of two inputs, creating all pairs of elements. Optionally uses a CrossFunction to turn the pair of elements into a single element Note: Cross is potentially a very compute-intensive operation which can challenge even large compute clusters! It is adviced to hint the system with the DataSet sizes by using crossWithTiny() and crossWithHuge(). |
Union |
Produces the union of two data sets. |
Rebalance |
Evenly rebalances the parallel partitions of a data set to eliminate data skew. Only Map-like transformations may follow a rebalance transformation. |
Hash-Partition |
Hash-partitions a data set on a given key. Keys can be specified as position keys, expression keys, and key selector functions. |
Range-Partition |
Range-partitions a data set on a given key. Keys can be specified as position keys, expression keys, and key selector functions. |
Custom Partitioning |
Manually specify a partitioning over the data. |
Sort Partition |
Locally sorts all partitions of a data set on a specified field in a specified order. Fields can be specified as tuple positions or field expressions. Sorting on multiple fields is done by chaining sortPartition() calls. |
First-n |
Returns the first n (arbitrary) elements of a data set. First-n can be applied on a regular data set, a grouped data set, or a grouped-sorted data set. Grouping keys can be specified as key-selector functions, tuple positions or case class fields. |
//初始化执行环境
val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
//加载数据
val data = env.fromElements(("A" , 1) , ("B" , 1) , ("C" , 1))
//使用trasformation加载这些数据
//TODO map
val map_result = data.map(line => line._1+line._2)
map_result.print()
//TODO flatmap
val flatmap_result = data.flatMap(line => line._1+line._2)
flatmap_result.print()
练习:如下数据
A;B;C;D;B;D;C
B;D;A;E;D;C
A;B
要求:统计相邻字符串出现的次数
import org.apache.flink.api.scala.{DataSet, ExecutionEnvironment}
import org.apache.flink.streaming.api.scala._
/** * Created by angel;*/
object demo {
/**
A;B;C;D;B;D;C
B;D;A;E;D;C
A;B
统计相邻字符串出现的次数(A+B , 2) (B+C , 1)....* */
def main(args: Array[String]): Unit = {
val env = ExecutionEnvironment.getExecutionEnvironment
val data = env.fromElements("A;B;C;D;B;D;C;B;D;A;E;D;C;A;B")
val map_data: DataSet[Array[String]] = data.map(line => line.split(";"))
//[A,B,C,D] ---"A,B,C,D"
//[A,B,C,D] ---> (x,1) , (y,1) -->groupBy--->sum--total
val tupe_data = map_data.flatMap{
line =>
for(index <- 0 until line.length-1) yield (line(index)+"+"+line(index+1) , 1)
}
val gropudata = tupe_data.groupBy(0)
val result = gropudata.sum(1)
result.print()
}
}
好处就是以后我们操作完数据了需要存储到mysql中,这样做的好处就是几个分区拿几个连接,如果用map的话,就是多少条数据拿多少个mysql的连接
//TODO mapPartition
val ele_partition = elements.setParallelism(2)//将分区设置为2
val partition = ele_partition.mapPartition(line => line.map(x=> x+"======"))
//line是每个分区下面的数据
partition.print()
Filter函数在实际生产中特别实用,数据处理阶段可以过滤掉大部分不符合业务的内容,可以极大减轻整体flink的运算压力
val filter:DataSet[String] = elements.filter(line => line.contains("java"))//过滤出带java的数据
filter.print()
|
普通的reduce函数
reduceGroup是reduce的一种优化方案;
它会先分组reduce,然后在做整体的reduce;这样做的好处就是可以减少网络IO;
//TODO reduceGroup
val elements:DataSet[List[Tuple2[String , Int]]] = env.fromElements(List(("java" , 1) ,("java" , 1), ("scala" , 1)))
val tuple_words = elements.flatMap(x=>x)
val group_words = tuple_words.groupBy(x => x._1)
val a = group_words.reduceGroup{
(in:Iterator[(String,Int)],out:Collector[(String , Int)]) =>
val result = in.reduce((x, y) => (x._1, x._2+y._2))
out.collect(result)
}
a.print()
}
import collection.JavaConverters._
class Tuple3GroupReduceWithCombine extends GroupReduceFunction[( String , Int), (String, Int)] with GroupCombineFunction[(String, Int), (String, Int)] {
override def reduce(values: Iterable[(String, Int)], out: Collector[(String, Int)]): Unit = {
for(in <- values.asScala){
out.collect((in._1 , in._2))
}
}
override def combine(values: Iterable[(String, Int)], out: Collector[(String, Int)]): Unit = {
val map = new mutable.HashMap[String , Int]()
var num = 0
var s = ""
for(in <- values.asScala){
num += in._2
s = in._1
}
out.collect((s , num))
}
}
// TODO GroupReduceFunction GroupCombineFunction
val env = ExecutionEnvironment.getExecutionEnvironment
val elements:DataSet[List[Tuple2[String , Int]]] = env.fromElements(List(("java" , 3) ,("java" , 1), ("scala" , 1)))
val collection = elements.flatMap(line => line)
val groupDatas:GroupedDataSet[(String, Int)] = collection.groupBy(line => line._1)
//在reduceGroup下使用自定义的reduce和combiner函数
val result = groupDatas.reduceGroup(new Tuple3GroupReduceWithCombine())
val result_sort = result.collect().sortBy(x=>x._1)
println(result_sort)
使用之前的group操作,比如:reduceGroup或者GroupReduceFuncation;这种操作很容易造成内存溢出;
因为要一次性把所有的数据一步转化到位,所以需要足够的内存支撑,如果内存不够的情况下,
那么需要使用combineGroup;
combineGroup在分组数据集上应用GroupCombineFunction。
GroupCombineFunction类似于GroupReduceFunction,但不执行完整的数据交换。
【注意】:使用combineGroup可能得不到完整的结果而是部分的结果
import collection.JavaConverters._
class MycombineGroup extends GroupCombineFunction[Tuple1[String] , (String , Int)]{
override def combine(iterable: Iterable[Tuple1[String]], out: Collector[(String, Int)]): Unit = {
var key: String = null
var count = 0
for(line <- iterable.asScala){
key = line._1
count += 1
}
out.collect((key, count))
}
}
//TODO combineGroup
val input = env.fromElements("a", "b", "c", "a").map(Tuple1(_))
val combinedWords = input.groupBy(0).combineGroup(new MycombineGroup())
combinedWords.print()
在数据集上进行聚合求最值(最大值、最小值),Aggregate只能作用于元组上
//TODO Aggregate
val data = new mutable.MutableList[(Int, String, Double)]
data.+=((1, "yuwen", 89.0))
data.+=((2, "shuxue", 92.2))
data.+=((3, "yingyu", 89.99))
data.+=((4, "wuli", 98.9))
data.+=((1, "yuwen", 88.88))
data.+=((1, "wuli", 93.00))
data.+=((1, "yuwen", 94.3))
// //fromCollection将数据转化成DataSet
val input: DataSet[(Int, String, Double)] = env.fromCollection(Random.shuffle(data))
val output = input.groupBy(1).aggregate(Aggregations.MAX, 2)
output.print()
//TODO MinBy / MaxBy
val data = new mutable.MutableList[(Int, String, Double)]
data.+=((1, "yuwen", 90.0))
data.+=((2, "shuxue", 20.0))
data.+=((3, "yingyu", 30.0))
data.+=((4, "wuli", 40.0))
data.+=((5, "yuwen", 50.0))
data.+=((6, "wuli", 60.0))
data.+=((7, "yuwen", 70.0))
// //fromCollection将数据转化成DataSet
val input: DataSet[(Int, String, Double)] = env.fromCollection(Random.shuffle(data))
val output: DataSet[(Int, String, Double)] = input
.groupBy(1)
//求每个学科下的最小分数
//minBy的参数代表要求哪个字段的最小值
.minBy(2)
output.print()
//TODO distinct 去重
val data = new mutable.MutableList[(Int, String, Double)]
data.+=((1, "yuwen", 90.0))
data.+=((2, "shuxue", 20.0))
data.+=((3, "yingyu", 30.0))
data.+=((4, "wuli", 40.0))
data.+=((5, "yuwen", 50.0))
data.+=((6, "wuli", 60.0))
data.+=((7, "yuwen", 70.0))
// //fromCollection将数据转化成DataSet
val input: DataSet[(Int, String, Double)] = env.fromCollection(Random.shuffle(data))
val distinct = input.distinct(1)
distinct.print()
Flink在操作过程中,有时候也会遇到关联组合操作,这样可以方便的返回想要的关联结果,比如:
求每个班级的每个学科的最高分数
//TODO join
val data1 = new mutable.MutableList[(Int, String, Double)]
//学生学号---学科---分数
data1.+=((1, "yuwen", 90.0))
data1.+=((2, "shuxue", 20.0))
data1.+=((3, "yingyu", 30.0))
data1.+=((4, "yuwen", 40.0))
data1.+=((5, "shuxue", 50.0))
data1.+=((6, "yingyu", 60.0))
data1.+=((7, "yuwen", 70.0))
data1.+=((8, "yuwen", 20.0))
val data2 = new mutable.MutableList[(Int, String)]
//学号 ---班级
data2.+=((1,"class_1"))
data2.+=((2,"class_1"))
data2.+=((3,"class_2"))
data2.+=((4,"class_2"))
data2.+=((5,"class_3"))
data2.+=((6,"class_3"))
data2.+=((7,"class_4"))
data2.+=((8,"class_1"))
val input1: DataSet[(Int, String, Double)] = env.fromCollection(Random.shuffle(data1))
val input2: DataSet[(Int, String)] = env.fromCollection(Random.shuffle(data2))
//求每个班级下每个学科最高分数
val joindata = input2.join(input1).where(0).equalTo(0){
(input2 , input1) => (input2._1 , input2._2 , input1._2 , input1._3)
}
// joindata.print()
// println("===================")
val aggregateDataSet = joindata.groupBy(1,2).aggregate(Aggregations.MAX , 3)
aggregateDataSet.print()
和join类似,但是这种交叉操作会产生笛卡尔积,在数据比较大的时候,是非常消耗内存的操作;
cross交叉操作
和join类似,但是这种交叉操作会产生笛卡尔积,在数据比较大的时候,是非常消耗内存的操作;
//TODO Cross 交叉操作,会产生笛卡尔积
val data1 = new mutable.MutableList[(Int, String, Double)]
//学生学号---学科---分数
data1.+=((1, "yuwen", 90.0))
data1.+=((2, "shuxue", 20.0))
data1.+=((3, "yingyu", 30.0))
data1.+=((4, "yuwen", 40.0))
data1.+=((5, "shuxue", 50.0))
data1.+=((6, "yingyu", 60.0))
data1.+=((7, "yuwen", 70.0))
data1.+=((8, "yuwen", 20.0))
val data2 = new mutable.MutableList[(Int, String)]
//学号 ---班级
data2.+=((1,"class_1"))
data2.+=((2,"class_1"))
data2.+=((3,"class_2"))
data2.+=((4,"class_2"))
data2.+=((5,"class_3"))
data2.+=((6,"class_3"))
data2.+=((7,"class_4"))
data2.+=((8,"class_1"))
val input1: DataSet[(Int, String, Double)] = env.fromCollection(Random.shuffle(data1))
val input2: DataSet[(Int, String)] = env.fromCollection(Random.shuffle(data2))
val cross = input1.cross(input2){
(input1 , input2) => (input1._1,input1._2,input1._3,input2._2)
}
cross.print()
//TODO union联合操作
val elements1 = env.fromElements(("123"))
val elements2 = env.fromElements(("456"))
val elements3 = env.fromElements(("123"))
val union = elements1.union(elements2).union(elements3).distinct(line => line)
union.print()
Flink也有数据倾斜的时候,比如当前有数据量大概10亿条数据需要处理,在处理过程中可能会发生如图所示的状况:
这个时候本来总体数据量只需要10分钟解决的问题,出现了数据倾斜,机器1上的任务需要4个小时才能完成,那么其他3台机器执行完毕也要等待机器1执行完毕后才算整体将任务完成;
所以在实际的工作中,出现这种情况比较好的解决方案就是本节课要讲解的—rebalance(内部使用round robin方法将数据均匀打散。这对于数据倾斜时是很好的选择。)
在不使用rebalance的情况下,观察每一个线程执行的任务特点
val ds = env.generateSequence(1, 3000)
val rebalanced = ds.filter(_ > 780)
// val rebalanced = skewed.rebalance()
val countsInPartition = rebalanced.map( new RichMapFunction[Long, (Int, Long)] {
def map(in: Long) = {
//获取并行时子任务的编号getRuntimeContext.getIndexOfThisSubtask
(getRuntimeContext.getIndexOfThisSubtask, in)
}
})
countsInPartition.print()
【数据随机的分发给各个子任务(分区)】
使用rebalance
//TODO rebalance
val ds = env.generateSequence(1, 3000)
val skewed = ds.filter(_ > 780)
val rebalanced = skewed.rebalance()
val countsInPartition = rebalanced.map( new RichMapFunction[Long, (Int, Long)] {
def map(in: Long) = {
//获取并行时子任务的编号getRuntimeContext.getIndexOfThisSubtask
(getRuntimeContext.getIndexOfThisSubtask, in)
}
})
countsInPartition.print()
//TODO partitionByHash
val data = new mutable.MutableList[(Int, Long, String)]
data.+=((1, 1L, "Hi"))
data.+=((2, 2L, "Hello"))
data.+=((3, 2L, "Hello world"))
data.+=((4, 3L, "Hello world, how are you?"))
data.+=((5, 3L, "I am fine."))
data.+=((6, 3L, "Luke Skywalker"))
data.+=((7, 4L, "Comment#1"))
data.+=((8, 4L, "Comment#2"))
data.+=((9, 4L, "Comment#3"))
data.+=((10, 4L, "Comment#4"))
data.+=((11, 5L, "Comment#5"))
data.+=((12, 5L, "Comment#6"))
data.+=((13, 5L, "Comment#7"))
data.+=((14, 5L, "Comment#8"))
data.+=((15, 5L, "Comment#9"))
data.+=((16, 6L, "Comment#10"))
data.+=((17, 6L, "Comment#11"))
data.+=((18, 6L, "Comment#12"))
data.+=((19, 6L, "Comment#13"))
data.+=((20, 6L, "Comment#14"))
data.+=((21, 6L, "Comment#15"))
val collection = env.fromCollection(Random.shuffle(data))
val unique = collection.partitionByHash(1).mapPartition{
line =>
line.map(x => (x._1 , x._2 , x._3))
}
unique.writeAsText("hashPartition", WriteMode.NO_OVERWRITE)
env.execute()
//TODO Range-Partition
val data = new mutable.MutableList[(Int, Long, String)]
data.+=((1, 1L, "Hi"))
data.+=((2, 2L, "Hello"))
data.+=((3, 2L, "Hello world"))
data.+=((4, 3L, "Hello world, how are you?"))
data.+=((5, 3L, "I am fine."))
data.+=((6, 3L, "Luke Skywalker"))
data.+=((7, 4L, "Comment#1"))
data.+=((8, 4L, "Comment#2"))
data.+=((9, 4L, "Comment#3"))
data.+=((10, 4L, "Comment#4"))
data.+=((11, 5L, "Comment#5"))
data.+=((12, 5L, "Comment#6"))
data.+=((13, 5L, "Comment#7"))
data.+=((14, 5L, "Comment#8"))
data.+=((15, 5L, "Comment#9"))
data.+=((16, 6L, "Comment#10"))
data.+=((17, 6L, "Comment#11"))
data.+=((18, 6L, "Comment#12"))
data.+=((19, 6L, "Comment#13"))
data.+=((20, 6L, "Comment#14"))
data.+=((21, 6L, "Comment#15"))
val collection = env.fromCollection(Random.shuffle(data))
val unique = collection.partitionByRange(x => x._1).mapPartition(line => line.map{
x=>
(x._1 , x._2 , x._3)
})
unique.writeAsText("rangePartition", WriteMode.OVERWRITE)
env.execute()
根据指定的字段值进行分区的排序;
//TODO Sort Partition
val data = new mutable.MutableList[(Int, Long, String)]
data.+=((1, 1L, "Hi"))
data.+=((2, 2L, "Hello"))
data.+=((3, 2L, "Hello world"))
data.+=((4, 3L, "Hello world, how are you?"))
data.+=((5, 3L, "I am fine."))
data.+=((6, 3L, "Luke Skywalker"))
data.+=((7, 4L, "Comment#1"))
data.+=((8, 4L, "Comment#2"))
data.+=((9, 4L, "Comment#3"))
data.+=((10, 4L, "Comment#4"))
data.+=((11, 5L, "Comment#5"))
data.+=((12, 5L, "Comment#6"))
data.+=((13, 5L, "Comment#7"))
data.+=((14, 5L, "Comment#8"))
data.+=((15, 5L, "Comment#9"))
data.+=((16, 6L, "Comment#10"))
data.+=((17, 6L, "Comment#11"))
data.+=((18, 6L, "Comment#12"))
data.+=((19, 6L, "Comment#13"))
data.+=((20, 6L, "Comment#14"))
data.+=((21, 6L, "Comment#15"))
val ds = env.fromCollection(Random.shuffle(data))
val result = ds
.map { x => x }.setParallelism(2)
.sortPartition(1, Order.DESCENDING)//第一个参数代表按照哪个字段进行分区
.mapPartition(line => line)
.collect()
println(result)
}
}
//TODO first-取前N个
val data = new mutable.MutableList[(Int, Long, String)]
data.+=((1, 1L, "Hi"))
data.+=((2, 2L, "Hello"))
data.+=((3, 2L, "Hello world"))
data.+=((4, 3L, "Hello world, how are you?"))
data.+=((5, 3L, "I am fine."))
data.+=((6, 3L, "Luke Skywalker"))
data.+=((7, 4L, "Comment#1"))
data.+=((8, 4L, "Comment#2"))
data.+=((9, 4L, "Comment#3"))
data.+=((10, 4L, "Comment#4"))
data.+=((11, 5L, "Comment#5"))
data.+=((12, 5L, "Comment#6"))
data.+=((13, 5L, "Comment#7"))
data.+=((14, 5L, "Comment#8"))
data.+=((15, 5L, "Comment#9"))
data.+=((16, 6L, "Comment#10"))
data.+=((17, 6L, "Comment#11"))
data.+=((18, 6L, "Comment#12"))
data.+=((19, 6L, "Comment#13"))
data.+=((20, 6L, "Comment#14"))
data.+=((21, 6L, "Comment#15"))
val ds = env.fromCollection(Random.shuffle(data))
// ds.first(10).print()
//还可以先goup分组,然后在使用first取值
ds.groupBy(line => line._2).first(2).print()
flink在批处理中常见的source
flink在批处理中常见的source主要有两大类。
1.基于本地集合的source(Collection-based-source)
2.基于文件的source(File-based-source)
在flink最常见的创建DataSet方式有三种。
1.使用env.fromElements(),这种方式也支持Tuple,自定义对象等复合形式。
2.使用env.fromCollection(),这种方式支持多种Collection的具体类型
3.使用env.generateSequence()方法创建基于Sequence的DataSet
import org.apache.flink.api.scala.{DataSet, ExecutionEnvironment, _}
import scala.collection.immutable.{Queue, Stack}
import scala.collection.mutable
import scala.collection.mutable.{ArrayBuffer, ListBuffer}
object DataSource001 {
def main(args: Array[String]): Unit = {
val env = ExecutionEnvironment.getExecutionEnvironment
//0.用element创建DataSet(fromElements)
val ds0: DataSet[String] = env.fromElements("spark", "flink")
ds0.print()
//1.用Tuple创建DataSet(fromElements)
val ds1: DataSet[(Int, String)] = env.fromElements((1, "spark"), (2, "flink"))
ds1.print()
//2.用Array创建DataSet
val ds2: DataSet[String] = env.fromCollection(Array("spark", "flink"))
ds2.print()
//3.用ArrayBuffer创建DataSet
val ds3: DataSet[String] = env.fromCollection(ArrayBuffer("spark", "flink"))
ds3.print()
//4.用List创建DataSet
val ds4: DataSet[String] = env.fromCollection(List("spark", "flink"))
ds4.print()
//5.用List创建DataSet
val ds5: DataSet[String] = env.fromCollection(ListBuffer("spark", "flink"))
ds5.print()
//6.用Vector创建DataSet
val ds6: DataSet[String] = env.fromCollection(Vector("spark", "flink"))
ds6.print()
//7.用Queue创建DataSet
val ds7: DataSet[String] = env.fromCollection(Queue("spark", "flink"))
ds7.print()
//8.用Stack创建DataSet
val ds8: DataSet[String] = env.fromCollection(Stack("spark", "flink"))
ds8.print()
//9.用Stream创建DataSet(Stream相当于lazy List,避免在中间过程中生成不必要的集合)
val ds9: DataSet[String] = env.fromCollection(Stream("spark", "flink"))
ds9.print()
//10.用Seq创建DataSet
val ds10: DataSet[String] = env.fromCollection(Seq("spark", "flink"))
ds10.print()
//11.用Set创建DataSet
val ds11: DataSet[String] = env.fromCollection(Set("spark", "flink"))
ds11.print()
//12.用Iterable创建DataSet
val ds12: DataSet[String] = env.fromCollection(Iterable("spark", "flink"))
ds12.print()
//13.用ArraySeq创建DataSet
val ds13: DataSet[String] = env.fromCollection(mutable.ArraySeq("spark", "flink"))
ds13.print()
//14.用ArrayStack创建DataSet
val ds14: DataSet[String] = env.fromCollection(mutable.ArrayStack("spark", "flink"))
ds14.print()
//15.用Map创建DataSet
val ds15: DataSet[(Int, String)] = env.fromCollection(Map(1 -> "spark", 2 -> "flink"))
ds15.print()
//16.用Range创建DataSet
val ds16: DataSet[Int] = env.fromCollection(Range(1, 9))
ds16.print()
//17.用fromElements创建DataSet
val ds17: DataSet[Long] = env.generateSequence(1,9)
ds17.print()
}
}
读取本地文件
//TODO 使用readTextFile读取本地文件
//TODO 初始化环境
val environment: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
//TODO 加载数据
val datas: DataSet[String] = environment.readTextFile("data.txt")
//TODO 指定数据的转化
val flatmap_data: DataSet[String] = datas.flatMap(line => line.split("\\W+"))
val tuple_data: DataSet[(String, Int)] = flatmap_data.map(line => (line , 1))
val groupData: GroupedDataSet[(String, Int)] = tuple_data.groupBy(line => line._1)
val result: DataSet[(String, Int)] = groupData.reduce((x, y) => (x._1 , x._2+y._2))
result.print()
读取hdfs数据
//TODO readTextFile读取hdfs数据
//todo 初始化环境
val environment: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
//TODO 加载数据
val file: DataSet[String] = environment.readTextFile("hdfs://hadoop01:9000/README.txt")
val flatData: DataSet[String] = file.flatMap(line => line.split("\\W+"))
val map_data: DataSet[(String, Int)] = flatData.map(line => (line , 1))
val groupdata: GroupedDataSet[(String, Int)] = map_data.groupBy(line => line._1)
val result_data: DataSet[(String, Int)] = groupdata.reduce((x, y) => (x._1 , x._2+y._2))
result_data.print()
读取CSV数据
//TODO 读取csv数据
val environment: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
val path = "data2.csv"
val ds3 = environment.readCsvFile[(String, String, String, String,String,Int,Int,Int)](
filePath = path,
lineDelimiter = "\n",
fieldDelimiter = ",",
lenient = false,
ignoreFirstLine = true,
includedFields = Array(0, 1, 2, 3 , 4 , 5 , 6 , 7))
val first = ds3.groupBy(0 , 1).first(50)
first.print()
基于文件的source(遍历目录)
flink支持对一个文件目录内的所有文件,包括所有子目录中的所有文件的遍历访问方式。
对于从文件中读取数据,当读取的数个文件夹的时候,嵌套的文件默认是不会被读取的,只会读取第一个文件,其他的都会被忽略。所以我们需要使用recursive.file.enumeration进行递归读取
val env = ExecutionEnvironment.getExecutionEnvironment
val parameters = new Configuration
// recursive.file.enumeration 开启递归
parameters.setBoolean("recursive.file.enumeration", true)
val ds1 = env.readTextFile("test").withParameters(parameters)
ds1.print()
读取压缩文件
对于以下压缩类型,不需要指定任何额外的inputformat方法,flink可以自动识别并且解压。但是,压缩文件可能不会并行读取,可能是顺序读取的,这样可能会影响作业的可伸缩性。
//TODO 读取压缩文件
val env = ExecutionEnvironment.getExecutionEnvironment
val file = env.readTextFile("test/data1/zookeeper.out.gz").print()
tar -czvf ***.tar.gz
flink在批处理中常见的sink
1.基于本地集合的sink(Collection-based-sink)
2.基于文件的sink(File-based-sink)
(1)基于本地集合的sink(Collection-based-sink)
//1.定义环境
val env = ExecutionEnvironment.getExecutionEnvironment
//2.定义数据 stu(age,name,height)
val stu: DataSet[(Int, String, Double)] = env.fromElements(
(19, "zhangsan", 178.8),
(17, "lisi", 168.8),
(18, "wangwu", 184.8),
(21, "zhaoliu", 164.8)
)
//3.TODO sink到标准输出
stu.print
//3.TODO sink到标准error输出
stu.printToErr()
//4.TODO sink到本地Collection
print(stu.collect())
基于文件的sink(File-based-sink)
flink支持多种存储设备上的文件,包括本地文件,hdfs文件等。
flink支持多种文件的存储格式,包括text文件,CSV文件等。
writeAsText():TextOuputFormat - 将元素作为字符串写入行。字符串是通过调用每个元素的toString()方法获得的。
将数据写入本地文件
//0.主意:不论是本地还是hdfs.若Parallelism>1将把path当成目录名称,若Parallelism=1将把path当成文件名。
val env = ExecutionEnvironment.getExecutionEnvironment
val ds1: DataSource[Map[Int, String]] = env.fromElements(Map(1 -> "spark" , 2 -> "flink"))
//1.TODO 写入到本地,文本文档,NO_OVERWRITE模式下如果文件已经存在,则报错,OVERWRITE模式下如果文件已经存在,则覆盖
ds1.setParallelism(1).writeAsText("test/data1/aa", WriteMode.OVERWRITE)
env.execute()
将数据写入HDFS
//TODO writeAsText将数据写入HDFS
val env = ExecutionEnvironment.getExecutionEnvironment
val ds1: DataSource[Map[Int, String]] = env.fromElements(Map(1 -> "spark" , 2 -> "flink"))
ds1.setParallelism(1).writeAsText("hdfs://hadoop01:9000/a", WriteMode.OVERWRITE)
env.execute()
可以使用sortPartition对数据进行排序后再sink到外部系统。
本地执行
local环境
LocalEnvironment是Flink程序本地执行的句柄。用它在本地JVM中运行程序 - 独立运行或嵌入其他程序中。
本地环境通过该方法实例化ExecutionEnvironment.createLocalEnvironment()。默认情况下,它将使用尽可能多的本地线程执行,因为您的机器具有CPU核心(硬件上下文)。您也可以指定所需的并行性。本地环境可以配置为使用enableLogging()/ 登录到控制台disableLogging()。
在大多数情况下,ExecutionEnvironment.getExecutionEnvironment()
是更好的方式。LocalEnvironment
当程序在本地启动时(命令行界面外),该方法会返回一个程序,并且当程序由命令行界面调用时,它会返回一个预配置的群集执行环境。
注意:本地执行环境不启动任何Web前端来监视执行。
object LocalEven {
def main(args: Array[String]): Unit = {
//TODO 初始化本地执行环境
val env: ExecutionEnvironment = ExecutionEnvironment.createLocalEnvironment()
val path = "data2.csv"
val data = env.readCsvFile[(String, String, String, String,String,Int,Int,Int)](
filePath = path,
lineDelimiter = "\n",
fieldDelimiter = ",",
ignoreFirstLine = true
)
data.groupBy(0,1).first(100).print()
}
}
集合环境
使用集合的执行CollectionEnvironment是执行Flink程序的低开销方法。这种模式的典型用例是自动化测试,调试和代码重用。
用户也可以使用为批处理实施的算法,以便更具交互性的案例
请注意,基于集合的Flink程序的执行仅适用于适合JVM堆的小数据。集合上的执行不是多线程的,只使用一个线程
//TODO createCollectionsEnvironment
val collectionENV = ExecutionEnvironment.createCollectionsEnvironment
val path = "data2.csv"
val data = collectionENV.readCsvFile[(String, String, String, String,String,Int,Int,Int)](
filePath = path,
lineDelimiter = "\n",
fieldDelimiter = ",",
ignoreFirstLine = true
)
data.groupBy(0,1).first(50).print()
集群执行:
Flink程序可以在许多机器的集群上分布运行。有两种方法可将程序发送到群集以供执行:
第一种方法:命令行界面:
• ./bin/flink run ./examples/batch/WordCount.jar \
• --input file:///home/user/hamlet.txt --output file:///home/user/wordcount_out
第二种方法:使用代码中的远程环境提交
远程环境允许您直接在群集上执行Flink Java程序。远程环境指向要在其上执行程序的群集
Maven打包:
org.apache.maven.plugins
maven-jar-plugin
2.6
true
lib/
com.flink.DataStream.RemoteEven
org.apache.maven.plugins
maven-dependency-plugin
2.10
copy-dependencies
package
copy-dependencies
${project.build.directory}/lib
val env: ExecutionEnvironment = ExecutionEnvironment.createRemoteEnvironment("hadoop01", 8081, "target/learning-flink-1.0-SNAPSHOT.jar")
val data: DataSet[String] = env.readTextFile("hdfs://hadoop01:9000/README.txt")
val flatMap_data: DataSet[String] = data.flatMap(line => line.toLowerCase().split("\\W+"))
val mapdata: DataSet[(String, Int)] = flatMap_data.map(line => (line , 1))
val groupData: GroupedDataSet[(String, Int)] = mapdata.groupBy(line => line._1)
val result = groupData.reduce((x , y) => (x._1 , x._2+y._2))
result.writeAsText("hdfs://hadoop01:9000/remote")
env.execute()
Flink支持广播变量,就是将数据广播到具体的taskmanager上,数据存储在内存中,这样可以减缓大量的shuffle操作;
比如在数据join阶段,不可避免的就是大量的shuffle操作,我们可以把其中一个dataSet广播出去,一直加载到taskManager的内存中,可以直接在内存中拿数据,避免了大量的shuffle,导致集群性能下降;
注意:因为广播变量是要把dataset广播到内存中,所以广播的数据量不能太大,否则会出现OOM这样的问题
/**
* Created by angel;
*/
object BrodCast {
def main(args: Array[String]): Unit = {
val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
//TODO data2 join data3的数据,使用广播变量完成
val data2 = new mutable.MutableList[(Int, Long, String)]
data2.+=((1, 1L, "Hi"))
data2.+=((2, 2L, "Hello"))
data2.+=((3, 2L, "Hello world"))
val ds1 = env.fromCollection(Random.shuffle(data2))
val data3 = new mutable.MutableList[(Int, Long, Int, String, Long)]
data3.+=((1, 1L, 0, "Hallo", 1L))
data3.+=((2, 2L, 1, "Hallo Welt", 2L))
data3.+=((2, 3L, 2, "Hallo Welt wie", 1L))
val ds2 = env.fromCollection(Random.shuffle(data3))
//todo 使用内部类RichMapFunction,提供open和map,可以完成join的操作
val result = ds1.map(new RichMapFunction[(Int , Long , String) , ArrayBuffer[(Int , Long , String , String)]] {
var brodCast:mutable.Buffer[(Int, Long, Int, String, Long)] = null
override def open(parameters: Configuration): Unit = {
import scala.collection.JavaConverters._
//asScala需要使用隐式转换
brodCast = this.getRuntimeContext.getBroadcastVariable[(Int, Long, Int, String, Long)]("ds2").asScala
}
override def map(value: (Int, Long, String)):ArrayBuffer[(Int , Long , String , String)] = {
val toArray: Array[(Int, Long, Int, String, Long)] = brodCast.toArray
val array = new mutable.ArrayBuffer[(Int , Long , String , String)]
var index = 0
var a:(Int, Long, String, String) = null
while(index < toArray.size){
if(value._2 == toArray(index)._5){
a = (value._1 , value._2 , value._3 , toArray(index)._4)
array += a
}
index = index + 1
}
array
}
}).withBroadcastSet(ds2 , "ds2")
println(result.collect())
}
}
Flink提供了一个类似于Hadoop的分布式缓存,让并行运行实例的函数可以在本地访问。这个功能可以被使用来分享外部静态的数据,例如:机器学习的逻辑回归模型等!
缓存的使用流程:
使用ExecutionEnvironment实例对本地的或者远程的文件(例如:HDFS上的文件),为缓存文件指定一个名字注册该缓存文件!当程序执行时候,Flink会自动将复制文件或者目录到所有worker节点的本地文件系统中,函数可以根据名字去该节点的本地文件系统中检索该文件!
【注意】广播是将变量分发到各个worker节点的内存上,分布式缓存是将文件缓存到各个worker节点上;
package com.flink.DEMO.dataset
import org.apache.flink.api.common.functions.RichMapFunction
import org.apache.flink.api.scala.{DataSet, ExecutionEnvironment}
import org.apache.flink.configuration.Configuration
import scala.collection.mutable.{ArrayBuffer, ListBuffer}
import scala.io.Source
import org.apache.flink.streaming.api.scala._
/**
* Created by angel;
*/
object Distribute_cache {
def main(args: Array[String]): Unit = {
val env = ExecutionEnvironment.getExecutionEnvironment
//1"开启分布式缓存
val path = "hdfs://hadoop01:9000/score"
env.registerCachedFile(path , "Distribute_cache")
//2:加载本地数据
val clazz:DataSet[Clazz] = env.fromElements(
Clazz(1,"class_1"),
Clazz(2,"class_1"),
Clazz(3,"class_2"),
Clazz(4,"class_2"),
Clazz(5,"class_3"),
Clazz(6,"class_3"),
Clazz(7,"class_4"),
Clazz(8,"class_1")
)
//3:开始进行关联操作
clazz.map(new MyJoinmap()).print()
}
}
class MyJoinmap() extends RichMapFunction[Clazz , ArrayBuffer[INFO]]{
private var myLine = new ListBuffer[String]
override def open(parameters: Configuration): Unit = {
val file = getRuntimeContext.getDistributedCache.getFile("Distribute_cache")
val lines: Iterator[String] = Source.fromFile(file.getAbsoluteFile).getLines()
lines.foreach( line =>{
myLine.append(line)
})
}
//在map函数下进行关联操作
override def map(value: Clazz): ArrayBuffer[INFO] = {
var stoNO = 0
var subject = ""
var score = 0.0
var array = new collection.mutable.ArrayBuffer[INFO]()
//(学生学号---学科---分数)
for(str <- myLine){
val tokens = str.split(",")
stoNO = tokens(0).toInt
subject = tokens(1)
score = tokens(2).toDouble
if(tokens.length == 3){
if(stoNO == value.stu_no){
array += INFO(value.stu_no , value.clazz_no , subject , score)
}
}
}
array
}
}
//(学号 , 班级) join (学生学号---学科---分数) ==(学号 , 班级 , 学科 , 分数)
case class INFO(stu_no:Int , clazz_no:String , subject:String , score:Double)
case class Clazz(stu_no:Int , clazz_no:String)