Flink1.5简单应用DataSet开发

Flink应用开发

flink和spark类似,也是一种一站式处理的框架;既可以进行批处理(DataSet),也可以进行实时处理(DataStream)

  1. 使用maven导入相关依赖
1.8
1.8
UTF-8
2.11.2
2.11
2.6.2
1.5.0



    
        org.scala-lang
        scala-library
        ${scala.version}
    

    
        org.apache.flink
        flink-streaming-scala_2.11
        ${flink.version}
    
    
        org.apache.flink
        flink-scala_2.11
        ${flink.version}
    
    
        org.apache.flink
        flink-clients_2.11  
       ${flink.version}
    

    
        org.apache.flink
        flink-table_2.11
        ${flink.version}
    

    
        org.apache.hadoop
        hadoop-client
        ${hadoop.version}
    

    
        mysql
        mysql-connector-java
        5.1.38
    

    
        com.alibaba
        fastjson
        1.2.22
    

    
        org.apache.flink
        flink-connector-kafka-0.9_2.11
        ${flink.version}
    



DateSet开发 
 开发流程

 获得一个execution environment,

 加载/创建初始数据,

 指定这些数据的转换,

 指定将计算结果放在哪里,

 触发程序执行

例子

程序执行
例子
object DataSet_WordCount {
  def main(args: Array[String]) {
    //TODO 初始化环境
    val env = ExecutionEnvironment.getExecutionEnvironment
    //TODO 加载/创建初始数据
    val text = env.fromElements(
      "Who's there?",
      "I think I hear them. Stand, ho! Who's there?")
    //TODO 指定这些数据的转换
    val split_words = text.flatMap(line => line.toLowerCase().split("\\W+"))
    val filter_words = split_words.filter(x=> x.nonEmpty)
    val map_words = filter_words.map(x=> (x,1))
    val groupBy_words = map_words.groupBy(0)
    val sum_words = groupBy_words.sum(1)
    //todo 指定将计算结果放在哪里
//    sum_words.setParallelism(1)//汇总结果
    sum_words.writeAsText(args(0))//"/Users/niutao/Desktop/flink.txt"
    //TODO 触发程序执行
    env.execute("DataSet wordCount")
  }
}
将程序打包,提交到yarn
添加maven打包插件

 



    src/main/java

    src/test/scala

    



        

            org.apache.maven.plugins

            maven-compiler-plugin

            2.5.1

            

                1.7

                1.7

                

            

        



        

            net.alchim31.maven

            scala-maven-plugin

            3.2.0

            

                

                    

                        compile

                        testCompile

                    

                    

                        

                            

                            -dependencyfile

                            ${project.build.directory}/.scala_dependencies

                        



                    

                

            

        

        

            org.apache.maven.plugins

            maven-surefire-plugin

            2.18.1

            

                false

                true

                

                    **/*Test.*

                    **/*Suite.*

                

            

        



        

            org.apache.maven.plugins

            maven-shade-plugin

            2.3

            

                

                    package

                    

                        shade

                    

                    

                        

                            

                                *:*

                                

                                    

                                    META-INF/*.SF

                                    META-INF/*.DSA

                                    META-INF/*.RSA

                                

                            

                        

                        

                            

                                com.itcast.DEMO.WordCount

                            

                        

                    

                



            

        

    

 

Flink1.5简单应用DataSet开发_第1张图片

执行成功后,在target文件夹中生成jar包;

Flink1.5简单应用DataSet开发_第2张图片

 

使用rz命令上传jar包,然后执行程序:

bin/flink run -m yarn-cluster -yn 2 /home/elasticsearch/flinkjar/itcast_learn_flink-1.0-SNAPSHOT.jar com.itcast.DEMO.WordCount

 

在yarn的8088页面可以观察到提交的程序:

Flink1.5简单应用DataSet开发_第3张图片

去/export/servers/flink-1.3.2/flinkJAR文件夹下可以找到输出的运行结果:

 

DateSet的Transformation

Transformation

Description

Map

Takes one element and produces one element.

data.map{x=>x.toInt}

FlatMap

Takes one element and produces zero, one, or more elements.

data.flatMap{str=>str.split(" ")}

MapPartition

Transforms a parallel partition in a single function call. The function get the partition as an `Iterator` and can produce an arbitrary number of result values. The number of elements in each partition depends on the degree-of-parallelism and previous operations.

data.mapPartition{in=>in map{(_,1)}}

Filter

Evaluates a boolean function for each element and retains those for which the function returns true.
IMPORTANT: The system assumes that the function does not modify the element on which the predicate is applied. Violating this assumption can lead to incorrect results.

data.filter{_ >1000}

Reduce

Combines a group of elements into a single element by repeatedly combining two elements into one. Reduce may be applied on a full data set, or on a grouped data set.

data.reduce{_+_}

ReduceGroup

Combines a group of elements into one or more elements. ReduceGroup may be applied on a full data set, or on a grouped data set.

data.reduceGroup{elements=>elements.sum}

Aggregate

Aggregates a group of values into a single value. Aggregation functions can be thought of as built-in reduce functions. Aggregate may be applied on a full data set, or on a grouped data set.

val input:DataSet[(Int,String,Double)]=// [...]
val output:DataSet[(Int,String,Doublr)]=input.aggregate(SUM,0).aggregate(MIN,2);

You can also use short-hand syntax for minimum, maximum, and sum aggregations.
val input:DataSet[(Int,String,Double)]=// [...]
val output:DataSet[(Int,String,Double)]=input.sum(0).min(2)

Distinct

Returns the distinct elements of a data set. It removes the duplicate entries from the input DataSet, with respect to all fields of the elements, or a subset of fields.

data.distinct()

Join

Joins two data sets by creating all pairs of elements that are equal on their keys. Optionally uses a JoinFunction to turn the pair of elements into a single element, or a FlatJoinFunction to turn the pair of elements into arbitrarily many (including none) elements. See the keys section to learn how to define join keys.

// In this case tuple fields are used as keys. "0" is the join field on the first tuple
// "1" is the join field on the second tuple.
val result=input1.join(input2).where(0).equalTo(1)
You can specify the way that the runtime executes the join via Join Hints. The hints describe whether the join happens through partitioning or broadcasting, and whether it uses a sort-based or a hash-based algorithm. Please refer to the Transformations Guide for a list of possible hints and an example. If no hint is specified, the system will try to make an estimate of the input sizes and pick the best strategy according to those estimates.
// using a hash table for the broadcasted data
val result=input1.join(input2,JoinHint.BROADCAST_HASH_FIRST).where(0).equalTo(1)

Note that the join transformation works only for equi-joins. Other join types need to be expressed using OuterJoin or CoGroup.

OuterJoin

Performs a left, right, or full outer join on two data sets. Outer joins are similar to regular (inner) joins and create all pairs of elements that are equal on their keys. In addition, records of the "outer" side (left, right, or both in case of full) are preserved if no matching key is found in the other side. Matching pairs of elements (or one element and a `null` value for the other input) are given to a JoinFunction to turn the pair of elements into a single element, or to a FlatJoinFunction to turn the pair of elements into arbitrarily many (including none) elements. See the keys section to learn how to define join keys.

val joined=left.leftOuterJoin(right).where(0).equalTo(1){
  (left,right)=>
    val a=if(left==null) "none" else left._1
  (a,right)
}

CoGroup

The two-dimensional variant of the reduce operation. Groups each input on one or more fields and then joins the groups. The transformation function is called per pair of groups. See the keys section to learn how to define coGroup keys.

data1.coGroup(data2).where(0).equalTo(1)

Cross

Builds the Cartesian product (cross product) of two inputs, creating all pairs of elements. Optionally uses a CrossFunction to turn the pair of elements into a single element

val data1:DataSet[Int]=// [...]
val data2:DataSet[String]=// [...]
val result:DataSet[(Int,String)]=data1.cross(data2)

Note: Cross is potentially a very compute-intensive operation which can challenge even large compute clusters! It is adviced to hint the system with the DataSet sizes by using crossWithTiny() and crossWithHuge().

Union

Produces the union of two data sets.

data.union(data2)

Rebalance

Evenly rebalances the parallel partitions of a data set to eliminate data skew. Only Map-like transformations may follow a rebalance transformation.

val data1:DataSet[Int]=// [...]
val result:DataSet[(Int,String)]=data1.rebalance().map(...)

Hash-Partition

Hash-partitions a data set on a given key. Keys can be specified as position keys, expression keys, and key selector functions.

val in:DataSet[(Int,String)]=// [...]
val result=in.partitionByHash(0).mapPartition{
...
}

Range-Partition

Range-partitions a data set on a given key. Keys can be specified as position keys, expression keys, and key selector functions.

val in:DataSet[(Int,String)]=// [...]
val result=
in.partitionByRange(0).mapPartition
{
...
}

Custom Partitioning

Manually specify a partitioning over the data. 
Note: This method works only on single field keys.

val in:DataSet[(Int,String)]=// [...]
val result=in.partitionCustom(partitioner:Partitioner[K],key)

Sort Partition

Locally sorts all partitions of a data set on a specified field in a specified order. Fields can be specified as tuple positions or field expressions. Sorting on multiple fields is done by chaining sortPartition() calls.

val in:DataSet[(Int,String)]=// [...]
val result=in.sortPartition(1,Order.ASCENDING).mapPartition{
...
}

First-n

Returns the first n (arbitrary) elements of a data set. First-n can be applied on a regular data set, a grouped data set, or a grouped-sorted data set. Grouping keys can be specified as key-selector functions, tuple positions or case class fields.

val in:DataSet[(Int,String)]=// [...]
val result1=in.first(3)
val result2=in.groupBy(0).first(3)
val result3=in.groupBy(0).sortGroup(1,Order.ASCENDING).first(3)

 

map函数

flatMap函数

​
//初始化执行环境

  val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment

  //加载数据

  val data = env.fromElements(("A" , 1) , ("B" , 1) , ("C" , 1))

  //使用trasformation加载这些数据

//TODO map

  val map_result = data.map(line => line._1+line._2)

map_result.print()

  //TODO flatmap

  val flatmap_result = data.flatMap(line => line._1+line._2)

flatmap_result.print()

​

 

练习:如下数据

  


A;B;C;D;B;D;C

B;D;A;E;D;C

A;B

要求:统计相邻字符串出现的次数

import org.apache.flink.api.scala.{DataSet, ExecutionEnvironment}
import org.apache.flink.streaming.api.scala._
/** * Created by angel;*/
object demo {
/**
A;B;C;D;B;D;C
B;D;A;E;D;C
A;B
统计相邻字符串出现的次数(A+B , 2) (B+C , 1)....* */
def main(args: Array[String]): Unit = {
    val env = ExecutionEnvironment.getExecutionEnvironment
    val data = env.fromElements("A;B;C;D;B;D;C;B;D;A;E;D;C;A;B")
    val map_data: DataSet[Array[String]] = data.map(line => line.split(";"))
    //[A,B,C,D] ---"A,B,C,D"
    //[A,B,C,D] ---> (x,1) , (y,1) -->groupBy--->sum--total
    val tupe_data = map_data.flatMap{
    line =>
        for(index <- 0 until line.length-1) yield (line(index)+"+"+line(index+1) , 1)
    }
    val gropudata = tupe_data.groupBy(0)
    val result = gropudata.sum(1)
    result.print()
  }
}

mapPartition函数:mapPartition:是一个分区一个分区拿出来的

好处就是以后我们操作完数据了需要存储到mysql中,这样做的好处就是几个分区拿几个连接,如果用map的话,就是多少条数据拿多少个mysql的连接

//TODO mapPartition
  val ele_partition = elements.setParallelism(2)//将分区设置为2
  val partition = ele_partition.mapPartition(line => line.map(x=> x+"======"))
  //line是每个分区下面的数据
  partition.print()

filter函数

Filter函数在实际生产中特别实用,数据处理阶段可以过滤掉大部分不符合业务的内容,可以极大减轻整体flink的运算压力

  val filter:DataSet[String] = elements.filter(line => line.contains("java"))//过滤出带java的数据

  filter.print()

reduce函数

//TODO reduce

 val elements:DataSet[List[Tuple2[String , Int]]] = env.fromElements(List(("java" , 1) , ("scala" , 1) , ("java" , 1)))

  val tuple_map = elements.flatMap(x=> x)//拆开里面的list,编程tuple

  val group_map = tuple_map.groupBy(x => x._1)//按照单词聚合

  val reduce = group_map.reduce((x,y) => (x._1 ,x._2+y._2))

reduce.print()

reduceGroup

普通的reduce函数

Flink1.5简单应用DataSet开发_第4张图片

reduceGroup是reduce的一种优化方案;

它会先分组reduce,然后在做整体的reduce;这样做的好处就是可以减少网络IO;

Flink1.5简单应用DataSet开发_第5张图片

 //TODO reduceGroup



  val elements:DataSet[List[Tuple2[String , Int]]] = env.fromElements(List(("java" , 1) ,("java" , 1), ("scala" , 1)))

  val tuple_words = elements.flatMap(x=>x)

  val group_words = tuple_words.groupBy(x => x._1)

  val a = group_words.reduceGroup{

    (in:Iterator[(String,Int)],out:Collector[(String , Int)]) =>

      val result = in.reduce((x, y) => (x._1, x._2+y._2))

      out.collect(result)

  }

  a.print()

}

 

GroupReduceFunction和GroupCombineFunction(自定义函数)

import collection.JavaConverters._

class Tuple3GroupReduceWithCombine extends GroupReduceFunction[( String , Int), (String, Int)] with GroupCombineFunction[(String, Int), (String, Int)] {

  override def reduce(values: Iterable[(String, Int)], out: Collector[(String, Int)]): Unit = {

    for(in <- values.asScala){

      out.collect((in._1 , in._2))
    }
  }

  override def combine(values: Iterable[(String, Int)], out: Collector[(String, Int)]): Unit = {

    val map = new mutable.HashMap[String , Int]()

    var num = 0

    var s = ""

    for(in <- values.asScala){

        num += in._2

        s = in._1

    }

    out.collect((s , num))

  }

}
//  TODO GroupReduceFunction  GroupCombineFunction

val env = ExecutionEnvironment.getExecutionEnvironment

val elements:DataSet[List[Tuple2[String , Int]]] = env.fromElements(List(("java" , 3) ,("java" , 1), ("scala" , 1)))

val collection = elements.flatMap(line => line)

val groupDatas:GroupedDataSet[(String, Int)] = collection.groupBy(line => line._1)

//在reduceGroup下使用自定义的reduce和combiner函数

val result = groupDatas.reduceGroup(new Tuple3GroupReduceWithCombine())

val result_sort = result.collect().sortBy(x=>x._1)

println(result_sort)

 

combineGroup

使用之前的group操作,比如:reduceGroup或者GroupReduceFuncation;这种操作很容易造成内存溢出;
因为要一次性把所有的数据一步转化到位,所以需要足够的内存支撑,如果内存不够的情况下,
那么需要使用combineGroup;
combineGroup在分组数据集上应用GroupCombineFunction。
GroupCombineFunction类似于GroupReduceFunction,但不执行完整的数据交换。

【注意】:使用combineGroup可能得不到完整的结果而是部分的结果
import collection.JavaConverters._
class MycombineGroup extends GroupCombineFunction[Tuple1[String] , (String , Int)]{
  override def combine(iterable: Iterable[Tuple1[String]], out: Collector[(String, Int)]): Unit = {
    var key: String = null
    var count = 0
    for(line <- iterable.asScala){
      key = line._1
      count += 1
    }
    out.collect((key, count))
  }

}
//TODO  combineGroup

val input = env.fromElements("a", "b", "c", "a").map(Tuple1(_))

val combinedWords = input.groupBy(0).combineGroup(new MycombineGroup())

combinedWords.print()

Aggregate

在数据集上进行聚合求最值(最大值、最小值),Aggregate只能作用于元组上
//TODO Aggregate

val data = new mutable.MutableList[(Int, String, Double)]

data.+=((1, "yuwen", 89.0))

data.+=((2, "shuxue", 92.2))

data.+=((3, "yingyu", 89.99))

data.+=((4, "wuli", 98.9))

data.+=((1, "yuwen", 88.88))

data.+=((1, "wuli", 93.00))

data.+=((1, "yuwen", 94.3))

//    //fromCollection将数据转化成DataSet

val input: DataSet[(Int, String, Double)] = env.fromCollection(Random.shuffle(data))

val output = input.groupBy(1).aggregate(Aggregations.MAX, 2)

output.print()

minBy和maxBy

//TODO MinBy / MaxBy

val data = new mutable.MutableList[(Int, String, Double)]

data.+=((1, "yuwen", 90.0))

data.+=((2, "shuxue", 20.0))

data.+=((3, "yingyu", 30.0))

data.+=((4, "wuli", 40.0))

data.+=((5, "yuwen", 50.0))

data.+=((6, "wuli", 60.0))

data.+=((7, "yuwen", 70.0))

//    //fromCollection将数据转化成DataSet

val input: DataSet[(Int, String, Double)] = env.fromCollection(Random.shuffle(data))

val output: DataSet[(Int, String, Double)] = input

  .groupBy(1)

  //求每个学科下的最小分数

  //minBy的参数代表要求哪个字段的最小值

  .minBy(2)

output.print()

distinct去重

//TODO distinct 去重

  val data = new mutable.MutableList[(Int, String, Double)]

  data.+=((1, "yuwen", 90.0))

  data.+=((2, "shuxue", 20.0))

  data.+=((3, "yingyu", 30.0))

  data.+=((4, "wuli", 40.0))

  data.+=((5, "yuwen", 50.0))

  data.+=((6, "wuli", 60.0))

  data.+=((7, "yuwen", 70.0))

  //    //fromCollection将数据转化成DataSet

  val input: DataSet[(Int, String, Double)] = env.fromCollection(Random.shuffle(data))

  val distinct = input.distinct(1)

  distinct.print()

join

Flink在操作过程中,有时候也会遇到关联组合操作,这样可以方便的返回想要的关联结果,比如:

求每个班级的每个学科的最高分数

//TODO join

    val data1 = new mutable.MutableList[(Int, String, Double)]

    //学生学号---学科---分数

    data1.+=((1, "yuwen", 90.0))

    data1.+=((2, "shuxue", 20.0))

    data1.+=((3, "yingyu", 30.0))

    data1.+=((4, "yuwen", 40.0))

    data1.+=((5, "shuxue", 50.0))

    data1.+=((6, "yingyu", 60.0))

    data1.+=((7, "yuwen", 70.0))

    data1.+=((8, "yuwen", 20.0))

    val data2 = new mutable.MutableList[(Int, String)]

    //学号 ---班级

    data2.+=((1,"class_1"))

    data2.+=((2,"class_1"))

    data2.+=((3,"class_2"))

    data2.+=((4,"class_2"))

    data2.+=((5,"class_3"))

    data2.+=((6,"class_3"))

    data2.+=((7,"class_4"))

    data2.+=((8,"class_1"))

    val input1: DataSet[(Int, String, Double)] = env.fromCollection(Random.shuffle(data1))

    val input2: DataSet[(Int, String)] = env.fromCollection(Random.shuffle(data2))

    //求每个班级下每个学科最高分数

    val joindata = input2.join(input1).where(0).equalTo(0){

      (input2 , input1) => (input2._1 , input2._2 , input1._2 , input1._3)

    }

//    joindata.print()

//    println("===================")

    val aggregateDataSet = joindata.groupBy(1,2).aggregate(Aggregations.MAX , 3)

    aggregateDataSet.print()

cross交叉操作

和join类似,但是这种交叉操作会产生笛卡尔积,在数据比较大的时候,是非常消耗内存的操作;

cross交叉操作
和join类似,但是这种交叉操作会产生笛卡尔积,在数据比较大的时候,是非常消耗内存的操作;
//TODO Cross 交叉操作,会产生笛卡尔积
val data1 = new mutable.MutableList[(Int, String, Double)]
//学生学号---学科---分数
data1.+=((1, "yuwen", 90.0))
data1.+=((2, "shuxue", 20.0))
data1.+=((3, "yingyu", 30.0))
data1.+=((4, "yuwen", 40.0))
data1.+=((5, "shuxue", 50.0))
data1.+=((6, "yingyu", 60.0))
data1.+=((7, "yuwen", 70.0))
data1.+=((8, "yuwen", 20.0))
val data2 = new mutable.MutableList[(Int, String)]
//学号 ---班级
data2.+=((1,"class_1"))
data2.+=((2,"class_1"))
data2.+=((3,"class_2"))
data2.+=((4,"class_2"))
data2.+=((5,"class_3"))
data2.+=((6,"class_3"))
data2.+=((7,"class_4"))
data2.+=((8,"class_1"))
val input1: DataSet[(Int, String, Double)] = env.fromCollection(Random.shuffle(data1))
val input2: DataSet[(Int, String)] = env.fromCollection(Random.shuffle(data2))
val cross = input1.cross(input2){
  (input1 , input2) => (input1._1,input1._2,input1._3,input2._2)
}
cross.print()

 

union 将多个DataSet合并成一个DataSet【注意】:union合并的DataSet的类型必须是一致的

//TODO union联合操作

  val elements1 = env.fromElements(("123"))

  val elements2 = env.fromElements(("456"))

  val elements3 = env.fromElements(("123"))

  val union = elements1.union(elements2).union(elements3).distinct(line => line)

union.print()

 

rebalance

Flink也有数据倾斜的时候,比如当前有数据量大概10亿条数据需要处理,在处理过程中可能会发生如图所示的状况:

Flink1.5简单应用DataSet开发_第6张图片

 

这个时候本来总体数据量只需要10分钟解决的问题,出现了数据倾斜,机器1上的任务需要4个小时才能完成,那么其他3台机器执行完毕也要等待机器1执行完毕后才算整体将任务完成;

所以在实际的工作中,出现这种情况比较好的解决方案就是本节课要讲解的—rebalance(内部使用round robin方法将数据均匀打散。这对于数据倾斜时是很好的选择。

Flink1.5简单应用DataSet开发_第7张图片

在不使用rebalance的情况下,观察每一个线程执行的任务特点

    val ds = env.generateSequence(1, 3000)
    val rebalanced = ds.filter(_ > 780)
//    val rebalanced = skewed.rebalance()
    val countsInPartition = rebalanced.map( new RichMapFunction[Long, (Int, Long)] {
      def map(in: Long) = {
        //获取并行时子任务的编号getRuntimeContext.getIndexOfThisSubtask
        (getRuntimeContext.getIndexOfThisSubtask, in)
      }
    })
    countsInPartition.print()
【数据随机的分发给各个子任务(分区)】

使用rebalance

//TODO rebalance
val ds = env.generateSequence(1, 3000)
val skewed = ds.filter(_ > 780)
val rebalanced = skewed.rebalance()
val countsInPartition = rebalanced.map( new RichMapFunction[Long, (Int, Long)] {
  def map(in: Long) = {
    //获取并行时子任务的编号getRuntimeContext.getIndexOfThisSubtask
    (getRuntimeContext.getIndexOfThisSubtask, in)
  }
})
countsInPartition.print()

 Flink1.5简单应用DataSet开发_第8张图片每隔8一次循环(数据使用轮询的方式在各个子任务中执行)

 

partitionByHash

//TODO partitionByHash

  val data = new mutable.MutableList[(Int, Long, String)]

data.+=((1, 1L, "Hi"))

data.+=((2, 2L, "Hello"))

data.+=((3, 2L, "Hello world"))

data.+=((4, 3L, "Hello world, how are you?"))

data.+=((5, 3L, "I am fine."))

data.+=((6, 3L, "Luke Skywalker"))

data.+=((7, 4L, "Comment#1"))

data.+=((8, 4L, "Comment#2"))

data.+=((9, 4L, "Comment#3"))

data.+=((10, 4L, "Comment#4"))

data.+=((11, 5L, "Comment#5"))

data.+=((12, 5L, "Comment#6"))

data.+=((13, 5L, "Comment#7"))

data.+=((14, 5L, "Comment#8"))

data.+=((15, 5L, "Comment#9"))

data.+=((16, 6L, "Comment#10"))

data.+=((17, 6L, "Comment#11"))

data.+=((18, 6L, "Comment#12"))

data.+=((19, 6L, "Comment#13"))

data.+=((20, 6L, "Comment#14"))

data.+=((21, 6L, "Comment#15"))

  val collection = env.fromCollection(Random.shuffle(data))

  val unique = collection.partitionByHash(1).mapPartition{

  line =>

    line.map(x => (x._1 , x._2 , x._3))

}

  

unique.writeAsText("hashPartition", WriteMode.NO_OVERWRITE)

env.execute()

Range-Partition

//TODO Range-Partition
val data = new mutable.MutableList[(Int, Long, String)]
data.+=((1, 1L, "Hi"))
data.+=((2, 2L, "Hello"))
data.+=((3, 2L, "Hello world"))
data.+=((4, 3L, "Hello world, how are you?"))
data.+=((5, 3L, "I am fine."))
data.+=((6, 3L, "Luke Skywalker"))
data.+=((7, 4L, "Comment#1"))
data.+=((8, 4L, "Comment#2"))
data.+=((9, 4L, "Comment#3"))
data.+=((10, 4L, "Comment#4"))
data.+=((11, 5L, "Comment#5"))
data.+=((12, 5L, "Comment#6"))
data.+=((13, 5L, "Comment#7"))
data.+=((14, 5L, "Comment#8"))
data.+=((15, 5L, "Comment#9"))
data.+=((16, 6L, "Comment#10"))
data.+=((17, 6L, "Comment#11"))
data.+=((18, 6L, "Comment#12"))
data.+=((19, 6L, "Comment#13"))
data.+=((20, 6L, "Comment#14"))
data.+=((21, 6L, "Comment#15"))
val collection = env.fromCollection(Random.shuffle(data))
val unique = collection.partitionByRange(x => x._1).mapPartition(line => line.map{
  x=>
    (x._1 , x._2 , x._3)
})
unique.writeAsText("rangePartition", WriteMode.OVERWRITE)
env.execute()

 sortPartition

根据指定的字段值进行分区的排序;

//TODO Sort Partition
    val data = new mutable.MutableList[(Int, Long, String)]
    data.+=((1, 1L, "Hi"))
    data.+=((2, 2L, "Hello"))
    data.+=((3, 2L, "Hello world"))
    data.+=((4, 3L, "Hello world, how are you?"))
    data.+=((5, 3L, "I am fine."))
    data.+=((6, 3L, "Luke Skywalker"))
    data.+=((7, 4L, "Comment#1"))
    data.+=((8, 4L, "Comment#2"))
    data.+=((9, 4L, "Comment#3"))
    data.+=((10, 4L, "Comment#4"))
    data.+=((11, 5L, "Comment#5"))
    data.+=((12, 5L, "Comment#6"))
    data.+=((13, 5L, "Comment#7"))
    data.+=((14, 5L, "Comment#8"))
    data.+=((15, 5L, "Comment#9"))
    data.+=((16, 6L, "Comment#10"))
    data.+=((17, 6L, "Comment#11"))
    data.+=((18, 6L, "Comment#12"))
    data.+=((19, 6L, "Comment#13"))
    data.+=((20, 6L, "Comment#14"))
    data.+=((21, 6L, "Comment#15"))
    val ds = env.fromCollection(Random.shuffle(data))
    val result = ds
      .map { x => x }.setParallelism(2)
      .sortPartition(1, Order.DESCENDING)//第一个参数代表按照哪个字段进行分区
      .mapPartition(line => line)
      .collect()
    println(result)
  }
}

first

 //TODO first-取前N个
    val data = new mutable.MutableList[(Int, Long, String)]
    data.+=((1, 1L, "Hi"))
    data.+=((2, 2L, "Hello"))
    data.+=((3, 2L, "Hello world"))
    data.+=((4, 3L, "Hello world, how are you?"))
    data.+=((5, 3L, "I am fine."))
    data.+=((6, 3L, "Luke Skywalker"))
    data.+=((7, 4L, "Comment#1"))
    data.+=((8, 4L, "Comment#2"))
    data.+=((9, 4L, "Comment#3"))
    data.+=((10, 4L, "Comment#4"))
    data.+=((11, 5L, "Comment#5"))
    data.+=((12, 5L, "Comment#6"))
    data.+=((13, 5L, "Comment#7"))
    data.+=((14, 5L, "Comment#8"))
    data.+=((15, 5L, "Comment#9"))
    data.+=((16, 6L, "Comment#10"))
    data.+=((17, 6L, "Comment#11"))
    data.+=((18, 6L, "Comment#12"))
    data.+=((19, 6L, "Comment#13"))
    data.+=((20, 6L, "Comment#14"))
    data.+=((21, 6L, "Comment#15"))
    val ds = env.fromCollection(Random.shuffle(data))
//    ds.first(10).print()
    //还可以先goup分组,然后在使用first取值
    ds.groupBy(line => line._2).first(2).print()

 输入数据集Data Sources

flink在批处理中常见的source

flink在批处理中常见的source主要有两大类。

1.基于本地集合的sourceCollection-based-source

2.基于文件的sourceFile-based-source

基于本地集合的source

flink最常见的创建DataSet方式有三种。

1.使用env.fromElements(),这种方式也支持Tuple,自定义对象等复合形式。

2.使用env.fromCollection(),这种方式支持多种Collection的具体类型

3.使用env.generateSequence()方法创建基于SequenceDataSet

import org.apache.flink.api.scala.{DataSet, ExecutionEnvironment, _}
import scala.collection.immutable.{Queue, Stack}
import scala.collection.mutable
import scala.collection.mutable.{ArrayBuffer, ListBuffer}

object DataSource001 {
  def main(args: Array[String]): Unit = {
    val env = ExecutionEnvironment.getExecutionEnvironment
    //0.用element创建DataSet(fromElements)
    val ds0: DataSet[String] = env.fromElements("spark", "flink")
    ds0.print()

    //1.用Tuple创建DataSet(fromElements)
    val ds1: DataSet[(Int, String)] = env.fromElements((1, "spark"), (2, "flink"))
    ds1.print()

    //2.用Array创建DataSet
    val ds2: DataSet[String] = env.fromCollection(Array("spark", "flink"))
    ds2.print()

    //3.用ArrayBuffer创建DataSet
    val ds3: DataSet[String] = env.fromCollection(ArrayBuffer("spark", "flink"))
    ds3.print()

    //4.用List创建DataSet
    val ds4: DataSet[String] = env.fromCollection(List("spark", "flink"))
    ds4.print()

    //5.用List创建DataSet
    val ds5: DataSet[String] = env.fromCollection(ListBuffer("spark", "flink"))
    ds5.print()

    //6.用Vector创建DataSet
    val ds6: DataSet[String] = env.fromCollection(Vector("spark", "flink"))
    ds6.print()

    //7.用Queue创建DataSet
    val ds7: DataSet[String] = env.fromCollection(Queue("spark", "flink"))
    ds7.print()

    //8.用Stack创建DataSet
    val ds8: DataSet[String] = env.fromCollection(Stack("spark", "flink"))
    ds8.print()

    //9.用Stream创建DataSet(Stream相当于lazy List,避免在中间过程中生成不必要的集合)
    val ds9: DataSet[String] = env.fromCollection(Stream("spark", "flink"))
    ds9.print()

    //10.用Seq创建DataSet
    val ds10: DataSet[String] = env.fromCollection(Seq("spark", "flink"))
    ds10.print()

    //11.用Set创建DataSet
    val ds11: DataSet[String] = env.fromCollection(Set("spark", "flink"))
    ds11.print()

    //12.用Iterable创建DataSet
    val ds12: DataSet[String] = env.fromCollection(Iterable("spark", "flink"))
    ds12.print()

    //13.用ArraySeq创建DataSet
    val ds13: DataSet[String] = env.fromCollection(mutable.ArraySeq("spark", "flink"))
    ds13.print()

    //14.用ArrayStack创建DataSet
    val ds14: DataSet[String] = env.fromCollection(mutable.ArrayStack("spark", "flink"))
    ds14.print()

    //15.用Map创建DataSet
    val ds15: DataSet[(Int, String)] = env.fromCollection(Map(1 -> "spark", 2 -> "flink"))
    ds15.print()

    //16.用Range创建DataSet
    val ds16: DataSet[Int] = env.fromCollection(Range(1, 9))
    ds16.print()

    //17.用fromElements创建DataSet
    val ds17: DataSet[Long] =  env.generateSequence(1,9)
    ds17.print()
  }
}

 

基于文件的source(File-based-source)

读取本地文件

//TODO 使用readTextFile读取本地文件
//TODO 初始化环境
val environment: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
//TODO 加载数据
val datas: DataSet[String] = environment.readTextFile("data.txt")
//TODO 指定数据的转化
val flatmap_data: DataSet[String] = datas.flatMap(line => line.split("\\W+"))
val tuple_data: DataSet[(String, Int)] = flatmap_data.map(line => (line , 1))
val groupData: GroupedDataSet[(String, Int)] = tuple_data.groupBy(line => line._1)
val result: DataSet[(String, Int)] = groupData.reduce((x, y) => (x._1 , x._2+y._2))
result.print()

读取hdfs数据

//TODO readTextFile读取hdfs数据
//todo 初始化环境
val environment: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
//TODO 加载数据

val file: DataSet[String] = environment.readTextFile("hdfs://hadoop01:9000/README.txt")
val flatData: DataSet[String] = file.flatMap(line => line.split("\\W+"))
val map_data: DataSet[(String, Int)] = flatData.map(line => (line , 1))
val groupdata: GroupedDataSet[(String, Int)] = map_data.groupBy(line => line._1)
val result_data: DataSet[(String, Int)] = groupdata.reduce((x, y) => (x._1 , x._2+y._2))
result_data.print()

读取CSV数据

//TODO 读取csv数据
val environment: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
val path = "data2.csv"
val ds3 = environment.readCsvFile[(String, String, String, String,String,Int,Int,Int)](
  filePath = path,
  lineDelimiter = "\n",
  fieldDelimiter = ",",
  lenient = false,
  ignoreFirstLine = true,
  includedFields = Array(0, 1, 2, 3 , 4 , 5 , 6 , 7))
val first = ds3.groupBy(0 , 1).first(50)
first.print()

 

基于文件的source(遍历目录)

flink支持对一个文件目录内的所有文件,包括所有子目录中的所有文件的遍历访问方式。

对于从文件中读取数据,当读取的数个文件夹的时候,嵌套的文件默认是不会被读取的,只会读取第一个文件,其他的都会被忽略。所以我们需要使用recursive.file.enumeration进行递归读取

val env = ExecutionEnvironment.getExecutionEnvironment
val parameters = new Configuration
// recursive.file.enumeration 开启递归
parameters.setBoolean("recursive.file.enumeration", true)
val ds1 = env.readTextFile("test").withParameters(parameters)
ds1.print()

读取压缩文件

对于以下压缩类型,不需要指定任何额外的inputformat方法,flink可以自动识别并且解压。但是,压缩文件可能不会并行读取,可能是顺序读取的,这样可能会影响作业的可伸缩性。

Flink1.5简单应用DataSet开发_第9张图片

//TODO  读取压缩文件
val env = ExecutionEnvironment.getExecutionEnvironment
val file = env.readTextFile("test/data1/zookeeper.out.gz").print()


tar -czvf ***.tar.gz

数据输出Data Sinks

flink在批处理中常见的sink

1.基于本地集合的sinkCollection-based-sink

2.基于文件的sinkFile-based-sink

(1)基于本地集合的sink(Collection-based-sink)

//1.定义环境
val env = ExecutionEnvironment.getExecutionEnvironment
//2.定义数据 stu(age,name,height)
val stu: DataSet[(Int, String, Double)] = env.fromElements(
  (19, "zhangsan", 178.8),
  (17, "lisi", 168.8),
  (18, "wangwu", 184.8),
  (21, "zhaoliu", 164.8)
)
//3.TODO sink到标准输出
stu.print

//3.TODO sink到标准error输出
stu.printToErr()

//4.TODO sink到本地Collection
print(stu.collect())

基于文件的sink(File-based-sink)

flink支持多种存储设备上的文件,包括本地文件,hdfs文件等。

flink支持多种文件的存储格式,包括text文件,CSV文件等。

      writeAsText():TextOuputFormat - 将元素作为字符串写入行。字符串是通过调用每个元素的toString()方法获得的。

将数据写入本地文件

//0.主意:不论是本地还是hdfs.若Parallelism>1将把path当成目录名称,若Parallelism=1将把path当成文件名。
val env = ExecutionEnvironment.getExecutionEnvironment
val ds1: DataSource[Map[Int, String]] = env.fromElements(Map(1 -> "spark" , 2 -> "flink"))
//1.TODO 写入到本地,文本文档,NO_OVERWRITE模式下如果文件已经存在,则报错,OVERWRITE模式下如果文件已经存在,则覆盖
ds1.setParallelism(1).writeAsText("test/data1/aa", WriteMode.OVERWRITE)
env.execute()

 将数据写入HDFS

//TODO writeAsText将数据写入HDFS
val env = ExecutionEnvironment.getExecutionEnvironment
val ds1: DataSource[Map[Int, String]] = env.fromElements(Map(1 -> "spark" , 2 -> "flink"))
ds1.setParallelism(1).writeAsText("hdfs://hadoop01:9000/a", WriteMode.OVERWRITE)
env.execute()

可以使用sortPartition对数据进行排序后再sink到外部系统。

 

 

本地执行和集群执行

本地执行

local环境

LocalEnvironmentFlink程序本地执行的句柄。用它在本地JVM中运行程序 - 独立运行或嵌入其他程序中。

本地环境通过该方法实例化ExecutionEnvironment.createLocalEnvironment()。默认情况下,它将使用尽可能多的本地线程执行,因为您的机器具有CPU核心(硬件上下文)。您也可以指定所需的并行性。本地环境可以配置为使用enableLogging()登录到控制台disableLogging()

在大多数情况下,ExecutionEnvironment.getExecutionEnvironment()是更好的方式。LocalEnvironment当程序在本地启动时(命令行界面外),该方法会返回一个程序,并且当程序由命令行界面调用时,它会返回一个预配置的群集执行环境。

注意:本地执行环境不启动任何Web前端来监视执行。

 

object LocalEven {
  def main(args: Array[String]): Unit = {
    //TODO 初始化本地执行环境
    val env: ExecutionEnvironment = ExecutionEnvironment.createLocalEnvironment()
    val path = "data2.csv"
    val data = env.readCsvFile[(String, String, String, String,String,Int,Int,Int)](
        filePath = path,
        lineDelimiter = "\n",
        fieldDelimiter = ",",
        ignoreFirstLine = true
    )
    data.groupBy(0,1).first(100).print()
  }
}

集合环境

使用集合的执行CollectionEnvironment是执行Flink程序的低开销方法。这种模式的典型用例是自动化测试,调试和代码重用。

用户也可以使用为批处理实施的算法,以便更具交互性的案例

请注意,基于集合的Flink程序的执行仅适用于适合JVM堆的小数据。集合上的执行不是多线程的,只使用一个线程

//TODO createCollectionsEnvironment
val collectionENV = ExecutionEnvironment.createCollectionsEnvironment
val path = "data2.csv"
val data = collectionENV.readCsvFile[(String, String, String, String,String,Int,Int,Int)](
    filePath = path,
    lineDelimiter = "\n",
    fieldDelimiter = ",",
    ignoreFirstLine = true
)
data.groupBy(0,1).first(50).print()

集群执行:

Flink程序可以在许多机器的集群上分布运行。有两种方法可将程序发送到群集以供执行

第一种方法:命令行界面

•	./bin/flink run ./examples/batch/WordCount.jar \
•	                     --input file:///home/user/hamlet.txt --output file:///home/user/wordcount_out

 

第二种方法:使用代码中的远程环境提交

远程环境允许您直接在群集上执行Flink Java程序。远程环境指向要在其上执行程序的群集

Maven打包:


    
        
            org.apache.maven.plugins
            maven-jar-plugin
            2.6
            
                
                    
                        true
                        lib/
                        com.flink.DataStream.RemoteEven
                    
                
            
        
        
            org.apache.maven.plugins
            maven-dependency-plugin
            2.10
            
                
                    copy-dependencies
                    package
                    
                        copy-dependencies
                    
                    
                        ${project.build.directory}/lib
                    
                
            
        
    

val env: ExecutionEnvironment = ExecutionEnvironment.createRemoteEnvironment("hadoop01", 8081, "target/learning-flink-1.0-SNAPSHOT.jar")
val data: DataSet[String] = env.readTextFile("hdfs://hadoop01:9000/README.txt")
val flatMap_data: DataSet[String] = data.flatMap(line => line.toLowerCase().split("\\W+"))
val mapdata: DataSet[(String, Int)] = flatMap_data.map(line => (line , 1))
val groupData: GroupedDataSet[(String, Int)] = mapdata.groupBy(line => line._1)
val result = groupData.reduce((x , y) => (x._1 , x._2+y._2))
result.writeAsText("hdfs://hadoop01:9000/remote")
env.execute()

 

Flink的广播变量

Flink支持广播变量,就是将数据广播到具体的taskmanager上,数据存储在内存中,这样可以减缓大量的shuffle操作;

比如在数据join阶段,不可避免的就是大量的shuffle操作,我们可以把其中一个dataSet广播出去,一直加载到taskManager的内存中,可以直接在内存中拿数据,避免了大量的shuffle,导致集群性能下降;

注意因为广播变量是要把dataset广播到内存中,所以广播的数据量不能太大,否则会出现OOM这样的问题

    1. Broadcast:Broadcast是通过withBroadcastSet(dataset,string)来注册的
    2. Access:通过getRuntimeContext().getBroadcastVariable(String)访问广播变量
/**
  * Created by angel;
  */
object BrodCast {
  def main(args: Array[String]): Unit = {
    val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
    //TODO data2  join  data3的数据,使用广播变量完成
    val data2 = new mutable.MutableList[(Int, Long, String)]
    data2.+=((1, 1L, "Hi"))
    data2.+=((2, 2L, "Hello"))
    data2.+=((3, 2L, "Hello world"))
    val ds1 = env.fromCollection(Random.shuffle(data2))
    val data3 = new mutable.MutableList[(Int, Long, Int, String, Long)]
    data3.+=((1, 1L, 0, "Hallo", 1L))
    data3.+=((2, 2L, 1, "Hallo Welt", 2L))
    data3.+=((2, 3L, 2, "Hallo Welt wie", 1L))
    val ds2 = env.fromCollection(Random.shuffle(data3))
    //todo 使用内部类RichMapFunction,提供open和map,可以完成join的操作
    val result = ds1.map(new RichMapFunction[(Int , Long , String) , ArrayBuffer[(Int , Long , String , String)]] {

      var brodCast:mutable.Buffer[(Int, Long, Int, String, Long)] = null

      override def open(parameters: Configuration): Unit = {
        import scala.collection.JavaConverters._
        //asScala需要使用隐式转换
        brodCast = this.getRuntimeContext.getBroadcastVariable[(Int, Long, Int, String, Long)]("ds2").asScala
      }
      override def map(value: (Int, Long, String)):ArrayBuffer[(Int , Long , String , String)] = {
        val toArray: Array[(Int, Long, Int, String, Long)] = brodCast.toArray
        val array = new mutable.ArrayBuffer[(Int , Long , String , String)]
        var index = 0

        var a:(Int, Long, String, String) = null
        while(index < toArray.size){
          if(value._2 == toArray(index)._5){
            a = (value._1 , value._2 , value._3 , toArray(index)._4)
            array += a
          }
          index = index + 1
        }
        array
      }
    }).withBroadcastSet(ds2 , "ds2")
    println(result.collect())
  }
}

Flink的分布式缓存

Flink提供了一个类似于Hadoop的分布式缓存,让并行运行实例的函数可以在本地访问。这个功能可以被使用来分享外部静态的数据,例如:机器学习的逻辑回归模型等!

缓存的使用流程:

 使用ExecutionEnvironment实例对本地的或者远程的文件(例如:HDFS上的文件),为缓存文件指定一个名字注册该缓存文件!当程序执行时候,Flink会自动将复制文件或者目录到所有worker节点的本地文件系统中,函数可以根据名字去该节点的本地文件系统中检索该文件!

【注意】广播是将变量分发到各个worker节点的内存上,分布式缓存是将文件缓存到各个worker节点上;

package com.flink.DEMO.dataset

import org.apache.flink.api.common.functions.RichMapFunction
import org.apache.flink.api.scala.{DataSet, ExecutionEnvironment}
import org.apache.flink.configuration.Configuration

import scala.collection.mutable.{ArrayBuffer, ListBuffer}
import scala.io.Source
import org.apache.flink.streaming.api.scala._
/**
  * Created by angel;
  */
object Distribute_cache {
  def main(args: Array[String]): Unit = {
    val env = ExecutionEnvironment.getExecutionEnvironment
    //1"开启分布式缓存
    val path = "hdfs://hadoop01:9000/score"
    env.registerCachedFile(path , "Distribute_cache")

    //2:加载本地数据
    val clazz:DataSet[Clazz] = env.fromElements(
        Clazz(1,"class_1"),
        Clazz(2,"class_1"),
        Clazz(3,"class_2"),
        Clazz(4,"class_2"),
        Clazz(5,"class_3"),
        Clazz(6,"class_3"),
        Clazz(7,"class_4"),
        Clazz(8,"class_1")
    )

    //3:开始进行关联操作
    clazz.map(new MyJoinmap()).print()
  }
}
class MyJoinmap() extends RichMapFunction[Clazz , ArrayBuffer[INFO]]{
  private var myLine = new ListBuffer[String]
  override def open(parameters: Configuration): Unit = {
    val file = getRuntimeContext.getDistributedCache.getFile("Distribute_cache")
    val lines: Iterator[String] = Source.fromFile(file.getAbsoluteFile).getLines()
    lines.foreach( line =>{
      myLine.append(line)
    })
  }

  //在map函数下进行关联操作
  override def map(value: Clazz):  ArrayBuffer[INFO] = {
    var stoNO = 0
    var subject = ""
    var score = 0.0
    var array = new collection.mutable.ArrayBuffer[INFO]()
    //(学生学号---学科---分数)
    for(str <- myLine){
      val tokens = str.split(",")
      stoNO = tokens(0).toInt
      subject = tokens(1)
      score = tokens(2).toDouble
      if(tokens.length == 3){
        if(stoNO == value.stu_no){
          array += INFO(value.stu_no , value.clazz_no , subject , score)
        }
      }
    }
    array
  }
}
//(学号 , 班级) join (学生学号---学科---分数) ==(学号 , 班级 , 学科 , 分数)


case class INFO(stu_no:Int , clazz_no:String , subject:String , score:Double)
case class Clazz(stu_no:Int , clazz_no:String)

 

 

 

 

 

 

 

 

 

 

 

你可能感兴趣的:(Flink)