1.11.3.使用flink on yarn提交任务: 在YARN上启动一个Flink主要有两种方式: (1)、启动一个YARN session(Start a long-running Flink cluster on YARN); (2)、直接在YARN上提交运行Flink作业(Run a Flink job on YARN)
第一种方式:YARN session 这种模式下会启动yarn session,并且会启动Flink的两个必要服务:JobManager和TaskManagers,然后你可以向集群提交作业。同一个Session中可以提交多个Flink作业。需要注意的是,这种模式下Hadoop的版本至少是2.2,而且必须安装了HDFS(因为启动YARN session的时候会向HDFS上提交相关的jar文件和配置文件) 通过./bin/yarn-session.sh脚本启动YARN Session 脚本可以携带的参数: Usage: Required -n,–container Number of YARN container to allocate (=Number of Task Managers) Optional -D Dynamic properties -d,–detached Start detached -id,–applicationId Attach to running YARN session -j,–jar Path to Flink jar file -jm,–jobManagerMemory Memory for JobManager Container [in MB] -n,–container Number of YARN container to allocate (=Number of Task Managers) -nm,–name Set a custom name for the application on YARN -q,–query Display available YARN resources (memory, cores) -qu,–queue Specify YARN queue. -s,–slots Number of slots per TaskManager -st,–streaming Start Flink in streaming mode -t,–ship Ship files in the specified directory (t for transfer) -tm,–taskManagerMemory Memory per TaskManager Container [in MB] -z,–zookeeperNamespace Namespace to create the Zookeeper sub-paths for high availability mode 注意: 如果不想让Flink YARN客户端始终运行,那么也可以启动分离的 YARN会话。该参数被称为-d或–detached。 在这种情况下,Flink YARN客户端只会将Flink提交给群集,然后关闭它自己
2.2.3.DateSet的Transformation Transformation Description Map Takes one element and produces one element. data.map { x => x.toInt } FlatMap Takes one element and produces zero, one, or more elements. data.flatMap { str => str.split(" ") } MapPartition Transforms a parallel partition in a single function call. The function get the partition as an Iterator and can produce an arbitrary number of result values. The number of elements in each partition depends on the degree-of-parallelism and previous operations. data.mapPartition { in => in map { (_, 1) } } Filter Evaluates a boolean function for each element and retains those for which the function returns true. IMPORTANT: The system assumes that the function does not modify the element on which the predicate is applied. Violating this assumption can lead to incorrect results. data.filter { _ > 1000 } Reduce Combines a group of elements into a single element by repeatedly combining two elements into one. Reduce may be applied on a full data set, or on a grouped data set. data.reduce { _ + _ } ReduceGroup Combines a group of elements into one or more elements. ReduceGroup may be applied on a full data set, or on a grouped data set. data.reduceGroup { elements => elements.sum } Aggregate Aggregates a group of values into a single value. Aggregation functions can be thought of as built-in reduce functions. Aggregate may be applied on a full data set, or on a grouped data set. val input: DataSet[(Int, String, Double)] = // […] val output: DataSet[(Int, String, Doublr)] = input.aggregate(SUM, 0).aggregate(MIN, 2); You can also use short-hand syntax for minimum, maximum, and sum aggregations. val input: DataSet[(Int, String, Double)] = // […] val output: DataSet[(Int, String, Double)] = input.sum(0).min(2) Distinct Returns the distinct elements of a data set. It removes the duplicate entries from the input DataSet, with respect to all fields of the elements, or a subset of fields. data.distinct() Join Joins two data sets by creating all pairs of elements that are equal on their keys. Optionally uses a JoinFunction to turn the pair of elements into a single element, or a FlatJoinFunction to turn the pair of elements into arbitrarily many (including none) elements. See the keys section to learn how to define join keys. // In this case tuple fields are used as keys. “0” is the join field on the first tuple // “1” is the join field on the second tuple. val result = input1.join(input2).where(0).equalTo(1) You can specify the way that the runtime executes the join via Join Hints. The hints describe whether the join happens through partitioning or broadcasting, and whether it uses a sort-based or a hash-based algorithm. Please refer to the Transformations Guide for a list of possible hints and an example. If no hint is specified, the system will try to make an estimate of the input sizes and pick the best strategy according to those estimates. // This executes a join by broadcasting the first data set // using a hash table for the broadcasted data val result = input1.join(input2, JoinHint.BROADCAST_HASH_FIRST) .where(0).equalTo(1) Note that the join transformation works only for equi-joins. Other join types need to be expressed using OuterJoin or CoGroup. OuterJoin Performs a left, right, or full outer join on two data sets. Outer joins are similar to regular (inner) joins and create all pairs of elements that are equal on their keys. In addition, records of the “outer” side (left, right, or both in case of full) are preserved if no matching key is found in the other side. Matching pairs of elements (or one element and a null value for the other input) are given to a JoinFunction to turn the pair of elements into a single element, or to a FlatJoinFunction to turn the pair of elements into arbitrarily many (including none) elements. See the keys section to learn how to define join keys. val joined = left.leftOuterJoin(right).where(0).equalTo(1) { (left, right) => val a = if (left == null) “none” else left._1 (a, right) } CoGroup The two-dimensional variant of the reduce operation. Groups each input on one or more fields and then joins the groups. The transformation function is called per pair of groups. See the keys section to learn how to define coGroup keys. data1.coGroup(data2).where(0).equalTo(1) Cross Builds the Cartesian product (cross product) of two inputs, creating all pairs of elements. Optionally uses a CrossFunction to turn the pair of elements into a single element val data1: DataSet[Int] = // […] val data2: DataSet[String] = // […] val result: DataSet[(Int, String)] = data1.cross(data2) Note: Cross is potentially a very compute-intensive operation which can challenge even large compute clusters! It is adviced to hint the system with the DataSet sizes by using crossWithTiny() and crossWithHuge(). Union Produces the union of two data sets. data.union(data2) Rebalance Evenly rebalances the parallel partitions of a data set to eliminate data skew. Only Map-like transformations may follow a rebalance transformation. val data1: DataSet[Int] = // […] val result: DataSet[(Int, String)] = data1.rebalance().map(…) Hash-Partition Hash-partitions a data set on a given key. Keys can be specified as position keys, expression keys, and key selector functions. val in: DataSet[(Int, String)] = // […] val result = in.partitionByHash(0).mapPartition { … } Range-Partition Range-partitions a data set on a given key. Keys can be specified as position keys, expression keys, and key selector functions. val in: DataSet[(Int, String)] = // […] val result = in.partitionByRange(0).mapPartition { … } Custom Partitioning Manually specify a partitioning over the data. Note: This method works only on single field keys. val in: DataSet[(Int, String)] = // […] val result = in .partitionCustom(partitioner: Partitioner[K], key) Sort Partition Locally sorts all partitions of a data set on a specified field in a specified order. Fields can be specified as tuple positions or field expressions. Sorting on multiple fields is done by chaining sortPartition() calls. val in: DataSet[(Int, String)] = // […] val result = in.sortPartition(1, Order.ASCENDING).mapPartition { … } First-n Returns the first n (arbitrary) elements of a data set. First-n can be applied on a regular data set, a grouped data set, or a grouped-sorted data set. Grouping keys can be specified as key-selector functions, tuple positions or case class fields. val in: DataSet[(Int, String)] = // […] // regular data set val result1 = in.first(3) // grouped data set val result2 = in.groupBy(0).first(3) // grouped-sorted data set val result3 = in.groupBy(0).sortGroup(1, Order.ASCENDING).first(3)
1:map函数 2:flatMap函数 //初始化执行环境 val env: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment //加载数据 val data = env.fromElements((“A” , 1) , (“B” , 1) , (“C” , 1)) //使用trasformation加载这些数据 //TODO map val map_result = data.map(line => line._1+line._2) map_result.print() //TODO flatmap val flatmap_result = data.flatMap(line => line._1+line._2) flatmap_result.print()
override def combine(values: Iterable[(String, Int)], out: Collector[(String, Int)]): Unit = { val map = new mutable.HashMapString , Int var num = 0 var s = “” for(in <- values.asScala){ num += in._2 s = in._1 } out.collect((s , num)) } }
// TODO GroupReduceFunction GroupCombineFunction val env = ExecutionEnvironment.getExecutionEnvironment val elements:DataSet[List[Tuple2[String , Int]]] = env.fromElements(List((“java” , 3) ,(“java” , 1), (“scala” , 1))) val collection = elements.flatMap(line => line) val groupDatas:GroupedDataSet[(String, Int)] = collection.groupBy(line => line._1) //在reduceGroup下使用自定义的reduce和combiner函数 val result = groupDatas.reduceGroup(new Tuple3GroupReduceWithCombine()) val result_sort = result.collect().sortBy(x=>x._1) println(result_sort)
14:union 将多个DataSet合并成一个DataSet 【注意】:union合并的DataSet的类型必须是一致的 //TODO union联合操作 val elements1 = env.fromElements((“123”)) val elements2 = env.fromElements((“456”)) val elements3 = env.fromElements((“123”)) val union = elements1.union(elements2).union(elements3).distinct(line => line) union.print()
Created by angel; */ object Distribute_cache { def main(args: Array[String]): Unit = { val env = ExecutionEnvironment.getExecutionEnvironment //1"开启分布式缓存 val path = “hdfs://hadoop01:9000/score” env.registerCachedFile(path , “Distribute_cache”)
Transformation Description Map DataStream → DataStream Takes one element and produces one element. A map function that doubles the values of the input stream: dataStream.map { x => x * 2 }
FlatMap DataStream → DataStream Takes one element and produces zero, one, or more elements. A flatmap function that splits sentences to words: dataStream.flatMap { str => str.split(" ") }
Filter DataStream → DataStream Evaluates a boolean function for each element and retains those for which the function returns true. A filter that filters out zero values: dataStream.filter { _ != 0 }
KeyBy DataStream → KeyedStream Logically partitions a stream into disjoint partitions, each partition containing elements of the same key. Internally, this is implemented with hash partitioning. See keys on how to specify keys. This transformation returns a KeyedStream. dataStream.keyBy(“someKey”) // Key by field “someKey” dataStream.keyBy(0) // Key by the first element of a Tuple
Reduce KeyedStream → DataStream A “rolling” reduce on a keyed data stream. Combines the current element with the last reduced value and emits the new value.
A reduce function that creates a stream of partial sums: keyedStream.reduce { _ + _ }
Fold KeyedStream → DataStream A "rolling" fold on a keyed data stream with an initial value. Combines the current element with the last folded value and emits the new value.
A fold function that, when applied on the sequence (1,2,3,4,5), emits the sequence “start-1”, “start-1-2”, “start-1-2-3”, … val result: DataStream[String] = keyedStream.fold(“start”)((str, i) => { str + “-” + i })
Aggregations KeyedStream → DataStream Rolling aggregations on a keyed data stream. The difference between min and minBy is that min returns the minimum value, whereas minBy returns the element that has the minimum value in this field (same for max and maxBy). keyedStream.sum(0) keyedStream.sum(“key”) keyedStream.min(0) keyedStream.min(“key”) keyedStream.max(0) keyedStream.max(“key”) keyedStream.minBy(0) keyedStream.minBy(“key”) keyedStream.maxBy(0) keyedStream.maxBy(“key”)
Window KeyedStream → WindowedStream Windows can be defined on already partitioned KeyedStreams. Windows group the data in each key according to some characteristic (e.g., the data that arrived within the last 5 seconds). See windows for a description of windows. dataStream.keyBy(0).window(TumblingEventTimeWindows.of(Time.seconds(5))) // Last 5 seconds of data
WindowAll DataStream → AllWindowedStream Windows can be defined on regular DataStreams. Windows group all the stream events according to some characteristic (e.g., the data that arrived within the last 5 seconds). See windows for a complete description of windows. WARNING: This is in many cases a non-parallel transformation. All records will be gathered in one task for the windowAll operator. dataStream.windowAll(TumblingEventTimeWindows.of(Time.seconds(5))) // Last 5 seconds of data
Window Apply WindowedStream → DataStream AllWindowedStream → DataStream Applies a general function to the window as a whole. Below is a function that manually sums the elements of a window. Note: If you are using a windowAll transformation, you need to use an AllWindowFunction instead. windowedStream.apply { WindowFunction }
// applying an AllWindowFunction on non-keyed window stream allWindowedStream.apply { AllWindowFunction }
Window Reduce WindowedStream → DataStream Applies a functional reduce function to the window and returns the reduced value. windowedStream.reduce { _ + _ }
Window Fold WindowedStream → DataStream Applies a functional fold function to the window and returns the folded value. The example function, when applied on the sequence (1,2,3,4,5), folds the sequence into the string “start-1-2-3-4-5”: val result: DataStream[String] = windowedStream.fold(“start”, (str, i) => { str + “-” + i })
Aggregations on windows WindowedStream → DataStream Aggregates the contents of a window. The difference between min and minBy is that min returns the minimum value, whereas minBy returns the element that has the minimum value in this field (same for max and maxBy). windowedStream.sum(0) windowedStream.sum(“key”) windowedStream.min(0) windowedStream.min(“key”) windowedStream.max(0) windowedStream.max(“key”) windowedStream.minBy(0) windowedStream.minBy(“key”) windowedStream.maxBy(0) windowedStream.maxBy(“key”)
Union DataStream* → DataStream Union of two or more data streams creating a new stream containing all the elements from all the streams. Note: If you union a data stream with itself you will get each element twice in the resulting stream. dataStream.union(otherStream1, otherStream2, …)
Window Join DataStream,DataStream → DataStream Join two data streams on a given key and a common window. dataStream.join(otherStream) .where().equalTo() .window(TumblingEventTimeWindows.of(Time.seconds(3))) .apply { … }
Window CoGroup DataStream,DataStream → DataStream Cogroups two data streams on a given key and a common window. dataStream.coGroup(otherStream) .where(0).equalTo(1) .window(TumblingEventTimeWindows.of(Time.seconds(3))) .apply {}
Connect DataStream,DataStream → ConnectedStreams “Connects” two data streams retaining their types, allowing for shared state between the two streams. someStream : DataStream[Int] = … otherStream : DataStream[String] = …
val connectedStreams = someStream.connect(otherStream)
CoMap, CoFlatMap ConnectedStreams → DataStream Similar to map and flatMap on a connected data stream connectedStreams.map( (_ : Int) => true, (_ : String) => false ) connectedStreams.flatMap( (_ : Int) => true, (_ : String) => false )
Split DataStream → SplitStream Split the stream into two or more streams according to some criterion. val split = someDataStream.split( (num: Int) => (num % 2) match { case 0 => List(“even”) case 1 => List(“odd”) } )
Select SplitStream → DataStream Select one or more streams from a split stream. val even = split select “even” val odd = split select “odd” val all = split.select(“even”,“odd”)
Iterate DataStream → IterativeStream → DataStream Creates a “feedback” loop in the flow, by redirecting the output of one operator to some previous operator. This is especially useful for defining algorithms that continuously update a model. The following code starts with a stream and applies the iteration body continuously. Elements that are greater than 0 are sent back to the feedback channel, and the rest of the elements are forwarded downstream. See iterations for a complete description. initialStream.iterate { iteration => { val iterationBody = iteration.map {/do something/} (iterationBody.filter(_ > 0), iterationBody.filter(_ <= 0)) } }
Extract Timestamps DataStream → DataStream Extracts timestamps from records in order to work with windows that use event time semantics. See Event Time. stream.assignTimestamps { timestampExtractor }
Created by angel; */ object DataSource_kafka { def main(args: Array[String]): Unit = { //1指定kafka数据流的相关信息 val zkCluster = “hadoop01,hadoop02,hadoop03:2181” val kafkaCluster = “hadoop01:9092,hadoop02:9092,hadoop03:9092” val kafkaTopicName = “test” //2.创建流处理环境 val env = StreamExecutionEnvironment.getExecutionEnvironment
//3.创建kafka数据流 val properties = new Properties() properties.setProperty(“bootstrap.servers”, kafkaCluster) properties.setProperty(“zookeeper.connect”, zkCluster) properties.setProperty(“group.id”, kafkaTopicName)
val kafka09 = new FlinkKafkaConsumer09[String](kafkaTopicName, new SimpleStringSchema(), properties) //4.添加数据源addSource(kafka09) val text = env.addSource(kafka09).setParallelism(4)
/**
test#CS#request http://b2c.csair.com/B2C40/query/jaxb/direct/query.ao?t=S&c1=HLN&c2=CTU&d1=2018-07-12&at=2&ct=2&inf=1#CS#POST#CS#application/x-www-form-urlencoded#CS#t=S&json={‘adultnum’:‘1’,‘arrcity’:‘NAY’,‘childnum’:‘0’,‘depcity’:‘KHH’,‘flightdate’:‘2018-07-12’,‘infantnum’:‘2’}#CS#http://b2c.csair.com/B2C40/modules/bookingnew/main/flightSelectDirect.html?t=R&c1=LZJ&c2=MZG&d1=2018-07-12&at=1&ct=2&inf=2#CS#123.235.193.25#CS#Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1#CS#2018-01-19T10:45:13:578+08:00#CS#106.86.65.18#CS#cookie
*/ val values: DataStream[ProcessedData] = text.map{ line => var encrypted = line val values = encrypted.split("#CS#") val valuesLength = values.length var regionalRequest = if(valuesLength > 1) values(1) else “” val requestMethod = if (valuesLength > 2) values(2) else “” val contentType = if (valuesLength > 3) values(3) else “” //Post提交的数据体 val requestBody = if (valuesLength > 4) values(4) else “” //http_referrer val httpReferrer = if (valuesLength > 5) values(5) else “” //客户端IP val remoteAddr = if (valuesLength > 6) values(6) else “” //客户端UA val httpUserAgent = if (valuesLength > 7) values(7) else “” //服务器时间的ISO8610格式 val timeIso8601 = if (valuesLength > 8) values(8) else “” //服务器地址 val serverAddr = if (valuesLength > 9) values(9) else “” //获取原始信息中的cookie字符串 val cookiesStr = if (valuesLength > 10) values(10) else “” ProcessedData(regionalRequest, requestMethod, contentType, requestBody, httpReferrer, remoteAddr, httpUserAgent, timeIso8601, serverAddr, cookiesStr) } values.print() val remoteAddr: DataStream[String] = values.map(line => line.remoteAddr) remoteAddr.print()
Created by angel; */ object DataSource_kafka { def main(args: Array[String]): Unit = { //1指定kafka数据流的相关信息 val zkCluster = “hadoop01,hadoop02,hadoop03:2181” val kafkaCluster = “hadoop01:9092,hadoop02:9092,hadoop03:9092” val kafkaTopicName = “test” val sinkKafka = “test2” //2.创建流处理环境 val env = StreamExecutionEnvironment.getExecutionEnvironment
//3.创建kafka数据流 val properties = new Properties() properties.setProperty(“bootstrap.servers”, kafkaCluster) properties.setProperty(“zookeeper.connect”, zkCluster) properties.setProperty(“group.id”, kafkaTopicName)
val kafka09 = new FlinkKafkaConsumer09[String](kafkaTopicName, new SimpleStringSchema(), properties) //4.添加数据源addSource(kafka09) val text = env.addSource(kafka09).setParallelism(4)
/**
test#CS#request http://b2c.csair.com/B2C40/query/jaxb/direct/query.ao?t=S&c1=HLN&c2=CTU&d1=2018-07-12&at=2&ct=2&inf=1#CS#POST#CS#application/x-www-form-urlencoded#CS#t=S&json={‘adultnum’:‘1’,‘arrcity’:‘NAY’,‘childnum’:‘0’,‘depcity’:‘KHH’,‘flightdate’:‘2018-07-12’,‘infantnum’:‘2’}#CS#http://b2c.csair.com/B2C40/modules/bookingnew/main/flightSelectDirect.html?t=R&c1=LZJ&c2=MZG&d1=2018-07-12&at=1&ct=2&inf=2#CS#123.235.193.25#CS#Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1#CS#2018-01-19T10:45:13:578+08:00#CS#106.86.65.18#CS#cookie
*/ val values: DataStream[ProcessedData] = text.map{ line => var encrypted = line val values = encrypted.split("#CS#") val valuesLength = values.length var regionalRequest = if(valuesLength > 1) values(1) else “” val requestMethod = if (valuesLength > 2) values(2) else “” val contentType = if (valuesLength > 3) values(3) else “” //Post提交的数据体 val requestBody = if (valuesLength > 4) values(4) else “” //http_referrer val httpReferrer = if (valuesLength > 5) values(5) else “” //客户端IP val remoteAddr = if (valuesLength > 6) values(6) else “” //客户端UA val httpUserAgent = if (valuesLength > 7) values(7) else “” //服务器时间的ISO8610格式 val timeIso8601 = if (valuesLength > 8) values(8) else “” //服务器地址 val serverAddr = if (valuesLength > 9) values(9) else “” //获取原始信息中的cookie字符串 val cookiesStr = if (valuesLength > 10) values(10) else “” ProcessedData(regionalRequest, requestMethod, contentType, requestBody, httpReferrer, remoteAddr, httpUserAgent, timeIso8601, serverAddr, cookiesStr)
} values.print() val remoteAddr: DataStream[String] = values.map(line => line.remoteAddr) remoteAddr.print() //TODO sink到kafka val p: Properties = new Properties p.setProperty(“bootstrap.servers”, “hadoop01:9092,hadoop02:9092,hadoop03:9092”) p.setProperty(“key.serializer”, classOf[ByteArraySerializer].getName) p.setProperty(“value.serializer”, classOf[ByteArraySerializer].getName) val sink = new FlinkKafkaProducer09[String](sinkKafka, new SimpleStringSchema(), properties) remoteAddr.addSink(sink) //5.触发运算 env.execute(“flink-kafka-wordcunt”) } } //保存结构化数据 case class ProcessedData(regionalRequest: String, requestMethod: String, contentType: String, requestBody: String, httpReferrer: String, remoteAddr: String, httpUserAgent: String, timeIso8601: String, serverAddr: String, cookiesStr: String )
2.3.4.基于mysql的sink和source 1:基于mysql的source操作: object MysqlSource { def main(args: Array[String]): Unit = { val env = StreamExecutionEnvironment.getExecutionEnvironment val source: DataStream[Student] = env.addSource(new SQL_source) source.print() env.execute() } } class SQL_source extends RichSourceFunction[Student]{ private var connection: Connection = null private var ps: PreparedStatement = null
override def open(parameters: Configuration): Unit = { val driver = “com.mysql.jdbc.Driver” val url = “jdbc:mysql://hadoop01:3306/test” val username = “root” val password = “root” Class.forName(driver) connection = DriverManager.getConnection(url, username, password) val sql = “select stuid , stuname , stuaddr , stusex from Student” ps = connection.prepareStatement(sql) }
override def run(sourceContext: SourceContext[Student]): Unit = { val queryRequest = ps.executeQuery() while (queryRequest.next()){ val stuid = queryRequest.getInt(“stuid”) val stuname = queryRequest.getString(“stuname”) val stuaddr = queryRequest.getString(“stuaddr”) val stusex = queryRequest.getString(“stusex”) val stu = new Student(stuid , stuname , stuaddr , stusex) sourceContext.collect(stu) } } override def cancel(): Unit = {} }
//该数据在算子制作快照时用于保存到目前为止算子记录的数据条数。 // 用户自定义状态 class UDFState extends Serializable{ private var count = 0L // 设置用户自定义状态 def setState(s: Long) = count = s // 获取用户自定状态 def getState = count }
//该段代码是window算子的代码,每当触发计算时统计窗口中元组数量。 class WindowStatisticWithChk extends WindowFunction[SEvent, Long, Tuple, TimeWindow] with ListCheckpointed[UDFState]{ private var total = 0L
// window算子的实现逻辑,即:统计window中元组的数量 override def apply(key: Tuple, window: TimeWindow, input: Iterable[SEvent], out: Collector[Long]): Unit = { var count = 0L for (event <- input) { count += 1L } total += count out.collect(count) } // 从自定义快照中恢复状态 override def restoreState(state: util.List[UDFState]): Unit = { val udfState = state.get(0) total = udfState.getState }
// 制作自定义状态快照 override def snapshotState(checkpointId: Long, timestamp: Long): util.List[UDFState] = { val udfList: util.ArrayList[UDFState] = new util.ArrayList[UDFState] val udfState = new UDFState udfState.setState(total) udfList.add(udfState) udfList } }
2.5.1.Table API和SQL程序的结构 Flink的批处理和流处理的Table API和SQL程序遵循相同的模式; 所以我们只需要使用一种来演示即可 要想执行flink的SQL语句,首先需要获取SQL的执行环境: 两种方式(batch和streaming): // *************** // STREAMING QUERY // *************** val sEnv = StreamExecutionEnvironment.getExecutionEnvironment // create a TableEnvironment for streaming queries val sTableEnv = TableEnvironment.getTableEnvironment(sEnv)
// *********** // BATCH QUERY // *********** val bEnv = ExecutionEnvironment.getExecutionEnvironment // create a TableEnvironment for batch queries val bTableEnv = TableEnvironment.getTableEnvironment(bEnv)
通过getTableEnvironment可以获取TableEnviromment;这个TableEnviromment是Table API和SQL集成的核心概念。它负责: 在内部目录中注册一个表 注册外部目录 执行SQL查询 注册用户定义的(标量,表格或聚合)函数 转换DataStream或DataSet成Table 持有一个ExecutionEnvironment或一个参考StreamExecutionEnvironment 1、在内部目录中注册一个表 TableEnvironment维护一个按名称注册的表的目录。有两种类型的表格,输入表格和输出表格。 输入表可以在Table API和SQL查询中引用并提供输入数据。输出表可用于将表API或SQL查询的结果发送到外部系统 输入表可以从各种来源注册: 现有Table对象,通常是表API或SQL查询的结果。 TableSource,它访问外部数据,例如文件,数据库或消息传递系统。 DataStream或DataSet来自DataStream或DataSet程序。 输出表可以使用注册TableSink。 1、注册一个表 // get a TableEnvironment val tableEnv = TableEnvironment.getTableEnvironment(env)
// register the Table projTable as table “projectedX” tableEnv.registerTable(“projectedTable”, projTable)
// Table is the result of a simple projection query val projTable: Table = tableEnv.scan("projectedTable ").select(…)
2、注册一个tableSource TableSource提供对存储在诸如数据库(MySQL,HBase等),具有特定编码(CSV,Apache [Parquet,Avro,ORC],…)的文件的存储系统中的外部数据的访问或者消息传送系统(Apache Kafka,RabbitMQ,…) // get a TableEnvironment val tableEnv = TableEnvironment.getTableEnvironment(env) // create a TableSource val csvSource: TableSource = new CsvTableSource("/path/to/file", …) // register the TableSource as table “CsvTable” tableEnv.registerTableSource(“CsvTable”, csvSource)
3、注册一个tableSink 注册TableSink可用于将表API或SQL查询的结果发送到外部存储系统,如数据库,键值存储,消息队列或文件系统(使用不同的编码,例如CSV,Apache [Parquet ,Avro,ORC],…) // get a TableEnvironment val tableEnv = TableEnvironment.getTableEnvironment(env)
// create a TableSink val csvSink: TableSink = new CsvTableSink("/path/to/file", …)
// define the field names and types val fieldNames: Array[String] = Array(“a”, “b”, “c”) val fieldTypes: Array[TypeInformation[_]] = Array(Types.INT, Types.STRING, Types.LONG)
// register the TableSink as table “CsvSinkTable” tableEnv.registerTableSink(“CsvSinkTable”, fieldNames, fieldTypes, csvSink)
例子: //创建batch执行环境 val env = ExecutionEnvironment.getExecutionEnvironment //创建table环境用于batch查询 val tableEnvironment = TableEnvironment.getTableEnvironment(env) //加载外部数据 val csvTableSource = CsvTableSource.builder() .path(“data1.csv”)//文件路径 .field(“id” , Types.INT)//第一列数据 .field(“name” , Types.STRING)//第二列数据 .field(“age” , Types.INT)//第三列数据 .fieldDelimiter(",")//列分隔符,默认是"," .lineDelimiter("\n")//换行符 .ignoreFirstLine()//忽略第一行 .ignoreParseErrors()//忽略解析错误 .build() //将外部数据构建成表 tableEnvironment.registerTableSource(“tableA” , csvTableSource) //TODO 1:使用table方式查询数据 val table = tableEnvironment.scan(“tableA”).select(“id , name , age”).filter(“name == ‘lisi’”) //将数据写出去 table.writeToSink(new CsvTableSink(“bbb” , “,” , 1 , FileSystem.WriteMode.OVERWRITE)) //TODO 2:使用sql方式 // val sqlResult = tableEnvironment.sqlQuery(“select id,name,age from tableA where id > 0 order by id limit 2”) //// //将数据写出去 // sqlResult.writeToSink(new CsvTableSink(“aaaaaa.csv”, “,”, 1, FileSystem.WriteMode.OVERWRITE)) env.execute()
在上面的例子讲解中,直接使用的是:registerTableSource注册表 对于flink来说,还有更灵活的方式:比如直接注册DataStream或者DataSet转换为一张表。 然后DataStream或者DataSet就相当于表,这样可以继续使用SQL来操作流或者批次的数据 语法: // get TableEnvironment // registration of a DataSet is equivalent Env:DataStream val tableEnv = TableEnvironment.getTableEnvironment(env)
val stream: DataStream[(Long, String)] = …
// register the DataStream as Table “myTable” with fields “f0”, “f1” tableEnv.registerDataStream(“myTable”, stream) 例子: object SQLToDataSetAndStreamSet { def main(args: Array[String]): Unit = {
// set up execution environment
val env = StreamExecutionEnvironment.getExecutionEnvironment
val tEnv = TableEnvironment.getTableEnvironment(env)
//构造数据
val orderA: DataStream[Order] = env.fromCollection(Seq(
Order(1L, "beer", 3),
Order(1L, "diaper", 4),
Order(3L, "rubber", 2)))
val orderB: DataStream[Order] = env.fromCollection(Seq(
Order(2L, "pen", 3),
Order(2L, "rubber", 3),
Order(4L, "beer", 1)))
// 根据数据注册表
tEnv.registerDataStream("OrderA", orderA)
tEnv.registerDataStream("OrderB", orderB)
// union the two tables
val result = tEnv.sqlQuery(
"SELECT * FROM OrderA WHERE amount > 2 UNION ALL " +
"SELECT * FROM OrderB WHERE amount < 2")
result.writeToSink(new CsvTableSink("ccc" , "," , 1 , FileSystem.WriteMode.OVERWRITE))
env.execute()
} } case class Order(user: Long, product: String, amount: Int)
3、将表转换为DataStream或DataSet A Table可以转换成a DataStream或DataSet。通过这种方式,可以在Table API或SQL查询的结果上运行自定义的DataStream或DataSet程序 1:将表转换为DataStream 有两种模式可以将 Table转换为DataStream: 1:Append Mode 将一个表附加到流上 2:Retract Mode 将表转换为流 语法格式: // get TableEnvironment. // registration of a DataSet is equivalent // ge val tableEnv = TableEnvironment.getTableEnvironment(env)
// Table with two fields (String name, Integer age) val table: Table = …
// convert the Table into an append DataStream of Row val dsRow: DataStream[Row] = tableEnv.toAppendStreamRow
// convert the Table into an append DataStream of Tuple2[String, Int] val dsTuple: DataStream[(String, Int)] dsTuple = tableEnv.toAppendStream(String, Int)
// convert the Table into a retract DataStream of Row. // A retract stream of type X is a DataStream[(Boolean, X)]. // The boolean field indicates the type of the change. // True is INSERT, false is DELETE. val retractStream: DataStream[(Boolean, Row)] = tableEnv.toRetractStreamRow
CREATE USER canal IDENTIFIED BY ‘canal’; GRANT SELECT, REPLICATION SLAVE, REPLICATION CLIENT ON . TO ‘canal’@’%’; – GRANT ALL PRIVILEGES ON . TO ‘canal’@’%’ ; FLUSH PRIVILEGES;
mysql的binlog日志必须打开log-bin功能才能生存binlog日志 -rw-rw---- 1 mysql mysql 669 8月 10 21:29 mysql-bin.000001 -rw-rw---- 1 mysql mysql 126 8月 10 22:06 mysql-bin.000002 -rw-rw---- 1 mysql mysql 11799 8月 15 18:17 mysql-bin.000003
public class PC {
/**
* 题目:生产者-消费者。
* 同步访问一个数组Integer[10],生产者不断地往数组放入整数1000,数组满时等待;消费者不断地将数组里面的数置零,数组空时等待。
*/
private static final Integer[] val=new Integer[10];
private static
在oracle连接(join)中使用using关键字
34. View the Exhibit and examine the structure of the ORDERS and ORDER_ITEMS tables.
Evaluate the following SQL statement:
SELECT oi.order_id, product_id, order_date
FRO
If i select like this:
SELECT id FROM users WHERE id IN(3,4,8,1);
This by default will select users in this order
1,3,4,8,
I would like to select them in the same order that i put IN() values so:
$(document).ready(
function() {
var flag = true;
$('#changeform').submit(function() {
var projectScValNull = true;
var s ="";
var parent_id = $("#parent_id").v
Mac 在国外很受欢迎,尤其是在 设计/web开发/IT 人员圈子里。普通用户喜欢 Mac 可以理解,毕竟 Mac 设计美观,简单好用,没有病毒。那么为什么专业人士也对 Mac 情有独钟呢?从个人使用经验来看我想有下面几个原因:
1、Mac OS X 是基于 Unix 的
这一点太重要了,尤其是对开发人员,至少对于我来说很重要,这意味着Unix 下一堆好用的工具都可以随手捡到。如果你是个 wi