Flink SQL 功能介绍

Flink SQL 功能介绍

  • SQL Job Support
  • Processing Mode :
  • Flink Sources :
  • Flink Sinks :
  • 本地调试与提交部署的区别
  • 支持的数据格式
  • 受支持的Time Attributes(时间属性)
  • WaterMark
  • 窗口函数时间单位
  • 目前支持的数据类型
  • Field Type Mapping
  • 目前不支持的SQL操作(以官网为参考)
  • Reserved Keywords

Flink Version : 1.8.2
Kafka Version : 0.10
Scala Version : 2.12

SQL Job Support

  • 注意事项(请务必仔细阅读此注意事项)
  • 单个CSV测试文件不得超过30行
  • 对于聚合和去重操作,本地CSV调试时的Sink端可以支持,但提交YARN运行时Kafka的Sink端并不支持(KafkaSink属于Append)。
  • 使用CSV文件测试时可使用From Field时间属性(ROW_TIME)和PROC_TIME
  • 如果Source端数据的格式为CSV,请按照顺序定义其字段名及类型。
    • 字段定义:
      fieldNames:userId、itemId、categoryId、behavior、timestemps
      fieldTypes:bigint、bigint、bigint、varchar、bigint
    • 数据样版:
      543462,1715,1464116,pv,1511658000
  • 如Source端数据的格式为JSON,定义字段名时请与JSON数据的属性名相对应(顺序可忽略)。否则无法对应的字段将显示为空或null。
    • 字段定义:
      fieldNames:userId、itemId、categoryId、timestemps、behavior
      fieldTypes:bigint、bigint、bigint、bigint、varchar
    • 数据样版:
      {“userId”:”148725”, “itemId”:”1798692”, “categoryId”:”737184”, “timestemps”:”1511667184”, “behavior”:”pv”]
  • 时间特性为ROW_TIME时source端不能使用"rowtime"命名字段(后台会根据时间戳指端自动生成),同理时间特性为PROC_TIME时不可使用proctime。否则会抛出ValidationException:Duplicate field name rowtime
  • 表名、字段名区分大小写
  • 查询(Select)出来的字段在数量和顺序上需要和Sink端的表相对应
    • 例如:
      SQL:Insert into tableName2 select id, age, name from tableName1;
      Sink端Table注册时的字段顺序:{“id”:”bigint”,”age”:”int”,”name”:”varchar”}
  • 如果KafkaSink端的数据格式为JSON,请使用cast(rowtime as varchar) as rowtime或其它方法进行类型转换,否则进入Sink端的数据依然是bigint型数据。
  • 使用From Source或者使用ProcessingTime(处理时间)后,rowTime字段会出现8小时的时间差,此时可以调用TO_LOCAL_DATA / TO_LOCAL_TIMESTAMP(函数)来处理。
  • 表名的组成包括支持数字、字母及下划线。
  • udtf和聚合窗口搭配时下游没有输出,暂时没有找到正确的写法。
  • 时间字段的源字段必须是Long、Timestamp或String类型。
  • 在时间窗口内使用的时间字段必须是程序自动生成的rowtime或proctime字段。
  • 目前使用ES Sink端时group 的字段需显式指定在select后面,源码中会去匹配。
  • 不支持维表JOIN。

Processing Mode :

  • Streaming

Flink Sources :

  • Kafka
  • Csv(本地测试)

Flink Sinks :

  • Kafka(AppendMode)
  • Elasticsearcha(AppendMode、UpsertMode)

本地调试与提交部署的区别

  • 在时间属性上,本地调试时本地调试时支持PROC_TIME、ROW_TIME,,但不包括KAFKA_TIME(数据进入kafka的时间,属于ROW_TIME范畴)。
  • 本地调试时只支持CSV格式的数据文件作为数据源,KAFKA可以支持CSV及JSON。
  • 本地调试时同时支持了AppendMode和UpsertMode两种模式,所以要注意sql语法属于哪一种模式。即使sql语句在调试时通过,提交部署后也可能会出错。
    • 例如GroupBy、 Having、Distinct、UDAF这几个操作都属于属于UpsertMode。
    • UpsertMode操作在单独使用的情况下,Sink端只能是Elasticsearch(目前),但搭配TUMBLE、HOP、SESSION这些窗口函数使用时可以支持AppendMode。

支持的数据格式

CSV、JSON(不包括本地调试)

受支持的Time Attributes(时间属性)

  • rocessing Time : 处理时间
  • Event Time : 数据本身自带的时间
    • From Field : 数据中自带的时间
    • From Source : 数据进入source端时的时间(目前只支持Kafka)

WaterMark

为rowtime属性设置一个内置的WaterMark,数据在一定的时间间隔内是有可能是无序的。这样会导致部分流数据丢失,可以使用水位来设置数据的最大延迟时间来尽可能的防止数据丢失(单位/毫秒)

窗口函数时间单位

YEAR

MONTH

DAY

HOUR

MINUTE

SECOND

目前支持的数据类型

字符型:varchar、boolean(布尔)、tinyint(字节)

数值型:bigint、int、float、smallint 、decimal

日期型:date、time、timestamp

Field Type Mapping

SQL varchar boolean tinyint smallint int bigint float double decimal date time timestamp
Java lang.String lang.Boolean lang.Byte lang.Short lang.Integer lang.Long lang.Float lang.Double math.BigDecimal sql.Date sql.Time sql.Timestamp

目前不支持的SQL操作(以官网为参考)

  • 流表查询单独使用Order By时必须按照升序时间属性排序;
  • Rowtime属性不能在常规联接的输入行中(Rowtime attributes must not be in the input rows of a regular join. As a workaround you can cast the time attributes of input tables to TIMESTAMP before.);
- Opertation
OrderBy & Limit Limit
Set Operations In、Union、Intersect、Except、Exists(不存在)
Aggregations Grouping Sets
Joins Full Outer、Right、Left、cross join unnest、Time-windowed Join

Reserved Keywords

虽然还没有实现所有的SQL特性,但是一些字符串组合已经被保留为将来使用的关键字。如果想使用下列字符串中的一个作为字段名,确保使用反引号(例如 value, count)

A, ABS, ABSOLUTE, ACTION, ADA, ADD, ADMIN, AFTER, ALL, ALLOCATE, ALLOW, ALTER, ALWAYS, AND, ANY, ARE, ARRAY, AS, ASC, ASENSITIVE, ASSERTION, ASSIGNMENT, ASYMMETRIC, AT, ATOMIC, ATTRIBUTE, ATTRIBUTES, AUTHORIZATION, AVG, BEFORE, BEGIN, BERNOULLI, BETWEEN, BIGINT, BINARY, BIT, BLOB, BOOLEAN, BOTH, BREADTH, BY, C, CALL, CALLED, CARDINALITY, CASCADE, CASCADED, CASE, CAST, CATALOG, CATALOG_NAME, CEIL, CEILING, CENTURY, CHAIN, CHAR, CHARACTER, CHARACTERISTICS, CHARACTERS, CHARACTER_LENGTH, CHARACTER_SET_CATALOG, CHARACTER_SET_NAME, CHARACTER_SET_SCHEMA, CHAR_LENGTH, CHECK, CLASS_ORIGIN, CLOB, CLOSE, COALESCE, COBOL, COLLATE, COLLATION, COLLATION_CATALOG, COLLATION_NAME, COLLATION_SCHEMA, COLLECT, COLUMN, COLUMN_NAME, COMMAND_FUNCTION, COMMAND_FUNCTION_CODE, COMMIT, COMMITTED, CONDITION, CONDITION_NUMBER, CONNECT, CONNECTION, CONNECTION_NAME, CONSTRAINT, CONSTRAINTS, CONSTRAINT_CATALOG, CONSTRAINT_NAME, CONSTRAINT_SCHEMA, CONSTRUCTOR, CONTAINS, CONTINUE, CONVERT, CORR, CORRESPONDING, COUNT, COVAR_POP, COVAR_SAMP, CREATE, CROSS, CUBE, CUME_DIST, CURRENT, CURRENT_CATALOG, CURRENT_DATE, CURRENT_DEFAULT_TRANSFORM_GROUP, CURRENT_PATH, CURRENT_ROLE, CURRENT_SCHEMA, CURRENT_TIME, CURRENT_TIMESTAMP, CURRENT_TRANSFORM_GROUP_FOR_TYPE, CURRENT_USER, CURSOR, CURSOR_NAME, CYCLE, DATA, DATABASE, DATE, DATETIME_INTERVAL_CODE, DATETIME_INTERVAL_PRECISION, DAY, DEALLOCATE, DEC, DECADE, DECIMAL, DECLARE, DEFAULT, DEFAULTS, DEFERRABLE, DEFERRED, DEFINED, DEFINER, DEGREE, DELETE, DENSE_RANK, DEPTH, DEREF, DERIVED, DESC, DESCRIBE, DESCRIPTION, DESCRIPTOR, DETERMINISTIC, DIAGNOSTICS, DISALLOW, DISCONNECT, DISPATCH, DISTINCT, DOMAIN, DOUBLE, DOW, DOY, DROP, DYNAMIC, DYNAMIC_FUNCTION, DYNAMIC_FUNCTION_CODE, EACH, ELEMENT, ELSE, END, END-EXEC, EPOCH, EQUALS, ESCAPE, EVERY, EXCEPT, EXCEPTION, EXCLUDE, EXCLUDING, EXEC, EXECUTE, EXISTS, EXP, EXPLAIN, EXTEND, EXTERNAL, EXTRACT, FALSE, FETCH, FILTER, FINAL, FIRST, FIRST_VALUE, FLOAT, FLOOR, FOLLOWING, FOR, FOREIGN, FORTRAN, FOUND, FRAC_SECOND, FREE, FROM, FULL, FUNCTION, FUSION, G, GENERAL, GENERATED, GET, GLOBAL, GO, GOTO, GRANT, GRANTED, GROUP, GROUPING, HAVING, HIERARCHY, HOLD, HOUR, IDENTITY, IMMEDIATE, IMPLEMENTATION, IMPORT, IN, INCLUDING, INCREMENT, INDICATOR, INITIALLY, INNER, INOUT, INPUT, INSENSITIVE, INSERT, INSTANCE, INSTANTIABLE, INT, INTEGER, INTERSECT, INTERSECTION, INTERVAL, INTO, INVOKER, IS, ISOLATION, JAVA, JOIN, K, KEY, KEY_MEMBER, KEY_TYPE, LABEL, LANGUAGE, LARGE, LAST, LAST_VALUE, LATERAL, LEADING, LEFT, LENGTH, LEVEL, LIBRARY, LIKE, LIMIT, LN, LOCAL, LOCALTIME, LOCALTIMESTAMP, LOCATOR, LOWER, M, MAP, MATCH, MATCHED, MAX, MAXVALUE, MEMBER, MERGE, MESSAGE_LENGTH, MESSAGE_OCTET_LENGTH, MESSAGE_TEXT, METHOD, MICROSECOND, MILLENNIUM, MIN, MINUTE, MINVALUE, MOD, MODIFIES, MODULE, MONTH, MORE, MULTISET, MUMPS, NAME, NAMES, NATIONAL, NATURAL, NCHAR, NCLOB, NESTING, NEW, NEXT, NO, NONE, NORMALIZE, NORMALIZED, NOT, NULL, NULLABLE, NULLIF, NULLS, NUMBER, NUMERIC, OBJECT, OCTETS, OCTET_LENGTH, OF, OFFSET, OLD, ON, ONLY, OPEN, OPTION, OPTIONS, OR, ORDER, ORDERING, ORDINALITY, OTHERS, OUT, OUTER, OUTPUT, OVER, OVERLAPS, OVERLAY, OVERRIDING, PAD, PARAMETER, PARAMETER_MODE, PARAMETER_NAME, PARAMETER_ORDINAL_POSITION, PARAMETER_SPECIFIC_CATALOG, PARAMETER_SPECIFIC_NAME, PARAMETER_SPECIFIC_SCHEMA, PARTIAL, PARTITION, PASCAL, PASSTHROUGH, PATH, PERCENTILE_CONT, PERCENTILE_DISC, PERCENT_RANK, PLACING, PLAN, PLI, POSITION, POWER, PRECEDING, PRECISION, PREPARE, PRESERVE, PRIMARY, PRIOR, PRIVILEGES, PROCEDURE, PUBLIC, QUARTER, RANGE, RANK, READ, READS, REAL, RECURSIVE, REF, REFERENCES, REFERENCING, REGR_AVGX, REGR_AVGY, REGR_COUNT, REGR_INTERCEPT, REGR_R2, REGR_SLOPE, REGR_SXX, REGR_SXY, REGR_SYY, RELATIVE, RELEASE, REPEATABLE, RESET, RESTART, RESTRICT, RESULT, RETURN, RETURNED_CARDINALITY, RETURNED_LENGTH, RETURNED_OCTET_LENGTH, RETURNED_SQLSTATE, RETURNS, REVOKE, RIGHT, ROLE, ROLLBACK, ROLLUP, ROUTINE, ROUTINE_CATALOG, ROUTINE_NAME, ROUTINE_SCHEMA, ROW, ROWS, ROW_COUNT, ROW_NUMBER, SAVEPOINT, SCALE, SCHEMA, SCHEMA_NAME, SCOPE, SCOPE_CATALOGS, SCOPE_NAME, SCOPE_SCHEMA, SCROLL, SEARCH, SECOND, SECTION, SECURITY, SELECT, SELF, SENSITIVE, SEQUENCE, SERIALIZABLE, SERVER, SERVER_NAME, SESSION, SESSION_USER, SET, SETS, SIMILAR, SIMPLE, SIZE, SMALLINT, SOME, SOURCE, SPACE, SPECIFIC, SPECIFICTYPE, SPECIFIC_NAME, SQL, SQLEXCEPTION, SQLSTATE, SQLWARNING, SQL_TSI_DAY, SQL_TSI_FRAC_SECOND, SQL_TSI_HOUR, SQL_TSI_MICROSECOND, SQL_TSI_MINUTE, SQL_TSI_MONTH, SQL_TSI_QUARTER, SQL_TSI_SECOND, SQL_TSI_WEEK, SQL_TSI_YEAR, SQRT, START, STATE, STATEMENT, STATIC, STDDEV_POP, STDDEV_SAMP, STREAM, STRUCTURE, STYLE, SUBCLASS_ORIGIN, SUBMULTISET, SUBSTITUTE, SUBSTRING, SUM, SYMMETRIC, SYSTEM, SYSTEM_USER, TABLE, TABLESAMPLE, TABLE_NAME, TEMPORARY, THEN, TIES, TIME, TIMESTAMP, TIMESTAMPADD, TIMESTAMPDIFF, TIMEZONE_HOUR, TIMEZONE_MINUTE, TINYINT, TO, TOP_LEVEL_COUNT, TRAILING, TRANSACTION, TRANSACTIONS_ACTIVE, TRANSACTIONS_COMMITTED, TRANSACTIONS_ROLLED_BACK, TRANSFORM, TRANSFORMS, TRANSLATE, TRANSLATION, TREAT, TRIGGER, TRIGGER_CATALOG, TRIGGER_NAME, TRIGGER_SCHEMA, TRIM, TRUE, TYPE, UESCAPE, UNBOUNDED, UNCOMMITTED, UNDER, UNION, UNIQUE, UNKNOWN, UNNAMED, UNNEST, UPDATE, UPPER, UPSERT, USAGE, USER, USER_DEFINED_TYPE_CATALOG, USER_DEFINED_TYPE_CODE, USER_DEFINED_TYPE_NAME, USER_DEFINED_TYPE_SCHEMA, USING, VALUE, VALUES, VARBINARY, VARCHAR, VARYING, VAR_POP, VAR_SAMP, VERSION, VIEW, WEEK, WHEN, WHENEVER, WHERE, WIDTH_BUCKET, WINDOW, WITH, WITHIN, WITHOUT, WORK, WRAPPER, WRITE, XML, YEAR, ZONE

你可能感兴趣的:(flink,sql,大数据)