Flink自定义Catalog之oracle

Flink自定义Catalog之oracle

概述

Catalog提供了元数据信息,例如数据库、表、分区、视图以及数据库或其他外部系统中存储的函数和信息。Catalog 提供了一个统一的API,用于管理元数据,并使其可以从 Table API 和 SQL 查询语句中来访问。

从Flink1.15后官方提供了oracle版Table的DDL定义未提供catalog的DDL定义。为满足大数据中间件要求,参考官方PostgreSQL的catalog源代码,扩展JdbcCatalog实现oracle的catalog。

JdbcCatalog使得用户可以将 Flink 通过 JDBC 协议连接到关系数据库。

具体实现

第一步创建OracleCatalog类

创建OracleCatalog类继承于AbstractJdbcCatalog抽象类,新类能够复用catalog常规功能。在新类中覆盖部分方法实现专有功能。

public class OracleCatalog extends AbstractJdbcCatalog {

    private static final Logger LOG = LoggerFactory.getLogger(OracleCatalog.class);

    private final JdbcDialectTypeMapper dialectTypeMapper;

    private static final Set builtinDatabases = new HashSet() {

        {

            add("SCOTT");

            add("ANONYMOUS");

            add("XS$NULL");

            add("DIP");

            add("SPATIAL_WFS_ADMIN_USR");

            add("SPATIAL_CSW_ADMIN_USR");

            add("APEX_PUBLIC_USER");

            add("ORACLE_OCM");

            add("MDDATA");

        }

    };

    public OracleCatalog(ClassLoader userClassLoader,

                         String catalogName,

                         String defaultDatabase,

                         String username,

                         String pwd,

                         String baseUrl) {

        super(userClassLoader, catalogName, defaultDatabase, username, pwd, baseUrl);

        String driverVersion =

                Preconditions.checkNotNull(getDriverVersion(), "Driver version must not be null.");

        String databaseVersion =

                Preconditions.checkNotNull(

                        getDatabaseVersion(), "Database version must not be null.");

        LOG.info("Driver version: {}, database version: {}", driverVersion, databaseVersion);

        this.dialectTypeMapper = new OracleTypeMapper(databaseVersion, driverVersion);

    }

    private String getDatabaseVersion() {

        try (TemporaryClassLoaderContext ignored =

                     TemporaryClassLoaderContext.of(userClassLoader)) {

            try (Connection conn = DriverManager.getConnection(defaultUrl, username, pwd)) {

                return conn.getMetaData().getDatabaseProductVersion();

            } catch (Exception e) {

                throw new CatalogException(

                        String.format("Failed in getting Oracle version by %s.", defaultUrl), e);

            }

        }

    }

    private String getDriverVersion() {

        try (TemporaryClassLoaderContext ignored =

                     TemporaryClassLoaderContext.of(userClassLoader)) {

            try (Connection conn = DriverManager.getConnection(defaultUrl, username, pwd)) {

                String driverVersion = conn.getMetaData().getDriverVersion();

                Pattern regexp = Pattern.compile("\\d+?\\.\\d+?\\.\\d+");

                Matcher matcher = regexp.matcher(driverVersion);

                return matcher.find() ? matcher.group(0) : null;

            } catch (Exception e) {

                throw new CatalogException(

                        String.format("Failed in getting Oracle driver version by %s.", defaultUrl),

                        e);

            }

        }

    }

    @Override

    public List listDatabases() throws CatalogException {

        return extractColumnValuesBySQL(

                this.defaultUrl,

                "select username from sys.dba_users " +

                        "where DEFAULT_TABLESPACE <> 'SYSTEM' and DEFAULT_TABLESPACE <> 'SYSAUX' " +

                        " order by username",

                1,

                dbName -> !builtinDatabases.contains(dbName));

    }

    @Override

    public List listTables(String databaseName) throws DatabaseNotExistException, CatalogException {

        Preconditions.checkState(

                StringUtils.isNotBlank(databaseName), "Database name must not be blank.");

        if (!databaseExists(databaseName)){// 注意这个值是 oracle 实例名称

            throw new DatabaseNotExistException(getName(), databaseName);

        }

        List listDatabases = listDatabases().stream().map(username -> "'" + username + "'")

                .collect(Collectors.toList());

        return extractColumnValuesBySQL(

                this.defaultUrl,

                "SELECT OWNER||'.'||TABLE_NAME AS schemaTableName FROM sys.all_tables WHERE OWNER IN (" + String.join(",", listDatabases) + ")"+

                "ORDER BY OWNER,TABLE_NAME",1, null, null);

    }

    @Override

    public boolean tableExists(ObjectPath tablePath) throws CatalogException {

        String[] schemaTableNames = getSchemaTableName(tablePath).split("\\.");

        return !extractColumnValuesBySQL(

                defaultUrl,

                "SELECT table_name FROM sys.all_tables where OWNER = ? and table_name = ?",

                1, null,

                schemaTableNames[0], schemaTableNames[1])

                .isEmpty();

    }

    /**

     * Converts Oracle type to Flink {@link DataType}.

     */

    @Override

    protected DataType fromJDBCType(ObjectPath tablePath, ResultSetMetaData metadata, int colIndex)

            throws SQLException {

        return dialectTypeMapper.mapping(tablePath, metadata, colIndex);

    }

    @Override

    protected String getTableName(ObjectPath tablePath) {

//        return tablePath.getObjectName();

        return PostgresTablePath.fromFlinkTableName(tablePath.getObjectName()).getPgTableName();

    }

    @Override

    protected String getSchemaName(ObjectPath tablePath) {

//        return tablePath.getDatabaseName();

        return PostgresTablePath.fromFlinkTableName(tablePath.getObjectName()).getPgSchemaName();

    }

    @Override

    protected String getSchemaTableName(ObjectPath tablePath) {

//        return tablePath.getObjectName();

        return PostgresTablePath.fromFlinkTableName(tablePath.getObjectName()).getFullPath();

    }

}

需要特别说明的:

  • builtinDatabases中定义的数据库名是不需要加载到catalog中,一般是系统级数据库
  • 通过SQL语句获取库、表的元数据,注意全表名的命名为.
  • 改写tableExists方法,避免调用listTables方法(Postgres的Catalog没有避免),Flink发起表查询之前有更好的性能
  • 覆盖getTableName/getSchemaName/getSchemaTableName三个方法实现,直接使用PostgresTablePath中的方法
  • 第二步创建OracleTypeMapper类

    创建OracleTypeMapper类继承于JdbcDialectTypeMap抽象类,实现oracle字段类型转换FlinkSql字段类型。这个转换是最容易出错的地方,需要在使用过程中持续完善。

    public class OracleTypeMapper implements JdbcDialectTypeMapper {

        private final String databaseVersion;

        private final String driverVersion;

        public OracleTypeMapper(String databaseVersion, String driverVersion) {

            this.databaseVersion = databaseVersion;

            this.driverVersion = driverVersion;

        }

        @Override

        public DataType mapping(ObjectPath tablePath, ResultSetMetaData metadata, int colIndex)

                throws SQLException {

            int jdbcType = metadata.getColumnType(colIndex);

            String columnName = metadata.getColumnName(colIndex);

            String oracleType = metadata.getColumnTypeName(colIndex).toUpperCase();

            int precision = metadata.getPrecision(colIndex);

            int scale = metadata.getScale(colIndex);

            switch (jdbcType) {

                case Types.CHAR:

                case Types.VARCHAR:

                case Types.NCHAR:

                case Types.NVARCHAR:

                case Types.STRUCT:

                case Types.CLOB:

                    return DataTypes.STRING();

                case Types.BLOB:

                    return DataTypes.BYTES();

                case Types.INTEGER:

                case Types.SMALLINT:

                case Types.TINYINT:

                    return DataTypes.INT();

                case Types.FLOAT:

                case Types.REAL:

                case OracleTypes.BINARY_FLOAT:

                    return DataTypes.FLOAT();

                case Types.DOUBLE:

                case OracleTypes.BINARY_DOUBLE:

                    return DataTypes.DOUBLE();

                case Types.NUMERIC:

                case Types.DECIMAL:

                    if (precision > 0 && precision < DecimalType.MAX_PRECISION) {

                        return DataTypes.DECIMAL(precision, metadata.getScale(colIndex));

                    }

                    return DataTypes.DECIMAL(DecimalType.MAX_PRECISION, 18);

                case Types.DATE:

                    return DataTypes.DATE();

                case Types.TIMESTAMP:

                case Types.TIMESTAMP_WITH_TIMEZONE:

                case OracleTypes.TIMESTAMPTZ:

                case OracleTypes.TIMESTAMPLTZ:

                    return scale > 0 ? DataTypes.TIMESTAMP(scale) : DataTypes.TIMESTAMP();

                case OracleTypes.INTERVALYM:

                    return DataTypes.INTERVAL(DataTypes.YEAR(), DataTypes.MONTH());

                case OracleTypes.INTERVALDS:

                    return DataTypes.INTERVAL(DataTypes.DAY(), DataTypes.SECOND());

                case Types.BOOLEAN:

                    return DataTypes.BOOLEAN();

                default:

                    final String jdbcColumnName = metadata.getColumnName(colIndex);

                    throw new UnsupportedOperationException(

                            String.format(

                                    "Doesn't support Oracle type '%s' on column '%s' in Oracle version %s, driver version %s yet.",

                                    oracleType, jdbcColumnName, databaseVersion, driverVersion));

            }

        }

    }

    第三步改造OracleDialect

    在Flink新版本中官方已经提供了这个类,为什么还要改造呢?原因如下:

    1. 定义为私有类不能在第四步创建Catalog对象
    2. 在getLimitClause方法中使用oracle12("FETCH FIRST " + limit + " ROWS ONLY")的语法,而产品线大量使用oracle11版本,需要改造为更通用的语法("WHERE ROWNUM <= " + limit)

    public class OracleDialect extends AbstractDialect {

        private static final long serialVersionUID = 1L;

        // Define MAX/MIN precision of TIMESTAMP type according to Oracle docs:

        // https://www.techonthenet.com/oracle/datatypes.php

        private static final int MAX_TIMESTAMP_PRECISION = 9;

        private static final int MIN_TIMESTAMP_PRECISION = 1;

        // Define MAX/MIN precision of DECIMAL type according to Oracle docs:

        // https://www.techonthenet.com/oracle/datatypes.php

        private static final int MAX_DECIMAL_PRECISION = 38;

        private static final int MIN_DECIMAL_PRECISION = 1;

        @Override

        public JdbcRowConverter getRowConverter(RowType rowType) {

            return new OracleRowConverter(rowType);

        }

        @Override

        public String getLimitClause(long limit) {

            // 该语法直到 Oracle Database 12c 才有效。

    //        return "FETCH FIRST " + limit + " ROWS ONLY";

            // oracle 更通用的语法

            return "WHERE ROWNUM <= " + limit;

        }

        @Override

        public Optional defaultDriverName() {

            return Optional.of("oracle.jdbc.OracleDriver");

        }

        @Override

        public String dialectName() {

            return "Oracle";

        }

        @Override

        public String quoteIdentifier(String identifier) {

            return identifier;

            // 不能指定下面的字符,会修改`库.名`为"库.名"而报错

    //        return "\"" + identifier + "\"";

        }

        @Override

        public Optional getUpsertStatement(

                String tableName, String[] fieldNames, String[] uniqueKeyFields) {

            String sourceFields =

                    Arrays.stream(fieldNames)

                            .map(f -> ":" + f + " " + quoteIdentifier(f))

                            .collect(Collectors.joining(", "));

            String onClause =

                    Arrays.stream(uniqueKeyFields)

                            .map(f -> "t." + quoteIdentifier(f) + "=s." + quoteIdentifier(f))

                            .collect(Collectors.joining(" and "));

            final Set uniqueKeyFieldsSet =

                    Arrays.stream(uniqueKeyFields).collect(Collectors.toSet());

            String updateClause =

                    Arrays.stream(fieldNames)

                            .filter(f -> !uniqueKeyFieldsSet.contains(f))

                            .map(f -> "t." + quoteIdentifier(f) + "=s." + quoteIdentifier(f))

                            .collect(Collectors.joining(", "));

            String insertFields =

                    Arrays.stream(fieldNames)

                            .map(this::quoteIdentifier)

                            .collect(Collectors.joining(", "));

            String valuesClause =

                    Arrays.stream(fieldNames)

                            .map(f -> "s." + quoteIdentifier(f))

                            .collect(Collectors.joining(", "));

            // if we can't divide schema and table-name is risky to call quoteIdentifier(tableName)

            // for example [tbo].[sometable] is ok but [tbo.sometable] is not

            String mergeQuery =

                    " MERGE INTO "

                            + tableName

                            + " t "

                            + " USING (SELECT "

                            + sourceFields

                            + " FROM DUAL) s "

                            + " ON ("

                            + onClause

                            + ") "

                            + " WHEN MATCHED THEN UPDATE SET "

                            + updateClause

                            + " WHEN NOT MATCHED THEN INSERT ("

                            + insertFields

                            + ")"

                            + " VALUES ("

                            + valuesClause

                            + ")";

            return Optional.of(mergeQuery);

        }

        @Override

        public Optional decimalPrecisionRange() {

            return Optional.of(Range.of(MIN_DECIMAL_PRECISION, MAX_DECIMAL_PRECISION));

        }

        @Override

        public Optional timestampPrecisionRange() {

            return Optional.of(Range.of(MIN_TIMESTAMP_PRECISION, MAX_TIMESTAMP_PRECISION));

        }

        @Override

        public Set supportedTypes() {

            // The data types used in Oracle are list at:

            // https://www.techonthenet.com/oracle/datatypes.php

            return EnumSet.of(

                    LogicalTypeRoot.CHAR,

                    LogicalTypeRoot.VARCHAR,

                    LogicalTypeRoot.BOOLEAN,

                    LogicalTypeRoot.VARBINARY,

                    LogicalTypeRoot.DECIMAL,

                    LogicalTypeRoot.TINYINT,

                    LogicalTypeRoot.SMALLINT,

                    LogicalTypeRoot.INTEGER,

                    LogicalTypeRoot.BIGINT,

                    LogicalTypeRoot.FLOAT,

                    LogicalTypeRoot.DOUBLE,

                    LogicalTypeRoot.DATE,

                    LogicalTypeRoot.TIME_WITHOUT_TIME_ZONE,

                    LogicalTypeRoot.TIMESTAMP_WITHOUT_TIME_ZONE,

                    LogicalTypeRoot.TIMESTAMP_WITH_LOCAL_TIME_ZONE,

                    LogicalTypeRoot.ARRAY);

        }

    }

    第四步改造JdbcCatalogUtils

    这个类是创建Catalog的关键类。修改该类的createCatalog方法,添加创建OracleCatalog对象的代码。

    public class JdbcCatalogUtils {

        public static void validateJdbcUrl(String url) {

              String[] parts = url.trim().split("\\/+");

              checkArgument(parts.length == 2);

        }

        /**

         * Create catalog instance from given information.

         */

        public static AbstractJdbcCatalog createCatalog(

                ClassLoader userClassLoader,

                String catalogName,

                String defaultDatabase,

                String username,

                String pwd,

                String baseUrl) {

            JdbcDialect dialect = JdbcDialectLoader.load(baseUrl, userClassLoader);

            if (dialect instanceof PostgresDialect) {

                return new PostgresCatalog(

                        userClassLoader, catalogName, defaultDatabase, username, pwd, baseUrl);

            } else if (dialect instanceof MySqlDialect) {

                return new MySqlCatalog(

                        userClassLoader, catalogName, defaultDatabase, username, pwd, baseUrl);

            }

            // 添加 oracle的catalog

            else if (dialect instanceof OracleDialect) {

                return new OracleCatalog(

                        userClassLoader, catalogName, defaultDatabase, username, pwd, baseUrl);

            }

     else {

                throw new UnsupportedOperationException(

                        String.format("Catalog for '%s' is not supported yet.", dialect));

            }

        }

    }

    使用OracleCatalog

    OracleCatalog的使用和官方PostgreSQL使用近似,请参考JDBC | Apache Flink

    使用案例:

    CREATE CATALOG ORACLE_CATALOG WITH(

     'type' = 'jdbc',

     'default-database' = 'orclgps',    -- 指定数据库实例名

     'username' = '****', 'password' = '****',

     'base-url' = 'jdbc:oracle:thin:@//ip:port' --不要添加数据库实例名

    );

    SELECT `LINE_NO`, `UP_PASSENGER`, `DOWN_PASSENGER`, `SITE_TIME`, `STATION_NAME`

    -- 注意全表名查询数据的格式

    FROM ORACLE_CATALOG.orclgps.`TMGPS.TM_BUS_PASSENGER_UPDOWN` LIMIT 1;

    使用经验分享

    使用Catalog能够替换TABEL定义,减少字段类型转换问题和当字段多的时候能够大幅减少DDL的代码量。分析Flink通过JdbcCatalog发起查询过程源代码,必定调用tableExists方法,该方法中级联调用获取数据库和每个库的表,如果表非常多,有一定的性能损耗(OracleCatalog实现中做了优化)。

    建议在批作业且短周期调度(比如10分钟以内)、尤其是adHoc查询时优先使用Table定义;流作业、长周期调度时使用Catalog定义(字段多、类型转换可以通过大数据中间件提供的工具自动生成DDL)。

    你可能感兴趣的:(oracle,flink,数据库)