kudu操作

一、介绍kudu

        kudu的定位是实时数据仓库,kudu功能上有交集的有两个数据库:ODPS、hbase。

        ODPS定位是数据仓库,kudu作为实时数据仓库后起之秀,因为结合spark等大数据处理工具比起ODPS优势明显:

        1. 操作灵活,我们可以使用spark + kudu快速进行数据统计分析,比起ODPS灵活很多。

        2. 速度快,我自己的感觉使用kudu拉取数据流很快,所以kudu定位是实时数仓。

        3. 数据库设计可以指定字段类型,比起hbase的rowkey更灵活。

        4. kudu的查询比hbase优秀很多,虽然都是scan,但是kudu的主键是多个字段拼起来的,在过滤筛选上比hbase强上很多。

二、创建kudu表

CREATE TABLE dbName.tableName
(
    name string NOT NULL,
    timestamp BIGINT NOT NULL,
    dt string NOT NULL,
    label DOUBLE NOT NULL,
    PRIMARY KEY (name, timestamp, dt)
) using kudu PARTITION BY HASH (timestamp) PARTITIONS 2,
range (dt) (
    PARTITION "20181018" <= VALUES < "20181019" ,
    PARTITION "20181019" <= VALUES < "20181020" ,
    PARTITION "20181020" <= VALUES < "20181021" ,
    PARTITION "20181021" <= VALUES < "20181022" ,
    PARTITION "20181022" <= VALUES < "20181023" ,
    PARTITION "20181023" <= VALUES < "20181024"
)

        kudu支持分区表,并且在分区表基础上允许再次分表,这样适用于数据量非常大的场景。

三、kudu查询样例

import com.google.common.collect.Lists;
import org.apache.kudu.client.*;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.util.List;

/**
 * kudu表操作
 *
 * @author xuanchi.lyf
 */
public class KuduDao {

    private static final Logger logger              = LoggerFactory.getLogger(KuduDao.class);

    private static final String KUDU_MASTER_ADDRESS = "master-address-cluster";

    private static KuduClient   client              = new KuduClient.KuduClientBuilder(
        KUDU_MASTER_ADDRESS).build();

    private static void insert() {
        KuduSession session = null;
        int size = 10;
        try {
            KuduTable table = client.openTable("dbName.tableName");
            session = client.newSession();
            session.setFlushMode(SessionConfiguration.FlushMode.MANUAL_FLUSH);
            session.setMutationBufferSpace(size);
            for (int i = 0; i < size; i++) {
                Insert insert = table.newInsert();
                insert.getRow().addString("name", "name" + i);
                insert.getRow().addLong("timestamp", System.currentTimeMillis());
                insert.getRow().addDouble("label", i);
                insert.getRow().addString("dt", "20181010");
                session.apply(insert);
            }
            session.flush();
            RowErrorsAndOverflowStatus status = session.getPendingErrors();
            if (status.isOverflowed()) {
                logger.error("操作kudu溢出");
            }
            for (RowError rowError : status.getRowErrors()) {
                String errorMessage = String.format(
                    "Kudu output error '%s' during operation '%s' at tablet server '%s'",
                    rowError.getErrorStatus(), rowError.getOperation(), rowError.getTsUUID());
                logger.error(errorMessage);
            }
            System.out.println(session.countPendingErrors());
        } catch (Exception e) {
            logger.info(e.getMessage(), e);
        } finally {
            if (session != null) {
                try {
                    session.close();
                } catch (KuduException e) {
                    logger.error(e.getMessage(), e);
                }
            }
        }
        logger.info("插入结束");
    }

    private static void query() {
        KuduScanner scanner = null;
        try {
            KuduTable table = client.openTable("dbName.tableName");
            List projectColumns = Lists.newArrayList("name", "timestamp", "label", "dt");

            scanner = client.newScannerBuilder(table).setProjectedColumnNames(projectColumns)
                .limit(1000).build();
            while (scanner.hasMoreRows()) {
                RowResultIterator results = scanner.nextRows();
                while (results.hasNext()) {
                    RowResult result = results.next();
                    String name = result.getString("name");
                    System.out.println(name);
                }
            }
        } catch (KuduException e) {
            logger.info(e.getMessage(), e);
        } finally {
            if (scanner != null) {
                try {
                    scanner.close();
                } catch (KuduException e) {
                    logger.info(e.getMessage(), e);
                }
            }
        }
    }

    private static void close() {
        try {
            client.close();
        } catch (KuduException e) {
            logger.info(e.getMessage(), e);
        }
    }

    public static void main(String[] args) {
        insert();
        query();
        close();
    }

}

        1. kudu的写入性能和hbase差不多,都是适合大量写入的场景,在我们的集群上每秒写入5W没压力。

        2. 分区表信息,也就是schame信息是存放在KuduTable,所以该对象需要注意不能持久化存放,需要不断的刷新。

四、查询

KuduTable table = kuduClient.openTable(TABLE);
Schema schema = table.getSchema();
KuduPredicate dtFilter = KuduPredicate.newComparisonPredicate(schema.getColumn("dt"),
KuduPredicate.ComparisonOp.EQUAL, dt);
KuduPredicate appNameFilter = KuduPredicate.newComparisonPredicate(
                schema.getColumn("name"), KuduPredicate.ComparisonOp.EQUAL, name);
KuduPredicate timestampFilter1 = KuduPredicate.newComparisonPredicate(
                schema.getColumn("timestamp"), KuduPredicate.ComparisonOp.GREATER_EQUAL,
                startTimestamp);
KuduPredicate timestampFilter2 = KuduPredicate.newComparisonPredicate(
                schema.getColumn("timestamp"), KuduPredicate.ComparisonOp.LESS_EQUAL, endTimestamp);
scanner = kuduClient.newScannerBuilder(table).addPredicate(dtFilter)
            .addPredicate(appNameFilter).addPredicate(timestampFilter1)
            .addPredicate(timestampFilter2).build();
            
while (scanner.hasMoreRows()) {
    RowResultIterator results = scanner.nextRows();
    while (results.hasNext()) {
        RowResult result = results.next();
        String name = result.getString("name");
        ...
        ...
    }
}

        1. kudu提供了一些过滤条件,可以让我们很快的对数据进行筛选,数据处理速度很快

五、spark sql访问kudu



    org.apache.spark
    spark-sql_2.11
    2.3.2



    org.apache.kudu
    kudu-client
    ${kudu.version}


    org.apache.kudu
    kudu-spark2-tools_2.11
    ${kudu.version}


    org.apache.kudu
    kudu-spark2_2.11
    ${kudu.version}

demo:

import org.apache.spark.SparkConf;
import org.apache.spark.SparkContext;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;

import java.util.Arrays;
import java.util.List;

public class Application {

    private static final String KUDU_MASTER_ADDRESS = “...";

    public static void main(String[] args) {
        SparkConf conf = new SparkConf().setAppName("kudu-sparkSql").setMaster("local[*]")
            .set("spark.driver.userClassPathFirst", "true")
            .set("spark.sql.crossJoin.enabled", "true");
        SparkContext sparkContext = new SparkContext(conf);
        SparkSession sparkSession = SparkSession.builder().sparkContext(sparkContext).getOrCreate();

        List fields = Arrays.asList(
            DataTypes.createStructField("name", DataTypes.StringType, true),
            DataTypes.createStructField("ldc", DataTypes.StringType, true),
            DataTypes.createStructField("flag", DataTypes.StringType, true),
            DataTypes.createStructField("type", DataTypes.StringType, true),
            DataTypes.createStructField("service", DataTypes.StringType, true),
            DataTypes.createStructField("subservice", DataTypes.StringType, true),
            DataTypes.createStructField("timestamp", DataTypes.LongType, true),
            DataTypes.createStructField("label", DataTypes.DoubleType, true),
            DataTypes.createStructField("dt", DataTypes.StringType, true));
        StructType schema = DataTypes.createStructType(fields);
        Dataset dataset = sparkSession.read().format("org.apache.kudu.spark.kudu").schema(schema)
            .option("kudu.master", KUDU_MASTER_ADDRESS).option("kudu.table", "tableName").load();
        dataset.registerTempTable("t");
        sparkSession.sql("select * from t where dt = '20181102' and timestamp > 1541143802000 and timestamp < 1541145302000 limit 10").show();

        sparkSession.close();
    }

}

 

你可能感兴趣的:(数据库)