一、介绍kudu
kudu的定位是实时数据仓库,kudu功能上有交集的有两个数据库:ODPS、hbase。
ODPS定位是数据仓库,kudu作为实时数据仓库后起之秀,因为结合spark等大数据处理工具比起ODPS优势明显:
1. 操作灵活,我们可以使用spark + kudu快速进行数据统计分析,比起ODPS灵活很多。
2. 速度快,我自己的感觉使用kudu拉取数据流很快,所以kudu定位是实时数仓。
3. 数据库设计可以指定字段类型,比起hbase的rowkey更灵活。
4. kudu的查询比hbase优秀很多,虽然都是scan,但是kudu的主键是多个字段拼起来的,在过滤筛选上比hbase强上很多。
二、创建kudu表
CREATE TABLE dbName.tableName
(
name string NOT NULL,
timestamp BIGINT NOT NULL,
dt string NOT NULL,
label DOUBLE NOT NULL,
PRIMARY KEY (name, timestamp, dt)
) using kudu PARTITION BY HASH (timestamp) PARTITIONS 2,
range (dt) (
PARTITION "20181018" <= VALUES < "20181019" ,
PARTITION "20181019" <= VALUES < "20181020" ,
PARTITION "20181020" <= VALUES < "20181021" ,
PARTITION "20181021" <= VALUES < "20181022" ,
PARTITION "20181022" <= VALUES < "20181023" ,
PARTITION "20181023" <= VALUES < "20181024"
)
kudu支持分区表,并且在分区表基础上允许再次分表,这样适用于数据量非常大的场景。
三、kudu查询样例
import com.google.common.collect.Lists;
import org.apache.kudu.client.*;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.util.List;
/**
* kudu表操作
*
* @author xuanchi.lyf
*/
public class KuduDao {
private static final Logger logger = LoggerFactory.getLogger(KuduDao.class);
private static final String KUDU_MASTER_ADDRESS = "master-address-cluster";
private static KuduClient client = new KuduClient.KuduClientBuilder(
KUDU_MASTER_ADDRESS).build();
private static void insert() {
KuduSession session = null;
int size = 10;
try {
KuduTable table = client.openTable("dbName.tableName");
session = client.newSession();
session.setFlushMode(SessionConfiguration.FlushMode.MANUAL_FLUSH);
session.setMutationBufferSpace(size);
for (int i = 0; i < size; i++) {
Insert insert = table.newInsert();
insert.getRow().addString("name", "name" + i);
insert.getRow().addLong("timestamp", System.currentTimeMillis());
insert.getRow().addDouble("label", i);
insert.getRow().addString("dt", "20181010");
session.apply(insert);
}
session.flush();
RowErrorsAndOverflowStatus status = session.getPendingErrors();
if (status.isOverflowed()) {
logger.error("操作kudu溢出");
}
for (RowError rowError : status.getRowErrors()) {
String errorMessage = String.format(
"Kudu output error '%s' during operation '%s' at tablet server '%s'",
rowError.getErrorStatus(), rowError.getOperation(), rowError.getTsUUID());
logger.error(errorMessage);
}
System.out.println(session.countPendingErrors());
} catch (Exception e) {
logger.info(e.getMessage(), e);
} finally {
if (session != null) {
try {
session.close();
} catch (KuduException e) {
logger.error(e.getMessage(), e);
}
}
}
logger.info("插入结束");
}
private static void query() {
KuduScanner scanner = null;
try {
KuduTable table = client.openTable("dbName.tableName");
List projectColumns = Lists.newArrayList("name", "timestamp", "label", "dt");
scanner = client.newScannerBuilder(table).setProjectedColumnNames(projectColumns)
.limit(1000).build();
while (scanner.hasMoreRows()) {
RowResultIterator results = scanner.nextRows();
while (results.hasNext()) {
RowResult result = results.next();
String name = result.getString("name");
System.out.println(name);
}
}
} catch (KuduException e) {
logger.info(e.getMessage(), e);
} finally {
if (scanner != null) {
try {
scanner.close();
} catch (KuduException e) {
logger.info(e.getMessage(), e);
}
}
}
}
private static void close() {
try {
client.close();
} catch (KuduException e) {
logger.info(e.getMessage(), e);
}
}
public static void main(String[] args) {
insert();
query();
close();
}
}
1. kudu的写入性能和hbase差不多,都是适合大量写入的场景,在我们的集群上每秒写入5W没压力。
2. 分区表信息,也就是schame信息是存放在KuduTable,所以该对象需要注意不能持久化存放,需要不断的刷新。
四、查询
KuduTable table = kuduClient.openTable(TABLE);
Schema schema = table.getSchema();
KuduPredicate dtFilter = KuduPredicate.newComparisonPredicate(schema.getColumn("dt"),
KuduPredicate.ComparisonOp.EQUAL, dt);
KuduPredicate appNameFilter = KuduPredicate.newComparisonPredicate(
schema.getColumn("name"), KuduPredicate.ComparisonOp.EQUAL, name);
KuduPredicate timestampFilter1 = KuduPredicate.newComparisonPredicate(
schema.getColumn("timestamp"), KuduPredicate.ComparisonOp.GREATER_EQUAL,
startTimestamp);
KuduPredicate timestampFilter2 = KuduPredicate.newComparisonPredicate(
schema.getColumn("timestamp"), KuduPredicate.ComparisonOp.LESS_EQUAL, endTimestamp);
scanner = kuduClient.newScannerBuilder(table).addPredicate(dtFilter)
.addPredicate(appNameFilter).addPredicate(timestampFilter1)
.addPredicate(timestampFilter2).build();
while (scanner.hasMoreRows()) {
RowResultIterator results = scanner.nextRows();
while (results.hasNext()) {
RowResult result = results.next();
String name = result.getString("name");
...
...
}
}
1. kudu提供了一些过滤条件,可以让我们很快的对数据进行筛选,数据处理速度很快
五、spark sql访问kudu
org.apache.spark
spark-sql_2.11
2.3.2
org.apache.kudu
kudu-client
${kudu.version}
org.apache.kudu
kudu-spark2-tools_2.11
${kudu.version}
org.apache.kudu
kudu-spark2_2.11
${kudu.version}
demo:
import org.apache.spark.SparkConf;
import org.apache.spark.SparkContext;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
import java.util.Arrays;
import java.util.List;
public class Application {
private static final String KUDU_MASTER_ADDRESS = “...";
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("kudu-sparkSql").setMaster("local[*]")
.set("spark.driver.userClassPathFirst", "true")
.set("spark.sql.crossJoin.enabled", "true");
SparkContext sparkContext = new SparkContext(conf);
SparkSession sparkSession = SparkSession.builder().sparkContext(sparkContext).getOrCreate();
List fields = Arrays.asList(
DataTypes.createStructField("name", DataTypes.StringType, true),
DataTypes.createStructField("ldc", DataTypes.StringType, true),
DataTypes.createStructField("flag", DataTypes.StringType, true),
DataTypes.createStructField("type", DataTypes.StringType, true),
DataTypes.createStructField("service", DataTypes.StringType, true),
DataTypes.createStructField("subservice", DataTypes.StringType, true),
DataTypes.createStructField("timestamp", DataTypes.LongType, true),
DataTypes.createStructField("label", DataTypes.DoubleType, true),
DataTypes.createStructField("dt", DataTypes.StringType, true));
StructType schema = DataTypes.createStructType(fields);
Dataset dataset = sparkSession.read().format("org.apache.kudu.spark.kudu").schema(schema)
.option("kudu.master", KUDU_MASTER_ADDRESS).option("kudu.table", "tableName").load();
dataset.registerTempTable("t");
sparkSession.sql("select * from t where dt = '20181102' and timestamp > 1541143802000 and timestamp < 1541145302000 limit 10").show();
sparkSession.close();
}
}