本次新增如下功能:
使用hive来读取ydb的数据进行分析。
使用spark来读取ydb的数据进行分析。
通过编程来导出到其他系统中的接口。
Mapreduce- InputFormat接口。
YDB下载地址:
(您必须同意授权使用协议才允许使用该软件 授权协议下载)
当前版本v1.0.5-beta
360云盘获取:http://yunpan.cn/cuHD72ifTWtz2 提取码: 5928
通过ydb与hive的数据对接,可以利用hive或者spark对ydb的功能进行拓展,实现(如多表关联,中位数,SQL嵌套等)复杂的查询。
add jar /data/xxx.xxx.xx/ydb-x.x.x-pg.jar ;
注意两点
1.仅映射必须的字段,无用的字段尽量别映射。
2.映射使用where过滤条件将映射的记录条数限制的越少越好。
通过上述两点,可以减少ydb传递给hive的数据量,ydb本身的磁盘IO也会变小,故可提高效率。
映射示例一:
CREATE external TABLE ydbhive_example (
tradetime string,tradenum string,tradeid string,nickname string,cardnum string
)
STORED BY 'cn.net.ycloud.ydb.handle.YdbStorageHandler'
TBLPROPERTIES (
"ydb.handler.hostport"="101.200.130.48:8080",
"ydb.handler.sql.key"="ydb.sql.ydbhive_example",
"ydb.handler.sql"=" select tradetime,tradenum,tradeid,nickname,cardnum from ydb_example_trade where ydbpartion='20151011' and ydbkv='export.joinchar:%01' and ydbkv='export.max.return.docset.size:100' limit 0,10"
);
映射示例二:
CREATE external TABLE ydbhive_example_bigdata (
phonenum string, ydb_sex string, ydb_province string, ydb_grade string, ydb_age string
)
STORED BY 'cn.net.ycloud.ydb.handle.YdbStorageHandler'
TBLPROPERTIES (
"ydb.handler.hostport"="101.200.130.48:8080",
"ydb.handler.sql.key"="ydb.sql.ydbhive_example_bigdata",
"ydb.handler.sql"=" select phonenum,ydb_sex,ydb_province,ydb_grade,ydb_age from ydb_example_ads where ydbpartion='20151111' and (ydb_grade='博士') and ydb_sex='女' and ydb_province='北京' and ydbkv='export.joinchar:%01' and ydbkv='export.max.return.docset.size:1000000' and ydbkv='max.return.docset.size:100000000' limit 0,10"
);
映射示例三:
为了节省IO,部分查询可以在ydb端做预聚合,减轻hive与spark的压力
CREATE external TABLE ydbhive_example_groupby (province string, bank string, amt double,cnt double)
STORED BY 'cn.net.ycloud.ydb.handle.YdbStorageHandler'
TBLPROPERTIES (
"ydb.handler.hostport"="101.200.130.48:8080",
"ydb.handler.sql.key"="ydb.sql.ydbhive_example_groupby",
' ydb.handler.sql'=" select province,bank,sum(amt),count(*) from ydb_example_trade where ydbpartion='20151011' and ydbkv='export.joinchar:%01' and ydbkv='export.max.return.docset.size:100' group by province,bank limit 0,10"
);
示例一:
select * from ydbhive_example limit 10;
select count(*) from ydbhive_example limit 10;
select tradeid,count(*) from ydbhive_example group by tradeid limit 10;
示例二:
select ydb_sex,ydb_province,ydb_grade,ydb_age,count(*) as cnt from ydbhive_example_bigdata group by ydb_sex,ydb_province,ydb_grade,ydb_age order by cnt desc limit 100
select count(*) from ydbhive_example_bigdata limit 10
示例三:
在hive端对ydb的聚合结果做进一步的查询
select * from ydbhive_example_groupby limit 10
select province,bank,sum(amt),sum(cnt) as cnt from ydbhive_example_groupby group by province,bank order by cnt desc limit 100
设置的属性的名字由,创建表时候配置的ydb.handler.sql.key的值指定
示例一
set ydb.sql.ydbhive_example_bigdata=" select phonenum,ydb_sex,ydb_province,ydb_grade,ydb_age from ydb_example_ads where ydbpartion='20151111' and ydb_province='辽宁' and ydbkv='export.joinchar:%01' and ydbkv='export.max.return.docset.size:1000000' and ydbkv='max.return.docset.size:100000000' limit 0,10";
select count(*) from ydbhive_example_bigdata limit 10
示例二
set ydb.sql.ydbhive_example_groupby =" select province,bank,sum(amt),count(*) from ydb_example_trade where ydbpartion='20151011' and province='辽宁省' and ydbkv='export.joinchar:%01' and ydbkv='export.max.return.docset.size:1000000' and ydbkv='max.return.docset.size:100000000' group by province,bank limit 0,10";
select province,bank,sum(amt),sum(cnt) as cnt from ydbhive_example_groupby group by province,bank order by cnt desc limit 100
set ydb.sql.ydbhive_example_bigdata=" select higoempty_ex1_s,ydb_sex, higoempty_ex2_s, higoempty_ex3_s, higoempty_ex4_s from ydb_example_ads where ydbpartion='20151111' and ydb_province='北京' and ydbkv='export.joinchar:%01' and ydbkv='export.max.return.docset.size:1000000' and ydbkv='max.return.docset.size:100000000' limit 0,10";
select ydb_sex ,count(*) as cnt from ydbhive_example_bigdata group by ydb_sex order by cnt desc limit 100
select * from ydbhive_example_bigdata limit 10
spark操作ydb几乎与hive完全一样。
但由于spark不支持add jar 方法,记得配置 SPARK_CLASSPATH
示例如下:
export SPARK_CLASSPATH=$SPARK_CLASSPATH:/data/ycloud/ycloud/ydb/lib/ydb-1.0.5-pg.jar
编程示例如下:
String master="101.200.130.48:8080"
String exportSql=" select tradetime,tradenum,tradeid,nickname,cardnum from ydb_example_trade where ydbpartion='20151011' and ydbkv='export.joinchar:%09' and ydbkv='export.max.return.docset.size:30' limit 0,10 ";
HiveYdbTableInputFormat format=new HiveYdbTableInputFormat();
YdbInputSplit[] list=format.getSplits(master, args[1], "");
System.out.println(Arrays.toString(list));
//这个步骤仅仅为了随机
HashMap<Integer, YdbInputSplit> randomMap=new HashMap<Integer, YdbInputSplit>();
for(YdbInputSplit split:list)
{
randomMap.put((int) (Math.random()*1000000), split);
}
//这里可以考虑多线程,并发导出
for(Entry<Integer, YdbInputSplit> e:randomMap.entrySet())
{
YdbInputSplit split=e.getValue();
System.out.println("#######################");
System.out.println(split.toString());
YdbRecordReader reader =new YdbRecordReader(split);
LongWritable key=new LongWritable();
BytesWritable wr=new BytesWritable();
while(reader.next(key, wr))
{
System.out.println(reader.getProgress()+"\t"+reader.getPos()+"\t"+reader.getTotal()+"\t"+new String(wr.getBytes(),0,wr.getLength(),"utf-8"));
}
reader.close();
}
注:的exportSql只能是select 不能是其他统计的SQL
使用示例如下:
String master="101.200.130.48:8080"
String exportSql=" select tradetime,tradenum,tradeid,nickname,cardnum from ydb_example_trade where ydbpartion='20151011' and ydbkv='export.joinchar:%09' and ydbkv='export.max.return.docset.size:30' limit 0,10 ";
job.setInputFormatClass(HiveYdbTableInputFormat.class);
HiveYdbTableInputFormat.setYdb(job. getConfiguration(),master, exportSql);
注:的exportSql只能是select 不能是其他统计的SQL