HiveQL数据抽样Sample查询

当数据量很大时,需要查找一个数据的子集用于加快数据的分析,这种技术就是抽样技术。Hive中,数据抽样分为以下三种:
随机抽样;
桶表抽样;
块抽样;

1 随机抽样
语法结构
使用Rand()和LIMIT关键字得到抽样数据,Distribute和Sort关键字确保数据在mappers和reducers之间高效的随机分布,也可以使用order by rand()实现,但是性能不好。

语法:
    SELECT * FROM DISTRIBUTE BY RAND() SORT BY RAND()
    LIMIT ;

示例

hive> select * from ana distribute by rand() sort by rand() limit 3;
    +-----------+-------------+-------------+
    | ana.name  | ana.depart  | ana.salary  |
    +-----------+-------------+-------------+
    | Mike      | 1001        | 6400        |
    | Will      | 1000        | 4000        |
    | Richard   | 1002        | 8000        |
    +-----------+-------------+-------------+
    3 rows selected (123.999 seconds)

2 桶表抽样
语法结构
桶表抽样是桶表已优化的特殊的抽样方法,colname指定列在哪里取样数据,当在整个行取样时,可以使用Rand()函数,如果抽样列是Clustered By列,Tablesample语句会更高效。

语法:
    SELECT * FROM
    TABLESAMPLE(BUCKET OUT OF     number of buckets> ON [colname|RAND()]) table_alias;

示例,略

3 块抽样
语法结构
块抽样允许Hive随机从数据中挑选N行、数据量的百分比或者数据的N字节大小。该种抽样的粒度是HDFS的块大小。
语法:

    SELECT *
    FROM TABLESAMPLE(N PERCENT|ByteLengthLiteral|N ROWS) s;
     
    -- ByteLengthLiteral
    -- (Digit)+ ('b' | 'B' | 'k' | 'K' | 'm' | 'M' | 'g' | 'G')

示例
示例一:按行抽样

 hive> select name from ana  tablesample(4 rows) a;
    +----------+
    |   name   |
    +----------+
    | Lucy     |
    | Michael  |
    | Steven   |
    | Will     |
    +----------+
    4 rows selected (0.29 seconds)

示例二:按数据大小的百分比抽样

 hive> select name from ana  tablesample(10 percent) a;
    +----------+
    |   name   |
    +----------+
    | Lucy     |
    | Michael  |
    +----------+
    2 rows selected (0.345 seconds)

示例三:按数据大小抽样

 hive> select name from ana  tablesample(2M) a;
    +----------+
    |   name   |
    +----------+
    | Lucy     |
    | Michael  |
    | Steven   |
    | Will     |
    | Will     |
    | Jess     |
    | Lily     |
    | Mike     |
    | Richard  |
    | Wei      |
    | Yun      |
    +----------+
    11 rows selected (0.264 seconds)


转自:https://blog.csdn.net/alen_liu_sz/article/details/80988553

你可能感兴趣的:(#,Hive,Sql)