Programing Hive读书笔记


    we said that Hive really has no control over the integrity of the  files  used for storage and whether or not their contents are consistent with the table schema. Even managed tables don’t give us this control. Still, a general principle of good software design is to express intent. If the data is shared between tools, then creating an external table makes this ownership explicit

        注释:hive 没有和hdfs 底层权限控制结合, hdfs 底层控制接口也不够, 包括目录是否可以包含子目录,创建子目录的权限,目录级别没有锁,一个目录过一定时间是否容许写入(对于数据源),最近几次写入的时间,写入一个目录一定时间之后,是否还容许添加新的文件,或者添加新文件之后应该执行的动作. 其中有一部分应该由hcatalog 来处理.

    
    For very large data sets, partitioning can dramatically improve query performance, but only if the partitioning scheme reflects common range  filtering (e.g., by locations, timestamp ranges). A highly suggested safety measure is putting Hive into “strict” mode, which prohibits queries of partitioned tables without  a WHERE clause that filters on partitions

        注释:partition 跟ppd 的一个区别,partition需要能够表示“常见”的过滤条件,且是单一的, ppd 则不一定,比如订单号,客户类型,这些ppd 同样可以使用.

    
    对于数据量比较大的排重select count(sid) from tb1 group by sid 会比select count(distinct sid) from tb1更有效。

    
    select name , price.* from stocks.  (rlike regular expression for columns)

    注释: 对于列的名字hive 可以支持通配表达式.

    

    data type casting

    
    Note the functions floor, round, and ceil (“ceiling”) for converting DOUBLE to BIGINT, which is floating-point numbers to integer numbers. These functions are the preferred technique,   rather than using the cast operator we mentioned above. There is a bug when trying to use count(DISTINCT col) when col is a partition column. The answer should be 743 for NASDAQ and NYSE, at least as of early 2010 in the infochimps.org data set we used.

    注释: 尽量使用round,floor,ceil 而不是cast来转换类型

    
    case when.. when… else 代表customer mapping
    group by collect_set()  代表grouping customer mapping 注释: 如果你经常使用这种sql 查询同一张表,也许你该考虑新建一张表包含这个字段

    
    Trust us, you want to add set hive.exec.mode.local.auto=true; to your

    
    $HOME/.hiverc file.

    注释: 需要你客户端安装对应的压缩包,比如lzo , snappy. 不然会报错

    
    (93页)The following variation eliminates the duplication, using a column alias, but unfortunately it’s not valid:

    hive> SELECT name, salary, deductions["Federal Taxes"],

           > salary * (1 – deductions["Federal Taxes"]) as salary_minus_fed_taxes

           > FROM employees

           > WHERE round(salary_minus_fed_taxes) > 70000;

    FAILED: Error in semantic analysis: Line 4:13 Invalid table alias or

    column reference ‘salary_minus_fed_taxes’: (possible column names are:

    name, salary, subordinates, deductions, address)

    hive> SELECT e.* FROM
    > (SELECT name, salary, deductions["Federal Taxes"] as ded,
    > salary * (1 – deductions["Federal Taxes"]) as salary_minus_fed_taxes
    > FROM employees) e
    > WHERE round(e.salary_minus_fed_taxes) > 70000;

    注释:hive 在判断where 的时候不是先判断select projections 的字段,这导致需要多嵌套一层,正确的执行计划多了一步MR , 但是没办法. 同理,having 语句也需要多一层MR ,

    
    Use extreme caution when comparing floating-point numbers. Avoid all implicit casts from smaller to wider types.

    SELECT name, address.street

    > FROM employees WHERE address.street RLIKE ‘.*(Chicago|Ontario).*’;

    注释: 对于RLIKE 后面使用的正则表达式是正规的java 正则方法. 对于like 则是 “_” 和 “%” 表示正则里面”.”和”.*”. show tables 和partitions 也是.

    
    WHERE clauses are evaluated after joins are performed, so WHERE clauses

    should use predicates that only filter on column values that won’t be

    NULL. Also, contrary to Hive documentation, partition filters don’t work

    in ON clauses for OUTER JOINS, although they do work for INNER JOINS!

    wordaround 方法是多套一层

    SELECT s.ymd, s.symbol, s.price_close, d.dividend FROM

    > (SELECT * FROM stocks WHERE symbol = ‘AAPL’ AND exchange = ‘NASDAQ’) s

    > LEFT OUTER JOIN

    > (SELECT * FROM dividends WHERE symbol = ‘AAPL’ AND exchange = ‘NASDAQ’) d

    > ON s.ymd = d.ymd;

    注释: 对于外连接需要过滤partition 的情况,需要多套一层.

    
    Hive contains an implementation of bitmap indexes since v0.8.0. The main use case for bitmap indexes is when there are comparatively few values for a given column

    注释: compact index 和 bitmap index 之间的储存大小差别还是很大的.

    
    Making Multiple Passes over the Same Data (123页)

    FROM history

    > INSERT OVERWRITE sales SELECT * WHERE action=’purchased’

    > INSERT OVERWRITE credits SELECT * WHERE action=’returned’;

    注释: 使用multi insert 减少扫描,hive 在ETL中比较常用的优化.

    
    Relational databases typically use unique keys, indexes, and normalization to store data sets that fit into memory or mostly into memory. Hive, however, does not have the concept of primary keys or automatic, sequence-based key generation. Joins should be avoided in favor of denormalized data, when feasible. The complex types, Array, Map, and Struct, help by allowing the storage of one-to-many data inside a single row. This is not to say normalization should never be utilized, but star-schema type designs are nonoptimal. The primary reason to avoid normalization is to minimize disk seeks, such as those typically required to navigate foreign key relations. Denormalizing data permits it to be scanned from or written to large, contiguous sections of disk drives, which optimizes I/O performance. However, you pay the penalty of denormalization, data duplication and the greater risk of inconsistent data.

    注释: 所谓的star schema 我个人认为的只适用于share-disk 的DW中,hadoop 和share-nothing 的MPP 对star schema 都是不适合的.  另外对于是否应该有主键也很有争议,至少很多MPP 在数据装载的时候是绝对不会有主键这个概念的(速度问题),另外列压缩对于反范式也比以前支持多.

    
    bin/hive -hiveconf hive.root.logger =DEBUG,console

    
    InputFormat reads key-value pairs from files; Hive currently ignores the

    key and works only with the data found in the value by default. The reason for this is that the key, which comes from TextInputFormat, is a long integer that represents the byte offset in the block (which is not user data).

    注释: 我原来还以为hive 里面key 是没有的(NullWritable ),因为可以避免排序. 原来TextInputFormat 里面key 是文件位置.

    
    hive avro serde It also allows for default values to be returned if the column is not defined in the data file.

    {"name":"extra_field",

    "type":"string",

    "doc:":"an extra field not in the original file",

    "default":"fishfingers and custard" }

    注释: avro 对于添加column, column 的默认值有很好的支持,对于很多后续处理有帮助,尤其是对于支持多版本并存的情况,对于不符合要求的字段会默认转换为0或NULL 也避免mapreduce 后续的解析出错问题.

    
    最后的第16章的很多案例都介绍的非常实用,毕竟都是来自一线的经验,其中M6D公司的介绍了他们的rank UDF , 还有他们提出的多集群的概念用来处理hdfs snapshot 和小时,天任务的混合负载,其实看起来没有很厉害,但是他们的处理思路和能力真的只有懂的人才会懂(我只能感概一流的员工跟一流的员工在一起战斗力会变的更强.)

    Outbrain公司提出的很多网站访问记录的常规分析也很实用,其中里面用hive 来实现session 的概念我个人倒是觉得不太推荐,session 就是一个标准的时间窗口函数的分析,pig 里面有实现,比如linkedin 的datafu 里面sessionize 的文档

    http://linkedin.github.com/datafu/docs/javadoc/datafu/pig/sessions/Sessionize.html

    另外hive windows functions 的实现里面也有https://github.com/hbutani/SQLWindowing

    之所以不推荐的原因是计算效率比较低,窗口函数就是需要一次计算的时候排序,然后得到结果,而书里面的实现却将结果集self join 扩大N倍之后过滤,然后又join, 最后只得到少量的数据,这个模式不适合大量数据做session 运算.另外书里面第四步还少写了一个from , 另外我运行的时候hive 默认做self join 的时候还报错:

    java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {"key":{"_col0":18398210,"_col1":"7

    一般这种错是记录里面有些记录不规则,但是将这种表变成两个独立的临时表却可以join 了. 另外我自己测试的时候发现正常结果集在第一步的时候就数据量就变大了80多倍,后面join和过滤又抛弃大量数据,所以这种模式的计算不是很高效,推荐还是上面两种专业的session 计算比较好.

你可能感兴趣的:(hive)