Hive

Hive简介

Hive是基于Hadoop的一个数据仓库工具，可以将结构化的数据文件映射为一张数据库表，并提供类SQL查询功能。
本质是将SQL转换为MapReduce程序

Hive配置文件

hive-site.xml #hive的配置文件 
hive-env.sh   #hvie的运行环境文件
hive-default.xml.template #默认模板
hive-env.sh.template #hive-env.sh默认配置
hive-exec-log4j.properties.template #exec默认配置
hive-log4j.properties.template #log默认配置

hive-site.xml

< property>
  javax.jdo.option.ConnectionURL jdbc:mysql://localhost:3306/hive?createData baseIfNotExist=true
  JDBC connect string for a JDBC metastore


javax.jdo.option.ConnectionDriverName
    com.mysql.jdbc.Driver
  Driver class name for a JDBC metastore


javax.jdo.option.ConnectionUserName
    root
   username to use against metastore database


javax.jdo.option.ConnectionPassword
   test
   password to use against metastore database

hive-env.sh

#配置Hive的配置文件路径
export HIVE_CONF_DIR= your path
#配置Hadoop的安装路径
HADOOP_HOME=your hadoop home

关系运算符

常见的关系运算符
等值比较: =
不等值比较: <>
小于比较: <
小于等于比较: <=
大于比较: >
大于等于比较: >=
空值判断: IS NULL
非空判断: IS NOT NULL
LIKE比较: LIKE
描述: 如果字符串A或者字符串B为NULL，则返回NULL；如果字符串A符合表达式B 的正则语法，则为TRUE；否则为FALSE。B中字符”_”表示任意单个字符，而字符”%”表示任意数量的字符。

hive> select 1 from dual where ‘key' like 'foot%';
1
hive> select 1 from dual where ‘key ' like 'foot____';
1

JAVA的LIKE操作: RLIKE
描述: 如果字符串A或者字符串B为NULL，则返回NULL；如果字符串A符合JAVA正则表达式B的正则语法，则为TRUE；否则为FALSE。

hive> select 1 from dual where 'footbar’ rlike '^f.*r$’;

REGEXP操作: REGEXP
描述: 功能与RLIKE相同

hive> select 1 from dual where ‘key' REGEXP '^f.*r$';
1

等值比较: =

符合类型构建操作

Map类型构建: map
map 类型访问 : M[key]
语法: map (key1, value1, key2, value2, …)
说明：根据输入的key和value对构建map类型

hive> Create table alex_test as select map('100','tom','200','mary') as t from dual;
hive> describe alex_test;
t       map
hive> select t from alex_test;
{"100":"tom","200":"mary"}

Struct类型构建: struct
struct 类型访问 : S.x
语法: struct(val1, val2, val3, …)
说明：根据输入的参数构建结构体struct类型

hive> create table alex_test as select struct('tom','mary','tim') as t from dual;
hive> describe alex_test;
t       struct
hive> select t from alex_test;
{"col1":"tom","col2":"mary","col3":"tim"}

array类型构建: array
array 类型访问 : A[n]
语法: array(val1, val2, …)
说明：根据输入的参数构建数组array类型

hive> create table alex_test as select array("tom","mary","tim") as t from dual;
hive> describe alex_test;
t       array
hive> select t from alex_test;
["tom","mary","tim"]

Hive参数

hive.exec.max.created.files
#说明：所有hive运行的map与reduce任务可以产生的文件的和
#默认值:100000 
hive.exec.dynamic.partition
#说明：是否为自动分区
#默认值：false
hive.mapred.reduce.tasks.speculative.execution
#说明：是否打开推测执行
#默认值：true
hive.input.format
#说明：Hive默认的input format
#默认值： org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
#如果有问题可以使用org.apache.hadoop.hive.ql.io.HiveInputFormat
hive.exec.counters.pull.interval
#说明：Hive与JobTracker拉取counter信息的时间
#默认值：1000ms 
hive.script.recordreader
#说明：使用脚本时默认的读取类
#默认值： org.apache.hadoop.hive.ql.exec.TextRecordReader
hive.script.recordwriter
#说明：使用脚本时默认的数据写入类
#默认值： org.apache.hadoop.hive.ql.exec.TextRecordWriter
hive.mapjoin.check.memory.rows
#说明： 内存里可以存储数据的行数
#默认值： 100000
hive.mapjoin.smalltable.filesize
#说明：输入小表的文件大小的阀值，如果小于该值，就采用普通的join
#默认值： 25000000
hive.auto.convert.join
#说明：是不是依据输入文件的大小，将Join转成普通的Map Join
#默认值： false
hive.mapjoin.followby.gby.localtask.max.memory.usage
#说明：map join做group by 操作时，可以使用多大的内存来存储数据，如果数据太大，则不会保存在内存里
#默认值：0.55
hive.mapjoin.localtask.max.memory.usage
#说明：本地任务可以使用内存的百分比
#默认值： 0.90
hive.heartbeat.interval
#说明：在进行MapJoin与过滤操作时，发送心跳的时间
#默认值1000
hive.merge.size.per.task
#说明： 合并后文件的大小
#默认值： 256000000
hive.mergejob.maponly
#说明： 在只有Map任务的时候 合并输出结果
#默认值： true
hive.merge.mapredfiles
#默认值： 在作业结束的时候是否合并小文件
#说明： false
hive.merge.mapfiles
#说明：Map-Only Job是否合并小文件
#默认值：true
hive.hwi.listen.host
#说明：Hive UI 默认的host
#默认值：0.0.0.0
hive.hwi.listen.port
#说明：Ui监听端口
#默认值：9999
hive.exec.parallel.thread.number
#说明：hive可以并行处理Job的线程数
#默认值：8
hive.exec.parallel
#说明：是否并行提交任务
#默认值：false
hive.exec.compress.output
#说明：输出使用压缩
#默认值： false
hive.mapred.mode
#说明： MapReduce的操作的限制模式，操作的运行在该模式下没有什么限制
#默认值： nonstrict
hive.join.cache.size
#说明： join操作时，可以存在内存里的条数
#默认值： 25000
hive.mapjoin.cache.numrows
#说明： mapjoin 存在内存里的数据量
#默认值：25000
hive.join.emit.interval
#说明： 有连接时Hive在输出前，缓存的时间
#默认值： 1000
hive.optimize.groupby
#说明：在做分组统计时，是否使用bucket table
#默认值： true
hive.fileformat.check
#说明：是否检测文件输入格式
#默认值：true
hive.metastore.client.connect.retry.delay
#说明： client 连接失败时,retry的时间间隔
#默认值：1秒
hive.metastore.client.socket.timeout
#说明:  Client socket 的超时时间
#默认值：20秒
mapred.reduce.tasks
#默认值：-1
#说明：每个任务reduce的默认值
 -1 代表自动根据作业的情况来设置reduce的值 
hive.exec.reducers.bytes.per.reducer
#默认值： 1000000000 （1G）
#说明：每个reduce的接受的数据量
    如果送到reduce的数据为10G,那么将生成10个reduce任务 
hive.exec.reducers.max
#默认值：999
#说明： reduce的最大个数      
hive.exec.reducers.max
#默认值：999
#说明： reduce的最大个数
hive.metastore.warehouse.dir
#默认值：/user/hive/warehouse
#说明： 默认的数据库存放位置
hive.default.fileformat
#默认值：TextFile
#说明： 默认的fileformat
hive.map.aggr
#默认值：true
#说明： Map端聚合，相当于combiner
hive.exec.max.dynamic.partitions.pernode
#默认值：100
#说明：每个任务节点可以产生的最大的分区数
hive.exec.max.dynamic.partitions
#默认值：1000
#说明： 默认的可以创建的分区数
hive.metastore.server.max.threads
#默认值：100000
#说明： metastore默认的最大的处理线程数
hive.metastore.server.min.threads
#默认值：200
#说明： metastore默认的最小的处理线程数

Hive

Hive简介

Hive配置文件

关系运算符

符合类型构建操作

Hive参数

你可能感兴趣的:(Hive)