例子:
INSERT OVERWRITE TABLE prices_collected_${hiveconf:wid_version}
select
pc.collect_id as product_id ,
regexp_extract(pc.price,'(\\d*\\.?\\d+)',1) as price ,
pc.region,
'' as location_area_code,
'' as city_code,
from_unixtime(unix_timestamp() , 'yyyy-MM-dd hh:mm:ss') as created_at,
from_unixtime(unix_timestamp() , 'yyyy-MM-dd hh:mm:ss') as updated_at
from products_compared_${hiveconf:wid_version} as pc
1.根据hive执行的参数来动态的设置表名称 prices_collected_${hiveconf:wid_version}
hive -hiveconf wid_version='4'
则可以通过${hiveconft:wid_version}来接收参数,生成prices_collected_4这张表
2. 使用正则表达式获取需要的信息,如:获取一段字符串中的数字
regexp_extract(pc.price,'(\\d*\\.?\\d+)',1) as price
注意hive中需要使用双斜杠来处理正则表达式
3. 获取系统时间
from_unixtime(unix_timestamp() , 'yyyy-MM-dd hh:mm:ss') as created_a
使用from_unixtime(unix_timestamp() , 'yyyy-MM-dd hh:mm:ss') 获取系统时间,格式可以根据需要调整
4. 多个表进行join的时候,可能会报错
使用set hive.auto.convert.join=false;解决
5. 创建表
create table if not exists brands (
name string,
created_at string,
updated_at string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
ESCAPED BY '\\'
STORED AS TEXTFILE;
以文本方式进行存储,"\\"进行转义,"\t"作为换行符
6.到处hive中的某个表中的数据到本地,执行hive命令如下:
hive
-hiveconf local_path=/home/hive/hive_data/products_24_1
-hiveconf hive_table=products_24_1
-hiveconf columnstr=' name , created_at, updated_at, "released" as status '
-f /home/hive/export_hive_table_to_local.sql
需要执行的参数依次是
1.导出到本地的位置local_path
2.导出hive中的哪个表 hive_table
3. 导出products_24_1 表中的哪些字段 colunmstr
4. 根据上面的参数,在本地创建products_24_1 表,使用-f来指定调用的文件
/home/hive/export_hive_table_to_local.sql 文件内容如下:
insert overwrite local directory '${hiveconf:local_path}'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
ESCAPED BY '\\'
STORED AS TEXTFILE
select ${hiveconf:columnstr}
from ${hiveconf:hive_table};
7.将本地文件导入到psql数据库中, hive对pg的支持不好,不能用sqoop来进行数据的导入,可以先将hive中的数据读到本地,在使用python脚本来进行文件的写入
def insert_to_pg(conn , table_name , file_path , insert_columns=None): conn = psycopg2.connect(conn) cursor = conn.cursor() if os.path.isfile( file_path ): datafile=ReadFileProgress(file_path) cursor.copy_from(file=datafile, table=table_name, sep='\t', null='\\N', size=81920, columns=insert_columns) datafile.close()
#!/usr/bin/python # #_*_ coding: utf-8 _*_ import os , sys import psycopg2 class ReadFileProgress: def __init__(self, filename): self.datafile = open(filename) self.totalRecords = 0 self.totalBytes = os.stat(filename).st_size self.readBytes = 0 self.datafile.readline() i = 0 for i, l in enumerate(self.datafile): pass self.totalRecords = i + 1 sys.stderr.write("Number of records: %d\n" % (self.totalRecords)) self.datafile.seek(0) self.datafile.readline() self.perc5 = self.totalBytes / 20.0 self.perc5count = 0 self.lastPerc5 = 0 sys.stderr.write("Writing records: 0%") def countBytes(self, size=0): self.readBytes += size if (self.readBytes - self.lastPerc5 >= self.perc5): self.lastPerc5 = self.readBytes if (int(self.readBytes / self.perc5) == 5): sys.stderr.write("25%") elif (int(self.readBytes / self.perc5) == 10): sys.stderr.write("50%") elif (int(self.readBytes / self.perc5) == 15): sys.stderr.write("75%") else: sys.stderr.write(".") sys.stderr.flush() def readline(self, size=None): countBytes(size) return self.datafile.readline(size) def read(self, size=None): self.countBytes(size) return self.datafile.read(size) def close(self): sys.stderr.write("100%\n") self.datafile.close()
8. 从pg上导出指定表
def do_export(conn , table_name , file_path , columns=None): conn = psycopg2.connect(conn) cursor = conn.cursor() cursor.copy_to(file=file(file_path , 'w'), table=table_name, sep='\t', null='\\N', columns=columns) cursor.close() conn.commit() sys.stdout.write("Transaction finished successfully.\n")
9. 则select语句中也可以通过hiveconf来传递参数,执行hive命令
hive -hiveconf name='hello hive'
INSERT OVERWRITE TABLE companies
select
'${hiveconf:name}' as name
from companies_old