╭⌒若隐_RowYet

Hive从入门到放弃——Hive 用户内置函数简介（十一）

背景

Hive作为大数据Hadoop集群的数据仓库，实际使用自然是要用来数据处理和数据分析的，必然经常用到Hive函数，Hive函数分为内置函数和自定义函数（user defined function，简称UDF）；

内置函数：跟其他常用数据库一样，伴随引擎自带的函数，用来满足绝大部分通用的数据处理和数据分析；
自定义函数：针对用户的特殊需求，需要频繁使用的某一段业务逻辑，但是Hive又不提供处理函数，就需要用户根据Hive的UDF开发指南自行创建的函数（下一篇详述，这里跳过）。

其实关于Hive的函数详解，最权威和经典的，当然要属于Hive的官方wiki:Hive Operators and User-Defined Functions (UDFs),界面安排如图1，这里就带着大家一起解读下官方界面的内置函数；

图1 Hive官网函数wiki

内置函数的分类

除了从内置活用户自定义函数来区分Hive函数，也可以根据函数的功能模块来分类，一般可划分为日期函数，字符串函数，数学函数，聚合函数，开窗函数，其他函数等，这里列举这些功能函数开发中使用较为频繁的Hive内置函数,至于其它偏门的，使用中再查吧，总览如表1；

表1 Hive常用的内置函数列表

函数类型	函数列表
日期函数	year、month、day、date_add、date_sub、datediff、from_unixtime、unix_timestamp、to_date等
字符串函数	substr 、substring、 concat、 concat_ws 、 split 、 regexp_replace 、 replace、regexp_extract 、 get_json_object 、 trim 、instr、 length等
数学函数	abs 、ceil、floor、round 、rand 、ipow，pmod 等
聚合函数	count、max、min、avg、count distinct 、sum、group_concat 、collect_set、collect_list等
开窗函数	row_number、lead 、lag、rank、dense_rank、max 、min 、count 等（此处的 max 、 min 、 count 等不同于聚合函数的，区别在于此处的计算在当前窗口内）
其他函数	coalesce 、 cast、 decode 、行列转换如 lateral view 、 explode 等

日期函数

顾名思义就是处理日期相关的函数，注意：日期函数一般需要传入的参数为数据类型date,timestamp，格式为yyyy-MM-dd或者yyyy-MM-dd hh:mm:ss,而格式yyyyMMdd形式，Hive可能会把它当成字符串，对这种格式的数据直接使用日期函数可能得到NULL值，具体如下；

year、month、day、hour、minute、second：
用法：year(date),month(date),day(date),hour(timestamp),minute(timestamp),second(timestamp)
参数1：类型为timestamp或者date的日期，其他类型参数可能触发异常;
返回值：返回参数日期对应的年，月，日，时，分，秒。
Hive Cli实现：如下；
获取Hive日期类型date,timestamp的年，月，日,Hive Cli环境实现如下;

hive> set hive.cli.print.header=true;
hive> select
    >   year('20200706')  as `字符串年无效`
    >  ,month('20200706')  as `字符串月无效`
    >  ,day('20200706')  as `字符串日无效`
    >  ,year(from_unixtime(unix_timestamp('20200706','yyyyMMdd')))  as `字符串年`
    >  ,month(from_unixtime(unix_timestamp('20200706','yyyyMMdd')))  as `字符串月`
    >  ,day(from_unixtime(unix_timestamp('20200706','yyyyMMdd')))  as `字符串日`
    >  ,year('2020-07-06')  as `年`
    >  ,month('2020-07-06')  as `月`
    >  ,day('2020-07-06')  as `日`
    >  ,hour('2020-07-06 12:30:49') as `时`
    >  ,minute('2020-07-06 12:30:49') as `分`
    >  ,second('2020-07-06 12:30:49') as `秒`
    >  ;
OK
字符串年无效    字符串月无效    字符串日无效    字符串年        字符串月        字符串日        年      月      日      时      分      秒
NULL    NULL    NULL    2020    7       6       2020    7       6       12      30      49
Time taken: 0.177 seconds, Fetched: 1 row(s)

date_add:
用法：date_add(timestamp/date time int days)
参数1：类型为timestamp或者date的日期;
参数2：类型为int的整数，可正可负；
返回值：返回参数日期的后几天或者前几天的日期类型，取决于第二个参数的是正数还是负数,正数为后几天，即参数基础上往后加，负数则相反；
Hive Cli实现：如下；

hive> set hive.cli.print.header=true;
hive> select
    >     date_add('20200706',1)  as `参数异常`,
    >     date_add('2020-07-06',1) as `明天`,
    >     date_add('2020-07-06',-1) as `昨天`;
OK
参数异常        明天    昨天
NULL    2020-07-07      2020-07-05
Time taken: 0.088 seconds, Fetched: 1 row(s)

date_sub:
用法：date_sub(timestamp/date time, int days)
参数1：类型为timestamp或者date的日期;
参数2：类型为int的整数，可正可负；
返回值：返回参数日期的后几天或者前几天的日期类型，取决于第二个参数的是正数还是负数,和date_add的两个参数意义正负数相反，正数为前几天，即参数基础上往前减，负数则相反；
Hive Cli实现：如下；

hive> set hive.cli.print.header=true;
hive> select
    >     date_sub('20200706',1)  as `参数异常`,
    >     date_sub('2020-07-06',1) as `昨天`,
    >     date_sub('2020-07-06',-1) as `明天`;
OK
参数异常     昨天          明天
NULL    2020-07-05      2020-07-07
Time taken: 0.609 seconds, Fetched: 1 row(s)

datediff:
用法：datediff(timestamp/date enddate, timestamp/date startdate)
参数1：类型为timestamp或者date的结束日期;
参数2：类型为timestamp或者date的开始日期;
返回值：返回结束日期减去开始日期的int天数；
Hive Cli实现：如下；

hive> set hive.cli.print.header=true;
hive> select
    >    datediff('2020-07-06','2020-06-06') as `相差天数`,
    >    datediff('20200606','20200706') as `参数异常相差天数`;
OK
相差天数        参数异常相差天数
30      NULL
Time taken: 0.084 seconds, Fetched: 1 row(s)

unix_timestamp:
用法：unix_timestamp(timestamp/date,格式)
参数1：类型为timestamp或者date的日期;
参数2：和参数1的日期一致的格式，如'yyMMdd'，'yy-MM-dd hh:mm:ss';
返回值：返回日期对应的Unix时间戳；
注意：某个时间点A的时间戳（unix_timestamp）就是一个从1970-01-01 00:00:00到时间点A的相隔的秒数,所以算法就出来了，注意：这里的点A和1970-01-01 00:00:00都是UTC时间,UTC时间和北京时间相差8个小时，所以如果你的时间戳是北京时间生成的，就应该是你的时间点相对于1970-01-01 08:00:00相差的秒数。
Hive Cli实现：如下；


hive> set hive.cli.print.header=true;
hive> select
    >   unix_timestamp('20200706','yyyyMMdd')  as `日期时间戳`,
    >   unix_timestamp('20200706 18:34:56','yyyyMMdd hh:mm:ss')    as `时间日期时间戳`
    >  ;
OK
日期时间戳      时间日期时间戳
1593964800      1594031696
Time taken: 0.061 seconds, Fetched: 1 row(s)

from_unixtime:
用法：from_unixtime(unix_timestamp(timestamp/date,格式))
参数1：unix_timestamp的时间戳;
返回值：返回unix_timestamp时间戳对应的timestamp日期；
注意： 常用from_unixtime结合unix_timestamp来实现字符格式‘’yyMMdd‘’转成日期格式’yy-MM-dd’，如from_unixtime(unix_timestamp(‘20200706’,‘yyyyMMdd’))
Hive Cli实现：如下；

hive> set hive.cli.print.header=true;
hive> select
    >   from_unixtime(1593964800)  as `日期1`
    >   ,from_unixtime(1594031696) as `日期2`
    >   ,from_unixtime(unix_timestamp('20200706','yyyyMMdd')) as `字符串转日期`
    >  ;
OK
日期1   日期2   字符串转日期
2020-07-06 00:00:00     2020-07-06 18:34:56     2020-07-06 00:00:00
Time taken: 0.07 seconds, Fetched: 1 row(s)

to_date:
用法：to_date(timestamp/date)
参数1：类型为timestamp或者date的日期;
返回值：返回时间日期或者日期的’yyyy-MM-dd’格式日期；
Hive Cli实现：如下；


hive> set hive.cli.print.header=true;
hive> select
    >   to_date('2020-07-06 12:30:30')  as `日期`
    >   ,to_date(from_unixtime(unix_timestamp('20200706','yyyyMMdd'))) as `字符串转日期`
    >  ;
OK
日期    字符串转日期
2020-07-06      2020-07-06
Time taken: 0.544 seconds, Fetched: 1 row(s)

综合使用，Hive SQL实现某一日期对应的上周六，假设这个某一日期是个动态日期，格式为yyMMdd，刚好当前该日期取值为20200706,明天则变成了20200707,Hive SQL实现如下；

hive> set hive.cli.print.header=true;
hive> select
    >
    > unix_timestamp('20200706','yyyyMMdd') as `时间戳`,
    > -- 获取当前时间戳 日期的格式和yyyyMMdd的格式要保持一致，不然报错或者为null，为什么是MM，和分钟区别
    >
    > from_unixtime(unix_timestamp('20200706','yyyyMMdd')) as `日期`,
    > -- from_unixtime从时间戳返回到正常的日期
    >
    > datediff(to_date(from_unixtime(unix_timestamp('20200706','yyyyMMdd'))),'1900-01-01') as `相差几天`,
    > -- 获取20191120和1900-01-01直接相差几天，1900-01-01为最早的周一
    >
    >
    > pmod(datediff(to_date(from_unixtime(unix_timestamp('20200706','yyyyMMdd'))),'1900-01-01'),7)+1 as `周几`,
    > --取相差的天数除以7的余数+1天，得到目前属于周几
    >
    > date_sub(to_date(from_unixtime(unix_timestamp('20200706','yyyyMMdd'))),pmod(datediff(to_date(from_unixtime(unix_timestamp('20200706','yyyyMMdd'))),'1900-01-08'),7)+2) as `上周六`
    > --获取上周六
    >  ;
OK
时间戳  日期    相差几天        周几    上周六
1593964800      2020-07-06 00:00:00     44016   1       2020-07-04
Time taken: 0.725 seconds, Fetched: 1 row(s)

更多的Hive内置日期函数，可以参考Hive官网日期函数wiki，不懂的欢迎下方留言一起学习；

字符串函数

处理字符相关的常用函数举例；

substring
用法：substring(string,startindexi,endindex)
参数1：类型为string或者varchar或者char的字符串;
参数2：字符的起始位置下标（从1开始算）Int型;
参数3：起始下标后截取的长度Int型;
返回值：返回开始下标后n位的的截取字符串，注意：参数2，参数3的Int型如果溢出了源字符串本身的下标或者长度，也不会报错，只是可能截取不到字符串或者截取出来的不是需求所要而已；
Hive Cli实现：如下；

hive> set hive.cli.print.header=true;
hive> select substring('abcderf',2,3) as `substring截取`
    >       ,substr('abcderf',2,3)    as `substr截取`
    >       ,substring('abcderf',0,100) as `substring溢出位数`
    >       ,substring('abcderf',0,-1) as `substring溢出位数`
    >       ;
OK
substring截取   substr截取      substring溢出位数       substring溢出位数
bcd     bcd     abcderf
Time taken: 0.233 seconds, Fetched: 1 row(s)

substr
用法：substr(string,startindexi,endindex)
参数1：同substring一样；
返回值：同substring一样；
Hive Cli实现：同substring一样；
concat
用法：concat(agrs1,agrs2,agrs3,...,agrsn)
参数n：多个不同类型的字段
返回值：返回多个字段连接组合的结果；
Hive Cli实现：如下；

hive> set hive.cli.print.header=true;
hive> select concat('adc',1,22.3) as `字符连接`
    > ;
OK
字符连接
adc122.3
Time taken: 0.066 seconds, Fetched: 1 row(s)

concat_ws
用法：concat_ws(string regex,string1,string2,....,stringn)
参数1：类型为string的分隔符regex;
参数n：类型为string的不同字符串字段;
返回值：返回以string为分隔符的各字符串字段之间的组合；
Hive Cli实现：如下；

hive> set hive.cli.print.header=true;
hive> select concat_ws(',','adc','1','22.3') as `字符连接`
    > ,concat_ws('/','跑步','游戏','看书') as `爱好`
    > ;
OK
字符连接        爱好
adc,1,22.3      跑步/游戏/看书
Time taken: 0.231 seconds, Fetched: 1 row(s)

split
用法：split(str,regex)
参数1：类型为string的字符串;
参数2：类型为string的分隔符;
返回值：返回以分隔符切割的字符串数组，支持正则切分，对于 “.”,"|“这样的特殊字符，不加”\“的时候是特殊字符，加了以后才是普通字符，而对于”\d"的字符，需要加"\“后才是特殊字符，就是是说”\d"才是匹配数字，即注意特殊字符是否需要转义；
Hive Cli实现：如下；

hive> set hive.cli.print.header=true;
hive> select split('张三,李四,王五',',') as `合伙人`
    > ;
OK
合伙人
["张三","李四","王五"]
Time taken: 0.103 seconds, Fetched: 1 row(s)

regexp_replace
用法：regexp_replace(string, string1, string2)
参数1：源字符串字段
参数2：源字符串匹配的字符串，支持正则和特殊字符的匹配
参数3：源字符串匹配成功后的字符替换的字符串
返回值：返回将string内的string1字符用string2替换，支持正则匹配和特殊字符的替换；
Hive Cli实现：Cli环境下的结果源字符带换行符，导致结果换行了，但是通过regexp_replace和replace处理后的结果换行符被处理清洗掉，如下；

hive> set hive.cli.print.header=true;
hive> select
    > regexp_replace("foobar", "oo|ar", "")  `regexp_replace替换`,
    > replace("foobar", "oo|ar", "") as  `replace替换`,
    > 'foobar\n' as `源字符`,
    > regexp_replace('foobar\n', '\n', '')  `regexp_replace替换特殊字符`,
    > replace('foobar\n', '\n', '') as  `replace替换替换特殊字符`
    > ;
OK
regexp_replace替换      replace替换     源字符  regexp_replace替换特殊字符      replace替换替换特殊字符
fb      foobar  foobar
        foobar  foobar
Time taken: 0.203 seconds, Fetched: 1 row(s)

replace
用法：replace(string, string1, string2)
参数1：源字符串字段
参数2：源字符串匹配的字符串，不支持正则，但可以特殊字符的匹配
参数3：源字符串匹配成功后的字符替换的字符串
返回值：返回将string内的string1字符用string2替换，不支持正则匹配，支持特殊字符的替换；
Hive Cli实现：如下；

hive> set hive.cli.print.header=true;
hive> select
    > regexp_replace("foobar", "oo|ar", "")  `regexp_replace替换`,
    > replace("foobar", "oo|ar", "") as  `replace替换`,
    > 'foobar\n' as `源字符`,
    > regexp_replace('foobar\n', '\n', '')  `regexp_replace替换特殊字符`,
    > replace('foobar\n', '\n', '') as  `replace替换替换特殊字符`
    > ;
OK
regexp_replace替换      replace替换     源字符  regexp_replace替换特殊字符      replace替换替换特殊字符
fb      foobar  foobar
        foobar  foobar
Time taken: 0.203 seconds, Fetched: 1 row(s)

regexp_extract
用法：regexp_extract(string subject, string pattern, int index)
参数1：类型为string源字符字段;
参数2：类型为string匹配规则字符串，支持正则表达式;
参数3：正则表达式的第几组;
返回值：返回符合正则匹配字段的第几组字符，index是返回结果取表达式的哪一部分默认值为1，0表示把整个正则表达式对应的结果全部返回1表示返回正则表达式中第一个() 对应的结果以此类推，要注意的是idx的数字不能大于表达式中()的个数，否则报错：
Hive Cli实现：如下；

hive> set hive.cli.print.header=true;
hive>  select
    >     regexp_extract('foothebar', 'foo(.*?)(bar)', 2) as `匹配第2组数据`,
    >     regexp_extract('foothebar', 'foo(.*?)(bar)', 1) as `匹配第1组数据`,
    >     regexp_extract('foothebar', 'foo(.*?)(bar)', 0) as `匹配第整组数据`
    >     ;
OK
匹配第2组数据   匹配第1组数据   匹配第整组数据
bar     the     foothebar
Time taken: 0.265 seconds, Fetched: 1 row(s)

get_json_object
用法：get_json_object(string json_string, string path)
参数1：类型为string的json字符串;
参数2：json字符串的key；
返回值：返回json字符串的key对应的value；
Hive Cli实现：如下；

hive> set hive.cli.print.header=true;
hive> select
    >      get_json_object('{"name":"Jack","sex":"man","age":12}','$.name') as `姓名`
    >     ,get_json_object('{"name":"Jack","sex":"man","age":12}','$.sex')  as `性别`
    >     ,get_json_object('{"name":"Jack","sex":"man","age":12}','$.age')  as `年龄`
    >     ;
OK
姓名    性别    年龄
Jack    man     12
Time taken: 0.47 seconds, Fetched: 1 row(s)

trim
用法：trim(string A)
参数1：类型为string的字符串;
返回值：去掉字符串前后两端的空格；
Hive Cli实现：如下；

hive> set hive.cli.print.header=true;
hive> select
    >
    >     trim("  hello world  ")  as `去掉两端的空格`
    >     ,rtrim("  hello world  ")  as `只去掉右边空格`
    >     ,ltrim("  hello world  ")  as `只去掉左边空格`
    > ;
OK
去掉两端的空格  只去掉右边空格  只去掉左边空格
hello world       hello world   hello world
Time taken: 0.042 seconds, Fetched: 1 row(s)

instr
用法：instr(string str, string substr)
参数1：源字符串1;
参数2：要查找的字符串2;
返回值：返回要查找的字符串2在源字符串中首次出现的位置下标，从1开始算起,如果找不到则返回0；
Hive Cli实现：如下；

hive> set hive.cli.print.header=true;
hive>
    > select
    >     instr("阿兹卡班的囚徒阿", "囚徒") as `位置下标1`,
    >     instr("阿兹卡班的囚徒", "mei") as `位置下标2`,
    >     instr("阿兹卡班的囚徒阿", "阿") as `位置下标3`
    > ;
OK
位置下标1       位置下标2       位置下标3
6       0       1
Time taken: 0.048 seconds, Fetched: 1 row(s)

length
用法：length(string A)
参数1：类型为string的字符串;
返回值：返回字符串的长度；
Hive Cli实现：如下；

hive> set hive.cli.print.header=true;
hive>
    > select
    >     length("阿兹卡班的囚徒") as `长度`
    > ;
OK
长度
7

更多的字符操作函数，这里就不一一列举了，请参考表2，或者参考Hive字符串函数官方wiki；

表2 hive字符串操作函数列举

返回类型	函数名	描述
int	ascii(string str)	返回str第一个字符串的数值
string	base64(binary bin)	将二进制参数转换为base64字符串
string	concat(string	binary A, string
array>	context_ngrams(array, array, int K, int pf)	从一组标记化的句子中返回前k个文本
string	concat_ws(string SEP, string A, string B…)	类似concat() ，但使用自定义的分隔符SEP
string	concat_ws(string SEP, array)	类似concat_ws() ，但参数为字符串数组
string	decode(binary bin, string charset)	使用指定的字符集将第一个参数解码为字符串，如果任何一个参数为null，返回null。可选字符集为： ‘US_ASCII’, ‘ISO-8859-1’, ‘UTF-8’, ‘UTF-16BE’, ‘UTF-16LE’, ‘UTF-16’
binary	encode(string src, string charset)	使用指定的字符集将第一个参数编码为binary ，如果任一参数为null，返回null
int	find_in_set(string str, string strList)	返回str在strList中第一次出现的位置，strList为用逗号分隔的字符串，如果str包含逗号则返回0，若任何参数为null，返回null。如： find_in_set(‘ab’, ‘abc,b,ab,c,def’) 返回3
string	format_number(number x, int d)	将数字x格式化为’#,###,###.##’，四舍五入为d位小数位，将结果做为字符串返回。如果d=0，结果不包含小数点或小数部分
string	get_json_object(string json_string, string path)	从基于json path的json字符串中提取json对象，返回json对象的json字符串，如果输入的json字符串无效返回null。Json 路径只能有数字、字母和下划线，不允许大写和其它特殊字符
boolean	in_file(string str, string filename)	如果str在filename中以正行的方式出现，返回true
int	instr(string str, string substr)	返回substr在str中第一次出现的位置。若任何参数为null返回null，若substr不在str中返回0。Str中第一个字符的位置为1
int	length(string A)	返回A的长度
int	locate(string substr, string str[, int pos])	返回substr在str的位置pos后第一次出现的位置
string	lower(string A) lcase(string A)	返回字符串的小写形式
string	lpad(string str, int len, string pad)	将str左侧用字符串pad填充，长度为len
string	ltrim(string A)	去掉字符串A左侧的空格，如：ltrim(’ foobar ')的结果为’foobar ’
array>	ngrams(array, int N, int K, int pf)	从一组标记化的Returns the top-k 句子中返回前K个N-grams
string	parse_url(string urlString, string partToExtract [, string keyToExtract])	返回给定URL的指定部分，partToExtract的有效值包括HOST，PATH， QUERY， REF， PROTOCOL， AUTHORITY，FILE和USERINFO。例如： parse_url(‘http://facebook.com/path1/p.php?k1=v1&k2=v2#Ref1’, ‘HOST’) 返回 ‘facebook.com’.。当第二个参数为QUERY时，可以使用第三个参数提取特定参数的值，例如： parse_url(‘http://facebook.com/path1/p.php?k1=v1&k2=v2#Ref1’,‘QUERY’, ‘k1’) 返回’v1’
string	printf(String format, Obj… args)	将输入参数进行格式化输出
string	regexp_extract(string subject, string pattern, int index)	使用pattern从给定字符串中提取字符串。如： regexp_extract(‘foothebar’, ‘foo(.*?)(bar)’, 2) 返回’bar’ 有时需要使用预定义的字符类：使用’\s’ 做为第二个参数将匹配s，'s’匹配空格等。参数index是Java正则匹配器方法group()方法中的索引
string	regexp_replace(string INITIAL_STRING, string PATTERN, string REPLACEMENT)	使用REPLACEMENT替换字符串INITIAL_STRING中匹配PATTERN的子串，例如： regexp_replace(“foobar”, "oo
string	repeat(string str, int n)	将str重复n次
string	reverse(string A)	将字符串A翻转
string	rpad(string str, int len, string pad)	在str的右侧使用pad填充至长度len
string	rtrim(string A)	去掉字符串A右侧的空格，如： rtrim(’ foobar ‘) 返回 ’ foobar’
array	sentences(string str, string lang, string locale)	将自然语言文本处理为单词和句子，每个句子在适当的边界分割，返回单词的数组。参数lang和local为可选参数，例如： sentences(‘Hello there! How are you?’) 返回( (“Hello”, “there”), (“How”, “are”, “you”) )
string	space(int n)	返回n个空格的字符串
array	split(string str, string pat)	用pat分割字符串str，pat为正则表达式
map	str_to_map(text[, delimiter1, delimiter2])	使用两个分隔符将文本分割为键值对。第一个分隔符将文本分割为K-V 对，第二个分隔符分隔每个K-V 对。默认第一个分隔符为“，“，第二个分隔符为=
string	substr(string	binary A, int start) substring(string
string	substr(string	binary A, int start, int len) substring(string
string	translate(string input, string from, string to)	将input中出现在from中的字符替换为to中的字符串，如果任何参数为null，结果为null
string	trim(string A)	去掉字符串A两端的空格
binary	unbase64(string str)	将base64字符串转换为二进制
string	upper(string A) ucase(string A)	返回字符串A的大写形式

数学函数

用来进行数学计算的函数，常用的数学函数举例如下；

abs：
用法：abs(double num)
参数1：double型的数值类型，可以传入整型，但是也会隐式转换成double
返回值：返回double型的数值绝对值
Hive Cli实现：如下；

hive> set hive.cli.print.header=true;
hive>
    > select
    >     abs(-12.89) as `绝对值1`
    >    ,abs(-28)    as `绝对值2`
    >    ,abs(28)     as `绝对值3`
    >    ;
OK
绝对值1 绝对值2 绝对值3
12.89   28      28
Time taken: 0.333 seconds, Fetched: 1 row(s)

ceil：
用法：ceil(double d)
参数1：double类型的浮点型
返回值：返回向上取整
Hive Cli实现：如下；

hive> set hive.cli.print.header=true;
hive>
    > select
    >     ceil(-12.89) as `向上取整1`,
    >     ceil(11.239) as `向上取整2`
    >    ;
OK
向上取整1       向上取整2
-12     12
Time taken: 0.068 seconds, Fetched: 1 row(s)

floor：
用法：floor(double d)
参数1：double类型的浮点型
返回值：返回向下取整的整形
Hive Cli实现：如下；

hive> set hive.cli.print.header=true;
hive>
    > select
    >     floor(-12.89) as `向下取整1`,
    >     floor(11.239) as `向下取整2`
    >    ;
OK
向下取整1       向下取整2
-13     11
Time taken: 0.063 seconds, Fetched: 1 row(s)

round：
用法：round(double d)/round(double d,int n)
参数1：double型浮点数
返回值：返回double型的四舍五入取整/返会浮点型的四舍五入取n位小数；
Hive Cli实现：如下；

hive> set hive.cli.print.header=true;
hive>
    > select
    >     round(12.89) as `四舍五入取整`,
    >     round(11.239123,2) as `四舍五入保留2为小数`
    >    ;
OK
四舍五入取整    四舍五入保留2为小数
13      11.24

rand：
用法：rand()/rand(int seed)
参数1：无参或者int的种子，当有种子时，返回固定的随机数
返回值：返回0到1之间的double随机数，当有种子参数时，返回相同的随机数，如果想得到某个反围内的随机数，如摇骰子游戏，随机获取1~6的随机点数，则可以利用rand()的放大10倍，然后再对6取余数加1，就能得到1-6的随机数；
Hive Cli实现：如下；

hive> set hive.cli.print.header=true;
hive>
    > select
    >     rand()  as `随机数`,
    >     rand()  as `随机数`,
    >     rand(3) as `固定随机因子随机数`,
    >     rand(3) as `固定随机因子随机数`,
    >     floor(rand()*10)%6+1 as `摇骰子一次`,
    >     floor(rand()*10)%6+1 as `摇骰子二次`
    >    ;
OK
随机数  随机数  固定随机因子随机数      固定随机因子随机数      摇骰子一次      摇骰子二次
0.4495149771126544      0.9207548698949779      0.731057369148862       0.731057369148862       6       2
Time taken: 0.647 seconds, Fetched: 1 row(s)

pow：
用法：pow(DOUBLE a, DOUBLE p)/ power(DOUBLE a, DOUBLE p)
参数1：double型的幂函数底数
参数2：double型的幂函数指数
返回值：返回幂函数底数为a，指数为p次方的结果。
Hive Cli实现：如下；

hive> set hive.cli.print.header=true;
hive>
    > select
    >    pow(2,2)  `2的2次方`,
    >    pow(2,3)  `2的3次方`
    >    ;
OK
2的2次方        2的3次方
4.0     8.0

pmod：
用法：pmod(INT a, INT b)
参数1：整型的参数源数据a;
参数2：取模的参数数据b;
返回值：返回a对b取模的结果;
Hive Cli实现：如下；

hive> set hive.cli.print.header=true;
hive>
    > select
    >    pmod(5,2)  `取模`
    > ;
OK
取模
1
Time taken: 0.055 seconds, Fetched: 1 row(s)

更多的数学操作函数，请参考Hive官方数学函数wiki；

聚合函数

所谓聚合函数，通常是指跟随group by关键字后面的维度指标统计分析的度量值，在日常数据分析中占有非常常用的地位，以下举例一些常用的聚合函数使用。
这里为了方便说明问题，准备好一张学生表，数据如下；

hive> set hive.cli.print.header=true;
hive> select * FROM ods_rs_basic_tbd_student where event_day='20200618';
OK
sno     sname   ssex    sage    classid event_week      event_day       event_hour
1       小明    男      15      6       25      20200618        00
2       小红    女      13      5       25      20200618        00
3       小丽    女      14      7       25      20200618        00
4       小华    男      17      1       25      20200618        00
5       小蓝    男      15      2       25      20200618        00
6       大林    男      14      3       25      20200618        00
7       大姝    女      13      5       25      20200618        00
8       大瑶    女      14      7       25      20200618        00
9       大发    男      17      1       25      20200618        00
10      大佬    男      15      4       25      20200618        00
10      NULL    男      14      1       25      20200618        00
11      大稳    男      14      NULL    25      20200618        00
Time taken: 2.708 seconds, Fetched: 12 row(s)

count
用法：count(*)/count(expr)/count(distinct expr[, expr...])
参数1：表示所有行，expr表示某一列，distinct expr[, expr…]表示去重的某些列。
返回值：count()返回所有行数，包含个别列为NULL的情况；count(expr)返回其中某一列的总行数，该列的NULL不计入数量；count(distinct expr[, expr…])返回某些列去重后的数量，NULL不计入数量;
Hive Cli实现：如下，这里会涉及MapReduce过程，所以Cli环境下会打印出MapReduce的进度；

hive> set hive.cli.print.header=true;
hive> select
    >        count(*) as `所有行数`
    >       ,count(sname) as `学生数`
    >       ,count(distinct classid) as `班号数`
    > from ods_rs_basic_tbd_student where event_day='20200618';
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = liuxiaowei_20200714143913_0c4645f8-03b1-431d-a8e1-f8e6bc0b96ac
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1592876386879_0071, Tracking URL = http://dw-test-cluster-007:8088/proxy/application_1592876386879_0071/
Kill Command = /usr/local/tools/hadoop/current//bin/hadoop job  -kill job_1592876386879_0071
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
2020-07-14 14:39:22,464 Stage-1 map = 0%,  reduce = 0%
2020-07-14 14:39:33,063 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 9.9 sec
2020-07-14 14:39:40,392 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 12.26 sec
MapReduce Total cumulative CPU time: 12 seconds 260 msec
Ended Job = job_1592876386879_0071
MapReduce Jobs Launched:
Stage-Stage-1: Map: 2  Reduce: 1   Cumulative CPU: 12.26 sec   HDFS Read: 19855 HDFS Write: 107 SUCCESS
Total MapReduce CPU Time Spent: 12 seconds 260 msec
OK
所有行数        学生数  班号数
12      11      7
Time taken: 27.714 seconds, Fetched: 1 row(s)

max
用法：max(col)
参数：某一列值
返回值：返回分组统计这一列数据的最大值
min
用法：min(col)
参数：某一列值
返回值：返回分组统计这一列数据的最小值
max and min Hive Cli实现：如下，这里会涉及MapReduce过程，所以Cli环境下会打印出MapReduce的进度；

hive> set hive.cli.print.header=true;
hive> select
    >     ssex      as `学生性别`,
    >     max(sage) as `最大年龄`,
    >     min(sage) as `最小年龄`
    > from ods_rs_basic_tbd_student
    > where event_day='20200618'
    > group by ssex
    > ;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = liuxiaowei_20200714145235_8c25f6ed-fc0a-4a85-8c7f-2cda501af316
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1592876386879_0072, Tracking URL = http://dw-test-cluster-007:8088/proxy/application_1592876386879_0072/
Kill Command = /usr/local/tools/hadoop/current//bin/hadoop job  -kill job_1592876386879_0072
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
2020-07-14 14:52:43,145 Stage-1 map = 0%,  reduce = 0%
2020-07-14 14:52:51,934 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 6.32 sec
2020-07-14 14:52:52,978 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 12.43 sec
2020-07-14 14:52:58,190 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 14.62 sec
MapReduce Total cumulative CPU time: 14 seconds 620 msec
Ended Job = job_1592876386879_0072
MapReduce Jobs Launched:
Stage-Stage-1: Map: 2  Reduce: 1   Cumulative CPU: 14.62 sec   HDFS Read: 19691 HDFS Write: 137 SUCCESS
Total MapReduce CPU Time Spent: 14 seconds 620 msec
OK
学生性别        最大年龄        最小年龄
女      14      13
男      17      14
Time taken: 23.729 seconds, Fetched: 2 row(s)

avg
用法：avg(col)
参数：某一列值
返回值：返回分组统计这一列数据的平均值
sum
用法：sum(col)
参数：某一列值
返回值：返回分组统计这一列数据的求和
avg and sum Hive Cli实现：如下，这里会涉及MapReduce过程，所以Cli环境下会打印出MapReduce的进度；

hive> set hive.cli.print.header=true;
hive> select
    >     ssex      as `学生性别`,
    >     sum(sage) as `年龄总和`,
    >     avg(sage) as `平均年龄`
    > from ods_rs_basic_tbd_student
    > where event_day='20200618'
    > group by ssex
    > ;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = liuxiaowei_20200714150201_b4e790f7-340f-4ea9-b4dc-fe637cd4bec1
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1592876386879_0074, Tracking URL = http://dw-test-cluster-007:8088/proxy/application_1592876386879_0074/
Kill Command = /usr/local/tools/hadoop/current//bin/hadoop job  -kill job_1592876386879_0074
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
2020-07-14 15:02:28,797 Stage-1 map = 0%,  reduce = 0%
2020-07-14 15:02:36,135 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 9.58 sec
2020-07-14 15:02:41,370 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 11.75 sec
MapReduce Total cumulative CPU time: 11 seconds 750 msec
Ended Job = job_1592876386879_0074
MapReduce Jobs Launched:
Stage-Stage-1: Map: 2  Reduce: 1   Cumulative CPU: 11.75 sec   HDFS Read: 20505 HDFS Write: 148 SUCCESS
Total MapReduce CPU Time Spent: 11 seconds 750 msec
OK
学生性别        年龄总和        平均年龄
女      54.0    13.5
男      121.0   15.125
Time taken: 41.839 seconds, Fetched: 2 row(s)

collect_set
用法：collect_set(col)
参数：某一列值
返回值：返回分组统计这一列数据转化成一行的列表，且列表元素会去重；
collect_list
用法：collect_set(col)
参数：某一列值
返回值：返回分组统计这一列数据转化成一行的列表，列表元素不会去重；
collect_set and collect_list Hive Cli实现：如下，这里会涉及MapReduce过程，所以Cli环境下会打印出MapReduce的进度；

hive> set hive.cli.print.header=true;
hive> select
    >     ssex      as `学生性别`,
    >     collect_set(classid)  as `性别班级分布`,
    >     collect_list(classid) as `性别班级分布`
    > from ods_rs_basic_tbd_student
    > where event_day='20200618'
    > group by ssex
    > ;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = liuxiaowei_20200714153712_cd429f62-2507-470c-a3b7-2ecd1f45ce75
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1592876386879_0076, Tracking URL = http://dw-test-cluster-007:8088/proxy/application_1592876386879_0076/
Kill Command = /usr/local/tools/hadoop/current//bin/hadoop job  -kill job_1592876386879_0076
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
2020-07-14 15:37:20,134 Stage-1 map = 0%,  reduce = 0%
2020-07-14 15:37:32,699 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 11.63 sec
2020-07-14 15:37:38,990 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 13.72 sec
MapReduce Total cumulative CPU time: 13 seconds 720 msec
Ended Job = job_1592876386879_0076
MapReduce Jobs Launched:
Stage-Stage-1: Map: 2  Reduce: 1   Cumulative CPU: 13.72 sec   HDFS Read: 20372 HDFS Write: 161 SUCCESS
Total MapReduce CPU Time Spent: 13 seconds 720 msec
OK
学生性别        性别班级分布    性别班级分布
女      ["5","7"]       ["5","7","5","7"]
男      ["6","1","2","3","4"]   ["6","1","2","1","3","1","4"]
Time taken: 27.955 seconds, Fetched: 2 row(s)

更多的聚合函数，请参考Hive官方聚合函数wiki；

开窗函数

开窗函数太有意思了，是Hive函数分析的核心之一，这里单独出一篇博客讲解，具体请期待Hive从入门到放弃——玩一玩Hive的数据分析开窗函数（十三）;

复杂数据类型函数

Hive因为支持复杂结构数据类型，如array,map,struct等，针对这些复杂的数据类型，Hive也提供了一系列操作函数；
首选我们先准本一个复杂类型的表，DDL语句如下；

CREATE EXTERNAL TABLE `rowyet.employees`(
  `name` string, 
  `salary` float, 
  `subordinates` array<string>, 
  `deductions` map<string,float>, 
  `address` struct<street:string,city:string,state:string,zip:int>)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
WITH SERDEPROPERTIES ( 
  'colelction.delim'=',', 
  'field.delim'='|', 
  'line.delim'='\n', 
  'mapkey.delim'='\;', 
  'serialization.format'='|') 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  '/rowyet/employees'
TBLPROPERTIES (
  'transient_lastDdlTime'='1590204797')

数据预览如下；

hive> select * from rowyet.employees;
OK
name    salary  subordinates    deductions      address
John Doe        100000.0        ["Mary","SmithTodd","JonesFederal"]     {"Taxes":2.0,"State":15.2}      {"street":"Insurance.11","city":"Michigan","state":"Ave.ChicagoIL","zip":60600}
Mary Smith      80000.0 ["Bill","KingFedera"]   {"Taxes":0.5,"State":10.2}      {"street":"Insurance.1100","city":"Ontario","state":"St.ChicagoIL","zip":60601}
Todd Jones      70000.0 ["Federal"]     {"Taxes":15.0}  {"street":"Insurance.1200","city":"Chicago","state":"Ave.OakParkIL","zip":60700}
Bill King       60000.0 ["Federal"]     {"Taxes":15.0}  {"street":"Insurance.1300","city":"Obscure","state":"Dr.ObscuriaIL","zip":60100}
Boss Man        200000.0        ["John","DoeFred","FinanceFederal"]     {"Taxes":13.0,"late":200.0}     {"street":"Insurance.051","city":"Pretentious","state":"Drive.ChicagoIL","zip":60500}
Fred Finance    150000.0        ["Stacy","AccountantFederal"]   {"Taxes":30.23,"others":102.9}  {"street":"Insurance.052","city":"Pretentious","state":"Drive.ChicagoIL","zip":60500}
Stacy Accountant        60000.0 ["Federal"]     {"Taxes":15.0}  {"street":"Insurance.1300","city":"Main","state":"St.NapervilleIL","zip":60563}
Time taken: 0.169 seconds, Fetched: 7 row(s)

size
用法：size(Map)/size(Array)
参数：map键值对类型或者array类型
返回值：返回map键值对或者array元素的总个数；
map_keys
用法：map_keys
参数：map键值对类型
返回值：返回map键值对的key，如果有多个，则返回这些key组成的一个array；
map_values
用法：map_values(Map)
参数：map键值对类型
返回值：返回map键值对的value，如果有多个，则返回这些value组成的一个array；
array_contains
用法：array_contains(Array, value)
参数1：array类型的字段
参数2：array元素一样类型的一个参数
返回值：如果array字段中包含后面的参数2的值，则返回true，否则返回false；
sort_array
用法：sort_array(Array)
参数：array类型的字段
返回值：返回array元素按字典排序的结果；
以上复杂数据类型函数的Hive Cli实现：如下；

hive> set hive.cli.print.header=true;
hive> select
    >     size(subordinates)          as `array元素个数`,
    >     size(deductions)            as `map元素个数`,
    >     map_keys(deductions)        as map_keys,
    >     map_values(deductions)      as map_values,
    >     array_contains(subordinates,'Mary') as `array是否包含元素`,
    >     sort_array(subordinates)    as `array元素排序`
    > from rowyet.employees
    > ;
OK
array元素个数   map元素个数     map_keys        map_values      array是否包含元素       array元素排序
3       2       ["Taxes","State"]       [2.0,15.2]      true    ["JonesFederal","Mary","SmithTodd"]
2       2       ["Taxes","State"]       [0.5,10.2]      false   ["Bill","KingFedera"]
1       1       ["Taxes"]       [15.0]  false   ["Federal"]
1       1       ["Taxes"]       [15.0]  false   ["Federal"]
3       2       ["Taxes","late"]        [13.0,200.0]    false   ["DoeFred","FinanceFederal","John"]
2       2       ["Taxes","others"]      [30.23,102.9]   false   ["AccountantFederal","Stacy"]
1       1       ["Taxes"]       [15.0]  false   ["Federal"]
Time taken: 0.162 seconds, Fetched: 7 row(s)

explode
用法：explode(ARRAY a)/explode(MAP m)
参数：array类型的或者map类型的字段
返回值：返回array类型或者map类型的每个元素行转列；
Hive Cli实现：如下；

-- array类型的explode
hive> set hive.cli.print.header=true;
hive>
    > select explode(subordinates) as `array_explode`
    > from rowyet.employees;
OK
array_explode
Mary
SmithTodd
JonesFederal
Bill
KingFedera
Federal
Federal
John
DoeFred
FinanceFederal
Stacy
AccountantFederal
Federal
Time taken: 7.779 seconds, Fetched: 13 row(s)

-- map类型的explode
hive> set hive.cli.print.header=true;
    > select explode(deductions) as (`array_explode_key`,`array_explode_value`)
    > from rowyet.employees;
OK
array_explode_key       array_explode_value
Taxes   2.0
State   15.2
Taxes   0.5
State   10.2
Taxes   15.0
Taxes   15.0
Taxes   13.0
late    200.0
Taxes   30.23
others  102.9
Taxes   15.0
Time taken: 0.14 seconds, Fetched: 11 row(s)

lateral view
用法：lateral view explod(T)
参数：后面常跟explod，用来视图合并
返回值：返回多个explod处理的结果视图合并；
Hive Cli实现：如下；

-- 单列array行转列，那么可以改写为lateral view用法；
-- lateral view可以理解为把explode的操作存成了一个视图，
-- 然后选用处理过的视图的列
-- 注意select出来的字段都是来源 lateral view视图，为了区分我都加了re_开头
hive> set hive.cli.print.header=true;
hive>
    > select tf.re_subordinates
    > from rowyet.employees
    > lateral view explode(subordinates) tf as re_subordinates;
OK
tf.re_subordinates
Mary
SmithTodd
JonesFederal
Bill
KingFedera
Federal
Federal
John
DoeFred
FinanceFederal
Stacy
AccountantFederal
Federal
Time taken: 0.126 seconds, Fetched: 13 row(s)


-- 单列map行转列，那么可以改写为lateral view用法；
-- lateral view可以理解为把explode的操作存成了一个视图，
-- 然后选用处理过的视图的列
-- 注意select出来的字段都是来源 lateral view视图，为了区分我都加了re_开头
hive> set hive.cli.print.header=true;
hive>
    > select tf.array_explode_key,tf.array_explode_value
    > from rowyet.employees
    > lateral view explode(deductions) tf as `array_explode_key`,`array_explode_value`;
OK
tf.array_explode_key    tf.array_explode_value
Taxes   2.0
State   15.2
Taxes   0.5
State   10.2
Taxes   15.0
Taxes   15.0
Taxes   13.0
late    200.0
Taxes   30.23
others  102.9
Taxes   15.0
Time taken: 0.086 seconds, Fetched: 11 row(s)

-- 多列复杂复合类型行转列，那么必须用lateral view用法；
-- lateral view可以理解为把explode的操作存成了一个个视图合并，
-- 然后选用处理过的视图的列
-- 注意select出来的字段都是来源 lateral view视图，为了区分我都加了re_开头
hive> set hive.cli.print.header=true;
hive>
    > select tf.re_subordinates, tf.array_explode_key,tf.array_explode_value
    > from rowyet.employees
    > lateral view explode(subordinates) tf as re_subordinates
    > lateral view explode(deductions) tf as `array_explode_key`,`array_explode_value`;
OK
tf.re_subordinates      tf.array_explode_key    tf.array_explode_value
Mary    Taxes   2.0
Mary    State   15.2
SmithTodd       Taxes   2.0
SmithTodd       State   15.2
JonesFederal    Taxes   2.0
JonesFederal    State   15.2
Bill    Taxes   0.5
Bill    State   10.2
KingFedera      Taxes   0.5
KingFedera      State   10.2
Federal Taxes   15.0
Federal Taxes   15.0
John    Taxes   13.0
John    late    200.0
DoeFred Taxes   13.0
DoeFred late    200.0
FinanceFederal  Taxes   13.0
FinanceFederal  late    200.0
Stacy   Taxes   30.23
Stacy   others  102.9
AccountantFederal       Taxes   30.23
AccountantFederal       others  102.9
Federal Taxes   15.0
Time taken: 0.07 seconds, Fetched: 23 row(s)

posexplode
用法：posexplode (array)
参数：array类型
返回值：返回两列，第一列是从0开始的下标，第二列是array参数的每个元素；
Hive Cli实现：如下；

-- 直接使用posexplode
hive> set hive.cli.print.header=true;
hive>
    > select posexplode(subordinates) as (pos,subordinates)
    > from rowyet.employees;
OK
pos     subordinates
0       Mary
1       SmithTodd
2       JonesFederal
0       Bill
1       KingFedera
0       Federal
0       Federal
0       John
1       DoeFred
2       FinanceFederal
0       Stacy
1       AccountantFederal
0       Federal
Time taken: 0.155 seconds, Fetched: 13 row(s)

-- 结合lateral view使用
-- 注意select出来的字段都是来源 lateral view视图，为了区分我都加了re_开头
hive>
    > select tf.re_pos,tf.re_subordinates
    > from rowyet.employees
    > lateral view posexplode(subordinates) tf as re_pos,re_subordinates;
OK
tf.re_pos       tf.re_subordinates
0       Mary
1       SmithTodd
2       JonesFederal
0       Bill
1       KingFedera
0       Federal
0       Federal
0       John
1       DoeFred
2       FinanceFederal
0       Stacy
1       AccountantFederal
0       Federal
Time taken: 0.062 seconds, Fetched: 13 row(s)

inline
用法：inline(ARRAY> a)
参数：必须是array，且array的元素是struct类型，如果是struct类型的参数，必须前面加(array)强行转换成array型；
返回值：返回struct结构体的每个项的值作为一列；
Hive Cli实现：如下；

-- 直接使用inline
hive> set hive.cli.print.header=true;
hive> select inline(array(address)) as (street,city,state,zip)
    > from rowyet.employees;
OK
street  city    state   zip
Insurance.11    Michigan        Ave.ChicagoIL   60600
Insurance.1100  Ontario St.ChicagoIL    60601
Insurance.1200  Chicago Ave.OakParkIL   60700
Insurance.1300  Obscure Dr.ObscuriaIL   60100
Insurance.051   Pretentious     Drive.ChicagoIL 60500
Insurance.052   Pretentious     Drive.ChicagoIL 60500
Insurance.1300  Main    St.NapervilleIL 60563
Time taken: 0.178 seconds, Fetched: 7 row(s)


-- 结合lateral view 使用inline
hive>
    > select tf.re_street,tf.re_city,tf.re_state,tf.re_zip
    > from rowyet.employees
    > lateral view  inline(array(address)) tf as re_street,re_city,re_state,re_zip
    > ;
OK
tf.re_street    tf.re_city      tf.re_state     tf.re_zip
Insurance.11    Michigan        Ave.ChicagoIL   60600
Insurance.1100  Ontario St.ChicagoIL    60601
Insurance.1200  Chicago Ave.OakParkIL   60700
Insurance.1300  Obscure Dr.ObscuriaIL   60100
Insurance.051   Pretentious     Drive.ChicagoIL 60500
Insurance.052   Pretentious     Drive.ChicagoIL 60500
Insurance.1300  Main    St.NapervilleIL 60563
Time taken: 0.061 seconds, Fetched: 7 row(s)

stack
用法：stack(int r,T1 V1,...,Tn/r Vn)
参数1：有序列的个数
参数2：结构一致的有序的值序列，其实就是结构体的值有序排列；
返回值：返回这些参数1个数的有序的每个元素为一列；
Hive Cli实现：如下；

hive> set hive.cli.print.header=true;
hive>
    > select stack(2,'A',10,date '2015-01-01','B',20,date '2016-01-01') as (name,age,my_date);
OK
name    age     my_date
A       10      2015-01-01
B       20      2016-01-01
Time taken: 0.151 seconds, Fetched: 2 row(s)
hive> select tf.* from (select 0) t lateral view stack(2,'A',10,date '2015-01-01','B',20,date '2016-01-01') tf as re_name,re_age,my_date;
OK
tf.re_name      tf.re_age       tf.my_date
A       10      2015-01-01
B       20      2016-01-01
Time taken: 0.044 seconds, Fetched: 2 row(s)

json_tuple
用法：json_tuple(string jsonStr,string k1,...,string kn)
参数1：json字符串
参数2：json里面的key
返回值：返回参数1json串里面参数2key对应的value，对的，没错他就是get_json_object的另一种实现方式，在读取多个key时， json_tuple比get_json_object方法人性化，可以一次返回多个json的key对应的值；
Hive Cli实现：如下；

-- 复习下get_json_object的用法
hive> set hive.cli.print.header=true;
    > select get_json_object('{"name":"Jack","sex":"man","age":12}','$.name') as `姓名`,
    >        get_json_object('{"name":"Jack","sex":"man","age":12}','$.sex') as `姓名`,
    >        get_json_object('{"name":"Jack","sex":"man","age":12}','$.age') as `姓名`
    > ;
OK
姓名    姓名    姓名
Jack    man     12
Time taken: 0.146 seconds, Fetched: 1 row(s)

-- json_tuple单独实现
hive>select
    >      json_tuple('{"name":"Jack","sex":"man","age":12}','name','sex','age') as (`姓名`,`性别`,`年龄`)
    > ;
OK
姓名    性别    年龄
Jack    man     12
Time taken: 0.038 seconds, Fetched: 1 row(s)

-- json_tuple结合lateral view实现
hive> select tf.*,t.num
    > from (select 0 as num) t
    > lateral view json_tuple('{"name":"Jack","sex":"man","age":12}','name','sex','age') tf  as `re_姓名`,`re_性别`,`re_年龄`
    > ;
OK
tf.re_姓名      tf.re_性别      tf.re_年龄      t.num
Jack    man     12      0
Time taken: 0.061 seconds, Fetched: 1 row(s)

parse_url_tuple
用法：parse_url_tuple(string urlStr,string p1,...,string pn)
参数1：ulr
参数2：ulr对用的host，path,query,query_id 等
返回值：返回ulr对应的host，path,query,query_id 等部分对应的值，没错获取url不是有个parse_url()吗？也是一样的，parse_url_tuple在同时需要获取多个ulr部分的时候很人性化，可以一次性获取，parse_url()则需要一次一次写；
Hive Cli实现：如下；

-- parse_url单项单个获取
hive> set hive.cli.print.header=true;
hive>
    >
    > select parse_url('http://facebook.com/path/p1.php?query=1&name=3', 'HOST') as host;
OK
host
facebook.com
Time taken: 0.076 seconds, Fetched: 1 row(s)
hive> select parse_url('http://facebook.com/path/p1.php?query=1&name=3', 'PATH') as path;
OK
path
/path/p1.php
Time taken: 0.043 seconds, Fetched: 1 row(s)

-- parse_url_tuple普通使用
hive> select parse_url_tuple('http://facebook.com/path/p1.php?query=1&name=3', 'HOST', 'PATH', 'QUERY')as (host,path,query);
OK
host    path    query
facebook.com    /path/p1.php    query=1&name=3
Time taken: 0.043 seconds, Fetched: 1 row(s)

-- parse_url_tuple结合lateral view来使用
hive> SELECT tf.*
    > FROM (select 0 as num) src
    > lateral view parse_url_tuple('http://facebook.com/path/p1.php?query=1&name=3', 'HOST', 'PATH', 'QUERY') tf as re_host, re_path, re_query;
OK
tf.re_host      tf.re_path      tf.re_query
facebook.com    /path/p1.php    query=1&name=3
Time taken: 0.086 seconds, Fetched: 1 row(s)

更多复杂数据类型操作的函数，请参看Hive复杂数据类型函数操作官方wiki;

类型转换函数

用于Hive各类型之间的转换。

cast
用法：cast(expr as )
参数：某一列值或者表达式
返回值：返回某列值的数据强行转换成另一个类型；
Hive Cli实现：如下；

hive> set hive.cli.print.header=true;
hive> select
    >     cast(1.0 as int) as `强类型转换`
    > ;
OK
强类型转换
1
Time taken: 0.071 seconds, Fetched: 1 row(s)

更多类型转换的函数，请参看Hive类型转换官方wiki;

条件函数

用户Hive的中各种条件的判别筛选；

if
用法：if(boolean testCondition, T valueTrue, T valueFalseOrNull)
参数1：判断条件
参数2：如果判断条件为真，则取该值；
参数3：如果条件为假，则取该值
返回值：根据判断条件真假返回参数2或者参数3的值；
isnull/isnotnull
用法：isnull(a)/isnotnull(a)
参数：某一列值或者表达式，Boolean型
返回值：isnull参数条件为true时返回null，否则返回false；isnotnull则跟isnull相反；
nvl
用法：nvl(T value, T default_value)
参数1：某一列值或者表达式
参数2：默认值
返回值：如果参数1的值为null，则返回默认值，否则返回参数1的值；
coalesce
用法：COALESCE(T v1, T v2, ...)
参数n：某一列值或者表达式
返回值：返回第一个不为null的值；
case
用法：CASE a WHEN b THEN c [WHEN d THEN e]* [ELSE f] END
参数：a变量参数，b常量，c为a=b的取值；e为a=d的取值，否则取f；
返回值：当a=b时，返回c，当a=d时，返回e，否则返回f，如果漏了ELSE条件，但是判别的a值又不满足b，d等条件，则返回null值；
nullif
用法：nullif(a, b)
参数1：某一列值或者表达式
参数2：某一列值或者表达式
返回值：如果a=b,则返回null，否则返回a的值；
assert_true
用法：assert_true(boolean condition)
参数：boolean表达式
返回值：根据Boolean表达式返回断言，如果断言为真返回null，如果断言为假，则抛出异常；
以上条件函数的Hive Cli实现：如下；

hive> set hive.cli.print.header=true;
hive> select
    >     if(true,'yes','no')  as `if条件1`,
    >     if(false,'yes','no') as `if条件2`,
    >     isnull(null)         as `isnull条件1` ,
    >     isnull('hello')      as `isnull条件2`,
    >     nvl(null,  0)        as `nvl条件1`,
    >     nvl('hello',  0)     as `nvl条件2`,
    >     coalesce(null,null,null,'hello')  as `coalese条件`,
    >     case when 1=1 then 'hello' else 'bye' end as `case条件`,
    >     nullif('a','b') as `nullif条件1`,
    >     nullif('a','a') as `nullif条件2`,
    >     assert_true(true) as `assert_true条件1`
    > ;
OK
if条件1 if条件2 isnull条件1     isnull条件2     nvl条件1        nvl条件2        coalese条件     case条件        nullif条件1     nullif条件2     assert_true条件1
yes     no      true    false   0       hello   hello   hello   a       NULL    NULL
Time taken: 0.055 seconds, Fetched: 1 row(s)
hive>
    > select  assert_true(2<1) as `assert_true条件2`
    > ;
OK
assert_true条件2
Failed with exception java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: ASSERT_TRUE(): assertion failed.
Time taken: 0.05 seconds

更多条件函数请参考Hive条件函数官方wiki；

马赛克函数(敏感数据打码)

Hive自带对敏感数据打马赛克的函数，如家庭住址号，手机电话等，再某些对外预览时，需要数据脱敏处理，这个时候就可用到马赛克函数，即保证了数据的格式准确，又对数据进行了马赛克处理，保证了数据的安全；

mask
用法：mask(string str[, string upper[, string lower[, string number]]])
参数1：某一列或者某一字符串；
参数2：可选参数，大写字符的打码字符，不写默认为X；
参数3：可选参数，小写字符的打码字符，不写默认为x；
参数4：可选参数，数字的打码字符，不写默认为n；
返回值：返回保持数据格式的打码后的字符串；
mask_first_n
用法：mask_first_n(string str[, int n])
参数1：某一列或者某一字符串；
参数2：打码的位数n；
返回值：返回前n位打码的字符串；
mask_last_n
用法：mask_last_n(string str[, int n])
参数1：某一列或者某一字符串；
参数2：打码的位数n；
返回值：返回后n位打码的字符串
mask_show_first_n
用法：mask_show_first_n(string str[, int n])
参数1：某一列或者某一字符串；
参数2：打码的位数n；
返回值：返回排除前n位后剩下的全部打码的字符串；
mask_show_last_n
用法：mask_show_last_n(string str[, int n])
参数1：某一列或者某一字符串；
参数2：打码的位数n；
返回值：返回排除后n位后剩下的全部打码的字符串；
mask_hash
用法：mask_hash(string|char|varchar str)
参数：某一列或者某一字符串；
返回值：返回针对这一列数据的哈希散列的值，值等于md5一致，即函数等同md5(string str[, int n]),相同的字符，字段种子处理后的结果是一致的；
以上马赛克函数的Hive Cli实现：如下；


hive> set hive.cli.print.header=true;
hive> select
    >     mask('abcd-EFGH-8765-4321')                  as `默认打码1`
    >    ,mask('abcd-EFGH-8765-4321','U','l','#')      as `自定义打码符号`
    >    ,mask_first_n('1234-5678-8765-4321', 4)       as `前n位打码`
    >    ,mask_last_n('1234-5678-8765-4321', 4)        as `后n位打码`
    >    ,mask_show_first_n('1234-5678-8765-4321', 4)  as `排除前n位后的全部打码`
    >    ,mask_show_last_n('1234-5678-8765-4321', 4)   as `排除后n位后的全部打码`
    >    ,mask_hash('abcd-EFGH-8765-4321')    as `哈希打码`
    >    ,mask_hash('abcd-EFGH-8765-4321')    as `哈希打码`
    >    ,md5('abcd-EFGH-8765-4321')          as `md5处理`
    > ;
OK
默认打码1       自定义打码符号  前n位打码       后n位打码       排除前n位后的全部打码   排除后n位后的全部打码   哈希打码        哈希打码        md5处理
xxxx-XXXX-nnnn-nnnn     llll-UUUU-####-####     nnnn-5678-8765-4321     1234-5678-8765-nnnn     1234-nnnn-nnnn-nnnn     nnnn-nnnn-nnnn-4321     60c713f5ec6912229d2060df1c322776        60c713f5ec6912229d2060df1c322776       60c713f5ec6912229d2060df1c322776
Time taken: 0.076 seconds, Fetched: 1 row(s)

更多马赛克函数请参考Hive马赛克函数官方wiki；


## ==其他杂项函数==
  根据不同的功能衍生出来的函数，比较常用的也就一个`version()`函数；

- `version()`
  **用法**：`	version()`
**参数**：无；
**返回值**：返回当前Hive的版本；
**version()函数的Hive Cli实现**：如下；
```sql
hive> set hive.cli.print.header=true;
hive> select
    >   version() as `Hive版本`
    > ;
OK
Hive版本
2.3.5 r76595628ae13b95162e77bba365fe4d2c60b3f29
Time taken: 0.064 seconds, Fetched: 1 row(s)

更多杂项函数请参考Hive杂项函数官方wiki；

Hive面试题御风行云天面试题大全 hive hadoop 数据仓库面试
Hive面试题1Hive基础概念1.1解释Hive是什么以及它的用途Hive的主要用途：1.2描述Hive架构和组件1.HiveCLI/Beeline和WebUI2.HiveQL3.HiveDriver（驱动）4.Metastore5.Compiler（编译器）6.Optimizer（优化器）7.Executor（执行器）8.HadoopCoreComponents（核心组件）9.HiveUDFs
Hive 实际应用场景及对应SQL示例小技工丨大数据随笔 hive sql hadoop 大数据数据仓库
Hive实际应用场景及对应SQL示例一、‌日志分析场景‌**场景说明‌：**处理大规模日志数据（如Web访问日志），分析用户行为或系统运行状态。SQL示例‌：--统计每日UV（用户访问量）SELECTdate,COUNT(DISTINCTuser_id)ASdaily_uvFROMweb_logsWHEREevent_type='page_view'GROUPBYdate;技术要点‌：使用DIST
#Hadoop全分布式安装 #mysql安装 #hive安装砸吧砸吧 hadoop hive yarn mysql
分布式（多台机器部署不同组件）与集群（多台机器部署相同组件）概念。Linux基础命令linux具有文件数：目录、文件，从根目录开始，路径具有唯一性。pwd：显示当前路径特殊符号：/：根目录.：隐藏文件，如果路径以.开始，表示当前目录下..：当前目录下的上一级~：当前目录的home目录--help：帮助命令使用linux常用操作命令tab键：自动补全ls：显示指定目录内容默认：当前路径-a：显示所有
hive 使用oracle数据库 sardtass hadoop hive 开源项目
hive使用oracle作为数据源，导入数据使用sqoop或kettle或自己写代码（淘宝的开源项目中有一个xdata就是淘宝自己写的）。感觉sqoop比kettle快多了，淘宝的xdata没用过。hive默认使用derby作为存储表信息的数据库，默认在哪启动就在哪建一个metadata_db文件放数据，可以在conf下的hive-site.xml中配置为一个固定的位置，这样不论在哪启动都可以了。
HiveMetastore 的架构简析 houzhizhen hive hive
HiveMetastore的架构简析HiveMetastore是Hive元数据管理的服务。可以把元数据存储在数据库中。对外通过api访问。hive_metastore.thrift对外提供的Thrift接口定义在文件standalone-metastore/src/main/thrift/hive_metastore.thrift中。内容包括用到的结构体和枚举，和常量，和rpcService。如分
Hadoop（在Linux中安装jdk）錠诗味 linux hadoop 运维
安装之前需准备：1.需要远程连接软件2.需要jdk3.需要准备两个文件夹01/export/software安装包02/export/servers解压文件夹现在正式开始安装1.将压缩包存放在/export/software目录下2.进入到software目录进行解压cd/export/software（切换目录）tar-zxvfjdk-8u202-linux-x64.tar.gz-C/expor
数据仓库和非结构化数据。 weixin_30631587 数据库
数据仓库包含标准化数据。还包含外部数据/非结构化数据如果外部数据量小可以保持数据库内部或者专用服务器。如果量大只能记住地址，在etl加载当然也有需求是实时数据比如股票汇率拿只能etl过程处理非结构化数据包含图片，视频音频如果是传统数据库db2oracle存在里面是不合适的。存储影响性能如果是hadoop无所谓影响不大，但是从使用者的角度非结构化数据只能转换关系使用建一张元数据表存储非结构化存储位置
CentOS 7系统中hadoop的安装和环境配置代码小张z centos hadoop linux
1.创建Hadoop安装解压路径：mkdir-p/usr/hadoop2.进入路径：cd/usr/hadoop3.下载安装包（我这里用的是阿里云镜像压缩包）：wgethttps://mirrors.aliyun.com/apache/hadoop/common/hadoop-3.3.5/hadoop-3.3.5.tar.gz4.解压安装包到hadoop文件路径：tar-zxvf/usr/hadoo
Hive与Spark的UDF：数据处理利器的对比与实践窝窝和牛牛 hive spark hadoop
文章目录Hive与Spark的UDF：数据处理利器的对比与实践一、UDF概述二、HiveUDF解析实现原理代码示例业务应用三、SparkUDF剖析-JDBC方式使用SparkThriftServer设置通过JDBC使用UDFSparkUDF的Java实现（用于JDBC方式）通过beeline客户端连接使用业务应用场景四、Hive与SparkUDF在JDBC模式下的对比五、实际部署与最佳实践六、总结
尚硅谷电商数仓6.0，hive on spark,spark启动不了新时代赚钱战士 hive spark hadoop
在datagrip执行分区插入语句时报错[42000][40000]Errorwhilecompilingstatement:FAILED:SemanticExceptionFailedtogetasparksession:org.apache.hadoop.hive.ql.metadata.HiveException:FailedtocreateSparkclientforSparksessio
qt-5.15.2 源码编译 Linux weixin_40857106 服务器运维
QT官方源码下载地址：https://download.qt.io/archive/qt/5.15/5.15.12/single/qt-everywhere-opensource-src-5.15.12.tar.xz安装Qt所需的依赖：sudoaptinstallbuild-essentiallibgl1-mesa-devlibxkbcommon-devlibnss3-devlibdbus-1-d
鸿蒙HarmonyOS开发：应用程序静态包-HAR 让开，我要吃人了鸿蒙开发 OpenHarmony HarmonyOS harmonyos 华为移动开发前端 html 开发语言鸿蒙
HAR（HarmonyArchive）是静态共享包，可以包含代码、C++库、资源和配置文件。通过HAR可以实现多个模块或多个工程共享ArkUI组件、资源等相关代码。使用场景作为二方库，发布到OHPM私仓，供公司内部其他应用使用。作为三方库，发布到OHPM中心仓，供其他应用使用。约束限制HAR不支持在设备上单独安装/运行，只能作为应用模块的依赖项被引用。HAR不支持在配置文件中声明UIAbility
flutter 使用xcodebuild 命令打包ipa 肥肥呀呀呀 flutter
苹果打ipa包(注意苹果打包需要连接真机)方式一、1.先执行flutterbuildios生成framework2.执行命令xcodebuild-exportArchive-archivePathbuild/ios/Runner.xcarchive-exportOptionsPlistexportOptions.plist-exportPathbuild/ios/ipaexportOptions.
Hadoop相关面试题努力的搬砖人. java 面试 hadoop
以下是150道Hadoop面试题及其详细回答，涵盖了Hadoop的基础知识、HDFS、MapReduce、YARN、HBase、Hive、Sqoop、Flume、ZooKeeper等多个方面，每道题目都尽量详细且简单易懂：Hadoop基础概念类1.什么是Hadoop？Hadoop是一个由Apache基金会开发的开源分布式计算框架，主要用于处理和存储大规模数据集。它提供了高容错性和高扩展性的分布式存
Flink读取kafka数据并写入HDFS 王知无(import_bigdata) Flink系统性学习专栏 hdfs kafka flink
硬刚大数据系列文章链接：2021年从零到大数据专家的学习指南(全面升级版)2021年从零到大数据专家面试篇之Hadoop/HDFS/Yarn篇2021年从零到大数据专家面试篇之SparkSQL篇2021年从零到大数据专家面试篇之消息队列篇2021年从零到大数据专家面试篇之Spark篇2021年从零到大数据专家面试篇之Hbase篇
Apache storm 赵世炎 storm hadoop
Apachestorm是一个分布式的实时大数据处理系统。用于在容错和水平可拓展方法中处理大量数据。它是一个流数据框架，具有很高的摄取率，无状态。通过zk管理分布式环境和集群状态，并行地对实时数据执行各种操作。storm易于设置和操作，并且它保证每个消息将通过拓扑至少处理一次。基本上Hadoop和Storm框架用于分析大数据。两者互补，在某些方面有所不同。ApacheStorm执行除持久性之外的所有
什么是Apache Avro？ maozexijr apache
什么是ApacheAvro？ApacheAvro是一个开源的数据序列化框架，主要用于高效的数据交换和存储。它由ApacheHadoop项目开发，广泛应用于大数据生态系统中（如Hadoop、Kafka等）。Avro提供了一种紧凑、快速的二进制数据格式，同时支持丰富的数据结构和模式演化。核心特性跨语言支持Avro支持多种编程语言（如Java、Python、C++、Go等），使得不同语言之间的数据交换变
计算机毕业设计之基于Hadoop的热点新闻分析系统的设计与实现微信bishe69 课程设计 python django mysql
近些年来，随着科技的飞速发展，互联网的普及逐渐延伸到各行各业中，给人们生活带来了十分的便利，热点新闻分析系统利用计算机网络实现信息化管理，使整个热点新闻分析的发展和服务水平有显著提升。本文拟采用PyCharm开发工具，django框架、Python语言、Hadoop大数据处理技术进行开发，后台使用MySQL数据库进行信息管理，设计开发的热点新闻分析系统。通过调研和分析，系统拥有管理员和用户两个模块
Hadoop 实战笔记（二）-- HDFS 常用 shell 命令总结 dazhong2012 Hadoop hdfs hadoop
一、HDFS命令显示当前目录结构#显示当前目录结构hadoopfs-ls#递归显示当前目录结构hadoopfs-ls-R#显示根目录下内容hadoopfs-ls/创建目录#创建目录hadoopfs-mkdir#递归创建目录hadoopfs-mkdir-p删除操作#删除文件hadoopfs-rm#递归删除目录和文件hadoopfs-rm-R从本地加载文件到HDFS#二选一执行即可hadoopfs-p
How Spark Read Sftp Files from Hadoop SFTP FileSystem IT•轩辕 Cloudy Computation spark hadoop 大数据
GradleDependenciesimplementation('org.apache.spark:spark-sql_2.13:3.5.3'){excludegroup:"org.apache.logging.log4j",module:"log4j-slf4j2-impl"}implementation('org.apache.hadoop:hadoop-common:3.3.4'){exc
中电金信25/3/18面前笔试（需求分析岗+数据开发岗）苍曦需求分析前端 javascript
部分相同题目在第二次数据开发岗中不做解析，本次解析来源于豆包AI，正确与否有待商榷，本文只提供一个速查与知识点的补充。一、需求分析第1题，单选题,Hadoop的核心组件包括HDFS和以下哪个？MapReduceSparkStormFlink解析：Hadoop的核心组件是HDFS（分布式文件系统）和MapReduce（分布式计算框架）。Spark、Storm、Flink虽然也是大数据处理相关技术，但
oracle cdc logminer与oracle xstream 24k小善 java 大数据 flink
以下为OracleCDC技术中XStream与LogMiner的核心差异解析，结合技术背景、实现原理、性能表现等维度进行系统化对比。一、技术背景与定位差异LogMiner：官方日志分析工具的非正式应用最初设计用于数据库管理员（DBA）审计和分析历史日志，非专为CDC场景优化[1][9][16]。通过解析归档日志（ArchiveLog）或在线日志（OnlineRedoLog）提取变更记录，采用轮询机
csv转为utf8编码_中文的csv文件的编码改成utf8的方法 John Sheppard csv转为utf8编码
直奔主题：把包含中文的csv文件的编码改成utf-8的方法：啰嗦几句：在用pandas读取hive导出的csv文件时，经常会遇到类似UnicodeDecodeError:'gbk'codeccan'tdecodebyte0xa3inposition12这样的问题，这种问题是因为导出的csv文件包含中文，且这些中文的编码不是gbk，直接用excel打开这些文件还会出现乱码，但用记事本打开这些csv则
企业信息化整体架构图 weixin_33937913 系统架构
今天无意间发现一张企业信息化的图，放在这里以后参考。CollaboraticeCommerce转载于:https://www.cnblogs.com/Masterpiece/archive/2004/12/29/83696.html
Spark集群启动与关闭陈沐 spark spark hadoop big data
Hadoop集群和Spark的启动与关闭Hadoop集群开启三台虚拟机均启动ZookeeperzkServer.shstartMaster1上面执行启动HDFSstart-dfs.shslave1上面执行开启YARNstart-yarn.shslave2上面执行开启YARN的资源管理器yarn-daemon.shstartresourcemanager(如果nodeManager没有启动(正常情况
Hive函数大全：从核心内置函数到自定义UDF实战指南（附详细案例与总结）一个天蝎座白勺程序猿大数据开发从入门到实战合集 hive hadoop 数据仓库
目录背景‌一、Hive函数分类与核心函数表‌1.内置函数分类‌2.用户自定义函数（UDF）分类二、常用函数详解与实战案例‌1.数学函数‌2.字符串函数‌3.窗口函数‌4.自定义UDF实战‌三、总结与优化建议‌1.核心总结2.性能优化建议‌3.常问问题背景‌Hive作为Hadoop生态中最常用的数据仓库工具，其强大的函数库是高效处理和分析海量数据的核心能力之一。Hive函数分为‌内置函数‌和‌用户自
dcm4che jamie_zhengmin dcm4che archive jboss 工具服务器
dcm4che工具包DICOMtoolkitDICOM工具包dcm4chee归档服务器器IHE影像管理器和影像归档执行器（dcm4jbossarchive影像归档器，影像扫描检查和报告的管理）dcm4che2重架构dcm4che的重架构实现
将Hive数据导出为CSV和Excel格式的方法翠绿探寻 hive excel hadoop 编程
将Hive数据导出为CSV和Excel格式的方法在Hive中存储和处理大规模数据是一项常见的任务。有时候，我们需要将Hive中的数据导出为CSV或Excel格式，以便进行进一步的分析或与其他工具进行集成。本文将介绍如何使用编程的方式将Hive数据导出为CSV和Excel格式，并提供相应的源代码。Hive数据导出为CSV格式要将Hive数据导出为CSV格式，我们可以使用Hive的内置函数INSERT
Hadoop MapReduce 词频统计（WordCount）代码解析教程我不是少爷. Java基础 hadoop mapreduce 大数据
一、概述这是一个基于HadoopMapReduce框架实现的经典词频统计程序。程序会统计输入文本中每个单词出现的次数，并将结果输出到HDFS文件系统。二、代码结构packagecom.bigdata.wc;//Hadoop核心类库导入importorg.apache.hadoop.conf.Configuration;importorg.apache.hadoop.fs.Path;//数据类型定义
从“笨重大象”到“敏捷火花”：Hadoop与Spark的大数据技术进化之路 Echo_Wish 大数据大数据 hadoop spark
从“笨重大象”到“敏捷火花”：Hadoop与Spark的大数据技术进化之路说起大数据技术，Hadoop和Spark可以说是这个领域的两座里程碑。Hadoop曾是大数据的开山之作，而Spark则带领我们迈入了一个高效、灵活的大数据处理新时代。那么，它们的演变过程到底有何深意？背后技术上的取舍和选择，又意味着什么？一、Hadoop：分布式存储与计算的奠基者Hadoop诞生于互联网流量爆发式增长的时代，
遍历dom 并且存储（将每一层的DOM元素存在数组中）换个号韩国红果果 JavaScript html
数组从0开始！！ var a=[],i=0; for(var j=0;j<30;j++){ a[j]=[];//数组里套数组，且第i层存储在第a[i]中 } function walkDOM(n){ do{ if(n.nodeType!==3)//筛选去除#text类型 a[i].push(n); //con
Android+Jquery Mobile学习系列(9)-总结和代码分享白糖_ JQuery Mobile
目录导航经过一个多月的边学习边练手，学会了Android基于Web开发的毛皮，其实开发过程中用Android原生API不是很多，更多的是HTML/Javascript/Css。个人觉得基于WebView的Jquery Mobile开发有以下优点： 1、对于刚从Java Web转型过来的同学非常适合，只要懂得HTML开发就可以上手做事。 2、jquerym
impala参考资料 dayutianfei impala
记录一些有用的Impala资料 1. 入门资料 >>官网翻译： http://my.oschina.net/weiqingbin/blog?catalog=423691 2. 实用进阶 >>代码&架构分析： Impala/Hive现状分析与前景展望：http
JAVA 静态变量与非静态变量初始化顺序之新解周凡杨 java 静态非静态顺序
今天和同事争论一问题，关于静态变量与非静态变量的初始化顺序，谁先谁后，最终想整理出来！测试代码： import java.util.Map; public class T { public static T t = new T(); private Map map = new HashMap(); public T(){ System.out.println(&quo
跳出iframe返回外层页面 g21121 iframe
在web开发过程中难免要用到iframe，但当连接超时或跳转到公共页面时就会出现超时页面显示在iframe中，这时我们就需要跳出这个iframe到达一个公共页面去。首先跳转到一个中间页，这个页面用于判断是否在iframe中，在页面加载的过程中调用如下代码： <script type="text/javascript"> //<!-- function
JAVA多线程监听JMS、MQ队列 510888780 java多线程
背景：消息队列中有非常多的消息需要处理，并且监听器onMessage（）方法中的业务逻辑也相对比较复杂，为了加快队列消息的读取、处理速度。可以通过加快读取速度和加快处理速度来考虑。因此从这两个方面都使用多线程来处理。对于消息处理的业务处理逻辑用线程池来做。对于加快消息监听读取速度可以使用1.使用多个监听器监听一个队列；2.使用一个监听器开启多线程监听。对于上面提到的方法2使用一个监听器开启多线
第一个SpringMvc例子布衣凌宇 spring mvc
第一步：导入需要的包；第二步：配置web.xml文件 <?xml version="1.0" encoding="UTF-8"?> <web-app version="2.5" xmlns="http://java.sun.com/xml/ns/javaee" xmlns:xsi=
我的spring学习笔记15-容器扩展点之PropertyOverrideConfigurer aijuans Spring3
PropertyOverrideConfigurer类似于PropertyPlaceholderConfigurer，但是与后者相比，前者对于bean属性可以有缺省值或者根本没有值。也就是说如果properties文件中没有某个bean属性的内容，那么将使用上下文（配置的xml文件）中相应定义的值。如果properties文件中有bean属性的内容，那么就用properties文件中的值来代替上下
通过XSD验证XML antlove xml schema xsd validation SchemaFactory
1. XmlValidation.java package xml.validation; import java.io.InputStream; import javax.xml.XMLConstants; import javax.xml.transform.stream.StreamSource; import javax.xml.validation.Schem
文本流与字符集百合不是茶 PrintWrite()的使用字符集名字别名获取
文本数据的输入输出; 输入;数据流,缓冲流输出;介绍向文本打印格式化的输出PrintWrite(); package 文本流; import java.io.FileNotFound
ibatis模糊查询sqlmap-mapping-**.xml配置 bijian1013 ibatis
正常我们写ibatis的sqlmap-mapping-*.xml文件时，传入的参数都用##标识，如下所示： <resultMap id="personInfo" class="com.bijian.study.dto.PersonDTO"> <res
java jvm常用命令工具——jdb命令(The Java Debugger) bijian1013 java jvm jdb
用来对core文件和正在运行的Java进程进行实时地调试，里面包含了丰富的命令帮助您进行调试，它的功能和Sun studio里面所带的dbx非常相似，但 jdb是专门用来针对Java应用程序的。现在应该说日常的开发中很少用到JDB了，因为现在的IDE已经帮我们封装好了，如使用ECLI
【Spring框架二】Spring常用注解之Component、Repository、Service和Controller注解 bit1129 controller
在Spring常用注解第一步部分【Spring框架一】Spring常用注解之Autowired和Resource注解（http://bit1129.iteye.com/blog/2114084）中介绍了Autowired和Resource两个注解的功能，它们用于将依赖根据名称或者类型进行自动的注入，这简化了在XML中，依赖注入部分的XML的编写，但是UserDao和UserService两个bea
cxf wsdl2java生成代码super出错,构造函数不匹配 bitray super
由于过去对于soap协议的cxf接触的不是很多,所以遇到了也是迷糊了一会.后来经过查找资料才得以解决. 初始原因一般是由于jaxws2.2规范和jdk6及以上不兼容导致的.所以要强制降为jaxws2.1进行编译生成.我们需要少量的修改: 我们原来的代码 wsdl2java com.test.xxx -client http://..... 修改后的代
动态页面正文部分中文乱码排障一例 ronin47
公司网站一部分动态页面，早先使用apache+resin的架构运行，考虑到高并发访问下的响应性能问题，在前不久逐步开始用nginx替换掉了apache。不过随后发现了一个问题，随意进入某一有分页的网页，第一页是正常的（因为静态化过了）；点“下一页”，出来的页面两边正常，中间部分的标题、关键字等也正常，唯独每个标题下的正文无法正常显示。因为有做过系统调整，所以第一反应就是新上
java-54- 调整数组顺序使奇数位于偶数前面 bylijinnan java
import java.util.Arrays; import java.util.Random; import ljn.help.Helper; public class OddBeforeEven { /** * Q 54 调整数组顺序使奇数位于偶数前面 * 输入一个整数数组，调整数组中数字的顺序，使得所有奇数位于数组的前半部分，所有偶数位于数组的后半
从100PV到1亿级PV网站架构演变 cfyme 网站架构
一个网站就像一个人，存在一个从小到大的过程。养一个网站和养一个人一样，不同时期需要不同的方法，不同的方法下有共同的原则。本文结合我自已14年网站人的经历记录一些架构演变中的体会。 1：积累是必不可少的架构师不是一天练成的。 1999年，我作了一个个人主页，在学校内的虚拟空间，参加了一次主页大赛，几个DREAMWEAVER的页面，几个TABLE作布局，一个DB连接，几行PHP的代码嵌入在HTM
[宇宙时代]宇宙时代的GIS是什么？ comsci Gis
我们都知道一个事实，在行星内部的时候，因为地理信息的坐标都是相对固定的，所以我们获取一组GIS数据之后，就可以存储到硬盘中，长久使用。。。但是，请注意，这种经验在宇宙时代是不能够被继续使用的宇宙是一个高维时空
详解create database命令 czmmiao database
完整命令 CREATE DATABASE mynewdb USER SYS IDENTIFIED BY sys_password USER SYSTEM IDENTIFIED BY system_password LOGFILE GROUP 1 ('/u01/logs/my/redo01a.log','/u02/logs/m
几句不中听却不得不认可的话 datageek
1、人丑就该多读书。 2、你不快乐是因为：你可以像猪一样懒，却无法像只猪一样懒得心安理得。 3、如果你太在意别人的看法，那么你的生活将变成一件裤衩，别人放什么屁，你都得接着。 4、你的问题主要在于：读书不多而买书太多，读书太少又特爱思考，还他妈话痨。 5、与禽兽搏斗的三种结局：(1)、赢了，比禽兽还禽兽。(2)、输了，禽兽不如。(3)、平了，跟禽兽没两样。结论：选择正确的对手很重要。 6
1 14:00 PHP中的“syntax error, unexpected T_PAAMAYIM_NEKUDOTAYIM”错误 dcj3sjt126com PHP
原文地址：http://www.kafka0102.com/2010/08/281.html 因为需要，今天晚些在本机使用PHP做些测试，PHP脚本依赖了一堆我也不清楚做什么用的库。结果一跑起来，就报出类似下面的错误：“Parse error: syntax error, unexpected T_PAAMAYIM_NEKUDOTAYIM in /home/kafka/test/
xcode6 Auto layout and size classes dcj3sjt126com ios
官方GUI https://developer.apple.com/library/ios/documentation/UserExperience/Conceptual/AutolayoutPG/Introduction/Introduction.html iOS中使用自动布局（一） http://www.cocoachina.com/ind
通过PreparedStatement批量执行sql语句【sql语句相同，值不同】梦见x光 sql 事务批量执行
比如说：我有一个List需要添加到数据库中，那么我该如何通过PreparedStatement来操作呢？ public void addCustomerByCommit(Connection conn , List<Customer> customerList) { String sql = "inseret into customer(id
程序员必知必会----linux常用命令之十【系统相关】 hanqunfeng Linux常用命令
一.linux快捷键 Ctrl+C : 终止当前命令 Ctrl+S : 暂停屏幕输出 Ctrl+Q : 恢复屏幕输出 Ctrl+U : 删除当前行光标前的所有字符 Ctrl+Z : 挂起当前正在执行的进程 Ctrl+L : 清除终端屏幕，相当于clear 二.终端命令 clear : 清除终端屏幕 reset : 重置视窗，当屏幕编码混乱时使用 time com
NGINX IXHONG nginx
pcre 编译安装 nginx conf/vhost/test.conf upstream admin { server 127.0.0.1:8080; } server { listen 80; &
设计模式--工厂模式 kerryg 设计模式
工厂方式模式分为三种： 1、普通工厂模式：建立一个工厂类，对实现了同一个接口的一些类进行实例的创建。 2、多个工厂方法的模式：就是对普通工厂方法模式的改进，在普通工厂方法模式中，如果传递的字符串出错，则不能正确创建对象，而多个工厂方法模式就是提供多个工厂方法，分别创建对象。 3、静态工厂方法模式：就是将上面的多个工厂方法模式里的方法置为静态，
Spring InitializingBean/init-method和DisposableBean/destroy-method mx_xiehd java spring bean xml
1.initializingBean/init-method 实现org.springframework.beans.factory.InitializingBean接口允许一个bean在它的所有必须属性被BeanFactory设置后，来执行初始化的工作，InitialzingBean仅仅指定了一个方法。通常InitializingBean接口的使用是能够被避免的，（不鼓励使用，因为没有必要
解决Centos下vim粘贴内容格式混乱问题 qindongliang1922 centos vim
有时候，我们在向vim打开的一个xml，或者任意文件中，拷贝粘贴的代码时，格式莫名其毛的就混乱了，然后自己一个个再重新，把格式排列好，非常耗时，而且很不爽，那么有没有办法避免呢？答案是肯定的，设置下缩进格式就可以了，非常简单：在用户的根目录下直接vi ~/.vimrc文件然后将set pastetoggle=<F9> 写入这个文件中，保存退出，重新登录，
netty大并发请求问题 tianzhihehe netty
多线程并发使用同一个channel java.nio.BufferOverflowException: null at java.nio.HeapByteBuffer.put(HeapByteBuffer.java:183) ~[na:1.7.0_60-ea] at java.nio.ByteBuffer.put(ByteBuffer.java:832) ~[na:1.7.0_60-ea]
Hadoop NameNode单点问题解决方案之一 AvatarNode wyz2009107220 NameNode
我们遇到的情况 Hadoop NameNode存在单点问题。这个问题会影响分布式平台24*7运行。先说说我们的情况吧。我们的团队负责管理一个1200节点的集群(总大小12PB)，目前是运行版本为Hadoop 0.20，transaction logs写入一个共享的NFS filer(注：NetApp NFS Filer)。经常遇到需要中断服务的问题是给hadoop打补丁。 DataNod