hive创建orc格式表不能像textfile格式一样直接load数据到表中,一般需要创建临时textfile表,然后通过insert into 或者insert overwrite到orc存储格式表中。
create table if not exists hm3.hm3_format_log_tmp
(
time string,
data string
)
partitioned by (dt string, hour string, msgtype string, action string)
row format delimited fields terminated by '\t';
create external table if not exists hm3.hm3_format_log
(
time string,
data string
)
partitioned by (dt string, hour string, msgtype string, action string)
row format delimited fields terminated by '\t'
stored as orc;
add jar /home/hadoop/codejar/flash_format.jar;
create temporary function gamelabel as 'com.js.dataclean.hive.udf.hm2.GameLabel';
查看函数用法:
查month 相关的函数
show functions like '*month*'
查看 add_months 函数的用法
desc function add_months;
查看 add_months 函数的详细说明并举例
desc function extended add_months;
关于UDAF开发注意点:
1.需要import org.apache.hadoop.hive.ql.exec.UDAF
以及org.apache.hadoop.hive.ql.exec.UDAFEvaluator
,这两个包都是必须的
2.函数类需要继承UDAF
类,内部类Evaluator实现UDAFEvaluator接口
3.Evaluator需要实现 init、iterate、terminatePartial、merge、terminate这几个函数
1)init函数类似于构造函数,用于UDAF的初始化
2)iterate接收传入的参数,并进行内部的轮转。其返回类型为boolean
3)terminatePartial无参数,其为iterate函数轮转结束后,返回乱转数据,iterate和terminatePartial类似于hadoop的Combiner
4)merge接收terminatePartial的返回结果,进行数据merge操作,其返回类型为boolean
5)terminate返回最终的聚集函数结果
create 'flash_people','info','label'
create external table if not exists hm2.flash_people(
guid string comment "people guid",
firsttime string comment "首次入库时间",
ip string comment "用户ip",
jstimestamp bigint comment "json时间戳"
)
stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
with serdeproperties("hbase.columns.mapping" = ":key,info:firsttime,info:ip,:timestamp")
tblproperties("hbase.table.name" = "hm2:flash_people");
hive -e
主要用来在命令行执行sql
hive -e 'set mapred.reduce.tasks = 30;insert into hm2.flash_people select guid,dt,remote_addr,(32523145869-ngx_timestamp) from hm2.data where dt = "2018-07-01" and length(guid) = 38 and ngx_timestamp is not null and ngx_timestamp != '' and ngx_timestamp is regexp '\\d{8}' and remote_addr is not null and remote_addr != '';'
set mapred.reduce.tasks = 30;insert into hm2.flash_people select guid,dt,remote_addr,(32523145869-ngx_timestamp) from hm2.data where dt = "2018-07-01" and length(guid) = 38 and ngx_timestamp is not null and ngx_timestamp != '' and ngx_timestamp rlike'^\\d+$' and remote_addr is not null and remote_addr != '';
hive一些优化参数
set hive.auto.convert.join = false;
set hive.ignore.mapjoin.hint=false;
set hive.exec.parallel=true;
set mapred.reduce.tasks = 60;
alter table hm2.helper add columns(country string, province string, city string);
hive添加字段后,前面数据会有空值,就算将前面数据hdfs文件删除,重新导入,仍然查询出来是 NULL,这个问题有待解决。
添加 map 复杂类型字段
alter table hm2.helper add columns(data_map map);
hive> desc formatted hm2.helper;
OK
# col_name data_type comment
time string
uuid string
country string
province string
city string
data_map map
# Partition Information
# col_name data_type comment
dt string
hour string
msgtype string
# Detailed Table Information
Database: hm2
Owner: hadoop
CreateTime: Wed Apr 24 10:12:30 CST 2019
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: hdfs://nameser/hive/warehouse/hm2.db/helper
Table Type: EXTERNAL_TABLE
Table Parameters:
EXTERNAL TRUE
last_modified_by hadoop
last_modified_time 1556072221
transient_lastDdlTime 1556072221
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
field.delim \t
serialization.format \t
Time taken: 0.105 seconds, Fetched: 59 row(s)
仿照一张在创建表时定义了map类型字段的表的属性描述
Storage Desc Params:
colelction.delim ,
field.delim \t
mapkey.delim :
serialization.format \t
只需要将map属性修改为:
hive> alter table hm2.helper set serdeproperties('colelction.delim' = ',', 'mapkey.delim' = ':');
OK
Time taken: 0.132 seconds
那么
hive> desc formatted hm2.helper;
OK
# col_name data_type comment
time string
uuid string
country string
province string
city string
data_map map
# Partition Information
# col_name data_type comment
dt string
hour string
msgtype string
# Detailed Table Information
Database: hm2
Owner: hadoop
CreateTime: Wed Apr 24 10:12:30 CST 2019
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: hdfs://nameser/hive/warehouse/hm2.db/helper
Table Type: EXTERNAL_TABLE
Table Parameters:
EXTERNAL TRUE
last_modified_by hadoop
last_modified_time 1556072669
transient_lastDdlTime 1556072669
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
colelction.delim ,
field.delim \t
mapkey.delim :
serialization.format \t
Time taken: 0.079 seconds, Fetched: 61 row(s)
即可
CREATE TABLE test (
creatingTs BIGINT,
a STRING,
b BIGINT,
c STRING,
d STRING,
e BIGINT,
f BIGINT
);
如果需要删除 column f 列,可以使用以下语句:
ALTER TABLE test REPLACE COLUMNS (
creatingTs BIGINT,
a STRING,
b BIGINT,
c STRING,
d STRING,
e BIGINT
);
增加列:
alter table of_test columns (judegment int)
insert,update,delete
的配置hive-site.xml
中添加配置
<property>
<name>hive.support.concurrencyname>
<value>truevalue>
property>
<property>
<name>hive.exec.dynamic.partition.modename>
<value>nonstrictvalue>
property>
<property>
<name>hive.txn.managername>
<value>org.apache.hadoop.hive.ql.lockmgr.DbTxnManagervalue>
property>
<property>
<name>hive.compactor.initiator.onname>
<value>truevalue>
property>
<property>
<name>hive.compactor.worker.threadsname>
<value>1value>
property>
<property>
<name>hive.enforce.bucketingname>
<value>truevalue>
property>
建表语句
create external table if not exists hm2.history_helper
(
guid string,
starttime string,
endtime string,
num int
)
clustered by(guid) into 50 buckets
stored as orc TBLPROPERTIES ('transactional'='true');
说明:建表语句必须带有into buckets
子句和stored as orc TBLPROPERTIES ('transactional'='true')
子句,并且不能带有sorted by
子句。
这样,这个表就可以就行insert,update,delete
操作了。
注意: 上面规则在 hive-1.2.1 是可以的,在 hive-2.1.1 中需要将external
关键字去掉,即高版本不支持外部表update了。
场景:
在执行insert into
或insert overwrite任务时,中途手动将程序停掉,会出现卡死情况(无法提交MapReduce),只能执行查询操作,而drop insert
操作均不可操作,无论执行多久,都会保持卡死状态
临时解决办法是……把表名换一个……
根本原因是:hive表被锁或者某个分区被锁,需要解锁
show locks 表名:
可以查看表被锁的情况
解锁
unlock table 表名; -- 解锁表
unlock table 表名 partition(dt='2014-04-01'); -- 解锁某个分区
注意
表锁和分区锁是两个不同的锁,对表解锁,对分区是无效的,分区需要单独解锁
高版本hive默认插入数据时,不能查询,因为有锁
hive> show locks;
OK
test@helper EXCLUSIVE
解决办法:关闭锁机制
set hive.txn.manager = org.apache.hadoop.hive.ql.lockmgr.DummyTxnManager; // 这是默认值
set hive.support.concurrency=false; 默认为true
查看表结构信息
desc formatted table_name;
desc table_name;
hive -e 'load data inpath "/data/MROutput/hm2_data_gamelabel_output2/part-r-*" into table hm2.game_label partition(dt="2018-10-11");'
注意:inpath 后接的hdfs路径需要引号
python中的hive命令字符串示例:
cmd = 'hive -e \'load data inpath "%s/part-r-*" into table hm2.game_label partition(dt=%s);\''%(outpath, formatDate(day))
textfile
格式一样直接load数据到表中,一般需要load创建临时textfile表,然后通过insert into 或者insert overwrite
到orc存储格式表中。hive
对应表的目录什么情况下只有一个reduce?
很多时候你会发现任务中不管数据量多大,不管你有没有设置调整reduce个数的参数,任务中一直都只有一个reduce任务;其实只有一个reduce任务的情况,除了数据量小于hive.exec.reducers.bytes.per.reducer
参数值的情况外,还有以下原因:
set mapreduce.map.memory.mb = 4096 ;
set mapreduce.reduce.memory.mb = 4096 ;
set mapreduce.map.java.opts=-Xmx3686m;
set mapreduce.reduce.java.opts=-Xmx3428m;
下面是对于该参数的测试过程:
测试sql:
select r1.a
from
(select t.a from sunwg_10 t join sunwg_10000000 s on t.a=s.b) r1
join
(select s.b from sunwg_100000 t join sunwg_10 s on t.a=s.b) r2
on (r1.a=r2.b);
set hive.exec.parallel=false;
set hive.exec.parallel=true;
set mapred.reduce.tasks = 15;
总结:
在资源充足的时候hive.exec.parallel会让那些存在并发job的sql运行得更快,但同时消耗更多的资源
可以评估下hive.exec.parallel对我们的刷新任务是否有帮助.
set mapred.max.split.size=256000000; -- 决定每个map处理的最大的文件大小,单位为B
cdh 6.0.1 下通过设置:
set hive.execution.engine=spark;
可以将默认的mapreduce执行引擎切换为spark;
apache hadoop 下配置 hive on spark
了解完了Spark作业运行的基本原理之后,对资源相关的参数就容易理解了。所谓的Spark资源参数调优,其实主要就是对Spark运行过程中各个使用资源的地方,通过调节各种参数,来优化资源使用的效率,从而提升Spark作业的执行性能。以下参数就是Spark中主要的资源参数,每个参数都对应着作业运行原理中的某个部分。
num-executors/spark.executor.instances
executor-memory/spark.executor.memory
executor-cores/spark.executor.cores
driver-memory
spark.default.parallelism
spark.storage.memoryFraction
spark.shuffle.memoryFraction
原文链接
Spark On Yarn执行中executor内存限制问题
解决Spark On Yarn执行中executor内存限制问题
集群版本 Spark 2.2.0 + Hadoop 3.0-CDH6.0.1
hive on saprk , 设置:
hive> set hive.execution.engine=spark;
hive> set spark.executor.memory=31.5g;
hive> set spark.executor.cores=11;
hive> set spark.serializer=org.apache.spark.serializer.KryoSerializer;
提示内存不足
Failed to execute spark task, with exception 'org.apache.hadoop.hive.ql.metadata.HiveException(Failed to create Spark client for Spark session 50288c8b-96aa-44ad-9eea-3cb4abb1ae5b)'
FAILED: Execution Error, return code 30041 from org.apache.hadoop.hive.ql.exec.spark.SparkTask. Failed to create Spark client for Spark session 50288c8b-96aa-44ad-9eea-3cb4abb1ae5b
解决方案,修改Yarn的配置文件:
1、yarn.nodemanager.resource.memory-mb
容器内存
设置为 至少 : executor-memory(15g) + driver(512m)
的内存,如上例可配置为 16g
2、yarn.scheduler.maximum-allocation-mb
最大容器内存
设置为 至少 : executor-memory(15g) + driver(512m)
的内存,如上例可配置为 16g
第一个参数为NodeManager
的配置 ,第二个参数为 ResourceManager
的配置。
字符串连接:
concat(str, str2, str3,...) 字符串连接
concat_ws(separator, str, str2, str3, ...) 将字符串用separator作为间隔连接起来
字符串截取
substr(s, 0, 1) 截取第一个字符
substr(s, -1) 截取最后一个字符
mapjoin的优化在于,在mapreduce task开始之前,创建一个local task,小表以hshtable的形式加载到内存,然后序列化到磁盘,把内存的hashtable压缩为tar文件。然后把文件分发到 Hadoop Distributed Cache,然后传输给每一个mapper,mapper在本地反序列化文件并加载进内存在做join
sql
select workflow,count(workflow) from (select guid, substr(workflow, -1) workflow from hm2.workflow_list) m right join hm2.helper helper on m.guid = helper.guid and helper.dt = "2018-10-21" group by workflow;
内存溢出解决办法:
set hive.auto.convert.join = false;
set hive.ignore.mapjoin.hint=false;
set hive.exec.parallel=true;
笼统的说,Hive中的Join可分为Common Join(Reduce阶段完成join)和Map Join(Map阶段完成join)。本文简单介绍一下两种join的原理和机制。
如果不指定MapJoin或者不符合MapJoin的条件,那么Hive解析器会将Join操作转换成Common Join,即:在Reduce阶段完成join.
整个过程包含Map、Shuffle、Reduce阶段。
读取源表的数据,Map输出时候以Join on条件中的列为key,如果Join有多个关联键,则以这些关联键的组合作为key;
Map输出的value为join之后所关心的(select或者where中需要用到的)列;同时在value中还会包含表的Tag信息,用于标明此value对应哪个表;
按照key进行排序
Shuffle阶段
根据key的值进行hash,并将key/value按照hash值推送至不同的reduce中,这样确保两个表中相同的key位于同一个reduce中
Reduce阶段
根据key的值完成join操作,期间通过Tag来识别不同表中的数据。
以下面的HQL为例,图解其过程:
SELECT
a.id,a.dept,b.age
FROM a join b
ON (a.id = b.id);
MapJoin通常用于一个很小的表和一个大表进行join的场景,具体小表有多小,由参数hive.mapjoin.smalltable.filesize来决定,该参数表示小表的总大小,默认值为25000000字节,即25M。
Hive0.7之前,需要使用hint提示 /*+ mapjoin(table) */才会执行MapJoin,否则执行Common Join,但在0.7版本之后,默认自动会转换Map Join,由参数hive.auto.convert.join来控制,默认为true.
仍然以9.1中的HQL来说吧,假设a表为一张大表,b为小表,并且hive.auto.convert.join=true,那么Hive在执行时候会自动转化为MapJoin。
如图中的流程,首先是Task A,它是一个Local Task(在客户端本地执行的Task),负责扫描小表b的数据,将其转换成一个HashTable的数据结构,并写入本地的文件中,之后将该文件加载到DistributeCache中,该HashTable的数据结构可以抽象为:
key | value |
---|---|
1 | 26 |
2 | 34 |
原文链接
hive> select split('a:1||b:2||c:3','\\|\\|') from hm2.test;
OK
["a:1","b:2","c:3"]
["a:1","b:2","c:3"]
["a:1","b:2","c:3"]
["a:1","b:2","c:3"]
其它转义字符还有{
, [
insert into tbName select * from tbName2;
insert overwrite table tbName select * from tbName2;
insert overwrite例子
insert overwrite table hm2.helper partition(dt = '2018-06-22', hour = '09',msgtype = 'helper') select time,source,remote_addr,remote_user,body_bytes_sent,request_time,status,host,request_method,http_referrer,http_x_forwarded_for,http_user_agent,upstream_response_time,upstream_addr,guid,helperversion,osversion,ngx_timestamp,get_type,split(ip2area(http_x_forwarded_for,remote_addr), "\t")[0] country,split(ip2area(http_x_forwarded_for,remote_addr), "\t")[1] province,split(ip2area(http_x_forwarded_for,remote_addr), "\t")[2] city from hm2.helper where dt = '2018-06-22' and hour = '09' and msgtype = 'helper';
插入分区表,不用指定分区,可以自动识别
INSERT overwrite TABLE test.dis_helper PARTITION (dt,hour,msgtype) select `(num)?+.+` from (select *,row_number() over (partition by guid order by time asc) num from hm2.helper where dt ='2018-09-06'and hour between '00' and '23' and msgtype='helper') t where t.num=1;
这里把数据去重,插入分区表test.dis_helper中,自动根据dt,hour,msgtype字段的取值进入分区表,并且(num)?+.+
表示除了num
这个字段。
对于这么一个插叙sql
explain
select
workflow,count(workflow) cou from (select guid, split(split(workflow,"\\|\\|")[0], ":")[1] workflow
from
hm2.workflow_list) m
inner join
hm2.flash flash
on m.guid = flash.guid and flash.dt = "2018-11-04"
group by workflow order by cou;
可以打印出执行计划
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-2 depends on stages: Stage-1
Stage-3 depends on stages: Stage-2
Stage-0 depends on stages: Stage-3
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: workflow_list
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
Filter Operator
predicate: guid is not null (type: boolean)
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
Select Operator
expressions: guid (type: string), split(split(workflow, '\|\|')[0], ':')[1] (type: string)
outputColumnNames: _col0, _col1
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: string)
sort order: +
Map-reduce partition columns: _col0 (type: string)
Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column stats: NONE
value expressions: _col1 (type: string)
TableScan
alias: flash
Statistics: Num rows: 489153811 Data size: 48915382339 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: guid is not null (type: boolean)
Statistics: Num rows: 244576906 Data size: 24457691219 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: guid (type: string)
sort order: +
Map-reduce partition columns: guid (type: string)
Statistics: Num rows: 244576906 Data size: 24457691219 Basic stats: COMPLETE Column stats: NONE
Reduce Operator Tree:
Join Operator
condition map:
Inner Join 0 to 1
keys:
0 _col0 (type: string)
1 guid (type: string)
outputColumnNames: _col1
Statistics: Num rows: 269034602 Data size: 26903460924 Basic stats: COMPLETE Column stats: NONE
Group By Operator
aggregations: count(_col1)
keys: _col1 (type: string)
mode: hash
outputColumnNames: _col0, _col1
Statistics: Num rows: 269034602 Data size: 26903460924 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe
Stage: Stage-2
Map Reduce
Map Operator Tree:
TableScan
Reduce Output Operator
key expressions: _col0 (type: string)
sort order: +
Map-reduce partition columns: _col0 (type: string)
Statistics: Num rows: 269034602 Data size: 26903460924 Basic stats: COMPLETE Column stats: NONE
value expressions: _col1 (type: bigint)
Reduce Operator Tree:
Group By Operator
aggregations: count(VALUE._col0)
keys: KEY._col0 (type: string)
mode: mergepartial
outputColumnNames: _col0, _col1
Statistics: Num rows: 134517301 Data size: 13451730462 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
serde: org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe
Stage: Stage-3
Map Reduce
Map Operator Tree:
TableScan
Reduce Output Operator
key expressions: _col1 (type: bigint)
sort order: +
Statistics: Num rows: 134517301 Data size: 13451730462 Basic stats: COMPLETE Column stats: NONE
value expressions: _col0 (type: string)
Reduce Operator Tree:
Select Operator
expressions: VALUE._col0 (type: string), KEY.reducesinkkey0 (type: bigint)
outputColumnNames: _col0, _col1
Statistics: Num rows: 134517301 Data size: 13451730462 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 134517301 Data size: 13451730462 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
explode
主要是将一行数据按照某个字段变为多行。
select mt.impress,count(mt.guid) from (select (case when t.impressed = '' or t.impressed is null then 'null' else t.impressed end) impress,t.guid from (select guid,impressed from hm2.install lateral view explode(split(replace(replace(offer_impressed,'[',''),']',''), ',')) test_alias as impressed where dt='2018-12-17') t) mt group by mt.impress;
row_number() over(partition by)
表hm4.ffbrowser_domain_url_2019_07_17
数据如下:
www.baidu.com https://www.baidu.com/link?url=fCBMjU_THQzoAdo0RTAQLQIfVWIlPlgzpEM5Pk5qChKiahWFfMzFo6ckRjd9OFc7w8cj5h8esZXKBab5WqeLPgDYipewXUdz9LFPf-oxOfK&wd=&eqid=d2c80e8e0033424b000000045d2f14e9 1
www.baidu.com https://www.baidu.com/link?url=3jlP1HLS4aYSnN1L5CG3pPr0zXnqBpDdG8mlFQm47w1RcHFwHiBU0t8Hi0UrMD37lSJvQkGWQ3iBtNpc0AJhEei-v8MdGKgRnVqy62tuCA_&wd=&eqid=e3d5845b002790f7000000045d2f0af8 1
www.baidu.com http://www.baidu.com/ 1
www.baidu.com https://www.baidu.com/link?url=8S_ziMFwpClJww3C15iXu__wqMrMOxnYuDnZDpQWnbs1PTTqx_wwjIY7QsrFfaKT&wd=&eqid=eb8f38f70039cf51000000045d2e8716 1
www.baidu.com https://www.baidu.com/link?url=AZvluWbTjZjpaT5lnIpkB-gTIdyX_nZdtoLX_pkbM5i&wd=&eqid=8b09c549000038e7000000035930ca9a 1
www.baidu.com https://www.baidu.com/link?url=IjStquL7c4YwVDk1zQJFYkwBiGY20Kv2PQsXuTQTHH0BhAPLjUaz-XhLLp5Zfe3fE4hU_KNfEs6JxyESkwGlea&wd=&eqid=e7e404c9000012b1000000045d2ee845 1
www.baidu.com https://www.baidu.com/link?url=qRaLKHc_ZZIskkWF_f6azkmHqRlfgmuRQZcrzRovBC5MEBR5yTIG20FiR3O__8Jz&wd=&eqid=e13f05290018e7fb000000045d2eed7d 2
www.baidu.com https://www.baidu.com/s?tn=50000201_hao_pg&usm=3&wd=%E7%AB%AF%E7%81%AB%E9%94%85%E6%B3%BC%E5%A6%BB%E5%AD%90%E5%90%8C%E5%AD%A6&ie=utf-8&rsv_cq=%E4%BA%BA%E7%B1%BB%E7%99%BB%E6%9C%8850%E5%91%A8%E5%B9%B4&rsv_dl=0_right_fyb_pchot_20811_01&rsf=531e82477396136261c6275e8afa58b1_1_10_8&rqid=e3d5845b002790f7 1
www.baidu.com https://www.baidu.com/link?url=NHmzZVrcbQ1tf6JnR4MJlHXJZFy-4RMgKwjNeDvskMyl17vpdi_8XgVCdRvGFtU2WJpNpHQf4VbwIeQi5qDHskDTrDUK5KMUkrkfKcWYxhy&wd=&eqid=e3d5845b002790f7000000045d2f0af8 1
www.baidu.com https://www.baidu.com/s?wd=%E5%BE%AE%E5%8D%9A&ie=utf-8 1
sql查询语句
select a.domain,a.url,a.cou from (select domain,url,cou,row_number() over(partition by domain order by cou desc) as n from hm4.ffbrowser_domain_url_2019_07_17)a where a.n <= 2;
结果
www.baidu.com https://www.baidu.com/ 69
www.baidu.com https://www.baidu.com/link?url=r6riiF-vxG9OX70KBVx86FuywJYXHu-TpTTSEst9ggK78xIjVvkI_QoS9tEDBAqq&wd=&eqid=ba409e160014f8c9000000045eec97 3
相似问题:每门课程成绩的前N名
select reginlistlength, softname, sum(cou) from (select (case when reginlistlength > '1' then 'more' else reginlistlength end) as reginlistlength, softname, count(l.guid) cou from (select reginlistlength, softname, guid from hm2.lnk where dt = '2019-05-24' and softtype = '1' group by guid,reginlistlength, softname, guid)l where l.guid in (select guid from hm2.helper where dt = '2019-05-24' group by guid) group by reginlistlength, softname)m group by reginlistlength, softname;
select (case when reginlistlength > '1' then 'more' else reginlistlength end) as reginlistlength, softname, count(l.guid) cou from (select reginlistlength, softname, guid from hm2.lnk where dt = '2019-05-24' and softtype = '1' group by guid,reginlistlength, softname, guid)l where l.guid in (select guid from hm2.helper where dt = '2019-05-24' group by guid) group by reginlistlength, softname;