For Hive, I use collect_set() + concat_ws() from https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF.
But if you want to remove duplicated elements, write your own UDF should be the only choice till now.
1
2
3
4
5
6
7
8
9
10
|
hive
>
select
uid
,
concat_ws
(
','
,
collect_set
(
tag
)
)
from
test
group
by
uid
;
FAILED
:
SemanticException
[
Error
10016
]
:
Line
1
:
27
Argument
type
mismatch
'tag'
:
Argument
2
of
function
CONCAT_WS
must
be
"string or array
hive
>
select
uid
,
concat_ws
(
','
,
collect_set
(
CAST
(
tag
AS
STRING
)
)
)
from
test
group
by
uid
;
.
.
.
Job
0
:
Map
:
3
Reduce
:
1
Cumulative
CPU
:
8.43
sec
HDFS
Read
:
890
HDFS
Write
:
18
SUCCESS
Total
MapReduce
CPU
Time
Spent
:
8
seconds
430
msec
OK
1
2
,
1
,
3
2
1
,
4
3
5
|
Impala also has a group_concat() but different from mysql
group_concat(string s [, string sep])
Purpose: Returns a single string representing the argument value concatenated together for each row of the result set. If the optional separator string is specified, the separator is added between each pair of concatenated values.
Return type: stringUsage notes: concat() and concat_ws() are appropriate for concatenating the values of multiple columns within the same row, while group_concat() joins together values from different rows.
By default, returns a single string covering the whole result set. To include other columns or values in the result set, or to produce multiple concatenated strings for subsets of rows, include a GROUP BY clause in the query.
group_concat(string s [, string sep]) 和分组函数配合使用,group_concat(字段, 分隔符) ,下面例子:
1
2
3
4
5
6
7
8
9
10
|
[
hadoop4
.
xxx
.
com
:
21000
]
>
select
uid
,
group_concat
(
cast
(
tag
as
string
)
,
','
)
as
tag_list
from
test3
group
by
uid
;
Query
:
select
uid
,
group_concat
(
cast
(
tag
as
string
)
,
','
)
as
tag_list
from
test3
group
by
uid
+
--
--
-
+
--
--
--
--
--
+
|
uid
|
tag_list
|
+
--
--
-
+
--
--
--
--
--
+
|
3
|
5
|
|
2
|
1
,
4
|
|
1
|
1
,
2
,
3
|
+
--
--
-
+
--
--
--
--
--
+
Returned
3
row
(
s
)
in
0.68s
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
|
FROM
:
+
--
--
--
+
--
--
--
+
--
--
--
+
|
uid
|
tag
|
val
|
+
--
--
--
+
--
--
--
+
--
--
--
+
|
1
|
1
|
1
|
|
1
|
2
|
0
|
|
1
|
3
|
1
|
|
2
|
1
|
1
|
|
2
|
4
|
0
|
|
3
|
5
|
1
|
+
--
--
--
+
--
--
--
+
--
--
--
+
TO
:
+
--
--
--
+
--
--
--
--
--
+
--
--
--
--
--
+
--
--
--
--
--
+
--
--
--
--
--
+
--
--
--
--
--
+
|
uid
|
tag1_val
|
tag2_val
|
tag3_val
|
tag4_val
|
tag5_val
|
+
--
--
--
+
--
--
--
--
--
+
--
--
--
--
--
+
--
--
--
--
--
+
--
--
--
--
--
+
--
--
--
--
--
+
|
1
|
1
|
0
|
1
|
0
|
0
|
|
2
|
1
|
0
|
0
|
0
|
0
|
|
3
|
0
|
0
|
0
|
0
|
1
|
+
--
--
--
+
--
--
--
--
--
+
--
--
--
--
--
+
--
--
--
--
--
+
--
--
--
--
--
+
--
--
--
--
--
+
|
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
|
[
hadoop4
.
xxx
.
com
:
21000
]
>
select
>
uid
,
>
max
(
case
when
tag
=
1
then
val
else
0
end
)
as
tag1_val
,
>
max
(
case
when
tag
=
2
then
val
else
0
end
)
as
tag2_val
,
>
max
(
case
when
tag
=
3
then
val
else
0
end
)
as
tag3_val
,
>
max
(
case
when
tag
=
4
then
val
else
0
end
)
as
tag4_val
,
>
max
(
case
when
tag
=
5
then
val
else
0
end
)
as
tag5_val
>
from
test2
>
group
by
uid
;
Query
:
select
uid
,
max
(
case
when
tag
=
1
then
val
else
0
end
)
as
tag1_val
,
max
(
case
when
tag
=
2
then
val
else
0
end
)
as
tag2_val
,
max
(
case
when
tag
=
3
then
val
else
0
end
)
as
tag3_val
,
max
(
case
when
tag
=
4
then
val
else
0
end
)
as
tag4_val
,
max
(
case
when
tag
=
5
then
val
else
0
end
)
as
tag5_val
from
test2
group
by
uid
+
--
--
-
+
--
--
--
--
--
+
--
--
--
--
--
+
--
--
--
--
--
+
--
--
--
--
--
+
--
--
--
--
--
+
|
uid
|
tag1_val
|
tag2_val
|
tag3_val
|
tag4_val
|
tag5_val
|
+
--
--
-
+
--
--
--
--
--
+
--
--
--
--
--
+
--
--
--
--
--
+
--
--
--
--
--
+
--
--
--
--
--
+
|
3
|
0
|
0
|
0
|
0
|
1
|
|
2
|
1
|
0
|
0
|
0
|
0
|
|
1
|
1
|
0
|
1
|
0
|
0
|
+
--
--
-
+
--
--
--
--
--
+
--
--
--
--
--
+
--
--
--
--
--
+
--
--
--
--
--
+
--
--
--
--
--
+
Returned
3
row
(
s
)
in
0.99s
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
|
FROM
:
+
--
--
-
+
--
--
--
--
--
+
|
uid
|
tag_list
|
+
--
--
-
+
--
--
--
--
--
+
|
1
|
1
,
2
,
3
|
|
2
|
1
,
4
|
|
3
|
5
|
+
--
--
-
+
--
--
--
--
--
+
TO
:
+
--
--
-
+
--
--
-
+
|
uid
|
tag
|
+
--
--
-
+
--
--
-
+
|
1
|
1
|
|
1
|
2
|
|
1
|
3
|
|
2
|
1
|
|
2
|
4
|
|
3
|
5
|
+
--
--
-
+
--
--
-
+
|
UNION [ALL] SELECT seems to be a solution.
And…A Stored Procedure or a UDF?
Lateral View is AWESOME!
I tried explode() which can split an array into rows and before that split() which split string into array.
1
2
3
4
5
6
7
8
9
10
11
12
13
|
hive
>
select
uid
,
tag
from
test4
LATERAL
view
explode
(
split
(
tag_list
,
','
)
)
tag_table
as
tag
;
.
.
.
Job
0
:
Map
:
1
Cumulative
CPU
:
1.69
sec
HDFS
Read
:
293
HDFS
Write
:
24
SUCCESS
Total
MapReduce
CPU
Time
Spent
:
1
seconds
690
msec
OK
1
1
1
2
1
3
2
1
2
4
3
5
Time
taken
:
12.894
seconds
hive
>
|
Not figured out.
Not figured out.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
|
FROM
:
+
--
--
--
+
--
--
--
--
--
+
--
--
--
--
--
+
--
--
--
--
--
+
--
--
--
--
--
+
--
--
--
--
--
+
|
uid
|
tag1_val
|
tag2_val
|
tag3_val
|
tag4_val
|
tag5_val
|
+
--
--
--
+
--
--
--
--
--
+
--
--
--
--
--
+
--
--
--
--
--
+
--
--
--
--
--
+
--
--
--
--
--
+
|
1
|
1
|
0
|
1
|
0
|
0
|
|
2
|
1
|
0
|
0
|
0
|
0
|
|
3
|
0
|
0
|
0
|
0
|
1
|
+
--
--
--
+
--
--
--
--
--
+
--
--
--
--
--
+
--
--
--
--
--
+
--
--
--
--
--
+
--
--
--
--
--
+
TO
:
+
--
--
--
+
--
--
--
+
--
--
--
+
|
uid
|
tag
|
val
|
+
--
--
--
+
--
--
--
+
--
--
--
+
|
1
|
1
|
1
|
|
1
|
2
|
0
|
|
1
|
3
|
1
|
|
2
|
1
|
1
|
|
2
|
4
|
0
|
|
3
|
5
|
1
|
+
--
--
--
+
--
--
--
+
--
--
--
|