Hive实现词频统计

概述

hadoop-mapreduce-examples-2.6.0-cdh5.7.0.jar wordcount: A map/reduce program that counts the words in the input files.
hadoop-mapreduce-examples-2.6.0-cdh5.7.0.jar wordcount: 一个map / reduce程序,用于统计输入文件中的单词。

通过Hive实现完成词频统计

实验步骤

1. 启动hdfs导入数据

hadoop@hadoop-master ~]$ start-dfs.sh 
Starting namenodes on [hadoop-master]
hadoop-master: starting namenode, logging to /usr/local/hadoop-2.6.0-cdh5.7.0/logs/hadoop-hadoop-namenode-hadoop-master.out
localhost: starting datanode, logging to /usr/local/hadoop-2.6.0-cdh5.7.0/logs/hadoop-hadoop-datanode-hadoop-master.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /usr/local/hadoop-2.6.0-cdh5.7.0/logs/hadoop-hadoop-secondarynamenode-hadoop-master.out

[hadoop@hadoop-master ~]$ start-yarn.sh 
starting yarn daemons
starting resourcemanager, logging to /usr/local/hadoop-2.6.0-cdh5.7.0/logs/yarn-hadoop-resourcemanager-hadoop-master.out
localhost: starting nodemanager, logging to /usr/local/hadoop-2.6.0-cdh5.7.0/logs/yarn-hadoop-nodemanager-hadoop-master.out

[hadoop@hadoop-master ~]$ hdfs dfs -mkdir -p /wordcount/input
[hadoop@hadoop-master ~]$ hdfs dfs -ls -R  /wordcount
drwxr-xr-x   - hadoop supergroup          0 2017-12-30 18:10 /wordcount
drwxr-xr-x   - hadoop supergroup          0 2017-12-30 18:59 /wordcount/input
-rw-r--r--   1 hadoop supergroup         80 2017-12-30 18:59 /wordcount/input/1.log


[hadoop@hadoop-master ~]$ cat /tmp/computerskills.txt 
nginx            scala         lua
openresty        scala         lua
haproxy          scala         openresty
keeplive         scala         openresty
oracle           lua           mysql
postgensql       hadoop        hive
redis            hadoop        hive
mencache         hadoop        hive
elasticsearch    elasticsearch hive
kafka            kafka         kafka
hadoop           redis         python
zookeeper        hive          python
hive             python        python
shell            shell         hive
awk              hadoop        hive
sed              nginx         nginx 
lua              awk           mysql
python           mysql         mysql 
java             
scala

[hadoop@hadoop-master ~]$ hdfs dfs -put /tmp/computerskills.txt /wordcount/input

[hadoop@hadoop-master ~]$ hdfs dfs -ls  /wordcount/input
Found 1 items
-rw-r--r--   1 hadoop supergroup        693 2017-12-30 23:40 /wordcount/input/computerskills.txt

2. 启动hive创建表导数据

[hadoop@hadoop-master hadoop]$ hive

Logging initialized using configuration in jar:file:/usr/local/apache-hive-1.1.0-cdh5.7.0-bin/lib/hive-common-1.1.0-cdh5.7.0.jar!/hive-log4j.properties
WARNING: Hive CLI is deprecated and migration to Beeline is recommended.

hive> create table computerskills_wordcount (line string);
OK
Time taken: 6.16 seconds

hive> load data inpath '/wordcount/input/computerskills.txt' into table computerskills_wordcount;
Loading data to table default.computerskills_wordcount
Table default.computerskills_wordcount stats: [numFiles=1, totalSize=693]
OK
Time taken: 2.448 seconds

hive> select line from computerskills_wordcount;
OK
nginx            scala         lua
openresty        scala         lua
haproxy          scala         openresty
keeplive         scala         openresty
oracle           lua           mysql
postgensql       hadoop        hive
redis            hadoop        hive
mencache         hadoop        hive
elasticsearch    elasticsearch hive
kafka            kafka         kafka
hadoop           redis         python
zookeeper        hive          python
hive             python        python
shell            shell         hive
awk              hadoop        hive
sed              nginx         nginx 
lua              awk           mysql
python           mysql         mysql 
java             
scala
Time taken: 1.535 seconds, Fetched: 20 row(s)

3. 通过HQL完成MapRedure词频统计

hive> select word,count(1) as count from (select explode(split(line,' ')) as word from computerskills_wordcount) word group by word order by count;

Query ID = hadoop_20171231135050_49285687-2446-45a7-8488-ec5fdca24a70
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1514699364714_0001, Tracking URL = http://hadoop-master:8088/proxy/application_1514699364714_0001/
Kill Command = /usr/local/hadoop/bin/hadoop job  -kill job_1514699364714_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2017-12-31 13:51:51,392 Stage-1 map = 0%,  reduce = 0%
2017-12-31 13:52:35,396 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 18.56 sec
2017-12-31 13:53:09,157 Stage-1 map = 100%,  reduce = 67%, Cumulative CPU 24.79 sec
2017-12-31 13:53:13,212 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 31.58 sec
MapReduce Total cumulative CPU time: 31 seconds 580 msec
Ended Job = job_1514699364714_0001
Launching Job 2 out of 2
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1514699364714_0002, Tracking URL = http://hadoop-master:8088/proxy/application_1514699364714_0002/
Kill Command = /usr/local/hadoop/bin/hadoop job  -kill job_1514699364714_0002
Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1
2017-12-31 13:54:00,966 Stage-2 map = 0%,  reduce = 0%
2017-12-31 13:54:30,809 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 6.7 sec
2017-12-31 13:55:00,188 Stage-2 map = 100%,  reduce = 100%, Cumulative CPU 17.29 sec
MapReduce Total cumulative CPU time: 17 seconds 290 msec
Ended Job = job_1514699364714_0002
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 31.58 sec   HDFS Read: 7595 HDFS Write: 645 SUCCESS
Stage-Stage-2: Map: 1  Reduce: 1   Cumulative CPU: 20.76 sec   HDFS Read: 5151 HDFS Write: 197 SUCCESS
Total MapReduce CPU Time Spent: 52 seconds 340 msec
OK
zookeeper   1
sed 1
postgensql  1
oracle  1
mencache    1
keeplive    1
java    1
haproxy 1
shell   2
elasticsearch   2
redis   2
awk 2
openresty   3
nginx   3
kafka   3
lua 4
mysql   4
hadoop  5
scala   5
python  5
hive    8
    324
Time taken: 273.32 seconds, Fetched: 22 row(s)

你可能感兴趣的:(Hive)