hive 简单操作搜狗实验室的词频文件

1:下载搜狗实验室的词频文件  http://www.sogou.com/labs/dl/w.html

[jifeng@jifeng02 ~]$ wget http://download.labs.sogou.com/dl/sogoulabdown/SogouW/SogouW.tar.gz
[jifeng@jifeng02 ~]$ tar zxf SogouW.tar.gz
[jifeng@jifeng02 ~]$ cd Freq
[jifeng@jifeng02 Freq]$ ls -l
-rw-r--r--. 1 jifeng jifeng 2961911 8月  21 09:34 SogouLabDic.dic
-rw-r--r--. 1 jifeng jifeng     217 10月 11 2006 ???????.txt

2:转换下词频文件格式

词频文件在linux中 显示乱码,进行格式转换

[jifeng@jifeng02 Freq]$ file SogouLabDic.dic 
SogouLabDic.dic: ISO-8859 text
[jifeng@jifeng02 Freq]$ iconv -f gbk -t utf8 SogouLabDic.dic >SogouLabDic1.dic
[jifeng@jifeng02 Freq]$ file SogouLabDic1.dic 
SogouLabDic1.dic: UTF-8 Unicode text
把转换好的文件用,分隔成新文件

[jifeng@jifeng02 Freq]$ cat SogouLabDic1.dic | awk '{print $1","$2","$3}' > dic.dic
[jifeng@jifeng02 Freq]$ file dic.dic 
dic.dic: UTF-8 Unicode text
[jifeng@jifeng02 Freq]$ ls -l
总用量 5804
-rw-rw-r--. 1 jifeng jifeng 2961911 8月  21 10:12 dic.dic
-rw-r--r--. 1 jifeng jifeng 2961911 8月  21 09:34 SogouLabDic.dic
-rw-r--r--. 1 jifeng jifeng     217 10月 11 2006 ???????.txt

3:hive中操作

1)创建表create table dic(word string,num string,class string)row format delimited fields terminated by ',';

2)加载文件load data local inpath '/home/jifeng/hadoop/Freq/dic.dic' into table dic; 

3)按中文排序,取前面10条

select word from dic order by word limit 10;

这个实现select top 10 * from dic的功能

hive> create table dic(word string,num string,class string)row format delimited fields terminated by ',';
OK
Time taken: 0.194 seconds
hive> load data local inpath '/home/jifeng/hadoop/Freq/dic.dic' into table dic;  
Copying data from file:/home/jifeng/hadoop/Freq/dic.dic
Copying file: file:/home/jifeng/hadoop/Freq/dic.dic
Loading data to table default.dic
Table default.dic stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 2961911, raw_data_size: 0]
OK
Time taken: 0.281 seconds
hive> select word from dic order by word limit 10;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapred.reduce.tasks=<number>
Starting Job = job_201408202333_0004, Tracking URL = http://jifeng01:50030/jobdetails.jsp?jobid=job_201408202333_0004
Kill Command = /home/jifeng/hadoop/hadoop-1.2.1/libexec/../bin/hadoop job  -kill job_201408202333_0004
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2014-08-21 10:15:54,411 Stage-1 map = 0%,  reduce = 0%
2014-08-21 10:15:56,430 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.61 sec
2014-08-21 10:15:57,439 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.61 sec
2014-08-21 10:15:58,448 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.61 sec
2014-08-21 10:15:59,459 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.61 sec
2014-08-21 10:16:00,469 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.61 sec
2014-08-21 10:16:01,477 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.61 sec
2014-08-21 10:16:02,482 Stage-1 map = 100%,  reduce = 33%, Cumulative CPU 0.61 sec
2014-08-21 10:16:03,489 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 1.1 sec
2014-08-21 10:16:04,504 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 1.1 sec
MapReduce Total cumulative CPU time: 1 seconds 100 msec
Ended Job = job_201408202333_0004
MapReduce Jobs Launched: 
Job 0: Map: 1  Reduce: 1   Cumulative CPU: 1.1 sec   HDFS Read: 2962117 HDFS Write: 97 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 100 msec
OK
一一
一一七
一一三
一一九
一一二
一一四
一一点
一丁点儿
一七
一七三
Time taken: 13.937 seconds, Fetched: 10 row(s)
hive> 

sort by排序测试

select word from dic sort by word limit 10; 

hive> select word from dic sort by word limit 10;
Total MapReduce jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapred.reduce.tasks=<number>
Starting Job = job_201408202333_0014, Tracking URL = http://jifeng01:50030/jobdetails.jsp?jobid=job_201408202333_0014
Kill Command = /home/jifeng/hadoop/hadoop-1.2.1/libexec/../bin/hadoop job  -kill job_201408202333_0014
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2014-08-21 13:19:44,026 Stage-1 map = 0%,  reduce = 0%
2014-08-21 13:19:46,040 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.6 sec
2014-08-21 13:19:47,045 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.6 sec
2014-08-21 13:19:48,052 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.6 sec
2014-08-21 13:19:49,058 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.6 sec
2014-08-21 13:19:50,065 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.6 sec
2014-08-21 13:19:51,071 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.6 sec
2014-08-21 13:19:52,077 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.6 sec
2014-08-21 13:19:53,083 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 1.05 sec
2014-08-21 13:19:54,089 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 1.05 sec
MapReduce Total cumulative CPU time: 1 seconds 50 msec
Ended Job = job_201408202333_0014
Launching Job 2 out of 2
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapred.reduce.tasks=<number>
Starting Job = job_201408202333_0015, Tracking URL = http://jifeng01:50030/jobdetails.jsp?jobid=job_201408202333_0015
Kill Command = /home/jifeng/hadoop/hadoop-1.2.1/libexec/../bin/hadoop job  -kill job_201408202333_0015
Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1
2014-08-21 13:19:56,360 Stage-2 map = 0%,  reduce = 0%
2014-08-21 13:19:58,372 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 0.28 sec
2014-08-21 13:19:59,377 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 0.28 sec
2014-08-21 13:20:00,385 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 0.28 sec
2014-08-21 13:20:01,391 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 0.28 sec
2014-08-21 13:20:02,398 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 0.28 sec
2014-08-21 13:20:03,402 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 0.28 sec
2014-08-21 13:20:04,407 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 0.28 sec
2014-08-21 13:20:05,413 Stage-2 map = 100%,  reduce = 100%, Cumulative CPU 0.78 sec
2014-08-21 13:20:06,420 Stage-2 map = 100%,  reduce = 100%, Cumulative CPU 0.78 sec
MapReduce Total cumulative CPU time: 780 msec
Ended Job = job_201408202333_0015
MapReduce Jobs Launched: 
Job 0: Map: 1  Reduce: 1   Cumulative CPU: 1.05 sec   HDFS Read: 2962117 HDFS Write: 363 SUCCESS
Job 1: Map: 1  Reduce: 1   Cumulative CPU: 0.78 sec   HDFS Read: 819 HDFS Write: 97 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 830 msec
OK
一一
一一七
一一三
一一九
一一二
一一四
一一点
一丁点儿
一七
一七三
Time taken: 25.807 seconds, Fetched: 10 row(s)
hive> 

4:操作中文词语搭配库(SogouR)http://www.sogou.com/labs/dl/r.html

解压:tar zxf SogouR.tar.gz
转换:iconv -f gbk -t utf8 SogouR.txt > RDic.dic
cat RDic.dic | awk '{print $1","$2}' > RDic2.dic

hive中建表create table rdic(word string,num int)row format delimited fields terminated by ',';  

导入
load data local inpath '/home/jifeng/hadoop/Freq/RDic2.dic' into table rdic;  

统计数据量:18399496

hive> select count(*) from rdic ;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapred.reduce.tasks=<number>
Starting Job = job_201408202333_0016, Tracking URL = http://jifeng01:50030/jobdetails.jsp?jobid=job_201408202333_0016
Kill Command = /home/jifeng/hadoop/hadoop-1.2.1/libexec/../bin/hadoop job  -kill job_201408202333_0016
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
2014-08-21 13:24:46,700 Stage-1 map = 0%,  reduce = 0%
2014-08-21 13:24:54,742 Stage-1 map = 25%,  reduce = 0%
2014-08-21 13:24:55,748 Stage-1 map = 75%,  reduce = 0%, Cumulative CPU 4.36 sec
2014-08-21 13:24:56,757 Stage-1 map = 75%,  reduce = 0%, Cumulative CPU 4.36 sec
2014-08-21 13:24:57,769 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 12.77 sec
2014-08-21 13:24:58,774 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 12.77 sec
2014-08-21 13:24:59,782 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 12.77 sec
2014-08-21 13:25:00,793 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 12.77 sec
2014-08-21 13:25:01,800 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 12.77 sec
2014-08-21 13:25:02,807 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 12.77 sec
2014-08-21 13:25:03,812 Stage-1 map = 100%,  reduce = 33%, Cumulative CPU 12.77 sec
2014-08-21 13:25:04,818 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 14.24 sec
2014-08-21 13:25:05,823 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 14.24 sec
2014-08-21 13:25:06,828 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 14.24 sec
MapReduce Total cumulative CPU time: 14 seconds 240 msec
Ended Job = job_201408202333_0016
MapReduce Jobs Launched: 
Job 0: Map: 2  Reduce: 1   Cumulative CPU: 14.24 sec   HDFS Read: 333094897 HDFS Write: 9 SUCCESS
Total MapReduce CPU Time Spent: 14 seconds 240 msec
OK
18399496
Time taken: 22.527 seconds, Fetched: 1 row(s)
hive> 

排序测试:

select word from rdic order by word limit 10;
select word from rdic sort by word limit 10;

hive> select word from rdic order by word limit 10;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapred.reduce.tasks=<number>
Starting Job = job_201408202333_0017, Tracking URL = http://jifeng01:50030/jobdetails.jsp?jobid=job_201408202333_0017
Kill Command = /home/jifeng/hadoop/hadoop-1.2.1/libexec/../bin/hadoop job  -kill job_201408202333_0017
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
2014-08-21 13:26:42,060 Stage-1 map = 0%,  reduce = 0%
2014-08-21 13:26:52,102 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 6.6 sec
2014-08-21 13:26:53,106 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 6.6 sec
2014-08-21 13:26:54,110 Stage-1 map = 63%,  reduce = 0%, Cumulative CPU 6.6 sec
2014-08-21 13:26:55,116 Stage-1 map = 63%,  reduce = 0%, Cumulative CPU 6.6 sec
2014-08-21 13:26:56,120 Stage-1 map = 63%,  reduce = 0%, Cumulative CPU 6.6 sec
2014-08-21 13:26:57,132 Stage-1 map = 75%,  reduce = 0%, Cumulative CPU 6.6 sec
2014-08-21 13:26:58,144 Stage-1 map = 75%,  reduce = 0%, Cumulative CPU 6.6 sec
2014-08-21 13:26:59,149 Stage-1 map = 75%,  reduce = 17%, Cumulative CPU 6.6 sec
2014-08-21 13:27:00,153 Stage-1 map = 75%,  reduce = 17%, Cumulative CPU 6.6 sec
2014-08-21 13:27:01,157 Stage-1 map = 75%,  reduce = 17%, Cumulative CPU 6.6 sec
2014-08-21 13:27:02,163 Stage-1 map = 75%,  reduce = 17%, Cumulative CPU 6.6 sec
2014-08-21 13:27:03,168 Stage-1 map = 88%,  reduce = 17%, Cumulative CPU 6.6 sec
2014-08-21 13:27:04,173 Stage-1 map = 88%,  reduce = 17%, Cumulative CPU 6.6 sec
2014-08-21 13:27:05,179 Stage-1 map = 100%,  reduce = 17%, Cumulative CPU 26.35 sec
2014-08-21 13:27:06,183 Stage-1 map = 100%,  reduce = 17%, Cumulative CPU 26.35 sec
2014-08-21 13:27:07,187 Stage-1 map = 100%,  reduce = 17%, Cumulative CPU 26.35 sec
2014-08-21 13:27:08,193 Stage-1 map = 100%,  reduce = 17%, Cumulative CPU 26.35 sec
2014-08-21 13:27:09,198 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 26.99 sec
2014-08-21 13:27:10,203 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 26.99 sec
MapReduce Total cumulative CPU time: 26 seconds 990 msec
Ended Job = job_201408202333_0017
MapReduce Jobs Launched: 
Job 0: Map: 2  Reduce: 1   Cumulative CPU: 26.99 sec   HDFS Read: 333094897 HDFS Write: 96 SUCCESS
Total MapReduce CPU Time Spent: 26 seconds 990 msec
OK
2-一个
2-一个字
2-一个月
2-一二
2-一些
2-一休
2-一共
2-一台
2-一周
2-一套
Time taken: 32.492 seconds, Fetched: 10 row(s)
hive> select word from rdic sort by word limit 10;
Total MapReduce jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapred.reduce.tasks=<number>
Starting Job = job_201408202333_0018, Tracking URL = http://jifeng01:50030/jobdetails.jsp?jobid=job_201408202333_0018
Kill Command = /home/jifeng/hadoop/hadoop-1.2.1/libexec/../bin/hadoop job  -kill job_201408202333_0018
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
2014-08-21 13:27:45,948 Stage-1 map = 0%,  reduce = 0%
2014-08-21 13:27:55,990 Stage-1 map = 13%,  reduce = 0%
2014-08-21 13:27:56,995 Stage-1 map = 63%,  reduce = 0%, Cumulative CPU 6.5 sec
2014-08-21 13:27:57,999 Stage-1 map = 63%,  reduce = 0%, Cumulative CPU 6.5 sec
2014-08-21 13:27:59,004 Stage-1 map = 63%,  reduce = 0%, Cumulative CPU 6.5 sec
2014-08-21 13:28:00,007 Stage-1 map = 63%,  reduce = 0%, Cumulative CPU 6.5 sec
2014-08-21 13:28:01,011 Stage-1 map = 63%,  reduce = 0%, Cumulative CPU 6.5 sec
2014-08-21 13:28:02,015 Stage-1 map = 75%,  reduce = 0%, Cumulative CPU 6.5 sec
2014-08-21 13:28:03,019 Stage-1 map = 75%,  reduce = 0%, Cumulative CPU 6.5 sec
2014-08-21 13:28:04,021 Stage-1 map = 75%,  reduce = 0%, Cumulative CPU 6.5 sec
2014-08-21 13:28:05,027 Stage-1 map = 88%,  reduce = 17%, Cumulative CPU 6.5 sec
2014-08-21 13:28:06,031 Stage-1 map = 88%,  reduce = 17%, Cumulative CPU 6.5 sec
2014-08-21 13:28:07,036 Stage-1 map = 88%,  reduce = 17%, Cumulative CPU 6.5 sec
2014-08-21 13:28:08,040 Stage-1 map = 88%,  reduce = 17%, Cumulative CPU 6.5 sec
2014-08-21 13:28:09,045 Stage-1 map = 88%,  reduce = 17%, Cumulative CPU 6.5 sec
2014-08-21 13:28:10,053 Stage-1 map = 100%,  reduce = 17%, Cumulative CPU 28.23 sec
2014-08-21 13:28:11,063 Stage-1 map = 100%,  reduce = 17%, Cumulative CPU 28.23 sec
2014-08-21 13:28:12,068 Stage-1 map = 100%,  reduce = 17%, Cumulative CPU 28.23 sec
2014-08-21 13:28:13,075 Stage-1 map = 100%,  reduce = 17%, Cumulative CPU 28.23 sec
2014-08-21 13:28:14,080 Stage-1 map = 100%,  reduce = 17%, Cumulative CPU 28.23 sec
2014-08-21 13:28:15,085 Stage-1 map = 100%,  reduce = 17%, Cumulative CPU 28.23 sec
2014-08-21 13:28:16,092 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 29.77 sec
2014-08-21 13:28:17,097 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 29.77 sec
2014-08-21 13:28:18,105 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 29.77 sec
2014-08-21 13:28:19,110 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 29.77 sec
MapReduce Total cumulative CPU time: 29 seconds 770 msec
Ended Job = job_201408202333_0018
Launching Job 2 out of 2
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapred.reduce.tasks=<number>
Starting Job = job_201408202333_0019, Tracking URL = http://jifeng01:50030/jobdetails.jsp?jobid=job_201408202333_0019
Kill Command = /home/jifeng/hadoop/hadoop-1.2.1/libexec/../bin/hadoop job  -kill job_201408202333_0019
Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1
2014-08-21 13:28:22,398 Stage-2 map = 0%,  reduce = 0%
2014-08-21 13:28:23,403 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 0.33 sec
2014-08-21 13:28:24,409 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 0.33 sec
2014-08-21 13:28:25,414 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 0.33 sec
2014-08-21 13:28:26,418 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 0.33 sec
2014-08-21 13:28:27,423 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 0.33 sec
2014-08-21 13:28:28,427 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 0.33 sec
2014-08-21 13:28:29,432 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 0.33 sec
2014-08-21 13:28:30,436 Stage-2 map = 100%,  reduce = 33%, Cumulative CPU 0.33 sec
2014-08-21 13:28:31,442 Stage-2 map = 100%,  reduce = 100%, Cumulative CPU 0.81 sec
2014-08-21 13:28:32,447 Stage-2 map = 100%,  reduce = 100%, Cumulative CPU 0.81 sec
MapReduce Total cumulative CPU time: 810 msec
Ended Job = job_201408202333_0019
MapReduce Jobs Launched: 
Job 0: Map: 2  Reduce: 1   Cumulative CPU: 29.77 sec   HDFS Read: 333094897 HDFS Write: 362 SUCCESS
Job 1: Map: 1  Reduce: 1   Cumulative CPU: 0.81 sec   HDFS Read: 818 HDFS Write: 96 SUCCESS
Total MapReduce CPU Time Spent: 30 seconds 580 msec
OK
2-一个
2-一个字
2-一个月
2-一二
2-一些
2-一休
2-一共
2-一台
2-一周
2-一套
Time taken: 49.836 seconds, Fetched: 10 row(s)
hive> 

关于

Hive中SELECT TOP N的方法(order by与sort by)

 

http://blog.sina.com.cn/s/blog_6ff05a2c0101eaxf.html

你可能感兴趣的:(hive)