1:下载搜狗实验室的词频文件 http://www.sogou.com/labs/dl/w.html
[jifeng@jifeng02 ~]$ wget http://download.labs.sogou.com/dl/sogoulabdown/SogouW/SogouW.tar.gz [jifeng@jifeng02 ~]$ tar zxf SogouW.tar.gz [jifeng@jifeng02 ~]$ cd Freq [jifeng@jifeng02 Freq]$ ls -l -rw-r--r--. 1 jifeng jifeng 2961911 8月 21 09:34 SogouLabDic.dic -rw-r--r--. 1 jifeng jifeng 217 10月 11 2006 ???????.txt
2:转换下词频文件格式
词频文件在linux中 显示乱码,进行格式转换
[jifeng@jifeng02 Freq]$ file SogouLabDic.dic SogouLabDic.dic: ISO-8859 text [jifeng@jifeng02 Freq]$ iconv -f gbk -t utf8 SogouLabDic.dic >SogouLabDic1.dic [jifeng@jifeng02 Freq]$ file SogouLabDic1.dic SogouLabDic1.dic: UTF-8 Unicode text把转换好的文件用,分隔成新文件
[jifeng@jifeng02 Freq]$ cat SogouLabDic1.dic | awk '{print $1","$2","$3}' > dic.dic [jifeng@jifeng02 Freq]$ file dic.dic dic.dic: UTF-8 Unicode text [jifeng@jifeng02 Freq]$ ls -l 总用量 5804 -rw-rw-r--. 1 jifeng jifeng 2961911 8月 21 10:12 dic.dic -rw-r--r--. 1 jifeng jifeng 2961911 8月 21 09:34 SogouLabDic.dic -rw-r--r--. 1 jifeng jifeng 217 10月 11 2006 ???????.txt
1)创建表create table dic(word string,num string,class string)row format delimited fields terminated by ',';
2)加载文件load data local inpath '/home/jifeng/hadoop/Freq/dic.dic' into table dic;
3)按中文排序,取前面10条
select word from dic order by word limit 10;
这个实现select top 10 * from dic的功能
hive> create table dic(word string,num string,class string)row format delimited fields terminated by ','; OK Time taken: 0.194 seconds hive> load data local inpath '/home/jifeng/hadoop/Freq/dic.dic' into table dic; Copying data from file:/home/jifeng/hadoop/Freq/dic.dic Copying file: file:/home/jifeng/hadoop/Freq/dic.dic Loading data to table default.dic Table default.dic stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 2961911, raw_data_size: 0] OK Time taken: 0.281 seconds hive> select word from dic order by word limit 10; Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapred.reduce.tasks=<number> Starting Job = job_201408202333_0004, Tracking URL = http://jifeng01:50030/jobdetails.jsp?jobid=job_201408202333_0004 Kill Command = /home/jifeng/hadoop/hadoop-1.2.1/libexec/../bin/hadoop job -kill job_201408202333_0004 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2014-08-21 10:15:54,411 Stage-1 map = 0%, reduce = 0% 2014-08-21 10:15:56,430 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.61 sec 2014-08-21 10:15:57,439 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.61 sec 2014-08-21 10:15:58,448 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.61 sec 2014-08-21 10:15:59,459 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.61 sec 2014-08-21 10:16:00,469 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.61 sec 2014-08-21 10:16:01,477 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.61 sec 2014-08-21 10:16:02,482 Stage-1 map = 100%, reduce = 33%, Cumulative CPU 0.61 sec 2014-08-21 10:16:03,489 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 1.1 sec 2014-08-21 10:16:04,504 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 1.1 sec MapReduce Total cumulative CPU time: 1 seconds 100 msec Ended Job = job_201408202333_0004 MapReduce Jobs Launched: Job 0: Map: 1 Reduce: 1 Cumulative CPU: 1.1 sec HDFS Read: 2962117 HDFS Write: 97 SUCCESS Total MapReduce CPU Time Spent: 1 seconds 100 msec OK 一一 一一七 一一三 一一九 一一二 一一四 一一点 一丁点儿 一七 一七三 Time taken: 13.937 seconds, Fetched: 10 row(s) hive>
sort by排序测试
select word from dic sort by word limit 10;
hive> select word from dic sort by word limit 10; Total MapReduce jobs = 2 Launching Job 1 out of 2 Number of reduce tasks not specified. Estimated from input data size: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapred.reduce.tasks=<number> Starting Job = job_201408202333_0014, Tracking URL = http://jifeng01:50030/jobdetails.jsp?jobid=job_201408202333_0014 Kill Command = /home/jifeng/hadoop/hadoop-1.2.1/libexec/../bin/hadoop job -kill job_201408202333_0014 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2014-08-21 13:19:44,026 Stage-1 map = 0%, reduce = 0% 2014-08-21 13:19:46,040 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.6 sec 2014-08-21 13:19:47,045 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.6 sec 2014-08-21 13:19:48,052 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.6 sec 2014-08-21 13:19:49,058 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.6 sec 2014-08-21 13:19:50,065 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.6 sec 2014-08-21 13:19:51,071 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.6 sec 2014-08-21 13:19:52,077 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.6 sec 2014-08-21 13:19:53,083 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 1.05 sec 2014-08-21 13:19:54,089 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 1.05 sec MapReduce Total cumulative CPU time: 1 seconds 50 msec Ended Job = job_201408202333_0014 Launching Job 2 out of 2 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapred.reduce.tasks=<number> Starting Job = job_201408202333_0015, Tracking URL = http://jifeng01:50030/jobdetails.jsp?jobid=job_201408202333_0015 Kill Command = /home/jifeng/hadoop/hadoop-1.2.1/libexec/../bin/hadoop job -kill job_201408202333_0015 Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1 2014-08-21 13:19:56,360 Stage-2 map = 0%, reduce = 0% 2014-08-21 13:19:58,372 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 0.28 sec 2014-08-21 13:19:59,377 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 0.28 sec 2014-08-21 13:20:00,385 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 0.28 sec 2014-08-21 13:20:01,391 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 0.28 sec 2014-08-21 13:20:02,398 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 0.28 sec 2014-08-21 13:20:03,402 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 0.28 sec 2014-08-21 13:20:04,407 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 0.28 sec 2014-08-21 13:20:05,413 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 0.78 sec 2014-08-21 13:20:06,420 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 0.78 sec MapReduce Total cumulative CPU time: 780 msec Ended Job = job_201408202333_0015 MapReduce Jobs Launched: Job 0: Map: 1 Reduce: 1 Cumulative CPU: 1.05 sec HDFS Read: 2962117 HDFS Write: 363 SUCCESS Job 1: Map: 1 Reduce: 1 Cumulative CPU: 0.78 sec HDFS Read: 819 HDFS Write: 97 SUCCESS Total MapReduce CPU Time Spent: 1 seconds 830 msec OK 一一 一一七 一一三 一一九 一一二 一一四 一一点 一丁点儿 一七 一七三 Time taken: 25.807 seconds, Fetched: 10 row(s) hive>
4:操作中文词语搭配库(SogouR)http://www.sogou.com/labs/dl/r.html
解压:tar zxf SogouR.tar.gz
转换:iconv -f gbk -t utf8 SogouR.txt > RDic.dic
cat RDic.dic | awk '{print $1","$2}' > RDic2.dic
hive中建表create table rdic(word string,num int)row format delimited fields terminated by ',';
导入
load data local inpath '/home/jifeng/hadoop/Freq/RDic2.dic' into table rdic;
统计数据量:18399496
hive> select count(*) from rdic ; Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapred.reduce.tasks=<number> Starting Job = job_201408202333_0016, Tracking URL = http://jifeng01:50030/jobdetails.jsp?jobid=job_201408202333_0016 Kill Command = /home/jifeng/hadoop/hadoop-1.2.1/libexec/../bin/hadoop job -kill job_201408202333_0016 Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1 2014-08-21 13:24:46,700 Stage-1 map = 0%, reduce = 0% 2014-08-21 13:24:54,742 Stage-1 map = 25%, reduce = 0% 2014-08-21 13:24:55,748 Stage-1 map = 75%, reduce = 0%, Cumulative CPU 4.36 sec 2014-08-21 13:24:56,757 Stage-1 map = 75%, reduce = 0%, Cumulative CPU 4.36 sec 2014-08-21 13:24:57,769 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 12.77 sec 2014-08-21 13:24:58,774 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 12.77 sec 2014-08-21 13:24:59,782 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 12.77 sec 2014-08-21 13:25:00,793 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 12.77 sec 2014-08-21 13:25:01,800 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 12.77 sec 2014-08-21 13:25:02,807 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 12.77 sec 2014-08-21 13:25:03,812 Stage-1 map = 100%, reduce = 33%, Cumulative CPU 12.77 sec 2014-08-21 13:25:04,818 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 14.24 sec 2014-08-21 13:25:05,823 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 14.24 sec 2014-08-21 13:25:06,828 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 14.24 sec MapReduce Total cumulative CPU time: 14 seconds 240 msec Ended Job = job_201408202333_0016 MapReduce Jobs Launched: Job 0: Map: 2 Reduce: 1 Cumulative CPU: 14.24 sec HDFS Read: 333094897 HDFS Write: 9 SUCCESS Total MapReduce CPU Time Spent: 14 seconds 240 msec OK 18399496 Time taken: 22.527 seconds, Fetched: 1 row(s) hive>
select word from rdic order by word limit 10;
select word from rdic sort by word limit 10;
hive> select word from rdic order by word limit 10; Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapred.reduce.tasks=<number> Starting Job = job_201408202333_0017, Tracking URL = http://jifeng01:50030/jobdetails.jsp?jobid=job_201408202333_0017 Kill Command = /home/jifeng/hadoop/hadoop-1.2.1/libexec/../bin/hadoop job -kill job_201408202333_0017 Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1 2014-08-21 13:26:42,060 Stage-1 map = 0%, reduce = 0% 2014-08-21 13:26:52,102 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 6.6 sec 2014-08-21 13:26:53,106 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 6.6 sec 2014-08-21 13:26:54,110 Stage-1 map = 63%, reduce = 0%, Cumulative CPU 6.6 sec 2014-08-21 13:26:55,116 Stage-1 map = 63%, reduce = 0%, Cumulative CPU 6.6 sec 2014-08-21 13:26:56,120 Stage-1 map = 63%, reduce = 0%, Cumulative CPU 6.6 sec 2014-08-21 13:26:57,132 Stage-1 map = 75%, reduce = 0%, Cumulative CPU 6.6 sec 2014-08-21 13:26:58,144 Stage-1 map = 75%, reduce = 0%, Cumulative CPU 6.6 sec 2014-08-21 13:26:59,149 Stage-1 map = 75%, reduce = 17%, Cumulative CPU 6.6 sec 2014-08-21 13:27:00,153 Stage-1 map = 75%, reduce = 17%, Cumulative CPU 6.6 sec 2014-08-21 13:27:01,157 Stage-1 map = 75%, reduce = 17%, Cumulative CPU 6.6 sec 2014-08-21 13:27:02,163 Stage-1 map = 75%, reduce = 17%, Cumulative CPU 6.6 sec 2014-08-21 13:27:03,168 Stage-1 map = 88%, reduce = 17%, Cumulative CPU 6.6 sec 2014-08-21 13:27:04,173 Stage-1 map = 88%, reduce = 17%, Cumulative CPU 6.6 sec 2014-08-21 13:27:05,179 Stage-1 map = 100%, reduce = 17%, Cumulative CPU 26.35 sec 2014-08-21 13:27:06,183 Stage-1 map = 100%, reduce = 17%, Cumulative CPU 26.35 sec 2014-08-21 13:27:07,187 Stage-1 map = 100%, reduce = 17%, Cumulative CPU 26.35 sec 2014-08-21 13:27:08,193 Stage-1 map = 100%, reduce = 17%, Cumulative CPU 26.35 sec 2014-08-21 13:27:09,198 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 26.99 sec 2014-08-21 13:27:10,203 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 26.99 sec MapReduce Total cumulative CPU time: 26 seconds 990 msec Ended Job = job_201408202333_0017 MapReduce Jobs Launched: Job 0: Map: 2 Reduce: 1 Cumulative CPU: 26.99 sec HDFS Read: 333094897 HDFS Write: 96 SUCCESS Total MapReduce CPU Time Spent: 26 seconds 990 msec OK 2-一个 2-一个字 2-一个月 2-一二 2-一些 2-一休 2-一共 2-一台 2-一周 2-一套 Time taken: 32.492 seconds, Fetched: 10 row(s) hive> select word from rdic sort by word limit 10; Total MapReduce jobs = 2 Launching Job 1 out of 2 Number of reduce tasks not specified. Estimated from input data size: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapred.reduce.tasks=<number> Starting Job = job_201408202333_0018, Tracking URL = http://jifeng01:50030/jobdetails.jsp?jobid=job_201408202333_0018 Kill Command = /home/jifeng/hadoop/hadoop-1.2.1/libexec/../bin/hadoop job -kill job_201408202333_0018 Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1 2014-08-21 13:27:45,948 Stage-1 map = 0%, reduce = 0% 2014-08-21 13:27:55,990 Stage-1 map = 13%, reduce = 0% 2014-08-21 13:27:56,995 Stage-1 map = 63%, reduce = 0%, Cumulative CPU 6.5 sec 2014-08-21 13:27:57,999 Stage-1 map = 63%, reduce = 0%, Cumulative CPU 6.5 sec 2014-08-21 13:27:59,004 Stage-1 map = 63%, reduce = 0%, Cumulative CPU 6.5 sec 2014-08-21 13:28:00,007 Stage-1 map = 63%, reduce = 0%, Cumulative CPU 6.5 sec 2014-08-21 13:28:01,011 Stage-1 map = 63%, reduce = 0%, Cumulative CPU 6.5 sec 2014-08-21 13:28:02,015 Stage-1 map = 75%, reduce = 0%, Cumulative CPU 6.5 sec 2014-08-21 13:28:03,019 Stage-1 map = 75%, reduce = 0%, Cumulative CPU 6.5 sec 2014-08-21 13:28:04,021 Stage-1 map = 75%, reduce = 0%, Cumulative CPU 6.5 sec 2014-08-21 13:28:05,027 Stage-1 map = 88%, reduce = 17%, Cumulative CPU 6.5 sec 2014-08-21 13:28:06,031 Stage-1 map = 88%, reduce = 17%, Cumulative CPU 6.5 sec 2014-08-21 13:28:07,036 Stage-1 map = 88%, reduce = 17%, Cumulative CPU 6.5 sec 2014-08-21 13:28:08,040 Stage-1 map = 88%, reduce = 17%, Cumulative CPU 6.5 sec 2014-08-21 13:28:09,045 Stage-1 map = 88%, reduce = 17%, Cumulative CPU 6.5 sec 2014-08-21 13:28:10,053 Stage-1 map = 100%, reduce = 17%, Cumulative CPU 28.23 sec 2014-08-21 13:28:11,063 Stage-1 map = 100%, reduce = 17%, Cumulative CPU 28.23 sec 2014-08-21 13:28:12,068 Stage-1 map = 100%, reduce = 17%, Cumulative CPU 28.23 sec 2014-08-21 13:28:13,075 Stage-1 map = 100%, reduce = 17%, Cumulative CPU 28.23 sec 2014-08-21 13:28:14,080 Stage-1 map = 100%, reduce = 17%, Cumulative CPU 28.23 sec 2014-08-21 13:28:15,085 Stage-1 map = 100%, reduce = 17%, Cumulative CPU 28.23 sec 2014-08-21 13:28:16,092 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 29.77 sec 2014-08-21 13:28:17,097 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 29.77 sec 2014-08-21 13:28:18,105 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 29.77 sec 2014-08-21 13:28:19,110 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 29.77 sec MapReduce Total cumulative CPU time: 29 seconds 770 msec Ended Job = job_201408202333_0018 Launching Job 2 out of 2 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapred.reduce.tasks=<number> Starting Job = job_201408202333_0019, Tracking URL = http://jifeng01:50030/jobdetails.jsp?jobid=job_201408202333_0019 Kill Command = /home/jifeng/hadoop/hadoop-1.2.1/libexec/../bin/hadoop job -kill job_201408202333_0019 Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1 2014-08-21 13:28:22,398 Stage-2 map = 0%, reduce = 0% 2014-08-21 13:28:23,403 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 0.33 sec 2014-08-21 13:28:24,409 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 0.33 sec 2014-08-21 13:28:25,414 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 0.33 sec 2014-08-21 13:28:26,418 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 0.33 sec 2014-08-21 13:28:27,423 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 0.33 sec 2014-08-21 13:28:28,427 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 0.33 sec 2014-08-21 13:28:29,432 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 0.33 sec 2014-08-21 13:28:30,436 Stage-2 map = 100%, reduce = 33%, Cumulative CPU 0.33 sec 2014-08-21 13:28:31,442 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 0.81 sec 2014-08-21 13:28:32,447 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 0.81 sec MapReduce Total cumulative CPU time: 810 msec Ended Job = job_201408202333_0019 MapReduce Jobs Launched: Job 0: Map: 2 Reduce: 1 Cumulative CPU: 29.77 sec HDFS Read: 333094897 HDFS Write: 362 SUCCESS Job 1: Map: 1 Reduce: 1 Cumulative CPU: 0.81 sec HDFS Read: 818 HDFS Write: 96 SUCCESS Total MapReduce CPU Time Spent: 30 seconds 580 msec OK 2-一个 2-一个字 2-一个月 2-一二 2-一些 2-一休 2-一共 2-一台 2-一周 2-一套 Time taken: 49.836 seconds, Fetched: 10 row(s) hive>
关于
http://blog.sina.com.cn/s/blog_6ff05a2c0101eaxf.html