Hadoop是一个开源的框架,可编写和运行分布式应用处理大规模数据。
运行Hadoop需要Java1.6或更高版本。JDK的下载地址为:http://www.oracle.com/technetwork/java/javase/downloads/jdk-7u3-download-1501626.html
下载JDK1.6,利用SAMBA,FTP拷到Linux服务器上。
./jdk-6u29-linux-i586-rpm.bin得到jdk-6u29-linux-i586.rpm
rpm -ivh jdk-6u29-linux-i586.rpm
JDK默认安装在/usr/java下
接下来就是配置JAVA_HOME目录,vi ~/.bash_profile,添加JAVA_HOME
- # User specific environment and startup programs
- PATH=$PATH:$HOME/bin
- JAVA_HOME=/usr/java/jdk1.6.0_29
- export PATH
- export JAVA_HOME
- unset USERNAME
- ~
source ~/.bash_profile使更改的变量值生效。
Hadoop的下载地址为http://labs.renren.com/apache-mirror/hadoop/common/hadoop-1.0.2/
将hadoop-1.0.2.tar.gz拷贝到Linux服务器上
使用tar zxvf hadoop-1.0.2.tar.gz进行解压
进行解压的文件的bin目录下,例如:/opt/hadoop-1.0.2/bin
不加任何参数运行Hadoop
./hadoop
得到
- [root@localhost bin]# ./hadoop
- Usage: hadoop [--config confdir] COMMAND
- where COMMAND is one of:
- namenode -format format the DFS filesystem
- secondarynamenode run the DFS secondary namenode
- namenode run the DFS namenode
- datanode run a DFS datanode
- dfsadmin run a DFS admin client
- mradmin run a Map-Reduce admin client
- fsck run a DFS filesystem checking utility
- fs run a generic filesystem user client
- balancer run a cluster balancing utility
- fetchdt fetch a delegation token from the NameNode
- jobtracker run the MapReduce job Tracker node
- pipes run a Pipes job
- tasktracker run a MapReduce task Tracker node
- historyserver run job history servers as a standalone daemon
- job manipulate MapReduce jobs
- queue get information regarding JobQueues
- version print the version
- jar <jar> run a jar file
- distcp <srcurl> <desturl> copy file or directories recursively
- archive -archiveName NAME -p <parent path> <src>* <dest> create a hadoop archive
- classpath prints the class path needed to get the
- Hadoop jar and the required libraries
- daemonlog get/set the log level for each daemon
- or
- CLASSNAME run the class named CLASSNAME
- Most commands print help when invoked w/o parameters.
各参数的中文描述为:
- [root@localhost bin]# ./hadoop
- Usage: hadoop [--config confdir] COMMAND
- where COMMAND is one of:
- namenode -format 格式化DFS文件系统
- secondarynamenode 运行DFS的第二个namenode
- namenode 运行DFS的namenode
- datanode 运行一个DFS的datanode
- dfsadmin 运行一个DFS的admin客户端
- mradmin 运行一个MapReduce的admin客户端
- fsck 运行一个DFS文件系统的检查工具
- fs 运行一个普通的文件系统用户客户端
- balancer 运行一个集群负载均衡工具
- fetchdt 从NameNode取一行词
- jobtracker 运行MapReduce的Tracker 节点
- pipes 运行 Pipes 作业
- tasktracker 运行MapReduce的task Tracker 节点
- historyserver 运行一个独立的history server守护进程
- job 处理MapReduce作业
- queue 得到JobQueues的信息
- version 打印版本
- jar <jar> 运行一个 jar file
- distcp <srcurl> <desturl> 递归地复制文件或者目录
- archive -archiveName NAME -p <parent path> <src>* <dest> 生成一个Hadoop档案
- classpath 打印找到Hadoop jar and the required libraries 所需要的目录
- daemonlog 获取每个daemon的日志级别
- or
- CLASSNAME 运行名为CLASSNAME的类大多数命令会在使用w/o参数时打印出帮助信息
- Most commands print help when invoked w/o parameters.
如运行./hadoop classpath,我们得到
- [root@localhost bin]# ./hadoop classpath
- /opt/hadoop-1.0.2/libexec/../conf:/usr/java/jdk1.6.0_29/lib/tools.jar:/opt/hadoop-1.0.2/libexec/..:/opt/hadoop-1.0.2/libexec/../hadoop-core-1.0.2.jar:/opt/hadoop-1.0.2/libexec/../lib/asm-3.2.jar:/opt/hadoop-1.0.2/libexec/../lib/aspectjrt-1.6.5.jar:/opt/hadoop-1.0.2/libexec/../lib/aspectjtools-1.6.5.jar:/opt/hadoop-1.0.2/libexec/../lib/commons-beanutils-1.7.0.jar:/opt/hadoop-1.0.2/libexec/../lib/commons-beanutils-core-1.8.0.jar:/opt/hadoop-1.0.2/libexec/../lib/commons-cli-1.2.jar:/opt/hadoop-1.0.2/libexec/../lib/commons-codec-1.4.jar:/opt/hadoop-1.0.2/libexec/../lib/commons-collections-3.2.1.jar:/opt/hadoop-1.0.2/libexec/../lib/commons-configuration-1.6.jar:/opt/hadoop-1.0.2/libexec/../lib/commons-daemon-1.0.1.jar:/opt/hadoop-1.0.2/libexec/../lib/commons-digester-1.8.jar:/opt/hadoop-1.0.2/libexec/../lib/commons-el-1.0.jar:/opt/hadoop-1.0.2/libexec/../lib/commons-httpclient-3.0.1.jar:/opt/hadoop-1.0.2/libexec/../lib/commons-lang-2.4.jar:/opt/hadoop-1.0.2/libexec/../lib/commons-logging-1.1.1.jar:/opt/hadoop-1.0.2/libexec/../lib/commons-logging-api-1.0.4.jar:/opt/hadoop-1.0.2/libexec/../lib/commons-math-2.1.jar:/opt/hadoop-1.0.2/libexec/../lib/commons-net-1.4.1.jar:/opt/hadoop-1.0.2/libexec/../lib/core-3.1.1.jar:/opt/hadoop-1.0.2/libexec/../lib/hadoop-capacity-scheduler-1.0.2.jar:/opt/hadoop-1.0.2/libexec/../lib/hadoop-fairscheduler-1.0.2.jar:/opt/hadoop-1.0.2/libexec/../lib/hadoop-thriftfs-1.0.2.jar:/opt/hadoop-1.0.2/libexec/../lib/hsqldb-1.8.0.10.jar:/opt/hadoop-1.0.2/libexec/../lib/jackson-core-asl-1.8.8.jar:/opt/hadoop-1.0.2/libexec/../lib/jackson-mapper-asl-1.8.8.jar:/opt/hadoop-1.0.2/libexec/../lib/jasper-compiler-5.5.12.jar:/opt/hadoop-1.0.2/libexec/../lib/jasper-runtime-5.5.12.jar:/opt/hadoop-1.0.2/libexec/../lib/jdeb-0.8.jar:/opt/hadoop-1.0.2/libexec/../lib/jersey-core-1.8.jar:/opt/hadoop-1.0.2/libexec/../lib/jersey-json-1.8.jar:/opt/hadoop-1.0.2/libexec/../lib/jersey-server-1.8.jar:/opt/hadoop-1.0.2/libexec/../lib/jets3t-0.6.1.jar:/opt/hadoop-1.0.2/libexec/../lib/jetty-6.1.26.jar:/opt/hadoop-1.0.2/libexec/../lib/jetty-util-6.1.26.jar:/opt/hadoop-1.0.2/libexec/../lib/jsch-0.1.42.jar:/opt/hadoop-1.0.2/libexec/../lib/junit-4.5.jar:/opt/hadoop-1.0.2/libexec/../lib/kfs-0.2.2.jar:/opt/hadoop-1.0.2/libexec/../lib/log4j-1.2.15.jar:/opt/hadoop-1.0.2/libexec/../lib/mockito-all-1.8.5.jar:/opt/hadoop-1.0.2/libexec/../lib/oro-2.0.8.jar:/opt/hadoop-1.0.2/libexec/../lib/servlet-api-2.5-20081211.jar:/opt/hadoop-1.0.2/libexec/../lib/slf4j-api-1.4.3.jar:/opt/hadoop-1.0.2/libexec/../lib/slf4j-log4j12-1.4.3.jar:/opt/hadoop-1.0.2/libexec/../lib/xmlenc-0.52.jar:/opt/hadoop-1.0.2/libexec/../lib/jsp-2.1/jsp-2.1.jar:/opt/hadoop-1.0.2/libexec/../lib/jsp-2.1/jsp-api-2.1.jar
由上可知,运行一个(Java)Hadoop程序的命令为hadoop jar<jar>。就像命令显示的那样,用Java写的Hadoop程序被打包为jar执行文件。
在Hadoop的目录下,有一个名为hadoop-examples-1.0.2.jar的文件(不同版本的Hadoop,jar文件不同),jar里打包了一些例子。可以在hadoop-1.0.2/src/examples/org/apache/hadoop/examples目录下找到这些例子。例子如下:
- [root@localhost examples]# ll
- total 228
- -rw-rw-r-- 1 root root 2797 Mar 25 08:01 AggregateWordCount.java
- -rw-rw-r-- 1 root root 2879 Mar 25 08:01 AggregateWordHistogram.java
- drwxr-xr-x 2 root root 4096 Apr 11 21:50 dancing
- -rw-rw-r-- 1 root root 13089 Mar 25 08:01 DBCountPageView.java
- -rw-rw-r-- 1 root root 3751 Mar 25 08:01 ExampleDriver.java
- -rw-rw-r-- 1 root root 3334 Mar 25 08:01 Grep.java
- -rw-rw-r-- 1 root root 6582 Mar 25 08:01 Join.java
- -rw-rw-r-- 1 root root 8282 Mar 25 08:01 MultiFileWordCount.java
- -rw-rw-r-- 1 root root 853 Mar 25 08:01 package.html
- -rw-rw-r-- 1 root root 11914 Mar 25 08:01 PiEstimator.java
- -rw-rw-r-- 1 root root 40350 Mar 25 08:01 RandomTextWriter.java
- -rw-rw-r-- 1 root root 10190 Mar 25 08:01 RandomWriter.java
- -rw-rw-r-- 1 root root 7809 Mar 25 08:01 SecondarySort.java
- -rw-rw-r-- 1 root root 9156 Mar 25 08:01 SleepJob.java
- -rw-rw-r-- 1 root root 8040 Mar 25 08:01 Sort.java
- drwxr-xr-x 2 root root 4096 Apr 11 21:50 terasort
- -rw-rw-r-- 1 root root 2395 Mar 25 08:01 WordCount.java
我们使用WordCount来试运行一个Hadoop。
不指定任何参数执行wordcount将显示一些有关用法的信息:
- [root@localhost bin]# ./hadoop jar /opt/hadoop-1.0.2/hadoop-examples-1.0.2.jar wordcount
- Usage: wordcount <in> <out>
从网上下载一篇英文散文,保存为test.txt,保存在/opt/data目录下
再次执行wordcount得到
- [root@localhost bin]# ./hadoop jar /opt/hadoop-1.0.2/hadoop-examples-1.0.2.jar wordcount /opt/data/test.txt /opt/data/output
- 12/04/11 22:48:41 INFO util.NativeCodeLoader: Loaded the native-hadoop library
- ****file:/opt/data/test.txt
- 12/04/11 22:48:41 INFO input.FileInputFormat: Total input paths to process : 1
- 12/04/11 22:48:41 WARN snappy.LoadSnappy: Snappy native library not loaded
- 12/04/11 22:48:42 INFO mapred.JobClient: Running job: job_local_0001
- 12/04/11 22:48:42 INFO util.ProcessTree: setsid exited with exit code 0
- 12/04/11 22:48:42 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@9ced8e
- 12/04/11 22:48:42 INFO mapred.MapTask: io.sort.mb = 100
- 12/04/11 22:48:43 INFO mapred.MapTask: data buffer = 79691776/99614720
- 12/04/11 22:48:43 INFO mapred.MapTask: record buffer = 262144/327680
- 12/04/11 22:48:43 INFO mapred.MapTask: Starting flush of map output
- 12/04/11 22:48:43 INFO mapred.MapTask: Finished spill 0
- 12/04/11 22:48:43 INFO mapred.Task: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting
- 12/04/11 22:48:43 INFO mapred.JobClient: map 0% reduce 0%
- 12/04/11 22:48:45 INFO mapred.LocalJobRunner:
- 12/04/11 22:48:45 INFO mapred.Task: Task 'attempt_local_0001_m_000000_0' done.
- 12/04/11 22:48:45 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@281d4b
- 12/04/11 22:48:45 INFO mapred.LocalJobRunner:
- 12/04/11 22:48:45 INFO mapred.Merger: Merging 1 sorted segments
- 12/04/11 22:48:45 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 6079 bytes
- 12/04/11 22:48:45 INFO mapred.LocalJobRunner:
- 12/04/11 22:48:45 INFO mapred.Task: Task:attempt_local_0001_r_000000_0 is done. And is in the process of commiting
- 12/04/11 22:48:45 INFO mapred.LocalJobRunner:
- 12/04/11 22:48:45 INFO mapred.Task: Task attempt_local_0001_r_000000_0 is allowed to commit now
- 12/04/11 22:48:45 INFO output.FileOutputCommitter: Saved output of task 'attempt_local_0001_r_000000_0' to /opt/data/output
- 12/04/11 22:48:46 INFO mapred.JobClient: map 100% reduce 0%
- 12/04/11 22:48:48 INFO mapred.LocalJobRunner: reduce > reduce
- 12/04/11 22:48:48 INFO mapred.Task: Task 'attempt_local_0001_r_000000_0' done.
- 12/04/11 22:48:49 INFO mapred.JobClient: map 100% reduce 100%
- 12/04/11 22:48:49 INFO mapred.JobClient: Job complete: job_local_0001
- 12/04/11 22:48:49 INFO mapred.JobClient: Counters: 20
- 12/04/11 22:48:49 INFO mapred.JobClient: File Output Format Counters
- 12/04/11 22:48:49 INFO mapred.JobClient: Bytes Written=4241
- 12/04/11 22:48:49 INFO mapred.JobClient: FileSystemCounters
- 12/04/11 22:48:49 INFO mapred.JobClient: FILE_BYTES_READ=301803
- 12/04/11 22:48:49 INFO mapred.JobClient: FILE_BYTES_WRITTEN=368355
- 12/04/11 22:48:49 INFO mapred.JobClient: File Input Format Counters
- 12/04/11 22:48:49 INFO mapred.JobClient: Bytes Read=5251
- 12/04/11 22:48:49 INFO mapred.JobClient: Map-Reduce Framework
- 12/04/11 22:48:49 INFO mapred.JobClient: Map output materialized bytes=6083
- 12/04/11 22:48:49 INFO mapred.JobClient: Map input records=21
- 12/04/11 22:48:49 INFO mapred.JobClient: Reduce shuffle bytes=0
- 12/04/11 22:48:49 INFO mapred.JobClient: Spilled Records=946
- 12/04/11 22:48:49 INFO mapred.JobClient: Map output bytes=9182
- 12/04/11 22:48:49 INFO mapred.JobClient: Total committed heap usage (bytes)=321134592
- 12/04/11 22:48:49 INFO mapred.JobClient: CPU time spent (ms)=0
- 12/04/11 22:48:49 INFO mapred.JobClient: SPLIT_RAW_BYTES=88
- 12/04/11 22:48:49 INFO mapred.JobClient: Combine input records=970
- 12/04/11 22:48:49 INFO mapred.JobClient: Reduce input records=473
- 12/04/11 22:48:49 INFO mapred.JobClient: Reduce input groups=473
- 12/04/11 22:48:49 INFO mapred.JobClient: Combine output records=473
- 12/04/11 22:48:49 INFO mapred.JobClient: Physical memory (bytes) snapshot=0
- 12/04/11 22:48:49 INFO mapred.JobClient: Reduce output records=473
- 12/04/11 22:48:49 INFO mapred.JobClient: Virtual memory (bytes) snapshot=0
- 12/04/11 22:48:49 INFO mapred.JobClient: Map output records=970
查看统计的结果
- [root@localhost bin]# more /opt/data/output/*
- ::::::::::::::
- /opt/data/output/part-r-00000
- ::::::::::::::
- "Eat, 1
- "How 1
- "she 1
- And 1
- But 1
- Darkness 1
- Epicurean 1
- Eyes". 1
- He 1
- I 24
- If 2
- In 1
- It 3
- Nature 2
- Occasionally, 1
- Only 1
- Particularly 1
- Persian 1
- Recently 1
- So 1
- Sometimes 1
- Such 1
- The 2
- Their 1
- There 1
- To 2
- Use 1
- We 3
- What 1
- When 1
- Yet, 1
wordcount程序有一些不足,分词完全根据空格而不是根据标点符号,这使得“"Eat”,“eat”,“Eat”分别成为单独的单词。可以修改wordcount.java来修改这个不足
将StringTokenizer itr = new StringTokenizer(line)改为
StringTokenizer itr = new StringTokenizer(line," \t\n\r\f,.:;?![]'")
重新编译再运行一次,得到的结果就好多了。