先看一下安装完之后的版本情况
$ java -version java version "1.7.0_51" Java(TM) SE Runtime Environment (build 1.7.0_51-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode) $ hadoop version Hadoop 2.3.0 Subversion Unknown -r Unknown Compiled by hduser on 2014-03-01T09:07Z Compiled with protoc 2.5.0 From source with checksum dfe46336fbc6a044bc124392ec06b85 This command was run using /home/hduser/hadoop/hadoop-2.3.0/share/hadoop/common/hadoop-common-2.3.0.jar
我基本按照这位大神写的博文一步步做的
http://blog.changecong.com/2013/10/ubuntu-%E7%BC%96%E8%AF%91%E5%AE%89%E8%A3%85-hadoop-2-2-0/
但是当用maven编译hadoop源码时候报错了
$ mvn package -Pdist,native -DskipTests -Dtar
貌似是因为maven的一个plugin:javadoc找不到,我也不是很懂,之前没有用过maven。
就胡乱搜索一把,按照这篇博文的做法http://www.cnblogs.com/lucius/p/3435296.html
修改了maven源为一个国内源,结果就编译成功了。
我的settings.xml中修改的部分。
<mirrors> <mirror> <id>nexus-osc</id> <mirrorOf>*</mirrorOf> <name>Nexusosc</name> <url>http://maven.oschina.net/content/groups/public/</url> </mirror> </mirrors> <profiles> <profile> <id>jdk-1.7</id> <activation> <jdk>1.7</jdk> </activation> <repositories> <repository> <id>nexus</id> <name>local private nexus</name> <url>http://maven.oschina.net/content/groups/public/</url> <releases> <enabled>true</enabled> </releases> <snapshots> <enabled>false</enabled> </snapshots> </repository> </repositories> <pluginRepositories> <pluginRepository> <id>nexus</id> <name>local private nexus</name> <url>http://maven.oschina.net/content/groups/public/</url> <releases> <enabled>true</enabled> </releases> <snapshots> <enabled>false</enabled> </snapshots> </pluginRepository> </pluginRepositories> </profile> </profiles>
当然要设置一堆环境变量,按照做http://www.cnblogs.com/lucius/p/3435296.html就可以了。
~/.bashrc添加
# hadoop variable settings export HADOOP_HOME="$HOME/hadoop/hadoop-2.3.0" export HADOOP_PREFIX=$HADOOP_HOME export YARN_HOME=$HADOOP_HOME export HADOOP_MAPRED_HOME=$HADOOP_HOME export HADOOP_COMMON_HOME=$HADOOP_HOME export HADOOP_HDFS_HOME=$HADOOP_HOME export HADOOP_CONF_DIR="$HADOOP_HOME/etc/hadoop/" export YARN_CONF_DIR=$HADOOP_CONF_DIR export PATH="$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH"
然后修改etc下的配置文件
hadoop-evn.sh
# The java implementation to use. export JAVA_HOME=/usr/lib/java/jdk1.7.0_51
<configuration> <property> <name>hadoop.tmp.dir</name> <value>/home/hduser/hadoop/hadoop-2.3.0/tmp/hadoop-${user.name}</value> <description>A base for other temporary directories</description> </property> <property> <name>fs.defaultFS</name> <value>hdfs://localhost:9000</value> </property> </configuration>备注:要在安装目录下新建一个文件夹tmp $ mkdir /home/hduser/hadoop/hadoop-2.3.0/tmp
mapred-site.xml
<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration>
yarn-site.xml
<configuration> <!-- Site specific YARN configuration properties --> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> </configuration>
14/03/02 21:35:04 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032 14/03/02 21:35:05 INFO input.FileInputFormat: Total input paths to process : 1 14/03/02 21:35:05 INFO mapreduce.JobSubmitter: number of splits:1 14/03/02 21:35:05 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1393682087091_0005 14/03/02 21:35:05 INFO impl.YarnClientImpl: Submitted application application_1393682087091_0005 14/03/02 21:35:05 INFO mapreduce.Job: The url to track the job: http://abalone-ubuntu:8088/proxy/application_1393682087091_0005/ 14/03/02 21:35:05 INFO mapreduce.Job: Running job: job_1393682087091_0005就是不往下执行map和reduce阶段,最后发现是因为配置文件写错了,我晕倒。
和这个情况很类似:http://stackoverflow.com/questions/20506992/cant-run-a-mapreduce-job-with-yarn
1. # 第一次运行前,需要格式化文件系统 $ hdfs namenode -format 2. # 开启hadoop $ start-dfs.sh $ start-yarn.sh 3. # 可以在浏览器里查看job的工作情况,URL为 localhost:8088 4. # 关闭hadoop $ stop-yarn.sh $ stop-dfs.sh 5. # 对hdfs一些基本操作 $ hdfs dfs -mkdir /tmp # make directory $ hdfs dfs -copyFromLocal file.txt /tmp # copy from local file system. $ hdfs dfs -ls /tmp # list content on hdfs 6. # 跑一个自带的测试用例 $ hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.3.0.jar pi 2 5 Number of Maps = 2 Samples per Map = 5 Wrote input for Map #0 Wrote input for Map #1 Starting Job 14/03/02 22:16:52 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032 14/03/02 22:16:53 INFO input.FileInputFormat: Total input paths to process : 2 14/03/02 22:16:53 INFO mapreduce.JobSubmitter: number of splits:2 14/03/02 22:16:53 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1393682087091_0006 14/03/02 22:16:53 INFO impl.YarnClientImpl: Submitted application application_1393682087091_0006 14/03/02 22:16:53 INFO mapreduce.Job: The url to track the job: http://abalone-ubuntu:8088/proxy/application_1393682087091_0006/ 14/03/02 22:16:53 INFO mapreduce.Job: Running job: job_1393682087091_0006 14/03/02 22:16:59 INFO mapreduce.Job: Job job_1393682087091_0006 running in uber mode : false 14/03/02 22:16:59 INFO mapreduce.Job: map 0% reduce 0% 14/03/02 22:17:04 INFO mapreduce.Job: map 100% reduce 0% 14/03/02 22:17:09 INFO mapreduce.Job: map 100% reduce 100% 14/03/02 22:17:09 INFO mapreduce.Job: Job job_1393682087091_0006 completed successfully 14/03/02 22:17:09 INFO mapreduce.Job: Counters: 49 File System Counters FILE: Number of bytes read=50 FILE: Number of bytes written=257571 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=530 HDFS: Number of bytes written=215 HDFS: Number of read operations=11 HDFS: Number of large read operations=0 HDFS: Number of write operations=3 Job Counters Launched map tasks=2 Launched reduce tasks=1 Data-local map tasks=2 Total time spent by all maps in occupied slots (ms)=5043 Total time spent by all reduces in occupied slots (ms)=2761 Total time spent by all map tasks (ms)=5043 Total time spent by all reduce tasks (ms)=2761 Total vcore-seconds taken by all map tasks=5043 Total vcore-seconds taken by all reduce tasks=2761 Total megabyte-seconds taken by all map tasks=5164032 Total megabyte-seconds taken by all reduce tasks=2827264 Map-Reduce Framework Map input records=2 Map output records=4 Map output bytes=36 Map output materialized bytes=56 Input split bytes=294 Combine input records=0 Combine output records=0 Reduce input groups=2 Reduce shuffle bytes=56 Reduce input records=4 Reduce output records=0 Spilled Records=8 Shuffled Maps =2 Failed Shuffles=0 Merged Map outputs=2 GC time elapsed (ms)=72 CPU time spent (ms)=1700 Physical memory (bytes) snapshot=693141504 Virtual memory (bytes) snapshot=3142955008 Total committed heap usage (bytes)=524288000 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=236 File Output Format Counters Bytes Written=97 Job Finished in 16.452 seconds Estimated value of Pi is 3.60000000000000000000
1.
$ hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.3.0.jar wordcount /input/words.txt /output 14/03/02 21:34:15 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032 14/03/02 21:34:15 WARN security.UserGroupInformation: PriviledgedActionException as:hduser (auth:SIMPLE) cause:org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://localhost:9000/output already exists org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://localhost:9000/output already exists at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:146)说明指定的输出文件夹/output已经存在,hadoop默认是不会覆盖已有的文件的,所以要把文件夹/output删除或者换一个不存在的文件夹名字。
2.
往hdfs上copy数据的时候报错,说是没有datanode
DataStreamer Exception: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /tmp/hadoop-Ceylon/mapred/system/job_local_0001/job.jar could only be replicated to 0 nodes, instead of 1我的做法是将hadoop关闭,然后将文件夹下/home/hduser/hadoop/hadoop-2.3.0/tmp/的内容全部删除,重新启动hadoop。