hadoop cluster install

vvvvvvvvv config vvvvvvvv

set domain alias in all nodes(optional must ):
/etc/hosts

#let the master accesses all the slaves without passwords:
#method 1:
ssh-copy-id -i $HOME/.ssh/id_rsa.pub hadoop@slave
#method 2:
copy $HOME/.ssh/id_rsa.pub to the slaves,
then login to slaves:cat id_rsa.pub >> $HOME/.ssh/authorized_keys
this will trigger save slave's host key fingerprint to the hadoop@master's "known_hosts" file

#setting the master & slaves file,*do only in master machine*
#Note that the machine on which bin/start-dfs.sh runned will become the primary namenode.

#so ,u MUST start hadoop in namenode ONLY! **but if u can run on all nodes if u want to run a app**
master:
 master
slaves:
 slave1
 slave2
 ...

##  to all nodes(contains master)
#setting the core-site.xml
<property>
  <name>fs.default.name</name>
  <value>hdfs://master:54310</value>
  <description>The name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.  The
  uri's scheme determines the config property (fs.SCHEME.impl) naming
  the FileSystem implementation class.  The uri's authority is used to
  determine the host, port, etc. for a filesystem.</description>
</property>

#hdfs-site.xml,optional
<property>
  <name>dfs.replication</name>
  <value>2</value>
  <description>Default block replication.
  The actual number of replications can be specified when the file is created.
  The default is used if replication is not specified in create time.
==It defines how many machines a single file should be replicated to before it becomes available.
If you set this to a value higher than the number of slave nodes (more precisely, the number of datanodes)
that you have available, you will start seeing a lot of (Zero targets found, forbidden1.size=1) type errors in the log files.
==
  </description>
</property>

#mapred-site.xml
<property>
  <name>mapred.job.tracker</name>
  <value>master:54311</value>
  <description>The host and port that the MapReduce job tracker runs
  at.  If "local", then jobs are run in-process as a single map
  and reduce task.
  </description>
</property>


setting the /etc/hosts
 because the default c1.domain set to 127.0.1.1,so there are something errors during the start-dfs.sh.
 resolution:add it behind the domain "desktop" as a new domain but the same ip.others is the same.


format on master

done

FAQ:java.io.IOException: Incompatible namespaceIDs
resolution:
 a.(直接删除datanode数据,再format namenode.deprecated)
   1. stop the cluster
   2. delete the data directory on the problematic datanode: the directory is specified by dfs.data.dir in conf/hdfs-site.xml; if you followed this tutorial, the relevant directory is /usr/local/hadoop-datastore/hadoop-hadoop/dfs/data
   3. reformat the namenode (NOTE: all HDFS data is lost during this process!)
   4. restart the cluster
 b.(将namenode的namespaceID 更新到 datanode中的VERSION中)
   1. stop the datanode
   2. edit the value of namespaceID in <dfs.data.dir>/current/VERSION to match the value of the current namenode
   3. restart the datanode

summary:
本地的hadoop.tmp.dir与cluster上的ls目录前面是对应的,但后面部分不是.

^^^^ config ^^^^^

 

output:


cluster run:(use 54s)
input 3 files
hadoop@leibnitz-laptop:/cc/hadoop/hadoop-0.20.2$ hadoop jar hadoop-0.20.2-examples.jar wordcount input/ output
11/02/26 02:51:33 INFO input.FileInputFormat: Total input paths to process : 3
11/02/26 02:51:34 INFO mapred.JobClient: Running job: job_201102260237_0002
11/02/26 02:51:35 INFO mapred.JobClient:  map 0% reduce 0%
11/02/26 02:51:57 INFO mapred.JobClient:  map 33% reduce 0%
11/02/26 02:52:05 INFO mapred.JobClient:  map 92% reduce 0%
11/02/26 02:52:08 INFO mapred.JobClient:  map 100% reduce 0%
11/02/26 02:52:18 INFO mapred.JobClient:  map 100% reduce 22%
11/02/26 02:52:25 INFO mapred.JobClient:  map 100% reduce 100%
11/02/26 02:52:27 INFO mapred.JobClient: Job complete: job_201102260237_0002
11/02/26 02:52:27 INFO mapred.JobClient: Counters: 17
11/02/26 02:52:27 INFO mapred.JobClient:   Job Counters
11/02/26 02:52:27 INFO mapred.JobClient:     Launched reduce tasks=1
11/02/26 02:52:27 INFO mapred.JobClient:     Launched map tasks=3
11/02/26 02:52:27 INFO mapred.JobClient:     Data-local map tasks=3
11/02/26 02:52:27 INFO mapred.JobClient:   FileSystemCounters
11/02/26 02:52:27 INFO mapred.JobClient:     FILE_BYTES_READ=2214725
11/02/26 02:52:27 INFO mapred.JobClient:     HDFS_BYTES_READ=3671479
11/02/26 02:52:27 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=3689100
11/02/26 02:52:27 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=880802
11/02/26 02:52:27 INFO mapred.JobClient:   Map-Reduce Framework
11/02/26 02:52:27 INFO mapred.JobClient:     Reduce input groups=82331
11/02/26 02:52:27 INFO mapred.JobClient:     Combine output records=102317
11/02/26 02:52:27 INFO mapred.JobClient:     Map input records=77931
11/02/26 02:52:27 INFO mapred.JobClient:     Reduce shuffle bytes=1474279
11/02/26 02:52:27 INFO mapred.JobClient:     Reduce output records=82331
11/02/26 02:52:27 INFO mapred.JobClient:     Spilled Records=255947
11/02/26 02:52:27 INFO mapred.JobClient:     Map output bytes=6076039
11/02/26 02:52:27 INFO mapred.JobClient:     Combine input records=629167
11/02/26 02:52:27 INFO mapred.JobClient:     Map output records=629167
11/02/26 02:52:27 INFO mapred.JobClient:     Reduce input records=102317

你可能感兴趣的:(mapreduce,xml,hadoop,ssh,Scheme)