学习了hadoop这几天,一些主要的概念必须得先弄清楚,下面是来自wiki.apache的一些很好的解释,整理如下:
本文信息来源:http://wiki.apache.org/hadoop/FrontPage
The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself.
Client applications talk to the NameNode whenever they wish to locate a file, or when they want to add/copy/move/delete a file. The NameNode responds the successful requests by returning a list of relevant DataNode servers where the data lives.
The NameNode is a Single Point of Failure (单点故障)for the HDFS Cluster. HDFS is not currently a High Availability system. When the NameNode goes down, the file system goes offline. There is an optional SecondaryNameNode that can be hosted on a separate machine. It only creates checkpoints of the namespace by merging the edits file into the fsimage file and does not provide any real redundancy(冗余). Hadoop 0.21+ has a BackupNameNode that is part of a plan to have an HA name service, but it needs active contributions from the people who want it (i.e. you) to make it Highly Available.
It is essential to look after the NameNode. Here are some recommendations from production use
If a NameNode does not start up, look at the TroubleShooting page.
A DataNode stores data in the [HadoopFileSystem]. A functional filesystem has more than one DataNode, with data replicated across them.
On startup, a DataNode connects to the NameNode; spinning until that service comes up. It then responds to requests from the NameNode for filesystem operations.
Client applications can talk directly to a DataNode, once the NameNode has provided the location of the data. Similarly, MapReduce operations farmed out toTaskTracker instances near a DataNode, talk directly to the DataNode to access the files. TaskTracker instances can, indeed should, be deployed on the same servers that host DataNode instances, so that MapReduce operations are performed close to the data.
DataNode instances can talk to each other, which is what they do when they are replicating data.
There is usually no need to use RAID storage for DataNode data, because data is designed to be replicated across multiple servers, rather than multiple disks on the same server.
An ideal configuration is for a server to have a DataNode, a TaskTracker, and then physical disks one TaskTracker slot per CPU. This will allow everyTaskTracker 100% of a CPU, and separate disks to read and write data.
Avoid using NFS for data storage in production system.
The JobTracker is the service within Hadoop that farms out MapReduce tasks to specific nodes in the cluster, ideally the nodes that have the data, or at least are in the same rack.
The JobTracker is a point of failure for the Hadoop MapReduce service. If it goes down, all running jobs are halted.
A TaskTracker is a node in the cluster that accepts tasks - Map, Reduce and Shuffle operations - from a JobTracker.
Every TaskTracker is configured with a set of slots, these indicate the number of tasks that it can accept. When the JobTracker tries to find somewhere to schedule a task within the MapReduce operations, it first looks for an empty slot on the same server that hosts the DataNode containing the data, and if not, it looks for an empty slot on a machine in the same rack.
The TaskTracker spawns a separate JVM processes to do the actual work; this is to ensure that process failure does not take down the task tracker. The TaskTracker monitors these spawned processes, capturing the output and exit codes. When the process finishes, successfully or not, the tracker notifies the JobTracker. TheTaskTrackers also send out heartbeat messages to the JobTracker, usually every few minutes, to reassure the JobTracker that it is still alive. These message also inform the JobTracker of the number of available slots, so the JobTracker can stay up to date with where in the cluster work can be delegated.
MapReduce is the key algorithm that the Hadoop MapReduce engine uses to distribute work around a cluster.
A map transform is provided to transform an input data row of key and value to an output key/value:
map(key1,value) -> list<key2,value2>
That is, for an input it returns a list containing zero or more (key,value) pairs:
A reduce transform is provided to take all values for a specific key, and generate a new list of the reduced output.
reduce(key2, list<value2>) -> list<value3>
The key aspect of the MapReduce algorithm is that if every Map and Reduce is independent of all other ongoing Maps and Reduces, then the operation can be run in parallel on different keys and lists of data. On a large cluster of machines, you can go one step further, and run the Map operations on servers where the data lives. Rather than copy the data over the network to the program, you push out the program to the machines. The output list can then be saved to the distributed filesystem, and the reducers run to merge the results. Again, it may be possible to run these in parallel, each reducing different keys.
A job scheduler (in Hadoop, the JobTracker), keeps track of which MR jobs are executing, schedules individual Maps, Reduces or intermediate merging operations to specific machines, monitors the success and failures of these individual Tasks, and works to complete the entire batch job.
Apache Hadoop is such a MapReduce engine. It provides its own distributed filesystem and runs [HadoopMapReduce] jobs on servers near the data stored on the filesystem -or any other supported filesystem, of which there is more than one.
For maximum parallelism, you need the Maps and Reduces to be stateless, to not depend on any data generated in the same MapReduce job. You cannot control the order in which the maps run, or the reductions.
If you can rewrite your algorithms as Maps and Reduces, then yes. If not, then no.
It is not a silver bullet(喻指新技术) to all the problems of scale, just a good technique to work on large sets of data when you can work on small pieces of that dataset in parallel.
Pseudo Distributed Hadoop is where Hadoop runs as set of independent JVMs, but only on a single host. It has much lower performance than a real Hadoop cluster, due to the smaller number of hard disks limiting IO bandwidth. It is, however, a good way to play with new MR algorithms on very small datasets, and to learn how to use Hadoop. Developers working in the Hadoop codebase usually test their code in this mode before deploying their build of Hadoop to a local test cluster.
If you are running in this mode (and don't have a proxy server fielding HTML requests), and have not changed the default port values, then both the NameNode andJobTracker can be reached from this page
These are the standard ports; if the configuration files are changed then they will not be valid.
HDFS filesystem browser http://localhost:50075/browseDirectory.jsp?namenodeInfoPort=50070&dir=/
Server logs http://localhost:50070/logs/
With only a single HDFS datanode, the replication factor should be set to 1 the same goes for the replication factor of submitted jars. You also need to tell the Job tracker to not try handing a failing task to another task tracker, or to blacklist a tracker that appears to fail a lot. While those options are essential in large clusters with many machines -some of which will start to fail, on a single node cluster they do more harm than good.
mapred.submit.replication=1 mapred.skip.attempts.to.start.skipping=1 mapred.max.tracker.failures=10000 mapred.max.tracker.blacklists=10000 mapred.map.tasks.speculative.execution=false mapred.reduce.tasks.speculative.execution=false tasktracker.http.threads=5