[Hadoop] “Too many fetch-failures” or “reducer stucks” issue

I post the solution here to help any ‘Hadoopers’ that have the same problem. This issue had been asked a lot on Hadoop mailing list but no answer was given so far.
After installing Hadoop cluster and trying to run some jobs, you may see the Reducers stuck and TaskTracker log on one of the Worker node shows these messages :
INFO org.apache.hadoop.mapred.TaskTracker: task_200801281756_0001_r_000000_0 0.2727273% reduce > copy (9 of
11 at 0.00 MB/s) >
INFO org.apache.hadoop.mapred.TaskTracker: task_0001_r_000000_0 0.2727273% reduce > copy (9 of
11 at 0.00 MB/s) >
INFO org.apache.hadoop.mapred.TaskTracker: task_0001_r_000000_0 0.2727273% reduce > copy (9 of
11 at 0.00 MB/s) >
INFO org.apache.hadoop.mapred.JobInProgress: Too many fetch-failures for output of task: task_001_r_000000_0 … killing it



The Reducer was failed to copy data through the HDFS, what we should do is to double check your Linux network and Hadoop configuration :

1. Make sure that all the needed parameters are configured in hadoop-site.xml, and all the worker nodes should have the same content of this file.
2. URI for TaskTracker and HDFS should use hostname instead of IP address. I saw some instances of Hadoop cluster using IP address for the URI, they can start all the services and execute the jobs, but the task never finished successfully.
3. Check the file /etc/hosts on all the nodes and make sure that you’re binding the host name to its network IP, not the local one (127.0.0.1), don’t forget to check that all the nodes are able to communicate to the others using their hostname.
Anyway, it doesn’t make sense to me when Hadoop always try to resolve an IP address using the hostname. I consider this is a bug of Hadoop and hope they will solve it in next stable version.

你可能感兴趣的:(hadoop)