在使用hadoop2.2.0 的 fairscheduler的时候,出现了下面的一个问题:
当多个客户端提交任务的时候,发现生成的appatempt 没有进入fairscheduler的 eventQueue,导致fairscheduler没有对该任务进行调度,而当am向scheduler请求这个作业的信息时,出现下面的问题,而且是打了很多这样的log:
2013-11-27 14:27:02,258 ERROR org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Request for appInfo of unknown attemptappattempt_1384743376038_1122_000001 2013-11-27 14:27:02,258 ERROR org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Request for appInfo of unknown attemptappattempt_1384743376038_1121_000001
仔细查找log中的蛛丝马迹,发现没有出现调度器调度的log:
正常作业调度的log记录:
2013-11-27 14:25:36,515 INFO org.apache.hadoop.yarn.server.resourcemanager.ClientRMService: Application with id 1120 submitted by user root 2013-11-27 14:25:36,515 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Storing application with id application_1384743376038_1120 2013-11-27 14:25:36,515 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root IP=192.168.24.101 OPERATION=Submit Application Request TARGET=ClientRMService RESULT=SUCCESS APPID=application_1384743376038_1120 2013-11-27 14:25:36,515 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1384743376038_1120 State change from NEW to NEW_SAVING 2013-11-27 14:25:36,515 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Storing info for app: application_1384743376038_1120 2013-11-27 14:25:36,516 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1384743376038_1120 State change from NEW_SAVING to SUBMITTED 2013-11-27 14:25:36,516 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Registering app attempt : appattempt_1384743376038_1120_000001 2013-11-27 14:25:36,516 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1384743376038_1120_000001 State change from NEW to SUBMITTED 2013-11-27 14:25:36,516 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Application Submission: appattempt_1384743376038_1120_000001, user: root, currently active: 2 2013-11-27 14:25:36,516 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1384743376038_1120_000001 State change from SUBMITTED to SCHEDULED 2013-11-27 14:25:36,516 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1384743376038_1120 State change from SUBMITTED to ACCEPTED 2013-11-27 14:25:36,816 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable: Node offered to app: application_1384743376038_1120 reserved: false出现异常的log记录:
2013-11-27 14:27:01,391 INFO org.apache.hadoop.yarn.server.resourcemanager.ClientRMService: Allocated new applicationId: 1122 2013-11-27 14:27:01,391 INFO org.apache.hadoop.yarn.server.resourcemanager.ClientRMService: Allocated new applicationId: 1121 2013-11-27 14:27:02,252 INFO org.apache.hadoop.yarn.server.resourcemanager.ClientRMService: Application with id 1121 submitted by user yangping.wu 2013-11-27 14:27:02,252 INFO org.apache.hadoop.yarn.server.resourcemanager.ClientRMService: Application with id 1122 submitted by user yangping.wu 2013-11-27 14:27:02,252 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yangping.wu IP=192.168.24.101 OPERATION=Submit Application Request TARGET=ClientRMService RESULT=SUCCESS APPID=application_1384743376038_1121 2013-11-27 14:27:02,252 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Storing application with id application_1384743376038_1122 2013-11-27 14:27:02,252 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yangping.wu IP=192.168.24.101 OPERATION=Submit Application Request TARGET=ClientRMService RESULT=SUCCESS APPID=application_1384743376038_1122 2013-11-27 14:27:02,252 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1384743376038_1122 State change from NEW to NEW_SAVING 2013-11-27 14:27:02,252 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Storing application with id application_1384743376038_1121 2013-11-27 14:27:02,252 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Storing info for app: application_1384743376038_1122 2013-11-27 14:27:02,252 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1384743376038_1121 State change from NEW to NEW_SAVING 2013-11-27 14:27:02,252 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Storing info for app: application_1384743376038_1121 2013-11-27 14:27:02,252 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1384743376038_1122 State change from NEW_SAVING to SUBMITTED 2013-11-27 14:27:02,253 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1384743376038_1121 State change from NEW_SAVING to SUBMITTED 2013-11-27 14:27:02,253 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Registering app attempt : appattempt_1384743376038_1122_000001 2013-11-27 14:27:02,253 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1384743376038_1122_000001 State change from NEW to SUBMITTED 2013-11-27 14:27:02,253 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Registering app attempt : appattempt_1384743376038_1121_000001 2013-11-27 14:27:02,253 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1384743376038_1121_000001 State change from NEW to SUBMITTED 2013-11-27 14:27:02,258 ERROR org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Request for appInfo of unknown attemptappattempt_1384743376038_1122_000001 2013-11-27 14:27:02,258 ERROR org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Request for appInfo of unknown attemptappattempt_1384743376038_1121_000001
应该也不是这个造成的。
调度器的现象就是无法工作了,没有反应,但是resourcemanager其他的功能没有收到影响。
推断应该是调度器内部出现死锁了
现在我们还无法重现这个问题,就更加不好解决了。。。
哪位同学也遇到过同样的问题,求解答。
我们自己再尝试着重现这个问题,希望早日解决…………
经过近10个小时不断的提交任务,终于又重现了这个异常,果然,fairscheduler的内部出现了block。我们提了一个issue:
https://issues.apache.org/jira/browse/YARN-1458
经过与cloudera工程师 Sandy 还有我们组的大麦同学合作,终于把问题解决了,我们也提了patch。
原因:
如果一个队列中所有的作业请求资源为0,或者作业还没有提交资源请求,这个时候在:
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102)会出现死循环。代码:
while (resourceUsedWithWeightToResourceRatio(rMax, schedulables, type) < totalResource) { rMax *= 2.0; }
resourceUsedWithWeightToResourceRatio(rMax, schedulables, type)返回为0,那么这个 rMax *=2.0 就不会起作用。程序进入死循环。
解决:
1、将yarn-site.xml 的配置项:
<property> <name>yarn.scheduler.fair.sizebasedweight</name> <value>false</value> </property>置成false, 就会避免这个问题
2、当下面的方法返回0,直接跳出循环,调度器会将该队列的weight看成0。 具体问题描述请看 issue的讨论部分
resourceUsedWithWeightToResourceRatio(rMax, schedulables, type)