hadoop2.2.0 的fairscheduler 遇到的一个问题

在使用hadoop2.2.0 的 fairscheduler的时候,出现了下面的一个问题:

当多个客户端提交任务的时候,发现生成的appatempt 没有进入fairscheduler的 eventQueue,导致fairscheduler没有对该任务进行调度,而当am向scheduler请求这个作业的信息时,出现下面的问题,而且是打了很多这样的log:

2013-11-27 14:27:02,258 ERROR org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Request for appInfo of unknown attemptappattempt_1384743376038_1122_000001
2013-11-27 14:27:02,258 ERROR org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Request for appInfo of unknown attemptappattempt_1384743376038_1121_000001

仔细查找log中的蛛丝马迹,发现没有出现调度器调度的log:

正常作业调度的log记录:

2013-11-27 14:25:36,515 INFO org.apache.hadoop.yarn.server.resourcemanager.ClientRMService: Application with id 1120 submitted by user root
2013-11-27 14:25:36,515 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Storing application with id application_1384743376038_1120
2013-11-27 14:25:36,515 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=root     IP=192.168.24.101       OPERATION=Submit Application Request    TARGET=ClientRMService  RESULT=SUCCESS  APPID=application_1384743376038_1120
2013-11-27 14:25:36,515 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1384743376038_1120 State change from NEW to NEW_SAVING
2013-11-27 14:25:36,515 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Storing info for app: application_1384743376038_1120
2013-11-27 14:25:36,516 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1384743376038_1120 State change from NEW_SAVING to SUBMITTED
2013-11-27 14:25:36,516 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Registering app attempt : appattempt_1384743376038_1120_000001
2013-11-27 14:25:36,516 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1384743376038_1120_000001 State change from NEW to SUBMITTED
2013-11-27 14:25:36,516 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Application Submission: appattempt_1384743376038_1120_000001, user: root, currently active: 2
2013-11-27 14:25:36,516 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1384743376038_1120_000001 State change from SUBMITTED to SCHEDULED
2013-11-27 14:25:36,516 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1384743376038_1120 State change from SUBMITTED to ACCEPTED
2013-11-27 14:25:36,816 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable: Node offered to app: application_1384743376038_1120 reserved: false
出现异常的log记录:

2013-11-27 14:27:01,391 INFO org.apache.hadoop.yarn.server.resourcemanager.ClientRMService: Allocated new applicationId: 1122
2013-11-27 14:27:01,391 INFO org.apache.hadoop.yarn.server.resourcemanager.ClientRMService: Allocated new applicationId: 1121
2013-11-27 14:27:02,252 INFO org.apache.hadoop.yarn.server.resourcemanager.ClientRMService: Application with id 1121 submitted by user yangping.wu
2013-11-27 14:27:02,252 INFO org.apache.hadoop.yarn.server.resourcemanager.ClientRMService: Application with id 1122 submitted by user yangping.wu
2013-11-27 14:27:02,252 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yangping.wu      IP=192.168.24.101 OPERATION=Submit Application Request    TARGET=ClientRMService  RESULT=SUCCESS  APPID=application_1384743376038_1121
2013-11-27 14:27:02,252 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Storing application with id application_1384743376038_1122
2013-11-27 14:27:02,252 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yangping.wu      IP=192.168.24.101
       OPERATION=Submit Application Request    TARGET=ClientRMService  RESULT=SUCCESS  APPID=application_1384743376038_1122
2013-11-27 14:27:02,252 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1384743376038_1122 State change from NEW to NEW_SAVING
2013-11-27 14:27:02,252 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Storing application with id application_1384743376038_1121
2013-11-27 14:27:02,252 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Storing info for app: application_1384743376038_1122
2013-11-27 14:27:02,252 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1384743376038_1121 State change from NEW to NEW_SAVING
2013-11-27 14:27:02,252 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Storing info for app: application_1384743376038_1121
2013-11-27 14:27:02,252 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1384743376038_1122 State change from NEW_SAVING to SUBMITTED
2013-11-27 14:27:02,253 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1384743376038_1121 State change from NEW_SAVING to SUBMITTED
2013-11-27 14:27:02,253 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Registering app attempt : appattempt_1384743376038_1122_000001
2013-11-27 14:27:02,253 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1384743376038_1122_000001 State change from NEW to SUBMITTED
2013-11-27 14:27:02,253 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Registering app attempt : appattempt_1384743376038_1121_000001
2013-11-27 14:27:02,253 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1384743376038_1121_000001 State change from NEW to SUBMITTED
2013-11-27 14:27:02,258 ERROR org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Request for appInfo of unknown attemptappattempt_1384743376038_1122_000001
2013-11-27 14:27:02,258 ERROR org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Request for appInfo of unknown attemptappattempt_1384743376038_1121_000001

你会发现,异常log记录少了调度器调度任务的日志:

2013-11-27 14:25:36,516 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Application Submission: appattempt_1384743376038_1120_000001, user: root, currently active: 2

追了一下代码,怀疑是并发造成的,任务生成之后会被放进SchedulerEventDispatcher.eventQueue队列中,但是这个队列是一个 LinkedBlockingQueue 它是线程安全的,

应该也不是这个造成的。

调度器的现象就是无法工作了,没有反应,但是resourcemanager其他的功能没有收到影响。

推断应该是调度器内部出现死锁了

现在我们还无法重现这个问题,就更加不好解决了。。。

哪位同学也遇到过同样的问题,求解答。

我们自己再尝试着重现这个问题,希望早日解决…………


经过近10个小时不断的提交任务,终于又重现了这个异常,果然,fairscheduler的内部出现了block。我们提了一个issue:

https://issues.apache.org/jira/browse/YARN-1458


经过与cloudera工程师 Sandy 还有我们组的大麦同学合作,终于把问题解决了,我们也提了patch。

原因:

           如果一个队列中所有的作业请求资源为0,或者作业还没有提交资源请求,这个时候在:

org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102)
会出现死循环。代码:

while (resourceUsedWithWeightToResourceRatio(rMax, schedulables, type)
        < totalResource) {
      rMax *= 2.0;
    }

如果
resourceUsedWithWeightToResourceRatio(rMax, schedulables, type)
 返回为0,那么这个 rMax *=2.0 就不会起作用。程序进入死循环。


解决:

1、将yarn-site.xml 的配置项:

        <property>
                <name>yarn.scheduler.fair.sizebasedweight</name>
                <value>false</value>
        </property>
置成false, 就会避免这个问题

2、当下面的方法返回0,直接跳出循环,调度器会将该队列的weight看成0。 具体问题描述请看 issue的讨论部分

resourceUsedWithWeightToResourceRatio(rMax, schedulables, type)




你可能感兴趣的:(hadoop2.2.0 的fairscheduler 遇到的一个问题)