每天进步一点点——磁盘损坏导致container-sync服务退出(Swift Bug )

  转载请说明出处:http://blog.csdn.net/cywosp/article/details/23848083   

    之前在项目中做了一个监控swift各个服务运行情况的模块,swift中的服务包括:container-updater , account-auditor, object-replicator, proxy-server, container-replicator, object-auditor, object-expirer, container-auditor, container-server, account-server, account-reaper, container-sync, account-replicator, object-updater, object-server共15个,其中proxy-server, account-server, container-server, object-server这四个服务是需要监控的重中之重,它们不工作意味着swift集群就不能对外提供服务了,因此在集群故障处理中,监控这些服务状态就显得尤为重要。
    前段时间监控模块在运行时产生了一些问题让发现了swift的一些小Bug,其中就有当加入到swift中的硬盘损害时导致container-sync服务停止的问题。该Bug的具体log表现如下:
Apr 15 10:07:24 0d7d51e8-024e-3a94-a310-46cf5426b3f9 container-sync UNCAUGHT EXCEPTION#012Traceback (most recent call last):#012 File "/usr/bin/swift-container-sync", line 23, in <module>#012 run_daemon(ContainerSync, conf_file, **options)#012 File "/usr/lib/python2.6/site-packages/swift/common/daemon.py", line 110, in run_daemon#012 klass(conf).run(once=once, **kwargs)#012 File "/usr/lib/python2.6/site-packages/swift/common/daemon.py", line 57, in run#012 self.run_forever(**kwargs)#012 File "/usr/lib/python2.6/site-packages/swift/container/sync.py", line 162, in run_forever#012 for path, device, partition in all_locs:#012 File "/usr/lib/python2.6/site-packages/swift/common/utils.py", line 1521, in audit_location_generator#012 partitions = listdir(datadir_path)#012 File "/usr/lib/python2.6/site-packages/swift/common/utils.py", line 1814, in listdir#012 return os.listdir(path)#012OSError: [Errno 5] Input/output error: '/srv/node/sdb1/containers'

根据日志输出我们可以分析得到是sdb1磁盘发生了input/output错误,导致程序在调用listdir函数时抛出了异常,listdir实现如下:
#swift/common/utils.py
def listdir(path):
    try:
        return os.listdir(path)
    except OSError as err:
        if err.errno != errno.ENOENT:   # ENOENT: No such file or directory 文件/路径不存在
            raise         # 如果所要list的目录(path)不存在则将异常往外抛出
    return []
listdir函数被audit_location_generator函数调用,具体实现如下:
#swift/common/utils.py
def audit_location_generator(devices, datadir, suffix='', mount_check=True, logger=None):
    device_dir = listdir(devices)
    # randomize devices in case of process restart before sweep completed
    shuffle(device_dir)
    for device in device_dir:
        ……
该函数没有捕捉异常,所产生的异常都继续往上抛了

audit_location_generator函数被run_forever函数调用,具体实现如下:
#swift/container/sync.py
def run_forever(self):
        sleep(random() * self.interval)
        while True:
            begin = time()
            all_locs = audit_location_generator(self.devices,
                                                container_server.DATADIR,
                                                '.db',
                                                mount_check=self.mount_check,
                                                logger=self.logger)
            for path, device, partition in all_locs:
                self.container_sync(path)
                if time() - self.reported >= 3600: # once an hour
                    self.report()
            elapsed = time() - begin
            if elapsed < self.interval:
                sleep(self.interval - elapsed)
从上面三个函数以及它们的调用过程可以知道run_forever中没有捕获异常,如果产生了未知异常,那么run_forever函数就会异常退出,从而导致了对应的进程崩溃。

磁盘发生IO错误时/var/log/messages的记录:
Apr 15 10:06:41 0d7d51e8-024e-3a94-a310-46cf5426b3f9 kernel: scanning ...
Apr 15 10:06:41 0d7d51e8-024e-3a94-a310-46cf5426b3f9 kernel: end_request: I/O error, dev sdb, sector 976403386
Apr 15 10:06:41 0d7d51e8-024e-3a94-a310-46cf5426b3f9 kernel: XFS (sdb1): metadata I/O error: block 0x3a32bb76 ("xlog_iodone") error 5 numblks 64
Apr 15 10:06:41 0d7d51e8-024e-3a94-a310-46cf5426b3f9 kernel: XFS (sdb1): xfs_do_force_shutdown(0x2) called from line 1115 of file fs/xfs/xfs_log.c. Return address = 0xffffffffa072c8b1
Apr 15 10:06:41 0d7d51e8-024e-3a94-a310-46cf5426b3f9 kernel: XFS (sdb1): Log I/O Error Detected. Shutting down filesystem
Apr 15 10:06:41 0d7d51e8-024e-3a94-a310-46cf5426b3f9 kernel: XFS (sdb1): Please umount the filesystem and rectify the problem(s)
Apr 15 10:06:41 0d7d51e8-024e-3a94-a310-46cf5426b3f9 kernel: XFS (sdb1): xfs_log_force: error 5 returned.
Apr 15 10:06:41 0d7d51e8-024e-3a94-a310-46cf5426b3f9 kernel: sd 0:2:1:0: [sdb] Synchronizing SCSI cache
Apr 15 10:06:54 0d7d51e8-024e-3a94-a310-46cf5426b3f9 kernel: XFS (sdb1): xfs_log_force: error 5 returned.
Apr 15 10:07:24 0d7d51e8-024e-3a94-a310-46cf5426b3f9 kernel: XFS (sdb1): xfs_log_force: error 5 returned.
Apr 15 10:07:54 0d7d51e8-024e-3a94-a310-46cf5426b3f9 kernel: XFS (sdb1): xfs_log_force: error 5 returned.

    该问题虽然对整个集群系统并不带来太大的问题,况且现在的磁盘坏的概率现在已经很低了,但是对整个集群的健康状况以及数据的container的一致性带来了一点小影响。因此,我在swift官方bug报告网站中提交了该bug,不知道大牛们会不会采纳并解决。具体见: https://bugs.launchpad.net/swift/+bug/1307798



你可能感兴趣的:(openstack,swift)