cephFS jewel  client 报警: 

mds0: Client l10-190 failing to respond to capability release

2016-08-26 08:01:49.905823 mds.0 192.168.14.120:6800/171607 3 : cluster [WRN] 7 slow requests, 5 included below; oldest blocked for > 33.942552 secs
2016-08-26 08:01:49.905831 mds.0 192.168.14.120:6800/171607 4 : cluster [WRN] slow request 32.990829 seconds old, received at 2016-08-26 08:01:16.914880: client_request(client.1840252:164888 getattr pAsLsXsFs #1000000760d 2016-08-26 08:01:16.919088) currently failed to rdlock, waiting
2016-08-26 08:01:49.905840 mds.0 192.168.14.120:6800/171607 5 : cluster [WRN] slow request 32.945069 seconds old, received at 2016-08-26 08:01:16.960641: client_request(client.1840252:164889 getattr pAsLsXsFs #1000000760d 2016-08-26 08:01:16.965089) currently failed to rdlock, waiting
2016-08-26 08:01:49.905848 mds.0 192.168.14.120:6800/171607 6 : cluster [WRN] slow request 32.819194 seconds old, received at 2016-08-26 08:01:17.086515: client_request(client.1840252:164893 getattr pAsLsXsFs #1000000760d 2016-08-26 08:01:17.091092) currently failed to rdlock, waiting
2016-08-26 08:01:49.905852 mds.0 192.168.14.120:6800/171607 7 : cluster [WRN] slow request 33.942552 seconds old, received at 2016-08-26 08:01:15.963158: client_request(client.1840252:164835 getattr pAsLsXsFs #1000000760d 2016-08-26 08:01:15.967068) currently failed to rdlock, waiting
2016-08-26 08:01:49.905857 mds.0 192.168.14.120:6800/171607 8 : cluster [WRN] slow request 33.930154 seconds old, received at 2016-08-26 08:01:15.975555: client_request(client.1840252:164836 getattr pAsLsXsFs #1000000760d 2016-08-26 08:01:15.980068) currently failed to rdlock, waiting
2016-08-26 08:01:54.905862 mds.0 192.168.14.120:6800/171607 9 : cluster [WRN] 7 slow requests, 2 included below; oldest blocked for > 38.942642 secs
2016-08-26 08:01:54.905868 mds.0 192.168.14.120:6800/171607 10 : cluster [WRN] slow request 38.920220 seconds old, received at 2016-08-26 08:01:15.985579: client_request(client.1840252:164837 getattr pAsLsXsFs #1000000760d 2016-08-26 08:01:15.990068) currently failed to rdlock, waiting
2016-08-26 08:01:54.905871 mds.0 192.168.14.120:6800/171607 11 : cluster [WRN] slow request 38.894827 seconds old, received at 2016-08-26 08:01:16.010972: client_request(client.1840252:164838 getattr pAsLsXsFs #100
cluster 75f7dde4-d350-4853-9asda6b4ed2
     health HEALTH_WARN
            mds0: Client l10-190 failing to respond to capability release
            mds0: Client l10-191 failing to respond to capability release
问题原因:
    cdn 预取图片回源比较到,导致ngx 进程down 了
解决方法:
    临时解决办法:关闭ngx 服务,重新mont -t ceph *:path
思考:
    1. 视频一当回源将会导致ngx 进程宕机,是否可以改变视频回源,针对视频文件进行切片的大小
       设置,并发数减小
    2. ngx 是否可以优化,改大缓存参数
    3. cephfs 是否可以优化
经过长时间关注,问题依然发生:终极解决办法,思路:
    1. 检测 client 端口状态
    2. 如果端口不通或超时
    3. 重启NGX 服务
    4. 代码如下
#!/usr/bin/env python
#-*-coding:UTF-8-*-
"""
@Item   :  Cephfs 
@Author :  Villiam Sheng
@Group  :  System Group
@Date   :  2016-08-11
@Mail   :  [email protected]
@Funtion:
           Check NGX Server 
"""

import ftplib, os,time,sys,traceback,socket,hashlib

LOGFILE = '/tmp/ngx.log'


def LOG (info):
    if not os.path.exists(LOGFILE):
        os.system("touch %s"%LOGFILE)

    fopen = open(LOGFILE,'a')
    fopen.write("%s INFO  %s \n" %(time.ctime(),info))

class TelnetHttp(object):
    def __init__ (self):
        version  = 0

    
    def work (self,dest_addr,port):
        sock=socket.socket(socket.AF_INET,socket.SOCK_STREAM)
        sock.settimeout(6)
        try:
            sock.connect((dest_addr,int(port)))
            LOG("ping ngx port  %s:80 is oK" %dest_addr)
            sock.close()
            return True
        except socket.error,e:
            sock.close()
            return False


if __name__ == '__main__':
    try:
        pid = os.fork()
        if pid > 0 :
            sys.exit(0)
        os.setsid()
        os.chdir('/')
        sys.stdin = open("/dev/null","r+")
        sys.stdout = os.dup(sys.stdin.fileno())
        sys.stderr = os.dup(sys.stdin.fileno())


        while True:
            try:
                start = TelnetHttp()
                result = start.work("127.0.0.1",80)
                if result:
                    time.sleep(3)
                else:
                    os.system("/usr/local/nginx/sbin/nginx -s stop")
                    time.sleep(6)
                    os.system("/usr/local/nginx/sbin/nginx")
                    LOG("Ngx service anomalies, restart the NGX services")
            except:
                os.system("/usr/local/nginx/sbin/nginx -s stop")
                time.sleep(6)
                os.system("/usr/local/nginx/sbin/nginx")
                LOG("Ngx service anomalies, restart the NGX services")
                continue
    except IOError,e:
        LOG(traceback.format_exc())