CephFS 常用命令以及问题分析

最近公司的生产环境已经开始使用 CephFS 作为文件系统存储，记录一下使用过程中遇到的问题，已经一些常用的命令。

1. 常用命令

1.1 ceph daemon mds.xxx help

ceph daemon 是一个很常用的命令，可以用来查看 Ceph 的各个守护进程的状态，这个 help 命令可以看到 MDS daemon 都支持哪些子命令：

$ sudo ceph daemon mds.cephfs-master1 help
{
    "cache status": "show cache status",
    "config diff": "dump diff of current config and default config",
    "config diff get": "dump diff get : dump diff of current and default config setting ",
    "config get": "config get : get the config value",
    "config help": "get config setting schema and descriptions",
    "config set": "config set   [ ...]: set a config variable",
    "config show": "dump current config settings",
    "dirfrag ls": "List fragments in directory",
    "dirfrag merge": "De-fragment directory by path",
    "dirfrag split": "Fragment directory by path",
    "dump cache": "dump metadata cache (optionally to a file)",
    "dump loads": "dump metadata loads",
    "dump tree": "dump metadata cache for subtree",
    "dump_blocked_ops": "show the blocked ops currently in flight",
    "dump_historic_ops": "show slowest recent ops",
    "dump_historic_ops_by_duration": "show slowest recent ops, sorted by op duration",
    "dump_mempools": "get mempool stats",
    "dump_ops_in_flight": "show the ops currently in flight",
    "export dir": "migrate a subtree to named MDS",
    "flush journal": "Flush the journal to the backing store",
    "flush_path": "flush an inode (and its dirfrags)",
    "force_readonly": "Force MDS to read-only mode",
    "get subtrees": "Return the subtree map",
    "get_command_descriptions": "list available commands",
    "git_version": "get git sha1",
    "help": "list available commands",
    "log dump": "dump recent log entries to log file",
    "log flush": "flush log entries to log file",
    "log reopen": "reopen log file",
    "objecter_requests": "show in-progress osd requests",
    "ops": "show the ops currently in flight",
    "osdmap barrier": "Wait until the MDS has this OSD map epoch",
    "perf dump": "dump perfcounters value",
    "perf histogram dump": "dump perf histogram values",
    "perf histogram schema": "dump perf histogram schema",
    "perf reset": "perf reset : perf reset all or one perfcounter name",
    "perf schema": "dump perfcounters schema",
    "scrub_path": "scrub an inode and output results",
    "session evict": "Evict a CephFS client",
    "session ls": "Enumerate connected CephFS clients",
    "status": "high-level status of MDS",
    "tag path": "Apply scrub tag recursively",
    "version": "get ceph version"
}

1.2 ceph daemon mds.xxx cache status

这个命令是用来查看 Ceph MDS 缓存的使用情况，默认的配置是使用 1G 内存作为缓存，不过这不是一个固定的上限，实际用量可能突破配置。

$ sudo ceph daemon mds.cephfs-master1 cache status
{
    "pool": {
        "items": 321121429,
        "bytes": 25797208658
    }
}

1.3 ceph mds stat

查看 MDS 组件状态，下面的例子输出的结果表示只有一个 MDS，而且 MDS 已经处于正常工作状态。

$ ceph mds stat
cephfs-1/1/1 up  {0=cephfs-master1=up:active}

1.4 ceph daemon mds.xxx perf dump mds

查看 MDS 的性能指标。

$ sudo ceph daemon mds.cephfs-master1 perf dump mds
{
    "mds": {
        "request": 4812776,
        "reply": 4812772,
        "reply_latency": {
            "avgcount": 4812772,
            "sum": 4018.941028931,
            "avgtime": 0.000835057
        },
        "forward": 0,
        "dir_fetch": 170753,
        "dir_commit": 3253,
        "dir_split": 9,
        "dir_merge": 6,
        "inode_max": 2147483647,
        "inodes": 9305913,
        "inodes_top": 1617338,
        "inodes_bottom": 7688575,
        "inodes_pin_tail": 0,
        "inodes_pinned": 6995430,
        "inodes_expired": 13937,
        "inodes_with_caps": 6995443,
        "caps": 7002958,
        "subtrees": 2,
        "traverse": 5076658,
        "traverse_hit": 4835068,
        "traverse_forward": 0,
        "traverse_discover": 0,
        "traverse_dir_fetch": 91030,
        "traverse_remote_ino": 0,
        "traverse_lock": 109,
        "load_cent": 5356538,
        "q": 1,
        "exported": 0,
        "exported_inodes": 0,
        "imported": 0,
        "imported_inodes": 0
    }
}

1.5 ceph daemon mds.xxx dirfrag ls /

这个命令是用来查看文件系统某个目录下是否有脏数据。

$  sudo ceph daemon mds.cephfs-master1 dirfrag ls /
[
    {
        "value": 0,
        "bits": 0,
        "str": "0/0"
    }
]

1.6

该命令是用来查看 CephFS 的 session 连接。

$ sudo ceph daemon mds.cephfs-master1 session ls

[
    {
        "id": 9872,
        "num_leases": 0,
        "num_caps": 1,
        "state": "open",
        "replay_requests": 0,
        "completed_requests": 0,
        "reconnecting": false,
        "inst": "client.9872 192.168.250.1:0/1887245819",
        "client_metadata": {
            "entity_id": "k8s.training.cephfs-teamvolume-aaaaaa-pvc",
            "hostname": "GPU-P100",
            "kernel_version": "4.9.107-0409107-generic",
            "root": "/prod/training/cephfs-teamvolume-aaaaaa-pvc"
        }
    },
    ......
]

2. 问题分析

2.1 Client cephfs-master1 failing to respond to cache pressure client_id: 9807

正巧是我修改了 MDS cache 之后出现了这个告警，所以一开始怀疑是是不是因为改大了 cache 造成了这个问题，但当我恢复了 cache 的默认值之后，问题依然存在。于是在 Ceph 的邮件列表中搜索类似问题，发现该问题一般都是 inode_max 这个数值设置的不够大造成的，于是查看了一下当前的 inode 和 inode_max 信息：

$ sudo ceph daemon mds.cephfs-master1 perf dump mds
{
    "mds": {
        "request": 404611246,
        "reply": 404611201,
        "reply_latency": {
            "avgcount": 404611201,
            "sum": 9613563.153437701,
            "avgtime": 0.023760002
        },
        ......
        "inode_max": 2147483647,
        "inodes": 3907095,
        ......
}

inodes 远小于 inode_max，所以这里的配置也没有问题。继续搜索发现不只是 inodes 的数量会造成这个问题，已经过期的 inodes 也是有影响的。

$ sudo ceph daemon mds.cephfs-master1 perf dump mds
{
        ......
        "inodes_expired": 21999096501,
        ......
}

果然，inodes_expired 的数值已经非常大了。进一步搜索发现，造成这个问题的主因是 cephfs 不会自动清理过期的 inodes，所以积累时间久了，就容易出现不够用的现象。解决方法如下：

$ sudo vim /etc/ceph/ceph.conf
……
[client]
client_try_dentry_invalidate = false
……

$ sudo systemctl restart [email protected]

2.2 MDS cache 配置

MDS 目前官方推荐的配置还是单活的，也就是说一个集群内只有一个提供服务的 MDS，虽然 Ceph MDS 性能很高，但毕竟是单点，再加上 MDS 运行的物理机上内存资源还是比较富裕的，自然想到通过使用内存作为缓存来提高 MDS 的性能。但是 MDS 的缓存配置项很多，一时还真不确定应该用哪个选项，而且配置成多大合适也拿不准。

经过进一步的整理后，把缓存配置进一步分解为以下四个小问题。

到底使用哪个选项配置缓存的大小
为什么大部分时间用不到配置的内存量
为什么有时 MDS 占用的内存远大于缓存的配置
应该将缓存配置成多大

2.2.1 到底使用哪个选项配置缓存的大小

相关的配置项主要有两个：
mds_cache_size 和 mds_cache_memory_limit，mds_cache_size 是老版本的配置参数，单位是 inode，目前的默认值是 0，表示没有限制；mds_cache_memory_limit 是建议使用的值，单位是 byte，默认值为 1G。所以要调整 cache 大小，当然是要改 mds_cache_memory_limit 。

2.2.2 为什么大部分时间用不到配置的内存量

例如将 mds_cache_memory_limit 配置为 30G（mds_cache_memory_limit = 32212254726），而实际运行时，看到的缓存用量却是这样的：

$ sudo ceph daemon mds.cephfs-master1 cache status
{
    "pool": {
        "items": 321121429,
        "bytes": 31197237046
    }
}

虽然差距不大，但为什么总是用不到配置的内存量呢？
原因在于这个参数：mds_cache_reservation，这个参数表示 MDS 预留一部分内存，没有具体的作用，就是为了留有余地。当 MDS 开始侵占这部分内存时，系统会自动释放掉超过配额的那部分。

mds_cache_reservation 的默认值是 5%，所以造成了我们看到的现象。

2.2.3 为什么有时 MDS 占用的内存远大于缓存的配置

但有时 MDS 占用的内存又远远大于配置的缓存，这个原因是 mds_cache_memory_limit 并非一个固定死不能突破的上限，程序运行时可能会在特定情况下突破配置的上限，所以建议不要把这个值配置的和系统内存总量太接近。不然有可能会占满整个服务器的内存资源。

2.2.4 应该将缓存配置成多大

官方文档有明确的说明，不推荐大于 64G，这里面的原因主要是 Ceph 的 bug，有很多使用者发现当高于 64g 时，MDS 有较高的概率占用远高于实际配置的内存，目前该 bug 还没有解决。

3. 参考文档

Why ceph status showing cephfs client failing to respond to cache pressure in RHCS
MDS CONFIG REFERENCE
Understanding MDS Cache Size Limits