关键词:SPDK、NVMeOF、Ceph、CPU负载均衡
SPDK是intel公司主导开发的一套存储高性能开发套件,提供了一组工具和库,用于编写高性能、可扩展和用户态存储应用。它通过使用一些关键技术实现了高性能:
SPDK是一个框架而不是分布式系统,它的基石是用户态(user space)、轮询(polled-mode)、异步(asynchronous)和无锁的NVMe驱动,其提供了零拷贝、高并发直接用户态访问SSD的特性。SPDK的最初目的是为了优化块存储落盘性能,但伴随持续的演进,已经被用于优化各种存储协议栈。SPDK架构分为协议层、服务层和驱动层,协议层包含NVMeOF Target、vhost-nvme Target、iscsi Target、vhost-scsi Target以及vhost-blk Target等,服务层包含LV、Raid、AIO、malloc以及Ceph RBD等,driver层主要是NVMeOF initiator、NVMe PCIe、virtio以及其他用于持久化内存的driver等。
SPDK架构
Ceph是目前应用比较广泛的一种分布式存储,它提供了块、对象和文件等存储服务,SPDK很早就支持连接Ceph RBD作为块存储服务,我们在使用SPDK测试RBD做性能测试时发现性能到达一定规格后无法继续提升,影响产品的综合性能,经过多种定位方法并结合现场与代码分析,最终定位问题原因并解决,过程如下。
查看poller显示rbd只有一个poller bdev_rbd_group_poll,与nvmf_tgt_poll_group_0都运行在id为2的thread上,而nvmf_tgt_poll_group_0是运行在0号核上的,故bdev_rbd_group_poll也运行在0号核。
[root@test]# spdk_rpc.py thread_get_pollers
{
"tick_rate": 2300000000,
"threads": [
{
"timed_pollers": [
{
"period_ticks": 23000000,
"run_count": 77622,
"busy_count": 0,
"state": "waiting",
"name": "nvmf_tgt_accept"
},
{
"period_ticks": 9200000,
"run_count": 194034,
"busy_count": 194034,
"state": "waiting",
"name": "rpc_subsystem_poll"
}
],
"active_pollers": [],
"paused_pollers": [],
"id": 1,
"name": "app_thread"
},
{
"timed_pollers": [],
"active_pollers": [
{
"run_count": 5919074761,
"busy_count": 0,
"state": "waiting",
"name": "nvmf_poll_group_poll"
},
{
"run_count": 40969661,
"busy_count": 0,
"state": "waiting",
"name": "bdev_rbd_group_poll"
}
],
"paused_pollers": [],
"id": 2,
"name": "nvmf_tgt_poll_group_0"
},
{
"timed_pollers": [],
"active_pollers": [
{
"run_count": 5937329587,
"busy_count": 0,
"state": "waiting",
"name": "nvmf_poll_group_poll"
}
],
"paused_pollers": [],
"id": 3,
"name": "nvmf_tgt_poll_group_1"
},
{
"timed_pollers": [],
"active_pollers": [
{
"run_count": 5927158562,
"busy_count": 0,
"state": "waiting",
"name": "nvmf_poll_group_poll"
}
],
"paused_pollers": [],
"id": 4,
"name": "nvmf_tgt_poll_group_2"
},
{
"timed_pollers": [],
"active_pollers": [
{
"run_count": 5971529095,
"busy_count": 0,
"state": "waiting",
"name": "nvmf_poll_group_poll"
}
],
"paused_pollers": [],
"id": 5,
"name": "nvmf_tgt_poll_group_3"
},
{
"timed_pollers": [],
"active_pollers": [
{
"run_count": 5923260338,
"busy_count": 0,
"state": "waiting",
"name": "nvmf_poll_group_poll"
}
],
"paused_pollers": [],
"id": 6,
"name": "nvmf_tgt_poll_group_4"
},
{
"timed_pollers": [],
"active_pollers": [
{
"run_count": 5968032945,
"busy_count": 0,
"state": "waiting",
"name": "nvmf_poll_group_poll"
}
],
"paused_pollers": [],
"id": 7,
"name": "nvmf_tgt_poll_group_5"
},
{
"timed_pollers": [],
"active_pollers": [
{
"run_count": 5931553507,
"busy_count": 0,
"state": "waiting",
"name": "nvmf_poll_group_poll"
}
],
"paused_pollers": [],
"id": 8,
"name": "nvmf_tgt_poll_group_6"
},
{
"timed_pollers": [],
"active_pollers": [
{
"run_count": 5058745767,
"busy_count": 0,
"state": "waiting",
"name": "nvmf_poll_group_poll"
}
],
"paused_pollers": [],
"id": 9,
"name": "nvmf_tgt_poll_group_7"
}
]
}
再结合代码分析,rbd模块加载时会将创建io_channel的接口bdev_rbd_create_cb向外注册,rbd bdev在创建rbd bdev时默认做bdev_examine,这个流程会创建一次io_channel,然后销毁。在将rbd bdev attach到nvmf subsystem时,会调用创建io_channel接口,因为nvmf_tgt有8个线程,所以会调用8次创建io_channel接口,但disk->main_td总是第一次调用者的线程,即nvmf_tgt_poll_group_0,而每个IO到达rbd模块后bdev_rbd_submit_request接口会将IO上下文调度到disk->main_td,所以每个rbd的线程都运行在0号核上。
综合环境现象与代码分析,最终定位该问题的原因是:由于spdk rbd模块在创建盘时bdev_rbd_create_cb接口会将每个盘的主线程disk->main_td分配到0号核上,故导致多盘测试时CPU负载不均衡,性能无法持续提高。
因为每个盘的disk->main_td均为第一个io_channel调用者的线程上下文,所以他们的线程都在同一个核上,导致IO从上游到达rbd模块后全部汇聚到一个核上,负载不均衡导致性能无法继续提高。