qemu本身对动态迁移有丰富的优化项,通过qemu monitor可以查看
(qemu) info migrate_capabilities
xbzrle: off
rdma-pin-all: off
auto-converge: off
zero-blocks: off
compress: off
events: off
postcopy-ram: off
x-colo: off
release-ram: off
block: off
return-path: off
pause-before-switchover: off
x-multifd: off
dirty-bitmaps: off
postcopy-blocktime: off
late-block-activate: off
有部分优化选项比较常见,比如postcopy-ram, compress。查询了一下这些优化项,有些优化项完全没有资料,看来只能看源码来了解它的作用了。接下来会尝试这些优化项并学习其原理,看看它们对于迁移优化的程度。
1. compress
打开compress的选项后,服务器会在迁移前对ram中的数据做压缩,在测试中,压缩的比率差异很大,从10%到80%,对于带宽不足的虚拟机迁移效率提升有所帮助。但是做压缩的时候会对cpu有额外的消耗,并且压缩也会耗时,所以在带宽足够大的情况下,迁移前进行压缩反而会导致动态迁移时间变长。
可以开启压缩和解压的多线程,可以加速压缩的速率,默认情况下配置的是compress 8 threads, decompress 2 threads,由于用的压缩算法是zlib,压缩和解压的速率差4倍,所以配置多线程时建议压缩线程数是解压线程数的四倍。
此外,可以对压缩的比率和压缩速度进行调节。level 0代表不压缩,level 1代表压缩速度最快但是压缩比率最低,level 9代表压缩比率最高,但是相应压缩速度也最慢。
虚拟机中模拟内存占用,脚本
[root@localhost ~]# cat make_cache_2.2G.sh
#!/bin/bash -x
mount -t ramfs ramfs z/
cp 800m_file z/1
cp 800m_file z/2
[root@localhost ~]# free -h
total used free shared buff/cache available
Mem: 4.9G 180M 2.5G 11M 2.2G 4.0G
Swap: 4.0G 0B 4.0G
把两个800m的文件放到ramfs中,可以看到memory中buff/cache已经上升到了2.2G
在qemu monitor 中配置
源端服务器
(qemu) migrate_set_capability compress on
(qemu) migrate_set_parameter compress-threads 1
(qemu) migrate_set_parameter compress-level 1
目的端服务器
(qemu) migrate_set_capability compress on
(qemu) migrate_set_parameter decompress-threads 2
迁移结果对比
no compress | compressed threads: 8 decompressed threads: 2 compression level: 1 |
|
total time(msec): | 33172 | 37697 (13.6%↑) |
downtime(msec): | 131 | 17 (87%↓) |
transfered ram(kB): | 3728602 | 3464038 (7.1%↓) |
throughput(mbps): | 921.04 | 752.96 (18.2%↓) |
total ram(kB): | 5374400 | 5374400 |
2. xbzrle
Xor Based Zero Run Length Encoding
简单说来,这个技术就是用亦或的方式找到memory 变化的部分(不开启xbzrle时,会把整张memory page进行传输),然后压缩再传输给目的服务器,从而减少了上传数据量,使得dirty pages能够尽快减少到可以down机再完整迁移的数量。尤其是对于那些memory write intensive workload的虚拟机的动态迁移,会有很大的帮助。准确的说是,不开启这个选项,由于虚拟机 memory wirte很频繁,dirty pages的数量始终维持在一个数量级, 使得迁移一直进行,无法进入到down机把剩余dirty page迁移过去的阶段。
为了找到memory变化的部分,原始memory会被储存在源服务器的cache里用以做比较,储存memory的cache对cache命中率有影响,还未对这个做深入的研究。
为了做这个测试,写了一个内存读写的程序。
#include
#include
int main()
{
long int buf_length;
long int buf_num;
char c[100];
buf_length = 4096;
int compare_result;
printf("Enter memory you want to test: ");
gets(c);
if (strcmp(c,"128m")==0)
compare_result = 1;
else if(strcmp(c,"256m")==0)
compare_result = 2;
else if(strcmp(c,"512m")==0)
compare_result = 3;
else if(strcmp(c,"1g")==0)
compare_result = 4;
else if(strcmp(c,"2g")==0)
compare_result = 5;
else
compare_result = 0;
switch(compare_result)
{
case 1:
buf_num = 32768;
printf("start generate 128M memory r/w load, ctrl+c to quit. \n");
break;
case 2:
buf_num = 65536;
printf("start generate 256M memory r/w load, ctrl+c to quit. \n");
break;
case 3:
buf_num = 131072;
printf("start generate 512M memory r/w load, ctrl+c to quit. \n");
break;
case 4:
buf_num = 262144;
printf("start generate 1G memory r/w load, ctrl+c to quit. \n");
break;
case 5:
buf_num = 524288;
printf("start generate 2G memory r/w load, ctrl+c to quit. \n");
break;
default:
buf_num = 0;
printf("please input following size 128m, 256m, 512, 1g, 2g. ");
break;
}
if(buf_num!=0){
printf("use free -h to monitor ");
char *buf = (char *) calloc(buf_num, buf_length);
while (1) {
long int i;
for (i = 0; i < buf_num * 4 ; i++) {
buf[i * buf_length / 4 ]++;
}
printf(".");
}
}
else{
printf("please input correct size. \n");
}
}
可以根据输入占用128M - 2G内存,并进行读写。
在qemu monitor 中配置
源端服务器
(qemu) migrate_set_capability xbzrle on
(qemu) migrate_set_cache_size 256m
默认的cache size是64M,最好配置cashe size大于被读写的内存,否则xbzrle cache miss 会很大,会导致迁移一直持续。
未打开xbzrle,在虚拟机中开启内存占用128M,再进行动态迁移。可以看到已经传输了11G的memory,迁移还未完成
(qemu) info migrate
globals:
store-global-state: on
only-migratable: off
send-configuration: on
send-section-footer: on
decompress-error-check: off
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: off events: off postcopy-ram: off x-colo: off release-ram: off block: off return-path: off pause-before-switchover: off x-multifd: off dirty-bitmaps: off postcopy-blocktime: off late-block-activate: off
Migration status: active
total time: 99833 milliseconds
expected downtime: 4872 milliseconds
setup: 4 milliseconds
transferred ram: 11202151 kbytes
throughput: 891.65 mbps
remaining ram: 240188 kbytes
total ram: 5374400 kbytes
duplicate: 1258520 pages
skipped: 0 pages
normal: 2792316 pages
normal bytes: 11169264 kbytes
dirty sync count: 23
page size: 4 kbytes
multifd bytes: 0 kbytes
dirty pages rate: 31494 pages
当打开xbzrle时,可以看到迁移很快就完成了
(qemu) info migrate
globals:
store-global-state: on
only-migratable: off
send-configuration: on
send-section-footer: on
decompress-error-check: off
capabilities: xbzrle: on rdma-pin-all: off auto-converge: off zero-blocks: off compress: off events: off postcopy-ram: off x-colo: off release-ram: off block: off return-path: off pause-before-switchover: off x-multifd: off dirty-bitmaps: off postcopy-blocktime: off late-block-activate: off
Migration status: completed
total time: 6405 milliseconds
downtime: 16 milliseconds
setup: 4 milliseconds
transferred ram: 601813 kbytes
throughput: 770.35 mbps
remaining ram: 0 kbytes
total ram: 5374400 kbytes
duplicate: 1246173 pages
skipped: 0 pages
normal: 146870 pages
normal bytes: 587480 kbytes
dirty sync count: 8
page size: 4 kbytes
multifd bytes: 0 kbytes
cache size: 268435456 bytes
xbzrle transferred: 2232 kbytes
xbzrle pages: 87614 pages
xbzrle cache miss: 46297
xbzrle cache miss rate: 15.33
xbzrle overflow : 0
可以看到info migrate多出了几项和xbzrle有关的参数,看字面意义就能理解,这里就不做解释了。