jianchwa

QEMU代码详解

概述

QOM

QM.1 Class & Object

QM.2 初始化

QM.3 Property

QM.4 QOM List

Options

MemoryRegion

MR.1 Hierarchy

MR.2 Listeners

Qemu Task Model

TM.1 概述

TM.2 Coroutine

TM.2.1 概述

TM.2.2 coroutine实现

TM.3 glib

TM.4 AIO

TM.4.1 aio驱动模型

TM.4.2 aio bh

TM.4.3 aio coroutine schedule

Device Emulation

DE.0 概述

DE.1 CPU和chipset

DE.1.1 Chipset

DE.1.2 PCIE

DE.2 网络设备

DE.2.1 配置参数

DE.2.2 netdev参数

DE.2.3 NetQueue

DE.2.4 Backend

DE.3 存储设备

DE.3.1 配置参数

DE.3.2 BlockBackend

Migration

MG.0 概述

MG.1 VMState

MG.2 流程概述

MG.3 内存迁移

MG.3.1 发送流程

MG.3.2 Dirty Pages

MG.3.2.1 Tracking

MG.3.2.2 Output

MG.3.2.3 Dirty log Qemu代码

MG.3.2.4 Manual Dirty Log Protect

MG.3.2.5 libvirt migrate --timeout

MG.4 停止Guest

MG.4.1 停止vcpu

MG.4.2 停止外设

MG.4.2.1 net

MG.4.2.2 blk

附录

vhost-blk

概述

代码组织：

vl.c，这是qemu的main函数所在文件；
tcg/tcg.c，参考连接，Documentation/TCG - QEMU，The Tiny Code Generator (TCG) exists to transform target insns (the processor being emulated) via the TCG frontend to TCG ops which are then transformed into host insns (the processor executing QEMU itself) via the TCG backend. tcg.c是整个tcg机制的核心文件；
Guest CPU，命名方式为qemu/target-xxx，这里cpu模拟相关代码，其中translate.c负责将guest insn转化为tcg ops；
tcg/xxx，这里是tcg后端的代码，负责将tcg ops转化成host insn；
qemu/cpu-exec.c，cpu模拟的相关执行的核心代码在这里
Emulated Hardware，qemu/hw下保存的是模拟硬件的代码，参考其一层目录结构：

9pfs   block  cris     i386   isa            mips   openrisc    ppc    sparc    unicore32  xtensa
acpi   bt     display  ide    lm32           misc   pci         s390x  sparc64  usb
alpha  char   dma      input  m68k           moxie  pci-bridge  scsi   ssi      virtio
arm    core   gpio     intc   Makefile.objs  net    pci-host    sd     timer    watchdog
audio  cpu    i2c      ipack  microblaze     nvram  pcmcia      sh4    tpm      xen

qom，qemu object model相关代码

QOM

QOM，Qemu Object Model，用于QEMU中的设备模拟；参考QEMU的官方文档，The QEMU Object Model (QOM) — QEMU 7.1.50 documentationhttps://qemu.readthedocs.io/en/latest/devel/qom.html

The QEMU Object Model provides a framework for registering user creatable types and instantiating objects from those types. QOM provides the following features:

System for dynamically registering types

Support for single-inheritance of types

Multiple inheritance of stateless interfaces

QM.1 Class & Object

QOM使用C语言实现了部分C++的特性，比如class的继承；我们可以试想一下，如何使用类继承功能实现设备模拟？

注：上图并不是QEMU的组织方式，最上层的组织方式也不太准确

如上图中，

使用dev和bus组织设备，形成层级关系
PCI Dev或者SCSI Dev等class将包含特性类型设备的属性和接口，
- PCI Dev的Config space和Bar，而具体的寄存器使用情况，将在具体的设备类中实现，比如NVME和网卡‘
- SCSI Dev将包含vender、rotational、wbc、ncq等信息；

所以，通过类继承的方式，可以很好的反映设备的组织方式，而且可以很大程度的提高代码编写。

接下来，我们看下QOM是如何实现类继承的？

QOM引入了两个概念：

class，对应的是c++类中的方法和静态属性，这两样都是全局的；
object，对应的是c++类中每实例的部分；

class和object都通过，将父class和object嵌入首成员的方式，实现继承，同时，也可以直接使用指针进行类型转换；例如：

typedef struct X86CPUClass {
    CPUClass parent_class;
	...
} X86CPUClass;

typedef struct CPUClass {
    DeviceClass parent_class;
	...
} CPUClass;

X86CPUClass -> CPUClass -> DeviceClass -> ObjectClass

X86CPU -> DeviceState -> Object

QOM引入了TypeImpl(或者Type)，来保存class和object的size和init/exit方法，相当于构造和析构；同时，也保存了全局唯一的class结构的指针；所有TypeImpl都保存在type_table中，详情可以参考函数：type_table_lookup和type_table_add。

下面，我们看下具体代码：

class初始化，重点关注：(1)子类与父类都是单独存在的，(2)子类对父类的method的覆盖，参考代码：

static void type_initialize(TypeImpl *ti)
{
    TypeImpl *parent;

    // 代表该class已经初始化过了
    if (ti->class) {
        return;
    }

    ti->class_size = type_class_get_size(ti);
    ti->instance_size = type_object_get_size(ti);

    //构造全局唯一的class结构
    ti->class = g_malloc0(ti->class_size);

    parent = type_get_parent(ti);
    if (parent) {
        type_initialize(parent);
        // 从父类的class结构中拷贝相关成员到本class
        // 这里需要特别注意的是，子类和父类分别是单独的结构，且子类中包含了
        // 一份父类的拷贝
        memcpy(ti->class, parent->class, parent->class_size);
		...
	}

    ti->class->type = ti;
    ...

    //调用子类的class_init，这一过程中可能会覆盖父类的函数，参考下面的函数
    if (ti->class_init) {
        ti->class_init(ti->class, ti->class_data);
    }
}


x86_cpu_common_class_init()
---
    X86CPUClass *xcc = X86_CPU_CLASS(oc);
    CPUClass *cc = CPU_CLASS(oc);
    DeviceClass *dc = DEVICE_CLASS(oc);

	// xcc -> cc -> dc
    xcc->parent_realize = dc->realize;
    dc->realize = x86_cpu_realizefn;
    dc->bus_type = TYPE_ICC_BUS;
    dc->props = x86_cpu_properties;

    xcc->parent_reset = cc->reset;
    cc->reset = x86_cpu_reset;
	...
---

object的初始化，重点关注与class如何关联起来，参考函数：

object_new_with_type()
---
    // 依照TypeImpl，构造object
	obj = g_malloc(type->instance_size);
    object_initialize_with_type(obj, type->instance_size, type);
---
object_initialize_with_type()
---
    //使用上层的object保存class指针
	obj->class = type->class;
    object_init_with_type(obj, type);
---

//从最上层的object依次执行instance_init
object_init_with_type()
---
    if (type_has_parent(ti)) {
        object_init_with_type(obj, type_get_parent(ti));
    }

    if (ti->instance_init) {
        ti->instance_init(obj);
    }
---

在进行object初始化之前，会先进行class的初始话，还是参考object_init_with_type()，

object_new_with_type()
---
    type_initialize(type);

    obj = g_malloc(type->instance_size);
    object_initialize_with_type(obj, type->instance_size, type);
---

QM.2 初始化

我们知道了QOM中的class和object，以及指导构造它们的TypeImpl(Type)，即

TypeImpl，相当于C++中的类的定义，也就是class xxx { xxx }；
class，相当于C++中类的方法和静态属性，两者都是全局的；
object，相当于C++中类的每实例的部分；

在qemu的代码中，我们只找到了各种TypeInfo的定义，以上三个元素都是什么时候被初始化的？本小节，我们将主要了解下QOM的构造框架。

以target-i386/cpu.c为例：

static const TypeInfo x86_cpu_type_info = {
    .name = TYPE_X86_CPU,
    .parent = TYPE_CPU,
    .instance_size = sizeof(X86CPU),
    .instance_init = x86_cpu_initfn,
    .abstract = true,
    .class_size = sizeof(X86CPUClass),
    .class_init = x86_cpu_common_class_init,
};

static void x86_cpu_register_types(void)
{
    int i;

    type_register_static(&x86_cpu_type_info);
    for (i = 0; i < ARRAY_SIZE(builtin_x86_defs); i++) {
        x86_register_cpudef_type(&builtin_x86_defs[i]);
    }
#ifdef CONFIG_KVM
    type_register_static(&host_x86_cpu_type_info);
#endif
}

type_init(x86_cpu_register_types)

我们看到两个关键的函数：

type_register_static，这个函数就是注册TypeImpl的地方，参考代码：

type_register_static()
  -> type_register()
    -> type_register_internal()
	   ---
	    TypeImpl *ti;
 	    ti = type_new(info);

 	    type_table_add(ti);
	   ---

type_init，该函数将保证x86_cpu_register_types被自动调用，

#define type_init(function) module_init(function, MODULE_INIT_QOM)

#define module_init(function, type)                                         \
static void __attribute__((constructor)) do_qemu_init_ ## function(void)    \
{                                                                           \
    register_module_init(function, type);                                   \
}

void register_module_init(void (*fn)(void), module_init_type type)
{
    ModuleEntry *e;
    ModuleTypeList *l;

    e = g_malloc0(sizeof(*e));
    e->init = fn;
    e->type = type;

    l = find_type(type);

    QTAILQ_INSERT_TAIL(l, e, node);
}

vl.c main()
  -> module_call_init(MODULE_INIT_QOM);
	---
	    l = find_type(type);

   		QTAILQ_FOREACH(e, l, node) {
        	e->init();
    	}
	---

其中用到了gcc的constructor属性，参考连接Function Attributes - Using the GNU Compiler Collection (GCC)Using the GNU Compiler Collection (GCC)https://gcc.gnu.org/onlinedocs/gcc-4.7.0/gcc/Function-Attributes.html

The constructor attribute causes the function to be called automatically before execution enters main (). constructor只是做了函数的注册，最终的调用，是main函数调用的。

到这里，通过type_init()向系统注册了TypeImpl。

class的初始化并没有一个固定的地方，而是在使用之前执行type_initialize()，比如：

cpu_x86_create()
---
    oc = x86_cpu_class_by_name(name);
	  -> object_class_by_name();
	...
    cpu = X86_CPU(object_new(object_class_get_name(oc)));
---

object_class_by_name()
---
    TypeImpl *type = type_get_by_name(typename);

    if (!type) {
        return NULL;
    }

    type_initialize(type);

    return type->class;
---

同时，type_initialize()也会顺便把所有的parent都初始化了。

object是按需分配，比如默认cpu object的申请，参考代码：

main()
---
    machine_class = find_default_machine(); // find the machine with "is_default = true"
	...
  	current_machine = MACHINE(object_new(object_class_get_name(
                          OBJECT_CLASS(machine_class))));
    object_property_add_child(object_get_root(), "machine",
                              OBJECT(current_machine), &error_abort);

    machine = machine_class->qemu_machine;
	...
    machine->init(¤t_machine->init_args);
	...
    cpu_synchronize_all_post_init();
---

static QEMUMachine pc_i440fx_machine_v2_0 = {
    PC_I440FX_2_0_MACHINE_OPTIONS,
    .name = "pc-i440fx-2.0",
    .alias = "pc",
    .init = pc_init_pci,
    .is_default = 1,
};

pc_init_pci()
  -> pc_init1()
    -> pc_cpus_init()
	   ---
	    /* init CPUs */
   	 	if (cpu_model == NULL)
	        cpu_model = "qemu64";
	   ...
	   for (i = 0; i < smp_cpus; i++) {
   	       cpu = pc_new_cpu(cpu_model, x86_cpu_apic_id_from_index(i),
                         icc_bridge, &error);
       }
	   ---
	   -> cpu_x86_create()
	      ---
			model_pieces = g_strsplit(cpu_model, ",", 2);
		    name = model_pieces[0];

		    oc = x86_cpu_class_by_name(name);
  			xcc = X86_CPU_CLASS(oc);

    		cpu = X86_CPU(object_new(object_class_get_name(oc)));
		  ---

QM.3 Property

属性的内容可以参考如下两个连接：

[Qemu-devel] qdev properties vs qom object propertieshttps://qemu-devel.nongnu.narkive.com/e31JZTzZ/qdev-properties-vs-qom-object-propertiesFeatures/QOM - QEMU Properties in QOMhttps://wiki.qemu.org/Features/QOM#Device_Properties总结起来，有以下几个点：

qdev properties are just a wrapper around QOM object properties, taking care of:

not allowing to set the property after realize
providing default values
letting people use the familiar DEFINE_PROP_* array syntax
pretty printing for "info qtree" (which right now is only used by PCI devfn and vlan properties

从更加具体的角度来讲，属性就是某个对象，比如虚拟网卡，的配置信息；比如虚拟网卡的mac就是它的一个属性；另外，有些属性还带有'side-effect'，比如：

realized/unrealize，QEMU的官方描述是， Devices support the notion of "realize" which roughly corresponds to construction. More accurately, it corresponds to the moment before a device will be first consumed by a guest. "unrealize" roughly corresponds to reset. A device may be realized and unrealized many times during its lifecycle. 参考相关的调用栈信息，

#0  x86_cpu_reset (s=0x555556204a60) at /usr/src/debug/qemu-2.0.0/target-i386/cpu.c:2404
#1  0x00005555557f7141 in x86_cpu_realizefn (dev=0x555556204a60, errp=0x7fffffffdf20)
    at /usr/src/debug/qemu-2.0.0/target-i386/cpu.c:2630
#2  0x0000555555685ea8 in device_set_realized (obj=, value=, 
    err=0x7fffffffe010) at hw/core/qdev.c:757
#3  0x000055555575bede in property_set_bool (obj=0x555556204a60, v=, 
    opaque=0x555556215050, name=, errp=0x7fffffffe010) at qom/object.c:1420
#4  0x000055555575e4a7 in object_property_set_qobject (obj=0x555556204a60, value=, 
    name=0x5555558aac3a "realized", errp=0x7fffffffe010) at qom/qom-qobject.c:24
#5  0x000055555575d380 in object_property_set_bool (obj=obj@entry=0x555556204a60, 
    value=value@entry=true, name=name@entry=0x5555558aac3a "realized", errp=errp@entry=0x7fffffffe010)
    at qom/object.c:883
#6  0x00005555557c4e7e in pc_new_cpu (cpu_model=cpu_model@entry=0x5555558ec002 "qemu64", apic_id=0, 
    icc_bridge=icc_bridge@entry=0x555556201aa0, errp=errp@entry=0x7fffffffe050)
    at /usr/src/debug/qemu-2.0.0/hw/i386/pc.c:944
#7  0x00005555557c5f7f in pc_cpus_init (cpu_model=0x5555558ec002 "qemu64", 
    icc_bridge=icc_bridge@entry=0x555556201aa0) at /usr/src/debug/qemu-2.0.0/hw/i386/pc.c:1015
#8  0x00005555557c71bf in pc_init1 (args=0x5555561eab68, pci_enabled=1, kvmclock_enabled=1)
    at /usr/src/debug/qemu-2.0.0/hw/i386/pc_piix.c:108
#9  0x00005555555ee7d3 in main (argc=, argv=, envp=)
    at vl.c:4383

设备的初始化过程会调用“realized”属性的回调。

另外，属性会在处理设备参数的时候用到，参考代码：

qdev_device_add()
---
    /* set properties */
    if (qemu_opt_foreach(opts, set_property, dev, 1) != 0) {
		...
    }

    dev->opts = opts;
    object_property_set_bool(OBJECT(dev), true, "realized", &err);
---

set_property()
  -> object_property_parse()
	-> object_property_set()
	   ---
		ObjectProperty *prop = object_property_find(obj, name, errp);
	    if (prop == NULL) {
   	     return;
   		}

	    if (!prop->set) {
   	    	error_set(errp, QERR_PERMISSION_DENIED);
    	} else {
        	prop->set(obj, v, prop->opaque, name, errp);
    	}
	   ---

static Property nvme_props[] = {
    DEFINE_BLOCK_PROPERTIES(NvmeCtrl, conf),
    DEFINE_PROP_STRING("serial", NvmeCtrl, serial),
    DEFINE_PROP_END_OF_LIST(),
};

#define DEFINE_BLOCK_PROPERTIES(_state, _conf)                          \
    DEFINE_PROP_DRIVE("drive", _state, _conf.bs),                       \
    DEFINE_PROP_BLOCKSIZE("logical_block_size", _state,                 \
                          _conf.logical_block_size, 512),               \
    DEFINE_PROP_BLOCKSIZE("physical_block_size", _state,                \
                          _conf.physical_block_size, 512),              \
    DEFINE_PROP_UINT16("min_io_size", _state, _conf.min_io_size, 0),  \
    DEFINE_PROP_UINT32("opt_io_size", _state, _conf.opt_io_size, 0),    \
    DEFINE_PROP_INT32("bootindex", _state, _conf.bootindex, -1),        \
    DEFINE_PROP_UINT32("discard_granularity", _state, \
                       _conf.discard_granularity, -1)

QM.4 QOM List

通过QMP命令，我们可以看到所有的QOM对象及其属性的值构成的树状结构，以此来了解设备的拓扑结构和配置，使用如下脚本，：

#!/bin/bash

opt=$1
obj=$2
path=$3
prop=$4

case $opt in
	list)
		virsh qemu-monitor-command $obj --pretty  "{\"execute\": \"qom-list\", \"arguments\": { \"path\": \"$path\" }}"
		;;
	get)
		virsh qemu-monitor-command $obj --pretty  "{\"execute\":  \"qom-get\", \"arguments\": { \"path\": \"$path\", \"property\":\"$prop\"}}"
		;;
	*)
		;;
esac

以下内容基于QEMU版本为7.0.0，

[root@dceff7e73f37 libvirt]# sh qmp.sh list will1 /
{
  "return": [
    {
      "name": "type",
      "type": "string"
    },
    {
      "name": "objects",
      "type": "child"
    },
    {
      "name": "machine",
      "type": "child"
    },
    {
      "name": "chardevs",
      "type": "child"
    }
  ],
  "id": "libvirt-487"
}

我们先看下/chardevs，

[root@dceff7e73f37 libvirt]# sh qmp.sh list will1 /chardevs
{
  "return": [
    {
      "name": "type",
      "type": "string"
    },
    {
      "name": "charchannel0",
      "type": "child"
    },
    {
      "name": "charserial0",
      "type": "child"
    },
    {
      "name": "charmonitor",
      "type": "child"
    }
  ],
  "id": "libvirt-489"
}
[root@dceff7e73f37 libvirt]# sh qmp.sh list will1 /chardevs/charmonitor
{
  "return": [
    {
      "name": "type",
      "type": "string"
    },
    {
      "name": "connected",
      "type": "bool"
    },
    {
      "name": "addr",
      "type": "SocketAddress"
    }
  ],
  "id": "libvirt-490"
}
[root@dceff7e73f37 libvirt]# sh qmp.sh get will1 /chardevs/charmonitor addr
{
  "return": {
    "path": "/var/lib/libvirt/qemu/domain-1-will1/monitor.sock",
    "type": "unix"
  },
  "id": "libvirt-491"
}

以上，我们可以看到，qemu monitor对应的对象与虚拟机machine处在同一个层级；

接下来，再看下machine，

[root@dceff7e73f37 libvirt]# sh qmp.sh list will1 /machine
{
  "return": [
    {
      "name": "type",
      "type": "string"
    },
    ...
    {
      "name": "smp",
      "type": "SMPConfiguration"
    },
    ...
    {
      "name": "memory-backend",
      "type": "string"
    },
    ...
    {
      "name": "q35",
      "type": "child"
    },
    ...
    {
      "name": "unattached",
      "type": "child"
    },
    {
      "name": "peripheral",
      "type": "child"
    },
    ...
  ],
  "id": "libvirt-492"
}

其中内容较多，我们省略了部分内容，可以看到machine中包含两种类型的对象：

container，用来承载其他对象，比如：

'unattached'，用来承载没有父设备的设备，最典型的是cpu，参考代码：

device_set_realized()
---
	...
	if (value && !dev->realized) {
		...
        if (!obj->parent) {
            gchar *name = g_strdup_printf("device[%d]", unattached_count++);

            object_property_add_child(container_get(qdev_get_machine(),
                                                    "/unattached"),
                                      name, obj);
            unattached_parent = true;
            g_free(name);
        }
		...
	}
---

So '/machine/unattached' means no parent...

'peripheral'，用来承载所有外设，参考代码：

device_init_func()
  -> qdev_device_add()
	-> qdev_device_add_from_qdict()
	   ---
	    driver = qdict_get_try_str(opts, "driver");
		...
	    dev = qdev_new(driver);
		...
	    id = g_strdup(qdict_get_try_str(opts, "id"));
   		qdev_set_id(dev, id, errp);
		...
	    object_set_properties_from_keyval(&dev->parent_obj, dev->opts, from_json, errp);
	    qdev_realize(DEVICE(dev), bus, errp);
	   ---

这里我们可以看下，cpu和设备的配置信息，

[root@dceff7e73f37 libvirt]# sh qmp.sh get will1 /machine/unattached/device[6] kvm
{
  "return": true,
  "id": "libvirt-496"
}
[root@dceff7e73f37 libvirt]# sh qmp.sh get will1 /machine/unattached/device[6] type
{
  "return": "host-x86_64-cpu",
  "id": "libvirt-497"
}
[root@dceff7e73f37 libvirt]# sh qmp.sh get will1 /machine/unattached/device[6] kvm_pv_eoi
{
  "return": true,
  "id": "libvirt-498"
}
[root@dceff7e73f37 libvirt]# sh qmp.sh get will1 /machine/unattached/device[6] hv-apicv
{
  "return": false,
  "id": "libvirt-499"
}


[root@dceff7e73f37 libvirt]# sh qmp.sh list will1 /machine/peripheral/net0/virtio-pci[0]
{
  "return": [
    {
      "name": "type",
      "type": "string"
    },
    {
      "name": "container",
      "type": "link"
    },
    {
      "name": "addr",
      "type": "uint64"
    },
    {
      "name": "size",
      "type": "uint64"
    },
    {
      "name": "priority",
      "type": "uint32"
    }
  ],
  "id": "libvirt-509"
}
[root@dceff7e73f37 libvirt]# sh qmp.sh get will1 /machine/peripheral/net0/virtio-pci[0] addr
{
  "return": 4271898624,
  "id": "libvirt-510"
}
[root@dceff7e73f37 libvirt]# sh qmp.sh get will1 /machine/peripheral/net0/virtio-pci[0] size
{
  "return": 16384,
  "id": "libvirt-511"
}
[root@dceff7e73f37 libvirt]# sh qmp.sh get will1 /machine/peripheral/net0 rx_queue_size
{
  "return": 256,
  "id": "libvirt-512"
}
[root@dceff7e73f37 libvirt]# sh qmp.sh list will1 /machine/peripheral/net0/msix-table[0]
{
  "return": [
    {
      "name": "type",
      "type": "string"
    },
    {
      "name": "container",
      "type": "link"
    },
    {
      "name": "addr",
      "type": "uint64"
    },
    {
      "name": "size",
      "type": "uint64"
    },
    {
      "name": "priority",
      "type": "uint32"
    }
  ],
  "id": "libvirt-513"
}
[root@dceff7e73f37 libvirt]# sh qmp.sh get will1 /machine/peripheral/net0 mac
{
  "return": "52:54:00:3b:87:a1",
  "id": "libvirt-514"
}

通过以上，我们可以更加直观的看到QOM在QEMU中的存在形式以及作用。

Options

本小节，我们看下Qemu是如何处理入参的。

首先我们看下几个关键结构；

QemuOptsList、QemuOpts和QemuOpt三者之间的关系就像Directory、Page以及在每个Page上的KV Pair；

参考代码：

vl.c main()
---
    qemu_add_opts(&qemu_drive_opts);
	...
    qemu_add_opts(&qemu_device_opts);
    qemu_add_opts(&qemu_netdev_opts);
	...
    qemu_add_opts(&qemu_machine_opts);
    qemu_add_opts(&qemu_smp_opts);
	...
---

QemuOptsList qemu_device_opts = {
    .name = "device",
    .implied_opt_name = "driver",
    .head = QTAILQ_HEAD_INITIALIZER(qemu_device_opts.head),
    .desc = {
        /*
         * no elements => accept any
         * sanity checking will happen later
         * when setting device properties
         */
        { /* end of list */ }
    },
};

这是QemuOptsList的注册和定义的过程。我们以-device为例，参考下面的例子：

NVMe Emulation — QEMU documentationhttps://qemu-project.gitlab.io/qemu/system/devices/nvme.html

-drive file=nvm.img,if=none,id=nvm
-device nvme,serial=deadbeef,drive=nvm

解析函数为：

vl.c main()
---
            case QEMU_OPTION_drive:
                if (qemu_opts_parse(qemu_find_opts("drive"), optstr, 0) == NULL) {
                    exit(1);
                }
	        break;
			...
            case QEMU_OPTION_device:
                if (!qemu_opts_parse(qemu_find_opts("device"), optarg, 1)) {
                    exit(1);
                }
                break;
---

这里我们不对两个函数做深入解析，最终的结果是构造出了下面的QemuOpts，

其中比较特殊的是：driver=nvme，参考代码：

"device"的QemuOptsList的implied_opt_name是“driver”
且传入的permit_abbrev为1

opts_parse()
---
    firstname = permit_abbrev ? list->implied_opt_name : NULL;
	...
    if (opts_do_parse(opts, params, firstname, defaults) != 0) {
        qemu_opts_del(opts);
        return NULL;
    }
---

opts_do_parse()
---
	for (p = params; *p != '\0'; p++) {
        pe = strchr(p, '=');
        pc = strchr(p, ',');
        if (!pe || (pc && pc < pe)) {
            /* found "foo,more" */
            if (p == params && firstname) {
                /* implicitly named first option */
                pstrcpy(option, sizeof(option), firstname);
                p = get_opt_value(value, sizeof(value), p);
            } else {
				...
            }
        } ....
        if (strcmp(option, "id") != 0) {
            /* store and parse */
            opt_set(opts, option, value, prepend, &local_err);
            if (local_err) {
                qerror_report_err(local_err);
                error_free(local_err);
                return -1;
            }
        }
        if (*p != ',') {
            break;
        }
    }
---

在解析完参数之后，Qemu对依据参数进行设备构建，还是参考device部分：

vl.c main()
  -> qemu_opts_foreach(qemu_find_opts("device"), device_init_func, NULL, 1)
    -> device_init_func()
	  -> qdev_device_add()

qdev_device_add()
---
    ObjectClass *oc;
    DeviceClass *dc;

    driver = qemu_opt_get(opts, "driver");
    
	oc = object_class_by_name(driver);
	...
    dc = DEVICE_CLASS(oc);
	...
    path = qemu_opt_get(opts, "bus");
    if (path != NULL) {
		...
    } else if (dc->bus_type != NULL) {
        bus = qbus_find_recursive(sysbus_get_default(), NULL, dc->bus_type);
		..
    }
    /* create device */
    dev = DEVICE(object_new(driver));

    if (bus) {
        qdev_set_parent_bus(dev, bus);
    }
	...
    dev->opts = opts;
    object_property_set_bool(OBJECT(dev), true, "realized", &err);
	...
---

static const TypeInfo nvme_info = {
    .name          = "nvme",
    .parent        = TYPE_PCI_DEVICE,
    .instance_size = sizeof(NvmeCtrl),
    .class_init    = nvme_class_init,
};

MemoryRegion

MR.1 Hierarchy

（To be continued）

MR.2 Listeners

当MemoryRegion信息更新的时候，会调用以下函数：

memory_region_transaction_commit()
  -> address_space_update_topology()
    -> generate_memory_topology()
	  -> render_memory_region()
	  -> flatview_simplify()
    -> address_space_update_topology_pass() // pass 0
    -> address_space_update_topology_pass() // pass 1

render_memory_region()会把MemoryRegion层级转换成FlatView，

struct AddrRange {
    Int128 start;
    Int128 size;
};

struct FlatRange {
    MemoryRegion *mr;
    hwaddr offset_in_region;
    AddrRange addr;
	...
};

/* Flattened global view of current active memory hierarchy.  Kept in sorted
 * order.
 */
struct FlatView {
    unsigned ref;
    FlatRange *ranges;
    unsigned nr;
    unsigned nr_allocated;
};


FlatView其实就是一个FlatRange的数组，按升序排列，具体可以参考函数render_memory_region()

flatview_simply()会把FlatView里的可以合并的，合并起来；

address_space_update_topology_pass()会对比前后两版FlatView，然后调用memory_listeners的region_add/region_del等回调函数；

下面我们通过memory_listener_register()，看下qemu中都有哪些listener，

address_space_memory，用来处理来自内核的PIO和MMIO请求，参考代码：

vl.c main()
  -> cpu_exec_init_all()
	-> memory_map_init()
	   ---
    memory_region_init(system_memory, NULL, "system", UINT64_MAX);
    address_space_init(&address_space_memory, system_memory, "memory");

    system_io = g_malloc(sizeof(*system_io));
    memory_region_init_io(system_io, NULL, &unassigned_io_ops, NULL, "io",
                          65536);
    address_space_init(&address_space_io, system_io, "I/O");
	   ---

address_space_init()
  -> address_space_init_dispatch()
	-> memory_listener_register()

kvm_cpu_exec()
  -> cpu_physical_memory_rw()
    -> address_space_rw(&address_space_memory, addr, buf, len, is_write);

memory slot，用来更新内核的memslot信息，参考代码：

kvm_init()
	-> memory_listener_register(&kvm_memory_listener, &address_space_memory);

kvm_region_add()
  -> kvm_set_phys_mem()
	-> kvm_set_user_memory_region()
	  -> ioctl of KVM_SET_USER_MEMORY_REGION

Qemu Task Model

TM.1 概述

Qemu本身是一个事件驱动型框架，具体我们可以参考资料：Improving the QEMU Event Loophttp://events17.linuxfoundation.org/sites/events/files/slides/Improving%20the%20QEMU%20Event%20Loop%20-%203.pdf

这里的数据，可能会有些过时，但是也为我们指明了大体方向，在深入研究代码之后，我们会在最后，对这里的数据进行修正。

本小节，所有代码均基于qemu 7.0.0。

TM.2 Coroutine

TM.2.1 概述

Qemu为什么要引入协程？

这里，我们参考两份资料：

第一个是qemu coroutine引入的commit，

commit 00dccaf1f848290d979a4b1e6248281ce1b32aaa
Author: Kevin Wolf 
Date:   Mon Jan 17 16:08:14 2011 +0000

    coroutine: introduce coroutines
    
    Asynchronous code is becoming very complex.  At the same time
    synchronous code is growing because it is convenient to write.
    Sometimes duplicate code paths are even added, one synchronous and the
    other asynchronous.  This patch introduces coroutines which allow code
    that looks synchronous but is asynchronous under the covers.
    
    A coroutine has its own stack and is therefore able to preserve state
    across blocking operations, which traditionally require callback
    functions and manual marshalling of parameters.
    ...

coroutine引入的目的是，让代码以更加方便更加简单的同步的方式实现，但是，在底层，却是以性能更好的异步的方式运行；

第二个是一位developer的blog，参考以下链接：Stefan Hajnoczi: Coroutines in QEMU: The basicshttp://blog.vmsplice.net/2014/01/coroutines-in-qemu-basics.html

选取其中的部分内容：

QEMU is an event-driven program with a main loop that invokes callback functions when file descriptors or timers become ready. Callbacks become hard to manage when multiple steps are needed as part of a single high-level operation:

通过coroutine，可以让一个多个步骤的复杂的异步代码变得直观简单，参考链接中例子：

/* 3-step process written using callbacks */
void start(void)
{
    send("Hi, what's your name? ", step1);
}

void step1(void)
{
    read_line(step2);
}

void step2(const char *name)
{
    send("Hello, %s\n", name, step3);
}

void step3(void)
{
    /* done! */
}
            |\/|
            |  |
           \|  |/
            \  /
             \/
             
/* 3-step process using coroutines */
void coroutine_fn say_hello(void)
{
    const char *name;

    co_send("Hi, what's your name? ");
    name = co_read_line();
    co_send("Hello, %s\n", name);
    /* done! */
}

综上，Qemu Coroutine引入的主要目的是，在保证性能的前提下，让编程变得更加简单，代码可读性更高。

TM.2.2 coroutine实现

coroutine在Linux上的实现，实现基于两个函数：

sigsetjmp, a call to sigsetjmp() saves the calling environment in its env parameter for later use by siglongjmp();
siglongjmp, the siglongjmp() function restores the environment saved by the most recent invocation of sigsetjmp() in the same thread, with the corresponding env argument;

两个函数在qemu coroutine中使用，首先看下几个核心的函数：

current，当前正在运行的上下文，对比内核的current宏，qemu中使用了一个thread local变量，
```
static __thread Coroutine *current;
```

switch，从一个上下文切换到另一个上下文，对比内核的context_switch()，

qemu_coroutine_switch()
---
	current = to_;

    ret = sigsetjmp(from->env, 0);
    if (ret == 0)
        siglongjmp(to->env, action);
---

qemu_coroutine_enter()，切换到下一个coroutine，对比内核的schedule()函数，

qemu_aio_coroutine_enter()
---
 	QSIMPLEQ_INSERT_TAIL(&pending, co, co_queue_next);

    /* Run co and any queued coroutines */
    while (!QSIMPLEQ_EMPTY(&pending)) {
        Coroutine *to = QSIMPLEQ_FIRST(&pending);
       	...
        QSIMPLEQ_REMOVE_HEAD(&pending, co_queue_next);
    	...
        to->caller = from;
        to->ctx = ctx;
        ...
        ret = qemu_coroutine_switch(from, to, COROUTINE_ENTER);
		...
        QSIMPLEQ_PREPEND(&pending, &to->co_queue_wakeup);
		...
    }
---

co_queue_wakeup上保存的是在to这上下文中唤醒的coroutine，参考函数：

aio_co_enter()
---
    if (qemu_in_coroutine()) {
        Coroutine *self = qemu_coroutine_self();
        assert(self != co);
        QSIMPLEQ_INSERT_TAIL(&self->co_queue_wakeup, co, co_queue_next);
    }
---

qemu_coroutine_yield()切换回caller，也就是qemu_coroutine_enter()的调用者，参考代码：

qemu_coroutine_yield() 
---
	Coroutine *self = qemu_coroutine_self();
    Coroutine *to = self->caller;
	...
    self->caller = NULL;
    qemu_coroutine_switch(self, to, COROUTINE_YIELD);
---

在了解了qemu的coroutine实现之后，我们可以几个简单的例子：

mutex实现：

qemu_co_mutex_wake()
---
    mutex->ctx = co->ctx;
    aio_co_wake(co);
	  -> aio_co_enter();
---

qemu_co_mutex_lock()
  -> qemu_coroutine_yield()

laio的实现：

laio_co_submit()
---
    struct qemu_laiocb laiocb = {
        .co         = qemu_coroutine_self(),
        .nbytes     = qiov->size,
        .ctx        = s,
        .ret        = -EINPROGRESS,
        .is_read    = (type == QEMU_AIO_READ),
        .qiov       = qiov,
    };

    ret = laio_do_submit(fd, &laiocb, offset, type, dev_max_batch);
    if (ret < 0) {
        return ret;
    }

    if (laiocb.ret == -EINPROGRESS) {
        qemu_coroutine_yield();
    }
    return laiocb.ret;
---


qemu_laio_process_completions()
  -> qemu_laio_process_completion()
      ---
	    if (!qemu_coroutine_entered(laiocb->co)) {
   	    	aio_co_wake(laiocb->co);
   		}
	   ---

laio基于linux libaio实现，本身这套接口是异步的，但是，通过coroutine，它的代码看起来像一个同步代码。

如果以上，都通过Linux进程或者线程实现，也是可以的，但是开销会非常高；而且，因为任务会不断陷入睡眠，我们必须不断的创建新的任务，以保证其他工作的进行，而这会导致线程数大量的增加；kernel io-uring实现buffer IO的异步话，其实就是通过内核线程实现的，成为iowq，从这个角度讲，iowq的确可以实现预期功能，并不是基于系统性能的最优解。

TM.3 glib

这部分使用了很多glib函数，参考以下链接：

The Main Event Loophttp://tux.iar.unlp.edu.ar/~fede/manuales/glib/glib-The-Main-Event-Loop.html

TM.4 AIO

TM.4.1 aio驱动模型

aio是通过什么驱动运转起来的?

qemu存在一个基本的aio实例，qemu_aio_context，我们看下它是如何运行的；

初始化，参考如下代码

qemu_init_main_loop()
---
    qemu_aio_context = aio_context_new(errp);
      -> ctx = (AioContext *) g_source_new(&aio_source_funcs, sizeof(AioContext));
	...
    src = aio_get_g_source(qemu_aio_context);
    g_source_set_name(src, "aio-context");
    g_source_attach(src, NULL);
---

qemu_main_loop()
---
    while (!main_loop_should_exit()) {
		...
        main_loop_wait(false);
		  -> os_host_main_loop_wait()
	        -> qemu_poll_ns()
			-> glib_pollfds_poll()
			  -> g_main_context_dispatch()
		...
    }
---

qemu_aio_context是qemu main loop的glib context的其中一个src；

往qemu_aio_contex中加入新的fd，通过以下接口：

aio_set_fd_handler()
---
 	new_node = g_new0(AioHandler, 1);
 	...
 	new_node->pfd.fd = fd;
 	...
     g_source_add_poll(&ctx->source, &new_node->pfd);

     new_node->pfd.events = (io_read ? G_IO_IN | G_IO_HUP | G_IO_ERR : 0);
     new_node->pfd.events |= (io_write ? G_IO_OUT | G_IO_ERR : 0);

 	QLIST_INSERT_HEAD_RCU(&ctx->aio_handlers, new_node, node);
---

fd被加入到了qemu_aio_context(或者其他aio ctx)中，并且还设置了处理函数，并挂载了ctx->aio_handlers上；当相关的fd中有事件时，dispatch回调被调用：

aio_source_funcs.aio_ctx_dispatch()
  -> aio_dispatch()
	-> aio_dispatch_handlers() // iterate ctx->aio_handlers
	  -> aio_dispatch_handler()
	     ---
			revents = node->pfd.revents & node->pfd.events;
			node->pfd.revents = 0;
			...
			if (!QLIST_IS_INSERTED(node, node_deleted) &&
		        (revents & (G_IO_IN | G_IO_HUP | G_IO_ERR)) &&
		        aio_node_check(ctx, node->is_external) &&
		        node->io_read) {
		        node->io_read(node->opaque);
				...
		    }

		    if (!QLIST_IS_INSERTED(node, node_deleted) &&
		        (revents & (G_IO_OUT | G_IO_ERR)) &&
		        aio_node_check(ctx, node->is_external) &&
		        node->io_write) {
		        node->io_write(node->opaque);
				...
    		}
		 ---

它会遍历所有的aio_handlers并检查相关fd是否有事件，并调用相关处理函数。

综上，aio最底层依赖的是poll/epoll等fd多路复用机制；

TM.4.2 aio bh

BH，即bottom half，它的作用是往aio context中插入一个需要处理在aio ctx中进行的操作，类似于Linux内核的中断将部分操作放入中断下半部，例如将引起睡眠的操作放入kworker中；

aio bh依赖aio notifier机制，看代码：

aio_context_new()
---
 	ret = event_notifier_init(&ctx->notifier, false);
	---
	    ret = eventfd(0, EFD_NONBLOCK | EFD_CLOEXEC);
		...
        e->rfd = e->wfd = ret;
	---
    ...
    aio_set_event_notifier(ctx, &ctx->notifier,
                           false,
                           aio_context_notifier_cb,
                           aio_context_notifier_poll,
                           aio_context_notifier_poll_ready);
		-> aio_set_fd_handler(ctx, event_notifier_get_fd(notifier) ...); // event_notifier_get_fd() is 'e->rfd'

	...
---

aio notifier其实就是aio context内置的一个eventfd；

aio_notify()
  -> event_notifier_set()
    -> ret = write(e->wfd, &value, sizeof(value)); // value = 0

aio_context_notifier_cb()
  -> event_notifier_test_and_clear()
    -> len = read(e->rfd, buffer, sizeof(buffer));

aio_notify()仅仅支持往eventfd中写入了一个0，显然，它并不是为了传递数据，而是为了让main loop进入aio的dispatch函数；

aio bh就是基于这样的机制，看代码：

aio_bh_schedule_oneshot()
  -> aio_bh_schedule_oneshot_full()
    -> aio_bh_enqueue()
	   ---
	    old_flags = qatomic_fetch_or(&bh->flags, BH_PENDING | new_flags);
	    if (!(old_flags & BH_PENDING)) {
   	    	QSLIST_INSERT_HEAD_ATOMIC(&ctx->bh_list, bh, next);
   		}

   		aio_notify(ctx);
	   ---

aio_ctx_dispatch()
  -> aio_dispatch()
	-> aio_bh_poll()
	   ---
	    QSLIST_MOVE_ATOMIC(&slice.bh_list, &ctx->bh_list);
	    QSIMPLEQ_INSERT_TAIL(&ctx->bh_slice_list, &slice, next);

	    while ((s = QSIMPLEQ_FIRST(&ctx->bh_slice_list))) {
			...
	        bh = aio_bh_dequeue(&s->bh_list, &flags);
   			...
	        if ((flags & (BH_SCHEDULED | BH_DELETED)) == BH_SCHEDULED) {
				...
            	aio_bh_call(bh);
	        }
			...
    	}
	   ---

aio_bh_schedule_oneshot()通过原子的方式往ctx中插入一个bh，并notify该ctx；然后该ctx调用aio_bh_poll()调用相关回调。

TM.4.3 aio coroutine schedule

我们已经了解过qemu coroutine的底层实现，但是qemu如何使用coroutine呢？

参考下面的一个使用场景：

virtio_mmio_write()
  -> virtio_queue_notify()
	-> vq->handle_output()
       virtio_blk_handle_output()
	     -> virtio_blk_handle_vq()
	       -> virtio_blk_handle_request()
	         -> virtio_blk_submit_multireq()
	           -> submit_requests()
	             -> blk_aio_pwritev() // virtio_blk_rw_complete()
	               -> blk_aio_prwv()

blk_aio_prwv()
---
    acb = blk_aio_get(&blk_aio_em_aiocb_info, blk, cb, opaque);
    acb->rwco = (BlkRwCo) {
        .blk    = blk,
        .offset = offset,
        .iobuf  = iobuf,
        .flags  = flags,
        .ret    = NOT_DONE,
    };
    acb->bytes = bytes;
    acb->has_returned = false;

    co = qemu_coroutine_create(co_entry, acb);
    bdrv_coroutine_enter(blk_bs(blk), co);

    acb->has_returned = true;
    if (acb->rwco.ret != NOT_DONE) {
        replay_bh_schedule_oneshot_event(blk_get_aio_context(blk),
                                         blk_aio_complete_bh, acb);
    }
---

bdrv_coroutine_enter()
  -> aio_co_enter()
	 ---
    if (ctx != qemu_get_current_aio_context()) {
        aio_co_schedule(ctx, co);
        return;
    }

    if (qemu_in_coroutine()) {
        Coroutine *self = qemu_coroutine_self();
        assert(self != co);
        QSIMPLEQ_INSERT_TAIL(&self->co_queue_wakeup, co, co_queue_next);
    } else {
        aio_context_acquire(ctx);
        qemu_aio_coroutine_enter(ctx, co);
        aio_context_release(ctx);
    }
	 ---

最终，vcpu thread调用了aio_co_enter()，接下来会发生什么？vcpu thread显然不会进入到coroutine处理中；关键因素就是qemu_get_current_aio_context()，参考代码：

AioContext *qemu_get_current_aio_context(void)
{
    AioContext *ctx = get_my_aiocontext();
    if (ctx) {
        return ctx;
    }
	...
}

void qemu_set_current_aio_context(AioContext *ctx)
{
    assert(!get_my_aiocontext());
    set_my_aiocontext(ctx);
}

iothread_run()
qemu_init_main_loop()

current aio context保存在一个thread local变量中，设置current aio context的只有两个函数，iothread_run()和qemu_init_main_loop()，vcpu thread并没有；所以，aio_co_enter()进入到了aio_co_schedule()，参考代码：

aio_co_schedule()
---
    QSLIST_INSERT_HEAD_ATOMIC(&ctx->scheduled_coroutines,
                              co, co_scheduled_next);
    qemu_bh_schedule(ctx->co_schedule_bh);
---

触发bh，

aio_context_new()
  -> ctx->co_schedule_bh = aio_bh_new(ctx, co_schedule_bh_cb, ctx);

bh处理函数，触发coroutine的调度
co_scheduled_bh_cb()
---
    AioContext *ctx = opaque;
    QSLIST_HEAD(, Coroutine) straight, reversed;

    /*
        原子操作的链表每次都是从链表头加入新的成员，导致schedule_coroutines
        保存的任务是先进后出的；这里的reserved和straight的作用个，就是把这个
        链表中的内容倒置过来，保证先进先出
    */
    QSLIST_MOVE_ATOMIC(&reversed, &ctx->scheduled_coroutines);
    QSLIST_INIT(&straight);

    while (!QSLIST_EMPTY(&reversed)) {
        Coroutine *co = QSLIST_FIRST(&reversed);
        QSLIST_REMOVE_HEAD(&reversed, co_scheduled_next);
        QSLIST_INSERT_HEAD(&straight, co, co_scheduled_next);
    }

    while (!QSLIST_EMPTY(&straight)) {
        Coroutine *co = QSLIST_FIRST(&straight);
        QSLIST_REMOVE_HEAD(&straight, co_scheduled_next);
		...
        qemu_aio_coroutine_enter(ctx, co);
		...
    }

---

Device Emulation

DE.0 概述

QEMU就是设备模拟软件，这是它的核心功能；

设备模拟分为前端和后端；

DE.1 CPU和chipset

DE.1.1 Chipset

参考q35的chipset:

在这里，我们思考两个问题：

图中的功能，我们都需要吗？
QEMU是否需要升级到更新的chipset呢？

对于第一个问题，当我们启动一个只做计算用的虚拟机时，USB、显卡、声卡等这些功能都不需要；而当我们打开一个启动虚拟机的配置参数：

-name guest=will1,debug-threads=on -S -object {"qom-type":"secret","id":"masterKey0","format":"raw","file":"/var/lib/libvirt/qemu/domain-1-will1/master-key.aes"} -machine pc-q35-rhel9.0.0,usb=off,dump-guest-core=off,memory-backend=pc.ram -accel kvm -cpu host,migratable=on -m 16384 -object {"qom-type":"memory-backend-ram","id":"pc.ram","size":17179869184} -overcommit mem-lock=off -smp 8,sockets=8,cores=1,threads=1 -uuid 48fd510f-4c9e-4761-b37c-5dfe0af49a63 -display none -no-user-config -nodefaults -chardev socket,id=charmonitor,fd=23,server=on,wait=off -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown -global ICH9-LPC.disable_s3=1 -global ICH9-LPC.disable_s4=1 -boot strict=on -device {"driver":"pcie-root-port","port":8,"chassis":1,"id":"pci.1","bus":"pcie.0","multifunction":true,"addr":"0x1"} -device {"driver":"pcie-root-port","port":9,"chassis":2,"id":"pci.2","bus":"pcie.0","addr":"0x1.0x1"} -device {"driver":"pcie-root-port","port":10,"chassis":3,"id":"pci.3","bus":"pcie.0","addr":"0x1.0x2"} -device {"driver":"pcie-root-port","port":11,"chassis":4,"id":"pci.4","bus":"pcie.0","addr":"0x1.0x3"} -device {"driver":"pcie-root-port","port":12,"chassis":5,"id":"pci.5","bus":"pcie.0","addr":"0x1.0x4"} -device {"driver":"pcie-root-port","port":13,"chassis":6,"id":"pci.6","bus":"pcie.0","addr":"0x1.0x5"} -device {"driver":"pcie-root-port","port":14,"chassis":7,"id":"pci.7","bus":"pcie.0","addr":"0x1.0x6"} -device {"driver":"pcie-root-port","port":15,"chassis":8,"id":"pci.8","bus":"pcie.0","addr":"0x1.0x7"} -device {"driver":"pcie-root-port","port":16,"chassis":9,"id":"pci.9","bus":"pcie.0","multifunction":true,"addr":"0x2"} -device {"driver":"pcie-root-port","port":17,"chassis":10,"id":"pci.10","bus":"pcie.0","addr":"0x2.0x1"} -device {"driver":"pcie-root-port","port":18,"chassis":11,"id":"pci.11","bus":"pcie.0","addr":"0x2.0x2"} -device {"driver":"pcie-root-port","port":19,"chassis":12,"id":"pci.12","bus":"pcie.0","addr":"0x2.0x3"} -device {"driver":"pcie-root-port","port":20,"chassis":13,"id":"pci.13","bus":"pcie.0","addr":"0x2.0x4"} -device {"driver":"pcie-root-port","port":21,"chassis":14,"id":"pci.14","bus":"pcie.0","addr":"0x2.0x5"} -device {"driver":"qemu-xhci","p2":15,"p3":15,"id":"usb","bus":"pci.3","addr":"0x0"} -device {"driver":"virtio-serial-pci","id":"virtio-serial0","bus":"pci.4","addr":"0x0"} -blockdev {"driver":"file","filename":"/vm/will1_sda.qcow2","node-name":"libvirt-2-storage","auto-read-only":true,"discard":"unmap"} -blockdev {"node-name":"libvirt-2-format","read-only":false,"driver":"qcow2","file":"libvirt-2-storage","backing":null} -device {"driver":"virtio-blk-pci","bus":"pci.5","addr":"0x0","drive":"libvirt-2-format","id":"virtio-disk0","bootindex":1} -device {"driver":"ide-cd","bus":"ide.0","id":"sata0-0-0"} -netdev tap,fd=24,vhost=on,vhostfd=26,id=hostnet0 -device {"driver":"virtio-net-pci","netdev":"hostnet0","id":"net0","mac":"52:54:00:3b:87:a1","bus":"pci.1","addr":"0x0"} -netdev user,id=hostnet1 -device {"driver":"virtio-net-pci","netdev":"hostnet1","id":"net1","mac":"52:54:00:c2:53:da","bus":"pci.2","addr":"0x0"} -chardev pty,id=charserial0 -device {"driver":"isa-serial","chardev":"charserial0","id":"serial0","index":0} -chardev socket,id=charchannel0,fd=22,server=on,wait=off -device {"driver":"virtserialport","bus":"virtio-serial0.0","nr":1,"chardev":"charchannel0","id":"channel0","name":"org.qemu.guest_agent.0"} -audiodev {"id":"audio1","driver":"none"} -device {"driver":"virtio-balloon-pci","id":"balloon0","bus":"pci.6","addr":"0x0"} -object {"qom-type":"rng-random","id":"objrng0","filename":"/dev/urandom"} -device {"driver":"virtio-rng-pci","rng":"objrng0","id":"rng0","bus":"pci.7","addr":"0x0"} -sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny -msg timestamp=on

其对应的设备：

我们使用的虚拟网卡和存储都是virtio的；并不是其原生的sata和集成网卡；

q35带给用户的关键不是q35本身，而是PCIe，且不说近些年大量的设备都是PCIe的；即使是在设备模拟时，PCIe的MSI性能也要好于IOAPIC。

同时，这也回答了另外一个问题，chipset的更新并不能带来性能的提升，因为目前，设备大多还是PCIe，所以升级chipset完全没有必要。

DE.1.2 PCIE

我们需要对PCIe了解更多，才能明白，它为什么这样模拟。

首先了解下PCIE的拓扑结构：PCI Express System Architecture 200 删减版https://www.mindshare.com/files/ebooks/pci%20express%20system%20architecture.pdf

PCI Express System Architecture 第三章完整版https://www.pearsonhighered.com/assets/samplechapter/0/3/2/1/0321156307.pdf引用其中的关键内容：

Unlike shared-bus architectures such as PCI and PCI-X, where traffic is visible to each device and routing is mainly a concern of bridges, PCI Express devices are dependent on each other to accept traffic or forward it in the direction of the ultimate recipient.

PCIe Endpoint之间的线路都是一对一专属的；当然，这里有个例外，就是PCe Switch；关于它的工作原理，参考以下内容：

PCIe Switch的upstream port与root complex之间的链接，是其downstream ports上的设备共享的；PCIe switch会根据upstream port的TLP(Transaction Layer Packet)中的地址信息，和downstream ports的type 1 configuration space中的配置信息，来路由TLP；

注：每个downstream port都有它自己的type 1 configuration space

对于一般的PCIe Endport，只需要比较自己的Bar就可以了；

PCIe是Pakcet Based Protocol，

pci root port，参考链接：PCI EXPRESS GUIDELINEShttps://raw.githubusercontent.com/qemu/qemu/master/docs/pcie.txt

Root Bus (pcie.0)
=====================
Place only the following kinds of devices directly on the Root Complex:

PCI Devices (e.g. network card, graphics card, IDE controller), not controllers. Place only legacy PCI devices on the Root Complex. These will be considered Integrated Endpoints. Note: Integrated Endpoints are not hot-pluggable. Although the PCI Express spec does not forbid PCI Express devices as Integrated Endpoints, existing hardware mostly integrates legacy PCI devices with the Root Complex. Guest OSes are suspected to behave strangely when PCI Express devices are integrated with the Root Complex.

PCI Express Root Ports (pcie-root-port), for starting exclusively PCI Express hierarchies.

PCI Express to PCI Bridge (pcie-pci-bridge), for starting legacy PCI hierarchies.

Extra Root Complexes (pxb-pcie), if multiple PCI Express Root Buses are needed.

其中我们可以得到的信息是，PCIe设备并不会直接连接到Root complex上，也就是pcie.0 bus，而是需要链接到PCIe root port上；其并不是协议不允许，而像是历史的约定俗成；在之前的qemu启动参数中，我们也看到这个情况。

(To Be Continued)

PCIe（一） —— 基础概念与设备树 | Soul Orbit

DE.2 网络设备

DE.2.1 配置参数

网络设备的配置参数，参考链接：

QEMU's new -nic command line option - QEMUhttps://www.qemu.org/2018/05/31/nic-parameter/其中有非常详细的描述，我们摘取其中部分内容：

网络设备配置的初始版本为：-net，例如：

-net nic,model=e1000 -net user

-net参数会将其对应的net client链接到一个nethub，与上面的前后端就可互相连接起来，并互相转发流量；但是，这种方式存在一个问题，即

-net nic,model=e1000 -net user -net nic,model=virtio -net tap

这个问题可以通过vlan参数解决，如下：

-net nic,model=e1000,vlan=0 -net user,vlan=0 -net nic,model=virtio,vlan=1 -net tap,vlan=1，

这里的vlan与IEEE 802.1Q 没有任何关系，引起了很多错误配置，于是它被去掉了。

更加合理的-netdev参数引入后，便不再存在上面的问题；

-netdev user,id=n1 -device e1000,netdev=n1 -netdev tap,id=n2 -device virtio-net,netdev=n2

后来，还引入了-nic参数，它是为了解决以下问题：

is easier to use (and shorter to type) than -netdev ,id= -device ,netdev=

can be used to configure on-board / non-pluggable NICs, too

does not place a hub between the NIC and the host back-end

instead of -netdev tap,id=n1 -device e1000,netdev=n1, you can simply type -nic tap,model=e1000.

DE.2.2 netdev参数

这小结我们看下代码中，netdev是如何发挥作用的，首先要说明的是，netdev有两个，即

-netdev tap,id=n1

-device e1000,netdev=n1,

-netdev，用于创建网络设备后端，通过id给自己命名
-device的子参数，用于指定该网络前端的后端

qemu_init()
  -> qemu_create_late_backends()
	-> net_init_clients()
	   ---
	    qemu_opts_foreach(qemu_find_opts("netdev"), net_init_netdev, NULL, errp);

	    qemu_opts_foreach(qemu_find_opts("nic"), net_param_nic, NULL, errp);

	    qemu_opts_foreach(qemu_find_opts("net"), net_init_client, NULL, errp);
	   ---
  -> qmp_x_exit_preconfig()
	-> qemu_create_cli_devices()
      -> qemu_opts_foreach(qemu_find_opts("device"),
                      device_init_func, NULL, &error_fatal);
         -> device_init_func()
	       -> qdev_device_add()
	         -> qdev_device_add_from_qdict()
	           -> qdev_new()
               -> object_set_properties_from_keyval(&dev->parent_obj, dev->opts, from_json, errp)
	           -> qdev_realize()

net_init_clients()中，通过遍历所有的-netdev参数，初始化所有的网络后端，并创建一个net client，参考tap的代码：

net_init_netdev()
  -> net_client_init()
	-> net_client_init1()
	  -> net_client_init_fun[type]()
	    -> net_init_tap()
	      -> net_init_tap_one()
	        -> net_tap_fd_init()
	          -> qemu_new_net_client() //net_tap_info

qemu_create_cli_device()则根据-device参数创建所有的设备，其中包括网络设备；参考代码：

static Property rtl8139_properties[] = {
    DEFINE_NIC_PROPERTIES(RTL8139State, conf),
    DEFINE_PROP_END_OF_LIST(),
};

#define DEFINE_NIC_PROPERTIES(_state, _conf)                            \
    DEFINE_PROP_MACADDR("mac",   _state, _conf.macaddr),                \
    DEFINE_PROP_NETDEV("netdev", _state, _conf.peers)

#define DEFINE_PROP_NETDEV(_n, _s, _f)             \
    DEFINE_PROP(_n, _s, _f, qdev_prop_netdev, NICPeers)

const PropertyInfo qdev_prop_netdev = {
    .name  = "str",
    .description = "ID of a netdev to use as a backend",
    .get   = get_netdev,
    .set   = set_netdev,
};

set_netdev()
---
    queues = qemu_find_net_clients_except(str, peers,
                                          NET_CLIENT_DRIVER_NIC,
                                          MAX_QUEUE_NUM)
	...

    for (i = 0; i < queues; i++) {
		...
        ncs[i] = peers[i];
        ncs[i]->queue_index = i;
    }
---

pci_rtl8139_realize()
---
    s->nic = qemu_new_nic(&net_rtl8139_info, &s->conf,
                          object_get_typename(OBJECT(dev)), d->id, s);
---
    ---
    NetClientState **peers = conf->peers.ncs;
	...
	
    for (i = 0; i < queues; i++) {
        qemu_net_client_setup(&nic->ncs[i], info, peers[i], model, name,
                              NULL, true);
        nic->ncs[i].queue_index = i;
    }
	---

pci_rtl8139_realize()的conf->peers.ncs来自set_netdev()，而set_netdev()运行时，-netdev相关参数已经初始化好了；qemu_net_client_setup()会让rtl8139和tap的net client互相为peer；

qemu_net_client_setup()
---
    if (peer) {
        assert(!peer->peer);
        nc->peer = peer;
        peer->peer = nc;
    }
---

DE.2.3 NetQueue

NetQueue是一个简单的用于连接Net前后端的数据结构，参考下图：

上图中的关键代码路径为：

net_init_tap()
  -> net_init_tap_one()
	-> qemu_new_net_client() // net_tap_info
	  -> qemu_net_client_setup()
	     ---
		    if (peer) {
   		 	    nc->peer = peer;
   	   	 		peer->peer = nc;
	   		}
    		nc->incoming_queue = qemu_new_net_queue(qemu_deliver_packet_iov, nc);
		 ---
	    -> nc->incoming_queue = qemu_new_net_queue(qemu_deliver_packet_iov, nc);
    -> tap_read_poll()
	  -> tap_update_fd_handler() // read_poll == tap_send, write_poll == tap_writable
	    -> qemu_set_fd_handler()


tap_send()
  -> qemu_send_packet_async()
	-> qemu_send_packet_async_with_flags()
	   ---
	    queue = sender->peer->incoming_queue;

	    return qemu_net_queue_send(queue, sender, flags, buf, size, sent_cb);
	   ---
	    -> qemu_net_queue_deliver()
	       qemu_deliver_packet_iov()
	         -> nc->info->receive()
	            rtl8139_receive()
	              -> rtl8139_do_receive()


rtl8139_io_writel()
  -> rtl8139_TxStatus_write()
	-> rtl8139_transmit()
	  -> rtl8139_transmit_one()
	    -> rtl8139_transfer_frame()
	      -> qemu_send_packet()
	        -> qemu_sendv_packet_async()
	          -> qemu_net_queue_send_iov()
	            -> qemu_net_queue_deliver_iov()
                  -> tap_receive()
                    -> tap_write_packet() // write to fd of tap directly

NetQueue主要有两个函数：

qemu_net_queue_receive()，主要用于Loopback

qemu_send_packet_async_with_flags()，用于向peer发送packet，参考代码：

qemu_send_packet_async_with_flags()
---
    queue = sender->peer->incoming_queue;

    return qemu_net_queue_send(queue, sender, flags, buf, size, sent_cb);
---

DE.2.4 Backend

Backend后端在生产中最常见的组合是tap + vhost；tap，terminal access port，终端访问端口；我们可以通过一个char设备(socket in kernel)向tap网卡中注入或者读取流量；不像物理网卡，数据去到或者来自连接在pcie上的网卡设备和网线，而是来自用户态或者vhost；如下图：

在tap设备和网络协议栈之间，存在着一层可能的stacked netdev，即堆叠网络设备；

这一层的正式名称，我并不了解；可以对比存储中，也存在这样一个存储设备堆叠的场景，最常见的例子是lvm依赖的device mapper以及md-raid；它们作为一种虚拟存储设备，在对bio做过复制、分割、重定向等操作之后，再发送到底层的存储设备上；网络软件栈中也存在这种情况，最典型的就是bridge；

tap的fd传递自qemu启动参数，参考：

-netdev tap,fd=27,id=hostnet0

-device rtl8139,netdev=hostnet0,id=net0,mac=52:54:00:b9:a7:5d,bus=pci.0,addr=0x3

这个fd在libvird创建，libvirtd通过fork创建子进程qemu，后者继承了这个fd。

父任务给子任务传递fd可以直接继承，那子任务如何给父任务传递fd呢？答案是UNIX socket，其sendmsg/recvmsg可以在进程间传递fd，参考链接：

关于vhost部分的内容，可以参考：KVM IO虚拟化_jianchwa的博客-CSDN博客KVM/QEMU IO虚拟化_kvm io虚拟化https://blog.csdn.net/home19900111/article/details/128610752?spm=1001.2014.3001.5501https://blog.csdn.net/home19900111/article/details/128610752?spm=1001.2014.3001.5501

在流量离开tap或者进入tap之前的内容，参考下面两个链接：

KVM Virtual Networking Concepts - NovaOrdis Knowledge Basehttps://kb.novaordis.com/index.php/KVM_Virtual_Networking_Conceptshttps://kb.novaordis.com/index.php/KVM_Virtual_Networking_ConceptsWhat is OpenVSwitchhttps://nsrc.org/workshops/2014/nznog-sdn/raw-attachment/wiki/Agenda/OpenVSwitch.pdfhttps://nsrc.org/workshops/2014/nznog-sdn/raw-attachment/wiki/Agenda/OpenVSwitch.pdf

DE.3 存储设备

DE.3.1 配置参数

关于存储设备的配置参数，参考文档，另外，本小节基于代码qemu 7.0.0;
QEMU Block Layer Concepts & Featureshttps://vmsplice.net/~stefan/qemu-block-layer-features-and-concepts.pdf

与网络设备类似，存储设备的配置参数也分为两种类型，

-device，配置前端，通过drive=xxx链接后端
-blockdev，配置后端，通过node-name为自己定义name以便前端引用；

-blockdev的特殊之处在于它可以有堆叠关系，它们通过参数'file'指向下一层；例如上图中，qcow2描述的是镜像格式，file和rdb则是承载镜像的主体；还可以加入一层：

DE.3.2 BlockBackend

BlockBackend相关有三个数据结构：

BlockBackend，该结构由存储前端直接引用；
BlockDriver，例如raw、qcow2、file、rbd、nbd等都是BlockDriver
BlockDriverState，BlockDriver实例

参考如下配置参数：

-blockdev {"driver":"file","filename":"/vm/will1_sda.qcow2","node-name":"libvirt-2-storage","auto-read-only":true,"discard":"unmap"}
-blockdev {"node-name":"libvirt-2-format","read-only":false,"driver":"qcow2","file":"libvirt-2-storage","backing":null}
-device {"driver":"virtio-blk-pci","bus":"pci.5","addr":"0x0","drive":"libvirt-2-format","id":"virtio-disk0","bootindex":1}

最终形成数据结构关系为：

BlockDriverState之间的堆叠关系在创建时候的体现为：

configure_blockdev()
  -> qmp_blockdev_add()
	-> bds_tree_init()
	  -> bdrv_open()
	    -> bdrv_open_inherit()


bdrv_open_inherit()
---
    bs = bdrv_new();
	...
    ret = bdrv_fill_options(&options, filename, &flags, &local_err);
       ---
	    drvname = qdict_get_try_str(*options, "driver");
	    if (drvname) {
	        drv = bdrv_find_format(drvname); // iterate bdrv_drivers
			...
	        protocol = drv->bdrv_file_open;
	    }

	    if (protocol) {
	        *flags |= BDRV_O_PROTOCOL;
	    } else {
	        *flags &= ~BDRV_O_PROTOCOL;
	    }
	   ---
	...
	if ((flags & BDRV_O_PROTOCOL) == 0) {
        BlockDriverState *file_bs;

        file_bs = bdrv_open_child_bs(filename, options, "file", bs,
                                     &child_of_bds, BDRV_CHILD_IMAGE,
                                     true, &local_err);
		    ---
		    reference = qdict_get_try_str(options, bdref_key); //bdref_key is "file"
		    ...
		    bs = bdrv_open_inherit(filename, reference, image_options, 0,
        		                   parent, child_class, child_role, errp);
			---
		...
    }
---

其中继续嵌套调用的条件是BlockDriver是否具有bdrv_file_open回调；观察代码，具有该回调的都是rbd, gluster, file-posix, nbd, iscsi等底层存储驱动。

堆叠关系的建立发生BlockDriver的bdrv_open中，例如：
raw_open()
---
    bs->file = bdrv_open_child(NULL, options, "file", bs, &child_of_bds,
                               file_role, false, errp);
---

qcow2_open()
---

    bs->file = bdrv_open_child(NULL, options, "file", bs, &child_of_bds,
                               BDRV_CHILD_IMAGE, false, errp);
---

这种堆叠关系在IO路径中的体现为：

blk_aio_preadv() // BlockBackend
  -> blk_aio_read_entry()
	-> blk_co_do_preadv()
	   ---
		    bdrv_inc_in_flight(bs);
			...
		    ret = bdrv_co_preadv(blk->root, offset, bytes, qiov, flags);
			                     ^^^^^^^^^^
		    bdrv_dec_in_flight(bs);
	   ---

这里的raw是raw format，代码比较简单
raw_co_preadv()
---
    ret = raw_adjust_offset(bs, &offset, bytes, false);
	...
    return bdrv_co_preadv(bs->file, offset, bytes, qiov, flags);
	                      ^^^^^^^^
---

这里的raw是file storage的raw
bdrv_file.bdrv_co_preadv()
	      raw_co_preadv()
	        -> raw_co_prw()
	          -> laio_co_submit()

存储前端与后端建立联系是在参数配置阶段，即-device里的drive参数：

set_drive()
  -> set_drive_helper()
	 ---
    blk = blk_by_name(str);
    if (!blk) {
        bs = bdrv_lookup_bs(NULL, str, NULL);
        if (bs) {
			...
            ctx = iothread ? bdrv_get_aio_context(bs) : qemu_get_aio_context();
            blk = blk_new(ctx, 0, BLK_PERM_ALL);
            blk_created = true;

            ret = blk_insert_bs(blk, bs, errp);
			...
        }
    }
	...
    *ptr = blk;
	 ---

virtio_blk_device_realize()
---
    s->blk = conf->conf.blk;
---

Migration

MG.0 概述

本节内容参考了以下链接：

Live Migrating QEMU-KVM Virtual Machines | Red Hat Developerhttps://developers.redhat.com/blog/2015/03/24/live-migrating-qemu-kvm-virtual-machines#vmstate_example__before_Introduction To Kvm Migration | Linux Technical Blogshttps://balamuruhans.github.io/2018/11/13/introduction-to-kvm-migration.html迁移目的：

Load balancing – If host gets overloaded, guests can be moved to other host which are not utilized much
Energy saving – guests can be moved from multiple host to single host and power off hosts which are not used
Maintenance – If host have to be shut-off for maintenance or to upgrade etc.,

迁移种类：

Live migration – guest remains to in running state in source, guest is booted and remains to be in paused state in destination host while migration is in progress. Once the migration is completed guest starts its execution from destination host without downtime.
Online migration – guest remains in paused state in source and in destination host, once migration completes it resumes in destination host. It takes comparatively less time to migrate as the there is no memory dirtying during migration as guest remains to be in paused state but there will be downtime equal to the migration time.
Offline migration – guest can be in running or in shutoff state in source, offline migration would just define a guest in destination in the destination host and it would remain in shutoff state.

本节，我们将主要关注热迁移流程。

在深入代码之前，我们首先要明确，虚拟机的那些信息需要迁移?

虚拟机包含的信息大体有两种：

配置信息，即可以从配置参数中初始化而来的信息，比如，几个vcpu、多少内存、磁盘容量等；这些不需要迁移，或者说，只需要整体将配置xml发送过去即可；
运行信息，即虚拟机各个组件运行过程中的产生的信息；

虚拟机运行信息分为哪些？换句话说，一台计算机在运行过程中，会有那些组件会实时变化？

内存，其又可以分为以下几种类型：
- 外设使用的内存，有部分外设需要OS给他分配内存，比如网卡的ring buffer、nvme的sq/cq等；
- CPU使用的内存，
寄存器，每个cpu都有一套寄存器，用于保存程序运行状态；
外设内部运行信息；外设除了访问OS分给它的内存，还有内部状态信息；

MG.1 VMState

vmstate是实现虚拟机迁移的辅助结构，其用来告知qemu什么信息需要迁移以及怎么迁移；目前，存在两种接口：

vmsd，大多数设备都使用这种方式：

static const VMStateDescription vmstate_virtio_blk = {
    .name = "virtio-blk",
    .minimum_version_id = 2,
    .version_id = 2,
    .fields = (VMStateField[]) {
        VMSTATE_VIRTIO_DEVICE,
        VMSTATE_END_OF_LIST()
    },
};

#define VMSTATE_VIRTIO_DEVICE \
    {                                         \
        .name = "virtio",                     \
        .info = &virtio_vmstate_info,         \
        .flags = VMS_SINGLE,                  \
    }

const VMStateInfo  virtio_vmstate_info = {
    .name = "virtio",
    .get = virtio_device_get,
    .put = virtio_device_put,
};


virtio_blk_class_init()
---
    dc->vmsd = &vmstate_virtio_blk;
---

device_set_realized()
  -> vmstate_register_with_alias_id() // qdev_get_vmsd()
	 ---
	    se = g_new0(SaveStateEntry, 1);
		...
   		se->opaque = opaque;
	    se->vmsd = vmsd;
	    se->alias_id = alias_id;
		...
	    savevm_state_handler_insert(se);
	 ---

vmstate_save()
  -> vmstate_save_state()
	-> vmstate_save_state_v()
	   ---
 	 while (field->name) {
        if ((field->field_exists && field->field_exists(opaque, version_id)) ||
            (!field->field_exists && field->version_id <= version_id)) {
            void *first_elem = opaque + field->offset;
            int i, n_elems = vmstate_n_elems(opaque, field);
            int size = vmstate_size(opaque, field);
			...
            for (i = 0; i < n_elems; i++) {
                void *curr_elem = first_elem + size * i;

                vmsd_desc_field_start(vmsd, vmdesc_loop, field, i, n_elems);a

                if (!curr_elem && size) {
					...
                } else if (field->flags & VMS_STRUCT) {
                    ret = vmstate_save_state(f, field->vmsd, curr_elem, vmdesc_loop);
                } else if (field->flags & VMS_VSTRUCT) {
                    ret = vmstate_save_state_v(f, field->vmsd, curr_elem, vmdesc_loop, ield->struct_version_id);
                } else {
                    ret = field->info->put(f, curr_elem, size, field, vmdesc_loop);
                }
				...
                written_bytes = qemu_ftell_fast(f) - old_offset;

                vmsd_desc_field_end(vmsd, vmdesc_loop, field, written_bytes, i);
				...
            }
        }
		...
        field++;
    }

	   ---

old style，典型的如ram，参考代码：

ram_mig_init()
  -> register_savevm_live("ram", 0, 4, &savevm_ram_handlers, &ram_state);
     ---
		se = g_new0(SaveStateEntry, 1);
		...
	    se->ops = ops;
	    se->opaque = opaque;
	    se->vmsd = NULL;
		...
        savevm_state_handler_insert(se);
	 ---

static SaveVMHandlers savevm_ram_handlers = {
    .save_setup = ram_save_setup,
    .save_live_iterate = ram_save_iterate,
    ...
    .load_state = ram_load,
    .save_cleanup = ram_save_cleanup,
    .load_setup = ram_load_setup,
    .load_cleanup = ram_load_cleanup,
    .resume_prepare = ram_resume_prepare,

};

vmstate_save()
  -> vmstate_save_old_style() // se->vmsd is NULL
	-> se->ops->save_state(f, se->opaque);

ram采用的这种vmstate，是因为在迁移拷贝内存的过程中，会有多轮操作，也就是save_live_iterate。

MG.2 流程概述

发送端代码流程概述大致如下：

迁移流程起始自qmp migrate命令，其处理函数为qmp_migrate()，

qmp_migrate()
  -> fd_start_outgoing_migration()
	-> migration_channel_connect()
	  -> migrate_fd_connect()
	     ---
			qemu_thread_create(&s->thread, "live_migration",
            			    migration_thread, s, QEMU_THREAD_JOINABLE);
		 ---

migration_thread()
---
    qemu_savevm_state_setup(s->to_dst_file);
	...
    while (migration_is_active(s)) {
        if (urgent || !qemu_file_rate_limit(s->to_dst_file)) {
            MigIterateState iter_state = migration_iteration_run(s);
            if (iter_state == MIG_ITERATE_SKIP) {
                continue;
            } else if (iter_state == MIG_ITERATE_BREAK) {
                break;
            }
        }
        ...

        urgent = migration_rate_limit();
    }
---

migration_iteration_run()
---
 	qemu_savevm_state_pending(s->to_dst_file, s->threshold_size, &pend_pre,
                              &pend_compat, &pend_post);
    pending_size = pend_pre + pend_compat + pend_post;

    if (pending_size && pending_size >= s->threshold_size) {
		...
        qemu_savevm_state_iterate(s->to_dst_file, in_postcopy);
    } else {
		migration_completion(s);
		return MIG_ITERATE_BREAK;
	}
---

首先需要拷贝所有的内存：

qemu_savevm_state_setup()中设置全部Ram为dirty，参考函数ram_save_setup()；
qemu_savevm_state_iterate()则持续发送dirty ram，参考函数ram_save_iterate()；
当发送完毕，或者因为某个条件强制结束，则进入migration_completion()

migration_completion()
  -> vm_stop_force_state()
	-> vm_stop()
	  -> do_vm_stop()
	    -> pause_all_vcpus()
	       ---
		    CPU_FOREACH(cpu) {
	        	if (qemu_cpu_is_self(cpu)) {
	            	qemu_cpu_stop(cpu, true);
	        	} else {
	            	cpu->stop = true;
	       	    	qemu_cpu_kick(cpu);
   	 	    	}
				...
			    while (!all_vcpus_paused()) {
			        qemu_cond_wait(&qemu_pause_cond, &qemu_global_mutex);
			        CPU_FOREACH(cpu) {
				        qemu_cpu_kick(cpu);
						  -> cpus_kick_thread()
							  -> pthread_kill(cpu->thread->thread, SIG_IPI)
			        }
	    		}
    		}
		   ---

        -> bdrv_drain_all()
	      -> bdrv_drain_all_begin()
	         ---
			   /* Now poll the in-flight requests */
  			  AIO_WAIT_WHILE(NULL, bdrv_drain_all_poll());
			 ---
	    -> bdrv_flush_all()

  -> qemu_savevm_state_complete_precopy()
	-> cpu_synchronize_all_states()
	  -> cpu_synchronize_state()
	     kvm_cpu_synchronize_state()
	       -> kvm_arch_get_registers() // cpu->vcpu_dirty = true, means need to put cpu registers
	          ---
			    ret = kvm_get_vcpu_events(cpu);
			    ret = kvm_get_mp_state(cpu);
			    ret = kvm_getput_regs(cpu, 0);
			    ret = kvm_get_xsave(cpu);
			    ret = kvm_get_xcrs(cpu);
			    ret = has_sregs2 ? kvm_get_sregs2(cpu) : kvm_get_sregs(cpu);
			    ret = kvm_get_msrs(cpu);
			    ret = kvm_get_apic(cpu);
    		  ---
    -> qemu_savevm_state_complete_precopy_iterable()
        ---
        QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
            if (!se->ops ||
                (in_postcopy && se->ops->has_postcopy &&
                 se->ops->has_postcopy(se->opaque)) ||
                 !se->ops->save_live_complete_precopy) {
                 continue;
             }
	     	...
            ret = se->ops->save_live_complete_precopy(f, se->opaque);
		    ...
  	    }
       ---
    -> qemu_savevm_state_complete_precopy_non_iterable()
	   ---
		QTAILQ_FOREACH(se, &savevm_state.handlers, entry) {
			if ((!se->ops || !se->ops->save_state) && !se->vmsd) {
				continue;
			}
			...
        	ret = vmstate_save(f, se, vmdesc);
			...
    	}
	   ---

从代码中可以看见migration_completion()主要分为四步：

停止所有vcpu线程
等待inflight IO完成，
- 由于vcpu线程停止，所有不会有新的IO产生
- 已经发送的IO完成之后，会给vcpu发出中断，中断信息会被记录下，
保存寄存器信息，其中包括上一步中的中断信息
保存所有设备的运行信息，分两步
- 带有save_live_complete_precopy()回调的，主要是Ram
- 带有vmsd，也就是各种设备

MG.3 内存迁移

MG.3.1 发送流程

内存迁移的发送端大致步骤如下：

第一轮首先将所有的内存都进行拷贝；
从第二轮开始，只拷贝dirty page；
等dirty page的数量逐渐收敛，停止vcpu，然后进行最后一轮拷贝

进入最后一轮的条件，参考代码：


migration_iteration_run()
---
    qemu_savevm_state_pending(s->to_dst_file, s->threshold_size, &pend_pre,
                              &pend_compat, &pend_post);
    pending_size = pend_pre + pend_compat + pend_post;
	...
    if (pending_size && pending_size >= s->threshold_size) {
		...
        qemu_savevm_state_iterate(s->to_dst_file, in_postcopy);
    } else {
        migration_completion(s);
        return MIG_ITERATE_BREAK;
    }
---

其中pending_size来自migration_dirty_pages，其增减过程参考代码：

migration_iteration_run()
  -> qemu_savevm_state_pending()
	-> .save_live_pending()
	   ram_save_pending()
	     -> migration_bitmap_sync_precopy()
	       -> migration_bitmap_sync()
             -> ramblock_sync_dirty_bitmap()
	            ---
				    uint64_t new_dirty_pages =
				        cpu_physical_memory_sync_dirty_bitmap(rb, 0, rb->used_length);

  			        rs->migration_dirty_pages += new_dirty_pages;
				---
  -> qemu_savevm_state_pending()
    -> ram_save_iterate()
      -> ram_find_and_save_block()
        -> ram_save_host_page()
          -> ram_save_target_page()
            -> migration_bitmap_clear_dirty()
	       ---
			ret = test_and_clear_bit(page, rb->bmap);
		    if (ret) {
		        rs->migration_dirty_pages--;
				    ^^^^^^^^^^^^^^^^^^^^^^^^
    		}
		   ---

threshhod_size则来自函数：

migration_rate_limit()
  -> migration_update_counters()
	 ---
	    current_bytes = migration_total_bytes(s);
	    transferred = current_bytes - s->iteration_initial_bytes;
	    time_spent = current_time - s->iteration_start_time;
	    bandwidth = (double)transferred / time_spent;
	    s->threshold_size = bandwidth * s->parameters.downtime_limit;
	 ---

它计算自评估的发送速度和用户设置的downtime_limit。

MG.3.2 Dirty Pages

内存迁移的发送端大致步骤如以上，但是，其中还有个关键步骤没有讲，就是如何获取dirty pages，这里面包括两个关键点：

如何追踪到dirty pages
如何将dirty pages的信息传递给qemu

如上图中，

Dirty Page的追踪使用了两种机制：

Page Modification Log，这是Intel VMX提供的机制；
Page Write Protection，这是软件机制，通过去掉EPT的spte的可写bit来追踪写操作

MG.3.2.1 Tracking

关于PML机制，参考Intel手册，28.2.5 Page-Modification Logging，其具体的工作方式如下:

When accessed and dirty flags for EPT are enabled, software can track writes to guest-physical addresses using a feature called page-modification logging.

Before allowing a guest-physical access, the processor may determine that it first needs to set an accessed or dirty flag for EPT (see Section 28.2.4). When this happens, the processor examines the PML index. If the PML index is not in the range 0–511, there is a page-modification log-full event and a VM exit occurs. In this case, the accessed or dirty flag is not set, and the guest-physical access that triggered the event does not occur.

If instead the PML index is in the range 0–511, the processor proceeds to update accessed or dirty flags for EPT as described in Section 28.2.4. If the processor updated a dirty flag for EPT (changing it from 0 to 1), it then operates as follows:

The guest-physical address of the access is written to the page-modification log. Specifically, the guest-physical address is written to physical address determined by adding 8 times the PML index to the PML address. Bits 11:0 of the value written are always 0 (the guest-physical address written is thus 4-KByte aligned).

The PML index is decremented by 1 (this may cause the value to transition from 0 to FFFFH)

PML在内核态的相关代码如下：

__vmx_handle_exit()
---
	/*
	 * Flush logged GPAs PML buffer, this will make dirty_bitmap more
	 * updated. Another good is, in kvm_vm_ioctl_get_dirty_log, before
	 * querying dirty_bitmap, we only need to kick all vcpus out of guest
	 * mode as if vcpus is in root mode, the PML buffer must has been
	 * flushed already.  Note, PML is never enabled in hardware while
	 * running L2.
	 */
	if (enable_pml && !is_guest_mode(vcpu))
		vmx_flush_pml_buffer(vcpu);
---

handle_pml_full()
---
	/*
	 * PML buffer already flushed at beginning of VMEXIT. Nothing to do
	 * here.., and there's no userspace involvement needed for PML.
	 */
---

vmx_flush_pml_buffer()
---
	pml_idx = vmcs_read16(GUEST_PML_INDEX);

	/* Do nothing if PML buffer is empty */
	if (pml_idx == (PML_ENTITY_NUM - 1))
		return;

	/* PML index always points to next available PML buffer entity */
	if (pml_idx >= PML_ENTITY_NUM)
		pml_idx = 0;
	else
		pml_idx++;

	ml_buf = page_address(vmx->pml_pg);
	for (; pml_idx < PML_ENTITY_NUM; pml_idx++) {
		u64 gpa;
		gpa = pml_buf[pml_idx];
    	WARN_ON(gpa & (PAGE_SIZE - 1));
		kvm_vcpu_mark_page_dirty(vcpu, gpa >> PAGE_SHIFT);
	}

	/* reset PML index */
	vmcs_write16(GUEST_PML_INDEX, PML_ENTITY_NUM - 1);
 ---

最终，它将Dirty Page通过函数kvm_vcpu_mark_page_dirty()输出。

如果采用Page Write Protection机制，则代码如下：

direct_page_fault()
  -> fast_page_fault()
	-> fast_pf_fix_direct_spte()
	   ---
		if (cmpxchg64(sptep, old_spte, new_spte) != old_spte)
			return false;

		if (is_writable_pte(new_spte) && !is_writable_pte(old_spte)) {
			gfn = kvm_mmu_page_get_gfn(sp, sptep - sp->spt);
			kvm_vcpu_mark_page_dirty(vcpu, gfn);
		}
	   ---
	     -> mark_page_dirty_in_slot()

同样，最终结果也输出到了mark_page_dirty_in_slot()。

MG.3.2.2 Output

Dirty Log如何输出给用户态qemu呢？目前，存在两种方式，

Dirty Log Bitmap，这是Qemu的传统方式
Dirty Log Ring，这是新加入的方式；

关于两种方式的对比，可以参考链接：

KVM Dirty Ring Interfacehttps://static.sched.com/hosted_files/kvmforum2020/97/kvm_dirty_ring_peter.pdf?_gl=1*1rs0obh*_ga*ODM3MDA3NzIwLjE2OTUwMDcyNDc.*_ga_XH5XM35VHB*MTY5NTAwNzI0Ny4xLjAuMTY5NTAwNzI0Ny42MC4wLjA.参考其中关于Dirty Log Ring与Bitmap的对比：

从作者的commit message中得到的性能测试数据：[PATCH RFC v3 00/11] KVM: Dirty ring support (QEMU part) - Peter Xuhttps://lore.kernel.org/qemu-devel/[email protected]/

I gave it a shot with a 24G guest, 8 vcpus, using 10g NIC as migration
channel.  When idle or dirty workload small, I don't observe major
difference on total migration time.  When with higher random dirty
workload (800MB/s dirty rate upon 20G memory, worse for kvm dirty
ring). Total migration time is (ping pong migrate for 6 times, in
seconds):

|-------------------------+---------------|
| dirty ring (4k entries) | dirty logging |
|-------------------------+---------------|
|                      70 |            58 |
|                      78 |            70 |
|                      72 |            48 |
|                      74 |            52 |
|                      83 |            49 |
|                      65 |            54 |
|-------------------------+---------------|

Summary:

dirty ring average:    73s
dirty logging average: 55s

The KVM dirty ring will be slower in above case.  The number may show
that the dirty logging is still preferred as a default value because
small/medium VMs are still major cases, and high dirty workload
happens frequently too.  And that's what this series did.

似乎Dirty Ring的性能收益并不好。。。

从原理的角度考虑，dirty ring对比dirty bitmap

系统调用改成mmap，但是，KVM_GET_DIRTY_LOG并不会高频执行，所以，这里收益不会很明显
dirty ring每次只需要拷贝出记录dirty的page信息的kvm_dirty_gfn(16 Bytes)，bitmap则需要拷贝出整个bitmap；在大内存虚拟机且dirty rate较低时，dirty ring的效率更高，但是，却也显现不出性能收益；如果dirty rate较高，出现dirty ring overflow的情况，dirty ring单位字节存储效率低的劣势就会显现，dirty ring每个dirty page需要16Bytes，而dirty bitmap仅需1 Bit，两者相差128倍，且dirty ring overflow还会导致vcpu回到用户态，这就会导致性能下降。

从以上，我们可以看出，dirty ring并不会带来性能上收益。

但是，通过dirty ring full返回用户态的这个契机，可以引入加快收敛的机制，参考代码：

kernel side:
------------
vcpu_enter_guest()
---
	/* Forbid vmenter if vcpu dirty ring is soft-full */
	if (unlikely(vcpu->kvm->dirty_ring_size &&
		     kvm_dirty_ring_soft_full(&vcpu->dirty_ring))) {
		vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
		r = 0;
		goto out;
	}
---

qemu side:
----------
kvm_cpu_exec()
---
		...
        run_ret = kvm_vcpu_ioctl(cpu, KVM_RUN, 0);
		...
        case KVM_EXIT_DIRTY_RING_FULL:
            qemu_mutex_lock_iothread();
            if (dirtylimit_in_service()) {
                kvm_dirty_ring_reap(kvm_state, cpu);
            } else {
                kvm_dirty_ring_reap(kvm_state, NULL);
            }
            qemu_mutex_unlock_iothread();
            dirtylimit_vcpu_execute(cpu);
		...
---

dirtylimit_vcpu_execute()
---
   if (dirtylimit_in_service() &&
        dirtylimit_vcpu_get_state(cpu->cpu_index)->enabled &&
        cpu->throttle_us_per_full) {
        usleep(cpu->throttle_us_per_full);
   }
---

MG.3.2.3 Dirty log Qemu代码

开启Dirty log:

参考代码：

ram_save_setup()
  -> ram_init_all()
	-> ram_init_bitmaps()
	  -> memory_global_dirty_log_start() //GLOBAL_DIRTY_MIGRATION
	     ---
	    global_dirty_tracking |= flags;

	    if (!old_flags) {
   	    	MEMORY_LISTENER_CALL_GLOBAL(log_global_start, Forward);
	        memory_region_transaction_begin();
	        memory_region_update_pending = true;
   	    	memory_region_transaction_commit();
    	}
		 ---

render_memory_region()
---
    fr.dirty_log_mask = memory_region_get_dirty_log_mask(mr);
---

address_space_update_topology_pass()
---
		if (frold && frnew && flatrange_equal(frold, frnew)) {
            /* In both and unchanged (except logging may have changed) */

            if (adding) {
                MEMORY_LISTENER_UPDATE_REGION(frnew, as, Forward, region_nop);
                if (frnew->dirty_log_mask & ~frold->dirty_log_mask) {
                    MEMORY_LISTENER_UPDATE_REGION(frnew, as, Forward, log_start,
                                                  frold->dirty_log_mask,
                                                  frnew->dirty_log_mask);
                }
                if (frold->dirty_log_mask & ~frnew->dirty_log_mask) {
                    MEMORY_LISTENER_UPDATE_REGION(frnew, as, Reverse, log_stop,
                                                  frold->dirty_log_mask,
                                                  frnew->dirty_log_mask);
                }
            }
            ++iold;
            ++inew;
       } 
---

kvm_log_start()
  -> kvm_section_update_flags()
	-> kvm_slot_update_flags()
	   ---
    	mem->flags = kvm_mem_flags(mr);
		   ---
    		if (memory_region_get_dirty_log_mask(mr) != 0) {
        		flags |= KVM_MEM_LOG_DIRTY_PAGES;
    		}
		   ---
		..
	    kvm_slot_init_dirty_bitmap(mem);
   		return kvm_set_user_memory_region(kml, mem, false);
	   ---

KVM_LOG_DIRTY_PAGES在内核端的memslot产生影响：

KVM_SET_USER_MEMORY_REGION
kvm_vm_ioctl_set_memory_region()
  -> kvm_set_memory_region()
	-> __kvm_set_memory_region()
	  -> kvm_alloc_dirty_bitmap() // KVM_MEM_LOG_DIRTY_PAGES
	  -> kvm_set_memslot()
	    -> kvm_arch_commit_memory_region()
	      -> kvm_mmu_slot_apply_flags()
	         ---
				bool log_dirty_pages = new->flags & KVM_MEM_LOG_DIRTY_PAGES;
				...
				if (!log_dirty_pages) {
					...
				} else {
					...
					/*
					 * Size of the CPU's dirty log buffer, i.e. VMX's PML buffer.  A zero
					 * value indicates CPU dirty logging is unsupported or disabled.
					 */
					if (kvm_x86_ops.cpu_dirty_log_size) {
						kvm_mmu_slot_leaf_clear_dirty(kvm, new);
						kvm_mmu_slot_remove_write_access(kvm, new, PG_LEVEL_2M);
					} else {
						kvm_mmu_slot_remove_write_access(kvm, new, PG_LEVEL_4K);
					}
				}
			 ---

读取Dirty log:

参考代码：

migration_iteration_run()
  -> qemu_savevm_state_pending()
	-> .save_live_pending()
	   ram_save_pending()
	     -> migration_bitmap_sync_precopy()
	       -> migration_bitmap_sync()
	         -> memory_global_dirty_log_sync()
	           -> memory_region_sync_dirty_bitmap()
	             -> .log_sync()
	                kvm_log_sync()
	                  -> kvm_physical_sync_dirty_bitmap()
	                    -> kvm_slot_get_dirty_log()
                          -> ret = kvm_vm_ioctl(s, KVM_GET_DIRTY_LOG, &d)

在内核态代码：

KVM_GET_DIRTY_LOG
kvm_vm_ioctl_get_dirty_log()
  -> kvm_get_dirty_log_protect()

1. Take a snapshot of the bit and clear it if needed.
2. Write protect the corresponding page.
3. Copy the snapshot to the userspace.
4. Upon return caller flushes TLB's if needed.

kvm_get_dirty_log_protect() 
---
	n = kvm_dirty_bitmap_bytes(memslot);
	flush = false;
	if (kvm->manual_dirty_log_protect) {
		...
	} else {
		...
		KVM_MMU_LOCK(kvm);
		for (i = 0; i < n / sizeof(long); i++) {
			unsigned long mask;
			gfn_t offset;

			if (!dirty_bitmap[i])
				continue;

			flush = true;
			mask = xchg(&dirty_bitmap[i], 0);
			dirty_bitmap_buffer[i] = mask;

			offset = i * BITS_PER_LONG;
			kvm_arch_mmu_enable_log_dirty_pt_masked(kvm, memslot,
								offset, mask);
		}
		KVM_MMU_UNLOCK(kvm);
	}

	if (flush)
		kvm_arch_flush_remote_tlbs_memslot(kvm, memslot);

	if (copy_to_user(log->dirty_bitmap, dirty_bitmap_buffer, n))
		return -EFAULT;
---

其不仅读取了dirty log，如果是write-protect，还会重新开启write-protect；因为write-protect的page在被写访问之后，会触发vm-exit，处理函数会赋予其写权限，并记录dirty bitmap。

MG.3.2.4 Manual Dirty Log Protect

在kvm_get_dirty_log_protect()中存在一个manual_dirty_log_protect的条件，引入的相关patch为：

[3/3] kvm: introduce manual dirty log reprotect - Patchworkhttps://patchwork.kernel.org/project/kvm/patch/[email protected]/其中谈到的场景是：

KVM_GET_DIRTY_LOG returns a set of dirty pages and write protects them.

The guest modifies the pages, causing them to be marked ditry.

Userspace actually copies the pages.

KVM_GET_DIRTY_LOG returns those pages as dirty again, even though they were not written to since (3).

场景中，Step 3做的Copy操作是Round N，而Step 2中的触发的Mark Dirty在Round N + 1中使用；这里会造成在Round N + 1中再次拷贝同一个Page；

manual_dirty_log_protect功能引入后，可以在执行Dirty Page拷贝之前，对其进行显示的Write Protection，如此便将造成两次拷贝的时间窗口缩到最短；Qemu中相关的代码为

kvm_init()
---
    if (!s->kvm_dirty_ring_size) {
        dirty_log_manual_caps = kvm_check_extension(s, KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2);
        dirty_log_manual_caps &= (KVM_DIRTY_LOG_MANUAL_PROTECT_ENABLE |
                                  KVM_DIRTY_LOG_INITIALLY_SET);
        s->manual_dirty_log_protect = dirty_log_manual_caps;
        if (dirty_log_manual_caps) {
            ret = kvm_vm_enable_cap(s, KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2, 0,
                                    dirty_log_manual_caps);
            if (ret) {
                warn_report("Trying to enable capability %"PRIu64" of "
                            "KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2 but failed. "
                            "Falling back to the legacy mode. ",
                            dirty_log_manual_caps);
                s->manual_dirty_log_protect = 0;
            }
        }
    }
---

这里的代码版本是7.0.0，这个功能是默认开启的；

ram_save_host_page()
  -> migration_bitmap_clear_dirty()
    -> migration_clear_memory_region_dirty_bitmap()
	  -> listener->log_clear()
	     kvm_log_clear()
	       -> kvm_physical_log_clear()
	          ---
 				if (!s->manual_dirty_log_protect) {
			        /* No need to do explicit clear */
			        return ret;
			    }

			   	...
			    kvm_slots_lock();

			    for (i = 0; i < s->nr_slots; i++) {
			        mem = &kml->slots[i];
					...
			        ret = kvm_log_clear_one_slot(mem, kml->as_id, offset, count);
					  -> ret = kvm_vm_ioctl(s, KVM_CLEAR_DIRTY_LOG, &d);
					...
			    }

			    kvm_slots_unlock();
			  ---

kvm_vm_ioctl_clear_dirty_log() // KVM_CLEAR_DIRTY_LOG
  -> kvm_clear_dirty_log_protect()
	-> kvm_arch_mmu_enable_log_dirty_pt_masked()
	   ---
		/* Now handle 4K PTEs.  */
		if (kvm_x86_ops.cpu_dirty_log_size)
			kvm_mmu_clear_dirty_pt_masked(kvm, slot, gfn_offset, mask);
		else
			kvm_mmu_write_protect_pt_masked(kvm, slot, gfn_offset, mask);
	   ---

不过，这里每个Page都要来次系统调用，是不是效率有点低啊。。。

MG.3.2.5 libvirt migrate --timeout

如果Guest一直在Dirty Page？

libvirt中存在参数用于这个情况，参考：

--timeout [seconds] - forces a guest virtual machine to suspend when the live migration counter exceeds N seconds. It can only be used with a live migration. Once the timeout is initiated, the migration continues on the suspended guest virtual machine.

参考libvirt的代码：

cmdMigrate()
---
    if (virThreadCreate(&workerThread,
                        true,
                        doMigrate,
                        &data) < 0)
        goto cleanup;
    virshWatchJob(ctl, dom, verbose, eventLoop,
                  &data.ret, timeout,
                  virshMigrateTimeout,
                  &timeoutAction, _("Migration"));
---


virshMigrateTimeout()
---
    switch (action) {
    case VIRSH_MIGRATE_TIMEOUT_DEFAULT: /* unreachable */
    case VIRSH_MIGRATE_TIMEOUT_SUSPEND:
		...
        if (virDomainSuspend(dom) < 0)
            vshDebug(ctl, VSH_ERR_INFO, "suspending domain failed\n");
        break;

    case VIRSH_MIGRATE_TIMEOUT_POSTCOPY:
		...
        if (virDomainMigrateStartPostCopy(dom, 0) < 0)
            vshDebug(ctl, VSH_ERR_INFO, "switching to post-copy failed\n");
        break;
    }
---

MG.4 停止Guest

接下来，需要将Guest的状态传递给目的Host，这期间vcpu和所有外设的活动都必须全部停止，接下来，我们看下Qemu如何处理这部分。

MG.4.1 停止vcpu

先参考代码：

do_vm_stop() // state = RUN_STATE_FINISH_MIGRATE
  -> pause_all_vcpus()
	-> cpu->stop = true
	-> qemu_cpu_kick()
	  -> cpus_kick_thread()
	    -> pthread_kill(cpu->thread->thread, SIG_IPI) //  #define SIG_IPI SIGUSR1

在内核态，发送signal的时候，会给正在运行的任务发送IPI，
__send_signal()
  -> complete_signal()
	-> signal_wake_up()
	  -> signal_wake_up_state()
	     ---
			if (!wake_up_state(t, state | TASK_INTERRUPTIBLE))
				kick_process(t); // IPI is sent here
		 ---

在vcpu thread接收到signal之后，会退出内核态，参考代码：

vcpu_run()
  -> xfer_to_guest_mode_handle_work()
	-> xfer_to_guest_mode_work()
	   ---
		if (ti_work & _TIF_SIGPENDING) {
			kvm_handle_signal_exit(vcpu);
			return -EINTR;
		}
	   ---

在qemu用户态，它会经历两轮循环，

Round 1
cpu->stop == true
---------
kvm_vcpu_thread_fn()
  -> qemu_wait_io_event()
	-> qemu_wait_io_event_common()
	  -> qemu_cpu_stop() // exit == false
	     ---
		    cpu->stop = false;
		    cpu->stopped = true;
		    if (exit) {
		        cpu_exit(cpu);
		    }
		    qemu_cond_broadcast(&qemu_pause_cond); // notify pause_all_vcpus()
		 ---

Round 2
cpu->stopped == true
-----------------
kvm_vcpu_thread_fn()
  -> qemu_wait_io_event()
	-> qemu_cond_wait(cpu->halt_cond, &qemu_global_mutex);

最终vcpu thread睡眠在pthread condition halt_cond上。

MG.4.2 停止外设

在vcpu停止后，不会有新的请求发给外设，但是依然要注意处理以下情况：

由vcpu发出请求，外设返回结果，例如存储设备，由vcpu发出请求，存储设备回复数据或者结果；
外设主动返回数据，例如网卡的收报行为

接下来，我们看下，在迁移期间，qemu如何处理这些情况。

对于这些外设，qemu通过vm_state_notify()告知它们qemu的状态，

migration_completion()
  -> vm_stop_force_state(RUN_STATE_FINISH_MIGRATE)
	-> vm_stop(state)
	  -> do_vm_stop(state)
        -> pause_all_vcpus()
        -> vm_state_notify(state, 0)
	       ---
			if (running) {
		        QTAILQ_FOREACH_SAFE(e, &vm_change_state_head, entries, next) {
		            e->cb(e->opaque, running, state);
		        }
		    } else {
		        QTAILQ_FOREACH_REVERSE_SAFE(e, &vm_change_state_head, entries, next) {
		            e->cb(e->opaque, running, state);
		        }
		    }
		   ---

外设则通过以下两个函数注册处理函数：

qemu_add_vm_change_state_handler()
qdev_add_vm_change_state_handler()

MG.4.2.1 net

对于e1000e、rtl8139等虚拟网卡，其处理函数为：

net_vm_change_state_handler()
---
    QTAILQ_FOREACH_SAFE(nc, &net_clients, next, tmp) {
        if (running) {
            /* Flush queued packets and wake up backends. */
            if (nc->peer && qemu_can_send_packet(nc)) {
                qemu_flush_queued_packets(nc->peer);
            }
        } else {
            /* Complete all queued packets, to guarantee we don't modify
             * state later when VM is not running.
             */
            qemu_flush_or_purge_queued_packets(nc, true);
        }
    }
---

qemu_flush_or_purge_queued_packets()
---
    if (qemu_net_queue_flush(nc->incoming_queue)) {
        /* We emptied the queue successfully, signal to the IO thread to repoll
         * the file descriptor (for tap, for example).
         */
        qemu_notify_event();
    } else if (purge) {
        /* Unable to empty the queue, purge remaining packets */
        qemu_net_queue_purge(nc->incoming_queue, nc->peer);
		---
	    QTAILQ_FOREACH_SAFE(packet, &queue->packets, entry, next) {
    	    if (packet->sender == from) {
        	    QTAILQ_REMOVE(&queue->packets, packet, entry);
	            queue->nq_count--;
   	        	if (packet->sent_cb) {
                	packet->sent_cb(packet->sender, 0);
            	}
            	g_free(packet);
        	}
    	}
		---
    }
---

这里会尝试将packet deliver，但是，如果不能，就会将包丢弃。

我们再看看下vhost-net的处理，

virtio_init()
---
    vdev->vmstate = qdev_add_vm_change_state_handler(DEVICE(vdev),
            virtio_vmstate_change, vdev);
---

virtio_vmstate_change()
  -> virtio_set_status()
	-> virtio_net_set_status()
	  -> virtio_net_vhost_status()
	    -> vhost_net_stop()

vhost_net_stop()
---
    struct vhost_vring_file file = { .fd = -1 };
                                     ^^^^^^^^^

    if (net->nc->info->type == NET_CLIENT_DRIVER_TAP) {
        for (file.index = 0; file.index < net->dev.nvqs; ++file.index) {
            int r = vhost_net_set_backend(&net->dev, &file);
            assert(r >= 0);
        }
    }
	...

    vhost_dev_stop(&net->dev, dev);
	---
    if (hdev->vhost_ops->vhost_dev_start) {
        hdev->vhost_ops->vhost_dev_start(hdev, false);
    }

    for (i = 0; i < hdev->nvqs; ++i) {
        vhost_virtqueue_stop(hdev,
                             vdev,
                             hdev->vqs + i,
                             hdev->vq_index + i);
    }
	---
---

这里会清除vhost的backend，我们看下内核态是如何处理这里的，

vhost_net_set_backend()
---
	vq = &n->vqs[index].vq;
	nvq = &n->vqs[index];
	mutex_lock(&vq->mutex);
	...
	sock = get_socket(fd); // return NULL if fd is -1
	...
	oldsock = vhost_vq_get_backend(vq);
	if (sock != oldsock) {
		...
		vhost_net_disable_vq(n, vq); // Detach poll from old socket fd
		vhost_vq_set_backend(vq, sock);
		r = vhost_vq_init_access(vq);
		r = vhost_net_enable_vq(n, vq); // Do nothing is backend is NULL
		...
	}
	mutex_unlock(&vq->mutex);
	...
	if (oldsock) {
		vhost_net_flush_vq(n, index); // Put an work on vhost thread and wait it  // for completion
		sockfd_put(oldsock);
	}
---

handle_tx()
---
	mutex_lock_nested(&vq->mutex, VHOST_NET_VQ_TX);
	sock = vhost_vq_get_backend(vq);
	if (!sock)
		goto out;
	....
	vhost_disable_notify(&net->dev, vq);
	vhost_net_disable_vq(net, vq);

	if (vhost_sock_zcopy(sock))
		handle_tx_zerocopy(net, sock);
	else
		handle_tx_copy(net, sock);

out:
	mutex_unlock(&vq->mutex);
---

vq->mutex可以保证handle_tx/rc()与vhost_net_set_backend()之间的互斥，

如果handle_tx/rc()首先获得vq->mutex，则vhost_net_set_backend()等待；
如果vhost_net_set_backend()先获得，则handle_tx/rx()等待；在得到锁之后，backend已经被清理，则直接返回；
vhost_net_set_backend()通过vhost_net_flush_vq()等待所有的work完成；

待vhost_net_set_backend()返回之后，vhost-net中便没有任何数据，但是，socket中可能依然有数据，无法及时收上来。

vhost-net保证，从socket中读出来的数据，通过virtio发送给guest，并发送中断；在后续的数据迁移的过程中，它们会随着cpu状态和内存发送给迁移目的Host；而Socket中没有读出来的数据，如果是tcp链接，没有读出来的部分，就没有ack，之后，tcp对端会触发retransmit。

注：这里也提示我们，在迁移的过程中，virtqueue中是可能有数据的，后续，我们将看到qemu如果处理这部分。

MG.4.2.2 blk

存储设备接收vcpu请求，处理后，将结果返回给vcpu；对于in-flight的请求，存储设备需要将它们处理完成，否则，在迁移设备的过程中，还需要将这些in-flight的请求一同迁移；

注：我们不能让IO请求返回错误或者超时给GuestOS，然后让后者重试；尤其是IO超时处理，可能造成系统卡顿。

对于virtio-blk这种qemu本地虚拟设备，处理相对简单，参考代码：

do_vm_stop()
---
    if (runstate_is_running()) {
        runstate_set(state);
        cpu_disable_ticks();
        pause_all_vcpus();
        vm_state_notify(0, state);
        ...
    }

    bdrv_drain_all();
    ret = bdrv_flush_all();
---

等待所有inflight-io完成，并像存储设备发送flush保证写数据落盘。这可以保证，在迁移目的Host上，也能访问到这些数据。

但是vhost-user如何处理？其同样依赖virtio的set_status回调，

vhost_user_blk_set_status()
---
    if (should_start) {
        ret = vhost_user_blk_start(vdev, &local_err);
		...
    } else {
        vhost_user_blk_stop(vdev);
    }
---

vhost_user_blk_stop()
  -> vhost_dev_stop()
	-> dev->vhost_ops->vhost_get_vring_base()
	   vhost_user_get_vring_base()
	   ---
	    VhostUserMsg msg = {
	        .hdr.request = VHOST_USER_GET_VRING_BASE,
        	.hdr.flags = VHOST_USER_VERSION,
        	.payload.state = *ring,
        	.hdr.size = sizeof(msg.payload.state),
    	};
		...
	    ret = vhost_user_write(dev, &msg, NULL, 0);
		...
    	ret = vhost_user_read(dev, &msg);
		...
	   ---

VHOST_USER_GET_VRING_BASE ?

参考链接：Vhost-user Protocol — QEMU documentationhttps://qemu-project.gitlab.io/qemu/interop/vhost-user.html

Each ring is initialized in a stopped state. The back-end must start ring upon receiving a kick (that is, detecting that file descriptor is readable) on the descriptor specified by VHOST_USER_SET_VRING_KICK or receiving the in-band message VHOST_USER_VRING_KICK if negotiated, and stop ring upon receiving VHOST_USER_GET_VRING_BASE.

附录

vhost-blk

vhost-blk的基本原理可以参考

Virtio-blk Performance Improvementhttps://www.linux-kvm.org/images/f/f9/2012-forum-virtio-blk-performance-improvement.pdf

不过，最终vhost-blk没有进入内核；在邮件列表中爆发激烈争论，具体可以参考：

[PATCH 0/5] Add vhost-blk supporthttps://linux-kernel.vger.kernel.narkive.com/piveEhxB/patch-0-5-add-vhost-blk-support个人认为vhost-blk没有进入内核主线的原因有以下几条：

存储IO远比网络IO慢，尤其是patch发出的2012年，所以，对于vhost可以带来的收益，比如减少上下文切换、减少系统调用等，在存储设备上表现不明显；
qemu存储栈存在format + stroage堆叠的情况，vhost-blk仅能用于裸盘，尤其是，在v6版本，从aio变成了bio-based；网络并不存在类似的堆叠的情况，bridge、oSwitch等都在qemu之外；

vhost-user的引入更使的vhost-blk这样方案变得没有必要，参考：

Accelerating NVMe I/Os in Virtual Machine via SPDK vhost* Solutionhttps://events19.linuxfoundation.org/wp-content/uploads/2017/11/Accelerating-NVMe-I_Os-in-Virtual-Machine-via-SPDK-vhost_-Solution-Ziye-Yang-_-Changpeng-Liu-Intel.pdf

你可能感兴趣的:(虚拟化技术,c++,开发语言)

Pybind11教程：从零开始打造 Python 的 C++ 小帮手 Yc9801 c++开发语言
参考官网文档：https://pybind11.readthedocs.io/en/stable/index.html一、Pybind11是什么？想象你在Python里写了个计算器，但跑得太慢，想用C++提速，又不想完全抛弃Python。Pybind11就像一座桥，把C++的高性能代码“嫁接”到Python里。你可以用Python调用C++函数，就像请了个跑得飞快的帮手来干活。主要功能：绑定函数：
【附JS、Python、C++题解】Leetcode面试150题（7） moz与京 leetcode整理 javascript python c++
一、题目167.两数之和II-输入有序数组给你一个下标从1开始的整数数组numbers，该数组已按非递减顺序排列，请你从数组中找出满足相加之和等于目标数target的两个数。如果设这两个数分别是numbers[index1]和numbers[index2]，则1targetIndex(vectornums,inttarget){intlength=nums.size();if(length<2){
C++基础匿名对象，友元和常成员(const) 没有百宝袋的哆啦A梦 c++java jvm
目录学习内容：1.匿名对象2.友元2.1友元的引入2.2友元函数2.3友元类2.4友元的总结3.常成员（const）3.1常成员的引入3.2常成员函数3.3常对象3.4mutable关键字3.5常函数3.6关于C/C++中const的使用(面试题)学习内容：1.匿名对象1>所谓匿名对象，就是没有名字的对象，生命周期只在当前语句内，所以可以理解成时一个将亡值2>定义格式：直接调用类的构造函数3>使用
C++并发编程有什么最佳实践？ c++
在C++并发编程中，遵循最佳实践可以显著提升代码的效率、可维护性和可扩展性。以下是一些关键的最佳实践：使用线程池管理线程线程池可以预先创建一组线程，并在需要时将任务分配给这些线程。这种方式减少了创建和销毁线程的开销，提高了程序性能。例如：cpp复制autopool=std::make_shared(std::thread::hardware_concurrency());pool->push(st
C++23标准库模块 eamon100 Win32软件开发 c++
一、C++23标准库引入了两个命名模块：std和std.compat：std导出C++标准库命名空间std中定义的声明和名称，例如std::vector。它还会导出C包装器标头的内容，例如和，提供类似std::printf()函数的内容。不会导出全局命名空间（如::printf()）中定义的C函数。这可改善包含这样的C包装器标头的同时也会包含像stdio.h这样的C头文件的情况，因为这会引入C全局
C++编程：从入门到精通的指南 zifeng0015 c++java jvm
本文将引导读者走进C++编程的世界，从基础知识讲起，逐步深入到高级特性。无论你是编程新手还是希望提升C++技能的开发者，本文都将为你提供有价值的指导和建议。正文：一、C++简介C++是一种面向对象的编程语言，由BjarneStroustrup于1985年开发。它结合了C语言的低级特性和面向对象编程的高级特性，因此既适合进行系统级编程，也适合进行大型应用软件开发。二、C++基础变量和数据类型：C++
深入理解 C++11 多线程编程：从入门到实践小河cpp c++开发语言
C++多线程编程是指使用C++提供的多线程库来并行执行代码块，从而提高程序的性能和响应能力。C++11标准引入了多线程支持，使得在C++中进行多线程编程变得更加容易和直观。以下是C++多线程编程的基本知识，并附有例子代码。多线程的基本概念线程（Thread）：线程是进程中的一个执行单元，每个线程有自己的堆栈，但与其他线程共享程序的全局内存。竞争条件（RaceCondition）：多个线程并发访问同
3月TIOBE编程语言排行：Python稳居榜首，C++和Java市场份额稳步上升朱公子的Note 编程语言 python c++java TIOBE编程语言排行
TIOBE编程语言排行榜是一个基于全球程序员数量、课程数量和第三方供应商数量的指标，旨在反映编程语言的流行度。根据TIOBEIndex，它每月更新一次，计算方法基于搜索引擎（如Google、Bing、Wikipedia等）的查询结果，涵盖专业开发者的兴趣和需求。需要注意的是，TIOBE指数不代表“最佳”编程语言或代码量最多的语言，而是反映语言在开发者社区中的热度。2025年3月的排行榜特别提到Py
使用CPLEX进行C++优化建模：从入门到精通 m0_57781768 c++java 开发语言
使用CPLEX进行C++优化建模：从入门到精通前言CPLEX是IBM开发的一款强大的数学编程求解器，广泛应用于线性规划（LP）、混合整数规划（MIP）和约束规划（CP）等领域。它具有高效的求解能力和灵活的建模功能，是优化领域的重要工具之一。本文将详细介绍如何在C++中使用CPLEX进行优化建模，从基本概念到高级应用，结合具体实例展示其强大功能。通过这篇文章，读者将能够深入理解CPLEX的使用方法，
Chapter 8: Advanced Template Metaprogramming in C++__《C++ Templates》notes 郭涤生 c/c++c++算法开发语言笔记
AdvancedTemplateMetaprogramminginC++1.KeyConcepts&CodeExplanations1.1SFINAE(SubstitutionFailureIsNotAnError)1.2`constexpr`andCompile-TimeComputation1.3TypeTraits1.4VariadicTemplateswithRecursion1.5C++
编译时报错“LNK2019 无法解析的外部符号”的可能原因及其解决办法烟锁池塘柳0 程序设计与编程语言 c++
在VS2022中运行C++程序的时候，有时候会遇到这样的问题：1>（源文件名称）.obj:errorLNK2019:无法解析的外部符号"public:__cdecl（函数名(参数列表)）"(??0（函数名与乱码）@@QEAA@XZ)，函数main中引用了该符号1>项目路径\x64\Debug\可执行程序名.exe:fatalerrorLNK1120:1个无法解析的外部命令遇到这种问题，可以说是很难
ArkTS 基础语法介绍怀男孩笔记 harmonyos
ArkTS基础语法编程语言介绍什么是ArkTS？ArkTS是HarmonyOS生态的应用开发语言。它基于TypeScript（TS），并在此基础上进行了增强和优化，提供了声明式UI范式、状态管理支持等能力，帮助开发者以更简洁、自然的方式开发应用。ArkTS强化了静态类型检查，支持并发编程增强，并与TS/JS生态高效互操作，兼容性良好。ArkTS的主要特点包括：静态类型检查：在编译阶段检测更多错误，
C++ 结构型设计模式十七12138 C++c++设计模式
C++设计模式自己理解整理笔记结构型-适配器模式适配器模式（AdapterPattern）是一种结构型设计模式，它的主要作用是将一个类的接口转换成客户希望的另一个接口，使得原本由于接口不兼容而不能一起工作的那些类可以一起工作。适配器模式主要有两种实现方式：类适配器模式和对象适配器模式。类适配器类适配器通过多重继承实现，这种方式利用了继承优点直接调用：由于适配器类继承了被适配类，所以可以直接调用被适
记：应聘北京思特奇信息技术股份有限公司 C++工程师指针的值是地址大四求职 c++敏捷开发
一轮，软件上的笔试题这里记录几个问题。1.构成C语言的基本单位是函数。2.敏捷开发：相对于“非敏捷”，更强调程序员团队与业务专家之间的紧密协作、面对面的沟通（认为比书面的文档更有效）、频繁交付新的软件版本、紧凑而自我组织型的团队、能够很好地适应需求变化的代码编写和团队组织方法，也更注重软件开发过程中人的作用。（来自百度百科）一个通俗的博客另一个。我个人的理解就是以人为中心，尽量以口头交流为主，以尽
c++中的向上取整和向下取整冬天快过去 c++
1.头文件#include或者直接用#include万能头文件2.ceil（）向上取整函数通常用于到买卖东西和人数问题中（因为没有半个人）3.floor()向下取整函数（就是高数中的取整函数）下面是两个函数测试用例运行结果
批量请求微信小程序封禁状态的C++代码示例安丨微信小程序 c++小程序
概述：此C++代码示例将展示如何批量请求指定API接口，检查微信小程序是否被封禁。根据返回的code值，我们可以判断小程序是否被封禁，code为0时表示小程序被封禁，code为1表示正常。代码介绍：目标：通过C++编写批量请求的代码，检查多个小程序的封禁状态。使用的库：使用libcurl库来发送HTTP请求。libcurl是一个强大的库，广泛用于在C++中进行网络请求。API接口：https://
Rust语言介绍和猜数字游戏的实现栖林_ Rust rust 游戏开发语言
文章目录Rust语言介绍和猜数字游戏的实现cargo是什么使用Rust编写猜数字Rust语言介绍和猜数字游戏的实现Rust语言是一种系统编程语言，核心强调安全性、并发性以及高性能，由类似于C/C++的底层控制能力，性能也非常接近，Rust有一些特性所有权系统，这个可以自动管理内存，无需垃圾回收器，保证数据的安全零成本抽象，高层抽象不会带来运行时的开销，运行时的效率会很高线程安全，在编译阶段就能防止
【C++】面向对象的三大特性：封装、继承、多态（3） _Yeps 【C++】基础知识解析 c++算法
1、面向对象的三大特性：封装、继承、多态——【C++】面向对象的三大特性：封装、继承、多态（1）详见以上链接，点击蓝字。2、C++的封装是如何实现的？——【C++】面向对象的三大特性：封装、继承、多态（2）详见以上链接，点击蓝字。3、C++的继承是如何实现的？在C++中，继承是通过:（冒号）+访问控制修饰符（public、protected、private）实现的。class父类{//父类的成员}
CUDA编程基础清澜算法面试人工智能 c++算法 nvidia cuda编程
一、快速理解CUDA编程1.1CUDA简介CUDA（ComputeUnifiedDeviceArchitecture）是由NVIDIA推出的并行计算平台和应用程序接口模型。它允许开发者利用NVIDIAGPU的强大计算能力来加速通用计算任务，而不仅仅是图形渲染。通过CUDA，开发者可以编写C、C++或Fortran代码，并将其扩展以在GPU上运行，从而显著提高性能，特别是在处理大规模数据集和复杂算法
kvm虚拟化的概念与作用千航@abc kvm虚拟化 kvm 虚拟化
概念——虚拟化是指通过虚拟化技术将一台计算机虚拟为多台逻辑计算机。在一台计算机上同时运行多个逻辑计算机，每个逻辑计算机可运行不同的操作系统，并且应用程序都可以在相互独立的空间内运行而互不影响，从而显著提高计算机的工作效率。作用——虚拟化技术可以扩大硬件的容量，简化软件的重新配置过程。CPU的虚拟化技术可以单CPU模拟多CPU并行，允许一个平台同时运行多个操作系统，并且应用程序都可以在相互独立的空间
鸿蒙HarmonyOS开发：应用程序静态包-HAR 让开，我要吃人了鸿蒙开发 OpenHarmony HarmonyOS harmonyos 华为移动开发前端 html 开发语言鸿蒙
HAR（HarmonyArchive）是静态共享包，可以包含代码、C++库、资源和配置文件。通过HAR可以实现多个模块或多个工程共享ArkUI组件、资源等相关代码。使用场景作为二方库，发布到OHPM私仓，供公司内部其他应用使用。作为三方库，发布到OHPM中心仓，供其他应用使用。约束限制HAR不支持在设备上单独安装/运行，只能作为应用模块的依赖项被引用。HAR不支持在配置文件中声明UIAbility
C++：std::move() / std::forward() 我什么都没有3 C++c++开发语言
移动语义和完美转发是C++11中引入的两个重要技术。熟练的掌握移动语义与完美转发，有益于设计安全、高性能的程序。其头文件均为。移动语义：增强了程序对数据所有权的控制，通过std::move标准库函数实现。完美转发：为实现通用的模板函数奠定了基础。通过std::forward库函数实现。基础1：右值引用C++表达式有两个属性：类型和值类型。这里的“值类型”指的就是左值（lvalue）与右值（rval
大话C++之：左右值引用和std::move Kelvin7_Feng c++
大话C++之：左右值引用和std::move什么是左值和右值什么是左值引用和右值引用std::move的应用场景在C++11引入右值引用后，一直对其使用缺乏深入理解，特别是结合std::move移动语义。恰逢最近工作里有相关优化代码使用到，可以趁机会重新学习，加深理解。什么是左值和右值从命名来理解，既然命名区分左右，左右值是相对于赋值号“=”来作锚点。左值(LValue)：可以位于等号左边，有持久
C++并发与实战（2）：trie.cpp实现 SoloRejudger C++并发 c++java 开发语言
2.trie.cpp实现注意到trie.h给了我们三个接口autoGet(std::string_viewkey)const->constT*;templateautoPut(std::string_viewkey,Tvalue)const->Trie;autoRemove(std::string_viewkey)const->Trie;我们就要在trie.cpp下面实现这三个接口实现前的注意点由
std::move() DDlsss c++网络协议
std::move是C++中一个用于实现移动语义的标准库函数，它用于将一个左值转换为右值引用。本质上，它并不会移动任何数据，它只是告诉编译器将某个对象当作临时对象（右值）处理。左值:左值是指能够出现在赋值语句左边的对象。它有一个明确的内存地址，并且是可以在多次使用的对象。例如，变量、对象、数组元素等都是左值。例子：intx=5;//x是左值x=10;//可以在赋值操作的左边右值:右值是指临时对象或
快速上手系列丨如何管理 PieCloudDB Database 虚拟数仓云原生数据库教程管理
为增强社区用户的体验，PieCloudDBDatabase社区版已于8月完成了全面改版升级。同时，PieCloudDB社区还特别制作了《快速入门PieCloudDB社区版》系列课程，旨在帮助大家全面了解新版本，逐步探索PieCloudDB的强大功能。PieCloudDB社区版提供免费下载，可用于体验产品新特性、个人学习、PoC验证等场景，方便社区用户快速体验领先的数仓虚拟化技术。PieCloudD
python pip报错：Preparing metadata (pyproject.toml) ... error 我有一个魔盒其他 python pip 开发语言
环境：win11（Python3.9.13）原因：想安装低版本python，结果安装成了32位的，但是依赖包基本都是64位的。解决办法：重装64位python（可能还需要VisualStudio内安装“使用C++的桌面开发”）异常报错：Collectingmatplotlib~=3.0(fromgradio)Usingcachedhttps://pypi.tuna.tsinghua.edu.cn/
C++中的双冒号：：逆旅可好 C++盲区 c++开发语言
在C++中，双冒号（::）被用作作用域解析运算符。类作用域解析运算符在C++中，如果要在类的定义外部定义或实现成员函数或静态成员变量，则必须使用双冒号运算符来引用类作用域中的成员。例如，如果有一个类叫做MyClass，其中有一个名为myMethod的成员函数，则可以使用以下方式引用该函数：voidMyClass::myMethod(){//函数体}其中的MyClass::表示myMethod属于M
C++学习note8(结构体）技术小白Byteman c++学习开发语言算法 visual studio
一，结构体用法结构体为用户自定义的数据类型，放在主函数前，其定义方法如下：structStudent{stringname;intage;intgrade；}；代码示例：#includeusingnamespacestd;#includestructStudent{/此处Student也可为student(不硬性要求大小写)stringname;intage;intgrade;}s3;/在此顺便创
C++学习note7(指针）技术小白Byteman c++学习开发语言 windows visual studio 算法数据结构
一，指针的定义指针用于记录变量的地址。代码示例:#includeusingnamespacestd;intmain(){inta=0;int*p;（int*为一体）p=&a;p为a的地址coutusingnamespacestd;intmain(){int*p=NULL;*p=100;定义空指针后不可对其进行访问，故程序出错coutusingnamespacestd;intmain(){int*p
Java实现的基于模板的网页结构化信息精准抽取组件：HtmlExtractor yangshangchuan 信息抽取 HtmlExtractor 精准抽取信息采集
HtmlExtractor是一个Java实现的基于模板的网页结构化信息精准抽取组件，本身并不包含爬虫功能，但可被爬虫或其他程序调用以便更精准地对网页结构化信息进行抽取。 HtmlExtractor是为大规模分布式环境设计的，采用主从架构，主节点负责维护抽取规则，从节点向主节点请求抽取规则，当抽取规则发生变化，主节点主动通知从节点，从而能实现抽取规则变化之后的实时动态生效。如
java编程思想 -- 多态百合不是茶 java 多态详解
一: 向上转型和向下转型面向对象中的转型只会发生在有继承关系的子类和父类中（接口的实现也包括在这里）。父类：人子类：男人向上转型： Person p = new Man() ; //向上转型不需要强制类型转化向下转型： Man man =
[自动数据处理]稳扎稳打,逐步形成自有ADP系统体系 comsci dp
对于国内的IT行业来讲,虽然我们已经有了"两弹一星",在局部领域形成了自己独有的技术特征,并初步摆脱了国外的控制...但是前面的路还很长.... 首先是我们的自动数据处理系统还无法处理很多高级工程...中等规模的拓扑分析系统也没有完成,更加复杂的
storm 自定义日志文件商人shang storm cluster logback
Storm中的日志级级别默认为INFO，并且，日志文件是根据worker号来进行区分的，这样，同一个log文件中的信息不一定是一个业务的，这样就会有以下两个需求出现： 1. 想要进行一些调试信息的输出 2. 调试信息或者业务日志信息想要输出到一些固定的文件中不要怕，不要烦恼，其实Storm已经提供了这样的支持，可以通过自定义logback 下的 cluster.xml 来输
Extjs3 SpringMVC使用 @RequestBody 标签问题记录 21jhf
springMVC使用 @RequestBody(required = false) UserVO userInfo 传递json对象数据，往往会出现http 415，400,500等错误，总结一下需要使用ajax提交json数据才行，ajax提交使用proxy，参数为jsonData，不能为params；另外，需要设置Content-type属性为json，代码如下：（由于使用了父类aaa
一些排错方法文强chu 方法
1、java.lang.IllegalStateException: Class invariant violation at org.apache.log4j.LogManager.getLoggerRepository(LogManager.java:199)at org.apache.log4j.LogManager.getLogger(LogManager.java:228) at o
Swing中文件恢复我觉得很难小桔子 swing
我那个草了！老大怎么回事，怎么做项目评估的？只会说相信你可以做的，试一下，有的是时间！用java开发一个图文处理工具，类似word，任意位置插入、拖动、删除图片以及文本等。文本框、流程图等，数据保存数据库，其余可保存pdf格式。ok,姐姐千辛万苦，
php 文件操作 aichenglong PHP 读取文件写入文件
1 写入文件 @$fp=fopen("$DOCUMENT_ROOT/order.txt", "ab"); if(!$fp){ echo "open file error" ; exit; } $outputstring="date:"." \t tire:".$tire."
MySQL的btree索引和hash索引的区别 AILIKES 数据结构 mysql 算法
Hash 索引结构的特殊性，其检索效率非常高，索引的检索可以一次定位，不像B-Tree 索引需要从根节点到枝节点，最后才能访问到页节点这样多次的IO访问，所以 Hash 索引的查询效率要远高于 B-Tree 索引。可能很多人又有疑问了，既然 Hash 索引的效率要比 B-Tree 高很多，为什么大家不都用 Hash 索引而还要使用 B-Tree 索引呢
JAVA的抽象--- 接口 --实现百合不是茶
抽象接口实现接口 //抽象类 ,方法 //定义一个公共抽象的类 ,并在类中定义一个抽象的方法体抽象的定义使用abstract abstract class A 定义一个抽象类例如： //定义一个基类 public abstract class A{ //抽象类不能用来实例化，只能用来继承 //
JS变量作用域实例 bijian1013 作用域
<script> var scope='hello'; function a(){ console.log(scope); //undefined var scope='world'; console.log(scope); //world console.log(b);
TDD实践（二） bijian1013 java TDD
实践题目：分解质因数 Step1：单元测试： package com.bijian.study.factor.test; import java.util.Arrays; import junit.framework.Assert; import org.junit.Before; import org.junit.Test; import com.bijian.
[MongoDB学习笔记一]MongoDB主从复制 bit1129 mongodb
MongoDB称为分布式数据库，主要原因是1.基于副本集的数据备份， 2.基于切片的数据扩容。副本集解决数据的读写性能问题，切片解决了MongoDB的数据扩容问题。事实上，MongoDB提供了主从复制和副本复制两种备份方式，在MongoDB的主从复制和副本复制集群环境中，只有一台作为主服务器，另外一台或者多台服务器作为从服务器。本文介绍MongoDB的主从复制模式，需要指明
【HBase五】Java API操作HBase bit1129 hbase
import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.hbase.HBaseConfiguration; import org.apache.hadoop.hbase.HColumnDescriptor; import org.apache.ha
python调用zabbix api接口实时展示数据 ronin47
zabbix api接口来进行展示。经过思考之后，计划获取如下内容： 1、获得认证密钥 2、获取zabbix所有的主机组 3、获取单个组下的所有主机 4、获取某个主机下的所有监控项
jsp取得绝对路径 byalias 绝对路径
在JavaWeb开发中，常使用绝对路径的方式来引入JavaScript和CSS文件，这样可以避免因为目录变动导致引入文件找不到的情况，常用的做法如下：一、使用${pageContext.request.contextPath} 　　代码” ${pageContext.request.contextPath}”的作用是取出部署的应用程序名，这样不管如何部署，所用路径都是正确的。
Java定时任务调度：用ExecutorService取代Timer bylijinnan java
《Java并发编程实战》一书提到的用ExecutorService取代Java Timer有几个理由，我认为其中最重要的理由是：如果TimerTask抛出未检查的异常，Timer将会产生无法预料的行为。Timer线程并不捕获异常，所以 TimerTask抛出的未检查的异常会终止timer线程。这种情况下，Timer也不会再重新恢复线程的执行了;它错误的认为整个Timer都被取消了。此时，已经被
SQL 优化原则 chicony sql
一、问题的提出　在应用系统开发初期，由于开发数据库数据比较少，对于查询SQL语句，复杂视图的的编写等体会不出SQL语句各种写法的性能优劣，但是如果将应用系统提交实际应用后，随着数据库中数据的增加，系统的响应速度就成为目前系统需要解决的最主要的问题之一。系统优化中一个很重要的方面就是SQL语句的优化。对于海量数据，劣质SQL语句和优质SQL语句之间的速度差别可以达到上百倍，可见对于一个系统
java 线程弹球小游戏 CrazyMizzz java 游戏
最近java学到线程，于是做了一个线程弹球的小游戏，不过还没完善这里是提纲 1.线程弹球游戏实现 1.实现界面需要使用哪些API类 JFrame JPanel JButton FlowLayout Graphics2D Thread Color ActionListener ActionEvent MouseListener Mouse
hadoop jps出现process information unavailable提示解决办法 daizj hadoop jps
hadoop jps出现process information unavailable提示解决办法 jps时出现如下信息： 3019 -- process information unavailable3053 -- process information unavailable2985 -- process information unavailable2917 --
PHP图片水印缩放类实现 dcj3sjt126com PHP
<?php class Image{ private $path; function __construct($path='./'){ $this->path=rtrim($path,'/').'/'; } //水印函数，参数：背景图，水印图，位置，前缀,TMD透明度 public function water($b,$l,$pos
IOS控件学习：UILabel常用属性与用法 dcj3sjt126com ios UILabel
参考网站： http://shijue.me/show_text/521c396a8ddf876566000007 http://www.tuicool.com/articles/zquENb http://blog.csdn.net/a451493485/article/details/9454695 http://wiki.eoe.cn/page/iOS_pptl_artile_281
完全手动建立maven骨架 eksliang java eclipse Web
建一个 JAVA 项目： mvn archetype:create -DgroupId=com.demo -DartifactId=App [-Dversion=0.0.1-SNAPSHOT] [-Dpackaging=jar] 建一个 web 项目： mvn archetype:create -DgroupId=com.demo -DartifactId=web-a
配置清单 gengzg 配置
1、修改grub启动的内核版本 vi /boot/grub/grub.conf 将default 0改为1 拷贝mt7601Usta.ko到/lib文件夹拷贝RT2870STA.dat到 /etc/Wireless/RT2870STA/文件夹拷贝wifiscan到bin文件夹，chmod 775 /bin/wifiscan 拷贝wifiget.sh到bin文件夹，chm
Windows端口被占用处理方法 huqiji windows
以下文章主要以80端口号为例，如果想知道其他的端口号也可以使用该方法..........................1、在windows下如何查看80端口占用情况?是被哪个进程占用?如何终止等. 这里主要是用到windows下的DOS工具,点击"开始"--"运行",输入&
开源ckplayer 网页播放器，跨平台(html5, mobile)，flv, f4v, mp4, rtmp协议. webm, ogg, m3u8 ！天梯梦 mobile
CKplayer，其全称为超酷flv播放器，它是一款用于网页上播放视频的软件，支持的格式有：http协议上的flv,f4v,mp4格式，同时支持rtmp视频流格式播放，此播放器的特点在于用户可以自己定义播放器的风格，诸如播放/暂停按钮，静音按钮，全屏按钮都是以外部图片接口形式调用，用户根据自己的需要制作出播放器风格所需要使用的各个按钮图片然后替换掉原始风格里相应的图片就可以制作出自己的风格了，
简单工厂设计模式 hm4123660 java 工厂设计模式简单工厂模式
简单工厂模式（Simple Factory Pattern）属于类的创新型模式，又叫静态工厂方法模式。是通过专门定义一个类来负责创建其他类的实例，被创建的实例通常都具有共同的父类。简单工厂模式是由一个工厂对象决定创建出哪一种产品类的实例。简单工厂模式是工厂模式家族中最简单实用的模式，可以理解为是不同工厂模式的一个特殊实现。
maven笔记 zhb8015 maven
跳过测试阶段： mvn package -DskipTests 临时性跳过测试代码的编译： mvn package -Dmaven.test.skip=true maven.test.skip同时控制maven-compiler-plugin和maven-surefire-plugin两个插件的行为，即跳过编译，又跳过测试。指定测试类 mvn test
非mapreduce生成Hfile，然后导入hbase当中 Stark_Summer map hbase reduce Hfile path实例
最近一个群友的boss让研究hbase，让hbase的入库速度达到5w+/s，这可愁死了，4台个人电脑组成的集群，多线程入库调了好久，速度也才1w左右，都没有达到理想的那种速度，然后就想到了这种方式，但是网上多是用mapreduce来实现入库，而现在的需求是实时入库，不生成文件了，所以就只能自己用代码实现了，但是网上查了很多资料都没有查到，最后在一个网友的指引下，看了源码，最后找到了生成Hfile
jsp web tomcat 编码问题王新春 tomcat jsp pageEncode
今天配置jsp项目在tomcat上，windows上正常，而linux上显示乱码，最后定位原因为tomcat 的server.xml 文件的配置，添加 URIEncoding 属性： <Connector port="8080" protocol="HTTP/1.1" connectionTi