OpenStack 虚拟机冷/热迁移的实现原理与代码分析

目录

文章目录

  • 目录
  • 前文列表
  • 冷迁移代码分析(基于 Newton)
    • Nova 冷迁移实现原理
  • 热迁移代码分析
    • Nova 热迁移实现原理
    • 向 libvirtd 发出 Live Migration 指令
    • 监控 libvirtd 的数据迁移状态
  • NUMA 亲和、CPU 绑定、SR-IOV 网卡的热迁移问题
  • 最后
  • 参考资料

前文列表

《OpenStack 虚拟机的磁盘文件类型与存储方式》
《Libvirt Live Migration 与 Pre-Copy 实现原理》
《OpenStack 虚拟机冷/热迁移功能实践与流程分析》

在经过上述文章的铺垫之后,终于来到了代码实现部分,通过对代码实现的分析,帮助我们洞穿 OpenStack 虚拟机迁移的本质。

冷迁移代码分析(基于 Newton)

Nova 冷迁移实现原理

  1. 通过是否传入了 New Flavor 来判断这次请求是 Resize 还是 Cold Migrate
  2. 获取虚拟机网络信息 network_info
  3. 获取虚拟机磁盘设备信息 block_device_info
  4. 获取虚拟机的停机超时和重试时间间隔信息
  5. 关闭虚拟机电源
  6. 迁移虚拟机本地磁盘文件
  7. 迁移虚拟机共享块设备
  8. 迁移虚拟机网络
  9. 修改虚拟机的主机记录和状态信息

NOTE:block_device_info 保存的并非只是 OpenStack 块设备(Volume)信息,而是虚拟机块设备信息,即磁盘信息(包含 image、volume),这一点认识不清很容易在代码中被混淆。

MariaDB [nova]> select device_name,destination_type,device_type,source_type,image_id from block_device_mapping where instance_uuid="1935fcf7-ba9b-437c-a7d3-5d54c6d0d6d3";
+-------------+------------------+-------------+-------------+--------------------------------------+
| device_name | destination_type | device_type | source_type | image_id                             |
+-------------+------------------+-------------+-------------+--------------------------------------+
| /dev/vda    | local            | disk        | image       | 0aff2888-47f8-4133-928a-9c54414b3afb |
+-------------+------------------+-------------+-------------+--------------------------------------+
# nova/nova/api/openstack/compute/migrate_server.py

    def _migrate(self, req, id, body):
        """Permit admins to migrate a server to a new host."""
        ...
        # 判断用户是否有权重执行 migrate 操作
        context.can(ms_policies.POLICY_ROOT % 'migrate')

        # 获取 instance 资源模型对象
        instance = common.get_instance(self.compute_api, context, id)
        try:
            # 实际调用的是 instance Resize 接口
            self.compute_api.resize(req.environ['nova.context'], instance)
        ...


# nova/nova/compute/api.py

    def resize(self, context, instance, flavor_id=None, clean_shutdown=True,
               **extra_instance_updates):
        """Resize (ie, migrate) a running instance.

        If flavor_id is None, the process is considered a migration, keeping
        the original flavor_id. If flavor_id is not None, the instance should
        be migrated to a new host and resized to the new flavor_id.
        """
        # 从注释可以看出,是 Migrate 还是 Resize 主要看是否传入了 New Flavor
        ...
        
        # 获取虚拟机当前的 Flavor
        current_instance_type = instance.get_flavor()
        # If flavor_id is not provided, only migrate the instance.
        if not flavor_id:
            LOG.debug("flavor_id is None. Assuming migration.",
                      instance=instance)
            # 保证迁移前后虚拟机 Flavor 不会发生改变
            new_instance_type = current_instance_type
            ...
            
        filter_properties = {'ignore_hosts': []}
        # 通过配置项 allow_resize_to_same_host 来决定是否会 resize 到同一个计算节点
        # 实际上,当 Migrate 到同一个计算节点时,nova-compute 会触发 UnableToMigrateToSelf 异常,
        # 再继续 Retry Scheduler,直至调度到合适的计算节点或异常退出,前提是 nova-scheduler 启用了 RetryFilter 
        if not CONF.allow_resize_to_same_host:
            filter_properties['ignore_hosts'].append(instance.host)
        ...
        
        scheduler_hint = {'filter_properties': filter_properties}
        self.compute_task_api.resize_instance(context, instance,
                extra_instance_updates, scheduler_hint=scheduler_hint,
                flavor=new_instance_type,
                reservations=quotas.reservations or [],
                clean_shutdown=clean_shutdown,
                request_spec=request_spec)


# nova/compute/manager.py

    def resize_instance(self, context, instance, image,
                        reservations, migration, instance_type,
                        clean_shutdown):
        """Starts the migration of a running instance to another host."""
        ...
            # 获取虚拟机的网络信息
            network_info = self.network_api.get_instance_nw_info(context,
                                                                 instance)

        ...
            # 获取虚拟机磁盘信息
            bdms = objects.BlockDeviceMappingList.get_by_instance_uuid(
                    context, instance.uuid)
            block_device_info = self._get_instance_block_device_info(
                                context, instance, bdms=bdms)

            # 获取虚拟机的停机超时和重试信息
            timeout, retry_interval = self._get_power_off_values(context,
                                            instance, clean_shutdown)
        
            # 关闭虚拟机电源并迁移虚拟机磁盘文件
            disk_info = self.driver.migrate_disk_and_power_off(
                    context, instance, migration.dest_host,
                    instance_type, network_info,
                    block_device_info,
                    timeout, retry_interval)

            # 断开虚拟机的共享块设备连接
            self._terminate_volume_connections(context, instance, bdms)

            # 迁移虚拟机网络
            migration_p = obj_base.obj_to_primitive(migration)
            self.network_api.migrate_instance_start(context,
                                                    instance,
                                                    migration_p)
            ...
            
            # 修改虚拟机的主机记录
            instance.host = migration.dest_compute
            instance.node = migration.dest_node
            instance.task_state = task_states.RESIZE_MIGRATED
            instance.save(expected_task_state=task_states.RESIZE_MIGRATING)
            ...


# nova/nova/virt/libvirt/driver.py

    def migrate_disk_and_power_off(self, context, instance, dest,
                                   flavor, network_info,
                                   block_device_info=None,
                                   timeout=0, retry_interval=0):

        # 获取临时盘信息
        ephemerals = driver.block_device_info_get_ephemerals(block_device_info)

        # 检查是否要调整磁盘大小
        # Checks if the migration needs a disk resize down.
        root_down = flavor.root_gb < instance.flavor.root_gb
        ephemeral_down = flavor.ephemeral_gb < eph_size
        # 检查虚拟机是否通过卷启动
        booted_from_volume = self._is_booted_from_volume(block_device_info)

        # 本地磁盘文件不能 Resize
        if (root_down and not booted_from_volume) or ephemeral_down:
            reason = _("Unable to resize disk down.")
            raise exception.InstanceFaultRollback(
                exception.ResizeError(reason=reason))

        # Cinder LVM Backend & Boot from volume 的虚拟机不能迁移
        # NOTE(dgenin): Migration is not implemented for LVM backed instances.
        if CONF.libvirt.images_type == 'lvm' and not booted_from_volume:
            reason = _("Migration is not supported for LVM backed instances")
            raise exception.InstanceFaultRollback(
                exception.MigrationPreCheckError(reason=reason))

        # copy disks to destination
        # rename instance dir to +_resize at first for using
        # shared storage for instance dir (eg. NFS).
        inst_base = libvirt_utils.get_instance_path(instance)
        inst_base_resize = inst_base + "_resize"
        # 判断是否为共享存储
        shared_storage = self._is_storage_shared_with(dest, inst_base)

        # try to create the directory on the remote compute node
        # if this fails we pass the exception up the stack so we can catch
        # failures here earlier
        if not shared_storage:
            try:
                # 非共享存储:通过 SSH 在目的主机上创建虚拟机目录
                self._remotefs.create_dir(dest, inst_base)
            except processutils.ProcessExecutionError as e:
                reason = _("not able to execute ssh command: %s") % e
                raise exception.InstanceFaultRollback(
                    exception.ResizeError(reason=reason))

        # 关闭虚拟机电源
        self.power_off(instance, timeout, retry_interval)

        # 卸载共享块设备
        block_device_mapping = driver.block_device_info_get_mapping(
            block_device_info)
        for vol in block_device_mapping:
            connection_info = vol['connection_info']
            disk_dev = vol['mount_device'].rpartition("/")[2]
            self._disconnect_volume(connection_info, disk_dev, instance)

        # 获取 disk.info 配置文件内容
        # 记录了 Root Disk、Ephemeral Disk、Swap Disk 的 file paths
        disk_info_text = self.get_instance_disk_info(
            instance, block_device_info=block_device_info)
        disk_info = jsonutils.loads(disk_info_text)

        try:
            # 预删除虚拟机目录
            utils.execute('mv', inst_base, inst_base_resize)
            # if we are migrating the instance with shared storage then
            # create the directory.  If it is a remote node the directory
            # has already been created

            if shared_storage:
                # 共享存储:目的主机看作是自己
                dest = None
                # 共享存储:直接在本地文件系统创建虚拟机目录
                utils.execute('mkdir', '-p', inst_base)
            ...
            active_flavor = instance.get_flavor()
            # 块迁移虚拟机本地磁盘文件
            for info in disk_info:
                # assume inst_base == dirname(info['path'])
                img_path = info['path']
                fname = os.path.basename(img_path)
                from_path = os.path.join(inst_base_resize, fname)
                ...
                # We will not copy over the swap disk here, and rely on
                # finish_migration/_create_image to re-create it for us.
                if not (fname == 'disk.swap' and
                    active_flavor.get('swap', 0) != flavor.get('swap', 0)):
                    # 是否启用压缩
                    compression = info['type'] not in NO_COMPRESSION_TYPES
                    # 非共享存储:使用 scp 远程拷贝
                    # 共享存储:使用 cp 本地拷贝
                    libvirt_utils.copy_image(from_path, img_path, host=dest,
                                             on_execute=on_execute,
                                             on_completion=on_completion,
                                             compression=compression)

            # Ensure disk.info is written to the new path to avoid disks being
            # reinspected and potentially changing format.
            
            # 拷贝 diks.inof 配置文件
            src_disk_info_path = os.path.join(inst_base_resize, 'disk.info')
            if os.path.exists(src_disk_info_path):
                dst_disk_info_path = os.path.join(inst_base, 'disk.info')
                libvirt_utils.copy_image(src_disk_info_path,
                                         dst_disk_info_path,
                                         host=dest, on_execute=on_execute,
                                         on_completion=on_completion)
        except Exception:
            with excutils.save_and_reraise_exception():
                self._cleanup_remote_migration(dest, inst_base,
                                               inst_base_resize,
                                               shared_storage)

        return disk_info_text

热迁移代码分析

Nova 热迁移实现原理

在《Libvirt Live Migration 与 Pre-Copy 实现原理》一文中我们提到了 Libvirt Live Migration 的实现原理,和 KVM Pre-Copy Live Migration 的实现原理。简单的说,可分为 3 个阶段:

  • Stage 1:将虚拟机所有 RAM 数据都标记为脏内存。
  • Stage 2:迁移所有脏内存,然后重新计算新产生的脏内存,如此迭代,直到某一个条件退出。例如:脏内存数据量达到低水位(low watermark)。
  • Stage 3:停止运行 GuestOS,将剩余的脏内存以及虚拟机的设备状态信息都迁移过去。

可以想到,其中最关键的阶段就是 Stage 2,即退出条件的实现。Libvirt 早期的原生退出条件有:

  • 50% 或者更少的脏内存需要迁移
  • 不需要进行第 2 次迭代或迭代次数超过 30 次。
  • 动态配置 max downtime(最大停机时间)
  • 源主机策略(e.g. 源主机 5m 后关机,那么就需要即刻迁移所有的虚拟机)

而 Nova 选择的是退出条件就是动态配置 max downtime,Libvirt Pre-Copy Live Migration 每次迭代都会重新计算虚拟机新的脏内存以及每次迭代所花掉的时间来估算带宽,再根据带宽和当前迭代的脏页数计算出传输剩余数据的时间,这个时间就是 downtime。如果 downtime 在管理员配置的 Live Migration Max Downtime 范围之内,则退出,进入 Stage 3。

NOTE:Live Migration Max Downtime(热迁移最大停机时间,单位是 ms),表示可被允许的虚拟机静态数据持续时间,描述业务中断的容忍区间,一般小到可以忽略不计。可通过 nova.conf 配置项指定 CONF.libvirt.live_migration_downtime

需要注意的是,动态配置 downtime 的退出条件存在一个问题,如果虚拟机持续处于高业务状态(不断产生新的脏内存),就意味着每次迭代迁移数据量都很大,downtime 就会一直无法进入退出范围。所以,你应该要有心理准备,使用热迁移可能是一个漫长的过程。针对这种情况,Libvirt 引入了一些新特性:

  • 自动收敛模式:如果虚拟机持续处于高业务状态,那么 libvirtd 会自动调整 vCPU 参数以减轻负载,达到降低脏内存的增长速度,从而保证 downtime 进入退出范围。

除了 Pre-Copy(预拷贝)模式之外,Libvirt 还支持 Post-Copy(后拷贝)模式。前者要求所有数据都必须在虚拟机切换到目标主机之前拷贝完;相对的,Post-Copy 则会优先考虑尽快的切换到目标主机,然后再拷贝内存数据。Port-Copy 模式先把虚拟机的设备状态信息和一部分(10%)脏内存数据到目标主机,然后虚拟机就切换到目标主机上运行。当 GuestOS 发现访问的某些内存页不存在时,就会触发一个远程页错误,进而触发从源主机上面拉取该内存页的动作。显然,Post-Copy 模式也存在一些问题:如果其中一台主机宕机,或出现故障,或网络不通都会导致整个虚拟机异常。Post-Copy 对于核心业务而言不是推荐的 Live Migration 方式,可以通过 nova.conf 配置项 live_migration_permit_post_copy 指定是否开启。

除此之外,Nova 采用的 Libvirt Live Migration 控制模型是 “Client 直连控制”,所以作为 Libvirt Client 的 Nova 就需要轮询访问 libvirtd 以获取数据迁移的状态信息作为控制迁移的依据。故此,Nova 还需要实现一套数据迁移监控机制。

简而言之,Nova 对于 Libvirt Live Migration 的主要实现有两点

  • 作为 Libvirt Client 向源主机的 libvirtd 服务进程发出 Live Migration 指令
  • 数据迁移监控机制
# nova/api/openstack/compute/migrate_server.py
 
    def _migrate_live(self, req, id, body):
        """Permit admins to (live) migrate a server to a new host."""
        ...
        # 是否执行块迁移
        block_migration = body["os-migrateLive"]["block_migration"]
        ...
        # 是否异步执行
        async = api_version_request.is_supported(req, min_version='2.34')
        ...
            # 是否强制执行
            force = self._get_force_param_for_live_migration(body, host)
        ...
            # 是否支持磁盘超额
            disk_over_commit = body["os-migrateLive"]["disk_over_commit"]
        ...
            self.compute_api.live_migrate(context, instance, block_migration,
                                          disk_over_commit, host, force, async)
        ...


# nova/nova/compute/api.py

    def live_migrate(self, context, instance, block_migration,
                     disk_over_commit, host_name, force=None, async=False):
        """Migrate a server lively to a new host."""
        ...
     
        # NOTE(sbauza): Force is a boolean by the new related API version
        if force is False and host_name:
        ...
                # 非强制执行:设定目的主机信息
                destination = objects.Destination(
                    host=target.host,
                    node=target.hypervisor_hostname
                )
                request_spec.requested_destination = destination
        ...
            self.compute_task_api.live_migrate_instance(context, instance,
                host_name, block_migration=block_migration,
                disk_over_commit=disk_over_commit,
                request_spec=request_spec, async=async)


# nova/nova/conductor/manager.py

   def _live_migrate(self, context, instance, scheduler_hint,
                      block_migration, disk_over_commit, request_spec):
        # 获取目的主机
        destination = scheduler_hint.get("host")
        ...
        task = self._build_live_migrate_task(context, instance, destination,
                                             block_migration, disk_over_commit,
                                             migration, request_spec)
            ...
            task.execute()
            
    ...


# nova/nova/conductor/tasks/live_migrate.py

class LiveMigrationTask(base.TaskBase):
    ...
    def _execute(self):
        # 检查虚拟机是否正常运行
        self._check_instance_is_active()
        # 检查源主机服务进程是否正常
        self._check_host_is_up(self.source)

        # 热迁移一定会指定目的主机
        if not self.destination:
            self.destination = self._find_destination()
            self.migration.dest_compute = self.destination
            self.migration.save()
        else:
            # 检查目的主机和源主机是否为同一个
            # 检查目的主机服务进程是否正常
            # 检查目的主机是否有足够的内存空间
            # 检查目的主机和源主机的 Hypervisor 是否一致
            # 检查目的主机是否可以进行热迁移
            self._check_requested_destination()

        # TODO(johngarbutt) need to move complexity out of compute manager
        # TODO(johngarbutt) disk_over_commit?
        return self.compute_rpcapi.live_migration(self.context,
                host=self.source,
                instance=self.instance,
                dest=self.destination,
                block_migration=self.block_migration,
                migration=self.migration,
                migrate_data=self.migrate_data)


# nova/compute/manager.py

    def live_migration(self, context, dest, instance, block_migration,
                       migration, migrate_data):
        ...
        # 设定 migration 状态为 '队列中'
        self._set_migration_status(migration, 'queued')

        def dispatch_live_migration(*args, **kwargs):
            with self._live_migration_semaphore:
                self._do_live_migration(*args, **kwargs)

        # Spawn 一个热迁移队列消息(任务)
        utils.spawn_n(dispatch_live_migration,
                      context, dest, instance,
                      block_migration, migration,
                      migrate_data)

    def _do_live_migration(self, context, dest, instance, block_migration,
                           migration, migrate_data):
        ...
        # 设定 migration 状态为 '准备'
        self._set_migration_status(migration, 'preparing')

        got_migrate_data_object = isinstance(migrate_data,
                                             migrate_data_obj.LiveMigrateData)
        if not got_migrate_data_object:
            migrate_data = \
                migrate_data_obj.LiveMigrateData.detect_implementation(
                    migrate_data)

        try:
            if ('block_migration' in migrate_data and
                    migrate_data.block_migration):
                # 进行块迁移:获取 disk.info 中记录的本地磁盘文件信息
                block_device_info = self._get_instance_block_device_info(
                    context, instance)
                disk = self.driver.get_instance_disk_info(
                    instance, block_device_info=block_device_info)
            else:
                disk = None

            # 让目的主机执行热迁移前的准备
            migrate_data = self.compute_rpcapi.pre_live_migration(
                context, instance,
                block_migration, disk, dest, migrate_data)
        ...

        # 设定 migration 状态为 '进行中'
        self._set_migration_status(migration, 'running')

        ...
            self.driver.live_migration(context, instance, dest,
                                       self._post_live_migration,
                                       self._rollback_live_migration,
                                       block_migration, migrate_data)
        ...


# nova/nova/virt/libvirt/driver.py

    def _live_migration(self, context, instance, dest, post_method,
                        recover_method, block_migration,
                        migrate_data):
        ...
        # nova.virt.libvirt.guest.Guest 对象
        guest = self._host.get_guest(instance)

        disk_paths = []
        device_names = []
        
        
        if migrate_data.block_migration:
            # 块迁移:获取本地磁盘文件路径
            # 如果不需要块迁移,则只内存数据
            # e.g. /var/lib/nova/instances/bf6824e9-1dac-466c-ab53-69f82d8adf73/disk
            disk_paths, device_names = self._live_migration_copy_disk_paths(
                context, instance, guest)

        # Spawn 一个热迁移执行函数
        opthread = utils.spawn(self._live_migration_operation,
                                     context, instance, dest,
                                     block_migration,
                                     migrate_data, guest,
                                     device_names)
        ...
            # 监控 libvirtd 数据迁移进度
            self._live_migration_monitor(context, instance, guest, dest,
                                         post_method, recover_method,
                                         block_migration, migrate_data,
                                         finish_event, disk_paths)
        ...

向 libvirtd 发出 Live Migration 指令

   def _live_migration_operation(self, context, instance, dest,
                                  block_migration, migrate_data, guest,
                                  device_names):
       ...
            # 获取 live migration URI
            migrate_uri = None
            if ('target_connect_addr' in migrate_data and
                    migrate_data.target_connect_addr is not None):
                dest = migrate_data.target_connect_addr
                if (migration_flags &
                    libvirt.VIR_MIGRATE_TUNNELLED == 0):
                    migrate_uri = self._migrate_uri(dest)

            # 获取 GuestOS XML
            new_xml_str = None
            params = None
            if (self._host.is_migratable_xml_flag() and (
                    listen_addrs or migrate_data.bdms)):
                new_xml_str = libvirt_migrate.get_updated_guest_xml(
                    # TODO(sahid): It's not a really well idea to pass
                    # the method _get_volume_config and we should to find
                    # a way to avoid this in future.
                    guest, migrate_data, self._get_volume_config)
            ...
            # 调用 libvirt.virDomain.migrate 的封装函数
            # 向 libvirtd 发出 Live Migration 指令
            guest.migrate(self._live_migration_uri(dest),
                          migrate_uri=migrate_uri,
                          flags=migration_flags,
                          params=params,
                          domain_xml=new_xml_str,
                          bandwidth=CONF.libvirt.live_migration_bandwidth)
         ...

Libvirt Python Client 的迁移函数原型是 libvirt.virDomain.migrate

migrate(self, dconn, flags, dname, uri, bandwidth) method of libvirt.virDomain instance Migrate the domain object from its current host to the destination host given by dconn (a connection to the destination host).

Nova Libvirt Driver 对 libvirt.virDomain.migrate 进行了封装:

# nova/virt/libvirt/guest.py

    def migrate(self, destination, migrate_uri=None, params=None, flags=0,
                domain_xml=None, bandwidth=0):
        """Migrate guest object from its current host to the destination
        """
        if domain_xml is None:
            self._domain.migrateToURI(
                destination, flags=flags, bandwidth=bandwidth)
        else:
            if params:
                ...
                if migrate_uri:
                    # In migrateToURI3 this paramenter is searched in
                    # the `params` dict
                    params['migrate_uri'] = migrate_uri
                params['bandwidth'] = bandwidth
                self._domain.migrateToURI3(
                    destination, params=params, flags=flags)
            else:
                self._domain.migrateToURI2(
                    destination, miguri=migrate_uri, dxml=domain_xml,
                    flags=flags, bandwidth=bandwidth)

通过 Flags 来配置 Libvirt 迁移细节:

  • VIR_MIGRATE_LIVE – Do not pause the VM during migration
  • VIR_MIGRATE_PEER2PEER – Direct connection between source & destination hosts
  • VIR_MIGRATE_TUNNELLED – Tunnel migration data over the libvirt RPC channel
  • VIR_MIGRATE_PERSIST_DEST – If the migration is successful, persist the domain on the destination host.
  • VIR_MIGRATE_UNDEFINE_SOURCE – If the migration is successful, undefine the domain on the source host.
  • VIR_MIGRATE_PAUSED – Leave the domain suspended on the remote side.
  • VIR_MIGRATE_CHANGE_PROTECTION – Protect against domain configuration changes during the migration process (set automatically when supported).
  • VIR_MIGRATE_UNSAFE – Force migration even if it is considered unsafe.
  • VIR_MIGRATE_CHANGE_PROTECTION – Protect against domain configuration changes during the migration process (set automatically when supported).
  • VIR_MIGRATE_UNSAFE – Force migration even if it is considered unsafe.
  • VIR_MIGRATE_OFFLINE – Migrate offline.

这些 Flags 通过 nova.conf 配置项 live_migration_flag 定义,e.g.

live_migration_flag=VIR_MIGRATE_UNDEFINE_SOURCE, VIR_MIGRATE_PEER2PEER, VIR_MIGRATE_LIVE, VIR_MIGRATE_TUNNELLED

监控 libvirtd 的数据迁移状态

# nova/nova/virt/libvirt/driver.py

    def _live_migration_monitor(self, context, instance, guest,
                                dest, post_method,
                                recover_method, block_migration,
                                migrate_data, finish_event,
                                disk_paths):
        # 获取需要进行热迁移的总数据量,包括 RAM 和本地磁盘文件
        # data_gb: total GB of RAM and disk to transfer
        data_gb = self._live_migration_data_gb(instance, disk_paths)
        
        # e.g. downtime_steps = [(0, 46), (300, 47), (600, 48), (900, 51), (1200, 57), (1500, 66), (1800, 84), (2100, 117), (2400, 179), (2700, 291), (3000, 500)]
        # downtime_steps 通过一个算法得出,参与计算的参数有:
        #    data_gb
        #    CONF.libvirt.live_migration_downtime
        #    CONF.libvirt.live_migration_downtime_steps
        #    CONF.libvirt.live_migration_downtime_delay
        # downtime_steps 的含义:
        #     一个元组表示一个 Step,分 Steps 次给 libvirtd 传输 downtime
        #     (delay, downtime),即:(下一次传递时间间隔,传递的 downtime 值)
        #     直到最后一次 Step 传递的元组是 (CONF.libvirt.live_migration_downtime_delay, CONF.libvirt.live_migration_downtime_steps)
        #     如果最后一次 libvirtd 迭代计算出来的 downtime 在传递的 downtime 范围内,则满足退出条件
        # NOTE:downtime_steps 每个 Step 的 max downtime 都在递增直到真正用户设定的最大可容忍 downtime,
        #       这是因为 Nova 在不断的试探实际最小的 max downtime,尽可能早的进入退出状态。
        downtime_steps = list(self._migration_downtime_steps(data_gb))
        ...
        # 轮询次数
        n = 0
        # 监控开始时间
        start = time.time()
        progress_time = start
        
        # progress_watermark 用来标记上次查询到的剩余数据量,如果数据有在迁移,那么脏数据水位(watermark)总是递减的
        progress_watermark = None
        
        # 是否启用了 Port-Copy 模型
        is_post_copy_enabled = self._is_post_copy_enabled(migration_flags)
        while True:
            # 获取 Live Migration Job 的信息
            info = guest.get_job_info()
            ...
            elif info.type == libvirt.VIR_DOMAIN_JOB_UNBOUNDED:
                # Migration is still running
                #
                # This is where we wire up calls to change live
                # migration status. eg change max downtime, cancel
                # the operation, change max bandwidth
                libvirt_migrate.run_tasks(guest, instance,
                                          self.active_migrations,
                                          on_migration_failure,
                                          migration,
                                          is_post_copy_enabled)

                now = time.time()
                elapsed = now - start

                if ((progress_watermark is None) or
                    (progress_watermark == 0) or
                    (progress_watermark > info.data_remaining)):
                    progress_watermark = info.data_remaining
                    progress_time = now

                # progress_timeout 这个变量的设计用来防止由于 libvirtd 异常导致的数据迁移卡壳
                # progress_timeout 标记迁移卡壳的超时时间,中止迁移
                progress_timeout = CONF.libvirt.live_migration_progress_timeout
                
                # completion_timeout 这个变量的设计用来防止 libvirtd 长时间处在迁移状态
                # 可能由于网络带宽太低等原因,libvirtd 就会长时间处于迁移状态,可能会导致管理带宽拥堵
                # completion_timeout 从第一次轮询开始计算,一旦超时没有完成迁移,中止迁移
                completion_timeout = int(
                    CONF.libvirt.live_migration_completion_timeout * data_gb)
                
                # 判断迁移过程是否应该终止
                if libvirt_migrate.should_abort(instance, now, progress_time,
                                                progress_timeout, elapsed,
                                                completion_timeout,
                                                migration.status):
                    try:
                        guest.abort_job()
                    except libvirt.libvirtError as e:
                        LOG.warning(_LW("Failed to abort migration %s"),
                                    e, instance=instance)
                        self._clear_empty_migration(instance)
                        raise

                # 判断是否启动 Port-Copy 模式
                if (is_post_copy_enabled and
                    libvirt_migrate.should_switch_to_postcopy(
                    info.memory_iteration, info.data_remaining,
                    previous_data_remaining, migration.status)):
                    # 进行 Port-Copy 转换
                    libvirt_migrate.trigger_postcopy_switch(guest,
                                                            instance,
                                                            migration)
                previous_data_remaining = info.data_remaining

                # 迭代的动态传递 Max Downtime Step
                curdowntime = libvirt_migrate.update_downtime(
                    guest, instance, curdowntime,
                    downtime_steps, elapsed)

               if (n % 10) == 0:
                    remaining = 100
                    if info.memory_total != 0:
                        # 计算剩余迁移数据量
                        remaining = round(info.memory_remaining *
                                          100 / info.memory_total)

                    libvirt_migrate.save_stats(instance, migration,
                                               info, remaining)

                    # 每轮询 60 次打印一次 info
                    # 没轮询 10 次打印一次 debug
                    lg = LOG.debug
                    if (n % 60) == 0:
                        lg = LOG.info

                    # 打印已经迁移了几秒、内存数据剩余量、迁移进度
                    lg(_LI("Migration running for %(secs)d secs, "
                           "memory %(remaining)d%% remaining; "
                           "(bytes processed=%(processed_memory)d, "
                           "remaining=%(remaining_memory)d, "
                           "total=%(total_memory)d)"),
                       {"secs": n / 2, "remaining": remaining,
                        "processed_memory": info.memory_processed,
                        "remaining_memory": info.memory_remaining,
                        "total_memory": info.memory_total}, instance=instance)
                    if info.data_remaining > progress_watermark:
                        lg(_LI("Data remaining %(remaining)d bytes, "
                               "low watermark %(watermark)d bytes "
                               "%(last)d seconds ago"),
                           {"remaining": info.data_remaining,
                            "watermark": progress_watermark,
                            "last": (now - progress_time)}, instance=instance)

                n = n + 1
           # 迁移完成
           elif info.type == libvirt.VIR_DOMAIN_JOB_COMPLETED:
                # Migration is all done
                LOG.info(_LI("Migration operation has completed"),
                         instance=instance)
                post_method(context, instance, dest, block_migration,
                            migrate_data)
                break
            # 迁移失败
            elif info.type == libvirt.VIR_DOMAIN_JOB_FAILED:
                # Migration did not succeed
                LOG.error(_LE("Migration operation has aborted"),
                          instance=instance)
                libvirt_migrate.run_recover_tasks(self._host, guest, instance,
                                                  on_migration_failure)
                recover_method(context, instance, dest, block_migration,
                               migrate_data)
                break
            # 迁移取消
            elif info.type == libvirt.VIR_DOMAIN_JOB_CANCELLED:
                # Migration was stopped by admin
                LOG.warning(_LW("Migration operation was cancelled"),
                         instance=instance)
                libvirt_migrate.run_recover_tasks(self._host, guest, instance,
                                                  on_migration_failure)
                recover_method(context, instance, dest, block_migration,
                               migrate_data, migration_status='cancelled')
                break
            else:
                LOG.warning(_LW("Unexpected migration job type: %d"),
                         info.type, instance=instance)

            time.sleep(0.5)
        self._clear_empty_migration(instance)    

    def _live_migration_data_gb(self, instance, disk_paths):
        '''Calculate total amount of data to be transferred

        :param instance: the nova.objects.Instance being migrated
        :param disk_paths: list of disk paths that are being migrated
        with instance

        Calculates the total amount of data that needs to be
        transferred during the live migration. The actual
        amount copied will be larger than this, due to the
        guest OS continuing to dirty RAM while the migration
        is taking place. So this value represents the minimal
        data size possible.

        :returns: data size to be copied in GB
        '''

        ram_gb = instance.flavor.memory_mb * units.Mi / units.Gi
        if ram_gb < 2:
            ram_gb = 2

        disk_gb = 0
        for path in disk_paths:
            try:
                size = os.stat(path).st_size
                size_gb = (size / units.Gi)
                if size_gb < 2:
                    size_gb = 2
                disk_gb += size_gb
            except OSError as e:
                LOG.warning(_LW("Unable to stat %(disk)s: %(ex)s"),
                         {'disk': path, 'ex': e})
                # Ignore error since we don't want to break
                # the migration monitoring thread operation
        # 返回 RAM + Disks 的数据量总和
        return ram_gb + disk_gb

    def _migration_downtime_steps(data_gb):
        '''Calculate downtime value steps and time between increases.

        :param data_gb: total GB of RAM and disk to transfer

        This looks at the total downtime steps and upper bound
        downtime value and uses an exponential backoff. So initially
        max downtime is increased by small amounts, and as time goes
        by it is increased by ever larger amounts

        For example, with 10 steps, 30 second step delay, 3 GB
        of RAM and 400ms target maximum downtime, the downtime will
        be increased every 90 seconds in the following progression:

        -   0 seconds -> set downtime to  37ms
        -  90 seconds -> set downtime to  38ms
        - 180 seconds -> set downtime to  39ms
        - 270 seconds -> set downtime to  42ms
        - 360 seconds -> set downtime to  46ms
        - 450 seconds -> set downtime to  55ms
        - 540 seconds -> set downtime to  70ms
        - 630 seconds -> set downtime to  98ms
        - 720 seconds -> set downtime to 148ms
        - 810 seconds -> set downtime to 238ms
        - 900 seconds -> set downtime to 400ms

        This allows the guest a good chance to complete migration
        with a small downtime value.
        '''
        # 通过配置项来控制 Live Migration 的执行细节
        downtime = CONF.libvirt.live_migration_downtime
        steps = CONF.libvirt.live_migration_downtime_steps
        delay = CONF.libvirt.live_migration_downtime_delay

        # TODO(hieulq): Need to move min/max value into the config option,
        # currently oslo_config will raise ValueError instead of setting
        # option value to its min/max.
        if downtime < nova.conf.libvirt.LIVE_MIGRATION_DOWNTIME_MIN:
            downtime = nova.conf.libvirt.LIVE_MIGRATION_DOWNTIME_MIN
        if steps < nova.conf.libvirt.LIVE_MIGRATION_DOWNTIME_STEPS_MIN:
            steps = nova.conf.libvirt.LIVE_MIGRATION_DOWNTIME_STEPS_MIN
        if delay < nova.conf.libvirt.LIVE_MIGRATION_DOWNTIME_DELAY_MIN:
            delay = nova.conf.libvirt.LIVE_MIGRATION_DOWNTIME_DELAY_MIN
        delay = int(delay * data_gb)

        offset = downtime / float(steps + 1)
        base = (downtime - offset) ** (1 / float(steps))

        for i in range(steps + 1):
            yield (int(delay * i), int(offset + base ** i))   


# nova/nova/virt/libvirt/migration.py

def update_downtime(guest, instance,
                    olddowntime,
                    downtime_steps, elapsed):
    """Update max downtime if needed

    :param guest: a nova.virt.libvirt.guest.Guest to set downtime for
    :param instance: a nova.objects.Instance
    :param olddowntime: current set downtime, or None
    :param downtime_steps: list of downtime steps
    :param elapsed: total time of migration in secs

    Determine if the maximum downtime needs to be increased
    based on the downtime steps. Each element in the downtime
    steps list should be a 2 element tuple. The first element
    contains a time marker and the second element contains
    the downtime value to set when the marker is hit.

    The guest object will be used to change the current
    downtime value on the instance.

    Any errors hit when updating downtime will be ignored

    :returns: the new downtime value
    """
    LOG.debug("Current %(dt)s elapsed %(elapsed)d steps %(steps)s",
              {"dt": olddowntime, "elapsed": elapsed,
               "steps": downtime_steps}, instance=instance)
    thisstep = None
    for step in downtime_steps:
        # elapsed 是当前的已迁移时长
        if elapsed > step[0]:
            # 如果已迁移时长大于 downtime_delay,那么此次 Step 就是 current step
            thisstep = step

    if thisstep is None:
        LOG.debug("No current step", instance=instance)
        return olddowntime

    if thisstep[1] == olddowntime:
        LOG.debug("Downtime does not need to change",
                  instance=instance)
        return olddowntime

    LOG.info(_LI("Increasing downtime to %(downtime)d ms "
                 "after %(waittime)d sec elapsed time"),
             {"downtime": thisstep[1],
              "waittime": thisstep[0]},
             instance=instance)

    try:
        # 向 libvirtd 传递 current max downtime
        guest.migrate_configure_max_downtime(thisstep[1])
    except libvirt.libvirtError as e:
        LOG.warning(_LW("Unable to increase max downtime to %(time)d"
                        "ms: %(e)s"),
                    {"time": thisstep[1], "e": e}, instance=instance)
    return thisstep[1]

NUMA 亲和、CPU 绑定、SR-IOV 网卡的热迁移问题

在《OpenStack 虚拟机冷/热迁移功能实践与流程分析》中我们尝试迁移过具有 NUMA 亲和、CPU 绑定的虚拟机,结果是迁移之后虚拟机依旧能够保持这些特性。这里我们再进行一个更加极端的测试 —— 将一个具有 NUMA 亲和、CPU 独占绑定的虚拟机迁移到一个 NUMA、CPU 资源都已经已经耗尽的目的主机。

[stack@undercloud (overcloudrc) ~]$ openstack server show VM1
+--------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+
| Field                                | Value                                                                                                                                      |
+--------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+
| OS-DCF:diskConfig                    | AUTO                                                                                                                                       |
| OS-EXT-AZ:availability_zone          | nova                                                                                                                                       |
| OS-EXT-SRV-ATTR:host                 | overcloud-ovscompute-1.localdomain                                                                                                         |
| OS-EXT-SRV-ATTR:hypervisor_hostname  | overcloud-ovscompute-1.localdomain                                                                                                         |
| OS-EXT-SRV-ATTR:instance_name        | instance-000000d6                                                                                                                          |
| OS-EXT-STS:power_state               | Running                                                                                                                                    |
| OS-EXT-STS:task_state                | None                                                                                                                                       |
| OS-EXT-STS:vm_state                  | active                                                                                                                                     |
| OS-SRV-USG:launched_at               | 2019-03-20T10:45:55.000000                                                                                                                 |
| OS-SRV-USG:terminated_at             | None                                                                                                                                       |
| accessIPv4                           |                                                                                                                                            |
| accessIPv6                           |                                                                                                                                            |
| addresses                            | net1=10.0.1.11, 10.0.1.8, 10.0.1.16, 10.0.1.10, 10.0.1.18, 10.0.1.19                                                                       |
| config_drive                         |                                                                                                                                            |
| created                              | 2019-03-20T10:44:52Z                                                                                                                       |
| flavor                               | Flavor1 (2ff09ec5-19e4-40b9-a52e-6026652c0788)                                                                                             |
| hostId                               | 9f1230901ddf3fe0e1a41e1c650a784c122b791f89fdf66a40cff3d6                                                                                   |
| id                                   | a17ddcbf-d936-4c77-9ea6-2e684c41cc39                                                                                                       |
| image                                | CentOS-7-x86_64-GenericCloud (0aff2888-47f8-4133-928a-9c54414b3afb)                                                                        |
| key_name                             | stack                                                                                                                                      |
| name                                 | VM1                                                                                                                                        |
| os-extended-volumes:volumes_attached | []                                                                                                                                         |
| progress                             | 0                                                                                                                                          |
| project_id                           | a6c78435075246f3aa5ab946b87086c5                                                                                                           |
| properties                           |                                                                                                                                            |
| security_groups                      | [{u'name': u'default'}, {u'name': u'default'}, {u'name': u'default'}, {u'name': u'default'}, {u'name': u'default'}, {u'name': u'default'}] |
| status                               | ACTIVE                                                                                                                                     |
| updated                              | 2019-03-20T10:45:56Z                                                                                                                       |
| user_id                              | 4fe574569664493bbd660abfe762a630                                                                                                           |
+--------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+

[stack@undercloud (overcloudrc) ~]$ openstack server migrate --block-migration --live overcloud-ovscompute-0.localdomain --wait VM1
Complete

[stack@undercloud (overcloudrc) ~]$ openstack server show VM1
+--------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+
| Field                                | Value                                                                                                                                      |
+--------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+
| OS-DCF:diskConfig                    | AUTO                                                                                                                                       |
| OS-EXT-AZ:availability_zone          | ovs                                                                                                                                        |
| OS-EXT-SRV-ATTR:host                 | overcloud-ovscompute-0.localdomain                                                                                                         |
| OS-EXT-SRV-ATTR:hypervisor_hostname  | overcloud-ovscompute-0.localdomain                                                                                                         |
| OS-EXT-SRV-ATTR:instance_name        | instance-000000d6                                                                                                                          |
| OS-EXT-STS:power_state               | Running                                                                                                                                    |
| OS-EXT-STS:task_state                | None                                                                                                                                       |
| OS-EXT-STS:vm_state                  | active                                                                                                                                     |
| OS-SRV-USG:launched_at               | 2019-03-20T10:45:55.000000                                                                                                                 |
| OS-SRV-USG:terminated_at             | None                                                                                                                                       |
| accessIPv4                           |                                                                                                                                            |
| accessIPv6                           |                                                                                                                                            |
| addresses                            | net1=10.0.1.11, 10.0.1.8, 10.0.1.16, 10.0.1.10, 10.0.1.18, 10.0.1.19                                                                       |
| config_drive                         |                                                                                                                                            |
| created                              | 2019-03-20T10:44:52Z                                                                                                                       |
| flavor                               | Flavor1 (2ff09ec5-19e4-40b9-a52e-6026652c0788)                                                                                             |
| hostId                               | 0f2ec590cd73fe0e9522f1ba715dae7a7d4b884e15aa8254defe85d0                                                                                   |
| id                                   | a17ddcbf-d936-4c77-9ea6-2e684c41cc39                                                                                                       |
| image                                | CentOS-7-x86_64-GenericCloud (0aff2888-47f8-4133-928a-9c54414b3afb)                                                                        |
| key_name                             | stack                                                                                                                                      |
| name                                 | VM1                                                                                                                                        |
| os-extended-volumes:volumes_attached | []                                                                                                                                         |
| progress                             | 0                                                                                                                                          |
| project_id                           | a6c78435075246f3aa5ab946b87086c5                                                                                                           |
| properties                           |                                                                                                                                            |
| security_groups                      | [{u'name': u'default'}, {u'name': u'default'}, {u'name': u'default'}, {u'name': u'default'}, {u'name': u'default'}, {u'name': u'default'}] |
| status                               | ACTIVE                                                                                                                                     |
| updated                              | 2019-03-20T10:51:47Z                                                                                                                       |
| user_id                              | 4fe574569664493bbd660abfe762a630                                                                                                           |
+--------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+

迁移过程中的异常信息

2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager [req-566373ae-5282-4378-9678-d8d08e121cdb - - - - -] Error updating resources for node overcloud-ovscompute-0.localdomain.
2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager Traceback (most recent call last):
2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 6590, in update_available_resource_for_node
2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager     rt.update_available_resource(context)
2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 536, in update_available_resource
2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager     self._update_available_resource(context, resources)
2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py", line 271, in inner
2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager     return f(*args, **kwargs)
2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 896, in _update_available_resource
2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager     self._update_usage_from_instances(context, instances)
2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 1393, in _update_usage_from_instances
2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager     self._update_usage_from_instance(context, instance)
2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 1273, in _update_usage_from_instance
2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager     sign, is_periodic)
2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 1119, in _update_usage
2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager     self.compute_node, usage, free)
2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/virt/hardware.py", line 1574, in get_host_numa_usage_from_instance
2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager     host_numa_topology, instance_numa_topology, free=free))
2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/virt/hardware.py", line 1447, in numa_usage_from_instances
2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager     newcell.pin_cpus(pinned_cpus)
2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager   File "/usr/lib/python2.7/site-packages/nova/objects/numa.py", line 86, in pin_cpus
2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager     self.pinned_cpus))
2019-03-20 10:52:48.401 424891 ERROR nova.compute.manager CPUPinningInvalid: CPU set to pin [0, 1] must be a subset of free CPU set [8]

迁移后的 NUMA 亲和、CPU 绑定信息

# 迁移虚拟机
[root@overcloud-ovscompute-0 ~]# virsh vcpupin instance-000000d6
VCPU: CPU Affinity
----------------------------------
   0: 0
   1: 1
   
# 已存在虚拟机
[root@overcloud-ovscompute-0 ~]# virsh vcpupin instance-000000d0
VCPU: CPU Affinity
----------------------------------
   0: 0
   1: 1
   2: 2
   3: 3
   4: 4
   5: 5
   6: 6
   7: 7

迁移虚拟机的 XML 文件局部

  
    IvyBridge
    
    
    
    
    
      
    
  

结论:虚拟机可以成功迁移并且依旧保持原有的 NUMA、CPU 特性。这是因为 Dedicated CPU Policy 是 Nova 层的概念,但从上述代码分析可以看出 Nova 是完全的 NUMA-Non-aware。Hypervisor 层就更不会买这些参数的单了,Hypervisor 完全忠于 XML 的描述,只要 XML 说了用 0,1 pCPU,那么即便 0,1 pCPU 已经被别的虚拟机占用了,Hypervisor 也依旧会安排下去。当然了,从 Nova 层面来看这就是一个 Bug,社区也已经有人描述来了这个问题并提出 BP:《NUMA-aware live migration》,《NUMA-aware live migration》。

至于 SR-IOV,Nova 官方文档明确提到了不支持 SR-IOV 虚拟机的 Live Migration。我曾在《启用 SR-IOV 解决 Neutron 网络 I/O 性能瓶颈》中分析过,SR-IOV 的 vf 设备对于 KVM 虚拟机来说就是一个 XML 标签段。e.g.


   
      

只要在目的计算节点可以找到与这个标签段匹配的 vf 设备即可实现 SR-IOV 网卡的迁移。问题是,原则上 Live Migration 虚拟机的 XML 文件理应不被修改,但实际上修改一段 vf 标签也许并无大碍,主要是要做好迁移失败的回滚备案和 Nova 的 SR-IOV-aware(感知和管理),写到这里我是越发的希望 OpenStack Placement 能够快快发展,毕竟 Nova 对 NUMA、SR-IOV 等资源的 “黑盒” 管理是那么的痛苦。

最后

通过对 OpenStack 虚拟机冷/热迁移的实现原理与代码分析可以感受到,Nova 只是对传统的迁移方式或对底层 Hypervisor 支撑软件的迁移功能进行封装和调度,使虚拟机的冷、热迁移功能能够达到企业级云平台的业务需求水平。主要的技术价值还是体现在底层技术支撑上,一如其他 OpenStack 项目。

参考资料

https://developers.redhat.com/blog/2015/03/24/live-migrating-qemu-kvm-virtual-machines/
https://www.cnblogs.com/sammyliu/p/4572287.html
https://docs.openstack.org/nova/pike/admin/configuring-migrations.html
https://docs.openstack.org/nova/pike/admin/live-migration-usage.html
https://blog.csdn.net/lemontree1945/article/details/79901874
https://www.ibm.com/developerworks/cn/linux/l-cn-mgrtvm1/index.html
https://blog.csdn.net/hawkerou/article/details/53482268

你可能感兴趣的:(OpenStack,Nova)