目前在OpenStack使用中遇见虚拟机创建失败的问题
问题表现为:
创建一台虚拟机
状态ERROR, 报错:nova/compute/manager.py 1902
原因是资源limit, 校验不通过:
{u'memory_mb': 130669.0, u'disk_gb': 199.0}
但真实disk size 远远大于199GB
查看日志
INFO nova.compute.manager [req-ac70ea37-c0df-4909-8322-... - - -] {u'memory_mb': 130669.0, u'disk_gb': 199.0}
INFO nova.compute.claims [req-ac70ea37-c0df-4909-8322-2...- - -] [instance: BE099D73-3273-489C-B468-1772C17A2A74] Attempting claim: memory 2048 MB, disk 60 GB, vcpus 1 CPU
INFO nova.compute.claims [req-ac70ea37-c0df-4909-8322-2...- - -] [instance: BE099D73-3273-489C-B468-1772C17A2A74] Total memory: 130669 MB, used: 23552.00 MB
INFO nova.compute.claims [req-ac70ea37-c0df-4909-8322-2...- - -] [instance: BE099D73-3273-489C-B468-1772C17A2A74] memory limit: 130669.00 MB, free: 107117.00 MB
INFO nova.compute.claims [req-ac70ea37-c0df-4909-8322-2...- - -] [instance: BE099D73-3273-489C-B468-1772C17A2A74] Total disk: 111710 GB, used: 181.00 GB
INFO nova.compute.claims [req-ac70ea37-c0df-4909-8322-2...- - -] [instance: BE099D73-3273-489C-B468-1772C17A2A74] disk limit: 199.00 GB, free: 18.00 GB
INFO nova.compute.claims [req-ac70ea37-c0df-4909-8322-2...- - -] [instance: BE099D73-3273-489C-B468-1772C17A2A74] Total vcpu: 31 VCPU, used: 4.00 VCPU
INFO nova.compute.claims [req-ac70ea37-c0df-4909-8322-2...- - -] [instance: BE099D73-3273-489C-B468-1772C17A2A74] vcpu limit not specified, defaulting to unlimited
INFO nova.compute.manager [req-ac70ea37-c0df-4909-8322-... - - -] [instance: BE099D73-3273-489C-B468-1772C17A2A74] Took 0.06 seconds to deallocate network for instance.
INFO nova.compute.resource_tracker [req-e1d70ac4-a379-4.... - -] Total usable vcpus: 31, total allocated vcpus: 4
INFO nova.compute.resource_tracker [req-e1d70ac4-a379-4.... - -] Final resource view: name=test phys_ram=130669MB used_ram=23552MB phys_disk=111710GB used_disk=181GB total_vcpus=31 used_vcpus=4 pci_stats=[]
WARNING nova.scheduler.client.report [req-e1d70ac4-a379.... - - -] Unable to refresh my resource provider record
INFO nova.compute.resource_tracker [req-e1d70ac4-a379-4.... - -] Compute_service record updated for test:test
查看源码: nova/compute/manager.py
def _build_and_run_instance(self, context, instance, image, injected_files,
admin_password, requested_networks, security_groups,
block_device_mapping, node, limits, filter_properties):
image_name = image.get('name')
self._notify_about_instance_usage(context, instance, 'create.start',
extra_usage_info={'image_name': image_name})
self._check_device_tagging(requested_networks, block_device_mapping)
try:
rt = self._get_resource_tracker(node)
with rt.instance_claim(context, instance, limits):
# NOTE(russellb) It's important that this validation be done
# *after* the resource tracker instance claim, as that is where
# the host is set on the instance.
self._validate_instance_group_policy(context, instance,
filter_properties)
image_meta = objects.ImageMeta.from_dict(image)
with self._build_resources(context, instance,
requested_networks, security_groups, image_meta,
block_device_mapping) as resources:
instance.vm_state = vm_states.BUILDING
instance.task_state = task_states.SPAWNING
异常信息, 查看with rt.instance_claim(context, instance, limits)
定义:
nova/compute/resource_tracker.py
@utils.synchronized(COMPUTE_RESOURCE_SEMAPHORE)
def instance_claim(self, context, instance, limits=None):
"""Indicate that some resources are needed for an upcoming compute
instance build operation.
This should be called before the compute node is about to perform
an instance build operation that will consume additional resources.
:param context: security context
:param instance: instance to reserve resources for.
:type instance: nova.objects.instance.Instance object
:param limits: Dict of oversubscription limits for memory, disk,
and CPUs.
:returns: A Claim ticket representing the reserved resources. It can
be used to revert the resource usage if an error occurs
during the instance build.
"""
.....
# get the overhead required to build this instance:
overhead = self.driver.estimate_instance_overhead(instance)
LOG.debug("Memory overhead for %(flavor)d MB instance; %(overhead)d "
"MB", {'flavor': instance.flavor.memory_mb,
'overhead': overhead['memory_mb']})
LOG.debug("Disk overhead for %(flavor)d GB instance; %(overhead)d "
"GB", {'flavor': instance.flavor.root_gb,
'overhead': overhead.get('disk_gb', 0)})
pci_requests = objects.InstancePCIRequests.get_by_instance_uuid(
context, instance.uuid)
claim = claims.Claim(context, instance, self, self.compute_node,
pci_requests, overhead=overhead, limits=limits)
# self._set_instance_host_and_node() will save instance to the DB
# so set instance.numa_topology first. We need to make sure
# that numa_topology is saved while under COMPUTE_RESOURCE_SEMAPHORE
# so that the resource audit knows about any cpus we've pinned.
instance_numa_topology = claim.claimed_numa_topology
.....
nova/compute/claims.py
class Claim(NopClaim):
"""A declaration that a compute host operation will require free resources.
Claims serve as marker objects that resources are being held until the
update_available_resource audit process runs to do a full reconciliation
of resource usage.
This information will be used to help keep the local compute hosts's
ComputeNode model in sync to aid the scheduler in making efficient / more
correct decisions with respect to host selection.
"""
def __init__(self, context, instance, tracker, resources, pci_requests,
overhead=None, limits=None):
super(Claim, self).__init__()
# Stash a copy of the instance at the current point of time
....
# Check claim at constructor to avoid mess code
# Raise exception ComputeResourcesUnavailable if claim failed
self._claim_test(resources, limits)
检查cpu, 内存, cpu, numa是否满足条件
def _claim_test(self, resources, limits=None):
"""Test if this claim can be satisfied given available resources and
optional oversubscription limits
This should be called before the compute node actually consumes the
resources required to execute the claim.
:param resources: available local compute node resources
:returns: Return true if resources are available to claim.
"""
if not limits:
limits = {}
# If an individual limit is None, the resource will be considered
# unlimited:
memory_mb_limit = limits.get('memory_mb')
disk_gb_limit = limits.get('disk_gb')
vcpus_limit = limits.get('vcpu')
numa_topology_limit = limits.get('numa_topology')
LOG.info(_LI("Attempting claim: memory %(memory_mb)d MB, "
"disk %(disk_gb)d GB, vcpus %(vcpus)d CPU"),
{'memory_mb': self.memory_mb, 'disk_gb': self.disk_gb,
'vcpus': self.vcpus}, instance=self.instance)
reasons = [self._test_memory(resources, memory_mb_limit),
self._test_disk(resources, disk_gb_limit),
self._test_vcpus(resources, vcpus_limit),
self._test_numa_topology(resources, numa_topology_limit),
self._test_pci()]
reasons = [r for r in reasons if r is not None]
if len(reasons) > 0:
raise exception.ComputeResourcesUnavailable(reason=
"; ".join(reasons))
检查磁盘是否满足条件
def _test_disk(self, resources, limit):
type_ = _("disk")
unit = "GB"
total = resources.local_gb
used = resources.local_gb_used
requested = self.disk_gb
return self._test(type_, unit, total, used, requested, limit)
test 比较request 的大小是否满足
def _test(self, type_, unit, total, used, requested, limit):
"""Test if the given type of resource needed for a claim can be safely
allocated.
"""
LOG.info(_LI('Total %(type)s: %(total)d %(unit)s, used: %(used).02f '
'%(unit)s'),
{'type': type_, 'total': total, 'unit': unit, 'used': used},
instance=self.instance)
if limit is None:
# treat resource as unlimited:
LOG.info(_LI('%(type)s limit not specified, defaulting to '
'unlimited'), {'type': type_}, instance=self.instance)
return
free = limit - used
# Oversubscribed resource policy info:
LOG.info(_LI('%(type)s limit: %(limit).02f %(unit)s, '
'free: %(free).02f %(unit)s'),
{'type': type_, 'limit': limit, 'free': free, 'unit': unit},
instance=self.instance)
if requested > free:
return (_('Free %(type)s %(free).02f '
'%(unit)s < requested %(requested)d %(unit)s') %
{'type': type_, 'free': free, 'unit': unit,
'requested': requested})
回头查看api中的create:
def build_instances(self, context, instances, image, filter_properties,
admin_password, injected_files, requested_networks,
security_groups, block_device_mapping=None, legacy_bdm=True):
# TODO(ndipanov): Remove block_device_mapping and legacy_bdm in version
# 2.0 of the RPC API.
....
self.compute_rpcapi.build_and_run_instance(context,
instance=instance, host=host['host'], image=image,
request_spec=request_spec,
filter_properties=local_filter_props,
admin_password=admin_password,
injected_files=injected_files,
requested_networks=requested_networks,
security_groups=security_groups,
block_device_mapping=bdms, node=host['nodename'],
limits=host['limits'])
nova/compute/api.py
创建api
(instances, resv_id) = self.compute_api.create(context,
inst_type,
image_uuid,
display_name=name,
display_description=description,
availability_zone=availability_zone,
forced_host=host, forced_node=node,
metadata=server_dict.get('metadata', {}),
admin_password=password,
requested_networks=requested_networks,
check_server_group_quota=True,
**create_kwargs)
nova/conductor/manager.py
conductor创建
def build_instances(self, context, instances, image, filter_properties,
admin_password, injected_files, requested_networks,
security_groups, block_device_mapping=None, legacy_bdm=True):
# TODO(ndipanov): Remove block_device_mapping and legacy_bdm in version
....
request_spec = scheduler_utils.build_request_spec(
context, image, instances)
hosts = self._schedule_instances(
context, request_spec, filter_properties)
hosts经过调度获取_schedule_instances
查看调度:
nova/scheduler/manager.py
@messaging.expected_exceptions(exception.NoValidHost)
def select_destinations(self, ctxt,
request_spec=None, filter_properties=None,
spec_obj=_sentinel):
"""Returns destinations(s) best suited for this RequestSpec.
The result should be a list of dicts with 'host', 'nodename' and
'limits' as keys.
"""
# TODO(sbauza): Change the method signature to only accept a spec_obj
# argument once API v5 is provided.
if spec_obj is self._sentinel:
spec_obj = objects.RequestSpec.from_primitives(ctxt,
request_spec,
filter_properties)
dests = self.driver.select_destinations(ctxt, spec_obj)
return jsonutils.to_primitive(dests)
nova/scheduler/filter_scheduler.py
class FilterScheduler(driver.Scheduler):
"""Scheduler that can be used for filtering and weighing."""
def __init__(self, *args, **kwargs):
super(FilterScheduler, self).__init__(*args, **kwargs)
self.options = scheduler_options.SchedulerOptions()
self.notifier = rpc.get_notifier('scheduler')
def select_destinations(self, context, spec_obj):
"""Selects a filtered set of hosts and nodes."""
self.notifier.info(
context, 'scheduler.select_destinations.start',
dict(request_spec=spec_obj.to_legacy_request_spec_dict()))
num_instances = spec_obj.num_instances
selected_hosts = self._schedule(context, spec_obj)
# Couldn't fulfill the request_spec
if len(selected_hosts) < num_instances:
# NOTE(Rui Chen): If multiple creates failed, set the updated time
# of selected HostState to None so that these HostStates are
# refreshed according to database in next schedule, and release
# the resource consumed by instance in the process of selecting
# host.
for host in selected_hosts:
host.obj.updated = None
# Log the details but don't put those into the reason since
# we don't want to give away too much information about our
# actual environment.
LOG.debug('There are %(hosts)d hosts available but '
'%(num_instances)d instances requested to build.',
{'hosts': len(selected_hosts),
'num_instances': num_instances})
reason = _('There are not enough hosts available.')
raise exception.NoValidHost(reason=reason)
dests = [dict(host=host.obj.host, nodename=host.obj.nodename,
limits=host.obj.limits) for host in selected_hosts]
self.notifier.info(
context, 'scheduler.select_destinations.end',
dict(request_spec=spec_obj.to_legacy_request_spec_dict()))
return dests
def _schedule(self, context, spec_obj):
"""Returns a list of hosts that meet the required specs,
ordered by their fitness.
"""
elevated = context.elevated()
config_options = self._get_configuration_options()
....
hosts = self._get_all_host_states(elevated)
nova/scheduler/host_manager.py
获取主机列表
def get_all_host_states(self, context):
"""Returns a list of HostStates that represents all the hosts
the HostManager knows about. Also, each of the consumable resources
in HostState are pre-populated and adjusted based on data in the db.
"""
service_refs = {service.host: service
for service in objects.ServiceList.get_by_binary(
context, 'nova-compute', include_disabled=True)}
# Get resource usage across the available compute nodes:
compute_nodes = objects.ComputeNodeList.get_all(context)
seen_nodes = set()
for compute in compute_nodes:
service = service_refs.get(compute.host)
object 获取node信息, 因此数据经过openstack oslo cache数据
重启服务
systemctl restart openstack-nova-api.service openstack-nova-scheduler.service openstack-nova-conductor.service
INFO nova.compute.manager [req-503a8a5a-8a4c-403d-814b-5f... - - -] {}
INFO nova.compute.claims [req-503a8a5a-8a4c-403d-814b-5f6... - - -] [instance: B9248A76-7D9E-4965-8D6B-239FA6C00630] Attempting claim: memory 2048 MB, disk 60 GB, vcpus 1 CPU
INFO nova.compute.claims [req-503a8a5a-8a4c-403d-814b-5f6... - - -] [instance: B9248A76-7D9E-4965-8D6B-239FA6C00630] Total memory: 130669 MB, used: 30720.00 MB
INFO nova.compute.claims [req-503a8a5a-8a4c-403d-814b-5f6... - - -] [instance: B9248A76-7D9E-4965-8D6B-239FA6C00630] memory limit not specified, defaulting to unlimited
INFO nova.compute.claims [req-503a8a5a-8a4c-403d-814b-5f6... - - -] [instance: B9248A76-7D9E-4965-8D6B-239FA6C00630] Total disk: 111710 GB, used: 420.00 GB
INFO nova.compute.claims [req-503a8a5a-8a4c-403d-814b-5f6... - - -] [instance: B9248A76-7D9E-4965-8D6B-239FA6C00630] disk limit not specified, defaulting to unlimited
INFO nova.compute.claims [req-503a8a5a-8a4c-403d-814b-5f6... - - -] [instance: B9248A76-7D9E-4965-8D6B-239FA6C00630] Total vcpu: 31 VCPU, used: 7.00 VCPU
INFO nova.compute.claims [req-503a8a5a-8a4c-403d-814b-5f6... - - -] [instance: B9248A76-7D9E-4965-8D6B-239FA6C00630] vcpu limit not specified, defaulting to unlimited
INFO nova.compute.claims [req-503a8a5a-8a4c-403d-814b-5f6... - - -] [instance: B9248A76-7D9E-4965-8D6B-239FA6C00630] Claim successful