目录
一.nova-scheduler基本介绍
1.1 基本配置
1.2 调度过程
二. nova-scheduler调度过程源码剖析
2.1 /nova/scheduler/filter_scheduler.py/FilterSchduler/_schedule():
2.2 /nova/scheduler/filter_scheduler.py/FilterSchduler/_get_sorted_hosts():
2.3 /nova/scheduler/host_manager.py/HostManager/get_filtered_hosts():
2.4 /nova/filters.py/BaseFilterHandler/get_filtered_objects():
三. 过滤/权值算法讲解
3.1 过滤算法
1.RetryFilter
2.TypeFilter
3.CoreFilter
4.RamFilter
3.2 权值算法
我们在创建云主机的时候,openstack肯定需要选择在哪个计算节点上面创建云主机,然而选择的标准就是用户定义的flavor,flavor包括磁盘大小,内存大小,vCPU个数以及metadata等等,openstack会根据这些参数选择一个合适的计算节点来创建云主机。
openstack的调度器有很多类型,在此我们仅讨论filter_scheduler,在配置文件nova.conf文件中的调度器驱动scheduler_driver选项选择filter_scheduler即可(其他备选项为:caching_scheduler,chance_scheduler,fake_scheduler)。
在配置文件nova.conf中的scheduler_available_filters 和 scheduler_default_filters这两个参数用于配置openstack使用的filter。其中,scheduler_available_filters表示openstack中所有可用的filter,scheduler_default_filters表示实际使用到的过滤器。
openstack的nova调度过程主要分为两步:
1. 利用多个filters结合具体flavor对计算节点进行层层过滤;
2. 利用weighting对通过filters层层过滤的计算节点进行权重的计算,最终选择权重最大的计算节点创建云主机。
首先,该函数内部会调用_get_all_host_states()函数来获得所有可用主机和节点以及对应的状态,其中_get_all_host_states()的一个传入参数provider_summaries是通过nova的Placement_API获得的所有可以提供创建云主机要求的计算节点的摘要(Placement_API会根据预期创建的主机所需资源去目前所有计算节点去搜索是否有可以满足要求的主机和节点,如果找不到则会报错:找不到有效主机);然后,获得需要创建的云主机的个数:num_instances;最后, 对于num_instances个云主机分别调用_get_sorted_hosts()过滤掉不满足要求的主机和节点,再返回找到的用于创建云主机的主机和节点。
def _schedule(self, context, spec_obj, instance_uuids,
alloc_reqs_by_rp_uuid, provider_summaries,
allocation_request_version=None, return_alternates=False):
elevated = context.elevated()
hosts = self._get_all_host_states(elevated, spec_obj,
provider_summaries)
num_instances = (len(instance_uuids) if instance_uuids
else spec_obj.num_instances)
......
for num in range(num_instances):
hosts = self._get_sorted_hosts(spec_obj, hosts, num)
if not hosts:
# NOTE(jaypipes): If we get here, that means not all instances
# in instance_uuids were able to be matched to a selected host.
# Any allocations will be cleaned up in the
# _ensure_sufficient_hosts() call.
break
instance_uuid = instance_uuids[num]
# Attempt to claim the resources against one or more resource
# providers, looping over the sorted list of possible hosts
# looking for an allocation_request that contains that host's
# resource provider UUID
claimed_host = None
for host in hosts:
cn_uuid = host.uuid
if cn_uuid not in alloc_reqs_by_rp_uuid:
msg = ("A host state with uuid = '%s' that did not have a "
"matching allocation_request was encountered while "
"scheduling. This host was skipped.")
LOG.debug(msg, cn_uuid)
continue
alloc_reqs = alloc_reqs_by_rp_uuid[cn_uuid]
# TODO(jaypipes): Loop through all allocation_requests instead
# of just trying the first one. For now, since we'll likely
# want to order the allocation_requests in the future based on
# information in the provider summaries, we'll just try to
# claim resources using the first allocation_request
alloc_req = alloc_reqs[0]
if utils.claim_resources(elevated, self.placement_client,
spec_obj, instance_uuid, alloc_req,
allocation_request_version=allocation_request_version):
claimed_host = host
break
if claimed_host is None:
# We weren't able to claim resources in the placement API
# for any of the sorted hosts identified. So, clean up any
# successfully-claimed resources for prior instances in
# this request and return an empty list which will cause
# select_destinations() to raise NoValidHost
LOG.debug("Unable to successfully claim against any host.")
break
claimed_instance_uuids.append(instance_uuid)
claimed_hosts.append(claimed_host)
# Now consume the resources so the filter/weights will change for
# the next instance.
self._consume_selected_host(claimed_host, spec_obj)
# Check if we were able to fulfill the request. If not, this call will
# raise a NoValidHost exception.
self._ensure_sufficient_hosts(context, claimed_hosts, num_instances,
claimed_instance_uuids)
# We have selected and claimed hosts for each instance. Now we need to
# find alternates for each host.
selections_to_return = self._get_alternate_hosts(
claimed_hosts, spec_obj, hosts, num, num_alts,
alloc_reqs_by_rp_uuid, allocation_request_version)
return selections_to_return
对于_get_sorted_hosts()函数,内部会先调用get_filtered_hosts()函数来过滤出满足条件的主机和节点,然后再调用get_weighted_hosts()函数权衡计算节点的权重选出权重最大的那个计算节点,最后获取到该函数返回的排序后的主机节点与权值一一对应的对象,取出排序后的主机节点保存进weighted_hosts变量中,最后再通过参数host_subset_size(位于nova.conf文件中,一般设置为1)选取前host_subset_size个主机节点,这样就选出来权值最大的主机节点了,最后返回到2.1小节的函数。
def _get_sorted_hosts(self, spec_obj, host_states, index):
"""Returns a list of HostState objects that match the required
scheduling constraints for the request spec object and have been sorted
according to the weighers.
"""
filtered_hosts = self.host_manager.get_filtered_hosts(host_states,
spec_obj, index)
LOG.debug("Filtered %(hosts)s", {'hosts': filtered_hosts})
if not filtered_hosts:
return []
weighed_hosts = self.host_manager.get_weighed_hosts(filtered_hosts,
spec_obj)
if CONF.filter_scheduler.shuffle_best_same_weighed_hosts:
# NOTE(pas-ha) Randomize best hosts, relying on weighed_hosts
# being already sorted by weight in descending order.
# This decreases possible contention and rescheduling attempts
# when there is a large number of hosts having the same best
# weight, especially so when host_subset_size is 1 (default)
best_hosts = [w for w in weighed_hosts
if w.weight == weighed_hosts[0].weight]
random.shuffle(best_hosts)
weighed_hosts = best_hosts + weighed_hosts[len(best_hosts):]
# Log the weighed hosts before stripping off the wrapper class so that
# the weight value gets logged.
LOG.debug("Weighed %(hosts)s", {'hosts': weighed_hosts})
# Strip off the WeighedHost wrapper class...
weighed_hosts = [h.obj for h in weighed_hosts]
# We randomize the first element in the returned list to alleviate
# congestion where the same host is consistently selected among
# numerous potential hosts for similar request specs.
host_subset_size = CONF.filter_scheduler.host_subset_size
if host_subset_size < len(weighed_hosts):
weighed_subset = weighed_hosts[0:host_subset_size]
else:
weighed_subset = weighed_hosts
chosen_host = random.choice(weighed_subset)
weighed_hosts.remove(chosen_host)
return [chosen_host] + weighed_hosts
该函数内部分别调用_strip_ignore_hosts(),_match_forced_hosts(),_match_forced_nodes()来删除忽略的主机,保留强制需要使用的主机和节点。最后调用get_filtered_objects()函数来对主机和节点进行过滤,具体实现如下所述:
def get_filtered_hosts(self, hosts, spec_obj, index=0):
"""Filter hosts and return only ones passing all filters."""
def _strip_ignore_hosts(host_map, hosts_to_ignore):
ignored_hosts = []
for host in hosts_to_ignore:
for (hostname, nodename) in list(host_map.keys()):
if host.lower() == hostname.lower():
del host_map[(hostname, nodename)]
ignored_hosts.append(host)
ignored_hosts_str = ', '.join(ignored_hosts)
LOG.info(_LI('Host filter ignoring hosts: %s'), ignored_hosts_str)
def _match_forced_hosts(host_map, hosts_to_force):
forced_hosts = []
lowered_hosts_to_force = [host.lower() for host in hosts_to_force]
for (hostname, nodename) in list(host_map.keys()):
if hostname.lower() not in lowered_hosts_to_force:
del host_map[(hostname, nodename)]
else:
forced_hosts.append(hostname)
if host_map:
forced_hosts_str = ', '.join(forced_hosts)
msg = _LI('Host filter forcing available hosts to %s')
else:
forced_hosts_str = ', '.join(hosts_to_force)
msg = _LI("No hosts matched due to not matching "
"'force_hosts' value of '%s'")
LOG.info(msg, forced_hosts_str)
def _match_forced_nodes(host_map, nodes_to_force):
forced_nodes = []
for (hostname, nodename) in list(host_map.keys()):
if nodename not in nodes_to_force:
del host_map[(hostname, nodename)]
else:
forced_nodes.append(nodename)
if host_map:
forced_nodes_str = ', '.join(forced_nodes)
msg = _LI('Host filter forcing available nodes to %s')
else:
forced_nodes_str = ', '.join(nodes_to_force)
msg = _LI("No nodes matched due to not matching "
"'force_nodes' value of '%s'")
LOG.info(msg, forced_nodes_str)
def _get_hosts_matching_request(hosts, requested_destination):
(host, node) = (requested_destination.host,
requested_destination.node)
requested_nodes = [x for x in hosts
if x.host == host and x.nodename == node]
if requested_nodes:
LOG.info(_LI('Host filter only checking host %(host)s and '
'node %(node)s'), {'host': host, 'node': node})
else:
# NOTE(sbauza): The API level should prevent the user from
# providing a wrong destination but let's make sure a wrong
# destination doesn't trample the scheduler still.
LOG.info(_LI('No hosts matched due to not matching requested '
'destination (%(host)s, %(node)s)'),
{'host': host, 'node': node})
return iter(requested_nodes)
ignore_hosts = spec_obj.ignore_hosts or []
force_hosts = spec_obj.force_hosts or []
force_nodes = spec_obj.force_nodes or []
requested_node = spec_obj.requested_destination
if requested_node is not None and 'host' in requested_node:
# NOTE(sbauza): Reduce a potentially long set of hosts as much as
# possible to any requested destination nodes before passing the
# list to the filters
hosts = _get_hosts_matching_request(hosts, requested_node)
if ignore_hosts or force_hosts or force_nodes:
# NOTE(deva): we can't assume "host" is unique because
# one host may have many nodes.
name_to_cls_map = {(x.host, x.nodename): x for x in hosts}
if ignore_hosts:
_strip_ignore_hosts(name_to_cls_map, ignore_hosts)
if not name_to_cls_map:
return []
# NOTE(deva): allow force_hosts and force_nodes independently
if force_hosts:
_match_forced_hosts(name_to_cls_map, force_hosts)
if force_nodes:
_match_forced_nodes(name_to_cls_map, force_nodes)
check_type = ('scheduler_hints' in spec_obj and
spec_obj.scheduler_hints.get('_nova_check_type'))
if not check_type and (force_hosts or force_nodes):
# NOTE(deva,dansmith): Skip filters when forcing host or node
# unless we've declared the internal check type flag, in which
# case we're asking for a specific host and for filtering to
# be done.
if name_to_cls_map:
return name_to_cls_map.values()
else:
return []
hosts = six.itervalues(name_to_cls_map)
return self.filter_handler.get_filtered_objects(self.enabled_filters,
hosts, spec_obj, index)
该函数内部会循环调用各种过滤器来对主机和节点进行过滤,其中最关键的一句话是objs = filter_.filter_all()。这里需要先说明几点:1. filter_就是各种过滤器之一(定义在/nova/scheduler/filters文件夹中);2. 这些过滤器类都是基类BaseHostFilter的扩展类,而BaseHostFilter是BaseFilter扩展类;3. 每个过滤器类都重写了BaseHostFilter中的host_passes()函数,而BaseHosrFilter类重写了BaseFilter中的_filter_one()函数。因此,filter_all()函数会调用BaseHostFilter类的_filter_one()函数,_filter_one()函数调用每一个过滤器内部重写的host_passes()函数,于是便开始了各个过滤器的过滤操作,从而过滤出需要的主机和节点。
def get_filtered_objects(self, filters, objs, spec_obj, index=0):
list_objs = list(objs)
LOG.debug("Starting with %d host(s)", len(list_objs))
part_filter_results = []
full_filter_results = []
log_msg = "%(cls_name)s: (start: %(start)s, end: %(end)s)"
for filter_ in filters:
if filter_.run_filter_for_index(index):
cls_name = filter_.__class__.__name__
start_count = len(list_objs)
objs = filter_.filter_all(list_objs, spec_obj)#最关键一句:具体过滤主机节点
if objs is None:
LOG.debug("Filter %s says to stop filtering", cls_name)
return
list_objs = list(objs)
end_count = len(list_objs)
part_filter_results.append(log_msg % {"cls_name": cls_name,
"start": start_count, "end": end_count})
if list_objs:
remaining = [(getattr(obj, "host", obj),
getattr(obj, "nodename", ""))
for obj in list_objs]
full_filter_results.append((cls_name, remaining))
else:
LOG.info(_LI("Filter %s returned 0 hosts"), cls_name)
full_filter_results.append((cls_name, None))
break
LOG.debug("Filter %(cls_name)s returned "
"%(obj_len)d host(s)",
{'cls_name': cls_name, 'obj_len': len(list_objs)})
......
return list_objs
这个过滤器的作用就是:当A,B,C三个节点都通过了过滤,然后因为B的权重最大,因此B又通过了筛选,但是在节点B上部署云主机操作失败了,就会重新进行过滤筛选节点,重复次数由nova.conf文件中的scheduler_max_attempts选项决定,那么这个时候RetryFilter过滤器会将节点B直接刷掉,避免再次操作错误,RetryFilter通常作为第一个过滤器。
首先会获得本次操作是否是重复尝试过滤,并且保存在变量retry中,如果不是重复尝试则直接跳过该过滤器不进行本过滤操作;然后再获得需要过滤的所有主机和节点,保存在变量host中;然后在retry变量中获得前面几次操作失败的节点和主机,保存在变量hosts中;最后找出所有host不在创捷失败集合hosts中的节点和主机,保存在变量passes中返回。
def host_passes(self, host_state, spec_obj):
"""Skip nodes that have already been attempted."""
retry = spec_obj.retry
if not retry:
# Re-scheduling is disabled
LOG.debug("Re-scheduling is disabled")
return True
# TODO(sbauza): Once the HostState is actually a ComputeNode, we could
# easily get this one...
host = [host_state.host, host_state.nodename]
# TODO(sbauza)... and we wouldn't need to primitive the hosts into
# lists
hosts = [[cn.host, cn.hypervisor_hostname] for cn in retry.hosts]
passes = host not in hosts
if not passes:
LOG.info(_LI("Host %(host)s fails. Previously tried hosts: "
"%(hosts)s"), {'host': host, 'hosts': hosts})
# Host passes if it's not in the list of previously attempted hosts:
return passes
如果目前主机上不存在规格为将要创建云主机的instance_type的其他云主机,则返回false,保证主机上创建相同规格的云主机。
首先获取当前要创建云主机的主机类型保存进instance_type变量;然后获取要筛选主机节点中所拥有云主机的主机类型集合,保存进aggregate_vals变量;最后判断要筛选主机节点中是否拥有将要创建的云主机主机类型,如果有则返回true。
def host_passes(self, host_state, spec_obj):
instance_type = spec_obj.flavor
aggregate_vals = utils.aggregate_values_from_key(
host_state, 'instance_type')
for val in aggregate_vals:
if (instance_type.name in
[x.strip() for x in val.split(',')]):
return True
return not aggregate_vals
看host上的vcpu个数能否满足创建虚拟机的instance_type中的vcpu个数。并且根据nova.conf文件中的cpu_allocation_ratio来进行超频:这意味着一个 8 vCPU 的计算节点,nova-scheduler 在调度时认为它有 128 个 vCPU。
首先获取需要创建的云主机需要的vCPU个数,保存进instance_vcpus变量;然后获取云主机CPU超频倍数,保存进cpu_allocation_ratio变量;计算超频后的主机节点支持的vCPU个数,保存进vcpus_total变量;判断如果要创建的云主机需要的vCPU个数大于超频前vCPU的个数,则返回false;最后再计算超频后的vCPU个数减去已经占用的vCPU个数,即超频后可用的vCPU个数,保存进free_vcpus变量,如果要创建的云主机需要的vCPU个数大于超频后可用的vCPU的个数,则返回false。总结来讲,需要创建云主机的vCPU个数不能大于超频前的主机节点支持的最大vCPU个数,也不能超过超频后可用的vCPU个数。
def host_passes(self, host_state, spec_obj):
"""Return True if host has sufficient CPU cores.
:param host_state: nova.scheduler.host_manager.HostState
:param spec_obj: filter options
:return: boolean
"""
if not host_state.vcpus_total:
# Fail safe
LOG.warning(_LW("VCPUs not set; assuming CPU collection broken"))
return True
instance_vcpus = spec_obj.vcpus
cpu_allocation_ratio = self._get_cpu_allocation_ratio(host_state,
spec_obj)
vcpus_total = host_state.vcpus_total * cpu_allocation_ratio
# Only provide a VCPU limit to compute if the virt driver is reporting
# an accurate count of installed VCPUs. (XenServer driver does not)
if vcpus_total > 0:
host_state.limits['vcpu'] = vcpus_total
# Do not allow an instance to overcommit against itself, only
# against other instances.
if instance_vcpus > host_state.vcpus_total:
LOG.debug("%(host_state)s does not have %(instance_vcpus)d "
"total cpus before overcommit, it only has %(cpus)d",
{'host_state': host_state,
'instance_vcpus': instance_vcpus,
'cpus': host_state.vcpus_total})
return False
free_vcpus = vcpus_total - host_state.vcpus_used
if free_vcpus < instance_vcpus:
LOG.debug("%(host_state)s does not have %(instance_vcpus)d "
"usable vcpus, it only has %(free_vcpus)d usable "
"vcpus",
{'host_state': host_state,
'instance_vcpus': instance_vcpus,
'free_vcpus': free_vcpus})
return False
return True
具体原理同CoreFilter。
首先将创建的云主机需要的内存大小保存入requested_ram变量,将超频前空闲的内存保存进free_ram_mb变量中,将超频前最大的内存大小保存进total_usable_ram_mb变量;判断超频前主机节点支持的最大内存大小是否大于将要创建的云主机需要的内存大小;然后将内存的超频参数保存进ram_allocation_ratio变量,计算超频后主机节点支持的最大内存大小保存进memory_mb_limit变量中,计算已经使用的内存大小,保存进used_ram_mb变量中,计算超频后空闲的内存大小,保存进变usable_ram变量中;最后判断超频后可用的内存大小是否大于将要创建主机需要的内存大小,如果大于则返回true,否则返回false。
def host_passes(self, host_state, spec_obj):
"""Only return hosts with sufficient available RAM."""
requested_ram = spec_obj.memory_mb
free_ram_mb = host_state.free_ram_mb
total_usable_ram_mb = host_state.total_usable_ram_mb
# Do not allow an instance to overcommit against itself, only against
# other instances.
if not total_usable_ram_mb >= requested_ram:
LOG.debug("%(host_state)s does not have %(requested_ram)s MB "
"usable ram before overcommit, it only has "
"%(usable_ram)s MB.",
{'host_state': host_state,
'requested_ram': requested_ram,
'usable_ram': total_usable_ram_mb})
return False
ram_allocation_ratio = self._get_ram_allocation_ratio(host_state,
spec_obj)
memory_mb_limit = total_usable_ram_mb * ram_allocation_ratio
used_ram_mb = total_usable_ram_mb - free_ram_mb
usable_ram = memory_mb_limit - used_ram_mb
if not usable_ram >= requested_ram:
LOG.debug("%(host_state)s does not have %(requested_ram)s MB "
"usable ram, it only has %(usable_ram)s MB usable ram.",
{'host_state': host_state,
'requested_ram': requested_ram,
'usable_ram': usable_ram})
return False
# save oversubscription limit for compute node to test against:
host_state.limits['memory_mb'] = memory_mb_limit
return True
该算法主要是在/nova/weights.py/BaseWeightHandler/get_weighed_objects()函数中实现:
具体流程:先创建对象,第一项为每一个之前过滤器过滤出来的主机节点,第二项为每个主机节点对应的权重,初始化为0,保存进weighed_objs变量中;然后分别用每一个权值类(位于/nova/scheduler/weights/文件夹中)对所有主机节点计算权值保存进weights变量中,再对每一个主机节点的权值进行归一化,再更新每一个主机对应的权值(主机节点的权值=主机节点的权值+当前权值类的权重*当前权值类赋予该主机节点的权值);最后返回排序后的权值。返回到2.2小节。
def get_weighed_objects(self, weighers, obj_list, weighing_properties):
"""Return a sorted (descending), normalized list of WeighedObjects."""
weighed_objs = [self.object_class(obj, 0.0) for obj in obj_list]
if len(weighed_objs) <= 1:
return weighed_objs
for weigher in weighers:
weights = weigher.weigh_objects(weighed_objs, weighing_properties)
# Normalize the weights
weights = normalize(weights,
minval=weigher.minval,
maxval=weigher.maxval)
for i, weight in enumerate(weights):
obj = weighed_objs[i]
obj.weight += weigher.weight_multiplier() * weight
return sorted(weighed_objs, key=lambda x: x.weight, reverse=True)