1. 前言
转载请说明原文出处, 尊重他人劳动成果!
源码位置: https://github.com/nicktming/kubernetes/tree/tming-v1.13/pkg/kubelet/cm/devicemanager
分支: tming-v1.13 (基于v1.13版本)
k8s-device-plugin
分支: tming-v1.11(基于v1.11版本)
device manager and device plugin
1. [k8s源码分析][kubelet] devicemanager 之 pod_devices 和 checkpoint
2. [k8s源码分析][kubelet] devicemanager 之 使用device-plugin(模拟gpu)
3. [k8s源码分析][kubelet] devicemanager 之 device-plugin向kubelet注册
4. [k8s源码分析][kubelet] devicemanager 之 kubelet申请资源
5. [k8s源码分析][kubelet] devicemanager 之 重启kubelet和device-plugin
上文 [k8s源码分析][kubelet] devicemanager 之 device-plugin向kubelet注册 已经分析了
device plugin
向device manager
注册的过程, 本文将分析kubelet
申请资源的过程, 也就是向device manager
申请资源.
Manager
// Manager manages all the Device Plugins running on a node.
type Manager interface {
// Start starts device plugin registration service.
Start(activePods ActivePodsFunc, sourcesReady config.SourcesReady) error
// Allocate configures and assigns devices to pods. The pods are provided
// through the pod admission attributes in the attrs argument. From the
// requested device resources, Allocate will communicate with the owning
// device plugin to allow setup procedures to take place, and for the
// device plugin to provide runtime settings to use the device (environment
// variables, mount points and device files). The node object is provided
// for the device manager to update the node capacity to reflect the
// currently available devices.
Allocate(node *schedulercache.NodeInfo, attrs *lifecycle.PodAdmitAttributes) error
// Stop stops the manager.
Stop() error
// GetDeviceRunContainerOptions checks whether we have cached containerDevices
// for the passed-in and returns its DeviceRunContainerOptions
// for the found one. An empty struct is returned in case no cached state is found.
GetDeviceRunContainerOptions(pod *v1.Pod, container *v1.Container) (*DeviceRunContainerOptions, error)
// GetCapacity returns the amount of available device plugin resource capacity, resource allocatable
// and inactive device plugin resources previously registered on the node.
GetCapacity() (v1.ResourceList, v1.ResourceList, []string)
GetWatcherHandler() watcher.PluginHandler
// GetDevices returns information about the devices assigned to pods and containers
GetDevices(podUID, containerName string) []*podresourcesapi.ContainerDevices
}
// DeviceRunContainerOptions contains the combined container runtime settings to consume its allocated devices.
type DeviceRunContainerOptions struct {
// The environment variables list.
Envs []kubecontainer.EnvVar
// The mounts for the container.
Mounts []kubecontainer.Mount
// The host devices mapped into the container.
Devices []kubecontainer.DeviceInfo
// The Annotations for the container
Annotations []kubecontainer.Annotation
}
Manager
接口的定义, 也就是一个device manager
必须要实现这些方法.
ManagerImpl
type monitorCallback func(resourceName string, devices []pluginapi.Device)
type ManagerImpl struct {
// 地址 /var/lib/kubelet/device-plugins/kubelet.sock
socketname string
socketdir string
// resouceName与对应的endpoint
endpoints map[string]endpointInfo // Key is ResourceName
mutex sync.Mutex
// grpc
server *grpc.Server
wg sync.WaitGroup
// 该方法用于得到节点中active pods, 可以用于更新节点中的资源信息
// 因为有些占有资源的pod已经运行完了, 就需要回收该pod的资源更新到device manager中
activePods ActivePodsFunc
sourcesReady config.SourcesReady
// 回调函数
callback monitorCallback
// resouceName以及它所有healthy的设备
healthyDevices map[string]sets.String
// resouceName以及它所有unhealthy的设备
unhealthyDevices map[string]sets.String
// resouceName以及它已经分配出去的设备
allocatedDevices map[string]sets.String
// podDeivces保存着pod与该pod拥有的资源设备信息
podDevices podDevices
// 持久化
checkpointManager checkpointmanager.CheckpointManager
}
type endpointInfo struct {
e endpoint
opts *pluginapi.DevicePluginOptions
}
func NewManagerImpl() (*ManagerImpl, error) {
// pluginapi.KubeletSocket=/var/lib/kubelet/device-plugins/kubelet.sock
return newManagerImpl(pluginapi.KubeletSocket)
}
func newManagerImpl(socketPath string) (*ManagerImpl, error) {
klog.V(2).Infof("Creating Device Plugin manager at %s", socketPath)
if socketPath == "" || !filepath.IsAbs(socketPath) {
return nil, fmt.Errorf(errBadSocket+" %s", socketPath)
}
dir, file := filepath.Split(socketPath)
manager := &ManagerImpl{
endpoints: make(map[string]endpointInfo),
socketname: file,
socketdir: dir,
healthyDevices: make(map[string]sets.String),
unhealthyDevices: make(map[string]sets.String),
allocatedDevices: make(map[string]sets.String),
podDevices: make(podDevices),
}
manager.callback = manager.genericDeviceUpdateCallback
// The following structs are populated with real implementations in manager.Start()
// Before that, initializes them to perform no-op operations.
// 在调用start方法的时候会传入新的activePods和sourcesReady
manager.activePods = func() []*v1.Pod { return []*v1.Pod{} }
manager.sourcesReady = &sourcesReadyStub{}
checkpointManager, err := checkpointmanager.NewCheckpointManager(dir)
if err != nil {
return nil, fmt.Errorf("failed to initialize checkpoint manager: %v", err)
}
manager.checkpointManager = checkpointManager
return manager, nil
}
ManagerImpl
是Manager
接口的一个实现类. 有几个属性需要注意一下:
healthyDevices:resouceName
以及它所有healthy
的设备.
unhealthyDevices:resouceName
以及它所有unhealthy
的设备.
allocatedDevices:resouceName
以及它已经分配出去的设备.
podDevices: 保存着pod
与该pod
拥有的资源设备信息. 参考 [k8s源码分析][kubelet] devicemanager 之 pod_devices 和 checkpoint
activePods: 该方法用于得到节点中active pods
, 可以用于更新节点中的资源信息. 因为有些占有资源的pod
已经运行完了, 就需要回收该pod的资源更新到device manager
中.
callback: 回调函数, 在 [k8s源码分析][kubelet] devicemanager 之 device-plugin向kubelet注册 已经分析过了, 用于更新healthyDevices
和unhealthyDevices
.
endpoints: 一个map
结构, 保存着resourceName
与对应的endpoint
信息. 在 [k8s源码分析][kubelet] devicemanager 之 device-plugin向kubelet注册 已经分析过了.
另外可以看到在初始化方法
NewManagerImpl
中默认使用的地址为/var/lib/kubelet/device-plugins/kubelet.sock
.
在 [k8s源码分析][kubelet] devicemanager 之 device-plugin向kubelet注册 中有提到过
Start
方法会传入新的activePods
和sourcesReady
, 这个是kubelet
在启动device manager
中会传进来告诉device manager
如何获取到节点的active pods
.
func (m *ManagerImpl) Start(activePods ActivePodsFunc, sourcesReady config.SourcesReady) error {
...
m.activePods = activePods
m.sourcesReady = sourcesReady
...
}
Allocate
func (m *ManagerImpl) Allocate(node *schedulercache.NodeInfo, attrs *lifecycle.PodAdmitAttributes) error {
// 要申请资源的pod
pod := attrs.Pod
// 尝试为该pod分配资源
err := m.allocatePodResources(pod)
...
// 再次确认分配是否成功
if _, podRequireDevicePluginResource := m.podDevices[string(pod.UID)]; !podRequireDevicePluginResource {
return nil
}
// 分配成功 调整节点信息
m.sanitizeNodeAllocatable(node)
return nil
}
1. 调用
allocatePodResources
尝试为该pod
分配资源, 有错误则返回.
2. 从podDevices
中确认是否有分配信息, 没有则返回.
3. 分配成功, 调用sanitizeNodeAllocatable
调整节点信息.
allocatePodResources
func (m *ManagerImpl) allocatePodResources(pod *v1.Pod) error {
devicesToReuse := make(map[string]sets.String)
for _, container := range pod.Spec.InitContainers {
if err := m.allocateContainerResources(pod, &container, devicesToReuse); err != nil {
return err
}
m.podDevices.addContainerAllocatedResources(string(pod.UID), container.Name, devicesToReuse)
}
for _, container := range pod.Spec.Containers {
if err := m.allocateContainerResources(pod, &container, devicesToReuse); err != nil {
return err
}
m.podDevices.removeContainerAllocatedResources(string(pod.UID), container.Name, devicesToReuse)
}
return nil
}
为该
pod
中的所有容器分配资源.
1. 对于pod.Spec.Containers
, 分配给每个container
的设备是不一样的.
2. 对于pod.Spec.InitContainers
, 因为等到所有的InitContainers
运行结束后才会开始运行pod.Spec.Containers
, 也就意味着InitContainers
中分配得到的设备在InitContainers
运行结束后就没有用处了, 所以这些设备可以分配给pod.Spec.Containers
继续使用, 不然就浪费了, 因为别的pod
也不可能申请到这些设备. 另外关于devicesToReuse
会在最后部分有详细说明.
allocateContainerResources
func (m *ManagerImpl) allocateContainerResources(pod *v1.Pod, container *v1.Container, devicesToReuse map[string]sets.String) error {
podUID := string(pod.UID)
contName := container.Name
allocatedDevicesUpdated := false
for k, v := range container.Resources.Limits {
resource := string(k)
needed := int(v.Value())
klog.V(3).Infof("needs %d %s", needed, resource)
if !m.isDevicePluginResource(resource) {
continue
}
// 更新一次资源设备信息
if !allocatedDevicesUpdated {
m.updateAllocatedDevices(m.activePods())
allocatedDevicesUpdated = true
}
// 得到分配给该容器的设备
allocDevices, err := m.devicesToAllocate(podUID, contName, resource, needed, devicesToReuse[resource])
...
// 获得与该resourceName对应的endpoint, endpoint可以与该注册此resourceName的device plugin发送请求
eI, ok := m.endpoints[resource]
...
devs := allocDevices.UnsortedList()
// 向device plugin发送请求 根据这些设备得到相关信息
// 比如nvidia device plugin 会返回NVIDIA_VISIBLE_DEVICES=UUID of devs(那些分配的gpu的uuid)
resp, err := eI.e.allocate(devs)
...
// 加入到podDeivce中
m.podDevices.insert(podUID, contName, resource, allocDevices, resp.ContainerResponses[0])
...
}
// 持久化到kubelet_internal_checkpoint中
return m.writeCheckpoint()
}
// k8s-device-plugin/server.go
func (m *NvidiaDevicePlugin) Allocate(ctx context.Context, reqs *pluginapi.AllocateRequest) (*pluginapi.AllocateResponse, error) {
devs := m.devs
name := fmt.Sprintf("NVIDIA_VISIBLE_DEVICES/%v", resourceName)
responses := pluginapi.AllocateResponse{}
for _, req := range reqs.ContainerRequests {
response := pluginapi.ContainerAllocateResponse{
Envs: map[string]string{
name: strings.Join(req.DevicesIDs, ","),
},
}
for _, id := range req.DevicesIDs {
if !deviceExists(devs, id) {
return nil, fmt.Errorf("invalid allocation request: unknown device: %s", id)
}
}
responses.ContainerResponses = append(responses.ContainerResponses, &response)
}
return &responses, nil
}
不影响主逻辑的代码去掉了.
1. 有容器的时候会调用一次updateAllocatedDevices
方法更新当前device manager
中关于资源设备的信息, 因为有些占有资源的pod
可能已经terminating
了.
2. 针对每一个资源(resouce
)会做以下操作
2.1 调用devicesToAllocate
去获取分配给该容器此资源(resouce
)的具体设备, 如果返回有错误, 直接返回了, 后面的资源也没有必要再做了, 因为只要有一个资源无法申请成功, 该容器就会失败的.
2.2 获得与该resource
对应的endpoint
, 该endpoint
可以与注册此resource
的device plugin
发送请求.
2.3 向device plugin
发送请求, 根据这些设备得到相关容器运行信息, 比如nvidia device plugin
会返回NVIDIA_VISIBLE_DEVICES=UUID of devs(那些分配的gpu的uuid)
, 这样在真正启动容器的时候nvidia docker
会把宿主机上对应的gpu
投射到该容器中.
2.4 将该pod
(podUID
)的容器(contName
)申请到此资源(resource
)的这些设备(allocDevices
)加入到podDeivce
中.
3. 将当前device manager
的信息持久化到kubelet_internal_checkpoint
. 这里再加一句, 为什么不把所有容器分配完了之后再持久化呢?我的理解是该容器已经确确实实分配到这些设备了, 已经加入到m.allocatedDevices
中了(后面devicesToAllocate
方法中有体现), 此pod
后面的容器申请成功与否跟此容器没有任何关系, 因为是独立的关系, 所以此时只要一个容器申请成功, 那就可以写入磁盘了.
updateAllocatedDevices
func (m *ManagerImpl) updateAllocatedDevices(activePods []*v1.Pod) {
if !m.sourcesReady.AllReady() {
return
}
m.mutex.Lock()
defer m.mutex.Unlock()
activePodUids := sets.NewString()
for _, pod := range activePods {
activePodUids.Insert(string(pod.UID))
}
// allocatedPodUids代表device manager目前保存了哪些pods
// activePodUids代表节点上真正还在运行的pods
allocatedPodUids := m.podDevices.pods()
// podsToBeRemoved代表那些在该节点上占有资源的pods已经terminating了
podsToBeRemoved := allocatedPodUids.Difference(activePodUids)
if len(podsToBeRemoved) <= 0 {
return
}
klog.V(3).Infof("pods to be removed: %v", podsToBeRemoved.List())
m.podDevices.delete(podsToBeRemoved.List())
// Regenerated allocatedDevices after we update pod allocation information.
m.allocatedDevices = m.podDevices.devices()
}
该方法的作用就是更新
device manger
的两个属性podDevices
和allocatedDevices
.
allocatedPodUids: 代表device manager
目前保存了哪些pods
.
activePodUids: 代表节点上真正还在运行的pods
.
allocatedPodUids - activePodUids
就是podsToBeRemoved
, 也就是代表那些在该节点上占有资源的pods
已经terminating
了, 可以释放资源了, 也就是更新allocatedDevices
中的数据.
更新podDevices
是以前的数据过时了.
devicesToAllocate
func (m *ManagerImpl) devicesToAllocate(podUID, contName, resource string, required int, reusableDevices sets.String) (sets.String, error) {
m.mutex.Lock()
defer m.mutex.Unlock()
needed := required
// Gets list of devices that have already been allocated.
// This can happen if a container restarts for example.
// 查看该container是否有分配过信息
devices := m.podDevices.containerDevices(podUID, contName, resource)
if devices != nil {
klog.V(3).Infof("Found pre-allocated devices for resource %s container %q in Pod %q: %v", resource, contName, podUID, devices.List())
needed = needed - devices.Len()
// A pod's resource is not expected to change once admitted by the API server,
// so just fail loudly here. We can revisit this part if this no longer holds.
if needed != 0 {
// 如果以前分配的设备数量与当前要求的数量不一致 返回错误
return nil, fmt.Errorf("pod %q container %q changed request for resource %q from %d to %d", podUID, contName, resource, devices.Len(), required)
}
}
if needed == 0 {
// No change, no work.
return nil, nil
}
klog.V(3).Infof("Needs to allocate %d %q for pod %q container %q", needed, resource, podUID, contName)
// Needs to allocate additional devices.
if _, ok := m.healthyDevices[resource]; !ok {
return nil, fmt.Errorf("can't allocate unregistered device %s", resource)
}
devices = sets.NewString()
// Allocates from reusableDevices list first.
// 从initContainer中取 当然也可以从上一个InitContainer中取 因为InitContainer是一个一个运行的
for device := range reusableDevices {
devices.Insert(device)
needed--
if needed == 0 {
return devices, nil
}
}
// Needs to allocate additional devices.
if m.allocatedDevices[resource] == nil {
m.allocatedDevices[resource] = sets.NewString()
}
// Gets Devices in use.
devicesInUse := m.allocatedDevices[resource]
// Gets a list of available devices.
available := m.healthyDevices[resource].Difference(devicesInUse)
if int(available.Len()) < needed {
return nil, fmt.Errorf("requested number of devices unavailable for %s. Requested: %d, Available: %d", resource, needed, available.Len())
}
allocated := available.UnsortedList()[:needed]
// 更新到allocatedDevices中
for _, device := range allocated {
m.allocatedDevices[resource].Insert(device)
devices.Insert(device)
}
return devices, nil
}
其实这块的逻辑很简单, 就是一道简单的数学题.
devicesInUse := m.allocatedDevices[resource] 代表该资源以及分配出去的设备.
available := m.healthyDevices[resource].Difference(devicesInUse) 代表该资源目前可以分配的所有设备.
needed 代表该容器请求此资源的数量.
那就很简单了, 如果
needed < available
才可以分配成功, 成功的话会把该这些分配的设备加入到device manager
的allocatedDevices
中.
另外有一点需要注意:
从reusableDevices
中取复用的设备, 也就是从initContainer
中取, 当然也可以从上一个InitContainer
中取, 因为InitContainer
是一个一个运行的.
举一个例子, 比如该pod
中所有的InitContainer
都请求1
个gpu
, 那么第一个InitContainer
会申请到一个gpu
, 第二个InitContainer
是在第一个InitContainer
运行结束后才开始运行的, 所以它申请的gpu
就是第一个InitContainer
的gpu
, 它会复用. 后面第三个, 第四个, 第五个InitContainer
都会复用前面的gpu
.
所以现在回头来看
allocatePodResources
方法. 理解devicesToReuse
的意思.
func (m *ManagerImpl) allocatePodResources(pod *v1.Pod) error {
devicesToReuse := make(map[string]sets.String)
for _, container := range pod.Spec.InitContainers {
...
m.podDevices.addContainerAllocatedResources(string(pod.UID), container.Name, devicesToReuse)
}
for _, container := range pod.Spec.Containers {
...
m.podDevices.removeContainerAllocatedResources(string(pod.UID), container.Name, devicesToReuse)
}
return nil
}
可以看到
pod.Spec.Containers
可以复用的资源是从InitContainer
来, 因为要等到所有的InitContainer
一个一个按顺序运行结束之后, 这些pod.Spec.Containers
才会启动(同时启动), 按照上面的分析可以得到pod.Spec.Containers
可以复用的资源就是max(每个initContainer申请资源的数量).
那如果所有
InitContainer
最大的那个InitContainer
中申请了10
个gpu
, 所有pod.Spec.Containers
总共就申请了1
个gpu
, 这种情况下该pod
就会无缘无故浪费9
个gpu
.
那如果所有InitContainer
最大的那个InitContainer
中申请了1
个gpu
, 所有pod.Spec.Containers
总共就申请了10
个gpu
, 这种情况下该pod
就会首先复用InitContainer
中的那已经申请好1
个gpu
, 另外9
个gpu
再继续申请.