正确监控容器 OOMKill 的指标（译）

最近在Splunk工作不久，一个同事在Slack上找到我，问起我之前一篇关于《Kubernetes指标》的博文。

他的问题是关于OOMKiller使用是容器里哪个 "memory usage “的指标来决定是否应该杀死一个容器。我在那篇文章中提出的论断是。

你可能认为用container_memory_usage_bytes来跟踪内存利用率很容易，但是，这个指标也包括缓存（想想文件系统缓存）数据，这些数据在内存压力下可能会被驱逐。更好的指标是container_memory_working_set_bytes，因为这是OOM杀手关注的。

这是这篇文章中最核心的论述，所以我决定我需要模拟这次行为。让我们看看OOMKiller在观察哪些指标。

我做了一个小东西，它会不断地分配内存，直到OOMKiller参与进来并杀死pod中的容器。


package main

import (
    "fmt"
    "net/http"
    "time"

    "github.com/prometheus/client_golang/prometheus/promhttp"
)

func main() {
    memoryTicker := time.NewTicker(time.Millisecond * 5)
    leak := make(map[int][]byte)
    i := 0

    go func() {
        for range memoryTicker.C {
            leak[i] = make([]byte, 1024)
            i++
        }
    }()
    
    http.Handle("/metrics", promhttp.Handler())
    http.ListenAndServe(":8081", nil)
}

见它部署在minikube中，并将内存请求和限制都设置为128MB，我们可以看到container_memory_usage_bytes和container_memory_working_set_bytes几乎1:1地相互跟踪。当它们都达到容器上设置的极限时，OOMKiller就会杀死容器，进程重新开始。

file

由于container_memory_usage_bytes也跟踪进程所使用的文件系统缓存，所以我又优化了下小工具，以便将数据直接写到文件系统上的一个文件。

package main

import (
    "fmt"
    "net/http"
    "time"

    "github.com/prometheus/client_golang/prometheus/promhttp"
)

func main() {

    memoryTicker := time.NewTicker(time.Millisecond * 5)
    leak := make(map[int][]byte)
    i := 0

    go func() {
        for range memoryTicker.C {
            leak[i] = make([]byte, 1024)
            i++
        }
    }()

    fileTicker := time.NewTicker(time.Millisecond * 5)
    go func() {
        os.Create("/tmp/file")
        buffer := make([]byte, 1024)
        defer f.Close()

        for range fileTicker.C {
            f.Write(buffer)
            f.Sync()
        }
    }()

    http.Handle("/metrics", promhttp.Handler())
    http.ListenAndServe(":8081", nil)
}

在引入文件系统缓存后，我们开始看到 container_memory_usage_bytes 和 container_memory_working_set_bytes 开始出现分叉

file

现在有趣的是，容器仍然不允许使用超过容器极限的内存量，但是OOMKiller container_memory_working_set_bytes达到内存极限时才会杀死容器。

file

这种行为的另一个有趣的方面是，container_memory_usage_bytes在容器的内存极限时达到了顶点，但是数据还是继续在往磁盘写入。

如果我们再看一下container_memory_cache，我们会发现，在container_memory_usage_bytes达到极限之前，使用的缓存量持续增加，然后开始减少。

file

从这个实验中,我们可以看到，container_memory_usage_bytes确实占用了一些正在被缓存的文件系统页面。我们还可以看到，OOMKiller正在追踪container_memory_working_set_bytes。这是有道理的，因为共享文件系统的缓存页可以在任何时候在内存中被驱逐。我们仅仅因为使用磁盘I/O而杀死进程是没有意义的。****

本文由博客群发一文多发等运营工具平台 OpenWrite 发布

正确监控容器 OOMKill 的指标（译）

你可能感兴趣的:(正确监控容器 OOMKill 的指标（译）)