关于Kubernetes的介绍与核心对象概念-阿里云开发者社区
k8s架构
核心对象
使用kubeadm+10分钟部署k8集群
使用 KuboardSpray 安装kubernetes_v1.23.1 | Kuboard
k8s-上部署第一个应用程序
Deployment基本概念
给应用添加service,执行扩容和滚动更新
安装Kuboard在页面上熟悉k8s集群
kubernetes 1.24.2安装kuboard v3
static pod 安装 kuboard
安装命令
curl -fsSL https://addons.kuboard.cn/kuboard/kuboard-static-pod.sh -o kuboard.sh
sh kuboard.sh
阅读k8s源码的准备工作
vscode
下载k8s 1.24.2源码
k8s组件代码仓库地址
从创建pod开始看流程和源码
Kubernetes源码分析一叶知秋(一)kubectl中Pod的创建流程 - 掘金
编写一个创建nginx pod的yaml
使用kubelet部署这个pod
cobra 中有个重要的概念,分别是 commands、arguments 和 flags。其中 commands 代表执行动作,arguments 就是执行参数,flags 是这些动作的标识符。执行命令行程序时的一般格式为:
APPNAME COMMAND ARG --FLAG
比如下面的例子:
# server是 commands,port 是 flag
hugo server --port=1313
# clone 是 commands,URL 是 arguments,brae 是 flag
git clone URL --bare
kubectl create命令 执行入口在cmd目录下各个组件的下面
代码位置/home/gopath/src/k8s.io/kubernetes-1.24.2/cmd/kubectl/kubectl.go
package main
import (
"k8s.io/component-base/cli"
"k8s.io/kubectl/pkg/cmd"
"k8s.io/kubectl/pkg/cmd/util"
// Import to initialize client auth plugins.
_ "k8s.io/client-go/plugin/pkg/client/auth"
)
func main() {
command := cmd.NewDefaultKubectlCommand()
if err := cli.RunNoErrOutput(command); err != nil {
// Pretty-print the error and exit with an error.
util.CheckErr(err)
}
}
rand.Seed设置随机数
调用kubectl库调用cmd创建command对象
command := cmd.NewDefaultKubectlCommand()
D:\Workspace\Go\src\k8s.io\[email protected]\staging\src\k8s.io\kubectl\pkg\cmd\cmd.go
staging\src\k8s.io\kubectl\pkg\cmd\cmd.go
github.com/spf13/cobra
cobra的主要功能如下
Cobra主要提供的功能
* 简易的子命令行模式,如 app server, app fetch等等
* 完全兼容posix命令行模式
* 嵌套子命令subcommand
* 支持全局,局部,串联flags
* 使用Cobra很容易的生成应用程序和命令,使用cobra create appname和cobra add cmdname
* 如果命令输入错误,将提供智能建议,如 app srver,将提示srver没有,是否是app server
* 自动生成commands和flags的帮助信息
* 自动生成详细的help信息,如app help
* 自动识别-h,--help帮助flag
* 自动生成应用程序在bash下命令自动完成功能
* 自动生成应用程序的man手册
* 命令行别名
* 自定义help和usage信息
* 可选的紧密集成的[viper](http://github.com/spf13/viper) apps
创建cobra应用
go install github.com/spf13/cobra-cli@latest
mkdir my_cobra
cd my_cobra
// 打开my_cobra项目,执行go mod init后可以看到相关的文件
go mod init github.com/spf13/my_cobra
find
go run main.go
// 修改root.go
// Uncomment the following line if your bare application
// has an action associated with it:
// Run: func(cmd *cobra.Command, args []string) { },
Run: func(cmd *cobra.Command, args []string) {
fmt.Println("my_cobra")
},
// 编译运行后打印
go run main.go
[root@k8s-worker02 my_cobra]# go run main.go
# github.com/spf13/my_cobra/cmd
cmd/root.go:29:10: undefined: fmt
[root@k8s-worker02 my_cobra]# go run main.go
my_cobra
// 用cobra程序生成应用程序框架
cobra-cli init
// 除了init生成应用程序框架,还可以通过cobra-cli add命令生成子命令的代码文件,比如下面的命令会添加两个子命令image和container相关的代码文件:
cobra-cli add image
cobra-cli add container
[root@k8s-worker02 my_cobra]# find
.
./go.mod
./main.go
./cmd
./cmd/root.go
./cmd/image.go
./LICENSE
./go.sum
[root@k8s-worker02 my_cobra]# cobra-cli add container
container created at /home/gopath/src/my_cobra
[root@k8s-worker02 my_cobra]# go run main.go image
image called
[root@k8s-worker02 my_cobra]# go run main.go container
container called
可以看出执行的是对应xxxCmd下的Run方法
// containerCmd represents the container command
var containerCmd = &cobra.Command{
Use: "container",
Short: "A brief description of your command",
Long: `A longer description that spans multiple lines and likely contains examples
and usage of using your command. For example:
Cobra is a CLI library for Go that empowers applications.
This application is a tool to generate the needed files
to quickly create a Cobra application.`,
Run: func(cmd *cobra.Command, args []string) {
fmt.Println("container called")
},
}
赋值cmd/container.go为version.go添加version信息
package cmd
import (
"fmt"
"github.com/spf13/cobra"
)
// versionCmd represents the version command
var versionCmd = &cobra.Command{
Use: "version",
Short: "A brief description of your command",
Long: `A longer description that spans multiple lines and likely contains examples
and usage of using your command. For example:
Cobra is a CLI library for Go that empowers applications.
This application is a tool to generate the needed files
to quickly create a Cobra application.`,
Run: func(cmd *cobra.Command, args []string) {
fmt.Println("my_cobra version is v1.0")
},
}
func init() {
rootCmd.AddCommand(versionCmd)
}
设置一个MinimumNArgs的验证
新增一个cmd/times.go
package cmd
import (
"fmt"
"strings"
"github.com/spf13/cobra"
)
// containerCmd respresents the container command
var echoTimes int
var timesCmd = &cobra.Command{
Use: "times [string to echo]",
Short: "Echo anything to the screen more times",
Long: `echo things multiple times back to the user by providing a count and a string`,
Args: cobra.MinimumNArgs(1),
Run: func(cmd *cobra.Command, args []string) {
for i := 0; i < echoTimes; i++ {
fmt.Println("Echo: " + strings.Join(args, " "))
}
},
}
func init() {
rootCmd.AddCommand(timesCmd)
timseCmd.Flags().IntVarP(&echoTimes, "times", "t", 1, "times to echo the input")
}
因为我们为timeCmd命令设置了Args: cobra.MinimumNArgs(1),所以必须为times子命令传入一个参数,不然times子命令会报错:
go run main.go times -t=4 k8s
[root@k8s-worker02 my_cobra]# go run main.go times -t=4 k8s
Echo: k8s
Echo: k8s
Echo: k8s
Echo: k8s
修改rootCmd
PersistentPreRun: func(cmd *cobra.Command, args []string) {
fmt.Printf("[step_1]PersistentPreRun with args: %v\n", args)
},
PreRun: func(cmd *cobra.Command, args []string) {
fmt.Printf("[step_2]PreRun with args: %v\n", args)
},
Run: func(cmd *cobra.Command, args []string) {
fmt.Printf("[step_3]my_cobra version is v1.0: %v\n", args")
},
PostRun: func(cmd *cobra.Command, args []string) {
fmt.Printf("[step_4]PostRun with args: %v\n", args)
},
PersistentPostRun: func(cmd *cobra.Command, args []string) {
fmt.Printf("[step_5]PersistentPostRun with args: %v\n", args)
},
[root@k8s-worker02 my_cobra]# go run main.go
[step_1]PersistentPreRun with args: []
[step_2]PreRun with args: []
[step_3]my_cobra version is v1.0: []
[step_4]PostRun with args: []
[step_5]PersistentPostRun with args: []
kubectl命令行设置pprof抓取火焰图
cmd调用入口
D:\Workspace\Go\src\k8s.io\[email protected]\staging\src\k8s.io\kubectl\pkg\cmd\cmd.go
// NewDefaultKubectlCommand creates the `kubectl` command with default arguments
func NewDefaultKubectlCommand() *cobra.Command {
return NewDefaultKubectlCommandWithArgs(KubectlOptions{
PluginHandler: NewDefaultPluginHandler(plugin.ValidPluginFilenamePrefixes),
Arguments: os.Args,
ConfigFlags: defaultConfigFlags,
IOStreams: genericclioptions.IOStreams{In: os.Stdin, Out: os.Stdout, ErrOut: os.Stderr},
})
}
底层函数NewKubectlCommand解析
func NewKubectlCommand(o KubectlOptions) *cobra.Command {}
使用cobra创建rootCmd
// Parent command to which all subcommands are added.
cmds := &cobra.Command{
Use: "kubectl",
Short: i18n.T("kubectl controls the Kubernetes cluster manager"),
Long: templates.LongDesc(`
kubectl controls the Kubernetes cluster manager.
Find more information at:
https://kubernetes.io/docs/reference/kubectl/overview/`),
Run: runHelp,
// Hook before and after Run initialize and write profiles to disk,
// respectively.
PersistentPreRunE: func(*cobra.Command, []string) error {
rest.SetDefaultWarningHandler(warningHandler)
return initProfiling()
},
PersistentPostRunE: func(*cobra.Command, []string) error {
if err := flushProfiling(); err != nil {
return err
}
if warningsAsErrors {
count := warningHandler.WarningCount()
switch count {
case 0:
// no warnings
case 1:
return fmt.Errorf("%d warning received", count)
default:
return fmt.Errorf("%d warnings received", count)
}
}
return nil
},
}
配合后面的addProfilingFlags(flags)添加pprof的flag
在persistentPreRunE设置pprof采集相关指令
代码位置
staging\src\k8s.io\kubectl\pkg\cmd\profiling.go
意思是有两个选项
--profile代表pprof统计哪类指标,可以是cpu,block等
--profile-output代表输出的pprof结果文件
initProfiling代码
func addProfilingFlags(flags *pflag.FlagSet) {
flags.StringVar(&profileName, "profile", "none", "Name of profile to capture. One of (none|cpu|heap|goroutine|threadcreate|block|mutex)")
flags.StringVar(&profileOutput, "profile-output", "profile.pprof", "Name of the file to write the profile to")
}
func initProfiling() error {
var (
f *os.File
err error
)
switch profileName {
case "none":
return nil
case "cpu":
f, err = os.Create(profileOutput)
if err != nil {
return err
}
err = pprof.StartCPUProfile(f)
if err != nil {
return err
}
// Block and mutex profiles need a call to Set{Block,Mutex}ProfileRate to
// output anything. We choose to sample all events.
case "block":
runtime.SetBlockProfileRate(1)
case "mutex":
runtime.SetMutexProfileFraction(1)
default:
// Check the profile name is valid.
if profile := pprof.Lookup(profileName); profile == nil {
return fmt.Errorf("unknown profile '%s'", profileName)
}
}
// If the command is interrupted before the end (ctrl-c), flush the
// profiling files
c := make(chan os.Signal, 1)
signal.Notify(c, os.Interrupt)
go func() {
<-c
f.Close()
flushProfiling()
os.Exit(0)
}()
return nil
}
并且在PersistentPostRunE中设置了pprof统计结果落盘
PersistentPostRunE: func(*cobra.Command, []string) error {
if err := flushProfiling(); err != nil {
return err
}
if warningsAsErrors {
count := warningHandler.WarningCount()
switch count {
case 0:
// no warnings
case 1:
return fmt.Errorf("%d warning received", count)
default:
return fmt.Errorf("%d warnings received", count)
}
}
return nil
},
对应执行的flushProfiling
func flushProfiling() error {
switch profileName {
case "none":
return nil
case "cpu":
pprof.StopCPUProfile()
case "heap":
runtime.GC()
fallthrough
default:
profile := pprof.Lookup(profileName)
if profile == nil {
return nil
}
f, err := os.Create(profileOutput)
if err != nil {
return err
}
defer f.Close()
profile.WriteTo(f, 0)
}
return nil
}
执行采集pprof cpu的kubelet命令
# 执行命令
kubectl get node --profile=cpu --profile-output=cpu.pprof
# 查看结果文件
ll cpu.pprof
# 生成svg
go tool pprof -svg cpu.pprof > kubectl_get_node_cpu.svg
kubectl get node --profile=goroutine --profile-output=goroutine.pprof
go tool pprof -text goroutine.pprof
cpu火焰图svg结果
kubectl架构图
用cmd工厂函数f创建7大分组命令
基础初级命令Basic Commands(Beginner)
基础中级命令Basic Commands(Intermediate)
部署命令Deploy Commands
集群管理分组 Cluster Management Commands
故障排查和调试Troubleshooting and Debugging Commands
高级命令Advanced Commands
设置命令Settings Commands
设置参数-替换方法
flags := cmds.PersistentFlags()
addProfilingFlags(flags)
flags.BoolVar(&warningsAsErrors, "warnings-as-errors", warningsAsErrors, "Treat warnings received from the server as errors and exit with a non-zero exit code")
设置kubeconfig相关的命令行
kubeConfigFlags := o.ConfigFlags
if kubeConfigFlags == nil {
kubeConfigFlags = defaultConfigFlags
}
kubeConfigFlags.AddFlags(flags)
matchVersionKubeConfigFlags := cmdutil.NewMatchVersionFlags(kubeConfigFlags)
matchVersionKubeConfigFlags.AddFlags(flags)
设置cmd工厂函数f,主要是封装了与kube-apiserver交互客户端
后面的子命令都使用这个f创建
f := cmdutil.NewFactory(matchVersionKubeConfigFlags)
创建proxy子命令
proxyCmd := proxy.NewCmdProxy(f, o.IOStreams)
proxyCmd.PreRun = func(cmd *cobra.Command, args []string) {
kubeConfigFlags.WrapConfigFn = nil
}
创建7大分组命令
1.基础初级命令Basic Commands (Begginner)
代码
{
Message: "Basic Commands (Beginner):",
Commands: []*cobra.Command{
create.NewCmdCreate(f, o.IOStreams),
expose.NewCmdExposeService(f, o.IOStreams),
run.NewCmdRun(f, o.IOStreams),
set.NewCmdSet(f, o.IOStreams),
},
},
命令行使用kubectl
对应的输出
释义
create代表创建资源
expose将一种资源暴露成service
run运行一个镜像
set在对象上设置一些功能
2.基础中级命令Basic Commands(Intermediate)
{
Message: "Basic Commands (Intermediate):",
Commands: []*cobra.Command{
explain.NewCmdExplain("kubectl", f, o.IOStreams),
getCmd,
edit.NewCmdEdit(f, o.IOStreams),
delete.NewCmdDelete(f, o.IOStreams),
},
},
打印的help效果
释义
explain获取资源的文档
get显示资源
edit编辑资源
delete删除资源
3.部署命令Deploy Commands
{
Message: "Deploy Commands:",
Commands: []*cobra.Command{
rollout.NewCmdRollout(f, o.IOStreams),
scale.NewCmdScale(f, o.IOStreams),
autoscale.NewCmdAutoscale(f, o.IOStreams),
},
},
释义
rollout滚动更新
scale扩缩容
autoscale自动扩缩容
4.集群管理分组Cluster Management Commands:
{
Message: "Cluster Management Commands:",
Commands: []*cobra.Command{
certificates.NewCmdCertificate(f, o.IOStreams),
clusterinfo.NewCmdClusterInfo(f, o.IOStreams),
top.NewCmdTop(f, o.IOStreams),
drain.NewCmdCordon(f, o.IOStreams),
drain.NewCmdUncordon(f, o.IOStreams),
drain.NewCmdDrain(f, o.IOStreams),
taint.NewCmdTaint(f, o.IOStreams),
},
},
释义
certificate管理证书
cluster-info展示集群信息
top展示资源消耗top
cordon将节点标记为不可用
uncordon将节点标记为可用
drain驱逐pod
taint设置节点污点
5.故障排查和调试Troubleshooting and Debugging Commands
{
Message: "Troubleshooting and Debugging Commands:",
Commands: []*cobra.Command{
describe.NewCmdDescribe("kubectl", f, o.IOStreams),
logs.NewCmdLogs(f, o.IOStreams),
attach.NewCmdAttach(f, o.IOStreams),
cmdexec.NewCmdExec(f, o.IOStreams),
portforward.NewCmdPortForward(f, o.IOStreams),
proxyCmd,
cp.NewCmdCp(f, o.IOStreams),
auth.NewCmdAuth(f, o.IOStreams),
debug.NewCmdDebug(f, o.IOStreams),
},
},
输出
释义
describe展示资源详情
logs打印pod中容器日志
attach进入容器
exec在容器中执行命令
port-forward端口转发
proxy运行代理
cp拷贝文件
auth检查鉴权
debug打印debug
6.高级命令Advanced Commands
代码
{
Message: "Advanced Commands:",
Commands: []*cobra.Command{
diff.NewCmdDiff(f, o.IOStreams),
apply.NewCmdApply("kubectl", f, o.IOStreams),
patch.NewCmdPatch(f, o.IOStreams),
replace.NewCmdReplace(f, o.IOStreams),
wait.NewCmdWait(f, o.IOStreams),
kustomize.NewCmdKustomize(o.IOStreams),
},
},
输出
释义
diff对比当前和应该运行的版本
apply应用变更或配置
patch更新资源的字段
replace替换资源
wait等待资源的特定状态
kustomize从目录或远程url构建kustomization目标
7.设置命令Setting Commands
代码
{
Message: "Settings Commands:",
Commands: []*cobra.Command{
label.NewCmdLabel(f, o.IOStreams),
annotate.NewCmdAnnotate("kubectl", f, o.IOStreams),
completion.NewCmdCompletion(o.IOStreams.Out, ""),
},
},
输出
释义
label打标签
annotate更新注释
completion在shell上设置补全
设置cmd工厂函数f,主要是封装了与kube-apiserver交互客户端
用cmd工厂函数f创建7大分组命令
kubectl create架构图
create流程
newCmdCreate调用cobra的Run函数
调用RunCreate构建resourceBuilder对象
调用visit方法创建资源
底层使用resetclient和看k8s-api通信
create的流程NewCmdCreate
代码入口staging\src\k8s.io\kubectl\pkg\cmd\create\create.go
创建Create选项对象
o := NewCreateOptions(ioStreams)
初始化cmd
cmd := &cobra.Command{
Use: "create -f FILENAME",
DisableFlagsInUseLine: true,
Short: i18n.T("Create a resource from a file or from stdin"),
Long: createLong,
Example: createExample,
Run: func(cmd *cobra.Command, args []string) {
if cmdutil.IsFilenameSliceEmpty(o.FilenameOptions.Filenames, o.FilenameOptions.Kustomize) {
ioStreams.ErrOut.Write([]byte("Error: must specify one of -f and -k\n\n"))
defaultRunFunc := cmdutil.DefaultSubCommandRun(ioStreams.ErrOut)
defaultRunFunc(cmd, args)
return
}
cmdutil.CheckErr(o.Complete(f, cmd))
cmdutil.CheckErr(o.ValidateArgs(cmd, args))
cmdutil.CheckErr(o.RunCreate(f, cmd))
},
}
设置选项
具体绑定到o的各个字段上
// bind flag structs
o.RecordFlags.AddFlags(cmd)
usage := "to use to create the resource"
cmdutil.AddFilenameOptionFlags(cmd, &o.FilenameOptions, usage)
cmdutil.AddValidateFlags(cmd)
cmd.Flags().BoolVar(&o.EditBeforeCreate, "edit", o.EditBeforeCreate, "Edit the API resource before creating")
cmd.Flags().Bool("windows-line-endings", runtime.GOOS == "windows",
"Only relevant if --edit=true. Defaults to the line ending native to your platform.")
cmdutil.AddApplyAnnotationFlags(cmd)
cmdutil.AddDryRunFlag(cmd)
cmdutil.AddLabelSelectorFlagVar(cmd, &o.Selector)
cmd.Flags().StringVar(&o.Raw, "raw", o.Raw, "Raw URI to POST to the server. Uses the transport specified by the kubeconfig file.")
cmdutil.AddFieldManagerFlagVar(cmd, &o.fieldManager, "kubectl-create")
o.PrintFlags.AddFlags(cmd)
绑定创建子命令
// create subcommands
cmd.AddCommand(NewCmdCreateNamespace(f, ioStreams))
cmd.AddCommand(NewCmdCreateQuota(f, ioStreams))
cmd.AddCommand(NewCmdCreateSecret(f, ioStreams))
cmd.AddCommand(NewCmdCreateConfigMap(f, ioStreams))
cmd.AddCommand(NewCmdCreateServiceAccount(f, ioStreams))
cmd.AddCommand(NewCmdCreateService(f, ioStreams))
cmd.AddCommand(NewCmdCreateDeployment(f, ioStreams))
cmd.AddCommand(NewCmdCreateClusterRole(f, ioStreams))
cmd.AddCommand(NewCmdCreateClusterRoleBinding(f, ioStreams))
cmd.AddCommand(NewCmdCreateRole(f, ioStreams))
cmd.AddCommand(NewCmdCreateRoleBinding(f, ioStreams))
cmd.AddCommand(NewCmdCreatePodDisruptionBudget(f, ioStreams))
cmd.AddCommand(NewCmdCreatePriorityClass(f, ioStreams))
cmd.AddCommand(NewCmdCreateJob(f, ioStreams))
cmd.AddCommand(NewCmdCreateCronJob(f, ioStreams))
cmd.AddCommand(NewCmdCreateIngress(f, ioStreams))
cmd.AddCommand(NewCmdCreateToken(f, ioStreams))
核心的cmd.Run函数
校验文件参数
if cmdutil.IsFilenameSliceEmpty(o.FilenameOptions.Filenames, o.FilenameOptions.Kustomize) {
ioStreams.ErrOut.Write([]byte("Error: must specify one of -f and -k\n\n"))
defaultRunFunc := cmdutil.DefaultSubCommandRun(ioStreams.ErrOut)
defaultRunFunc(cmd, args)
return
}
完善并填充所需字段
cmdutil.CheckErr(o.Complete(f, cmd))
校验参数
cmdutil.CheckErr(o.ValidateArgs(cmd, args))
核心的RunCreate
cmdutil.CheckErr(o.RunCreate(f, cmd))
如果配置了apiserver的raw-uri就直接发送请求
if len(o.Raw) > 0 {
restClient, err := f.RESTClient()
if err != nil {
return err
}
return rawhttp.RawPost(restClient, o.IOStreams, o.Raw, o.FilenameOptions.Filenames[0])
}
如果配置了创建前edit就执行RunEditOnCreate
if o.EditBeforeCreate {
return RunEditOnCreate(f, o.PrintFlags, o.RecordFlags, o.IOStreams, cmd, &o.FilenameOptions, o.fieldManager)
}
根据配置中validate决定是否开启validate
--validate=true: 使用一种模式校验一下配置,模式是true的
cmdNamespace, enforceNamespace, err := f.ToRawKubeConfigLoader().Namespace()
if err != nil {
return err
}
构建builder对象,建造者模式
r := f.NewBuilder().
Unstructured().
Schema(schema).
ContinueOnError().
NamespaceParam(cmdNamespace).DefaultNamespace().
FilenameParam(enforceNamespace, &o.FilenameOptions).
LabelSelectorParam(o.Selector).
Flatten().
Do()
err = r.Err()
if err != nil {
return err
}
FilenameParam读取配置文件
除了支持简单的本地文件,也支持标准输入和http/https协议访问的文件,保存为Visitor
代码位置 staging\src\k8s.io\cli-runtime\pkg\resource\builder.go
// FilenameParam groups input in two categories: URLs and files (files, directories, STDIN)
// If enforceNamespace is false, namespaces in the specs will be allowed to
// override the default namespace. If it is true, namespaces that don't match
// will cause an error.
// If ContinueOnError() is set prior to this method, objects on the path that are not
// recognized will be ignored (but logged at V(2)).
func (b *Builder) FilenameParam(enforceNamespace bool, filenameOptions *FilenameOptions) *Builder {
if errs := filenameOptions.validate(); len(errs) > 0 {
b.errs = append(b.errs, errs...)
return b
}
recursive := filenameOptions.Recursive
paths := filenameOptions.Filenames
for _, s := range paths {
switch {
case s == "-":
b.Stdin()
case strings.Index(s, "http://") == 0 || strings.Index(s, "https://") == 0:
url, err := url.Parse(s)
if err != nil {
b.errs = append(b.errs, fmt.Errorf("the URL passed to filename %q is not valid: %v", s, err))
continue
}
b.URL(defaultHttpGetAttempts, url)
default:
matches, err := expandIfFilePattern(s)
if err != nil {
b.errs = append(b.errs, err)
continue
}
if !recursive && len(matches) == 1 {
b.singleItemImplied = true
}
b.Path(recursive, matches...)
}
}
if filenameOptions.Kustomize != "" {
b.paths = append(
b.paths,
&KustomizeVisitor{
mapper: b.mapper,
dirPath: filenameOptions.Kustomize,
schema: b.schema,
fSys: filesys.MakeFsOnDisk(),
})
}
if enforceNamespace {
b.RequireNamespace()
}
return b
}
调用visit函数创建资源
err = r.Visit(func(info *resource.Info, err error) error {
if err != nil {
return err
}
if err := util.CreateOrUpdateAnnotation(cmdutil.GetFlagBool(cmd, cmdutil.ApplyAnnotationsFlag), info.Object, scheme.DefaultJSONEncoder()); err != nil {
return cmdutil.AddSourceToErr("creating", info.Source, err)
}
if err := o.Recorder.Record(info.Object); err != nil {
klog.V(4).Infof("error recording current command: %v", err)
}
if o.DryRunStrategy != cmdutil.DryRunClient {
if o.DryRunStrategy == cmdutil.DryRunServer {
if err := o.DryRunVerifier.HasSupport(info.Mapping.GroupVersionKind); err != nil {
return cmdutil.AddSourceToErr("creating", info.Source, err)
}
}
obj, err := resource.
NewHelper(info.Client, info.Mapping).
DryRun(o.DryRunStrategy == cmdutil.DryRunServer).
WithFieldManager(o.fieldManager).
WithFieldValidation(o.ValidationDirective).
Create(info.Namespace, true, info.Object)
if err != nil {
return cmdutil.AddSourceToErr("creating", info.Source, err)
}
info.Refresh(obj, true)
}
Create函数追踪底层调用createResource创建资源
代码位置D:\Workspace\Go\src\k8s.io\[email protected]\staging\src\k8s.io\cli-runtime\pkg\resource\helper.go
func (m *Helper) createResource(c RESTClient, resource, namespace string, obj runtime.Object, options *metav1.CreateOptions) (runtime.Object, error) {
return c.Post().
NamespaceIfScoped(namespace, m.NamespaceScoped).
Resource(resource).
VersionedParams(options, metav1.ParameterCodec).
Body(obj).
Do(context.TODO()).
Get()
}
底层使用restfulclient.post
代码位置staging\src\k8s.io\cli-runtime\pkg\resource\interfaces.go
// RESTClient is a client helper for dealing with RESTful resources
// in a generic way.
type RESTClient interface {
Get() *rest.Request
Post() *rest.Request
Patch(types.PatchType) *rest.Request
Delete() *rest.Request
Put() *rest.Request
}
1.newCmdCreate调用cobra的Run函数
2.调用RunCreate构建resourceBuilder对象
3.调用Visit方法创建资源
4.底层使用resetclient和k8s-api通信
设计模式之建造者模式
优点
缺点
kubectl中的创建者模式
设计模式之建造者模式
建造者(Builder)模式:指将一个复杂对象的构造与它的表示分离
使同样的构建过程可以创建不同的对象,这样的设计模式被称为建造者模式
它是将一个复杂的对象分解为多个简单的对象,然后一步一步构建而成
它将变与不变相分离,即产品的组成部分是不变的,但每一部分是可以灵活选择的。
更多用来针对复杂对象的创建
优点
封装性好,构建和表示分离
扩展性好,各个具体的建造者相互分离,有利于系统的解耦
客户端不必知道产品内部组成的细节,建造者可以对创建过程逐步细化,而不对其他模块产生任何影响,便于控制细节风险。
缺点
产品的组成部分必须相同,这限制了其使用范围
如果产品的内部变化复杂,如果产品内部发生变化,则建造者也要同步修改,后期维护成本较大。
kubectl中的创建者模式
kubectl中的Builder对象
特点1 针对复杂对象的创建,字段非常多
特点2 开头的方法要返回要创建对象的指针
特点3 所有的方法都返回的是建造者对象的指针
特点1 针对复杂对象的创建,字段非常多
kubectl中的Builder对象,可以看到字段非常多
如果使用Init函数构造参数会非常多
而且参数是不固定的,即可以根据用户传入的参数情况构造不同对象
代码位置staging\src\k8s.io\cli-runtime\pkg\resource\builder.go
// Builder provides convenience functions for taking arguments and parameters
// from the command line and converting them to a list of resources to iterate
// over using the Visitor interface.
type Builder struct {
categoryExpanderFn CategoryExpanderFunc
// mapper is set explicitly by resource builders
mapper *mapper
// clientConfigFn is a function to produce a client, *if* you need one
clientConfigFn ClientConfigFunc
restMapperFn RESTMapperFunc
// objectTyper is statically determinant per-command invocation based on your internal or unstructured choice
// it does not ever need to rely upon discovery.
objectTyper runtime.ObjectTyper
// codecFactory describes which codecs you want to use
negotiatedSerializer runtime.NegotiatedSerializer
// local indicates that we cannot make server calls
local bool
errs []error
paths []Visitor
stream bool
stdinInUse bool
dir bool
labelSelector *string
fieldSelector *string
selectAll bool
limitChunks int64
requestTransforms []RequestTransform
resources []string
subresource string
namespace string
allNamespace bool
names []string
resourceTuples []resourceTuple
defaultNamespace bool
requireNamespace bool
flatten bool
latest bool
requireObject bool
singleResourceType bool
continueOnError bool
singleItemImplied bool
schema ContentValidator
// fakeClientFn is used for testing
fakeClientFn FakeClientFunc
}
特点2 开头的方法要返回要创建对象的指针
func NewBuilder(restClientGetter RESTClientGetter) *Builder {
categoryExpanderFn := func() (restmapper.CategoryExpander, error) {
discoveryClient, err := restClientGetter.ToDiscoveryClient()
if err != nil {
return nil, err
}
return restmapper.NewDiscoveryCategoryExpander(discoveryClient), err
}
return newBuilder(
restClientGetter.ToRESTConfig,
restClientGetter.ToRESTMapper,
(&cachingCategoryExpanderFunc{delegate: categoryExpanderFn}).ToCategoryExpander,
)
}
特点3 所有的方法都返回的是建造者对象的指针
staging\src\k8s.io\kubectl\pkg\cmd\create\create.go
r := f.NewBuilder().
Unstructured().
Schema(schema).
ContinueOnError().
NamespaceParam(cmdNamespace).DefaultNamespace().
FilenameParam(enforceNamespace, &o.FilenameOptions).
LabelSelectorParam(o.Selector).
Flatten().
Do()
调用时看着像链式调用,链上的每个方法都返回这个要建造对象的指针
如
func (b *Builder) Schema(schema ContentValidator) *Builder {
b.schema = schema
return b
}
func (b *Builder) ContinueOnError() *Builder {
b.continueOnError= true
return b
}
看起来就是设置构造对象的各种属性
访问者模式(Visitor Pattern)是一种将数据结构与数据操作作分离的设计模式,
指封装一些作用于某种数据结构中的各元素的操作,
可以在不改变数据结构的前提下定义作用于这些元素的新的操作,
属于行为型设计模式。
kubectl中的访问者模式
在kubectl中多个Visitor是来访问一个数据结构的不同部分
这种情况下,数据结构有点像一个数据库,而各个Visitor会成为一个个小应用
visitor访问者模式简介
kubeclt中的visitor应用
visitor访问者模式简介
访问者模式(Visitor Pattern)是一种将数据结构与数据操作分离的设计模式,
指封装一些作用于某种数据结构中的各元素的操作,
可以在不改变数据结构的前提下定义作用于这些元素的新操作,
属于行为型设计模式。
kubectl中的访问者模式
在kubectl中多个Visitor是来访问一个数据结构的不同部分。
这种情况下,数据结构有点像一个数据库,而各个Visitor会成为一个个小应用。
访问者模式主要适用于以下应用场景:
(1)数据结构稳定,作用于数据结构稳定的操作经常变化的场景。
(2)需要数据结构与数据作分离的场景。
(3)需要对不同数据类型(元素)进行操作,而不使用分支判断具体类型的场景。
访问者模式的优点
(1)解耦了数据结构与数据操作,使得操作集合可以独立变化。
(2)可以通过扩展访问者角色,实现对数据集的不同操作,程序扩展性更好。
(3)元素具体类型并非单一,访问者均可操作。
(4)各角色职责分离,符合单一职责原则。
访问者模式的缺点
(1)无法增加元素类型:若系统数据结构对象易于变化,
经常有新的数据对象增加进来,
则访问者类必须增加对应元素类型的操作,违背了开闭原则。
(2)具体元素变更困难:具体元素增加属性、删除属性等操作,
会导致对应的访问者类需要进行相应的修改,
尤其当有大量访问类时,修改范围太大。
(3)违背依赖倒置原则:为了达到“区别对待”,
访问者角色依赖的具体元素类型,而不是抽象。
kubectl中访问者模式
在kubectl中多个Visitor是来访问一个数据结构的不同部分
这种情况下,数据结构有点像一个数据库,而各个Visitor会成为一个个小应用。
Visitor接口和VisitorFunc定义
位置在kubernetes/staging/src/k8s.io/cli-runtime/pkg/resource/interfaces.go
// Visitor lets clients walk a list of resources.
type Visitor interface {
Visit(VisitorFunc) error
}
// VisitorFunc implements the Visitor interface for a matching function.
// If there was a problem walking a list of resources, the incoming error
// will describe the problem and the function can decide how to handle that error.
// A nil returned indicates to accept an error to continue loops even when errors happen.
// This is useful for ignoring certain kinds of errors or aggregating errors in some way.
type VisitorFunc func(*Info, error) error
result的Visit方法
func(r *Result) Visit(fn VisitorFunc) error {
if r.err != nil {
return r.err
}
err := r.visitor.Visit(fn)
return utilerrors.FilterOut(err, r.ignoreErrors...)
}
具体的visitor的visit方法定义,参数都是一个VisitorFunc的fn
// Visit in a FileVisitor is just taking care of opening/closing files
func (v *FileVisitor) Visit(fn VisitorFunc) error {
var f *os.File
if v.Path == constSTDINstr {
f = os.Stdin
} else {
var err error
f, err = os.Open(v.Path)
if err != nil {
return err
}
defer f.Close()
}
// TODO: Consider adding a flag to force to UTF16, apparently some
// Windows tools don't write the BOM
utf16bom := unicode.BOMOverride(unicode.UTF8.NewDecoder())
v.StreamVisitor.Reader = transform.NewReader(f, utf16bom)
return v.StreamVisitor.Visit(fn)
}
kubectl create中 通过Builder模式创建visitor并执行的过程
FilenameParam解析 -f文件参数 创建一个visitor
位置kubernetes/staging/src/k8s.io/cli-runtime/pkg/resource/builder.go
validate校验-f参数
func (o *FilenameOptions) validate() []error {
var errs []error
if len(o.Filenames) > 0 && len(o.Kustomize) > 0 {
errs = append(errs, fmt.Errorf("only one of -f or -k can be specified"))
}
if len(o.Kustomize) > 0 && o.Recursive {
errs = append(errs, fmt.Errorf("the -k flag can't be used with -f or -R"))
}
return errs
}
-k代表使用Kustomize配置
如果-f -k都存在报错only one of -f or -k can be specified
kubectl create -f rule.yaml -k rule.yaml
error: only one of -f or -k can be specified
-k不支持递归 -R
kubectl create -k rule.yaml -R
error: the -k flag can't be used with -f or -R
调用path解析文件
recursive := filenameOptions.Recursive
paths := filenameOptions.Filenames
for _, s := range paths {
switch {
case s == "-":
b.Stdin()
case strings.Index(s, "http://") == 0 || strings.Index(s, "https://") == 0:
url, err := url.Parse(s)
if err != nil {
b.errs = append(b.errs, fmt.Errorf("the URL passed to filename %q is not valid: %v", s, err))
continue
}
b.URL(defaultHttpGetAttempts, url)
default:
matches, err := expandIfFilePattern(s)
if err != nil {
b.errs = append(b.errs, err)
continue
}
if !recursive && len(matches) == 1 {
b.singleItemImplied = true
}
b.Path(recursive, matches...)
}
}
遍历-f传入的paths
如果是-代表从标准输入传入
如果是http开头的代表从远端http接口读取,调用b.URL
默认是文件,调用b.Path解析
b.Path调用ExpandPathsToFileVisitors生产visitor
// ExpandPathsToFileVisitors will return a slice of FileVisitors that will handle files from the provided path.
// After FileVisitors open the files, they will pass an io.Reader to a StreamVisitor to do the reading. (stdin
// is also taken care of). Paths argument also accepts a single file, and will return a single visitor
func ExpandPathsToFileVisitors(mapper *mapper, paths string, recursive bool, extensions []string, schema ContentValidator) ([]Visitor, error) {
var visitors []Visitor
err := filepath.Walk(paths, func(path string, fi os.FileInfo, err error) error {
if err != nil {
return err
}
if fi.IsDir() {
if path != paths && !recursive {
return filepath.SkipDir
}
return nil
}
// Don't check extension if the filepath was passed explicitly
if path != paths && ignoreFile(path, extensions) {
return nil
}
visitor := &FileVisitor{
Path: path,
StreamVisitor: NewStreamVisitor(nil, mapper, path, schema),
}
visitors = append(visitors, visitor)
return nil
})
if err != nil {
return nil, err
}
return visitors, nil
}
底层调用的StreamVisitor,把对应的方法注册到visitor中
位置D:\Workspace\Go\kubernetes\staging\src\k8s.io\cli-runtime\pkg\resource\visitor.go
// Visit implements Visitor over a stream. StreamVisitor is able to distinct multiple resources in one stream.
func (v *StreamVisitor) Visit(fn VisitorFunc) error {
d := yaml.NewYAMLOrJSONDecoder(v.Reader, 4096)
for {
ext := runtime.RawExtension{}
if err := d.Decode(&ext); err != nil {
if err == io.EOF {
return nil
}
return fmt.Errorf("error parsing %s: %v", v.Source, err)
}
// TODO: This needs to be able to handle object in other encodings and schemas.
ext.Raw = bytes.TrimSpace(ext.Raw)
if len(ext.Raw) == 0 || bytes.Equal(ext.Raw, []byte("null")) {
continue
}
if err := ValidateSchema(ext.Raw, v.Schema); err != nil {
return fmt.Errorf("error validating %q: %v", v.Source, err)
}
info, err := v.infoForData(ext.Raw, v.Source)
if err != nil {
if fnErr := fn(info, err); fnErr != nil {
return fnErr
}
continue
}
if err := fn(info, nil); err != nil {
return err
}
}
}
用jsonYamlDecoder解析文件
ValidateSchema会解析文件中字段进行校验,比如我们把spec故意写成aspec
kubectl apply -f rule.yaml
error: error validating "rule.yaml": error validating data: [ValidationError(PrometheusRule): Unknown field "aspec" in
infoForData将解析结果转换为Info对象
创建Info。object就是k8s的对象
位置staging\src\k8s.io\cli-runtime\pkg\resource\mapper.go
m.decoder.Decode解析出object和gvk对象
其中object代表就是k8s的对象
gvk是Group/Vsersion/Kind的缩写
// InfoForData creates an Info object for the given data. An error is returned
// if any of the decoding or client lookup steps fail. Name and namespace will be
// set into Info if the mapping's MetadataAccessor can retrieve them.
func (m *mapper) infoForData(data []byte, source string) (*Info, error) {
obj, gvk, err := m.decoder.Decode(data, nil, nil)
if err != nil {
return nil, fmt.Errorf("unable to decode %q: %v", source, err)
}
name, _ := metadataAccessor.Name(obj)
namespace, _ := metadataAccessor.Namespace(obj)
resourceVersion, _ := metadataAccessor.ResourceVersion(obj)
ret := &Info{
Source: source,
Namespace: namespace,
Name: name,
ResourceVersion: resourceVersion,
Object: obj,
}
if m.localFn == nil || !m.localFn() {
restMapper, err := m.restMapperFn()
if err != nil {
return nil, err
}
mapping, err := restMapper.RESTMapping(gvk.GroupKind(), gvk.Version)
if err != nil {
if _, ok := err.(*meta.NoKindMatchError); ok {
return nil, fmt.Errorf("resource mapping not found for name: %q namespace: %q from %q: %v\nensure CRDs are installed first",
name, namespace, source, err)
}
return nil, fmt.Errorf("unable to recognize %q: %v", source, err)
}
ret.Mapping = mapping
client, err := m.clientFn(gvk.GroupVersion())
if err != nil {
return nil, fmt.Errorf("unable to connect to a server to handle %q: %v", mapping.Resource, err)
}
ret.Client = client
}
return ret, nil
}
k8s对象object讲解
Object k8s对象
文档地址https://kubenetes.io/zh/docs/concepts/overview/working-objects/kubernetes-objects/
位置staging\src\k8s.io\apimachinery\pkg\runtime\interfaces.go
// Object interface must be supported by all API types registered with Scheme. Since objects in a scheme are
// expected to be serialized to the wire, the interface an Object must provide to the Scheme allows
// serializers to set the kind, version, and group the object is represented as. An Object may choose
// to return a no-op ObjectKindAccessor in cases where it is not expected to be serialized.
type Object interface {
GetObjectKind() schema.ObjectKind
DeepCopyObject() Object
}
作用
Kubernetes对象是持久化的实体
Kubernetes使用这些实体去表示整个集群的状态。特别地,它们描述了如下信息:
哪些容器化应用在运行(以及在哪些节点上)
可以被应用使用的资源
关于应用运行时表现的策略,比如重启策略、升级策略,以及容错策略
操作Kubernetes对象,无论是创建、修改,或者删除,需要使用Kubernetes API
期望状态
Kuberetes对象是“目标性记录”一旦创建对象,Kubernetes系统将持续工作以确保对象存在
通过创建对象,本质上是在告知Kubernetes系统,所需要的集群工作负载看起来是什么样子的,这就是Kubernetes集群的期望状态(Desired State)
对象规约(Spec)与状态(Status)
几乎每个Kuberneter对象包含两个嵌套的对象字段,它们负责管理对象的配置:对象spec(规约)和对象status(状态)
对于具有spec的对象,你必须在创建时设置其内容,描述你希望对象所具有的特征:期望状态(Desired State)。
status描述了对象的当前状态(Current State),它是由Kubernetes系统和组件设置并更新的。在任何时刻,Kubernetes控制平面都一直积极地管理者对象的实际状态,以使之与期望状态相匹配。
yaml中的必须字段
在想要创建的Kubernetes对象对应的.yaml文件中,需要配置如下的字段:
apiVersion - 创建该对象所使用的Kubernetes API的版本
kind - 想要创建的对象的类别
metadata-帮助唯一性标识对象的一些数据,包括一个name字符串、UID和可选的namespace
Do中创建一批visitor
// Do returns a Result object with a Visitor for the resources identified by the Builder.
// The visitor will respect the error behavior specified by ContinueOnError. Note that stream
// inputs are consumed by the first execution - use Infos() or Object() on the Result to capture a list
// for further iteration.
func (b *Builder) Do() *Result {
r := b.visitorResult()
r.mapper = b.Mapper()
if r.err != nil {
return r
}
if b.flatten {
r.visitor = NewFlattenListVisitor(r.visitor, b.objectTyper, b.mapper)
}
helpers := []VisitorFunc{}
if b.defaultNamespace {
helpers = append(helpers, SetNamespace(b.namespace))
}
if b.requireNamespace {
helpers = append(helpers, RequireNamespace(b.namespace))
}
helpers = append(helpers, FilterNamespace)
if b.requireObject {
helpers = append(helpers, RetrieveLazy)
}
if b.continueOnError {
r.visitor = ContinueOnErrorVisitor{Visitor: r.visitor}
}
r.visitor = NewDecoratedVisitor(r.visitor, helpers...)
return r
}
helpers代表一批VisitorFunc
比如校验namespace的RequireNamespace
// RequireNamespace will either set a namespace if none is provided on the
// Info object, or if the namespace is set and does not match the provided
// value, returns an error. This is intended to guard against administrators
// accidentally operating on resources outside their namespace.
func RequireNamespace(namespace string) VisitorFunc {
return func(info *Info, err error) error {
if err != nil {
return err
}
if !info.Namespaced() {
return nil
}
if len(info.Namespace) == 0 {
info.Namespace = namespace
UpdateObjectNamespace(info, nil)
return nil
}
if info.Namespace != namespace {
return fmt.Errorf("the namespace from the provided object %q does not match the namespace %q. You must pass '--namespace=%s' to perform this operation.", info.Namespace, namespace, info.Namespace)
}
return nil
}
}
创建带装饰器的visitor DecoratedVisitor
if b.continueOnError {
r.visitor = ContinueOnErrorVisitor{Visitor: r.visitor}
}
r.visitor = NewDecoratedVisitor(r.visitor, helpers...)
对应的visit方法
// Visit implements Visitor
func (v DecoratedVisitor) Visit(fn VisitorFunc) error {
return v.visitor.Visit(func(info *Info, err error) error {
if err != nil {
return err
}
for i := range v.decorators {
if err := v.decorators[i](info, nil); err != nil {
return err
}
}
return fn(info, nil)
})
}
visitor的调用
Visitor调用链分析
外层调用result.Visit方法,内部的func
err = r.Visit(func(info *resource.Info, err error) error {
if err != nil {
return err
}
if err := util.CreateOrUpdateAnnotation(cmdutil.GetFlagBool(cmd, cmdutil.ApplyAnnotationsFlag), info.Object, scheme.DefaultJSONEncoder()); err != nil {
return cmdutil.AddSourceToErr("creating", info.Source, err)
}
if err := o.Recorder.Record(info.Object); err != nil {
klog.V(4).Infof("error recording current command: %v", err)
}
if o.DryRunStrategy != cmdutil.DryRunClient {
if o.DryRunStrategy == cmdutil.DryRunServer {
if err := o.DryRunVerifier.HasSupport(info.Mapping.GroupVersionKind); err != nil {
return cmdutil.AddSourceToErr("creating", info.Source, err)
}
}
obj, err := resource.
NewHelper(info.Client, info.Mapping).
DryRun(o.DryRunStrategy == cmdutil.DryRunServer).
WithFieldManager(o.fieldManager).
WithFieldValidation(o.ValidationDirective).
Create(info.Namespace, true, info.Object)
if err != nil {
return cmdutil.AddSourceToErr("creating", info.Source, err)
}
info.Refresh(obj, true)
}
count++
return o.PrintObj(info.Object)
})
visitor接口中的调用方法
// Visit implements the Visitor interface on the items described in the Builder.
// Note that some visitor sources are not traversable more than once, or may
// return different results. If you wish to operate on the same set of resources
// multiple times, use the Infos() method.
func (r *Result) Visit(fn VisitorFunc) error {
if r.err != nil {
return r.err
}
err := r.visitor.Visit(fn)
return utilerrors.FilterOut(err, r.ignoreErrors...)
}
最终的调用就是前面注册的各个visitor的Visit方法
外层VisitorFunc分析
如果出错就返回错误
DryRunStraregy代表试运行策略
默认为None代表不试运行
client代表客户端试运行,不发送请求到server
server点服务端试运行,发送请求,但是如果会改变状态的话就不做
最终调用,Create创建资源,然后调用o.PrintObj(info.Object)打印结果
func(info *resource.Info, err error) error {
if err != nil {
return err
}
if err := util.CreateOrUpdateAnnotation(cmdutil.GetFlagBool(cmd, cmdutil.ApplyAnnotationsFlag), info.Object, scheme.DefaultJSONEncoder()); err != nil {
return cmdutil.AddSourceToErr("creating", info.Source, err)
}
if err := o.Recorder.Record(info.Object); err != nil {
klog.V(4).Infof("error recording current command: %v", err)
}
if o.DryRunStrategy != cmdutil.DryRunClient {
if o.DryRunStrategy == cmdutil.DryRunServer {
if err := o.DryRunVerifier.HasSupport(info.Mapping.GroupVersionKind); err != nil {
return cmdutil.AddSourceToErr("creating", info.Source, err)
}
}
obj, err := resource.
NewHelper(info.Client, info.Mapping).
DryRun(o.DryRunStrategy == cmdutil.DryRunServer).
WithFieldManager(o.fieldManager).
WithFieldValidation(o.ValidationDirective).
Create(info.Namespace, true, info.Object)
if err != nil {
return cmdutil.AddSourceToErr("creating", info.Source, err)
}
info.Refresh(obj, true)
}
count++
return o.PrintObj(info.Object)
}
kubectl的职责
主要的工作是处理用户提交的东西(包括,命令行参数,yaml文件等)
然后其会把用户提交的这些东西组织成一个数据结构体
然后把其发送给API Server
kubectl的代码原理
cobra从命令行和yaml文件中获取信息
通过Builder模式并把其转成一系列的资源
最后用Visitor模式来迭代处理这些Resources,实现各类资源对象的解析和校验
用RESTClient将Object发送到kube-apiserver
kubectl架构图
create流程
kubectl中的核心对象
RESTClient 和k8s-api通信的restful-client
位置D:\Workspace\Go\kubernetes\staging\src\k8s.io\cli-runtime\pkg\resource\interfaces.go
type RESTClientGetter interface {
ToRESTConfig() (*rest.Config, error)
ToDiscoveryClient() (discovery.CachedDiscoveryInterface, error)
ToRESTMapper() (meta.RESTMapper, error)
}
Object k8s对象
文档地址https://kubenetes.io/zh/docs/concepts/overview/working-objects/kubernetes-objects/
staging\src\k8s.io\cli-runtime\pkg\resource\interfaces.go
apiserver启动流程
CreateServerChain创建3个server
CreateKubeAPIServer创建kubeAPIServer代表API核心服务,包括常见的Pod/Deployment/Service
createAPIExtensionsServer创建apiExtensionsServer代表API扩展服务,主要针对CRD
createAggregatorServer创建aggregatorServer代表处理merics的服务
入口地址
位置D:\Workspace\Go\kubernetes\cmd\kube-apiserver\apiserver.go
初始化apiserver的cmd并执行
func main() {
command := app.NewAPIServerCommand()
code := cli.Run(command)
os.Exit(code)
}
newCmd执行流程
之前我们说过cobra的几个func执行顺序
// The *Run functions are executed in the following order:
// * PersistenPreRun()
// * PreRun()
// * Run()
// * *PostRun()
// * *PersistentPostRun()
// All functions get the same args, the arguments after the command name.
//
PersistentPreRunE准备
设置WarningHandler
PersistentPreRunE: func(*cobra.Command, []string) error {
// silence client-go warnings.
// kube-apiserver loopback clients should not log self-issued warnings.
rest.SetDefaultWarningHandler(rest.NoWarnings{})
return nil
},
runE解析 准备工作
打印版本信息
verflag.PrintAndExitIfRequested()
...
// PrintAndExitIfRequested will check if the -version flag was passed
// and, if so, print the version and exit.
func PrintAndExitIfRequested() {
if *versionFlag == VersionRaw {
fmt.Printf("%#v\n", version.Get())
os.Exit(0)
} else if *versionFlag == VersionTrue {
fmt.Printf("%s %s\n", programName, version.Get())
os.Exit(0)
}
}
打印命令行参数
// PrintFlags logs the flags in the flagset
func PrintFlags(flags *pflag.FlagSet) {
flags.VisitAll(func(flag *pflag.Flag) {
klog.V(1).Infof("FLAG: --%s=%q", flag.Name, flag.Value)
})
}
检查不安全的端口
delete this check after insecure flags removed in v1.24
Complete设置默认值
// set default options
completedOptions, err := Complete(s)
if err != nil {
return err
}
检查命令行参数
// validate options
if errs := completedOptions.Validate(); len(errs) != 0 {
return utilerrors.NewAggregate(errs)
}
cmd\kube-apiserver\app\options\validation.go
// Validate checks ServerRunOptions and return a slice of found errs.
func (s *ServerRunOptions) Validate() []error {
var errs []error
if s.MasterCount <= 0 {
errs = append(errs, fmt.Errorf("--apiserver-count should be a positive number, but value '%d' provided", s.MasterCount))
}
errs = append(errs, s.Etcd.Validate()...)
errs = append(errs, validateClusterIPFlags(s)...)
errs = append(errs, validateServiceNodePort(s)...)
errs = append(errs, validateAPIPriorityAndFairness(s)...)
errs = append(errs, s.SecureServing.Validate()...)
errs = append(errs, s.Authentication.Validate()...)
errs = append(errs, s.Authorization.Validate()...)
errs = append(errs, s.Audit.Validate()...)
errs = append(errs, s.Admission.Validate()...)
errs = append(errs, s.APIEnablement.Validate(legacyscheme.Scheme, apiextensionsapiserver.Scheme, aggregatorscheme.Scheme)...)
errs = append(errs, validateTokenRequest(s)...)
errs = append(errs, s.Metrics.Validate()...)
errs = append(errs, validateAPIServerIdentity(s)...)
return errs
}
举一个例子,比如这个校验etcd的src\k8s.io\apiserver\pkg\server\options\etcd.go
func (s *EtcdOptions) Validate() []error {
if s == nil {
return nil
}
allErrors := []error{}
if len(s.StorageConfig.Transport.ServerList) == 0 {
allErrors = append(allErrors, fmt.Errorf("--etcd-servers must be specified"))
}
if s.StorageConfig.Type != storagebackend.StorageTypeUnset && !storageTypes.Has(s.StorageConfig.Type) {
allErrors = append(allErrors, fmt.Errorf("--storage-backend invalid, allowed values: %s. If not specified, it will default to 'etcd3'", strings.Join(storageTypes.List(), ", ")))
}
for _, override := range s.EtcdServersOverrides {
tokens := strings.Split(override, "#")
if len(tokens) != 2 {
allErrors = append(allErrors, fmt.Errorf("--etcd-servers-overrides invalid, must be of format: group/resource#servers, where servers are URLs, semicolon separated"))
continue
}
apiresource := strings.Split(tokens[0], "/")
if len(apiresource) != 2 {
allErrors = append(allErrors, fmt.Errorf("--etcd-servers-overrides invalid, must be of format: group/resource#servers, where servers are URLs, semicolon separated"))
continue
}
}
return allErrors
}
kubectl get pod -n kube-system
ps -ef |grep apiserver
ps -ef |grep apiserver |grep etcd
真正的Run函数
Run(completeOptions, genericapiserver.SetupSignalHandler())
completedOptions代表ServerRunOptions
第二个参数解析stopCh
在底层的Run函数定义上可以看到第二个参数类型是一个只读的stop chan, stopCh <- chan struct{}
对应的genericapiserver.SetupSignalHandler()解析
var onlyOneSignalHandler = make(chan struct{})
var shutdownHandler chan os.Signal
// SetupSignalHandler registered for SIGTERM and SIGINT. A stop channel is returned
// which is closed on one of these signals. If a second signal is caught, the program
// is terminated with exit code 1.
// Only one of SetupSignalContext and SetupSignalHandler should be called, and only can
// be called once.
func SetupSignalHandler() <-chan struct{} {
return SetupSignalContext().Done()
}
// SetupSignalContext is same as SetupSignalHandler, but a context.Context is returned.
// Only one of SetupSignalContext and SetupSignalHandler should be called, and only can
// be called once.
func SetupSignalContext() context.Context {
close(onlyOneSignalHandler) // panics when called twice
shutdownHandler = make(chan os.Signal, 2)
ctx, cancel := context.WithCancel(context.Background())
signal.Notify(shutdownHandler, shutdownSignals...)
go func() {
<-shutdownHandler
cancel()
<-shutdownHandler
os.Exit(1) // second signal. Exit directly.
}()
return ctx
}
从上面可以看出这是一个context的Done方法返回,就是一个<-chan struct{}
CreateKubeAPIServer创建 kubeAPIServer代表API核心服务,包括常见的Pod/Deployment/Service
createAPIExtensionsServer创建 apiExtensionsServer 代表API扩展服务,主要针对CRD
createAggregatorServer创建aggregatorServer代表处理metrics的服务
然后运行
这一小节先简单过一下运行的流程,后面再慢慢看细节
// Run runs the specified APIServer. This should never exit.
func Run(completeOptions completedServerRunOptions, stopCh <-chan struct{}) error {
// To help debugging, immediately log version
klog.Infof("Version: %+v", version.Get())
klog.InfoS("Golang settings", "GOGC", os.Getenv("GOGC"), "GOMAXPROCS", os.Getenv("GOMAXPROCS"), "GOTRACEBACK", os.Getenv("GOTRACEBACK"))
server, err := CreateServerChain(completeOptions, stopCh)
if err != nil {
return err
}
prepared, err := server.PrepareRun()
if err != nil {
return err
}
return prepared.Run(stopCh)
}
API核心服务需要的通用配置工作中的准备工作
创建和节点通信的结构体proxyTransport,使用缓存长连接来提高效率
创建clientset
初始化etcd存储
D:\Workspace\Go\src\github.com\kubernetes\kubernetes\cmd\kube-apiserver\app\server.go
创建和节点通信的结构体proxyTransport,使用缓存长连接提高效率
proxyTransport := CreateProxyTransport()
http.transport功能简介
transport的主要功能其实就是缓存了长连接
用于大量http请求场景下的连接复用
减少发送请求时TCP(TLS)连接建立的时间损耗
创建通用配置genericConfig
genericConfig, versionedInformers, serviceResolver, pluginInitializers, admissionPostStartHook, storageFactory, err := buildGenericConfig(s.ServerRunOptions, proxyTransport)
下面是众多的ApplyTo分析
众多ApplyTo分析,并且有对应的AddFlags标记命令行参数
先创建genericConfig
genericConfig = genericapiserver.NewConfig(legacyscheme.Codecs)
以检查https配置的ApplyTo分析
if lastErr = s.SecureServing.ApplyTo(&genericConfig.SecureServing, &genericConfig.LoopbackClientConfig); lastErr != nil {
return
}
底层调用SecureServingOptions的ApplyTo,有对应的AddFlags方法标记命令行参数,位置再
staging\src\k8s.io\apiserver\pkg\server\options\serving.go
func (s *SecureServingOptions) AddFlags(fs *pflag.FlagSet) {
if s == nil {
return
}
fs.IPVar(&s.BindAddress, "bind-address", s.BindAddress, ""+
"The IP address on which to listen for the --secure-port port. The "+
"associated interface(s) must be reachable by the rest of the cluster, and by CLI/web "+
"clients. If blank or an unspecified address (0.0.0.0 or ::), all interfaces will be used.")
desc := "The port on which to serve HTTPS with authentication and authorization."
if s.Required {
desc += " It cannot be switched off with 0."
} else {
desc += " If 0, don't serve HTTPS at all."
}
fs.IntVar(&s.BindPort, "secure-port", s.BindPort, desc)
初始化etcd存储
创建存储工厂配置
storageFactoryConfig := kubeapiserver.NewStorageFactoryConfig()
storageFactoryConfig.APIResourceConfig = genericConfig.MergedResourceConfig
completedStorageFactoryConfig, err := storageFactoryConfig.Complete(s.Etcd)
if err != nil {
lastErr = err
return
}
初始化存储工厂
storageFactory, lastErr = completedStorageFactoryConfig.New()
if lastErr != nil {
return
}
将存储工厂应用到服务端运行对象中,后期可以通过RESTOptionsGetter获取操作Etcd的句柄
if lastErr = s.Etcd.ApplyWithStorageFactoryTo(storageFactory, genericConfig); lastErr != nil {
return
}
func (s *EtcdOptions) ApplyWithStorageFactoryTo(factory serverstorage.StorageFactory, c *server.Config) error {
if err := s.addEtcdHealthEndpoint(c); err != nil {
return err
}
// use the StorageObjectCountTracker interface instance from server.Config
s.StorageConfig.StorageObjectCountTracker = c.StorageObjectCountTracker
c.RESTOptionsGetter = &StorageFactoryRestOptionsFactory{Options: *s, StorageFactory: factory}
return nil
}
addEtcdHealthEndpoint创建etcd的健康检测
func (s *EtcdOptions) addEtcdHealthEndpoint(c *server.Config) error {
healthCheck, err := storagefactory.CreateHealthCheck(s.StorageConfig)
if err != nil {
return err
}
c.AddHealthChecks(healthz.NamedCheck("etcd", func(r *http.Request) error {
return healthCheck()
}))
if s.EncryptionProviderConfigFilepath != "" {
kmsPluginHealthzChecks, err := encryptionconfig.GetKMSPluginHealthzCheckers(s.EncryptionProviderConfigFilepath)
if err != nil {
return err
}
c.AddHealthChecks(kmsPluginHealthzChecks...)
}
return nil
}
从CreateHealthCheck得知,只支持etcdV3的接口
// CreateHealthCheck creates a healthcheck function based on given config.
func CreateHealthCheck(c storagebackend.Config) (func() error, error) {
switch c.Type {
case storagebackend.StorageTypeETCD2:
return nil, fmt.Errorf("%s is no longer a supported storage backend", c.Type)
case storagebackend.StorageTypeUnset, storagebackend.StorageTypeETCD3:
return newETCD3HealthCheck(c)
default:
return nil, fmt.Errorf("unknown storage type: %s", c.Type)
}
}
设置使用protobufs用来内部交互,并且禁用压缩功能
因为内部网络速度快,没必要为了节省带宽而将cpu浪费在压缩和解压上
// Use protobufs for self-communication.
// Since not every generic apiserver has to support protobufs, we
// cannot default to it in generic apiserver and need to explicitly
// set it in kube-apiserver.
genericConfig.LoopbackClientConfig.ContentConfig.ContentType = "application/vnd.kubernetes.protobuf"
// Disable compression for self-communication, since we are going to be
// on a fast local network
genericConfig.LoopbackClientConfig.DisableCompression = true
创建clientset
kubeClientConfig := genericConfig.LoopbackClientConfig
clientgoExternalClient, err := clientgoclientset.NewForConfig(kubeClientConfig)
if err != nil {
lastErr = fmt.Errorf("failed to create real external clientset: %v", err)
return
}
versionedInformers = clientgoinformers.NewSharedInformerFactory(clientgoExternalClient, 10*time.Minute)
versionedInformers代表k8s-client的informer对象,用于listAndWatch k8s对象的
Authenticatioon的目的
验证你是谁 确认“你是不是你”
包括多种方式,如Client Certificates,Password,and Plain Tokens, Bootstarp Tokens, and JWT Tokens等
Kubernets使用身份认证插件利用下面的策略来认证API请求的身份
-客户端证书
-持有者令牌(Bearer Token)
-身份认证代理(Proxy)
-HTTP基本认证机制
union认证的规则
-如果某一个认证方法报错就返回,说明认证没过
-如果某一个认证方法报ok,说明认证过了,直接return了,无需再运行其他认证了
-如果所有的认证方法都没报ok,则认证没过
验证你是谁 确认“你是不是你”
包括多种方式,如Client Certificates,Password,and Plain Tokens, Bootstarp Tokens, and JWT Tokens等
文档地址https://kubernetes.io/zh/docs/reference/access-authn-authz/authentication/
所有Kubernetes集群都有两类用户:由Kubernetes管理的服务账号和普通用户
所以认证要围绕这两类用户展开
身份认证策略
Kubernetes使用身份认证插件利用客户端证书、持有者令牌(Bearer Token)、身份认证代理(Proxy)或者HTTP基本认证机制来认证API请求的身份
Http请求发给API服务器时,插件会将以下属性关联到请求本身:
- 用户名:用来辨识最终用户的字符串。常见的值可以是kube-admin或[email protected]。
- 用户ID:用来辨识最终用户的字符串。旨在比用户名有更好的一致性和唯一性。
- 用户组:取值为一组字符串,其中各个字符串用来表明用户是某个命名的用户逻辑集合的成员。常见的值可能是sysytem:masters或者devops-team等。
-附加字段:一组额外的键-值映射,键是字符串,值是一组字符串;用来保存一些鉴权组件可能觉得有额外信息
你可以同时启用多种身份认证方法,并且你通常会至少使用两种方法:
-针对服务账号使用服务账号令牌
-至少另外一种方法对用户的身份进行认证
当集群中启用了多个身份认证模块时,第一个成功地对请求完成身份认证的模块会直接做出评估决定。API服务器并不保证身份认证模块的运行顺序。
对于所有通过身份认证的用户,system:authenticated组都会被添加到其组列表中。
与其他身份认证协议(LDAP、SAML、Kuberos、X509的替代模块等等)都可以通过使用一个身份认证代理或身份认证webhook来实现。
代码解读
D:\Workspace\Go\src\github.com\kubernetes\kubernetes\cmd\kube-apiserver\app\server.go
之前构建server之前生成通用配置buildGenericConfig里
// Authentication.ApplyTo requires already applied OpenAPIConfig and EgressSelector if present
if lastErr = s.Authentication.ApplyTo(&genericConfig.Authentication, genericConfig.SecureServing, genericConfig.EgressSelector, genericConfig.OpenAPIConfig, genericConfig.OpenAPIV3Config, clientgoExternalClient, versionedInformers); lastErr != nil {
return
}
真正的Authentication初始化
D:\Workspace\Go\src\github.com\kubernetes\kubernetes\pkg\kubeapiserver\options\authorization.go
authInfo.Authenticator, openAPIConfig.SecurityDefinitions, err = authenticatorConfig.New()
New代码、创建认证实例,支持多种认证方式:请求Header认证、Auth文件认证、CA证书认证、Bearer token认证
D:\Workspace\Go\src\github.com\kubernetes\kubernetes\pkg\kubeapiserver\authenticator\config.go
核心变量1 tokenAuthenticators []authenticator.Token代表Bearer token认证
// Token checks a string value against a backing authentication store and
// returns a Response or an error if the token could not be checked.
type Token interface {
AuthenticateToken(ctx context.Context, token string) (*Response, bool, error)
}
不断添加到数组中,最终创建union对象,最终调用unionAuthTokenHandler.AuthenticateToken
// Union the token authenticators
tokenAuth := tokenunion.New(tokenAuthenticators...)
// AuthenticateToken authenticates the token using a chain of authenticator.Token objects.
func (authHandler *unionAuthTokenHandler) AuthenticateToken(ctx context.Context, token string) (*authenticator.Response, bool, error) {
var errlist []error
for _, currAuthRequestHandler := range authHandler.Handlers {
info, ok, err := currAuthRequestHandler.AuthenticateToken(ctx, token)
if err != nil {
if authHandler.FailOnError {
return info, ok, err
}
errlist = append(errlist, err)
continue
}
if ok {
return info, ok, err
}
}
return nil, false, utilerrors.NewAggregate(errlist)
}
核心变量2 authenticator.Request代表用户认证的接口,其中AuthenticateRequest是对应得认证方法
// Request attempts to extract authentication information from a request and
// returns a Response or an error if the request could not be checked.
type Request interface {
AuthenticateRequest(req *http.Request) (*Response, bool, error)
}
然后不断添加到切片中,比如x509认证
// X509 methods
if config.ClientCAContentProvider != nil {
certAuth := x509.NewDynamic(config.ClientCAContentProvider.VerifyOptions, x509.CommonNameUserConversion)
authenticators = append(authenticators, certAuth)
}
把上面得unionAuthTokenHandler也加入到链中
authenticators = append(authenticators, bearertoken.New(tokenAuth), websocket.NewProtocolAuthenticator(tokenAuth))
最后创建一个union对象unionAuthRequestHandler
authenticator := union.New(authenticators...)
最终调用得unionAuthRequestHandler.AuthenticateRequest方法遍历认证方法认证
// AuthenticateRequest authenticates the request using a chain of authenticator.Request objects.
func (authHandler *unionAuthRequestHandler) AuthenticateRequest(req *http.Request) (*authenticator.Response, bool, error) {
var errlist []error
for _, currAuthRequestHandler := range authHandler.Handlers {
resp, ok, err := currAuthRequestHandler.AuthenticateRequest(req)
if err != nil {
if authHandler.FailOnError {
return resp, ok, err
}
errlist = append(errlist, err)
continue
}
if ok {
return resp, ok, err
}
}
return nil, false, utilerrors.NewAggregate(errlist)
}
代码解读:
-如果某一个认证方法报错就返回,说明认证没过
-如果某一个认证方法报ok,说明认证过了,直接retrun了,无需再运行其他认证了
-如果所有得认证方法都没报ok,则认证没过
Authentication的目的
Kubernetes使用身份认证插件利用下面的策略来认证API请求的身份
-客户端证书
-持有者令牌(Bearer Token)
-身份认证代理(Proxy)
-HTTP基本认证机制
union认证的规则
-如果某一个认证方法报错就返回,说明认证没过
-如果某一个认证方法报ok,说明认证过了,直接retrun了,无需再运行其他认证了
-如果所有得认证方法都没报ok,则认证没过
Authorization鉴权
确认“你是不是有权力做这件事”。怎样判定是否有权利,通过配置策略
4种鉴权模块
鉴权执行链unionAuthorizHandler
Authorization鉴权相关
Authorization鉴权,确认“你是不是有权力做这件事”。怎样判定是否有权利,通过配置策略。
Kubernetes使用API服务器对API请求进行鉴权
它根据所有策略评估所有请求属性来决定允许或拒绝请求。
一个API请求的所有部分都必须被某些策略允许才能继续。这意味着默认情况下拒绝权限。
当系统配置了多个鉴权模块时,Kubernetes将按顺序使用每个模块。如果任何鉴权模块批准或拒绝请求·,则立即返回该决定,并不会与其他鉴权模块协商。如果所有模块对请求没有意见,则拒绝该请求。被拒绝相应返回HTTP状态码403.
文档地址https://kubernrtes.io/zh/docs/reference/access-authn-authz/authorization/
文档地址https://kubernetes.io/zh/docs/references/access-authn-authz/authorization/#authorization-modules
Node - 一个专用鉴权组件,根据调度到kubelet上运行的Pod为kubelet授予权限。了解有关使用节点鉴权模式的更多信息,请参阅节点鉴权。
ABAC-基于属性的访问控制(ABAC)定义了一种访问控制范型,通过使用将属性组合在一起的策略,将访问权限授予用户。策略可以使用任何类型的属性(用户属性、资源属性、对象、环境属性等)。要了解有关ABAC模式更多信息,请参阅ABAC模式。
RBAC-基于角色的访问控制(RBAC)是一种基于企业内个人用户的角色管理对计算机或网络资源的访问的方法。在此上下文中,权限是单个用户执行特定任务的能力,例如查看、创建或修改文件。要了解有关使用RBAC模式更多信息,请参阅RBAC模式。
- 被启用之后,RBAC(基于角色的访问控制)使用rbac.authorization.k8s.io API组成驱动鉴权决策,从而允许管理员通过Kubernetes API动态配置权限策略。
- 要启用RBAC,请使用--authorization-mode = RBAC启动API服务器。
Webhook-Webhook是一个HTTP回调:发生某些事情时调用的HTTP POST:通过HTTP POST进行简单的事件通知。实现Webhook的Web应用程序会在发生某些事情时将消息发布到UrL。要了解有关使用Webhook模型的更多信息,请参阅webhook模式。
入口还在buildGenericConfig D:\Workspace\Go\src\k8s.io\kubernetes\cmd\kube-apiserver\app\server.go
genericConfig.Authorization.Authorizer, genericConfig.RuleResolver, err = BuildAuthorizer(s, genericConfig.EgressSelector, versionedInformers)
还是通过New构造,位置D:\Workspace\Go\src\k8s.io\kubernetes\pkg\kubeapiserver\authorizer\config.go
authorizationConfig.New()
构造函数New分析
核心变量1 authorizers
D:\Workspace\Go\src\k8s.io\kubernetes\staging\src\k8s.io\apiserver\pkg\authorization\authorizer\interfaces.go
// Authorizer makes an authorization decision based on information gained by making
// zero or more calls to methods of the Attributes interface. It returns nil when an action is
// authorized, otherwise it returns an error.
type Authorizer interface {
Authorize(ctx context.Context, a Attributes) (authorized Decision, reason string, err error)
}
鉴权的接口,有对应的Authorize执行鉴权操作,返回参数如下
Decision代表鉴权结果,有
- 拒绝DecisionDeny
-通过DecisionAllow
- 未表态 DecisionNoOpinion
reason代表拒绝的原因
核心变量2 ruleResolvers
D:\Workspace\Go\src\k8s.io\kubernetes\staging\src\k8s.io\apiserver\pkg\authorization\authorizer\interfaces.go
// RuleResolver provides a mechanism for resolving the list of rules that apply to a given user within a namespace.
type RuleResolver interface {
// RulesFor get the list of cluster wide rules, the list of rules in the specific namespace, incomplete status and errors.
RulesFor(user user.Info, namespace string) ([]ResourceRuleInfo, []NonResourceRuleInfo, bool, error)
}
获取rule的接口,有对应的RulesFor执行获取rule操作,返回参数如下
[]ResourceRuleInfo代表资源型的rule
[]NonResourceRuleInfo代表非资源型的如nonResourceURLs:["/metrics"]
遍历鉴权模块判断,向上述切片中append
for _, authorizationMode := range config.AuthorizationModes {
// Keep cases in sync with constant list in k8s.io/kubernetes/pkg/kubeapiserver/authorizer/modes/modes.go.
switch authorizationMode {
case modes.ModeNode:
node.RegisterMetrics()
graph := node.NewGraph()
node.AddGraphEventHandlers(
graph,
config.VersionedInformerFactory.Core().V1().Nodes(),
config.VersionedInformerFactory.Core().V1().Pods(),
config.VersionedInformerFactory.Core().V1().PersistentVolumes(),
config.VersionedInformerFactory.Storage().V1().VolumeAttachments(),
)
nodeAuthorizer := node.NewAuthorizer(graph, nodeidentifier.NewDefaultNodeIdentifier(), bootstrappolicy.NodeRules())
authorizers = append(authorizers, nodeAuthorizer)
ruleResolvers = append(ruleResolvers, nodeAuthorizer)
case modes.ModeAlwaysAllow:
alwaysAllowAuthorizer := authorizerfactory.NewAlwaysAllowAuthorizer()
authorizers = append(authorizers, alwaysAllowAuthorizer)
ruleResolvers = append(ruleResolvers, alwaysAllowAuthorizer)
case modes.ModeAlwaysDeny:
alwaysDenyAuthorizer := authorizerfactory.NewAlwaysDenyAuthorizer()
authorizers = append(authorizers, alwaysDenyAuthorizer)
ruleResolvers = append(ruleResolvers, alwaysDenyAuthorizer)
case modes.ModeABAC:
abacAuthorizer, err := abac.NewFromFile(config.PolicyFile)
if err != nil {
return nil, nil, err
}
authorizers = append(authorizers, abacAuthorizer)
ruleResolvers = append(ruleResolvers, abacAuthorizer)
case modes.ModeWebhook:
if config.WebhookRetryBackoff == nil {
return nil, nil, errors.New("retry backoff parameters for authorization webhook has not been specified")
}
clientConfig, err := webhookutil.LoadKubeconfig(config.WebhookConfigFile, config.CustomDial)
if err != nil {
return nil, nil, err
}
webhookAuthorizer, err := webhook.New(clientConfig,
config.WebhookVersion,
config.WebhookCacheAuthorizedTTL,
config.WebhookCacheUnauthorizedTTL,
*config.WebhookRetryBackoff,
)
if err != nil {
return nil, nil, err
}
authorizers = append(authorizers, webhookAuthorizer)
ruleResolvers = append(ruleResolvers, webhookAuthorizer)
case modes.ModeRBAC:
rbacAuthorizer := rbac.New(
&rbac.RoleGetter{Lister: config.VersionedInformerFactory.Rbac().V1().Roles().Lister()},
&rbac.RoleBindingLister{Lister: config.VersionedInformerFactory.Rbac().V1().RoleBindings().Lister()},
&rbac.ClusterRoleGetter{Lister: config.VersionedInformerFactory.Rbac().V1().ClusterRoles().Lister()},
&rbac.ClusterRoleBindingLister{Lister: config.VersionedInformerFactory.Rbac().V1().ClusterRoleBindings().Lister()},
)
authorizers = append(authorizers, rbacAuthorizer)
ruleResolvers = append(ruleResolvers, rbacAuthorizer)
default:
return nil, nil, fmt.Errorf("unknown authorization mode %s specified", authorizationMode)
}
}
最后返回两个对象的union对象,跟authentication一样
return union.New(authorizers...), union.NewRuleResolvers(ruleResolvers...), nil
authorizaers的union unionauthzHandler
位置D:\Workspace\Go\src\k8s.io\kubernetes\staging\src\k8s.io\apiserver\pkg\authorization\union\union.go
// New returns an authorizer that authorizes against a chain of authorizer.Authorizer objects
func New(authorizationHandlers ...authorizer.Authorizer) authorizer.Authorizer {
return unionAuthzHandler(authorizationHandlers)
}
// Authorizes against a chain of authorizer.Authorizer objects and returns nil if successful and returns error if unsuccessful
func (authzHandler unionAuthzHandler) Authorize(ctx context.Context, a authorizer.Attributes) (authorizer.Decision, string, error) {
var (
errlist []error
reasonlist []string
)
for _, currAuthzHandler := range authzHandler {
decision, reason, err := currAuthzHandler.Authorize(ctx, a)
if err != nil {
errlist = append(errlist, err)
}
if len(reason) != 0 {
reasonlist = append(reasonlist, reason)
}
switch decision {
case authorizer.DecisionAllow, authorizer.DecisionDeny:
return decision, reason, err
case authorizer.DecisionNoOpinion:
// continue to the next authorizer
}
}
return authorizer.DecisionNoOpinion, strings.Join(reasonlist, "\n"), utilerrors.NewAggregate(errlist)
}
unionAuthzHandler的鉴权执行方法Auhorize同样是遍历执行内部的鉴权方法Authoriza
如果任一方法的鉴权结果decision为通过或者拒绝,就直接返回
否则代表不表态,继续执行下一个Authorize方法
ruleResolvers的union unionauthzHandler
位置D:\Workspace\Go\src\k8s.io\kubernetes\staging\src\k8s.io\apiserver\pkg\authorization\union\union.go
// unionAuthzRulesHandler authorizer against a chain of authorizer.RuleResolver
type unionAuthzRulesHandler []authorizer.RuleResolver
// NewRuleResolvers returns an authorizer that authorizes against a chain of authorizer.Authorizer objects
func NewRuleResolvers(authorizationHandlers ...authorizer.RuleResolver) authorizer.RuleResolver {
return unionAuthzRulesHandler(authorizationHandlers)
}
// RulesFor against a chain of authorizer.RuleResolver objects and returns nil if successful and returns error if unsuccessful
func (authzHandler unionAuthzRulesHandler) RulesFor(user user.Info, namespace string) ([]authorizer.ResourceRuleInfo, []authorizer.NonResourceRuleInfo, bool, error) {
var (
errList []error
resourceRulesList []authorizer.ResourceRuleInfo
nonResourceRulesList []authorizer.NonResourceRuleInfo
)
incompleteStatus := false
for _, currAuthzHandler := range authzHandler {
resourceRules, nonResourceRules, incomplete, err := currAuthzHandler.RulesFor(user, namespace)
if incomplete {
incompleteStatus = true
}
if err != nil {
errList = append(errList, err)
}
if len(resourceRules) > 0 {
resourceRulesList = append(resourceRulesList, resourceRules...)
}
if len(nonResourceRules) > 0 {
nonResourceRulesList = append(nonResourceRulesList, nonResourceRules...)
}
}
return resourceRulesList, nonResourceRulesList, incompleteStatus, utilerrors.NewAggregate(errList)
}
unionAuthzRulesHandler的执行方法RulesFor中遍历内部的authzHandler
执行他们的RulesFor方法获取resourceRules和nonResourceRules
并将结果添加到resourceRuleList和nonReourceRulesList,返回
Authorization鉴权的目的
4种鉴权模块
鉴权执行链unionAuthzHandler
Authorization鉴权
节点鉴权是一种特殊用途的鉴权模式,专门对kubelet发出的API请求进行鉴权
4种规则解读-如果不是node的请求则拒绝
- 如果nodeName没找到则拒绝
-如果请求的是configmap、pod、pv、pvc、secret需要校验
- 如果动作是非get,拒绝
- 如果请求的资源和节点没关系,拒绝
- 如果请求其他资源,需要按照定义好的rule匹配
节点鉴权
文档地址https://kubernetes.io/zh/docs/reference/access-authn-authz/node/
节点鉴权是一种特殊用途的鉴权模式,专门对kubelet发出的API请求进行鉴权。
概述
节点鉴权器允许kubelet执行API操作,包括:
读取操作:
services
endpoints
nodes
pods
secrets、configmaps、pvcs以及绑定到kubelet节点的与pod相关的持久卷
写入操作:
节点和节点状态(启用NodeRestriction准入插件以限制kubelet只能修改自己的节点)
Pod和Pod状态(启用NodeRestriction准入插件以限制kubelet只能修改绑定到自身的Pod)
鉴权相关操作:
对于基于TLS的启动引导过程时使用的certificationsigningrequests API的读/写权限
为委派的身份验证/授权检查创建tokenreviews和subjectaccessreviews的能力
位置D:\Workspace\Go\src\k8s.io\kubernetes\plugin\pkg\auth\authorizer\node\node_authorizer.go
func (r *NodeAuthorizer) Authorize(ctx context.Context, attrs authorizer.Attributes) (authorizer.Decision, string, error) {
nodeName, isNode := r.identifier.NodeIdentity(attrs.GetUser())
if !isNode {
// reject requests from non-nodes
return authorizer.DecisionNoOpinion, "", nil
}
if len(nodeName) == 0 {
// reject requests from unidentifiable nodes
klog.V(2).Infof("NODE DENY: unknown node for user %q", attrs.GetUser().GetName())
return authorizer.DecisionNoOpinion, fmt.Sprintf("unknown node for user %q", attrs.GetUser().GetName()), nil
}
// subdivide access to specific resources
if attrs.IsResourceRequest() {
requestResource := schema.GroupResource{Group: attrs.GetAPIGroup(), Resource: attrs.GetResource()}
switch requestResource {
case secretResource:
return r.authorizeReadNamespacedObject(nodeName, secretVertexType, attrs)
case configMapResource:
return r.authorizeReadNamespacedObject(nodeName, configMapVertexType, attrs)
case pvcResource:
if attrs.GetSubresource() == "status" {
return r.authorizeStatusUpdate(nodeName, pvcVertexType, attrs)
}
return r.authorizeGet(nodeName, pvcVertexType, attrs)
case pvResource:
return r.authorizeGet(nodeName, pvVertexType, attrs)
case vaResource:
return r.authorizeGet(nodeName, vaVertexType, attrs)
case svcAcctResource:
return r.authorizeCreateToken(nodeName, serviceAccountVertexType, attrs)
case leaseResource:
return r.authorizeLease(nodeName, attrs)
case csiNodeResource:
return r.authorizeCSINode(nodeName, attrs)
}
}
// Access to other resources is not subdivided, so just evaluate against the statically defined node rules
if rbac.RulesAllow(attrs, r.nodeRules...) {
return authorizer.DecisionAllow, "", nil
}
return authorizer.DecisionNoOpinion, "", nil
}
规则解读
// NodeAuthorizer authorizes requests from kubelets, with the following logic:
// 1. If a request is not from a node (NodeIdentity() returns isNode=false), reject
// 2. If a specific node cannot be identified (NodeIdentity() returns nodeName=""), reject
// 3. If a request is for a secret, configmap, persistent volume or persistent volume claim, reject unless the verb is get, and the requested object is related to the requesting node:
// node <- configmap
// node <- pod
// node <- pod <- secret
// node <- pod <- configmap
// node <- pod <- pvc
// node <- pod <- pvc <- pv
// node <- pod <- pvc <- pv <- secret
// 4. For other resources, authorize all nodes uniformly using statically defined rules
前两条规则很好理解
规则3解读
第三条如果请求的资源时secret,configmap,persistent volume or persistent volume claim,需要验证动作是否是get
以secretResource为例,调用authorizeReadNamespaceObject方法
case secretResource:
return r.authorizeReadNamespaceObject(nodeName, secretVertexType, attrs)
authorizeReadNamespaceObject验证namespace的方法
authorizeReadNamespaceObject方法是装饰方法,先校验资源是否是namespace级别的,再调用底层的authorize方法
// authorizeReadNamespacedObject authorizes "get", "list" and "watch" requests to single objects of a
// specified types if they are related to the specified node.
func (r *NodeAuthorizer) authorizeReadNamespacedObject(nodeName string, startingType vertexType, attrs authorizer.Attributes) (authorizer.Decision, string, error) {
switch attrs.GetVerb() {
case "get", "list", "watch":
//ok
default:
klog.V(2).Infof("NODE DENY: '%s' %#v", nodeName, attrs)
return authorizer.DecisionNoOpinion, "can only read resources of this type", nil
}
if len(attrs.GetSubresource()) > 0 {
klog.V(2).Infof("NODE DENY: '%s' %#v", nodeName, attrs)
return authorizer.DecisionNoOpinion, "cannot read subresource", nil
}
if len(attrs.GetNamespace()) == 0 {
klog.V(2).Infof("NODE DENY: '%s' %#v", nodeName, attrs)
return authorizer.DecisionNoOpinion, "can only read namespaced object of this type", nil
}
return r.authorize(nodeName, startingType, attrs)
}
解读
- DecisionNoOpinion代表不表态,如果只有一个Authorizer,意味着拒绝
- 如果动作是变更类的就拒绝
- 如果请求保护子资源就拒绝
- 如果请求参数中没有namespace就拒绝
- 然后调用底层的authorize
node底层的authorize方法
- 如果资源的名称没找到就拒绝
- hasPathFrom代表判断资源是否和节点有关系
- 如果没关系就拒绝
func (r *NodeAuthorizer) authorize(nodeName string, startingType vertexType, attrs authorizer.Attributes) (authorizer.Decision, string, error) {
if len(attrs.GetName()) == 0 {
klog.V(2).Infof("NODE DENY: '%s' %#v", nodeName, attrs)
return authorizer.DecisionNoOpinion, "No Object name found", nil
}
ok, err := r.hasPathFrom(nodeName, startingType, attrs.GetNamespace(), attrs.GetName())
if err != nil {
klog.V(2).InfoS("NODE DENY", "err", err)
return authorizer.DecisionNoOpinion, fmt.Sprintf("no relationship found between node '%s' and this object", nodeName), nil
}
if !ok {
klog.V(2).Infof("NODE DENY: '%s' %#v", nodeName, attrs)
return authorizer.DecisionNoOpinion, fmt.Sprintf("no relationship found between node '%s' and this object", nodeName), nil
}
return authorizer.DecisionAllow, "", nil
}
pvResource使用的authorizeGet解析
如果动作不是get就拒绝
如果含有subresource就拒绝
然后调用底层的authorize
// authorizeGet authorizes "get" requests to objects of the specified type if they are related to the specified node
func (r *NodeAuthorizer) authorizeGet(nodeName string, startingType vertexType, attrs authorizer.Attributes) (authorizer.Decision, string, error) {
if attrs.GetVerb() != "get" {
klog.V(2).Infof("NODE DENY: '%s' %#v", nodeName, attrs)
return authorizer.DecisionNoOpinion, "can only get individual resources of this type", nil
}
if len(attrs.GetSubresource()) > 0 {
klog.V(2).Infof("NODE DENY: '%s' %#v", nodeName, attrs)
return authorizer.DecisionNoOpinion, "cannot get subresource", nil
}
return r.authorize(nodeName, startingType, attrs)
}
规则4解读
规则4代表如果node请求其他资源,就通过对应的静态队长认证
// Access to other resources is not subdivided, so just evaluate against the statically defined node rules
if rbac.RulesAllow(attrs, r.nodeRules...) {
return authorizer.DecisionAllow, "", nil
}
return authorizer.DecisionNoOpinion, "", nil
底层调用rbac.RuleAllows
D:\Workspace\Go\src\k8s.io\kubernetes\plugin\pkg\auth\authorizer\rbac\rbac.go
func RuleAllows(requestAttributes authorizer.Attributes, rule *rbacv1.PolicyRule) bool {
if requestAttributes.IsResourceRequest() {
combinedResource := requestAttributes.GetResource()
if len(requestAttributes.GetSubresource()) > 0 {
combinedResource = requestAttributes.GetResource() + "/" + requestAttributes.GetSubresource()
}
return rbacv1helpers.VerbMatches(rule, requestAttributes.GetVerb()) &&
rbacv1helpers.APIGroupMatches(rule, requestAttributes.GetAPIGroup()) &&
rbacv1helpers.ResourceMatches(rule, combinedResource, requestAttributes.GetSubresource()) &&
rbacv1helpers.ResourceNameMatches(rule, requestAttributes.GetName())
}
return rbacv1helpers.VerbMatches(rule, requestAttributes.GetVerb()) &&
rbacv1helpers.NonResourceURLMatches(rule, requestAttributes.GetPath())
}
最底层是两个Matches
VerbMatches
代表如果动作是*那就放行
如果请求的动作和rule的动作一致就放行
D:\Workspace\Go\src\k8s.io\kubernetes\pkg\apis\rbac\v1\evaluation_helpers.go
func VerbMatches(rule *rbacv1.PolicyRule, requestedVerb string) bool {
for _, ruleVerb := range rule.Verbs {
if ruleVerb == rbacv1.VerbAll {
return true
}
if ruleVerb == requestedVerb {
return true
}
}
return false
}
NonResourceURLMatches
D:\Workspace\Go\src\k8s.io\kubernetes\pkg\apis\rbac\v1\evaluation_helpers.go
如果动作是*就放行
cat rbac.yaml
如果请求的url和rule定义的url一致就放行
如果rule中定义的url末尾有*代表通配,那么要判断请求的url是否包含定义url的前缀
func NonResourceURLMatches(rule *rbacv1.PolicyRule, requestedURL string) bool {
for _, ruleURL := range rule.NonResourceURLs {
if ruleURL == rbacv1.NonResourceAll {
return true
}
if ruleURL == requestedURL {
return true
}
if strings.HasSuffix(ruleURL, "*") && strings.HasPrefix(requestedURL, strings.TrimRight(ruleURL, "*")) {
return true
}
}
return false
}
node的rule在哪里
位置D:\Workspace\Go\src\k8s.io\kubernetes\plugin\pkg\auth\authorizer\rbac\bootstrappolicy\policy.go
// NodeRules returns node policy rules, it is slice of rbacv1.PolicyRule.
func NodeRules() []rbacv1.PolicyRule {
nodePolicyRules := []rbacv1.PolicyRule{
// Needed to check API access. These creates are non-mutating
rbacv1helpers.NewRule("create").Groups(authenticationGroup).Resources("tokenreviews").RuleOrDie(),
rbacv1helpers.NewRule("create").Groups(authorizationGroup).Resources("subjectaccessreviews", "localsubjectaccessreviews").RuleOrDie(),
// Needed to build serviceLister, to populate env vars for services
rbacv1helpers.NewRule(Read...).Groups(legacyGroup).Resources("services").RuleOrDie(),
// Nodes can register Node API objects and report status.
// Use the NodeRestriction admission plugin to limit a node to creating/updating its own API object.
rbacv1helpers.NewRule("create", "get", "list", "watch").Groups(legacyGroup).Resources("nodes").RuleOrDie(),
rbacv1helpers.NewRule("update", "patch").Groups(legacyGroup).Resources("nodes/status").RuleOrDie(),
rbacv1helpers.NewRule("update", "patch").Groups(legacyGroup).Resources("nodes").RuleOrDie(),
// TODO: restrict to the bound node as creator in the NodeRestrictions admission plugin
rbacv1helpers.NewRule("create", "update", "patch").Groups(legacyGroup).Resources("events").RuleOrDie(),
// TODO: restrict to pods scheduled on the bound node once field selectors are supported by list/watch authorization
rbacv1helpers.NewRule(Read...).Groups(legacyGroup).Resources("pods").RuleOrDie(),
// Needed for the node to create/delete mirror pods.
// Use the NodeRestriction admission plugin to limit a node to creating/deleting mirror pods bound to itself.
rbacv1helpers.NewRule("create", "delete").Groups(legacyGroup).Resources("pods").RuleOrDie(),
// Needed for the node to report status of pods it is running.
// Use the NodeRestriction admission plugin to limit a node to updating status of pods bound to itself.
rbacv1helpers.NewRule("update", "patch").Groups(legacyGroup).Resources("pods/status").RuleOrDie(),
// Needed for the node to create pod evictions.
// Use the NodeRestriction admission plugin to limit a node to creating evictions for pods bound to itself.
rbacv1helpers.NewRule("create").Groups(legacyGroup).Resources("pods/eviction").RuleOrDie(),
// Needed for imagepullsecrets, rbd/ceph and secret volumes, and secrets in envs
// Needed for configmap volume and envs
// Use the Node authorization mode to limit a node to get secrets/configmaps referenced by pods bound to itself.
rbacv1helpers.NewRule("get", "list", "watch").Groups(legacyGroup).Resources("secrets", "configmaps").RuleOrDie(),
// Needed for persistent volumes
// Use the Node authorization mode to limit a node to get pv/pvc objects referenced by pods bound to itself.
rbacv1helpers.NewRule("get").Groups(legacyGroup).Resources("persistentvolumeclaims", "persistentvolumes").RuleOrDie(),
// TODO: add to the Node authorizer and restrict to endpoints referenced by pods or PVs bound to the node
// Needed for glusterfs volumes
rbacv1helpers.NewRule("get").Groups(legacyGroup).Resources("endpoints").RuleOrDie(),
// Used to create a certificatesigningrequest for a node-specific client certificate, and watch
// for it to be signed. This allows the kubelet to rotate it's own certificate.
rbacv1helpers.NewRule("create", "get", "list", "watch").Groups(certificatesGroup).Resources("certificatesigningrequests").RuleOrDie(),
// Leases
rbacv1helpers.NewRule("get", "create", "update", "patch", "delete").Groups("coordination.k8s.io").Resources("leases").RuleOrDie(),
// CSI
rbacv1helpers.NewRule("get").Groups(storageGroup).Resources("volumeattachments").RuleOrDie(),
// Use the Node authorization to limit a node to create tokens for service accounts running on that node
// Use the NodeRestriction admission plugin to limit a node to create tokens bound to pods on that node
rbacv1helpers.NewRule("create").Groups(legacyGroup).Resources("serviceaccounts/token").RuleOrDie(),
}
// Use the Node authorization mode to limit a node to update status of pvc objects referenced by pods bound to itself.
// Use the NodeRestriction admission plugin to limit a node to just update the status stanza.
pvcStatusPolicyRule := rbacv1helpers.NewRule("get", "update", "patch").Groups(legacyGroup).Resources("persistentvolumeclaims/status").RuleOrDie()
nodePolicyRules = append(nodePolicyRules, pvcStatusPolicyRule)
// CSI
csiDriverRule := rbacv1helpers.NewRule("get", "watch", "list").Groups("storage.k8s.io").Resources("csidrivers").RuleOrDie()
nodePolicyRules = append(nodePolicyRules, csiDriverRule)
csiNodeInfoRule := rbacv1helpers.NewRule("get", "create", "update", "patch", "delete").Groups("storage.k8s.io").Resources("csinodes").RuleOrDie()
nodePolicyRules = append(nodePolicyRules, csiNodeInfoRule)
// RuntimeClass
nodePolicyRules = append(nodePolicyRules, rbacv1helpers.NewRule("get", "list", "watch").Groups("node.k8s.io").Resources("runtimeclasses").RuleOrDie())
return nodePolicyRules
}
以endpoint为例,代表node可以访问core apigroup中的endpoint资源,用get方法
rbacv1helpers.NewRule("get").Groups(legacyGroup).Resources("nodes").RuleOrDie(),
本节重点总结 同上
role、clusterrole中的rules规则
- 资源对象
- 非资源对象
- apiGroups
- verb动作
rbac鉴权的代码逻辑
- 通过informer获取clusterRoleBindings列表,根据user匹配subject,通过informer获取clusterRoleBindings的rules,遍历调用visit进行rule匹配
- 通过informer获取RoleBindings列表,根据user和namespace匹配subject,,通过informer获取RoleBindings的rules,遍历调用visit进行rule匹配
rbac模型四种对象的关系
o role, clusterrole
o rolebinding,clusterrolebinding
。role、clusterrole中的rules规则
。资源对象
。非资源对象
。apiGroups
o verb动作
。rbac鉴权的代码逻辑
。通过informer获取clusterRoleBindings列表,根据user匹配subject,通过informer获取clusterRoleBindings的rules,遍历调用visit进行rule匹配
。通过informer获取RoleBindings列表,根据user和namespace匹配subject,,通过informer获取RoleBindings的rules,遍历调用visit进行rule匹配
文档地址https://kubernetes.io/zh/docs/reference/access-authn-authz/rbac/
简介
基干角色(Role) 的访问控制 (RBAC) 是一种基于组织中用户的角色来调节控制对 计算机或网络资源的访问的方法。
RBAC 鉴权机制使用rbac.authorization.k8sio API组来驱动鉴权决定,允许你通过 Kubernetes API动态配置策略。
看文档介绍
入口D:\Workspace\Go\src\k8s.io\kubernetes\pkg\kubeapiserver\authorizer\config.go
case modes.ModeRBAC:
rbacAuthorizer := rbac.New(
&rbac.RoleGetter{Lister: config.VersionedInformerFactory.Rbac().V1().Roles().Lister()},
&rbac.RoleBindingLister{Lister: config.VersionedInformerFactory.Rbac().V1().RoleBindings().Lister()},
&rbac.ClusterRoleGetter{Lister: config.VersionedInformerFactory.Rbac().V1().ClusterRoles().Lister()},
&rbac.ClusterRoleBindingLister{Lister: config.VersionedInformerFactory.Rbac().V1().ClusterRoleBindings().Lister()},
)
authorizers = append(authorizers, rbacAuthorizer)
ruleResolvers = append(ruleResolvers, rbacAuthorizer)
rbac.New 传入Role、ClusterRole、RoleBinding 和ClusterRoleBinding 4种对象的Getter
func New(roles rbacregistryvalidation.RoleGetter, roleBindings rbacregistryvalidation.RoleBindingLister, clusterRoles rbacregistryvalidation.ClusterRoleGetter, clusterRoleBindings rbacregistryvalidation.ClusterRoleBindingLister) *RBACAuthorizer {
authorizer := &RBACAuthorizer{
authorizationRuleResolver: rbacregistryvalidation.NewDefaultRuleResolver(
roles, roleBindings, clusterRoles, clusterRoleBindings,
),
}
return authorizer
}
构建DefaultRuleResolver,并用DefaultRuleResolver构建RBACAuthorizer
RBACAuthorizer的Authorize解析
核心判断点在ruleCheckingVisitor的allowed标志位,如果为true,则通过,否则就不通过
func (r *RBACAuthorizer) Authorize(ctx context.Context, requestAttributes authorizer.Attributes) (authorizer.Decision, string, error) {
ruleCheckingVisitor := &authorizingVisitor{requestAttributes: requestAttributes}
r.authorizationRuleResolver.VisitRulesFor(requestAttributes.GetUser(), requestAttributes.GetNamespace(), ruleCheckingVisitor.visit)
if ruleCheckingVisitor.allowed {
return authorizer.DecisionAllow, ruleCheckingVisitor.reason, nil
}
// Build a detailed log of the denial.
// Make the whole block conditional so we don't do a lot of string-building we won't use.
if klogV := klog.V(5); klogV.Enabled() {
var operation string
if requestAttributes.IsResourceRequest() {
b := &bytes.Buffer{}
b.WriteString(`"`)
b.WriteString(requestAttributes.GetVerb())
b.WriteString(`" resource "`)
b.WriteString(requestAttributes.GetResource())
if len(requestAttributes.GetAPIGroup()) > 0 {
b.WriteString(`.`)
b.WriteString(requestAttributes.GetAPIGroup())
}
if len(requestAttributes.GetSubresource()) > 0 {
b.WriteString(`/`)
b.WriteString(requestAttributes.GetSubresource())
}
b.WriteString(`"`)
if len(requestAttributes.GetName()) > 0 {
b.WriteString(` named "`)
b.WriteString(requestAttributes.GetName())
b.WriteString(`"`)
}
operation = b.String()
} else {
operation = fmt.Sprintf("%q nonResourceURL %q", requestAttributes.GetVerb(), requestAttributes.GetPath())
}
var scope string
if ns := requestAttributes.GetNamespace(); len(ns) > 0 {
scope = fmt.Sprintf("in namespace %q", ns)
} else {
scope = "cluster-wide"
}
klogV.Infof("RBAC: no rules authorize user %q with groups %q to %s %s", requestAttributes.GetUser().GetName(), requestAttributes.GetUser().GetGroups(), operation, scope)
}
reason := ""
if len(ruleCheckingVisitor.errors) > 0 {
reason = fmt.Sprintf("RBAC: %v", utilerrors.NewAggregate(ruleCheckingVisitor.errors))
}
return authorizer.DecisionNoOpinion, reason, nil
}
这个allowed标志位只有在visit方法中才会被设置,条件是RuleAllows=true
func (v *authorizingVisitor) visit(source fmt.Stringer, rule *rbacv1.PolicyRule, err error) bool {
if rule != nil && RuleAllows(v.requestAttributes, rule) {
v.allowed = true
v.reason = fmt.Sprintf("RBAC: allowed by %s", source.String())
return false
}
if err != nil {
v.errors = append(v.errors, err)
}
return true
}
VisitRulesFor调用visit方法校验每一条rule
r.authorizationRuleResolver.VisitRulesFor(requestAttributes.GetUser(), requestAttributes.GetNamespace(), ruleCheckingVisitor.visit)
先校验clusterRoleBinding
·具体流程先用informer获取clusterRoleBindings,出错了就校验失败,因为传给visitor的rule为nil就意味着allowed不会被设置为true
if clusterRoleBindings, err := r.clusterRoleBindingLister.ListClusterRoleBindings(); err != nil {
if !visitor(nil, nil, err) {
return
}
}
先根据传入的user对象对比subject主体
for _, clusterRoleBinding := range clusterRoleBindings {
subjectIndex, applies := appliesTo(user, clusterRoleBinding.Subjects, "")
if !applies {
continue
}
appliesToUser对比函数
根据user类型判断
。如果是普通用户就判断名字
。如果是group就判断里面的user名字
。如果是ServiceAccount就要用serviceaccountMatchesUsername对比
func appliesToUser(user user.Info, subject rbacv1.Subject, namespace string) bool {
switch subject.Kind {
case rbacv1.UserKind:
return user.GetName() == subject.Name
case rbacv1.GroupKind:
return has(user.GetGroups(), subject.Name)
case rbacv1.ServiceAccountKind:
// default the namespace to namespace we're working in if its available. This allows rolebindings that reference
// SAs in th local namespace to avoid having to qualify them.
saNamespace := namespace
if len(subject.Namespace) > 0 {
saNamespace = subject.Namespace
}
if len(saNamespace) == 0 {
return false
}
// use a more efficient comparison for RBAC checking
return serviceaccount.MatchesUsername(saNamespace, subject.Name, user.GetName())
default:
return false
}
}
serviceaccount.MatchesUsername对比 serviceaccount的全名system:serviceaccount:namespace:name逐次进行对比
// MatchesUsername checks whether the provided username matches the namespace and name without
// allocating. Use this when checking a service account namespace and name against a known string.
func MatchesUsername(namespace, name string, username string) bool {
if !strings.HasPrefix(username, ServiceAccountUsernamePrefix) {
return false
}
username = username[len(ServiceAccountUsernamePrefix):]
if !strings.HasPrefix(username, namespace) {
return false
}
username = username[len(namespace):]
if !strings.HasPrefix(username, ServiceAccountUsernameSeparator) {
return false
}
username = username[len(ServiceAccountUsernameSeparator):]
return username == name
}
再根据clusterRoleBinding.RoleRef从informer获取rules
rules, err := r.GetRoleReferenceRules(clusterRoleBinding.RoleRef, "")
遍历rule,传入找到的clusterRoleBinding,调用visit进行比对
sourceDescriber.binding = clusterRoleBinding
sourceDescriber.subject = &clusterRoleBinding.Subjects[subjectIndex]
for i := range rules {
if !visitor(sourceDescriber, &rules[i], nil) {
return
}
}
资源型
。对比request和rule的verb是否一致
func VerbMatches(rule *rbacv1.PolicyRule, requestedVerb string) bool {
for _, ruleVerb := range rule.Verbs {
if ruleVerb == rbacv1.VerbAll {
return true
}
if ruleVerb == requestedVerb {
return true
}
}
return false
}
非资源型的
对比request和rule的url和verb
func NonResourceURLMatches(rule *rbacv1.PolicyRule, requestedURL string) bool {
for _, ruleURL := range rule.NonResourceURLs {
if ruleURL == rbacv1.NonResourceAll {
return true
}
if ruleURL == requestedURL {
return true
}
if strings.HasSuffix(ruleURL, "*") && strings.HasPrefix(requestedURL, strings.TrimRight(ruleURL, "*")) {
return true
}
}
return false
}
再校验RoleBinding
通过informer获取roleBinding列表
if roleBindings, err := r.roleBindingLister.ListRoleBindings(namespace); err != nil {
if !visitor(nil, nil, err) {
return
}
遍历对比subject
- 一样的appliesTo判断subject
for _, roleBinding := range roleBindings {
subjectIndex, applies := appliesTo(user, roleBinding.Subjects, namespace)
if !applies {
continue
}
根据informer获取匹配到的roleBinding的rules对象
rules, err := r.GetRoleReferenceRules(roleBinding.RoleRef, namespace)
if err != nil {
if !visitor(nil, nil, err) {
return
}
continue
}
调用visit方法遍历rule进行匹配
如果匹配中了,allowed置为true
sourceDescriber.binding = roleBinding
sourceDescriber.subject = &roleBinding.Subjects[subjectIndex]
for i := range rules {
if !visitor(sourceDescriber, &rules[i], nil) {
return
}
}
Auditing
Kubernetes 审计 (Auditing) 功能提供了与安全相关的、按时间顺序排列的记录集,记录每个用户、使用 KubernetesAPI的应用以及控制面自身引发的活动
审计功能使得集群管理员能处理以下问题
发生了什么?
什么时候发生的?
谁触发的?
活动发生在哪个(些) 对象上?
在哪观察到的?
它从哪触发的?
活动的后续处理行为是什么?
审计策略 由粗到细粒度增长
None- 符合这条规则的日志将不会记录。
Metadata- 记录请求的元数据(请求的用户、时间戳、资源、动词等等),但是不记录请求或者响应的消息体
Request- 记录事件的元数据和请求的消息体,但是不记录响应的消息体。 这不适用于非资源类型的请求
- RequestResponse - 记录事件的元数据,请求和响应的消息体。这不适用于非资源类型的请求。
审计功能介绍
随文档学习
地https://kubernetes.io/zh/docs/tasks/debug-application-cluster/audit/
入口位置D:\Workspace\Go\src\k8s.io\kubernetes\cmd\kube-apiserver\app\server.go
buildGenericConfig
lastErr = s.Audit.ApplyTo(genericConfig)
if lastErr != nil {
return
}
1.从配置的 --audit-policy-file加载audit策略
你可以使用--audit-policy-file 标志将包含策略的文件传递给 kube-apiserver
如果不设置该标志,则不记录事件
rules 字段 必须在审计策略文件中提供。没有 (0)规则的策略将被视为非法配置
// 1. Build policy evaluator
evaluator, err := o.newPolicyRuleEvaluator()
if err != nil {
return err
}
2.从配置的 --audit-log-path设置logBackend
// 2. Build log backend
var logBackend audit.Backend
w, err := o.LogOptions.getWriter()
if err != nil {
return err
}
if w != nil {
if evaluator == nil {
klog.V(2).Info("No audit policy file provided, no events will be recorded for log backend")
} else {
logBackend = o.LogOptions.newBackend(w)
}
}
如果后端--audit-log-path="-"代表记录到标准输出
func (o *AuditLogOptions) getWriter() (io.Writer, error) {
if !o.enabled() {
return nil, nil
}
if o.Path == "-" {
return os.Stdout, nil
}
if err := o.ensureLogFile(); err != nil {
return nil, fmt.Errorf("ensureLogFile: %w", err)
}
return &lumberjack.Logger{
Filename: o.Path,
MaxAge: o.MaxAge,
MaxBackups: o.MaxBackups,
MaxSize: o.MaxSize,
Compress: o.Compress,
}, nil
}
ensureLogFile 尝试打开一下log文件,做日志文件是否可用的验证·
底层使用 https://github.com/natefinch/lumberjack,是个带有滚动功能的日志库
获取到日志writer对象后校验下有没有evaluator
- 如果没有evaluator,打印条提示日志
- 如果将backend设置为w
if w != nil {
if evaluator == nil {
klog.V(2).Info("No audit policy file provided, no events will be recorded for log backend")
} else {
logBackend = o.LogOptions.newBackend(w)
}
}
3根据配置构建webhook的 后端
// 3. Build webhook backend
var webhookBackend audit.Backend
if o.WebhookOptions.enabled() {
if evaluator == nil {
klog.V(2).Info("No audit policy file provided, no events will be recorded for webhook backend")
} else {
if c.EgressSelector != nil {
var egressDialer utilnet.DialFunc
egressDialer, err = c.EgressSelector.Lookup(egressselector.ControlPlane.AsNetworkContext())
if err != nil {
return err
}
webhookBackend, err = o.WebhookOptions.newUntruncatedBackend(egressDialer)
} else {
webhookBackend, err = o.WebhookOptions.newUntruncatedBackend(nil)
}
if err != nil {
return err
}
}
}
4 如果有webhook就把它封装为dynamicBackend
// 4. Apply dynamic options.
var dynamicBackend audit.Backend
if webhookBackend != nil {
// if only webhook is enabled wrap it in the truncate options
dynamicBackend = o.WebhookOptions.TruncateOptions.wrapBackend(webhookBackend, groupVersion)
}
5设置审计的策略计算对象 evaluator
// 5. Set the policy rule evaluator
c.AuditPolicyRuleEvaluator = evaluator
6把logBackend和 dynamicBackend 做union
// 6. Join the log backend with the webhooks
c.AuditBackend = appendBackend(logBackend, dynamicBackend)
func appendBackend(existing, newBackend audit.Backend) audit.Backend {
if existing == nil {
return newBackend
}
if newBackend == nil {
return existing
}
return audit.Union(existing, newBackend)
}
7.最终的运行方法
backend接口方法
D:\Workspace\Go\src\k8s.io\kubernetes\staging\src\k8s.io\apiserver\pkg\audit\types.go
type Sink interface {
// ProcessEvents handles events. Per audit ID it might be that ProcessEvents is called up to three times.
// Errors might be logged by the sink itself. If an error should be fatal, leading to an internal
// error, ProcessEvents is supposed to panic. The event must not be mutated and is reused by the caller
// after the call returns, i.e. the sink has to make a deepcopy to keep a copy around if necessary.
// Returns true on success, may return false on error.
ProcessEvents(events ...*auditinternal.Event) bool
}
type Backend interface {
Sink
// Run will initialize the backend. It must not block, but may run go routines in the background. If
// stopCh is closed, it is supposed to stop them. Run will be called before the first call to ProcessEvents.
Run(stopCh <-chan struct{}) error
// Shutdown will synchronously shut down the backend while making sure that all pending
// events are delivered. It can be assumed that this method is called after
// the stopCh channel passed to the Run method has been closed.
Shutdown()
// Returns the backend PluginName.
String() string
}
最终调用audit的ProcessEvents方法,以log举例,位置D:\Workspace\Go\src\k8s.io\kubernetes\staging\src\k8s.io\apiserver\plugin\pkg\audit\log\backend.go
func (b *backend) ProcessEvents(events ...*auditinternal.Event) bool {
success := true
for _, ev := range events {
success = b.logEvent(ev) && success
}
return success
}
func (b *backend) logEvent(ev *auditinternal.Event) bool {
line := ""
switch b.format {
case FormatLegacy:
line = audit.EventString(ev) + "\n"
case FormatJson:
bs, err := runtime.Encode(b.encoder, ev)
if err != nil {
audit.HandlePluginError(PluginName, err, ev)
return false
}
line = string(bs[:])
default:
audit.HandlePluginError(PluginName, fmt.Errorf("log format %q is not in list of known formats (%s)",
b.format, strings.Join(AllowedFormats, ",")), ev)
return false
}
if _, err := fmt.Fprint(b.out, line); err != nil {
audit.HandlePluginError(PluginName, err, ev)
return false
}
return true
}
8 http侧调用的handler
D:\Workspace\Go\src\k8s.io\kubernetes\staging\src\k8s.io\apiserver\pkg\endpoints\filters\audit.go
// WithAudit decorates a http.Handler with audit logging information for all the
// requests coming to the server. Audit level is decided according to requests'
// attributes and audit policy. Logs are emitted to the audit sink to
// process events. If sink or audit policy is nil, no decoration takes place.
func WithAudit(handler http.Handler, sink audit.Sink, policy audit.PolicyRuleEvaluator, longRunningCheck request.LongRunningRequestCheck) http.Handler {
if sink == nil || policy == nil {
return handler
}
return http.HandlerFunc(func(w http.ResponseWriter, req *http.Request) {
auditContext, err := evaluatePolicyAndCreateAuditEvent(req, policy)
if err != nil {
utilruntime.HandleError(fmt.Errorf("failed to create audit event: %v", err))
responsewriters.InternalError(w, req, errors.New("failed to create audit event"))
return
}
ev := auditContext.Event
if ev == nil || req.Context() == nil {
handler.ServeHTTP(w, req)
return
}
req = req.WithContext(audit.WithAuditContext(req.Context(), auditContext))
ctx := req.Context()
omitStages := auditContext.RequestAuditConfig.OmitStages
ev.Stage = auditinternal.StageRequestReceived
if processed := processAuditEvent(ctx, sink, ev, omitStages); !processed {
audit.ApiserverAuditDroppedCounter.WithContext(ctx).Inc()
responsewriters.InternalError(w, req, errors.New("failed to store audit event"))
return
}
// intercept the status code
var longRunningSink audit.Sink
if longRunningCheck != nil {
ri, _ := request.RequestInfoFrom(ctx)
if longRunningCheck(req, ri) {
longRunningSink = sink
}
}
respWriter := decorateResponseWriter(ctx, w, ev, longRunningSink, omitStages)
// send audit event when we leave this func, either via a panic or cleanly. In the case of long
// running requests, this will be the second audit event.
defer func() {
if r := recover(); r != nil {
defer panic(r)
ev.Stage = auditinternal.StagePanic
ev.ResponseStatus = &metav1.Status{
Code: http.StatusInternalServerError,
Status: metav1.StatusFailure,
Reason: metav1.StatusReasonInternalError,
Message: fmt.Sprintf("APIServer panic'd: %v", r),
}
processAuditEvent(ctx, sink, ev, omitStages)
return
}
// if no StageResponseStarted event was sent b/c neither a status code nor a body was sent, fake it here
// But Audit-Id http header will only be sent when http.ResponseWriter.WriteHeader is called.
fakedSuccessStatus := &metav1.Status{
Code: http.StatusOK,
Status: metav1.StatusSuccess,
Message: "Connection closed early",
}
if ev.ResponseStatus == nil && longRunningSink != nil {
ev.ResponseStatus = fakedSuccessStatus
ev.Stage = auditinternal.StageResponseStarted
processAuditEvent(ctx, longRunningSink, ev, omitStages)
}
ev.Stage = auditinternal.StageResponseComplete
if ev.ResponseStatus == nil {
ev.ResponseStatus = fakedSuccessStatus
}
processAuditEvent(ctx, sink, ev, omitStages)
}()
handler.ServeHTTP(respWriter, req)
})
}
tail -f /var/log/audit/audit.log
- 准入控制器是一段代码,它会在请求通过认证和授权之后对象被持久化之前拦截到达 API服务器的请求
- 准入控制过程分为两个阶段。第一阶段,运行变更准入控制器。第二阶段,运行验证准入控制器- - 控制器需要编译进 kube-apiserver 二进制文件,并且只能由集群管理员配置。
- 如果任何一个阶段的任何控制器拒绝了该请求,则整个请求将立即被拒绝,并向终端用户返回一个错误。
准入控制插件的作用
- 开启高级特性
什么是准入控制插件
文档地址 https://kubernetesio/zh/docs/reference/access-authn-authz/admission-controllers/
- 准入控制器是一段代码,它会在请求通过认证和授权之后、对象被持久化之前拦截到达 API 服务器的请求
- 准入控制过程分为两个阶段。第一阶段,运行变更准入控制器。第二阶段,运行验证准入控制器- - 控制器需要编译进 kube-apiserver 二进制文件,并且只能由集群管理员配置。
- 如果任何一个阶段的任何控制器拒绝了该请求,则整个请求将立即被拒绝,并向终端用户返回一个错误。
为什么需要准入控制器?
- Kubernetes 的许多高级功能都要求启用一个准入控制器,以便正确地支持该特性。
- 因此,没有正确配置准入控制器的 Kubernetes API 服务器是不完整的,它无法支持你期望的所有特性。
按照是否可以修改对象分类
- 准入控制器可以执行“验证(Validating)”和/或“变更(Mutating)” 操作
- 变更(mutating)控制器可以修改被其接受的对象;验证 (validating)控制器则不行
按照静态动态分类
- 静态的就是固定的单一功能,如AlwaysPullImages 修改每一个新创建的 Pod 的镜像拉取策略为 Always
动态的如有两个特殊的控制器:MutatingAdmissionWebhook 和 Validating.AdmissionWebhook.。
- 它们根据API 中的配置,分别执行变更和验证准入控制 webhook。
- 相当于可以调用外部的http请求准入控制插件
入口在D:\Workspace\Go\src\k8s.io\kubernetes\cmd\kube-apiserver\app\server.go
pluginInitializers, admissionPostStartHook, err = admissionConfig.New(proxyTransport, genericConfig.EgressSelector, serviceResolver, genericConfig.TracerProvider)
admissionConfig.New初始化准入控制器的配置
New函数设置准入所需的插件和webhook准入的启动钩子函数
// New sets up the plugins and admission start hooks needed for admission
func (c *Config) New(proxyTransport *http.Transport, egressSelector *egressselector.EgressSelector, serviceResolver webhook.ServiceResolver, tp *trace.TracerProvider) ([]admission.PluginInitializer, genericapiserver.PostStartHookFunc, error) {
webhookAuthResolverWrapper := webhook.NewDefaultAuthenticationInfoResolverWrapper(proxyTransport, egressSelector, c.LoopbackClientConfig, tp)
webhookPluginInitializer := webhookinit.NewPluginInitializer(webhookAuthResolverWrapper, serviceResolver)
var cloudConfig []byte
if c.CloudConfigFile != "" {
var err error
cloudConfig, err = ioutil.ReadFile(c.CloudConfigFile)
if err != nil {
klog.Fatalf("Error reading from cloud configuration file %s: %#v", c.CloudConfigFile, err)
}
}
clientset, err := kubernetes.NewForConfig(c.LoopbackClientConfig)
if err != nil {
return nil, nil, err
}
discoveryClient := cacheddiscovery.NewMemCacheClient(clientset.Discovery())
discoveryRESTMapper := restmapper.NewDeferredDiscoveryRESTMapper(discoveryClient)
kubePluginInitializer := NewPluginInitializer(
cloudConfig,
discoveryRESTMapper,
quotainstall.NewQuotaConfigurationForAdmission(),
)
admissionPostStartHook := func(context genericapiserver.PostStartHookContext) error {
discoveryRESTMapper.Reset()
go utilwait.Until(discoveryRESTMapper.Reset, 30*time.Second, context.StopCh)
return nil
}
return []admission.PluginInitializer{webhookPluginInitializer, kubePluginInitializer}, admissionPostStartHook, nil
}
其中用到的准入初始化接口为 Pluginlnitializer,位置D:\Workspace\Go\src\k8s.io\kubernetes\pkg\kubeapiserver\admission\initializer.go
同时含有对应的Initialize方法,作用是提供初始化的数据
// PluginInitializer is used for initialization of the Kubernetes specific admission plugins.
type PluginInitializer struct {
cloudConfig []byte
restMapper meta.RESTMapper
quotaConfiguration quota.Configuration
}// Initialize checks the initialization interfaces implemented by each plugin
// and provide the appropriate initialization data
func (i *PluginInitializer) Initialize(plugin admission.Interface) {
if wants, ok := plugin.(WantsCloudConfig); ok {
wants.SetCloudConfig(i.cloudConfig)
}
if wants, ok := plugin.(WantsRESTMapper); ok {
wants.SetRESTMapper(i.restMapper)
}
if wants, ok := plugin.(initializer.WantsQuotaConfiguration); ok {
wants.SetQuotaConfiguration(i.quotaConfiguration)
}
}
同时还初始化了quto配额的准入
kubePluginInitializer := NewPluginInitializer(
cloudConfig,
discoveryRESTMapper,
quotainstall.NewQuotaConfigurationForAdmission(),
)
生成一个webhook启动的钩子,每30秒重置一下discoveryRESTMapper,重置内部缓存的 Discovery
admissionPostStartHook := func(context genericapiserver.PostStartHookContext) error {
discoveryRESTMapper.Reset()
go utilwait.Until(discoveryRESTMapper.Reset, 30*time.Second, context.StopCh)
return nil
}
s.Admission.ApplyTo 初始化准入控制
err = s.Admission.ApplyTo(
genericConfig,
versionedInformers,
kubeClientConfig,
utilfeature.DefaultFeatureGate,
pluginInitializers...)
根据 传入的控制器列表和推荐的计算开启的和关闭的
if a.PluginNames != nil {
// pass PluginNames to generic AdmissionOptions
a.GenericAdmission.EnablePlugins, a.GenericAdmission.DisablePlugins = computePluginNames(a.PluginNames, a.GenericAdmission.RecommendedPluginOrder)
}
PluginNames代表--admission-control传入的
a.GenericAdmission.RecommendedPluginOrder代表官方所有的,AllOrderedPlugins,位置D:\Workspace\Go\src\k8s.io\kubernetes\pkg\kubeapiserver\options\plugins.go
使用computePluginNames算差集得到开启的和关闭的
// explicitly disable all plugins that are not in the enabled list
func computePluginNames(explicitlyEnabled []string, all []string) (enabled []string, disabled []string) {
return explicitlyEnabled, sets.NewString(all...).Difference(sets.NewString(explicitlyEnabled...)).List()
}
底层的ApplyTo分析
D:\Workspace\Go\src\k8s.io\kubernetes\staging\src\k8s.io\apiserver\pkg\server\options\admission.go
func (a *AdmissionOptions) ApplyTo(){}
根据传入关闭的、传入开启的、推荐的等插件列表计算真正要开启的列表
// enabledPluginNames makes use of RecommendedPluginOrder, DefaultOffPlugins,
// EnablePlugins, DisablePlugins fields
// to prepare a list of ordered plugin names that are enabled.
func (a *AdmissionOptions) enabledPluginNames() []string {
allOffPlugins := append(a.DefaultOffPlugins.List(), a.DisablePlugins...)
disabledPlugins := sets.NewString(allOffPlugins...)
enabledPlugins := sets.NewString(a.EnablePlugins...)
disabledPlugins = disabledPlugins.Difference(enabledPlugins)
orderedPlugins := []string{}
for _, plugin := range a.RecommendedPluginOrder {
if !disabledPlugins.Has(plugin) {
orderedPlugins = append(orderedPlugins, plugin)
}
}
return orderedPlugins
}
根据配置文件读取配置 admission-control-config-file
pluginsConfigProvider, err := admission.ReadAdmissionConfiguration(pluginNames, a.ConfigFile, configScheme)
if err != nil {
return fmt.Errorf("failed to read plugin config: %v", err)
}
初始化genericInitializer
clientset, err := kubernetes.NewForConfig(kubeAPIServerClientConfig)
if err != nil {
return err
}
genericInitializer := initializer.New(clientset, informers, c.Authorization.Authorizer, features)
initializersChain := admission.PluginInitializers{}
pluginInitializers = append(pluginInitializers, genericInitializer)
initializersChain = append(initializersChain, pluginInitializers...)
NewFromPlugins执行所有启用的准入插件
D:\Workspace\Go\src\k8s.io\kubernetes\staging\src\k8s.io\apiserver\pkg\admission\plugins.go
遍历插件,调用InitPlugin初始化对应的实例
for _, pluginName := range pluginNames {
pluginConfig, err := configProvider.ConfigFor(pluginName)
if err != nil {
return nil, err
}
plugin, err := ps.InitPlugin(pluginName, pluginConfig, pluginInitializer)
if err != nil {
return nil, err
}
if plugin != nil {
if decorator != nil {
handlers = append(handlers, decorator.Decorate(plugin, pluginName))
} else {
handlers = append(handlers, plugin)
}
if _, ok := plugin.(MutationInterface); ok {
mutationPlugins = append(mutationPlugins, pluginName)
}
if _, ok := plugin.(ValidationInterface); ok {
validationPlugins = append(validationPlugins, pluginName)
}
}
}
InitPlugin
·调用getPlugin从Plugins获取plugin实例
// InitPlugin creates an instance of the named interface.
func (ps *Plugins) InitPlugin(name string, config io.Reader, pluginInitializer PluginInitializer) (Interface, error) {
if name == "" {
klog.Info("No admission plugin specified.")
return nil, nil
}
plugin, found, err := ps.getPlugin(name, config)
if err != nil {
return nil, fmt.Errorf("couldn't init admission plugin %q: %v", name, err)
}
if !found {
return nil, fmt.Errorf("unknown admission plugin: %s", name)
}
pluginInitializer.Initialize(plugin)
// ensure that plugins have been properly initialized
if err := ValidateInitialization(plugin); err != nil {
return nil, fmt.Errorf("failed to initialize admission plugin %q: %v", name, err)
}
return plugin, nil
}
getPlugin
// getPlugin creates an instance of the named plugin. It returns `false` if the
// the name is not known. The error is returned only when the named provider was
// known but failed to initialize. The config parameter specifies the io.Reader
// handler of the configuration file for the cloud provider, or nil for no configuration.
func (ps *Plugins) getPlugin(name string, config io.Reader) (Interface, bool, error) {
ps.lock.Lock()
defer ps.lock.Unlock()
f, found := ps.registry[name]
if !found {
return nil, false, nil
}
config1, config2, err := splitStream(config)
if err != nil {
return nil, true, err
}
if !PluginEnabledFn(name, config1) {
return nil, true, nil
}
ret, err := f(config2)
return ret, true, err
}
其中最关键的就是去ps.registry map中获取插件,对应的value就是工厂函数
// Factory is a function that returns an Interface for admission decisions.
// The config parameter provides an io.Reader handler to the factory in
// order to load specific configurations. If no configuration is provided
// the parameter is nil.
type Factory func(config io.Reader) (Interface, error)
type Plugins struct {
lock sync.Mutex
registry map[string]Factory
}
追踪可以发现这些工厂函数是在 RegisterAllAdmissionPlugins被注册的
D:\Workspace\Go\src\k8s.io\kubernetes\pkg\kubeapiserver\options\plugins.go
// RegisterAllAdmissionPlugins registers all admission plugins.
// The order of registration is irrelevant, see AllOrderedPlugins for execution order.
func RegisterAllAdmissionPlugins(plugins *admission.Plugins) {
admit.Register(plugins) // DEPRECATED as no real meaning
alwayspullimages.Register(plugins)
antiaffinity.Register(plugins)
defaulttolerationseconds.Register(plugins)
defaultingressclass.Register(plugins)
denyserviceexternalips.Register(plugins)
deny.Register(plugins) // DEPRECATED as no real meaning
eventratelimit.Register(plugins)
extendedresourcetoleration.Register(plugins)
gc.Register(plugins)
imagepolicy.Register(plugins)
limitranger.Register(plugins)
autoprovision.Register(plugins)
exists.Register(plugins)
noderestriction.Register(plugins)
nodetaint.Register(plugins)
label.Register(plugins) // DEPRECATED, future PVs should not rely on labels for zone topology
podnodeselector.Register(plugins)
podtolerationrestriction.Register(plugins)
runtimeclass.Register(plugins)
resourcequota.Register(plugins)
podsecurity.Register(plugins)
podsecuritypolicy.Register(plugins)
podpriority.Register(plugins)
scdeny.Register(plugins)
serviceaccount.Register(plugins)
setdefault.Register(plugins)
resize.Register(plugins)
storageobjectinuseprotection.Register(plugins)
certapproval.Register(plugins)
certsigning.Register(plugins)
certsubjectrestriction.Register(plugins)
}
ps -ef |grep apiserver |grep admission-control
以alwayspullimages.Register(plugins)为例
- 那么对应的工厂函数为
D:\Workspace\Go\src\k8s.io\kubernetes\plugin\pkg\admission\alwayspullimages\admission.go
// Register registers a plugin
func Register(plugins *admission.Plugins) {
plugins.Register(PluginName, func(config io.Reader) (admission.Interface, error) {
return NewAlwaysPullImages(), nil
})
}
对应就是初始化一个AlwaysPulllmages
// NewAlwaysPullImages creates a new always pull images admission control handler
func NewAlwaysPullImages() *AlwaysPullImages {
return &AlwaysPullImages{
Handler: admission.NewHandler(admission.Create, admission.Update),
}
}
对应就会有修改对象的的变更准入控制器方法Admit
// Admit makes an admission decision based on the request attributes
func (a *AlwaysPullImages) Admit(ctx context.Context, attributes admission.Attributes, o admission.ObjectInterfaces) (err error) {
// Ignore all calls to subresources or resources other than pods.
if shouldIgnore(attributes) {
return nil
}
pod, ok := attributes.GetObject().(*api.Pod)
if !ok {
return apierrors.NewBadRequest("Resource was marked with kind Pod but was unable to be converted")
}
pods.VisitContainersWithPath(&pod.Spec, field.NewPath("spec"), func(c *api.Container, _ *field.Path) bool {
c.ImagePullPolicy = api.PullAlways
return true
})
return nil
}
上面的的是把pod的ImagePullPolicy改为api.PullAlways
对应的文档地址 https://kubernetes.io/zh/docs/reference/access-authn-authz/admission-controllers/#alwayspullimages
。该准入控制器会修改每一个新创建的 Pod 的镜像拉取策略为 Always
。这在多租户集群中是有用的,这样用户就可以放心,他们的私有镜像只能被那些有凭证的人使用
。如果没有这个准入控制器,一旦镜像被拉取到节点上,任何用户的 Pod 都可以通过已了解到的镜像的名称(假设 Pod 被调度到正确的节点上)来使用它,而不需要对镜像进行任何授权检查
。当启用这个准入控制器时,总是在启动容器之前拉取镜像,这意味着需要有效的凭证
。同时对应还有校验的方法Validate
// Validate makes sure that all containers are set to always pull images
func (*AlwaysPullImages) Validate(ctx context.Context, attributes admission.Attributes, o admission.ObjectInterfaces) (err error) {
if shouldIgnore(attributes) {
return nil
}
pod, ok := attributes.GetObject().(*api.Pod)
if !ok {
return apierrors.NewBadRequest("Resource was marked with kind Pod but was unable to be converted")
}
var allErrs []error
pods.VisitContainersWithPath(&pod.Spec, field.NewPath("spec"), func(c *api.Container, p *field.Path) bool {
if c.ImagePullPolicy != api.PullAlways {
allErrs = append(allErrs, admission.NewForbidden(attributes,
field.NotSupported(p.Child("imagePullPolicy"), c.ImagePullPolicy, []string{string(api.PullAlways)}),
))
}
return true
})
if len(allErrs) > 0 {
return utilerrors.NewAggregate(allErrs)
}
return nil
}
编写一个准入控制器,实现自动注入nginx sidecar pod
- 编写准入控制器,并运行
- 最终的效果就是指定命名空间下的应用pod都会被注入一个简单的nginx sidecar
istio 自动注入envoy 说明
https://istio.io/latest/img/service-mesh.svg
- 现在非常火热的的 Service Mesh 应用istio 就是通过k8s apiserver的 mutating webhooks 来自动将Envoy这个 sidecar 容器注入到 Pod 中去的,相关文档https://istio.io/docs/setup/kubernetes/sidecar-injection/。
- 为了利用 Istio 的所有功能,网格中的 pod 必须运行lstio sidecar 代理。
- 当在 pod 的命名空间中启用时,自动注入会在 pod 创建时使用准入控制器注入代理配置,最后你的pod旁边有envoy 运行了
流程说明
- 检查集群中是否启用了admission webhook 控制器,并根据需要进行配置。
- 编写mutating webhook代码
。启动tls-http server
。实现/mutate方法
。当用户调用create/update 方法创建/更新 pod时
。apiserver调用这个mutating webhook,修改其中的方法,添加nginx sidecar容器
。返回给apiserver,达到注入的目的
- 创建证书完成ca签名
- 创建MutatingWebhookConfiguration
- 部署服务验证注入结果
什么是准入控制插件
- k8s集群检查操作
- 新建项目 kube-mutating-webhook-inject-pod,准备工作
- k8s集群检查操作
- 新建项目 kube-mutating-webhook-inject-pod,准备工作
k8s集群检查操作
检查k8s集群否启用了准入注册 API:
- 执行kubectl api-versions lgrep admission。
- 如果有下面的结果说明已启用
kubectl api-versions |grep admission
admissionregistration.k8s.io/v1
检查 apiserver 中启用了MutatingAdmissionWebhook和ValidatingAdmissionWebhook两个准入控制插件
- 1.20以上的版本默认已经启用,在默认启用的命令行中enable-admission-plugins
/usr/local/bin/kube-apiserver -h | grep enable-admission-plugins
--admission-plugins strings Admission is divided into two phases. In the first phase, only mutating admission plugins run. In the second phase, only validating admission plugins run.
--enable-admission plugins strings admission plugins that should be enabled in addition to default enabled ones
编写 webhook
- 新建项目kube-mutating-webhook-inject-pod
go mod init kube-mutating-webhook-inject-pod
注入sidecar容器的配置文件设计
- 因为要注入一个容器,需要定义容器的相关配置所以复用k8s pod中container段的yaml
- 同时要挂载注入容器的配置,所以要复用k8s pod 中volumes的yaml
- 新建config.yaml如下
```yaml
containers:
- name: sidecar-nginx
image: nginx:1.12.2
imagePullPolicy: IfNotPresent
ports:
- containerPort: 80
volumeMounts:
- name: nginx-conf
mountPath: /etc/nginx
volumes:
- name: nginx-conf
configMap:
name: nginx-configmap
```
对应的go代码
新建 pkg/webhook.go
package main
import (
corev1 "k8s.io/api/core/v1"
)
type Config struct {
Containers []corevl.Container `yaml:"containers"`
Volumes []corev1.Volume `yaml: "volumes"`
}
解析配置文件的函数
func loadConfig(configFile string) (*Config, error) {
data, err := ioutil.ReadFile(configFile)
if err != nil {
return nil, err
}
glog.Infof("New configuration: sha256sum %x", sha256.Sum256(data))
var cfg Config
if err := yaml.Unmarshal(data, &cfg); err != nil {
return nil, err
}
return &cfg, nil
}
编写webhook server的配置
// Webhook Server options
type webHookSvrOptions struct {
port int // 监听https的端口
certFile string // https x509 证书路径
keyFile string // https x509 证书私钥路径
sidecarCfgFile string // 注入sidecar容器的配置文件路径
}
在main中通过命令行传入默认值并解析
package main
import (
"flag"
"github.com/golang/glog"
)
func main() {
var runOption webHookSvrOptions
// get command line parameters
flag.IntVar(&runOption.port, "port", 8443, "Webhook server port.")
flag.StringVar(&runOption.certFile, "tlsCertFile", "/etc/webhook/certs/cert.pem", "File containing the x509 Certificate for HTTPS.")
flag.StringVar(&runOption.keyFile, "tlsKeyFile", "/etc/webhook/certs/key.pem", "File containing the x509 private key to --tlsCertFile.")
//flag.StringVar(&runOption.sidecarCfgFile, "sidecarCfgFile", "/etc/webhook/config/sidecarconfig.yaml", "File containing the mutation configuration.")
flag.StringVar(&runOption.sidecarCfgFile, "sidecarCfgFile", "config.yaml", "File containing the mutation configuration.")
flag.Parse()
sidecarConfig, err := loadConfig(runOption.sidecarCfgFile)
if err != nil {
glog.Errorf("Failed to load configuration: %v", err)
return
}
glog.Infof("[sidecarConfig:%v]", sidecarConfig)
}
加载tls x509证书
pair, err := tls.LoadX509KeyPair(runOption.certFile, runOption.keyFile)
if err != nil {
glog.Errorf("Failed to load key pair: %v", err)
return
}
定义webhookhttp server,并构造
webhook.go中
type webhookServer struct {
sidecarConfig *Config // 注入sidecar容器的配置
server *http.Server // http serer
}
main中
webhooksvr := &webhookServer{
sidecarConfig: sidecarConfig,
server: &http.Server{
Addr: fmt.Sprintf(":%v", runOption.port),
TLSConfig: &tls.Config{Certificates: []tls.Certificate{pair}},
},
}
webhookServer的mutate handler并关联path
webhook.go
func (ws *webhookServer) serveMutate(w http.ResponseWriter, r *http.Request) {
}
main.go
mux := httpNewServeMux()
mux.HandleFunc("/mutate", webhooksvr.serveMutate)
webhooksvr.serverHandler = mux
// start webhook server in new grountine
go func() {
if err := webhooksvr.server.ListenAndServeTLS("", ""); err != nil {
glog.Errorf("Failed to listen and serve webhook server: %v", err)
}
}()
意思是请求/mutate 由webhooksvr.serveMutate处理
main中监听退出信号
// listening OS shutdown singal
signalChan := make(chan os.Signal, 1)
signal.Notify(signalChan, syscall.SIGINT, syscall.SIGTERM)
<-signalChan
glog.Infof("Got 0S shutdown signal, shutting down webhook server gracefully...")
webhooksvr.server.Shutdown(context.Background())
本节完成红色框中的mutating admission webhooks
代码仓库地址GitHub - yunixiangfeng/k8s-exercise
k8s-exercise/kube-mutating-webhook-inject-pod at main · yunixiangfeng/k8s-exercise · GitHub
什么是准入控制插件
serveMutate编写
- 准入控制请求参数校验
- 根据annotation标签判断是否需要注入sidecar
- mutatePod 注入函数编写
- 生成注入容器和volume的patch函数
serveMutate编写
普通校验请求
。 serveMutate方法
。 body是否为空
。 req header的Content-Type是否为application/json
// webhookServer的mutate handler
func (ws *webhookServer) serveMutate(w http.ResponseWriter, r *http.Request) {
var body []byte
if r.Body !=nil {
if data,err := ioutil.ReadAll(r.Body); err == nil {
body = data
}
}
if len(body) == 0 {
glog.Error("empty body")
http.Error(w, "empty body", http.StatusBadRequest)
return
}
// verify the content type is accurate
contentType := r.Header.Get("Content-Type")
if contentType !="application/json" {
glog.Errorf("Content-Type=%s, expect application/json", contentType)
http.Error(w, "invalid Content-Type, expect `application/json`", http.StatusUnsupportedMediaType)
return
}
}
准入控制请求参数校验
- 构造准入控制的审查对象包括请求和响应
- 然后使用UniversalDeserializer解析传入的申请
- 如果出错就设置响应为报错的信息
- 没出错就调用mutatePod生成响应
// 构造准入控制器的响应
var admissionResponse *v1beta1.AdmissionResponse
// 构造准入控制的审查对象 包括请求和响应
// 然后使用UniversalDeserializer解析传入的申请
// 如果出错就设置响应为报错的信息
// 没出错就调用mutatePod生成响应
ar := v1beta1.AdmissionReview{}
if _, _, err := deserializer.Decode(body, nil, &ar); err != nil {
glog.Errorf("Can't decode body: %v", err)
admissionResponse = &v1beta1.AdmissionResponse{
Result: &metav1.Status{
Message: err.Error(),
},
}
} else {
admissionResponse = ws.mutatePod(&ar)
}
解析器使用UniversalDeserializer
D:\Workspace\Go\pkg\mod\k8s.io\[email protected]\pkg\runtime\serializer\codec_factory.go
import (
"crypto/sha256"
"io/ioutil"
"net/http"
"github.com/golang/glog"
"gopkg.in/yaml.v2"
"k8s.io/api/admission/v1beta1"
corev1 "k8s.io/api/core/v1"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/apimachinery/pkg/runtime"
"k8s.io/apimachinery/pkg/runtime/serializer"
)
var (
runtimeScheme = runtime.NewScheme()
codecs = serializer.NewCodecFactory(runtimeScheme)
deserializer = codecs.UniversalDeserializer()
// (https://github .com/kubernetes/kubernetes/issues/57982)
defaulter = runtime.ObjectDefaulter(runtimeScheme)
)
写入响应
- 构造最终响应对象admissionReview
- 给response赋值
- json解析后用 w.write写入
//构造最终响应对象 admissionReview
// 给response贼值
//json解析后用 w.write写入
admissionReview := v1beta1.AdmissionReview{}
if admissionResponse != nil {
admissionReview.Response = admissionResponse
if ar.Request != nil {
admissionReview.Response.UID = ar.Request.UID
}
}
resp, err := json.Marshal(admissionReview)
if err != nil {
glog.Errorf("Can't encode response: %v", err)
http.Error(w, fmt.Sprintf("could not encode response: %v", err),
http.StatusInternalServerError)
}
glog.Infof("Ready to write reponse ...")
if _, err := w.Write(resp); err != nil {
glog.Errorf("Can't write response: %v", err)
http.Error(w, fmt.Sprintf("could not write response: %v", err), http.StatusInternalServerError)
}
mutatePod注入函数编写
- 将请求中的对象解析为pod,如果出错就返回
func (ws *webhookServer) mutatePod(ar *v1beta1.AdmissionReview) *v1beta1.AdmissionResponse {
// 将请求中的对象解析为pod,如果出错就返回
req := ar.Request
var pod corev1.Pod
if err := json.Unmarshal(req.Object.Raw, &pod); err != nil {
glog.Errorf("Could not unmarshal raw object: %v", err)
return &v1beta1.AdmissionResponse{
Result: &metav1.Status{
Message: err.Error(),
},
}
}
}
是否需要注入判断
// 是否需要注入判断
if !mutationRequired(ignoredNamespaces, &pod.ObjectMeta) {
glog.Infof("Skipping mutation for %s/%s due to policy check", pod.Namespace, pod.Name)
return &v1beta1.AdmissionResponser{
Allowed: true,
}
}
mutationRequired判断函数, 判断这个pod资源要不要注入
1.如果pod在高权限的ns中,不注入
2.如果pod annotations中标记为已注入就不再注入了
3.如果pod annotations中配置不愿意注入就不注入
// 判断这个pod资源要不要注入
// 1.如果pod在高权限的ns中,不注入
// 2.如果pod annotations中标记为已注入就不再注入了
// 3.如果pod annotations中配置不愿意注入就不注入
func mutationRequired(ignoredList []string, metadata *metav1.ObjectMeta) bool {
// skip special kubernete system namespaces
for _, namespace := range ignoredList {
if metadata.Namespace == namespace {
glog.Infof("skip mutation for %v for it's in special namespace:%v", metadata.Name, metadata.Namespace)
return false
}
}
annotations := metadata.GetAnnotations()
if annotations == nil {
annotations = map[string]string{}
}
// 如果 annotation中 标记为已注入就不再注入了
status := annotations[admissionWebhookAnnotationStatusKey]
if strings.ToLower(status) == "injected" {
return false
}
// 如果pod中配置不愿意注入就不注入
switch strings.ToLower(annotations[admissionWebhookAnnotationInjectKey]) {
default:
return false
case "true":
return false
}
}
相关的常量定义
const (
// 代表这个pod是否要注入 = ture代表要注入
admissionWebhookAnnotationInjectKey = "sidecar-injector-webhook.xiaoyi/need _inject"
// 代表判断pod已经注入过的标志 = injected代表已经注入了,就不再注入
admissionWebhookAnnotationStatusKey = "sidecar-injector-webhook.xiaoyi/status"
)
// 为了安全,不给这两个ns中的pod注入 sidecar
var ignoredNamespaces = []string{
metav1.NamespaceSystem,
metav1.NamespacePublic,
}
添加默认的配置
https://github.com/kubernetes/kubernetes/pull/58025
defaulter = runtime.ObjectDefaulter(runtimeScheme)
func applyDefaultsWorkaround(containers []corev1.Container, volumes []corev1.Volume) {
defaulter .Default(&corev1Pod{
Spec: corev1.PodSpec{
Containers: containers,
Volumes:volumes,
},
})
}
定义pathoption
type patchOperation struct {
Op string `json:"op"` // 动作
Path string `json:"path"` // 操作的path
Value interface{} `json:"value,omitempty"` //值
}
生成容器端的patch函数
// 添加容器的patch
// 如果是第一个patch 需要在path末尾添加 /-
func addContainer(target, added []corev1.Container, basePath string) (patch []patchOperation) {
first := len(target) == 0
var value interface{}
for _, add := range added {
value = add
path := basePath
if first {
first = false
value =[]corev1.Container{add}
} else {
path = path +"/-"
}
patch = append(patch, patchOperation{
Op: "add",
Path: path,
Value: value,
})
}
return patch
}
生成添加volume的patch函数
func addVolume(target, added []corev1.Volume, basePath string) (patch []patchOperation) {
first := len(target) == 0
var value interface{}
for _, add := range added {
value = add
path := basePath
if first {
first = false
value =[]corev1.Volume{add}
} else {
path = path +"/-"
}
patch = append(patch, patchOperation{
Op: "add",
Path: path,
Value: value,
})
}
return patch
}
更新annotation的patch
func updateAnnotation(target map[string]string, added map[string]string) (patch []patchOperation) {
for key, value := range added {
if target == nil || target[key] == "" {
target = map[string]string{}
patch = append(patch, patchOperation{
Op: "add",
Path: "/metadata/annotations",
Value: map[string]string{
key: value,
},
})
} else {
patch = append(patch, patchOperation{
Op: "replace",
Path: "/metadata/annotations/" + key,
Value: value,
})
}
}
return patch
}
最终的patch调用
func createPatch(pod *corev1.Pod, sidecarConfig *Config, annotations map[string]string) ([]byte,
error) {
var patch []patchOperation
patch = append(patch, addContainer(pod.Spec.Containers, sidecarConfig.Containers, "/spec/containers")...)
patch = append(patch,addVolume(pod.Spec.Volumes, sidecarConfig.Volumes, "/spec/volumes")...)
patch = append(patch, updateAnnotation(pod.Annotations, annotations)...)
return json.Marshal(patch)
}
调用patch 生成patch option
- mutatePod方法中
annotations := map[string]string{admissionWebhookAnnotationStatusKey: "injected"}
patchBytes, err := createPatch(&pod, ws.sidecarConfig, annotations)
if err != nil {
return &v1beta1.AdmissionResponse{
Result: &metav1.Status{
Message: err.Error(),
},
}
}
glog.Infof("AdmissionResponse: patch=%v\n", string(patchBytes))
return &v1beta1.AdmissionResponse{
Allowed: true,
Patch: patchBytes,
PatchType: func() *v1beta1.PatchType {
pt := v1beta1.PatchTypeJSONPatch
return &pt
}(),
}
return nil
}
// Workaround: https://githubcom/kubernetes/kubernetes/issues/57982
glog,Infof("[before applyDefaultsWorkaround][ws.sidecarConfig.Containers:%+v][ws.sidecarconfig.Volumes:%+v]", ws.sidecarConfig.Containers[0], ws.sidecarConfig.Volumes[0])
applyDefaultsWorkaround(ws.sidecarConfig.Containers, ws.sidecarConfig.Volumes)
glog.Infof("[after applyDefaultsWorkaround][ws.sidecarConfig.Containers:%+v][ws.sidecarConfig.Volumes:%+v]", ws.sidecarConfig.Containers[0], ws.sidecarConfig.Volumes[0])
// 这里构造一个本次已注入sidecar的annotations
annotations := map[string]string{admissionWebhookAnnotationStatusKey: "injected"}
serveMutate编写
。准入控制请求参数校验
。根据annotation标签判断是否需要注入sidecarmutatePod注入函数编写
。生成注入容器和volume的patch函数
创建ca证书,通过csr让apiserver签名
获取审批后的证书,用它创建MutatingWebhookConfiguration
编译打镜像
makefile
IMAGE_NAME ?= sidecar-injector
PWD := $(shell pwd)
BASE_DIR := $(shell basename $(PWD))
export GOPATH ?= $(GOPATH_DEFAULT)
IMAGE_TAG ?= $(shell date +v%Y%m%d)-$(shell git describe --match=$(git rev-parse --short-8 HEAD) --tags --always --dirty
build:
@echo "Building the $(IMAGE_NAME) binary..."
@CGO_ENABLED=0 go build -o $(IMAGE_NAME) ./pkg/
build-linux:
@echo "Building the $(IMAGE_NAME) binary for Docker (linux)..."
@GOOS=linux GOARCH=amd64 CGO_ENABLED=0 go build -o $(IMAGE_NAME) ./pkg/
#################################
# image section
#################################
image: build-image
build-image: build-linux
@echo "Building the docker image: $(IMAGE_NAME)..."
@docker build -t $(IMAGE_NAME) f Dockerfile .
.PHONY: all build image
dockerfile
FROM alpine:latest
# set environment variables
ENV SIDECAR_INJECTOR=/usr/local/bin/sidecar-injector \
USER UID=1001 \
USER_NAME=sidecar-injector
COPY sidecar-injector /usr/local/bin/sidecar-injector
# set entrypoint
ENTRYPOINT ["/usr/local/bin/sidecar-injector"]
# switch to non-root user
USER ${USER_UID}
打包代码kube-mutating-webhook-inject-pod.zip
拷贝到k8s集群节点,
打镜像
运行make build-image
docker save sidecar-injector > a.tar
scp a.tar k8s-worker02:~
ctr --namespace k8s.io images import a.tar
部署
创建ns nginx-injection,最终部署到这个ns中的容器会被注入nginx sidecar
kubectl create ns nginx-injection
创建ns sidecar-injector,我们的这个mutate webhook服务运行的ns
kubectl create ns sidecar-injector
创建ca证书,并让apiserver签名
01生成证书签名请求配置文件csr.conf
cat < csr.conf
[req]
req_extensions = v3_req
distinguished_name = req_distinguished_name
[req_distinguished_name]
[ v3_req ]
basicConstraints = CA:FALSE
keyUsage = nonRepudiation, digitalSignature, keyEncipherment
extendedKeyUsage = serverAuth
subjectAltName = @alt_names
[alt_names]
DNS.1 = sidecar-injector-webhook-svc
DNS.2 = sidecar-injector-webhook-svc.sidecar-injector
DNS.3 = sidecar-injector-webhook-svc.sidecar-injector.svc
EOF
02 opensslgenrsa 命生成RSA 私有秘钥
openssl genrsa -out server-key.pem 2048
03 生成证书请求文件验证证书请求文件和创建根CA
openssl req -new -key server-key.pem -subj "/CN=sidecar-injector-webhook-svc.sidecar-injector.svc" -out server.csr -config csr.conf
删除之前的csr请求
kubectl delete csr sidecar-injector-webhook-svc.sidecar-injector
申请csr CertificateSigningRequest
cat <
检查csr
kubectl get csr
审批csr
kubectl certificate approve sidecar-injector-webhook-svc.sidecar-injector
certificatesigningrequest.certificates.k8s.io/sidecar-injector-webhook-svc.sidecar-injector approve
获取签名后的证书
serverCert=$(kubectl get csr sidecar-injector-webhook-svc.sidecar-injector -o jsonpath='{ .status.certificate}')
echo "${serverCert}" | openssl base64 -d -A -out server-cert.pem
使用证书创建secret
kubectl create secret generic sidecar-injector-webhook-certs \
--from-file=keypem=server-key.pem \
--from-file=cert.pem=server-cert.pem \
--dry-run=client -o yaml |
kubectl -n sidecar-injector apply -f -
检查证书
kubectl get secret -n sidecar-injector
NAME TYPE DATA AGE
default-token-hvgnl kubernetes.io/service-account-token 3 25m
sidecar-injector-webhookcerts Opaque 2 25m
获取CA_BUNDLE并替换 mutatingwebhook中的CA_BUNDLE占位
符
CA_BUNDLE=$(kubectl config view --raw --minify --flatten -o jsonpath='{.clusters[].cluster.certificate-authority-data}')
if [ -z "${CA_BUNDLE}" ]; then
CA_BUNDLE=$(kubectl get secrets -o jsonpath="{.items[?(@.metadata.annotations['kubernetes\.io/service-account\.name']=='default')].data.ca\.crt}")
fi
替换
cat deploy/mutating_webhook.yaml | sed -e "s|\${CA_BUNDLE}|${CA_BUNDLE}|g" > deploy/mutatingwebhook-ca-bundle.yaml
检查结果
cat deploy/mutatingwebhook-ca-bundle.yaml
上述两个步骤可以直接运行脚本
脚本如下
chmod +x ./deploy/*.sh
./deploy/webhook-create-signed-cert.sh \
--service sidecar-injector-webhook-svc \
--secret sidecar-injector-webhook-certs \
--namespace sidecar-injector
cat deploy/mutating_webhook.yaml | \
deploy/webhook-patch-ca-bundle.sh > \
deploy/mutatingwebhook-ca-bundle.yaml
这里重用Istio 项目中的生成的证书签名请求脚本。通过发送请求到apiserver,获取认证信息,然后使用获得的结果来创建需要的 secret 对象。
部署yaml
01先部署sidecar-injector
部署
kubectl create -f deploy/inject_configmap.yaml
kubectl create -f deploy/inject_deployment.yaml
kubectl create -f deploy/inject_service.yaml
检查
kubectl get pod -n sidecar-injector
kubectl get svc -n sidecar-injector
02 部署 mutatingwebhook
kubectl create -f deploy/mutatingwebhook-ca-bundle.yaml
检查
kubectl get MutatingWebhookConfiguration -A
03 部署nginx-sidecar 运行所需的configmap
kubectl create -f deploy/nginx_configmap.yaml
04 创建一个namespace ,并打上标签 sidecar-injection=enabled
kubectl create ns nginx-injection
kubectl label namespace nginx-injection nginx-sidecar-injection=enabled
sidecar-injection=enabled和 MutatingWebhookConfiguration中的ns过滤器相同
namespaceSelector:
matchLabels:
nginx-sidecar-injection: enabled
检查标签结果,最终部署到这里的pod都判断是否要注入sidecar
kubectl get ns -L nginx-sidecar-injection
05 向nginx-injection中部署一个pod
annotations中配置的sidecar-injector-webhooknginxsidecar/need_inject:"true"代表需要注入
apiVersion: v1
kind: Pod
metadata:
namespace: nginx-injection
name: test-alpine-inject01
labels:
role: myrole
annotations:
sidecar-injector-webhook.nginx.sidecar/need_inject: "true"
spec:
containers:
- image: alpine
command:
- /bin/sh
- "-c"
- "sleep 60m"
imagePullPolicy: IfNotPresent
name: alpine
restartPolicy: Always
部署
kubectl create -f test_sleep_deployment.yaml
查看结果,可以看到test-alpine-inject-01 pod中被注入了nginx sidecar,curl这个pod的ip访问80端口,可以看到nginx sidecar的响应
kubectl get pod -n nginx-injection -o wide
curl pod_ip
06 观察sidecar-injector的日志
apiserver过来访问 sidecar-injector,然后经过判断后给该pod 注入了sidecar
07 部署一个不需要注入sidecar的pod
sidecar-injector-webhooknginxsidecar/need inject:"false"明确指出不需要如注入
apiVersion: v1
kind: Pod
metadata:
namespace: nginx-injection
name: test-alpine-inject02
labels:
role: myrole
annotations:
sidecar-injector-webhook.nginx.sidecar/need_inject: "false"
spec:
containers:
- image: alpine
command:
- /bin/sh
- "-c"
- "sleep 60m"
imagePullPolicy: IfNotPresent
name: alpine
restartPolicy: Always
观察部署结果,test-alpine-inject-02 只运行了一个容器
观察sidecar-injector的日志,可以看到[skip mutation][reason=pod not need]
通用的GenericApiServerNew函数
apiserver核心服务的初始化
最终的apiserver启动流程
之前我们分析了使用buildGenericConfig构建api核心服务的配置
然后回到CreateServerChain函数中,位置
D:\Workspace\Go\src\k8s.io\kubernetes\cmd\kube-apiserver\app\server.go
发现会调用三个server的create函数,传入对应的配置初始化
。 apiExtensionsServer API扩展服务,主要针对CRD
。 kubeAPIServer API核心服务,包括常见的Pod/Deployment/Service
。apiExtensionsServer API聚合服务,主要针对metrics
代码如下
apiExtensionsServer, err := createAPIExtensionsServer(apiExtensionsConfig, genericapiserver.NewEmptyDelegateWithCustomHandler(notFoundHandler))
if err != nil {
return nil, err
}
kubeAPIServer, err := CreateKubeAPIServer(kubeAPIServerConfig, apiExtensionsServer.GenericAPIServer)
if err != nil {
return nil, err
}
// aggregator comes last in the chain
aggregatorConfig, err := createAggregatorConfig(*kubeAPIServerConfig.GenericConfig, completedOptions.ServerRunOptions, kubeAPIServerConfig.ExtraConfig.VersionedInformers, serviceResolver, kubeAPIServerConfig.ExtraConfig.ProxyTransport, pluginInitializer)
if err != nil {
return nil, err
}
aggregatorServer, err := createAggregatorServer(aggregatorConfig, kubeAPIServer.GenericAPIServer, apiExtensionsServer.Informers)
if err != nil {
// we don't need special handling for innerStopCh because the aggregator server doesn't create any go routines
return nil, err
}
。因为他们三个server
。如 kubeAPIServer
s,err := c.GenericConfig.New("kube-apiserver",delegationTarget)
。还有apiExtensionsServer
genericServer, err := c.GenericConfig.New("apiextensions-apiserver", delegationTarget)
。还有 createAggregatorServer
genericServer,err := c.GenericConfig.ew("kube-aggregator", delegationTarget)
位置
D:\Workspace\Go\src\k8s.io\kubernetes\staging\src\k8s.io\apiserver\pkg\server\config.go
New 创建一个新的服务器,它在逻辑上将处理链与传入的服务器组合在一起。
name用于区分日志记录。
初始化handler
handlerChainBuilder := func(handler http.Handler) http.Handler {
return c.BuildHandlerChainFunc(handler, c.Config)
}
apiServerHandler := NewAPIServerHandler(name, c.Serializer, handlerChainBuilder, delegationTarget.UnprotectedHandler())
s := &GenericAPIServer{
discoveryAddresses: c.DiscoveryAddresses,
LoopbackClientConfig: c.LoopbackClientConfig,
legacyAPIGroupPrefixes: c.LegacyAPIGroupPrefixes,
admissionControl: c.AdmissionControl,
Serializer: c.Serializer,
AuditBackend: c.AuditBackend,
Authorizer: c.Authorization.Authorizer,
delegationTarget: delegationTarget,
EquivalentResourceRegistry: c.EquivalentResourceRegistry,
HandlerChainWaitGroup: c.HandlerChainWaitGroup,
Handler: apiServerHandler,
listedPathProvider: apiServerHandler,
minRequestTimeout: time.Duration(c.MinRequestTimeout) * time.Second,
ShutdownTimeout: c.RequestTimeout,
ShutdownDelayDuration: c.ShutdownDelayDuration,
SecureServingInfo: c.SecureServing,
ExternalAddress: c.ExternalAddress,
openAPIConfig: c.OpenAPIConfig,
openAPIV3Config: c.OpenAPIV3Config,
skipOpenAPIInstallation: c.SkipOpenAPIInstallation,
postStartHooks: map[string]postStartHookEntry{},
preShutdownHooks: map[string]preShutdownHookEntry{},
disabledPostStartHooks: c.DisabledPostStartHooks,
healthzChecks: c.HealthzChecks,
livezChecks: c.LivezChecks,
readyzChecks: c.ReadyzChecks,
livezGracePeriod: c.LivezGracePeriod,
DiscoveryGroupManager: discovery.NewRootAPIsHandler(c.DiscoveryAddresses, c.Serializer),
maxRequestBodyBytes: c.MaxRequestBodyBytes,
livezClock: clock.RealClock{},
lifecycleSignals: c.lifecycleSignals,
ShutdownSendRetryAfter: c.ShutdownSendRetryAfter,
APIServerID: c.APIServerID,
StorageVersionManager: c.StorageVersionManager,
Version: c.Version,
muxAndDiscoveryCompleteSignals: map[string]<-chan struct{}{},
}
先从传入的server中获取
// first add poststarthooks from delegated targets
for k, v := range delegationTarget.PostStartHooks() {
s.postStartHooks[k] = v
}
for k, v := range delegationTarget.PreShutdownHooks() {
s.preShutdownHooks[k] = v
}
再从提前配置comoletedConfig中获取
// add poststarthooks that were preconfigured. Using the add method will give us an error if the same name has already been registered.
for name, preconfiguredPostStartHook := range c.PostStartHooks {
if err := s.AddPostStartHook(name, preconfiguredPostStartHook.hook); err != nil {
return nil, err
}
}
比如在之前生成GenericConfig配置的admissionPostStartHook准入控制hook,位置
D:\Workspace\Go\src\k8s.io\kubernetes\cmd\kube-apiserver\app\server.go
if err := config.GenericConfig.AddPostStartHook("start-kube-apiserver-admission-initializer", admissionPostStartHook); err != nil {
return nil, nil, nil, err
}
对应的hook方法在admissionNew中,位置
D:\Workspace\Go\src\k8s.io\kubernetes\pkg\kubeapiserver\admission\config.go
admissionPostStartHook := func(context genericapiserver.PostStartHookContext) error {
discoveryRESTMapper.Reset()
go utilwait.Until(discoveryRESTMapper.Reset, 30*time.Second, context.StopCh)
return nil
}
注册generic-apiserver-start-informers的hook
genericApiServerHookName := "generic-apiserver-start-informers"
if c.SharedInformerFactory != nil {
if !s.isPostStartHookRegistered(genericApiServerHookName) {
err := s.AddPostStartHook(genericApiServerHookName, func(context PostStartHookContext) error {
c.SharedInformerFactory.Start(context.StopCh)
return nil
})
if err != nil {
return nil, err
}
}
// TODO: Once we get rid of /healthz consider changing this to post-start-hook.
err := s.AddReadyzChecks(healthz.NewInformerSyncHealthz(c.SharedInformerFactory))
if err != nil {
return nil, err
}
}
注册apiserver中的限流策略 hook
具体的内容在限流那章节中讲解
const priorityAndFairnessConfigConsumerHookName = "priority-and-fairness-config-consumer"
if s.isPostStartHookRegistered(priorityAndFairnessConfigConsumerHookName) {
} else if c.FlowControl != nil {
err := s.AddPostStartHook(priorityAndFairnessConfigConsumerHookName, func(context PostStartHookContext) error {
go c.FlowControl.MaintainObservations(context.StopCh)
go c.FlowControl.Run(context.StopCh)
return nil
})
if err != nil {
return nil, err
}
// TODO(yue9944882): plumb pre-shutdown-hook for request-management system?
} else {
klog.V(3).Infof("Not requested to run hook %s", priorityAndFairnessConfigConsumerHookName)
}
// Add PostStartHooks for maintaining the watermarks for the Priority-and-Fairness and the Max-in-Flight filters.
if c.FlowControl != nil {
const priorityAndFairnessFilterHookName = "priority-and-fairness-filter"
if !s.isPostStartHookRegistered(priorityAndFairnessFilterHookName) {
err := s.AddPostStartHook(priorityAndFairnessFilterHookName, func(context PostStartHookContext) error {
genericfilters.StartPriorityAndFairnessWatermarkMaintenance(context.StopCh)
return nil
})
if err != nil {
return nil, err
}
}
} else {
const maxInFlightFilterHookName = "max-in-flight-filter"
if !s.isPostStartHookRegistered(maxInFlightFilterHookName) {
err := s.AddPostStartHook(maxInFlightFilterHookName, func(context PostStartHookContext) error {
genericfilters.StartMaxInFlightWatermarkMaintenance(context.StopCh)
return nil
})
if err != nil {
return nil, err
}
}
}
添加健康检查
for _, delegateCheck := range delegationTarget.HealthzChecks() {
skip := false
for _, existingCheck := range c.HealthzChecks {
if existingCheck.Name() == delegateCheck.Name() {
skip = true
break
}
}
if skip {
continue
}
s.AddHealthChecks(delegateCheck)
}
通过设置liveness 容忍度为0,要求立即发现传入的server不可用
位置
D:\Workspace\Go\src\k8s.io\kubernetes\staging\src\k8s.io\apiserver\pkg\server\healthz.go
// AddHealthChecks adds HealthCheck(s) to health endpoints (healthz, livez, readyz) but
// configures the liveness grace period to be zero, which means we expect this health check
// to immediately indicate that the apiserver is unhealthy.
func (s *GenericAPIServer) AddHealthChecks(checks ...healthz.HealthChecker) error {
// we opt for a delay of zero here, because this entrypoint adds generic health checks
// and not health checks which are specifically related to kube-apiserver boot-sequences.
return s.addHealthChecks(0, checks...)
}
初始化api路由的installAPI
。添加/和/index.html的路由规则
if c.EnableIndex {
routes.Index{}.Install(s.listedPathProvider, s.Handler.NonGoRestfulMux)
}
添加/debug/pprof 分析的路由规则,用于性能分析
if c.EnableProfiling {
routes.Profiling{}.Install(s.Handler.NonGoRestfulMux)
if c.EnableContentionProfiling {
goruntime.SetBlockProfileRate(1)
}
// so far, only logging related endpoints are considered valid to add for these debug flags.
routes.DebugFlags{}.Install(s.Handler.NonGoRestfulMux, "v", routes.StringFlagPutHandler(logs.GlogSetter))
}
添加/metrics 指标监控的路由规则
if c.EnableMetrics {
if c.EnableProfiling {
routes.MetricsWithReset{}.Install(s.Handler.NonGoRestfulMux)
} else {
routes.DefaultMetrics{}.Install(s.Handler.NonGoRestfulMux)
}
}
添加/version 版本信息的路由规则
routes.Version(Version: c.Version).Install(s.Handler,GoRestfulContainer)
开启服务发现
if c.EnableDiscovery (
s.Handler,GoRestfulContainer.Add(s.DiscoveryGroupHanager.WebService())
位置D:\Workspace\Go\src\k8s.io\kubernetes\pkg\controlplane\instance.go
// New returns a new instance of Master from the given config.
// Certain config fields will be set to a default value if unset.
// Certain config fields must be specified, including:
// KubeletClientConfig
func (c completedConfig) New(delegationTarget genericapiserver.DelegationTarget) (*Instance, error) { ... }
上面提到的初始化通用的server
s, err := c.GenericConfig.New("kube-apiserver", delegationTarget)
并且用通用配置实例化master 实例
m := &Instance{
GenericAPIServer: s,
ClusterAuthenticationInfo: c.ExtraConfig.ClusterAuthenticationInfo,
}
// install legacy rest storage
if err := m.InstallLegacyAPI(&c, c.GenericConfig.RESTOptionsGetter); err != nil {
return nil, err
}
if err := m.InstallAPIs(c.ExtraConfig.APIResourceConfigSource, c.GenericConfig.RESTOptionsGetter, restStorageProviders...); err != nil {
return nil, err
}
最终的apiserver启动流程
回到Run函数通过CreateServerChain拿到创建的3个server,执行run即可
位置D:\Workspace\Go\src\k8s.io\kubernetes\cmd\kube-apiserver\app\server.go
// Run runs the specified APIServer. This should never exit.
func Run(completeOptions completedServerRunOptions, stopCh <-chan struct{}) error {
// To help debugging, immediately log version
klog.Infof("Version: %+v", version.Get())
klog.InfoS("Golang settings", "GOGC", os.Getenv("GOGC"), "GOMAXPROCS", os.Getenv("GOMAXPROCS"), "GOTRACEBACK", os.Getenv("GOTRACEBACK"))
server, err := CreateServerChain(completeOptions, stopCh)
if err != nil {
return err
}
prepared, err := server.PrepareRun()
if err != nil {
return err
}
return prepared.Run(stopCh)
}
位置D:\Workspace\Go\src\k8s.io\kubernetes\staging\src\k8s.io\apiserver\pkg\server\genericapiserver.go
stoppedCh, listenerStoppedCh, err := s.NonBlockingRun(stopHttpServerCh, shutdownTimeout)
if s.SecureServingInfo != nil && s.Handler != nil {
var err error
stoppedCh, listenerStoppedCh, err = s.SecureServingInfo.Serve(s.Handler, shutdownTimeout, internalStopCh)
if err != nil {
close(internalStopCh)
close(auditStopCh)
return nil, nil, err
}
}
最终调用Serve运行secure http server,
位置D:\Workspace\Go\src\k8s.io\kubernetes\staging\src\k8s.io\apiserver\pkg\server\secure_serving.go
// Serve runs the secure http server. It fails only if certificates cannot be loaded or the initial listen call fails.
// The actual server loop (stoppable by closing stopCh) runs in a go routine, i.e. Serve does not block.
// It returns a stoppedCh that is closed when all non-hijacked active requests have been processed.
// It returns a listenerStoppedCh that is closed when the underlying http Server has stopped listening.
func (s *SecureServingInfo) Serve(handler http.Handler, shutdownTimeout time.Duration, stopCh <-chan struct{}) (<-chan struct{}, <-chan struct{}, error) {
if s.Listener == nil {
return nil, nil, fmt.Errorf("listener must not be nil")
}
tlsConfig, err := s.tlsConfig(stopCh)
if err != nil {
return nil, nil, err
}
secureServer := &http.Server{
Addr: s.Listener.Addr().String(),
Handler: handler,
MaxHeaderBytes: 1 << 20,
TLSConfig: tlsConfig,
IdleTimeout: 90 * time.Second, // matches http.DefaultTransport keep-alive timeout
ReadHeaderTimeout: 32 * time.Second, // just shy of requestTimeoutUpperBound
}
// At least 99% of serialized resources in surveyed clusters were smaller than 256kb.
// This should be big enough to accommodate most API POST requests in a single frame,
// and small enough to allow a per connection buffer of this size multiplied by `MaxConcurrentStreams`.
const resourceBody99Percentile = 256 * 1024
http2Options := &http2.Server{
IdleTimeout: 90 * time.Second, // matches http.DefaultTransport keep-alive timeout
}
// shrink the per-stream buffer and max framesize from the 1MB default while still accommodating most API POST requests in a single frame
http2Options.MaxUploadBufferPerStream = resourceBody99Percentile
http2Options.MaxReadFrameSize = resourceBody99Percentile
// use the overridden concurrent streams setting or make the default of 250 explicit so we can size MaxUploadBufferPerConnection appropriately
if s.HTTP2MaxStreamsPerConnection > 0 {
http2Options.MaxConcurrentStreams = uint32(s.HTTP2MaxStreamsPerConnection)
} else {
http2Options.MaxConcurrentStreams = 250
}
// increase the connection buffer size from the 1MB default to handle the specified number of concurrent streams
http2Options.MaxUploadBufferPerConnection = http2Options.MaxUploadBufferPerStream * int32(http2Options.MaxConcurrentStreams)
if !s.DisableHTTP2 {
// apply settings to the server
if err := http2.ConfigureServer(secureServer, http2Options); err != nil {
return nil, nil, fmt.Errorf("error configuring http2: %v", err)
}
}
// use tlsHandshakeErrorWriter to handle messages of tls handshake error
tlsErrorWriter := &tlsHandshakeErrorWriter{os.Stderr}
tlsErrorLogger := log.New(tlsErrorWriter, "", 0)
secureServer.ErrorLog = tlsErrorLogger
klog.Infof("Serving securely on %s", secureServer.Addr)
return RunServer(secureServer, s.Listener, shutdownTimeout, stopCh)
}
Scheme 定义了资源序列化和反序列化的方法以及资源类型和版本的对应关系;
这里我们可以理解成一张纪录表
所有的k8s资源必须要注册到scheme表中才可以使用
RESTStorage定义了一种资源该如何curd,如何和存储打交道
- 各个资源创建的restStore 塞入restStorageMap中
- map的key是 资源/子资源的名称, value是对应的restStore
。上节课讲到apiserver 核心服务初始化的时候会创建restStorage
并用restStorage初始化核心服务
入口地址D:\Workspace\Go\src\k8s.io\kubernetes\pkg\controlplane\instance.go
// InstallLegacyAPI will install the legacy APIs for the restStorageProviders if they are enabled.
func (m *Instance) InstallLegacyAPI(c *completedConfig, restOptionsGetter generic.RESTOptionsGetter) error {
legacyRESTStorageProvider := corerest.LegacyRESTStorageProvider{
StorageFactory: c.ExtraConfig.StorageFactory,
ProxyTransport: c.ExtraConfig.ProxyTransport,
KubeletClientConfig: c.ExtraConfig.KubeletClientConfig,
EventTTL: c.ExtraConfig.EventTTL,
ServiceIPRange: c.ExtraConfig.ServiceIPRange,
SecondaryServiceIPRange: c.ExtraConfig.SecondaryServiceIPRange,
ServiceNodePortRange: c.ExtraConfig.ServiceNodePortRange,
LoopbackClientConfig: c.GenericConfig.LoopbackClientConfig,
ServiceAccountIssuer: c.ExtraConfig.ServiceAccountIssuer,
ExtendExpiration: c.ExtraConfig.ExtendExpiration,
ServiceAccountMaxExpiration: c.ExtraConfig.ServiceAccountMaxExpiration,
APIAudiences: c.GenericConfig.Authentication.APIAudiences,
}
legacyRESTStorage, apiGroupInfo, err := legacyRESTStorageProvider.NewLegacyRESTStorage(c.ExtraConfig.APIResourceConfigSource, restOptionsGetter)
if err != nil {
return fmt.Errorf("error building core storage: %v", err)
}
if len(apiGroupInfo.VersionedResourcesStorageMap) == 0 { // if all core storage is disabled, return.
return nil
}
controllerName := "bootstrap-controller"
coreClient := corev1client.NewForConfigOrDie(c.GenericConfig.LoopbackClientConfig)
bootstrapController, err := c.NewBootstrapController(legacyRESTStorage, coreClient, coreClient, coreClient, coreClient.RESTClient())
if err != nil {
return fmt.Errorf("error creating bootstrap controller: %v", err)
}
m.GenericAPIServer.AddPostStartHookOrDie(controllerName, bootstrapController.PostStartHook)
m.GenericAPIServer.AddPreShutdownHookOrDie(controllerName, bootstrapController.PreShutdownHook)
if err := m.GenericAPIServer.InstallLegacyAPIGroup(genericapiserver.DefaultLegacyAPIPrefix, &apiGroupInfo); err != nil {
return fmt.Errorf("error in registering group versions: %v", err)
}
return nil
}
位置 D:\Workspace\Go\src\k8s.io\kubernetes\pkg\registry\core\rest\storage_core.go
func (c LegacyRESTStorageProvider) NewLegacyRESTStorage(apiResourceConfigSource serverstorage.APIResourceConfigSource, restOptionsGetter generic.RESTOptionsGetter) (LegacyRESTStorage, genericapiserver.APIGroupInfo, error) {
apiGroupInfo := genericapiserver.APIGroupInfo{
PrioritizedVersions: legacyscheme.Scheme.PrioritizedVersionsForGroup(""),
VersionedResourcesStorageMap: map[string]map[string]rest.Storage{},
Scheme: legacyscheme.Scheme,
ParameterCodec: legacyscheme.ParameterCodec,
NegotiatedSerializer: legacyscheme.Codecs,
}
·legacyscheme.Scheme是k8s的重要结构体Scheme 的默认实例
位置D:\Workspace\Go\src\k8s.io\kubernetes\staging\src\k8s.io\apimachinery\pkg\runtime\scheme.go
Scheme 定义了资源序列化和反序列化的方法以及资源类型和版本的对应关系,这里我们可以理解成一张记录表
运维人员在创建资源的时候,可能只关注kind (如deployment,本能的忽略分组和版本信息)
但是k8s的资源定位中只说deployment是不准确的
因为k8s系统支持多个Group,每个Group支持多个Version,每个Version支持多个Resource
其中部分资源同时会拥有自己的子资源(即SubResource)。例如,Deployment资源拥有Status子资源
资源组、资源版本、资源、子资源的完整表现形式为///
以常用的Deployment资源为例,其完整表现形式为apps/v1/deployments/status
。其中apps代码资源组
。v1代表版本
deployments代表resource
。status代表子资源
Group:被称为资源组,在Kubernetes API Server中也可称其为APIGroup。
Version:被称为资源版本,在Kubernetes API Server中也可称其为APIVersions。
Resource:被称为资源,在Kubernetes API Server中也可称其为APIResource。.
Kind:资源种类,描述Resource的种类,与Resource为同一级别。
k8s系统拥有众多资源,每一种资源就是一个资源类型
这些资源类型需要有统一的注册、存储、查询、管理等机制
目前k8s系统中的所有资源类型都已注册到Scheme资源注册表中,其是一个内存型的资源注册表,拥有如下特点:
。支持注册多种资源类型,包括内部版本和外部版本。
。支持多种版本转换机制。
。支持不同资源的序列化/反序列化机制。
分别是UnversionedType和KnownType资源类型,分别介绍如下
UnversionedType: 无版本资源类型
这是一个早期Kubernetes系统中的概念,它主要应用于某些没有版本的资源类型
该类型的资源对象并不需要进行转换
在目前的Kubernetes发行版本中,无版本类型已被弱化,几乎所有的资源对象都拥有版本
但在metav1元数据中还有部分类型,它们既属于meta.k8s.io/v1又属于UnversionedType无版本资源类型,例如:
o metav1.Status
o metav1.APIVersions
o metav1.APIGroupList
o metav1.APIGroup
o metav1APIResourceList
KnownType: 是目前Kubernetes最常用的资源类型
也可称其为“拥有版本的资源类型”。在scheme资源注册表中,UnversionedType资源类型的对象通过schemeAddUnversionedTypes方法进行注册
KnownType资源类型的对象通过schemeAddKnownTypes方法进行注册
代码位置
D:\Workspace\Go\src\k8s.io\kubernetes\staging\src\k8s.io\apimachinery\pkg\runtime\scheme.go
s := &Scheme{
gvkToType: map[schema.GroupVersionKind]reflect.Type{},
typeToGVK: map[reflect.Type][]schema.GroupVersionKind{},
unversionedTypes: map[reflect.Type]schema.GroupVersionKind{},
unversionedKinds: map[string]reflect.Type{},
fieldLabelConversionFuncs: map[schema.GroupVersionKind]FieldLabelConversionFunc{},
defaulterFuncs: map[reflect.Type]func(interface{}){},
versionPriority: map[string][]string{},
schemeName: naming.GetNameFromCallsite(internalPackages...),
}
具体定义如下
。gvkToType:存储GVK与Type的映射关系
· typeToGVK:存储Type与GVK的映射关系,一个Type会对应一个或多个GVK。
· unVersionedTypes: 存储UnversionedType与GVK的映射关系。
。unversionedKinds:存储Kind (资源种类)名称与UnversionedType的映射关系
Scheme资源注册表通过Go语言的map结构实现映射关系
这些映射关系可以实现高效的正向和反向检索,从Scheme资源注册表中检索某个GVK的Type,它的时间复杂度O(1)
获取scheme对象
var Scheme = runtime.NewScheme()
定义注册方法AddToScheme
通过runtime.NewScheme实例化一个新的Scheme资源注册表。注册资源类型到Scheme资源注册表有两种方式:
。通过schemeAddKnownTypes方法注册KnownType类型的对象。
。通过schemeAddUnversionedTypes方法注册UnversionedType类型的对象
实例代码
func init() {
metav1.AddToGroupVersion(Scheme, schema.GroupVersion{Version: "v1"})
AddToScheme(Scheme)
}
获取解码对象
var Codecs = serializer.NewCodecFactory(Scheme)
var ParameterCodec = runtime.NewParameterCodec(Scheme)
实际举例
比如我们之前写的 webhook-mutation的准入控制器注入sidecar。
runtimeScheme代表初始化这个注册表。
codecs和deserializer 是解码编码相关的对象
最后可以调用deserializer.Decode解码参数为 v1beta1.AdmissionReview资源
这里就是用了我们上面提到的scheme
apiGroupInfo := genericapiserver.APIGroupInfo{
PrioritizedVersions: legacyscheme.Scheme.PrioritizedVersionsForGroup(""),
VersionedResourcesStorageMap: map[string]map[string]rest.Storage{},
Scheme: legacyscheme.Scheme,
ParameterCodec: legacyscheme.ParameterCodec,
NegotiatedSerializer: legacyscheme.Codecs,
}
restStorage := LegacyRESTStorage{}
使用各种资源的NewREST创建RESTStorage,以configmap为例
configMapStorage, err := configmapstore.NewREST(restOptionsGetter)
if err != nil {
return LegacyRESTStorage{}, genericapiserver.APIGroupInfo{}, err
}
RESTStorage定义了一种资源该如何curd,如何和存储打交道
位置D:\Workspace\Go\src\k8s.io\kubernetes\pkg\registry\core\configmap\storage\storage.go
// REST implements a RESTStorage for ConfigMap
type REST struct {
*genericregistry.Store
}
// NewREST returns a RESTStorage object that will work with ConfigMap objects.
func NewREST(optsGetter generic.RESTOptionsGetter) (*REST, error) {
store := &genericregistry.Store{
NewFunc: func() runtime.Object { return &api.ConfigMap{} },
NewListFunc: func() runtime.Object { return &api.ConfigMapList{} },
PredicateFunc: configmap.Matcher,
DefaultQualifiedResource: api.Resource("configmaps"),
CreateStrategy: configmap.Strategy,
UpdateStrategy: configmap.Strategy,
DeleteStrategy: configmap.Strategy,
TableConvertor: printerstorage.TableConvertor{TableGenerator: printers.NewTableGenerator().With(printersinternal.AddHandlers)},
}
options := &generic.StoreOptions{
RESTOptions: optsGetter,
AttrFunc: configmap.GetAttrs,
TriggerFunc: map[string]storage.IndexerFunc{"metadata.name": configmap.NameTriggerFunc},
}
if err := store.CompleteWithOptions(options); err != nil {
return nil, err
}
return &REST{store}, nil
}
· NewFunc代表get一个对象时的方法
。NewListFunc代表list对象时的方法
· PredicateFunc 返回与提供的标签对应的匹配器和字段。如果object 匹配给定的字段和标签选择器则返回真
。DefaultQualifiedResource 是资源的复数名称。
· CreateStrategy 代表创建的策略
· UpdateStrategy代表更新的策略
· DeleteStrategy 代表删除的策略
。TableConvertor 代表输出为表格的方法
。options代表选项,并使用store.CompleteWithOptionsoptions)做校验
// CompleteWithOptions updates the store with the provided options and
// defaults common fields.
func (e *Store) CompleteWithOptions(options *generic.StoreOptions) error {
if e.DefaultQualifiedResource.Empty() {
return fmt.Errorf("store %#v must have a non-empty qualified resource", e)
}
if e.NewFunc == nil {
return fmt.Errorf("store for %s must have NewFunc set", e.DefaultQualifiedResource.String())
}
if e.NewListFunc == nil {
return fmt.Errorf("store for %s must have NewListFunc set", e.DefaultQualifiedResource.String())
}
if (e.KeyRootFunc == nil) != (e.KeyFunc == nil) {
return fmt.Errorf("store for %s must set both KeyRootFunc and KeyFunc or neither", e.DefaultQualifiedResource.String())
}
if e.TableConvertor == nil {
return fmt.Errorf("store for %s must set TableConvertor; rest.NewDefaultTableConvertor(e.DefaultQualifiedResource) can be used to output just name/creation time", e.DefaultQualifiedResource.String())
}
位置D:\Workspace\Go\src\k8s.io\kubernetes\pkg\registry\core\pod\storage\storage.go
// NewStorage returns a RESTStorage object that will work against pods.
func NewStorage(optsGetter generic.RESTOptionsGetter, k client.ConnectionInfoGetter, proxyTransport http.RoundTripper, podDisruptionBudgetClient policyclient.PodDisruptionBudgetsGetter) (PodStorage, error) {
store := &genericregistry.Store{
NewFunc: func() runtime.Object { return &api.Pod{} },
NewListFunc: func() runtime.Object { return &api.PodList{} },
PredicateFunc: registrypod.MatchPod,
DefaultQualifiedResource: api.Resource("pods"),
CreateStrategy: registrypod.Strategy,
UpdateStrategy: registrypod.Strategy,
DeleteStrategy: registrypod.Strategy,
ResetFieldsStrategy: registrypod.Strategy,
ReturnDeletedObject: true,
TableConvertor: printerstorage.TableConvertor{TableGenerator: printers.NewTableGenerator().With(printersinternal.AddHandlers)},
}
options := &generic.StoreOptions{
RESTOptions: optsGetter,
AttrFunc: registrypod.GetAttrs,
TriggerFunc: map[string]storage.IndexerFunc{"spec.nodeName": registrypod.NodeNameTriggerFunc},
Indexers: registrypod.Indexers(),
}
if err := store.CompleteWithOptions(options); err != nil {
return PodStorage{}, err
}
statusStore := *store
statusStore.UpdateStrategy = registrypod.StatusStrategy
statusStore.ResetFieldsStrategy = registrypod.StatusStrategy
ephemeralContainersStore := *store
ephemeralContainersStore.UpdateStrategy = registrypod.EphemeralContainersStrategy
bindingREST := &BindingREST{store: store}
return PodStorage{
Pod: &REST{store, proxyTransport},
Binding: &BindingREST{store: store},
LegacyBinding: &LegacyBindingREST{bindingREST},
Eviction: newEvictionStorage(store, podDisruptionBudgetClient),
Status: &StatusREST{store: &statusStore},
EphemeralContainers: &EphemeralContainersREST{store: &ephemeralContainersStore},
Log: &podrest.LogREST{Store: store, KubeletConn: k},
Proxy: &podrest.ProxyREST{Store: store, ProxyTransport: proxyTransport},
Exec: &podrest.ExecREST{Store: store, KubeletConn: k},
Attach: &podrest.AttachREST{Store: store, KubeletConn: k},
PortForward: &podrest.PortForwardREST{Store: store, KubeletConn: k},
}, nil
}
podStore 返回的是PodStorage,和其它资源不同的是下面会有很多subresource 子资源的restStore
// PodStorage includes storage for pods and all sub resources
type PodStorage struct {
Pod *REST
Binding *BindingREST
LegacyBinding *LegacyBindingREST
Eviction *EvictionREST
Status *StatusREST
EphemeralContainers *EphemeralContainersREST
Log *podrest.LogREST
Proxy *podrest.ProxyREST
Exec *podrest.ExecREST
Attach *podrest.AttachREST
PortForward *podrest.PortForwardREST
}
。利用上面各个资源创建的restStore 塞入Storage
Map中map的key是资源/子资源的名称,value是对应的 Storage
storage := map[string]rest.Storage{}
if resource := "pods"; apiResourceConfigSource.ResourceEnabled(corev1.SchemeGroupVersion.WithResource(resource)) {
storage[resource] = podStorage.Pod
storage[resource+"/attach"] = podStorage.Attach
storage[resource+"/status"] = podStorage.Status
storage[resource+"/log"] = podStorage.Log
storage[resource+"/exec"] = podStorage.Exec
storage[resource+"/portforward"] = podStorage.PortForward
storage[resource+"/proxy"] = podStorage.Proxy
storage[resource+"/binding"] = podStorage.Binding
if podStorage.Eviction != nil {
storage[resource+"/eviction"] = podStorage.Eviction
}
if utilfeature.DefaultFeatureGate.Enabled(features.EphemeralContainers) {
storage[resource+"/ephemeralcontainers"] = podStorage.EphemeralContainers
}
}
if resource := "bindings"; apiResourceConfigSource.ResourceEnabled(corev1.SchemeGroupVersion.WithResource(resource)) {
storage[resource] = podStorage.LegacyBinding
}
if resource := "podtemplates"; apiResourceConfigSource.ResourceEnabled(corev1.SchemeGroupVersion.WithResource(resource)) {
storage[resource] = podTemplateStorage
}
if resource := "replicationcontrollers"; apiResourceConfigSource.ResourceEnabled(corev1.SchemeGroupVersion.WithResource(resource)) {
storage[resource] = controllerStorage.Controller
storage[resource+"/status"] = controllerStorage.Status
if legacyscheme.Scheme.IsVersionRegistered(schema.GroupVersion{Group: "autoscaling", Version: "v1"}) {
storage[resource+"/scale"] = controllerStorage.Scale
}
}
if resource := "services"; apiResourceConfigSource.ResourceEnabled(corev1.SchemeGroupVersion.WithResource(resource)) {
storage[resource] = serviceRESTStorage
storage[resource+"/proxy"] = serviceRESTProxy
storage[resource+"/status"] = serviceStatusStorage
}
if resource := "endpoints"; apiResourceConfigSource.ResourceEnabled(corev1.SchemeGroupVersion.WithResource(resource)) {
storage[resource] = endpointsStorage
}
if resource := "nodes"; apiResourceConfigSource.ResourceEnabled(corev1.SchemeGroupVersion.WithResource(resource)) {
storage[resource] = nodeStorage.Node
storage[resource+"/proxy"] = nodeStorage.Proxy
storage[resource+"/status"] = nodeStorage.Status
}
if resource := "events"; apiResourceConfigSource.ResourceEnabled(corev1.SchemeGroupVersion.WithResource(resource)) {
storage[resource] = eventStorage
}
if resource := "limitranges"; apiResourceConfigSource.ResourceEnabled(corev1.SchemeGroupVersion.WithResource(resource)) {
storage[resource] = limitRangeStorage
}
if resource := "resourcequotas"; apiResourceConfigSource.ResourceEnabled(corev1.SchemeGroupVersion.WithResource(resource)) {
storage[resource] = resourceQuotaStorage
storage[resource+"/status"] = resourceQuotaStatusStorage
}
if resource := "namespaces"; apiResourceConfigSource.ResourceEnabled(corev1.SchemeGroupVersion.WithResource(resource)) {
storage[resource] = namespaceStorage
storage[resource+"/status"] = namespaceStatusStorage
storage[resource+"/finalize"] = namespaceFinalizeStorage
}
if resource := "secrets"; apiResourceConfigSource.ResourceEnabled(corev1.SchemeGroupVersion.WithResource(resource)) {
storage[resource] = secretStorage
}
if resource := "serviceaccounts"; apiResourceConfigSource.ResourceEnabled(corev1.SchemeGroupVersion.WithResource(resource)) {
storage[resource] = serviceAccountStorage
if serviceAccountStorage.Token != nil {
storage[resource+"/token"] = serviceAccountStorage.Token
}
}
if resource := "persistentvolumes"; apiResourceConfigSource.ResourceEnabled(corev1.SchemeGroupVersion.WithResource(resource)) {
storage[resource] = persistentVolumeStorage
storage[resource+"/status"] = persistentVolumeStatusStorage
}
if resource := "persistentvolumeclaims"; apiResourceConfigSource.ResourceEnabled(corev1.SchemeGroupVersion.WithResource(resource)) {
storage[resource] = persistentVolumeClaimStorage
storage[resource+"/status"] = persistentVolumeClaimStatusStorage
}
if resource := "configmaps"; apiResourceConfigSource.ResourceEnabled(corev1.SchemeGroupVersion.WithResource(resource)) {
storage[resource] = configMapStorage
}
if resource := "componentstatuses"; apiResourceConfigSource.ResourceEnabled(corev1.SchemeGroupVersion.WithResource(resource)) {
storage[resource] = componentstatus.NewStorage(componentStatusStorage{c.StorageFactory}.serversToValidate)
}
最终将上述storage塞入apiGrouplnfo的VersionedResourcesStorageMap中
这是一个双层map,第一层的key是版本,然后是资源名称,最后是对应的资源存储
if len(storage) > 0 {
apiGroupInfo.VersionedResourcesStorageMap["v1"] = storage
}
kube-apiserver createPod数据时的保存
·上节课我们知道每种资源有对应的Storage,其中定义了如何跟存储打交道
。比如pod的位置
store := &genericregistry.Store{
NewFunc: func() runtime.Object { return &api.Pod{} },
NewListFunc: func() runtime.Object { return &api.PodList{} },
PredicateFunc: registrypod.MatchPod,
DefaultQualifiedResource: api.Resource("pods"),
CreateStrategy: registrypod.Strategy,
UpdateStrategy: registrypod.Strategy,
DeleteStrategy: registrypod.Strategy,
ResetFieldsStrategy: registrypod.Strategy,
ReturnDeletedObject: true,
TableConvertor: printerstorage.TableConvertor{TableGenerator: printers.NewTableGenerator().With(printersinternal.AddHandlers)},
}
pod的资源对应的就是原始的store
。rest store底层是 genericregistry.Store,下面我们来分析一下genericregistry.Store的create 创建方法
// REST implements a RESTStorage for pods
type REST struct {
*genericregistry.Store
proxyTransport http.RoundTripper
}
位置D:\Workspace\Go\src\k8s.io\kubernetes\staging\src\k8s.io\apiserver\pkg\registry\generic\registry\store.go
先调用BeginCreate
if e.BeginCreate != nil {
fn, err := e.BeginCreate(ctx, obj, options)
if err != nil {
return nil, err
}
finishCreate = fn
defer func() {
finishCreate(ctx, false)
}()
}
if err := rest.BeforeCreate(e.CreateStrategy, ctx, obj); err != nil {return nil, err}
strategy.PrepareForCreate(ctx,obj)
那么对应pod中就在D:\Workspace\Go\src\k8s.io\kubernetes\pkg\registry\core\pod\strategy.go
// PrepareForCreate clears fields that are not allowed to be set by end users on creation.
func (podStrategy) PrepareForCreate(ctx context.Context, obj runtime.Object) {
pod := obj.(*api.Pod)
pod.Status = api.PodStatus{
Phase: api.PodPending,
QOSClass: qos.GetPodQOS(pod),
}
podutil.DropDisabledPodFields(pod, nil)
applySeccompVersionSkew(pod)
}
先把pod状态设置为pending
pod.Status = api.PodStatus{
Phase: api.PodPending,
去掉一些字段
podutil.DropDisabledPodFields(pod, nil)
applySeccompVersionSkew(pod)
。通过 GetPodQOS获取pod的 qos
简介
QoS(Quality of Service) 即服务质量
Qos 是一种控制机制,它提供了针对不同用户或者不同数据流采用相应不同的优先级
或者是根据应用程序的要求,保证数据流的性能达到一定的水准。kubernetes 中有三种 Qos,分别为:
。 Guaranteed:pod的requests 与limits 设定的值相等:
。Burstable:pod requests 小于limits 的值目不为0;
。BestEffort: pod 的 requests 与 limits 均为0;
三者的优先级如下所示,依次递增:
BestEffort -> Burstable -> Guaranteed
不同 Qos 的本质区别
在调度时调度器只会根据request 值进行调度;
二是当系统OOM上时对于处理不同OOMScore 的进程表现不同,也就是说当系统OOM 时,首先会kill掉 BestEffort pod 的进程,若系统依然处于OOM 状态,然后才会 kill 掉 Burstable pod,最后是Guaranteed pod;
资源的requests和limits
我们知道在k8s中为了达到容器资源限制的目录,在yaml文件中有cpu和内存的 requests和limits配置
对这两个参数可以简单理解为根据requests进行调度,根据limits进行运行限制。
举例下面的配置代表cpu 申请100m,限制1000m。内存申请100Mi,限制2500Mi
resources:
requests:
cpu: 100m
memory: 100Mi
limits:
cpu: 1000m
memory: 250GMi
。API优先级和公平性
https://kubernetes.io/h/docs/concepts/cluster-administration/flow-control/
代码解读
位置D:\Workspace\Go\src\k8s.io\kubernetes\pkg\apis\core\helper\qos\qos.go
首先遍历pod中的容器处理 resource.request
for _, container := range allContainers {
// process requests
for name, quantity := range container.Resources.Requests {
if !isSupportedQoSComputeResource(name) {
continue
}
if quantity.Cmp(zeroQuantity) == 1 {
delta := quantity.DeepCopy()
if _, exists := requests[name]; !exists {
requests[name] = delta
} else {
delta.Add(requests[name])
requests[name] = delta
}
}
}
// process limits
qosLimitsFound := sets.NewString()
for name, quantity := range container.Resources.Limits {
if !isSupportedQoSComputeResource(name) {
continue
}
if quantity.Cmp(zeroQuantity) == 1 {
qosLimitsFound.Insert(string(name))
delta := quantity.DeepCopy()
if _, exists := limits[name]; !exists {
limits[name] = delta
} else {
delta.Add(limits[name])
limits[name] = delta
}
}
}
if !qosLimitsFound.HasAll(string(core.ResourceMemory), string(core.ResourceCPU)) {
isGuaranteed = false
}
}
然后遍历处理 resource.limit
// process limits
qosLimitsFound := sets.NewString()
for name, quantity := range container.Resources.Limits {
if !isSupportedQoSComputeResource(name) {
continue
}
if quantity.Cmp(zeroQuantity) == 1 {
qosLimitsFound.Insert(string(name))
delta := quantity.DeepCopy()
if _, exists := limits[name]; !exists {
limits[name] = delta
} else {
delta.Add(limits[name])
limits[name] = delta
}
}
}
判定规则
如果limit和request都没设置就是 BestEffort
if len(requests) == 0 && len(limits) == 0 {
return core.PodQOSBestEffort
}
如果limit和request相等就是 Guaranteed
if isGuaranteed &&
len(requests) == len(limits) {
return core.PodQOSGuaranteed
}
。否则就是Burstable
if err := e.Storage.Create(ctx, key, obj, out, ttl, dryrun.IsDryRun(options.DryRun)); err != nil {
Storage调用的是 DryRunnableStorage的create
位置D:\Workspace\Go\src\k8s.io\kubernetes\staging\src\k8s.io\apiserver\pkg\registry\generic\registry\dryrun.go
func (s *DryRunnableStorage) Create(ctx context.Context, key string, obj, out runtime.Object, ttl uint64, dryRun bool) error {
if dryRun {
if err := s.Storage.Get(ctx, key, storage.GetOptions{}, out); err == nil {
return storage.NewKeyExistsError(key, 0)
}
return s.copyInto(obj, out)
}
return s.Storage.Create(ctx, key, obj, out, ttl)
}
如果是dryRun就是空跑,不存储在etcd中,只是将资源的结果返回
位置D:\Workspace\Go\src\k8s.io\kubernetes\staging\src\k8s.io\apiserver\pkg\storage\etcd3\store.go
// Create implements storage.Interface.Create.
func (s *store) Create(ctx context.Context, key string, obj, out runtime.Object, ttl uint64) error {
if version, err := s.versioner.ObjectResourceVersion(obj); err == nil && version != 0 {
return errors.New("resourceVersion should not be set on objects to be created")
}
if err := s.versioner.PrepareObjectForStorage(obj); err != nil {
return fmt.Errorf("PrepareObjectForStorage failed: %v", err)
}
data, err := runtime.Encode(s.codec, obj)
if err != nil {
return err
}
key = path.Join(s.pathPrefix, key)
opts, err := s.ttlOpts(ctx, int64(ttl))
if err != nil {
return err
}
newData, err := s.transformer.TransformToStorage(ctx, data, authenticatedDataString(key))
if err != nil {
return storage.NewInternalError(err.Error())
}
startTime := time.Now()
txnResp, err := s.client.KV.Txn(ctx).If(
notFound(key),
).Then(
clientv3.OpPut(key, string(newData), opts...),
).Commit()
metrics.RecordEtcdRequestLatency("create", getTypeName(obj), startTime)
if err != nil {
return err
}
if !txnResp.Succeeded {
return storage.NewKeyExistsError(key, 0)
}
if out != nil {
putResp := txnResp.Responses[0].GetResponsePut()
return decode(s.codec, s.versioner, data, out, putResp.Header.Revision)
}
return nil
}
。如果有AfterCreate和Decorator就调用
if e.AfterCreate != nil {
e.AfterCreate(out, options)
}
if e.Decorator != nil {
e.Decorator(out)
}
。kube-apiserver createPod数据时的保存
。架构图
>为了防止突发流量影响apiserver可用性,k8s支持多种限流配置,包括:
·- MaxInFlightLimit,server级别整体限流
·- Client限流
- EventRateLimit,限制event
·- APF,更细力度的限制配置
apiserver默认可设置最大并发量(集群级别,区分只读与修改操作作)
通过参数--max-requests-inflight代表只读请求
--max-mutating-requests-inflight代表修改请求
。可以简单实现限流
。入口GenericAPIServer.New中的添加hook
位置D:\Workspace\Go\src\k8s.io\kubernetes\staging\src\k8s.io\apiserver\pkg\server\config.go
// Add PostStartHooks for maintaining the watermarks for the Priority-and-Fairness and the Max-in-Flight filters.
if c.FlowControl != nil {
const priorityAndFairnessFilterHookName = "priority-and-fairness-filter"
if !s.isPostStartHookRegistered(priorityAndFairnessFilterHookName) {
err := s.AddPostStartHook(priorityAndFairnessFilterHookName, func(context PostStartHookContext) error {
genericfilters.StartPriorityAndFairnessWatermarkMaintenance(context.StopCh)
return nil
})
if err != nil {
return nil, err
}
}
} else {
const maxInFlightFilterHookName = "max-in-flight-filter"
if !s.isPostStartHookRegistered(maxInFlightFilterHookName) {
err := s.AddPostStartHook(maxInFlightFilterHookName, func(context PostStartHookContext) error {
genericfilters.StartMaxInFlightWatermarkMaintenance(context.StopCh)
return nil
})
if err != nil {
return nil, err
}
}
}
意思是FlowControl为nil,代表未启用APF,API 服务器中的整体并发量将受到 kube-apiserver 的参数max-requests-inflight和--max-mutating-requests-inflight 的限制
启动metrics观测的函数
// startWatermarkMaintenance starts the goroutines to observe and maintain the specified watermark.
func startWatermarkMaintenance(watermark *requestWatermark, stopCh <-chan struct{}) {
// Periodically update the inflight usage metric.
go wait.Until(func() {
watermark.lock.Lock()
readOnlyWatermark := watermark.readOnlyWatermark
mutatingWatermark := watermark.mutatingWatermark
watermark.readOnlyWatermark = 0
watermark.mutatingWatermark = 0
watermark.lock.Unlock()
metrics.UpdateInflightRequestMetrics(watermark.phase, readOnlyWatermark, mutatingWatermark)
}, inflightUsageMetricUpdatePeriod, stopCh)
// Periodically observe the watermarks. This is done to ensure that they do not fall too far behind. When they do
// fall too far behind, then there is a long delay in responding to the next request received while the observer
// catches back up.
go wait.Until(func() {
watermark.readOnlyObserver.Add(0)
watermark.mutatingObserver.Add(0)
}, observationMaintenancePeriod, stopCh)
}
调用的入口在 D:\Workspace\Go\src\k8s.io\kubernetes\staging\src\k8s.io\apiserver\pkg\server\config.go
if c.FlowControl != nil {
requestWorkEstimator := flowcontrolrequest.NewWorkEstimator(c.StorageObjectCountTracker.Get, c.FlowControl.GetInterestedWatchCount)
handler = filterlatency.TrackCompleted(handler)
handler = genericfilters.WithPriorityAndFairness(handler, c.LongRunningFunc, c.FlowControl, requestWorkEstimator)
handler = filterlatency.TrackStarted(handler, "priorityandfairness")
} else {
handler = genericfilters.WithMaxInFlightLimit(handler, c.MaxRequestsInFlight, c.MaxMutatingRequestsInFlight, c.LongRunningFunc)
}
如果limitnum为0就不开启限流了
if nonMutatingLimit == 0 && mutatingLimit == 0 {
return handler
}
构造限流的chan,类型为长度=limit的的 bool chan
var nonMutatingChan chan bool
var mutatingChan chan bool
if nonMutatingLimit != 0 {
nonMutatingChan = make(chan bool, nonMutatingLimit)
watermark.readOnlyObserver.SetDenominator(float64(nonMutatingLimit))
}
if mutatingLimit != 0 {
mutatingChan = make(chan bool, mutatingLimit)
watermark.mutatingObserver.SetDenominator(float64(mutatingLimit))
}
检查是否是长时间运行的请求
// Skip tracking long running events.
if longRunningRequestCheck != nil && longRunningRequestCheck(r, requestInfo) {
handler.ServeHTTP(w, r)
return
}
使用BasicLongRunningRequestCheck检查是否是watch或者pprof debug等长时间运行的请求,因为这些请求不受限制,位置D:\Workspace\Go\src\k8s.io\kubernetes\vendor\k8s.io\apiserver\pkg\server\filters\longrunning.go
// BasicLongRunningRequestCheck returns true if the given request has one of the specified verbs or one of the specified subresources, or is a profiler request.
func BasicLongRunningRequestCheck(longRunningVerbs, longRunningSubresources sets.String) apirequest.LongRunningRequestCheck {
return func(r *http.Request, requestInfo *apirequest.RequestInfo) bool {
if longRunningVerbs.Has(requestInfo.Verb) {
return true
}
if requestInfo.IsResourceRequest && longRunningSubresources.Has(requestInfo.Subresource) {
return true
}
if !requestInfo.IsResourceRequest && strings.HasPrefix(requestInfo.Path, "/debug/pprof/") {
return true
}
return false
}
}
检查是只读操作还是修改操作,决定使用哪个chan限制
var c chan bool
isMutatingRequest := !nonMutatingRequestVerbs.Has(requestInfo.Verb)
if isMutatingRequest {
c = mutatingChan
} else {
c = nonMutatingChan
}
如果队列未满,有空的位置,则更新下排队数字
使用select 向c中写入true,如果能写入到说明队列未满
记录下对应的指标
select {
case c <- true:
// We note the concurrency level both while the
// request is being served and after it is done being
// served, because both states contribute to the
// sampled stats on concurrency.
if isMutatingRequest {
watermark.recordMutating(len(c))
} else {
watermark.recordReadOnly(len(c))
}
defer func() {
<-c
if isMutatingRequest {
watermark.recordMutating(len(c))
} else {
watermark.recordReadOnly(len(c))
}
}()
handler.ServeHTTP(w, r)
default:
// at this point we're about to return a 429, BUT not all actors should be rate limited. A system:master is so powerful
// that they should always get an answer. It's a super-admin or a loopback connection.
if currUser, ok := apirequest.UserFrom(ctx); ok {
for _, group := range currUser.GetGroups() {
if group == user.SystemPrivilegedGroup {
handler.ServeHTTP(w, r)
return
}
}
}
// We need to split this data between buckets used for throttling.
metrics.RecordDroppedRequest(r, requestInfo, metrics.APIServerComponent, isMutatingRequest)
metrics.RecordRequestTermination(r, requestInfo, metrics.APIServerComponent, http.StatusTooManyRequests)
tooManyRequests(r, w)
}
default代表队列已满,但是如果请求的group中含有 system:masters,则放行。
因为apiserver认为这个组是很重要的请求,不能被限流
if currUser, ok := apirequest.UserFrom(ctx); ok {
for _, group := range currUser.GetGroups() {
if group == user.SystemPrivilegedGroup {
handler.ServeHTTP(w, r)
return
}
}
}
group=system:masters 对应的clusterRole 为cluster-admin
队列已满,如果请求的group中没有 system:masters,则返http 429错误
·http429代表当前有太多请求了,请重试
并设置response 的header Retry-After =1
// We need to split this data between buckets used for throttling.
metrics.RecordDroppedRequest(r, requestInfo, metrics.APIServerComponent, isMutatingRequest)
metrics.RecordRequestTermination(r, requestInfo, metrics.APIServerComponent, http.StatusTooManyRequests)
tooManyRequests(r, w)
例如client-go默认的qps为5,但是只支持客户端限流,只能由各个发起端限制
集群管理员无法控制用户行为。
EventRateLimit在1.13之后支持,只限制event请求
集成在apiserver内部webhoook中
可配置某个用户、namespace、server等event操作限制,通过webhook形式实现。
和文档一起学习
https://kubernetes.io/zh/docs/reference/access-authn-authz/admission-controllers/#eventratelimit
原理
具体原理可以参考提案,每个eventratelimit 配置使用一个单独的令牌桶限速器
每次event操作,遍历每个匹配的限速器检查是否能获取令牌,如果可以允许请求,否则返回429.
优点
实现简单,允许一定量的并发
可支持server/namespace/user等级别的限流
缺点
仅支持event,通过webhook实现只能拦截修改类请求
·所有namespace的限流相同,没有优先级
。API优先级和公平性(APF)是MaxlnFlightLimit限流的一种替代方案,设计文档见提案
·API 优先级和公平性(1.15以上,alpha版本),以更细粒度(byUser,byNamespace) 对请求进行分
类和隔离。支持突发流量,通过使用公平排队技术从队列中分发请求从而避免饥饿。
APF限流通过两种资源,PriorityLevelConfigurations定义隔离类型和可处理的并发预算量,还可以调整排队行为。FlowSchemas用于对每个入站请求进行分类,并与一PriorityLevelConfigurations相匹配。
可对用户或用户组或全局进行某些资源某些请求的限制,如限制default namespace写services put/patch请求。
优点
考虑情况较全面,支持优先级,白名单等
。可支持server/namespace/user/resource等细粒度级别的限流
缺点
配置复杂,不直观,需要对APF原理深入了解
。功能较新,缺少生产环境验证
文档地址
https://kubernetes.io/zh/docs/concepts/cluster-administration/flow-control/
apiserver 启动三个server
三个APIServer底层均依赖通用的GenericServer,使用go-restful对外提供RESTful风格的API服务kube-apiserver 对请求进行Authentication、Authorization和Admission三层验证
完成验证后,请求会根据路由规则,触发到对应资源的handler,主要包括数据的预处理和保存·kube-apiserver的底层存储为etcd v3,它被抽象为一种RESTStorage,使请求和存储操作-一对应
。apiExtensionsServer API扩展服务,主要针对CRD
kubeAPIServer API核心服务,包括常见的Pod/Deployment/Service
apiExtensionsServer API聚合服务,主要针对metrics
自定义准入控制器实现注入nginx-sidecar容器
k8s系统拥有众多资源,每一种资源就是一个资源类型
这些资源类型需要有统一的注册、存储、查询、管理等机制
目前k8s系统中的所有资源类型都已注册到Scheme资源注册表中,其是一个内存型的资源注册表,拥有如下特点:
。支持注册多种资源类型,包括内部版本和外部版本。
。支持多种版本转换机制。
。支持不同资源的序列化/反序列化机制。
- 了解kube-scheduler的启动流程
- 了解clientset 的使用方法
D:\Workspace\Go\src\k8s.io\kubernetes\cmd\kube-scheduler\scheduler.go
command := app.NewSchedulerCommand()
code := cli.Run(command)
os.Exit(code)
D:\Workspace\Go\src\k8s.io\kubernetes\cmd\kube-scheduler\app\server.go
// runCommand runs the scheduler.
func runCommand(cmd *cobra.Command, opts *options.Options, registryOptions ...Option) error {
verflag.PrintAndExitIfRequested()
// Activate logging as soon as possible, after that
// show flags with the final logging configuration.
if err := opts.Logs.ValidateAndApply(utilfeature.DefaultFeatureGate); err != nil {
fmt.Fprintf(os.Stderr, "%v\n", err)
os.Exit(1)
}
cliflag.PrintFlags(cmd.Flags())
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
go func() {
stopCh := server.SetupSignalHandler()
<-stopCh
cancel()
}()
cc, sched, err := Setup(ctx, opts, registryOptions...)
if err != nil {
return err
}
return Run(ctx, cc, sched)
}
Apply应用配置
c := &schedulerappconfig.Config{}
if err := o.ApplyTo(c); err != nil {
return nil, err
}
使用 --kubeconfig传入的配置初始化kube config
// Prepare kube config.
kubeConfig, err := createKubeConfig(c.ComponentConfig.ClientConnection, o.Master)
if err != nil {
return nil, err
}
使用kube-config 创建kube-clients 返回的是client-set对象
// Prepare kube clients.
client, eventClient, err := createClients(kubeConfig)
if err != nil {
return nil, err
}
。Clientset中包含一批rest.Interface的对象,如下
cs.admissionregistrationV1, err = admissionregistrationv1.NewForConfigAndClient(&configShallowCopy, httpClient)
if err != nil {
return nil, err
}
cs.admissionregistrationV1beta1, err = admissionregistrationv1beta1.NewForConfigAndClient(&configShallowCopy, httpClient)
if err != nil {
return nil, err
}
cs.internalV1alpha1, err = internalv1alpha1.NewForConfigAndClient(&configShallowCopy, httpClient)
if err != nil {
return nil, err
}
cs.appsV1, err = appsv1.NewForConfigAndClient(&configShallowCopy, httpClient)
if err != nil {
return nil, err
}
最终返回的Clientset
// Clientset contains the clients for groups. Each group has exactly one
// version included in a Clientset.
type Clientset struct {
*discovery.DiscoveryClient
admissionregistrationV1 *admissionregistrationv1.AdmissionregistrationV1Client
admissionregistrationV1beta1 *admissionregistrationv1beta1.AdmissionregistrationV1beta1Client
internalV1alpha1 *internalv1alpha1.InternalV1alpha1Client
appsV1 *appsv1.AppsV1Client
appsV1beta1 *appsv1beta1.AppsV1beta1Client
appsV1beta2 *appsv1beta2.AppsV1beta2Client
authenticationV1 *authenticationv1.AuthenticationV1Client
authenticationV1beta1 *authenticationv1beta1.AuthenticationV1beta1Client
authorizationV1 *authorizationv1.AuthorizationV1Client
authorizationV1beta1 *authorizationv1beta1.AuthorizationV1beta1Client
autoscalingV1 *autoscalingv1.AutoscalingV1Client
autoscalingV2 *autoscalingv2.AutoscalingV2Client
autoscalingV2beta1 *autoscalingv2beta1.AutoscalingV2beta1Client
autoscalingV2beta2 *autoscalingv2beta2.AutoscalingV2beta2Client
batchV1 *batchv1.BatchV1Client
batchV1beta1 *batchv1beta1.BatchV1beta1Client
certificatesV1 *certificatesv1.CertificatesV1Client
certificatesV1beta1 *certificatesv1beta1.CertificatesV1beta1Client
coordinationV1beta1 *coordinationv1beta1.CoordinationV1beta1Client
coordinationV1 *coordinationv1.CoordinationV1Client
coreV1 *corev1.CoreV1Client
discoveryV1 *discoveryv1.DiscoveryV1Client
discoveryV1beta1 *discoveryv1beta1.DiscoveryV1beta1Client
eventsV1 *eventsv1.EventsV1Client
eventsV1beta1 *eventsv1beta1.EventsV1beta1Client
extensionsV1beta1 *extensionsv1beta1.ExtensionsV1beta1Client
flowcontrolV1alpha1 *flowcontrolv1alpha1.FlowcontrolV1alpha1Client
flowcontrolV1beta1 *flowcontrolv1beta1.FlowcontrolV1beta1Client
flowcontrolV1beta2 *flowcontrolv1beta2.FlowcontrolV1beta2Client
networkingV1 *networkingv1.NetworkingV1Client
networkingV1beta1 *networkingv1beta1.NetworkingV1beta1Client
nodeV1 *nodev1.NodeV1Client
nodeV1alpha1 *nodev1alpha1.NodeV1alpha1Client
nodeV1beta1 *nodev1beta1.NodeV1beta1Client
policyV1 *policyv1.PolicyV1Client
policyV1beta1 *policyv1beta1.PolicyV1beta1Client
rbacV1 *rbacv1.RbacV1Client
rbacV1beta1 *rbacv1beta1.RbacV1beta1Client
rbacV1alpha1 *rbacv1alpha1.RbacV1alpha1Client
schedulingV1alpha1 *schedulingv1alpha1.SchedulingV1alpha1Client
schedulingV1beta1 *schedulingv1beta1.SchedulingV1beta1Client
schedulingV1 *schedulingv1.SchedulingV1Client
storageV1beta1 *storagev1beta1.StorageV1beta1Client
storageV1 *storagev1.StorageV1Client
storageV1alpha1 *storagev1alpha1.StorageV1alpha1Client
}
clientSet的使用
后面程序在list获取对象时就可以使用,比如之前ink8s-pod-metrics中获取node
nodes, err := clientset.CoreV1().Nodes().List(context.TODO(), metav.ListOptions))
获取pod
pods, err := clientset.CoreV1().Pods("kube-system").List(context.TODO(), metav1.ListOptions))
可以看出上面的node和pod使用的都是clientset中的coreV1()
默认scheduler 启动的时候带上参数 --leader-elect=true,代表先选主再进行主流程,为了高可用部署
// Set up leader election if enabled.
var leaderElectionConfig *leaderelection.LeaderElectionConfig
if c.ComponentConfig.LeaderElection.LeaderElect {
// Use the scheduler name in the first profile to record leader election.
schedulerName := corev1.DefaultSchedulerName
if len(c.ComponentConfig.Profiles) != 0 {
schedulerName = c.ComponentConfig.Profiles[0].SchedulerName
}
coreRecorder := c.EventBroadcaster.DeprecatedNewLegacyRecorder(schedulerName)
leaderElectionConfig, err = makeLeaderElectionConfig(c.ComponentConfig.LeaderElection, kubeConfig, coreRecorder)
if err != nil {
return nil, err
}
}
c.InformerFactory = scheduler.NewInformerFactory(client, 0)
// Create the scheduler.
sched, err := scheduler.New(cc.Client,
cc.InformerFactory,
cc.DynInformerFactory,
recorderFactory,
ctx.Done(),
scheduler.WithComponentConfigVersion(cc.ComponentConfig.TypeMeta.APIVersion),
scheduler.WithKubeConfig(cc.KubeConfig),
scheduler.WithProfiles(cc.ComponentConfig.Profiles...),
scheduler.WithPercentageOfNodesToScore(cc.ComponentConfig.PercentageOfNodesToScore),
scheduler.WithFrameworkOutOfTreeRegistry(outOfTreeRegistry),
scheduler.WithPodMaxBackoffSeconds(cc.ComponentConfig.PodMaxBackoffSeconds),
scheduler.WithPodInitialBackoffSeconds(cc.ComponentConfig.PodInitialBackoffSeconds),
scheduler.WithPodMaxInUnschedulablePodsDuration(cc.PodMaxInUnschedulablePodsDuration),
scheduler.WithExtenders(cc.ComponentConfig.Extenders...),
scheduler.WithParallelism(cc.ComponentConfig.Parallelism),
scheduler.WithBuildFrameworkCapturer(func(profile kubeschedulerconfig.KubeSchedulerProfile) {
// Profiles are processed during Framework instantiation to set default plugins and configurations. Capturing them for logging
completedProfiles = append(completedProfiles, profile)
}),
)
保存到全局的map中
可以通过/configz的https path获取到,代码如下
// Configz registration.
if cz, err := configz.New("componentconfig"); err == nil {
cz.Set(cc.ComponentConfig)
} else {
return fmt.Errorf("unable to register configz: %s", err)
}
通过curl命令获取/configz信息
修改我们之前使用的prometheus service account 中的clusterrole策略
在nonResourceURLs中添加"/configz "
vim rbac.yaml
kubectl apply -f rbac.yaml
kubectl get sa prometheus -n kube-system
应用后使用curl 获取token,再访问即可
。Event事件是k8s里的一个核心资源,下面的章节中详细讲解
// Prepare the event broadcaster .
cc.EventBroadcaster.StartRecordingToSink(ctx.Done())
// Setup healthz checks .
var checks []healthz.HealthChecker
if cc.ComponentConfig.LeaderElection.LeaderElect {
checks = append(checks,cc.LeaderElection.WatchDog)}
waitingForLeader代表选主结果通知的chan。有两个地方会close
。在下面的选主成功时会close
。如果选主功能没开启会close
isLeader代表判断当前节点是否为leader的func,如果waitingForLeader被关闭了则代表当前节点会成为leader
waitingForLeader := make(chan struct{})
isLeader := func() bool {
select {
case _, ok := <-waitingForLeader:
// if channel is closed, we are leading
return !ok
default:
// channel is open, we are waiting for a leader
return false
}
}
如果不是leader,那么/metrics/resources就不要注册到路由中了也就是非leader节点不能导出这些指标
func installMetricHandler(pathRecorderMux *mux.PathRecorderMux, informers informers.SharedInformerFactory, isLeader func() bool) {
configz.InstallHandler(pathRecorderMux)
pathRecorderMux.Handle("/metrics", legacyregistry.HandlerWithReset())
resourceMetricsHandler := resources.Handler(informers.Core().V1().Pods().Lister())
pathRecorderMux.HandleFunc("/metrics/resources", func(w http.ResponseWriter, req *http.Request) {
if !isLeader() {
return
}
resourceMetricsHandler.ServeHTTP(w, req)
})
}
逐次添加鉴权、认证、info、cache、logging等handler
// buildHandlerChain wraps the given handler with the standard filters.
func buildHandlerChain(handler http.Handler, authn authenticator.Request, authz authorizer.Authorizer) http.Handler {
requestInfoResolver := &apirequest.RequestInfoFactory{}
failedHandler := genericapifilters.Unauthorized(scheme.Codecs)
handler = genericapifilters.WithAuthorization(handler, authz, scheme.Codecs)
handler = genericapifilters.WithAuthentication(handler, authn, failedHandler, nil)
handler = genericapifilters.WithRequestInfo(handler, requestInfoResolver)
handler = genericapifilters.WithCacheControl(handler)
handler = genericfilters.WithHTTPLogging(handler)
handler = genericfilters.WithPanicRecovery(handler, requestInfoResolver)
return handler
}
D:\Workspace\Go\src\k8s.io\kubernetes\staging\src\k8s.io\client-go\informers\factory.go
// Start initializes all requested informers.
func (f *sharedInformerFactory) Start(stopCh <-chan struct{}) {
f.lock.Lock()
defer f.lock.Unlock()
for informerType, informer := range f.informers {
if !f.startedInformers[informerType] {
go informer.Run(stopCh)
f.startedInformers[informerType] = true
}
}
}
// Wait for all caches to sync before scheduling.
cc.InformerFactory.WaitForCacheSync(ctx.Done())
位置D:\Workspace\Go\src\k8s.io\kubernetes\staging\src\k8s.io\client-go\informers\factory.go
// WaitForCacheSync waits for all started informers' cache were synced.
func (f *sharedInformerFactory) WaitForCacheSync(stopCh <-chan struct{}) map[reflect.Type]bool {
informers := func() map[reflect.Type]cache.SharedIndexInformer {
f.lock.Lock()
defer f.lock.Unlock()
informers := map[reflect.Type]cache.SharedIndexInformer{}
for informerType, informer := range f.informers {
if f.startedInformers[informerType] {
informers[informerType] = informer
}
}
return informers
}()
res := map[reflect.Type]bool{}
for informType, informer := range informers {
res[informType] = cache.WaitForCacheSync(stopCh, informer.HasSynced)
}
return res
}
。如果被选为主的话则执行schedRun
// If leader election is enabled, runCommand via LeaderElector until done and exit.
if cc.LeaderElection != nil {
cc.LeaderElection.Callbacks = leaderelection.LeaderCallbacks{
OnStartedLeading: func(ctx context.Context) {
close(waitingForLeader)
sched.Run(ctx)
},
OnStoppedLeading: func() {
select {
case <-ctx.Done():
// We were asked to terminate. Exit 0.
klog.InfoS("Requested to terminate, exiting")
os.Exit(0)
default:
// We lost the lock.
klog.ErrorS(nil, "Leaderelection lost")
klog.FlushAndExit(klog.ExitFlushTimeout, 1)
}
},
}
leaderElector, err := leaderelection.NewLeaderElector(*cc.LeaderElection)
if err != nil {
return fmt.Errorf("couldn't create leader elector: %v", err)
}
leaderElector.Run(ctx)
return fmt.Errorf("lost lease")
}
k8s leader election抢锁机制
-leaderelection 主要是利用了k8s API操作的原子性实现了一个分布式锁,在不断的竞争中进行选举
- 选中为leader的进行才会执行具体的业务代码,这在k8s中非常的常见。
为什么要选主
·- 在Kubernetes中,通常kube-schduler和kube-controller-manager都是多副本进行部署的来保证高可用
·- 而真正在工作的实例其实只有一个
- 这里就利用到leaderelection 的选主机制,保证leader是处于工作状态
- 并且在leader挂掉之后,从其他节点选取新的leader保证组件正常工作
位置D:\Workspace\Go\src\k8s.io\kubernetes\cmd\kube-scheduler\app\options\options.go
if c.ComponentConfig.LeaderElection.LeaderElect {
// Use the scheduler name in the first profile to record leader election.
schedulerName := corev1.DefaultSchedulerName
if len(c.ComponentConfig.Profiles) != 0 {
schedulerName = c.ComponentConfig.Profiles[0].SchedulerName
}
coreRecorder := c.EventBroadcaster.DeprecatedNewLegacyRecorder(schedulerName)
leaderElectionConfig, err = makeLeaderElectionConfig(c.ComponentConfig.LeaderElection, kubeConfig, coreRecorder)
if err != nil {
return nil, err
}
}
makeLeaderElectionConfig 创建选主抢锁的配置
位置D:\Workspace\Go\src\k8s.io\kubernetes\cmd\kube-scheduler\app\options\options.go
使用机器名+uuid作为标识
锁为资源锁resourcelock
resourcelock默认配置可以通过/configz获取
代码如下
// makeLeaderElectionConfig builds a leader election configuration. It will
// create a new resource lock associated with the configuration.
func makeLeaderElectionConfig(config componentbaseconfig.LeaderElectionConfiguration, kubeConfig *restclient.Config, recorder record.EventRecorder) (*leaderelection.LeaderElectionConfig, error) {
hostname, err := os.Hostname()
if err != nil {
return nil, fmt.Errorf("unable to get hostname: %v", err)
}
// add a uniquifier so that two processes on the same host don't accidentally both become active
id := hostname + "_" + string(uuid.NewUUID())
rl, err := resourcelock.NewFromKubeconfig(config.ResourceLock,
config.ResourceNamespace,
config.ResourceName,
resourcelock.ResourceLockConfig{
Identity: id,
EventRecorder: recorder,
},
kubeConfig,
config.RenewDeadline.Duration)
if err != nil {
return nil, fmt.Errorf("couldn't create resource lock: %v", err)
}
return &leaderelection.LeaderElectionConfig{
Lock: rl,
LeaseDuration: config.LeaseDuration.Duration,
RenewDeadline: config.RenewDeadline.Duration,
RetryPeriod: config.RetryPeriod.Duration,
WatchDog: leaderelection.NewLeaderHealthzAdaptor(time.Second * 20),
Name: "kube-scheduler",
ReleaseOnCancel: true,
}, nil
}
D:\Workspace\Go\src\k8s.io\kubernetes\staging\src\k8s.io\client-go\tools\leaderelection\resourcelock\interface.go
// Manufacture will create a lock of a given type according to the input parameters
func New(lockType string, ns string, name string, coreClient corev1.CoreV1Interface, coordinationClient coordinationv1.CoordinationV1Interface, rlc ResourceLockConfig) (Interface, error) {
endpointsLock := &endpointsLock{
EndpointsMeta: metav1.ObjectMeta{
Namespace: ns,
Name: name,
},
Client: coreClient,
LockConfig: rlc,
}
configmapLock := &configMapLock{
ConfigMapMeta: metav1.ObjectMeta{
Namespace: ns,
Name: name,
},
Client: coreClient,
LockConfig: rlc,
}
leaseLock := &LeaseLock{
LeaseMeta: metav1.ObjectMeta{
Namespace: ns,
Name: name,
},
Client: coordinationClient,
LockConfig: rlc,
}
switch lockType {
case endpointsResourceLock:
return nil, fmt.Errorf("endpoints lock is removed, migrate to %s", EndpointsLeasesResourceLock)
case configMapsResourceLock:
return nil, fmt.Errorf("configmaps lock is removed, migrate to %s", ConfigMapsLeasesResourceLock)
case LeasesResourceLock:
return leaseLock, nil
case EndpointsLeasesResourceLock:
return &MultiLock{
Primary: endpointsLock,
Secondary: leaseLock,
}, nil
case ConfigMapsLeasesResourceLock:
return &MultiLock{
Primary: configmapLock,
Secondary: leaseLock,
}, nil
default:
return nil, fmt.Errorf("Invalid lock-type %s", lockType)
}
}
。在scheduler的Run函数中,位置在D:\Workspace\Go\src\k8s.io\kubernetes\cmd\kube-scheduler\app\server.go
// If leader election is enabled, runCommand via LeaderElector until done and exit.
if cc.LeaderElection != nil {
cc.LeaderElection.Callbacks = leaderelection.LeaderCallbacks{
OnStartedLeading: func(ctx context.Context) {
close(waitingForLeader)
sched.Run(ctx)
},
OnStoppedLeading: func() {
select {
case <-ctx.Done():
// We were asked to terminate. Exit 0.
klog.InfoS("Requested to terminate, exiting")
os.Exit(0)
default:
// We lost the lock.
klog.ErrorS(nil, "Leaderelection lost")
klog.FlushAndExit(klog.ExitFlushTimeout, 1)
}
},
}
leaderElector, err := leaderelection.NewLeaderElector(*cc.LeaderElection)
if err != nil {
return fmt.Errorf("couldn't create leader elector: %v", err)
}
leaderElector.Run(ctx)
return fmt.Errorf("lost lease")
}
D:\Workspace\Go\src\k8s.io\kubernetes\staging\src\k8s.io\client-go\tools\leaderelection\leaderelection.go
// Run starts the leader election loop. Run will not return
// before leader election loop is stopped by ctx or it has
// stopped holding the leader lease
func (le *LeaderElector) Run(ctx context.Context) {
defer runtime.HandleCrash()
defer func() {
le.config.Callbacks.OnStoppedLeading()
}()
if !le.acquire(ctx) {
return // ctx signalled done
}
ctx, cancel := context.WithCancel(ctx)
defer cancel()
go le.config.Callbacks.OnStartedLeading(ctx)
le.renew(ctx)
}
acquire会轮询调用tryAcquireOrRenew,如果抢到锁就返回true
如果ctx收到了退出信号就返回false
// acquire loops calling tryAcquireOrRenew and returns true immediately when tryAcquireOrRenew succeeds.
// Returns false if ctx signals done.
func (le *LeaderElector) acquire(ctx context.Context) bool {
ctx, cancel := context.WithCancel(ctx)
defer cancel()
succeeded := false
desc := le.config.Lock.Describe()
klog.Infof("attempting to acquire leader lease %v...", desc)
wait.JitterUntil(func() {
succeeded = le.tryAcquireOrRenew(ctx)
le.maybeReportTransition()
if !succeeded {
klog.V(4).Infof("failed to acquire lease %v", desc)
return
}
le.config.Lock.RecordEvent("became leader")
le.metrics.leaderOn(le.config.Name)
klog.Infof("successfully acquired lease %v", desc)
cancel()
}, le.config.RetryPeriod, JitterFactor, true, ctx.Done())
return succeeded
}
。首先获取原有的锁(通过apiserver到etcd中获取)
如果错误是IsNotFound就创建资源,并且持有锁
// 1. obtain or create the ElectionRecord
oldLeaderElectionRecord, oldLeaderElectionRawRecord, err := le.config.Lock.Get(ctx)
if err != nil {
if !errors.IsNotFound(err) {
klog.Errorf("error retrieving resource lock %v: %v", le.config.Lock.Describe(), err)
return false
}
if err = le.config.Lock.Create(ctx, leaderElectionRecord); err != nil {
klog.Errorf("error initially creating leader election record: %v", err)
return false
}
le.setObservedRecord(&leaderElectionRecord)
return true
}
检查本地缓存和远端的锁对象,不一致就更新一下
// 2. Record obtained, check the Identity & Time
if !bytes.Equal(le.observedRawRecord, oldLeaderElectionRawRecord) {
le.setObservedRecord(oldLeaderElectionRecord)
le.observedRawRecord = oldLeaderElectionRawRecord
}
判断持有的锁是否到期以及是否被自己持有
if len(oldLeaderElectionRecord.HolderIdentity) > 0 &&
le.observedTime.Add(le.config.LeaseDuration).After(now.Time) &&
!le.IsLeader() {
klog.V(4).Infof("lock is held by %v and has not yet expired", oldLeaderElectionRecord.HolderIdentity)
return false
}
自己现在是leader,但是分两组情况
le.IsLeader代表上一次也是leader,不需要变更信息
else代表首次变为leader,需要将leader切换+1
// 3. We're going to try to update. The leaderElectionRecord is set to it's default
// here. Let's correct it before updating.
if le.IsLeader() {
leaderElectionRecord.AcquireTime = oldLeaderElectionRecord.AcquireTime
leaderElectionRecord.LeaderTransitions = oldLeaderElectionRecord.LeaderTransitions
} else {
leaderElectionRecord.LeaderTransitions = oldLeaderElectionRecord.LeaderTransitions + 1
}
更新锁资源,这里如果在Get 和 Update 之有变化,将会更新失败
// update the lock itself
if err = le.config.Lock.Update(ctx, leaderElectionRecord); err != nil {
klog.Errorf("Failed to update lock: %v", err)
return false
}
le.setObservedRecord(&leaderElectionRecord)
return true
如果上面的update等操作并发执行会怎么样
在le.config.Lock.Get0 中会获取到锁的对象,其中有一个resourceVersion 字段用于标识一个资源对象的内部版本,每次更新操作都会更新其值
如果一个更新操作附加上了resourceVersion 字段,那么apiserver 就会通过验证当前 resourceVersion的值与指定的值是否相匹配来确保在此次更新操作周期内没有其他的更新操作,从而保证了更新操作的原子性
resourceVersion 在ObjectMeta中,位置D:\Workspace\Go\src\k8s.io\kubernetes\staging\src\k8s.io\apimachinery\pkg\apis\meta\v1\types.go
// An opaque value that represents the internal version of this object that can
// be used by clients to determine when objects have changed. May be used for optimistic
// concurrency, change detection, and the watch operation on a resource or set of resources.
// Clients must treat these values as opaque and passed unmodified back to the server.
// They may only be valid for a particular resource or set of resources.
//
// Populated by the system.
// Read-only.
// Value must be treated as opaque by clients and .
// More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#concurrency-control-and-consistency
// +optional
ResourceVersion string `json:"resourceVersion,omitempty" protobuf:"bytes,6,opt,name=resourceVersion"`
kubectl get lease -n kube-system
[root@k8s-master01 k8s-leaderelection]# kubectl get lease -n kube-system
NAME HOLDER AGE
kube-controller-manager k8s-master01_def02578-36f9-4a43-b700-66dd407ff612 161d
kube-scheduler k8s-master01_29da2906-54c1-4db1-9146-4bf8919b4cda 161d
可以看到我们现在是单master的环境
kube-scheduler 和kube-controller-manager都使用了lease做选主抢锁
在kube-system命名空间下
。holder代表当前锁被那个节点持有,为机器名+uuid
PS D:\Workspace\Go\src\k8s-leaderelection> go mod init k8s-leaderelection
go: creating new go.mod: module k8s-leaderelection
新建 leaderelection.go
package main
import (
"context"
"flag"
"os"
"os/signal"
"syscall"
"time"
"github.com/google/uuid"
metav1 "k8s.io/apimechinery/pkg/apis/meta/v1"
clientset "k8s.io/client-go/kubernetes"
"k8s.io/client-go/rest"
"k8s.io/client-go/tools/clientcmd"
"k8s.io/client-go/tools/leaderelection"
"k8s.io/client-go/tools/leaderelection/resourcelock"
"k8s.io/klog"
)
// 初始化restconfig,如果指定了kubeconfig就用文件,否则使用InClusterConfig对应service account
func buildConfig(kubeconfig string) (*rest.Config, error) {
if kubeconfig != "" {
cfg, err := clientcmd.BuildConfigFromFlags("", kubeconfig)
if err != nil {
return nil, err
}
return cfg, nil
}
cfg, err := rest.InClusterConfig()
if err != nil {
return nil, err
}
return cfg, nil
}
func main() {
klog.InitFlags(nil)
var kubeconfig string
var leaseLockName string
var leaseLockNamespace string
var id string
flag.StringVar(&kubeconfig, "kubeconfig", "", "absolute path to the kubeconfig file")
// 唯一的id
flag.StringVar(&id, "id", uuid.New().String(), "the holder identity name")
// lease-lock资源锁的名称
flag.StringVar(&leaseLockName, "lease-lock-name", "the lease lock resource name")
// 资源锁的namespace
flag.StringVar(&leaseLockNamespace, "lease-lock-namespace", "the lease lock resource namespace")
flag.Parse()
if leaseLockName == "" {
klog.Fatal("unable to get lease lock resource name (missing lease-lock-name flag).")
}
if leaseLockNamespace == "" {
klog.Fatal("unable to get lease lock resource namespace (missing lease-lock-namespace flag).")
}
// leader election uses the Kubernetes API by writing to a// lock object, which can be a LeaseLock object (preferred),
// a ConfigMap, or an Endpoints (deprecated) object.
// Conflicting writes are detected and each client handles those actions
// independently.
config, err := buildConfig(kubeconfig)
if err != nil {
klog.Fatal(err)
}
// @创建clientset
client := clientset.NewForConfigOrDie(config)
run := func(ctx context.Context) {
// complete your controller loop here
klog.Info("Controller loop...")
select {}
}
// use a Go context so we can tell the leaderelection code when we
// want to step down
// 抢锁停止的context
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
// 监信号
ch := make(chan os.Signal, 1)
signal.Notify(ch, os.Interrupt, syscall.SIGTERM)
go func() {
<-ch
klog.Info("Received termination, signaling shutdown")
cancel()
}()
// we use the Lease lock type since edits to Leases are less common
// and fewer objects in the cluster watch "all Leases"
// 指定锁的资源对象,这里使用了Lease贷源,还支持configmap,endpoint,或者multilock(即多种亮合使用)
lock := &resourcelock.LeaseLock{
LeaseMeta: metav1.ObjectMeta{
Name: leaseLockName,
Namespace: leaseLockNamespace,
},
Client: client.CoordinationV1(),
LockConfig: resourcelock.ResourceLockConfig{
Identity: id,
},
}
// start the leader election code loop
leaderelection.RunOrDie(ctx, leaderelection.LeaderElectionConfig{
Lock: lock,
// IMPORTANT: you MUST ensure that any code you have that
// is protected by the lease must terminate **before**
// you call cancel. Otherwise, you could have a background
// loop still running and another process could
// get elected before your background loop finished, violating
// the stated goal of the lease.
ReleaseOnCancel: true,
LeaseDuration: 60 * time.Second, //租约时间
RenewDeadline: 15 * time.Second, //更新租约的
RetryPeriod: 5 * time.Second, //非leader节点重试时间
Callbacks: leaderelection.leaderCallbacks{
OnStartedLeading: func(ctx context.Context) {
//变为leader执行的业务代码
// we're notified when we start - this is where you would
// usually put your coder
run(ctx)
},
OnStoppedLeading: func() {
// 进程退出
// we can do cleanup here
klog.Infof("leader lost: %s", id)
os.Exit(0)
},
OnNewLeader: func(identity string) {
//当产生新的Leader后执行的方法
// we 're notified when new leader elected
if identity == id {
// I just got the lock
return
}
klog.Infof("new leader elected: %s", identity)
},
},
})
}
首先通过命令行传入 kubeconfig lease的名字和id等参数
通过buildConfig获取 restConfig
clientset.NewForConfigOrDie创建cntset
实例化resourcelock.LeaseLock,资源类型使用lease
。leaderelection.RunOrDie启动抢锁逻辑
。时间相关参数解读
。 LeaseDuration: 60*time.Second,//租约时间
。 RenewDeadline: 15*time.Second,//更新租约的
。 RetryPeriod: 5*time.Second//leader节点重试时间
事件回调说明.
。OnStartedLeading代表变为leader时执行什么,往往是业务代码,这里执行的是空的run
。OnStoppedLeading 代表进程退出
。OnNewLeader 当产生新的leader后执行的方法
go build
。运行,先启动id=1的成员
可以看到当前进程抢到了锁,被选为主,get一下lease
再启动一个id=2的成员,可以看到1号为leader
这时候停止1号,可以看到2号获取了锁
leaderelection主要是利用了k8s API操作的原子性实现了一个分布式锁,在不断的竞争中进行选举
选中为leader的进行才会执行具体的业务代码,这在k8s中非常的常见
kube-scheduler使用lease类型的资源锁选主,选主之后才能进行调度
这样做的目的是多副本进行部署的来保证高可用
k8s的events是向您展示集群内部发生的事情的对象
- 例如调度程序做出了哪些决定
或者为什么某些 Pod 从节点中被逐出
所有核心组件和扩展(操作符)都可以通过APServer创建事件
-k8s 多个组件均会产生 event
·比如创建pod时故意将容器的image 仓库名字写错
创建之后就可以describe 这个pod获取events,可以看到拉取镜像失败的events
EventRecorder:是事件生成者,k8s组件通过调用它的方法来生成事件
EventBroadcaster: 事件广播器,负责消费EventRecorder产生的事件,然后分发给broadcasterWatcher;
broadcasterWatcher:用于定义事件的处理方式,如上报apiserver;
Events量非常大,只能存在一个很短的时间
Event 一般会通过apiserver暂存在etcd集群中(最好是单独的etcd集群存储events,和集群业务数据的etcd分开)
为避免磁盘空间被填满,故强制执行保留策略:在最后一次的事件发生后,删除1小时之前发生的事件。
下面图片来自网络可以基于event对k8s集群监控
在初始化配置的Config中,位置D:\Workspace\Go\src\k8s.io\kubernetes\cmd\kube-scheduler\app\options\options.go
c.EventBroadcaster = events.NewEventBroadcasterAdapter(eventClient)
new
// NewEventBroadcasterAdapter creates a wrapper around new and legacy broadcasters to simplify
// migration of individual components to the new Event API.
func NewEventBroadcasterAdapter(client clientset.Interface) EventBroadcasterAdapter {
eventClient := &eventBroadcasterAdapterImpl{}
if _, err := client.Discovery().ServerResourcesForGroupVersion(eventsv1.SchemeGroupVersion.String()); err == nil {
eventClient.eventsv1Client = client.EventsV1()
eventClient.eventsv1Broadcaster = NewBroadcaster(&EventSinkImpl{Interface: eventClient.eventsv1Client})
}
// Even though there can soon exist cases when coreBroadcaster won't really be needed,
// we create it unconditionally because its overhead is minor and will simplify using usage
// patterns of this library in all components.
eventClient.coreClient = client.CoreV1()
eventClient.coreBroadcaster = record.NewBroadcaster()
return eventClient
}
底层使用client-go中的eventBroadcasterlmpl,位置D:\Workspace\Go\src\k8s.io\kubernetes\staging\src\k8s.io\client-go\tools\record\event.go
type eventBroadcasterImpl struct {
*watch.Broadcaster
sleepDuration time.Duration
options CorrelatorOptions
}
初始化scheduler的Setup函数中,位置D:\Workspace\Go\src\k8s.io\kubernetes\cmd\kube-scheduler\app\server.go
recorderFactory := getRecorderFactory(&cc)
recorderFactory代表生成eventRecorder的工厂函数
最终使用的是client-go中的recorderlmpl,位置D:\Workspace\Go\src\k8s.io\kubernetes\staging\src\k8s.io\client-go\tools\events\event_recorder.go
type recorderImpl struct {
scheme *runtime.Scheme
reportingController string
reportingInstance string
*watch.Broadcaster
clock clock.Clock
}
。在scheduler的run中,位置在D:\Workspace\Go\src\k8s.io\kubernetes\cmd\kube-scheduler\app\server.go
// Prepare the event broadcaster.
cc.EventBroadcaster.StartRecordingToSink(ctx.Done())
// StartRecordingToSink starts sending events received from the specified eventBroadcaster to the given sink.
func (e *eventBroadcasterAdapterImpl) StartRecordingToSink(stopCh <-chan struct{}) {
if e.eventsv1Broadcaster != nil && e.eventsv1Client != nil {
e.eventsv1Broadcaster.StartRecordingToSink(stopCh)
}
if e.coreBroadcaster != nil && e.coreClient != nil {
e.coreBroadcaster.StartRecordingToSink(&typedv1core.EventSinkImpl{Interface: e.coreClient.Events("")})
}
}
D:\Workspace\Go\src\k8s.io\kubernetes\vendor\k8s.io\client-go\tools\events\event_broadcaster.go
func (e *eventBroadcasterImpl) startRecordingEvents(stopCh <-chan struct{}) {
eventHandler := func(obj runtime.Object) {
event, ok := obj.(*eventsv1.Event)
if !ok {
klog.Errorf("unexpected type, expected eventsv1.Event")
return
}
e.recordToSink(event, clock.RealClock{})
}
stopWatcher := e.StartEventWatcher(eventHandler)
go func() {
<-stopCh
stopWatcher()
}()
}
启动一个eventHandler,执行recordToSink写入后端存储
在StartEventWatcher消费ResultChan队列中的event,传递给eventHandler处理
// StartEventWatcher starts sending events received from this EventBroadcaster to the given event handler function.
// The return value is used to stop recording
func (e *eventBroadcasterImpl) StartEventWatcher(eventHandler func(event runtime.Object)) func() {
watcher := e.Watch()
go func() {
defer utilruntime.HandleCrash()
for {
watchEvent, ok := <-watcher.ResultChan()
if !ok {
return
}
eventHandler(watchEvent.Object)
}
}()
return watcher.Stop
}
通过getKey生成event 的类型key,作为cache中的标识
func getKey(event *eventsv1.Event) eventKey {
key := eventKey{
action: event.Action,
reason: event.Reason,
reportingController: event.ReportingController,
regarding: event.Regarding,
}
if event.Related != nil {
key.related = *event.Related
}
return key
}
.Event.series中记录的是这个event的次数和最近一次的时间
用上面的key去eventCache中寻找,如果找到了同时series存在就更新下相关的次数和时间.
isomorphicEvent, isIsomorphic := e.eventCache[eventKey]
if isIsomorphic {
if isomorphicEvent.Series != nil {
isomorphicEvent.Series.Count++
isomorphicEvent.Series.LastObservedTime = metav1.MicroTime{Time: clock.Now()}
return nil
}
不然的话创建新的并返回
isomorphicEvent.Series = &eventsv1.EventSeries{
Count: 1,
LastObservedTime: metav1.MicroTime{Time: clock.Now()},
}
return isomorphicEvent
然后用获取到的evToRecord 发送,并更新缓存
if evToRecord != nil {
recordedEvent := e.attemptRecording(evToRecord)
if recordedEvent != nil {
recordedEventKey := getKey(recordedEvent)
e.mu.Lock()
defer e.mu.Unlock()
e.eventCache[recordedEventKey] = recordedEvent
}
}
attemptRecording代表带重试发送,底层调用的就是sink的方法,位置 D:\Workspace\Go\src\k8s.io\kubernetes\vendor\k8s.io\client-go\tools\events\interfaces.go
// EventSink knows how to store events (client-go implements it.)
// EventSink must respect the namespace that will be embedded in 'event'.
// It is assumed that EventSink will return the same sorts of errors as
// client-go's REST client.
type EventSink interface {
Create(event *eventsv1.Event) (*eventsv1.Event, error)
Update(event *eventsv1.Event) (*eventsv1.Event, error)
Patch(oldEvent *eventsv1.Event, data []byte) (*eventsv1.Event, error)
}
Event.series的作用
就像重复的连接不上mysql的错误日志会打很多条一样
k8s的同一个event也会产生多条,那么去重降噪是有必要的通过event.Action类型event.Reason原因 eventReportingController产生的源等信息组成唯一的key如果cache中有key的记录,那么更新这种event的发生的次数和最近的时间就可以了
在scheduler的frameworklmpl中,位置D:\Workspace\Go\src\k8s.io\kubernetes\pkg\scheduler\framework\runtime\framework.go
// EventRecorder returns an event recorder.
func (f *frameworkImpl) EventRecorder() events.EventRecorder {
return f.eventRecorder
}
D:\Workspace\Go\src\k8s.io\kubernetes\pkg\scheduler\scheduler.go
// handleSchedulingFailure records an event for the pod that indicates the
// pod has failed to schedule. Also, update the pod condition and nominated node name if set.
func (sched *Scheduler) handleSchedulingFailure(fwk framework.Framework, podInfo *framework.QueuedPodInfo, err error, reason string, nominatingInfo *framework.NominatingInfo) {
sched.Error(podInfo, err)
// Update the scheduling queue with the nominated pod information. Without
// this, there would be a race condition between the next scheduling cycle
// and the time the scheduler receives a Pod Update for the nominated pod.
// Here we check for nil only for tests.
if sched.SchedulingQueue != nil {
sched.SchedulingQueue.AddNominatedPod(podInfo.PodInfo, nominatingInfo)
}
pod := podInfo.Pod
msg := truncateMessage(err.Error())
fwk.EventRecorder().Eventf(pod, nil, v1.EventTypeWarning, "FailedScheduling", "Scheduling", msg)
if err := updatePod(sched.client, pod, &v1.PodCondition{
Type: v1.PodScheduled,
Status: v1.ConditionFalse,
Reason: reason,
Message: err.Error(),
}, nominatingInfo); err != nil {
klog.ErrorS(err, "Error updating pod", "pod", klog.KObj(pod))
}
}
这里让pod要调度到 disktype=ssd的node上
创建后获取event,可以看到调度失败的event
kubectl get event
分析下kube-scheduler 记录event的代码
Eventf参数和字段分析
regarding 关注哪种资源的event 这里传入的是pod
related 还关联哪些资源,这里传入的是nil
eventtype代表是warning的还是normal的,这里是v1.EventTypeWarning
reason 原因,这里传入的是 FailedScheduling调度失败
action 代表执行哪个动作,这里传入的是Scheduling
note代表详细信息,这里传入的是错误信息msg
可以看到
pod调度正常也会有相关的event
反查正常的eventf源码,位置D:\Workspace\Go\src\k8s.io\kubernetes\pkg\scheduler\scheduler_one.go
func (sched *Scheduler) finishBinding(fwk framework.Framework, assumed *v1.Pod, targetNode string, err error) {
if finErr := sched.Cache.FinishBinding(assumed); finErr != nil {
klog.ErrorS(finErr, "Scheduler cache FinishBinding failed")
}
if err != nil {
klog.V(1).InfoS("Failed to bind pod", "pod", klog.KObj(assumed))
return
}
fwk.EventRecorder().Eventf(assumed, nil, v1.EventTypeNormal, "Scheduled", "Binding", "Successfully assigned %v/%v to %v", assumed.Namespace, assumed.Name, targetNode)
}
。k8s的events是展示集群内部发生的事情的对象
很多组件都可以产生event,数据会上传到apiserver,临时存储在etcd中,因为量太大了,有删除策略
Event事件管理机制要有三部分组成
EventRecorder:是事件生成者,k8s组件通过调用它的方法来生成事件;
EventBroadcaster: 事件广播器负责消费EventRecorder产生的事件,然后分发给BroadcasterWatcher
Informer 机制在不依赖中间件的情况下保证消息的实时性可靠性和顺序性
降低了k8s各个组件跟 Etcd 与 k8s APIServer 的通信压力
-k8s中的informer 框架可以很方便的让每个子模块以及扩展程序拿到k8s中的资源信息。
- reflector 用来直接和 k8s api server 通信,内部实现了listwatch 机制
- DeltaFIFO:更新队列
- informer 是我们要监听的资源的一个代码抽象
- Indexer: Client-go 用来存储资源对象并自带索引功能的本地存储
架构图分为两部分
黄色图标是开发者需要自行开发的部分而
其它的部分是client-go 已经提供的,直接使用即可
informer机制的作用
k8s中的informer框架可以很方便的让每个子模块以及扩展程序拿到k8s中的资源信息
informer机制的作用
Informer 机制在不依赖中间件的情况下保证消息的实时性,可靠性和顺序性降低了k8s各个组件跟Etcd 与k8s APIServer 的通信压力
Reflector: reflector 用来直接和 k8s api server 通信,内部实现了 listwatch 机制
。listwatch 就是用来监听资源变化的
。一个listwatch 只对应一个资源
。这个资源可以是k8s中内部的资源也可以是自定义的资源
。当收到资源变化时(创建、删除、修改)时会将资源放到 Delta Fifo 队列中
。 Reflector 包会和apiServer 建立长连接
DeltaFIFO:更新队列
。FIFO 就是一个队列,拥有队列基本方法(ADD,UPDATE,DELETE,LIST,POP,CLOSE等)
。Delta 是一个资源对象存储,保存存储对象的消费类型,比如Added,Updated,Deleted,Sync等
informer:informer 是我们要监听的资源的一个代码抽象
。能够将 delta filo 队列中的数据弹出
。然后保存到本地缓存indexer中,也就是图中的步骤5
。同时将数据分发到自定义controller 中进行事件处理也就是图中的步骤6
Indexer:Client-go用来存储资源对象并自带索引功能的本地存储
。Reflector从 DeltaFIFO 中将消费出来的资源对象存储到Indexer
。lndexer与Etcd 集群中的数据完全保持一致
。从而client-go 可以本地读取,减少 Kubernetes API和 Etcd 集群的压力
go mod init k8s-informer
informer.go
package main
import (
"context"
"flag"
"log"
"os"
"os/signal"
"path/filepath"
"syscall"
"time"
v1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/client-go/informers"
"k8s.io/client-go/kubernetes"
"k8s.io/client-go/tools/cache"
"k8s.io/client-go/tools/clientcmd"
"k8s.io/client-go/util/homedir"
"k8s.io/klog"
)
func main() {
var kubeconfig *string
//如果是windows,那么会读取c:\Users\xxx\.kube\config 下面的配置文件
//如果是Linux,那么会读收~/.kube/config 下面的配置文件
if home := homedir.HomeDir(); home != "" {
kubeconfig = flag.String("kubeconfig", filepath.Join(home, ".kube", "config"), "(optional) absolute path to the kubeconfig file")
} else {
kubeconfig = flag.String("kubeconfig", "", "absolute path to the kubeconfig file")
}
flag.Parse()
config, err := clientcmd.BuildConfigFromFlags("", *kubeconfig)
if err != nil {
panic(err)
}
clientset, err := kubernetes.NewForConfig(config)
if err != nil {
panic(err)
}
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
// 监信号
ch := make(chan os.Signal, 1)
signal.Notify(ch, os.Interrupt, syscall.SIGTERM)
go func() {
<-ch
klog.Info("Received termination,signaling shutdown")
cancel()
}()
//表示每分钟进行一次resync,resync会周期性地执行List操作
sharedInformers := informers.NewSharedInformerFactory(clientset, time.Minute)
informer := sharedInformers.Core().V1().Pods().Informer()
informer.AddEventHandler(cache.ResourceEventHandlerFuncs{
AddFuncs: func(obj interface{}) {
mObj := obj.(v1.Object)
log.Printf("New Pod Added to store: %s", mObj.GetName())
},
UpdateFunc: func(oldObj, newobj interface{}) {
oObj := oldObj.(v1.Object)
nObj := newobj.(v1.Object)
log.Printf("%s Pod Updated to %s", oObj.GetName(), nObj.GetName())
},
DeleteFunc: func(obj interface{}) {
mObj := obj.(v1.Object)
log.Printf("pod Deleted from Store: %s", mObj.GetName())
},
})
informer.Run(ctx.Done())
}
先通过kubeconfig创建 restclient.Config
config,err := clientcmd.BuildConfigFromFlags("", *kubeconfig)
然后创建和apiserver交互的 clientset
clientset,err := kubernetes.NewForConfig(config)
监听退出信号,并创建退出的context
使用SharedInformerFactory创建sharedInformer,传入的Resync时间是1分钟,代表1分钟执行list操作
然后创建pod资源的informer
添加EventHandler,并执行
。AddFunc代表新的资源创建时的回调
。UpdateFunc代表资源更新时的回调
o DeleteFunc代表资源删除时的回调代码如下
go build
./informer
执行后拉取全量的结果
informer那边
修改刚才创建的pod,添加标签信息
pod的yaml
informer的update日志
每次的resync也会触发 update
。入口在kube-scheduler的config中,位置
D:\Workspace\Go\src\k8s.io\kubernetes\cmd\kube-scheduler\app\options\options.go
这里传入的resync=0,代表不进行周期性的list,而是通过第一次的全量list+增量的更新
c.InformerFactory = scheduler.NewInformerFactory(client, 0)
informerFactory := informers.NewSharedInformerFactory(cs, resyncPeriod)
。NewSharedInformerFactory方法最终会调用到NewSharedinformerFactoryWithOptions初始化一个sharedInformerFactory ,位置D:\Workspace\Go\src\k8s.io\kubernetes\vendor\k8s.io\client-go\informers\factory.go
// NewSharedInformerFactoryWithOptions constructs a new instance of a SharedInformerFactory with additional options.
func NewSharedInformerFactoryWithOptions(client kubernetes.Interface, defaultResync time.Duration, options ...SharedInformerOption) SharedInformerFactory {
factory := &sharedInformerFactory{
client: client,
namespace: v1.NamespaceAll,
defaultResync: defaultResync,
informers: make(map[reflect.Type]cache.SharedIndexInformer),
startedInformers: make(map[reflect.Type]bool),
customResync: make(map[reflect.Type]time.Duration),
}
// Apply all options
for _, opt := range options {
factory = opt(factory)
}
return factory
}
sharedInformerFactory字段如下
type sharedInformerFactory struct {
client kubernetes.Interface
namespace string
tweakListOptions internalinterfaces.TweakListOptionsFunc
lock sync.Mutex
defaultResync time.Duration
customResync map[reflect.Type]time.Duration
informers map[reflect.Type]cache.SharedIndexInformer
// startedInformers is used for tracking which informers have been started.
// This allows Start() to be called multiple times safely.
startedInformers map[reflect.Type]bool
}
其中最重要的就是informers这个map,根据资源的类型更新对应的informer
一种资源会对应多个Informer,会导致效率低下,所以让一个资源对应一个sharedinformer,而一个sharedInformer内部自己维护多个Informer
D:\Workspace\Go\src\k8s.io\kubernetes\pkg\scheduler\scheduler.go
// NewInformerFactory creates a SharedInformerFactory and initializes a scheduler specific
// in-place podInformer.
func NewInformerFactory(cs clientset.Interface, resyncPeriod time.Duration) informers.SharedInformerFactory {
informerFactory := informers.NewSharedInformerFactory(cs, resyncPeriod)
informerFactory.InformerFor(&v1.Pod{}, newPodInformer)
return informerFactory
}
我们可以看到使用informerFactory的InformerFor创建了pod的informer对象
具体的InformerFor为sharedInformerFactory的InformerFor,位置在
D:\Workspace\Go\src\k8s.io\kubernetes\staging\src\k8s.io\client-go\informers\factory.go
// InternalInformerFor returns the SharedIndexInformer for obj using an internal
// client.
func (f *sharedInformerFactory) InformerFor(obj runtime.Object, newFunc internalinterfaces.NewInformerFunc) cache.SharedIndexInformer {
f.lock.Lock()
defer f.lock.Unlock()
informerType := reflect.TypeOf(obj)
informer, exists := f.informers[informerType]
if exists {
return informer
}
resyncPeriod, exists := f.customResync[informerType]
if !exists {
resyncPeriod = f.defaultResync
}
informer = newFunc(f.client, resyncPeriod)
f.informers[informerType] = informer
return informer
}
InformerFor解读
根据obj的反射类型在informers map中寻找informer
如果找到就返回,找不到就使用传入的newFunc 创建一个,并更新map
对应的 newFunc就是newPodInformer,位置D:\Workspace\Go\src\k8s.io\kubernetes\pkg\scheduler\scheduler.go
// newPodInformer creates a shared index informer that returns only non-terminal pods.
func newPodInformer(cs clientset.Interface, resyncPeriod time.Duration) cache.SharedIndexInformer {
selector := fmt.Sprintf("status.phase!=%v,status.phase!=%v", v1.PodSucceeded, v1.PodFailed)
tweakListOptions := func(options *metav1.ListOptions) {
options.FieldSelector = selector
}
return coreinformers.NewFilteredPodInformer(cs, metav1.NamespaceAll, resyncPeriod, nil, tweakListOptions)
}
底层的newPod在client-go中,位置 D:\Workspace\Go\src\k8s.io\kubernetes\staging\src\k8s.io\client-go\informers\core\v1\pod.go
在其中能看到对应的 listFunc和WatchFunc
// NewFilteredPodInformer constructs a new informer for Pod type.
// Always prefer using an informer factory to get a shared informer instead of getting an independent
// one. This reduces memory footprint and number of connections to the server.
func NewFilteredPodInformer(client kubernetes.Interface, namespace string, resyncPeriod time.Duration, indexers cache.Indexers, tweakListOptions internalinterfaces.TweakListOptionsFunc) cache.SharedIndexInformer {
return cache.NewSharedIndexInformer(
&cache.ListWatch{
ListFunc: func(options metav1.ListOptions) (runtime.Object, error) {
if tweakListOptions != nil {
tweakListOptions(&options)
}
return client.CoreV1().Pods(namespace).List(context.TODO(), options)
},
WatchFunc: func(options metav1.ListOptions) (watch.Interface, error) {
if tweakListOptions != nil {
tweakListOptions(&options)
}
return client.CoreV1().Pods(namespace).Watch(context.TODO(), options)
},
},
&corev1.Pod{},
resyncPeriod,
indexers,
)
}
上面我们提到在 NewSharedIndexlnformer中会新建indexer存储位置 D:\Workspace\Go\src\k8s.io\kubernetes\staging\src\k8s.io\client-go\tools\cache\store.go
// NewIndexer returns an Indexer implemented simply with a map and a lock.
func NewIndexer(keyFunc KeyFunc, indexers Indexers) Indexer {
return &cache{
cacheStorage: NewThreadSafeStore(indexers, Indices{}),
keyFunc: keyFunc,
}
}
对应的底层数据结构为 threadSafeMap,位置D:\Workspace\Go\src\k8s.io\kubernetes\staging\src\k8s.io\client-go\tools\cache\thread_safe_store.go
// threadSafeMap implements ThreadSafeStore
type threadSafeMap struct {
lock sync.RWMutex
items map[string]interface{}
// indexers maps a name to an IndexFunc
indexers Indexers
// indices maps a name to an Index
indices Indices
}
D:\Workspace\Go\src\k8s.io\kubernetes\staging\src\k8s.io\client-go\tools\cache\index.go
// Index maps the indexed value to a set of keys in the store that match on that value
type Index map[string]sets.String
// Indexers maps a name to an IndexFunc
type Indexers map[string]IndexFunc
// Indices maps a name to an Index
type Indices map[string]Index
。indices数据结构,看起来是三层的map,key 都是string
看起来 threadSafeMap.items存储具体的资源对象,indices是索引,加速查找
。使用的是MetaNamespaceKeyFunc,也就是对象的namespace/name
D:\Workspace\Go\src\k8s.io\kubernetes\staging\src\k8s.io\client-go\tools\cache\store.go
// MetaNamespaceKeyFunc is a convenient default KeyFunc which knows how to make
// keys for API objects which implement meta.Interface.
// The key uses the format / unless is empty, then
// it's just .
//
// TODO: replace key-as-string with a key-as-struct so that this
// packing/unpacking won't be necessary.
func MetaNamespaceKeyFunc(obj interface{}) (string, error) {
if key, ok := obj.(ExplicitKey); ok {
return string(key), nil
}
meta, err := meta.Accessor(obj)
if err != nil {
return "", fmt.Errorf("object has no meta: %v", err)
}
if len(meta.GetNamespace()) > 0 {
return meta.GetNamespace() + "/" + meta.GetName(), nil
}
return meta.GetName(), nil
}
。在 scheduler 的的New中,addAllEventHandlers
这些handler代表就是使用方能干些什么,informer拿到之后要更新存储,使用方如scheduler拿到后要调度pod
位置 D:\Workspace\Go\src\k8s.io\kubernetes\pkg\scheduler\scheduler.go
addAllEventHandlers(sched, informerFactory, dynInformerFactory, unionedGVKs(clusterEventMap))
添加调度pod时的回调,有对应的AddFunc、UpdateFunc、DeleteFunc
// scheduled pod cache
informerFactory.Core().V1().Pods().Informer().AddEventHandler(
cache.FilteringResourceEventHandler{
FilterFunc: func(obj interface{}) bool {
switch t := obj.(type) {
case *v1.Pod:
return assignedPod(t)
case cache.DeletedFinalStateUnknown:
if _, ok := t.Obj.(*v1.Pod); ok {
// The carried object may be stale, so we don't use it to check if
// it's assigned or not. Attempting to cleanup anyways.
return true
}
utilruntime.HandleError(fmt.Errorf("unable to convert object %T to *v1.Pod in %T", obj, sched))
return false
default:
utilruntime.HandleError(fmt.Errorf("unable to handle object in %T: %T", sched, obj))
return false
}
},
Handler: cache.ResourceEventHandlerFuncs{
AddFunc: sched.addPodToCache,
UpdateFunc: sched.updatePodInCache,
DeleteFunc: sched.deletePodFromCache,
},
},
)
入口在 scheduler的run中,位置 D:\Workspace\Go\src\k8s.io\kubernetes\cmd\kube-scheduler\app\server.go
// Start all informers.
cc.InformerFactory.Start(ctx.Done())
// DynInformerFactory can be nil in tests.
if cc.DynInformerFactory != nil {
cc.DynInformerFactory.Start(ctx.Done())
}
// Wait for all caches to sync before scheduling.
cc.InformerFactory.WaitForCacheSync(ctx.Done())
// DynInformerFactory can be nil in tests.
if cc.DynInformerFactory != nil {
cc.DynInformerFactory.WaitForCacheSync(ctx.Done())
}
对应的stat为sharedinformerFactory中,遍历informers map,启动每个informer。并且在启动前用startedInformers map做check
// Start initializes all requested informers.
func (f *sharedInformerFactory) Start(stopCh <-chan struct{}) {
f.lock.Lock()
defer f.lock.Unlock()
for informerType, informer := range f.informers {
if !f.startedInformers[informerType] {
go informer.Run(stopCh)
f.startedInformers[informerType] = true
}
}
}
新建fifo队列
fifo := NewDeltaFIFOWithOptions(DeltaFIFOOptions{
KnownObjects: s.indexer,
EmitDeltaTypeReplaced: true,
})
新建controller
func() {
s.startedLock.Lock()
defer s.startedLock.Unlock()
s.controller = New(cfg)
s.controller.(*controller).clock = s.clock
s.started = true
}()
processor启动listeners
wg.StartwithChannel(processorStopCh,s.processor.run)
底层调用 processorListener的run
D:\Workspace\Go\src\k8s.io\kubernetes\vendor\k8s.io\client-go\tools\cache\shared_informer.go
在processorListener的run中执行eventHandler注册的回调方法
func (p *processorListener) run() {
// this call blocks until the channel is closed. When a panic happens during the notification
// we will catch it, **the offending item will be skipped!**, and after a short delay (one second)
// the next notification will be attempted. This is usually better than the alternative of never
// delivering again.
stopCh := make(chan struct{})
wait.Until(func() {
for next := range p.nextCh {
switch notification := next.(type) {
case updateNotification:
p.handler.OnUpdate(notification.oldObj, notification.newObj)
case addNotification:
p.handler.OnAdd(notification.newObj)
case deleteNotification:
p.handler.OnDelete(notification.oldObj)
default:
utilruntime.HandleError(fmt.Errorf("unrecognized notification: %T", next))
}
}
// the only way to get here is if the p.nextCh is empty and closed
close(stopCh)
}, 1*time.Second, stopCh)
}
D:\Workspace\Go\src\k8s.io\kubernetes\vendor\k8s.io\client-go\tools\cache\controller.go
// Run begins processing items, and will continue until a value is sent down stopCh or it is closed.
// It's an error to call Run more than once.
// Run blocks; call via go.
func (c *controller) Run(stopCh <-chan struct{}) {
defer utilruntime.HandleCrash()
go func() {
<-stopCh
c.config.Queue.Close()
}()
r := NewReflector(
c.config.ListerWatcher,
c.config.ObjectType,
c.config.Queue,
c.config.FullResyncPeriod,
)
r.ShouldResync = c.config.ShouldResync
r.WatchListPageSize = c.config.WatchListPageSize
r.clock = c.clock
if c.config.WatchErrorHandler != nil {
r.watchErrorHandler = c.config.WatchErrorHandler
}
c.reflectorMutex.Lock()
c.reflector = r
c.reflectorMutex.Unlock()
var wg wait.Group
wg.StartWithChannel(stopCh, r.Run)
wait.Until(c.processLoop, time.Second, stopCh)
wg.Wait()
}
·其中新建reflector,r.Run 代表生产,往Queue里放数据
。ListAndWatch会调用watchHandler
。watchHandler顾名思义,就是Watch到对应的事件,调用对应的Handler在watchhandler中可以看到对增删改等动作的处理,位置
D:\Workspace\Go\src\k8s.io\kubernetes\vendor\k8s.io\client-go\tools\cache\reflector.go
switch event.Type {
case watch.Added:
err := r.store.Add(event.Object)
if err != nil {
utilruntime.HandleError(fmt.Errorf("%s: unable to add watch event object (%#v) to store: %v", r.name, event.Object, err))
}
case watch.Modified:
err := r.store.Update(event.Object)
if err != nil {
utilruntime.HandleError(fmt.Errorf("%s: unable to update watch event object (%#v) to store: %v", r.name, event.Object, err))
}
case watch.Deleted:
// TODO: Will any consumers need access to the "last known
// state", which is passed in event.Object? If so, may need
// to change this.
err := r.store.Delete(event.Object)
if err != nil {
utilruntime.HandleError(fmt.Errorf("%s: unable to delete watch event object (%#v) from store: %v", r.name, event.Object, err))
}
D:\Workspace\Go\src\k8s.io\kubernetes\vendor\k8s.io\client-go\tools\cache\shared_informer.go
func (s *sharedIndexInformer) HandleDeltas(obj interface{}) error {
s.blockDeltas.Lock()
defer s.blockDeltas.Unlock()
if deltas, ok := obj.(Deltas); ok {
return processDeltas(s, s.indexer, s.transform, deltas)
}
return errors.New("object given as Process argument is not Deltas")
}
过程如下
。在indexer中判断对象是否存在,存在更新,不存在就新增
同时调用distribute函数分发给listener
informer机制的框架
Pod的调度是通过一个队列SchedulingQueue异步工作的
监听到对应pod事件后,放入队列
-有个消费者从队列中获取pod,进行调度
单个pod的调度主要分为3个步骤
根据Predict和Priority两个阶段,调用各自的算法插件选择最优的Node
- Assume这个Pod被调度到对应的Node,保存到cache
- 用extender和plugins进行验证,如果通过则绑定
位置D:\Workspace\Go\src\k8s.io\kubernetes\pkg\scheduler\scheduler.go
。可以看到其中和pod调度直接相关的两个字段
// Scheduler watches for new unscheduled pods. It attempts to find
// nodes that they fit on and writes bindings back to the api server.
type Scheduler struct {
NextPod func() *framework.QueuedPodInfo // 获取下一个需要调度的PodSchedulingQueue // 获取下一个需要调度的PodSchedulingQueue
// SchedulingQueue holds pods to be scheduled
SchedulingQueue internalqueue.SchedulingQueue // 等待调度的Pd队列,我们重点看看这个队列是什么
}
在create函数中,创建了podQueue,位置D:\Workspace\Go\src\k8s.io\kubernetes\pkg\scheduler\scheduler.go
podQueue := internalqueue.NewSchedulingQueue(
profiles[options.profiles[0].SchedulerName].QueueSortFunc(),
informerFactory,
internalqueue.WithPodInitialBackoffDuration(time.Duration(options.podInitialBackoffSeconds)*time.Second),
internalqueue.WithPodMaxBackoffDuration(time.Duration(options.podMaxBackoffSeconds)*time.Second),
internalqueue.WithPodNominator(nominator),
internalqueue.WithClusterEventMap(clusterEventMap),
internalqueue.WithPodMaxInUnschedulablePodsDuration(options.podMaxInUnschedulablePodsDuration),
)
可以看出这是一个带有优先级的队列
// NewSchedulingQueue initializes a priority queue as a new scheduling queue.
func NewSchedulingQueue(
lessFn framework.LessFunc,
informerFactory informers.SharedInformerFactory,
opts ...Option) SchedulingQueue {
return NewPriorityQueue(lessFn, informerFactory, opts...)
}
因为由此pod比较主要,需要优先调度
调度优先级文档
获取集群默认的调度优先级
pod调度优先级实例,之前讲的prometheus statefulset中的配置
。可以看到就是从 podQueue中 pop一个
// MakeNextPodFunc returns a function to retrieve the next pod from a given
// scheduling queue
func MakeNextPodFunc(queue SchedulingQueue) func() *framework.QueuedPodInfo {
return func() *framework.QueuedPodInfo {
podInfo, err := queue.Pop()
if err == nil {
klog.V(4).InfoS("About to try and schedule pod", "pod", klog.KObj(podInfo.Pod))
for plugin := range podInfo.UnschedulablePlugins {
metrics.UnschedulableReason(plugin, podInfo.Pod.Spec.SchedulerName).Dec()
}
return podInfo
}
klog.ErrorS(err, "Error while retrieving next pod from scheduling queue")
return nil
}
}
。还记得之前scheduler的New中会有添加回调的函数
addAllEventHandlers(sched, informerFactory, dynInformerFactory, unionedGVKs(clusterEventMap))
D:\Workspace\Go\src\k8s.io\kubernetes\pkg\scheduler\eventhandlers.go
// scheduled pod cache
informerFactory.Core().V1().Pods().Informer().AddEventHandler(
cache.FilteringResourceEventHandler{
FilterFunc: func(obj interface{}) bool {
switch t := obj.(type) {
case *v1.Pod:
return assignedPod(t)
case cache.DeletedFinalStateUnknown:
if _, ok := t.Obj.(*v1.Pod); ok {
// The carried object may be stale, so we don't use it to check if
// it's assigned or not. Attempting to cleanup anyways.
return true
}
utilruntime.HandleError(fmt.Errorf("unable to convert object %T to *v1.Pod in %T", obj, sched))
return false
default:
utilruntime.HandleError(fmt.Errorf("unable to handle object in %T: %T", sched, obj))
return false
}
},
Handler: cache.ResourceEventHandlerFuncs{
AddFunc: sched.addPodToCache,
UpdateFunc: sched.updatePodInCache,
DeleteFunc: sched.deletePodFromCache,
},
},
)
FilterFunc为一个过滤的函数,assignedPod代表pod信息中有node信息了,说明pod已被调度到node
// assignedPod selects pods that are assigned (scheduled and running).
func assignedPod(pod *v1.Pod) bool {
return len(pod.Spec.NodeName) != 0
}
add对应的触发动作就是sched.addPodToCache。比如我们之i前创建的nginx-pod就该走这里
在下面的addPodToCache可以看到调用了SchedulingQueueAssignedPodAdded将pod推入队列中
func (sched *Scheduler) addPodToCache(obj interface{}) {
pod, ok := obj.(*v1.Pod)
if !ok {
klog.ErrorS(nil, "Cannot convert to *v1.Pod", "obj", obj)
return
}
klog.V(3).InfoS("Add event for scheduled pod", "pod", klog.KObj(pod))
if err := sched.Cache.AddPod(pod); err != nil {
klog.ErrorS(err, "Scheduler cache AddPod failed", "pod", klog.KObj(pod))
}
sched.SchedulingQueue.AssignedPodAdded(pod)
}
至此创建的pod入队出队我们都了解了
。我们可以追踪NextPod合适被调用,追查到在scheduleOne中
// scheduleOne does the entire scheduling workflow for a single pod. It is serialized on the scheduling algorithm's host fitting.
func (sched *Scheduler) scheduleOne(ctx context.Context) {
podInfo := sched.NextPod()
}
在向上追查可以看到是在scheduler启动的时候,选主成功的OnStartedLeading回调中有sched.Run执行调度
// If leader election is enabled, runCommand via LeaderElector until done and exit.
if cc.LeaderElection != nil {
cc.LeaderElection.Callbacks = leaderelection.LeaderCallbacks{
OnStartedLeading: func(ctx context.Context) {
close(waitingForLeader)
sched.Run(ctx)
},
OnStoppedLeading: func() {
select {
case <-ctx.Done():
// We were asked to terminate. Exit 0.
klog.InfoS("Requested to terminate, exiting")
os.Exit(0)
default:
// We lost the lock.
klog.ErrorS(nil, "Leaderelection lost")
klog.FlushAndExit(klog.ExitFlushTimeout, 1)
}
},
}
leaderElector, err := leaderelection.NewLeaderElector(*cc.LeaderElection)
if err != nil {
return fmt.Errorf("couldn't create leader elector: %v", err)
}
leaderElector.Run(ctx)
return fmt.Errorf("lost lease")
}
podinfo 就是从队列中获取到的pod对象,检查pod的有效性
podInfo := sched.NextPod()
// pod could be nil when schedulerQueue is closed
if podInfo == nil || podInfo.Pod == nil {
return
}
pod := podInfo.Pod
根据定义的 pod.SpecSchedulerName查到对应的profile
fwk, err := sched.frameworkForPod(pod)
if err != nil {
// This shouldn't happen, because we only accept for scheduling the pods
// which specify a scheduler name that matches one of the profiles.
klog.ErrorS(err, "Error occurred")
return
}
根据调度算法获取结果
scheduleResult, err := sched.SchedulePod(schedulingCycleCtx, fwk, state, pod)
调用assume对调度算法的结果进行验证
// Tell the cache to assume that a pod now is running on a given node, even though it hasn't been bound yet.
// This allows us to keep scheduling without waiting on binding to occur.
assumedPodInfo := podInfo.DeepCopy()
assumedPod := assumedPodInfo.Pod
// assume modifies `assumedPod` by setting NodeName=scheduleResult.SuggestedHost
err = sched.assume(assumedPod, scheduleResult.SuggestedHost)
if err != nil {
metrics.PodScheduleError(fwk.ProfileName(), metrics.SinceInSeconds(start))
// This is most probably result of a BUG in retrying logic.
// We report an error here so that pod scheduling can be retried.
// This relies on the fact that Error will check if the pod has been bound
// to a node and if so will not add it back to the unscheduled pods queue
// (otherwise this would cause an infinite loop).
sched.handleSchedulingFailure(fwk, assumedPodInfo, err, SchedulerError, clearNominatedNode)
return
}
下面的go func进行异步绑定
// bind the pod to its host asynchronously (we can do this b/c of the assumption step above).
err := sched.bind(bindingCycleCtx, fwk, assumedPod, scheduleResult.SuggestedHost, state)
绑定成功后就会打几个metrics
metrics.PodScheduled(fwk.ProfileName(), metrics.SinceInSeconds(start))
metrics.PodSchedulingAttempts.Observe(float64(podInfo.Attempts)) metrics.PodSchedulingDuration.WithLabelValues(getAttemptsLabel(podInfo)).Observe(metrics.SinceInSeconds(podInfo.InitialAttemptTimestamp))
比如平均调度时间
scheduler_pod_scheduling_duration_seconds_sum /scheduler_pod_scheduling_duration_seconds_count
D:\Workspace\Go\src\k8s.io\kubernetes\pkg\scheduler\schedule_one.go
对当前信息保存快照,如果快照中的node数量为0就返回没有节点可用
// schedulePod tries to schedule the given pod to one of the nodes in the node list.
// If it succeeds, it will return the name of the node.
// If it fails, it will return a FitError with reasons.
func (sched *Scheduler) schedulePod(ctx context.Context, fwk framework.Framework, state *framework.CycleState, pod *v1.Pod) (result ScheduleResult, err error) {
trace := utiltrace.New("Scheduling", utiltrace.Field{Key: "namespace", Value: pod.Namespace}, utiltrace.Field{Key: "name", Value: pod.Name})
defer trace.LogIfLong(100 * time.Millisecond)
if err := sched.Cache.UpdateSnapshot(sched.nodeInfoSnapshot); err != nil {
return result, err
}
trace.Step("Snapshotting scheduler cache and node infos done")
if sched.nodeInfoSnapshot.NumNodes() == 0 {
return result, ErrNoNodesAvailable
}
feasibleNodes, diagnosis, err := sched.findNodesThatFitPod(ctx, fwk, state, pod)
if err != nil {
return result, err
}
trace.Step("Computing predicates done")
if len(feasibleNodes) == 0 {
return result, &framework.FitError{
Pod: pod,
NumAllNodes: sched.nodeInfoSnapshot.NumNodes(),
Diagnosis: diagnosis,
}
}
Predict阶段: 找到所有满足调度条件的节点feasibleNodes,不满足的就直接过滤
feasibleNodes, diagnosis, err := sched.findNodesThatFitPod(ctx, fwk, state, pod)
if err != nil {
return result, err
}
trace.Step("Computing predicates done")
if len(feasibleNodes) == 0 {
return result, &framework.FitError{
Pod: pod,
NumAllNodes: sched.nodeInfoSnapshot.NumNodes(),
Diagnosis: diagnosis,
}
}
如果Predict阶段只找到一个节点就用它
// When only one node after predicate, just use it.
if len(feasibleNodes) == 1 {
return ScheduleResult{
SuggestedHost: feasibleNodes[0].Name,
EvaluatedNodes: 1 + len(diagnosis.NodeToStatusMap),
FeasibleNodes: 1,
}, nil
}
Priority阶段:通过打分,找到一个分数最高、也就是最优的节点
priorityList, err := prioritizeNodes(ctx, sched.Extenders, fwk, state, pod, feasibleNodes)
if err != nil {
return result, err
}
host, err := selectHost(priorityList)
trace.Step("Prioritizing done")
return ScheduleResult{
SuggestedHost: host,
EvaluatedNodes: len(feasibleNodes) + len(diagnosis.NodeToStatusMap),
FeasibleNodes: len(feasibleNodes),
}, err
Predict 和 Priority
。Predict和 Priority 是选择调度节点的两个关键性步骤,它的底层调用了各种algorithm算法
。我们之前提到的NodeName 匹配属于Predict阶段
。将host 填入到 pod spec字段的nodename,假定分配到对应的节点上
。调用 SchedulerCache下的AssumePod测试,如果出错则验证失败
// assume signals to the cache that a pod is already in the cache, so that binding can be asynchronous.
// assume modifies `assumed`.
func (sched *Scheduler) assume(assumed *v1.Pod, host string) error {
// Optimistically assume that the binding will succeed and send it to apiserver
// in the background.
// If the binding fails, scheduler will release resources allocated to assumed pod
// immediately.
assumed.Spec.NodeName = host
if err := sched.Cache.AssumePod(assumed); err != nil {
klog.ErrorS(err, "Scheduler cache AssumePod failed")
return err
}
// if "assumed" is a nominated pod, we should remove it from internal cache
if sched.SchedulingQueue != nil {
sched.SchedulingQueue.DeleteNominatedPodIfExists(assumed)
}
return nil
}
·根据pod uid去cache中寻找, 正常是找不到的
func (cache *cacheImpl) AssumePod(pod *v1.Pod) error {
key, err := framework.GetPodKey(pod)
if err != nil {
return err
}
cache.mu.Lock()
defer cache.mu.Unlock()
if _, ok := cache.podStates[key]; ok {
return fmt.Errorf("pod %v is in the cache, so can't be assumed", key)
}
return cache.addPod(pod, true)
}
cacheaddPod(pod)代表把pod 信息填入node中
// Assumes that lock is already acquired.
func (cache *cacheImpl) addPod(pod *v1.Pod, assumePod bool) error {
key, err := framework.GetPodKey(pod)
if err != nil {
return err
}
n, ok := cache.nodes[pod.Spec.NodeName]
if !ok {
n = newNodeInfoListItem(framework.NewNodeInfo())
cache.nodes[pod.Spec.NodeName] = n
}
n.info.AddPod(pod)
cache.moveNodeInfoToHead(pod.Spec.NodeName)
ps := &podState{
pod: pod,
}
cache.podStates[key] = ps
if assumePod {
cache.assumedPods.Insert(key)
}
return nil
}
AddPodInfo 会更新node的信息,把新来的pod经加上去
// AddPodInfo adds pod information to this NodeInfo.
// Consider using this instead of AddPod if a PodInfo is already computed.
func (n *NodeInfo) AddPodInfo(podInfo *PodInfo) {
res, non0CPU, non0Mem := calculateResource(podInfo.Pod)
n.Requested.MilliCPU += res.MilliCPU
n.Requested.Memory += res.Memory
n.Requested.EphemeralStorage += res.EphemeralStorage
if n.Requested.ScalarResources == nil && len(res.ScalarResources) > 0 {
n.Requested.ScalarResources = map[v1.ResourceName]int64{}
}
for rName, rQuant := range res.ScalarResources {
n.Requested.ScalarResources[rName] += rQuant
}
n.NonZeroRequested.MilliCPU += non0CPU
n.NonZeroRequested.Memory += non0Mem
n.Pods = append(n.Pods, podInfo)
if podWithAffinity(podInfo.Pod) {
n.PodsWithAffinity = append(n.PodsWithAffinity, podInfo)
}
if podWithRequiredAntiAffinity(podInfo.Pod) {
n.PodsWithRequiredAntiAffinity = append(n.PodsWithRequiredAntiAffinity, podInfo)
}
// Consume ports when pods added.
n.updateUsedPorts(podInfo.Pod, true)
n.updatePVCRefCounts(podInfo.Pod, true)
n.Generation = nextGeneration()
}
将assumed验证过的pod信息bind到node上
// bind binds a pod to a given node defined in a binding object.
// The precedence for binding is: (1) extenders and (2) framework plugins.
// We expect this to run asynchronously, so we handle binding metrics internally.
func (sched *Scheduler) bind(ctx context.Context, fwk framework.Framework, assumed *v1.Pod, targetNode string, state *framework.CycleState) (err error) {
defer func() {
sched.finishBinding(fwk, assumed, targetNode, err)
}()
bound, err := sched.extendersBinding(assumed, targetNode)
if bound {
return err
}
bindStatus := fwk.RunBindPlugins(ctx, state, assumed, targetNode)
if bindStatus.IsSuccess() {
return nil
}
if bindStatus.Code() == framework.Error {
return bindStatus.AsError()
}
return fmt.Errorf("bind status: %s, %v", bindStatus.Code().String(), bindStatus.Message())
}
底层的请求
// Bind delegates the action of binding a pod to a node to the extender.
func (h *HTTPExtender) Bind(binding *v1.Binding) error {
var result extenderv1.ExtenderBindingResult
if !h.IsBinder() {
// This shouldn't happen as this extender wouldn't have become a Binder.
return fmt.Errorf("unexpected empty bindVerb in extender")
}
req := &extenderv1.ExtenderBindingArgs{
PodName: binding.Name,
PodNamespace: binding.Namespace,
PodUID: binding.UID,
Node: binding.Target.Name,
}
if err := h.send(h.bindVerb, req, &result); err != nil {
return err
}
if result.Error != "" {
return fmt.Errorf(result.Error)
}
return nil
}
首先打开scheduler的插件目录,
D:\Workspace\Go\src\k8s.io\kubernetes\pkg\scheduler\framework\plugins
可以看到一堆类似过滤器的目录,在其中找打node name,位置D:\Workspace\Go\src\k8s.io\kubernetes\pkg\scheduler\framework\plugins\nodename\node_name.go
// Filter invoked at the filter extension point.
func (pl *NodeName) Filter(ctx context.Context, _ *framework.CycleState, pod *v1.Pod, nodeInfo *framework.NodeInfo) *framework.Status {
if nodeInfo.Node() == nil {
return framework.NewStatus(framework.Error, "node not found")
}
if !Fits(pod, nodeInfo) {
return framework.NewStatus(framework.UnschedulableAndUnresolvable, ErrReason)
}
return nil
}
// Fits actually checks if the pod fits the node.
func Fits(pod *v1.Pod, nodeInfo *framework.NodeInfo) bool {
return len(pod.Spec.NodeName) == 0 || pod.Spec.NodeName == nodeInfo.Node().Name
}
// New initializes a new plugin and returns it.
func New(_ runtime.Object, _ framework.Handle) (framework.Plugin, error) {
return &NodeName{}, nil
}
往上追查可以看到在registry有注册的动作,位置 D:\Workspace\Go\src\k8s.io\kubernetes\pkg\scheduler\framework\plugins\registry.go
// NewInTreeRegistry builds the registry with all the in-tree plugins.
// A scheduler that runs out of tree plugins can register additional plugins
// through the WithFrameworkOutOfTreeRegistry option.
func NewInTreeRegistry() runtime.Registry {
fts := plfeature.Features{
EnablePodDisruptionBudget: feature.DefaultFeatureGate.Enabled(features.PodDisruptionBudget),
EnableReadWriteOncePod: feature.DefaultFeatureGate.Enabled(features.ReadWriteOncePod),
EnableVolumeCapacityPriority: feature.DefaultFeatureGate.Enabled(features.VolumeCapacityPriority),
EnableMinDomainsInPodTopologySpread: feature.DefaultFeatureGate.Enabled(features.MinDomainsInPodTopologySpread),
}
return runtime.Registry{
selectorspread.Name: selectorspread.New,
imagelocality.Name: imagelocality.New,
tainttoleration.Name: tainttoleration.New,
nodename.Name: nodename.New,
再向上追溯可以看到是Scheduler的New中调用的NewInTreeRegistry
registry := frameworkplugins.NewInTreeRegistry()
。Filter调用Fits判断 pod的spec nodename 是否和目标node相等
追踪Filter调用过程,发现是RunFilterplugins遍历插件调用,位置D:\Workspace\Go\src\k8s.io\kubernetes\pkg\scheduler\framework\runtime\framework.go
// RunFilterPlugins runs the set of configured Filter plugins for pod on
// the given node. If any of these plugins doesn't return "Success", the
// given node is not suitable for running pod.
// Meanwhile, the failure message and status are set for the given node.
func (f *frameworkImpl) RunFilterPlugins(
ctx context.Context,
state *framework.CycleState,
pod *v1.Pod,
nodeInfo *framework.NodeInfo,
) framework.PluginToStatus {
statuses := make(framework.PluginToStatus)
for _, pl := range f.filterPlugins {
pluginStatus := f.runFilterPlugin(ctx, pl, state, pod, nodeInfo)
if !pluginStatus.IsSuccess() {
if !pluginStatus.IsUnschedulable() {
// Filter plugins are not supposed to return any status other than
// Success or Unschedulable.
errStatus := framework.AsStatus(fmt.Errorf("running %q filter plugin: %w", pl.Name(), pluginStatus.AsError())).WithFailedPlugin(pl.Name())
return map[string]*framework.Status{pl.Name(): errStatus}
}
pluginStatus.SetFailedPlugin(pl.Name())
statuses[pl.Name()] = pluginStatus
}
}
return statuses
}
func (f *frameworkImpl) runFilterPlugin(ctx context.Context, pl framework.FilterPlugin, state *framework.CycleState, pod *v1.Pod, nodeInfo *framework.NodeInfo) *framework.Status {
if !state.ShouldRecordPluginMetrics() {
return pl.Filter(ctx, state, pod, nodeInfo)
}
startTime := time.Now()
status := pl.Filter(ctx, state, pod, nodeInfo)
f.metricsRecorder.observePluginDurationAsync(Filter, pl.Name(), status, metrics.SinceInSeconds(startTime))
return status
}
最终追查到是findNodesThatPassFilters调用了RunFilterPluginsWithNominatedPods,位置在
feasibleNodes, err := sched.findNodesThatPassFilters(ctx, fwk, state, pod, diagnosis, nodes)
// findNodesThatPassFilters finds the nodes that fit the filter plugins.
func (sched *Scheduler) findNodesThatPassFilters(
ctx context.Context,
fwk framework.Framework,
state *framework.CycleState,
pod *v1.Pod,
diagnosis framework.Diagnosis,
nodes []*framework.NodeInfo) ([]*v1.Node, error) {
status := fwk.RunFilterPluginsWithNominatedPods(ctx, state, pod, nodeInfo)
}