kspan 集群度量方案

非原创,参考文章如下,相对下列文章信息,操作和说明更加贴近日常工作:

  • https://mp.weixin.qq.com/s/8A8YDAQd67YACnbZiN6Q5g
  • https://felipecruz.es/visualizing-kubernetes-events-with-kspan/

背景

作为集群管理员,当我们管理的集群数量众多时,或者pod从创建到启动的过程,需要经理的过程,以及耗时,可以分析出我们的集群慢在哪里。

在没有可视化工具之前,我们可以通过查看event事件,确定每个步骤的耗时,如下:

$ kubectl create deploy nginx --image=nginx
deployment.apps/nginx created
$ kubectl get event
LAST SEEN   TYPE     REASON              OBJECT                       MESSAGE
7s          Normal   Scheduled           pod/nginx-f89759699-whcxz    Successfully assigned default/nginx-f89759699-whcxz to hd-k8s-master003
7s          Normal   Pulling             pod/nginx-f89759699-whcxz    Pulling image "nginx"
7s          Normal   SuccessfulCreate    replicaset/nginx-f89759699   Created pod: nginx-f89759699-whcxz
7s          Normal   ScalingReplicaSet   deployment/nginx             Scaled up replica set nginx-f89759699 to 1

我们可以查看到Pod从调度,pull ,create,start的全部过程,以及大致的时间消耗。

更优雅的方案

K8S 中的这些事件,都对应着我们的一个操作,比如上文中是创建了一个 deployment ,它产生了几个 event , 包括 Scheduled , Pulled ,Created 等。我们将其进行抽象,是不是和我们做的链路追踪(tracing)很像呢?

这里我们会用到一个 CNCF 的毕业项目 Jaeger[1] ,在之前的 K8S生态周报 中我有多次介绍它,Jaeger 是一款开源的,端对端的分布式 tracing 系统。不过本文重点不是介绍它,所以我们查看其文档,快速的部署一个 Jaeger 即可。另一个 CNCF 的 sandbox 级别的项目是 OpenTelemetry[2] 是一个云原生软件的可观测框架,我们可以把它跟 Jaeger 结合起来使用。不过本文的重点不是介绍这俩项目,这里暂且略过。

接下来介绍我们这篇文章的用到的主要项目,是来自 Weaveworks 开源的一个项目,名叫 kspan ,它的主要做法就是将 K8S 中的 event 作为 trace 系统中的 span 进行组织。

部署kspan
创建rbac授权,因为kspan要监听event相关信息

---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: kspan
  
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: kspan-admin
rules:
- apiGroups:
  - ""
  resources:
  - configmaps
  - endpoints
  - persistentvolumeclaims
  - persistentvolumeclaims/status
  - pods
  - replicationcontrollers
  - replicationcontrollers/scale
  - serviceaccounts
  - services
  - services/status
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - ""
  resources:
  - bindings
  - events
  - limitranges
  - namespaces/status
  - pods/log
  - pods/status
  - replicationcontrollers/status
  - resourcequotas
  - resourcequotas/status
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - ""
  resources:
  - pods/exec
  verbs:
  - create
- apiGroups:
  - ""
  resources:
  - namespaces
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - apps
  resources:
  - controllerrevisions
  - daemonsets
  - daemonsets/status
  - deployments
  - deployments/scale
  - deployments/status
  - replicasets
  - replicasets/scale
  - replicasets/status
  - statefulsets
  - statefulsets/scale
  - statefulsets/status
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - autoscaling
  resources:
  - horizontalpodautoscalers
  - horizontalpodautoscalers/status
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - batch
  resources:
  - cronjobs
  - cronjobs/status
  - jobs
  - jobs/status
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - extensions
  resources:
  - daemonsets
  - daemonsets/status
  - deployments
  - deployments/scale
  - deployments/status
  - ingresses
  - ingresses/status
  - networkpolicies
  - replicasets
  - replicasets/scale
  - replicasets/status
  - replicationcontrollers/scale
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - policy
  resources:
  - poddisruptionbudgets
  - poddisruptionbudgets/status
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - networking.k8s.io
  resources:
  - ingresses
  - ingresses/status
  - networkpolicies
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - metrics.k8s.io
  resources:
  - pods
  - nodes
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - metrics.k8s.io
  resources:
  - pods
  verbs:
  - get
  - list
  - watch

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  creationTimestamp: null
  name: kspan-admin
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: kspan-admin
subjects:
- kind: ServiceAccount
  name: kspan
  namespace: default

创建pod

apiVersion: v1
kind: Pod
metadata:
  labels:
    run: kspan
  name: kspan
spec:
  containers:
  - image: docker.io/weaveworks/kspan:v0.0
    name: kspan
    resources: {}
  dnsPolicy: ClusterFirst
  restartPolicy: Always
  serviceAccountName: kspan

部署jagger

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: jaeger
  name: jaeger
spec:
  replicas: 1
  selector:
    matchLabels:
      app: jaeger
  strategy: {}
  template:
    metadata:
      labels:
        app: jaeger
    spec:
      containers:
      - image: jaegertracing/opentelemetry-all-in-one
        name: opentelemetry-all-in-one
        resources: {}
        ports:
        - containerPort: 16685
        - containerPort: 16686
        - containerPort: 5775
          protocol: UDP
        - containerPort: 6831
          protocol: UDP
        - containerPort: 6832
          protocol: UDP
        - containerPort: 5778
          protocol: TCP

创建jagger svc,它默认会使用 otlp-collector.default:55680 传递 span

apiVersion: v1
kind: Service
metadata:
  labels:
    app: jaeger
  name: otlp-collector
spec:
  ports:
  - port: 55680
    protocol: TCP
    targetPort: 55680
  selector:
    app: jaeger

当所有的Pod都启动成功后,我们可以进行访问测试

效果

创建ns以及Pod

$ kubectl create ns moelove
namespace/moelove created
$ kubectl -n moelove create deploy nginx --image=nginx
deployment.apps/nginx created

查看jaeger ui,查看信息


创建Pod耗时详情

结论

目前kspan的开源地址并没有提供定制化部署的方案,或者我没有找到详细的文档,所以不建议将kspan作为kubernetes的常用组件进行部署,当有需求再进行部署,查看任务下发的耗时,找到瓶颈即可。

如果你是多租户场景,需要针对调度慢等情况做告警,可以研究OpenTelemetry

你可能感兴趣的:(kspan 集群度量方案)