琦彦

Kubernetes高可用性监控：Thanos的部署

原文发表于kubernetes中文社区，为作者原创翻译，原文地址

更多kubernetes文章，请多关注kubernetes中文社区

介绍

对Prometheus高可用性的需求

实现

Thanos架构

Thanos包含以下组件：

重复数据删除

Thanos搭建

先决条件

组件部署

部署Prometheus中的ServiceAccount资源对象，分别创建Clusterrole和Clusterrolebinding

部署Prometheus配置文件configmap.yaml

部署prometheus-rules的configmap 这将创建我们的警报规则，该警报规则将中继到alertmanager进行交付

部署Prometheus的StatefulSet资源

部署Prometheus服务

部署Thanos Querier

部署Thanos Store Gateway

部署Thanos Ruler

部署Alertmanager

部署Kubestate指标

部署Node-Exporter Daemonset

部署Grafana

部署Ingress对象

结论

介绍

对Prometheus高可用性的需求

在过去的几个月中，Kubernetes的采用已经增长了很多倍，现在很明显，Kubernetes是容器编排的事实标准。

同时，监视是任何基础架构的重要方面。Prometheus被认为是监视容器应用和非容器应用的绝佳选择。我们应该确保监视系统具有高可用性和高度可扩展性，以适应不断增长的基础架构的需求，尤其是在Kubernetes的情况下。

因此，今天，我们将部署Prometheus集群，它不仅可以抵抗节点故障，而且还可以确保数据归档以备将来参考。我们的集群也具有很大的可扩展性，以至于我们可以在同一监控系统内跨越多个Kubernetes集群。

当前方案

大多数Prometheus部署都使用具备持久性存储的Pod，而Prometheus使用联邦进行扩展。但是，并非所有数据都可以使用联邦进行聚合，在添加其他服务器时，通常需要一种机制来管理Prometheus配置。

解决方案

Thanos旨在解决上述问题。在Thanos的帮助下，我们不仅可以扩展Prometheus实例，并能够消除重复数据，还可以将数据归档在GCS或S3等持久性存储中。

实现

Thanos架构

Thanos包含以下组件：

Thanos Sidecar：这是在Prometheus运行的主要组件。它读取并存储object store中的数据。此外，它管理Prometheus的配置和生命周期。为了区分每个Prometheus实例，Sidecar组件将外部标签注入Prometheus配置中。Sidecar组件能够在Prometheus服务器的PromQL接口上运行查询。Sidecar组件还侦听Thanos gRPC协议，并在gRPC查询和REST查询之间转换。
Thanos Store：此组件在object store中的历史数据之上实现Store API。它主要充当API网关，因此不需要大量的本地磁盘空间。它在启动时加入Thanos集群，并公布它可以访问的数据。它会在本地磁盘上保留有关所有远程块的少量信息，并使它与object store保持同步。通常，在重新启动时可以安全地删除此数据，但会增加启动时间。
Thanos Query：是个查询组件，负责侦听HTTP并将查询转换为Thanos gRPC格式。它汇总了来自不同来源的查询结果，并且可以从Sidecar和Store中读取数据。在高可用性设置中，它甚至可以对重复数据进行删除。

重复数据删除

Prometheus是有状态的，不允许复制其数据库。这意味着通过运行多个Prometheus副本来增强高可用性并不是最佳选择。

简单的负载平衡将不起作用，例如，在发生崩溃后，副本可能已启动，但是查询此类副本将导致其关闭期间的间隙很小。你有第二个副本可能正在运行，但是又可能在另一时间关闭（例如，滚动重启），因此在这些副本上进行负载平衡将无法正常工作。

相反，Thanos Querier从两个副本中提取数据，并对这些信号进行重复数据删除，从而帮助Querier使用者填补了空白。
Thanos Compact：是Thanos的压缩器组件，它采用Prometheus 2.0存储引擎的压缩过程，来阻止数据存储在object store中。通常，它以单例方式部署。它还负责数据的向下采样(downsampling)-40小时后执行5m的向下采样，而10天后执行1h的向下采样。
Thanos Ruler：它基本上与Prometheus的rules具有相同的作用。唯一的区别是它可以与Thanos组件进行通信。

Thanos搭建

先决条件

为了完全理解本教程，需要以下内容：

Kubernetes的工作原理和熟练使用Kubectl
Kubernetes集群至少有3个节点（在本演示中，使用GKE集群）
实现Ingress Controller和Ingress对象（出于演示目的，使用Nginx Ingress Controller）。尽管这不是强制性的，但强烈建议你这样做以减少外部端点创建的数量。
创建供Thanos组件访问object store的凭证（在本例中为GCS存储）
创建2个GCS存储，并将其命名为prometheus-long-term和thanos-ruler
创建一个角色为“ 存储对象管理员”的服务帐户
将密钥文件下载保存为JSON格式，并将其命名为thanos-gcs-credentials.json
使用secret创建kubernetes密钥 kubectl create secret generic thanos-gcs-credentials --from-file=thanos-gcs-credentials.json -n monitoring

组件部署

部署Prometheus中的ServiceAccount资源对象，分别创建Clusterrole和Clusterrolebinding

apiVersion: v1
kind: Namespace
metadata:
  name: monitoring
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: monitoring
  namespace: monitoring
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
  name: monitoring
  namespace: monitoring
rules:
- apiGroups: [""]
  resources:
  - nodes
  - nodes/proxy
  - services
  - endpoints
  - pods
  verbs: ["get", "list", "watch"]
- apiGroups: [""]
  resources:
  - configmaps
  verbs: ["get"]
- nonResourceURLs: ["/metrics"]
  verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
  name: monitoring
subjects:
  - kind: ServiceAccount
    name: monitoring
    namespace: monitoring
roleRef:
  kind: ClusterRole
  Name: monitoring
  apiGroup: rbac.authorization.k8s.io
---

上述清单创建监控命名空间和服务ServiceAccount。

部署Prometheus配置文件configmap.yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-server-conf
  labels:
    name: prometheus-server-conf
  namespace: monitoring
data:
  prometheus.yaml.tmpl: |-
    global:
      scrape_interval: 5s
      evaluation_interval: 5s
      external_labels:
        cluster: prometheus-ha
        # Each Prometheus has to have unique labels.
        replica: $(POD_NAME)

    rule_files:
      - /etc/prometheus/rules/*rules.yaml

    alerting:

      # We want our alerts to be deduplicated
      # from different replicas.
      alert_relabel_configs:
      - regex: replica
        action: labeldrop

      alertmanagers:
        - scheme: http
          path_prefix: /
          static_configs:
            - targets: ['alertmanager:9093']

    scrape_configs:
    - job_name: kubernetes-nodes-cadvisor
      scrape_interval: 10s
      scrape_timeout: 10s
      scheme: https
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      kubernetes_sd_configs:
        - role: node
      relabel_configs:
        - action: labelmap
          regex: __meta_kubernetes_node_label_(.+)
        # Only for Kubernetes ^1.7.3.
        # See: https://github.com/prometheus/prometheus/issues/2916
        - target_label: __address__
          replacement: kubernetes.default.svc:443
        - source_labels: [__meta_kubernetes_node_name]
          regex: (.+)
          target_label: __metrics_path__
          replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
      metric_relabel_configs:
        - action: replace
          source_labels: [id]
          regex: '^/machine\.slice/machine-rkt\\x2d([^\\]+)\\.+/([^/]+)\.service$'
          target_label: rkt_container_name
          replacement: '${2}-${1}'
        - action: replace
          source_labels: [id]
          regex: '^/system\.slice/(.+)\.service$'
          target_label: systemd_service_name
          replacement: '${1}'

    - job_name: 'kubernetes-pods'
      kubernetes_sd_configs:
        - role: pod
      relabel_configs:
        - action: labelmap
          regex: __meta_kubernetes_pod_label_(.+)
        - source_labels: [__meta_kubernetes_namespace]
          action: replace
          target_label: kubernetes_namespace
        - source_labels: [__meta_kubernetes_pod_name]
          action: replace
          target_label: kubernetes_pod_name
        - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
          action: keep
          regex: true
        - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
          action: replace
          target_label: __scheme__
          regex: (https?)
        - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
          action: replace
          target_label: __metrics_path__
          regex: (.+)
        - source_labels: [__address__, __meta_kubernetes_pod_prometheus_io_port]
          action: replace
          target_label: __address__
          regex: ([^:]+)(?::\d+)?;(\d+)
          replacement: $1:$2


    - job_name: 'kubernetes-apiservers'
      kubernetes_sd_configs:
        - role: endpoints
      scheme: https 
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      relabel_configs:
        - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
          action: keep
          regex: default;kubernetes;https

    - job_name: 'kubernetes-service-endpoints'
      kubernetes_sd_configs:
        - role: endpoints
      relabel_configs:
        - action: labelmap
          regex: __meta_kubernetes_service_label_(.+)
        - source_labels: [__meta_kubernetes_namespace]
          action: replace
          target_label: kubernetes_namespace
        - source_labels: [__meta_kubernetes_service_name]
          action: replace
          target_label: kubernetes_name
        - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
          action: keep
          regex: true
        - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
          action: replace
          target_label: __scheme__
          regex: (https?)
        - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
          action: replace
          target_label: __metrics_path__
          regex: (.+)
        - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
          action: replace
          target_label: __address__
          regex: (.+)(?::\d+);(\d+)
          replacement: $1:$2

上面的Configmap创建Prometheus配置文件模板。Thanos sidecar组件将读取此配置文件模板，并将生成实际的配置文件，而该配置文件又将由在同一容器中运行的Prometheus容器使用。

在配置文件中添加external_labels部分非常重要，以使Querier可以基于该部分对重复数据进行删除。

部署prometheus-rules的configmap 这将创建我们的警报规则，该警报规则将中继到alertmanager进行交付

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-rules
  labels:
    name: prometheus-rules
  namespace: monitoring
data:
  alert-rules.yaml: |-
    groups:
      - name: Deployment
        rules:
        - alert: Deployment at 0 Replicas
          annotations:
            summary: Deployment {
    {$labels.deployment}} in {
    {$labels.namespace}} is currently having no pods running
          expr: |
            sum(kube_deployment_status_replicas{pod_template_hash=""}) by (deployment,namespace)  < 1
          for: 1m
          labels:
            team: devops

        - alert: HPA Scaling Limited  
          annotations: 
            summary: HPA named {
    {$labels.hpa}} in {
    {$labels.namespace}} namespace has reached scaling limited state
          expr: | 
            (sum(kube_hpa_status_condition{condition="ScalingLimited",status="true"}) by (hpa,namespace)) == 1
          for: 1m
          labels: 
            team: devops

        - alert: HPA at MaxCapacity 
          annotations: 
            summary: HPA named {
    {$labels.hpa}} in {
    {$labels.namespace}} namespace is running at Max Capacity
          expr: | 
            ((sum(kube_hpa_spec_max_replicas) by (hpa,namespace)) - (sum(kube_hpa_status_current_replicas) by (hpa,namespace))) == 0
          for: 1m
          labels: 
            team: devops

      - name: Pods
        rules:
        - alert: Container restarted
          annotations:
            summary: Container named {
    {$labels.container}} in {
    {$labels.pod}} in {
    {$labels.namespace}} was restarted
          expr: |
            sum(increase(kube_pod_container_status_restarts_total{namespace!="kube-system",pod_template_hash=""}[1m])) by (pod,namespace,container) > 0
          for: 0m
          labels:
            team: dev

        - alert: High Memory Usage of Container 
          annotations: 
            summary: Container named {
    {$labels.container}} in {
    {$labels.pod}} in {
    {$labels.namespace}} is using more than 75% of Memory Limit
          expr: | 
            ((( sum(container_memory_usage_bytes{image!="",container_name!="POD", namespace!="kube-system"}) by (namespace,container_name,pod_name)  / sum(container_spec_memory_limit_bytes{image!="",container_name!="POD",namespace!="kube-system"}) by (namespace,container_name,pod_name) ) * 100 ) < +Inf ) > 75
          for: 5m
          labels: 
            team: dev

        - alert: High CPU Usage of Container 
          annotations: 
            summary: Container named {
    {$labels.container}} in {
    {$labels.pod}} in {
    {$labels.namespace}} is using more than 75% of CPU Limit
          expr: | 
            ((sum(irate(container_cpu_usage_seconds_total{image!="",container_name!="POD", namespace!="kube-system"}[30s])) by (namespace,container_name,pod_name) / sum(container_spec_cpu_quota{image!="",container_name!="POD", namespace!="kube-system"} / container_spec_cpu_period{image!="",container_name!="POD", namespace!="kube-system"}) by (namespace,container_name,pod_name) ) * 100)  > 75
          for: 5m
          labels: 
            team: dev

      - name: Nodes
        rules:
        - alert: High Node Memory Usage
          annotations:
            summary: Node {
    {$labels.kubernetes_io_hostname}} has more than 80% memory used. Plan Capcity
          expr: |
            (sum (container_memory_working_set_bytes{id="/",container_name!="POD"}) by (kubernetes_io_hostname) / sum (machine_memory_bytes{}) by (kubernetes_io_hostname) * 100) > 80
          for: 5m
          labels:
            team: devops

        - alert: High Node CPU Usage
          annotations:
            summary: Node {
    {$labels.kubernetes_io_hostname}} has more than 80% allocatable cpu used. Plan Capacity.
          expr: |
            (sum(rate(container_cpu_usage_seconds_total{id="/", container_name!="POD"}[1m])) by (kubernetes_io_hostname) / sum(machine_cpu_cores) by (kubernetes_io_hostname)  * 100) > 80
          for: 5m
          labels:
            team: devops

        - alert: High Node Disk Usage
          annotations:
            summary: Node {
    {$labels.kubernetes_io_hostname}} has more than 85% disk used. Plan Capacity.
          expr: |
            (sum(container_fs_usage_bytes{device=~"^/dev/[sv]d[a-z][1-9]$",id="/",container_name!="POD"}) by (kubernetes_io_hostname) / sum(container_fs_limit_bytes{container_name!="POD",device=~"^/dev/[sv]d[a-z][1-9]$",id="/"}) by (kubernetes_io_hostname)) * 100 > 85
          for: 5m
          labels:
            team: devops

部署Prometheus的StatefulSet资源

apiVersion: storage.k8s.io/v1beta1
kind: StorageClass
metadata:
  name: fast
  namespace: monitoring
provisioner: kubernetes.io/gce-pd
allowVolumeExpansion: true
---
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
  name: prometheus
  namespace: monitoring
spec:
  replicas: 3
  serviceName: prometheus-service
  template:
    metadata:
      labels:
        app: prometheus
        thanos-store-api: "true"
    spec:
      serviceAccountName: monitoring
      containers:
        - name: prometheus
          image: prom/prometheus:v2.4.3
          args:
            - "--config.file=/etc/prometheus-shared/prometheus.yaml"
            - "--storage.tsdb.path=/prometheus/"
            - "--web.enable-lifecycle"
            - "--storage.tsdb.no-lockfile"
            - "--storage.tsdb.min-block-duration=2h"
            - "--storage.tsdb.max-block-duration=2h"
          ports:
            - name: prometheus
              containerPort: 9090
          volumeMounts:
            - name: prometheus-storage
              mountPath: /prometheus/
            - name: prometheus-config-shared
              mountPath: /etc/prometheus-shared/
            - name: prometheus-rules
              mountPath: /etc/prometheus/rules
        - name: thanos
          image: quay.io/thanos/thanos:v0.8.0
          args:
            - "sidecar"
            - "--log.level=debug"
            - "--tsdb.path=/prometheus"
            - "--prometheus.url=http://127.0.0.1:9090"
            - "--objstore.config={type: GCS, config: {bucket: prometheus-long-term}}"
            - "--reloader.config-file=/etc/prometheus/prometheus.yaml.tmpl"
            - "--reloader.config-envsubst-file=/etc/prometheus-shared/prometheus.yaml"
            - "--reloader.rule-dir=/etc/prometheus/rules/"
          env:
            - name: POD_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
            - name : GOOGLE_APPLICATION_CREDENTIALS
              value: /etc/secret/thanos-gcs-credentials.json
          ports:
            - name: http-sidecar
              containerPort: 10902
            - name: grpc
              containerPort: 10901
          livenessProbe:
              httpGet:
                port: 10902
                path: /-/healthy
          readinessProbe:
            httpGet:
              port: 10902
              path: /-/ready
          volumeMounts:
            - name: prometheus-storage
              mountPath: /prometheus
            - name: prometheus-config-shared
              mountPath: /etc/prometheus-shared/
            - name: prometheus-config
              mountPath: /etc/prometheus
            - name: prometheus-rules
              mountPath: /etc/prometheus/rules
            - name: thanos-gcs-credentials
              mountPath: /etc/secret
              readOnly: false
      securityContext:
        fsGroup: 2000
        runAsNonRoot: true
        runAsUser: 1000
      volumes:
        - name: prometheus-config
          configMap:
            defaultMode: 420
            name: prometheus-server-conf
        - name: prometheus-config-shared
          emptyDir: {}
        - name: prometheus-rules
          configMap:
            name: prometheus-rules
        - name: thanos-gcs-credentials
          secret:
            secretName: thanos-gcs-credentials
  volumeClaimTemplates:
  - metadata:
      name: prometheus-storage
      namespace: monitoring
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: fast
      resources:
        requests:
          storage: 20Gi

上面提供的清单，重要的是要了解以下几个方面：

Prometheus部署为具有3个副本的StatefulSet，每个副本动态地配置自己的持久存储卷。
Thanos sidecar容器使用我们在上面创建的模板文件，生成Prometheus配置信息。
Thanos需要处理数据压缩，因此我们需要设置--storage.tsdb.min-block-duration = 2h和--storage.tsdb.max-block-duration = 2h
PrometheusStatefulSet被打上thanos-store-api：true的标签，因此每个headless 服务都会发现每个Pod，我们将在下面的Service资源中创建它。Thanos Querier将使用此headless服务来查询所有Prometheus实例中的数据。我们还将相同的标签(thanos-store-api：true)应用于Thanos Store和Thanos Ruler组件，以便Querier也会发现它们并将其用于查询指标。
使用GOOGLE_APPLICATION_CREDENTIALS环境变量提供了GCS存储凭据路径。这个凭据是我们创建secret获得的。

部署Prometheus服务

apiVersion: v1
kind: Service
metadata: 
  name: prometheus-0-service
  annotations: 
    prometheus.io/scrape: "true"
    prometheus.io/port: "9090"
  namespace: monitoring
  labels:
    name: prometheus
spec:
  selector: 
    statefulset.kubernetes.io/pod-name: prometheus-0
  ports: 
    - name: prometheus 
      port: 8080
      targetPort: prometheus
---
apiVersion: v1
kind: Service
metadata: 
  name: prometheus-1-service
  annotations: 
    prometheus.io/scrape: "true"
    prometheus.io/port: "9090"
  namespace: monitoring
  labels:
    name: prometheus
spec:
  selector: 
    statefulset.kubernetes.io/pod-name: prometheus-1
  ports: 
    - name: prometheus 
      port: 8080
      targetPort: prometheus
---
apiVersion: v1
kind: Service
metadata: 
  name: prometheus-2-service
  annotations: 
    prometheus.io/scrape: "true"
    prometheus.io/port: "9090"
  namespace: monitoring
  labels:
    name: prometheus
spec:
  selector: 
    statefulset.kubernetes.io/pod-name: prometheus-2
  ports: 
    - name: prometheus 
      port: 8080
      targetPort: prometheus
---
#This service creates a srv record for querier to find about store-api's
apiVersion: v1
kind: Service
metadata:
  name: thanos-store-gateway
  namespace: monitoring
spec:
  type: ClusterIP
  clusterIP: None
  ports:
    - name: grpc
      port: 10901
      targetPort: grpc
  selector:
    thanos-store-api: "true"

我们为StatefulSet中的每个Prometheus Pod创建了不同的服务，这不是必需的，这些仅用于调试目的。上面已经解释了headless服务名称为thanos-store-gateway的目的。稍后我们将使用ingress 对象暴露Prometheus服务。

部署Thanos Querier

apiVersion: v1
kind: Namespace
metadata:
  name: monitoring
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: thanos-querier
  namespace: monitoring
  labels:
    app: thanos-querier
spec:
  replicas: 1
  selector:
    matchLabels:
      app: thanos-querier
  template:
    metadata:
      labels:
        app: thanos-querier
    spec:
      containers:
      - name: thanos
        image: quay.io/thanos/thanos:v0.8.0
        args:
        - query
        - --log.level=debug
        - --query.replica-label=replica
        - --store=dnssrv+thanos-store-gateway:10901
        ports:
        - name: http
          containerPort: 10902
        - name: grpc
          containerPort: 10901
        livenessProbe:
          httpGet:
            port: http
            path: /-/healthy
        readinessProbe:
          httpGet:
            port: http
            path: /-/ready
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: thanos-querier
  name: thanos-querier
  namespace: monitoring
spec:
  ports:
  - port: 9090
    protocol: TCP
    targetPort: http
    name: http
  selector:
    app: thanos-querier

Thanos Querier是Thanos部署的主要组件之一。请注意以下几点：

容器参数--store=dnssrv+thanos-store-gateway:10901有助于从度量标准数据中发现所有组件。
thanos-querier服务提供了一个Web界面来运行PromQL查询。它还可以选择在多个Prometheus群集之间删除重复数据。
Thanos Querier也是Grafana等所有仪表板的数据源。

部署Thanos Store Gateway

apiVersion: v1
kind: Namespace
metadata:
  name: monitoring
---
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
  name: thanos-store-gateway
  namespace: monitoring
  labels:
    app: thanos-store-gateway
spec:
  replicas: 1
  selector:
    matchLabels:
      app: thanos-store-gateway
  serviceName: thanos-store-gateway
  template:
    metadata:
      labels:
        app: thanos-store-gateway
        thanos-store-api: "true"
    spec:
      containers:
        - name: thanos
          image: quay.io/thanos/thanos:v0.8.0
          args:
          - "store"
          - "--log.level=debug"
          - "--data-dir=/data"
          - "--objstore.config={type: GCS, config: {bucket: prometheus-long-term}}"
          - "--index-cache-size=500MB"
          - "--chunk-pool-size=500MB"
          env:
            - name : GOOGLE_APPLICATION_CREDENTIALS
              value: /etc/secret/thanos-gcs-credentials.json
          ports:
          - name: http
            containerPort: 10902
          - name: grpc
            containerPort: 10901
          livenessProbe:
            httpGet:
              port: 10902
              path: /-/healthy
          readinessProbe:
            httpGet:
              port: 10902
              path: /-/ready
          volumeMounts:
            - name: thanos-gcs-credentials
              mountPath: /etc/secret
              readOnly: false
      volumes:
        - name: thanos-gcs-credentials
          secret:
            secretName: thanos-gcs-credentials
---

这将创建存储组件，该组件存储服务从object store到Querier的指标信息。

部署Thanos Ruler

apiVersion: v1
kind: Namespace
metadata:
  name: monitoring
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: thanos-ruler-rules
  namespace: monitoring
data:
  alert_down_services.rules.yaml: |
    groups:
    - name: metamonitoring
      rules:
      - alert: PrometheusReplicaDown
        annotations:
          message: Prometheus replica in cluster {
    {$labels.cluster}} has disappeared from Prometheus target discovery.
        expr: |
          sum(up{cluster="prometheus-ha", instance=~".*:9090", job="kubernetes-service-endpoints"}) by (job,cluster) < 3
        for: 15s
        labels:
          severity: critical
---
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
  labels:
    app: thanos-ruler
  name: thanos-ruler
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app: thanos-ruler
  serviceName: thanos-ruler
  template:
    metadata:
      labels:
        app: thanos-ruler
        thanos-store-api: "true"
    spec:
      containers:
        - name: thanos
          image: quay.io/thanos/thanos:v0.8.0
          args:
            - rule
            - --log.level=debug
            - --data-dir=/data
            - --eval-interval=15s
            - --rule-file=/etc/thanos-ruler/*.rules.yaml
            - --alertmanagers.url=http://alertmanager:9093
            - --query=thanos-querier:9090
            - "--objstore.config={type: GCS, config: {bucket: thanos-ruler}}"
            - --label=ruler_cluster="prometheus-ha"
            - --label=replica="$(POD_NAME)"
          env:
            - name : GOOGLE_APPLICATION_CREDENTIALS
              value: /etc/secret/thanos-gcs-credentials.json
            - name: POD_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
          ports:
            - name: http
              containerPort: 10902
            - name: grpc
              containerPort: 10901
          livenessProbe:
            httpGet:
              port: http
              path: /-/healthy
          readinessProbe:
            httpGet:
              port: http
              path: /-/ready
          volumeMounts:
            - mountPath: /etc/thanos-ruler
              name: config
            - name: thanos-gcs-credentials
              mountPath: /etc/secret
              readOnly: false
      volumes:
        - configMap:
            name: thanos-ruler-rules
          name: config
        - name: thanos-gcs-credentials
          secret:
            secretName: thanos-gcs-credentials
---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: thanos-ruler
  name: thanos-ruler
  namespace: monitoring
spec:
  ports:
    - port: 9090
      protocol: TCP
      targetPort: http
      name: http
  selector:
    app: thanos-ruler

现在，在与我们的工作负载相同的名称空间中的输入以下命令，能够查看到thanos-store-gateway对应有哪些Pod ：

root@my-shell-95cb5df57-4q6w8:/# nslookup thanos-store-gateway
Server:     10.63.240.10
Address:    10.63.240.10#53

Name:   thanos-store-gateway.monitoring.svc.cluster.local
Address: 10.60.25.2
Name:   thanos-store-gateway.monitoring.svc.cluster.local
Address: 10.60.25.4
Name:   thanos-store-gateway.monitoring.svc.cluster.local
Address: 10.60.30.2
Name:   thanos-store-gateway.monitoring.svc.cluster.local
Address: 10.60.30.8
Name:   thanos-store-gateway.monitoring.svc.cluster.local
Address: 10.60.31.2

root@my-shell-95cb5df57-4q6w8:/# exit

上面返回的IP对应于我们的Prometheus中的Pod（thanos-store和thanos-ruler）。

可以通过以下命令验证

$ kubectl get pods -o wide -l thanos-store-api="true"
NAME                     READY   STATUS    RESTARTS   AGE    IP           NODE                              NOMINATED NODE   READINESS GATES
prometheus-0             2/2     Running   0          100m   10.60.31.2   gke-demo-1-pool-1-649cbe02-jdnv              
prometheus-1             2/2     Running   0          14h    10.60.30.2   gke-demo-1-pool-1-7533d618-kxkd              
prometheus-2             2/2     Running   0          31h    10.60.25.2   gke-demo-1-pool-1-4e9889dd-27gc              
thanos-ruler-0           1/1     Running   0          100m   10.60.30.8   gke-demo-1-pool-1-7533d618-kxkd              
thanos-store-gateway-0   1/1     Running   0          14h    10.60.25.4   gke-demo-1-pool-1-4e9889dd-27gc

部署Alertmanager

apiVersion: v1
kind: Namespace
metadata:
  name: monitoring
---
kind: ConfigMap
apiVersion: v1
metadata:
  name: alertmanager
  namespace: monitoring
data:
  config.yml: |-
    global:
      resolve_timeout: 5m
      slack_api_url: ""
      victorops_api_url: ""

    templates:
    - '/etc/alertmanager-templates/*.tmpl'
    route:
      group_by: ['alertname', 'cluster', 'service']
      group_wait: 10s
      group_interval: 1m
      repeat_interval: 5m  
      receiver: default 
      routes:
      - match:
          team: devops
        receiver: devops
        continue: true 
      - match: 
          team: dev
        receiver: dev
        continue: true

    receivers:
    - name: 'default'

    - name: 'devops'
      victorops_configs:
      - api_key: ''
        routing_key: 'devops'
        message_type: 'CRITICAL'
        entity_display_name: '{
    { .CommonLabels.alertname }}'
        state_message: 'Alert: {
    { .CommonLabels.alertname }}. Summary:{
    { .CommonAnnotations.summary }}. RawData: {
    { .CommonLabels }}'
      slack_configs:
      - channel: '#k8-alerts'
        send_resolved: true


    - name: 'dev'
      victorops_configs:
      - api_key: ''
        routing_key: 'dev'
        message_type: 'CRITICAL'
        entity_display_name: '{
    { .CommonLabels.alertname }}'
        state_message: 'Alert: {
    { .CommonLabels.alertname }}. Summary:{
    { .CommonAnnotations.summary }}. RawData: {
    { .CommonLabels }}'
      slack_configs:
      - channel: '#k8-alerts'
        send_resolved: true

---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: alertmanager
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app: alertmanager
  template:
    metadata:
      name: alertmanager
      labels:
        app: alertmanager
    spec:
      containers:
      - name: alertmanager
        image: prom/alertmanager:v0.15.3
        args:
          - '--config.file=/etc/alertmanager/config.yml'
          - '--storage.path=/alertmanager'
        ports:
        - name: alertmanager
          containerPort: 9093
        volumeMounts:
        - name: config-volume
          mountPath: /etc/alertmanager
        - name: alertmanager
          mountPath: /alertmanager
      volumes:
      - name: config-volume
        configMap:
          name: alertmanager
      - name: alertmanager
        emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
  annotations:
    prometheus.io/scrape: 'true'
    prometheus.io/path: '/metrics'
  labels:
    name: alertmanager
  name: alertmanager
  namespace: monitoring
spec:
  selector:
    app: alertmanager
  ports:
  - name: alertmanager
    protocol: TCP
    port: 9093
    targetPort: 9093

alertmanager将根据Prometheus规则生成所有的警报。

部署Kubestate指标

apiVersion: v1
kind: Namespace
metadata:
  name: monitoring
---
apiVersion: rbac.authorization.k8s.io/v1 
# kubernetes versions before 1.8.0 should use rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
  name: kube-state-metrics
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: kube-state-metrics
subjects:
- kind: ServiceAccount
  name: kube-state-metrics
  namespace: monitoring
---
apiVersion: rbac.authorization.k8s.io/v1
# kubernetes versions before 1.8.0 should use rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
  name: kube-state-metrics
rules:
- apiGroups: [""]
  resources:
  - configmaps
  - secrets
  - nodes
  - pods
  - services
  - resourcequotas
  - replicationcontrollers
  - limitranges
  - persistentvolumeclaims
  - persistentvolumes
  - namespaces
  - endpoints
  verbs: ["list", "watch"]
- apiGroups: ["extensions"]
  resources:
  - daemonsets
  - deployments
  - replicasets
  verbs: ["list", "watch"]
- apiGroups: ["apps"]
  resources:
  - statefulsets
  verbs: ["list", "watch"]
- apiGroups: ["batch"]
  resources:
  - cronjobs
  - jobs
  verbs: ["list", "watch"]
- apiGroups: ["autoscaling"]
  resources:
  - horizontalpodautoscalers
  verbs: ["list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
# kubernetes versions before 1.8.0 should use rbac.authorization.k8s.io/v1beta1
kind: RoleBinding
metadata:
  name: kube-state-metrics
  namespace: monitoring
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: kube-state-metrics-resizer
subjects:
- kind: ServiceAccount
  name: kube-state-metrics
  namespace: monitoring
---
apiVersion: rbac.authorization.k8s.io/v1
# kubernetes versions before 1.8.0 should use rbac.authorization.k8s.io/v1beta1
kind: Role
metadata:
  namespace: monitoring
  name: kube-state-metrics-resizer
rules:
- apiGroups: [""]
  resources:
  - pods
  verbs: ["get"]
- apiGroups: ["extensions"]
  resources:
  - deployments
  resourceNames: ["kube-state-metrics"]
  verbs: ["get", "update"]
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: kube-state-metrics
  namespace: monitoring
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: kube-state-metrics
  namespace: monitoring
spec:
  selector:
    matchLabels:
      k8s-app: kube-state-metrics
  replicas: 1
  template:
    metadata:
      labels:
        k8s-app: kube-state-metrics
    spec:
      serviceAccountName: kube-state-metrics
      containers:
      - name: kube-state-metrics
        image: quay.io/mxinden/kube-state-metrics:v1.4.0-gzip.3
        ports:
        - name: http-metrics
          containerPort: 8080
        - name: telemetry
          containerPort: 8081
        readinessProbe:
          httpGet:
            path: /healthz
            port: 8080
          initialDelaySeconds: 5
          timeoutSeconds: 5
      - name: addon-resizer
        image: k8s.gcr.io/addon-resizer:1.8.3
        resources:
          limits:
            cpu: 150m
            memory: 50Mi
          requests:
            cpu: 150m
            memory: 50Mi
        env:
          - name: MY_POD_NAME
            valueFrom:
              fieldRef:
                fieldPath: metadata.name
          - name: MY_POD_NAMESPACE
            valueFrom:
              fieldRef:
                fieldPath: metadata.namespace
        command:
          - /pod_nanny
          - --container=kube-state-metrics
          - --cpu=100m
          - --extra-cpu=1m
          - --memory=100Mi
          - --extra-memory=2Mi
          - --threshold=5
          - --deployment=kube-state-metrics
---
apiVersion: v1
kind: Service
metadata:
  name: kube-state-metrics
  namespace: monitoring
  labels:
    k8s-app: kube-state-metrics
  annotations:
    prometheus.io/scrape: 'true'
spec:
  ports:
  - name: http-metrics
    port: 8080
    targetPort: http-metrics
    protocol: TCP
  - name: telemetry
    port: 8081
    targetPort: telemetry
    protocol: TCP
  selector:
    k8s-app: kube-state-metrics

需要使用Kubestate指标来中继一些重要的容器指标，这些指标不是kubelet本身公开的，因此不能直接用于Prometheus。

部署Node-Exporter Daemonset

apiVersion: v1
kind: Namespace
metadata:
  name: monitoring
---
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  name: node-exporter
  namespace: monitoring
  labels:
    name: node-exporter
spec:
  template:
    metadata:
      labels:
        name: node-exporter
      annotations:
         prometheus.io/scrape: "true"
         prometheus.io/port: "9100"
    spec:
      hostPID: true
      hostIPC: true
      hostNetwork: true
      containers:
        - name: node-exporter
          image: prom/node-exporter:v0.16.0
          securityContext:
            privileged: true
          args:
            - --path.procfs=/host/proc
            - --path.sysfs=/host/sys
          ports:
            - containerPort: 9100
              protocol: TCP
          resources:
            limits:
              cpu: 100m
              memory: 100Mi
            requests:
              cpu: 10m
              memory: 100Mi
          volumeMounts:
            - name: dev
              mountPath: /host/dev
            - name: proc
              mountPath: /host/proc
            - name: sys
              mountPath: /host/sys
            - name: rootfs
              mountPath: /rootfs
      volumes:
        - name: proc
          hostPath:
            path: /proc
        - name: dev
          hostPath:
            path: /dev
        - name: sys
          hostPath:
            path: /sys
        - name: rootfs
          hostPath:
            path: /

Node-Exporter是Daemonset资源，它在每个节点上运行一个pod -exporter的容器，并公开非常重要的与节点相关的度量标准，这些度量标准可以由Prometheus实例提取。

部署Grafana

apiVersion: v1
kind: Namespace
metadata:
  name: monitoring
---
apiVersion: storage.k8s.io/v1beta1
kind: StorageClass
metadata:
  name: fast
  namespace: monitoring
provisioner: kubernetes.io/gce-pd
allowVolumeExpansion: true
---
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
  name: grafana
  namespace: monitoring
spec:
  replicas: 1
  serviceName: grafana
  template:
    metadata:
      labels:
        task: monitoring
        k8s-app: grafana
    spec:
      containers:
      - name: grafana
        image: k8s.gcr.io/heapster-grafana-amd64:v5.0.4
        ports:
        - containerPort: 3000
          protocol: TCP
        volumeMounts:
        - mountPath: /etc/ssl/certs
          name: ca-certificates
          readOnly: true
        - mountPath: /var
          name: grafana-storage
        env:
        - name: GF_SERVER_HTTP_PORT
          value: "3000"
          # The following env variables are required to make Grafana accessible via
          # the kubernetes api-server proxy. On production clusters, we recommend
          # removing these env variables, setup auth for grafana, and expose the grafana
          # service using a LoadBalancer or a public IP.
        - name: GF_AUTH_BASIC_ENABLED
          value: "false"
        - name: GF_AUTH_ANONYMOUS_ENABLED
          value: "true"
        - name: GF_AUTH_ANONYMOUS_ORG_ROLE
          value: Admin
        - name: GF_SERVER_ROOT_URL
          # If you're only using the API Server proxy, set this value instead:
          # value: /api/v1/namespaces/kube-system/services/monitoring-grafana/proxy
          value: /
      volumes:
      - name: ca-certificates
        hostPath:
          path: /etc/ssl/certs
  volumeClaimTemplates:
  - metadata:
      name: grafana-storage
      namespace: monitoring
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: fast
      resources:
        requests:
          storage: 5Gi
---
apiVersion: v1
kind: Service
metadata:
  labels:
    kubernetes.io/cluster-service: 'true'
    kubernetes.io/name: grafana
  name: grafana
  namespace: monitoring
spec:
  ports:
  - port: 3000
    targetPort: 3000
  selector:
    k8s-app: grafana

这将创建我们的Grafana的Deployment和Service资源对象，该Service将通过我们的Ingress对象公开。

为了将Thanos-Querier添加为Grafana数据源。我们可以这样做：

在Grafana单击Add DataSource
名称：DS_PROMETHEUS
类型：Prometheus
网址：http://thanos-querier:9090
点击Save and Test。现在，你可以构建自定义仪表板，也可以直接从grafana.net导入仪表板。仪表盘＃315和＃1471是很好的开始。

部署Ingress对象

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: monitoring-ingress
  namespace: monitoring
  annotations:
    kubernetes.io/ingress.class: "nginx"
spec:
  rules:
  - host: grafana..com
    http:
      paths:
      - path: /
        backend:
          serviceName: grafana
          servicePort: 3000
  - host: prometheus-0..com
    http:
      paths:
      - path: /
        backend:
          serviceName: prometheus-0-service
          servicePort: 8080
  - host: prometheus-1..com
    http:
      paths:
      - path: /
        backend:
          serviceName: prometheus-1-service
          servicePort: 8080
  - host: prometheus-2..com
    http:
      paths:
      - path: /
        backend:
          serviceName: prometheus-2-service
          servicePort: 8080
  - host: alertmanager..com
    http: 
      paths:
      - path: /
        backend:
          serviceName: alertmanager
          servicePort: 9093
  - host: thanos-querier..com
    http:
      paths:
      - path: /
        backend:
          serviceName: thanos-querier
          servicePort: 9090
  - host: thanos-ruler..com
    http:
      paths:
      - path: /
        backend:
          serviceName: thanos-ruler
          servicePort: 9090

这将有助于在Kubernetes集群之外公开我们所有的服务。记得将替换为你可以访问的域名，并且可以将Ingress-Controller的服务指向该域名。

现在，您应该可以在http://thanos-querier..com上访问Thanos Querier 。

它看起来像这样：

可以选择“ deldupication“ 删除重复数据。

如果单击“ Stores ”，则可以看到thanos-store-gateway服务发现的所有活动端点

现在，您将Thanos Querier添加为Grafana中的数据源，并开始创建仪表板

Kubernetes集群监控仪表板

Kubernetes节点监控仪表板

结论

将Thanos与Prometheus集成无疑提供了水平扩展Prometheus的能力，并且由于Thanos-Querier能够从其他查询器实例中提取指标，因此你实际上可以跨集群提取指标，从而在单个仪表板上可视化它们。

我们还能够将度量标准数据存档在object store中，该object store为我们的监视系统提供了无限的存储空间，并提供了来自object store本身的度量。

但是，要实现所有这些，你需要进行大量配置。上面提供的清单已在生产环境中进行了测试。如果你有任何疑问，请随时与我们联系。

译文连接：https://dzone.com/articles/high-availability-kubernetes-monitoring-using-prom

你可能感兴趣的:(Kubernetes,高可用,监控,Thanos)

k8s:安装 Helm 私有仓库ChartMuseum、helm-push插件并上传、安装Zookeeper 云游 docker helm helm-push
ChartMuseum是Kubernetes生态中用于存储、管理和发布HelmCharts的开源系统，主要用于扩展Helm包管理器的功能核心功能‌集中存储‌：提供中央化仓库存储Charts，支持版本管理和权限控制。‌‌跨集群部署‌：支持多集群环境下共享Charts，简化部署流程。‌‌离线部署‌：适配无网络环境，可将Charts存储在本地或局域网内。‌‌HTTP接口‌：通过HTTP协议提供服务，用户
分布式学习笔记_04_复制模型 NzuCRAS 分布式学习笔记架构后端
常见复制模型使用复制的目的在分布式系统中，数据通常需要被分布在多台机器上，主要为了达到：拓展性：数据量因读写负载巨大，一台机器无法承载，数据分散在多台机器上仍然可以有效地进行负载均衡，达到灵活的横向拓展高容错&高可用：在分布式系统中单机故障是常态，在单机故障的情况下希望整体系统仍然能够正常工作，这时候就需要数据在多台机器上做冗余，在遇到单机故障时能够让其他机器接管统一的用户体验：如果系统客户端分布
传统检测响应慢？陌讯多模态引擎提速90+FPS实战 2501_92473147 算法计算机视觉目标检测
开篇痛点：实时目标检测在安防监控中的核心挑战在安防监控领域，实时目标检测是保障公共安全的关键技术。然而，传统算法如YOLOv5或开源框架MMDetection常面临两大痛点：误报率高（复杂光照或遮挡场景下检测不稳定）和响应延迟（高分辨率视频流处理FPS低于30）。实测数据显示，城市交通监控系统误报率达15%，导致安保资源浪费；客户反馈表明，延迟超100ms时，目标跟踪可能失效。这些问题源于算法泛化
反光衣识别漏检率 30%？陌讯多尺度模型实测优化
在建筑工地、交通指挥等场景中，反光衣是保障作业人员安全的重要装备，对其进行精准识别是智能监控系统的核心功能之一。但传统视觉算法在实际应用中却屡屡碰壁：强光下反光衣易与背景混淆、远距离小目标漏检率高达30%、复杂场景下模型泛化能力不足[实测数据来源：某智慧工地项目2024年Q1日志]。这些问题直接导致安全监控系统预警滞后，给安全生产埋下隐患。一、技术解析：反光衣识别的核心难点与陌讯算法创新反光衣识别
Python 爬虫实战：视频平台播放量实时监控（含反爬对抗与数据趋势预测）西攻城狮北 python 爬虫音视频
一、引言在数字内容蓬勃发展的当下，视频平台的播放量数据已成为内容创作者、营销人员以及行业分析师手中极为关键的情报资源。它不仅能够实时反映内容的受欢迎程度，更能在竞争分析、营销策略制定以及内容优化等方面发挥不可估量的作用。然而，视频平台为了保护自身数据和用户隐私，往往会设置一系列反爬虫机制，对数据爬取行为进行限制。这就向我们发起了挑战：如何巧妙地突破这些限制，同时精准地捕捉并预测播放量的动态变化趋势
redis集群之Sentinel哨兵高可用会飞的爱迪生 redis redis sentinel bootstrap
Sentinel是官网推荐的高可用（HA）解决方案，可以实现redis的高可用，即主挂了从代替主工作，在一台单独的服务器上运行多个sentinel，去监控其他服务器上的redismaster-slave状态(可以监控多个master-slave)，当发现master宕机后sentinel会在slave中选举并启动新的master。至少需要3台redis才能建立起基于哨兵的reids集群。一、通过s
GoView 强势入驻 GitCode：拖拽低代码，打造高颜值数据大屏 GitCode 代码君 gitcode 低代码开源
信息可视化时代，数字大屏日益成为展示核心KPI、运营状态、监控预警的主流形式。然而，用传统方式开发一个定制化数字大屏需要解决多少问题？1.繁复的数据源集成，各种不同的协议和格式……2.让人晕头转向的可视化逻辑，调动艰难的样式、布局、动画，和往往难以统一的风格3.牵一发而动全身的代码结构，就想换个主题色结果开启的全局CSS大冒险……现在，一个开源项目即可搞定上述问题——拖拽式低代码数字可视化平台Go
Kubernetes自动扩缩容方案对比与实践指南浅沫云归后端技术栈小结 kubernetes autoscaling devops
Kubernetes自动扩缩容方案对比与实践指南随着微服务架构和容器化的广泛采用，Kubernetes自动扩缩容（Autoscaling）成为保障生产环境性能稳定与资源高效利用的关键技术。面对水平Pod扩缩容、垂直资源调整、集群节点扩缩容以及事件驱动扩缩容等多种需求，社区提供了HPA、VPA、ClusterAutoscaler、KEDA等多种方案。本篇文章将从业务背景、方案对比、优缺点分析、选型建
【运维实战】解决 K8s 节点无法拉取 pause:3.6 镜像导致 API Server 启动失败的问题 gs80140 各种问题运维 kubernetes 容器
目录【运维实战】解决K8s节点无法拉取pause:3.6镜像导致APIServer启动失败的问题问题分析✅解决方案：替代拉取方式导入pause镜像Step1.从私有仓库拉取pause镜像Step2.重新打tag为Kubernetes默认命名Step3.导出镜像为tar包Step4.拷贝镜像到目标节点Step5.在目标节点导入镜像到containerd的k8s.io命名空间Step6.验证镜像是否导
基于Python的智能公示信息监控爬虫系统开发实战 Python爬虫项目 2025年爬虫实战项目 python 爬虫开发语言音视频搜索引擎 scrapy
摘要本文详细介绍了如何使用Python构建一个高效的公示信息监控爬虫系统。系统采用最新技术栈，包括异步爬取、智能解析、反反爬策略等，能够自动监控各类政府网站、企业公示平台的更新信息。文章从系统设计到具体实现，提供了完整的代码示例和详细的技术解析，帮助读者掌握大规模公示信息采集的核心技术。关键词：Python爬虫、公示监控、信息采集、异步爬取、智能解析1.引言在数字化时代，各类公示信息（如政府采购、
ZooKeeper架构及应用场景详解走过冬季学习笔记 zookeeper 架构分布式
ZooKeeper是一个开源的分布式协调服务，由Apache软件基金会维护。它旨在为分布式应用提供高性能、高可用、强一致性的基础服务，解决分布式系统中常见的协调难题（如配置管理、命名服务、分布式锁、服务发现、领导者选举等）。核心软件架构ZooKeeper的架构设计围绕其核心目标（协调）而优化，主要包含以下关键组件：集群模式(Ensemble):ZooKeeper通常部署为集群（称为ensemble
zookeeper etcd区别 sun007700 zookeeper etcd 分布式
ZooKeeper与etcd的核心区别体现在设计理念、数据模型、一致性协议及适用场景等方面。‌ZooKeeper基于ZAB协议实现分布式协调，采用树形数据结构和临时节点特性，适合传统分布式系统；而etcd基于Raft协议，以高性能键值对存储为核心，专为云原生场景优化，是Kubernetes等容器编排系统的默认存储组件。‌‌1‌‌2‌架构与设计目标差异‌‌ZooKeeper‌。‌设计定位‌:专注于分
目标检测（object detection）加油吧zkf 目标检测目标检测人工智能计算机视觉
目标检测作为计算机视觉的核心技术，在自动驾驶、安防监控、医疗影像等领域发挥着不可替代的作用。本文将系统讲解目标检测的概念、原理、主流模型、常见数据集及应用场景，帮助读者构建对这一技术的完整认知。一、目标检测的核心概念目标检测（ObjectDetection）是指在图像或视频中自动定位并识别出所有感兴趣的目标的技术。它需要解决两个核心问题：分类（Classification）：确定图像中每个目标的类
在 Linux（openEuler 24.03 LTS-SP1）上安装 Kubernetes + KubeSphere 的防火墙放行全攻略
目录在Linux（openEuler24.03LTS-SP1）上安装Kubernetes+KubeSphere的防火墙放行全攻略一、为什么要先搞定防火墙？二、目标环境三、需放行的端口和协议列表四、核心工具说明1.修正后的exec.sh脚本（支持管道/重定向）2.批量放行脚本：open_firewall.sh五、使用示例1.批量放行端口2.查看当前防火墙规则3.仅开放单一端口（临时需求）4.检查特定
微算法科技的前沿探索：量子机器学习算法在视觉任务中的革新应用 MicroTech2025 量子计算算法
在信息技术飞速发展的今天，计算机视觉作为人工智能领域的重要分支，正逐步渗透到我们生活的方方面面。从自动驾驶到人脸识别，从医疗影像分析到安防监控，计算机视觉技术展现了巨大的应用潜力。然而，随着视觉任务复杂度的不断提升，传统机器学习算法在处理大规模、高维度数据时遇到了计算瓶颈。在此背景下，量子计算作为一种颠覆性的计算模式，以其独特的并行处理能力和指数级增长的计算空间，为解决这一难题提供了新的思路。微算
UDP服务器的优缺点都包含哪些？ wanhengidc udp 服务器网络协议
UDP协议不需要像TCP协议那样进行复杂的连接建立与拆除过程，在进行传输数据信息的过程中，应用层将数据交给UDP层，UDP层直接加上首部就发往网络层，极大地减少了处理时间和资源消耗。例如在一些简单的网络监控程序中，只是定期发送一些状态信息，对数据准确性的要求不高时，企业可以选择使用UDP服务器，能够实现快速传输数据的功能。由于UDP服务器不需要连接建立过程和重传机制的束缚，UDP数据能够快速地从发
服务器深夜告警？可能是攻击前兆！群联云防护小杜安全问题汇总服务器网络运维前端人工智能重构 ddos
凌晨三点，刺耳的告警铃声把你从梦中惊醒：服务器CPU100%，内存耗尽！你手忙脚乱地登录服务器，发现某个进程疯狂占用资源。是程序Bug？还是业务突增？排查半天，最后在角落的日志里发现蛛丝马迹——你的服务器正在遭受攻击！这种资源被“悄悄”耗尽的攻击，往往比直接的流量洪峰更难察觉，危害却同样巨大。本文将深入探讨这类资源消耗型攻击的原理，并提供一个实用的监控脚本，助你早发现、早处置。一、资源消耗型攻击：
Flink DataStream API详解（一） bxlj_jcj Flink flink 大数据
一、引言Flink的DataStreamAPI，在流处理领域大显身手的核心武器。在很多实时数据处理场景中，如电商平台实时分析用户购物行为以实现精准推荐，金融领域实时监控交易数据以防范风险，DataStreamAPI都发挥着关键作用，能够对源源不断的数据流进行高效处理和分析。接下来，就让我们一起深入探索FlinkDataStreamAPI。二、DataStream编程基础搭建在开始使用FlinkDa
Elasticsearch搜索引擎存储：从原理到实践的全景解析 Python×CATIA工业智造搜索引擎 elasticsearch 大数据
引言在大数据时代，数据规模呈指数级增长，传统数据库的模糊查询、实时分析能力逐渐成为瓶颈。Elasticsearch（简称ES）凭借其分布式架构、实时搜索和灵活的数据分析能力，成为企业级搜索与存储的核心引擎。截至2025年，ES在全球日志分析、电商搜索、实时监控等场景的市场占有率超过60%。本文将从存储架构、核心技术、应用场景及优化策略四个维度，深入解析Elasticsearch的设计哲学与实践价值
三、【docker】docker和docker-compose的常用命令
文章目录一、docker常用命令1、镜像管理2、容器管理3、容器监控和调试4、网络管理5、数据卷管理6、系统维护7、实用组合命令8、常用技巧二、docker-compose常用命令1、基本命令2、构建相关3、运行维护4、常用组合命令5、实用参数一、docker常用命令1、镜像管理#查看本地镜像dockerimages#拉取镜像dockerpull:#删除镜像dockerrmi#构建镜像docker
做了10年的性能测试，性能测试调优全解析：从定位到优化的实用指南颜挺锐性能测试性能优化
性能测试调优全解析：从定位到优化的实用指南**引言在当今数字化时代，软件系统的性能直接影响用户体验和业务的成功。性能测试调优作为确保系统高效运行的关键手段，对于提升系统响应速度、吞吐量以及稳定性至关重要。本文将深入探讨性能测试调优的全过程，从性能瓶颈的定位到具体调优策略的实施，帮助读者掌握性能测试调优的核心技能。性能瓶颈定位监控工具的使用APM工具：如NewRelic、Dynatrace等应用性能
jmeter 性能测试步骤是什么？
1.测试计划2.线程组-设置线程数3.HTTP请求（替换参数）4.用户参数/CSV数据文件设置参数、消息体数据5.集合点（同步定时器）-设置模拟用户数和超时时间6.响应断言（检查点）7.断言结果8.监听器-察看结果树9.监听器-聚合报告10.场景监控、运行10.1配置监听器参数10.2登录服务器启动agent服务jmeter性能测试实战（零基础入门到精通）即学即上手！
SQL Server通过存储过程实现企业微信消息卡片推送 Favor_Yang SQL调优及高级SQL语法编写数据库企业微信 SQL Server 消息推送
背景与需求分析企业微信消息卡片广泛应用于企业内部系统通知（如审批流提醒、工单状态变更、数据监控报警）。SQLServer存储过程因其高效执行、业务逻辑封装能力，成为处理数据库触发式消息推送的理想选择。技术整合的核心价值在于将数据库业务事件直接转化为企业微信消息，减少人工干预，提升流程自动化水平。技术架构设计系统采用三层架构：数据层：SQLServer存储过程处理业务数据并生成消息内容传输层：通过O
K3s-io/kine项目核心架构与数据流解析富珂祯
K3s-io/kine项目核心架构与数据流解析kineRunKubernetesonMySQL,Postgres,sqlite,dqlite,notetcd.项目地址:https://gitcode.com/gh_mirrors/ki/kine项目概述K3s-io/kine是一个创新的存储适配器，它在传统SQL数据库之上实现了轻量级的键值存储功能。该项目最显著的特点是采用单一数据表结构，通过巧妙的
20250707-3-Kubernetes 核心概念-有了Docker，为什么还用K8s_笔记 Andy杨 CKA-专栏 kubernetes docker 笔记
一、Kubernetes核心概念1.有了Docker，为什么还用Kubernetes1）企业需求独立性问题：Docker容器本质上是独立存在的，多个容器跨主机提供服务时缺乏统一管理机制负载均衡需求：为提高业务并发和高可用，企业会使用多台服务器部署多个容器实例，但Docker本身不具备负载均衡能力管理复杂度：随着Docker主机和容器数量增加，面临部署、升级、监控等统一管理难题运维效率：单机升
20250707-4-Kubernetes 集群部署、配置和验证-K8s基本资源概念初_笔记
一、kubeconfig配置文件文件作用:kubectl使用kubeconfig认证文件连接K8s集群生成方式:使用kubectlconfig指令生成核心字段:clusters:定义集群信息，包括证书和服务端地址contexts:定义上下文，关联集群和用户users:定义客户端认证信息current-context:指定当前使用的上下文二、Kubernetes弃用Docker1.弃用背景原因:
k8s之configmap 西京刀客云原生(Cloud Native)云计算虚拟化 #Kubernetes(k8s)kubernetes 容器云原生
文章目录k8s之configmap什么是ConfigMap？为什么需要ConfigMap？ConfigMap的创建方式ConfigMap的使用方式实际应用场景ConfigMap最佳实践参考k8s之configmap什么是ConfigMap？ConfigMap是Kubernetes中用于存储非机密配置数据的API对象。它允许你将配置信息与容器镜像解耦，使应用程序更加灵活和可移植。ConfigMap以
数据分析框架和方法 XiaoQiong.Zhang 人工智能
一、核心分析框架(TheBigPictureFrameworks)描述性分析(WhatHappened?)目的：了解过去发生了什么，描述现状，监控业务健康。核心工作：汇总、聚合、计算基础指标(KPI)，生成报表和仪表盘。常用方法/指标：计数/求和/平均值/中位数：DAU/MAU，总销售额，客单价等。比率：转化率，点击率，流失率，毛利率等。分布：用户活跃度分布、订单金额分布、地域分布等。常用于理解群
SQL Server通过CLR连接InfluxDB实现异构数据关联查询技术指南 Favor_Yang SQL调优及高级SQL语法编写 SQL Server InfluxDB
一、背景与需求场景在工业物联网和金融监控场景中，实时时序数据（InfluxDB）需与业务元数据（SQLServer）联合分析：工业场景：设备传感器每秒采集温度、振动数据（InfluxDB），需关联工单状态、设备型号（SQLServer）金融场景：交易流水时序数据（每秒万条）需实时匹配客户风险等级、账户余额（SQLServer）核心痛点：传统ETL延迟高，无法满足实时风控/故障诊断需求，需实现毫秒级
STM32-- 调试 -日志输出 code_snow stm32 stm32 嵌入式硬件单片机
在调试嵌入式程序时，输出日志是非常重要的环节，可以帮助开发者定位问题、监控程序状态和性能。以下是几种常见的日志输出方式及其适用场景：1.使用串口（UART）输出日志实现方式：通过串口将日志输出到主机的串口工具（如PuTTY、TeraTerm、minicom）中。优点：简单易用，几乎所有嵌入式设备都支持。实时性强，适合调试运行时的动态信息。与printf结合使用方便。示例代码：#include//配
java杨辉三角 3213213333332132 java基础
package com.algorithm; /** * @Description 杨辉三角 * @author FuJianyong * 2015-1-22上午10:10:59 */ public class YangHui { public static void main(String[] args) { //初始化二维数组长度 int[][] y
《大话重构》之大布局的辛酸历史白糖_ 重构
《大话重构》中提到“大布局你伤不起”，如果企图重构一个陈旧的大型系统是有非常大的风险，重构不是想象中那么简单。我目前所在公司正好对产品做了一次“大布局重构”，下面我就分享这个“大布局”项目经验给大家。背景公司专注于企业级管理产品软件，企业有大中小之分，在2000年初公司用JSP/Servlet开发了一套针对中
电驴链接在线视频播放源码 dubinwei 源码电驴播放器视频 ed2k
本项目是个搜索电驴（ed2k）链接的应用,借助于磁力视频播放器（官网： http://loveandroid.duapp.com/ 开放平台），可以实现在线播放视频，也可以用迅雷或者其他下载工具下载。项目源码： http://git.oschina.net/svo/Emule,动态更新。也可从附件中下载。项目源码依赖于两个库项目，库项目一链接： http://git.oschina.
Javascript中函数的toString()方法周凡杨 JavaScript js toString function object
简述 The toString() method returns a string representing the source code of the function. 简译之，Javascript的toString()方法返回一个代表函数源代码的字符串。句法 function.
struts处理自定义异常 g21121 struts
很多时候我们会用到自定义异常来表示特定的错误情况，自定义异常比较简单，只要分清是运行时异常还是非运行时异常即可，运行时异常不需要捕获，继承自RuntimeException，是由容器自己抛出，例如空指针异常。非运行时异常继承自Exception，在抛出后需要捕获，例如文件未找到异常。此处我们用的是非运行时异常，首先定义一个异常LoginException: /** * 类描述：登录相
Linux中find常见用法示例 510888780 linux
Linux中find常见用法示例 ·find path -option [ -print ] [ -exec -ok command ] {} \; find命令的参数；
SpringMVC的各种参数绑定方式 Harry642 springMVC 绑定表单
1. 基本数据类型(以int为例，其他类似)： Controller代码： @RequestMapping("saysth.do") public void test(int count) { } 表单代码： <form action="saysth.do" method="post&q
Java 获取Oracle ROWID aijuans java oracle
A ROWID is an identification tag unique for each row of an Oracle Database table. The ROWID can be thought of as a virtual column, containing the ID for each row. The oracle.sql.ROWID class i
java获取方法的参数名 antlove java jdk parameter method reflect
reflect.ClassInformationUtil.java package reflect; import javassist.ClassPool; import javassist.CtClass; import javassist.CtMethod; import javassist.Modifier; import javassist.bytecode.CodeAtt
JAVA正则表达式匹配查找替换提取操作百合不是茶 java 正则表达式替换提取查找
正则表达式的查找;主要是用到String类中的split(); String str; str.split();方法中传入按照什么规则截取,返回一个String数组常见的截取规则: str.split("\\.")按照.来截取 str.
Java中equals()与hashCode()方法详解 bijian1013 java set equals()hashCode()
一.equals()方法详解 equals()方法在object类中定义如下： public boolean equals(Object obj) { return (this == obj); } 很明显是对两个对象的地址值进行的比较（即比较引用是否相同）。但是我们知道，String 、Math、I
精通Oracle10编程SQL(4)使用SQL语句 bijian1013 oracle 数据库 plsql
--工资级别表 create table SALGRADE ( GRADE NUMBER(10), LOSAL NUMBER(10,2), HISAL NUMBER(10,2) ) insert into SALGRADE values(1,0,100); insert into SALGRADE values(2,100,200); inser
【Nginx二】Nginx作为静态文件HTTP服务器 bit1129 HTTP服务器
Nginx作为静态文件HTTP服务器在本地系统中创建/data/www目录，存放html文件(包括index.html) 创建/data/images目录，存放imags图片在主配置文件中添加http指令 http { server { listen 80; server_name
kafka获得最新partition offset blackproof kafka partition offset 最新
kafka获得partition下标，需要用到kafka的simpleconsumer import java.util.ArrayList; import java.util.Collections; import java.util.Date; import java.util.HashMap; import java.util.List; import java.
centos 7安装docker两种方式 ronin47
第一种是采用yum 方式 yum install -y docker
java-60-在O(1)时间删除链表结点 bylijinnan java
public class DeleteNode_O1_Time { /** * Q 60 在O(1)时间删除链表结点 * 给定链表的头指针和一个结点指针(!!)，在O(1)时间删除该结点 * * Assume the list is: * head->...->nodeToDelete->mNode->nNode->..
nginx利用proxy_cache来缓存文件 cfyme cache
user zhangy users; worker_processes 10; error_log /var/vlogs/nginx_error.log crit; pid /var/vlogs/nginx.pid; #Specifies the value for ma
[JWFD开源工作流]JWFD嵌入式语法分析器负号的使用问题 comsci 嵌入式
假如我们需要用JWFD的语法分析模块定义一个带负号的方程式，直接在方程式之前添加负号是不正确的，而必须这样做： string str01 = "a=3.14;b=2.71;c=0;c-((a*a)+(b*b))" 定义一个0整数c,然后用这个整数c去
如何集成支付宝官方文档 dai_lm android
官方文档下载地址 https://b.alipay.com/order/productDetail.htm?productId=2012120700377310&tabId=4#ps-tabinfo-hash 集成的必要条件 1. 需要有自己的Server接收支付宝的消息 2. 需要先制作app，然后提交支付宝审核，通过后才能集成调试的时候估计会真的扣款，请注意
应该在什么时候使用Hadoop datamachine hadoop
原帖地址：http://blog.chinaunix.net/uid-301743-id-3925358.html 存档，某些观点与我不谋而合，过度技术化不可取，且hadoop并非万能。 --------------------------------------------万能的分割线-------------------------------- 有人问我，“你在大数据和Hado
在GridView中对于有外键的字段使用关联模型进行搜索和排序 dcj3sjt126com yii
在GridView中使用关联模型进行搜索和排序首先我们有两个模型它们直接有关联: class Author extends CActiveRecord { ... } class Post extends CActiveRecord { ... function relations() { return array( '
使用NSString 的格式化大全 dcj3sjt126com Objective-C
格式定义The format specifiers supported by the NSString formatting methods and CFString formatting functions follow the IEEE printf specification; the specifiers are summarized in Table 1. Note that you c
使用activeX插件对象object滚动有重影蕃薯耀 activeX插件滚动有重影
使用activeX插件对象object滚动有重影 <object style="width:0;" id="abc" classid="CLSID:D3E3970F-2927-9680-BBB4-5D0889909DF6" codebase="activex/OAX339.CAB#
SpringMVC4零配置 hanqunfeng springmvc4
基于Servlet3.0规范和SpringMVC4注解式配置方式，实现零xml配置，弄了个小demo，供交流讨论。项目说明如下： 1.db.sql是项目中用到的表，数据库使用的是oracle11g 2.该项目使用mvn进行管理，私服为自搭建nexus,项目只用到一个第三方 jar，就是oracle的驱动； 3.默认项目为零配置启动，如果需要更改启动方式，请
《开源框架那点事儿16》：缓存相关代码的演变 j2eetop 开源框架
问题引入上次我参与某个大型项目的优化工作，由于系统要求有比较高的TPS，因此就免不了要使用缓冲。该项目中用的缓冲比较多，有MemCache，有Redis，有的还需要提供二级缓冲，也就是说应用服务器这层也可以设置一些缓冲。当然去看相关实现代代码的时候，大致是下面的样子。 [java] view plain copy print ? public vo
AngularJS浅析 kvhur JavaScript
概念 AngularJS is a structural framework for dynamic web apps. 了解更多详情请见原文链接：http://www.gbtags.com/gb/share/5726.htm Directive 扩展html，给html添加声明语句，以便实现自己的需求。对于页面中html元素以ng为前缀的属性名称，ng是angular的命名空间
架构师之jdk的bug排查(一)---------------split的点号陷阱 nannan408 split
1.前言. jdk1.6的lang包的split方法是有bug的,它不能有效识别A.b.c这种类型,导致截取长度始终是0.而对于其他字符,则无此问题.不知道官方有没有修复这个bug. 2.代码 String[] paths = "object.object2.prop11".split("'"); System.ou
如何对10亿数据量级的mongoDB作高效的全表扫描 quentinXXZ mongodb
本文链接: http://quentinXXZ.iteye.com/blog/2149440 一、正常情况下，不应该有这种需求首先，大家应该有个概念，标题中的这个问题，在大多情况下是一个伪命题，不应该被提出来。要知道，对于一般较大数据量的数据库，全表查询，这种操作一般情况下是不应该出现的，在做正常查询的时候，如果是范围查询，你至少应该要加上limit。说一下，
C语言算法之水仙花数 qiufeihu c 算法
/** * 水仙花数 */ #include <stdio.h> #define N 10 int main() { int x,y,z; for(x=1;x<=N;x++) for(y=0;y<=N;y++) for(z=0;z<=N;z++) if(x*100+y*10+z == x*x*x
JSP指令 wyzuomumu jsp
jsp指令的一般语法格式： <%@ 指令名属性 =”值 ” %> 常用的三种指令： page,include,taglib page指令语法形式： <%@ page 属性 1=”值 1” 属性 2=”值 2”%> include指令语法形式： <%@include file=”relative url”%> (jsp可以通过 include