阿方很慌

docker-compose部署prometheus+grafana

prometheus介绍：

Prometheus是一个开源的系统监控和报警系统，现在已经加入到CNCF基金会，成为继k8s之后第二个在CNCF托管的项目，在kubernetes容器管理系统中，通常会搭配prometheus进行监控，同时也支持多种exporter采集数据，还支持pushgateway进行数据上报，Prometheus性能足够支撑上万台规模的集群。

prometheus组件：

prometheus：主要组件，负责收集告警的调度

alertmanager(告警)：告警，负责接受prometheus传来的告警数据，进行告警

node_exproter：监控主机数据，比如cpu，内存，网络等

black_exporter(黑盒测试): 检测网站状态码，主机存活，端口检测等

mysql_exporter: 监控mysql数据

redis_exporter：监控redis数据

docker-compoes搭建：

准备工作：创建挂载目录

mkdir -p /docker/{prometheus,prometheus/data,alertmanager,grafana,blackbox_exporter}
cd /docker

创建prometheus配置文件

global:
  scrape_interval:     15s
  evaluation_interval: 15s
 
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - docker_alertmanager_1:9093
 
rule_files:
  - "*rules.yml"
  
scrape_configs:
  - job_name: 'prometheus'
    static_configs:
    - targets: ['192.1.1.12:9090']
 
  - job_name: 'node'
    static_configs:
    - targets: ['192.1.1.12:9100','192.168.88.11:9100']
 
  - job_name: 'alertmanager'
    static_configs:
    - targets: ['192.1.1.12:9093']
 
  - job_name: 'cadvisor'
    static_configs:
    - targets: ['192.168.88.13:8080','192.168.88.11:8080']
 
 
  - job_name: 'mysql-exporter'
    #scrape_interval: 5s
    static_configs:
      - targets: ['192.168.88.13:9104']
 
  - job_name: 'redis-exporter'
    #scrape_interval: 5s
    static_configs:
      - targets: ['192.168.88.13:9121']
 
  #下面这个6个redis-targets是集群的，不是集群去掉
  - job_name: 'redis_exporter_targets'
    static_configs:
      - targets:
        - redis://172.16.8.58:6379
    metrics_path: /metrics
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 172.16.0.46:9121
 
 
  - job_name: 'http_status'  # 配置job名
    metrics_path: /probe     # 定义metric获取的路径
    params:
      module: [http_2xx]     # 这里就是我们在black_exporter中定义的模块名
    file_sd_configs:         # 因需要监控的地址很多，我们这里将所有地址独立出来，后面会介绍该文件
      - files: 
        - '/etc/prometheus/job-web.yml'
        refresh_interval: 30s # 30秒刷新一次，当有新的监控地址时，会自动加载进来不需要重启
    relabel_configs:
      - source_labels: [__address__]  # 当前target的访问地址，比如监控百度则为 https://baidu.com
        target_label: __param_target  # __param是默认参数前缀，target为参数，这里可以理解为把__address__ 的值赋给__param_target，若监控百度，则target=https://baidu.com
      - source_labels: [__param_target]
        target_label: instance        # 可以理解为把__param_target的值赋给instance标签
      - target_label: __address__
        replacement: docker_black_exporter_1:9115 # web监控原本的target为站点的地址，但Prometheus不是直接去请求该地址，而是去请求black_exporter，故需要把目标地址替换为black_exporter的地址
 
  - job_name: 'rocketmq-exporter'
    #scrape_interval: 5s
    static_configs:
      - targets: ['rocketmq-exporter的ip:5557']
        labels:
          project: baidu
          instance: 172.30.0.150:9876 # 这里加了一个标签方便检索
          app: rocketmq
          env: pro

创建告警规则文件

vi /docker/prometheus/rules.yml  #yml文件注意格式
 
 
groups:
  - name: node-alert
    rules:
    - alert: NodeDown
      expr: up{job="node"} == 0
      for: 1m
      labels:
        severity: critical
        instance: "{{ $labels.instance }}"
      annotations:
        summary: "instance: {{ $labels.instance }} down"
        description: "Instance: {{ $labels.instance }} 已经宕机 1分钟"
        value: "{{ $value }}"
        
    - alert: NodeCpuHigh
      expr: (1 - avg by (instance) (irate(node_cpu_seconds_total{job="node",mode="idle"}[5m]))) * 100 > 85
      for: 5m
      labels:
        severity: warning
        instance: "{{ $labels.instance }}"
      annotations:
        summary: "instance: {{ $labels.instance }} cpu使用率过高"
        description: "CPU 使用率超过 80%"
        value: "{{ $value }}"
 
    - alert: NodeCpuIowaitHigh
      expr: avg by (instance) (irate(node_cpu_seconds_total{job="node",mode="iowait"}[5m])) * 100 > 80
      for: 5m
      labels:
        severity: warning
        instance: "{{ $labels.instance }}"
      annotations:
        summary: "instance: {{ $labels.instance }} cpu iowait 使用率过高"
        description: "CPU iowait 使用率超过 50%"
        value: "{{ $value }}"
 
    - alert: NodeLoad5High
      expr: node_load5 > (count by (instance) (node_cpu_seconds_total{job="node",mode='system'})) * 1.2
      for: 5m
      labels:
        severity: warning
        instance: "{{ $labels.instance }}"
      annotations:
        summary: "instance: {{ $labels.instance }} load(5m) 过高"
        description: "Load(5m) 过高，超出cpu核数 1.2倍"
        value: "{{ $value }}"
 
    - alert: NodeMemoryHigh
      expr: (1 - node_memory_MemAvailable_bytes{job="node"} / node_memory_MemTotal_bytes{job="node"}) * 100 > 90
      for: 5m
      labels:
        severity: warning
        instance: "{{ $labels.instance }}"
      annotations:
        summary: "instance: {{ $labels.instance }} memory 使用率过高"
        description: "Memory 使用率超过 90%"
        value: "{{ $value }}"
 
    - alert: NodeDiskRootHigh
      expr: (1 - node_filesystem_avail_bytes{job="node",fstype=~"ext.*|xfs",mountpoint ="/"} / node_filesystem_size_bytes{job="node",fstype=~"ext.*|xfs",mountpoint ="/"}) * 100 > 90
      for: 10m
      labels:
        severity: warning
        instance: "{{ $labels.instance }}"
      annotations:
        summary: "instance: {{ $labels.instance }} disk(/ 分区) 使用率过高"
        description: "Disk(/ 分区) 使用率超过 90%"
        value: "{{ $value }}"
 
    - alert: NodeDiskBootHigh
      expr: (1 - node_filesystem_avail_bytes{job="node",fstype=~"ext.*|xfs",mountpoint ="/boot"} / node_filesystem_size_bytes{job="node",fstype=~"ext.*|xfs",mountpoint ="/boot"}) * 100 > 80
      for: 10m
      labels:
        severity: warning
        instance: "{{ $labels.instance }}"
      annotations:
        summary: "instance: {{ $labels.instance }} disk(/boot 分区) 使用率过高"
        description: "Disk(/boot 分区) 使用率超过 80%"
        value: "{{ $value }}"
 
    - alert: NodeDiskReadHigh
      expr: irate(node_disk_read_bytes_total{job="node"}[5m]) > 20 * (1024 ^ 2)
      for: 5m
      labels:
        severity: warning
        instance: "{{ $labels.instance }}"
      annotations:
        summary: "instance: {{ $labels.instance }} disk 读取字节数 速率过高"
        description: "Disk 读取字节数 速率超过 20 MB/s"
        value: "{{ $value }}"
 
    - alert: NodeDiskWriteHigh
      expr: irate(node_disk_written_bytes_total{job="node"}[5m]) > 20 * (1024 ^ 2)
      for: 5m
      labels:
        severity: warning
        instance: "{{ $labels.instance }}"
      annotations:
        summary: "instance: {{ $labels.instance }} disk 写入字节数 速率过高"
        description: "Disk 写入字节数 速率超过 20 MB/s"
        value: "{{ $value }}"
        
    - alert: NodeDiskReadRateCountHigh
      expr: irate(node_disk_reads_completed_total{job="node"}[5m]) > 3000
      for: 5m
      labels:
        severity: warning
        instance: "{{ $labels.instance }}"
      annotations:
        summary: "instance: {{ $labels.instance }} disk iops 每秒读取速率过高"
        description: "Disk iops 每秒读取速率超过 3000 iops"
        value: "{{ $value }}"
 
    - alert: NodeDiskWriteRateCountHigh
      expr: irate(node_disk_writes_completed_total{job="node"}[5m]) > 3000
      for: 5m
      labels:
        severity: warning
        instance: "{{ $labels.instance }}"
      annotations:
        summary: "instance: {{ $labels.instance }} disk iops 每秒写入速率过高"
        description: "Disk iops 每秒写入速率超过 3000 iops"
        value: "{{ $value }}"
 
    - alert: NodeInodeRootUsedPercentHigh
      expr: (1 - node_filesystem_files_free{job="node",fstype=~"ext4|xfs",mountpoint="/"} / node_filesystem_files{job="node",fstype=~"ext4|xfs",mountpoint="/"}) * 100 > 80
      for: 10m
      labels:
        severity: warning
        instance: "{{ $labels.instance }}"
      annotations:
        summary: "instance: {{ $labels.instance }} disk(/ 分区) inode 使用率过高"
        description: "Disk (/ 分区) inode 使用率超过 80%"
        value: "{{ $value }}"
 
    - alert: NodeInodeBootUsedPercentHigh
      expr: (1 - node_filesystem_files_free{job="node",fstype=~"ext4|xfs",mountpoint="/boot"} / node_filesystem_files{job="node",fstype=~"ext4|xfs",mountpoint="/boot"}) * 100 > 80
      for: 10m
      labels:
        severity: warning
        instance: "{{ $labels.instance }}"
      annotations:
        summary: "instance: {{ $labels.instance }} disk(/boot 分区) inode 使用率过高"
        description: "Disk (/boot 分区) inode 使用率超过 80%"
        value: "{{ $value }}"
        
    - alert: NodeFilefdAllocatedPercentHigh
      expr: node_filefd_allocated{job="node"} / node_filefd_maximum{job="node"} * 100 > 80
      for: 10m
      labels:
        severity: warning
        instance: "{{ $labels.instance }}"
      annotations:
        summary: "instance: {{ $labels.instance }} filefd 打开百分比过高"
        description: "Filefd 打开百分比 超过 80%"
        value: "{{ $value }}"
 
    - alert: NodeNetworkNetinBitRateHigh
      expr: avg by (instance) (irate(node_network_receive_bytes_total{device=~"eth0|eth1|ens33|ens37"}[1m]) * 8) > 20 * (1024 ^ 2) * 8
      for: 3m
      labels:
        severity: warning
        instance: "{{ $labels.instance }}"
      annotations:
        summary: "instance: {{ $labels.instance }} network 接收比特数 速率过高"
        description: "Network 接收比特数 速率超过 20MB/s"
        value: "{{ $value }}"
 
    - alert: NodeNetworkNetoutBitRateHigh
      expr: avg by (instance) (irate(node_network_transmit_bytes_total{device=~"eth0|eth1|ens33|ens37"}[1m]) * 8) > 20 * (1024 ^ 2) * 8
      for: 3m
      labels:
        severity: warning
        instance: "{{ $labels.instance }}"
      annotations:
        summary: "instance: {{ $labels.instance }} network 发送比特数 速率过高"
        description: "Network 发送比特数 速率超过 20MB/s"
        value: "{{ $value }}"
        
    - alert: NodeNetworkNetinPacketErrorRateHigh
      expr: avg by (instance) (irate(node_network_receive_errs_total{device=~"eth0|eth1|ens33|ens37"}[1m])) > 15
      for: 3m
      labels:
        severity: warning
        instance: "{{ $labels.instance }}"
      annotations:
        summary: "instance: {{ $labels.instance }} 接收错误包 速率过高"
        description: "Network 接收错误包 速率超过 15个/秒"
        value: "{{ $value }}"
 
    - alert: NodeNetworkNetoutPacketErrorRateHigh
      expr: avg by (instance) (irate(node_network_transmit_packets_total{device=~"eth0|eth1|ens33|ens37"}[1m])) > 15
      for: 3m
      labels:
        severity: warning
        instance: "{{ $labels.instance }}"
      annotations:
        summary: "instance: {{ $labels.instance }} 发送错误包 速率过高"
        description: "Network 发送错误包 速率超过 15个/秒"
        value: "{{ $value }}"
 
    - alert: NodeProcessBlockedHigh
      expr: node_procs_blocked{job="node"} > 10
      for: 10m
      labels:
        severity: warning
        instance: "{{ $labels.instance }}"
      annotations:
        summary: "instance: {{ $labels.instance }} 当前被阻塞的任务的数量过多"
        description: "Process 当前被阻塞的任务的数量超过 10个"
        value: "{{ $value }}"
 
    - alert: NodeTimeOffsetHigh
      expr: abs(node_timex_offset_seconds{job="node"}) > 3 * 60
      for: 2m
      labels:
        severity: info
        instance: "{{ $labels.instance }}"
      annotations:
        summary: "instance: {{ $labels.instance }} 时间偏差过大"
        description: "Time 节点的时间偏差超过 3m"
        value: "{{ $value }}"
 
 
    - alert: Web访问异常
      expr: probe_http_status_code{not_200 != "yes" } != 200
      for: 30s
      annotations:
        summary: Web 访问异常{{ $labels.instance }}
      labels:
        Severity: '严重'
    - alert: Web访问响应响应时间>3s
      expr: probe_duration_seconds >= 3
      for: 30s
      annotations:
        summary: Web 响应异常{{ $labels.instance }}
      labels:
        Severity: '警告'
    - alert: 证书过期时间<30天
      expr: probe_ssl_earliest_cert_expiry-time()< 3600*24*30
      annotations:
        summary: Web 证书将在30天后过期 {{ $labels.instance }}
      labels:
        Severity: '提醒'
    - alert: 证书过期时间<7天
      expr: probe_ssl_earliest_cert_expiry-time()< 3600*24*7
      annotations:
        summary: Web 证书将在30天后过期 {{ $labels.instance }}
      labels:
        Severity: '严重'
    - alert: 证书过期时间<1天
      expr: probe_ssl_earliest_cert_expiry-time()< 3600*24*1
      annotations:
        summary: Web 证书将在30天后过期 {{ $labels.instance }}
      labels:
        Severity: '灾难'
 
 
#groups:
#  - name: mysql.rules
#    rules:
    - alert: MysqlDown
      expr: up == 0
      for: 0m
      labels:
        severity: critical
      annotations:
        title: 'MySQL down'
        description: "Mysql实例: 【{{ $labels.instance }}】, MySQL instance is down"
 
    - alert: MysqlRestarted
      expr: mysql_global_status_uptime < 60
      for: 0m
      labels:
        severity: info
      annotations:
        title: 'MySQL Restarted'
        description: "Mysql实例: 【{{ $labels.instance }}】, MySQL has just been restarted, less than one minute ago"
 
    - alert: MysqlTooManyConnections(>80%)
      expr: avg by (instance) (rate(mysql_global_status_threads_connected[1m])) / avg by (instance) (mysql_global_variables_max_connections) * 100 > 80
      for: 2m
      labels:
        severity: warning
      annotations:
        title: 'MySQL too many connections (> 80%)'
        description: "Mysql实例: 【{{ $labels.instance }}】, More than 80% of MySQL connections are in use, Current Value: {{ $value }}%"
 
    - alert: MysqlThreadsRunningHigh
      expr: mysql_global_status_threads_running > 40
      for: 2m
      labels:
        severity: warning
      annotations:
        title: 'MySQL Threads_Running High'
        description: "Mysql实例: 【{{ $labels.instance }}】, Threads_Running above the threshold(40), Current Value: {{ $value }}"
 
    - alert: MysqlQpsHigh
      expr: sum by (instance) (rate(mysql_global_status_queries[2m])) > 500
      for: 2m
      labels:
        severity: warning
      annotations:
        title: 'MySQL QPS High'
        description: "Mysql实例: 【{{ $labels.instance }}】, MySQL QPS above 500"
 
    - alert: MysqlSlowQueries
      expr: increase(mysql_global_status_slow_queries[1m]) > 0
      for: 2m
      labels:
        severity: warning
      annotations:
        title: 'MySQL slow queries'
        description: "Mysql实例: 【{{ $labels.instance }}】, has some new slow query."
 
    - alert: MysqlTooManyAbortedConnections
      expr: round(increase(mysql_global_status_aborted_connects[5m])) > 20
      for: 2m
      labels:
        severity: warning
      annotations:
        title: 'MySQL too many Aborted connections in 2 minutes'
        description: "Mysql实例: 【{{ $labels.instance }}】, {{ $value }} Aborted connections within 2 minutes"
 
    - alert: MysqlTooManyAbortedClients
      expr: round(increase(mysql_global_status_aborted_clients[120m])) > 5000
      for: 2m
      labels:
        severity: warning
      annotations:
        title: 'MySQL too many Aborted connections in 2 hours'
        description: "Mysql实例: 【{{ $labels.instance }}】, {{ $value }} Aborted Clients within 2 hours"
 
    - alert: MysqlSlaveIoThreadNotRunning
      expr: mysql_slave_status_master_server_id > 0 and ON (instance) mysql_slave_status_slave_io_running == 0
      for: 0m
      labels:
        severity: critical
      annotations:
        title: 'MySQL Slave IO thread not running'
        description: "Mysql实例: 【{{ $labels.instance }}】, MySQL Slave IO thread not running"
 
    - alert: MysqlSlaveSqlThreadNotRunning
      expr: mysql_slave_status_master_server_id > 0 and ON (instance) mysql_slave_status_slave_sql_running == 0
      for: 0m
      labels:
        severity: critical
      annotations:
        title: 'MySQL Slave SQL thread not running'
        description: "Mysql实例: 【{{ $labels.instance }}】, MySQL Slave SQL thread not running"
 
    - alert: MysqlSlaveReplicationLag
      expr: mysql_slave_status_master_server_id > 0 and ON (instance) (mysql_slave_status_seconds_behind_master - mysql_slave_status_sql_delay) > 30
      for: 1m
      labels:
        severity: critical
      annotations:
        title: 'MySQL Slave replication lag'
        description: "Mysql实例: 【{{ $labels.instance }}】, MySQL replication lag"
 
    - alert: MysqlInnodbLogWaits
      expr: rate(mysql_global_status_innodb_log_waits[15m]) > 10
      for: 0m
      labels:
        severity: warning
      annotations:
        title: 'MySQL InnoDB log waits'
        description: "Mysql实例: 【{{ $labels.instance }}】, innodb log writes stalling"
 
#groups:
#  - name:  Docker containers monitoring
#    rules: 
    - alert: ContainerKilled
      expr: time() - container_last_seen > 60
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Container killed (instance {{ $labels.instance }})"
        description: "A container has disappeared\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"
    - alert: ContainerCpuUsage
      expr: (sum(rate(container_cpu_usage_seconds_total[3m])) BY (instance, name) * 100) > 80
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Container CPU usage (instance {{ $labels.instance }})"
        description: "Container CPU usage is above 80%\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"
    - alert: ContainerMemoryUsage
      expr: (sum(container_memory_usage_bytes) BY (instance, name) / sum(container_spec_memory_limit_bytes) BY (instance, name) * 100) > 80
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Container Memory usage (instance {{ $labels.instance }})"
        description: "Container Memory usage is above 80%\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"
    - alert: ContainerVolumeUsage
      expr: (1 - (sum(container_fs_inodes_free) BY (instance) / sum(container_fs_inodes_total) BY (instance)) * 100) > 80
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Container Volume usage (instance {{ $labels.instance }})"
        description: "Container Volume usage is above 80%\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"
    - alert: ContainerVolumeIoUsage
      expr: (sum(container_fs_io_current) BY (instance, name) * 100) > 80
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Container Volume IO usage (instance {{ $labels.instance }})"
        description: "Container Volume IO usage is above 80%\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"
    - alert: ContainerHighThrottleRate
      expr: rate(container_cpu_cfs_throttled_seconds_total[3m]) > 1
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Container high throttle rate (instance {{ $labels.instance }})"
        description: "Container is being throttled\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"
    - alert: PgbouncerActiveConnectinos
      expr: pgbouncer_pools_server_active_connections > 200
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "PGBouncer active connectinos (instance {{ $labels.instance }})"
        description: "PGBouncer pools are filling up\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"
    - alert: PgbouncerErrors
      expr: increase(pgbouncer_errors_count{errmsg!="server conn crashed?"}[5m]) > 10
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "PGBouncer errors (instance {{ $labels.instance }})"
        description: "PGBouncer is logging errors. This may be due to a a server restart or an admin typing commands at the pgbouncer console.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"
    - alert: PgbouncerMaxConnections
      expr: rate(pgbouncer_errors_count{errmsg="no more connections allowed (max_client_conn)"}[1m]) > 0
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "PGBouncer max connections (instance {{ $labels.instance }})"
        description: "The number of PGBouncer client connections has reached max_client_conn.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"
    - alert: SidekiqQueueSize
      expr: sidekiq_queue_size{} > 100
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Sidekiq queue size (instance {{ $labels.instance }})"
        description: "Sidekiq queue {{ $labels.name }} is growing\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"
    - alert: SidekiqSchedulingLatencyTooHigh
      expr: max(sidekiq_queue_latency) > 120
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "Sidekiq scheduling latency too high (instance {{ $labels.instance }})"
        description: "Sidekiq jobs are taking more than 2 minutes to be picked up. Users may be seeing delays in background processing.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"
    - alert: ConsulServiceHealthcheckFailed
      expr: consul_catalog_service_node_healthy == 0
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "Consul service healthcheck failed (instance {{ $labels.instance }})"
        description: "Service: `{{ $labels.service_name }}` Healthcheck: `{{ $labels.service_id }}`\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"
    - alert: ConsulMissingMasterNode
      expr: consul_raft_peers < 3
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "Consul missing master node (instance {{ $labels.instance }})"
        description: "Numbers of consul raft peers should be 3, in order to preserve quorum.\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"
    - alert: ConsulAgentUnhealthy
      expr: consul_health_node_status{status="critical"} == 1
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "Consul agent unhealthy (instance {{ $labels.instance }})"
        description: "A Consul agent is down\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"
 
 
#groups:
#- name:  Redis
#  rules: 
    - alert: RedisDown
      expr: redis_up  == 0
      for: 5m
      labels:
        severity: error
      annotations:
        summary: "Redis down (instance {{ $labels.instance }})"
        description: "Redis 挂了啊，mmp\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"
    - alert: MissingBackup
      expr: time() - redis_rdb_last_save_timestamp_seconds > 60 * 60 * 24
      for: 5m
      labels:
        severity: error
      annotations:
        summary: "Missing backup (instance {{ $labels.instance }})"
        description: "Redis has not been backuped for 24 hours\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"       
    - alert: OutOfMemory
      expr: redis_memory_used_bytes / redis_total_system_memory_bytes * 100 > 90
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Out of memory (instance {{ $labels.instance }})"
        description: "Redis is running out of memory (> 90%)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"
    - alert: ReplicationBroken
      expr: delta(redis_connected_slaves[1m]) < 0
      for: 5m
      labels:
        severity: error
      annotations:
        summary: "Replication broken (instance {{ $labels.instance }})"
        description: "Redis instance lost a slave\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"
    - alert: TooManyConnections
      expr: redis_connected_clients > 10
      for: 1m
      labels:
        severity: warning
      annotations:
        summary: "Too many connections (instance {{ $labels.instance }})"
        description: "Redis instance has too many connections\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"       
    - alert: NotEnoughConnections
      expr: redis_connected_clients < 5
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "Not enough connections (instance {{ $labels.instance }})"
        description: "Redis instance should have more connections (> 5)\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"
    - alert: RejectedConnections
      expr: increase(redis_rejected_connections_total[1m]) > 0
      for: 5m
      labels:
        severity: error
      annotations:
        summary: "Rejected connections (instance {{ $labels.instance }})"
        description: "Some connections to Redis has been rejected\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"
 
 
# rules_rocketmq.yml 
#groups:
#- name: rocketmq
#  rules:
    - alert: RocketMQ Exporter is Down 
      expr: up{job="rocketmq"} == 0
      for: 1m
      labels: 
        severity: '灾难'
      annotations:
        summary: RocketMQ {{ $labels.instance }} is down
    - alert: RocketMQ 存在消息积压
      expr: (sum(irate(rocketmq_producer_offset[1m])) by (topic)  - on(topic) group_right sum(irate(rocketmq_consumer_offset[1m])) by (group,topic)) > 5
      for: 5m
      labels: 
        severity: '警告'
      annotations:
        summary: RocketMQ (group={{ $labels.group }} topic={{ $labels.topic }})积压数 = {{ .Value }}
    - alert: GroupGetLatencyByStoretime 消费组的消费延时时间过高
      expr: rocketmq_group_get_latency_by_storetime/1000  > 5 and rate(rocketmq_group_get_latency_by_storetime[5m]) >0
      for: 3m
      labels:
        severity: 警告
      annotations:
        description: 'consumer {{$labels.group}} on {{$labels.broker}}, {{$labels.topic}} consume time lag behind message store time and (behind value is {{$value}}).'
        summary: 消费组的消费延时时间过高
    - alert: RocketMQClusterProduceHigh 集群TPS > 20
      expr: sum(rocketmq_producer_tps) by (cluster) >= 20
      for: 3m
      labels:
        severity: 警告
      annotations:
        description: '{{$labels.cluster}} Sending tps too high. now TPS = {{ .Value }}'
        summary: cluster send tps too high

创建prometheus.yaml文件

vi /docker/prometheus/prometheus.yaml
global:
  scrape_interval:     15s
  evaluation_interval: 15s
 
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - docker_alertmanager_1:9093
 
rule_files:
  - "*rules.yml"
  
scrape_configs:
  - job_name: 'prometheus'
    static_configs:
    - targets: ['192.1.1.12:9090']
 
  - job_name: 'node'
    static_configs:
    - targets: ['192.1.1.12:9100','192.168.88.11:9100']
 
  - job_name: 'alertmanager'
    static_configs:
    - targets: ['192.1.1.12:9093']
 
  - job_name: 'cadvisor'
    static_configs:
    - targets: ['192.168.88.13:8080','192.168.88.11:8080']
 
 
  - job_name: 'mysql-exporter'
    #scrape_interval: 5s
    static_configs:
      - targets: ['192.168.88.13:9104']
 
  - job_name: 'redis-exporter'
    #scrape_interval: 5s
    static_configs:
      - targets: ['192.168.88.13:9121']
 
  #下面这个6个redis-targets是集群的，不是集群去掉
  - job_name: 'redis_exporter_targets'
    static_configs:
      - targets:
        - redis://172.16.8.58:6379
    metrics_path: /metrics
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 172.16.0.46:9121
 
 
  - job_name: 'http_status'  # 配置job名
    metrics_path: /probe     # 定义metric获取的路径
    params:
      module: [http_2xx]     # 这里就是我们在black_exporter中定义的模块名
    file_sd_configs:         # 因需要监控的地址很多，我们这里将所有地址独立出来，后面会介绍该文件
      - files: 
        - '/etc/prometheus/job-web.yml'
        refresh_interval: 30s # 30秒刷新一次，当有新的监控地址时，会自动加载进来不需要重启
    relabel_configs:
      - source_labels: [__address__]  # 当前target的访问地址，比如监控百度则为 https://baidu.com
        target_label: __param_target  # __param是默认参数前缀，target为参数，这里可以理解为把__address__ 的值赋给__param_target，若监控百度，则target=https://baidu.com
      - source_labels: [__param_target]
        target_label: instance        # 可以理解为把__param_target的值赋给instance标签
      - target_label: __address__
        replacement: docker_black_exporter_1:9115 # web监控原本的target为站点的地址，但Prometheus不是直接去请求该地址，而是去请求black_exporter，故需要把目标地址替换为black_exporter的地址
 
  - job_name: 'rocketmq-exporter'
    #scrape_interval: 5s
    static_configs:
      - targets: ['rocketmq-exporter的ip:5557']
        labels:
          project: baidu
          instance: 172.30.0.150:9876 # 这里加了一个标签方便检索
          app: rocketmq
          env: pro

创建alertmanager文件

vi /docker/alertmanager/alertmanager.yml
global:
  resolve_timeout: 5m #超时,默认5min
  #邮箱smtp服务
  smtp_smarthost: 'smtp.qq.com:465'
  smtp_from: '[email protected]'
  smtp_auth_username: '[email protected]'
  smtp_auth_password: '######' ##qq邮箱token
  smtp_require_tls: false

# 定义模板信息
templates:
  - 'template/*.tmpl'   # 路径

# 路由
route:
  group_by: ['alertname'] # 报警分组依据
  group_wait: 10s #组等待时间
  group_interval: 10s # 发送前等待时间
  repeat_interval: 1m #重复周期
  receiver: 'mail' # 默认警报接收者

# 警报接收者
receivers:
- name: 'mail' #警报名称
  email_configs:
  - to: '{{ template "email.to" . }}'  #接收警报的email
    html: '{{ template "email.to.html" . }}' # 模板
    send_resolved: true

# 告警抑制
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']

创建告警模板

vi /docker/alertmanager/template/em.tmpl
{{ define "email.from" }}[email protected]{{ end }}
{{ define "email.to" }}[email protected],[email protected]{{ end }}
{{ define "email.to.html" }}
{{ range .Alerts }}
=========start==========

告警程序: prometheus_alert 

告警级别: {{ .Labels.severity }} 级 

告警类型: {{ .Labels.alertname }} 

故障主机: {{ .Labels.instance }} 

告警主题: {{ .Annotations.summary }} 

告警详情: {{ .Annotations.description }} 

触发时间: {{ (.StartsAt.Add 28800e9) }} 

=========end==========

{{ end }}
{{ end }}

创建black_exporter配置文件

vi /docker/blackbox_exporter/config.yml


modules:
  http_2xx:
    prober: http
  http_post_2xx:
    prober: http
    http:
      method: POST
  tcp_connect:
    prober: tcp
  pop3s_banner:
    prober: tcp
    tcp:
      query_response:
      - expect: "^+OK"
      tls: true
      tls_config:
        insecure_skip_verify: false
  grpc:
    prober: grpc
    grpc:
      tls: true
      preferred_ip_protocol: "ip4"
  grpc_plain:
    prober: grpc
    grpc:
      tls: false
      service: "service1"
  ssh_banner:
    prober: tcp
    tcp:
      query_response:
      - expect: "^SSH-2.0-"
      - send: "SSH-2.0-blackbox-ssh-check"
  irc_banner:
    prober: tcp
    tcp:
      query_response:
      - send: "NICK prober"
      - send: "USER prober prober prober :prober"
      - expect: "PING :([^ ]+)"
        send: "PONG ${1}"
      - expect: "^:[^ ]+ 001"
  icmp:
    prober: icmp
  icmp_ttl5:
    prober: icmp
    timeout: 5s
    icmp:
      ttl: 5

创建blackweb文件

vi /docker/prometheus/job-web.yml
 
 
---
- targets:
  - https://www.baidu.com/
  labels:
    env: pro
    app: web
    project: 百度
    desc: 百度生产
- targets:
  - https://blog.csdn.net/
  labels:
    env: test
    app: web
    project: CSDN
    desc: 测试一下啦
    not_200: yes # 这个自定义标签是为了标识某些地址在正常情况下不是返回200状态码

创建docker-compose文件

vi /docker/docker-compoes.yml

version: '3.7'
 
services:
  node-exporter:
    image: prom/node-exporter:latest
    restart: always
    ports:
      - "9100:9100"
    networks:
      - prom
 
#  dingtalk:
#    image: timonwong/prometheus-webhook-dingtalk:latest
#    restart: always
#    volumes:
#      - type: bind
#        source: ./alertmanager/config.yml
#        target: /etc/prometheus-webhook-dingtalk/config.yml
#        read_only: true
#    ports:
#      - "8060:8060"
#    networks:
#      - prom
 
  alertmanager:
    #depends_on:
    #  - dingtalk
    image: prom/alertmanager:latest
    restart: always
    volumes:
      - type: bind
        source: ./alertmanager/alertmanager.yml
        target: /etc/alertmanager/alertmanager.yml
        read_only: true
      - type: volume
        source: alertmanager
        target: /etc/alertmanager
    ports:
      - "9093:9093"
      - "9094:9094"
    networks:
      - prom
 
  prometheus:
    depends_on:
      - alertmanager
    image: prom/prometheus:latest
    restart: always
    command:
      - --config.file=/etc/prometheus/prometheus.yml
      - --web.enable-lifecycle
    volumes:
      - type: bind
        source: ./prometheus/prometheus.yml
        target: /etc/prometheus/prometheus.yml
        read_only: true
      - type: bind
        source: ./prometheus/rules.yml
        target: /etc/prometheus/rules.yml
        read_only: true
      #- type: bind
        #source: ./prometheus/job-web.yml
        #target: /etc/prometheus/job-web.yml
        #read_only: true
      - type: volume
        source: prometheus
        target: /prometheus
        
    ports:
      - "9090:9090"
    networks:
      - prom
 
  grafana:
    depends_on:
      - prometheus
    image: grafana/grafana:latest
    restart: always
    volumes:
      - type: volume
        source: grafana
        target: /var/lib/grafana
    ports:
      - "3000:3000"
    networks:
      - prom
 
  cadvisor:
    image: google/cadvisor:latest
    #container_name: cadvisor
    hostname: cadvisor
    restart: always
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:rw
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
    ports:
      - "8080:8080"
    networks:
      - prom
    privileged: true
 
  mysqld-exporter:
    image: prom/mysqld-exporter
    hostname: mysqld-exporter
    restart: always
    ports:
      - "9104:9104"
    environment:
      - DATA_SOURCE_NAME=root:12345@(10.211.122.9:3306)/  #username:password@(ip:端口)
    networks:
      - prom
 
  redis-exporter:
    image: oliver006/redis_exporter
    #container_name: mysqld-exporter
    hostname: redis-exporter
    restart: always
    ports:
      - "9121:9121"
    command:
      - --redis.addr=redis://10.211.122.9:6379
      - --redis.password=123456
    networks:
      - prom
 
  blackbox_exporter:
    #container_name: blackbox_exporter
    image: prom/blackbox-exporter:master
    restart: always
    volumes:
      - /docker/blackbox_exporter/config.yml:/etc/blackbox_exporter/config.yml
    ports:
      - 9115:9115 
 
  rocketmq-exporter:
    image: sawyerlan/rocketmq-exporter
    #container_name: mysqld-exporter
    hostname: sawyerlan/rocketmq-exporter
    restart: always
    ports:
      - "5557:5557"
    command:
      #- --rocketmq.config.namesrvAddr=172.30.0.150:9876;172.30.0.151:9876
      - --rocketmq.config.namesrvAddr=172.30.0.150:9876
    networks:
      - prom
 
volumes:
  prometheus:
    driver: local
    driver_opts:
      type: none
      o: bind
      device: /docker/prometheus/data
  grafana:
    driver: local
    driver_opts:
      type: none
      o: bind
      device: /docker/grafana
    
  alertmanager:
    driver: local
    driver_opts:
      type: none
      o: bind
      device: /docker/alertmanager
networks:
  prom:
    driver: bridge

热更新prom

curl -X POST http://192.1.1.12:9090/-/reload

运行docker-compoes

一键脚本部署docker_prometheus

prometheusinstall.sh https://www.aliyundrive.com/s/X5qiSZMeMDk

黑盒测试模板 14928 14603black_exporter

14612 14883Rocket_exporter

8919 node_exporter

13631 docker容器模板

14934 MySQL模板Buffer Pool Size of Total RAM会 no data问题

docker-compose部署prometheus+grafana_第1张图片

你可能感兴趣的:(docker,prometheus,grafana)

PostgreSQL：GiST索引实现千万级IP库0.01毫秒检索伏羲栈数据库 postgresql tcp/ip 数据库
博主简介：CSDN博客专家，历代文学网（PC端可以访问：https://literature.sinhy.com/#/?__c=1000，移动端可微信小程序搜索“历代文学”）总架构师，15年工作经验，精通Java编程，高并发设计，Springboot和微服务，熟悉Linux，ESXI虚拟化以及云原生Docker和K8s，热衷于探索科技的边界，并将理论知识转化为实际应用。保持对新技术的好奇心，乐于分
《CKA/CKAD应试指南/从docker到kubernetes 完全攻略》学习笔记第3章部署kubernets集群 Aphelios· docker kubernetes 学习
目录3.1了解kubernetes3.2安装kubernetes3.2.1实验拓扑图及环境及准备设置3.2.3安装master3.2.4配置work加入集群3.2.5安装calico网络3.3安装后的设置3.3.1删除节点及重新加入3.3.2常见一些命令3.4设置metric-server监控pod及节点的负载3.5命名空间namespace3.6管理命名空间3.7安装一套v1.20.1版本的集群
centos下nginx实现按国家/地域封禁、按ip频率限流能力、ngx_http_geoip2_module、ngx_http_geoip_module的区分 zlingh 网络 nginx linux 运维服务器
本人亲测，且在docker环境中运行成功一、采用ngx_http_geoip2_module模块nginx版本./configure--with-http_stub_status_module--prefix=/usr/local/nginx--user=nginx--group=nginx--with-http_ssl_module--with-stream--add-module=/usr/l
【前端构建】使用Docker打包多个前端项目到一个Nginx镜像，并给conf文件动态传递参数 Zacks_xdc 前端 docker nginx
提示：文章写完后，目录可以自动生成，如何生成可参考右边的帮助文档文章目录背景正文DockerFileNginx配置模板接收变量并替换Shell脚本将Nginx配置模板替换成配置文件使用构建镜像运行容器总结背景公司给一些客户要部署三个前端项目。最初，每个前端项目都以独立的镜像形式交付并部署。然而，随着客户数量的增加，每个客户都提出了一些自定义需求，后端也进行了对应改造。这导致了部署过程变得复杂且繁琐
k8s拉取镜像规则_dockerfile拉取阿里云镜像 weixin_39632291 k8s拉取镜像规则
当您对于命名空间数、私有仓库数、构建规则数等规格要求不高时，建议使用支持基础镜像功能的默认实例版。本文主要介绍如何为默认实例创建镜像仓库、设置构建规则以及构建镜像。功能特点代码变更时自动触发构建开启代码变更自动构建镜像后，每次提交代码将自动触发镜像构建，减少手动触发构建的繁琐工作。登录容器镜像服务控制台，在控制台页面的左上方，选择所需地域。在左侧导航栏中，选择默认实例>镜像仓库。在镜像仓库页面，单
腾讯面经，有点难度~ 后端go
今天分享组织内的朋友在腾讯安全的实习面经。内容涵盖了QPS测试方法、SQL聚合查询、Linux进程管理、Redis数据结构与持久化、NAT原理、Docker隔离机制、Go语言GMP调度模型、协程控制、系统调用流程、变量逃逸分析及map操作等等知识点。下面是我整理的面经详解：面经详解一个表，里面有数据列，id，name,class，查学生最喜欢的前10个课程，sql语句实现SELECTclass,C
Dify1.01版本vscode 本地环境搭建运行实践 hamish-wu vscode 编辑器 dify 大模型 python flask
dify是python编写的低代码AI开发平台，是常用的大模型开发平台。本文基于最新的1.0.1版本实践完成，有需要的可以私信交流。咨询免费，详细文档及视频需要一定成本，大概相当于节约的时间成本。搭建环境windows11开发工具vscode搭建步骤：1.Startthedocker-composestackwindow环境下运行docker命令，需要下载docker官网镜像，会遇到timeout
别只会用别人的模型了，自学Ai大模型，顺序千万不要搞反了！刚入门的小白必备！鸡腿爱学习人工智能学习自然语言处理服务器数据库
大家好，我是JackBytes，一个专注于将人工智能应用于日常生活的半吊子程序猿，平时主要分享AI、NAS、Docker、搞机技巧、开源项目等。在使用诸如DeepSeek、ChatGPT、豆包、文心一言等大模型之余，你是否知道这些大模型背后的技术原理是什么？假如让你从头开始学习大模型，你知道应该遵循什么样的路线嘛？今天给大家介绍一下Ai大模型的学习路线，顺序千万不要搞反了！，大家可以按照这个路线进
服务器负载均衡冬冬小圆帽服务器负载均衡 vim
1.安装EPEL仓库EPEL（ExtraPackagesforEnterpriseLinux）仓库提供了额外的软件包，安装HAProxy前需要先启用EPEL仓库。sudoyuminstallepel-release-y2.安装HAProxy通过EPEL仓库安装HAProxy。sudoyuminstallhaproxy-y注意：如果服务器上已安装Docker，可能会干扰HAProxy的安装。建议先关
使用 Docker 部署 Puter 云桌面系统 Jaxx.Wang #开源项目 Docker
1）Puter介绍:::infoGitHub：https://github.com/HeyPuter/puter:::Puter是一个先进的开源桌面环境，运行在浏览器中，旨在具备丰富的功能、异常快速和高度可扩展性。它可以用于构建远程桌面环境，也可以作为云存储服务、远程服务器、Web托管平台等的界面。Puter是一个隐私至上的个人云，可以将您的所有文件、应用程序和游戏保存在一个安全的地方，随时随地都
Docker搭建开源Web云桌面操作系统Puter和DaedalOS 没刮胡子 Linux服务器技术 Linux 1024程序员节 puter 云桌面云桌面操作系统 daedalOS web操作系统
文章目录Puter操作系统说明基于Docker启动Puter操作系统拉取镜像运行容器基于Docker-Compose启动Puter操作系统创建目录编写docker-compose.yml运行在本地直接运行puter操作系统puter界面截图puter个人使用总结构建自己的Puter镜像daedalos基于web的操作系统说明技术特点核心功能使用场景基于docker运行daedalos拉取镜像运行容
分享：Javascript开源桌面环境-Puter ac-er8888 javascript 开发语言 ecmascript
Puter这是一个运行在浏览器里的桌面操作系统，提供了笔记本、代码编辑器、终端、画图、相机、录音等应用和一些小游戏。该项目作者出于性能方面的考虑没有选择Vue和React技术栈，而是采用的JavaScript和jQuery构建，支持Docker一键部署和在线使用。简介：Puter是一个先进的开源项目，旨在为用户提供全新的云端体验。它可以在浏览器中运行，无需安装，即可提供丰富的功能和极快的速度。功能
kibana第一次连接elasticsearch出现问题1：Unable to retrieve version information from Elasticsearch nodes. 皮卡兔子屋 elasticsearch docker
问题描述elasticsearch容器正常运行，在启动kibana容器后打开对应连接，出现错误：kibanaserverisnotreadyyet.通过docker命令查看kibana日志：dockerlogskibana显示错误为：[ERROR][elasticsearch-service]UnabletoretrieveversioninformationfromElasticsearchno
TDengine 入坑 xijieyu tdengine docker linux
的最近想折腾一个时序数据库，所以入坑了TDengine我的环境是WIN10+虚拟机ubuntu，开发语言是C#。在虚拟机里一开始使用docker来拉取TDengine镜像，后来发现docker的网络配置不熟，所以干脆直接在宿主机上安装TDengine直接使用。安装完了后，taos怎么都连接不上，显示"Unabletoestablishconnection"，根据官方教程中的解释，一步一步排除各类连
2025最新docker教程（四）嘿rasa 2025最新教程系列 docker eureka 容器
Docker客户端docker客户端非常简单,我们可以直接输入docker命令来查看到Docker客户端的所有命令选项。runoob@runoob:~#docker可以通过命令dockercommand--help更深入的了解指定的Docker命令使用方法。例如我们要查看dockerstats指令的具体使用方法：runoob@runoob:~#dockerstats--help容器使用获取镜像如果
docker创建的mysql没有配置文件_使用docker安装mysql, redis, kafka等各类服务 Gyrolt
前言大致说来,docker的作用如下绝大部分应用，开发者都可以通过dockerbuild创建镜像，通过dockerpush上传镜像，用户通过dockerpull下载镜像，用dockerrun运行应用。用户不需要再去关心如何搭建环境，如何安装，如何解决不同发行版的库冲突——而且通常不会需要消耗更多的硬件资源，不会明显降低性能。也就是实现了标准化、集装箱如果想要简单使用,可以看答主的这一片文章:番茄番
第6章：Dockerfile最佳实践：多阶段构建与镜像优化 DogDog_Shuai docker 容器运维
第6章：Dockerfile最佳实践：多阶段构建与镜像优化作者：DogDog_Shuai阅读时间：约30分钟难度：中级目录1.引言2.Dockerfile基础3.多阶段构建4.镜像优化技术5.最佳实践指南6.总结1.引言Dockerfile是构建Docker镜
【Unity网络同步框架 - Nakama研究(二)】归海_一刀 unity 网络游戏引擎
Unity网络同步框架-Nakama研究(二)虽说官方文档和网站以及论坛建立的不错，而且还有中文翻译且质量也不错，但是总会遇到一些词不达意，说了但是依旧没懂的部分，甚至问AI也问不出什么东西，所以需要有一些比较明显的博客来记录实战部分服务端搭建使用官方推荐的Docker进行安装在将Docker软件下载到Windows环境后，请确保已安装node-js、typescript、lua和Go等环境（后续
使用Docker部署RabbitMQ 九思x docker rabbitmq 容器
第一步：安装RabbitMQ#1.拉取镜像dockerpullrabbitmq:3.12.0-management#2.启动容器（开放端口+数据持久化）dockerrun-d\--name=share_rabbitmq\-p5672:5672\#AMQP协议端口-p15672:15672\#管理界面端口-v/opt/rabbitmq/data:/var/lib/rabbitmq\#数据持久化目录r
Kubernets命名空间忍界英雄 docker k8s
Kubernets命名空间什么是命名空间命名空间（Namespace）是一种用于组织和隔离Kubernetes资源的机制。在Kubernetes集群中，命名空间将物理集群划分为多个逻辑部分，每个部分都拥有自己的一组资源（如Pod、Service、ConfigMap等），彼此之间互不干扰，实现资源的隔离管理。不仅Kubernetes具备命名空间的概念，在Docker等容器技术中，也通过命名空间（Na
深夜惊魂：当监控告警“撒谎”时，SRE 如何逆风翻盘？ YAMLMaster kubernetes 运维开发 devops 容器云原生
Yorkshire,England引言我们这一篇也是含金量十足，如果面试官让你说个你处理过的比较有意思的案例，可以跟他讲讲，让他也见见世面。好吧，我们直接开始，最后有相关的群，有兴趣可以加入。开始一、故障场景深度还原时间：2025年1月3日02:00（GMT+8）环境：•数据库集群：MySQL8.0.35，通过KubeBlocks部署（3节点，跨AZ）•监控架构：•Prometheus-Opera
阿里云在使用 Docker 过程中踩过的坑 weixin_34293059 运维
昨天下午朋友在微信上丢给我一条新闻，看看，我们阿里云支持Docker企业版了。我打开一看，果然，阿里云发布了飞天敏捷版，开始支持企业级的Docker容器。美国中部时间4月19日，阿里云在容器技术大会DockerCon2017上正式推出了ApsaraStackAgility，也就是飞天的敏捷版。Docker公司首席执行官BenGolub在大会上宣布了ApsaraStackAgility的正式发布，这
Centos使用docker搭建Graylog日志平台 moxiaoran5753 centos docker graylog
日志管理系统有很多，比如ELK,Graylog，Loki+Grafana+Promtail适用场景：1.如果需求复杂，服务器资源不受限制，推荐使用ELK（Logstash+Elasticsearch+Kibana）方案；2.如果需求仅是将不同服务器上的日志采集上来集中展示和检索，且需要一个轻量级的框架，那使用PLG（Promtail+Loki+Grafana）最合适不过了。3.Graylog专注于
一文读懂 Linux 下 Docker 搭建及简单应用 Waitccy linux docker 运维服务器
一、引言在Linux系统的运维与开发场景中，Docker凭借其高效的容器化技术，极大地简化了应用部署与管理流程。它打破了传统环境配置的复杂性，实现应用及其依赖的封装，确保在不同环境中稳定运行。本文将详细介绍在Linux系统下搭建Docker的步骤，并通过几个简单应用示例，带你快速上手Docker。二、Linux下Docker搭建（一）准备工作系统要求：建议使用主流的Linux发行版，如Ubuntu
docker（10、日志管理4）5、Graylog 日志系统(1、部署Graylog日志系统，2、Graylog管理日志) junior1206 k8s docker
部署Graylog日志系统Graylog是与ELK可以相提并论的一款几种式日志管理方案，支持数据收集、检索、可视化Dashboard。将实践用Graylog来管理Docker日志Graylog架构Graylog架构如下图所示：Graylog负责接收来自各种设备和应用的日志，并未用户提供Web访问接口。Elasticsearch用于索引和保存Graylog接收到的日志MongoDB负责保存Grayl
区块链环境配置自用 Xmas190 其它区块链
FabricLab1.Fabric环境搭建与基本操作2.Fabric链码基础3.Fabric项目架构Fabric实践一：环境搭建与基本操作一、Fabric环境搭建本文用于指导Fabric在基于Ubuntu的Linux系统中的安装与配置，如有未安装过的同学可以参考本指南自行配置。相关组件版本号：名称版本Ubuntu16.04Fabric1.4Docker20.10.5Docker-compose1.
CentOS 7 64位安装Docker 咯拉咯啦 Docker docker
以下是在已有的CentOS764位虚拟机上安装Docker并配置华为镜像源的详细步骤：1.备份原有Yum源（可选，建议操作）#备份原有仓库文件sudomv/etc/yum.repos.d/CentOS-Base.repo/etc/yum.repos.d/CentOS-Base.repo.backup2.更换CentOS7Yum源为华为源#下载华为云CentOS7的仓库文件sudocurl-o/et
Docker 容器基础技术：namespace 寻雾&启示 docker 容器运维
在容器内进程是隔离的，比如容器有自己的网络和文件系统，容器内进程的PID为1，这些都是依赖于Linuxnamespace所提供的隔离机制。本篇我们来了解下Linux有哪些namespace，以及它们是如何实现隔离的。文中案例代码均由ChatGPT生成，在Linux内核5.15.0-124-generic，ubuntu22.04LTS系统上测试通过。namespace类型每个进程都有自己所属的nam
Windows10本地部署Dify+Xinference 橘长长长 AI相关 ai dify xinference glm4
目录前言一、安装必要项1.安装Docker和AnaConda2.安装Xinference3.通过Xinference部署本地glm4-chat-1m4.验证glm4-chat-1m是否部署完成5.安装Dify三、Dify中配置大模型1.浏览器输入http://localhost:80启动Dify页面2.随便注册账户登录3.配置Xinference四、运行Dify1.设置系统推理模型2.对话窗口验证
Docker之安装与配置雨五夜 Docker docker 容器运维
Docker之安装与配置一、Docker环境配置1.基本配置2.镜像加速3.网络配置4.数据持久化5.优化建议6.常见问题与解决方案7.补充工具二、Docker配置本地仓库指南1.拉取Registry镜像2.启动本地仓库3.配置Docker客户端Linux/macOSWindows4.推送镜像到本地仓库标记镜像推送镜像5.推送镜像到本地仓库6.管理本地仓库7.优化与安全性8.常见问题一、Docke
面向对象面向过程 3213213333332132 java
面向对象：把要完成的一件事，通过对象间的协作实现。面向过程：把要完成的一件事，通过循序依次调用各个模块实现。我把大象装进冰箱这件事为例，用面向对象和面向过程实现，都是用java代码完成。 1、面向对象 package bigDemo.ObjectOriented; /** * 大象类 * * @Description * @author FuJian
Java Hotspot: Remove the Permanent Generation bookjovi HotSpot
openjdk上关于hotspot将移除永久带的描述非常详细，http://openjdk.java.net/jeps/122 JEP 122: Remove the Permanent Generation Author Jon Masamitsu Organization Oracle Created 2010/8/15 Updated 2011/
正则表达式向前查找向后查找,环绕或零宽断言 dcj3sjt126com 正则表达式
向前查找和向后查找 1. 向前查找：根据要匹配的字符序列后面存在一个特定的字符序列(肯定式向前查找)或不存在一个特定的序列(否定式向前查找)来决定是否匹配。.NET将向前查找称之为零宽度向前查找断言。对于向前查找，出现在指定项之后的字符序列不会被正则表达式引擎返回。 2. 向后查找：一个要匹配的字符序列前面有或者没有指定的
BaseDao 171815164 seda
import java.sql.Connection; import java.sql.DriverManager; import java.sql.SQLException; import java.sql.PreparedStatement; import java.sql.ResultSet; public class BaseDao { public Conn
Ant标签详解--Java命令 g21121 Java命令
这一篇主要介绍与java相关标签的使用终于开始重头戏了，Java部分是我们关注的重点也是项目中用处最多的部分。 1
[简单]代码片段_电梯数字排列 53873039oycg 代码
今天看电梯数字排列是9 18 26这样呈倒N排列的,写了个类似的打印例子，如下: import java.util.Arrays; public class 电梯数字排列_S3_Test { public static void main(S
Hessian原理云端月影 hessian原理
Hessian 原理分析一．远程通讯协议的基本原理网络通信需要做的就是将流从一台计算机传输到另外一台计算机，基于传输协议和网络 IO 来实现，其中传输协议比较出名的有 http 、 tcp 、 udp 等等， http 、 tcp 、 udp 都是在基于 Socket 概念上为某类应用场景而扩展出的传输协
区分Activity的四种加载模式----以及Intent的setFlags aijuans android
在多Activity开发中，有可能是自己应用之间的Activity跳转，或者夹带其他应用的可复用Activity。可能会希望跳转到原来某个Activity实例，而不是产生大量重复的Activity。这需要为Activity配置特定的加载模式，而不是使用默认的加载模式。加载模式分类及在哪里配置 Activity有四种加载模式： standard singleTop
hibernate几个核心API及其查询分析 antonyup_2006 html .net Hibernate xml 配置管理
(一) org.hibernate.cfg.Configuration类读取配置文件并创建唯一的SessionFactory对象.(一般,程序初始化hibernate时创建.) Configuration co
PL/SQL的流程控制百合不是茶 oracle PL/SQL编程循环控制
PL/SQL也是一门高级语言,所以流程控制是必须要有的,oracle数据库的pl/sql比sqlserver数据库要难,很多pl/sql中有的sqlserver里面没有流程控制; 分支语句 if 条件 then 结果 else 结果 end if ; 条件语句 case when 条件 then 结果; 循环语句 loop
强大的Mockito测试框架 bijian1013 mockito 单元测试
一.自动生成Mock类在需要Mock的属性上标记@Mock注解，然后@RunWith中配置Mockito的TestRunner或者在setUp()方法中显示调用MockitoAnnotations.initMocks(this);生成Mock类即可。二.自动注入Mock类到被测试类 &nbs
精通Oracle10编程SQL(11)开发子程序 bijian1013 oracle 数据库 plsql
/* *开发子程序 */ --子程序目是指被命名的PL/SQL块，这种块可以带有参数，可以在不同应用程序中多次调用 --PL/SQL有两种类型的子程序：过程和函数 --开发过程 --建立过程：不带任何参数 CREATE OR REPLACE PROCEDURE out_time IS BEGIN DBMS_OUTPUT.put_line(systimestamp); E
【EhCache一】EhCache版Hello World bit1129 Hello world
本篇是EhCache系列的第一篇，总体介绍使用EhCache缓存进行CRUD的API的基本使用，更细节的内容包括EhCache源代码和设计、实现原理在接下来的文章中进行介绍环境准备 1.新建Maven项目 2.添加EhCache的Maven依赖 <dependency> <groupId>ne
学习EJB3基础知识笔记白糖_ bean Hibernate jboss webservice ejb
最近项目进入系统测试阶段，全赖袁大虾领导有力，保持一周零bug记录，这也让自己腾出不少时间补充知识。花了两天时间把“传智播客EJB3.0”看完了，EJB基本的知识也有些了解，在这记录下EJB的部分知识，以供自己以后复习使用。 EJB是sun的服务器端组件模型，最大的用处是部署分布式应用程序。EJB (Enterprise JavaBean)是J2EE的一部分，定义了一个用于开发基
angular.bootstrap boyitech AngularJS AngularJS API angular中文api
angular.bootstrap 描述：手动初始化angular。这个函数会自动检测创建的module有没有被加载多次，如果有则会在浏览器的控制台打出警告日志，并且不会再次加载。这样可以避免在程序运行过程中许多奇怪的问题发生。使用方法： angular .
java-谷歌面试题-给定一个固定长度的数组，将递增整数序列写入这个数组。当写到数组尾部时，返回数组开始重新写，并覆盖先前写过的数 bylijinnan java
public class SearchInShiftedArray { /** * 题目：给定一个固定长度的数组，将递增整数序列写入这个数组。当写到数组尾部时，返回数组开始重新写，并覆盖先前写过的数。 * 请在这个特殊数组中找出给定的整数。 * 解答： * 其实就是“旋转数组”。旋转数组的最小元素见http://bylijinnan.iteye.com/bl
天使还是魔鬼？都是我们制造 ducklsl 生活教育情感
----------------------------剧透请原谅，有兴趣的朋友可以自己看看电影，互相讨论哦！！！从厦门回来的动车上，无意中瞟到了书中推荐的几部关于儿童的电影。当然，这几部电影可能会另大家失望，并不是类似小鬼当家的电影，而是关于“坏小孩”的电影！自己挑了两部先看了看，但是发现看完之后，心里久久不能平
[机器智能与生物]研究生物智能的问题 comsci 生物
我想,人的神经网络和苍蝇的神经网络,并没有本质的区别...就是大规模拓扑系统和中小规模拓扑分析的区别.... 但是,如果去研究活体人类的神经网络和脑系统,可能会受到一些法律和道德方面的限制,而且研究结果也不一定可靠,那么希望从事生物神经网络研究的朋友,不如把
获取Android Device的信息 dai_lm android
String phoneInfo = "PRODUCT: " + android.os.Build.PRODUCT; phoneInfo += ", CPU_ABI: " + android.os.Build.CPU_ABI; phoneInfo += ", TAGS: " + android.os.Build.TAGS; ph
最佳字符串匹配算法（Damerau-Levenshtein距离算法）的Java实现 datamachine java 算法字符串匹配
原文：http://www.javacodegeeks.com/2013/11/java-implementation-of-optimal-string-alignment.html------------------------------------------------------------------------------------------------------------
小学5年级英语单词背诵第一课 dcj3sjt126com english word
long 长的 show 给...看，出示 mouth 口，嘴 write 写 use 用，使用 take 拿，带来 hand 手 clever 聪明的 often 经常 wash 洗 slow 慢的 house 房子 water 水 clean 清洁的 supper 晚餐 out 在外 face 脸，
macvim的使用实战 dcj3sjt126com mac vim
macvim用的是mac里面的vim, 只不过是一个GUI的APP, 相当于一个壳 1. 下载macvim https://code.google.com/p/macvim/ 2. 了解macvim :h vim的使用帮助信息 :h macvim
java二分法查找蕃薯耀 java二分法查找二分法 java二分法
java二分法查找 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 蕃薯耀 2015年6月23日 11:40:03 星期二 http:/
Spring Cache注解+Memcached hanqunfeng spring memcached
Spring3.1 Cache注解依赖jar包：  <dependency> <groupId>com.google.code.simple-spring-memcached</groupId> <artifactId>simple-s
apache commons io包快速入门 jackyrong apache commons
原文参考 http://www.javacodegeeks.com/2014/10/apache-commons-io-tutorial.html Apache Commons IO 包绝对是好东西，地址在http://commons.apache.org/proper/commons-io/，下面用例子分别介绍： 1）工具类 2
如何学习编程 lampcy java 编程 C++c
首先,我想说一下学习思想.学编程其实跟网络游戏有着类似的效果.开始的时候,你会对那些代码,函数等产生很大的兴趣,尤其是刚接触编程的人,刚学习第一种语言的人.可是,当你一步步深入的时候,你会发现你没有了以前那种斗志.就好象你在玩韩国泡菜网游似的,玩到一定程度,每天就是练级练级,完全是一个想冲到高级别的意志力在支持着你.而学编程就更难了,学了两个月后,总是觉得你好象全都学会了,却又什么都做不了,又没有
架构师之spring-----spring3.0新特性的bean加载控制@DependsOn和@Lazy nannan408 Spring3
1.前言。如题。 2.描述。 @DependsOn用于强制初始化其他Bean。可以修饰Bean类或方法，使用该Annotation时可以指定一个字符串数组作为参数，每个数组元素对应于一个强制初始化的Bean。 @DependsOn({"steelAxe","abc"}) @Comp
Spring4+quartz2的配置和代码方式调度 Everyday都不同代码配置 spring4 quartz2.x 定时任务
前言：这些天简直被quartz虐哭。。因为quartz 2.x版本相比quartz1.x版本的API改动太多，所以，只好自己去查阅底层API…… quartz定时任务必须搞清楚几个概念： JobDetail——处理类 Trigger——触发器，指定触发时间，必须要有JobDetail属性，即触发对象 Scheduler——调度器，组织处理类和触发器，配置方式一般只需指定触发
Hibernate入门 tntxia Hibernate
前言使用面向对象的语言和关系型的数据库，开发起来很繁琐，费时。由于现在流行的数据库都不面向对象。Hibernate 是一个Java的ORM（Object/Relational Mapping）解决方案。 Hibernte不仅关心把Java对象对应到数据库的表中，而且提供了请求和检索的方法。简化了手工进行JDBC操作的流程。如
Math类 xiaoxing598 Math
一、Java中的数字（Math）类是final类，不可继承。 1、常数 PI：double圆周率 E：double自然对数 2、截取（注意方法的返回类型） double ceil(double d) 返回不小于d的最小整数 double floor(double d) 返回不大于d的整最大数 int round(float f) 返回四舍五入后的整数 long round