监控系列讲座(十二)常见系统监控指标之存储

4. 磁盘/存储监控指标

一般来说,我们监控存储设备的时候大多数都是在监控文件系统,也就是可以被操作系统直接使用的部分。但是实际的生产中,我们会有其他的监控需求

  • 监控那些没有被格式化成对应的文件系统(XFS,FAT32,EXT4)的磁盘,这些磁盘使用的时候使用一般的df命令是不可见的,比如:oracle的ASM盘,块存储,存储映射过来的lun(FC,iSCSI)。我们需要使用一些特殊手段才能让他们为我们所用。
  • 提供存储的设备,比如:惠普的3PAR,IBM的DS存储,EMC存储这些设备,或者是NFS服务器,Ceph/swift集群这一类提供存储功能的服务器。对于厂商的产品我们最好是去咨询原厂的工程师,关于监控指标的问题,比如是否可以有插件支持某类(zabbix,grafana)监控软件直接采集,还是有API接口,可以供外部程序采集指标。即使都没有的话,还会有snmp这种简单的方式可以让我们监控。但是snmp方式提供的指标数量有限,算是个保底的solution。而对于使用开源软件这类的解决方案,可以我们客户或者领导最想听到的是一些硬性指标,比如:随机读写的速度,顺时读写的速度等等。因为这类指标是衡量我们系统的重要依据之一。

这块我们后面会在讲分布式存储和Ceph的时候再详细说,我们这里只比较一下一些工具内置模板可以监控到的指标。

4.1. 系统上查看硬盘指标

同样是两类

  • 通过命令:top、iostat、vmstat、sar这类属于查看瞬时速度的和查看使用率的df类命令。或者使用dd+time命令,可以通过查看读写的结果来测试速度。当然,还有一些三方工具,比如:FIO,hdparm,smartctl

  • 通过文件:一般来说,这些磁盘也是一个文件,他们都有对应的指标,我们可以在/sys/block/sda下面找到对应的信息,sda是设备名字。当然,不是所有的linux/unix系统都是这样的,比如MacOS就找不到/sys/block目录

    # ls /sys/block/sda
    alignment_offset  discard_alignment  inflight queue      slaves
    bdi         ext_range      mmcblk0p1  range      stat
    capability      force_ro       mmcblk0p2  removable  subsystem
    dev         hidden         mq     ro     trace
    device          holders        power  size       uevent
    

其实系统上能看到的指标是最全面的,而我们常用的vmstat命令提供的指标也非常少

   Swap
       si: Amount of memory swapped in from disk (/s).
       so: Amount of memory swapped to disk (/s).

   IO
       bi: Blocks received from a block device (blocks/s).
       bo: Blocks sent to a block device (blocks/s).

只有swap分区的读和写,块存储的读和写。我们经常会使用iostat -d来查看硬盘的IO

$ iostat
Linux 2.6.32-431.11.15.el6.ucloud.x86_64 (ssdk1)     10/14/2016     _x86_64_    (4 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.44    0.00    0.26    0.01    0.01   99.29

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
vda               0.66         0.09         6.75    1404732  105885456
vdb               1.42        12.47        55.86  195619082  876552296

这个会显示所有的每块盘的速度

tps:该设备每秒的传输次数
Blk_read/s:每秒从设备(drive expressed)读取的数据量;
Blk_wrtn/s:每秒向设备(drive expressed)写入的数据量;
Blk_read:  读取的总数据量;
Blk_wrtn:写入的总数量数据量;

然后就是df命令了,他会显示磁盘的使用率,这个是是很重要的指标,因为如果磁盘满了,和CPU一样,某些运行的程序可能会由于无法写数据而意外终止。

df -H
Filesystem      Size  Used Avail Use% Mounted on
/dev/root       126G  2.0G  119G   2% /
devtmpfs        1.9G     0  1.9G   0% /dev
tmpfs           2.0G     0  2.0G   0% /dev/shm
tmpfs           2.0G  8.8M  2.0G   1% /run
tmpfs           5.3M  4.1k  5.3M   1% /run/lock
tmpfs           2.0G     0  2.0G   0% /sys/fs/cgroup
/dev/mmcblk0p1  265M   55M  210M  21% /boot
tmpfs           400M     0  400M   0% /run/user/1000

4.2. zabbix上的存储监控指标

和我们在系统上看到的指标大同小异

image-20200724235135473.png
file

4.3. grafana上的存储监控指标

多了一个inode的监控,其他的基本一样

image-20200725000108563.png
file

4.4. node_exporter上的存储监控指标

这边的监控貌似多了很多

# HELP node_disk_discard_time_seconds_total This is the total number of seconds spent by all discards.
# TYPE node_disk_discard_time_seconds_total counter
node_disk_discard_time_seconds_total{device="mmcblk0"} 0
node_disk_discard_time_seconds_total{device="mmcblk0p1"} 0
node_disk_discard_time_seconds_total{device="mmcblk0p2"} 0
# HELP node_disk_discarded_sectors_total The total number of sectors discarded successfully.
# TYPE node_disk_discarded_sectors_total counter
node_disk_discarded_sectors_total{device="mmcblk0"} 0
node_disk_discarded_sectors_total{device="mmcblk0p1"} 0
node_disk_discarded_sectors_total{device="mmcblk0p2"} 0
# HELP node_disk_discards_completed_total The total number of discards completed successfully.
# TYPE node_disk_discards_completed_total counter
node_disk_discards_completed_total{device="mmcblk0"} 0
node_disk_discards_completed_total{device="mmcblk0p1"} 0
node_disk_discards_completed_total{device="mmcblk0p2"} 0
# HELP node_disk_discards_merged_total The total number of discards merged.
# TYPE node_disk_discards_merged_total counter
node_disk_discards_merged_total{device="mmcblk0"} 0
node_disk_discards_merged_total{device="mmcblk0p1"} 0
node_disk_discards_merged_total{device="mmcblk0p2"} 0
# HELP node_disk_io_now The number of I/Os currently in progress.
# TYPE node_disk_io_now gauge
node_disk_io_now{device="mmcblk0"} 0
node_disk_io_now{device="mmcblk0p1"} 0
node_disk_io_now{device="mmcblk0p2"} 0
# HELP node_disk_io_time_seconds_total Total seconds spent doing I/Os.
# TYPE node_disk_io_time_seconds_total counter
node_disk_io_time_seconds_total{device="mmcblk0"} 11.476
node_disk_io_time_seconds_total{device="mmcblk0p1"} 0.44
node_disk_io_time_seconds_total{device="mmcblk0p2"} 11.064
# HELP node_disk_io_time_weighted_seconds_total The weighted # of seconds spent doing I/Os.
# TYPE node_disk_io_time_weighted_seconds_total counter
node_disk_io_time_weighted_seconds_total{device="mmcblk0"} 16.476
node_disk_io_time_weighted_seconds_total{device="mmcblk0p1"} 0.668
node_disk_io_time_weighted_seconds_total{device="mmcblk0p2"} 15.792
# HELP node_disk_read_bytes_total The total number of bytes read successfully.
# TYPE node_disk_read_bytes_total counter
node_disk_read_bytes_total{device="mmcblk0"} 2.32966144e+08
node_disk_read_bytes_total{device="mmcblk0p1"} 1.153536e+07
node_disk_read_bytes_total{device="mmcblk0p2"} 2.20890112e+08
# HELP node_disk_read_time_seconds_total The total number of seconds spent by all reads.
# TYPE node_disk_read_time_seconds_total counter
node_disk_read_time_seconds_total{device="mmcblk0"} 11.972
node_disk_read_time_seconds_total{device="mmcblk0p1"} 0.704
node_disk_read_time_seconds_total{device="mmcblk0p2"} 11.232000000000001
# HELP node_disk_reads_completed_total The total number of reads completed successfully.
# TYPE node_disk_reads_completed_total counter
node_disk_reads_completed_total{device="mmcblk0"} 4883
node_disk_reads_completed_total{device="mmcblk0p1"} 416
node_disk_reads_completed_total{device="mmcblk0p2"} 4447
# HELP node_disk_reads_merged_total The total number of reads merged.
# TYPE node_disk_reads_merged_total counter
node_disk_reads_merged_total{device="mmcblk0"} 6505
node_disk_reads_merged_total{device="mmcblk0p1"} 3795
node_disk_reads_merged_total{device="mmcblk0p2"} 2710
# HELP node_disk_write_time_seconds_total This is the total number of seconds spent by all writes.
# TYPE node_disk_write_time_seconds_total counter
node_disk_write_time_seconds_total{device="mmcblk0"} 26.967000000000002
node_disk_write_time_seconds_total{device="mmcblk0p1"} 0.008
node_disk_write_time_seconds_total{device="mmcblk0p2"} 26.958000000000002
# HELP node_disk_writes_completed_total The total number of writes completed successfully.
# TYPE node_disk_writes_completed_total counter
node_disk_writes_completed_total{device="mmcblk0"} 1456
node_disk_writes_completed_total{device="mmcblk0p1"} 3
node_disk_writes_completed_total{device="mmcblk0p2"} 1453
# HELP node_disk_writes_merged_total The number of writes merged.
# TYPE node_disk_writes_merged_total counter
node_disk_writes_merged_total{device="mmcblk0"} 2529
node_disk_writes_merged_total{device="mmcblk0p1"} 0
node_disk_writes_merged_total{device="mmcblk0p2"} 2529
# HELP node_disk_written_bytes_total The total number of bytes written successfully.
# TYPE node_disk_written_bytes_total counter
node_disk_written_bytes_total{device="mmcblk0"} 6.9829632e+07
node_disk_written_bytes_total{device="mmcblk0p1"} 5120
node_disk_written_bytes_total{device="mmcblk0p2"} 6.9824512e+07
node_scrape_collector_duration_seconds{collector="diskstats"} 0.001754445
node_scrape_collector_success{collector="diskstats"} 1
  • merged的,是说合并所有硬盘后的指标

  • discard是说硬盘的丢包率,也就是说如果丢包率过高,有可能是硬盘本身的介质出现问题
    为了方便大家学习,请大家加我的微信,我会把大家加到微信群(微信群的二维码会经常变)和qq群821119334,问题答案云原生技术课堂,有问题可以一起讨论

  • 个人微信
    640.jpeg

  • 腾讯课堂
    640-20200506145837072.jpeg

  • 微信公众号
    640-20200506145842007.jpeg

  • 专题讲座

2020 CKA考试视频 真题讲解 https://www.bilibili.com/video/BV167411K7hp

2020 CKA考试指南 https://www.bilibili.com/video/BV1sa4y1479B/

2020年 5月CKA考试真题 https://mp.weixin.qq.com/s/W9V4cpYeBhodol6AYtbxIA

你可能感兴趣的:(监控系列讲座(十二)常见系统监控指标之存储)