1.小命令 sysdig
curl -s https://s3.amazonaws.com/download.draios.com/stable/install-sysdig | bash
执行 sysdig -cl | less出现的结果
Category: Application
---------------------
httplog HTTP requests log
httptop Top HTTP requests
memcachelog memcached requests log
Category: CPU Usage
-------------------
spectrogram Visualize OS latency in real time.
subsecoffset Visualize subsecond offset execution time.
topcontainers_cpu
Top containers by CPU usage
topprocs_cpu Top processes by CPU usage
Category: Errors
----------------
topcontainers_error
Top containers by number of errors
topfiles_errors Top files by number of errors
topprocs_errors top processes by number of errors
Category: I/O
-------------
echo_fds Print the data read and written by processes.
fdbytes_by I/O bytes, aggregated by an arbitrary filter field
fdcount_by FD count, aggregated by an arbitrary filter field
fdtime_by FD time group by
iobytes Sum of I/O bytes on any type of FD
iobytes_file Sum of file I/O bytes
spy_file Echo any read/write made by any process to all files. Optionall
y, you can provide the name of one file to only intercept reads
/writes to that file.
stderr Print stderr of processes
stdin Print stdin of processes
stdout Print stdout of processes
topcontainers_file
Top containers by R+W disk bytes
topfiles_bytes Top files by R+W bytes
topfiles_time Top files by time
topprocs_file Top processes by R+W disk bytes
Category: Logs
--------------
spy_logs Echo any write made by any process to a log file. Optionally, e
xport the events around each log message to file.
spy_syslog Print every message written to syslog. Optionally, export the e
vents around each syslog message to file.
Category: Misc
--------------
around Export to file the events around the time range where the given
filter matches.
Category: Net
-------------
iobytes_net Show total network I/O bytes
spy_ip Show the data exchanged with the given IP address
spy_port Show the data exchanged using the given IP port number
topconns Top network connections by total bytes
topcontainers_net
Top containers by network I/O
topports_server Top TCP/UDP server ports by R+W bytes
topprocs_net Top processes by network I/O
Category: Performance
---------------------
bottlenecks Slowest system calls
fileslower Trace slow file I/O
netlower Trace slow network I/0
proc_exec_time Show process execution time
scallslower Trace slow syscalls
topscalls Top system calls by number of calls
topscalls_time Top system calls by time
Category: Security
------------------
list_login_shells
List the login shell IDs
shellshock_detect
print shellshock attacks
spy_users Display interactive user activity
Category: System State
----------------------
lscontainers List the running containers
lsof List (and optionally filter) the open file descriptors.
netstat List (and optionally filter) network connections.
ps List (and optionally filter) the machine processes.
Category: Tracers
-----------------
tracers_2_statsd
Export spans duration as statds metrics.
Use the -i flag to get detailed information about a specific chisel
2.sysdig案例分析 - 用fdbytes_by chisel来分析磁盘I/O活动
http://shanker.blog.51cto.com/1189689/1771418
今天来分享一下fdbytes_by的用法,该案例可以探测到系统的那个文件的I/O占用最高(不光是file,还可以是network I/O),而且可以查到哪个进程在读写该文件,并且可以查看到内核级的I/O活动明细。应用场景可以观察一下你的文件系统是否是在高效运转,或者调查一个磁盘I/O延迟的故障。配合dstat --top-io可以更容易定位到进程名字,但是今天介绍的主要是sysdig的fdbytes_by chisel用法,可以想象成没有dstat工具可用的场景下
首先我们先来看一下今天的主角fdbytes_by的用法明细:
# sysdig -i fdbytes_by
Category: I/O
-------------
fdbytes_by I/O bytes, aggregated by an arbitrary filter field
Groups FD activity based on the given filter field, and returns the key that ge
nerated the most input+output bytes. For example, this script can be used to li
st the processes or TCP ports that generated most traffic.
Args:
[string] key - The filter field used for grouping
答题意思是以文件描述符的各种活动所产生的IO大小来进行排序。
2.1 首先我们来抓取30M的sysdig包来用分析使用。
sysdig -w fdbytes_by.scap -C 30
2.2 然后我们来分析这次抓包没个文件描述符对文件系统的I/O活动:
sysdig -r fdbytes_by.scap0 -c fdbytes_by fd.type
Bytes fd.type
--------------------------------------------------------------------------------
45.16M file
9.30M ipv4
87.55KB unix
316B
60B pipe
可以看到file占用的45.16M,是最大的FD,
2.3然后我们来看一下按目录的I/O活动来排序
# sysdig -r fdbytes_by.scap0 -c fdbytes_by fd.directory
Bytes fd.directory
--------------------------------------------------------------------------------
38.42M /etc
7.59M /
5.04M /var/www/html
1.38M /var/log/nginx
304.73KB /root/.zsh_history/root
7.31KB /lib/x86_64-linux-gnu
2.82KB /dev
2.76KB /dev/pts
1.62KB /usr/lib/x86_64-linux-gnu
发现访问最多的是/etc目录
2.4 那我们看一下,具体访问的是哪个文件呢
# sysdig -r fdbytes_by.scap0 -c fdbytes_by fd.name fd.directory=/etc
Bytes fd.name
--------------------------------------------------------------------------------
38.42M /etc/services
2.5 Bingo!找到了,原来是/etc/services被访问的最多,因为services是系统文件,所以可以判断肯定是read的操作达到了38.42M,那我们来看一下哪个进程访问的此文件呢?
# sysdig -r fdbytes_by.scap0 -c fdbytes_by proc.name "fd.filename=services and fd.directory=/etc"
Bytes proc.name
--------------------------------------------------------------------------------
38.42M nscd
2.6 找到元凶了,原来是nscd缓存程序,那他为什么会读取这么多次的services文件呢?在继续看:
# sysdig -r fdbytes_by.scap0 -A -s 4096 -c echo_fds proc.name=nscd
原来是nscd在读取services中定义的端口跟服务名称之间的关系,我在抓包的过程中是运行了ab做nginx的静态页面压力测试,本来希望看到的是nginx的读写会很高,没想到中途出现了这个nscd来捣乱:
ab -k -c 2000 -n 300000
http://shanker.heyoa.com/index.html
# sysdig -r fdbytes_by.scap0 -c topprocs_file
Bytes Process PID
--------------------------------------------------------------------------------
38.42M nscd 1343
6.43M nginx 4804
304.89KB zsh 32402
9.20KB ab 20774
2.79KB screen 18338
2.37KB sshd 12812
后来我分别测试了一下开启nscd的情况下ab的测试时间,和不开nscd做缓存的情况下,确实开启nscd做本地services的缓存会提高10.189%。
ab -k -c 2000 -n 300000 http://shanker.heyoa.com/index.html 0.94s user 2.77s system 9% cpu 38.561 total
ab -k -c 2000 -n 300000 http://shanker.heyoa.com/index.html 0.93s user 2.79s system 10% cpu 34.632 total
nscd缓存加速可以参考之前的这篇文章
http://shanker.blog.51cto.com/1189689/1735058
至此,整个分析就结束了,本文只是一个例子,跟大家分享如何使用chisel的fdbytes_by,sysdig还提供了很多chisel共大家分析系统。
3.性能调优之综合篇 - Linux系统性能监控和故障排查利器Sysdig
http://shanker.blog.51cto.com/1189689/1768735
Sysdig最新版提供了Docker容器镜像,可以很方便的直接拉取Docker镜像,另一方它提供容器级别的信息采集指令(sysdig -pc container.name=your_container_name),支持查询指定容器之间的网络流量、指定容器的CPU使用率等。
公司旗下的商用软件Sysdig Cloud则是容器级别的系统信息和网络流量监控、调试软件,这个在CoreOS Fest 大会上有介绍,它支持Real-Time Dashboard, Historical Replay, Dynamic Topology and Intelligent Alert, 可以想象成Nagios对系统的监控
软件安装请参考官方文档:http://www.sysdig.org/install/ 相对于SystemTap的安装Sysdig更容易些,本篇文章有点长就不浪费在安装上了,熟悉Ansible的可以去直接用sysdig的Galaxy:https://galaxy.ansible.com/detail#/role/692
Sysdig的语法在record 和replay系统跟踪方面跟Tcpdump和perf很像;在系统性能分析方面的语法chisels又跟SystemTap和dstat的--top*很像,只不过SystemTap需要自己写tap(代码写好了,比Sysdig强大), Sysdig是已经帮你写好了;在交互式使用方面又跟htop很像。
最简单的使用方法是直接输入sysdig, 他会捕获系统的每一个事件并且直接输出到屏幕。