原文地址:http://www.yellow-bricks.com/esxtop/  来自 Duncan Epping


esxtop 命令的指标和对应阈值(原文作者根据官方文档,测试和使用经验给出的参考值)

Metrics and Thresholds
Display
Metric Threshold  Explanation
CPU
%RDY 10
Overprovisioning of vCPUs, excessive usage of vSMP or a limit(check %MLMTD) has been set. See Jason’s explanation for vSMP VMs
CPU
%CSTP
3
Excessive usage of vSMP. Decrease amount of vCPUs for this particular VM. This should lead to increased scheduling opportunities.
CPU %SYS
20
The percentage of time spent by system services on behalf of the world. Most likely caused by high IO VM. Check other metrics and VM for possible root cause
CPU %MLMTD
0
The percentage of time the vCPU was ready to run but deliberately wasn’t scheduled because that would violate the “CPU limit” settings. If larger than 0 the world is being throttled due to the limit on CPU.
CPU
%SWPWT
5
VM waiting on swapped pages to be read from disk. Possible cause: Memory overcommitment.
MEM
MCTLSZ
1
If larger than 0 host is forcing VMs to inflate balloon driver to reclaim memory as host is overcommited.
MEM
SWCUR
1
If larger than 0 host has swapped memory pages in the past. Possible cause: Overcommitment.
MEM
SWR/s
1
If larger than 0 host is actively reading from swap(vswp). Possible cause: Excessive memory overcommitment.
MEM
SWW/s
1
If larger than 0 host is actively writing to swap(vswp). Possible cause: Excessive memory overcommitment.
MEM
CACHEUSD
0
If larger than 0 host has compressed memory. Possible cause: Memory overcommitment.
MEM
ZIP/s
0
If larger than 0 host is actively compressing memory. Possible cause: Memory overcommitment.
MEM
UNZIP/s
0
If larger than 0 host has accessing compressed memory. Possible cause: Previously host was overcommited on memory.
MEM
N%L
80
If less than 80 VM experiences poor NUMA locality. If a VM has a memory size greater than the amount of memory local to each processor, the ESX scheduler does not attempt to use NUMA optimizations for that VM and “remotely” uses memory via “interconnect”. Check “GST_ND(X)” to find out which NUMA nodes are used.
NETWORK
%DRPTX
1
Dropped packets transmitted, hardware overworked. Possible cause: very high network utilization
NETWORK
%DRPRX
1
Dropped packets received, hardware overworked. Possible cause: very high network utilization
DISK
GAVG
25
Look at “DAVG” and “KAVG” as the sum of both is GAVG.
DISK
DAVG
25
Disk latency most likely to be caused by array.
DISK
KAVG
2
Disk latency caused by the VMkernel, high KAVG usually means queuing. Check “QUED”.
DISK
QUED
1
Queue maxed out. Possibly queue depth set to low. Check with array vendor for optimal queue depth value.
DISK
ABRTS/s
1
Aborts issued by guest(VM) because storage is not responding. For Windows VMs this happens after 60 seconds by default. Can be caused for instance when paths failed or array is not accepting any IO for whatever reason.
DISK
RESETS/s
1
The number of commands reset per second.
DISK CONS/s 20 SCSI Reservation Conflicts per second. If many SCSI Reservation Conflicts occur performance could be degraded due to the lock on the VMFS.


基本使用方式

通过本地控制台或者ssh登录,执行esxtop启动它

esxtop

默认采集间隔是5秒,按s,输入正整数修改采集间隔。

s 2

通过以下快捷键切换视图

c = cpu
m = memory
n = network
i = interrupts
d = disk adapter
u = disk device (includes NFS as of 4.0 Update 2)
v = disk VM
p = power states
V = only show virtual machine worlds
e = Expand/Rollup CPU statistics, show details of all worlds associated with group (GID)
k = kill world, for tech support purposes only!
l = limit display to a single group (GID), enables you to focus on one VM
# = limiting the number of entitites, for instance the top 5
2 = highlight a row, moving down8 = highlight a row, moving up
4 = remove selected row from view
e = statistics broken down per world
6 = statistics broken down per world

添加删除字段

f<根据屏幕提示输入字段对应的字母>

更改排序

o<输入对应字符移动字段,大写向左,小写向右>

保存设置

W

不修改文件名的情况下,以默认文件名保存时将作为默认设置
获取帮助

?

在大型环境中可能因为大量数据需要搜集和计算,从而导致使用esxtop占用大量CPU资源。可以使用命令行选项锁定特定的实例和特定的信息来减少esxtop所消耗的CPU资源。

esxtop -l

了解更多信息,请查看 here.


通过批处理模式采集数据

首先,确认需要获取的信息,添加/删除你需要/不需要的字段(f),保存到配置文件(W)

运行以下命令搜集数据,将结果保存到csv文件。

esxtop -b -d 2 -n 100 > esxtopcapture.csv

其中,"-b"表示批处理模式,"-d 2"表示采集间隔2秒,"-n 100"表示采集100次。间隔2秒采集100次,也就是采集200秒的数据。如果需要采集所有指标,使用"-a"参数

如果实例过多,或者采集周期较长,从而导致数据量很大,可以通过 gzip压缩

esxtop -b -a -d 2 -n 100 | gzip -9c > esxtopoutput.csv.gz

注意,这种方式采集的数据,将不包括命令执行以后新创建的虚机或者是从其他主机上vMotion过来的虚机。这一点和 -l 参数相似。


数据分析

有多种方式,官方的方案是通过Windows的性能监视器或者Excel。在http://labs.vmware.com/flings/上还有一些工具可以完成数据呈现。 比如 visualEsxtop和esxplot。


其他

实际使用过程中,可能因为实例数量,字段长度,显示器分辨率等问题导致显示不完整,可以通过导出列表,修改编辑后重新导入的方式来限制显示视图

esxtop -export-entity filename

导出后,你可以编辑这个文件,注释掉不需要的部分

esxtop -import-entity filename

以下是命令行的方式筛选出需要的虚机信息,其中virtualmachinename需要根据需要修改(未测试)

VMWID=`vm-support -x | grep  |awk '{gsub("wid=", "");print $1}'`
VMXCARTEL=`vsish -e cat /vm/$VMWID/vmxCartelID`
vsish -e cat /sched/memClients/$VMXCARTEL/SchedGroupID