Performance tuning is something everyone needs to do and should do periodically. I recently started a new job and one of the ways I can get a handle on the SQL Servers I support is to start baselining performance. This series focuses on the methods that I used to setup monitoring and analyze the data. The articles in this series are:
NOTE:Let me qualify this by saying that I built this system for someone to use in a somewhat manual process. My company has PATROL from BMC Software, which will automate most of this (for a significant price). There are also other products that can do this but BMC is the one I am most familiar with.
When I started this process, I started looking for a list of counters to use. Not that I don't have an idea, but it would be nice to see what others are using and what they've learned. I checked out Brad McGhee's series on this topic at SQL Server Performance, which is a great resource, but also looked at the MS whitepapers. Both references are given below.
From the lists of others, I thought about it and narrowed my focus to these counters. You cannot track everything, but these, IMHO, give me a good overview on what is happening on the server. Here's a nice concise list of what I'd drop in Performance Monitor to get a baseline on all my servers. Or add to a third party tool. You can click any of the counters to get my rational behind the choice.
That's it, a simple list you can plug into Performance Monitor and begin watching your system. That's if you trust me or don't care and just want something. Not the worst thing, but if you're interested in the reasons why I picked these counters, read on.
Each of these is discussed below with an explanation and reasoning why it should be used.
Process Object : % Processor Time
This is everyone's standard counter. Maybe because it's the default counter selected, but it's also a good gauge for looking at a server's usefulness. This is defined as the amount of time that the server is executing a non-idle thread, but it actually calculated by looking at the percentage of time that the idle thread executes and subtracting this from 100.
I use this counter because it provides an indication of a first bottleneck. If the CPU is pegged, then the server is probably not doing anything. I like to keep this number under 40% unless a large query is being processed. However, if this goes above 40%, I do not set an alert or have my monitoring software notify me immediately. Doing this will result in you being notified constantly.
Instead, understand that your server will go to 100% CPU at times, but unless this is sustained for a period of time, for me it's more than 15 minutes, I don't usually worry. I might lower this threshold for some systems, but for most of them this seems to work well.
However, over time, like over a month, if you constantly see your CPUs averaging more than 50%, you might want to start planning for an upgrade and if they are > 70% on average, you might want to hurry that planning along.
Return to the counter list
System Object : Processor Queue Length
This is an interesting counter and one that I've had a hard time determining how it affects the system. For me it's really a value that I watch, but mostly over a sustained period to see if the CPU is a bottleneck. It represents how many threads are waiting for processor time. The count includes all threads that are "ready" and not waiting on some other thing like IO to complete.
The general rule of thumb is that this value should be less than 2 x the number of CPUs in the system. If it is growing and the CPU % (above) is 80%+, then you likely need more or faster processors.
Return to the counter list
Memory Object : Pages/Sec
This counter looks at the number of memory pages read or written to disk. It should be the sum of both these.
From the SQL 2000 Operations Guide: Defined as the number of pages read from or written to disk to resolve hard page faults. (Hard page faults occur when a process requires code or data that is not in its working set or elsewhere in physical memory, and must be retrieved from disk). This counter was designed as a primary indicator of the kinds of faults that cause system-wide delays. It is the sum of Memory: Pages Input/sec and Memory: Pages Output/sec. It is counted in numbers of pages, so it can be compared to other counts of pages, such as Memory: Page Faults/sec, without conversion. It includes pages retrieved to satisfy faults in the file system cache (usually requested by applications) non-cached mapped memory files. This counter displays the difference between the values observed in the last two samples, divided by the duration of the sample interval.
This is a counter that cannot really be measured independent of a system. In other words, I can't tell you that 1,000 pages/sec is a high or low value. It really depends on your system. So it's a relative counter, one that you want to watch so you can tell if things are changing. You make a change and see if the counter goes up or down. In general, adding memory ( or allocating more to SQL) should lower this counter, but you have to be careful. If you allocated too much memory to SQL and starved the OS, this could go up. Again, it depends on your system, so get a baseline of this and then watch it as you make changes.
Return to the counter list
Memory Object : Available MBytes
This counter is one I am glad got added to the list because I really got tired of trying to compute this from the Available Bytes counter. If you are worried about tracking memory bytes as opposed to MBytes, you have other issues. Or you're a lot smarter and more detailed than I am.
This number should be the amount of MB of memory on the computer (from the OS perspective) that is available to be allocated to processes. It's the sum of Free, Standby, and Zeroed memory pages. Again, if you need to know these individual numbers you probably are way past this value. I look at this to see if the average amount of MB is fairly consistent for a baseline perspective. This often can clue me in to some other process or service being added when I think everything on the server is the same and this number is lower.
Return to the counter list
System Object : Avg. Disk Queue Length
This counter measures whether I/O requests are being held by the disk drive while they catch up on requests. This can be a bottleneck in performance for a server in that if it grows large or you consistently see it above a 8 for a particular disk, then that disk is spending a good amount of time stacking requests up instead of servicing them.
Return to the counter list
PhysicalDisk Object : % Idle Time
The % disk time counter has been analyzed a bit on the Internet and there are some problems with how the data is collected. There is a reference below for this. In terms of it being accurate, I'm not sure, but I have read a few items that present a good case for it not being quite accurate, at least without some math being performed by the person doing the interpretation. Since I'm looking for things that are simple and easy to read, instead I've taken the alternate approach (also recommended by others). I look at the idle time, which supposedly is more accurate. Subtracting this from 100 gives me an idea of how hard the disks are working.
Be sure that you include this counter for each disk instance and not the whole group.
Return to the counter list
Network Interface Object : Bytes Total/Sec
This counter is supposed to be the number of bytes being transmitted per second on the NIC. There will be one instance per NIC on the machine, including one for the loopback adapter. This is much better than the current bandwidth counter, which is merely the limit of the NIC, 10, 100 or perhaps 1,000Mbps.
From the SQL 2000 Operations Guide: Defined as the number of bytes traveling over the network interface per second. If this rate begins to drop, you should investigate whether or not network problems are interfering with your application.
From the Steve Jones' Operation Guide: This should be a high number. roughly 60% of the theoretical max for your NIC. If it's not, start asking the network guys. I had an instance where this counter was 10% of what I expected, because the gigabit ethernet card was plugged into a 100Mbps port. I just happened to notice that this number wasn't much larger than other servers and I expected it to. Overall, this is basically a double check, not something I usually check. Having it on the baseline is just a way of double checking things.
Return to the counter list
SQL Server Access Methods Object : Full Scans/Sec
This counter is one I always capture to ensure that I know how often indexes are not being used.
From the SQL 2000 Operations Guide: Defined as the number of unrestricted full scans. These can either be base table or full index scans.
I'm not completely sure what an unrestricted full scan is. As opposed to a restricted scan. And this is another "relative" counter that you have to baseline on your system. Until you know what the average number is, it's hard to know if your tuning efforts are lowering or raising this number, which should correspond to improving or worsening performance.
Return to the counter list
SQL Server Databases Methods Object : Transactions/Sec
This counter is one I always capture to ensure that I know what the average utilization of the server may be. After all, transactions are the basis of everything in SQL Server. Most queries are implicit transactions, but they are transactions. Unfortunately, this counter only shows the transactions that change data. So for me, this shows me on a particular database, the long term expected number of transactions.
This is extremely handy for determining if the load has substantially increased. Having a long term average of a dozen or so transactions/sec and seeing a spot rate of 200/sec shows me that perhaps the application isn't at fault. I may be just outgrowing this hardware.
Return to the counter list
SQL Server Buffer Manager Object : Cache Hit Ratio
This counter shows the percentage of pages that are found in the buffer as opposed to disk. Since there is a read ahead thread, if that can keep ahead of the read requests, it can keep this counter low. It does now, as far as I know, imply that everything you have requested is in memory as opposed to disk. Since I have had systems with issues, have low IO, have this above 95% and only have 4GB of RAM with a 200+GB database, this must not include the read aheads. Especially when I've seen it high and I've read a couple GB from a table, not all of which can possibly be in memory.
From the SQL 2000 Operations Guide: Defined as the percentage of pages that were found in the buffer pool without having to incur a read from disk. When this percentage is high your server is operating at optimal efficiency (as far as disk I/O is concerned).
The operations guide gives a hint for this counter. I rarely see if below 90% on any of my servers. Perhaps it's dumb luck, or perhaps I've got things fairly well tuned, but this is one of those counters that I'm not concerned with until it's below 95%. At which time I am really concerned.
Return to the counter list
SQL Server General Statistics Object : User Connections
This counter tells you how many connections there are to SQL Server on a spot basis, meaning this is not an average, just the current number of connections. Now there are a number of system connections depending on how you have your server connected, typically 10-20. But if you track this on a baseline basis over time, you don't really care. More you are trying to correlate the performance of the system with an average number of users. If the users go up, check other counters, CPU, memory, lock wait time, etc. to see if there is a corresponding change with this larger load.
As with many counters, you want to use this as a broad instrument to measure the performance of the system, not a finely honed one.
Return to the counter list
SQL Server Locks Object : Average Wait Time
This counter measures the amount of time in milliseconds that a user is waiting for a lock. This is an average that is compiled over the performance monitor interval.
From the SQL 2000 Operations Guide: Defined as the average amount of wait time (milliseconds) for each lock request that resulted in a wait.
I look at this counter, not so much the actual value, but more the deviation from the average or baseline, over time. I track it and then as I am troubleshooting performance issues (the main place I look for this) I compare the spot values to the baseline. On a broad, system level basis, this tells me if the system in general is experiencing problems or just one user. Usually it's one user, but I have seen this counter be 4 or 500% of the long term average, clueing me in to some other problem with the system in general.
Return to the counter list
A very basic look at an extremely complex subject. I haven't been thrilled with most books or articles on the topic because they don't give a clear list of counters to look at without having to wade through tons of theory or background. I've tried to give a list of counters to watch and then have the explanations separate. However there is much more to consider, so this is a very introductory look at the complex art of performance tuning with some rational as to why I use this counters.
If there are areas that are not clear, or you have a request for a topic, please feel free to contact me. As always I welcome feedback on this article using the "Your Opinion" button below.
Steve Jones ©dkRanch.net March 2004