NewServerMaint Mechanics Union Updated Aug 22, 2011 Watching Server HealthMemcached has a lot of statistical counters. Most are tallies, but others serve as warnings for an administrator to notice. Note that there are many tools and top programs floating around which serve to help you digest this information. Issuing CommandsMonitoring scripts should ideally issue test sets/gets/deletes to the servers occasionally, timing the response of each. If it's slow to connect you might have a connection or network problem. If it's slow to respond the server might be swapping or in ill health. You should always strictly monitor the health of a server, but an extra layer can be appreciated. Important StatsThese are found when issuing a simple stats command to a memcached server. curr_connectionsLists the number of clients presently connected. Monitor that this number doesn't come too close to your max connection setting (-c). listen_disabled_numAn obscure named statistic counting the number of times memcached has hit its connection limit. When memcached hits the max connections setting, it disables its listener and new connections will wait in a queue. When someone disconnects, memcached wakes up the listener and starts accepting again. Each time this counter goes up, you're entering a situation where new connections will lag. Make sure it stays at or close to zero. accepting_connsRelated to the above listen_disabled_num, if you're already connected to memcached you can see if it's hit max connections or not by checking if this value is 1 or 0. limit_maxbytesSometimes it's nice to ensure that how you think memcached starts and how it actually starts are in line. By checking limit_maxbytes you can verify that the -m argument took. Occasionally people using init scripts can be misconfigured and memcached will start with default values. cmd_flushWhile not exactly server health per-se, this is a good one to generically monitor. Every time someone issues a flush_all command, all items inside the cache are invalidated, and this counter is incremented. Sometimes debug code or misinformed people can leave scripts or callbacks running to "invalidate" the entire cache. Watch this value in production and sound the alarms if it starts moving, unless you really intended it to. stats sizesWritten a few times in these pages, it bears repeating; don't run this thing in production unless you can handle having the cache hang for up to several minutes. As of this writing this is the last command which iterates over all known cache items to run a calculation. Slab imbalanceA more complicated issue is an imbalance of the amount of memory assigned per slab, compared to where your actual data wants to go. Confusing speech for a simple algorithm:
Sadly issuing a flush_all does not do enough. You actually have to restart memcached if you ever end up in this position. It's rare but it happens. Future versions will have memcached mitigate this situation itself. tailrepairs and outofmemoryThese two stats appear in stats items, and can indicate bugs in the memcached code, or in the case of outofmemory, severe memory pressure paired with high read activity. If you see these counters move with any regularity, please contact the mailing list with as much info as you can. While we haven't hit problems with these for a long time, we feel it's better to be safe and instrument the case of the more common runtime bugs. Troubleshooting Client TimeoutsSee Timeouts for help. Stats for Application HealthMonitoring these stats can tell you how the available memory in memcached, and how your app behaves, changes efficiency. Global hitrateHitrate is defined as: get_hits / (get_hits + get_misses). The higher this value is, the more often your application is finding cache results instead of finding dead air. Watch that the value is both higher than what you expect, and watch to see if it changes with time or releases of your application. Hitrate per slabWhile not possible to calculate the actual hitrate per-slab, since a "get miss" doesn't know how large the item could have been, you can monitor the number of get_hits and cmd_set on a per-slab basis via stats slabs. Watch that the number of get_hits is usually above cmd_set (more fetches than updates is often what you want), and watch for changes over time. EvictionsAn item is "evicted" from the cache if it still has time to live, but ended up at the tail end of the LRU cache when it comes time to allocate a new item. While memcached tries a little to find an expired item from the tail, it doesn't guarantee it. stats items shows per-slab eviction status. 'evicted' is the total number of items tossed early from that slab, 'evicted_nonzero' are the number of tossed items that did not have an unlimited expiration time, and 'evicted - evicted_nonzero' are the number of items tossed which were stored forever. 'evicted_time' notes how many seconds it's been since the last item to be evicted, was last fetched. If this number is small, it means you're evicting items which were recently used. Looks Can be DeceivingYou see a low hit rate, a high eviction rate, and assume that you need to buy more memory for memcached. Unfortunately this isn't necessarily true either. Consider some potential scenarios:
The former can cause high eviction count even though the data was not important to begin with. Audit what our application stores and see if this is a fault. The latter can show poor hitrates, as many get_misses are accounted for by this flag. A workaround for this is to have the app "recache" a flag with a value of "not set". So the flag stays around in cache even if it's disabled, and doesn't throw off your hitrate calculations. OS Health (avoid swap!)Memcached interacts hard with the network and with RAM. It's common for people to monitor swap usage, which can cause severe performance degredation to memcached. It's also important to watch your network stack. Linux, for example, sports a number of counters for its network interfaces related to dropped packets, underruns, crc errors, etc. Most switches also have per-port counters on packet issues or link issues. Degraded performance could trace down to bad wiring, bad switchport, a dying NIC, or a network driver in need of tuning. UpgradingMost minor releases of memcached focus on fixing bugs and adding instrumentation. New features are rare and tend to only happen on middle version bumps now. IE; 1.2 to 1.4 added the binary protocol and a large number of counters. 1.4.0 to 1.4.1 was mostly bugfixes and missing counters. Minor features may show up as well. SASL authentication support happened in the mid-1.4 series, but the change is isolated and should not break existing clients. You should follow new releases and the release notes. It's often better to uprade early so you have the extra instrumentation when you need it, instead of wishing you had it when things go wrong. |