Back when our team was dealing with operations, optimization and scalability atour previous company, we had our fair share of troubleshooting poorly performing applications and infrastructures of various sizes, often large (think CNN or the World Bank). Tight deadlines, “exotic” technical stacks and lack of information usually made for memorable experiences.
The cause of the issues was rarely obvious: here are a few things we usually got started with.
Don’t rush on the servers just yet, you need to figure out how much is already known about the server and the specifics of the issues. You don’t want to waste your time (trouble) shooting in the dark.
A few “must have”:
The last two ones are the most convenient sources of information, but don’t expect too much: they’re also the ones usually painfully absent. Tough luck, make a note to get this corrected and move on.
$ w $ last
Not critical, but you’d rather not be troubleshooting a platform others are playing with. One cook in the kitchen is enough.
$ history
Always a good thing to look at; combined with the knowledge of who was on the box earlier on. Be responsible by all means, being admin shouldn’t allow you to break ones privacy.
A quick mental note for later, you may want to update the environment variableHISTTIMEFORMAT
to keep track of the time those commands were ran. Nothing is more frustrating than investigating an outdated list of commands…
$ pstree -a $ ps aux
While ps aux
tends to be pretty verbose,pstree -a
gives you a nice condensed view of what is running and who called what.
$ netstat -ntlp $ netstat -nulp $ netstat -nxlp
I tend to prefer running them separately, mainly because I don’t like looking at all the services at the same time.netstat -nalp
will do to though. Even then, I’d ommit thenumeric
option (IPs are more readable IMHO).
Identify the running services and whether they’re expected to be running or not. Look for the various listening ports. You can always match the PID of the process with the output ofps aux
; this can be quite useful especially when you end up with 2 or 3 Java or Erlang processes running concurrently.
We usual prefer to have more or less specialized boxes, with a low number of services running on each one of them. If you see 3 dozens of listening ports you probably should make a mental note of investigating this further and see what can be cleaned up or reorganized.
$ free -m $ uptime $ top $ htop
This should answer a few questions:
$ lspci $ dmidecode $ ethtool
There are still a lot of bare-metal servers out there, this should help with;
$ iostat -kx 2 $ vmstat 2 10 $ mpstat 2 10 $ dstat --top-io --top-bio
Very useful commands to analyze the overall performances of your backend;
dstat
is my all-time favorite. What is using the IO? Is MySQL sucking up the resources? Is it your PHP processes?$ mount $ cat /etc/fstab $ vgs $ pvs $ lvs $ df -h $ lsof +D / /* beware not to kill your box */
$ sysctl -a | grep ... $ cat /proc/interrupts $ cat /proc/net/ip_conntrack /* may take some time on busy servers */ $ netstat $ ss -s
conntrack_max
set to a high enough number to handle your traffic?TIME_WAIT
, …)?netstat
can be a bit slow to display all the existing connections, you may want to usess
instead to get a summary.Have a look at Linux TCP tuning for some more pointer as to how to tune your network stack.
$ dmesg $ less /var/log/messages $ less /var/log/secure $ less /var/log/auth
$ ls /etc/cron* + cat $ for user in $(cat /etc/passwd | cut -f1 -d:); do crontab -l -u $user; done
There is a lot to analyze here, but it’s unlikely you’ll have time to be exhaustive at first. Focus on the obvious ones, for example in the case of a LAMP stack:
5xx
errors, look for possible limit_zone
errors.mysql.log
, trace of corrupted tables, innodb repair process in progress. Looks for slow logs and define if there is disk/index/query issues.varnishlog
andvarnishstat
, check your hit/miss ratio. Are you missing some rules in your config that let end-users hit your backend instead?After these first 5 minutes (give or take 10 minutes) you should have a better understanding of:
You may even have found the actual root cause. If not, you should be in a good place to start digging further, with the knowledge that you’ve covered the obvious.