Okay, is anyone tired of my server admin tips yet? Yes? Too bad.
Monitor everything… Put as much info at your finger tips as easily as possible. Put that info in a place where you will always be looking at it for some reason. For example, I made a vBulletin plug-in that monitors 4 memcached servers (including a latency test it runs) as well as 10 blade servers. This shows every time I’m in the admin of the forum (which is a lot), so I can’t help but to not see it.
I wrote a little daemon that runs on my servers that can quickly report back whatever info I want (time, disk RAID status, server load, MySQL replication status, etc.)
The more info you have in one place (especially when you run a bunch of servers), the easier it is to see anything wrong. For example, I had an issue with a web server serving requests slow one day… it ultimately ended up not being a problem with the web server, but the memcached server it was using. The latency test was showing ~2,000 ms latency (2 seconds) vs. the normal 0.5 ms (1/2000 of a second).
And be proactive about monitoring stuff… don’t wait until something bad happens to start doing it! Then it’s too late.
Can I go to bed now?