Call to arms: two system management applications we sorely need in Linux land

Diagnosing server (and desktop, and network) load issues is always tricky. When an application misbehaves for short bursts of time, it’s often a matter of guesswork to find out which.

The regular monitoring application suite is just not enough. top only analyzes CPU usage. vmstat and iostat only present summaries, but can’t find culprits. strace is too low-level and doesn’t allow to visualize proportions of disk accesses among processes. But pinpointing the exact application or daemon that’s taxing computer resources, whether it’s for short or long times, that’s currently impossible.

Therefore, I propose someone very clever creates these two applications, which should close the gap.

Application 1: psnettop

We need a combo between top and iftop. I should be able to fire up an application that shows me network usage, not by port or by interface, but by process. That way, I can quickly diagnose whether it’s BitTorrent or Firefox. Network admins could remotely log onto users’ computers and inspect which applications are sucking their Internet links.

It should also let me collapse statistics per-application or per-PID.

On Windows land, there’s the fantastic NetLimiter which, apart from setting network bandwidth limits per-process, automatically shows you the instantanteous bandwidth per-process. We need this.

Application 2: disktop

We also need an application, top-style, that shows us disk transactions per application. Not just disk accesses, but real major page faults (swap-ins) and disk reads/writes, bypassing the disk cache. One column for page faults. Another for blocks read. A third for blocks written. They should be reported as they happen, on hardware, so the caches should be ignored. Thus, system admins can identify applications that are:

  • causing memory pressure to increase
  • sucking up disk bandwidth with excessive swapins/outs
  • exercising the disk by loading files not in the cache

strace does not cut it, because it doesn’t show which accesses caused actual disk hardware operations, and because it only monitors a single process.

It should also let me collapse statistics per-application or per-PID.

Extra wishes

If these applications or monitoreables could be taken advantage of in the following ways, it’d be even greater.

KSysGuard integration

As much of a fan I am of the command line, I’ve grown very accustomed to the great KSysGuard. If these monitoreables could be integrated in a graphical console, then it would be very easy to spot problems at a glance.

Logging à là sar

Those of us who have used sar know it’s a valuable way to make assessments on a timeline. If periodic logging of these monitoreables were incorporated into the system, we could quickly diagnose problems that have occurred in the past, and share log files with debuggers and developers everywhere. Thus we wouldn’t rely on anecdotic evidence and hearsay when it comes to discussing naughty applications.

Please help!

Sadly, I don’t have the skillset required to do this. I’m hoping someone in the Free Software community can help us in this sense. I understand that clever work can use SystemTap to derive these values. So there’s a start.

Have a great day!

5 Responses to “Call to arms: two system management applications we sorely need in Linux land”

  1. steve Says:

    Sounds like you want a port of Solaris’ dtrace. Can do all that and more, and then some.

    You can use orca for useful views of system performance. (orcaware.com) I mainly use it on Solaris boxes, but it does work on linux.

    I also don’t understand why you find it so hard to identify a particular process that’s running amok? That’s kinda normal for sysadmins and dba’s day-in day-out. :-) Processes that use excessive resources are generally very easy to find. Simple matter of keeping detailed notes and crosschecks and so forth. The patterns give it away, generally, pretty quickly.

    Use of netstat, lsof & tcpdump etc help identify network abusers. Or network flow recorders like argus to get a back history to compare with. Combo of top, and strace will identify disk abusers. You don’t use strace to monitor one process, you use it to monitor several, and gradually slice and dice. If something is doing diskio, what are the likely culprits. Eliminate those or not. Then go for the next ones. Which partitions thay’re abusing can also be a BIG Clue.

    The hard problems in my experience, are the subtle bugs in major applications. Not so much the logic bugs, rather the subtle races, 100% cpu for no reason or purpose and doing “nothing”, O/S’s and network devices doing subtly weird things with network packets, or random hardware fluctuations that only duplicate when the cow jumping over a blue moon is a Jersey painted a just-so shade of lime green and in bright orange boots.

    You’re also ignoring all the application end-to-end monitoring you should be doing. You are monitoring aren’t you? ;-) Nagios and it’s various plugins and extras allows for a fantastic view of how a given app is performing. So look for changes. That may be the clue you need. More complex apps should be monitored at multiple points. The idea beng to spot the problems before they become problems.

    I tend not to use it so much anymore for a variety of reasons, but there is a wealth of info you can get from SNMP.

    And lastly, don’t overlook the value of running logs at info or even debug levels.

    HTH?

    My concern is that I’m unsure if you’re suggesting the fault to be found is for server level monitoring, or troubleshooting desktop woes. It really depends on how much time and effort is involved. Windows has the bad rep of alt-ctrl-del because that’s often the cheapest and most efficient way of solving the problem. Rightly or wrongly, and I have my own opinions on that one. On mission critial systems you simply cannot do a bounce, so you have a proportionately greater and more complete monitoring regime. There’s a lot more to sysadmin than running “apt-get upgrade” :-)

  2. frankg Says:

    I agree with steve, look into dtrace. Here are a couple of monitoring applications that can display the information you want, but not on a per processes basis. Give either of these the ability to collect data from dtrace and you will have what you want.

    nmon from IBM http://www-128.ibm.com/developerworks/aix/library/au-analyze_aix/ and MODDSS http://moodss.sourceforge.net/

  3. Rudd-O Says:

    I regularly use top, ksysguard, strace, vmstat, iostat, nagios and sar. None of those alone help me get anything more than a suspicion as to what process is at fault. I usually have to get a bunch of terminal windows running (hard to do on a dying system), and usually when that’s done, it’s too late because the storm has passed, leaving me with no data to judge what went wrong.

    And I certainly want to know not only which process is running amok, but what exactly is it doing so I can fix it.

    Lemme give you guys an example: kio_http_cache_cleaner has been a regular bitch for me, because when it runs, it evicts nearly anything useful from the pagecache and eats RAM that makes my system swap, leaving me with a sluggish system, even when it has stopped running.

    Of course, I didn’t find out about it with top, because kio_http_cache_cleaner is disk-bound and it doesn’t show up in top when sorting by CPU usage. You know how I found it? I had to “correlate” several instances of the computer going slow and it running (and by chance). If I had had a tool that let me see disk accesses per process (no, sar -x doesn’t count because it doesn’t say which process name or group by process name), I could have pinpointed the issue way earlier.

  4. steve Says:

    If top sorted by CPU won’t show you excessive memory usage, why don’t you tell top to sort by memory usage? RTFM? ;-) From memory, it’s just “M”. Really Bad pun intended. :-)

    By the sounds of it you’re trying to fix desktop issues. Opening new xterms or similar on a dying machine is not a Good Idea(tm). Better off switching to the local consoles. They have a far smaller use footprint. You can also re-nice your own process to give your troubleshooting console priority to find & fix the issue.

    HTH?

  5. frankg Says:

    look at iosnoop, a dtrace script. Watch live what is happening on the disk including PID, and command responsible. Dtrace will get ported to Linux, just like it has been to OS X 10.5 http://www.brendangregg.com/dtrace.html

Leave a Reply