Call to arms: two system management applications we sorely need in Linux land

published Oct 30, 2006 , last modified Jun 26, 2013

Diagnosing server (and desktop, and network) load issues is always tricky. When an application misbehaves for short bursts of time, it's often a matter of guesswork to find out which.

Apparently, my wishes have been granted, so this entire article is now obsolete.

The regular monitoring application suite is just not enough. top only analyzes CPU usage. vmstat and iostat only present summaries, but can't find culprits. strace is too low-level and doesn't allow to visualize proportions of disk accesses among processes. But pinpointing the exact application or daemon that's taxing computer resources, whether it's for short or long times, that's currently impossible.

Therefore, I propose someone very clever creates these two applications, which should close the gap.

Application 1: psnettop

We need a combo between top and iftop. I should be able to fire up an application that shows me network usage, not by port or by interface, but by process. That way, I can quickly diagnose whether it's BitTorrent or Firefox. Network admins could remotely log onto users' computers and inspect which applications are sucking their Internet links.

It should also let me collapse statistics per-application or per-PID.

On Windows land, there's the fantastic NetLimiter which, apart from setting network bandwidth limits per-process, automatically shows you the instantanteous bandwidth per-process. We need this.

Application 2: disktop

We also need an application, top-style, that shows us disk transactions per application. Not just disk accesses, but real major page faults (swap-ins) and disk reads/writes, bypassing the disk cache. One column for page faults. Another for blocks read. A third for blocks written. They should be reported as they happen, on hardware, so the caches should be ignored. Thus, system admins can identify applications that are:

causing memory pressure to increase
sucking up disk bandwidth with excessive swapins/outs
exercising the disk by loading files not in the cache

strace does not cut it, because it doesn't show which accesses caused actual disk hardware operations, and because it only monitors a single process.

It should also let me collapse statistics per-application or per-PID.

Extra wishes

If these applications or monitoreables could be taken advantage of in the following ways, it'd be even greater.

KSysGuard integration

As much of a fan I am of the command line, I've grown very accustomed to the great KSysGuard. If these monitoreables could be integrated in a graphical console, then it would be very easy to spot problems at a glance.

Logging à là `sar`

Those of us who have used sar know it's a valuable way to make assessments on a timeline. If periodic logging of these monitoreables were incorporated into the system, we could quickly diagnose problems that have occurred in the past, and share log files with debuggers and developers everywhere. Thus we wouldn't rely on anecdotic evidence and hearsay when it comes to discussing naughty applications.

Please help!

Sadly, I don't have the skillset required to do this. I'm hoping someone in the Free Software community can help us in this sense. I understand that clever work can use SystemTap to derive these values. So there's a start.

Have a great day!

debugging ideas system administration Linux