Computer stability explained: why your computer crashes, and what you can do about it

by Rudd-O published 2006/05/08 18:01:28 GMT+0, last modified 2013-06-26T03:24:27+00:00

Does your computer crash? Do you want to know why? Here's a guide which will help you understand why this happens, and (hopefully) lead you on to a better computing experience.

"Computer stability" is somewhat of a misnomer. The term "stability" is related to the physical properties of an object, meaning "it won't tip or fall down". In the computing world, the term "stability" is used (by analogy, and rather liberally) to any situation involving a computer crash (or "downfall").

So what causes computer crashes? On a single computer system (and most computer systems are as different between each other as snowflakes), there can be a gazillion latent reasons for crashes, as well as different types of crashes. You see, having a computer up and running is (at least for Intel x86-based computers, also known as PCs) practically a miracle of modern technology: layers and layers of (mostly legacy) tech piled atop each other, starting from the computer hardware, passing through the BIOS, into the operating system and running applications. Fortunately, due to the high predictability of computer chips (they tend to repeat what they did before with more accuracy than, say, people), once the tower's up, it's kind of hard to make it topple.

Okay, let's list the two major categories of crashes:

  • hardware-related crashes
  • software-related crashes

And without further ado, let's investigate each one of them. We'll start with the hardware part of the problem -- read the next page to find out about it.

Hardware-related crashes

To put it in one sentence: A computer can crash due to faulty hardware. (Somewhat un)fortunately, computer hardware tends to fail after long periods of use (usually in the 3 to 5 year span). And, most of the time, any computer part that fails goes down spectacularly hard: CPUs burn, power supplies explode into bursts of blue (or wacky colored) smoke, PCI cards overheat and melt, or hard disk motors die down. So, when these conditions are met, you're usually left with a nonfunctional computer, hardly what you might call a "crash".

Glitches

But sometimes (although not that rarely) a component in your computer can experience transitory glitches. Some of these glitches are temporary (such as those caused by voltage spikes or transients, or heat buildup around the faulty component), and some of these glitches repeat themselves with amazing regularity. During the periods where the glitch is absent, no malfunction is evident, and the computer works as usual. When the glitch manifests itself, odd behaviors or crashes may present themselves.

RAM problems

Faulty RAM (random access memory) modules (not to be confused with hard disks or other nonvolatile storage media such as 75g laser printer paper) are frequent culprits of crashes. When (partly) faulty memory acts up, bits of memory flip; sometimes these flips go unnoticed (and then they come back to haunt you in the form of unreadable documents or MP3s with blips), and sometimes these flips affect application or kernel (the operating system core) memory.

When the second situation arises, applications usually fail (on Windows, with an "exception" dialog, and on Linux or UNIX, with a Segmentation fault message to the console) or, if kernel memory was compromised, a "blue screen of death" or "panic" screen appears. That's usually the point where the operating system calls it quits, and the computer freezes. Note that faulty RAM usually does not induce a hard freeze but rather causes a panic or blue screen.

Hard disks

Hard disks can also make a computer fail. You see, machines with moving parts tend to fail more often than solid-state ones. Hard disks do have moving parts. And sometimes the surface of a hard disk platter gets hit by the read/write head(s) (and that's why you shouldn't bump the computer when it's working) causing permanent damage. Don't get me wrong, some disks don't fail because of this but rather because the platter simply could not hold the data (magnetically speaking, in these situations the disk simply can't tell whether an area represents a zero or a one because the platter has too weak a magnetic charge). But most of the time, mechanical failure is induced either by wear or blunt force.

Okay, when this happens, your computer can also start acting up. Best case scenario, the damaged sector of the disk hits a data file, and you can't read it (and, since hard disks are hard working machines, while the disk is trying to figure out what the hell the damaged sector says, the computer or a particular application may appear to hang indefinitely). A worse case scenario is when an operating system file is hit. Then it's quite probable that the computer won't boot. But the absolute worst case scenario is when the damaged sector lies squarely on any file system's critical area: massive (or total) data loss is not uncommon in this case, and what are computers if not devices to hold, store and let you enjoy your data?

Bugs in hardware parts

Sometimes components have bugs (programming or design mistakes which cause unexpected behavior). Yes, hardware can also have bugs. Especially modern hardware, where cost differences in volume sales matter a lot, and thus they are built from general purpose parts driven by (you guessed it) embedded software (where it's easier to let bugs slip, since the manufacturer can always issue a "firmware update" to retroactively fix their pieces of crap).

There's a moral to the hardware story

The moral of the story is quite simple: don't skimp on hardware. Brand name devices (more expensive) usually tend to last longer and have fewer glitches, plus better-engineered parts, and more resources to test their products, so you have fewer incompatibilities and random glitches. It's a matter of doing your research before buying anything. Use Google to dig success and failure stories on your would-be purchases. Oh, and make sure you're purchasing a good UPS with power conditioning as well (preferably one that can tell your computer "it's time to shut down because the power is out, dude") because you don't want random glitches or burnt parts caused by power problems.

Now, onto the software part of the equation. Go to the next page now.

Software-related crashes

Software is what computers are all about, right? Data is software. Programs are software. The operating system is software.

And software causes the majority of computer malfunctions as well! That's a no-brainer, of course. If you got good hardware, and your computer starts to fail, software is almost surely the cause.

Just a heads up: I won't be discussing the topic of application crashes here. That's a topic which is far too wide for a humble blog posting. I'm gonna be talking about real crashes, those that pop up blue screens and cause you to reboot your computer.

First: why modern operating systems don't crash so often

Operating systems are really good these days. All modern PC operating systems have memory protection (that's a fancy term for "Application A cannot get into Application B's memory at all"), paging (another fancy term which means "it's harder for a program to hog all the memory since the computer can simulate almost limitless amounts of it") and ring-based privilege separation (yet another tech term which means "applications run confined in a cage, which is enforced by the actual hardware"). Which means that application couldn't possibly be the cause for computer breakage, right?

Wrong.

There is software that you can install and can ruin your computer with practically no limits.

You know what's the name of that kind of software?

Any ideas?

They're called device drivers.

Device drivers

You see, the operating system cannot possibly know how to drive all the different and crazy types of hardware that exist in the market. So modern operating system include special facilities to let you "plug" extra functionality or knowledge right into the heart of the operating system, near to the bare metal, which enables the OS (and you) to use the hardware efficiently. These mini-applications are called device drivers, because they're used to "drive" devices.

Bugs in device drivers

Sometimes a device driver contains bugs. That bug may cause the device to malfunction (which would be the best case scenario, since usually all that happens is that the device becomes temporarily unavailable until you reboot your computer), or it can cause actual data loss. That data loss may be induced because of the bug in question corrupting data, or by a blue screen of death or panic. Yes, panic, usually an indication that the memory reserved for the operating system has been corrupted, just like when a memory module fails.

And, yes, higher-quality hardware parts usually come with higher-quality drivers -- a great reason to not skimp on hardware.

Malicious software

But, sometimes, a device driver is installed for malicious purposes. Witness the Sony DRM fiasco. Yes, Sony. Once you've played (at least once) an audio CD with Sony's special software (included in the audio CD) you can be pretty sure you have a rogue device driver in your computer. Sony built that device driver with the express intention of making your CD/DVD writer malfunction and fail to copy or rip the CD into your computer or onto another disc. Yes, Sony intentionally crippled consumers' computers to "protect" their bottom line. And they're not the only industry players out there doing this. Worse, the most popular antivirus products won't tell you that your computer has this malicious software installed -- why they don't tell you is a matter for conspiracy theorists.

But of course, that's only a concern if you're using Microsoft Windows, because their device driver won't run on Linux or Mac OS X. What's to stop Sony from doing the same to Linux or Mac OS X?

Attitude.

Yes, attitude. Mostly the operating system developers' attitudes.

"What's that got to do with the problem?" I hear you asking? Turn to the next page to find out why.

Why Windows users tends to be hit more by malicious software

This is the issue: Most people that run Windows NT or higher on their computers don't bother to create a non-privileged user account for everyday use. Why should they bother, when they trust Microsoft to do the hard work? So, by default, they run as the Administrator account identity, or an account with Administrator-level privileges. You've probably heard this before, and the name says it all, "Administrator" means "Do anything with my computer", even installing device drivers. Yes, indeed, under normal conditions, only the Administrator-level user accounts can install software that touches key system parts (and that includes device drivers). In fact, Administrator-level accounts can do nearly anything to a Windows-based computer.

Of course, most people don't know they're doing themselves a disservice by running as an administrator, because the default setting is (you guessed it) "be an administrator". Let me repeat this simple fact: programs run by Administrator (or equivalents) can do anything to a computer, including deleting your entire MP3 collection, changing everyone's passwords (or retrieving them and deciphering them, then sending them to a rogue Internet site), installing software and modifying key parts of the operating system so these applications can run hidden from the user. That's exactly how viruses install themselves on a Windows computer:

  • First, they break a vulnerable part of the operating system,
  • which is extremely easy on Windows, because Internet Explorer is a gateway for all kinds of malware,
  • or they go in unnoticed thanks to the "marvelous" AutoPlay feature for CDs that also defaults to On,
  • or they sneak in through malicious and deceitful software installers,
  • then they sit there, most of the time attempting to replicate themselves onto other computers.

End result? Crippled computers, and annoying pop-ups telling you how to refinance your debt or gain several inches on key parts of your anatomy.

Do yourself a favor, please erase your Windows installation, reinstall and create yourself a non-administrative user account for everyday use, reserving the Administrator account (you did password it, right?) for software installation and system administration. And think twice about installing anything downloaded from the Internet, or opening mail attachments that aren't pictures.

Or even better, change to Linux (it's -- mostly -- free) or Mac OS X (if you can afford a Mac).

Why Linux and Mac OS X users never get viruses, worms, or 'refinance your debt' popups

Yes, that would be the "attitude" adjustment the world needs. Contrary to common Microsoft practice, both Linux and Mac OS X (based on real, true UNIX) avoid the practice of using the administrator account (called root in UNIX-land) for ordinary usage. Linux reserves the root account for special uses (installing software and system maintenance tasks, mostly) and prompts you for the root password when a change affecting the entire system is needed. Mac OS X does something similar (but with a different mechanism, much like the known sudo in Linux-land).

Yes, both operating systems can be crippled by malicious software installations but, in both cases, due to this simple fact (and due to AutoPlay not being available -- hurrah against Sony) the probability of this happening to you is much, much lower. In other words, you'd have to be the victim of a real con artist, following his/her exact step-by-step guidance, for a malicious software installation to cripple your computer.

But wait, the advantage doesn't end there:

  • Fewer Linux installations: this means malicious program writers don't have as much incentive to write malicious Linux programs as Windows programs.
  • Fewer critically integrated components: the design philosopies of Mac OS X and Linux make them much more robust to actual virus and worm attacks, because each part of the operating system is clearly separated.
  • Security is built in, rather than bolted on: heard about the latest Microsoft security initiative? Well, the UNIX guys have a 20-year head start on that one, simply because they're older and more experienced. In fact, the latest computer security/hardening technologies always come first on UNIX-land, because it's a much more mature field.
  • Fewer bugs: and this is not a myth, despite what some would have you believe -- Free Software has fewer bugs, simply because there are many more people out there with the ability to "crack open the hood" on the programs, discover and fix the bugs, which also leads to...
  • Quicker patches: when a bug hits Free Software, since the process happens out in the open, everyone can start patching their computer systems faster, either by themselves or by just waiting for their operating system vendors to issue patches -- and nowadays, even truly Free (as in beer) Linux distributions have automatic patch management systems, just like Windows Update, but better.

Keep reading on to the next page for conclusions.

Conclusions? Do we really need anything said here?

Okay, enough of the stability ramblings. I'll talk a bit about my own (anecdotal) experience, and then I'll call it quits (since I've got a real need for sleep now).

I'm a happy Linux user. I've been using Linux since 1998, and I haven't looked back once. During these (admittedly uneasy yet incredibly rewarding times), I haven't had a single instance of my computer randomly failing or crashing that I couldn't trace to a hardware fault (remember those). In fact, I've had just about 7 hard disks fail catastrophically, three malfunctioning motherboards and several memory modules with "flickering bits". Every single one of these failures have caused my computer to malfunction in at least one way.

But my computer hasn't had a single operating system failure in years. Sure, I've had my share of corrupted files (by malfunctioning disks) which have required reinstallation or restoration from backups. Of course, when hardware flaked on me, my computer crashed (spectacularly, at times). Naturally, some applications have bugs, and sometimes they even die without notice.

But I can confidently tell you that, during the course of the last eight and a half years, my operating system has never gone AWOL. It just doesn't go down. It's so damn mother***ing reliable that, once the operating system starts to act up, I am so 100% positive I'm in need of new hardware, that I don't even bother with tracing the cause to software. At the very most, I reboot using a different computer (something unthinkable in Windows-land) and check the log files to see if anything extraneous was going on at the "time of death".

Plus, the operating system is (and I can't explain why, because I'm no Linux kernel guru, I presume) kind of more resilient to hardware faults and glitches: I once popped a PCI card out of its socket (mental note: consumer-level PCI is not hot-pluggable) then plugged it back with the computer on, and it still worked! I have done the same to hard disks and CD drives (of course, always ensuring they're not mounted and being used, and manually syncing drives before the event) and they actually work after being reconnected. And they're not hot-pluggable hardware!

For the past year, my computer has been on, up and running in excess of 30 days between reboots. Mind you, it's not a brand-name computer, but a white box PC assembled from (OK, I'll accept it) good hardware (OK, OK, built by me) plus a great APC UPS.

I've said goodbye to daily reboots "just in case the computer fails later on", or rebooting because "it's gotten so damn slow since yesterday, I'd rather boot again", or power cycles because of blue screens of death.

I've also said goodbye to viruses and worms. Sure, I know a worm (or an actual living person hacker) may attempt to hit my computer some day. So I keep up with patches and I have a firewall installed, to keep bystanders out of my turf. Viruses? I haven't installed an antivirus since 1997. I also haven't had a single "enlarge your penis and we'll give you free screensavers" pop-up on my screen, ever (because, apparently, enlarging one's penis wasn't so hip in 1997).

Yes, this is not a pipe dream. You can be on the same bandwagon. Say yes to Linux.