Introduction: the problem

One dark evening over the Pacific ocean, near the coast of Washington state:

"Seattle Center, this is transglobal zero-three-four with you at 27,000 bound for three niner thousand"

"Transglobal zero-three-four thank you please squawk three six zero five"

"Squawk three six zero five. Transglobal zero-three-four"

"Seattle Center, this is Air Vancouver Island one zero zero with you at FL three four five, smooth ride"

"Air Vancouver, I've gotcha. Seattle Center"

"Transglobal zero-seven-niner, please turn right to zero seven six, maintain altitude and speed"

Seattle Center: come right to zero seven six, maintain altitude and speed. Transglobal zero-seven-niner, "

"Seattle Center, this is Pacific Sky zero zero four at 38,000 request permission for 26,000."

"Pacific Sky zero zero four descend to 26,000 at your discretion"

"Cleared to descend to 26,000 at my discretion, thank you Seattle Center"

"Seattle Center, this is World Wide zero niner three with you at 39,000"

"World wide, please squawk three six zero six"

"MAYDAY MAYDAY MAYDAY This is this is Pacific Sky zero zero four. I have a sudden cabin decompression. Repeat I have a sudden cabin decompression. Request emergency descent to 8,000 and a vector to Portland"

"Pacific Sky zero zero four: This is Seattle Center, I hear your emergency. Let me get traffic out of your way. Squawk seven seven zero zero. Begin your descent now and turn left, I'll get you your vector to PDX in just a moment. Transglobal, turn right come to 270 hold altitude and maintain speed. Oh shit, my computer just went down. Who's squawking three six zero five? Say your altitude, course and speed. Where the hell are you?

"Seattle Center: Which Transglobal?"

There was a time when there were no computers. Everything that had to be done was done by humans, perhaps with the assistance of animals. Clearly anything that required thinking at any level had to be done by humans. As we got smarter, we started building tools. Our tools became machines and then became sophisticated machines. We started doing things faster and cheaper, and the profits fueled research and development into still faster and still cheaper ways of doing things. Even nuclear weapons, which are horrendously expensive, were developed as a faster and cheaper way of destroying something.

But after a while, people starting noticing something unanticipated: sometimes the tools didn't work. Sometimes, the problems were annoyances, and sometimes the problems were lethal. In many cases, once the failure mode began, there was no way to recover other than by starting over (Consider, for example, the Titanic).

Insofar as I know, the first industry to start thinking about safety and reliability in design was the railroad industry. Trains cannot start or stop quickly because they are very heavy in relationship to the power available for starting or stopping. So signals were developed to extend the vision of the engineer. The first such signal was a ball on a tower that was hoisted to the top when the way ahead was clear. If the engineer could see the ball, then the way was clear to proceed. If the engineer could not see the ball, then he must assume that the tracks were blocked by something. This was a brilliant system. If it was foggy, or dark, or if the rope broke, or if the signal man had neglected to raise the ball, then the train would not start. This may be inconvenient, but it was safe. To this day, when the conductor of a train wants the engineer to start, he will call on his radio or cell phone: "high ball, high ball, high ball".

Alas, the computer industry has not been nearly so safety conscious. There is an underlying tension between the demands of cost and the demands of reliability, and in general, cost has won out. This is not to say that you cannot build a reliable computer system on a budget: clearly you can. But it is going to cost more and it's hard to justify that extra cost to uptight managers and anxious investors. Furthermore, reliable computer systems have been around since the 1950s.

What's worse is that, to a degree I haven't seen in any other industry, the players tend to be highly polarized. The managers want everything done faster and cheaper, frequently with no understanding of how to accomplish that (or even to measure it to see if they are succeeding). The software developers want to write really high quality code, but there is no silver bullet. The testers can always find more things to test. The system administrators have to make it all work when it's done. What makes the situation more interesting is that a lot of the players have (reasonable) aspirations of transitioning into management. So it has proven difficult to unionize (in many industries, unions have been instrumental in advocating safety rules, and frequently safety translated into reliability).

We know that computers break. Fans fail which causes motherboards or power supplies to fry. Disk drives crash. Capacitors leak (See articles in www.pcstats.com , www.siliconchip.com , wikipedia and www.dashdist.com ).

Finally, most of the literature concerning computer reliability has been theoretical work - by computer engineers, software engineers and electrical engineers for engineers. By way of contrast, there is relatively little literature for system administrators, network administrators, software engineers, and the managers who command them. I wrote this book for those people, trying very hard to stay away from theory and instead to embrace practical suggestions and tutorials on how to do things reliably.

The keys to reliability

In order to build a reliable computer system, you have to go through the following steps, more or less in order:

Establish goals for performance, reliability and cost. In particular, you have to estimate the cost of failure, and then estimate the cost of getting the job done. It may very well be the case that what you are trying to do just isn't economically feasable.
Develop a design of hardware, software, physical infrastructure and network infrastructure to get the job done. Use this design to create estimates of system reliability and cost.
Implement the design.
Test it in a test environment
Put it in production
Test it in production
Ongoing monitoring of proper operation.
Sunset the system at end-of-life

Sysadmin training This belongs someplace else

Most of the system administrators I know, especially the ones I respect, usually start as something else. There is relatively little training for sysadmins, and most of the sysadmin training comes from learning how things are done at a particular site. I know one sysadmin who is an absolutely brilliant man, who has a masters degree in computer science. However, he's weak at installing hardware and will frequently ask other sysadmins on the team to do his hardware installations for him. I another sysadmin who holds an MCSE (Microsoft Certified Systems Engineer) and he's very good at running Microsoft systems, but he doesn't seem to understand how to make them interoperate with anything else. Further, he doesn't know about a lot of failure modes that aren't in the Microsoft documentation, such as how to detect two machines with the same IP address. I interviewed a 17 year old kid who's a genius at building old computers - he cons computer repair shops out of obsolete but still good hardware and puts 'em together (he built a software RAID-5 disk array out of floppy drives). I myself was an engineer at the Boeing Company, and I was charged with building a software integration lab: I became the sysadmin by virtue of the fact that nobody else wanted to do it. Most of the things I learned by either reading the book or asking my fellow sysadmins how to do things.

One of the consequences is that sysadmins frequently have gaps in their training. For example, if you grew up in an environment that was scrupulous about tracking IP addresses, then you probably don't know what two machines with the same IP address looks like (you might want to try it on a test network). If you grew up in an environment where machines were remotely administered, then you might not know how to install a computer. If you haven't done facilities work, then you might not understand the difference between two phase and three phase power ( and Y vs. delta wired 3 phase) and what can go wrong with each. If you grew up in a Solaris environment, then you might find the BSDs and Linux somewhat confusing (and vice versa). If you grew up in a Microsoft environment, then the whole concept of all this really cool stuff for free (as in Free Beer) might be a little weird. One of the reasons for this phenomena is that frequently management will buy something and then skimp on the training for it. The sysadmins will figure out how to do the stuff they need to do, but it won't be the "best" way (a classic example: SNMP).

When I wrote this book, I tried hard to make no assumptions about what sysadmins know and don't know. There are sysadmins who are geniuses at running Apache, khttpd, and roxen, but they don't know how to write HTML, even with a nice HTML editor such as Mozilla or emacs. Some sysadmins don't know how to do a relational database, and why should they when there are DBAs who can run circles around them? So what I tried to do was to consider what you have to do, why you have to it, a suggestion on how to do it, and then pointers to more information.

Up Next ->

$Log: introduction.html,v $
Revision 1.1.1.1  2006/10/01 23:36:20  cvsuser
Initial checkin to CVS

Revision 1.1  2006/01/05 06:02:19  jeffs
Initial revision