One dark evening over the Pacific ocean, near the coast of
Washington state:
"Seattle Center, this is
transglobal zero-three-four with you at 27,000 bound for three niner
thousand"
"Transglobal zero-three-four thank you please squawk
three six zero five"
"Squawk three six zero five. Transglobal zero-three-four"
"Seattle Center, this is Air
Vancouver Island one zero zero with you at FL three four five, smooth
ride"
"Air Vancouver, I've
gotcha. Seattle Center"
"Transglobal zero-seven-niner,
please turn right to zero seven six, maintain altitude and speed"
Seattle Center: come right to zero seven six, maintain altitude and speed. Transglobal zero-seven-niner, "
"Seattle Center, this is Pacific
Sky zero zero four at 38,000 request permission
for 26,000."
"Pacific Sky zero zero four descend
to 26,000 at your discretion"
"Cleared to descend to 26,000 at
my discretion, thank you Seattle Center"
"Seattle Center, this is World
Wide zero niner three with you at 39,000"
"World wide, please squawk three
six zero six"
"MAYDAY MAYDAY MAYDAY This is this is Pacific Sky zero zero
four. I have a sudden cabin decompression. Repeat I have a
sudden cabin decompression. Request emergency descent to 8,000
and a vector to Portland"
"Pacific Sky zero zero four: This is
Seattle Center, I hear your emergency. Let me get traffic out of
your way. Squawk seven seven zero zero. Begin your descent now
and turn left, I'll get you your vector to PDX in just a moment.
Transglobal, turn right come to 270 hold altitude and maintain
speed. Oh shit, my
computer just went down. Who's squawking three six zero
five? Say your altitude, course and speed. Where the hell
are you?
"Seattle Center: Which
Transglobal?"
There was a time when there were no computers. Everything that
had to be done was done by humans, perhaps with the assistance of
animals. Clearly anything that required thinking at any level had
to be done by humans. As we got smarter, we started building
tools. Our
tools became machines and then became sophisticated machines. We
started doing things faster and cheaper, and the profits fueled
research and development into still faster and still cheaper ways of
doing things. Even nuclear weapons, which are horrendously
expensive, were developed as a faster and cheaper way of destroying
something.
But after a while, people starting noticing something unanticipated: sometimes the tools didn't work. Sometimes, the problems were annoyances, and sometimes the problems were lethal. In many cases, once the failure mode began, there was no way to recover other than by starting over (Consider, for example, the Titanic).
Insofar as I know, the first industry to start thinking about safety and reliability in design was the railroad industry. Trains cannot start or stop quickly because they are very heavy in relationship to the power available for starting or stopping. So signals were developed to extend the vision of the engineer. The first such signal was a ball on a tower that was hoisted to the top when the way ahead was clear. If the engineer could see the ball, then the way was clear to proceed. If the engineer could not see the ball, then he must assume that the tracks were blocked by something. This was a brilliant system. If it was foggy, or dark, or if the rope broke, or if the signal man had neglected to raise the ball, then the train would not start. This may be inconvenient, but it was safe. To this day, when the conductor of a train wants the engineer to start, he will call on his radio or cell phone: "high ball, high ball, high ball".
Alas, the computer industry has not been nearly so safety conscious.
There is an underlying tension between the demands of cost and the
demands of reliability, and in general, cost has won out. This is
not to say that you cannot build a reliable computer system on a
budget: clearly you can. But it is going to cost more and it's
hard to justify that extra cost to uptight managers and anxious
investors. Furthermore, reliable computer systems have been
around since the 1950s.
What's worse is that, to a degree I haven't seen in any other
industry, the players tend to be highly polarized. The managers
want everything done faster and cheaper, frequently with no
understanding of how to accomplish that (or even to measure it to see
if they are succeeding). The software developers want to write
really high quality code, but there is no silver bullet. The
testers can always find more things to test. The system
administrators have to make it all work when it's done. What
makes
the situation more interesting is that a lot of the players have
(reasonable) aspirations of transitioning into management. So it
has proven difficult to unionize (in many industries, unions have been
instrumental in advocating safety rules, and frequently safety
translated into reliability).
We know that computers break. Fans fail which causes
motherboards or power supplies to fry. Disk drives crash. Capacitors leak (See articles in www.pcstats.com
, www.siliconchip.com
, wikipedia
and www.dashdist.com
).
Finally, most of the literature concerning computer reliability has
been theoretical work - by computer engineers, software engineers and
electrical engineers for engineers. By way of contrast, there is
relatively little literature for system administrators, network
administrators, software engineers, and the managers who command
them. I wrote this book for those people, trying very hard to
stay away from theory and instead to embrace practical suggestions and
tutorials on how to do things reliably.
In order to build a reliable computer system, you have to go through
the following steps, more or less in order:
Most of the system administrators I know, especially the ones I
respect, usually start as something else. There is relatively
little training for sysadmins, and most of the sysadmin training comes
from learning how things are done at a particular site. I know
one sysadmin who
is an absolutely brilliant man, who has a masters degree in computer
science. However, he's weak at installing hardware and will
frequently ask other sysadmins on the team to do his hardware
installations for him. I another sysadmin who holds an MCSE
(Microsoft Certified Systems Engineer) and he's very good at running
Microsoft systems, but he doesn't seem to understand how to make them
interoperate with anything else. Further, he doesn't know about a
lot of failure modes that aren't in the Microsoft documentation, such
as how to detect two machines with the same IP address. I
interviewed a 17 year old kid who's a genius at building old computers
- he cons computer repair shops out of obsolete but still good hardware
and puts 'em together (he built a software RAID-5 disk array out of
floppy drives). I myself was an engineer at the Boeing
Company, and I was charged with building a software integration lab: I
became the sysadmin by virtue of the fact that nobody else wanted to do
it. Most of the things I learned by either reading the book or
asking my fellow sysadmins how to do things.
One of the consequences is that sysadmins frequently have gaps in
their training. For example, if you grew up in an environment
that was scrupulous about tracking IP addresses, then you probably
don't know what two machines with the same IP address looks like (you
might want to try it on a test network). If you grew up in an
environment where machines were remotely administered, then you might
not know how to install a computer. If you haven't done
facilities work, then you might not understand the difference between
two phase and three phase power ( and Y vs. delta wired 3 phase) and
what can go wrong with each. If you grew up in a Solaris
environment, then you might find the BSDs and Linux somewhat confusing
(and vice versa). If you grew up in a Microsoft environment, then
the whole concept of all this really cool stuff for free (as in Free
Beer) might be a little weird. One of the reasons for this
phenomena is that frequently management will buy something and then
skimp on the training for it. The sysadmins will figure out how
to do the stuff they need to do, but it won't be the "best" way (a
classic example: SNMP).
When I wrote this book, I tried hard to make no assumptions about
what sysadmins know and don't know. There are sysadmins who are
geniuses at running Apache, khttpd, and roxen, but they don't know how
to write HTML, even with a nice HTML editor such as Mozilla or
emacs. Some sysadmins don't know how to do a relational database,
and why should they when there are DBAs who can run circles around
them? So what I tried to do was to consider what you have to do,
why you have to it, a suggestion on how to do it, and then pointers to
more information.
$Log: introduction.html,v $Revision 1.1.1.1 2006/10/01 23:36:20 cvsuserInitial checkin to CVS
Revision 1.1 2006/01/05 06:02:19 jeffs
Initial revision