Nobody can predict when a given system is going to fail, but we can make predictions about how many systems are going to fail in a given period of time. To do this, we have to have a basic understanding of probability theory. Normally, this will be done by your S&M person.
The probability of any event e occuring (in a given period of time) is Pe which is a floating point number in the range of 0 to 1, inclusive. An event which cannot happen has a Pimpossible of 0. An event which must happen has a Pcertain of 1. If you toss a fair coin, there are only two possibilities, Pheads=0.5 and Ptails=0.5. Note that 0.5+0.5=1.0.
The probability that the computer will work in a given unit of time is the Mean Time To Repair(MTTR) divided by the Mean Time Between Failures (MTBF). Pfailure=MTTR/MTBF. So, if the MTTR is 2 hours and the MTBF is 9000 hours (roughly a year), then the Pfailure=2/9000=0.0002
Note that the event either occurs or does not occur. So, for example, if the machine kernel panics and reboots automatically, that is a failure for whatever quanta of time you use. Even if the machine reboots in three minutes, if you measure reliability in hours, it is considered down for the whole hour. Of course, if the machine kernel panics and reboots three times in that hour, it's still considered a single failure, but that's an unusual failure mode. In fact, kernel panics followed by a successful reboot are pretty rare, at least in the Linux world, it more common for a hardware problem to take the machine down and leave it down. My intuition from years of watching computers break suggests to me that an hour is about the right quantum of time for discussing such things, but that's just my preference.
The probability that both of two independent
events will occur is the product of the probabilities of
either event
So, for example, if the Pfailure of a computer is 0.0002 and you have two such computers, then
the
probability that
the two of them will fail at the same hour is:
Pfailure_1∩~failure_2=Pfailure_1*Pfailure_2=0.00022*0.00022=4.8e-08
(∩ is the symbol for intersection).
there are 365.25*24=8766 hours in a year, so the probability that the
two computers will fail in the same hour in a given year is
8766*4.93826E-08 or 0.000432888.
We usually quote reliability in terms of "nines". Since reliability is the opposite of failure, Psuccess=1-Pfailure. In the example above, 1-0.000432888=0.999567. This is three nines reliability in one year.
The astute reader will ask why calculate the odds of failure, why
not calculate the odds of
success? We can, and we come up with the same answer. The probability that either
of two independent events will
occur is the sum of the
probabilities of the events, or Psuccess_1
∪ success_2=Psuccess_1+Psuccess_2
- Psuccess_1*Psuccess_2 (The symbol ∪ means union). This
may seem counterintuitive, but think about it using this
table (where the column widths and heights are proportional to the
probabilities):
Pe1 |
P~e1 | ||
Probability |
.7 |
.3 |
|
Pe2 | .8 |
Pe2∩e1=.56 |
Pe2∩~e1=.24 |
P~e2 | .2 |
P~e2∩e1=.14 |
P~e2∩~e1=.06 |
Note that .56+.14=0.7, and so on. .56+.24+.14+.06=1.0.
The problem with the intutitive thought is that the case of P~e2∩~e1
is counted twice. Another way to think of it is
Psuccess_1 ∪ success_2=Psuccess_2 ∩ fail_1+Psuccess_2
∩ success_1 +Pfail_2 ∩ success_1
Psuccess_1 ∪ success_2=Psuccess_2* Pfail_1
+Psuccess_2
* Psuccess_1 + Pfail_2*Psuccess_2
In the case of the numbers above,
Psuccess_1 ∪ success_2
= 8998/9000 * 2/9000 + 8998/9000 * 8998/9000 + 2/9000 * 8998/9000=0.999999950617284. But that's for a given hour. To find
the probability of success over a year you must find Psuccess=(1-Psuccess*8766)+1
gives 0.999567, which is again 3 nines. So the reason why we do these calculations
in terms of failure is that it's easier to do these
calculations with probabilities of intersections instead of unions.
If you have N servers, any one of which is capable of doing the job, then Psimultaneous_failure=Pfailure_1*Pfailure_2 *...*Pfailure_N If all the machines are equally likely to fail, then Psimultaneous_failure=Pfailure_1N which is much easier to calculate. It should be a small number. For example, if Pfailure=2/9000= 0.000222222 and you have 8 computers, any 1 of which is up to carrying the load, then Psimultaneous_failure=0.0002222228 or 5.94698*10-30 . So 8 computers is probably overkill.
What about the probability of a failure if you need M computers out of N computers? The answer is either Psimultaneous_failure * M So, if your Psimultaneous_failure is 6*10-30 and you need 4 computers, then your Pfailure=1.5*10-30. Calculating the same answer using Psimultaneous_success (which is 0.999 999 999 999 999 999 999 999 999 994 ) is left as an exercise for the reader.
The discussion above assumes that the probabilities are independent. This is true for most but not all failure modes. A fire in the computer room, power failure (the most common failure mode), cooling failure, a fire below the computer room (see the stories: the fried computers, earthquake, insurrection, volcanic eruption are all disasters that will affect all computers. Simultaneous resignation of the operations staff is an interesting disaster -it's happened.
To deal with these possibilities, you've got to have a remote facility for disaster recovery (DR). DR is a whole chapter on to itself.
$Log: intro_to_stats.html,v $Revision 1.1.1.1 2006/10/01 23:36:21 cvsuserInitial checkin to CVSRevision 1.2 2006/09/20 21:22:45 jeffsUpdate discussion of statisticsRevision 1.1 2006/01/05 08:08:22 jeffsInitial revision
Revision 1.1 2006/01/05 06:02:19 jeffs
Initial revision