I'm going to create a fairly simple,
highly reliable application to demonstrate some of these ideas: an
acronym server.
We always start these kinds of documents with a description of what
we want the system to do. There are two kinds of requirements:
business requirements and technical requirements. The business
people haven't a clue about the problems that the technical people are
going to face. So we take the business requirements and create
technical requirements. In some cases, these requirements border
on design. In the case of computer systems, there frequently
isn't a nice dividing line between requirements and design, because one
has to beat what one wants against what can be done.
The acronym server has the following business requirements, in order of decreasing priority:
One of the interesting things about this exercise was what happened to the reliability figure. In my rough draft, I wanted 99.99% reliability, which is a number I just dreamed up (don't be surprised, a lot of managers do that). After beating that reliability figure against what I felt could be done as a reasonable cost, I decided to relax the reliability requirement by a factor of 3. In an academic exercise where I am both the technician and the manager, I can get away with this sort of thing.
In addition to the business requirements, there are some technical requirements:
Environment |
name |
Usage and characteristics |
Reliability and repair |
Configuration control |
Hardware |
---|---|---|---|---|---|
Development |
dev |
Used for development
Software includes compilers, debuggers, documentation
tools. |
Systems may crash at will, repair SLA is next business day. | No
formal configuration control, or control by developers. |
Virtual
machines are acceptable. Whatever is laying around. |
Test |
test |
Test systems, including
simulated load generators. |
Systems may crash at will, repair SLA is next business day. | Informal configuration control by software leads. | Virtual machines are
acceptable. Whatever is laying around. Must be fast enough
to generate a proper load. |
Integration |
int |
Used to simulate a
production environment, and test software release procedures. |
Systems ought not to
crash. Repair SLA
is next business day. |
Informal configuration control by QA leads. | Virtual machines are acceptable. |
load |
load |
Used to simulate a
production environment and check that performance is what is
desired. |
Systems ought not to crash. Repair SLA is next business day. | Informal configuration
control by QA leads. |
Production hardware. Virtual machines are acceptable if and only if production machines with the same function are virtual machines |
Production |
prod |
Production
environment. |
Repair SLA is 1 hour
24x7x365. |
Formal configuration control
by management |
production hardware. Virtual machines are acceptable if approved by the system architect. |
ancillary |
an |
Advertising Content
management, log processing, system monitoring, security monitoring
release system. |
Repair SLA is defined on a per machine basis. | Configuration controls as
appropriate for the application |
production hardware. Virtual machines are acceptable if approved by the system architect. |
Customer facing |
Rewrite rules go to other
machines |
Repair SLA is 1 hour 24x7x365. | Formal configuration control by management | production hardware. Virtual machines are acceptable if approved by the system architect. |
Requirement |
Rule |
Rationale |
---|---|---|
Prevent water damage in case
of a flood in the computer room |
No machine installed lower
than 8" above the floor. |
Why 8"? So that if somebody is wading through the water, the waves won't touch the machines and possibly start a fire or kill somebody. |
Any
power supply might fail |
All production computers
have hot swappable dual power supplies |
If a PS fails, then the
machine won't go down, and the PS can be repaired without interrupting
service. |
Dual power supplies on
different phases |
A phase might fail (it's
more likely that two phases will fail, but a single phase failure is
still a possibility) |
|
Machines in a farm must be
spread out over several racks |
If a PDU in a rack fails, it
won't take out an entire farm |
|
Current draw must be
balanced across all phases in a rack, no more than 40% of nominal max
current may be drawn under normal conditions. |
If one of the PDUs fails,
then all of the load goes to the other PDU. If any
PDU is loaded more than 50%, when the failover occurs, the surviving
PDU will be overloaded. |
|
SOX and FACTA |
Financial processing
machines must have special security mounting with limited access
controls |
|
Good airflow |
Alternating aisles between
racks should be "hot side" "cold side". Computers should draw air
from the front, which is the cold side and exhaust it on the back,
which is the hot side. The fronts of the computers should face
the fronts of the computers on the next row; similarly the backs of the
computers should face the backs of the computers. |
The computers will stay
cooler if they draw in cool air. Once the air has been heated, it
will tend to rise out the hot side and be captured by the HVAC, |