I can hear all of my geeky friends screaming at me "but you promised
that this would be a technical book!" Trust me, it will be.
My thinking is that you and your career will be better served if you
understand what "the suits" are thinking about. You want "the
suits" to read the technical stuff, so it seems fair to ask you
to read the managerial stuff. It's about questions you ought to
know the answers to:
Management is all about what, why, where, and why (and how much). The technical stuff is about the how.
Before any geeks can touch any computer, there are some things that
have to happen:
All of the things in the list above are things that sysadmnins generally don't worry about, but they are critical things that have to happen if the business is going to stay healthy. If the business is not healthy, then you will lose your job. Guaranteed. In addition to the list above, there is another list of things that has to be worried about:
What is the cost of failure? The answer is: it depends.
Consider some scenarios:
Before you can design for reliability, you have to know how much
failure you can tolerate. Some systems are less critical than
others, and a wise manager will put more resources (money, personnel,
space, redundancy) into the critical systems, while spending less on
the not-so-critical systems.
The key measure of the success of any business is net profit.
This is something you have to have at least a minimal understanding of
if you are going to convince your boss of anything.
Profit = Net Income - total costs
Anything you can (legally) do to increase your net income is A Good
Thing. Anything you can (legally) do to decrease your costs is
also A Good Thing. There are a couple of problems with
considering pure profit: it fails to consider investment (paying money
in order to make money in the future) and risk.
However, the formulas below overstate the true value. The
degree to which Return On Investment (ROI) overstates the
economic value depends on at least 5 factors:
The formula for ROI is
Net Income / Book Value of Assets = Return On Investment
however, a better formula is
Net Income+Interest (1-Tax Rate) / Book value of Assets = Return On Investment
One of the things that throws sysadmins for a loop is the concept of
depreciation. Most sysadmins understand, intuitively, that
equipment you buy has a finite life. Straight line depreciation,
the simplest way to think about it, finds the annual cost of an item by
dividing its purchase price by the lifetime in years. What is
confusing is that the faster you depreciate something, the higher the
ROI.
I am about to show you a minimal list of all of the functions that a
company has to do. This list has been organized the way I think
it ought to be done, putting organizations and suborganizations
together so that as much similar functionality as possible is
grouped.
Your organization will vary, but this list represents the minimum of
what you have to do.
First, security is mentioned five times: physical, computing,
personnel and 2 application checks. The rationale for this is
that while the needs for security are always the same, the methods that
you go about accomplishing are very different. Physical security
can be handled by guards, possibly armed. But an armed guard
can't do a thing about a 14 year old prodigy from Hoboken who's just
given himself administrator rights on your MySQL database.
So you have to have somebody who knows computers to deal with computing
threats. In a small organization, that might be the system
administrator, in a large organization, a dedicated person or
team. Personnel security is supposed to protect you from "inside
jobs", where a person abuses their position of trust. That
includes thinigs such as carefully screening new hires, running ongoing
security checks on your current employees (I know that sounds draconian
- I'm a sysadmin myself and I resent being invetigated), and having
processes and procedures in place to keep sensitive information
safe. For example, you may require that all of your credit card
data be handled by two people at all times. Also, if you ever
fire or layoff somebody, you must escort them from the building and cut
off all access to the computer systems, so that you are safe from
disgruntled employees. Application security is crucial, as most
modern operating systems are pretty secure (even MS-Windows has gotten
better). The applications that run under them are frequently the
source of holes. Applications should be secure by design, and
they should be tested for security before release.
What is "Statistics and Measures", and why is it 'way up the
corporate ladder? This is the embodiment of an idea from Tom
Yourdon's book insert citation
here. Yourdon proposes a group whose job it is to measure
things - defects, productivity, reliability. This group verifies
that your organization is meeting its goals, whatever those goals might
be. Statistics and measures is high in the corporate ladder
because it has to have as much independence as possible. So, for
example, if a given project is a train wreck, and the statistics and
measures group predicts its going to be a train wreck, then the group
has been successful! Having statistics and measures allow
you to truly manage, as opposed to merely giving orders.
Quality assurance is in a separate organization under technology for
the same reason Statistics and Measures is in a separate organization:
it has to be as independent as possible. Quality assurance should
be involved with the design process - it must ensure that the design is
testable.
Every business has legal requirements. Every data processing
organization has additional legal requirements. In some cases,
the penalties for violating those laws can be quite severe. If
you have troubles staying awake, then ask your lawyer about some of
these laws and regulations:
Acronym |
Common Name |
Who it covers |
brief summary |
Typical penalties for violations |
|
---|---|---|---|---|---|
SOX |
Sorbannes-Oxley |
Public corporations |
Good Corporate governance |
||
HIPPA |
Hospitals, insurance companies,
physicians and other private practioners |
||||
FERPA |
Schools, community colleges, 4
year colleges, universities. |
||||
Your lawyer should might with your sysadmins, DBAs, and networking
people to make sure that all of the rules and regulations are
followed. There are a couple of reasons why: 1) It's The Right
Thing To Do; and 2) The cost of proper controls is smaller than the
cost of defending, let alone losing, a lawsuit.
Marketing is all about finding out what your customers need that you
can fulfill. One of the trends in the music business in 2005 was
an
increase in (legal) downloading, as opposed to purchasing CDs which
lost market share. Why? Because most customers want only
one or two
songs on a typical CD. The rest is filler. By downloading
(legally)
just what they want, customers
Why is customer support a marketing function? Because it is a
golden opportunity to talk with your customers, an opportunity most
companies squander. Your customer support people should be noting
what doesn't work, and you should give your engineers the results of
that information so they know what should be made better. You can
even make measurements of customer satisfaction, or customer
dissatisfaction, and use that to measure if the product is
improved. Your customer support people should be asking
questions, seeking ideas. that sort of thing.
It never ceases to amaze me how tolerant American business is of the
cost of turnover. Most computer organizations are
idiosyncratic. Their systems have been developed over years or
even decades, and it is frequently cheaper to fix the old systems than
it is to rebuild them new. When a computer finally does wear out,
it is cheaper and less disruptive to duplicate its functionality then
it is to reengineer entire processes.
Once I worked on a Solaris machine
that got its inputs from a machine running Redhat and gave its output
to another machine running Debian. The Solaris machine had given
us years of faithful and reliable service, but it was getting old and
unreliable. I replaced it with a PC running Redhat, recompiled
all of the software, and put it into place. However, I then
noticed that the other two machines and this machine were only working
at 10% of their capacity. So I combined all of their
functionality into one computer and that, I thought, was that.
Unknown to me, or anybody else for that matter, was a little process,
just a short program in a cronjob, which was a corporate critical
process. And it would run only under Solaris, and it had to talk
to both of the other machines - occaisonally. And the source code
was lost in antiquity. It turns out that the guy who wrote the
program had quit in disgust the year before. We were able to
track him down and find the source code on an old backup tape, so the
day was saved, but at such a cost!
I like to think I am a pretty good system administrator. I've
worked in places where the documentation for the sysadmins was quite
good. I've worked in places where the documentation was quite
bad, or even non-existant. Even with superb documentation,
it takes a long time for a sysadmin to get aquainted with the
environment. Systems are frequently put together as a result of
mergers, buyouts, and moves. There's never enough time to
properly document everything, so much time is wasted trying to find
things. But it's okay, because people remember things (Under Charlie's desk is an enterprise
critical workstation, but Charlie is the only guy who can make it work
to do the billing). One day, the person with the memory
leaves, and all that arcane site specific knowlege goes with him or her.
So turnover has four costs associated with it
New people will make mistakes, not because they are stupid or incompetent, but because that's how we all learn. The person who left may have spent six months learning how a given system works (and how it fails), when he or she leaves, those six months of experience are gone. Worse, if you discover 6 months after he or she left that you need him or her for something, and bring them back on an expen$ive consulting contract, you run the risk that they will have forgotten everything.
Clearly, the solution is to reduce turnover. How do you do
that?
It happens, despite laws against it. The "sweet spot" of most
peoples careers occurs in their late 20s or early 30s. At that
time, you've finally gained enough experience so that you know enough
to be useful, but you haven't had so much salary growth that you're
priced out of the market. The perception is that, as we grow
older, we start slowing down, getting set in our ways, have more health
problems, have families which are a distraction. Like all
stereotypes, there is some truth in all of these perceptions, and the
occaisional truths tend to reinforce what we believe.
Obviously, you want technically qualified people, so you should look
for things like education, certifications, and experience. But
there are some other things you ought to look for:
I was chatting with a sysadmin who
had turned down a job offer
because the bathroom was dirty and they were out of toilet paper.
She was a very qualified sysadmin and would have been a dynamite
addition to the organization. But they stinted on the cans!
The cost of recruiting somebody else probably swamped the cost of
making nice bathrooms.
Similarly, system administration is a high stress job.
Sysadmins are smart people, and they understand that exercise is a good
relief for stress. Exercise also lowers your health care
costs. Provide a shower facility if you possibly can.
This section focuses on the "what" and the "why" of building a
reliable system. There are other parts of the book that are
devoted to "how" to do it.
The key to making reliable systems is in the design stage. It
is axoimatic that it is cheaper to fix design flaws in the design stage
than in coding, and it is cheaper to fix problems in coding that it is
in testing.
The programming industry is fairly mature at this point, and we know
what works and what doesn't. There are also some things that,
remarkably enough, we still don't know. A classic example is
"what is the best programming language?" Another example is "What
is the best operating system?"
P
As I write this in 2005, there seem to a relatively small set of
programming languages in common use. Some of them are safer than
others.
My critics accuse me of "Microsoft Bashing" and to a certain extent,
they are correct, I do. The problem is that, as a computing
expert with decades of experience, I see how Microsoft has utterly
botched the job of designing for security. For example, the
Microsoft system has a single data structure, the registry, and if you
corrupt it, then your system can become unbootable. I am unaware
of any data structure like that in the UNIX or Linux world. I
suppose /etc/inittab could do it, if, for example, you made the default
run level 0 or 6. But you seldom touch the /etc/inittab, and the
only account that can is root, wherease anything and everything touches
the registry.
In all likelihood, you will have several sets of problems.
Your software engineers will have bugs, your operations staff will have
things that break and need fixing. However, your facilities
people also have problems. So do your purchasing people.
The solution, of course, is a problem tracking system that
implements business rules
There are several ways of failure testing. I've seen (been the victim of) somebody pulling the power cord in the middle of a load test. I've also seen a client test script that locked a record, read the record, modified the record, did a kill -9 on the database PID, and then tried to write the record. When a component of the system fails, and it will, can the system recover fast enough to meet the requirement? When part of the system has failed, will performance be adequate? Is the MTTR acceptable?
How well are you doing? One way to measure that is by looking
at your financials. If your revenue is greater than your costs,
then you are doing well indeed. Your shareholders, VCs, and your
employees are all interested in this measure.
But there are other metrics you might use, and those have a bearing
on your profitability.
Why is Ethics in a book about reliable systems? Because Ethics is key to making systems reliable. In order for your systems to work reliably, you have to know what the problems are. Your people have to have confidence that they can come to you with a problem, and you won't shoot the messenger. You only have to shoot one or two, and the word will get out that you don't want to hear bad news. So consequently, when bad things happen (and they will), then you won't know about them until they are too big to ignore, and perhaps too big to do something about.
One day, an engine fell off of a
Boeing 747. The FAA was concerned that a part which holds the
engine on, called a "fuse pin", may be defective. So I was put on
a team to test the fuse pins. I had been given some software to
run the test with (the test was almost completely automated) and all I
had to do was run the computer. So we put a fuse pin in the test
machine and tested it. It was fine. We put another fuse pin
in the test machine, and it was fine. We put another fuse pin in
the test machine, and it was fine. After a while, I suggested
that we try testing a fuse pin to destruction, just to see what would
happen. I was told this had been done decades ago, we don't need
to do that. I tried again, pointing out that it would be an
interesting test of my software. So we took an old fuse pin that
we had already tested and put it back in the machine. I gave
instructions to the program to increase the load on the pin to 110% of
"worst case" load. Then 120% of "worst case" load. Then 130% of "worst case" load. Then the fuse pin
broke. It turned out that my software had a bug in it.
I had to go to my manager and tell him that the software had a bug in
it, and we might have to redo all of the testing we had done.
Fortunately, my manager was an ethical man, and he listened carefully
to my story and then asked me what I was going to do about it.
The two of us came up with a plan. He reported to his higher ups
that there was a problem and we were working on it. I worked with
my fellow engineers to figure out the problem, develop a fix, test it
by destroying another fuse pin, reprocess the old data to make it right.
My manager took a risk that his managers would be mad at him - this was
a highly visible issue. I took a risk by going to his
manager. We could have covered up the problem, and the world
would never know. But Boeing had a reputation as a quality
organization. That day, I tested that reputation, and it passed
that test.
Earlier, I mentioned personnel security. I discussed
investigating people before you hire them, and investigating your key
people again while they are working for you. That is to protect
you against abuse of trust for financial gain. But people are
motivated by other things than money, e.g. revenge. So while you
should not be afraid of your people, you do have to treat them with
respect, dignity and understanding. Remarkably enough, you don't
have to pay them very well if you can motivate them in other ways
(consider, for example, Boy Scouts and Girl Scouts - nobody is making
money but there are a lot of scouts). One way is an equitable
profit sharing plan. After all, they are sharing risk with you,
even if they are not aware of it. Another way is a relaxed
atmosphere, especially when there are no customers around.
Frequent parties, recognition for work well done, respecting people's
wishes concerning overtime and schedule (to the extent that you can)
are all ways to keep your workforce loyal.
It ain't easy.
While I was out of work, I was offered a ludicrous sum of money to
come in and help a pornographer who wanted to take over his own
hosting. I have friends who go "war driving", find unsecured
wireless transceivers and send hundreds of thousands of spams.
With a little luck and skill, the owner of the access point will never
know. There are phishing sites. I found one that was
registered in Taiwan, but running traceroute suggested it was in
Fullerton, California. There are times when I really want to
write a virus.
One of my managers once asked me how small I could make an
operations department and still make it work 24x7. After some
thought, I decided that the answer was zero: you could farm out the
whole thing. The manager said that he wanted to know how small to
make the operations department but still have it under his
control. Again, the answer was: zero. Just because you've
outsourced doesn't necessarily mean that you've lost control of
it. Now that manager, clearly frustrated, told me that he wanted
to have an operations staff of direct reports, how small could it be
and still run 24x7? I decided that the answer was anywhere from 4
to 16, depending on how failure tolerant he wanted the operation to
be. His response was that he wanted an organization of between 8
and 10 people - how could we organize it to provide 24x7 operation?
Most people want to work the day shift, because the rest of the
world does. Computer system failures seem to be uniformly
distributed over time, especially for 24x7 systems. Even if your
systems are failure tolerant, they will break. A lot of outfits
do batch processing at night, to get the next days billings out the
door. If you are selling to the global economy, then your load
will be fairly constant over 24 hours. So you have to organize to
provide a human being, 24x7.
However, you need more than one human being. The system
administrators have a different skill set than the networking guys, who
in turn have different skills than the database administrators.
So you need one person from each of these three groups available.
Then, everybody needs a backup person, for when things go really bad,
or if a question arises that requires more research or thought.
Finally, people go on vacation, get sick, go to conferences, etc. so
you need a third person from each of these groups. Finally, you
will need a cadre of developers to analyze and possibly correct
software failures. This discussion assumes that you have
automatic monitoring that will alert your people at the earliest sign
of trouble and will at least minimally diagnose the problem, which may
or may not be a valid assumption.
If you have a sysadmin, a DBA, a network administrator, and a
developer, then you have the kernel of a technical operations
organization. However, if anything happens to any of them, then
you have a major hole in your organization, one that is impossible to
fill.
This is expensive. If your organization is very small, then it
might make sense to oursource the system administration. In
this
day and age, you have a lot of options. You can outsource to an
outfit in town, a company elsewhere in the country, or a company
elsewhere in the world.
$Log: management.html,v $ Revision 1.1.1.1 2006/10/01 23:36:20 cvsuser Initial checkin to CVS Revision 1.1 2006/01/05 06:02:19 jeffs Initial revision