Ideally, you to measure how well you are doing
Earlier, I mentioned comparing the mean time between boots as a
measure of reliability, and that is certainly one measure.
However, it's not the only one. Applications always restart at
least as frequently as the OS, but often more frequently. For
example, an application may malloc
ram more frequently than it frees
it: when the system runs out of VM, the process has to be
restarted. It is common for the problem to manifest itself by the
application refusing to take more requests, which in turn may lead to
the loss of some data. When the application stops accepting data,
one can hope that the load balancer will detect the problem and take
the machine out of rotation before the problem becomes significant.
See the introduction
for pointers to details.
At first glance, it looks easy to measure the productivity of a
programmer or a system administrator. Sysadmins deal with tickets
and you can count how many tickets each sysadmin deals with.
Programmers churn out lines of code (LOCs). However, sysadmins
are famous for handing off tickets to one another for all sorts of
reasons. One sysadmin may be really good at very complicated,
time consuming, while another excels at getting lightweight stuff done
quickly. I know one sysadmin who lets account change requests
pile up for up to a week, and then he handles them all on Wednesday, en
masse. So while he is very productive in terms of getting a lot
of tickets done quickly, the users are mad at him because it takes so
long to get simple things done. Programmers have tricks of their
own to crank out lots of LOCs, for example:
if ( $error_condition ) then { die "an error condition was
detected"; };
which is 1 LOC, versus:
if ( $error_condition ) then
{
die "an error condition was detected";
};
which is 4 LOCs. Focussing on LOCs also drives design. I
once replaced a horrible rats nest of if-then-elses with a much simpler
implementation using a table-driven finite state automata. This
was a better solution: it was smaller, easier to understand and change,
but it had fewer LOCs, so I was penalized.
Never, never, never use statistics and measure as a basis for merit
raises or worse, firing decisions. If you do, then the people
will will be motivated to "game" the system, which will destroy the
usefulness of it. For example, a single account creation request
for 5 users will suddenly become 5 account creation requests each for
one user. The productivity actually goes down while the measured
productivity goes up.
There is lots of information you can get from the log files.
The access log file can tell you which services are popular and which
are not. Unpopular services can be discontinued or
combined. The error log file can tell you which code is reliable
and which is not. The error log file can also be used to detect
failures in an environment where failure is frequently masked by the
system. If the log files should lots of login failures, then you
know you are under attack.
$Log: SandM.html,v $ Revision 1.1.1.1 2006/10/01 23:36:20 cvsuser Initial checkin to CVS Revision 1.1 2006/01/05 06:02:19 jeffs Initial revision