Statistics And Measurement (S&M)

Ideally, you to measure how well you are doing

Code defect density

Time between patches

Uptime, time between restarts

Earlier, I mentioned comparing the mean time between boots as a measure of reliability, and that is certainly one measure. However, it's not the only one. Applications always restart at least as frequently as the OS, but often more frequently. For example, an application may malloc ram more frequently than it frees it: when the system runs out of VM, the process has to be restarted. It is common for the problem to manifest itself by the application refusing to take more requests, which in turn may lead to the loss of some data. When the application stops accepting data, one can hope that the load balancer will detect the problem and take the machine out of rotation before the problem becomes significant.

Hardware reliability and RMAs (Return to Manufacturer Authorization)

leaking capacitors

See the introduction for pointers to details.

SNMP

Some abuses of statistics and measures

Sysadmin and programmer productivity

At first glance, it looks easy to measure the productivity of a programmer or a system administrator. Sysadmins deal with tickets and you can count how many tickets each sysadmin deals with. Programmers churn out lines of code (LOCs). However, sysadmins are famous for handing off tickets to one another for all sorts of reasons. One sysadmin may be really good at very complicated, time consuming, while another excels at getting lightweight stuff done quickly. I know one sysadmin who lets account change requests pile up for up to a week, and then he handles them all on Wednesday, en masse. So while he is very productive in terms of getting a lot of tickets done quickly, the users are mad at him because it takes so long to get simple things done. Programmers have tricks of their own to crank out lots of LOCs, for example:
if ( $error_condition ) then { die "an error condition was detected"; };which is 1 LOC, versus:if ( $error_condition ) then { die "an error condition was detected"; };
which is 4 LOCs. Focussing on LOCs also drives design. I once replaced a horrible rats nest of if-then-elses with a much simpler implementation using a table-driven finite state automata. This was a better solution: it was smaller, easier to understand and change, but it had fewer LOCs, so I was penalized.

Never, never, never use statistics and measure as a basis for merit raises or worse, firing decisions. If you do, then the people will will be motivated to "game" the system, which will destroy the usefulness of it. For example, a single account creation request for 5 users will suddenly become 5 account creation requests each for one user. The productivity actually goes down while the measured productivity goes up.

Log file analysis

There is lots of information you can get from the log files. The access log file can tell you which services are popular and which are not. Unpopular services can be discontinued or combined. The error log file can tell you which code is reliable and which is not. The error log file can also be used to detect failures in an environment where failure is frequently masked by the system. If the log files should lots of login failures, then you know you are under attack.

$Log: SandM.html,v $
Revision 1.1.1.1  2006/10/01 23:36:20  cvsuser
Initial checkin to CVS

Revision 1.1  2006/01/05 06:02:19  jeffs
Initial revision