As part of a recent push to automate everything from test builds to documentation updates, my groupat the request of one of our development teamsdeployed a job-scheduling system. The idea behind the deployment is that anyone should be able to set up a periodic job to run in order to do some work that takes a long time, but that is not absolutely critical to the day-to-day work of the company. It is a way of avoiding having people run
cron jobs on their desktops and of providing a centralized set of background processing services.
There are a couple of problems with the system, though. The first is that it is very resource intensive, particularly in terms of memory use. The second is that no one in our group knows how it works, or how it is really used, but only how to deploy it on our serversevery week or so someone uses the system in a new and unexpected way, which then breaks the system for all the previous users. The people who use the system are too busy to explain how they use it, which actually defeats the main reason we deployed it in the first placeto save them time. The documentation is not very good either. No one in the group that supports the system has the time to read and understand the source code, but we have to do something to learn how the system works and how it scales in order to save ourselves from this code. Can you shed some light on how to proceed without becoming mired in the code?
So your group fell for the "just install this software and things will be great" ploy. It is an old trick that continues to snag sysadmins and others who have supporting roles around developers. Whenever someone asks you to trust them, don't. Cynical as that might be, it is better than being suckered.
But now that you have been suckered, how do you un-sucker yourselves? While wading through thousands of lines of unknown code of dubious provenance is the normal approach to such a problema sort of "suck it up" effortthere are some other ways of trying to understand the system without starting from
main() and reading every function.
The first is to build a second system, just for yourselves, and create a set of typical test jobs for your environment. The second is to use the system already in place to test how far you can push it. In both cases, you will want to instrument the machine so that you can measure the effect that adding work has on the system.
Once you have the set of test jobs or you are running on the production machine, you instrument your machine(s) to measure the effect each job has on the system. In your original question, you say memory is one of the things the job-control system uses in large amounts, so that is the first thing to look at. How much real memory, not virtual, does the system use when you add a job. If you add two jobs, does it take twice as much? What about three? How does the memory usage scale? Once you can graph how the memory usage scales, you can get an idea of how much work the system can take before you start to have memory problems. You should continue to add work until the system begins to swap, at which point you will know the memory limit of the system.
Do not make the mistake of trying only one or two jobsgo all the way to the limit of the system, because there are effects you will not find with only a small amount of work. If the system had failed with one or two jobs, you would not have deployed it at all, right? Please tell me that is right.
Another thing to measure is what happens when a job ends. Does the memory get freed? On most modern systems you will not see memory freed until another program needs memory, so you will have to test by running jobs until the system swaps, then remove all the jobs, and then add the same number of jobs again. Does the system swap with fewer jobs after the warm-up run? The system may have a memory leak. If you cannot fix the leak, then guess what, you will get to reboot the system periodically, since you are unlikely to have time to find the leak yourself.
When you are trying to understand how a system scales, it is also good to look at how it uses resources other than memory. All systems have simple tools to look at CPU utilization, and you should, of course, make sure that the job-control system is the one taking all the CPU time, as that adds to the total system overhead.
When you are trying to understand how a system scales, it is also good to look at how it uses resources other than memory.
The files and network resources a system uses can be understood using programs such as
procstat, as well as
lsof. Does the system open lots of files and just leave them? That is a waste of resources you need to know about, because most operating systems limit the number of open files a process can have. Is the system disk intensive, or does it use lock files for a lot of work? A system that uses lots of lock files needs to have space on a local, non-networked disk for the lock files, as network file systems are particularly bad at file locking.
A rather drastic measure, and one that I favor, is the use of
ktrace, strace, and particularly
DTrace to figure out just what a program is doing to a system. The first two will definitely slow down the system they are measuring, but they can quickly show you what a program is doing, including the system calls it makes when waiting for I/O to complete, plus what files it is using, and other details. On systems that support
DTrace, the overhead of tracing is reduced, and on a system that is not latency sensitive, it is acceptable to do a great deal more tracing with
DTrace than with either
strace. There is even a script,
dtruss, provided with
DTrace, that works like
strace, but that has the lower overhead associated with
DTrace. If you want to know what a program is doing without tiptoeing through the source code, I strongly recommend using some form of tracing.
In the end it is always better to understand the goals of a system, but with engineers and programmers being who they are, this might be like pulling teeth. Not that pulling teeth isn't funtrust me, I've done itbut it is more work than it looks like and sometimes the tooth fairy doesn't give you that extra buck for all your hard work.
Orchestrating an Automated Test Lab
How OSGi Changed My Life
The Digital Library is published by the Association for Computing Machinery. Copyright © 2013 ACM, Inc.