next up previous contents
Next: Printing your Programs Up: Getting Started With Previous: Running MPI

Checking and Killing Processes Using spy and shoot

Bugs? - Just don't write buggy programs! - Simple! Of course, it will clearly never happen that a program written in this class would ever have any sort of problems, but, if, for some reason, a program that you write were to crash unexpectedly, there's something to watch out for.

An MPI program that contains parallelism may start simultaneously on all (or at least, many) of the DEC Alpha machines. If one process crashes, and MPI dies, it is quite possible that some of the other processes might continue living -- and, cut off from their MPI connection -- may just sort of hang around and use up CPU time. This is a great way to lose friends!

In fact, sit back for a while and imagine the Alphas, filled to the brim with students, all of them running their programs together on all the machines. One student's program crashes, leaving nine other copies of his program treading water.

Then a second person's program crashes. And a third.

These people try to fix their bugs, recompile, and run their programs again. The twenty-seven floundering processes from their first attempts are still around.

Some other people's programs crash, adding more dead weight. After a second compile-and-run attempt, the Alphas are host to sixty-three floundering processes, each potentially using up a unit of CPU load.

Inexplicably, the Alphas start to feel sluggish.

Slow, even.

Tempers flare. People start getting out their knives.

Not a good scene!

Soooo, for just such an eventuality, we have provided the commands spy, spyall, and shoot.

When you type spy, spy will start a remote shell on each of the Alphas and issue a ps command that will display the current status of all processes on the Alphas associated with your username. spyall will do the same, but show the status of all processes owned by anyone on the Alphas. shoot will insure that all your processes (except login shells on porsche) will die.

To access these programs, first do a soft link from within a subdirectory that is part of your path and for which you have write permission (e.g., your home directory or personal bin directory):

ln -s ~pollock/public/spy spy 
ln -s ~pollock/public/shoot shoot 
ln -s ~pollock/public/spyall spyall

Now when you type spy, spyall or shoot, they will be found from your path statement, and the link will point to my copy of the file which will be executed.

It is suggested that whenever you run an MPI program on a large portion of the Alphas, and it crashes unexpectedly in a way that leads you to believe that there may be other, floundering processes left over, you should run spy to check out your suspicions and shoot to find and kill any processes you have hanging around.

It is strongly suggested that you issue a shoot command immediately before logging off porsche to help keep the peace. You should use spyall just before you run a program for performance numbers to be sure that no one else is running a job that will affect your performance numbers. You want to make sure that you are the only one using the cluster when you are doing performance runs.



next up previous contents
Next: Printing your Programs Up: Getting Started With Previous: Running MPI



Lori Pollock
Wed Feb 4 14:18:58 EST 1998