next up previous contents
Next: World Wide Web Up: Common Problems: Descriptions Previous: Lost Output

Error Messages

Many messages - When an MPI job crashes, you typically get more than one line of error messages. The FIRST line is the most important and contains the clue to your actual problem. The rest of the messages are usually the system's attempt to clean up the rest of the processes that have been left hanging!

Infinite messages - Occasionally, you will get runaway error messages that appear to be in an infinite loop... You will need to log onto porsche from another window and issue a shoot command.

Intermittent messages - During initial testing, it was discovered that the DEC Alphas sometimes issue intermittent error messages on programs that are correct. The errors may have something to do with a network or hardware problem. Gurus are looking at the problem but... unfortunately you may still have to deal with this! Our suggestion... build your program slowly, adding just a few lines at a time. If you do get an error that does not have obvious origins, run the code a couple of times to make sure it is your problem and not the system. Don't forget to do a shoot command between runs to clean up leftover jobs. Error messages that are suspicious for being system problems usually contain phrases like:

	net_send: could not write
	unidentified err handler
	bad file number
	interrupt SIGBUS: 10

Uninitialized variables - Another potential problem error could be uninitialized variables. MPI_Init in the main part of your program appears to set uninitialized variables to zero; however, uninitialized variables in subroutines appear to be set to the usual C compiler initialization; that is, garbage. Beware of subroutines bearing garbage! A clue to this problem is a SIGFPE error message.

A reminder of common signals and their explanation:

SIGABRT - Abnormal termination of the program (such as a call to abort).
SIGFPE  - An erroneous arithmetic operation, such as a divide-by-zero
                 or an operation resulting in overflow
SIGILL -  Detection of an illegal instruction
SIGINT - Receipt of an interactive attention signal
SIGSEGV - An invalid access to storage
SIGTERM - A termination request sent to the program



Lori Pollock
Wed Feb 4 14:18:58 EST 1998