Many messages - When an MPI job crashes, you typically get more than one line of error messages. The FIRST line is the most important and contains the clue to your actual problem. The rest of the messages are usually the system's attempt to clean up the rest of the processes that have been left hanging!
Infinite messages - Occasionally, you will get runaway error messages that appear to be in an infinite loop... You will need to log onto porsche from another window and issue a shoot command.
Intermittent messages - During initial testing, it was discovered that the DEC Alphas sometimes issue intermittent error messages on programs that are correct. The errors may have something to do with a network or hardware problem. Gurus are looking at the problem but... unfortunately you may still have to deal with this! Our suggestion... build your program slowly, adding just a few lines at a time. If you do get an error that does not have obvious origins, run the code a couple of times to make sure it is your problem and not the system. Don't forget to do a shoot command between runs to clean up leftover jobs. Error messages that are suspicious for being system problems usually contain phrases like:
net_send: could not write unidentified err handler bad file number interrupt SIGBUS: 10
Uninitialized variables - Another potential problem error could be uninitialized variables. MPI_Init in the main part of your program appears to set uninitialized variables to zero; however, uninitialized variables in subroutines appear to be set to the usual C compiler initialization; that is, garbage. Beware of subroutines bearing garbage! A clue to this problem is a SIGFPE error message.
A reminder of common signals and their explanation:
SIGABRT - Abnormal termination of the program (such as a call to abort). SIGFPE - An erroneous arithmetic operation, such as a divide-by-zero or an operation resulting in overflow SIGILL - Detection of an illegal instruction SIGINT - Receipt of an interactive attention signal SIGSEGV - An invalid access to storage SIGTERM - A termination request sent to the program