This page is for real life example bugs that we might like to do something useful with in TASS.
- MPICH
parameters are NPROCS and count For certain vaues, the recv count argument to an MPI_Sendrecv is negative. This is reduced to capture the essence of the bug. The original code is here: https://svn.mcs.anl.gov/repos/mpi/mpich2/trunk/src/mpi/coll/bcast.c
The method is MPIR_Bcast_scatter_doubling_allgather()
#include<stdlib.h> #include<stdio.h> #include<assert.h> int NPROCS; /* Number of processes. 256 in the field failure */ int RANK; /* my rank. 0<=RANK<NPROCS. 254 in field failure */ void f(int count, int root) { /* assume count >= 0 && 0<=root<NPROCS */ int nbytes=0, i, dst_tree_root, relative_dst, comm_size, rank, type_size; int scatter_size, recv_offset, mask, relative_rank; comm_size = NPROCS; rank = RANK; /* anything in range 0..comm_size-1 */ relative_rank = (rank >= root) ? rank - root : rank - root + comm_size; type_size = 4; /* 4 byte ints. Anything >0 */ nbytes = type_size * count; scatter_size = (nbytes + comm_size - 1)/comm_size; /* ceiling division */ mask = 0x1; /* same as 1 in an unsigned format */ i = 0; while (mask < comm_size) { /* flip the i+1-th bit from the right. either add 2^i or subtract it */ /* same as: ((x/2^i)%2==0 ? x+2^i : x-2^i) */ relative_dst = relative_rank ^ mask; /* 0 out the i right-most bits */ dst_tree_root = relative_dst >> i; /* int divide by 2^i */ dst_tree_root <<= i; /* multiply by 2^i */ recv_offset = dst_tree_root * scatter_size; if (relative_dst < comm_size) { int my_count = nbytes-recv_offset; if (my_count < 0) { printf("NPROCS=d, count=d, nbytes-recv_offset=%d\n", NPROCS, RANK, count, i, my_count); fflush(stdout); } } mask <<= 1; /* mask=mask*2 */ i++; } } int main(int argc, char* argv[]) { int count = 3251; int ROOT = 0; for (NPROCS = 1; NPROCS <=256; NPROCS++) { for (RANK=0; RANK<NPROCS; RANK++) { f(count,ROOT); } } return 0; }
[6/1/10 2:37:43 PM] Tim Zirkel: so the problem is the nbytes-recv_offset?
[6/1/10 2:37:59 PM] Stephen Siegel: yeah, that is used as the argument to MPI_Sendrecv, as the recv count arg.
[6/1/10 2:40:12 PM] Stephen Siegel: the problem is all the bitwise operations
[6/1/10 2:40:40 PM] Stephen Siegel: i was trying to translate them to arithmetic ops
[6/1/10 2:43:52 PM] Stephen Siegel: So in the actual field failure, NPROCS was 256
[6/1/10 2:44:28 PM] Stephen Siegel: and the defect only manifests when 3201<=count<=3251 on rank 254
[6/1/10 2:44:41 PM] Stephen Siegel: Otherwise, the recv count is non-negative, so no failure
[6/1/10 2:47:22 PM] Stephen Siegel: Another interesting view of the defect: if you fix count at 3201, then you need at least 128 procs before the defect manifests
[6/1/10 2:47:57 PM] Stephen Siegel: I "know" that (pretty sure anyway) by writing a loop that goes over all possible values with count<=100000 or something like that.
[6/1/10 2:48:32 PM] Stephen Siegel: So if you fix any one of the parameters, it is a very specific range in the other parameter that you need to find to catch the bug.
[6/1/10 2:48:42 PM] Stephen Siegel: (Two parameters are NPROCS and count.)
[6/1/10 2:49:08 PM] Stephen Siegel: However, if you let both parameters be free, then you can catch the defect with NPROCS=6
[6/1/10 2:49:21 PM] Stephen Siegel: and count=something reasonably small, maybe 2
[6/1/10 2:49:39 PM] Stephen Siegel: But you need at least 6 procs, I think.