This page is for real life example bugs that we might like to do something useful with in TASS.
- MPICH
parameters are NPROCS and count For certain vaues, the recv count argument to an MPI_Sendrecv is negative. This is reduced to capture the essence of the bug. The original code is here: https://svn.mcs.anl.gov/repos/mpi/mpich2/trunk/src/mpi/coll/bcast.c
The method is MPIR_Bcast_scatter_doubling_allgather()
#include<stdlib.h>
#include<stdio.h>
#include<assert.h>
int NPROCS; /* Number of processes. 256 in the field failure */
int RANK; /* my rank. 0<=RANK<NPROCS. 254 in field failure */
void f(int count, int root) {
/* assume count >= 0 && 0<=root<NPROCS */
int nbytes=0, i, dst_tree_root, relative_dst, comm_size, rank, type_size;
int scatter_size, recv_offset, mask, relative_rank;
comm_size = NPROCS;
rank = RANK; /* anything in range 0..comm_size-1 */
relative_rank = (rank >= root) ? rank - root : rank - root + comm_size;
type_size = 4; /* 4 byte ints. Anything >0 */
nbytes = type_size * count;
scatter_size = (nbytes + comm_size - 1)/comm_size; /* ceiling division */
mask = 0x1; /* same as 1 in an unsigned format */
i = 0;
while (mask < comm_size) {
/* flip the i+1-th bit from the right. either add 2^i or subtract it */
/* same as: ((x/2^i)%2==0 ? x+2^i : x-2^i) */
relative_dst = relative_rank ^ mask;
/* 0 out the i right-most bits */
dst_tree_root = relative_dst >> i; /* int divide by 2^i */
dst_tree_root <<= i; /* multiply by 2^i */
recv_offset = dst_tree_root * scatter_size;
if (relative_dst < comm_size) {
int my_count = nbytes-recv_offset;
if (my_count < 0) {
printf("NPROCS=d, count=d, nbytes-recv_offset=%d\n",
NPROCS, RANK, count, i, my_count);
fflush(stdout);
}
}
mask <<= 1; /* mask=mask*2 */
i++;
}
}
int main(int argc, char* argv[]) {
int count = 3251;
int ROOT = 0;
for (NPROCS = 1; NPROCS <=256; NPROCS++) {
for (RANK=0; RANK<NPROCS; RANK++) {
f(count,ROOT);
}
}
return 0;
}
[6/1/10 2:37:43 PM] Tim Zirkel: so the problem is the nbytes-recv_offset?
[6/1/10 2:37:59 PM] Stephen Siegel: yeah, that is used as the argument to MPI_Sendrecv, as the recv count arg.
[6/1/10 2:40:12 PM] Stephen Siegel: the problem is all the bitwise operations
[6/1/10 2:40:40 PM] Stephen Siegel: i was trying to translate them to arithmetic ops
[6/1/10 2:43:52 PM] Stephen Siegel: So in the actual field failure, NPROCS was 256
[6/1/10 2:44:28 PM] Stephen Siegel: and the defect only manifests when 3201<=count<=3251 on rank 254
[6/1/10 2:44:41 PM] Stephen Siegel: Otherwise, the recv count is non-negative, so no failure
[6/1/10 2:47:22 PM] Stephen Siegel: Another interesting view of the defect: if you fix count at 3201, then you need at least 128 procs before the defect manifests
[6/1/10 2:47:57 PM] Stephen Siegel: I "know" that (pretty sure anyway) by writing a loop that goes over all possible values with count<=100000 or something like that.
[6/1/10 2:48:32 PM] Stephen Siegel: So if you fix any one of the parameters, it is a very specific range in the other parameter that you need to find to catch the bug.
[6/1/10 2:48:42 PM] Stephen Siegel: (Two parameters are NPROCS and count.)
[6/1/10 2:49:08 PM] Stephen Siegel: However, if you let both parameters be free, then you can catch the defect with NPROCS=6
[6/1/10 2:49:21 PM] Stephen Siegel: and count=something reasonably small, maybe 2
[6/1/10 2:49:39 PM] Stephen Siegel: But you need at least 6 procs, I think.


