This page is for real life example bugs that we might like to do something useful with in TASS.

  • MPICH

parameters are NPROCS and count For certain vaues, the recv count argument to an MPI_Sendrecv is negative. This is reduced to capture the essence of the bug. The original code is here: https://svn.mcs.anl.gov/repos/mpi/mpich2/trunk/src/mpi/coll/bcast.c

The method is MPIR_Bcast_scatter_doubling_allgather()

 #include<stdlib.h>
 #include<stdio.h>
 #include<assert.h>

 int NPROCS; /* Number of processes.  256 in the field failure */
 int RANK;   /* my rank.   0<=RANK<NPROCS.  254 in field failure */

 void f(int count, int root) {
  /* assume count >= 0 && 0<=root<NPROCS */

  int nbytes=0, i, dst_tree_root, relative_dst, comm_size, rank, type_size;
  int scatter_size, recv_offset, mask, relative_rank;

  comm_size = NPROCS;
  rank = RANK; /* anything in range 0..comm_size-1 */
  relative_rank = (rank >= root) ? rank - root : rank - root + comm_size;
  type_size = 4; /* 4 byte ints.  Anything >0 */
  nbytes = type_size * count;
  scatter_size = (nbytes + comm_size - 1)/comm_size; /* ceiling division */
  mask = 0x1; /* same as 1 in an unsigned format */
  i = 0;
  while (mask < comm_size) {
    /* flip the i+1-th bit from the right. either add 2^i or subtract it */
    /* same as: ((x/2^i)%2==0 ? x+2^i : x-2^i) */
    relative_dst = relative_rank ^ mask;
    /* 0 out the i right-most bits */
    dst_tree_root = relative_dst >> i;  /* int divide by 2^i */
    dst_tree_root <<= i;                /* multiply by 2^i */
    recv_offset = dst_tree_root * scatter_size;
    if (relative_dst < comm_size) {
      int my_count = nbytes-recv_offset;

      if (my_count < 0) {
	printf("NPROCS=d, count=d, nbytes-recv_offset=%d\n",
	       NPROCS, RANK, count, i, my_count);
	fflush(stdout);
      }
    }
    mask <<= 1; /* mask=mask*2 */
    i++;
  }
 }

 int main(int argc, char* argv[]) {
  int count = 3251;
  int ROOT = 0;

  for (NPROCS = 1; NPROCS <=256; NPROCS++) {
    for (RANK=0; RANK<NPROCS; RANK++) {
      f(count,ROOT);
    }
  }
  return 0;
 }

[6/1/10 2:37:43 PM] Tim Zirkel: so the problem is the nbytes-recv_offset?

[6/1/10 2:37:59 PM] Stephen Siegel: yeah, that is used as the argument to MPI_Sendrecv, as the recv count arg.

[6/1/10 2:40:12 PM] Stephen Siegel: the problem is all the bitwise operations

[6/1/10 2:40:40 PM] Stephen Siegel: i was trying to translate them to arithmetic ops

[6/1/10 2:43:52 PM] Stephen Siegel: So in the actual field failure, NPROCS was 256

[6/1/10 2:44:28 PM] Stephen Siegel: and the defect only manifests when 3201<=count<=3251 on rank 254

[6/1/10 2:44:41 PM] Stephen Siegel: Otherwise, the recv count is non-negative, so no failure

[6/1/10 2:47:22 PM] Stephen Siegel: Another interesting view of the defect: if you fix count at 3201, then you need at least 128 procs before the defect manifests

[6/1/10 2:47:57 PM] Stephen Siegel: I "know" that (pretty sure anyway) by writing a loop that goes over all possible values with count<=100000 or something like that.

[6/1/10 2:48:32 PM] Stephen Siegel: So if you fix any one of the parameters, it is a very specific range in the other parameter that you need to find to catch the bug.

[6/1/10 2:48:42 PM] Stephen Siegel: (Two parameters are NPROCS and count.)

[6/1/10 2:49:08 PM] Stephen Siegel: However, if you let both parameters be free, then you can catch the defect with NPROCS=6

[6/1/10 2:49:21 PM] Stephen Siegel: and count=something reasonably small, maybe 2

[6/1/10 2:49:39 PM] Stephen Siegel: But you need at least 6 procs, I think.