Organization

  • MADRE stands for Memory-Aware Data REdistribution
  • Participants: Stephen Siegel, Andrew Siegel
  • Wiki: https://www.eecis.udel.edu/wiki/vsl/index.php/MADRE
  • SVN code repo: svn://vsl.cis.udel.edu/madre
  • SVN paper repo: svn://vsl.cis.udel.edu/vsl/papers/madre
  • Andrew's username: asiegel. Password: nameofsecondcat (dennis made up the name)

I have converted the old CVS repository to SVN and renamed it "madre" instead of "padre". I checked it out, built, and ran all tests on my Mac, with no warnings or errors (to my great surprise).

Literature Summary

Andrew is going to fill this in.

  1. Ali Pinar and Bruce Hendrickson, Interprocessor Communication with Limited Memory, IEEE Trans. Par. Dist. Systems 15(20), 2004, pp. 606-616.
  2. Robert D. Falgout, Jim E. Jones, and Ulrike Meier Yang, Pursuing Scalability for hypre's Conceptual Interfaces, ACM Trans. Math. Soft. 31(3), 2005, pp.. 326-350
  3. A.H. Baker, R.D. Falgout, and U.M. Yang, An Assumed Partition Algorithm for Determining Processor Inter-Communication, Parallel Computing 32, 2006, pp. 394-414

Code issues

Code needs to be cleaned up in a few ways:

  • PADRE needs to be changed to MADRE throughout
  • Makefiles need to be fixed. The dependencies aren't right.
  • Make sure there are no mallocs after creation.
  • Run some analysis tools (lint, -wall, ....) for finding common problems

Algorithm Summary

Algorithm features

continguity
some algorithms requires that the data exist in one contiguous block of memory. Others do not make this restriction: the blocks could be anywhere. Currently, however, the interface assume contiguity so we are not able to take advantage of this feature of the more general algorithms. Nevertheless I will point out which ones require contiguity and which don't
symmetry
A "root" algorithm requires that one process play a special role, for example, by collecting and collating information about the map from all the other processes and then disseminating this information back to the other processes. In a symmetric algorithm each process plays a symmetric role (they are indistinguishable, except perhaps for trivial things such as printing output). In a root algorithm, the root process may have to use a lot more memory than the other processes. If this is a significant amount of memory, then the root process may not have room for all the data plus the extra memory required by the algorithm. Hence a root algorithm may be most useful for an application in which one process can be reserved for this sort of thing and not be used for storing data.
MPI resource requirements
Some algorithms can have a large number of outstanding MPI communication requests. An MPI implementation can only handle a finite number of these and will have to break down at some point. I will try to estimate this number for each algorithm.
memory requirements
Some algorithms require a lot of extra memory. I will try to estimate this too.
deadlock
Some of them can deadlock for certain ranges of the parameters. This doesn't mean they are totally useless because they may be fast when they don't deadlock, and the deadlocking circumstances may be rare. I will try to figure this out for each.

The algorithms

unitBred
this is probably the simplest and most naive redistributor. All it's doing is posting a bunch of wildcard receives and a bunch of sends. It doesn't try to create a schedule beforehand. Because of this, it can deadlock. Now to be more precise: you iterate until done. At each iteration, you post as many receives as you have dead spaces (without exceeding bound on number of outstanding requests), then you post as many sends as you can (without exceeding bound on number of outstanding send requests), wait for them all to complete, then shuffle everything around and repeat. Status: passes tests.
cycleShiftBred
uses a root process to calculate the cycle-shift decomposition of the global map. This is pretty much algorithm that was used to save FLASH. The messages consist of one block. The amount of memory allocated on the root is O(globalNumBlocks). On a non-root it is O(numBlocks). But this is all for meta-data. Only one block has to be allocated for the real data. The initial meta-data computation uses a few collectives. The data communication is just vanilla (standard mode) MPI_Send, MPI_Recv, and MPI_Sendrecv. Status: passes tests.
contigBred
This one is written by Andrew but is not clearly documented. I'm pretty sure this is an implementation of the algorithm from the one paper we found that addresses this problem. It works in phases and uses a heuristic to decide who to receive from in each phase. It is now using a "round robin" heuristic. Status: seems like it doesn't run the tests for some reason.
cyclicSchedulerBred(+cyclicScheduler)
I just re-wrote this one. It is now working very well. Some extra synchronization had to be added to get it to work on BGL without the "unexpected message" error. The synchronization guarantees that all sends are "ready": they are not posted unless one is sure the receive has already been posted. The schedule creation algorithm uses a root process to create a "schedule" for each proc to follow. The algorithm used by the root basically proceeds as follows: the root performs a depth-first search (DFS) on the (weighted, directed) graph in which the nodes are the procs and there is an edge from p to q if p wishes to send data to q. The weight of the edge p->q is the number of blocks p wishes to send to q. The algorithm keeps track of the amount of free space (i.e., the number of dead blocks) on each proc. If there is any free space on q, then the maximum possible transfer from p to q is scheduled immediately, i.e., the maximum of the free space on q or the weight of the edge p->q. This edge is then popped from the DFS stack. If there is no free space on q, the DFS search continues. Hence the stack consists of nodes with no free space (except possibly for the first node). Eventually, a cycle must be reached, i.e., a node will be reached that is already on the stack. (This must happen if the redistribution is feasible.) When such a cycle is reached, a cyclic redistribution is scheduled using one MPI_Sendrecv_replace on each proc involved in the cycle. The quantity of data transfered is the minimum of the weights of the edges involved in the cycle. This will result in the elimination of at least one edge. The stack is then popped back to just before the cycle and the search proceeds. The root returns to each process the part of the global schedule that concerns that process as a sequence of "actions". Each action is either (1) an instruction to send a certain number of blocks to another process, (2) an instruction to receive a certain number of blocks from another process, or (3) an instruction to send from one process while receiving from another using the same buffer. The cyclicSchedulerBred then executes the schedule. Status: passing tests and performing very well in experiments, even when 0 free space is availablel
phaseBred(+phaseScheduler+cyclicScheduler)
this basically works like the contigBred by scheduling a series of phases, using some heuristic to decide who to receive from in each phase. However, after creating each phase of the schedule, it checks to see if a deadlock has arisen. If that is the case, it schedules a cyclic transformation to break the deadlock. The whole schedule is computed first, before the data redistribution takes place. The regular phases are executed using a single MPI_Alltoall. The cycle breaking phases are executed using an Irecv/Isend/Sendrecv. The cycle breaking just sends one block around the cycle, so can be slow. Hopefully that doesn't happen too often? Status: passing tests.
unitCyclicBred(+cyclicScheduler)
uses the cyclicScheduler, but just sends messages of size one block. So pretty simple. Uses Isend/Irecv as much as possible. unitCyclicReadyBred is a small variation using "ready mode" sends (Irsend) to make it work on BGL. Status: both unitCyclicBred and unitCyclicReadyBred passing tests.
parkBred
this is the Pinar-Hendrickson algorithm with parking. Each phase includes parking blocks as well as direct-transmission blocks. Each phase is executed with a single MPI_Alltoallv. Scratch that: the Alltoallv was very slow, so I replaced it with nonblocking point-to-points, but have not yet re-run all experiments. Will do soon. This one really slows down when there is not a lot of free space.

The Tests

In the following n=numProcs, r=rank.

  1. numBlocks=2. blockSize=1. elementType: double. Map on proc r: 0->((r+1)%n,0). 1:dead. This is just one big cycle.
  2. numBlocks=3. blockSize=3. elementType: int. Map on proc r: 0->(r,1), 1->(r,2), 2->((r+1)%n,0). Think of globally ordering all the blocks in the dictionary order, and cyclically shifting them all one to the right. Globally, it is one big cycle.
  3. numBlocks=3. blockSize=3. elementType: int. Map on proc r: i->((r+1)%n,i). So the index stays the same, but the block is moved to the next proc (cyclically). This creates numBlocks disjoint cycles.
  4. numBlocks=3. blockSize=2. elementType: (double,char). For r<n-1, map is i->(r+1,i). For r=n-1, all cells dead. Hence map consists of numBlocks disjoint shifts.
  5. numBlocks on proc 0: 3n. numBlocks on all other procs: 3. blockSize=1. elementType: int. The map just gathers all data onto proc 0. A bunch of shifts of length 2.
  6. numBlocks on proc 0: 3n. numBlocks on all other procs: 3. blockSize=1. elementType: int. The map just scatters all data on proc 0 to the other procs. A bunch of shifts of length 2.
  7. numBlocks on proc 0: 3n. numBlocks on all other procs: 3. blockSize=1. elementType: int. The map combines the previous two maps: it scatters everything from proc and 0 and replaces everything on proc 0 by gathering up everything on the other procs. So a bunch of cycles of length 2.
  8. numBlocks on proc r: r. blockSize=1. elementType: int. Map: there is no inter-process communication. So each process just performs a local permutation.
  9. numBlocks=10. blockSize=1. elementType: int. Additional parameters: dx=3, dy=4. Map: (r,i)->((r+dx)%n,(i+dy)%numBlocks).
  10. numBlocks=70. blockSize=1. elementType: int. Same as above but with dx=3, dy=15.
  11. numBlocks on proc r: r+1. blockSize=1. elementType: int. Initially the data in these processes form the shape of a isoceles triangle. After the data redistribution, all the data in the isoceles triangle flip over by the height over its hypotenuse.

Basic BG/L usage

cqstat
see the job queue. option -f shows more info.
cqsub -t <time> -n <nodecount> -c <#processors> -m <mode> <exe> [arg1,arg2,...]
queue the job. Option -q short puts job in reserved "short" queue; documentation says this is limited to 64 nodes, but this does not seem to be true; however it does seem that wall time must be no more than 30 seconds. Default mode is co, meaning one MPI process per node, which means one MPI process gets two cores.
cqdel [-f] jobid
delete a queued job. Option -f forces.
partlist
get a list of all partitions and their status
qmove
qalter

Resources:

Experiment 1

Structure of Experiment 1

n = number of MPI processes (= number of BGL "nodes")
m = number of live blocks per process
k = number of dead blocks per process (unused spaces)
r = number of times to repeat

Each block: 16K bytes of doubles. Map: (i,j) |-> ((mi+j)%n, (mi+j)/n). The rationale for this map is just that it "mixes things up a lot." Not sure if that is good or not.

Command-line arguments to exp1:

  1. bredName, e.g., unitBred
  2. numBlocks: number of blocks on each process
  3. numLiveBlocks: number of those blocks containing live data; others are "free"
  4. blockSize: size of each blocks, in bytes. Must be multiple of sizeof(double)
  5. repeat: number of times to repeat experiment

Any additional arguments are for specific bred...

  • unitBred
    1. outstandingSendLimit
    2. outstandingReceiveLimit
    3. outstandingRequestLimit
  • cyclicSchedulerBred
    1. edgeBufferSize
    2. actionBufferSize
  • unitCyclicBred
    1. requestLimit
    2. edgeBufferSize
    3. actionBufferSize

BGL Results of Experiment 1:

numBlocks=25000, numLiveBlocks=20000, blockSize=16000, repeat=1

  • unitBred
    • outstandingSendLimit=outstandingReceiveLimit=100, outstandingRequestLimit=200
    • 2: job=151030, time=2.64
    • 4: job=151031, time=3.69
    • 8: job=143688, time=3.87s
    • 16: job=143687, time=4.02s
    • 32: job=143689, time=4.33s
    • 64: job=143690, time=6.17s
    • 128: job=143962, time=8.44s
    • 256: job=143693, time=7.30s
    • 512: job=150930, time=5.97
    • 1024: job=150931, time=9.64
  • phaseBred
    • no parameters?
    • observations
      • we have inserted printf's to tell us how much time is used to prepare the schedule and how much time is used to execute each phase. The time to prepare the schedule is insignificant.
      • None of these runs required any cycle executions to break a deadlock. Perhaps we need to try some tighter requirements (fewer free blocks) to exercise that possibility.
    • 8: job=143694, time=9.02s, phases=6
    • 16: job=143695, time=11.57, phases=7
    • 32: job=143696, time=11.59, phases=7
    • 64: job=143697, time=13.99, phases=7
    • 128: job=143698, time=14.37, phases=7
    • 256: job=143699, time=20.24, phases=7
  • parkBred
    • 2: job=151382, time=3.41, mem=716484
    • 4: job=151383, time=7.50, mem=716612
    • 8: job=151384, time=9.67, mem=716868
    • 16: job=151385, time=11.53, mem=717380
    • 32: job=151379, time=11.66, mem=718404
    • 64: job=151386, time=14.02, mem=720452
    • 128: job=151387, time=15.21, mem=724548
    • 256: job=151388, time=21.18, mem=732740
    • 512: job=151541, time=14.62, mem=749124
    • 1024: job=151542, time=28.32, mem=781892
  • cyclicSchedulerBred
    • 2: job=151035, time=1.79
    • 4: job=151036, time=4.97
    • 8: job=151037, time=7.30
    • 16: job=150901, time=9.85
    • 32: job=150902, time=13.80
    • 64: job=150903, time=108.92
    • 128: job=150920, time=1085.73
  • phBred
    • 2:
    • 4:
    • 8:
    • 16:
    • 32: job=154181, time=11.81
    • 64: job=154182, time=14.13
    • 128: job=154183, time=14.75
    • 256:
    • 512:
    • 1024:

numBlocks=25000, numLiveBlocks=24000, blockSize=16000, repeat=1

  • unitBred
    • 64: job=143703: "Rzv: cannot allocate unexpected buffer"
  • phaseBred
    • 64: job=143700, time=51.35, phases=42, no cycles

Experiment 2

This was where we ran into MPI bug on BGL and did some experiments to nail down the bug.

Experiment 3

In this experiment, all blocks on proc i are sent to proc (i+1)%n in a big cycle.

numBlocks=25000, numLiveBlocks=25000, blockSize=16000, repeat=1

  • unitBred
    • 2: job=151054, time=4.81
    • 4: job=151055, time=5.78
    • 8: job=151056, time=5.04
    • 16: job=151053, time=5.07
    • 32: job=151057, time=5.90
    • 64: job=151058, time=6.14
    • 128: job=151059, time=5.93
    • 256: job=151060, time=6.16
    • 512: job=151105, time=5.94
    • 1024: job=151106, time=6.01
  • cyclicSchdulerBred
    • 2: job=151013, time=4.52
    • 4: job=151014, time=4.89
    • 8: job=151015, time=4.95
    • 16: job=150702, time=4.97
    • 32: job=150704, time=5.23
    • 64: job=150705, time=5.19
    • 128: job=150706, time=5.32
    • 256: job=150718, time=5.29
    • 512: job=150720, time=4.75
    • 1024: job=150854, time=4.77
  • parkBred
    • 16: job=149381, time:>30 mins (killed by scheduler)

numBlocks=25000, numLiveBlocks=20000, blockSize=16000, repeat=1

  • unitBred
    • 2: job=151046, time=4.82, mem=717920
    • 4: job=151047, time=5.33, mem=717952
    • 8: job=151048, time= 5.27, mem=718016
    • 16: job=151043, time=5.35, mem=718144
    • 32: job=151049, time=5.29, mem=718400
    • 64: job=151050, time=5.36, mem=718912
    • 128: job=151051, time=5.40, mem=719936
    • 256: job=151052, time=5.33, mem=721984
    • 512: job=151107, time=5.30, mem=726080
    • 1024: job=151108, time=5.29, mem=734272
  • cyclicSchedulerBred
    • 2: job=151010, time=2.63, mem=1316748
    • 4: job=151011, time=2.63, mem=1317028
    • 8: job=151012, time=2.63, mem=1317588
    • 16: job=150694, time=2.63, mem=1318708
    • 32: job=150697, time=2.63, mem=1320948
    • 64: job=150699, time=2.73, mem=1325428
    • 128: job=150701, time=2.81, mem=1334388
    • 256: job=150715, time=2.93, mem=1352308
    • 512: job=150716, time=2.60, mem=1388148
    • 1024: job=150855, time=2.63, mem=1459828
  • parkBred
    • 2: job=151401, time=4.49, mem=716484
    • 4: job=151402, time=4.47, mem=716612
    • 8: job=151403, time=4.49, mem=716868
    • 16: job=151404, time=4.50, mem=717380
    • 32: job=151405, time=4.49, mem=718404
    • 64: job=151406, time=4.60, mem=720452
    • 128: job=151407, time=4.71, mem=724548
    • 256: job=151408, time=4.95, mem=732740
    • 512: job=151544, time=4.52, mem=749124
    • 1024: job=151545, time=4.59, mem=781892

In this sequence I fix numProcs at 128 and scale the number of live blocks. This shows how parkBred blows up as free space decreases, but cyclicSchedulerBred does not.

  • unitBred
    • 12000: job=151064, time=3.22
    • 13000: job=151065, time=3.52
    • 14000: job=151066, time=3.76
    • 15000: job=151067, time=4.06
    • 16000: job=151068, time=4.30
    • 17000: job=151069, time=4.55
    • 18000: job=151070, time=4.83
    • 19000: job=151071, time=5.12
    • 20000: job=151051, time=5.40
    • 21000: job=151074, time=5.64
    • 22000: job=151075, time=5.87
    • 23000: job=151076, time=6.07
    • 24000: job=151077, time=6.18
    • 25000: job=151059, time=5.93
  • parkBred
    • 12000: job=151412, time=1.75, mem=724548
    • 13000: job=151413, time=2.24, mem=724548
    • 14000: job=151414, time=2.42, mem=724548
    • 15000: job=151415, time=2.64, mem=etc.
    • 16000: job=151416, time=2.80
    • 17000: job=151417, time=3.43
    • 18000: job=151418, time=3.70
    • 19000: job=151419, time=4.42
    • 20000: job=151420, time=4.72
    • 21000: job=151421, time=6.03
    • 22000: job=151422, time=7.44
    • 23000: job=151423, time=10.21
    • 24000: job=151424, time=18.20
    • 24100: job=151426, time=20.12
    • 24200: job=151427, time=22.91
    • 24300: job=151428, time=25.39
    • 24400: job=151429, time=29.58
    • 24500: job=151430, time=34.10
    • 24600: job=151431, time=43.17
    • 24700: job=151432, time=55.33
    • 24800: job=151433, time=85.04
    • 24900: job=151434, time=163.93
    • 24950: job=151439, time=310.30
  • cyclicSchedulerBred
    • 12000: job=150834, time=1.72
    • 13000: job=150835, time=1.85
    • 14000: job=150836, time=1.98
    • 15000: job=150837, time=2.13
    • 16000: job=150838, time=2.27
    • 17000: job=150839, time=2.40
    • 18000: job=150840, time=2.54
    • 19000: job=150841, time=2.68
    • 20000: job=150842, time=2.81
    • 21000: job=150843, time=2.94
    • 22000: job=150844, time=3.06
    • 23000: job=150845, time=3.18
    • 24000: job=150846, time=3.32
    • 25000: job=150706, time=5.32
  • phBred
    • 24800: job=154193, time=81.01

Experiment 4

Idea: have one proc which has all free blocks. Each remaining proc divides its data into nprocs-2 approximately equally sized chunks and sends one chunk to each of the other procs. The parking algorithm should take advantage of the free space on this one proc. The cyclicScheduler will ignore this one proc entirely.

numBlocks=25000 blockSize = 16000. Scale numProcs.

  • unitBred
    • 16: job=151062, "Rzv:cannot allocate unexpected buffer from R:9 T:15084 C:124"
  • parkBred
    • 4: job=151391, time=10.69, mem=716612
    • 8: job=151392, time=18.22, mem=716868
    • 16: job=151393, time=32.78, mem=717380
    • 32: job=151394, time=47.89, mem=718404
    • 64: job=151395, time=86.05, mem=720452
    • 128: job=151396, time=157.56, mem=724548
    • 256: job=151397, time=304.50, mem=732740
    • 512: job=151547, time=550.38, mem=749124
    • 1024: job=151548, time=1129.67, mem=781892
  • cyclicSchedulerBred
    • 4: job=151024, time=7.64
    • 8: job=151025, time=10.86
    • 16: job=150893, time=14.47
    • 32: job=150894, time=19.76
    • 64: job=150895, time=29.74
    • 128: job=150898, time=48.72
    • 256: job=150923, time=86.78
    • 512: job=150924, time=179.24
    • 1024: job=150936, time=367.71

Goals

Here is what I suggest we do for a first paper:

  • Describe the data redistribution problem and the requirements for a solution:
    • solution must be an algorithm, i.e., must complete in finite time (cannot deadlock, livelock, etc.)
    • memory requirements: the memory allocated for carrying out the redistribution should be O(n+b), where n is the number of procs and b is the maximum number of blocks a single proc. Hopefully the constant in the O should be relatively small. Note that this eliminates any algorithm that requires gathering the entire global map onto one proc (O(nb)). It also eliminates an algorithm that requires gathering the entire "transmission graph" on one proc (O(n2) or O(nb)). The transmission graph is the weighted directed graph in which the nodes are the procs and there is an edge from p to q if p has at least one block to send to q; the weight of that edge is the number of blocks p has to send to q. If the out-degree of a node is bounded (i.e., independent of n and b), then I suppose you could gather the whole transmission graph, but there's no reason to assume this is the case.
    • algorithm must work for any "feasible" problem
  • Pinar-Hendrickson describes a basic phase-based algorithm for data redistribution. The algorithm doesn't work in some cases: it will either deadlock or livelock, so does not meet our requirements
  • A modified algorithm uses "parking" to solve this problem (and also potentially reduce the number of phases). This does meet our requirements. However the parking also involves moving data twice, which is potentially expensive. Also, it forces the redistribution to take place in "phases", with an effective barrier between each phase. There can be a lot of phases and a lot of procs doing nothing in many phases, limiting potential parallelism. I think that is what we are seeing in our experiment 3. You really notice this when the free space approaches 0. The algorithm is called parkBred.
  • We have a very different algorithm which does not involve parking. It depends on finding cycles in the transmission graph. It is called cyclicSchedulerBred. It differs in several ways. First, there is no parking: all blocks go directly to their destination. Yet it still terminates on any input. Second, it does not proceed in discrete phases. Rather each proc has a sequence of sends and recv actions it is to carry out, and the way these are interleaved in time is left largely unspecified. Third, the complete schedule is computed first, and then executed.
  • We should compare Pinar-Hendrickson's parking algorithm with our cyclic algorithm.
  • ## The entire schedule should be made first, in both cases. The schedule should not be made first because the memory requirements exceed O(n). OK: I don't think it is a big deal, because the size of the schedule is bounded above by the number of blocks. Anyway, as it now stands, parkBred does not make a schedule in advance and cyclicScheduler does. It might be worthwhile to make a version of parkBred where the schedule is computed first, to see if this helps.
  • Comparison should show time and memory usage.
  • Experiments:
    1. Exp3: one big cycle. This is a good experiment. parkBred seems to break down on it.
    2. Exp1: a good general mix up, with a small amount of free space on all proc.
    3. Exp4: a case where one proc has all the free spaces. parkBred should take advantage of that, cyclicScheduler will not.
    4. others??? Andrew: have any ideas?

Conferences

  • Paper for EuroPVM/MPI 2008, due April 6, 2008. Conference is in Dublin, Sep. 8-10, 2008. LLNCS style, 8 pages.
  • Paper for SC08, abstract due April 4, 2008, full paper due April 7. Conference in Austin TX, Nov. 2008. 10pt text, double spaced, PDF format, <20 pages.

Future Work

  • Andrew said the interface would be improved if the user could attach free space during the call to redistribute rather than once all at beginning.
  • Note to self: I keep getting burned by the BG/L's damn "rzv: unexpected message" errors. These tend to pop up only on the 1024 (or 512) node runs. Why not eat my own dogfood and use MPI-Spin to check that none of the big messages are ever sent before the matching receive is posted? Good example of how model checking only requires small configurations.

Response to referees

  • what is going on with the cyclic algorithm in the case where every proc is sending to every other proc approx. equal volume?
    • I looked into this and saw that the search through the transmission graph yields only cycles of length 2. The transmission graph is a complete graph with all edges of approx. equal weight. All nodes order their outgoing edges in the same order: increasing rank. The schedule can be visualized by making and nxn matrix (n=numProcs). The phases correspond to the diagonals. In the first phase, 0 and 1 exchange and no one else does anything. In the second phase, 02 and 13 are the exchanging pairs. Etc. Basically there are n(n-1)/2 exchanges and these are broken up into 2n-3 phases. There is a lot of waiting around doing nothing. In contrast, the parkBred uses 4 phases regardless of n. (At least, that's what I got for n<=64).