UDel: CISC 879 Parallel Computation - exercise Fox/MPI matrix multiply analysis

Analyze The fox matrix multiplication algorithm, as found in eecis:~saunders/879/fox.c.

Compute S, the per communication call latency. In other words, count the total number of calls to mpi communication functions per process.

Compute W, the total number of words transmitted in the messages to and from each process. For the word total in a message, it suffices to count just data elements. When a message sends one block object (local_matrix_mpi_t) just use the number of entries in the block.

Compute A, the total number of element arithmetic operations used by the process. It suffices to count the ops to multiply and add the blocks. For k by k block multiply, assume k^3 mults and k^2 adds. For k by k block add, k^2 element adds are used.

This model then expects the total run time to be T_total = S*T_s + W*T_w + A*T_ma, where T_s, T_w, T_ma are the costs we computed last week.

What is T_total?
Time fox.c on porsche with 4 processors and n = 100, 200, 400. For this it will suffice to have process 0 call the timer before and after the call to fox. [ Optionally have all procs compute and print their fox() call time. much variation? ] (note: you may want to remove the block printing in main) Is the T_total formula borne out? If not, discuss briefly what you think may be the reason(s).
Optionally, try 9 procs. Compare this formulation with the HPF model. [ Assignment is just to think about the two models for yourself. No writeup required. ]

Due Mar 6.