Compute S, the per communication call latency. In other words, count the total number of calls to mpi communication functions per process.
Compute W, the total number of words transmitted in the messages to and from each process. For the word total in a message, it suffices to count just data elements. When a message sends one block object (local_matrix_mpi_t) just use the number of entries in the block.
Compute A, the total number of element arithmetic operations used by the process. It suffices to count the ops to multiply and add the blocks. For k by k block multiply, assume k^3 mults and k^2 adds. For k by k block add, k^2 element adds are used.
This model then expects the total run time to be T_total = S*T_s + W*T_w + A*T_ma, where T_s, T_w, T_ma are the costs we computed last week.
Due Mar 6.