CISC 372: INTRODUCTION TO PARALLEL PROGRAMMING
Spring 2000
Individual Programming Assignment 2
Due Date: start of class, Thursday, March 2, 2000

Objectives

The objective of this assignment is to write some simple MPI programs to learn how to accomplish communicaton between processes, and to time MPI programs. The side effect of this assignment is that you will be implementing several versions of a global sum reduction operation.

Procedure

1.

Read through Section 3 of the Beginner's Guide to MPI on the University of Delaware DEC Alpha Cluster, and chapter 3 of your textbook.

2.

Centralized Summation: Write an MPI program that computes the global sum of results from computations on the data local to each process in the following way. First, for ease of creating local data, use the rank of each process as its local data. To have some computation at each node, have each node locally compute the summation 1 to 1000 of the local rank (myrank+myrank+...myrank), performing the additions explicitly, not by multiplying 1000 times myrank! We're just creating some fair amount of local computation! The overall centralized summation is the sum of these local sums. Assuming that there are N processes, the global sum should be computed by having processes 1 through N-1 send their local result to process 0, and having process 0 receive their local results and compute the global sum of the local results. The processes should not have to send their local results in any particular order, and process 0 should have not have to receive them in any particular order. That is, process 0 should not explicitly try to receive the results in order 1 through N-1. This causes undue synchronization. Your centralized summation program should be written in a general manner in order to work for any number of processes N > 1, without recompilation of the program code. You need to time from just before when the processes start sending data until just before process 0 prints the result. Do not time the printing. Take 3 timings for 4 processes, and then for 8 processes.

3.

Shifting Summation: Write an MPI program that computes the global sum of the local results of each process in the following way. Again, assume that the rank of the process is the local data and that the same local summation is computed as above. Assuming that there are N processes, process N-1 should send its result to the next lower numbered process N-2, which adds its local results to the running sum (i.e., we call this a partial sum), and sends the current partial sum to the next lower ranked process. This continues until process 0 receives the partial sum from process 1. Process 0 then adds its local sum (I know, it is zero, but we are mimicking summation of local data), and prints out the final global sum. You need to time from the start of process N-1 sending its data, until just before process 0 prints. Take 3 timings for 4 processes, and then for 8 processes. Again, your code should be written in a general manner, able to work for any N > 1 number of processes without recompilation.

4.

Tree Summation: Write an MPI program that computes the global sum of the local results of each process in the following way. Again, assume that the local data is the rank of the process and that the local result is computed as above. Assume that there is a number of processes, N > 1, that is a power of 2 (i.e., 2, 4, 8). The communication and summation should be performed in a tree-like pattern. The following diagram shows the communication for 8 processes, where the numbers indicate the process that is doing the summation at that point in the tree computation. So, process 1 sends its local result to process 0, which computes a partial sum, and waits for process 2 to send its partial sum of 2 and 3. Then, process 0 computes the partial sum of its current sum with the partial sum from process 2. Then, process 0 waits for the partial sum from process 4, and computes the final global sum. Each interior node in this logical tree is a partial sum computation. The original local results are at each process at the bottom of this tree of computation. You should time from the start of computing local sums at the bottom of the tree computation until just before process 0 prints the final result. Take 3 timings for 4 processes, and then for 8 processes.

                  0
                 /\
        0                  4
       /\                 /\
   0         2        4         6
  /\        /\        /\        /\  
0    1    2    3    4    5    6    7

5.

Create a single graph that displays the averages of your timings for 4 and 8 processes, for each method. The horizontal axis should be the number of processes involved in the summation. The vertical axis should be the timings. The graph should include only your average timings, so there should be 6 points on the graph. A bar graph might be a nice depiction of this data. A different bar could be used for each of the three methods, for a given number of processes. You should use a tool that automatically creates the graph based on the data inputs. Two different approaches to generating these graphs from output of your MPI programs are given on the CISC 372 web site.

Hints for Getting Good Timing Data

In order to get reliable timing data, you need to make sure that no one else is using the cluster. To do this, you should run the spyall command just prior to making a performance run. Also, from past experience, leaving the timings until last evenings before the assignment due date makes getting reliable timing runs very difficult to obtain due to the many other folks trying to get access to the cluster for timings.

Experimental Report

Your experimental report should consist of the following sections in this order. It is strongly recommended that you type your report using a word processor rather than handing in a hand-written report.

1.

Cover Page: Title, author, course number and semester, date.

2.

Project Summary: In one paragraph, summarize the activities of the lab.

3.

Hypothesis: First, state and justify your hypothesis. Which method for summation do you believe should take the longest time and which should take the shortest time? Why? Explain your hypothesis in the context of a large number of processes and a small number of processes, and a large amount of computation at each node versus a small amount of computation at each node.

4.

Analysis:

(a): Collected Timing Data. This section includes both a chart of the raw collected timings, and the graph depicting the averages pictorially.
(b): Analysis. In English, explain your timing results. Describe your observations concerning the different methods, and how it compares to your hypothesis. Explain why you believe the timings are different or similar between the methods. How does it compare to your hypothesis?

5.

Conclusions: This section consists of a discussion of what you learned from the various aspects of the lab. Discuss possible reasons for inconsistencies or discrepancies in your data versus what you would have expected to happen.

6.

General Remarks: Any comments on the lab itself that you would like to make in terms of suggestions for improvement or problems you experienced.

7.

Appendix: Your code for the three versions of summation.

Please staple all parts of your lab together, and label each piece. Be prepared to discuss your results in class.

Criteria for Evaluation

Your lab will be evaluated according to the following criteria:

1.

13 pts: Centralized summation

2.

3 pts: Timings of centralized summation

3.

15 pts: Shifting summation

4.

3 pts: Timings of shifting summation

5.

25 pts: Tree summation

6.

3 pts: Timings for tree summation

7.

5 pts: Graph of averages

8.

3 pts: Use of tool for graph construction

9.

30 pts: Experimental report

(a): 5 pts: cover page
(b): 5 pts: project summary
(c): 5 pts: hypothesis
(d): 10 pts: analysis of timing results (includes points for code in appendix)
(e): 5 pts: conclusions

Extra Credit

7 pts: Try varying the amount of local computation performed at each node until you see a different effect or a trend in the overall comparisons between the 3 methods. Each process should do the same amount of computation, but vary the amount they all do. In this original scheme, they are all doing 1000 summations. What happens if they do very little computation locally? What happens if they do alot more computations locally? Write a short explanation to these questions and back your answers up with timing data. Be sure to describe what local computations you used for your timing data.

About this document ...

This document was generated using the LaTeX2HTML translator Version 98.1p1 release (March 2nd, 1998)

The command line arguments were:
latex2html -split 0 lab2.tex.

The translation was initiated by Lori Pollock on 2000-02-22

Lori Pollock
2000-02-22