CISC879 Project 4

Parallelization of a Scattering Code

Final Report

Faculty Advisor: P. Monk

Group Members: J. Sun, E. Fellows, M. Jochen, L. Milano, & P. Dmitruk

Problem Description:

1. Motivation

The interaction of electromagnetic waves with matter underlies many important engineering applications. For example, radar involves the scattering of electromagnetic pulses from targets and understanding the effects of microwave radiation on biological tissue involves the scattering of electromagnetic waves from complicated, penetrable targets. Another large class of problems of particular interest to one of us (P. Monk) is microwave imaging and non-destructive testing. In these problems we attempt to infer properties of hidden objects by their effects on microwave signals. Examples are again radar, ground penetrating radar (for example for mine detection) and attempts to use microwaves to image inside the human body.

All these problems require an efficient code to simulate the propagation of electromagnetic waves. In a recent paper, Cessenat and Despres [1,2] suggested a new method for solving this problem called the Ultra Weak Variational Formulation (UWVF). This formulation has potential advantages over standard finite element methods. Dr. Monk has implemented a serial code for this problem in F90. This code is the subject of our parallelization attempts.

2. Problem Statement

Given a bounded domain, , the goal is to approximate the electric field, ,(a complex valued vector function of position), that satisfies the Maxwell system

in where is the wave number of the radiation (the wave-length of the radiation is ) and is the possibly complex and spatially varying relative permittivity of the material. In addition, the solution must satisfy the generalized impedance boundary condition:

on the boundary of where the unit outward normal to the boundary is. Using different combinations of and the data function , we can simulate different types of boundaries. For example is a classical absorbing boundary condition, while gives a perfect conductor (i.e. a metallic wall). The UWVF shares many similarities with finite element codes. First the region of interest is subdivided into tetrahedra using an automatic mesh generator (this would be an interesting subject for parallelization, but far too difficult for this semester!). Note that the resulting mesh is unstructured and may have elements of very different sizes. The advantage of such a mesh is that it can model complex geometric shapes well, but the disadvantage is that elements are of variable size and have no simple relationship to one another (unstructured). A cut through of a simple mesh is shown in Figure 1.

Figure 1: A cut through mesh for the domain between a sphere and a cube. The goal is to simulate scattering from the inner sphere using the outer sphere as an artificial boundary to terminate the mesh. This is a small test mesh with around 71,000 tetrahedra.

Once the mesh is available, the UWVF code performs two operations

· The tetrahedra are processed and a large, sparse complex valued matrix is generated. In addition a right hand side vector is also computed.

· The matrix problem is solved by a simple iterative process requiring only matrix multiplies and dot-products.

The UWVF is not a standard finite element method (it is closer to a discontinuous Galerkin finite element method in spirit), so many of the node based parallelization schemes in the literature need to be adapted in this case. In particular, at the assembly stage, when computing the matrices for one tetrahedron, it is necessary to use data from surrounding tetrahedra (unlike finite elements in which this process is entirely local). However the sparsity structure of the resulting matrix is quite simple.

Serial Implementation:

For the Ultra Weak Variational Formulation, due to Cessenat and Despres, we have the following general outline:

· Mesh the domain by cutting it up into tetrahedra. This mesh generation step is very time consuming, but only needs to be done once for a given domain and wave number.

· On each element K (tetrahedron) we represent the solution as a sum of simple exact solutions of Maxwell's equations:

where and . For us since there are two independent polarizations for each different direction of propagation . These simple functions are called Plane Waves.

· Note the connection with spectral methods (exponential basis elements).

· Maxwell's equations are satisfied exactly on each tetrahedron, but there needs to be a way of imposing suitable continuity between elements. This is done by the UWVF.

For simplicity of exposition, we shall now assume , although this is no necessary in our code. To construct the UWVF, let satisfy

where , is the boundary of , and is the outward normal to .

Let satisfy

Diagram of the relationship between elements in 2D showing orientation of the normal on a inter-tet boundary

Suppose touches tetrahedra ,, and through it's faces. Let denote the common face between and . Let denote on , where is the outward normal to . Using integration by parts and Maxwell’s equations, we have:

(1)

This constrains boundary values of across mesh faces, since similar equations hold for each tetrahedron.

Using the expansion for and taking in the previous equations gives us a system of linear equations for the unknowns on each tetrahedron.

We list the unknowns element by element in a vector .

The term gives rise to a block diagonal, Hermitian, matrix .

The terms gives rise to a sparse matrix with 4 non-zero blocks per block row.

We need to solve

and we do this by solving by the bi-conjugate gradient method.

For the serial algorithm, we first "Assemble" the matrices D and C by processing tetrahedron by tetrahedron and evaluating exactly the integrals in (1). Data is needed from neighboring tetrahedra through faces; this has implications for domain decomposition parallel schemes. Then, we solve the linear system by the bi-conjugate gradient method. This requires matrix multiplication by , and can be done element by element as long as data on neighboring tetrahedra (via faces) is available.

Typical linear finite element methods need at least 10 grid points per wavelength to yield acceptable results (and usually more!). Experiments with the UWVF have shown that it also has a restriction on a minimal number of unknowns per wavelength (4 per wavelength so far).Due to the grid-points-per-wavelength problem, we need many tetrahedra (for example, aircraft 15m long at 2mm wavelength would require tetrahedra which is not possible today). Each tetrahedron must have at least which requires unknowns!

Even for small numbers of tetrahedra, the UWVF can be very time consuming and very memory intensive. For example, a very small test problem using 71,528 tetrahedra with 20 basis functions per tetrahedron currently requires 2.5GByte of memory and runs for 4,859 seconds (in serial mode) on a 195 MHz Silicon Graphics Origin-2000. Realistic problems of simulating the interaction of microwaves with the human head will require far more memory and computer time. This makes us to turn to parallel methods.

For our project, we investigated two approaches:

1. Application of MUMPS: Once the matrix is generated (and stored on disk perhaps) the matrix problem could be fed to an existing parallel direct sparse matrix code such as MUltifrontal Massively Parallel Solver (MUMPS). This should be a relatively easy job (at least if the most straight forward matrix format is used). Statistics could be gathered comparing the performance of this code on different architectures (e.g. shared memory multi-processor, high performance cluster, poor network cluster...) and different sized problems. Some MPI coding may be needed, but the amount will be limited.

2. Data Parallelization or Domain Decomposition: Using partitioning software such as METIS the grid can be partitioned to the processors. A picture of the mesh used in Figure 1, partitioned into 8 sub-meshes is shown in Figure 2.

Figure 2: The sphere mesh used in Figure 1 partitioned into 8 sub-meshes using METIS.

The matrix can then be constructed on each sub-grid (but this will require significant communication unlike in the finite element case). Having computed the matrix in parallel, we can then implement a parallel matrix multiplication routine and hence fully parallelize the code. This will require significant investment of time to take care of renumbering and indexing the grid and unknowns at various times in the solution process.

Application of MUMPS and Numerical Results:

Introduction to MUMPS:

MUMPS ("MUltifrontal Massively Parallel Solver") is a package for solving linear systems of equations , where the matrix is sparse and can be either unsymmetric, symmetric positive definite, or general symmetric. MUMPS uses a multifrontal technique which is a direct method based on either the or the factorization of the matrix. MUMPS exploits both parallelism arising from sparsity in the matrix and from dense factorizations kernels.

The software is written in Fortran 90. It requires MPI for message passing and makes use of BLAS, BLACS, and ScaLAPACK subroutines.

MUMPS distributes the work task among the processors, but an identified (host) processor is required to perform the analysis phase, distribute the incoming matrix to other (slave) processors in the case where the matrix is centralized, collect the solution, and generally oversee the computation. The system is solved in three main steps:

1. Analysis. The host performs an approximate minimum degree algorithm based on the symmetric pattern , and carries out symbolic factorization. A mapping of the multifrontal computational graph is then computed, and symbolic information is transferred from the host to the other processors. Using this information, the processors estimate the memory necessary for factorization and solution.

2. Factorization. The original matrix is first distributed to processors that will participate in the numerical factorization. The numerical factorization on each frontal matrix is conducted by a master processor (determined by the analysis phase0 and one or more slave processors (determined dynamically). each processor allocates an array for contribution blocks and factors; the factors must be kept for the solution phase.

3. Solution. The right-hand side is broadcast from the host to the other processors. These processors compute the solution using the (distributed) factors computed during Step 2, and the solution is assembled on the host.

MUMPS allows the host processor to participate in computations during the factorization and solve phases, just like a slave processor. This may lead to memory imbalance if the host already stores the initial matrix, but it also allows us to run MUMPS on a single processor and avoids one processor being idle during the factorization and solve phases. If the user is able to input the matrix in a distributed way or if memory is not an issue, we recommend this mode.

For both the symmetric and the unsymmetric algorithms used in the code, MUMPS chooses a fully asynchronous approach with dynamic scheduling of the computational tasks. Asynchronous communication is used to enable overlapping between communication and computation. Dynamic scheduling was initially chosen to accommodate numerical pivoting in the factorization. The other important reason for this choice is that, with dynamic scheduling, the algorithm can adapt itself at execution time to re-map work and data to more appropriate processors. In fact, MUMPS combines the main features of static and dynamic approaches; it uses the estimation obtained during analysis to map some of the main computational tasks; the other tasks are dynamically scheduled at execution time. The main data structures (the original matrix and the factors) are similarly partially mapped according to the analysis phase. Part of the initial matrix is replicated to enable rapid task migration without data redistribution.

In the Fortran 90 interface, there is a single user callable subroutine called MUMPS that has a single parameter mumps_par of Fortran 90 derived datatype STRUC_MUMPS. MPI must be initialized by the user before the first call to MUMPS. The calling sequence looks as follows:

INCLUDE 'mpif.h'

INCLUDE 'mumps_struc.h'

...

INTEGER IERR

TYPE (MUMPS_STRUC) :: mumps_par

...

CALL MPI_INIT(IERR)

...

CALL MUMPS (mumps_par)

...

CALL MPI_FINALIZE(IERR)

The datatype STRUC_MUMPS holds all the data for the problem. It has many components that are internal to the package (and could be declared private). Some of the components must only be defined on the host. Others must be defined on all processors. The file mumps_struc.h defines the derived datatype and must always be included in the program that calls MUMPS. The mumps_root.h, which is included in mumps_struc.h, defines the datatype for an internal component root, and must also be available at compilation time. Components of the structure STRUC_MUMPS that are of interest to the user are shown in Figure 1. This structure contains all parameters for the interface to the user, plus internal information.

The interface to MUMPS consists in calling the subroutine MUMPS with the appropriate parameters set in mumps_par. We append a sample program at the end of our report.

Modify UWVF to use MUMPS

In our serial method for UWVF, we use bi-conjugate gradients to solve the linear system. Now, we turn to MUMPS. In stead of using bi-conjugate gradients to solve the linear system serially, we want to call MUMPS.

So far, we have a complex linear system. In order to use MUMPS, we need to change it to real linear system. Suppose we have where is a complex and is a complex vector. Write , and as , and where the footnotes and representing the real part and the imaginary part. We obtain

Thus

Write in matrix form

Let

, and

We obtain a real linear system . This is when we want to call MUMPS. But we need to be a little bit careful. Suppose is a diagonal matrix. We can see that is not diagonal. The nice characteristic of is not retained. Here, we need a trick not to destroy the nice structure of .

We can write the linear system as following:

Suppose

, and

Write

, and

Here, is a tri-diagonal matrix, which retains the structure of . Once we obtain the real linear system, we can use MUMPS to solve it.

Numerical Results:

(Note: We ran the program on a 195MHz Silicon Graphics Origin-2000 which has four processors)

Our test problem (verysmall.dat) has 226 vertices, 2138 faces and 1023 tetrahedra. When we get the complex linear system, we have 8184 unknowns and 315584 non zero entries. After we change it to the real complex linear system, we have 16368 unknowns and 1262336 non zero entries.

Using MUMPS, we will get the output like this:

MUMPS Version 4.1.6 --- March 2000

L U Solver for unsymmetric matrices

Type of parallelism: Working host

***** ANALYSIS STEP *****

Node parallelisms: ICNTL(12/13) = 2*0

A root of estimated size 1680 has been selected for Scalapack

Mapping takes memory into account

***** FACTORIZATION STEP *****

GLOBAL STATISTICS PRIOR NUMERICAL FACTORIZATION ...

NUMBER OF WORKING PROCESSES = 4

REAL SPACE FOR FACTORS = 12194112

...

***** SCALING OF ORIGINAL MATRIX

ROW AND COLUMN SCALING (1 Pass)

....

***** SOLVE & CHECK STEP *****

STATISTICS PRIOR SOLVE PHASE..

...

Table 1 shows the timing data of our test problem:

# of processors used	time used on each processor
1	41.88
2	37.05
2	37.05
3	26.79
	26.79
	26.79
4	23.66
	23.66
	23.66
	23.66

Table 1. Timing Data

Figure 3. Speedup of MUMPS code

In Figure 3 (Speed Up), we got a low speed up from 1 processor to 2 processors, a better speed up from 2 to 3, and again a lower speed up from 3 to 4 processors. When we use 4 processors, we obtained a speed up almost 2. In the Efficiency Figure (Figure 4), the efficiency drops down quickly from 1 processor to 2 processors. It indicates that the communication time and idle time in each processor eats up a lot of time.

Figure 4. Efficiency of MUMPS code

Comments on MUMPS and Numerical Results:

1. MUMPS is a direct solver for large sparse matrix. It usually takes more memory than iterative methods. We met a memory allocation error from MUMPS when we wanted to solve a larger system which has 24 thousand unknowns. According to one of authors of MUMPS, we may reorder the matrix first and MUMPS will ask for less memory during the factorization phase.

2. For our test problem, we need use the serial code to generate the linear system and then use MUMPS to solve it. Thus, our timing data is concerned with the part when we use MUMPS.

3. Because of the dynamic scheduling approach and partial pivoting with a threshold used in MUMPS, the analysis phase cannot fully predict the space that will be required on each processor and an upper bound is therefore used for memory allocation. Thus, when we use MUMPS, we may pass the analysis phase but still get the problem at factorization phase. Recent numerical results show that MUMPS uses more memory than other parallel solvers such as SuperLU.

4. In MUMPS Version 4.0, communications are asynchronous and based on immediate sends with explicit buffering in user space. On the destination process, the reception of the messages will be the key point for the synchronization and scheduling of the work. In fact, message reception can be invoked in the following three situations:

· Dynamic scheduling: Blocking and non-blocking receives are used to drive the scheduling of the tasks on each process.

· Task ordering: A process may have to receive and treat a "late" message to be able to finish its current task.

· Insufficient space in send buffer: To avoid deadlock, the corresponding process tries to receive messages until space becomes available in its local send buffer.

A recent report on MUMPS says that the "overhead" time may take 25% to 50% of the total time, which depends on the structure of the matrix, number of the processors used. Here, "overhead" means MPI calls and idle time due to communications or synchronization's.

A Parallel Approach for UWVF

Our second approach is to parallelize the serial UWVF solver. In doing so, we hope to permit larger problems to be solved by computing the solution more quickly and by requiring less memory per computation node. Large scattering problems require a large mesh, which may not fit into memory. By solving the problem across multiple processors, we can decompose the mesh into smaller sub-problems, which will fit in memory.

The opportunity for parallelism exists where we solve the linear system by the bi-conjugate gradient method. If the domain is partitioned across a number of processes, each process can perform the matrix multiplication in parallel. In between each iteration of computation, data for neighboring tetrahedra across boundaries will need to be shared. Once all processes have detected convergence of the iterative method, we have each process report its solution to the managing process. A sketch of the algorithm follows:

1. MANAGER: Read in meshfile

2. MANAGER: Partition mesh via Metis

3. ALL: Broadcast resulting partition vector from MANAGER

4. MANAGER: Read physical properties

5. ALL: Broadcast physical properties from MANAGER

6. MANAGER: Loop for each process

6.1.1. Build local (partitioned) mesh for each worker

6.1.2. Renumber/re-index tets, faces, & verts

6.1.3. Send renumbered package to each worker

7. ALL: Determine border tets with neighbors

8. ALL: Send/recv data for border tets

9. ALL: Compute matrices D & C

10. ALL: Compute rhs, solve by bi-conjugate gradients, loop

10.1.1. Calculate values for x

10.1.2. Must gather dot-products at each point to find global variance

10.1.3. Share rhs info for border tets after each iteration

10.1.4. When global variance is within tolerance, send results to manager, exit

11. MANAGER: Collect rhs from workers

12. MANAGER: Reorder rhs from local ordering to global ordering

13. MANAGER: Save results to file

The above is a high level look at how we parallelized the serial UWVF code. Note the use of Metis to partition the mesh. Metis is discussed in the following section.

Using METIS

Domain decomposition is accomplished using the software package METIS. The raw data consists of a mesh of tetrahedra generated automatically and stored in a file which is read during run time. This file contains information about the mesh (e.g. number of tetrahedra, faces and vertices ) and sufficient detail regarding the spatial relationships between tetrahedra for METIS to partition the mesh among processors. The desired distribution is such that the number of elements assigned to each processor is approximately the same while minimizing the number of adjacent elements assigned to different processors. The objective of the former is to balance the computation time on each processor while the purpose of the latter is to minimize the amount of communication between processors. The call METIS_PartGraphKway() is used to attain such a distribution by minimizing the edge cut, which is defined as the number of edges which straddle partitions. According to the authors this method is extremely fast and efficient. From runs on mesh sizes ranging from 6000 to 72000 tetrahedra, the partitioning takes less than 0.01 % of the overall execution time and thus has negligible effect. Since the amount of data being communicated for each tetrahedron is the same for this implementation, then the communication between processors is also minimized. If the number of basis functions, p, varies from tetrahedra to tetrahedra then communication times may not be the minimum. However, this information can not be determined until run time, but can be included in the mesh partitioning process by using different weights on the nodes of the METIS adjacency graph (not yet implemented).

An example of the process of mesh partitioning and subsequent re-mapping of indices can be found at http://www.math.udel.edu/~monk/CISC879/code.html. This simple example also describes the details of the communication requirements during computation and what is involved in determining what information needs to be communicated. Care was taken in the code to achieve parallel communication. Each processor may need to communicate with many other processors to share information about border tetrahedra. The goal during this communication phase was to have as many processor pairs simultaneously communicating this information at once, rather than forcing the communication to be more serial in nature. Considerable coding effort was spent in UseMetis.f90, matops.f90, init_basis.f90 to parallelize the code.

Challenges in Implementation

A local numbering for faces, tetrahedra, and vertices is established when the mesh is partitioned. This local numbering must be mapped back to the global numbering for the mesh. The mapping must also re-index references between faces, tetrahedra, and vertices within the local partition. Further, when information for border tetrahedra is received from neighboring processes, this establishes a second renumbering between local numberings of these border elements. Keeping track of these somewhat involved numbering schemes is paramount to ensure correctness of the solution. A good bit of time was spent dealing with keeping track of this information on each process.

Between each iteration of computation, and even within each round of computation, data must be shared with neighboring processes. The shared data is used to determine when to stop the computation and to ensure the solution is correctly computed across partition boundaries. Because this communication permeates the computation, we experience a synchronization of processes which keeps that rate of computation equivalent across all processes. This synchronization adversely affects performance as the speed of computation is tied to the speed of the slowest process.

One thought on how to relieve this synchronization is to partition the mesh into many more partitions than there are processors. The idea then becomes one of dynamically distributing work (partitions) to processors as they become available to do work. This approach may prove unappealing as the amount of information to be communicated increases greatly with such an approach. In this instance, we would need to send portions of the matrices D & C along with each partition, as well as the current state of the right hand side for that partition. This approach also requires the manager to store the entire solution and all matrices in its own memory. Part of the attractiveness the former approach is to enable larger problems to be solved that ordinarily would not fit into one machine’s memory.

Results

We did not perform a theoretical time complexity analysis on this code because the nature of the computation and communication are rather complex and dynamic. The main problem is that the grid is constructed such that it is impossible to predict communication load. Furthermore, the number of unknowns per tetrahedra can vary from one tetrahedron to the next. In the future, the number of unknowns per tetrahedra will be able to change during runtime. This can greatly increase or decrease the amount of communication across border tetrahedra. In addition, the amount of communication vastly depends upon the result of the partitioning code Metis and upon the nature (size and structure) of the original mesh. We could perform an analysis based on an average partition with fixed unknown numbers for some particular problem, but we feel this analysis would be misleading at best. Perhaps a useful future exercise would be to perform a theoretical analysis on a simple “uniform” tetrahedral grid to give a “best case” for the code.

	tiny	small	sph1
Serial	99	844	4060
1 Proc	103	863	4080
2 Proc	70	582	2401
3 Proc	45	441	1586

Table 2. Time in seconds for serial and parallel UWVF code

We give timing data for the serial and parallel bi-conjugate gradient codes in table 1. These codes were run on three mesh sizes, a tiny mesh with 6,000 tetrahedra, a small mesh with 15,581 tetrahedra, and a large mesh with 71,528 tetrahedra. The first row gives the times for a serial version of the code on each mesh. Time is recorded in seconds. We use these times as baseline times for speedup and efficiency computation. The subsequent rows give times for the parallel code on one through three processors. Tables 2 and 3 give speedup and efficiency respectively. Our results are encouraging. While efficiency is not one, it does increase as the size of the mesh increases. Speedup for the parallel code is also encouraging. Again, the best gains are made with large mesh. With larger mesh, there are more internal tetrahedra per border tetrahedra within each partition, thus the overhead of communication can be amortized over increased parallel computation.

	tiny	small	sph1
1 Proc	0.96	0.98	1.00
2 Proc	1.42	1.45	1.69
3 Proc	2.22	1.91	2.56

Table 3. Speedup of parallel UWVF code

	tiny	small	sph1
1 Proc	0.96	0.98	1.00
2 Proc	0.71	0.72	0.85
3 Proc	0.74	0.64	0.85

Table 4. Efficiency of parallel UWVF code

Figure 5. Speedup parallel UWVF code

Figure 6. Efficiency of parallel UWVF code

Figures 5 and 6 graph the results presented in tables 3 and 4 respectively. While we are encouraged by the shape of the speedup curves, the efficiency curves are not as flattering. We do not expect to achieve an efficiency of one, but we do hope to keep the efficiency above 70%. During testing of our code, we were only able to utilize three processors on an SGI Origin 2000. We believe that efficiency will stabilize on large mesh when run on a larger amount of processors.

Future Work

What follows is a list of enhancements and areas of exploration for potential improvement to our current effort.

1. We notice that creating the matrices C and D can take a large amount of time during execution. Is there any way of speeding up this process?

2. Would it be worthwhile to parallelize mesh partitioning? At this point, we conjecture no. Mesh partitioning takes a negligible amount of time compared with the rest of the computation.

3. Mesh generation can take a very long time. This would be a good spot to focus attention for future improvement.

4. Consider other solvers (perhaps QMR) that require less synchronization during the iteration process.

5. Test the parallel UWVF code on a very large mesh over more processors.

Conclusion

Through review of our results, we conclude that MUMPS is not a desirable solution alternative for our application. Direct methods typically require a great deal of memory to reach a solution. We are trying to find solutions on rather large mesh, thus MUMPS requires more memory then we are currently able to provide.

We state that the parallel UWVF fairs well. While there are areas for improvement, we are encouraged by our results. The code experienced a speedup over multiple processors, and the efficiency, while not quite stellar, was not found to be offensive.

Appendix

Mesh partitioning example at: http://www.math.udel.edu/~monk/CISC879/code.html

Source code listing of the parallel code at: http://www.math.udel.edu/~monk/CISC879/HTML/ftagshtml/progindex.html

MUMPS Manual and sample code at: http://www.math.udel.edu/~monk/CISC879/HTML/ftagshtml/progindex.html

References

[1] O. CESSENAT, Application d’une nouvelle formulation variationnelle aux equations d’ondes harmoniques. Problems de Helmholtz 2D et de Maxwell 3D, PhD thesis, Universite Paris IX Dauphine, 1996.

[2] O. CESSENAT and B. DESPRES, Application of the ultra-weak variational formulation of elliptic PDEs to the 2-dimensional Helmholtz problem, SIAM J. Numer. Anal., 35 (1998), pp.255-299.