CISC879 Project 4
Parallelization of a Scattering Code
Final
Report
Faculty
Advisor: P. Monk
Group
Members: J. Sun, E. Fellows, M. Jochen, L. Milano, & P. Dmitruk
Problem Description:
1. Motivation
The interaction of
electromagnetic waves with matter underlies many important engineering
applications. For example, radar involves the scattering of electromagnetic
pulses from targets and understanding the effects of microwave radiation on
biological tissue involves the scattering of electromagnetic waves from
complicated, penetrable targets. Another large class of problems of particular
interest to one of us (P. Monk) is microwave imaging and non-destructive
testing. In these problems we attempt to infer properties of hidden objects by
their effects on microwave signals. Examples are again radar, ground
penetrating radar (for example for mine detection) and attempts to use
microwaves to image inside the human body.
All these problems require an efficient code to
simulate the propagation of electromagnetic waves. In a recent paper, Cessenat
and Despres [1,2] suggested a new method for solving this problem called the
Ultra Weak Variational Formulation (UWVF). This formulation has potential
advantages over standard finite element methods. Dr. Monk has implemented a
serial code for this problem in F90. This code is the subject of our
parallelization attempts.
2. Problem Statement
Given a bounded domain, , the goal is to approximate the electric field,
,(a complex valued vector function of position),
that satisfies the Maxwell system
in where
is the wave number of
the radiation (the wave-length of the radiation is
) and
is
the possibly complex and spatially varying relative permittivity of the
material. In addition, the solution must satisfy the generalized impedance
boundary condition:
on the boundary of where the unit outward normal
to the boundary is. Using different combinations of and the data function
, we can simulate different types of boundaries.
For example
is a classical
absorbing boundary condition, while
gives a perfect
conductor (i.e. a metallic wall). The UWVF shares many similarities with finite
element codes. First the region of interest is subdivided into tetrahedra using
an automatic mesh generator (this would be an interesting subject for
parallelization, but far too difficult for this semester!). Note that the
resulting mesh is unstructured and may have elements of very different sizes.
The advantage of such a mesh is that it can model complex geometric shapes
well, but the disadvantage is that elements are of variable size and have no
simple relationship to one another (unstructured). A cut through of a simple
mesh is shown in Figure 1.
Figure
1: A cut through
mesh for the domain between a sphere and a cube. The goal is to simulate
scattering from the inner sphere using the outer sphere as an artificial
boundary to terminate the mesh. This is a small test mesh with around 71,000
tetrahedra.
Once the mesh is available, the UWVF code
performs two operations
·
The tetrahedra are processed and a large, sparse
complex valued matrix is generated. In addition a right hand side vector is
also computed.
·
The matrix problem is solved by a simple
iterative process requiring only matrix multiplies and dot-products.
The UWVF is not a standard finite element method
(it is closer to a discontinuous Galerkin finite element method in spirit), so
many of the node based parallelization schemes in the literature need to be
adapted in this case. In particular, at the assembly stage, when computing the
matrices for one tetrahedron, it is necessary to use data from surrounding
tetrahedra (unlike finite elements in which this process is entirely local).
However the sparsity structure of the resulting matrix is quite simple.
Serial Implementation:
For the Ultra Weak Variational Formulation, due
to Cessenat and Despres, we have the following general outline:
·
Mesh the domain by cutting it up into
tetrahedra. This mesh generation step is very time consuming, but only needs to
be done once for a given domain and wave number.
·
On each element K (tetrahedron) we represent the
solution as a sum of simple exact solutions of Maxwell's equations:
where and
. For us
since there are two
independent polarizations
for each different
direction of propagation
. These simple functions are called Plane Waves.
·
Note the connection with spectral methods
(exponential basis elements).
·
Maxwell's equations are satisfied exactly on
each tetrahedron, but there needs to be a way of imposing suitable continuity
between elements. This is done by the UWVF.
For simplicity of
exposition, we shall now assume , although this is no necessary in our code. To
construct the UWVF, let
satisfy
in
on
where ,
is the boundary of
, and
is the
outward normal to
.
Let satisfy
in
on
Diagram of the relationship between
elements in 2D showing orientation of the normal on a inter-tet boundary
Suppose touches tetrahedra
,
,
and
through it's faces.
Let
denote the common face
between
and
. Let
denote
on
, where
is the
outward normal to
. Using integration by parts and Maxwell’s equations, we
have:
(1)
This constrains
boundary values of across mesh faces,
since similar equations hold for each tetrahedron.
Using the expansion
for and taking
in the previous
equations gives us a system of linear equations for the unknowns
on each tetrahedron.
We list the unknowns
element by element in a vector .
The term gives rise to a block
diagonal, Hermitian, matrix
.
The terms gives rise to a
sparse matrix
with 4 non-zero
blocks per block row.
We need to solve
and we do this by
solving by the bi-conjugate
gradient method.
For the serial
algorithm, we first "Assemble" the matrices D and C by processing
tetrahedron by tetrahedron and evaluating exactly the integrals in (1). Data is
needed from neighboring tetrahedra through faces; this has implications for
domain decomposition parallel schemes. Then, we solve the linear system by the bi-conjugate
gradient method. This requires matrix multiplication by , and can be done element by element as long as data on
neighboring tetrahedra (via faces) is available.
Typical linear finite
element methods need at least 10 grid points per wavelength to yield acceptable
results (and usually more!). Experiments with the UWVF have shown that it also
has a restriction on a minimal number of unknowns per wavelength (4 per
wavelength so far).Due to the grid-points-per-wavelength problem, we need many
tetrahedra (for example, aircraft 15m long at 2mm wavelength would require tetrahedra which is
not possible today). Each tetrahedron must have at least
which requires
unknowns!
Even for small
numbers of tetrahedra, the UWVF can be very time consuming and very memory
intensive. For example, a very small test problem using 71,528 tetrahedra with
20 basis functions per tetrahedron currently requires 2.5GByte of memory and
runs for 4,859 seconds (in serial mode) on a 195 MHz Silicon Graphics
Origin-2000. Realistic problems of simulating the interaction of microwaves
with the human head will require far more memory and computer time. This makes
us to turn to parallel methods.
For our project, we investigated two approaches:
1. Application of MUMPS: Once the matrix is generated (and stored on disk perhaps) the matrix problem could be fed to an existing parallel direct sparse matrix code such as MUltifrontal Massively Parallel Solver (MUMPS). This should be a relatively easy job (at least if the most straight forward matrix format is used). Statistics could be gathered comparing the performance of this code on different architectures (e.g. shared memory multi-processor, high performance cluster, poor network cluster...) and different sized problems. Some MPI coding may be needed, but the amount will be limited.
2. Data Parallelization or Domain Decomposition: Using partitioning software such as METIS the grid can be partitioned to the processors. A picture of the mesh used in Figure 1, partitioned into 8 sub-meshes is shown in Figure 2.
Figure 2:
The sphere mesh used in Figure 1 partitioned into 8 sub-meshes using METIS.
The matrix can then be constructed on each sub-grid (but this will require significant communication unlike in the finite element case). Having computed the matrix in parallel, we can then implement a parallel matrix multiplication routine and hence fully parallelize the code. This will require significant investment of time to take care of renumbering and indexing the grid and unknowns at various times in the solution process.
Application of MUMPS and Numerical Results:
Introduction to MUMPS:
MUMPS ("MUltifrontal Massively Parallel
Solver") is a package for solving linear systems of equations , where the matrix
is sparse and can be
either unsymmetric, symmetric positive definite, or general symmetric. MUMPS
uses a multifrontal technique which is a direct method based on either the
or the
factorization of the
matrix. MUMPS exploits both parallelism arising from sparsity in the matrix
and from dense
factorizations kernels.
The software is written in Fortran 90. It
requires MPI for message passing and makes use of BLAS, BLACS, and ScaLAPACK
subroutines.
MUMPS distributes the work task among the
processors, but an identified (host) processor is required to perform the
analysis phase, distribute the incoming matrix to other (slave) processors in
the case where the matrix is centralized, collect the solution, and generally
oversee the computation. The system is solved in three
main steps:
1. Analysis.
The host performs an approximate minimum degree algorithm based on the
symmetric pattern , and carries out symbolic factorization. A mapping of the
multifrontal computational graph is then computed, and symbolic information is
transferred from the host to the other processors. Using this information, the
processors estimate the memory necessary for factorization and solution.
2. Factorization.
The original matrix is first distributed to processors that will participate in
the numerical factorization. The numerical factorization on each frontal matrix
is conducted by a master processor (determined by the analysis phase0 and one
or more slave processors (determined dynamically). each processor allocates an
array for contribution blocks and factors; the factors must be kept for the
solution phase.
3. Solution.
The right-hand side is broadcast from the
host to the other processors. These processors compute the solution
using the
(distributed) factors computed during Step 2, and the solution is assembled on
the host.
MUMPS
allows the host processor to participate in computations during the factorization
and solve phases, just like a slave processor. This may lead to memory
imbalance if the host already stores the initial matrix, but it also allows us
to run MUMPS on a single processor and avoids one processor being idle during
the factorization and solve phases. If the user is able to input the matrix in
a distributed way or if memory is not an issue, we recommend this mode.
For
both the symmetric and the unsymmetric algorithms used in the code, MUMPS
chooses a fully asynchronous approach with dynamic scheduling of the
computational tasks. Asynchronous communication is used to enable overlapping
between communication and computation. Dynamic scheduling was initially chosen
to accommodate numerical pivoting in the factorization. The other important
reason for this choice is that, with dynamic scheduling, the algorithm can
adapt itself at execution time to re-map work and data to more appropriate
processors. In fact, MUMPS combines the main features of static and dynamic
approaches; it uses the estimation obtained during analysis to map some of the
main computational tasks; the other tasks are dynamically scheduled at
execution time. The main data structures (the original matrix and the factors)
are similarly partially mapped according to the analysis phase. Part of the
initial matrix is replicated to enable rapid task migration without data
redistribution.
In
the Fortran 90 interface, there is a single user callable subroutine called
MUMPS that has a single parameter mumps_par of Fortran 90 derived datatype STRUC_MUMPS.
MPI must be initialized by the user before the first call to MUMPS. The calling
sequence looks as follows:
INCLUDE
'mpif.h'
INCLUDE
'mumps_struc.h'
...
INTEGER
IERR
TYPE
(MUMPS_STRUC) :: mumps_par
...
CALL
MPI_INIT(IERR)
...
CALL
MUMPS (mumps_par)
...
CALL
MPI_FINALIZE(IERR)
The
datatype STRUC_MUMPS holds all the data for the problem. It has many components
that are internal to the package (and could be declared private). Some of the
components must only be defined on the host. Others must be defined on all
processors. The file mumps_struc.h defines the derived datatype and must always
be included in the program that calls MUMPS. The mumps_root.h, which is
included in mumps_struc.h, defines the datatype for an internal component root,
and must also be available at compilation time. Components of the structure
STRUC_MUMPS that are of interest to the user are shown in Figure 1. This
structure contains all parameters for the interface to the user, plus internal
information.
The
interface to MUMPS consists in calling the subroutine MUMPS with the
appropriate parameters set in mumps_par. We append a sample program at the end
of our report.
Modify UWVF to use MUMPS
In
our serial method for UWVF, we use bi-conjugate gradients to solve the linear
system. Now, we turn to MUMPS. In stead of using bi-conjugate gradients to
solve the linear system serially, we want to call MUMPS.
So
far, we have a complex linear system. In order to use MUMPS, we need to change
it to real linear system. Suppose we have
where
is a complex and
is a complex vector.
Write
,
and
as
,
and
where the footnotes
and
representing the real
part and the imaginary part. We obtain
Thus
Write
in matrix form
Let
,
and
We
obtain a real linear system . This is when we want to call MUMPS. But we need to be a
little bit careful. Suppose
is a diagonal matrix.
We can see that
is not diagonal. The
nice characteristic of
is not retained.
Here, we need a trick not to destroy the nice structure of
.
We
can write the linear system as following:
,
and
,
and
Here, is a tri-diagonal
matrix, which retains the structure of
. Once we obtain the real linear system, we can use MUMPS to
solve it.
Numerical
Results:
(Note: We ran the program on a 195MHz Silicon
Graphics Origin-2000 which has four processors)
Our test problem (verysmall.dat) has 226
vertices, 2138 faces and 1023 tetrahedra. When we get the complex linear system,
we have 8184 unknowns and 315584 non zero entries. After we change it to the
real complex linear system, we have 16368 unknowns and 1262336 non zero
entries.
Using MUMPS, we will get the output like this:
MUMPS Version 4.1.6 --- March 2000
L U Solver for unsymmetric matrices
Type of parallelism: Working host
***** ANALYSIS STEP *****
Node parallelisms: ICNTL(12/13) = 2*0
A root of estimated
size 1680 has been selected for Scalapack
Mapping
takes memory into account
***** FACTORIZATION STEP *****
GLOBAL STATISTICS PRIOR NUMERICAL FACTORIZATION
...
NUMBER OF WORKING PROCESSES = 4
REAL SPACE FOR FACTORS = 12194112
...
***** SCALING OF ORIGINAL MATRIX
ROW AND COLUMN SCALING (1 Pass)
....
***** SOLVE & CHECK STEP *****
STATISTICS PRIOR SOLVE PHASE..
...
Table 1 shows the timing data of our test
problem:
# of processors used |
time used on each processor |
1 |
41.88 |
2 |
37.05 |
37.05 |
|
3 |
26.79 |
26.79 |
|
26.79 |
|
4 |
23.66 |
23.66 |
|
23.66 |
|
23.66 |
Table 1. Timing Data
Figure 3. Speedup
of MUMPS code
In Figure 3 (Speed
Up), we got a low speed up from 1
processor to 2 processors, a better speed up from 2 to 3, and again a lower
speed up from 3 to 4 processors. When we use 4 processors, we obtained a speed
up almost 2. In the Efficiency Figure (Figure 4), the efficiency drops down
quickly from 1 processor to 2 processors. It indicates that the communication
time and idle time in each processor eats up a lot of time.
Figure 4. Efficiency of MUMPS code
Comments
on MUMPS and Numerical Results:
1. MUMPS
is a direct solver for large sparse matrix. It usually takes more memory than
iterative methods. We met a memory allocation error from MUMPS when we wanted
to solve a larger system which has 24 thousand unknowns. According to one of
authors of MUMPS, we may reorder the matrix first and MUMPS will ask for less
memory during the factorization phase.
2. For
our test problem, we need use the serial code to generate the linear system and
then use MUMPS to solve it. Thus, our timing data is concerned with the part
when we use MUMPS.
3. Because
of the dynamic scheduling approach and partial pivoting with a threshold used
in MUMPS, the analysis phase cannot fully predict the space that will be
required on each processor and an upper bound is therefore used for memory
allocation. Thus, when we use MUMPS, we may pass the analysis phase but still
get the problem at factorization phase. Recent numerical results show that
MUMPS uses more memory than other parallel solvers such as SuperLU.
4. In
MUMPS Version 4.0, communications are asynchronous and based on immediate sends
with explicit buffering in user space. On the destination process, the
reception of the messages will be the key point for the synchronization and
scheduling of the work. In fact, message reception can be invoked in the
following three situations:
·
Dynamic scheduling: Blocking and non-blocking
receives are used to drive the scheduling of the tasks on each process.
·
Task ordering: A process may have to receive and
treat a "late" message to be able to finish its current task.
·
Insufficient space in send buffer: To avoid
deadlock, the corresponding process tries to receive messages until space
becomes available in its local send buffer.
A recent report on
MUMPS says that the "overhead" time may take 25% to 50% of the total
time, which depends on the structure of the matrix, number of the processors
used. Here, "overhead" means MPI calls and idle time due to
communications or synchronization's.
A
Parallel Approach for UWVF
Our second approach is to
parallelize the serial UWVF solver. In doing so, we hope to permit larger
problems to be solved by computing the solution more quickly and by requiring
less memory per computation node. Large scattering problems require a large
mesh, which may not fit into memory. By solving the problem across multiple
processors, we can decompose the mesh into smaller sub-problems, which will fit
in memory.
The
opportunity for parallelism exists where we solve the linear system by the bi-conjugate
gradient method. If the domain is partitioned across a number of processes,
each process can perform the matrix multiplication in parallel. In between each
iteration of computation, data for neighboring tetrahedra across boundaries
will need to be shared. Once all processes have detected convergence of the
iterative method, we have each process report its solution to the managing
process. A sketch of the algorithm follows:
1. MANAGER: Read in meshfile
2. MANAGER: Partition mesh via Metis
3. ALL: Broadcast resulting partition vector
from MANAGER
4. MANAGER: Read physical properties
5. ALL: Broadcast physical properties from
MANAGER
6. MANAGER: Loop for each process
6.1.1. Build local (partitioned) mesh for each
worker
6.1.2. Renumber/re-index tets, faces, & verts
6.1.3. Send renumbered package to each worker
7. ALL: Determine border tets with neighbors
8. ALL: Send/recv data for border tets
9. ALL: Compute matrices D & C
10. ALL: Compute rhs, solve by bi-conjugate
gradients, loop
10.1.1. Calculate values for x
10.1.2. Must gather dot-products at each point to
find global variance
10.1.3. Share rhs info for border tets after each
iteration
10.1.4. When global variance is within tolerance,
send results to manager, exit
11. MANAGER: Collect rhs from workers
12. MANAGER: Reorder rhs from local ordering
to global ordering
13. MANAGER: Save results to file
The above is a high
level look at how we parallelized the serial UWVF code. Note the use of Metis
to partition the mesh. Metis is discussed in the following section.
Using
METIS
Domain decomposition
is accomplished using the software package METIS. The raw data consists of a
mesh of tetrahedra generated automatically and stored in a file which is read during run time. This file contains information
about the mesh (e.g. number of tetrahedra, faces and vertices ) and sufficient
detail regarding the spatial relationships between tetrahedra for METIS to
partition the mesh among processors. The desired distribution is such that the
number of elements assigned to each processor is approximately the same while
minimizing the number of adjacent elements assigned to different processors.
The objective of the former is to balance the computation time on each
processor while the purpose of the latter is to minimize the amount of
communication between processors. The call METIS_PartGraphKway() is used to
attain such a distribution by minimizing the edge cut, which is defined as the
number of edges which straddle partitions. According to the authors this method
is extremely fast and efficient. From runs on mesh sizes ranging from 6000 to
72000 tetrahedra, the partitioning takes less than 0.01 % of the overall
execution time and thus has negligible effect. Since the amount of data being
communicated for each tetrahedron is the same for this implementation, then the
communication between processors is also minimized. If the number of basis
functions, p, varies from tetrahedra to tetrahedra then communication times may
not be the minimum. However, this information can not be determined until run
time, but can be included in the mesh partitioning process by using different
weights on the nodes of the METIS adjacency graph (not yet implemented).
An example of the process
of mesh partitioning and subsequent re-mapping of indices can be found at http://www.math.udel.edu/~monk/CISC879/code.html.
This simple example also describes the details of the communication requirements
during computation and what is involved in determining what information needs
to be communicated. Care was taken in the code to achieve parallel
communication. Each processor may need to communicate with many other processors
to share information about border tetrahedra. The goal during this
communication phase was to have as many processor pairs simultaneously
communicating this information at once, rather than forcing the communication
to be more serial in nature. Considerable coding effort was spent in
UseMetis.f90, matops.f90, init_basis.f90 to parallelize the code.
Challenges
in Implementation
A
local numbering for faces, tetrahedra, and vertices is established when the
mesh is partitioned. This local numbering must be mapped back to the global
numbering for the mesh. The mapping must also re-index references between
faces, tetrahedra, and vertices within the local partition. Further, when
information for border tetrahedra is received from neighboring processes, this
establishes a second renumbering between local numberings of these border
elements. Keeping track of these somewhat involved numbering schemes is
paramount to ensure correctness of the solution. A good bit of time was spent
dealing with keeping track of this information on each process.
Between
each iteration of computation, and even within each round of computation, data
must be shared with neighboring processes. The shared data is used to determine
when to stop the computation and to ensure the solution is correctly computed
across partition boundaries. Because this communication permeates the
computation, we experience a synchronization of processes which keeps that rate
of computation equivalent across all processes. This synchronization adversely
affects performance as the speed of computation is tied to the speed of the
slowest process.
One
thought on how to relieve this synchronization is to partition the mesh into
many more partitions than there are processors. The idea then becomes one of
dynamically distributing work (partitions) to processors as they become
available to do work. This approach may prove unappealing as the amount of
information to be communicated increases greatly with such an approach. In this
instance, we would need to send portions of the matrices D & C along with
each partition, as well as the current state of the right hand side for that
partition. This approach also requires the manager to store the entire solution
and all matrices in its own memory. Part of the attractiveness the former
approach is to enable larger problems to be solved that ordinarily would not
fit into one machine’s memory.
Results
We did not perform a theoretical time complexity analysis on this code because the nature of the computation and communication are rather complex and dynamic. The main problem is that the grid is constructed such that it is impossible to predict communication load. Furthermore, the number of unknowns per tetrahedra can vary from one tetrahedron to the next. In the future, the number of unknowns per tetrahedra will be able to change during runtime. This can greatly increase or decrease the amount of communication across border tetrahedra. In addition, the amount of communication vastly depends upon the result of the partitioning code Metis and upon the nature (size and structure) of the original mesh. We could perform an analysis based on an average partition with fixed unknown numbers for some particular problem, but we feel this analysis would be misleading at best. Perhaps a useful future exercise would be to perform a theoretical analysis on a simple “uniform” tetrahedral grid to give a “best case” for the code.
|
tiny |
small |
sph1 |
Serial |
99 |
844 |
4060 |
1
Proc |
103 |
863 |
4080 |
2
Proc |
70 |
582 |
2401 |
3
Proc |
45 |
441 |
1586 |
Table 2. Time in seconds for serial and
parallel UWVF code
We give timing data for
the serial and parallel bi-conjugate gradient codes in table 1. These codes
were run on three mesh sizes, a tiny mesh with 6,000 tetrahedra, a small mesh
with 15,581 tetrahedra, and a large mesh with 71,528 tetrahedra. The first row
gives the times for a serial version of the code on each mesh. Time is recorded
in seconds. We use these times as baseline times for speedup and efficiency
computation. The subsequent rows give times for the parallel code on one
through three processors. Tables 2 and 3 give speedup and efficiency
respectively. Our results are encouraging. While efficiency is not one, it does
increase as the size of the mesh increases. Speedup for the parallel code is
also encouraging. Again, the best gains are made with large mesh. With larger
mesh, there are more internal tetrahedra per border tetrahedra within each
partition, thus the overhead of communication can be amortized over increased
parallel computation.
|
tiny |
small |
sph1 |
1
Proc |
0.96 |
0.98 |
1.00 |
2
Proc |
1.42 |
1.45 |
1.69 |
3
Proc |
2.22 |
1.91 |
2.56 |
Table 3. Speedup of parallel UWVF code
|
tiny |
small |
sph1 |
1
Proc |
0.96 |
0.98 |
1.00 |
2
Proc |
0.71 |
0.72 |
0.85 |
3
Proc |
0.74 |
0.64 |
0.85 |
Table 4. Efficiency of parallel UWVF code
Figure 5. Speedup parallel UWVF code
Figure 6. Efficiency of parallel UWVF code
Figures 5 and 6 graph the results presented in
tables 3 and 4 respectively. While we are encouraged by the shape of the
speedup curves, the efficiency curves are not as flattering. We do not expect
to achieve an efficiency of one, but we do hope to keep the efficiency above
70%. During testing of our code, we were only able to utilize three processors
on an SGI Origin 2000. We believe that efficiency will stabilize on large mesh
when run on a larger amount of
processors.
Future Work
What
follows is a list of enhancements and areas of exploration for potential
improvement to our current effort.
1. We
notice that creating the matrices C and D can take a large amount of time
during execution. Is there any way of speeding up this process?
2. Would
it be worthwhile to parallelize mesh partitioning? At this point, we conjecture
no. Mesh partitioning takes a negligible amount of time compared with the rest
of the computation.
3. Mesh
generation can take a very long time. This would be a good spot to focus
attention for future improvement.
4. Consider
other solvers (perhaps QMR) that require less synchronization during the
iteration process.
5. Test the parallel UWVF code on a very large mesh over more
processors.
Through
review of our results, we conclude that MUMPS is not a desirable solution
alternative for our application. Direct methods typically require a great deal
of memory to reach a solution. We are trying to find solutions on rather large
mesh, thus MUMPS requires more memory then we are currently able to provide.
We
state that the parallel UWVF fairs well. While there are areas for improvement,
we are encouraged by our results. The code experienced a speedup over multiple
processors, and the efficiency, while not quite stellar, was not found to be
offensive.
Appendix
Mesh partitioning example at: http://www.math.udel.edu/~monk/CISC879/code.html
Source code listing of the parallel code at: http://www.math.udel.edu/~monk/CISC879/HTML/ftagshtml/progindex.html
MUMPS Manual and sample code at: http://www.math.udel.edu/~monk/CISC879/HTML/ftagshtml/progindex.html
References
[1] O. CESSENAT, Application d’une nouvelle formulation variationnelle aux equations
d’ondes harmoniques. Problems de Helmholtz 2D et de Maxwell 3D, PhD thesis,
Universite Paris IX Dauphine, 1996.
[2] O. CESSENAT and B. DESPRES, Application
of the ultra-weak variational formulation of elliptic PDEs to the 2-dimensional
Helmholtz problem, SIAM J. Numer. Anal., 35 (1998), pp.255-299.