|
Final Report for Project 5 |
Advisors:
Dr. L. P. Wang
Dr. P. Dmitruk
Members:
Orlando Ayala
Ben Breech
Peng Gao
Introduction:
Turbulence
is an unstable state of viscous fluid flow and is the most common flow regime
in nature and in engineering applications. Air motion in the earth's atmosphere,
ship wakes, flows in pipelines and chemical reactors are just a few examples.
Turbulence produces rapid mixing and transport of passive pollutants and
particulate matters. Analytical solutions to even the simplest turbulent flows
do not exist. Direct numerical simulation (DNS) provides a numerical solution
of the flow variables as a function of space and time by numerically solving
the exact mathematical equations governing a turbulent flow. DNS has been used
effectively as a research tool since its invention in the earlier 70’s (Orszag
and Patterson 1972).
Significant
insight into turbulence physics has been gained from DNS of certain idealized
flows that cannot be easily attained in the laboratory (Moin and Mahesh
1998). For example, for the case of
isotropic turbulence, DNS have helped us (a) identify the small scale vortex
tubes and their physical properties, (b) examine the hypotheses underlying the
classical Kolmogorov-Obukhov theory of the inertial range, (c) explore the
locality of the spectral energy transfer, (d) quantify mixing, structure, and
turbulent combustion of passive scalar, and (e) reformulate statistical models
for transport and coagulation of heavy particles. In quite a few cases, DNS
data have been used to evaluate measurement accuracy (e.g., Moin & Mahesh
1998). We believe that DNS, viewed as
novel numerical experiments, continue to be a complementary tool to experiments
in turbulence research. Given these remarkable contributions made in the recent
years through DNS, a known limitation is the relatively low Reynolds number (to
date, Re is about or less than 250). A major effort has been to increase the
flow Reynolds number. This project
represents a first step towards this goal by optimizing DNS codes on scalable
parallel computers.
Objective:
Our
faculty advisor, Dr. Wang, has recently developed two MPI codes to simulate
homogeneous turbulence on scalable parallel computers. The codes are based on
two distinct numerical methods. The
first or spectral code is based on a 3D Fourier representation of the
Navier-Stokes equations (the governing equations for macroscopic fluid
motion). The second is based on the
discretized Boltzmann equation and as such is known as the lattice Boltzmann method
(LBM). The two methods solve completely
different equations but macroscopically give the same turbulent flow field if
they share a same initial condition.
The
overall objective of this project is to analyze these codes carefully and test
their performance on the cluster machine so that their capabilities for
simulating turbulence can be explored and compared.
Program
and method description:
Spectral method description (SM)
What is solved in the spectral method is the Navier
Stokes equation:
in the Fourier space or k-space through the
following transformations:
leading
to the following equation in the Fourier space:
where
calling
the first and second term at right hand side of the equation C(k,t) and using
the following change of variables:
The
equation is simplified to:
which
when discretized:
using this expression and the transformation from Q to u (velocities at k-space), we can readily calculated the velocity field at each time step. From the above equation, it is clear that it is needed the term C(k,t). Using the continuity equation at k-space:
The Navier Stoke equation at k-space can be simplified to:
from which we can obtain B(k,t). So, C(k,t) can be calculated base on A(k,t):
To resume, to calculate the velocity field at the following time step, it is needed C(k,t) which can be calculated from A(k,t).
The
process can be reduced to the following steps:
At t=0 we know u(x,t). Using FFT, we obtain u(k,t)
At each time step is known, we
want:
, then:
1)
2)
Fast Fourier Transforms:
3) Compute Nonlinear term
A(x,t) at real space:
4) Fast Fourier Transform:
5) It
is necessary step to assure getting real physical results, forcing symmetry on
kx=0 plane
6)
Compute C(k,t)
7)
Compute Q(k,t+dt) to later obtain . And repeat all the steps to calculate next time step.
As it can be seen from above, the FFT is an essential computational component of Galerkin and pseudo-spectral codes. To do the implementation and performance of the multidimensional FFT on a distributed memory Beowulf cluster, we focus on the three-dimensional real transform. The approach study is a one-dimensional domain decomposition algorithm that relies on communication-intensive transpose operation involving P processors. Communication is based upon the standard portable message-passing interface (MPI).
·
The
transpose algorithm is optimized for simultaneous block communication by all
processors;
·
Communication
is arranged for non-overlapping pair-wise communication between processors,
thus eliminating blocking when standard fast ethernet interconnects are
employed
This method provides the basis for implementation of scalable and efficient spectral method computations of hydrodynamic and magneto-hydrodynamic turbulence on Beowulf clusters assembled from standard commodity components. [1]
Lattice
Boltzmann method description (LBM)
The
lattice Boltzmann method is a parallel and efficient algorithm for simulating
single-phase and multiphase fluid flows and for incorporation additional
physical complexities. The LBM is particularly successful in fluid flow
applications involving interfacial dynamics and complex boundaries.
This
method simulates the motion of the fluid particles trough a distribution of
fluid particles function (f). These particles are allowed to move and collide
with each other. The movement of the particles is constrained to a particular
set of directions.
Some key equations to be solved in the 3D LBM code:
Using 15 velocity model:
The particles can have the following
velocities along each particular set of direction. Each type of particle will
depend on the direction it follows. The velocities are:
(0, 0, 0); a=0;
(±1, 0, 0) (0, ±1, 0) (0, 0, ±1); a=1, 2, 3, 4, 5, 6;
(±1, ±1, ±1); a=7, 8, 9, 10, 11, 12, 13,
14;
At each time step, it is calculated
the particle distribution function which we will need to recover the velocity
and density field.
The right hand side of the equation is the so-called collision term, and the left hand side is called the streaming. In the collision term exits a term, which is the particle distribution function at equilibrium. It can be calculated as follow:
where:
It can be proven that with the equations shown can be recovered the Navier Stoke equation, which mean that solving these, we are solving the Navier Stoke equation behind the scene. The density and the velocity field can be calculated from the particle distribution function as follow:
The
process can be reduced to the following simple steps:
At t=0, we calculate the initial particle distribution function from initial velocity field (u(x,t)) using the particle distribution function at equilibrium.
At each time step is known, we want:
, then:
1) Compute the collision term. At this stage, it is
needed information at the same node location.
2) Compute (streaming). At this
stage, it is needed some values of the particle distribution function at some
neighboring nodes.
3) Compute density and velocity from he particle
distribution function.
The
approach study is a one-dimensional domain decomposition algorithm that relies
on communication-intensive operation involving P processors when calculating
step 2 which correspond to the streaming calculations, as it was mentioned to
calculate it, it is need information from the neighboring nodes. If the
physical neighboring nodes are in other processors, it is need to communicate
this information. Communication is based upon the standard portable
message-passing interface (MPI). Communication involves only pair of processor,
each processor only communicates with his neighbors.
Results
Analysis:
We
have met problem to implement the codes on the DEC Alpha cluster since we
cannot run more than 2 processors on it when using MPI FORTRAN. We have
imported and implemented the both methods on the Beowulf clusters finally. The programs run well. Several runs were
done on both SM and LBM codes, at resolutions varying from 128^3, 64^3, 32^3,
16^3. We ran the code on P = 1, 2, 4,
8, and 16 processors on the Beowulf cluster.
Physical result:
First at all, we have to prove both
methods are giving same results. It does not make sense to compare
computationally two methods that do not match in physical results.
Fig.
1. Kinetic energy and Dissipation for a
problem size of 32^3.
Fig.
2. Kinetic energy and Dissipation for a
problem size of 64^3.
Fig.
3. The energy spectrum for a problem
size of 32^3.
Fig.
4. The energy spectrum for a problem
size of 64^3.
Figures 1, 2, 3 and 4 shows physical
results extracted from runs of size 32^3 and 64^3. Both sizes were run using
the two methods, LBM and SM. The results show a very good agreement between the
two methods. It can be said that either method gives same results. So, the
choice of each one method such relies on computational aspects. In this project
it is tried to analyze the computational aspects involving LBM and SM due to
computations, as well as communications when the two programs are run with
several processors.
Timing results:
In this section, it is shown time
each program (LBM and SM) takes for communication, computation and the total
time, which is the addition of the communication, and computational time. The
timing was obtained using w_time command before and after each section of the
program where it was going to either calculate or communicate. Another
important thing to point out is that the time we show on those plots is the
time takes to compute one time step in the problem using either method.
Generally speaking, it is needed something like 5000 steps or more to get
interesting physical results. With few easy calculations, it can be figured out
why it is needed to parallelize the programs, for a 128^3, it could take a day
to obtain good results.
Fig.
5. Communication time as function of number of processors P for different
sizes.
As
it can be seen from above, for the data sizes 128^3 and 64^3 the communication
time decreases slowly when increasing the number of processors for the Spectral
Method. The reason is the data size for communication of each processor is
getting smaller and smaller when increasing the number of processors.
For
the data sizes 32^3 for SM the communication time changes little by the number
of processors, but it still decreases as the higher resolution problems.
However, for the data size 16^3 the communication time increase a little with
the number of processors. We think that it is due to the problem that any LINUX
system has that when the data of communication is very small; the time to transmit
the information is higher.
Interestingly,
for the Lattice Boltzmann code, the communication time should be proportional
to N^2, with no dependence on p. Yet,
for small values of p, the graphs clearly show some variations with p. They seem to settle down into a straight
line once p goes to 8 and above. We
hypothesize that this behavior should continue for larger data sets, but that
has yet to be confirmed.
Fig
6. Computational time as function of number of processors P for different
sizes.
Looking
at Fig. 6, for the data sizes 128^3, 64^3, 32^3, 16^3 the computation time
decreases obviously by the number of processor for both methods. It is clear
since when increasing the number of processors, each processor has smaller
amount of data to calculate, therefore it will take less time to calculate.
It
is important to point out, comparing Fig. 5 and 6, that for the problem we
solved, the time for computation is higher than the time each processor takes
to communicate. Probably, any improvement towards communication schemes will
help little.
Fig
7. Total time as function of number of processors P for different sizes.
Fig.
7 shows the total time of computation. Generally, for the data sizes 128^3,
64^3, 32^3, 16^3, the computation time decreases obviously by the number of
processor for both methods. Therefore, the parallel programming decreases the
time for scientific computation. From this plot, one important behavior that
can be seen is that either method takes almost the same amount of time to
calculate each time step, at least for problems of higher resolutions. It has
to be still studied what is going to happen for problem resolutions higher than
the ones we studied, higher than 128^3.
Other important plots are the
Efficiency and the Speed Up. Fig. 8 and 9 show both of them. The overall
computation efficiency increases with problem size (N) and decrease with P,
just as we expected.
The efficiency plot is important
since it can be used to evaluate a parallel program. The efficiency always
tends to decrease with the number of processors (P), but a well-behaved
parallel program is if the efficiency tends to be constant or slow down a
little bit with the number of processors. Looking at the Fig. 8, it can be said
that for SM problems 128^3 and 64^3 have good efficiency, in contrast for
smaller problems, it decreases constantly. For LBM, we could not obtain data
for 128^3, for 64^3 and 32^3 the efficiency is always around the singular value
of one and tend to be maintained, while for 16^3 the efficiency drops
constantly.
In general, both methods have a
well-behaved efficiency when solving higher resolutions problems.
Fig
8. Efficiency as function of number of processors P for different sizes.
Fig
9. Speed Up as function of number of processors P for different sizes.
From
Fig. 9, it can be seen that the LBM has much better speed up than spectral
method. At this point, we do not want to say that the LBM can be an alternative
to spectral method for turbulence simulation or even better method. It is easy
to think that from the efficiency and speed up plots, it can be concluded that
the LBM is much better.
Remember
that the time each program takes to calculate each time step is fairly the
same. Base on this, either method is good. The behavior we see in the speed up
and efficiency plot can be because the Lattice Boltzmann Method does not work
at all well with one processor. LBM is always better to use with several
processors.
Further
studies have to be done to try to make a good conclusion of which method is
better for the type of problem we are solving.
In
the future, the both methods need to be run for larger processors. As we said,
the efficiency and speedup would be of great concern with larger numbers of
processors. For LBM, since the communication time does not depend on p, there
will come a point where each process has to spend more time communicating than
doing useful computations. Thus we
expect the efficiency to start dropping off. For SM, looking at the
computational time and communication time plots, there will be also a point
where each process has to spend more time communicating than doing useful
computations because the computational time drops faster than the communication
time when increasing the number of processors. It will be interesting to see
the efficiency behavior.
Complexity analysis
One important study to be done when analyzing
parallelized codes is the complexity analysis. It is to try to develop an
approximate quantitative equation model for execution time.
In general, it can be
said that the total time a program takes is:
the time of computation and the time of communication.
For Spectral method:
The computational time
can be modeled as the addition of the time for NON FFT computations and FFT
computations. The following equations show it [1]:
with and
approximately constant.
We
decided to plot the ratio of the time and the non-constant term of each
equation. If the model is correct, all the curves should lie horizontally at
the same place at a value of t1 or t2.
Fig.
10. Complexity Analysis for FFT computations in the SM.
Fig. 10 shows the mentioned plot for
FFT computations. Almost all the curves lie at the same value, except for the
16^3 problem. We think that it is probably because the FFT program implemented
works much better for better for bigger problems. It can be concluded that the
model used to quantify the FFT computational time is correct.
Fig.
11. Complexity Analysis for NON FFT computations in the SM.
Fig.
11 shows the same type of plot for NON FFT computations. For 128^3 and 64^3,
the model works fine. For the problem 16^3 and the 32^3 problem with 8 and 16
processors, the value is lower than the model should predict. We believe it is
because for those situations the data is small enough that the calculations are
handled by another memory in the machine that is faster.
The
communication time can be model with the following equation [1]:
with approximately constant.
Fig.
12. Complexity Analysis for communications in the SM.
Fig. 12 shows the case for
communicational time. Again for 128^3 and 64^3 problem, the model works just
fine, while for 16^3 and 32^3 with 8 and 16 processors is giving higher values.
It is because the problem that any LINUX system has that when the data of
communication is very small; the time to transmit the information is higher.
In general, the model for the
Spectral Method is reliable and it can be used to predict the timing behavior
for higher number of processors and higher resolution problems. It can be
determined the constants of proportionality experimentally. For the best well
behaved run we see in the plot, we determined the "constants" and
then averaged them together. The NON FFT communication constant works out to be
3.78x10-6; the FFT communication constant is 7.36x10-7,
and the computation constant being 7.37x10-6.
For Lattice Boltzmann method (LBM)
we have:
with and
approximately constant.
The
Lattice Boltzmann codes have a fairly simple complexity analysis. On each time
step, each processor is sending and receiving N^2 worth of data. The communication time is proportional to
N^2, with the constant of proportionality obviously including the time it takes
to send a word of data through the network.
Similarly, the computation time is proportional to N^3/P, where P is the
number of processors.
Fig.
13. Complexity Analysis for computations in the LBM.
Fig.
14. Complexity Analysis for communications in the LBM.
From the plots for LBM, it can be
intuited that the model does not work at all. The curves do not lay at the same
position each other. Definitely, the model has to be improved, the code has to
be improved or the taking of time has to be reviewed.
Even
though of that, we determined both constants of proportionality experimentally.
For each run we did, we determined the "constants" and then averaged
them together. While obviously not
perfect, this method may give a good idea as to what the constants should
be. The communication constant works
out to be 1.68x10-5, with the computation constant being 1.22x10-5.
Conclusions:
·
It cannot be concluded yet which code works
better computationally speaking, while it was proved that both method give
fairly good physical results. So far, it can be said that either method will
take the same amount of time to solve the same problem.
·
Definitely, this project is going in a good
road. We believe we shortly find out the answer of the question whether the LBM
can be an alternative to spectral method for turbulence simulation.
·
The communication time and computation time may
be described analytically.
·
For a given problem size, CPU speed, and
communication bandwidth, there exists an optimum node number for the overall
execution time.
Future work:
·
Use the approximate
quantitative equation model for execution time was determined for the Spectral
Method to predict the timing behavior of this code at higher number of
processors and higher resolution problem.
·
Improve the
quantitative equation model for execution time for the Lattice Boltzmman code.
·
Run the Lattice Boltzmann and Spectral codes for
larger processors. We would like to see how well the actual times match up
against the theoretical times. Also,
the efficiency and speedup would be of great concern with larger numbers of
processors.
·
Try to count the floating points operations. It
can help to construct, improve and better understand the model for execution
time for each code.
·
Improvement of a MPI FFT subroutine in the
spectral code: one limitation in the existing spectral code is that the code
can only run on 2n nodes when a pairing strategy is used for data
communication. In principle, this limitation can be removed.
·
Further study of different index ordering of
data array in the LBM code: In the LBM
code, there are several possibilities for the order of indices of the main data
array. Apparently, the speed of the
code can depend on this ordering since the efficiency of data retrieval process
depends on the ordering.
·
Study of different approaches of the communication
in Lattice Boltzmman Method.
References:
[1] Dmitruk P., et all, Scalable Parallel FFT
for Spectral Simulations on a Beowulf Cluster
[2] Gropp W., Lusk E.
and Skjellum A. Using MPI: Portable Parallel Programming with the
Message-Passing Interface. MPI Press. 1999.
[3] Moin P and K.
Makesh 1998 Annu Rev Fluid Mech. 30: 539-578.
[4] Orszag, S.A. and
Patterson, G.S. 1972 Phys. Rev. Lett. 28: 76-79.
[5] Wilkinson B. and
Allen M. Parallel Programming Techniques and Applications Using Networked Workstations
and Parallel Computers. Prentice Hall. 1999.