3D Hydro Simulation using Spectral Method and Lattice Boltzmann Method:Accuracy and Performance Analysis

 

CISC879 Parallelization for Scientific Applications
Final Report for Project 5

        Advisors:

        Dr. L. P. Wang

        Dr. P. Dmitruk

 

        Members:

        Orlando Ayala

        Ben Breech

        Peng Gao

 

Introduction:

 

Turbulence is an unstable state of viscous fluid flow and is the most common flow regime in nature and in engineering applications. Air motion in the earth's atmosphere, ship wakes, flows in pipelines and chemical reactors are just a few examples. Turbulence produces rapid mixing and transport of passive pollutants and particulate matters. Analytical solutions to even the simplest turbulent flows do not exist. Direct numerical simulation (DNS) provides a numerical solution of the flow variables as a function of space and time by numerically solving the exact mathematical equations governing a turbulent flow. DNS has been used effectively as a research tool since its invention in the earlier 70’s (Orszag and Patterson 1972).

 

Significant insight into turbulence physics has been gained from DNS of certain idealized flows that cannot be easily attained in the laboratory (Moin and Mahesh 1998).  For example, for the case of isotropic turbulence, DNS have helped us (a) identify the small scale vortex tubes and their physical properties, (b) examine the hypotheses underlying the classical Kolmogorov-Obukhov theory of the inertial range, (c) explore the locality of the spectral energy transfer, (d) quantify mixing, structure, and turbulent combustion of passive scalar, and (e) reformulate statistical models for transport and coagulation of heavy particles. In quite a few cases, DNS data have been used to evaluate measurement accuracy (e.g., Moin & Mahesh 1998).  We believe that DNS, viewed as novel numerical experiments, continue to be a complementary tool to experiments in turbulence research. Given these remarkable contributions made in the recent years through DNS, a known limitation is the relatively low Reynolds number (to date, Re is about or less than 250). A major effort has been to increase the flow Reynolds number.  This project represents a first step towards this goal by optimizing DNS codes on scalable parallel computers.

 

Objective:

 

Our faculty advisor, Dr. Wang, has recently developed two MPI codes to simulate homogeneous turbulence on scalable parallel computers. The codes are based on two distinct numerical methods.  The first or spectral code is based on a 3D Fourier representation of the Navier-Stokes equations (the governing equations for macroscopic fluid motion).  The second is based on the discretized Boltzmann equation and as such is known as the lattice Boltzmann method (LBM).  The two methods solve completely different equations but macroscopically give the same turbulent flow field if they share a same initial condition.

 

The overall objective of this project is to analyze these codes carefully and test their performance on the cluster machine so that their capabilities for simulating turbulence can be explored and compared.

 

 

Program and method description:

           

Spectral method description (SM)

 

What is solved in the spectral method is the Navier Stokes equation:

 

in the Fourier space or k-space through the following transformations:

 

 

leading to the following equation in the Fourier space:

 

where

 

 

calling the first and second term at right hand side of the equation C(k,t) and using the following change of variables:

 

            The equation is simplified to:

 

which when discretized:

 

 

using this expression and the transformation from Q to u (velocities at k-space), we can readily calculated the velocity field at each time step. From the above equation, it is clear that it is needed the term C(k,t).  Using the continuity equation at k-space:

 

 

The Navier Stoke equation at k-space can be simplified to:

 

 

from which we can obtain B(k,t). So, C(k,t) can be calculated base on A(k,t):

 

 

            To resume, to calculate the velocity field at the following time step, it is needed C(k,t) which can be calculated from A(k,t).

 

            The process can be reduced to the following steps:

 

At t=0 we know u(x,t). Using FFT, we obtain u(k,t)

 

At each time step  is known, we want:  , then:

 

1)

 

2) Fast Fourier Transforms:

           

           

 

3) Compute Nonlinear term A(x,t) at real space:

           

 

4) Fast Fourier Transform:

           

 

5) It is necessary step to assure getting real physical results, forcing symmetry on kx=0 plane

           

 

6) Compute C(k,t)

 

7) Compute Q(k,t+dt) to later obtain . And repeat all the steps to calculate next time step.

 

 

            As it can be seen from above, the FFT is an essential computational component of Galerkin and pseudo-spectral codes. To do the implementation and performance of the multidimensional FFT on a distributed memory Beowulf cluster, we focus on the three-dimensional real transform. The approach study is a one-dimensional domain decomposition algorithm that relies on communication-intensive transpose operation involving P processors.  Communication is based upon the standard portable message-passing interface (MPI).

 

·        The transpose algorithm is optimized for simultaneous block communication by all processors;

·        Communication is arranged for non-overlapping pair-wise communication between processors, thus eliminating blocking when standard fast ethernet interconnects are employed

 

This method provides the basis for implementation of scalable and efficient spectral method computations of hydrodynamic and magneto-hydrodynamic turbulence on Beowulf clusters assembled from standard commodity components. [1]

 

 

Lattice Boltzmann method description (LBM)

 

The lattice Boltzmann method is a parallel and efficient algorithm for simulating single-phase and multiphase fluid flows and for incorporation additional physical complexities. The LBM is particularly successful in fluid flow applications involving interfacial dynamics and complex boundaries.

This method simulates the motion of the fluid particles trough a distribution of fluid particles function (f). These particles are allowed to move and collide with each other. The movement of the particles is constrained to a particular set of directions.

 

Some key equations to be solved in the 3D LBM code:

 

Using 15 velocity model:

 

            The particles can have the following velocities along each particular set of direction. Each type of particle will depend on the direction it follows. The velocities are:

 

 

            (0, 0, 0);                                               a=0;

            (±1, 0, 0) (0, ±1, 0) (0, 0, ±1);             a=1, 2, 3, 4, 5, 6;

            (±1, ±1, ±1);                                        a=7, 8, 9, 10, 11, 12, 13, 14;

 

 

            At each time step, it is calculated the particle distribution function which we will need to recover the velocity and density field.

 

 

            The right hand side of the equation is the so-called collision term, and the left hand side is called the streaming. In the collision term exits a term, which is the particle distribution function at equilibrium. It can be calculated as follow:

 

 

where:

 

 

 

            It can be proven that with the equations shown can be recovered the Navier Stoke equation, which mean that solving these, we are solving the Navier Stoke equation behind the scene. The density and the velocity field can be calculated from the particle distribution function as follow:

 

 

 

           

 

 

            The process can be reduced to the following simple steps:

 

At t=0, we calculate the initial particle distribution function from initial velocity field (u(x,t)) using the particle distribution function at equilibrium.

 

At each time step  is known, we want: , then:

 

1) Compute the collision term. At this stage, it is needed information at the same node location.

 

2) Compute  (streaming). At this stage, it is needed some values of the particle distribution function at some neighboring nodes.

 

3) Compute density and velocity from he particle distribution function.

 

 

            The approach study is a one-dimensional domain decomposition algorithm that relies on communication-intensive operation involving P processors when calculating step 2 which correspond to the streaming calculations, as it was mentioned to calculate it, it is need information from the neighboring nodes. If the physical neighboring nodes are in other processors, it is need to communicate this information. Communication is based upon the standard portable message-passing interface (MPI). Communication involves only pair of processor, each processor only communicates with his neighbors.

 

 

 

Results Analysis:

           

We have met problem to implement the codes on the DEC Alpha cluster since we cannot run more than 2 processors on it when using MPI FORTRAN. We have imported and implemented the both methods on the Beowulf clusters finally.  The programs run well. Several runs were done on both SM and LBM codes, at resolutions varying from 128^3, 64^3, 32^3, 16^3.  We ran the code on P = 1, 2, 4, 8, and 16 processors on the Beowulf cluster.

 

 

 

Physical result:

 

            First at all, we have to prove both methods are giving same results. It does not make sense to compare computationally two methods that do not match in physical results.

Fig. 1.  Kinetic energy and Dissipation for a problem size of 32^3.

 

Fig. 2.  Kinetic energy and Dissipation for a problem size of 64^3.

 

Fig. 3.  The energy spectrum for a problem size of 32^3.

 

 

Fig. 4.  The energy spectrum for a problem size of 64^3.

 

 

            Figures 1, 2, 3 and 4 shows physical results extracted from runs of size 32^3 and 64^3. Both sizes were run using the two methods, LBM and SM. The results show a very good agreement between the two methods. It can be said that either method gives same results. So, the choice of each one method such relies on computational aspects. In this project it is tried to analyze the computational aspects involving LBM and SM due to computations, as well as communications when the two programs are run with several processors.

 

 

Timing results:

 

            In this section, it is shown time each program (LBM and SM) takes for communication, computation and the total time, which is the addition of the communication, and computational time. The timing was obtained using w_time command before and after each section of the program where it was going to either calculate or communicate. Another important thing to point out is that the time we show on those plots is the time takes to compute one time step in the problem using either method. Generally speaking, it is needed something like 5000 steps or more to get interesting physical results. With few easy calculations, it can be figured out why it is needed to parallelize the programs, for a 128^3, it could take a day to obtain good results.

 

Fig. 5. Communication time as function of number of processors P for different sizes.

 

 

As it can be seen from above, for the data sizes 128^3 and 64^3 the communication time decreases slowly when increasing the number of processors for the Spectral Method. The reason is the data size for communication of each processor is getting smaller and smaller when increasing the number of processors.

For the data sizes 32^3 for SM the communication time changes little by the number of processors, but it still decreases as the higher resolution problems. However, for the data size 16^3 the communication time increase a little with the number of processors. We think that it is due to the problem that any LINUX system has that when the data of communication is very small; the time to transmit the information is higher.

Interestingly, for the Lattice Boltzmann code, the communication time should be proportional to N^2, with no dependence on p.  Yet, for small values of p, the graphs clearly show some variations with p.  They seem to settle down into a straight line once p goes to 8 and above.  We hypothesize that this behavior should continue for larger data sets, but that has yet to be confirmed.

 

 

Fig 6. Computational time as function of number of processors P for different sizes.

 

Looking at Fig. 6, for the data sizes 128^3, 64^3, 32^3, 16^3 the computation time decreases obviously by the number of processor for both methods. It is clear since when increasing the number of processors, each processor has smaller amount of data to calculate, therefore it will take less time to calculate.

It is important to point out, comparing Fig. 5 and 6, that for the problem we solved, the time for computation is higher than the time each processor takes to communicate. Probably, any improvement towards communication schemes will help little.

 

 

Fig 7. Total time as function of number of processors P for different sizes.

 

 

Fig. 7 shows the total time of computation. Generally, for the data sizes 128^3, 64^3, 32^3, 16^3, the computation time decreases obviously by the number of processor for both methods. Therefore, the parallel programming decreases the time for scientific computation. From this plot, one important behavior that can be seen is that either method takes almost the same amount of time to calculate each time step, at least for problems of higher resolutions. It has to be still studied what is going to happen for problem resolutions higher than the ones we studied, higher than 128^3.

 

            Other important plots are the Efficiency and the Speed Up. Fig. 8 and 9 show both of them. The overall computation efficiency increases with problem size (N) and decrease with P, just as we expected.

            The efficiency plot is important since it can be used to evaluate a parallel program. The efficiency always tends to decrease with the number of processors (P), but a well-behaved parallel program is if the efficiency tends to be constant or slow down a little bit with the number of processors. Looking at the Fig. 8, it can be said that for SM problems 128^3 and 64^3 have good efficiency, in contrast for smaller problems, it decreases constantly. For LBM, we could not obtain data for 128^3, for 64^3 and 32^3 the efficiency is always around the singular value of one and tend to be maintained, while for 16^3 the efficiency drops constantly.

            In general, both methods have a well-behaved efficiency when solving higher resolutions problems.

 

          

Fig 8. Efficiency as function of number of processors P for different sizes.

 

 

          

Fig 9. Speed Up as function of number of processors P for different sizes.

 

 

From Fig. 9, it can be seen that the LBM has much better speed up than spectral method. At this point, we do not want to say that the LBM can be an alternative to spectral method for turbulence simulation or even better method. It is easy to think that from the efficiency and speed up plots, it can be concluded that the LBM is much better.

Remember that the time each program takes to calculate each time step is fairly the same. Base on this, either method is good. The behavior we see in the speed up and efficiency plot can be because the Lattice Boltzmann Method does not work at all well with one processor. LBM is always better to use with several processors.

Further studies have to be done to try to make a good conclusion of which method is better for the type of problem we are solving.

In the future, the both methods need to be run for larger processors. As we said, the efficiency and speedup would be of great concern with larger numbers of processors. For LBM, since the communication time does not depend on p, there will come a point where each process has to spend more time communicating than doing useful computations.  Thus we expect the efficiency to start dropping off. For SM, looking at the computational time and communication time plots, there will be also a point where each process has to spend more time communicating than doing useful computations because the computational time drops faster than the communication time when increasing the number of processors. It will be interesting to see the efficiency behavior.

 

 

 

Complexity analysis

 

           

            One important study to be done when analyzing parallelized codes is the complexity analysis. It is to try to develop an approximate quantitative equation model for execution time.

            In general, it can be said that the total time a program takes is:

 

 

the time of computation and the time of communication.

 

            For Spectral method:

 

            The computational time can be modeled as the addition of the time for NON FFT computations and FFT computations. The following equations show it [1]:

 

 

 

 

with and approximately constant.

           

            We decided to plot the ratio of the time and the non-constant term of each equation. If the model is correct, all the curves should lie horizontally at the same place at a value of t1 or t2.

Fig. 10. Complexity Analysis for FFT computations in the SM.

 

            Fig. 10 shows the mentioned plot for FFT computations. Almost all the curves lie at the same value, except for the 16^3 problem. We think that it is probably because the FFT program implemented works much better for better for bigger problems. It can be concluded that the model used to quantify the FFT computational time is correct.

 

 

Fig. 11. Complexity Analysis for NON FFT computations in the SM.

 

Fig. 11 shows the same type of plot for NON FFT computations. For 128^3 and 64^3, the model works fine. For the problem 16^3 and the 32^3 problem with 8 and 16 processors, the value is lower than the model should predict. We believe it is because for those situations the data is small enough that the calculations are handled by another memory in the machine that is faster.

 

The communication time can be model with the following equation [1]:

 

 

with approximately constant.

 

Fig. 12. Complexity Analysis for communications in the SM.

 

            Fig. 12 shows the case for communicational time. Again for 128^3 and 64^3 problem, the model works just fine, while for 16^3 and 32^3 with 8 and 16 processors is giving higher values. It is because the problem that any LINUX system has that when the data of communication is very small; the time to transmit the information is higher.

           

            In general, the model for the Spectral Method is reliable and it can be used to predict the timing behavior for higher number of processors and higher resolution problems. It can be determined the constants of proportionality experimentally. For the best well behaved run we see in the plot, we determined the "constants" and then averaged them together. The NON FFT communication constant works out to be 3.78x10-6; the FFT communication constant is 7.36x10-7, and the computation constant being 7.37x10-6.

 

 

 

            For Lattice Boltzmann method (LBM) we have:

 

 

with and approximately constant.

 

The Lattice Boltzmann codes have a fairly simple complexity analysis. On each time step, each processor is sending and receiving N^2 worth of data.  The communication time is proportional to N^2, with the constant of proportionality obviously including the time it takes to send a word of data through the network.  Similarly, the computation time is proportional to N^3/P, where P is the number of processors.

 

 

Fig. 13. Complexity Analysis for computations in the LBM.

 

 

Fig. 14. Complexity Analysis for communications in the LBM.

 

 

            From the plots for LBM, it can be intuited that the model does not work at all. The curves do not lay at the same position each other. Definitely, the model has to be improved, the code has to be improved or the taking of time has to be reviewed.

           

Even though of that, we determined both constants of proportionality experimentally. For each run we did, we determined the "constants" and then averaged them together.  While obviously not perfect, this method may give a good idea as to what the constants should be.  The communication constant works out to be 1.68x10-5, with the computation constant being 1.22x10-5.

 

 

 

 

 

Conclusions:

 

·        It cannot be concluded yet which code works better computationally speaking, while it was proved that both method give fairly good physical results. So far, it can be said that either method will take the same amount of time to solve the same problem.

·        Definitely, this project is going in a good road. We believe we shortly find out the answer of the question whether the LBM can be an alternative to spectral method for turbulence simulation.

·        The communication time and computation time may be described analytically.

·        For a given problem size, CPU speed, and communication bandwidth, there exists an optimum node number for the overall execution time.

 

Future work:

 

·        Use the approximate quantitative equation model for execution time was determined for the Spectral Method to predict the timing behavior of this code at higher number of processors and higher resolution problem.

·        Improve the quantitative equation model for execution time for the Lattice Boltzmman code.

·        Run the Lattice Boltzmann and Spectral codes for larger processors. We would like to see how well the actual times match up against the theoretical times.  Also, the efficiency and speedup would be of great concern with larger numbers of processors.

·        Try to count the floating points operations. It can help to construct, improve and better understand the model for execution time for each code.

·        Improvement of a MPI FFT subroutine in the spectral code: one limitation in the existing spectral code is that the code can only run on 2n nodes when a pairing strategy is used for data communication. In principle, this limitation can be removed.

·        Further study of different index ordering of data array in the LBM code:  In the LBM code, there are several possibilities for the order of indices of the main data array.  Apparently, the speed of the code can depend on this ordering since the efficiency of data retrieval process depends on the ordering.

·        Study of different approaches of the communication in Lattice Boltzmman Method.

 

References:

 

[1] Dmitruk P., et all, Scalable Parallel FFT for Spectral Simulations on a Beowulf Cluster

 

[2] Gropp W., Lusk E. and Skjellum A. Using MPI: Portable Parallel Programming with the Message-Passing Interface. MPI Press. 1999.

 

[3] Moin P and K. Makesh 1998 Annu Rev Fluid Mech. 30: 539-578.

 

[4] Orszag, S.A. and Patterson, G.S. 1972 Phys. Rev. Lett. 28: 76-79.

 

[5] Wilkinson B. and Allen M. Parallel Programming Techniques and Applications Using Networked Workstations and Parallel Computers. Prentice Hall. 1999.





Last update: December 08, 2000