Mstack memory diagram
Documentation
Mstack is a benchmark created by Daniel Pressel to simulate median stacking of oceanagraphic seismic data. Mstack was designed to be lightweight, easy to compile, and easy to port from one platform to another. We are currently in the process of writing a paper describing the completed work, for submission to an academic conference.

  • Mstack is described (in work-in-progress form) in ARL-MR-0683.
  • My master's thesis describes the Mstack benchmark in detail, and efforts to port it to the Cray MTA-2 architecture.
  • Powerpoint file explaining the benchmark
  • Powerpoint file #2 (short)
  • CAPSL technical note #21 describing the process of porting Mstack to ParalleX.


  • Source code


    We have created a number of similiar-but-different versions, listed here. Where there happen to be both a C and a Fortran implementation of the same version, both are listed together.



    Early mstack versions
    mstack.c
    mstack.f
    The Mstack reference implementations
    mstack2.c
    mstack2.f
    Identical to reference implementation except for swap implementation
    mstack3.c
    mstack3.f
    Identical to reference implementation except for swap implementation
    mstack4.f
    Identical to reference implementation except for swap implementation
    mstack5.f Identical to reference implementation except for swap implementation
    mstackv.c
    mstackv.f
    Mstack vectorized implementation (Substantially different from reference implementation)
    mstackv2.c Second vectorized implementation


    Ioannis Venetis realized that it would be possible to change the inner loop from sorting every element on every iteration, to sorting numchn-k1 elements on each loop, where numchn is the total number of elements, and k1 is the number of iterations of the outer loop that have been iterated through. Thus, it sorts numchn elements on the zeroeth iteration, numchn-1 on the first, numchn-2 on the second, etc. This effectively reduces the workload by half. These versions are thus dubbed "o" for optimized.

    Optimized mstack versions
    mstacko.c
    mstacko.f
    Optimized version of the reference implementation
    mstack2o.c
    mstack2o.f
    Optimized versions of mstack2 C and Fortran versions.
    mstack3o.c
    mstack3o.f
    Optimized versions of mstack3 C and Fortran versions.
    mstack4o.c
    mstack4o.f
    Optimized versions of mstack4 C and Fortran versions.
    mstack5o.f Optimized version of mstack5 Fortran version
    mstackvo.c
    mstackvo.f
    Optimized versions of mstackv C and Fortan vectorized versions
    mstackv2o.c Optimized version of second vectorized implementation


    In porting Mstack to the Cray MTA-2, we realized that autoparallelizing compilers could benefit from slight modification to how the loop indices were treated (making it easier to determine that there were no dependencies and that the loops could be safely parallelized). These easy-to-autoparallelize versions, based on mstacko, are thus dubbed "mta".

    Easy to auto-parallelizable versions
    mstackomta1.c One extra level of dimensionality added to the scratch array (scratch now has 2 dimensions)
    mstackomta2.c Two extra levels of dimensionality added to the scratch array (scratch now has 3 dimensions)
    mstackomta3.c Scratch array eliminated entirely. All data access and manipulation now done directly on traces (which has 3 dimensions)
    mstackomta5.c This was an attempt to reduce the number of memory operations by using the recurrence. (Note: in practice, this did not produce the expected speed up)
    mstackpomta.c Replaces the bubblesort with an odd-even transposition sort (effectively a parallelized bubblesort). This introduces an additional factor of n/2 parallelism.


    Miscellaneous versions of Mstack.
    Miscellaneous versions
    mstackp.c Same as C reference implementation except uses an odd-even transposition sort instead of bubblesort.
    mstackpo.c Same as mstackp but using Ioannis's optimization
    mstackmo.c A somewhat crude effort to use a merge-sort in combination with odd-even transposition sort.
    mstacko-cyclops.c Similiar to mstackomta2. It does not assume automatic parallelization, but instead relies on nested openmp parallelization. Theoretically has 1 million units of parallelism.


    We implemented Mstack in DIME-C

    Dime-C implementation
    dime-c_8sorts.c DIME-C source code implementing 8 independent sorting units in parallel
    example.c Host C file containing FPGA specific API functions for design execution
    dimetalk_48units.dt3 DIMEtalk file for the network of 48 sort processing elements


    We implemented Mstack in Mitrion-C

    Mitrion C implementation
    host5.c
    mitrion_numchn5.mitc
    Mitrion host C file and FPGA .mitc file - Implements optimized inner loop bubble sort for 5 channels
    host50.c
    mitrion_numchn50.mitc
    Mitrion host C file and FPGA .mitc file - Implements optimized inner loop bubble sort for 50 channels
    host75.c
    mitrion_numchn75.mitc
    Mitrion host C file and FPGA .mitc file - Implements optimized inner loop bubble sort for 75 channels
    host128.c
    mitrion_numchn128.mitc
    Mitrion host C file and FPGA .mitc file - Implements optimized inner loop bubble sort for 128 channels


    We implemented Mstack for Sony's Cell Broadband Engine

    Cell B.E. implementation.
    Mstack_Cell.tar Mstack-on-Cell source code


    Ioannis Venetis implemented Mstack for IBM's Cyclops64. Download the code, build it with make (which outputs an object file), then finish compiling with the Cyclops64 compiler (cyclops64-linux-elf-gcc -o a.out tnt_mstack.o ; cyclops64-linux-elf-sim -p 2 --bw --no-spmd a.out)

    Cyclops64 implementation.
    c64_mstack.tar Mstack reference version
    c64_mstacko.tar Optimized Mstack reference version
    c64_mstack2.tar Identical to reference implementation except for swap implementation
    c64_mstack2o.tar Optimized version of c64_mstack2
    c64_mstack3.tar Identical to reference implementation except for swap implementation
    c64_mstack3o.tar Optimized version of c64_mstack3
    c64_mstackv.tar Vectorized mstack implementation
    c64_mstackvo.tar Optimized vectorized Mstack implementation
    c64_mstackv2.tar Second vectorized version
    c64_mstackv2o.tar Optimized second vectorized version


    We implemented Mstack for ParalleX - the next generation execution model under development by Thomas Sterling's group at LSU.

    ParalleX implementation.
    mstack_classes.cpp C++ classed based implementation (precursor to ParalleX)
    parallex_main.cpp ParalleX main
    mstack.hpp Header file containing Mstack class and return_set struct definitions
    mstack.cpp Class definitions


    We ported Mstack to CUDA

    CUDA implementation.
    mstack_local_mem.cu CUDA mstack implementation ParalleX main