Scope

We proudly announce the Workshop on Hot Topics in Parallel Computing. The purpose of this workshop is to introduce the community to the state-of-the-art in parallel architecture and software systems that the scientific community is using to accelerate computationally demanding applications. Academic and industry leaders from several areas of parallel computing research will review the progress that the field has made, and will present a number of future problems that we should be prepared to tackle in the coming decade.

This workshop will be divided into talks in several areas of parallel computing including:

novel processors and systems
programming models, languages, and compilers
run-time environments and profiling tools

The workshop is free and open to everyone, although space is limited, so please register early to guarantee a spot.

An event flyer is available for download and distribution here: Event Flyer

Schedule

Below is a final schedule of talks for the workshop. Please note the changes of start time.

Abstracts of each talk may be viewed by clicking on the right column of the table on a particular talk.

Time	Event
9:00 — 9:15	Opening Remarks John Cavazos University of Delaware
Section 1
9:15 — 9:45	A Case for a Value-Aware Cache Per Stenström Chalmers University of Technology Replication of values causes poor utilization of on-chip cache memory resources. This talk addresses the question: How much cache resources can be theoretically and practically saved if value replication is eliminated? We introduce the concept of value-aware caches and show that a sixteen times smaller value-aware cache can yield the same miss rate as a conventional cache. We then make a case for a value-aware cache design using Huffman-based compression. Since the value set is rather stable across the execution of an application, one can afford to reconstruct the coding tree in software. The decompression latency is kept short by the proposed novel pipelined Huffman decoder that uses canonical codewords. While the (loose) upper-bound compression factor is 5.2X, we show that, by eliminating cache-block alignment restrictions, it is possible to achieve a compression factor of 3.4X for practical designs.
9:45 — 10:15	Break
Section 2
10:15 — 10:45	Rethinking the Architecture of Warehouse-Scale Computers: Improving Efficiency and Utilization Jason Mars University of California, San Diego The class of datacenters coined as "warehouse scale computers" (WSCs) house large-scale data intensive web services such as websearch, maps, social networking, docs, video sharing, etc. Companies like Google, Microsoft, Yahoo, and Amazon spend ten to hundreds of millions to construct and operate WSCs to provide these services. Maximizing the efficiency of this class of computing reduces cost and has energy implications for a greener planet. However, WSC design and architecture remains in its relative infancy. WSCs are built using commodity processor architectures (Intel/AMD), and software components (Linux, GCC, JVM, etc) that has been engineered and optimized for traditional computing environments and workloads, such as those you'd find in the desktop / laptop environment. However, there are many characteristics, assumptions, and requirements present in the WSC computing domain that impacts design decisions within these components. In this presentation, we rethink how WSCs are designed and architected, identify sources of inefficiency, and develop solutions to improve WSCs, with a particular focus on the interaction between the application layer, system software stack, and the underlying hardware platform.
10:45 — 11:15	Fast Modeling Technology in the Multicore Era Erik Hagersten Uppsala University Multicore introduces many hardware design tradeoffs and simulation technology is used extensively to guide the hardware designs towards near-optimal design points. However, multicore also introduces a number of new design choices for the programmer, compiler and runtime system and finding the near-optimal design point for these technologies could be just as important. In this talk I will advocate that simulation/modeling technology is needed not only for hardware developers, but also for application developers and runtime systems. Such use requires two to three orders of magnitude lower overhead than traditional simulation technology. I will present statistical simulation methods, architecturally independent performance metrics and performance variability studies as three possible "simulation" technologies for such usage.
11:15 — 12:00	A Pathway to Usability in the Face of Extreme Concurrency ACM Distinguished Lecture Bronis de Supinski Lawrence Livermore National Laboratory We are faced with an explosion in parallelism at all levels of large scale systems. Multi-core chips have become ubiquitous and almost all large scale systems use them. Several systems such as the Lawrence Livermore National Laboratory's BlueGene/P and Oak Ridge National Laboratory's Jaguar already have over 100,000 processor cores. The recently announced ASC Sequoia will have over 1.5 million cores when it is deployed in FY12. Los Alamos National Laboratory's Roadrunner system features a heterogeneous node architecture that requires the use of three different compilers to build a single application. Hardware support for other novel parallelism mechanisms, such as transactional memory and thread level speculation, are likely to appear in systems in the near future. Further, future systems are likely to have much less off-chip and off-node bandwidth per core as well as significantly smaller main memories per core. These trends will necessitate significant changes in applications and the development environment that supports them. We will require new mechanisms to target applications to these architectures, to identify and to solve software defects that arise in those applications and to understand and to improve their performance. In this talk, I will detail several novel directions that we are pursuing at Livermore to overcome many of these challenges. Please note that the exact content of this talk may be modified when selected from the ACM Speakers' Program to reflect new results.
12:00 — 13:30	Lunch
Section 3
13:30 — 14:00	Performance Analysis in GPUs Amirali Baniasadi University of Victoria We start with discussing the three parameters affecting performance: branch divergence, memory access delays and limited workload parallelism. We suggest machine models to estimate performance gain potentials obtainable by eliminating each performance degrading parameter. Such estimates indicate how much improvement designers could expect by investing in different GPU subsections. We conclude that memory is by far the most important parameter among the three issues impacting performance. Moreover we provide better understanding regarding how divergence can be addressed by investigating warp size impact. GPUs execute threads in groups (called warps) to achieve both scheduling simplicity and SIMD efficiency. We study how the number of instructions in each warp can impact branch and memory divergence. Small warps reduce the performance penalty associated with branch divergence at the expense of higher memory divergence. Large warps enhance memory coalescing significantly but also increase branch divergence. We present and evaluate two possible choices to address both divergences simultaneously: use small warps and invest in finding new solutions to address memory divergence or use large warps and address branch divergence employing effective control-flow solutions.
14:00 — 14:30	A Platform to Help Migrating OpenMP code to Heterogeneous Systems François Bodin CAPS Entreprise Heterogeneous systems based on a multicore processor supplemented with accelerator or coprocessor units are potentially providing high performance at the expense of extra programming complexity. Migrating codes to these systems is usually an intricate burdensome task even if new standards such as OpenACC or OpenMP accelerator extensions are greatly simplifying programming issues. However dealing with data transfers and performance tuning is still programmer's burden. In this presentation we provide an overview of the technology developed at CAPS to help migrating regular OpenMP codes to heterogeneous systems. We address issues such as multiple address space issues, portability, code outlining for auto-tuning, integration of domain specific code transformations, etc.
14:30 — 15:00	Phalanx: Designing a Unified Programming Model for Heterogeneous Machines Bryan Catanzaro NVIDIA Heterogeneous, massively multithreaded processors are increasingly important at every scale of computing, from smartphone to supercomputer. However, mainstream programming languages reflect a mental model in which processing elements are homogeneous, concurrency is limited, and memory is a flat, undifferentiated pool of storage. Moreover, the current state of the art in programming heterogeneous machines tends towards using separate programming models, such as OpenMP and CUDA, for different portions of the machine. Both of these factors make programming heterogeneous machines unnecessarily difficult. We describe the design of the Phalanx programming model, which seeks to provide a unified programming model for heterogeneous machines. It provides constructs for bulk parallelism, synchronization, and data placement that operate across the entire machine. In this talk, I will discuss the programming model and its implementation, giving examples that show how Phalanx's approach to heterogeneity simplifies the task of exploiting heterogeneous machines.
15:00 — 15:30	Break
Section 4
15:30 — 16:00	Probabilistic pointer analysis in SSA form and the applications Jenq Kuen Lee National Tsing-Hua University Probabilistic pointer analysis (PPA) is a compile-time analysis method that estimates the probability that a points-to relationship will hold at a particular program point. The results are useful for optimizing and parallelizing compilers, which need to quantitatively assess the profitability of transformations when performing aggressive optimizations and parallelization. In this talk, I will report our recent progress with a PPA technique using the static single assignment (SSA) form. When computing the probabilistic points-to relationships of a specific pointer, a pointer relation graph (PRG) is first built to represent all of the possible points-to relationships of the pointer. The PRG is transformed by a sequence of reduction operations into a compact graph, from which the probabilistic points-to relationships of the pointer can be determined. In addition, PPA is further extended to interprocedural cases by considering function related statements. We have implemented our proposed scheme including static and profiling versions in the Open64 compiler for experiments. I will also describe the applications potentially benefiting from such analysis. (The work is based on our paper in IEEE TPDS V23, Issue 12, Dec 2012.)
16:00 — 16:30	Improving Processor Efficiency by Statically Pipelining Instructions David Whalley Florida State University A new generation of applications requires reduced power consumption without sacrificing performance. Instruction pipelining is commonly used to meet application performance requirements, but some implementation aspects of pipelining are inefficient with respect to energy usage. We propose static pipelining as a new instruction set architecture to enable more efficient instruction flow through the pipeline, which is accomplished by exposing the pipeline structure to the compiler. While this approach simplifies hardware pipeline requirements, significant modifications to the compiler are required. We describe the code generation and compiler optimizations we implemented to exploit the features of this architecture and show that we can achieve performance and code size improvements despite a very low-level instruction representation. We also demonstrate that static pipelining of instructions reduces energy usage by simplifying hardware, avoiding many unnecessary operations, and allowing the compiler to perform optimizations that are not possible on traditional architectures.
16:30 — 17:00	Using the General-Purpose Part of a Special-Purpose Machine: The Fast Fourier Transform on Anton Clifford Young D.E. Shaw No abstract available at this time.
17:00 — 17:15	Closing Remarks John Cavazos University of Delaware

Registration

Please note that space for this workshop is limited, so please register early to guarantee a spot. All fields in the form below are required for registration. No information will be shared with any third party.

Name

Email address

Affiliation

Position

Undergraduate student
Masters student
PhD student
Postdoctoral researcher
Faculty member
Other:

How did you find out about the event?

Directions

The University of Delaware is located in Newark, Delaware. Newark is conveniently located about one hour from Philadelphia and Baltimore, two hours from Washington DC, and less than three hours from New York City. A Google Maps link for directions to Perkins Student Center may be found here.

If driving to Newark, there is a parking garage available directly to the south of Perkins Student Center. A map indicating the parking region may be found here. When looking at the map, the parking garage has reference code 4C.

Organizers

This workshop is being organized by the following people:

John Cavazos — University of Delaware
Tristan Vanderbuggen — University of Delaware
Renato Miceli — Irish Centre for High-End Computing, Université de Rennes 1
William Killian — University of Delaware

Please direct any questions about the workshop to John Cavazos ( cavazos AT cis DOT udel DOT edu )