9:00 — 9:15 |
Opening Remarks
John Cavazos
University of Delaware
|
Section 1 |
9:15 — 9:45 |
A Case for a Value-Aware Cache
Per Stenström
Chalmers University of Technology
Replication of values causes poor utilization of on-chip cache memory resources. This talk addresses the question: How much cache resources can be theoretically and practically saved if value replication is eliminated? We introduce the concept of value-aware caches and show that a sixteen times smaller value-aware cache can yield the same miss rate as a conventional cache. We then make a case for a value-aware cache design using Huffman-based compression. Since the value set is rather stable across the execution of an application, one can afford to reconstruct the coding tree in software. The decompression latency is kept short by the proposed novel pipelined Huffman decoder that uses canonical codewords. While the (loose) upper-bound compression factor is 5.2X, we show that, by eliminating cache-block alignment restrictions, it is possible to achieve a compression factor of 3.4X for practical designs.
|
9:45 — 10:15 |
Break |
Section 2 |
10:15 — 10:45 |
Rethinking the Architecture of Warehouse-Scale Computers: Improving Efficiency and Utilization
Jason Mars
University of California, San Diego
The class of datacenters coined as "warehouse scale computers" (WSCs) house large-scale data intensive web services such as websearch, maps, social networking, docs, video sharing, etc. Companies like Google, Microsoft, Yahoo, and Amazon spend ten to hundreds of millions to construct and operate WSCs to provide these services. Maximizing the efficiency of this class of computing reduces cost and has energy implications for a greener planet. However, WSC design and architecture remains in its relative infancy.
WSCs are built using commodity processor architectures (Intel/AMD), and software components (Linux, GCC, JVM, etc) that has been engineered and optimized for traditional computing environments and workloads, such as those you'd find in the desktop / laptop environment. However, there are many characteristics, assumptions, and requirements present in the WSC computing domain that impacts design decisions within these components. In this presentation, we rethink how WSCs are designed and architected, identify sources of inefficiency, and develop solutions to improve WSCs, with a particular focus on the interaction between the application layer, system software stack, and the underlying hardware platform.
|
10:45 — 11:15 |
Fast Modeling Technology in the Multicore Era
Erik Hagersten
Uppsala University
Multicore introduces many hardware design tradeoffs and simulation technology is used extensively to guide the hardware designs towards near-optimal design points. However, multicore also introduces a number of new design choices for the programmer, compiler and runtime system and finding the near-optimal design point for these technologies could be just as important. In this talk I will advocate that simulation/modeling technology is needed not only for hardware developers, but also for application developers and runtime systems. Such use requires two to three orders of magnitude lower overhead than traditional simulation technology. I will present statistical simulation methods, architecturally independent performance metrics and performance variability studies as three possible "simulation" technologies for such usage.
|
11:15 — 12:00 |
A Pathway to Usability in the Face of Extreme Concurrency
ACM Distinguished Lecture
Bronis de Supinski
Lawrence Livermore National Laboratory
We are faced with an explosion in parallelism at all levels of large scale systems. Multi-core chips have become ubiquitous and almost all large scale systems use them. Several systems such as the Lawrence Livermore National Laboratory's BlueGene/P and Oak Ridge National Laboratory's Jaguar already have over 100,000 processor cores. The recently announced ASC Sequoia will have over 1.5 million cores when it is deployed in FY12. Los Alamos National Laboratory's Roadrunner system features a heterogeneous node architecture that requires the use of three different compilers to build a single application. Hardware support for other novel parallelism mechanisms, such as transactional memory and thread level speculation, are likely to appear in systems in the near future. Further, future systems are likely to have much less off-chip and off-node bandwidth per core as well as significantly smaller main memories per core. These trends will necessitate significant changes in applications and the development environment that supports them. We will require new mechanisms to target applications to these architectures, to identify and to solve software defects that arise in those applications and to understand and to improve their performance. In this talk, I will detail several novel directions that we are pursuing at Livermore to overcome many of these challenges.
Please note that the exact content of this talk may be modified when selected from the ACM Speakers' Program to reflect new results.
|
12:00 — 13:30 |
Lunch |
Section 3 |
13:30 — 14:00 |
Performance Analysis in GPUs
Amirali Baniasadi
University of Victoria
We start with discussing the three parameters affecting performance: branch divergence, memory access delays and limited workload parallelism. We suggest machine models to estimate performance gain potentials obtainable by eliminating each performance degrading parameter. Such estimates indicate how much improvement designers could expect by investing in different GPU subsections. We conclude that memory is by far the most important parameter among the three issues impacting performance.
Moreover we provide better understanding regarding how divergence can be addressed by investigating warp size impact. GPUs execute threads in groups (called warps) to achieve both scheduling simplicity and SIMD efficiency. We study how the number of instructions in each warp can impact branch and memory divergence. Small warps reduce the performance penalty associated with branch divergence at the expense of higher memory divergence. Large warps enhance memory coalescing significantly but also increase branch divergence. We present and evaluate two possible choices to address both divergences simultaneously: use small warps and invest in finding new solutions to address memory divergence or use large warps and address branch divergence employing effective control-flow solutions.
|
14:00 — 14:30 |
A Platform to Help Migrating OpenMP code to Heterogeneous Systems
François Bodin
CAPS Entreprise
Heterogeneous systems based on a multicore processor supplemented with accelerator or coprocessor units are potentially providing high performance at the expense of extra programming complexity. Migrating codes to these systems is usually an intricate burdensome task even if new standards such as OpenACC or OpenMP accelerator extensions are greatly simplifying programming issues. However dealing with data transfers and performance tuning is still programmer's burden.
In this presentation we provide an overview of the technology developed at CAPS to help migrating regular OpenMP codes to heterogeneous systems. We address issues such as multiple address space issues, portability, code outlining for auto-tuning, integration of domain specific code transformations, etc.
|
14:30 — 15:00 |
Phalanx: Designing a Unified Programming Model for Heterogeneous Machines
Bryan Catanzaro
NVIDIA
Heterogeneous, massively multithreaded processors are increasingly important at every scale of computing, from smartphone to supercomputer. However, mainstream programming languages reflect a mental model in which processing elements are homogeneous, concurrency is limited, and memory is a flat, undifferentiated pool of storage. Moreover, the current state of the art in programming heterogeneous machines tends towards using separate programming models, such as OpenMP and CUDA, for different portions of the machine. Both of these factors make programming heterogeneous machines unnecessarily difficult. We describe the design of the Phalanx programming model, which seeks to provide a unified programming model for heterogeneous machines. It provides constructs for bulk parallelism, synchronization, and data placement that operate across the entire machine. In this talk, I will discuss the programming model and its implementation, giving examples that show how Phalanx's approach to heterogeneity simplifies the task of exploiting heterogeneous machines.
|
15:00 — 15:30 |
Break |
Section 4 |
15:30 — 16:00 |
Probabilistic pointer analysis in SSA form and the applications
Jenq Kuen Lee
National Tsing-Hua University
Probabilistic pointer analysis (PPA) is a compile-time analysis method that estimates the probability that a points-to relationship will hold at a particular program point. The results are useful for optimizing and parallelizing compilers, which need to quantitatively assess the profitability of transformations when performing aggressive optimizations and parallelization. In this talk, I will report our recent progress with a PPA technique using the static single assignment (SSA) form. When computing the probabilistic points-to relationships of a specific pointer, a pointer relation graph (PRG) is first built to represent all of the possible points-to relationships of the pointer. The PRG is transformed by a sequence of reduction operations into a compact graph, from which the probabilistic points-to relationships of the pointer can be determined. In addition, PPA is further extended to interprocedural cases by considering function related statements. We have implemented our proposed scheme including static and profiling versions in the Open64 compiler for experiments. I will also describe the applications potentially benefiting from such analysis. (The work is based on our paper in IEEE TPDS V23, Issue 12, Dec 2012.)
|
16:00 — 16:30 |
Improving Processor Efficiency by Statically Pipelining Instructions
David Whalley
Florida State University
A new generation of applications requires reduced power consumption without sacrificing performance. Instruction pipelining is commonly used to meet application performance requirements, but some implementation aspects of pipelining are inefficient with respect to energy usage. We propose static pipelining as a new instruction set architecture to enable more efficient instruction flow through the pipeline, which is accomplished by exposing the pipeline structure to the compiler. While this approach simplifies hardware pipeline requirements, significant modifications to the compiler are required. We describe the code generation and compiler optimizations we implemented to exploit the features of this architecture and show that we can achieve performance and code size improvements despite a very low-level instruction representation. We also demonstrate that static pipelining of instructions reduces energy usage by simplifying hardware, avoiding many unnecessary operations, and allowing the compiler to perform optimizations that are not possible on traditional architectures.
|
16:30 — 17:00 |
Using the General-Purpose Part of a Special-Purpose Machine: The Fast Fourier Transform on Anton
Clifford Young
D.E. Shaw
No abstract available at this time.
|
17:00 — 17:15 |
Closing Remarks
John Cavazos
University of Delaware
|