Designing Scalable Distributed Memory Models: A Case Study

Published in Computing Frontiers, 2017

One promising effort as we progress toward exascale is the development of fine grain execution models. These models display an innate agility providing new avenues to address the challenges presented by futures systems such as extreme parallelism, restrictive power constraints, and fault tolerance. These opportunities however, may be prematurely abandoned if the system software, particularly a distributed runtime, is incapable of scaling. One potentially limiting factor is the enforcement of the memory model in a runtime. Read more

Download here

Extending the Roofline Model for Asynchronous Many-Task Runtimes

Published in CLUSTER, 2016

A common practice for application developers is to experimentally determine the granularity of a task after a code has been parallelized based on the observed overhead of a runtime. Instead, we propose a new methodology based on an extended Roofline model to provide practical upper bounds on the throughput performance of an application. First, we extend the Roofline model to support not only latency hiding analysis, but also a multidimensional amortized analysis. By combining this new methodology with a serial application and an Asynchronous Many Task (AMT) runtime implementation, we can predict the worst case runtime overhead attribution of individual runtime features prior to the development of parallel code. Read more

Download here

Asynchronous Runtimes in Action: An Introspective Framework for a Next Gen Runtime

Published in IPDRM, 2016

One of the most critical challenges that new high performance systems face is the lack of system software support for these large scale systems. Investment on system stack components is essential in the development, debugging and optimization of the new emerging programming models. These emerging models have the promise to better utilize the vast hardware resources available in current and future systems. To aid in the development of applications and new system stacks, runtimes, as instances of their respective execution models, need to produce facilities to introspect their inner workings and allow an in depth attribution of performance bottlenecks and computational patterns. In other words, the runtime systems need to reduce their opacity to observers so that users of a novel program execution model can adapt their designs to fit the intended model usage, regardless of the layer that they are working on. This design/development loop (akin to co-design) enables synergistic opportunities across the entire computational stack. This paper presents the design and implementation of a simple “gray” box performance attribution harness running inside a fine grain runtime system: the Open Community Runtime (OCR). We showcase what such a framework can indicate regarding the runtime behavior while running at scale. To this end, we have designed a set of synthetic scenarios aimed to test the runtime at their best and worst cases. We present an analysis of the most important runtime features, properties and idiosyncrasies that will affect the development of new runtime features, algorithmic selection, and application development. Read more

Download here

Application Characterization at Scale: Lessons Learned from Developing a Distributed Open Community Runtime System for High Performance Computing

Published in Computing Frontiers, 2016

Since 2012, the U.S. Department of Energy’s X-Stack program has been developing solutions including runtime systems, programming models, languages, compilers, and tools for the Exascale system software to address crucial performance and power requirements. Fine grain programming models and runtime systems show a great potential to efficiently utilize the underlying hardware. Thus, they are essential to many X-Stack efforts. An abundant amount of small tasks can better utilize the vast parallelism available on current and future machines. Moreover, finer tasks can recover faster and adapt better, due to a decrease in state and control. Read more

Download here

Toward a Unified HPC and Big Data Runtime

Published in STREAM, 2015

The landscape of high performance computing (HPC) has radically changed over the past decade as the community has well surpassed Petascale performance and aims for Exascale. In this effort, chip fabrication and hardware architects have been directly challenged by the fundamentals of physics of chip manufacturing. The effects of these challenges have extended beyond the underlying hardware requiring the attention of the entire stack. As the fight for raw performance continues, a new field in computing has emerged. Big Data Analytics has been heralded as the fourth paradigm of science by turning enormous volumes of data into actionable knowledge. Big Data’s influence spans various interests including, commercial, political, and scientific fields. Read more

Download here

DARTS: A Runtime Based on the Codelet Execution Model

Published in University of Delaware, 2014

Over the past decade computer architectures have drastically evolved to circumnavigate prevailing physical limitations in chip technology. Energy consumption and heat expenditure have become the predominant concerns for architects and chip manufacturers. Previously anticipated trends such as frequency scaling, deep execution pipelines, and fully consistent caches in future many-core systems have been deemed unsustainable. Read more

Download here

Using a Codelet Program Execution Model for Exascale Machines: Position Paper

Published in EXADAPT, 2011

As computing has moved relentlessly through giga-, tera-, and peta-scale systems, exa-scale (a million trillion operations/sec.) computing is currently under active research. DARPA has recently sponsored the “UHPC” [1] — ubiquitous high-performance computing — program, encouraging partnership with academia and industry to explore such systems. Among the requirements are the development of novel techniques in “self-awareness”in support of performance, energy-efficiency, and resiliency. Read more

Download here

Toward an execution model for extreme-scale systems-runnemede and beyond

Published in University of Delaware, 2011

The Intel-led Ubiquitous High-Performance Computer (UHPC) Runnemede project has been proposed to meet the challenges of the design and development of extreme scale computing systems. It must integrate the combined capabilities of hardware and software technologies in order to yield superior operational attributes. The Runnemede architecture redefines the relationship between memory and processors to provide efficient computation in the presence of drastically disparate cycle times and latencies. Read more