You are required to select a research paper (in the last 5 years) on the following topics in the operating system area. 1) Over-view of Operating systems. 2) Operating-systems structures. 3) Process

47

Generating Fine-Grain Multithreaded Applications Using

a Multigrain Approach

JAIME ARTEAGA ,University of Delaware

STÉPHANE ZUCKERMAN ,Michigan Technological University

GUANG R. GAO ,University of Delaware

The recent evolution in hardware landscape, aimed at producing high-performance computing systems ca-

pable of reaching extreme-scale performance, has reignited the interest in fine-grain multithreading, partic-

ularly at the intranode level. Indeed, popular parallel programming environments, such as OpenMP, which

features a simple interface for the parallelization of programs, are now incorporating fine-grain constructs.

However, since coarse-grain directives are still heavily used, the OpenMP runtime is forced to support both

coarse- and fine-grain models of execution, potentially reducing the advantages obtained when executing an

application in a fully fine-grain environment. To evaluate the type of applications that benefit from execut-

ing in a unified fine-grain program execution model, this article presents a multigrain parallel programming

environment for the generation of fine-grain multithreaded applications from programs featuring OpenMP’s

API, allowing OpenMP programs to be run on top of a fine-grain event-driven program execution model.

Experimental results with five scientific benchmarks show that fine-grain applications, generated by and run

on our environment with two runtimes implementing a fine-grain event-driven program execution model,

are competitive and can outperform their OpenMP counterparts, especially for data-intensive workloads with

irregular and dynamic parallelism, reaching speedups as high as 2 .6×for Graph500 and 51 ×for NAS Data

Cube.

CCS Concepts: • Computing methodologies → Parallel programming languages; Concurrent program-

ming languages ;•Software and its engineering → Compilers ;

Additional Key Words and Phrases: Codelets, DARTS, extreme-scale computing, multigrain parallel compiler,

runtime, SWARM

ACM Reference format:

Jaime Arteaga, Stéphane Zuckerman, and Guang R. Gao. 2017. Generating Fine-Grain Multithreaded Ap-

plications Using a Multigrain Approach. ACM Trans. Archit. Code Optim. 14, 4, Article 47 (December 2017),

26 pages.

https://doi.org/10.1145/3155288

Extension of conference paper (Arteaga et al. 2017 ). New content includes a more detailed presentation of the proposed compiler techniques and an expanded experimental section with an additional fine-grain runtime.This work was partially supported by the National Science Foundation, under award XPS-1439097.Authors’ addresses: J. Arteaga and G. R. Gao, University of Delaware, Department of Electrical and Computer Engineer-ing, Room 140 Evans Hall, 139 The Green, Newark, DE 19716, USA; emails: [email protected], [email protected];S. Zuckerman, Michigan Technological University, Department of Computer Science, Rekhi Hall 1400 Townsend Drive,Houghton, MI, 49931, USA; email: [email protected] to make digital or hard copies of all or part of this work for personal or classroom use is granted without feeprovided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and thefull citation on the first page. Copyrights for components of this work owned by others than the ACM must be honored.Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requiresprior specific permission and/or a fee. Request permissions from [email protected].© 2017 ACM 1544-3566/2017/12-ART47 $15.00https://doi.org/10.1145/3155288

ACM Transactions on Architecture and Code Optimization, Vol. 14, No. 4, Article 47. Publication date: December 2017. 47:2 J. Arteaga et al.

1 INTRODUCTION

Current and upcoming changes in hardware, and in particular the advent of “true” general-purpose

manycore chips (e.g., Intel’s latest Xeon and IBM’s POWER8 processors offering close to a hundred

processing elements) designed to help supercomputers reach extreme-scale performance, must be

accompanied by a paradigm shift in the programming models (PM) and program execution models

(PXM) used to write and execute parallel applications (DoE 2010 ).1This has renewed the interest

in fine-grain PMs and PXMs because of their potential in better handling the high core count

of shared-memory nodes of future extreme-scale systems and in executing efficiently modern

workloads with dynamic and unstructured parallelism, like those containing unbounded loops

and recursive functions. One such PXM is the Codelet Model (Zuckerman et al. 2011 ), a hybrid

dataflow/von Neumann fine-grain event-driven PXM that provides programmers with a compre-

hensive application programming interface (API) to develop parallel programs with several levels

of granularity and parallelism. Delaware Adaptive Runtime System (DARTS) (Suettlerlein et al. 2013 )

and SWift Adaptive Runtime Machine (SWARM) (Lauderdale and Khan 2012 ) are some of the run-

times that have been developed based on the Codelet Model. They differ on how relaxed their

implementations of the Codelet Model are, their APIs, and their hardware support, among other

features.

Fine-grain multithreading has also been added to legacy PMs, like OpenMP’s (OpenMP4.0 2011 ).

OpenMP is a parallel programming environment, comprising three aspects: a set of compiler di-

rectives, a runtime library, and a set of environment variables accessible at execution time by the

user. OpenMP is particularly attractive because of its straightforward and simple interface that

can be added incrementally to sequential code. At first, OpenMP focused on offering coarse-grain

constructs for workloads with structured parallelism (e.g., numerical applications with for loops).

But recent versions of the standard now also include fine-grain constructs, named tasks ,which

are created and scheduled dynamically at runtime and are more suitable for modern workloads

with unstructured parallelism. 2

While the enrichment of OpenMP’s programming model is most welcome, it results in a signif-

icant increase in complexity with respect to its PXM. Indeed, OpenMP must support both coarse-

grain and fine-grain models of execution and allow them to coexist during execution at the same

time . We believe that this may be counterproductive when executing certain applications, such as

those with irregular and dynamic workloads, which may be more likely to achieve maximum per-

formance when running on top of a unified fine-grain environment. Moreover, we posit “regular”

coarse-grain constructs in the PM can be safely converted to fine-grain ones in the PXM with little

or no loss in performance.

It is then important to provide developers with tools that allow them to determine the execution

model that is most appropriate for each application while keeping, at a high level, a common

and well-known interface to write parallel programs. This article presents a multigrain parallel

programming environment designed around the Codelet Model for the fine-grain execution of

OpenMP programs. This environment takes an OpenMP program and generates an equivalent

fine-grain event-driven application, either in DARTS or SWARM, to fully exploit the high core

count available in the underlying hardware. By using this environment, programmers will be

able to keep using OpenMP’s simple and familiar API and, at the same time, benefit from the

1In this article, we define Programming Model (PM) as the way to express what parallel constructs and behaviors can be expressed by the programmer, whereas a program execution model (PXM) expresses how such parallel behaviors are created and destroyed, how they are orchestrated, and how they react when faced with memory accesses (shared or distributed).2The keyword depends is also an important addition to the standard as it provides a data-driven way of expressing parallelism.

ACM Transactions on Architecture and Code Optimization, Vol. 14, No. 4, Article 47. Publication date: December 2017. Generating Fine-Grain Multithreaded Applications Using a Multigrain Approach 47:3

advantages offered by a model specifically designed for fine-grain execution. Moreover, this

environment may help programmers transition into the use of fine-grain PXMs to write parallel

applications for many and multicore architectures.

The major component in our environment is omp2cd ,a multigrain parallel compiler based on

clang , which takes an input OpenMP program written in C and transforms it into a fine-grain

application, automatically determining tasks’ granularities at several levels of parallelism. This is

done depending on the type of parallelism present in the application, as well as on the OpenMP

directives and compiler hints in the form of OpenMP clause extensions used by the programmer.

This article presents the following contributions:

• omp2cd , a multigrain parallel compiler for the generation of fine-grain applications from

OpenMP programs to be run on top of the Codelet Model, using DARTS or SWARM.

• An analysis of the parameters and runtime options available in DARTS to fine-tune its

highly configurable abstract machine model (AMM) according to the application’s features

and underlying hardware.

• Experimental evaluation of the multigrain parallel programming environment using a set

of scientific benchmarks, showing that the fine-grain applications generated by our envi-

ronment are highly competitive and can outperform their OpenMP counterparts.

This article is structured as follows: Section 2introduces the Codelet Model and two fine-grain

runtimes implementing it: DARTS, and SWARM. Section 3presents the multigrain parallel pro-

gramming environment and its compiler, as well as some of the new optimizations introduced to

DARTS during the development of this work. Section 4exposes the experimental evaluation of

the environment. Section 5discusses some of the related work found in the literature. The article

closes discussing the major conclusions derived from this work and future work.

2 BACKGROUND

2.1 The Codelet Model

The Codelet Model is a fine-grain event-driven PXM designed to fully exploit architectures featur-

ing a high degree of parallelism by decomposing an application into a great number of lightweight

asynchronous tasks called codelets . Codelets can be easily scheduled among cores, depending on

resource availability, proximity to data to compute, and other constraints (Zuckerman et al. 2011 ).

Each codelet has a collection of data inputs (respectively, outputs) coming from (respectively,

going to) other codelets. Additionally, a codelet may require a set of specific resources for its ex-

ecution. When both data and resource dependencies are satisfied, a codelet is said to be ready

and is scheduled for non-preemptive execution, potentially reducing the overhead produced by

context switching. Also, since codelets only act on their data inputs and outputs, which are local,

long-latency memory operations are avoided and power consumption decreased. Interactions be-

tween codelets are described in the form of a Codelet Graph (CDG) , a directed graph, where nodes

represent codelets and arcs the dependencies between them. CDGs are usually defined statically, 3

leaving to the runtime the instantiation and execution of TPs and codelets upon satisfaction of

dependencies.

Codelets may be grouped into Threaded Procedures (TPs). A TP is an asynchronous function that

acts as a container for a CDG and the data accessed by its codelets. This provides a second level of

parallelism in the Codelet Model that can be exploited for increasing locality of data. The number

3Being statically determined does not mean the codelet is generated at compile time. However, once a given CDG will be fully specified when it is instantiated (i.e., according to the model), a codelet cannot be dynamically added to an already existing CDG.

ACM Transactions on Architecture and Code Optimization, Vol. 14, No. 4, Article 47. Publication date: December 2017. 47:4 J. Arteaga et al.

Fig. 1. Codelet model’s Abstract Machine Model (AMM).

of TPs within an application, as well as the number of codelets per TP and statements per codelet,

may vary, providing users with the ability to decompose an application with several degrees of

granularity at different levels of parallelism.

The Codelet Model relies upon an Abstract Machine Model (AMM), shown in Figure 1,whichis

mapped at runtime to the target many-core system. The system is hierarchically decomposed into

compute nodes, each featuring at least one chip. Each chip is composed of clusters of cores. Cores

can be either Synchronization Units (SUs) or Compute Units (CUs), with each cluster containing one

SU and zero or more CUs. A SU is in charge of allocating TPs and scheduling codelets to the CUs

within the same cluster. Additionally, an SU may execute a codelet if all of its associated CUs are

busy. CUs execute codelets and send TP allocation requests to their local SUs. Nevertheless, and

depending on the runtime implementation, such requests may be ultimately fulfilled by an SU in a

different cluster. Once a TP has been officially allocated by a given cluster, the codelets it contains

cannot migrate outside of it. This guarantees a certain degree of locality for code execution.

2.2 Delaware Adaptive Runtime System - DARTS

Delaware Adaptive Runtime System (DARTS) is currently the most faithful runtime implementation

of the Codelet Model and its AMM (Suettlerlein et al. 2013 ). It is written in C++ and distributed as

free and open-source software (CAPSL 2013 ). In DARTS, TPs and codelets are defined as class ob-

jects at compile time and are instantiated at runtime. A TP is instantiated when a codelet performs

a TP invocation . This operation creates a TP handle and sends a request to the runtime to allocate

memory for the TP instance and the instances of the codelets it contains. Once this is done, codelet

instances are executed upon satisfaction of their dependencies.

A two-level scheduling mechanism, comprising TP schedulers and microschedulers, is used for

the scheduling and execution of codelets. A TP scheduler is a core which has been assigned the

role of SU and has a TP queue (TPQ) with TP instantiation requests and a ready-codelets queue

(RCQ). A microscheduler is a core acting as a CU with a local codelet queue (LCQ), which is fed

by its local TP scheduler with codelets from its RCQ.

DARTS allows users to determine the “shape” of the AMM by setting the number of clusters in

the machine and the number of CUs associated with an SU. This can be done based on the topology

of the target architecture, such as the number of sockets or NUMA nodes. Hence, the two-level

parallelism exposed by the Codelet Model is preserved and configurable in DARTS.

Users can also fine-tune DARTS AMM by setting the scheduler affinity policy as COMPACT (SUs

and CUs are pinned down to physically contiguous cores), SPREAD (schedulers are assigned in a

round-robin fashion to the different sockets of the platform), or user-defined using an environment

variable that follows the same notation as GCC’s GOMP_CPU_AFFINITY .

In addition, different scheduling policies are available: namely, standard and work-stealing .

With a standard policy, a TP scheduler only initializes TPs assigned locally by one of its mi-

croschedulers and distributes codelets in round-robin fashion to them. With work-stealing, on

the other hand, a TP scheduler may also steal TP handles from another TP scheduler’s TPQ. A

ACM Transactions on Architecture and Code Optimization, Vol. 14, No. 4, Article 47. Publication date: December 2017. Generating Fine-Grain Multithreaded Applications Using a Multigrain Approach 47:5

microscheduler with standard policy only executes codelets pushed into its local queue by its TP

scheduler. However, if its policy is work-stealing, it may steal codelets from the TP scheduler’s

RCQ. Other new optimizations available to fine-tune the AMM that have been developed in con-

junction with the design of the environment are presented in Section 3.5 .

2.3 SWift Adaptive Runtime Machine - SWARM

The SWift Adaptive Runtime Machine (SWARM) is a commercial implementation of the Codelet

Model written in C and developed by E.T. International, Inc. (Lauderdale and Khan 2012 ). As with

DARTS, the basic unit of computation is also a codelet, which runs to completion without blocking.

However, unlike DARTS, SWARM allows codelets to be preempted in certain circumstances, such

as avoiding having a codelet running infinitely. TPs are not defined in SWARM, and fine-grain

applications are described only in terms of codelets (this contrasts with DARTS, where the use of

TPs is enforced by its PM). Instead, SWARM uses the concept of codelet complexes , defined as a set of

codelets that work cooperatively to complete a specific task and that share a private context frame.

SWARM also defines locales to describe the hardware segments on which the runtime environment

is divided. With locales, programmers can specify the processing element where a codelet should

be executed or the memory area where an object should reside.

Locales are organized hierarchically and each one is associated with a scheduler, which is in

charge of managing and scheduling work for the hardware associated with the locale. When a

codelet is ready to run, it is placed in the scheduler’s queue and then executed when a thread in

the scheduler’s locale is available. If, on the other hand, the scheduler is idle and there are no ready

codelets to run, it may steal work from other schedulers in the locale’s hierarchy. If no work is

found, the thread associated with the scheduler may be placed to sleep until there is work to do.

3 MULTIGRAIN PARALLEL PROGRAMMING ENVIRONMENT

This section presents two of the main components of our multigrain parallel programming envi-

ronment for the fine-grain execution of OpenMP programs. The first one is omp2cd , a multigrain

parallel compiler which generates a fine-grain codelet application targeting a runtime implemen-

tation of the Codelet Model, either DARTS or SWARM, depending on the command line options

used during compilation. The second component is an optimized version of the DARTS runtime,

which we have tuned-up during the development of the environment to improve the performance

of DARTS applications with respect to OpenMP. For SWARM, the implementation provided and

extensively optimized by E.T. International, Inc. is used.

3.1 Compiler Overview

For the generation of fine-grain applications from OpenMP programs, the multigrain parallel com-

piler omp2cd was designed. The compiler is based on clang 3.9, accepts input source files in C

language, and supports OpenMP v2.5 (partially 4), as well as the task and taskwait constructs

defined in v3.0 and the depend clause from v4.0 (OpenMP4.0 2011 ).

A diagram of the compiler’s internal structure is presented in Figure 2. The compiler takes

a set of OpenMP C source files and uses clang’s front-end to generate an Abstract Syntax Tree

(AST) for each translation unit. Each AST is then visited and annotated with information needed

by other compiler’s stages, such as the OpenMP directives and clauses used by the program-

mer, control flow patterns in the code (e.g., selection, iteration, fork-join), and synchronization

points, among others. These annotations allow the compiler to later determine how to implement

4Missing support from OpenMP v2.5 includes the workshare and ordered directives and runtime routines to handle nestable locks. Work is currently being done to add them to the compiler.

ACM Transactions on Architecture and Code Optimization, Vol. 14, No. 4, Article 47. Publication date: December 2017. 47:6 J. Arteaga et al.

Fig. 2. omp2cd ’s internal structure and interaction with other components of the environment.

each section of code in the input program (either as a collection of TPs or codelets), to gener-

ate the correct set of statements to invoke TPs and signal codelets, and to apply certain available

optimizations.

The annotated ASTs then go through an interprocedural analysis , which identifies the functions

containing omp executable directives and/or being called from omp regions, to correctly implement

these functions and function calls using the target fine-grain programming model’s API. In the case

of DARTS, for example, a function may be converted into a TP, and every call to such function is

translated into a TP invocation.

The next stage takes the annotated ASTs and translates them into a collection of CDGs by group-

ing AST nodes into codelets. This is done through a multigrain parallel analysis (Section 3.2 )based

on the annotations made previously in the ASTs. This stage also creates upper levels of parallelism,

like grouping codelets into TPs, provided the target runtime selected by the user supports it, like

DARTS.

A multigrain optimization stage (see Section 3.3 ) transforms the CDGs at different levels of par-

allelism to improve the application’s performance in a fine-grain execution environment. Last,

the optimized CDGs are translated into source files containing the target API, either DARTS or

SWARM. In the case of DARTS CDGs, the user can then configure the abstract machine at runtime

to better suit locality and/or affinity requirements and further improve performance.

All mechanisms and optimizations available on DARTS are currently supported by omp2cd and

integrated in the output applications. When using SWARM as a back-end, we only use regular

codelets— codelet complexes ,codelet chaining ,andtheuseof locales , will be included in future ver-

sions of the compiler.

3.2 Creation of CDGs Using a Multigrain Parallel Analysis

To generate CDGs from the annotated ASTs, omp2cd uses a multigrain parallel analysis that follows

the specifications of the Codelet Model. This analysis defines the number and granularities of the

parallel actors to be used by the output application to execute the input OpenMP program in a

fine-grain manner. This is done at two levels of parallelism: The top level corresponds to TPs,

with their granularity defined as the number of codelets a TP contains; the bottom level comprises

codelets, with each codelet’s granularity being the number of statements to execute when fired.

Generation of TPs and codelets from each annotated AST is made according to the following

set of rules. These have been defined based on an extensive study of the Codelet Model and of

handwritten applications written in DARTS and SWARM, like those used by Suettlerlein et al.

( 2013 ) and those contained in the form of examples in the release versions of the runtimes. After

applying these rules on the ASTs of a valid OpenMP program, a set of CDGs is created, which are

later optimized.

ACM Transactions on Architecture and Code Optimization, Vol. 14, No. 4, Article 47. Publication date: December 2017. Generating Fine-Grain Multithreaded Applications Using a Multigrain Approach 47:7

Fig. 3. Generation of TPs based on the rules defined by omp2cd .The for region’s implicit barrier in

functionA is shown in a dotted rectangle since it is not actually present in the code. The atomic state-

ment in functionC is implemented using locks or GCC atomic built-ins, depending on the data type being

processed.

3.2.1 Rules to Define TPs. To determine the nodes for which a TP must be used, the compiler

traverses each annotated AST in depth-first order and creates a TP definition for each node

containing:

TP- 1: The declaration of a function, other than main() , that contains an omp executable directive. 5

TP- 2: An omp region. 6

Both rules ignore atomic and critical regions since they are translated into atomic statements

or protected sections of code using locks. If a function contains at least a barrier or taskwait

directive, such function is converted into a TP to implement the synchronization construct. How-

ever, these directives are not translated into TPs since they are stand-alone directives with no

associated code. Instead, they are implemented as codelets, as shown later.

The generation of TPs from a program’s AST using these rules is shown with three examples in

Figure 3, where TPs are represented with rectangles. In the first example, functionA is a function

containing an omp for region. Following rule TP-1, a TP is defined for this function; the same

occurs for the omp for region according to rule TP-2. The TP defined for the omp for region is a

child of the function’s TP since that is the same relationship their nodes have in the AST. A TP is

also defined for functionB as defined by rule TP-1. No TP is defined for its omp barrier directive

since the compiler uses only codelets for its implementation. Finally, functionC comprises an omp

atomic region. This is implemented using locks or GCC atomic built-ins depending on the data

type of the variable being updated. Since no codelets are used for this region, there is no need

to define a TP to group them. Consequently, the function is not translated into a TP but left in

sequential form.

5An omp executable directive is “An OpenMP directive that...may be placed in an executable context (OpenMP4.0 2011 ).” 6An omp region is “All code encountered during a specific instance of the execution of a given construct...,” while an omp construct is “An OpenMP executable directive...and the associated statement, loop or structured block, if any...” (OpenMP4.0 2011 ).

ACM Transactions on Architecture and Code Optimization, Vol. 14, No. 4, Article 47. Publication date: December 2017. 47:8 J. Arteaga et al.

If the target fine-grain runtime does not support TPs (e.g., SWARM), TPs are not created but the

associated nodes are further annotated for subsequent compiler stages.

3.2.2 Rules to Define Codelets. The compiler defines codelets by traversing the annotated ASTs

and grouping statement nodes using the concept of Codelet Basic Blocks (CBB) .ACBBisasetof

statements that can be executed by the same codelet and whose leader (i.e., the first statement in

the CBB) is defined by the presence of nodes annotated as TPs or that contain synchronization

directives. Once a leader has been found, the statements that follow it are attached to the newly

created CBB until the leader of a new CBB is defined. Each CBB is assigned a consecutive ID and

a new codelet, which executes the CBB’s statements when fired.

CBB’s leaders are defined according to the following rules, which, as with the rules to create

TPs, also ignore atomic and critical regions. An AST statement is a CBB leader if it is:

CBB- 1: An omp executable directive.

CBB- 2: A call to a function containing an omp executable directive.

CBB- 3: The first statement in an omp region.

CBB- 4: The first statement in a function, other than main() , containing an omp executable direc-

tive.

CBB- 5: The first statement of a branch, provided any of the branch’s parent nodes are part of a

CBB.

CBB- 6: The implicit barrier of an omp region.

CBB- 7: The statement following a CBB whose leader was created using rules CBB-1, CBB-2, or

CBB-6.

Depending on the rule used to determine its leader, a new codelet is defined for the CBB. The

codelet may be one of the following codelet categories:

• Invocation Codelet: Created for CBBs containing a TP invocation. This codelet is also used

to monitor the threads that have reached the point in the program where the TP is in-

voked and are allowed to continue: Rules CBB-1, CBB-2 . Since SWARM does not define TPs,

SWARM applications do not contain invocation codelets. The use of these codelets during

the instantiation of a TP is explained in Section 3.2.4 .

• Computation Codelet: Used for CBBs whose statements are neither associated with TP in-

vocations nor synchronization constructs: Rules CBB-3, CBB-4, CBB-5, and CBB-7 .

• Synchronization Codelet: Defined for CBBs representing a synchronization or task schedul-

ing point in the program, such as barrier and taskwait directives: Rule CBB-6 .

Invocation and computation codelets may be instantiated up to Ntimes for concurrent execu-

tion, with Nbeing the number of OpenMP threads defined for the current region. Each codelet

instance is assigned an ID from 0 to N −1. Synchronization codelets may be instantiated only

once or implemented as a tree to reduce signaling overhead.

Figure 4shows how CBBs and codelets are defined for the AST of a given OpenMP program

(the definition of dependencies between codelets is explained in Section 3.2.3 ). The figure presents

a function containing an omp for a region and two sets of statements placed before and after such

region. As with functionA in Figure 3,functionD in Figure 4is translated into a TP using rule

TP-1. Then, following rule CBB-4, a new CBB is defined for Statement 0 , which becomes the

CBB’s leader. Statement 1 is attached to this CBB since no rule applies to it. Because the rule

used was CBB-4, the codelet defined for this CBB is a computation one.

The omp for region generates a new TP, according rule TP-2. In order to invoke it, this directive

is translated into a new CBB according to rule CBB-1, which is implemented using an invocation

ACM Transactions on Architecture and Code Optimization, Vol. 14, No. 4, Article 47. Publication date: December 2017. Generating Fine-Grain Multithreaded Applications Using a Multigrain Approach 47:9

Fig. 4. Generation of codelets from the AST of an OpenMP program following the rules to define a CBB’s

leaders. TPs are represented with rectangles, while circles are used for codelets. Superscripts represent CBB’s

IDs; subscripts represent instance’s IDs from 0 to N−1, with Nbeing the number of OpenMP threads in

the region. The for region’s implicit barrier is shown in a dotted rectangle since it is not actually present in

the code, but it is used by the compiler to define a synchronization codelet.

codelet. The statements in the for region are grouped into a new CBB, for which Statement 2

is defined as leader, following rule CBB-3. For the region’s implicit barrier, a CBB with a synchro-

nization codelet is created, as indicated by rule CBB-6. Finally, the two statements that follow the

region are grouped into a new CBB, implemented as a computation codelet whose leader corre-

sponds to Statement 4 according to rule CBB-7.

3.2.3 Definition of Dependencies between Codelets. To correctly define the dependencies be-

tween codelets, omp2cd keeps track of the AST’s locations where CBB’s leaders have been set and

the codelet’s category (i.e., invocation, computation, or synchronization codelet) assigned to each

CBB. With this information, the compiler annotates CBBs with the set of codelets they must signal,

as well as with the number of data dependencies that must be satisfied before their execution.

When the compiler defines a CBB’s leader (and, consequently, a codelet), AST statements are

attached to the current CBB until finding a new leader due to a TP invocation or synchronization

point. When this occurs, the compiler creates a dependence arc between the current codelet (i.e.,

the predecessor codelet) and the new one (i.e., the dependent codelet) to maintain the program’s

flow and its correctness. In the case any of these codelets is meant to be instantiated more than

once at execution time (which depends on the codelet’s category), the compiler adjusts the number

of dependencies each codelet must be initialized with.

For instance, if an arc is defined between a computation codelet and an invocation one (or vice

versa), the number of dependencies for both is set to one. This dependency is satisfied by the in-

stance of the predecessor codelet with the same ID as the instance for the dependent codelet, since

both computation and invocation codelets are duplicated equally up to the number of OpenMP

threads defined for the current region. This can be seen in Figure 4, where there are depen-

dency arcs between computation codelet C0 0and invocation codelet I1 0(which have the same in-

stance ID =0), between codelets C0 1and I1 1(instance ID =1), and so on. At runtime, an invocation

codelet may also reset its dependence counter to one and satisfy it itself with a back edge to allow

the fine-grain application to perform the instantiation of TPs without blocking the codelet (see

Section 3.2.4 ).

ACM Transactions on Architecture and Code Optimization, Vol. 14, No. 4, Article 47. Publication date: December 2017. 47:10 J. Arteaga et al.

Fig. 5. Instantiation at runtime of the for region in Figure 4. Case is shown when the number of threads

defined for the region is N=4and two instances are defined for the region’s TP. Dotted lines are used to

indicate that the corresponding object has not yet been allocated in memory.

A different situation occurs with a synchronization codelet, which may be instantiated only

once at runtime. If such a codelet is a predecessor codelet, it must signal all instances of its de-

pendent codelet. This is shown in Figure 4, with S3 0signaling all instances of computation codelet

C4. Conversely, if a synchronization codelet is a dependent codelet, its number of dependencies is

set to the number of threads in the region. These dependences are satisfied by all the instances of

its predecessor codelet. In Figure 4, for instance, the number of dependencies for S3 0is set to N.

These are satisfied by all instances of C2. Arcs between two synchronization codelets are usually

optimized by merging their CBBs.

A special case occurs when the input program makes use of the depend clause. The clause defines

at compile time and, in an explicit way, data dependencies that must be enforced at runtime during

the execution of OpenMP tasks. Depending on the type of dependency defined by the clause (either

true, anti, or output dependence), omp2cd creates an arc at compile time between the last codelet

defined for the task region producing the dependence and the first codelet on each of the task

regions expecting it.

3.2.4 Instantiation of TPs and Back Edges in Invocation Codelets. As explained in Section 2.2 ,a

TP invocation is a request sent to the runtime for the instantiation of a TP. This request is made

by a microscheduler when it finds a TP invocation operation in the current codelet it is executing.

The request is sent to its local TP scheduler, which allocates memory for the new TP instance and

for the instances of the codelets it contains. The number of instances allocated at runtime for each

TP depends on the type of OpenMP region that it represents. For instance, the TP defined for a

single region may be instantiated only once, whereas several instances of a for region’s TP may

be instantiated to balance the workload across the target machine (Section 3.3 ).

Because currently DARTS relies on the OS’s kernel for any memory allocation, the instantiation

of TPs may be considered a long-latency operation and can cause a codelet to be blocked while

waiting for the instantiation to complete. To avoid this, TP invocation operations are performed

exclusively by invocation codelets. Additionally, once the allocation is made, invocation codelets

are in charge of signaling the codelet’s instances of the new TP instance so they can be scheduled

for execution.

This process is shown in Figure 5. This figure contains a timeline of events occurring at runtime

when the for region from Figure 4is instantiated twice, the C2computation codelet defined for

the loop’s body is instantiated up to the number of available OpenMP threads in the region (i.e.,

N =4), and the codelet’s instances are evenly distributed among the TP instances.

In t0, invocation codelet with ID =0,I1 0, is fired. Then, this instance checks if its assigned TP

instance for the for region, instance number 0, has been invoked, and, assuming it has not, it

invokes it. I1 0now waits until the TP instance is allocated to use the pointer to it to signal one of its

ACM Transactions on Architecture and Code Optimization, Vol. 14, No. 4, Article 47. Publication date: December 2017. Generating Fine-Grain Multithreaded Applications Using a Multigrain Approach 47:11

codelets. However, the Codelet Model prohibits having codelets spinning idly until the completion

of an event—in this case, the allocation of a new TP instance. So the invocation codelet resets itself

to be sent back to the local TP scheduler and be scheduled again for execution. By doing this, the

invocation codelet releases the processing unit that it was using to allow the execution of another

ready codelet. The process of having the I1 0invocation codelet reseting itself until its TP instance

is created is shown in t1with a back edge.

Once the instance has been allocated in t2,I1 0can use the pointer to it to signal the first codelet

in the TP with its same instance ID. In Figure 5, this corresponds to the C2 0computation codelet,

which becomes a ready codelet and is scheduled for execution. Finally, in t3,I1 1is fired and, since

the pointer to its TP instance is already available, it signals C2 1right away, which has its same

instance ID. A similar line of events occurs for the TP’s instance number 1 when I1 2and I1 3are

fired.

3.3 Multigrain Optimization

To improve the fine-grain execution of the input program, omp2cd may apply a series of optimiza-

tions to each CDG depending on the parallelism available in the application. These optimizations

may be complemented at runtime by tuning up the fine-grain runtime’s AMM parameters. Two of

the main optimizations applied are:

3.3.1 Region Inlining. If the number of statements to be executed by an omp region is less than

a instruction granularity threshold set by the user in the command line, the compiler merges the

CBB (or TP) for such region with the CBB (or TP) implemented for the enclosing one, and all its

statements (or codelets) are moved one level “up.” By doing this, the runtime avoids the overhead

of instantiating and scheduling an instance of the original TP. This optimization is mainly applied

to single and master regions, as well as to small for regions, like those initializing an array.

3.3.2 OpenMP Clause Extension – codelet .An omp region can be also inlined by using the

codelet clause. With it, developers can instruct the compiler to implement the intended region

using only codelets, with no TPs. This clause overrides the instruction parameter threshold, and,

consequently, the region is effectively inlined, regardless the number of statement it contains.

Currently, this clause is supported for for and single regions and is ignored when generating

SWARM applications since this runtime does not support TPs.

3.3.3 Loops (DARTS Only). A dedicated stage to optimize for regions is included since they

are very dominant in scientific numerical applications. One of two types of optimizations may be

applied. The first optimization (applied by default) translates the for region into a TP containing

the CDG generated for the loop’s body. Then, at runtime, the TP is instantiated M times. Users can

set the value for M using an environment variable or let the runtime use the number of clusters set

in the AMM. These M instances are distributed across the different clusters in the AMM, and each

instance is assigned up to ceilinд (N/M)codelets with consecutive IDs, where N is the number of

OpenMP threads defined for the region. Loop’s iteration range is distributed hierarchically among

TP instances and then between codelets, following the loop’s scheduling policy.

The second optimization, which is available through a compiler’s command line option, uses

a set of special parallel constructs provided by DARTS, named TPLoops (Suettlerlein et al. 2013 ).

With this optimization, the compiler generates three CDGs for the loop’s body. Then, at runtime,

the fine-grain application selects one of them to execute the loop’s body depending on the current

range and a set of thresholds set by the user. The first CDG contains a single computation codelet,

which is instantiated only once at runtime and is in charge of executing all the loop’s iterations.

The second CDG, called Codelet Parallel For (CPF) Loop , defines a TP with a single codelet inside;

ACM Transactions on Architecture and Code Optimization, Vol. 14, No. 4, Article 47. Publication date: December 2017. 47:12 J. Arteaga et al.

Fig. 6. Default and optimized multigrain transformation of an OpenMP program. Superscripts represent

CBB’s IDs, whereas subscripts represent the instance’s ID from 0 to N−1, with N being the number of

OpenMP threads in the region.

during execution, the TP is instantiated Ptimes and the loop’s range is distributed uniformly

among the instances. The third CDG is called Parallel For (PF) Loop and defines a TP having a

CPF Loop. In other words, a PF Loop is a CPF Loop contained inside another TP. At runtime, Q

instances of the PF Loop are created, each one having Pinstances of their CPFs. The total number

of a loop’s iterations is distributed uniformly among the Qinstances of the PF Loop, and then each

one of these instances divides its subrange among the local Pinstances of its CPF Loop. Values for

P and Qand the TPLoops thresholds are set by the user at runtime depending on the configuration

of the AMM, the maximum number of OpenMP threads, and any heuristics or experimental data

theuserrelieson.

Both optimizations aim at improving locality of the data accessed and processed by the omp for

region, but they differ on the type of loops for which they have been designed. The default opti-

mization is intended for loops that have the same range throughout the lifetime of an application,

set either statically or dynamically. An example of such type of application is matrix multiply,

where the number of iterations to be executed is fixed as the number of rows or columns in the

matrix or the tile, in case a tiling algorithm is used. TPLoops, on the other hand, are more suitable

for loops with dynamic ranges, where the number of iterations varies. This is the case, for example,

with Graph500, which makes use of a Breadth-First Search algorithm that uses a parallel loop to

examine a set of nodes in the graph, known as the graph’s frontier (Graph500 2010 ). At runtime,

and depending on the graph’s size, the frontier can be as small as a dozen nodes (for which a single

codelet is more suitable), as big as several hundreds or thousands (for which a PF Loop is better),

or something in the middle (in which case, CPF Loops are preferred).

3.4 Example

Figure 6presents the default and optimized versions of the CDG generated by the compiler for a

DAXPY OpenMP program. The program contains a main() function with one parallel region,

which has a single region for the vector initialization and a for region for the computation. A

TP is generated for the parallel region (rule TP-2), but not for the function since it is main() .

The first statement in the region, statement 4 , is translated into a computation codelet to

be instantiated N times (rule CBB-3). In the default CDG, a TP is defined for the single region

(rule TP-2) with one single computation codelet, C2, implementing statement 5 (rule CBB-3). N

instances of the invocation codelet I1are also defined for this region to determine at runtime the

ACM Transactions on Architecture and Code Optimization, Vol. 14, No. 4, Article 47. Publication date: December 2017. Generating Fine-Grain Multithreaded Applications Using a Multigrain Approach 47:13

Fig. 7. DARTS application in C++ for the optimized CDG in Figure 6(stripped-down code shown for brevity).

thread executing its TP (for simplicity, this thread is N −1 in Figure 6). Provided the instruction

granularity threshold set by the user in the command line is less than the number of statements

to execute by the single region, or that the codelet clause is used, this region may be inlined to

avoid the overhead of allocating and scheduling a TP. This is shown in the optimized CDG, where

the C2computation codelet has been merged with the previous invocation codelets. In this case,

the invocation operation in the I1codelets is replaced with the code from C2. The implementation

of the single region ends with a synchronization codelet, S3, for its implicit barrier.

The for region is also translated into a TP using rule TP-2. As mentioned previously, two im-

plementations are available for this type of region. In the default implementation, the TP defined

for the region is instantiated M times, with each instance handling up to N/M codelets. Iteration

chunks are assigned depending on the scheduling policy set for the loop ( static by default). The

optimized implementation uses TPLoops with three different CDGs, which are selected at runtime

depending on the loop’s current range (Section 3.3.3 ). This optimization is only available in DARTS,

so SWARM would only implement the default CDG for this region. Ninvocation codelets are also

instantiated to distribute the available N OpenMP threads among the different TP instances.

The for region does not have an implicit barrier, so the codelets in its TP may continue individ-

ually and signal the synchronization codelet defined for the parallel region’s implicit barrier, S6.

This codelet may be implemented as a tree to reduce signaling overhead. However, for simplicity,

this is represented as one codelet instance in Figure 6.

The code produced in DARTS for the optimized CDG in Figure 6is presented in Figure 7.The

main function is left in its sequential form with two fundamental changes. First, is the addition at its

beginning of any statement necessary to initialize the runtime. This is shown in Figure 7with calls

to ThreadAffinity to set up the DARTS AMM (parameters include the number of microschedulers

and TP schedulers, their affinity scheme, and the schedulers’ policies) and to Runtime to create the

ACM Transactions on Architecture and Code Optimization, Vol. 14, No. 4, Article 47. Publication date: December 2017. 47:14 J. Arteaga et al.

runtime. The second change is a call to the runtime whenever a parallel region is found. Calls

to the runtime require the name of the class defined for the parallel region ( TP_parallel_0 ),

along with all variables defined outside the region but used inside of it.

The definition of the TP_parallel_0 class appears after the main function and contains the

classes of the TP’s codelets. Each codelet class has a fire method, which is called upon sat-

isfaction of dependencies of the corresponding instance. It also uses the myT P variable to ac-

cess the TP’s data frame and getMyID to get its instance ID. Each invocation and computation

codelet is instantiated in the TP_parallel_0 ’s constructor up to the number of threads set for

the parallel region, whose value is accessed with numThreads , with each instance having one

dependency; synchronization codelets are instantiated only once with this- >numThreads depen-

dences. TP_parallel_0 ’s constructor also signals all instances of the Compute_0 class to start their

execution since they represent the first codelet in the region, C0. After each of these instances ex-

ecute statement 4 , they signal the instance of the next codelet class, Invoke_1 , having the same

instance ID.

The Invoke_1 class implements the invocation codelet defined for the single region. An atomic

operation is used to select the instance executing the statements in the region, which are contained

in the same invocation codelet since the region has been inlined. All Invoke_1 ’s instances signal

the only instance defined for the Synch_3 class that represents the region’s barrier’s codelet. When

this instance is fired, it signals all instances of the Invoke_4 class, which has been defined for the I4

invocation codelet. This codelet is in charge of invoking NU M _TPS instances of the class defined

for the for region’s TP, TP_for_0 .Ifan Invoke_4 ’s instance determines its TP instance has already

been allocated, it calls the TP’s dispatchCodelet function to signal the first codelet in the for

region, as explained in Section 3.2.4 .

The TP_for_0 class, not shown for brevity, contains the classes of the codelets executing the

loop’s statements. When the instances of these codelets finish, they signal the Synch_6 class’ in-

stance that implements the parallel region’s implicit barrier. Since this is the last codelet in the

region, its instance signals the runtime’s finalCodelet to shut down the runtime and execute the

next statement after the parallel region in the main function.

3.5 Optimized DARTS Runtime

The DARTS runtime used in the environment is an optimized version we modified to improve its

performance with respect to OpenMP. This implementation includes a set of new features available

to users through the API and environment variables. We describe them here.

3.5.1 TP Placement. By default, newly invoked TPs are assigned to the same cluster where they

had been originated. This might not be the best approach for embarrassingly parallel types of code

regions, such as forall loops, where data are evenly distributed across the machine. To support the

assignment of a new TP to a specific cluster, we have added the TP Placement function, which takes

the target cluster as an additional parameter.

3.5.2 Scheduler Downtime and Delayed Context Deletion. When executing a DARTS fine-grain

application, runtime overheads may occur as a result of having schedulers busy-waiting for long

periods of time. This can heavily tax the memory subsystem of compute nodes as they rely on

costly atomic operations. Another source of overhead may be the constant dynamic allocation

and deallocation of underlying runtime context objects. When the last codelet of a TP is done

firing, the TP is ready to be deallocated. As DARTS does not have a custom allocator and relies on

the OS memory manager for allocation, this may result in multiple cores attempting to access it

simultaneously, generating several system calls which can only be serialized to access the Linux

kernel.

ACM Transactions on Architecture and Code Optimization, Vol. 14, No. 4, Article 47. Publication date: December 2017. Generating Fine-Grain Multithreaded Applications Using a Multigrain Approach 47:15

To reduce the runtime overhead from these two sources, we added a Scheduler Downtime and

Delayed Context Deletion (SD+DCD) option. SD+DCD reduces the time schedulers are idle by using

an exponential backoff strategy with a saturation cap, which can be fine-tuned by the user, allowing

schedulers to enter into sleep mode when there are no TPs to create nor codelets to execute.

Additionally, this option delays the deletion of objects until there is no work to perform. When

used, schedulers act as follows: When a TP is marked for deletion, it is added to a deletion queue.

Later, when a scheduler has no work to perform, it checks if the deletion queue has reached a

threshold set in the runtime and proceeds to delete all completed TPs in sequence; otherwise, it

enters into sleep mode.

4 EXPERIMENTAL EVALUATION

This section presents the experimental evaluation of the fine-grain applications generated by our

environment. Experiments were performed for the two fine-grain runtimes supported by omp2cd :

DARTS and SWARM. The applications used feature different types of parallelism and memory

access patterns, which makes them ideal to determine the types of workloads for which our envi-

ronment is more suitable and to compare the performance of both fine-grain runtimes.

4.1 Experimental Testbeds

Experiments were performed on two Intel-based shared-memory nodes. All cores feature 32 KiB

private L1 instruction and data caches, and 256KiB private unified L2 caches. The first platform is a

two-socket node, with 10 cores per socket, featuring an Intel Xeon E5-2660V2 (Ivy Bridge) clocked

at 2.60GHz with Hyper-Threading (HT) disabled. A 25MiB unified L3 cache is shared by all the

cores in the same socket (50MiB in total). The total memory is 64GiB of DRAM divided into two

NUMA domains. The second compute node is a four-socket node, with six cores per socket, with

an Intel Xeon E5-4610V1 processor (Sandy Bridge) clocked at 2.2GHz with HT activated. All cores

in the same socket share a 15MiB unified L3 cache (60MiB in total). All 48 threads have access to

128GiB DDR3 DRAM, distributed across four NUMA domains. The Ivy Bridge testbed runs CentOS

v6.6, while the Sandy-Bridge runs Ubuntu v14.04. All programs were compiled with GCC v6.3.0

with optimizations set to -O3 . OpenMP applications were linked against Intel’s OpenMP runtime

released with clang 3.9 (LLVM-OpenMP 2017 ).

4.2 Experimental Design

Experiments were performed on five benchmarks written in C and parallelized and optimized

by their authors using OpenMP 2.5 directives. Unless expressed otherwise, no major changes or

additional optimizations were introduced in the benchmarks’ source code in order to perform a

fair comparison and to use them in the same form that is available to other people.

Each OpenMP benchmark was compiled with omp2cd to generate two equivalent fine-grain ap-

plications: One in DARTS and other in SWARM. These were then compared against the reference

OpenMP application. Since omp2cd generates a CDG describing the input OpenMP code in terms

of the Codelet Model and uses the same CDG to generate the DARTS and SWARM output applica-

tions, experimental results were also used to compare not only the performance of the fine-grain

applications against OpenMP, but also the performance of one fine-grain runtime against the other.

At runtime, threads were pinned down evenly across each testbed using GOMP_CPU_AFFINITY on

OpenMP, DARTS_AFFINITY on DARTS, and SWARM_PIN_THREADS on SWARM. On all experiments,

iteration space in parallel loops was partitioned statically (as specified in the original OpenMP

codes) and spread evenly across all NUMA nodes in a round-robin fashion using numactl -i all .

For DARTS applications, around 16 combinations of configurations/optimizations of the AMM

were evaluated at runtime, with the best ones being presented in Table 1. The number of schedulers

ACM Transactions on Architecture and Code Optimization, Vol. 14, No. 4, Article 47. Publication date: December 2017. 47:16 J. Arteaga et al.

Table 1. Fine-Tuning of the AMM Implemented by DARTS for Each Benchmark

Ivy Bridge Sandy Bridge

Clusters per SD+DCD Clusters per SD+DCD

Bench. Socket TP Plac. [ μs] Affin. Socket TP Plac. [ μs]Affin.

Graph500 1 No No compact 1 No No compact

NAS-DC 10 No 512; 4,096 sparse 12 No 512; 4,096 sparse

SCF 1 Yes 512; 4,096 compact 1 Yes 16; 4,096 compact

CoMD 2 Yes No compact 2 Yes No compact

CoSP2 1 No 16; 16,384 compact 1 No 1,024; 8,192 compact

Affinity schemes used are Compact (0-19 in Ivy B.; 0-47 in Sandy B.) and Sparse (0, 10, 1, 11, . . . in Ivy B.; 0, 6, 12, 18, 1, 7, . . . in Sandy B). For benchmarks using SD+DCD, saturation caps for the sleep times of TP and microschedulers are shown.

per cluster was calculated as the ratio between OMP_NUM_THREADS and the number of clusters

used by the application, with one of them being a TP scheduler and the rest microschedulers.

For benchmarks using the SD+DCD option, the table shows the maximum time in microseconds

that TP and microschedulers spend in sleep mode waiting for new work to perform. All DARTS

versions were run using work-stealing policies for the schedulers. For SWARM applications, we

set the number of schedulers equal to OMP_NUM_THREADS . Other SWARM runtime options were left

as default. However, SWARM lacks proper documentation to determine these values .

Results shown in the following sections were taken after performing between 20 and 30 runs for

each benchmark version. Unless otherwise specified, bar heights in the following figures represent

the results’ medians calculated from those runs, and error bars represent the results’ maximum and

minimum values. Figures also present each DARTS and SWARM bar with the percentage difference

obtained with respect to OpenMP.

4.3 Graph500

Graph500 performs a parallel Breadth-First Search (BFS) in a large undirected graph. This

benchmark is typically used to measure the performance of a system when dealing with irregular

data-intensive problems. We used the reference OpenMP code from Graph500 ( 2010 ) with a minor

modification in the search kernel: We moved the parallel region inside of the while loop as we

noticed the instructions preceding the loop in the original code can be executed faster in sequential

form without loss of correctness. All versions were executed using OMP_SCHEDULE =static .7

Experimental results for Graph500 are presented in Figure 8, which presents DARTS with and

without TPLoops (see Section 3.3 ) in order to evaluate the impact the TPLoops optimization has

in the application. TPLoops’ thresholds were experimentally fine-tuned based on the graph’s size,

the maximum graph’s frontier (i.e., the number of vertices to visit), and results obtained from run

trials on each platform. The scale parameter was varied between 10 and 22 (higher scales were

not evaluated due to testbed memory constraints); other parameters were left with their default

values. Performance is measured in Traversed Edges per Second (TEPS).

The optimized DARTS version with TPLoops performs considerably better than OpenMP and

the other fine-grain versions (i.e., DARTS without TPLoops and SWARM), especially in the lower

end of the scale. This can be attributed to the irregular parallelism present in the application.

Graph500’s search kernel contains a parallel loop whose range varies at runtime depending on

the graph’s frontier. To better balance this work among threads, the size of iteration chunks has

7As explained by Suettlerlein et al. ( 2013 ), various chunk sizes using a dynamic policy yielded worse results than the default static one.

ACM Transactions on Architecture and Code Optimization, Vol. 14, No. 4, Article 47. Publication date: December 2017. Generating Fine-Grain Multithreaded Applications Using a Multigrain Approach 47:17

Fig. 8. Weak and strong scalability of Graph500 on (a)Ivy Bridge and (b)Sandy Bridge. DARTS results are

presented with and without using TPLoops optimization.

to change at execution time, which DARTS does by selecting the most appropriate CDG for the

current range out of the three generated at compile-time for the TPLoops. Such capabilities are not

present on SWARM, and a similar behavior cannot be obtained with OpenMP unless the search

kernel is significantly rewritten.

As with the results presented by Suettlerlein et al. ( 2013 ), weak scalability results in Figure 8

show that DARTS with TPLoops outperforms OpenMP across all scales tested. Suetterlein’s work

uses a handwritten DARTS BFS application with TPLoops, with fixed thresholds for all scales.

Conversely, our fine-grain application has been generated by omp2cd , reducing the time needed

to write the program and only requiring users to tune certain parameters according to the target

architecture and the application.

4.4 NAS-DC

NASA Advanced Supercomputing Division (NAS) Parallel Benchmarks is a suite of applications

designed to evaluate the performance of supercomputers (NAS 2016 ). One of this is Data Cube

(DC), which implements a user’s view of a decision-support database, with each cube’s cell having

a set of attributes of interest for the user (Harinarayan et al. 1996 ). The benchmark is defined for

four different problem sizes or classes (S, W, A, and B), containing a different number of attributes

per cell in the data cube. S is the smallest class (1 E3 tuples of up to five attributes), while B is the

largest (1 E7 tuples of up to 20 attributes).

Results for NAS-DC are shown in Figure 9. DARTS greatly improves with respect to OpenMP

and SWARM for most cases. This may be attributed to three factors. First, unlike other benchmarks

in the experimental set, NAS-DC does not contain parallel loops or barriers, thus reducing coarse-

grain synchronization and allowing threads in the fine-grain runtime to progress individually.

Second, fine-grain PXMs, like the Codelet Model, are more suitable for dynamic data structures,

ACM Transactions on Architecture and Code Optimization, Vol. 14, No. 4, Article 47. Publication date: December 2017. 47:18 J. Arteaga et al.

Fig. 9. Weak and strong scalability of NAS-DC on (a)Ivy Bridge and on ( b)Sandy Bridge.

such as the lattice of views used in NAS-DC, which is constantly changing at runtime based on

each view’s implementation cost. Finally, as seen later in Section 4.8 , DARTS speedups are obtained

thanks to its ability to tune-up its AMM for maximum performance, whereas SWARM offers users

fewer options to do so. For instance, Figure 13(a) shows that when DARTS uses only two TP

schedulers for NAS-DC with a size class A, its speedup is only 0.2, contrary to the one obtained

when all 20 cores available are used as TP schedulers, which is 6.4.

4.5 PNNL’s NWChem-SCF

NWChem-SCF is an implementation of the Self-Consistent Field (SCF) method from Pacific North-

west National Laboratory (PNNL) and part of the NWChem package software (PNNL 2014 ), a high-

performance computational software for the computation of large chemistry problems. SCF is an

iterative fixed-point algorithm solving a nonlinear system with a Schrödinger’s equation in the

form of a self-consistent eigenvalue problem.

Figure 10 presents the results for weak and strong scalability. On both platforms, DARTS and

SWARM performances remain, on average, within ±10% of OpenMP’s, with both fine-grain run-

times performing slightly better on Ivy Bridge. While general implementations of SCF tend to

be loosely structured, the one used for our experiments (obtained from ETI 2013 ) is very regular,

comprising several rectangular nested loops. OpenMP has been extensively tuned for such types

of workloads, which explains why speedups obtained with DARTS for this application are smaller

than those for Graph500 and NAS-DC.

4.6 ExMatEx’s CoMD

CoMD (Molecular Dynamics simulation) is one of the proxy applications of the Department of

Energy’s Exascale Co-Design Center for Materials in Extreme Environments (ExMatEx) (DoE

2012a ). These are used for the co-design of algorithms, software, and hardware for exascale

ACM Transactions on Architecture and Code Optimization, Vol. 14, No. 4, Article 47. Publication date: December 2017. Generating Fine-Grain Multithreaded Applications Using a Multigrain Approach 47:19

Fig. 10. Weak and strong scalability of NWChem-SCF on (a)Ivy Bridge and on ( b)Sandy Bridge.

Fig. 11. Strong scalability of CoMD using 32000 atoms on (a)Ivy Bridge and on (b)Sandy Bridge. DARTS

results on Sandy Bridge with four threads were not computed since the AMM was configured with eight TP

schedulers.

simulation frameworks of extreme mechanical and radiation environments. CoMD evaluates the

force imposed on each atom by the rest of the particles in a system within a cutoff radius using

Lennard-Jones and embedded-atom method potentials.

Strong scalability results for this application are shown in Figure 11.AswithNWChem-SCF,

CoMD’s parallelism is highly structured with static memory access patterns, which explains why

DARTS performance in Figure 11 is also on par with OpenMP, but not better. In particular, OpenMP

provides guarantees for statically scheduled loops which help improve data locality. This is not the

case with SWARM, whose performance is greatly reduced on Sandy Bridge (Figure 11(b)). Unlike

DARTS, SWARM does not define TPs and, therefore, cannot benefit from the increase locality of

data obtained by having codelets access their data on each loop’s iteration from their parent TP’s

data frame. Moreover, codelets in SWARM are not bound to a specific socket and may be executed

ACM Transactions on Architecture and Code Optimization, Vol. 14, No. 4, Article 47. Publication date: December 2017. 47:20 J. Arteaga et al.

Fig. 12. Weak and strong scalability of CoSP2 on (a)Ivy Bridge and on (b)Sandy Bridge with a 500 ×500

submatrix.

by any core in the machine, ultimately reducing the locality of the application and its performance,

which is more noticeable in Figure 11.

4.7 ExMatEx’s CoSP2

CoSP2 (Second Order Spectral Projection for Electronic Structure Calculations) benchmark is

another ExMatEx’s proxy application (DoE 2012b ). CoSP2 implements a sparse linear algebra

parallel algorithm to determine density matrices in electronic structure theory using a recur-

sive second-order Fermi-operator expansion method (SP2) defined as a recursive series of matrix

multiplications.

Figure 12 presents results for weak and strong scalability. On both architectures, OpenMP has a

lower execution time than the fine grain runtimes, with SWARM performing better than DARTS.

On Sandy Bridge, however, the performance of both fine-grain runtimes is reduced, particularly

for DARTS. The difference between both fine-grain applications indicates that sparse matrices (and

particularly the ELL format used by CoSP2) are more suitable for fine-grain execution when using

only one level of scheduling (as in SWARM that uses only codelets) rather than two levels (i.e., TPs

and codelets in DARTS). This contrasts with applications with dense matrices, like NWChem-SCF

and CoMD, where the data frame on each TP might improve the application’s performance by

increasing the locality of the data used by the TP’s codelets.

4.8 Impact of Different Configuration Options in DARTS AMM

in Application’s Performance

As seen previously, applications with irregular parallelism and/or dynamic data structures, like

Graph500 and NAS-DC, may perform better when executed in a fine-grain environment, such

as the Codelet Model implemented by DARTS. However, improvements are not only due to the

ACM Transactions on Architecture and Code Optimization, Vol. 14, No. 4, Article 47. Publication date: December 2017. Generating Fine-Grain Multithreaded Applications Using a Multigrain Approach 47:21

Fig. 13. DARTS speedup with respect to OpenMP for different numbers of TP Schedulers in the AMM.

fine-grain nature of the runtime, but also to the ability users have to configure the AMM imple-

mented by DARTS and to tune it up for each application. This section presents an analysis of two

AMM configuration parameters available in DARTS and evaluates their impact in an application’s

performance. Experiments were performed on Graph500 for a scale of 20, NAS-DC for size class

A, NWChem-SCF for 160 atoms, CoMD for 32k atoms, and CoSP2 for a 100k ×100k matrix and

500 ×500 submatrix. All results used 20 threads on Ivy Bridge and 48 threads on Sandy Bridge.

Other parameters have the same values as those mentioned previously for each benchmark.

4.8.1 Number of TP Schedulers. The number of TP schedulers determines the number of clus-

ters in the AMM, as well as the number of microschedulers per cluster used to implement DARTS’s

two-level scheduling mechanism. This has a direct impact in the locality of data accessed by the

codelets and in the scheduling overhead produced by having a cluster’s TP scheduler distribute

ready codelets to each one of its microschedulers. To evaluate this, Figure 13 presents the speedup

obtained with respect to OpenMP for each of the previous benchmarks for different AMM shapes.

From examining the results in Figure 13 and the particularities of each benchmark, we can

determine that the impact that the number of TP schedulers has in an application’s performance

depends, at least partially, on the presence or absence of parallel loops. In fact, NAS-DC, which

does not contain parallel loops, is the only application whose performance clearly improves for

increasing numbers of TP schedulers. For all other applications that are designed around parallel

loops whose ranges either change dynamically (like Graph500) or are fixed, performance improves

when the number of TP schedulers is close to the number of sockets or NUMA nodes in the target

architecture: Two on Ivy Bridge and four on Sandy Bridge. CoMD is a special case, requiring 4 TP

schedulers on Ivy Bridge but 24 on Sandy Bridge.

This is related to the way that TPs are assigned to clusters and how codelets access their data.

When a TP is invoked, it is assigned to a cluster’s TP scheduler, which instantiates it and distributes

its ready codelets to each one of the microschedulers of the same cluster for execution. If these

microschedulers are mapped to the same socket or NUMA node as their cluster’s TP scheduler (as

it is the case for those applications with compact affinity in Table 1), the locality of data is increased

since codelets access their data from their TP instance’s data frame. This is especially important in

applications based on parallel loops, where increased locality may reduce the number of memory

accesses performed on each iteration, ultimately improving the application’s performance.

However, if more than one cluster is mapped to the same socket or NUMA node, contention over

the memory bank may increase, producing higher data latencies on each iteration. To avoid this,

results in Figure 13 indicate that the “sweet spot” for most of the applications containing parallel

loops is having only one cluster per socket or NUMA domain in the target architecture (i.e., two

ACM Transactions on Architecture and Code Optimization, Vol. 14, No. 4, Article 47. Publication date: December 2017. 47:22 J. Arteaga et al.

Fig. 14. DARTS speedup with respect to OpenMP for different scheduling policies.

2 on Ivy Bridge and four on Sandy Bridge). On the other hand, if an application does not contain

such control-flow patterns (i.e., NAS-DC), no benefit is obtained from having clusters of TP and

microschedulers in the same hardware region, and it is actually better to have a single level of

scheduling (i.e., only TP schedulers in the machine with no microschedulers).

4.8.2 Scheduling Policy and Options. In addition to the number of TP and microschedulers in

the abstract machine, users can choose from two main types of scheduling policies (see Section 2.2 ):

Standard , where microschedulers only execute codelets pushed into their local queues by their

parent TP scheduler, and work-stealing , where they steal codelets from their local TP scheduler’s

RCQ. Additionally, schedulers may have the option to go into sleep mode and to reduce the over-

head of constantly deleting context objects by enqueueing them and deleting them afterward if

the SD+DCD option (see Section 3.5.2 ) is activated. The impact that these scheduling policies and

options have on DARTS application performance is presented in Figure 14.

All benchmarks perform better when work-stealing is used instead of standard policies, and,

with the exception of Graph500 and CoMD, SD+DCD further improves the application’s perfor-

mance. When standard policy is used, schedulers only execute codelets whose TPs have been

invoked in the same cluster. For microschedulers, this policy also means that they only execute

codelets assigned by their local TP scheduler. If a microscheduler’s local queue is empty, it spins

idly until new work is pushed into it, even if its local TP scheduler has ready codelets on its own

RCQ.

With work-stealing, however, a microscheduler may steal those codelets as soon as it has none

on its own queue, thus reducing the time it spends idly. However, contention in the TP sched-

uler’s RCQ may increase, as well as coherence traffic overall, because not only the TP scheduler

is popping codelets out of it, but potentially also all of its microschedulers. Nevertheless, results

in Figure 14 show that, for the benchmarks presented, this overhead is small in comparison with

the benefits of having schedulers constantly busy executing work. Moreover, when SD+DCD is en-

abled, performance obtained with work-stealing is further enhanced by having schedulers sleep or

delete previously queued context objects when there are no codelets to execute nor TPs to create.

For Graph500 and CoMD, results show that SD+DCD actually decreases the application’s speedup

obtained with work-stealing alone. This indicates that, for these applications, schedulers are con-

stantly executing codelets and benefit more from deallocating context objects immediately rather

than later.

4.9 Summary of the Environment’s Opportunities and Tradeoffs

Experimental results show that the multigrain parallel programming environment is more suit-

able for certain types of applications and that performance of the DARTS runtime can be highly

ACM Transactions on Architecture and Code Optimization, Vol. 14, No. 4, Article 47. Publication date: December 2017. Generating Fine-Grain Multithreaded Applications Using a Multigrain Approach 47:23

dependent on the set of configuration parameters used for its AMM. Here, we summarize the sce-

narios where an increase in performance with respect to OpenMP can be expected and the tradeoffs

a user of the environment may face:

(1) Applications with irregular parallelism and/or dynamic data structures may perform bet-

ter when written and executed under a fine-grain runtime, such as DARTS.

(2) OpenMP tends to get slightly better performance than the Codelet Model with regular

workloads.

(3) Performance of a DARTS application featuring parallel loops can be increased when the

number of clusters in the AMM matches the number of sockets in the target architecture.

(4) DARTS implements a highly configurable AMM, offering more flexibility to users to tune-

up applications at execution time than SWARM.

5 RELATED WORK

The Optimally Scheduled Advanced Multiprocessor (OSCAR) multigrain parallelizing com-

piler (Kimura et al. 2010 ) decomposes an application with several grains of parallelism. It gen-

erates a parallelized application from a sequential program that may be compiled by an OpenMP

compiler or translated into a Fortran or C program with runtime library calls. In this sense, OS-

CAR output programs are run using OpenMP’s coarse-grain PXM, unlike omp2cd , which generates

applications to run over a fine-grain PXM.

Weng’s doctoral dissertation (Weng 2003 ) proposes the translation of OpenMP programs to a

dataflow PXM using the Cougar compiler and SMARTS runtime system, increasing the data local-

ity by finding, at compile time, the best mapping of nodes in the task graph to processors. omp2cd ,

on the other hand, improves data locality by automatically determining the size and number of

TPs to be used in the application in order to exploit the implicit locality provided by the two-level

parallelism scheme found in DARTS.

KaCC is an ongoing project for the design of a compiler to parse OpenMP programs into appli-

cations running on a dataflow-like runtime with different task grain sizes; it is named libKOMP

(Broquedis et al. 2012 ) and is based on XKaapi (Gautier et al. 2013 ). The main difference between

KaCC and omp2cd is that the latter performs a two-level granularity analysis in the input OpenMP

application (TPs and codelets), while the former decomposes the input application into a single

type of parallel construct needed in libKOMP, (i.e., tasks).

OpenStream allows programmers to express dynamic dependent tasks by annotating an

OpenMP application (Pop and Cohen 2013 ). In this way, an application’s dataflow task graph can

be specified using first-class dataflow streams. omp2cd is different by not requiring users to mod-

ify the original OpenMP application in order to obtain a dataflow-like application that uses the

Codelet model.

OmpSs provides OpenMP extensions to support the StarSs programming model in a single in-

terface (BSC 2017 ). StarSs is a programming model that computes, at runtime, the dependencies

between tasks to execute a sequential task-based program in parallel. The task graph in OmpSs is

thus by nature dynamic. In omp2cd , using DARTS as a back-end, codelet graphs contained within a

TP are static (i.e., all the required codelets, along with their dependence relationships , are computed

once and for all). This allows for a lower overhead task graph instantiation.

Cilk/Cilk++ (Leiserson 2010 ) comprises a set of language extensions for C/C++ for the devel-

opment of multithreaded applications using a divide-and-conquer model and providing a simple

interface through the use of a small set of keywords. Like DARTS and SWARM, Cilk/Cilk++ sched-

uler uses a work-stealing algorithm; however, it does not provide locality-aware mechanisms like

ACM Transactions on Architecture and Code Optimization, Vol. 14, No. 4, Article 47. Publication date: December 2017. 47:24 J. Arteaga et al.

those provided by the Codelet Model and DARTS through the use of TPs and their assignment to

specific clusters in the machine.

Rice University’s Habanero (Kumar et al. 2014 ) project comprises a set of parallel programming

languages, an execution model, compilers, and a runtime for the development of parallel applica-

tions for extreme-scale systems in a wide variety of architectures. It supports dynamic creation and

termination of lightweight tasks by transforming at compile-time tasking constructs into runtime

calls. These tasks may block when executing, which is an important difference with respect to the

Codelet Model, which specifies that codelets are non-preemptive and must run until completion

without blocking.

7 CONCLUSION

We have presented a multigrain parallel programming environment based on the Codelet Model

for the translation and fine-grain execution of OpenMP programs. It includes a multigrain parallel

compiler, omp2cd , which automatically determines the granularity of fine-grain tasks at two levels

of parallelism and generates a set of source files to be executed on top of a runtime implementing

the Codelet Model: DARTS or SWARM. The environment also includes an optimized version of

the DARTS runtime that has been modified to improve its performance with respect to OpenMP.

Experimental evaluation showed that DARTS improves the performance of workloads with ir-

regular parallelism and dynamic structures with respect to OpenMP and SWARM. For applications

having a more structured parallelism designed around nested parallel loops and static memory ac-

cess patterns, and for which OpenMP has been broadly optimized, the output fine-grain applica-

tions can match OpenMP’s performance by either DARTS or SWARM or both. Their performance

on these type of workloads may vary depending on factors like the DARTS AMM’s configuration

and the type of arrays used by the algorithm, with SWARM performing better when the benchmark

employ sparse matrices.

Results also showed that fine-tuning DARTS AMM can have a great impact on an application’s

performance by determining the number of codelets executed by each cluster, which has a direct

impact on the locality of data used by the codelets. Applications containing parallel loops are

more likely to improve their performance when the number of clusters in the AMM is close to the

number of sockets in the underlying architecture. Conversely, applications with no parallel loops

benefit more from having only one level of scheduling in DARTS (i.e., having all threads acting as

TP schedulers). Further improvements may be achieved by selecting the appropriate scheduling

policy and options, like allowing schedulers to enter into sleep mode when there is no work to do

and performing a fine-tuning of the corresponding sleep time.

Work is currently being done in improving the performance of the environment for additional

scientific applications, such as EPCC OpenMP micro-benchmark suite (of Edinburgh 2016 ), NAS

Integer Sort, and Barcelona Supercomputing Center’s OpenMP Tasks Suite (Duran et al. 2009 ). Ad-

ditionally, future work will focus on the integration in omp2cd of additional mechanisms offered

by SWARM to complement the experimental evaluation of the environment and the performance

comparison between the two fine-grain runtimes. We will also focus on making the omp2cd com-

piler fully compliant with OpenMP v2.5, as well as in completely supporting the tasking constructs

and clauses from OpenMP v3.0 and v.4.0. New clauses extending OpenMP will also be added to

guide the compilation process with specific power and resilience constraints.

REFERENCES

Jaime Arteaga, Stephane Zuckerman, and Guang R. Gao. 2017. Multigrain parallelism: Bridging coarse-grain parallel pro- grams and fine-grain event-driven multithreading. In Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) . IEEE, 799–808. DOI: http://dx.doi.org/10.1109/IPDPS.2017.63

ACM Transactions on Architecture and Code Optimization, Vol. 14, No. 4, Article 47. Publication date: December 2017. Generating Fine-Grain Multithreaded Applications Using a Multigrain Approach 47:25

François Broquedis, Thierry Gautier, and Vincent Danjean. 2012. libKOMP, an efficient openMP runtime system for both fork-join and data flow paradigms. In OpenMP in a Heterogeneous World: Proceedings of the 8th International Workshop on OpenMP (IWOMP’12). Springer, Berlin, 102–115. https://link.springer.com/chapter/10.1007%2F978- 3- 642- 30961- 8_8 . BSC. 2017. Barcelona Supercomputing Center: The OmpSs Programming Model. Retrieved March 31, 2017 from https:// pm.bsc.es/ompss . CAPSL. 2013. Computer Architecture and Parallel Systems Laboratory at University of Delaware: The Codelet Execution Model. Retrieved March 31, 2017 from http://www.capsl.udel.edu/codelets.shtml . DoE. 2010. US Department of Energy, Office of Science, Office of Advanced Scientific Computing Research: The Opportu- nities and Challenges of Exascale Computing. Retrieved March 31, 2017 from http://science.energy.gov/ ∼/media/ascr/ ascac/pdf/reports/Exascale_subcommittee_report.pdf . DoE. 2012a. DoE Exascale Co-Design Center for Materials in Extreme Environments, ExMatEx - Proxy Applications - CoMD. Retrieved October 18, 2017 from http://www.exmatex.org/comd.html . DoE. 2012b. DoE Exascale Co-Design Center for Materials in Extreme Environments, ExMatEx Proxy Applications - CoSP2. Retrieved October 18, 2017 from http://www.exmatex.org/cosp2.html . Alejandro Duran, Xavier Teruel, Roger Ferrer, Xavier Martorell, and Eduard Ayguade. 2009. Barcelona openMP tasks suite: A set of benchmarks targeting the exploitation of task parallelism in openMP. In Proceedings of the 2009 International Conference on Parallel Processing (ICPP’09) . IEEE Computer Society, Washington, DC, Article 16, 8 pages. DOI: http:// dx.doi.org/10.1109/ICPP.2009.64 ETI. 2013. ET International Inc., DynAX - Extreme Scale Software Stack. Retrieved October 18, 2017 from https://wiki. modelado.org/images/1/15/Scf.tar.gz . Thierry Gautier, Joao V. F. Lima, Nicolas Maillard, and Bruno Raffin. 2013. XKaapi: A runtime system for data-flow task programming on heterogeneous architectures. In Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing (IPDPS’13) . IEEE Computer Society, Washington, DC, 1299–1308. DOI: http://dx.doi.org/10. 1109/IPDPS.2013.66 Graph500. 2010. Graph500 Reference Implementation Version 2.1.4. Retrieved October 18, 2017 from https://github.com/ graph500/graph500/releases/tag/graph500- 2.1.4 . Venky Harinarayan, Anand Rajaraman, and Jeffrey D. Ullman. 1996. Implementing data cubes efficiently. In Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data (SIGMOD’96) . ACM, New York, 205–216. DOI: http://dx.doi.org/10.1145/233269.233333 Keiji Kimura, Masayoshi Mase, Hiroki Mikami, Takamichi Miyamoto, Jun Shirako, and Hironori Kasahara. 2010. OSCAR API for real-time low-power multicores and its performance on multicores and SMP servers. In Proceedings of the 22nd International Workshop on Languages and Compilers for Parallel Computing (LCPC’09). Springer, Berlin, 188–202. DOI: http://dx.doi.org/10.1007/978- 3- 642- 13374- 9_13 Vivek Kumar, Yili Zheng, Vincent Cavé, Zoran Budimlić, and Vivek Sarkar. 2014. HabaneroUPC++: A compiler-free PGAS library. In Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models (PGAS’14) .ACM,NewYork,Article5,10pages. DOI: http://dx.doi.org/10.1145/2676870.2676879 Christopher Lauderdale and Rishi Khan. 2012. Towards a codelet-based runtime for exascale computing: Position pa- per. In Proceedings of the 2nd International Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop Era (EXADAPT’12) . ACM, New York, 21–26. DOI: http://dx.doi.org/10.1145/2185475.2185478 Charles E. Leiserson. 2010. The cilk++ concurrency platform. The Journal of Supercomputing 51, 3 (2010), 244–257. DOI: http://dx.doi.org/10.1007/s11227- 010- 0405- 3 LLVM-OpenMP. 2017. OpenMP: Support for the OpenMP Language. Retrieved March 1, 2017 from http://openmp.llvm. org/index.html . NAS. 2016. NASA Advanced Supercomputing Division Parallel Benchmarks. Retrieved September 20, 2017 from http:// www.nas.nasa.gov/publications/npb.html . The University of Edinburgh. 2016. EPCC OpenMP Micro-benchmark Suite. Retrieved March 31, 2017 from https://www.epcc.ed.ac.uk/research/computing/performance- characterisation- and- benchmarking/epcc- openmp- micro- benchmark- suite . OpenMP4.0. 2011. OpenMP Application Program Interface Version 4.0. Retrieved March 31, 2017 from http://www.openmp. org/mp- documents/OpenMP4.0.0.pdf . PNNL. 2014. Pacific Northwest National Laboratory: Hartree-Fock Theory for Molecules - NWCHEM. Retrieved March 31, 2017 from http://www.nwchem- sw.org/index.php/Release65:Hartree- Fock_Theory_for% _Molecules . Antoniu Pop and Albert Cohen. 2013. OpenStream: Expressiveness and data-flow compilation of openMP streaming pro- grams. ACM Transactions on Architecture and Code Optimization 9, 4, Article 53 (Jan. 2013), 25 pages. DOI: http:// dx.doi.org/10.1145/2400682.2400712 Joshua Suettlerlein, Stéphane Zuckerman, and Guang R. Gao. 2013. An implementation of the Codelet model. In Proceedings of the 19th International Conference on Parallel Processing (Euro-Par’13). Springer, Berlin, 633–644.

ACM Transactions on Architecture and Code Optimization, Vol. 14, No. 4, Article 47. Publication date: December 2017. 47:26 J. Arteaga et al.

Tien-Hsiung Weng. 2003. Translation of Openmp to Dataflow Execution Model for Data Locality and Efficient Parallel Execu- tion . Ph.D. Dissertation. University of Houston, TX. Stéphane Zuckerman, Joshua Suetterlein, Rob Knauerhase, and Guang R. Gao. 2011. Using a “codelet” program execution model for exascale machines: Position paper. In Proceedings of the 1st International Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop Era (EXADAPT’11) . ACM, New York, 64–69. DOI: http://dx.doi.org/10.1145/2000417. 2000424

Received March 2017; revised September 2017; accepted October 2017

ACM Transactions on Architecture and Code Optimization, Vol. 14, No. 4, Article 47. Publication date: December 2017.