Personal tools
You are here: Home Science Environment Next Generation Programming Environments for Next Generation Architectures
Document Actions

Next Generation Programming Environments for Next Generation Architectures

by Matt Sottile last modified 2007-04-10 20:50

Addressing Heterogeneity and Large-Scale Parallelism with Compiler Infrastructure Tools and Transitional Languages

Authors: Matthew Sottile (CCS-2), Katharine Chartrand (HPC-4), Craig Rasmussen (CCS-1)

Basis

Highly parallel and heterogeneous architectures will be typical of next generation supercomputing. For example, one proposed architecture is composed of multiple levels of processing scattered throughout the memory hierarchy (where "memory" includes non-volatile memory such as disk or even tape). Machines of this sort, although proposed and prototyped in the past, have never been more than a niche resource and there is a lot of work to be done to make these architectures general purpose data analysis tools - both programmable and usable. Specifically, next generation architectures require languages that 1) both capitalize on and hide the heterogeneous hardware, and 2) can readily incorporate new parallel models and syntactic techniques. We propose a method for achieving this that also retains the enormous investment in algorithm development in traditional languages.

Architectural Candidates

Although one candidate architecture is being discussed currently, it is useful to discuss other potential candidates that may influence the architecture of a next generation machine designed for data analysis. There are at least three candidates: PIMs, a multi-threaded architecture (MTA), and hybrid multicore processors.

The proposed architecture uses Processor-in-Memory (PIM) technology. In the PIM architecture, computations are embedded in the memory, eliminating the "memory-wall" that exists in current mainstream architectures. PIM is not only a solution to latency - one can associate computations with memory operations, turning the typical model of computation inside-out, allowing memory accesses to drive the computational flow. This is the cornerstone of the dataflow model of computation proposed in past decades. PIM brings with it significant considerations, specifically in terms of locality. One must ensure that the memory and computations are co-located in an optimal manner - unstructured and random algorithms (as are often found in data mining) are not necessarily suited for this architecture.

An alternative architecture is the multithreaded architecture (MTA) pioneered by Burton Smith at Tera, and later Cray. In this case, one does not embed processors in memory, but hides potentially severe latency behind a massive set of independent threads of instructions. One must be able to decompose algorithms into a large set of independent threads of execution. Locality is no longer an issue, deferring the difficulty to identifying concurrency.

A final architecture to consider is one composed of hybrid multicore processors. Currently, multicore processors use homogeneous sets of processing elements, essentially implementing a traditional SMP on a single chip. Recent announcements from leading chip-makers suggest they are moving towards heterogeneous processors. It is likely that GPUs will be included on the same die as the processor, and a mix of general purpose superscalar and VLIW cores may appear on the same CPU. Utilizing these machines will require balancing high levels of parallelism with heterogeneity in the instruction stream for the different processing elements.

Computing at the Transport Layer

Some amount of data processing in next generation architectures must occur at the transport layer between levels of storage, such as the disk/memory interface. The largest cost in computing currently is the latency in bringing data from storage (memory or disk) to the processing elements. At the link between layers of storage, the data is available for examination while it traverses the link. No extra cost is required to process it at the link level since we are moving the computation to the data instead of the other way around, making the previously costly operation of data access essentially free. The tradeoff will be designing algorithms to a strictly streaming model of computation instead of the more familiar random access memory model.

The processing elements available for working on this data at these links will be highly heterogeneous -- ranging from embedded processors such as ARM or simple PowerPC, to field programmable gate arrays (FPGAs) embedded in the network fabric. Depending on which layer we wish to perform computations, the processors are likely not instruction-set compatible with each other, and more importantly, have vastly different execution architectures (superscalar, vector, arrays of boolean gates, etc...). Further, no single vendor exists that can provide disk, network, and memory technology. Heterogeneity is going to be a key design requirement in tools that target these systems.

The Problem: How do we make these architectures useable and programmable

The motivation and promise of these architectures is clear. The problem will be to make them useable and programmable:

  • Usability: These architectures will be usable when we have languages that allow programmers to express algorithms in a straight forward manner that preserves mathematical abstraction while utilizing the parallel capabilities of new heterogeneous architectures in all their complexity.
  • Programmability: Heterogeneity is complex from a code and data management perspective. These architectures will be programmable when compilers can automatically and optimally compile code for heterogeneous hardware.

Heterogeneity: The Library Solution

Libraries are frequently proposed as the solution for heterogeneous architectures. In this approach, a suite of libraries that are callable from the main traditional processor move costly subroutines to the coprocessor for execution. This allows a clean separation to be made for the user who programs as they have on traditional systems, keeping them isolated from the implementation details of the coprocessor algorithms. A precedent exists for this concept tracing back to linear algebra libraries developed for early vector machines.

Performance and instruction stream analysis of simulation codes shows that the expected benefit from kernel libraries will be minimal [see Gokhale/Sottile, 2006]. Fortunately, data analysis is often simpler than simulation from an algorithmic perspective; regular, embarrassingly parallel and highly structured algorithms, such as the FFT, are common. Furthermore, the use of kernel libraries is quite viable - the widespread use of Matlab and GNU R are proof of this, as they are essentially kernel libraries that are scriptable by users.

Unfortunately, as Matlab programmers know, analysis algorithms that are not perfectly structured, or cannot be expressed as vectorized code, have abysmal performance when implemented as libraries. The granularity of each library call becomes small, and a great deal of computational time is spent outside the optimized libraries. As algorithmic complexity increases, experimental and folkloric evidence shows that the viability of libraries as a full solution is low. We seek a solution that works generally for all algorithms, including those that exceed the complexity for which libraries are suited.

Current Efforts To Introduce Parallel Programming Constructs

Many people recognize that new language capability is necessary to generate code that can run on new architectures. New languages provide syntactic tools for algorithm expression that eliminate or discourage features that prohibit decomposition onto parallel or heterogeneous architectures. They provide new programming models that are necessary to fully utilize even current generation machines.

DARPA has funded as part of the HPCS effort three languages for high productivity parallel computing - Fortress (Sun), X10 (IBM), and Chapel (Cray). Commercial developers also recognize the importance and promise of language level techniques. Google has published their "map-reduce" programming model (based on primitives found in functional programming) used to implement some of their massive data mining and page ranking algorithms.

However, it has been impossible, historically, to force a wholesale change in the way people think about programming. An old anecdote is that whatever new language emerges, it will be called Fortran. Why is this true? A vast investment has been made in algorithm development in traditional languages, even outside the domain of simulation science. New languages require a wholesale rewrite of existing code. This is not realistic in practice, and has doomed otherwise promising language projects such as Sisal.

None-the-less, some level of change is necessary to respond to the challenges posed by future architectures. One would like to pursue a conservative route and evolve the traditional languages into next generation tools - users must be eased into new models, not forced into them.

Considerable effort has been made to parallelize or vectorize programs written in traditional languages. For example, vectorizing compilers for Fortran have been a research topic since the early 1970s, yet in practice they are found to hit limits in terms of their performance enhancement. The most familiar example of this limitation is due to data aliasing in modern versions of Fortran with pointers, and languages such as C or Java which allow for pass-by-reference.

Efforts to evolve traditional languages currently do so by introducing language extensions which are translated by a third-party tool into their equivalent representation in existing languages. For example, languages like ZPL are compiled down to C source with parallel capabilities provided by MPI or other common libraries. The existing C compiler is then responsible for building the executable program and performing platform-specific optimizations. This approach works, but is limited in that the compiler generating machine code and performing analyses for optimization has no knowledge of the high level constructs from which the C code originated. Therefore no amount of metadata regarding the intent of the code and expected behavior is available for use in parallelization or decomposition onto heterogeneous architectures during binary executable creation.

In order for traditional languages to evolve and integrate new features into the compiler directly, we currently depend on commercial compiler vendors to integrate new capabilities into very complex compiler tools and bring them to the market. While this does happen, it happens very slowly, often with compilers emerging with features introduced in standards some 10 years earlier. We see some parallel constructs now present in Fortran 2008, yet history shows that we should not expect them to be available for many years.

Furthermore, language standard bodies require features proposed for inclusion to be implemented and demonstrated first. There currently is no vehicle for rapidly testing and demonstrating features that are advantageous to algorithms of relevance, making them more likely to be considered for future standards.

Developing a Compiler Infrastructure for Highly Parallel and Heterogeneous Architectures

We currently lack a well-defined path forward to useable and programmable environments for the highly parallel and heterogeneous architectures on the horizon. Solutions such as the library technique and current efforts to re-engineer traditional languages are band-aids that mask the real problem: programmers are working with languages that were never suited for computers that look different than a single CPU with a single hierarchy of memory and storage. The von Neumann model remains pervasive in programming, yet mainstream hardware today is taking on characteristics of novel models such as dataflow that have been relegated to the fringe of computing in the past. How do we transition to these necessary new models without losing our massive investment in traditional languages? We propose a twofold approach.

Compiler Infrastructure

The solution, we believe starts with an open and extensible compiler infrastructure. Tools that manipulate software at the source level and facilitate language extensions require access to data and operations traditionally found in compiler tools. Existing production-quality compilers are often closed source. Those few that are open to tool developers are poorly designed with respect to extensibility, and impose a steep learning curve on tool developers who wish to leverage the technology provided by the compilers. Writing simple tools is far from simple. The elements of a well designed compiler infrastructure as we suggest are presented in this figure.

An extensible and componentized compiler infrastructure can aid in addressing heterogeneity. Research has shown that traditional languages can be directly compiled down to very novel architectures (as opposed to a kernel library approach). One example is the LANL-developed Trident compiler that targets FPGAs from C source code, generating VHDL as the compiler output. This type of compiler produces libraries that are fully optimized for both the particular program and the hardware. One of the most costly aspects of developing compilers for such novel architectures is fulfilling the need for a robust front-end and intermediate layer for representing the code.

Researchers on the Trident project are interested in a robust and extensible compiler infrastructure because it would provide front-end and analysis tools for multiple languages essentially for free. They would then be able to focus on building code generators to target novel hardware architectures. Currently, they are restricted to C due to the limitations of the compiler infrastructure that they chose to work within. This prohibits their techniques from being applied to the vast body of Fortran-based codes in use at LANL. A component compiler infrastructure can potentially make any language available to any type of tool developer.

In summary, a component compiler architecture will enable us to:

  1. implement new language capability necessary to make optimal use of new architectures,
  2. retain our investment in traditional languages,
  3. test and demonstrate features that are advantageous to algorithms of relevance for more rapid inclusion in future standards, and
  4. facilitate the fully optimized implementation of algorithms on heterogeneous architectures.

Education

More than anything, this approach recognizes that people are the key to the changes we need to make and seeks to provide programmers with an infrastructure to rapidly and creatively evolve high-performance computing programming tools. Such an effort needs to be accompanied by educational programs to raise the level of parallel programming expertise. Ultimately, it will be people, not compilers or languages that fully utilize these advanced architectures.

In this regard, even the available educational materials on parallel programming are insufficient. There is a need not only for courses, but also the textbooks required to introduce programmers to the practical use of new abstract techniques for creating algorithms to target new languages and architectures.

Conclusion

A new world of computing is coming, and with the presence of multicore and user programmable GPU coprocessors in commodity laptop computers, it is arguably already here. We are faced with heterogeneity of architectures coupled with a scattering of computations throughout the hierarchy of the computer itself. Two areas are key to such platforms succeeding:

  • Usability: These architectures will be usable when we have languages that allow programmers to express algorithms in a straight forward manner that preserves mathematical abstraction while utilizing the parallel capabilities of new heterogeneous architectures in all their complexity.
  • Programmability: Heterogeneity is complex from a code and data management perspective. These architectures will be programmable when compilers can automatically and optimally compile code for heterogeneous hardware.

To meet these challenges we need compiler infrastructure that can express analysis algorithms based on high-level parallel constructs, accommodate an evolutionary rather than revolutionary transition from traditional programming, and facilitate the optimal implementation of algorithms in executable code for heterogeneous architecture. Furthermore, educating programmers in new parallel programming constructs is essential preparation for this future.