TITLE

A Scalable Farm Skeleton for Hybrid Parallel and Distributed Programming

AUTHOR(S)
Ernsting, Steffen; Kuchen, Herbert
PUB. DATE
December 2014
SOURCE
International Journal of Parallel Programming;Dec2014, Vol. 42 Issue 6, p968
SOURCE TYPE
Academic Journal
DOC. TYPE
Article
ABSTRACT
Multi-core processors and clusters of multi-core processors are ubiquitous. They provide scalable performance yet introducing complex and low-level programming models for shared and distributed memory programming. Thus, fully exploiting the potential of shared and distributed memory parallelization can be a tedious and error-prone task: programmers must take care of low-level threading and communication (e.g. message passing) details. In order to assist programmers in developing performant and reliable parallel applications Algorithmic Skeletons have been proposed. They encapsulate well-defined, frequently recurring parallel and distributed programming patterns, thus shielding programmers from low-level aspects of parallel and distributed programming. In this paper we take on the design and implementation of the well-known Farm skeleton. In order to address the hybrid architecture of multi-core clusters we present a two-tier implementation built on top of MPI and OpenMP. On the basis of three benchmark applications, including a simple ray tracer, an interacting particles system, and an application for calculating the Mandelbrot set, we illustrate the advantages of both skeletal programming in general and this two-tier approach in particular.
ACCESSION #
97731308

 

Related Articles

  • Performance scalability and energy consumption on distributed and many-core platforms. Karanikolaou, E.; Milovanović, E.; Milovanović, I.; Bekakos, M. // Journal of Supercomputing;Oct2014, Vol. 70 Issue 1, p349 

    In this paper, the performance evaluation of distributed and many-core computer complexes, in conjunction with their consumed energy, is investigated. The distributed execution of a specific problem on an interconnected processors platform requires a larger amount of energy compared to the...

  • An Improved DSM System Design and Implementation. Ramesh, T.; Sudhakar, Chapram // International Journal of Next-Generation Computing;Nov2012, Vol. 3 Issue 3, p312 

    In this paper, an Improved Distributed Shared Memory (IDSM) system, a hybrid version of shared memory and message passing version is proposed. This version effectively uses the benefits of shared memory in terms of ease of programming and message passing in terms of efficiency. Further it is...

  • Parallelizing Complex Streaming Applications on Distributed Scratchpad Memory Multicore Architecture. Chen, Shin-Kai; Hung, Cheng-Yu; Chen, Ching-Chih; Liu, Chih-Wei // International Journal of Parallel Programming;Dec2014, Vol. 42 Issue 6, p875 

    Multicore processors can provide sufficient computing power and flexibility for complex streaming applications, such as high-definition video processing. For less hardware complexity and power consumption, the distributed scratchpad memory architecture is considered, instead of the cache memory...

  • An Efficient Scalable Runtime System for Macro Data Flow Processing Using S- Net. Gijsbers, Bert; Grelck, Clemens // International Journal of Parallel Programming;Dec2014, Vol. 42 Issue 6, p988 

    S- Net is a declarative coordination language and component technology aimed at radically facilitating software engineering for modern parallel compute systems by near-complete separation of concerns between application (component) engineering and concurrency orchestration. S- Net builds on the...

  • Distributed FlowVisor: a distributed FlowVisor platform for quality of service aware cloud network virtualisation. Lingxia Liao; Shami, Abdallah; Leung, Victor C. M. // IET Networks;2015, Vol. 4 Issue 5, p270 

    Cloud-based virtual networking environments are required to provide fine-grained quality of service (QoS) control without sacrificing scalability. However, no single approach can currently achieve these two goals simultaneously. FlowVisor is a building block to virtualise networks with...

  • Automatic Skeleton-Driven Memory Affinity for Transactional Worklist Applications. Góes, Luís; Ribeiro, Christiane; Castro, Márcio; Méhaut, Jean-François; Cole, Murray; Cintra, Marcelo // International Journal of Parallel Programming;Apr2014, Vol. 42 Issue 2, p365 

    Memory affinity has become a key element to achieve scalable performance on multi-core platforms. Mechanisms such as thread scheduling, page allocation and cache prefetching are commonly employed to enhance memory affinity which keeps data close to the cores that access it. In particular,...

  • Microarchitectural performance comparison of Intel Knights Corner and Intel Sandy Bridge with CFD applications. Che, Yonggang; Zhang, Lilun; Wang, Yongxian; Xu, Chuanfu; Liu, Wei; Wang, Zhenghua // Journal of Supercomputing;Oct2014, Vol. 70 Issue 1, p321 

    This paper comparatively evaluates the microarchitectural performance of two representative Computational Fluid Dynamics (CFD) applications on the Intel Many Integrated Core (MIC) product, the Intel Knights Corner (KNC) coprocessor, and the Intel Sand Bridge (SNB) processor. Performance...

  • A methodology for speeding up matrix vector multiplication for single/multi-core architectures. Kelefouras, Vasilios; Kritikakou, Angeliki; Papadima, Elissavet; Goutis, Costas // Journal of Supercomputing;Jul2015, Vol. 71 Issue 7, p2644 

    In this paper, a new methodology for computing the Dense Matrix Vector Multiplication, for both embedded (processors without SIMD unit) and general purpose processors (single and multi-core processors, with SIMD unit), is presented. This methodology achieves higher execution speed than ATLAS...

  • A SURVEY: SOFTWARE-MANAGED ON-CHIP MEMORIES. ALAM, Shahid; HORSPOOL, Nigel // Computing & Informatics;2015, Vol. 34 Issue 5, p1168 

    Processors are unable to achieve significant gains in speed using the conventional methods. For example increasing the clock rate increases the average access time to on-chip caches which in turn lowers the average number of instructions per cycle of the processor. On-chip memory system will be...

Share

Read the Article

Courtesy of THE LIBRARY OF VIRGINIA

Sorry, but this item is not currently available from your library.

Try another library?
Sign out of this library

Other Topics