Available task-level parallelism on the Cell BE

Rico, Alejandro; Ramirez, Alex; Valero, Mateo
January 2009
Scientific Programming;2009, Vol. 17 Issue 1/2, p59
Academic Journal
There is a clear industrial trend towards chip multiprocessors (CMP) as the most power efficient way of further increasing performance. Heterogeneous CMP architectures take one more step along this power efficiency trend by using multiple types of processors, tailored to the workloads they will execute. Programming these CMP architectures has been identified as one of the main challenges in the near future, and programming heterogeneous systems is even more challenging. High-level programming models which allow the programmer to identify parallel tasks, and the runtime management of the inter-task dependencies, have been identified as a suitable model for programming such heterogeneous CMP architectures. In this paper we analyze the performance of Cell Superscalar, a task-based programming model for the Cell Broadband Engine Architecture, in terms of its scalability to higher number of on-chip processors. Our results show that the low performance of the PPE component limits the scalability of some applications to less than 16 processors. Since the PPE has been identified as the limiting element, we perform a set of simulation studies evaluating the impact of out-of-order execution, branch prediction and larger caches on the task management overhead. We conclude that out-of-order execution is a very desirable feature, since it increases task management performance by 50%. We also identify memory latency as a fundamental aspect in performance, while the working set is not that large. We expect a significant performance impact if task management would run using a fast private memory to store the task dependency graph instead of relying on the cache hierarchy.


Related Articles

  • Speculative High Performance Computation on Heterogeneous Multi-Core. Liu Cong; Wang Wen; Wang Zhiying // Advanced Materials Research;2014, Vol. 1049-1050, p2126 

    Thread level speculation has been proposed and researched to parallelize traditional sequential applications on homogeneous multi-core architecture. In this paper, a heterogeneous multi-core hardware simulation system is present, which provides with TLS execution mechanism. With a novel TLS...

  • A Review of Transactional Memory in Multicore Processors. Wang, X.; Ji, Z.; Fu, C.; Hu, M. // Information Technology Journal;2010, Vol. 9 Issue 1, p192 

    To develop composable parallel programs easily and get high performance, many transactional memory systems have been proposed to solve the synchronization problem of multicore processors, Transactional memory can be implemented in hardware, software, or a hybrid of the two. There are many hot...

  • MPI runtime error detection with MUST: Advances in deadlock detection. Hilbrich, Tobias; Protze, Joachim; Schulz, Martin; de Supinski, Bronis R.; Müller, Matthias S. // Scientific Programming;2013, Vol. 21 Issue 4, p109 

    The widely used Message Passing Interface (MPI) is complex and rich. As a result, application developers require automated tools to avoid and to detect MPI programming errors. We present the Marmot Umpire Scalable Tool (MUST) that detects such errors with significantly increased scalability. We...

  • Performance scalability and energy consumption on distributed and many-core platforms. Karanikolaou, E.; Milovanović, E.; Milovanović, I.; Bekakos, M. // Journal of Supercomputing;Oct2014, Vol. 70 Issue 1, p349 

    In this paper, the performance evaluation of distributed and many-core computer complexes, in conjunction with their consumed energy, is investigated. The distributed execution of a specific problem on an interconnected processors platform requires a larger amount of energy compared to the...

  • Industrial robot control.  // Control Engineering;May2011, Vol. 58 Issue 5, p16 

    The article focuses on the Microsoft Windows-based KR C4 industrial robotic controls from Kuka AG win improved hardare functions. It says that integrated safety and energy savings software was used for the robotic's hardware functions which reduced 35% of hardware and 50% plug connections and...

  • Inherent Limitations on Disjoint-Access Parallel Implementations of Transactional Memory. Attiya, Hagit; Hillel, Eshcar; Milani, Alessia // Theory of Computing Systems;Nov2011, Vol. 49 Issue 4, p698 

    Transactional memory (TM) is a popular approach for alleviating the difficulty of programming concurrent applications; TM guarantees that a transaction, consisting of a sequence of operations, appear to be executed atomically. Two fundamental properties of TM implementations are disjoint-access...

  • Hardware and Software Synthesis of Heterogeneous Systems from Dataflow Programs. Roquier, Ghislain; Bezati, Endri; Mattavelli, Marco // Journal of Electrical & Computer Engineering;2012, p1 

    The new generation of multicore processors and reconfigurable hardware platforms provides a dramatic increase of the available parallelism and processing capabilities. However, one obstacle for exploiting all the promises of such platforms is deeply rooted in sequential thinking. The sequential...

  • Concurrent programming in web applications. Erb, Benjamin; Kargl, Frank; Domaschka, Jörg // IT: Information Technology;Jun2014, Vol. 56 Issue 3, p119 

    Modern web applications are concurrently used by many users and provide increasingly interactive features. Multi-core processors, highly distributed backend architectures, and new web technologies force a reconsideration of approaches for concurrent programming in order to fulfil scalability...

  • Partial Runtime Reconfiguration of FPGA, Applications and a Fault Emulation Case Study. Legat, Uroš; Biasizzo, Anton; Novak, Franc // International Review on Computers & Software;Sep2009, Vol. 4 Issue 5, p606 

    The paper surveys the basic principles of partial runtime reeonfiguration of FPGA and comments on their possible applications in practice. The dynamic partial reconfiguration of FPGA is a process of reconfiguring a part of FPGA logic, while the rest of the logic is unaffected by the...


Read the Article


Sorry, but this item is not currently available from your library.

Try another library?
Sign out of this library

Other Topics