Diego R. Llanos | Universidad de Valladolid (original) (raw)
Uploads
Papers by Diego R. Llanos
The Journal of supercomputing/Journal of supercomputing, Apr 15, 2024
Currently, the generation of parallel codes which are portable to different kinds of parallel com... more Currently, the generation of parallel codes which are portable to different kinds of parallel computers is a challenge. Many approaches have been proposed during the last years following two different paths. Programming from scratch using new programming languages and models that deal with parallelism explicitly, or automatically generating parallel codes from already existing sequential programs. Using the current main-trend parallel languages, the programmer deals with mapping and optimization details that forces to take into account details of the execution platform to obtain a good performance. In code generators from sequential programs, programmers cannot control basic mapping decisions, and many times the programmer needs to transform the code to expose to the compiler information needed to leverage important optimizations. This paper presents a new high-level parallel programming language named CMAPS, designed to be used with the Trasgo parallel programming framework. This language provides a simple and explicit way to express parallelism in a highly abstract level. The programmer does not face decisions about granularity, thread management, or interprocess communication. Thus, the programmer can express different parallel paradigms in a easy, unified, abstract, and portable form. The language supports the necessary features imposed by transformation models such as Trasgo, to generate parallel codes that adapt their communication and synchronization structures for target machines composed by mixed distributed-and shared-memory parallel multicomputers.
Parallel Computing, Nov 1, 2017
Current High Performance Computing (HPC) systems are typically built as interconnected clusters o... more Current High Performance Computing (HPC) systems are typically built as interconnected clusters of shared-memory multicore computers. Several techniques to automatically generate parallel programs from high-level parallel languages or sequential codes have been proposed. To properly exploit the scalability of HPC clusters, these techniques should take into account the combination of data communication across distributed memory, and the exploitation of sharedmemory models. In this paper, we present a new communication calculation technique to be applied across different SPMD (Single Program Multiple Data) code blocks, containing several uniform data access expressions. We have implemented this technique in Trasgo, a programming model and compilation framework that transforms parallel programs from a high-level parallel specification that deals with parallelism in a unified, abstract, and portable way. The proposed technique computes at runtime exact coarse-grained communications for distributed message-passing processes. Applying this technique at runtime has the advantage of being independent of compile-time decisions, such as the tile size chosen for each process. Our approach allows the automatic generation of pre-compiled multi-level parallel routines, libraries, or programs that can adapt their communication, synchronization, and optimization structures to the target system, even when computing nodes have different capabilities. Our experimental results show that, despite our runtime calculation, our approach can automatically produce efficient programs compared with MPI reference codes, and with codes generated with auto-parallelizing compilers.
The threadblock size and shape choice is one of the most important user decisions when a parallel... more The threadblock size and shape choice is one of the most important user decisions when a parallel problem is coded to run in GPU architectures. In fact, threadblock conguration has a signicant
Intel Xeon Phi accelerators are one of the newest devices used in the field of parallel computing... more Intel Xeon Phi accelerators are one of the newest devices used in the field of parallel computing. However, there are comparatively few studies concerning their performance when using most of the existing parallelization techniques. One of them is thread-level speculation, a technique that optimistically tries to extract parallelism of loops without the need of a compile-time analysis that guarantees that the loop can be executed in parallel. In this article we evaluate the performance delivered by an Intel Xeon Phi coprocessor when using a software, state-of-the-art thread-level speculative parallelization library in the execution of well-known benchmarks. Our results show that, although the Xeon Phi delivers a relatively good speedup in comparison with a shared-memory architecture in terms of scalability, the low computing power of its computational units when specific vectorization and SIMD instructions are not exploited, indicates that further development of new specific techniq...
OpenACC is a parallel programming model for automatic parallelization of sequential code using co... more OpenACC is a parallel programming model for automatic parallelization of sequential code using compiler directives or pragmas. OpenACC is intended to be used with accelerators such as GPUs and Xeon Phi. The different implementations of the standard, although still in early development, are primarily focused on GPU execution. In this study, we analyze how the different OpenACC compilers available under certain premises behave when the clauses affecting the underlying block geometry implementation are modified. These clauses are the Gang number, Worker number, and Vector Size defined by the standard.
Lecture Notes in Computer Science, 2017
Supercomputers are becoming more heterogeneous. They are composed by several machines with differ... more Supercomputers are becoming more heterogeneous. They are composed by several machines with different computation capabilities and different kinds and families of accelerators, such as GPUs or Intel Xeon Phi coprocessors. Programming these machines is a hard task, that requires a deep study of the architectural details, in order to exploit efficiently each computational unit. In this paper, we present an extension of a GPU-CPU heterogeneous programming model, to include support for Intel Xeon Phi coprocessors. This contribution extends the previous model and its implementation, by taking advantage of both the GPU communication model and the CPU execution model of the original approach, to derive a new approach for the Xeon Phi. Our experimental results show that using our approach, the programming effort needed for changing the kind of target devices is highly reduced for several study cases. For example, using our model to program a Mandelbrot benchmark, the 97% of the application code is reused between a GPU implementation and a Xeon Phi implementation.
International Journal of Parallel Programming, 2018
The Journal of Supercomputing, 2018
Concurrency and Computation: Practice and Experience, 2018
2017 25th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP), 2017
Lecture Notes in Computer Science, 2016
The Journal of Supercomputing, 2016
This paper presents an extension that adds XML capabilities to Cetus, a source-to-source compiler... more This paper presents an extension that adds XML capabilities to Cetus, a source-to-source compiler developed by Purdue University. In this work, the Cetus Intermediate Represen-tation is converted into an XML DOM tree that, in turn, enables XML capabilities, such as searching specic code features through XPath expressions. As an example, we write an XPath code to nd private and shared vari-ables for parallel execution in C source code. Loopest is a Java program with embedded XPath expressions. While Cetus needs 2573 lines of internal JAVA code to locate private variables in an input code, Loopest needs a total of only 425 lines of code to determine the same private variables in the equivalent XML representation. Using XPath as search method provides a second advantage over Cetus: extensibility. Changes in Cetus requires a deep knowledge of Java, Cetus internal structure, and its Inter-mediate Representation. Moreover, changes in Loopest are easier because it only depends on XPath to ...
The Journal of supercomputing/Journal of supercomputing, Apr 15, 2024
Currently, the generation of parallel codes which are portable to different kinds of parallel com... more Currently, the generation of parallel codes which are portable to different kinds of parallel computers is a challenge. Many approaches have been proposed during the last years following two different paths. Programming from scratch using new programming languages and models that deal with parallelism explicitly, or automatically generating parallel codes from already existing sequential programs. Using the current main-trend parallel languages, the programmer deals with mapping and optimization details that forces to take into account details of the execution platform to obtain a good performance. In code generators from sequential programs, programmers cannot control basic mapping decisions, and many times the programmer needs to transform the code to expose to the compiler information needed to leverage important optimizations. This paper presents a new high-level parallel programming language named CMAPS, designed to be used with the Trasgo parallel programming framework. This language provides a simple and explicit way to express parallelism in a highly abstract level. The programmer does not face decisions about granularity, thread management, or interprocess communication. Thus, the programmer can express different parallel paradigms in a easy, unified, abstract, and portable form. The language supports the necessary features imposed by transformation models such as Trasgo, to generate parallel codes that adapt their communication and synchronization structures for target machines composed by mixed distributed-and shared-memory parallel multicomputers.
Parallel Computing, Nov 1, 2017
Current High Performance Computing (HPC) systems are typically built as interconnected clusters o... more Current High Performance Computing (HPC) systems are typically built as interconnected clusters of shared-memory multicore computers. Several techniques to automatically generate parallel programs from high-level parallel languages or sequential codes have been proposed. To properly exploit the scalability of HPC clusters, these techniques should take into account the combination of data communication across distributed memory, and the exploitation of sharedmemory models. In this paper, we present a new communication calculation technique to be applied across different SPMD (Single Program Multiple Data) code blocks, containing several uniform data access expressions. We have implemented this technique in Trasgo, a programming model and compilation framework that transforms parallel programs from a high-level parallel specification that deals with parallelism in a unified, abstract, and portable way. The proposed technique computes at runtime exact coarse-grained communications for distributed message-passing processes. Applying this technique at runtime has the advantage of being independent of compile-time decisions, such as the tile size chosen for each process. Our approach allows the automatic generation of pre-compiled multi-level parallel routines, libraries, or programs that can adapt their communication, synchronization, and optimization structures to the target system, even when computing nodes have different capabilities. Our experimental results show that, despite our runtime calculation, our approach can automatically produce efficient programs compared with MPI reference codes, and with codes generated with auto-parallelizing compilers.
The threadblock size and shape choice is one of the most important user decisions when a parallel... more The threadblock size and shape choice is one of the most important user decisions when a parallel problem is coded to run in GPU architectures. In fact, threadblock conguration has a signicant
Intel Xeon Phi accelerators are one of the newest devices used in the field of parallel computing... more Intel Xeon Phi accelerators are one of the newest devices used in the field of parallel computing. However, there are comparatively few studies concerning their performance when using most of the existing parallelization techniques. One of them is thread-level speculation, a technique that optimistically tries to extract parallelism of loops without the need of a compile-time analysis that guarantees that the loop can be executed in parallel. In this article we evaluate the performance delivered by an Intel Xeon Phi coprocessor when using a software, state-of-the-art thread-level speculative parallelization library in the execution of well-known benchmarks. Our results show that, although the Xeon Phi delivers a relatively good speedup in comparison with a shared-memory architecture in terms of scalability, the low computing power of its computational units when specific vectorization and SIMD instructions are not exploited, indicates that further development of new specific techniq...
OpenACC is a parallel programming model for automatic parallelization of sequential code using co... more OpenACC is a parallel programming model for automatic parallelization of sequential code using compiler directives or pragmas. OpenACC is intended to be used with accelerators such as GPUs and Xeon Phi. The different implementations of the standard, although still in early development, are primarily focused on GPU execution. In this study, we analyze how the different OpenACC compilers available under certain premises behave when the clauses affecting the underlying block geometry implementation are modified. These clauses are the Gang number, Worker number, and Vector Size defined by the standard.
Lecture Notes in Computer Science, 2017
Supercomputers are becoming more heterogeneous. They are composed by several machines with differ... more Supercomputers are becoming more heterogeneous. They are composed by several machines with different computation capabilities and different kinds and families of accelerators, such as GPUs or Intel Xeon Phi coprocessors. Programming these machines is a hard task, that requires a deep study of the architectural details, in order to exploit efficiently each computational unit. In this paper, we present an extension of a GPU-CPU heterogeneous programming model, to include support for Intel Xeon Phi coprocessors. This contribution extends the previous model and its implementation, by taking advantage of both the GPU communication model and the CPU execution model of the original approach, to derive a new approach for the Xeon Phi. Our experimental results show that using our approach, the programming effort needed for changing the kind of target devices is highly reduced for several study cases. For example, using our model to program a Mandelbrot benchmark, the 97% of the application code is reused between a GPU implementation and a Xeon Phi implementation.
International Journal of Parallel Programming, 2018
The Journal of Supercomputing, 2018
Concurrency and Computation: Practice and Experience, 2018
2017 25th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP), 2017
Lecture Notes in Computer Science, 2016
The Journal of Supercomputing, 2016
This paper presents an extension that adds XML capabilities to Cetus, a source-to-source compiler... more This paper presents an extension that adds XML capabilities to Cetus, a source-to-source compiler developed by Purdue University. In this work, the Cetus Intermediate Represen-tation is converted into an XML DOM tree that, in turn, enables XML capabilities, such as searching specic code features through XPath expressions. As an example, we write an XPath code to nd private and shared vari-ables for parallel execution in C source code. Loopest is a Java program with embedded XPath expressions. While Cetus needs 2573 lines of internal JAVA code to locate private variables in an input code, Loopest needs a total of only 425 lines of code to determine the same private variables in the equivalent XML representation. Using XPath as search method provides a second advantage over Cetus: extensibility. Changes in Cetus requires a deep knowledge of Java, Cetus internal structure, and its Inter-mediate Representation. Moreover, changes in Loopest are easier because it only depends on XPath to ...