Samuel Thibault - Academia.edu (original) (raw)
Papers by Samuel Thibault
HAL (Le Centre pour la Communication Scientifique Directe), Feb 18, 2022
L'analyse de deux systèmes de vote différents, Neovote et Belenios, nous a montré que même si tec... more L'analyse de deux systèmes de vote différents, Neovote et Belenios, nous a montré que même si techniquement ils semblent sûrs, des attaques restent envisageables dans les deux cas. De plus, une attaque contre un système de vote en ligne peut rapidement compromettre l'ensemble de l'élection tout en passant complètement inaperçue, et étant donc difficilement contestable.
HAL (Le Centre pour la Communication Scientifique Directe), Jul 10, 2022
Example with a memory of size 2 data. The graph of input data dependencies is shown on the left. ... more Example with a memory of size 2 data. The graph of input data dependencies is shown on the left. The figure on the right corresponds to the partition and schedule produced by the scheduler ▸ Deque Model Data Aware Ready (DMDAR): Deque Model Data Aware Ready (DMDAR):
HAL (Le Centre pour la Communication Scientifique Directe), Feb 16, 2022
Clusters make use of workload schedulers such as the Slurm Workload Manager to allocate computing... more Clusters make use of workload schedulers such as the Slurm Workload Manager to allocate computing jobs onto nodes. These schedulers usually aim at a good trade-o between increasing resource utilization and user satisfaction (decreasing job waiting time). However, these schedulers are typically unaware of jobs sharing large input les, which may happen in data intensive scenarios. The same input les may be loaded several times, leading to a waste of resources. We study how to design a data-aware job scheduler that is able to keep large input les on the computing nodes, without impacting other memory needs, and can use previously loaded les to limit data transfers in order to reduce the waiting times of jobs. We present three schedulers capable of distributing the load between the computing nodes as well as re-using an input le already loaded in the memory of some node as much as possible. We perform simulations using real cluster usage traces to compare them to classical job schedulers. The results show that keeping data in local memory between successive jobs and using data locality information to schedule jobs allows a reduction in job waiting time and a drastic decrease in the amount of data transfers.
HAL (Le Centre pour la Communication Scientifique Directe), Oct 4, 2020
HAL (Le Centre pour la Communication Scientifique Directe), Mar 1, 2015
National audienc
HAL (Le Centre pour la Communication Scientifique Directe), Oct 8, 2008
International audienc
European Conference on Parallel Processing, 2013
Computational accelerators such as GPUs, FPGAs and many-core accelerators can dramatically improv... more Computational accelerators such as GPUs, FPGAs and many-core accelerators can dramatically improve the performance of computing systems and catalyze highly demanding applications. Many scientific and commercial applications are beginning to integrate computational accelerators in their code. However, programming accelerators for high performance remains a challenge, resulting from the restricted architectural features of accelerators compared to general purpose CPUs. Moreover, software must conjointly use conventional CPUs with accelerators to support legacy code and benefit from general purpose operating system services. The objective of this topic is to provide a forum for exchanging new ideas and findings in the domain of accelerator-based computing.
Clusters employ workload schedulers such as the Slurm Workload Manager to allocate computing jobs... more Clusters employ workload schedulers such as the Slurm Workload Manager to allocate computing jobs onto nodes. These schedulers usually aim at a good trade-off between increasing resource utilization and user satisfaction (decreasing job waiting time). However, these schedulers are typically unaware of jobs sharing large input files, which may happen in data intensive scenarios. The same input files may end up being loaded several times, leading to a waste of resources. We study how to design a data-aware job scheduler that is able to keep large input files on the computing nodes, without impacting other memory needs, and can benefit from previouslyloaded files to decrease data transfers in order to reduce the waiting times of jobs. We present three schedulers capable of distributing the load between the computing nodes as well as re-using input files already loaded in the memory of some node as much as possible. We perform simulations with single node jobs using traces of real HPC-cluster usage, to compare them to classical job schedulers. The results show that keeping data in local memory between successive jobs and using data-locality information to schedule jobs improves performance compared to a widely-used scheduler (FCFS, with and without backfilling): a reduction in job waiting time (a 7.5% improvement in stretch), and a decrease in the amount of data transfers (7%).
HAL (Le Centre pour la Communication Scientifique Directe), Sep 9, 2022
The multidimensional scaling (MDS) is an important and robust algorithm for representing individu... more The multidimensional scaling (MDS) is an important and robust algorithm for representing individual cases of a dataset out of their respective dissimilarities. However, heuristics, possibly trading-off with robustness, are often preferred in practice due to the potentially prohibitive memory and computational costs of the MDS. The recent introduction of random projection techniques within the MDS allowed it to be become competitive on larger test cases. The goal of this manuscript is to propose a high-performance distributed-memory MDS based on random projection for processing data sets of even larger size (up to one million items). We propose a task-based design of the whole algorithm and we implement it within an efficient software stack including state-of-the-art numerical solvers, runtime systems and communication layers. The outcome is the ability to efficiently apply robust MDS to large data sets on modern supercomputers. We assess the resulting algorithm and software stack to the point cloud visualization for analyzing distances between sequences in metabarcoding.
HAL (Le Centre pour la Communication Scientifique Directe), Jul 5, 2022
Les systèmes à base de tâches ont gagné en popularité du fait de leur capacité à exploiter pleine... more Les systèmes à base de tâches ont gagné en popularité du fait de leur capacité à exploiter pleinement la puissance de calcul des architectures hétérogènes complexes. Un modèle de programmation courant est le modèle de soumission séquentielle de tâches (Sequential Task Flow, STF) qui malheureusement ne peut manipuler que des graphes de tâches statiques. Ceci conduit potentiellement à un surcoût lors de la soumission, et le graphe de tâches statique n'est pas nécessairement adapté pour s'exécuter sur un système hétérogène. Une solution standard consiste à trouver un compromis entre la granularité permettant d'exploiter la puissance des accélérateurs et celle nécessaire à la bonne performance des CPU. Pour répondre à ces problèmes, nous proposons d'étendre le modèle STF fourni par le support d'exécution STARPU [4] en y ajoutant la possibilité de transformer certaines tâches en sous-graphes durant l'exécution. Nous appelons ces tâches des tâches hiérarchiques. Cette approche permet d'exprimer des graphes de tâches plus dynamiques. En combinant ce nouveau modèle à un gestionnaire automatique des données, il est possible d'adapter dynamiquement la granularité pour fournir une taille optimale aux différentes ressources de calcul ciblées. Nous montrons dans cet article que le modèle des tâches hiérarchiques est valide et nous donnons une première évaluation de ses performances en utilisant la bibliothèque d'algèbre linéaire dense CHAMELEON [1].
Future Generation Computer Systems, Jun 1, 2023
A now-classical way of meeting the increasing demand for computing speed by HPC applications is t... more A now-classical way of meeting the increasing demand for computing speed by HPC applications is the use of GPUs and/or other accelerators. Such accelerators have their own memory, which is usually quite limited, and are connected to the main memory through a bus with bounded bandwidth. Thus, particular care should be devoted to data locality in order to avoid unnecessary data movements. Task-based runtime schedulers have emerged as a convenient and efficient way to use such heterogeneous platforms. When processing an application, the scheduler has the knowledge of all tasks available for processing on a GPU, as well as their input data dependencies. Hence, it is possible to produce a tasks processing order aiming at reducing the total processing time through three objectives: minimizing data transfers, overlapping transfers and computation and optimizing the eviction of previously-loaded data. In this paper, we focus on how to schedule tasks that share some of their input data (but are otherwise independent) on a single GPU. We provide a formal model of the problem, exhibit an optimal eviction strategy, and show that ordering tasks to minimize data movement is NP-complete. We review and adapt existing ordering strategies to this problem, and propose a new one based on task aggregation. We prove that the underlying problem of this new strategy is NP-complete, and prove the reasonable complexity of our proposed heuristic. These strategies have been implemented in the StarPU runtime system. We present their performance on tasks from tiled 2D, 3D matrix products, Cholesky factorization, randomized task order, randomized data pairs from the 2D matrix product as well as a sparse matrix product. We introduce a visual way to understand these performance and lower bounds on the number of data loads for the 2D and 3D matrix products. Our experiments demonstrate that using our new strategy together with the optimal eviction policy reduces the amount of data movement as well as the total processing time.
Lecture Notes in Computer Science, 2023
Task-based systems have gained popularity because of their promise of exploiting the computationa... more Task-based systems have gained popularity because of their promise of exploiting the computational power of complex heterogeneous systems. A common programming model is the so-called Sequential Task Flow (STF) model, which, unfortunately, has the intrinsic limitation of supporting static task graphs only. This leads to potential submission overhead and to a static task graph which is not necessarily adapted for execution on heterogeneous systems. A standard approach is to nd a trade-o between the granularity needed by accelerator devices and the one required by CPU cores to achieve performance. To address these problems, we extend the STF model in the StarPU runtime system to enable tasks subgraphs at runtime. We refer to these tasks as hierarchical tasks. This approach allows for a more dynamic task graph. This extended model combined with an automatic data manager allows to dynamically adapt the granularity to meet the optimal size of the targeted computing resource. We show that the hierarchical task model is correct and we provide an early evaluation on shared memory heterogeneous systems, using the Chameleon dense linear algebra library.
HAL (Le Centre pour la Communication Scientifique Directe), Feb 22, 2017
The significant increase of the hardware complexity that occurred in the last few years led the h... more The significant increase of the hardware complexity that occurred in the last few years led the high performance community to design many scientific libraries according to a taskbased parallelization. The modeling of the performance of the individual tasks (or kernels) they are composed of is crucial for facing multiple challenges as diverse as performing accurate performance predictions, designing robust scheduling algorithms, tuning the applications, etc. Fine-grain modeling such as emulation and cycle-accurate simulation may lead to very accurate results. However, not only their high cost may be prohibitive but they furthermore require a high fidelity modeling of the processor, which makes them hard to deploy in practice. In this paper, we propose an alternative coarse-grain, empirical methodology oblivious to both the target code and the hardware architecture, which leads to robust and accurate timing predictions. We illustrate our approach with a task-based Fast Multipole Method (FMM) algorithm, whose kernels are highly irregular, implemented in the ScalFMM library on top of the StarPU task-based runtime system and the SimGrid simulator.
HAL (Le Centre pour la Communication Scientifique Directe), Apr 1, 2020
Enables the correction of transient errors on address lines of the memory channel. Traditional pa... more Enables the correction of transient errors on address lines of the memory channel. Traditional parity is limited to detecting and recovering single-bit errors. Category: Considered "MUST HAVE" in any production HPC system. Memory Lockstep Memory Lockstep lets two memory channels work as a single channel, moving a data word two channels wide and providing eight bits of memory correction. Memory Lockstep provides protection against both single-bit and multi-bit errors.
HAL (Le Centre pour la Communication Scientifique Directe), Feb 12, 2020
Concurrency and Computation: Practice and Experience
Task-based systems have gained popularity because of their promise of exploiting the computationa... more Task-based systems have gained popularity because of their promise of exploiting the computational power of complex heterogeneous systems. A common programming model is the so-called Sequential Task Flow (STF) model, which, unfortunately, has the intrinsic limitation of supporting static task graphs only. This leads to potential submission overhead and to a static task graph which is not necessarily adapted for execution on heterogeneous systems. A standard approach is to nd a trade-o between the granularity needed by accelerator devices and the one required by CPU cores to achieve performance. To address these problems, we extend the STF model in the StarPU runtime system to enable tasks subgraphs at runtime. We refer to these tasks as hierarchical tasks. This approach allows for a more dynamic task graph. This extended model combined with an automatic data manager allows to dynamically adapt the granularity to meet the optimal size of the targeted computing resource. We show that the hierarchical task model is correct and we provide an early evaluation on shared memory heterogeneous systems, using the Chameleon dense linear algebra library.
Future Generation Computer Systems
A now-classical way of meeting the increasing demand for computing speed by HPC applications is t... more A now-classical way of meeting the increasing demand for computing speed by HPC applications is the use of GPUs and/or other accelerators. Such accelerators have their own memory, which is usually quite limited, and are connected to the main memory through a bus with bounded bandwidth. Thus, particular care should be devoted to data locality in order to avoid unnecessary data movements. Task-based runtime schedulers have emerged as a convenient and efficient way to use such heterogeneous platforms. When processing an application, the scheduler has the knowledge of all tasks available for processing on a GPU, as well as their input data dependencies. Hence, it is possible to produce a tasks processing order aiming at reducing the total processing time through three objectives: minimizing data transfers, overlapping transfers and computation and optimizing the eviction of previously-loaded data. In this paper, we focus on how to schedule tasks that share some of their input data (but are otherwise independent) on a single GPU. We provide a formal model of the problem, exhibit an optimal eviction strategy, and show that ordering tasks to minimize data movement is NP-complete. We review and adapt existing ordering strategies to this problem, and propose a new one based on task aggregation. We prove that the underlying problem of this new strategy is NP-complete, and prove the reasonable complexity of our proposed heuristic. These strategies have been implemented in the StarPU runtime system. We present their performance on tasks from tiled 2D, 3D matrix products, Cholesky factorization, randomized task order, randomized data pairs from the 2D matrix product as well as a sparse matrix product. We introduce a visual way to understand these performance and lower bounds on the number of data loads for the 2D and 3D matrix products. Our experiments demonstrate that using our new strategy together with the optimal eviction policy reduces the amount of data movement as well as the total processing time.
HAL (Le Centre pour la Communication Scientifique Directe), Mar 29, 2022
A now-classical way of meeting the increasing demand for computing speed by HPC applications is t... more A now-classical way of meeting the increasing demand for computing speed by HPC applications is the use of GPUs and/or other accelerators. Such accelerators have their own memory, which is usually quite limited, and are connected to the main memory through a bus with bounded bandwidth. Thus, particular care should be devoted to data locality in order to avoid unnecessary data movements. Task-based runtime schedulers have emerged as a convenient and efficient way to use such heterogeneous platforms. When processing an application, the scheduler has the knowledge of all tasks available for processing on a GPU, as well as their input data dependencies. Hence, it is possible to produce a tasks processing order aiming at reducing the total processing time through three objectives: minimizing data transfers, overlapping transfers and computation and optimizing the eviction of previously-loaded data. In this paper, we focus on how to schedule tasks that share some of their input data (but are otherwise independent) on a single GPU. We provide a formal model of the problem, exhibit an optimal eviction strategy, and show that ordering tasks to minimize data movement is NP-complete. We review and adapt existing ordering strategies to this problem, and propose a new one based on task aggregation. We prove that the underlying problem of this new strategy is NP-complete, and prove the reasonable complexity of our proposed heuristic. These strategies have been implemented in the StarPU runtime system. We present their performance on tasks from tiled 2D, 3D matrix products, Cholesky factorization, randomized task order, randomized data pairs from the 2D matrix product as well as a sparse matrix product. We introduce a visual way to understand these performance and lower bounds on the number of data loads for the 2D and 3D matrix products. Our experiments demonstrate that using our new strategy together with the optimal eviction policy reduces the amount of data movement as well as the total processing time.
HAL (Le Centre pour la Communication Scientifique Directe), Aug 29, 2021
A now-classical way of meeting the increasing demand for computing speed by HPC applications is t... more A now-classical way of meeting the increasing demand for computing speed by HPC applications is the use of GPUs and/or other accelerators. Such accelerators have their own memory, which is usually quite limited, and are connected to the main memory through a bus with bounded bandwidth. Thus, a particular care should be devoted to data locality in order to avoid unnecessary data movements. Task-based runtime schedulers have emerged as a convenient and efficient way to use such heterogeneous platforms. When processing an application, the scheduler has the knowledge of all tasks available for processing on a GPU, as well as their input data dependencies. Hence, it is able to order tasks and prefetch their input data in the GPU memory (after possibly evicting some previously-loaded data), while aiming at minimizing data movements, so as to reduce the total processing time. In this paper, we focus on how to schedule tasks that share some of their input data (but are otherwise independent) on a GPU. We provide a formal model of the problem, exhibit an optimal eviction strategy, and show that ordering tasks to minimize data movement is NP-complete. We review and adapt existing ordering strategies to this problem, and propose a new one based on task aggregation. These strategies have been implemented in the StarPU runtime system, which allows to test them on a variety of linear algebra problems. Our experiments demonstrate that using our new strategy together with the optimal eviction policy reduces the amount of data movement as well as the total processing time.
HAL (Le Centre pour la Communication Scientifique Directe), Mar 22, 2021
Anticipating the behavior of applications, studying, and designing algorithms are some of the mos... more Anticipating the behavior of applications, studying, and designing algorithms are some of the most important purposes for the performance and correction studies about simulations and applications relating to intensive computing. Many frameworks were designed to simulate large distributed computing infrastructures and the applications running on them. At the node level, some frameworks have also been proposed to simulate task-based parallel applications. However, one missing critical capability from these works is the ability to take Non-Uniform Memory Access (NUMA) effects into account, even though virtually every HPC platform nowadays exhibits such effects. We thus enhance an existing simulator for dependency-based task-parallel applications, that enables experimenting with multiple data locality models. We also introduce two localityaware performance models: we update a lightweight communication-oriented model that uses topology information to weight data transfers, and introduce a more complex communications and cache model that takes into account data storage in the LLC. We validate both models on dense linear algebra test cases and show that, on average, our simulator reproducibly predicts execution time with a small relative error.
HAL (Le Centre pour la Communication Scientifique Directe), Feb 18, 2022
L'analyse de deux systèmes de vote différents, Neovote et Belenios, nous a montré que même si tec... more L'analyse de deux systèmes de vote différents, Neovote et Belenios, nous a montré que même si techniquement ils semblent sûrs, des attaques restent envisageables dans les deux cas. De plus, une attaque contre un système de vote en ligne peut rapidement compromettre l'ensemble de l'élection tout en passant complètement inaperçue, et étant donc difficilement contestable.
HAL (Le Centre pour la Communication Scientifique Directe), Jul 10, 2022
Example with a memory of size 2 data. The graph of input data dependencies is shown on the left. ... more Example with a memory of size 2 data. The graph of input data dependencies is shown on the left. The figure on the right corresponds to the partition and schedule produced by the scheduler ▸ Deque Model Data Aware Ready (DMDAR): Deque Model Data Aware Ready (DMDAR):
HAL (Le Centre pour la Communication Scientifique Directe), Feb 16, 2022
Clusters make use of workload schedulers such as the Slurm Workload Manager to allocate computing... more Clusters make use of workload schedulers such as the Slurm Workload Manager to allocate computing jobs onto nodes. These schedulers usually aim at a good trade-o between increasing resource utilization and user satisfaction (decreasing job waiting time). However, these schedulers are typically unaware of jobs sharing large input les, which may happen in data intensive scenarios. The same input les may be loaded several times, leading to a waste of resources. We study how to design a data-aware job scheduler that is able to keep large input les on the computing nodes, without impacting other memory needs, and can use previously loaded les to limit data transfers in order to reduce the waiting times of jobs. We present three schedulers capable of distributing the load between the computing nodes as well as re-using an input le already loaded in the memory of some node as much as possible. We perform simulations using real cluster usage traces to compare them to classical job schedulers. The results show that keeping data in local memory between successive jobs and using data locality information to schedule jobs allows a reduction in job waiting time and a drastic decrease in the amount of data transfers.
HAL (Le Centre pour la Communication Scientifique Directe), Oct 4, 2020
HAL (Le Centre pour la Communication Scientifique Directe), Mar 1, 2015
National audienc
HAL (Le Centre pour la Communication Scientifique Directe), Oct 8, 2008
International audienc
European Conference on Parallel Processing, 2013
Computational accelerators such as GPUs, FPGAs and many-core accelerators can dramatically improv... more Computational accelerators such as GPUs, FPGAs and many-core accelerators can dramatically improve the performance of computing systems and catalyze highly demanding applications. Many scientific and commercial applications are beginning to integrate computational accelerators in their code. However, programming accelerators for high performance remains a challenge, resulting from the restricted architectural features of accelerators compared to general purpose CPUs. Moreover, software must conjointly use conventional CPUs with accelerators to support legacy code and benefit from general purpose operating system services. The objective of this topic is to provide a forum for exchanging new ideas and findings in the domain of accelerator-based computing.
Clusters employ workload schedulers such as the Slurm Workload Manager to allocate computing jobs... more Clusters employ workload schedulers such as the Slurm Workload Manager to allocate computing jobs onto nodes. These schedulers usually aim at a good trade-off between increasing resource utilization and user satisfaction (decreasing job waiting time). However, these schedulers are typically unaware of jobs sharing large input files, which may happen in data intensive scenarios. The same input files may end up being loaded several times, leading to a waste of resources. We study how to design a data-aware job scheduler that is able to keep large input files on the computing nodes, without impacting other memory needs, and can benefit from previouslyloaded files to decrease data transfers in order to reduce the waiting times of jobs. We present three schedulers capable of distributing the load between the computing nodes as well as re-using input files already loaded in the memory of some node as much as possible. We perform simulations with single node jobs using traces of real HPC-cluster usage, to compare them to classical job schedulers. The results show that keeping data in local memory between successive jobs and using data-locality information to schedule jobs improves performance compared to a widely-used scheduler (FCFS, with and without backfilling): a reduction in job waiting time (a 7.5% improvement in stretch), and a decrease in the amount of data transfers (7%).
HAL (Le Centre pour la Communication Scientifique Directe), Sep 9, 2022
The multidimensional scaling (MDS) is an important and robust algorithm for representing individu... more The multidimensional scaling (MDS) is an important and robust algorithm for representing individual cases of a dataset out of their respective dissimilarities. However, heuristics, possibly trading-off with robustness, are often preferred in practice due to the potentially prohibitive memory and computational costs of the MDS. The recent introduction of random projection techniques within the MDS allowed it to be become competitive on larger test cases. The goal of this manuscript is to propose a high-performance distributed-memory MDS based on random projection for processing data sets of even larger size (up to one million items). We propose a task-based design of the whole algorithm and we implement it within an efficient software stack including state-of-the-art numerical solvers, runtime systems and communication layers. The outcome is the ability to efficiently apply robust MDS to large data sets on modern supercomputers. We assess the resulting algorithm and software stack to the point cloud visualization for analyzing distances between sequences in metabarcoding.
HAL (Le Centre pour la Communication Scientifique Directe), Jul 5, 2022
Les systèmes à base de tâches ont gagné en popularité du fait de leur capacité à exploiter pleine... more Les systèmes à base de tâches ont gagné en popularité du fait de leur capacité à exploiter pleinement la puissance de calcul des architectures hétérogènes complexes. Un modèle de programmation courant est le modèle de soumission séquentielle de tâches (Sequential Task Flow, STF) qui malheureusement ne peut manipuler que des graphes de tâches statiques. Ceci conduit potentiellement à un surcoût lors de la soumission, et le graphe de tâches statique n'est pas nécessairement adapté pour s'exécuter sur un système hétérogène. Une solution standard consiste à trouver un compromis entre la granularité permettant d'exploiter la puissance des accélérateurs et celle nécessaire à la bonne performance des CPU. Pour répondre à ces problèmes, nous proposons d'étendre le modèle STF fourni par le support d'exécution STARPU [4] en y ajoutant la possibilité de transformer certaines tâches en sous-graphes durant l'exécution. Nous appelons ces tâches des tâches hiérarchiques. Cette approche permet d'exprimer des graphes de tâches plus dynamiques. En combinant ce nouveau modèle à un gestionnaire automatique des données, il est possible d'adapter dynamiquement la granularité pour fournir une taille optimale aux différentes ressources de calcul ciblées. Nous montrons dans cet article que le modèle des tâches hiérarchiques est valide et nous donnons une première évaluation de ses performances en utilisant la bibliothèque d'algèbre linéaire dense CHAMELEON [1].
Future Generation Computer Systems, Jun 1, 2023
A now-classical way of meeting the increasing demand for computing speed by HPC applications is t... more A now-classical way of meeting the increasing demand for computing speed by HPC applications is the use of GPUs and/or other accelerators. Such accelerators have their own memory, which is usually quite limited, and are connected to the main memory through a bus with bounded bandwidth. Thus, particular care should be devoted to data locality in order to avoid unnecessary data movements. Task-based runtime schedulers have emerged as a convenient and efficient way to use such heterogeneous platforms. When processing an application, the scheduler has the knowledge of all tasks available for processing on a GPU, as well as their input data dependencies. Hence, it is possible to produce a tasks processing order aiming at reducing the total processing time through three objectives: minimizing data transfers, overlapping transfers and computation and optimizing the eviction of previously-loaded data. In this paper, we focus on how to schedule tasks that share some of their input data (but are otherwise independent) on a single GPU. We provide a formal model of the problem, exhibit an optimal eviction strategy, and show that ordering tasks to minimize data movement is NP-complete. We review and adapt existing ordering strategies to this problem, and propose a new one based on task aggregation. We prove that the underlying problem of this new strategy is NP-complete, and prove the reasonable complexity of our proposed heuristic. These strategies have been implemented in the StarPU runtime system. We present their performance on tasks from tiled 2D, 3D matrix products, Cholesky factorization, randomized task order, randomized data pairs from the 2D matrix product as well as a sparse matrix product. We introduce a visual way to understand these performance and lower bounds on the number of data loads for the 2D and 3D matrix products. Our experiments demonstrate that using our new strategy together with the optimal eviction policy reduces the amount of data movement as well as the total processing time.
Lecture Notes in Computer Science, 2023
Task-based systems have gained popularity because of their promise of exploiting the computationa... more Task-based systems have gained popularity because of their promise of exploiting the computational power of complex heterogeneous systems. A common programming model is the so-called Sequential Task Flow (STF) model, which, unfortunately, has the intrinsic limitation of supporting static task graphs only. This leads to potential submission overhead and to a static task graph which is not necessarily adapted for execution on heterogeneous systems. A standard approach is to nd a trade-o between the granularity needed by accelerator devices and the one required by CPU cores to achieve performance. To address these problems, we extend the STF model in the StarPU runtime system to enable tasks subgraphs at runtime. We refer to these tasks as hierarchical tasks. This approach allows for a more dynamic task graph. This extended model combined with an automatic data manager allows to dynamically adapt the granularity to meet the optimal size of the targeted computing resource. We show that the hierarchical task model is correct and we provide an early evaluation on shared memory heterogeneous systems, using the Chameleon dense linear algebra library.
HAL (Le Centre pour la Communication Scientifique Directe), Feb 22, 2017
The significant increase of the hardware complexity that occurred in the last few years led the h... more The significant increase of the hardware complexity that occurred in the last few years led the high performance community to design many scientific libraries according to a taskbased parallelization. The modeling of the performance of the individual tasks (or kernels) they are composed of is crucial for facing multiple challenges as diverse as performing accurate performance predictions, designing robust scheduling algorithms, tuning the applications, etc. Fine-grain modeling such as emulation and cycle-accurate simulation may lead to very accurate results. However, not only their high cost may be prohibitive but they furthermore require a high fidelity modeling of the processor, which makes them hard to deploy in practice. In this paper, we propose an alternative coarse-grain, empirical methodology oblivious to both the target code and the hardware architecture, which leads to robust and accurate timing predictions. We illustrate our approach with a task-based Fast Multipole Method (FMM) algorithm, whose kernels are highly irregular, implemented in the ScalFMM library on top of the StarPU task-based runtime system and the SimGrid simulator.
HAL (Le Centre pour la Communication Scientifique Directe), Apr 1, 2020
Enables the correction of transient errors on address lines of the memory channel. Traditional pa... more Enables the correction of transient errors on address lines of the memory channel. Traditional parity is limited to detecting and recovering single-bit errors. Category: Considered "MUST HAVE" in any production HPC system. Memory Lockstep Memory Lockstep lets two memory channels work as a single channel, moving a data word two channels wide and providing eight bits of memory correction. Memory Lockstep provides protection against both single-bit and multi-bit errors.
HAL (Le Centre pour la Communication Scientifique Directe), Feb 12, 2020
Concurrency and Computation: Practice and Experience
Task-based systems have gained popularity because of their promise of exploiting the computationa... more Task-based systems have gained popularity because of their promise of exploiting the computational power of complex heterogeneous systems. A common programming model is the so-called Sequential Task Flow (STF) model, which, unfortunately, has the intrinsic limitation of supporting static task graphs only. This leads to potential submission overhead and to a static task graph which is not necessarily adapted for execution on heterogeneous systems. A standard approach is to nd a trade-o between the granularity needed by accelerator devices and the one required by CPU cores to achieve performance. To address these problems, we extend the STF model in the StarPU runtime system to enable tasks subgraphs at runtime. We refer to these tasks as hierarchical tasks. This approach allows for a more dynamic task graph. This extended model combined with an automatic data manager allows to dynamically adapt the granularity to meet the optimal size of the targeted computing resource. We show that the hierarchical task model is correct and we provide an early evaluation on shared memory heterogeneous systems, using the Chameleon dense linear algebra library.
Future Generation Computer Systems
A now-classical way of meeting the increasing demand for computing speed by HPC applications is t... more A now-classical way of meeting the increasing demand for computing speed by HPC applications is the use of GPUs and/or other accelerators. Such accelerators have their own memory, which is usually quite limited, and are connected to the main memory through a bus with bounded bandwidth. Thus, particular care should be devoted to data locality in order to avoid unnecessary data movements. Task-based runtime schedulers have emerged as a convenient and efficient way to use such heterogeneous platforms. When processing an application, the scheduler has the knowledge of all tasks available for processing on a GPU, as well as their input data dependencies. Hence, it is possible to produce a tasks processing order aiming at reducing the total processing time through three objectives: minimizing data transfers, overlapping transfers and computation and optimizing the eviction of previously-loaded data. In this paper, we focus on how to schedule tasks that share some of their input data (but are otherwise independent) on a single GPU. We provide a formal model of the problem, exhibit an optimal eviction strategy, and show that ordering tasks to minimize data movement is NP-complete. We review and adapt existing ordering strategies to this problem, and propose a new one based on task aggregation. We prove that the underlying problem of this new strategy is NP-complete, and prove the reasonable complexity of our proposed heuristic. These strategies have been implemented in the StarPU runtime system. We present their performance on tasks from tiled 2D, 3D matrix products, Cholesky factorization, randomized task order, randomized data pairs from the 2D matrix product as well as a sparse matrix product. We introduce a visual way to understand these performance and lower bounds on the number of data loads for the 2D and 3D matrix products. Our experiments demonstrate that using our new strategy together with the optimal eviction policy reduces the amount of data movement as well as the total processing time.
HAL (Le Centre pour la Communication Scientifique Directe), Mar 29, 2022
A now-classical way of meeting the increasing demand for computing speed by HPC applications is t... more A now-classical way of meeting the increasing demand for computing speed by HPC applications is the use of GPUs and/or other accelerators. Such accelerators have their own memory, which is usually quite limited, and are connected to the main memory through a bus with bounded bandwidth. Thus, particular care should be devoted to data locality in order to avoid unnecessary data movements. Task-based runtime schedulers have emerged as a convenient and efficient way to use such heterogeneous platforms. When processing an application, the scheduler has the knowledge of all tasks available for processing on a GPU, as well as their input data dependencies. Hence, it is possible to produce a tasks processing order aiming at reducing the total processing time through three objectives: minimizing data transfers, overlapping transfers and computation and optimizing the eviction of previously-loaded data. In this paper, we focus on how to schedule tasks that share some of their input data (but are otherwise independent) on a single GPU. We provide a formal model of the problem, exhibit an optimal eviction strategy, and show that ordering tasks to minimize data movement is NP-complete. We review and adapt existing ordering strategies to this problem, and propose a new one based on task aggregation. We prove that the underlying problem of this new strategy is NP-complete, and prove the reasonable complexity of our proposed heuristic. These strategies have been implemented in the StarPU runtime system. We present their performance on tasks from tiled 2D, 3D matrix products, Cholesky factorization, randomized task order, randomized data pairs from the 2D matrix product as well as a sparse matrix product. We introduce a visual way to understand these performance and lower bounds on the number of data loads for the 2D and 3D matrix products. Our experiments demonstrate that using our new strategy together with the optimal eviction policy reduces the amount of data movement as well as the total processing time.
HAL (Le Centre pour la Communication Scientifique Directe), Aug 29, 2021
A now-classical way of meeting the increasing demand for computing speed by HPC applications is t... more A now-classical way of meeting the increasing demand for computing speed by HPC applications is the use of GPUs and/or other accelerators. Such accelerators have their own memory, which is usually quite limited, and are connected to the main memory through a bus with bounded bandwidth. Thus, a particular care should be devoted to data locality in order to avoid unnecessary data movements. Task-based runtime schedulers have emerged as a convenient and efficient way to use such heterogeneous platforms. When processing an application, the scheduler has the knowledge of all tasks available for processing on a GPU, as well as their input data dependencies. Hence, it is able to order tasks and prefetch their input data in the GPU memory (after possibly evicting some previously-loaded data), while aiming at minimizing data movements, so as to reduce the total processing time. In this paper, we focus on how to schedule tasks that share some of their input data (but are otherwise independent) on a GPU. We provide a formal model of the problem, exhibit an optimal eviction strategy, and show that ordering tasks to minimize data movement is NP-complete. We review and adapt existing ordering strategies to this problem, and propose a new one based on task aggregation. These strategies have been implemented in the StarPU runtime system, which allows to test them on a variety of linear algebra problems. Our experiments demonstrate that using our new strategy together with the optimal eviction policy reduces the amount of data movement as well as the total processing time.
HAL (Le Centre pour la Communication Scientifique Directe), Mar 22, 2021
Anticipating the behavior of applications, studying, and designing algorithms are some of the mos... more Anticipating the behavior of applications, studying, and designing algorithms are some of the most important purposes for the performance and correction studies about simulations and applications relating to intensive computing. Many frameworks were designed to simulate large distributed computing infrastructures and the applications running on them. At the node level, some frameworks have also been proposed to simulate task-based parallel applications. However, one missing critical capability from these works is the ability to take Non-Uniform Memory Access (NUMA) effects into account, even though virtually every HPC platform nowadays exhibits such effects. We thus enhance an existing simulator for dependency-based task-parallel applications, that enables experimenting with multiple data locality models. We also introduce two localityaware performance models: we update a lightweight communication-oriented model that uses topology information to weight data transfers, and introduce a more complex communications and cache model that takes into account data storage in the LLC. We validate both models on dense linear algebra test cases and show that, on average, our simulator reproducibly predicts execution time with a small relative error.