M. Tchiboukdjian - Academia.edu (original) (raw)
Papers by M. Tchiboukdjian
Proceedings of the international conference on Supercomputing - ICS '11, 2011
This paper discusses the use of software cache partitioning techniques to study and improve cache... more This paper discusses the use of software cache partitioning techniques to study and improve cache behavior of HPC applications. Cache partitioning is traditionally considered as an hardware/OS solution to shared caches issues, particularly to resource utilization fairness between multiple processes. We believe that, in the HPC context of a single application being studied/optimized on the system, with a single thread per core, cache partitioning can be used in new and interesting ways.
Lecture Notes in Computer Science, 2010
Classical list scheduling is a very popular and efficient technique for scheduling jobs in parall... more Classical list scheduling is a very popular and efficient technique for scheduling jobs in parallel platforms. However, with the increasing number of processors, the cost for managing a single centralized list becomes prohibitive. The objective of this work is to study the extra cost that must be paid when the list is distributed among the processors. We present a general methodology for computing the expected makespan based on the analysis of an adequate potential function which represents the load unbalance between the local lists. A bound on the deviation from the mean is also derived. Then, we apply this technique to show that the expected makespan for scheduling W unit independent tasks on m processors is equal to W/m with an additional term in 3.65 log 2 W . Finally, we analyze the work stealing algorithm of Arora, Blumofe and Plaxton and significantly improve the bound on the number of steals. Moreover, simulations show that our bound is very close to the exact value, approximately 50% off. This new analysis also enables to study the influence of the initial repartition of tasks and the reduction of the number of steals when several thieves can simultaneously steal work in the same processor's list. m i=1 w i (t)
IEEE Transactions on Visualization and Computer Graphics, 2000
One important bottleneck when visualizing large data sets is the data transfer between processor ... more One important bottleneck when visualizing large data sets is the data transfer between processor and memory. Cacheaware (CA) and cache-oblivious (CO) algorithms take into consideration the memory hierarchy to design cache efficient algorithms. CO approaches have the advantage to adapt to unknown and varying memory hierarchies. Recent CA and CO algorithms developed for 3D mesh layouts significantly improve performance of previous approaches, but they lack of theoretical performance guarantees. We present in this paper a O OðN log NÞ algorithm to compute a CO layout for unstructured but well shaped meshes. We prove that a coherent traversal of a N-size mesh in dimension d induces less than N=B þ O OðN=M 1=d Þ cache-misses where B and M are the block size and the cache size, respectively. Experiments show that our layout computation is faster and significantly less memory consuming than the best known CO algorithm. Performance is comparable to this algorithm for classical visualization algorithm access patterns, or better when the BSP tree produced while computing the layout is used as an acceleration data structure adjusted to the layout. We also show that cache oblivious approaches lead to significant performance increases on recent GPU architectures.
Classical list scheduling is a very popular and efficient technique for scheduling jobs in parall... more Classical list scheduling is a very popular and efficient technique for scheduling jobs in parallel and distributed platforms. It is inherently centralized. However, with the increasing number of processors in new parallel platforms, the cost for managing a single centralized ...
This paper proposes to revisit isosurface extraction algorithms taking into consideration two spe... more This paper proposes to revisit isosurface extraction algorithms taking into consideration two specific aspects of recent multicore architectures: their intrinsic parallelism associated with the presence of multiple computing cores and their cache hierarchy that often includes private caches as well as caches shared between all cores. Taking advantage of these shared caches require adapting the parallelization scheme to make the core collaborate on cache usage and not compete for it, which can impair performance. We propose to have cores working on independent but close data sets that can all fit in the shared cache. We propose two shared cache aware parallel isosurface algorithms, one based on marching tetrahedra, and one using a min-max tree as acceleration data structure. We theoretically prove that in both cases the number of cache misses is the same as for the sequential algorithm for the same cache size. The algorithms are based on the FastCOL cache-oblivious data layout for irregular meshes. The CO layout also enables to build a very compact min-max tree that leads to a reduced number of cache misses. Experiments confirm the interest of these shared cache aware isosurface algorithms, the performance gain increasing as the shared cache size to core number ratio decreases.
Abstract: One important bottleneck when visualizing large data sets is the data trans-fer between... more Abstract: One important bottleneck when visualizing large data sets is the data trans-fer between processor and memory. Cache-aware (CA) and cache-oblivious (CO) al-gorithms take into consideration the memory hierarchy to design cache efficient algo-rithms. CO approaches have the ...
Proceedings of the international conference on Supercomputing - ICS '11, 2011
This paper discusses the use of software cache partitioning techniques to study and improve cache... more This paper discusses the use of software cache partitioning techniques to study and improve cache behavior of HPC applications. Cache partitioning is traditionally considered as an hardware/OS solution to shared caches issues, particularly to resource utilization fairness between multiple processes. We believe that, in the HPC context of a single application being studied/optimized on the system, with a single thread per core, cache partitioning can be used in new and interesting ways.
Lecture Notes in Computer Science, 2010
Classical list scheduling is a very popular and efficient technique for scheduling jobs in parall... more Classical list scheduling is a very popular and efficient technique for scheduling jobs in parallel platforms. However, with the increasing number of processors, the cost for managing a single centralized list becomes prohibitive. The objective of this work is to study the extra cost that must be paid when the list is distributed among the processors. We present a general methodology for computing the expected makespan based on the analysis of an adequate potential function which represents the load unbalance between the local lists. A bound on the deviation from the mean is also derived. Then, we apply this technique to show that the expected makespan for scheduling W unit independent tasks on m processors is equal to W/m with an additional term in 3.65 log 2 W . Finally, we analyze the work stealing algorithm of Arora, Blumofe and Plaxton and significantly improve the bound on the number of steals. Moreover, simulations show that our bound is very close to the exact value, approximately 50% off. This new analysis also enables to study the influence of the initial repartition of tasks and the reduction of the number of steals when several thieves can simultaneously steal work in the same processor's list. m i=1 w i (t)
IEEE Transactions on Visualization and Computer Graphics, 2000
One important bottleneck when visualizing large data sets is the data transfer between processor ... more One important bottleneck when visualizing large data sets is the data transfer between processor and memory. Cacheaware (CA) and cache-oblivious (CO) algorithms take into consideration the memory hierarchy to design cache efficient algorithms. CO approaches have the advantage to adapt to unknown and varying memory hierarchies. Recent CA and CO algorithms developed for 3D mesh layouts significantly improve performance of previous approaches, but they lack of theoretical performance guarantees. We present in this paper a O OðN log NÞ algorithm to compute a CO layout for unstructured but well shaped meshes. We prove that a coherent traversal of a N-size mesh in dimension d induces less than N=B þ O OðN=M 1=d Þ cache-misses where B and M are the block size and the cache size, respectively. Experiments show that our layout computation is faster and significantly less memory consuming than the best known CO algorithm. Performance is comparable to this algorithm for classical visualization algorithm access patterns, or better when the BSP tree produced while computing the layout is used as an acceleration data structure adjusted to the layout. We also show that cache oblivious approaches lead to significant performance increases on recent GPU architectures.
Classical list scheduling is a very popular and efficient technique for scheduling jobs in parall... more Classical list scheduling is a very popular and efficient technique for scheduling jobs in parallel and distributed platforms. It is inherently centralized. However, with the increasing number of processors in new parallel platforms, the cost for managing a single centralized ...
This paper proposes to revisit isosurface extraction algorithms taking into consideration two spe... more This paper proposes to revisit isosurface extraction algorithms taking into consideration two specific aspects of recent multicore architectures: their intrinsic parallelism associated with the presence of multiple computing cores and their cache hierarchy that often includes private caches as well as caches shared between all cores. Taking advantage of these shared caches require adapting the parallelization scheme to make the core collaborate on cache usage and not compete for it, which can impair performance. We propose to have cores working on independent but close data sets that can all fit in the shared cache. We propose two shared cache aware parallel isosurface algorithms, one based on marching tetrahedra, and one using a min-max tree as acceleration data structure. We theoretically prove that in both cases the number of cache misses is the same as for the sequential algorithm for the same cache size. The algorithms are based on the FastCOL cache-oblivious data layout for irregular meshes. The CO layout also enables to build a very compact min-max tree that leads to a reduced number of cache misses. Experiments confirm the interest of these shared cache aware isosurface algorithms, the performance gain increasing as the shared cache size to core number ratio decreases.
Abstract: One important bottleneck when visualizing large data sets is the data trans-fer between... more Abstract: One important bottleneck when visualizing large data sets is the data trans-fer between processor and memory. Cache-aware (CA) and cache-oblivious (CO) al-gorithms take into consideration the memory hierarchy to design cache efficient algo-rithms. CO approaches have the ...