Clemens Grelck | Friedrich-Schiller-Universität Jena (original) (raw)
Papers by Clemens Grelck
Lecture Notes in Computer Science, 2020
Lecture Notes in Computer Science, 2001
Sac is a functional array processing language particularly designed with numerical applications i... more Sac is a functional array processing language particularly designed with numerical applications in mind. In this field the runtime performance of programs critically depends on the efficient utilization of the memory hierarchy. Cache conflicts due to limited set associativity are one relevant source of inefficiency. This paper describes the realization of an optimization technique which aims at eliminating cache conflicts by adjusting the data layout of arrays to specific access patterns and cache configurations. Its effect on cache utilization and runtime performance is demonstrated by investigations on the PDE1 benchmark.
CEUR workshop proceedings, 2015
Data parallel frameworks (e.g. Hive, Spark or Tez) can be used to execute complex data analyses c... more Data parallel frameworks (e.g. Hive, Spark or Tez) can be used to execute complex data analyses consisting of many dependent tasks represented by a Directed Acylical Graph (DAG). Minimising the job completion time (i.e. makespan) is still an open problem for large graphs.We propose a novel deep Q-learning (DQN) approach to statically scheduling DAGs and minimising the makespan. Our approach learns to schedule DAGs from scratch instead of learning how to imitate some heuristic. We show that our current approach learns fast and steadily. Furthermore, our approach can schedule DAGs almost 15 times faster than a Forward List Scheduling (FLS) heuristic.
Functional High-Performance Computing, Sep 23, 2013
It is our great pleasure to welcome you to the 2nd ACM SIGPLAN Workshop on Functional High-Perfor... more It is our great pleasure to welcome you to the 2nd ACM SIGPLAN Workshop on Functional High-Performance Computing. FHPC 2013 brings together researchers who explore declarative highlevel programming technology in application domains where large-scale computations arise naturally and high performance is essential. The workshop is in its second year. Our goal is to establish FHPC as a regular annual forum for researchers interested in applying functional programming techniques in the area of high-performance computing. Functional programming is increasingly recognized as presenting a nice sweet spot between expressiveness and efficiency for parallel programming, reconciling execution performance with programming productivity. Making FHPC'13 happen depended on a number of people and organizations, which we would like to acknowledge here. We thank the authors and panelists for providing the content of the program. We would like to express our gratitude to the program committee and the additional reviewers, who worked very hard in reviewing papers and providing suggestions for their improvements. Special thanks go to ACM SIGPLAN and the ICFP workshop chairs for accepting our workshop nomination and being flexible with organizational matters. The call for papers attracted 14 submissions from Asia, the Americas, and Europe. An international program committee selected 8 contributions for publication. These papers cover a variety of topics. Some touch upon optimizing compilation techniques and programming techniques for GPU applications. Others propose novel parallel programming models, libraries, and bespoke runtime management, which take advantage of declarative constructs for better performance and productivity. In addition to the refereed contributions, FHPC'13 features two invited talks. Matthew Fluet from Rochester Institute of Technology will provide an overview of the Manticore project, with focus on programming models and runtime techniques. Manuel Chakravarty from the University of New South Wales will present different strands of work in data-parallel computing, discussing results and issues in Data-Parallel Haskell and Accelerate. The topic of data-parallelism and GPU computing will be further deepened in a panel discussion. We hope to have put together an interesting program, looking forward to stimulating discussions during the second FHPC workshop, and a successful follow-up FHPC workshop at ICFP 2014.
SAC (Single Assignment C) is a purely functional, data-parallel array programming language that p... more SAC (Single Assignment C) is a purely functional, data-parallel array programming language that predominantly targets compute-intensive applications. Thus, clusters of workstations, or distributed memory architectures in general, form highly relevant compilation targets. Notwithstanding, SAC as of today only supports shared-memory architectures, graphics accelerators and heterogeneous combinations thereof. In our current work we aim at closing this gap. At the same time, we are determined to uphold SAC's promise of entirely compiler-directed exploitation of concurrency, no matter what the target architecture is. Distributed memory architectures are going to make this promise a particular challenge. Despite SAC's functional semantics, it is generally far from straightforward to infer exact communication patterns from architecture-agnostic code. Therefore, we intend to capitalise on recent advances in network technology, namely the closing of the gap between memory bandwidth and network bandwidth. We aim at a solution based on a custom-designed software distributed shared memory (S-DSM) and large per-node software-managed cache memories. To this effect the functional nature of SAC with its write-once/read-only arrays provides a strategic advantage that we thoroughly exploit. Throughout the paper we further motivate our approach, sketch out our implementation strategy, show preliminary results and discuss the pros and cons of our approach.
We discuss the aspect of synchronisation in the language design of the asynchronous data flow lan... more We discuss the aspect of synchronisation in the language design of the asynchronous data flow language S-Net. Synchronisation is a crucial aspect of any coordination approach. S-Net provides a particularly simple construct, the synchrocell. The synchrocell is actually two simple to meet regular synchronisation demands itself. We show that in conjunction with other language feature, S-Net synchrocells can effectively do the job. Moreover, we argue that their simplistic design in fact is a necessary prerequisite to implement even more interesting scenarios, for which we outline ways of efficient implementation.
Applied Informatics, 2003
Applied computing review, Mar 1, 2023
Parallel and Distributed Processing Techniques and Applications, 2000
ABSTRACT
Parallel Computing, 2014
Booleans are the most basic values in computing. Machines, however, store Booleans in larger comp... more Booleans are the most basic values in computing. Machines, however, store Booleans in larger compounds such as bytes or integers due to limitations in addressing memory locations. For individual values the relative waste of memory capacity is huge, but the absolute waste is negligible. The latter radically changes if large numbers of Boolean values are processed in (multidimensional) arrays. Most programming languages, however, only provide sparse implementations of Boolean arrays, thus wasting large quantities of memory and potentially making poor use of cache hierarchies. In the context of the functional data-parallel array programming language SAC we investigate dense implementations of Boolean arrays and compare their performance with traditional sparse implementations. A particular challenge arises in data-parallel execution on today's shared memory multi-core architectures: scheduling of loops over Boolean arrays is unaware of the non-standard addressing of dense Boolean arrays. We discuss our proposed solution and report on experiments analysing the impact of the runtime representation of Boolean arrays both on sequential performance as well as on scalability using up to 32 cores of a large ccNUMA multi-core system.
Parallel Computing, 2012
ABSTRACT The Sparc T3-4 server provides up to 512 concurrent hardware threads, a degree of concur... more ABSTRACT The Sparc T3-4 server provides up to 512 concurrent hardware threads, a degree of concurrency that is unprecedented in a single server system. This paper reports on how the automatically parallelising compiler of the data-parallel func-tional array language SAC copes with up to 512 execution units. We investigate three different numerical kernels that are representative for a wide range of appli-cations: matrix multiplication, convolution and 3-dimensional FFT. We show both the high-level declarative coding style of SAC and the performance achieved on the T3-4 server. Last not least, we draw conclusions for improving our compiler technology in the future.
International Journal of Parallel Programming, Sep 13, 2013
Lecture Notes in Computer Science, 2020
Lecture Notes in Computer Science, 2001
Sac is a functional array processing language particularly designed with numerical applications i... more Sac is a functional array processing language particularly designed with numerical applications in mind. In this field the runtime performance of programs critically depends on the efficient utilization of the memory hierarchy. Cache conflicts due to limited set associativity are one relevant source of inefficiency. This paper describes the realization of an optimization technique which aims at eliminating cache conflicts by adjusting the data layout of arrays to specific access patterns and cache configurations. Its effect on cache utilization and runtime performance is demonstrated by investigations on the PDE1 benchmark.
CEUR workshop proceedings, 2015
Data parallel frameworks (e.g. Hive, Spark or Tez) can be used to execute complex data analyses c... more Data parallel frameworks (e.g. Hive, Spark or Tez) can be used to execute complex data analyses consisting of many dependent tasks represented by a Directed Acylical Graph (DAG). Minimising the job completion time (i.e. makespan) is still an open problem for large graphs.We propose a novel deep Q-learning (DQN) approach to statically scheduling DAGs and minimising the makespan. Our approach learns to schedule DAGs from scratch instead of learning how to imitate some heuristic. We show that our current approach learns fast and steadily. Furthermore, our approach can schedule DAGs almost 15 times faster than a Forward List Scheduling (FLS) heuristic.
Functional High-Performance Computing, Sep 23, 2013
It is our great pleasure to welcome you to the 2nd ACM SIGPLAN Workshop on Functional High-Perfor... more It is our great pleasure to welcome you to the 2nd ACM SIGPLAN Workshop on Functional High-Performance Computing. FHPC 2013 brings together researchers who explore declarative highlevel programming technology in application domains where large-scale computations arise naturally and high performance is essential. The workshop is in its second year. Our goal is to establish FHPC as a regular annual forum for researchers interested in applying functional programming techniques in the area of high-performance computing. Functional programming is increasingly recognized as presenting a nice sweet spot between expressiveness and efficiency for parallel programming, reconciling execution performance with programming productivity. Making FHPC'13 happen depended on a number of people and organizations, which we would like to acknowledge here. We thank the authors and panelists for providing the content of the program. We would like to express our gratitude to the program committee and the additional reviewers, who worked very hard in reviewing papers and providing suggestions for their improvements. Special thanks go to ACM SIGPLAN and the ICFP workshop chairs for accepting our workshop nomination and being flexible with organizational matters. The call for papers attracted 14 submissions from Asia, the Americas, and Europe. An international program committee selected 8 contributions for publication. These papers cover a variety of topics. Some touch upon optimizing compilation techniques and programming techniques for GPU applications. Others propose novel parallel programming models, libraries, and bespoke runtime management, which take advantage of declarative constructs for better performance and productivity. In addition to the refereed contributions, FHPC'13 features two invited talks. Matthew Fluet from Rochester Institute of Technology will provide an overview of the Manticore project, with focus on programming models and runtime techniques. Manuel Chakravarty from the University of New South Wales will present different strands of work in data-parallel computing, discussing results and issues in Data-Parallel Haskell and Accelerate. The topic of data-parallelism and GPU computing will be further deepened in a panel discussion. We hope to have put together an interesting program, looking forward to stimulating discussions during the second FHPC workshop, and a successful follow-up FHPC workshop at ICFP 2014.
SAC (Single Assignment C) is a purely functional, data-parallel array programming language that p... more SAC (Single Assignment C) is a purely functional, data-parallel array programming language that predominantly targets compute-intensive applications. Thus, clusters of workstations, or distributed memory architectures in general, form highly relevant compilation targets. Notwithstanding, SAC as of today only supports shared-memory architectures, graphics accelerators and heterogeneous combinations thereof. In our current work we aim at closing this gap. At the same time, we are determined to uphold SAC's promise of entirely compiler-directed exploitation of concurrency, no matter what the target architecture is. Distributed memory architectures are going to make this promise a particular challenge. Despite SAC's functional semantics, it is generally far from straightforward to infer exact communication patterns from architecture-agnostic code. Therefore, we intend to capitalise on recent advances in network technology, namely the closing of the gap between memory bandwidth and network bandwidth. We aim at a solution based on a custom-designed software distributed shared memory (S-DSM) and large per-node software-managed cache memories. To this effect the functional nature of SAC with its write-once/read-only arrays provides a strategic advantage that we thoroughly exploit. Throughout the paper we further motivate our approach, sketch out our implementation strategy, show preliminary results and discuss the pros and cons of our approach.
We discuss the aspect of synchronisation in the language design of the asynchronous data flow lan... more We discuss the aspect of synchronisation in the language design of the asynchronous data flow language S-Net. Synchronisation is a crucial aspect of any coordination approach. S-Net provides a particularly simple construct, the synchrocell. The synchrocell is actually two simple to meet regular synchronisation demands itself. We show that in conjunction with other language feature, S-Net synchrocells can effectively do the job. Moreover, we argue that their simplistic design in fact is a necessary prerequisite to implement even more interesting scenarios, for which we outline ways of efficient implementation.
Applied Informatics, 2003
Applied computing review, Mar 1, 2023
Parallel and Distributed Processing Techniques and Applications, 2000
ABSTRACT
Parallel Computing, 2014
Booleans are the most basic values in computing. Machines, however, store Booleans in larger comp... more Booleans are the most basic values in computing. Machines, however, store Booleans in larger compounds such as bytes or integers due to limitations in addressing memory locations. For individual values the relative waste of memory capacity is huge, but the absolute waste is negligible. The latter radically changes if large numbers of Boolean values are processed in (multidimensional) arrays. Most programming languages, however, only provide sparse implementations of Boolean arrays, thus wasting large quantities of memory and potentially making poor use of cache hierarchies. In the context of the functional data-parallel array programming language SAC we investigate dense implementations of Boolean arrays and compare their performance with traditional sparse implementations. A particular challenge arises in data-parallel execution on today's shared memory multi-core architectures: scheduling of loops over Boolean arrays is unaware of the non-standard addressing of dense Boolean arrays. We discuss our proposed solution and report on experiments analysing the impact of the runtime representation of Boolean arrays both on sequential performance as well as on scalability using up to 32 cores of a large ccNUMA multi-core system.
Parallel Computing, 2012
ABSTRACT The Sparc T3-4 server provides up to 512 concurrent hardware threads, a degree of concur... more ABSTRACT The Sparc T3-4 server provides up to 512 concurrent hardware threads, a degree of concurrency that is unprecedented in a single server system. This paper reports on how the automatically parallelising compiler of the data-parallel func-tional array language SAC copes with up to 512 execution units. We investigate three different numerical kernels that are representative for a wide range of appli-cations: matrix multiplication, convolution and 3-dimensional FFT. We show both the high-level declarative coding style of SAC and the performance achieved on the T3-4 server. Last not least, we draw conclusions for improving our compiler technology in the future.
International Journal of Parallel Programming, Sep 13, 2013