Graham Riley | The University of Manchester (original) (raw)

Papers by Graham Riley

Research paper thumbnail of Concurrency Mapping to FPGAs with OpenCL: A Case Study with a Shallow Water Kernel

FPGAs have been around for over 30 years and are a viable accelerator for compute-intensive workl... more FPGAs have been around for over 30 years and are a viable accelerator for compute-intensive workloads on HPC systems. The adoption of FPGAs for scientific applications has been stimulated recently by the emergence of better programming environments such as High-Level Synthesis (HLS) and OpenCL available through the Xilinx SDSoC design tool. The mapping of the multi-level concurrency available within applications onto HPC systems with FPGAs is a challenge. OpenCL and HLS provide different mechanisms for exploiting concurrency within a node leading to a concurrency mapping design problem. In addition to considering the performance of different mappings, there are also questions of resource usage, programmability (development effort), ease-of-use and robustness. This paper examines the concurrency levels available in a case study kernel from a shallow water model and explores the programming options available in OpenCL and HLS. We conclude that the use of SDSoC Dataflow over functions ...

Research paper thumbnail of First white paper on community guidelines on the use, value and applicability of emerging technologies in climate and weather applications

On June 30, 2020, the Workshop on Emerging Technologies for Weather and Climate Modelling<br&g... more On June 30, 2020, the Workshop on Emerging Technologies for Weather and Climate Modelling<br> was held as a virtual event within the framework of ESiWACE2, the Centre of Excellence in<br> Simulation of Weather and Climate in Europe.<br> The workshop organized by Giovanni Aloisio (CMCC), Graham Riley (UNIMAN), Carlos Osuna<br> (METEOSWISS) and Sandro Fiore (CMCC), was hosted by DKRZ with the local support by Dela<br> Spickermann and Florian Ziemen under the supervision of the ESiWACE2 Coordinator Joachim<br> Biercamp. The workshop was funded by the Horizon 2020 project ESiWACE2. Due to the situation<br> with COVID-19, the event was held as a virtual conference with approximately 143 participants<br> mainly from Europe and the US, but also from Brazil, India and Israel.<br> The workshop brought together scientists from the fields of earth system modeling, machine<br> learning, exascale hardware/computing, and programming mode...

Research paper thumbnail of Fine-Grained Energy and Performance Profiling framework for Deep Convolutional Neural Networks

Research paper thumbnail of Vectorization of Hybrid Breadth First Search on the Intel Xeon Phi

Proceedings of the Computing Frontiers Conference, 2017

Research paper thumbnail of SLAMBench 3.0: Systematic Automated Reproducible Evaluation of SLAM Systems for Robot Vision Challenges and Scene Understanding

2019 International Conference on Robotics and Automation (ICRA), 2019

Research paper thumbnail of Requirements for Automatic Performance Analysis - APART Technical Report

Research paper thumbnail of Feasibility of Instantaneous Time Mirror in Electromagnetics

The Time Reversal Mirror (TRM) typically composes of an array of transducers. These transducers a... more The Time Reversal Mirror (TRM) typically composes of an array of transducers. These transducers are used for recording propagations reflected off from a source, which are then processed by and re-emitted from the transducers back to the source location known as the time reversed propagation. The TRM technique has been studied in acoustics, water, and electromagnetic waves for various applications; commonly in medical imaging [1], nondestructive testing [2], underwater target detection [3], and seismic sources localization [4]. Although TRM is a well studied method, it is difficult to achieve perfect time reversed propagations converging to the source with high resolution due to the limited spatial sampling available [5]. Additionally, in medical applications, the heterogeneous lossy characteristic of human tissue is responsible for attenuation and dispersion, reducing the potential of obtaining perfect spatial resolution and accuracy at the source [6].

Research paper thumbnail of Energy Predictive Models for Convolutional Neural Networks on Mobile Platforms

Energy use is a key concern when deploying deep learning models on mobile and embedded platforms.... more Energy use is a key concern when deploying deep learning models on mobile and embedded platforms. Current studies develop energy predictive models based on application-level features to provide researchers a way to estimate the energy consumption of their deep learning models. This information is useful for building resource-aware models that can make efficient use of the hard-ware resources. However, previous works on predictive modelling provide little insight into the trade-offs involved in the choice of features on the final predictive model accuracy and model complexity. To address this issue, we provide a comprehensive analysis of building regression-based predictive models for deep learning on mobile devices, based on empirical measurements gathered from the SyNERGY framework.Our predictive modelling strategy is based on two types of predictive models used in the literature:individual layers and layer-type. Our analysis of predictive models show that simple layer-type feature...

Research paper thumbnail of First Steps in Porting the LFRic Weather and Climate Model to the FPGAs of the EuroExa Architecture

Scientific Programming, 2019

In recent years, there has been renewed interest in the use of field-programmable gate arrays (FP... more In recent years, there has been renewed interest in the use of field-programmable gate arrays (FPGAs) for high-performance computing (HPC). In this paper, we explore the techniques required by traditional HPC programmers in porting HPC applications to FPGAs, using as an example the LFRic weather and climate model. We report on the first steps in porting LFRic to the FPGAs of the EuroExa architecture. We have used Vivado High-Level Syntheusywwi to implement a matrix-vector kernel from the LFRic code on a Xilinx UltraScale+ development board containing an XCZU9EG multiprocessor system-on-chip. We describe the porting of the code, discuss the optimization decisions, and report performance of 5.34 Gflop/s with double precision and 5.58 Gflop/s with single precision. We discuss sources of inefficiencies, comparisons with peak performance, comparisons with CPU and GPU performance (taking into account power and price), comparisons with published techniques, and comparisons with published p...

Research paper thumbnail of Estimation of energy consumption in machine learning

Journal of Parallel and Distributed Computing, 2019

Research paper thumbnail of GungHo Phase 1: Computational Science Recommendations

Research paper thumbnail of Coupling technologies for Earth System Modelling

Geoscientific Model Development, 2012

Research paper thumbnail of Portable Multi- and Many-Core Performance for Finite Difference Codes; Application to the Free-Surface Component of NEMO

Geoscientific Model Development Discussions, 2017

We present an approach which we call PSyKAl that is designed to achieve portable performance for ... more We present an approach which we call PSyKAl that is designed to achieve portable performance for parallel, finite-difference Ocean models. In PSyKAl the code related to the underlying science is formally separated from code related to parallelisation and single-core optimisations. This separation of concerns allows scientists to code their science independently of the underlying hardware architecture and for optimisation specialists to be able to tailor the code for a particular machine independently of the science code. We have taken the free-surface part of the NEMO ocean model and created a new, shallow-water model named NEMOLite2D. In doing this we have a code which is of a manageable size and yet which incorporates elements of full ocean models (input/output, boundary conditions, <i>etc.</i>). We have then manually constructed a PSyKAl version of this code and investigated the transformations that must be applied to the middle/PSy layer in order to achieve good perf...

Research paper thumbnail of Exploiting Parallelism and Vectorisation in Breadth-First Search for the Intel Xeon Phi

IEEE Transactions on Parallel and Distributed Systems

Research paper thumbnail of Breadth first search vectorization on the Intel Xeon Phi

Proceedings of the ACM International Conference on Computing Frontiers - CF '16, 2016

Research paper thumbnail of Parallel implementation of a multilevel modelling package

Computational Statistics Data Analysis, Oct 28, 1999

A portable parallel implementation of MLn, a multilevel modelling package, for shared memory para... more A portable parallel implementation of MLn, a multilevel modelling package, for shared memory parallel machines is described. Particular attention is paid to cross-classified and multiple membership models, which are more computationally demanding than those with simple hierarchical structure. Performance results are presented for a range of shared-memory parallel architectures, demonstrating a significant increase in the size of models which can

Research paper thumbnail of Techniques For Improving The Performance Of Parallel Computations

Research paper thumbnail of Knowledge Specification for Automatic Performance Analysis

Research paper thumbnail of Special Issue: Grid Performance; Licklider and the Grid

Concurrency and Computation Practice and Experience Eds J Gurd T Hey J Papay and G Riley 2005 17 2 4 95 98, 2005

Research paper thumbnail of Automatic Overheads Profiler for OpenMP Codes

To develop a good parallel implementation requires understanding of where run-time is spent and c... more To develop a good parallel implementation requires understanding of where run-time is spent and comparing this to some realistic best possible time.

Research paper thumbnail of Concurrency Mapping to FPGAs with OpenCL: A Case Study with a Shallow Water Kernel

FPGAs have been around for over 30 years and are a viable accelerator for compute-intensive workl... more FPGAs have been around for over 30 years and are a viable accelerator for compute-intensive workloads on HPC systems. The adoption of FPGAs for scientific applications has been stimulated recently by the emergence of better programming environments such as High-Level Synthesis (HLS) and OpenCL available through the Xilinx SDSoC design tool. The mapping of the multi-level concurrency available within applications onto HPC systems with FPGAs is a challenge. OpenCL and HLS provide different mechanisms for exploiting concurrency within a node leading to a concurrency mapping design problem. In addition to considering the performance of different mappings, there are also questions of resource usage, programmability (development effort), ease-of-use and robustness. This paper examines the concurrency levels available in a case study kernel from a shallow water model and explores the programming options available in OpenCL and HLS. We conclude that the use of SDSoC Dataflow over functions ...

Research paper thumbnail of First white paper on community guidelines on the use, value and applicability of emerging technologies in climate and weather applications

On June 30, 2020, the Workshop on Emerging Technologies for Weather and Climate Modelling<br&g... more On June 30, 2020, the Workshop on Emerging Technologies for Weather and Climate Modelling<br> was held as a virtual event within the framework of ESiWACE2, the Centre of Excellence in<br> Simulation of Weather and Climate in Europe.<br> The workshop organized by Giovanni Aloisio (CMCC), Graham Riley (UNIMAN), Carlos Osuna<br> (METEOSWISS) and Sandro Fiore (CMCC), was hosted by DKRZ with the local support by Dela<br> Spickermann and Florian Ziemen under the supervision of the ESiWACE2 Coordinator Joachim<br> Biercamp. The workshop was funded by the Horizon 2020 project ESiWACE2. Due to the situation<br> with COVID-19, the event was held as a virtual conference with approximately 143 participants<br> mainly from Europe and the US, but also from Brazil, India and Israel.<br> The workshop brought together scientists from the fields of earth system modeling, machine<br> learning, exascale hardware/computing, and programming mode...

Research paper thumbnail of Fine-Grained Energy and Performance Profiling framework for Deep Convolutional Neural Networks

Research paper thumbnail of Vectorization of Hybrid Breadth First Search on the Intel Xeon Phi

Proceedings of the Computing Frontiers Conference, 2017

Research paper thumbnail of SLAMBench 3.0: Systematic Automated Reproducible Evaluation of SLAM Systems for Robot Vision Challenges and Scene Understanding

2019 International Conference on Robotics and Automation (ICRA), 2019

Research paper thumbnail of Requirements for Automatic Performance Analysis - APART Technical Report

Research paper thumbnail of Feasibility of Instantaneous Time Mirror in Electromagnetics

The Time Reversal Mirror (TRM) typically composes of an array of transducers. These transducers a... more The Time Reversal Mirror (TRM) typically composes of an array of transducers. These transducers are used for recording propagations reflected off from a source, which are then processed by and re-emitted from the transducers back to the source location known as the time reversed propagation. The TRM technique has been studied in acoustics, water, and electromagnetic waves for various applications; commonly in medical imaging [1], nondestructive testing [2], underwater target detection [3], and seismic sources localization [4]. Although TRM is a well studied method, it is difficult to achieve perfect time reversed propagations converging to the source with high resolution due to the limited spatial sampling available [5]. Additionally, in medical applications, the heterogeneous lossy characteristic of human tissue is responsible for attenuation and dispersion, reducing the potential of obtaining perfect spatial resolution and accuracy at the source [6].

Research paper thumbnail of Energy Predictive Models for Convolutional Neural Networks on Mobile Platforms

Energy use is a key concern when deploying deep learning models on mobile and embedded platforms.... more Energy use is a key concern when deploying deep learning models on mobile and embedded platforms. Current studies develop energy predictive models based on application-level features to provide researchers a way to estimate the energy consumption of their deep learning models. This information is useful for building resource-aware models that can make efficient use of the hard-ware resources. However, previous works on predictive modelling provide little insight into the trade-offs involved in the choice of features on the final predictive model accuracy and model complexity. To address this issue, we provide a comprehensive analysis of building regression-based predictive models for deep learning on mobile devices, based on empirical measurements gathered from the SyNERGY framework.Our predictive modelling strategy is based on two types of predictive models used in the literature:individual layers and layer-type. Our analysis of predictive models show that simple layer-type feature...

Research paper thumbnail of First Steps in Porting the LFRic Weather and Climate Model to the FPGAs of the EuroExa Architecture

Scientific Programming, 2019

In recent years, there has been renewed interest in the use of field-programmable gate arrays (FP... more In recent years, there has been renewed interest in the use of field-programmable gate arrays (FPGAs) for high-performance computing (HPC). In this paper, we explore the techniques required by traditional HPC programmers in porting HPC applications to FPGAs, using as an example the LFRic weather and climate model. We report on the first steps in porting LFRic to the FPGAs of the EuroExa architecture. We have used Vivado High-Level Syntheusywwi to implement a matrix-vector kernel from the LFRic code on a Xilinx UltraScale+ development board containing an XCZU9EG multiprocessor system-on-chip. We describe the porting of the code, discuss the optimization decisions, and report performance of 5.34 Gflop/s with double precision and 5.58 Gflop/s with single precision. We discuss sources of inefficiencies, comparisons with peak performance, comparisons with CPU and GPU performance (taking into account power and price), comparisons with published techniques, and comparisons with published p...

Research paper thumbnail of Estimation of energy consumption in machine learning

Journal of Parallel and Distributed Computing, 2019

Research paper thumbnail of GungHo Phase 1: Computational Science Recommendations

Research paper thumbnail of Coupling technologies for Earth System Modelling

Geoscientific Model Development, 2012

Research paper thumbnail of Portable Multi- and Many-Core Performance for Finite Difference Codes; Application to the Free-Surface Component of NEMO

Geoscientific Model Development Discussions, 2017

We present an approach which we call PSyKAl that is designed to achieve portable performance for ... more We present an approach which we call PSyKAl that is designed to achieve portable performance for parallel, finite-difference Ocean models. In PSyKAl the code related to the underlying science is formally separated from code related to parallelisation and single-core optimisations. This separation of concerns allows scientists to code their science independently of the underlying hardware architecture and for optimisation specialists to be able to tailor the code for a particular machine independently of the science code. We have taken the free-surface part of the NEMO ocean model and created a new, shallow-water model named NEMOLite2D. In doing this we have a code which is of a manageable size and yet which incorporates elements of full ocean models (input/output, boundary conditions, <i>etc.</i>). We have then manually constructed a PSyKAl version of this code and investigated the transformations that must be applied to the middle/PSy layer in order to achieve good perf...

Research paper thumbnail of Exploiting Parallelism and Vectorisation in Breadth-First Search for the Intel Xeon Phi

IEEE Transactions on Parallel and Distributed Systems

Research paper thumbnail of Breadth first search vectorization on the Intel Xeon Phi

Proceedings of the ACM International Conference on Computing Frontiers - CF '16, 2016

Research paper thumbnail of Parallel implementation of a multilevel modelling package

Computational Statistics Data Analysis, Oct 28, 1999

A portable parallel implementation of MLn, a multilevel modelling package, for shared memory para... more A portable parallel implementation of MLn, a multilevel modelling package, for shared memory parallel machines is described. Particular attention is paid to cross-classified and multiple membership models, which are more computationally demanding than those with simple hierarchical structure. Performance results are presented for a range of shared-memory parallel architectures, demonstrating a significant increase in the size of models which can

Research paper thumbnail of Techniques For Improving The Performance Of Parallel Computations

Research paper thumbnail of Knowledge Specification for Automatic Performance Analysis

Research paper thumbnail of Special Issue: Grid Performance; Licklider and the Grid

Concurrency and Computation Practice and Experience Eds J Gurd T Hey J Papay and G Riley 2005 17 2 4 95 98, 2005

Research paper thumbnail of Automatic Overheads Profiler for OpenMP Codes

To develop a good parallel implementation requires understanding of where run-time is spent and c... more To develop a good parallel implementation requires understanding of where run-time is spent and comparing this to some realistic best possible time.