Graham Riley - Profile on Academia.edu (original) (raw)
Papers by Graham Riley
FPGAs have been around for over 30 years and are a viable accelerator for compute-intensive workl... more FPGAs have been around for over 30 years and are a viable accelerator for compute-intensive workloads on HPC systems. The adoption of FPGAs for scientific applications has been stimulated recently by the emergence of better programming environments such as High-Level Synthesis (HLS) and OpenCL available through the Xilinx SDSoC design tool. The mapping of the multi-level concurrency available within applications onto HPC systems with FPGAs is a challenge. OpenCL and HLS provide different mechanisms for exploiting concurrency within a node leading to a concurrency mapping design problem. In addition to considering the performance of different mappings, there are also questions of resource usage, programmability (development effort), ease-of-use and robustness. This paper examines the concurrency levels available in a case study kernel from a shallow water model and explores the programming options available in OpenCL and HLS. We conclude that the use of SDSoC Dataflow over functions ...
On June 30, 2020, the Workshop on Emerging Technologies for Weather and Climate Modelling<br&g... more On June 30, 2020, the Workshop on Emerging Technologies for Weather and Climate Modelling<br> was held as a virtual event within the framework of ESiWACE2, the Centre of Excellence in<br> Simulation of Weather and Climate in Europe.<br> The workshop organized by Giovanni Aloisio (CMCC), Graham Riley (UNIMAN), Carlos Osuna<br> (METEOSWISS) and Sandro Fiore (CMCC), was hosted by DKRZ with the local support by Dela<br> Spickermann and Florian Ziemen under the supervision of the ESiWACE2 Coordinator Joachim<br> Biercamp. The workshop was funded by the Horizon 2020 project ESiWACE2. Due to the situation<br> with COVID-19, the event was held as a virtual conference with approximately 143 participants<br> mainly from Europe and the US, but also from Brazil, India and Israel.<br> The workshop brought together scientists from the fields of earth system modeling, machine<br> learning, exascale hardware/computing, and programming mode...
Energy-use is a key concern when migrating current deep learning applications onto low power hete... more Energy-use is a key concern when migrating current deep learning applications onto low power heterogeneous devices such as a mobile device. This is because deep neural networks are typically designed and trained on high-end GPUs or servers and require additional processing steps to deploy them on low power devices. Such steps include the use of compression techniques to scale down the network size or the provision of efficient device-specific software implementations. Migration is further aggravated by the lack of tools and the inability to measure power and performance accurately and consistently across devices. We present a novel evaluation framework for measuring energy and performance for deep neural networks using ARMs Streamline Performance Analyser integrated with standard deep learning frameworks such as Caffe and CuDNNv5. We apply the framework to study the execution behaviour of SqueezeNet on the Maxwell GPU of the NVidia Jetson TX1, on an image classification task (also known as inference) and demonstrate the ability to measure energy of specific layers of the neural network.
Proceedings of the Computing Frontiers Conference, 2017
The Breadth-First Search (BFS) algorithm is an important building block for graph analysis of lar... more The Breadth-First Search (BFS) algorithm is an important building block for graph analysis of large datasets. The BFS parallelisation has been shown to be challenging because of its inherent characteristics, including irregular memory access patterns, data dependencies and workload imbalance, that limit its scalability. We investigate the optimisation and vectorisation of the hybrid BFS (a combination of top-down and bottom-up approaches for BFS) on the Xeon Phi, which has advanced vector processing capabilities. The results show that our new implementation improves by 33%, for a one million vertices graph, compared to the state-of-the-art.
2019 International Conference on Robotics and Automation (ICRA), 2019
The first two authors have equal contribution, the order just reflects alphabetical order.
Requirements for Automatic Performance Analysis - APART Technical Report
The Time Reversal Mirror (TRM) typically composes of an array of transducers. These transducers a... more The Time Reversal Mirror (TRM) typically composes of an array of transducers. These transducers are used for recording propagations reflected off from a source, which are then processed by and re-emitted from the transducers back to the source location known as the time reversed propagation. The TRM technique has been studied in acoustics, water, and electromagnetic waves for various applications; commonly in medical imaging [1], nondestructive testing [2], underwater target detection [3], and seismic sources localization [4]. Although TRM is a well studied method, it is difficult to achieve perfect time reversed propagations converging to the source with high resolution due to the limited spatial sampling available [5]. Additionally, in medical applications, the heterogeneous lossy characteristic of human tissue is responsible for attenuation and dispersion, reducing the potential of obtaining perfect spatial resolution and accuracy at the source [6].
Energy use is a key concern when deploying deep learning models on mobile and embedded platforms.... more Energy use is a key concern when deploying deep learning models on mobile and embedded platforms. Current studies develop energy predictive models based on application-level features to provide researchers a way to estimate the energy consumption of their deep learning models. This information is useful for building resource-aware models that can make efficient use of the hard-ware resources. However, previous works on predictive modelling provide little insight into the trade-offs involved in the choice of features on the final predictive model accuracy and model complexity. To address this issue, we provide a comprehensive analysis of building regression-based predictive models for deep learning on mobile devices, based on empirical measurements gathered from the SyNERGY framework.Our predictive modelling strategy is based on two types of predictive models used in the literature:individual layers and layer-type. Our analysis of predictive models show that simple layer-type feature...
Scientific Programming, 2019
In recent years, there has been renewed interest in the use of field-programmable gate arrays (FP... more In recent years, there has been renewed interest in the use of field-programmable gate arrays (FPGAs) for high-performance computing (HPC). In this paper, we explore the techniques required by traditional HPC programmers in porting HPC applications to FPGAs, using as an example the LFRic weather and climate model. We report on the first steps in porting LFRic to the FPGAs of the EuroExa architecture. We have used Vivado High-Level Syntheusywwi to implement a matrix-vector kernel from the LFRic code on a Xilinx UltraScale+ development board containing an XCZU9EG multiprocessor system-on-chip. We describe the porting of the code, discuss the optimization decisions, and report performance of 5.34 Gflop/s with double precision and 5.58 Gflop/s with single precision. We discuss sources of inefficiencies, comparisons with peak performance, comparisons with CPU and GPU performance (taking into account power and price), comparisons with published techniques, and comparisons with published p...
Journal of Parallel and Distributed Computing, 2019
Energy consumption has been widely studied in the computer architecture field for decades. While ... more Energy consumption has been widely studied in the computer architecture field for decades. While the adoption of energy as a metric in machine learning is emerging, the majority of research is still primarily focused on obtaining high levels of accuracy without any computational constraint. We believe that one of the reasons for this lack of interest is due to their lack of familiarity with approaches to evaluate energy consumption. To address this challenge, we present a review of the different approaches to estimate energy consumption in general and machine learning applications in particular. Our goal is to provide useful guidelines to the machine learning community giving them the fundamental knowledge to use and build specific energy estimation methods for machine learning algorithms. We also present the latest software tools that give energy estimation values, together with two use cases that enhance the study of energy consumption in machine learning.
GungHo Phase 1: Computational Science Recommendations
Geoscientific Model Development, 2012
This paper presents a review of the software currently used in climate modelling in general and i... more This paper presents a review of the software currently used in climate modelling in general and in CMIP5 in particular to couple the numerical codes representing the different components of the Earth system. The coupling technologies presented show common features, such as the ability to communicate and regrid data, but also offer different functions and implementations. Design characteristics of the different approaches are discussed as well as future challenges arising from the increasing complexity of scientific problems and computing platforms.
Geoscientific Model Development Discussions, 2017
We present an approach which we call PSyKAl that is designed to achieve portable performance for ... more We present an approach which we call PSyKAl that is designed to achieve portable performance for parallel, finite-difference Ocean models. In PSyKAl the code related to the underlying science is formally separated from code related to parallelisation and single-core optimisations. This separation of concerns allows scientists to code their science independently of the underlying hardware architecture and for optimisation specialists to be able to tailor the code for a particular machine independently of the science code. We have taken the free-surface part of the NEMO ocean model and created a new, shallow-water model named NEMOLite2D. In doing this we have a code which is of a manageable size and yet which incorporates elements of full ocean models (input/output, boundary conditions, <i>etc.</i>). We have then manually constructed a PSyKAl version of this code and investigated the transformations that must be applied to the middle/PSy layer in order to achieve good perf...
IEEE Transactions on Parallel and Distributed Systems
Modern applications generate massive amounts of data that is challenging to process or analyse. G... more Modern applications generate massive amounts of data that is challenging to process or analyse. Graph algorithms have emerged as a solution for the analysis of such data because they can represent the entities participating in the generation of large-scale datasets in terms of vertices and their relationships in terms of edges. Graph analysis algorithms are used for finding patterns within these relationships, aiming to extract information to be further analysed. The breadth-first search (BFS) is one of the main graph search algorithms used for graph analysis and its optimisation has been widely researched using different parallel computers. However, the parallelisation of BFS has been shown to be challenging because of its inherent characteristics, including irregular memory access patterns, data dependencies and workload imbalance, that limit its scalability. This paper investigates the optimisation of the BFS on the Xeon Phi (Knights Corner), a modern parallel architecture provided with an advanced vector processor supporting the AVX-512 instruction set, using a bespoke development framework integrated with the Graph 500 benchmark. In addition, to demonstrate portability, we show results for a direct port of the algorithms to a more recent version of the Xeon Phi (Knights Landing) and to a Skylake CPU which supports most of the AVX-512 instruction set. Optimised parallel versions of two high-level algorithms for BFS were created using vectorisation, starting with the conventional top-down BFS algorithm and, building on this, a hybrid BFS algorithm. On the KNC our best implementations result in speedups of 1.37x (top-down) and 1.37x (hybrid), for a one million vertices graph, compared to the state-of-the-art. On the KNL and Skylake, the performance is higher than on KNC. In addition, we show results of our best hybrid algorithm on real-world graphs from the SNAP datasets with speedups up to 1.3x on KNC. Performance on KNL and Skylake is again higher, demonstrating the robustness and portability of our algorithm. The hybrid BFS algorithm can be further used to speed up other graph analysis algorithms and the lessons learned from vectorisation can be applied to other algorithms targeting existing and future models of the Xeon Phi and other advanced vector architectures.
Proceedings of the ACM International Conference on Computing Frontiers - CF '16, 2016
Breadth First Search (BFS) is a building block for graph algorithms and has recently been used fo... more Breadth First Search (BFS) is a building block for graph algorithms and has recently been used for large scale analysis of information in a variety of applications including social networks, graph databases and web searching. Due to its importance, a number of different parallel programming models and architectures have been exploited to optimize the BFS. However, due to the irregular memory access patterns and the unstructured nature of the large graphs, its efficient parallelization is a challenge. The Xeon Phi is a massively parallel architecture available as an off-the-shelf accelerator, which includes a powerful 512 bit vector unit with optimized scatter and gather functions. Given its potential benefits, work related to graph traversing on this architecture is an active area of research. We present a set of experiments in which we explore architectural features of the Xeon Phi and how best to exploit them in a top-down BFS algorithm but the techniques can be applied to the current state-of-the-art hybrid, top-down plus bottom-up, algorithms. We focus on the exploitation of the vector unit by developing an improved highly vectorized OpenMP parallel algorithm, using vector intrinsics, and understanding the use of data alignment and prefetching. In addition, we investigate the impact of hyperthreading and thread affinity on performance, a topic that appears under researched in the literature. As a result, we achieve what we believe is the fastest published top-down BFS algorithm on the version of Xeon Phi used in our experiments. The vectorized BFS topdown source code presented in this paper can be available on request as free-to-use software.
Parallel implementation of a multilevel modelling package
Computational Statistics Data Analysis, Oct 28, 1999
A portable parallel implementation of MLn, a multilevel modelling package, for shared memory para... more A portable parallel implementation of MLn, a multilevel modelling package, for shared memory parallel machines is described. Particular attention is paid to cross-classified and multiple membership models, which are more computationally demanding than those with simple hierarchical structure. Performance results are presented for a range of shared-memory parallel architectures, demonstrating a significant increase in the size of models which can
Special Issue: Grid Performance; Licklider and the Grid
Concurrency and Computation Practice and Experience Eds J Gurd T Hey J Papay and G Riley 2005 17 2 4 95 98, 2005
To develop a good parallel implementation requires understanding of where run-time is spent and c... more To develop a good parallel implementation requires understanding of where run-time is spent and comparing this to some realistic best possible time.
FPGAs have been around for over 30 years and are a viable accelerator for compute-intensive workl... more FPGAs have been around for over 30 years and are a viable accelerator for compute-intensive workloads on HPC systems. The adoption of FPGAs for scientific applications has been stimulated recently by the emergence of better programming environments such as High-Level Synthesis (HLS) and OpenCL available through the Xilinx SDSoC design tool. The mapping of the multi-level concurrency available within applications onto HPC systems with FPGAs is a challenge. OpenCL and HLS provide different mechanisms for exploiting concurrency within a node leading to a concurrency mapping design problem. In addition to considering the performance of different mappings, there are also questions of resource usage, programmability (development effort), ease-of-use and robustness. This paper examines the concurrency levels available in a case study kernel from a shallow water model and explores the programming options available in OpenCL and HLS. We conclude that the use of SDSoC Dataflow over functions ...
On June 30, 2020, the Workshop on Emerging Technologies for Weather and Climate Modelling<br&g... more On June 30, 2020, the Workshop on Emerging Technologies for Weather and Climate Modelling<br> was held as a virtual event within the framework of ESiWACE2, the Centre of Excellence in<br> Simulation of Weather and Climate in Europe.<br> The workshop organized by Giovanni Aloisio (CMCC), Graham Riley (UNIMAN), Carlos Osuna<br> (METEOSWISS) and Sandro Fiore (CMCC), was hosted by DKRZ with the local support by Dela<br> Spickermann and Florian Ziemen under the supervision of the ESiWACE2 Coordinator Joachim<br> Biercamp. The workshop was funded by the Horizon 2020 project ESiWACE2. Due to the situation<br> with COVID-19, the event was held as a virtual conference with approximately 143 participants<br> mainly from Europe and the US, but also from Brazil, India and Israel.<br> The workshop brought together scientists from the fields of earth system modeling, machine<br> learning, exascale hardware/computing, and programming mode...
Energy-use is a key concern when migrating current deep learning applications onto low power hete... more Energy-use is a key concern when migrating current deep learning applications onto low power heterogeneous devices such as a mobile device. This is because deep neural networks are typically designed and trained on high-end GPUs or servers and require additional processing steps to deploy them on low power devices. Such steps include the use of compression techniques to scale down the network size or the provision of efficient device-specific software implementations. Migration is further aggravated by the lack of tools and the inability to measure power and performance accurately and consistently across devices. We present a novel evaluation framework for measuring energy and performance for deep neural networks using ARMs Streamline Performance Analyser integrated with standard deep learning frameworks such as Caffe and CuDNNv5. We apply the framework to study the execution behaviour of SqueezeNet on the Maxwell GPU of the NVidia Jetson TX1, on an image classification task (also known as inference) and demonstrate the ability to measure energy of specific layers of the neural network.
Proceedings of the Computing Frontiers Conference, 2017
The Breadth-First Search (BFS) algorithm is an important building block for graph analysis of lar... more The Breadth-First Search (BFS) algorithm is an important building block for graph analysis of large datasets. The BFS parallelisation has been shown to be challenging because of its inherent characteristics, including irregular memory access patterns, data dependencies and workload imbalance, that limit its scalability. We investigate the optimisation and vectorisation of the hybrid BFS (a combination of top-down and bottom-up approaches for BFS) on the Xeon Phi, which has advanced vector processing capabilities. The results show that our new implementation improves by 33%, for a one million vertices graph, compared to the state-of-the-art.
2019 International Conference on Robotics and Automation (ICRA), 2019
The first two authors have equal contribution, the order just reflects alphabetical order.
Requirements for Automatic Performance Analysis - APART Technical Report
The Time Reversal Mirror (TRM) typically composes of an array of transducers. These transducers a... more The Time Reversal Mirror (TRM) typically composes of an array of transducers. These transducers are used for recording propagations reflected off from a source, which are then processed by and re-emitted from the transducers back to the source location known as the time reversed propagation. The TRM technique has been studied in acoustics, water, and electromagnetic waves for various applications; commonly in medical imaging [1], nondestructive testing [2], underwater target detection [3], and seismic sources localization [4]. Although TRM is a well studied method, it is difficult to achieve perfect time reversed propagations converging to the source with high resolution due to the limited spatial sampling available [5]. Additionally, in medical applications, the heterogeneous lossy characteristic of human tissue is responsible for attenuation and dispersion, reducing the potential of obtaining perfect spatial resolution and accuracy at the source [6].
Energy use is a key concern when deploying deep learning models on mobile and embedded platforms.... more Energy use is a key concern when deploying deep learning models on mobile and embedded platforms. Current studies develop energy predictive models based on application-level features to provide researchers a way to estimate the energy consumption of their deep learning models. This information is useful for building resource-aware models that can make efficient use of the hard-ware resources. However, previous works on predictive modelling provide little insight into the trade-offs involved in the choice of features on the final predictive model accuracy and model complexity. To address this issue, we provide a comprehensive analysis of building regression-based predictive models for deep learning on mobile devices, based on empirical measurements gathered from the SyNERGY framework.Our predictive modelling strategy is based on two types of predictive models used in the literature:individual layers and layer-type. Our analysis of predictive models show that simple layer-type feature...
Scientific Programming, 2019
In recent years, there has been renewed interest in the use of field-programmable gate arrays (FP... more In recent years, there has been renewed interest in the use of field-programmable gate arrays (FPGAs) for high-performance computing (HPC). In this paper, we explore the techniques required by traditional HPC programmers in porting HPC applications to FPGAs, using as an example the LFRic weather and climate model. We report on the first steps in porting LFRic to the FPGAs of the EuroExa architecture. We have used Vivado High-Level Syntheusywwi to implement a matrix-vector kernel from the LFRic code on a Xilinx UltraScale+ development board containing an XCZU9EG multiprocessor system-on-chip. We describe the porting of the code, discuss the optimization decisions, and report performance of 5.34 Gflop/s with double precision and 5.58 Gflop/s with single precision. We discuss sources of inefficiencies, comparisons with peak performance, comparisons with CPU and GPU performance (taking into account power and price), comparisons with published techniques, and comparisons with published p...
Journal of Parallel and Distributed Computing, 2019
Energy consumption has been widely studied in the computer architecture field for decades. While ... more Energy consumption has been widely studied in the computer architecture field for decades. While the adoption of energy as a metric in machine learning is emerging, the majority of research is still primarily focused on obtaining high levels of accuracy without any computational constraint. We believe that one of the reasons for this lack of interest is due to their lack of familiarity with approaches to evaluate energy consumption. To address this challenge, we present a review of the different approaches to estimate energy consumption in general and machine learning applications in particular. Our goal is to provide useful guidelines to the machine learning community giving them the fundamental knowledge to use and build specific energy estimation methods for machine learning algorithms. We also present the latest software tools that give energy estimation values, together with two use cases that enhance the study of energy consumption in machine learning.
GungHo Phase 1: Computational Science Recommendations
Geoscientific Model Development, 2012
This paper presents a review of the software currently used in climate modelling in general and i... more This paper presents a review of the software currently used in climate modelling in general and in CMIP5 in particular to couple the numerical codes representing the different components of the Earth system. The coupling technologies presented show common features, such as the ability to communicate and regrid data, but also offer different functions and implementations. Design characteristics of the different approaches are discussed as well as future challenges arising from the increasing complexity of scientific problems and computing platforms.
Geoscientific Model Development Discussions, 2017
We present an approach which we call PSyKAl that is designed to achieve portable performance for ... more We present an approach which we call PSyKAl that is designed to achieve portable performance for parallel, finite-difference Ocean models. In PSyKAl the code related to the underlying science is formally separated from code related to parallelisation and single-core optimisations. This separation of concerns allows scientists to code their science independently of the underlying hardware architecture and for optimisation specialists to be able to tailor the code for a particular machine independently of the science code. We have taken the free-surface part of the NEMO ocean model and created a new, shallow-water model named NEMOLite2D. In doing this we have a code which is of a manageable size and yet which incorporates elements of full ocean models (input/output, boundary conditions, <i>etc.</i>). We have then manually constructed a PSyKAl version of this code and investigated the transformations that must be applied to the middle/PSy layer in order to achieve good perf...
IEEE Transactions on Parallel and Distributed Systems
Modern applications generate massive amounts of data that is challenging to process or analyse. G... more Modern applications generate massive amounts of data that is challenging to process or analyse. Graph algorithms have emerged as a solution for the analysis of such data because they can represent the entities participating in the generation of large-scale datasets in terms of vertices and their relationships in terms of edges. Graph analysis algorithms are used for finding patterns within these relationships, aiming to extract information to be further analysed. The breadth-first search (BFS) is one of the main graph search algorithms used for graph analysis and its optimisation has been widely researched using different parallel computers. However, the parallelisation of BFS has been shown to be challenging because of its inherent characteristics, including irregular memory access patterns, data dependencies and workload imbalance, that limit its scalability. This paper investigates the optimisation of the BFS on the Xeon Phi (Knights Corner), a modern parallel architecture provided with an advanced vector processor supporting the AVX-512 instruction set, using a bespoke development framework integrated with the Graph 500 benchmark. In addition, to demonstrate portability, we show results for a direct port of the algorithms to a more recent version of the Xeon Phi (Knights Landing) and to a Skylake CPU which supports most of the AVX-512 instruction set. Optimised parallel versions of two high-level algorithms for BFS were created using vectorisation, starting with the conventional top-down BFS algorithm and, building on this, a hybrid BFS algorithm. On the KNC our best implementations result in speedups of 1.37x (top-down) and 1.37x (hybrid), for a one million vertices graph, compared to the state-of-the-art. On the KNL and Skylake, the performance is higher than on KNC. In addition, we show results of our best hybrid algorithm on real-world graphs from the SNAP datasets with speedups up to 1.3x on KNC. Performance on KNL and Skylake is again higher, demonstrating the robustness and portability of our algorithm. The hybrid BFS algorithm can be further used to speed up other graph analysis algorithms and the lessons learned from vectorisation can be applied to other algorithms targeting existing and future models of the Xeon Phi and other advanced vector architectures.
Proceedings of the ACM International Conference on Computing Frontiers - CF '16, 2016
Breadth First Search (BFS) is a building block for graph algorithms and has recently been used fo... more Breadth First Search (BFS) is a building block for graph algorithms and has recently been used for large scale analysis of information in a variety of applications including social networks, graph databases and web searching. Due to its importance, a number of different parallel programming models and architectures have been exploited to optimize the BFS. However, due to the irregular memory access patterns and the unstructured nature of the large graphs, its efficient parallelization is a challenge. The Xeon Phi is a massively parallel architecture available as an off-the-shelf accelerator, which includes a powerful 512 bit vector unit with optimized scatter and gather functions. Given its potential benefits, work related to graph traversing on this architecture is an active area of research. We present a set of experiments in which we explore architectural features of the Xeon Phi and how best to exploit them in a top-down BFS algorithm but the techniques can be applied to the current state-of-the-art hybrid, top-down plus bottom-up, algorithms. We focus on the exploitation of the vector unit by developing an improved highly vectorized OpenMP parallel algorithm, using vector intrinsics, and understanding the use of data alignment and prefetching. In addition, we investigate the impact of hyperthreading and thread affinity on performance, a topic that appears under researched in the literature. As a result, we achieve what we believe is the fastest published top-down BFS algorithm on the version of Xeon Phi used in our experiments. The vectorized BFS topdown source code presented in this paper can be available on request as free-to-use software.
Parallel implementation of a multilevel modelling package
Computational Statistics Data Analysis, Oct 28, 1999
A portable parallel implementation of MLn, a multilevel modelling package, for shared memory para... more A portable parallel implementation of MLn, a multilevel modelling package, for shared memory parallel machines is described. Particular attention is paid to cross-classified and multiple membership models, which are more computationally demanding than those with simple hierarchical structure. Performance results are presented for a range of shared-memory parallel architectures, demonstrating a significant increase in the size of models which can
Special Issue: Grid Performance; Licklider and the Grid
Concurrency and Computation Practice and Experience Eds J Gurd T Hey J Papay and G Riley 2005 17 2 4 95 98, 2005
To develop a good parallel implementation requires understanding of where run-time is spent and c... more To develop a good parallel implementation requires understanding of where run-time is spent and comparing this to some realistic best possible time.