Dataflow Acceleration of scikit-learn Gaussian Process Regression (original) (raw)

FEREBUS: a high-performance modern Gaussian process regression engine

Digital discovery, 2023

FEREBUS is a highly optimised Gaussian process regression (GPR) engine, which provides both model and optimiser flexibility to produce tailored models designed for domain specific applications. FEREBUS provides the user with the necessary tools to decide on the trade-off between time and accuracy, in order to produce adequately accurate machine learnt models. FEREBUS has been designed from the ground up, for deep integration in the file management pipeline (ICHOR) of the multipolar, machine learned, polarisable force field FFLUX. As such it can produce accurate atomistic models for molecular dynamics simulations as efficiently as possible. FEREBUS utilises both OpenMP and OpenAcc technologies for parallel execution of optimisation routines and offloading computation to GPU accelerator devices with high efficiency, reaching parallel efficiency of 99%. The FORTRAN90 program FEREBUS embodies a modern approach to a high performance GPR engine providing both flexibility and performance in a single package.

Fast Gaussian Process Regression for Big Data

Big Data Research

Gaussian Processes are widely used for regression tasks. A known limitation in the application of Gaussian Processes to regression tasks is that the computation of the solution requires performing a matrix inversion. The solution also requires the storage of a large matrix in memory. These factors restrict the application of Gaussian Process regression to small and moderate size data sets. We present an algorithm that combines estimates from models developed using subsets of the data obtained in a manner similar to the bootstrap. The sample size is a critical parameter for this algorithm. Guidelines for reasonable choices of algorithm parameters, based on detailed experimental study, are provided. Various techniques have been proposed to scale Gaussian Processes to large scale regression tasks. The most appropriate choice depends on the problem context. The proposed method is most appropriate for problems where an additive model works well and the response depends on a small number of features. The minimax rate of convergence for such problems is attractive and we can build effective models with a small subset of the data. The Stochastic Variational Gaussian Process and the Sparse Gaussian Process are also appropriate choices for such problems. These methods pick a subset of data based on theoretical considerations. The proposed algorithm uses bagging and random sampling. Results from experiments conducted as part of this study indicate that the algorithm presented in this work can be as effective as these methods.

A data parallel approach for large-scale Gaussian process modeling

Proceedings of the Second SIAM …, 2002

This paper proposes an enabling data parallel local learning methodology for handling large data regression through the Gaussian Process (GP) modeling paradigm. The proposed model achieves parallelism by employing a specialized compactly supported covariance function defined over spatially localized clusters. The associated load balancing constraints arising from data parallelism are satisfied using a novel greedy clustering algorithm, GeoClust producing balanced clusters localized in space. Further, the use of the proposed covariance function as a building block for GP models is shown to decompose the maximum likelihood estimation problem into smaller decoupled subproblems. The attendant benefits which include a significant reduction in training complexity, as well as sparse predictive models for the posterior mean and variance make the present scheme extremely attractive. Experimental investigations on real and synthetic data demonstrate that the current approach can consistently outperform the state-of-the-art Bayesian Committee Machine (BCM) which employs a random data partitioning strategy. Finally, extensive evaluations over a grid-based computational infrastructure using the NetSolve distributed computing system show that the present approach scales well with data and could potentially be used in large-scale data mining applications.

Efficient Gaussian process regression for large datasets

Biometrika, 2013

Gaussian processes are widely used in nonparametric regression, classification and spatiotemporal modelling, facilitated in part by a rich literature on their theoretical properties. However, one of their practical limitations is expensive computation, typically on the order of n 3 where n is the number of data points, in performing the necessary matrix inversions. For large datasets, storage and processing also lead to computational bottlenecks, and numerical stability of the estimates and predicted values degrades with increasing n. Various methods have been proposed to address these problems, including predictive processes in spatial data analysis and the subset-ofregressors technique in machine learning. The idea underlying these approaches is to use a subset of the data, but this raises questions concerning sensitivity to the choice of subset and limitations in estimating fine-scale structure in regions that are not well covered by the subset. Motivated by the literature on compressive sensing, we propose an alternative approach that involves linear projection of all the data points onto a lower-dimensional subspace. We demonstrate the superiority of this approach from a theoretical perspective and through simulated and real data examples.

Fast large scale Gaussian process regression using the improved fast Gauss transform

Gaussian processes allow the treatment of non-linear non-parametric regression problems in a Bayesian framework. However the computational cost of training such a model with N examples scales as O (N3). Iterative methods for the solution of linear systems can bring this cost down to O (N2), which is still prohibitive for large data sets. In this paper we use an ��-exact approximation technique, the improved fast Gauss transform to reduce the computational complexity to O (N), for the squared exponential covariance function.

Scalable Hyperparameter Optimization with Lazy Gaussian Processes

2019 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC), 2019

Most machine learning methods require careful selection of hyper-parameters in order to train a high performing model with good generalization abilities. Hence, several automatic selection algorithms have been introduced to overcome tedious manual (try and error) tuning of these parameters. Due to its very high sample efficiency, Bayesian Optimization over a Gaussian Processes modeling of the parameter space has become the method of choice. Unfortunately, this approach suffers from a cubic compute complexity due to underlying Cholesky factorization, which makes it very hard to be scaled beyond a small number of sampling steps. In this paper, we present a novel, highly accurate approximation of the underlying Gaussian Process. Reducing its computational complexity from cubic to quadratic allows an efficient strong scaling of Bayesian Optimization while outperforming the previous approach regarding optimization accuracy. The first experiments show speedups of a factor of 162 in single node and further speed up by a factor of 5 in a parallel environment.

Scale-out acceleration for machine learning

Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, 2017

The growing scale and complexity of Machine Learning (ML) algorithms has resulted in prevalent use of distributed general-purpose systems. In a rather disjoint effort, the community is focusing mostly on high performance single-node accelerators for learning. This work bridges these two paradigms and offers CoSMIC, a full computing stack constituting language, compiler, system software, template architecture, and circuit generators, that enable programmable acceleration of learning at scale. CoSMIC enables programmers to exploit scale-out acceleration using FPGAs and Programmable ASICs (P-ASICs) from a high-level and mathematical Domain-Specific Language (DSL). Nonetheless, CoSMIC does not require programmers to delve into the onerous task of system software development or hardware design. CoSMIC achieves three conflicting objectives of efficiency, automation, and programmability, by integrating a novel multi-threaded template accelerator architecture and a cohesive stack that generates the hardware and software code from its high-level DSL. CoSMIC can accelerate a wide range of learning algorithms that are most commonly trained using parallel variants of gradient descent. The key is to distribute partial gradient calculations of the learning algorithms across the accelerator-augmented nodes of the scale-out system. Additionally, CoSMIC leverages the parallelizability of the algorithms to offer multi-threaded acceleration within each node. Multi-threading allows CoSMIC to efficiently exploit the numerous resources that are becoming available on modern FPGAs/P-ASICs by striking a balance between multi-threaded parallelism and singlethreaded performance. CoSMIC takes advantage of algorithmic properties of ML to offer a specialized system software that optimizes task allocation, role-assignment, thread management, and internode communication. We evaluate the versatility and efficiency of CoSMIC for 10 different machine learning applications from various domains. On average, a 16-node CoSMIC with UltraScale+ FPGAs offers 18.8⇥ speedup over a 16-node Spark system with Xeon processors while the programmer only writes 22-55 lines of code. CoSMIC offers higher scalability compared to the state-of-the-art Spark; scaling from 4 to 16 nodes with CoSMIC yields 2.7⇥ improvements whereas Spark offers 1.8⇥. These results confirm that the full-stack approach of CoSMIC takes an effective and vital step towards enabling scale-out acceleration for machine learning.

Scalable High-Order Gaussian Process Regression

2019

While most Gaussian processes (GP) work focus on learning single-output functions, many applications, such as physical simulations and gene expressions prediction, require estimations of functions with many outputs. The number of outputs can be much larger than or comparable to the size of training samples. Existing multi-output GP models either are limited to low-dimensional outputs and restricted kernel choices, or assume oversimplified low-rank structures within the outputs. To address these issues, we propose HOGPR, a High-Order Gaussian Process Regression model, which can flexibly capture complex correlations among the outputs and scale up to a large number of outputs. Specifically, we tensorize the high-dimensional outputs, introducing latent coordinate features to index each tensor element (i.e., output) and to capture their correlations. We then generalize a multilinear model to a hybrid of a GP and latent GP model. The model is endowed with a Kronecker product structure ove...

Efficient development of high performance data analytics in Python

Future Gener. Comput. Syst., 2020

Our society is generating an increasing amount of data at an unprecedented scale, variety, and speed. This also applies to numerous research areas, such as genomics, high energy physics, and astronomy, for which large-scale data processing has become crucial. However, there is still a gap between the traditional scientific computing ecosystem and big data analytics tools and frameworks. On the one hand, high performance computing (HPC) programming models lack productivity, and do not provide means for processing large amounts of data in a simple manner. On the other hand, existing big data processing tools have performance issues in HPC environments, and are not general-purpose. In this paper, we propose and evaluate PyCOMPSs, a task-based programming model for Python, as an excellent solution for distributed big data processing in HPC infrastructures. Among other useful features, PyCOMPSs offers a highly productive general-purpose programming model, is infrastructure-agnostic, and ...

Resource-aware Distributed Gaussian Process Regression for Real-time Machine Learning

ArXiv, 2021

We study the problem where a group of agents aim to collaboratively learn a common latent function through streaming data. We propose a Resource-aware Gaussian process regression algorithm that is cognizant of agents’ limited capabilities in communication, computation and memory. We quantify the improvement that limited inter-agent communication brings to the transient and steady-state performance in predictive variance and predictive mean. A set of simulations is conducted to evaluate the developed algorithm.