Randal Burns - Profile on Academia.edu (original) (raw)

Papers by Randal Burns

Research paper thumbnail of Geodesic Forests

Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020

Together with the curse of dimensionality, nonlinear dependencies in large data sets persist as m... more Together with the curse of dimensionality, nonlinear dependencies in large data sets persist as major challenges in data mining tasks. A reliable way to accurately preserve nonlinear structure is to compute geodesic distances between data points. Manifold learning methods, such as Isomap, aim to preserve geodesic distances in a Riemannian manifold. However, as manifold learning algorithms operate on the ambient dimensionality of the data, the essential step of geodesic distance computation is sensitive to high-dimensional noise. Therefore, a direct application of these algorithms to highdimensional, noisy data often yields unsatisfactory results and does not accurately capture nonlinear structure. We propose an unsupervised random forest approach called geodesic forests (GF) to geodesic distance estimation in linear and nonlinear manifolds with noise. GF operates on low-dimensional sparse linear combinations of features, rather than the full observed dimensionality. To choose the optimal split in a computationally efficient fashion, we developed Fast-BIC, a fast Bayesian Information Criterion statistic for Gaussian mixture models. We additionally propose geodesic precision and geodesic recall as novel evaluation metrics that quantify how well the geodesic distances of a latent manifold are preserved. Empirical results on simulated and real data demonstrate that GF is robust to high-dimensional noise, whereas other methods, such as Isomap, UMAP, and FLANN, quickly deteriorate in such settings. Notably, GF is able to estimate geodesic distances better than other approaches on a real connectome dataset.

Research paper thumbnail of An architecture for a data-intensive computer

Scientific instruments, as well as simulations, generate increasingly large datasets, changing th... more Scientific instruments, as well as simulations, generate increasingly large datasets, changing the way we do science. We propose that processing Petascale-sized datasets will be carried in a data-intensive computer, a system consisting of an HPC cluster, a massively parallel database and an intermediate operating system layer. The operating system will run on dedicated servers and will exploit massive parallelism in the database, as well as numerous optimization strategies, to deliver highthroughput, balanced and regular data flow for I/O operations between the HPC cluster and the database. The programming model of sequential file storage is not appropriate for dataintensive computations, so we propose a data-object-oriented operating system, where support for high-level data objects, such as multi-dimensional arrays, is built in. User application programs will be compiled into code that is executed both on the HPC cluster and inside the database. The data-intensive operating system is however non-local, so that user applications running on a remote PC will be compiled into code executing both on the PC and inside the database. This model supports the collaborative environment, where a large data set is typically created and processed by a large group of users. We have implemented a software library, MPI-DB, which is a prototype of the data-intensive operating system. It is currently being used to ingest the output of the simulation of a turbulent channel flow into the database.

Research paper thumbnail of Linear Optimal Low Rank Projection for High-Dimensional Multi-class Data

arXiv (Cornell University), Sep 5, 2017

Classification of individual samples into one or more categories is critical to modern scientific... more Classification of individual samples into one or more categories is critical to modern scientific inquiry. Most modern datasets, such as those used in genetic analysis or imaging, include numerous features, such as genes or pixels. Principal Components Analysis (PCA) is now generally used to find low-dimensional representations of such features for further analysis. However, PCA ignores class label information, thereby discarding data that could substantially improve downstream classification performance. We here describe an approach called "Linear Optimal Low-rank"' projection (LOL), which extends PCA by incorporating the class labels. Using theory and synthetic data, we show that LOL leads to a better representation of the data for subsequent classification than PCA while adding negligible computational cost. Experimentally we demonstrate that LOL substantially outperforms PCA in differentiating cancer patients from healthy controls using genetic data and in differentiating gender from magnetic resonance imaging data incorporating >500 million features and 400 gigabytes of data. LOL allows the solution of previous intractable problems yet requires only a few minutes to run on a single desktop computer.

Research paper thumbnail of Optimize Unsynchronized Garbage Collection in an SSD Array

arXiv (Cornell University), Jun 24, 2015

Solid state disks (SSDs) have advanced to outperform traditional hard drives significantly in bot... more Solid state disks (SSDs) have advanced to outperform traditional hard drives significantly in both random reads and writes. However, heavy random writes trigger frequent garbage collection and decrease the performance of SSDs. In an SSD array, garbage collection of individual SSDs is not synchronized, leading to underutilization of some of the SSDs. We propose a software solution to tackle the unsynchronized garbage collection in an SSD array installed in a host bus adaptor (HBA), where individual SSDs are exposed to an operating system. We maintain a long I/O queue for each SSD and flush dirty pages intelligently to fill the long I/O queues so that we hide the performance imbalance among SSDs even when there are few parallel application writes. We further define a policy of selecting dirty pages to flush and a policy of taking out stale flush requests to reduce the amount of data written to SSDs. We evaluate our solution in a real system. Experiments show that our solution fully utilizes all SSDs in an array under random write-heavy workloads. It improves I/O throughput by up to 62% under random workloads of mixed reads and writes when SSDs are under active garbage collection. It causes little extra data writeback and increases the cache hit rate.

Research paper thumbnail of Random Projection Forests

arXiv (Cornell University), Jun 10, 2015

Ensemble methods---particularly those based on decision trees---have recently demonstrated superi... more Ensemble methods---particularly those based on decision trees---have recently demonstrated superior performance in a variety of machine learning settings. We introduce a generalization of many existing decision tree methods called "Random Projection Forests" (RPF), which is any decision forest that uses (possibly data dependent and random) linear projections. Using this framework, we introduce a special case, called "Lumberjack", using very sparse random projections, that is, linear combinations of a small subset of features. Lumberjack obtains statistically significantly improved accuracy over Random Forests, Gradient Boosted Trees, and other approaches on a standard benchmark suites for classification with varying dimension, sample size, and number of classes. To illustrate how, why, and when Lumberjack outperforms other methods, we conduct extensive simulated experiments, in vectors, images, and nonlinear manifolds. Lumberjack typically yields improved performance over existing decision trees ensembles, while mitigating computational efficiency and scalability, and maintaining interpretability. Lumberjack can easily be incorporated into other ensemble methods such as boosting to obtain potentially similar gains.

Research paper thumbnail of Forest Packing: Fast, Parallel Decision Forests

arXiv (Cornell University), Jun 19, 2018

Machine learning has an emerging critical role in high-performance computing to modulate simulati... more Machine learning has an emerging critical role in high-performance computing to modulate simulations, extract knowledge from massive data, and replace numerical models with efficient approximations. Decision forests are a critical tool because they provide insight into model operation that is critical to interpreting learned results. While decision forests are trivially parallelizable, the traversals of tree data structures incur many random memory accesses and are very slow. We present memory packing techniques that reorganize learned forests to minimize cache misses during classification. The resulting layout is hierarchical. At low levels, we pack the nodes of multiple trees into contiguous memory blocks so that each memory access fetches data for multiple trees. At higher levels, we use leaf cardinality to identify the most popular paths through a tree and collocate those paths in cache lines. We extend this layout with out-of-order execution and cache-line prefetching to increase memory throughput. Together, these optimizations increase the performance of classification in ensembles by a factor of four over an optimized C++ implementation and a factor of 50 over a popular Rlanguage implementation.

Research paper thumbnail of To the Cloud! A Grassroots Proposal to Accelerate Brain Science Discovery

Neuron, 2016

The revolution in neuroscientific data acquisition is creating an analysis challenge. We propose ... more The revolution in neuroscientific data acquisition is creating an analysis challenge. We propose leveraging cloud-computing technologies to enable large-scale neurodata storing, exploring, analyzing, and modeling. This utility will empower scientists globally to generate and test theories of brain function and dysfunction.

Research paper thumbnail of An architecture for a data-intensive computer

Proceedings of the first international workshop on Network-aware data management, 2011

Scientific instruments, as well as simulations, generate increasingly large datasets, changing th... more Scientific instruments, as well as simulations, generate increasingly large datasets, changing the way we do science. We propose that processing Petascale-sized datasets will be carried in a data-intensive computer, a system consisting of an HPC cluster, a massively parallel database and an intermediate operating system layer. The operating system will run on dedicated servers and will exploit massive parallelism in the database, as well as numerous optimization strategies, to deliver highthroughput, balanced and regular data flow for I/O operations between the HPC cluster and the database. The programming model of sequential file storage is not appropriate for dataintensive computations, so we propose a data-object-oriented operating system, where support for high-level data objects, such as multi-dimensional arrays, is built in. User application programs will be compiled into code that is executed both on the HPC cluster and inside the database. The data-intensive operating system is however non-local, so that user applications running on a remote PC will be compiled into code executing both on the PC and inside the database. This model supports the collaborative environment, where a large data set is typically created and processed by a large group of users. We have implemented a software library, MPI-DB, which is a prototype of the data-intensive operating system. It is currently being used to ingest the output of the simulation of a turbulent channel flow into the database.

Research paper thumbnail of MPI-DB, A Parallel Database Services Software Library for Scientific Computing

MPI-DB, A Parallel Database Services Software Library for Scientific Computing

Lecture Notes in Computer Science, 2011

Large-scale scientific simulations generate petascale data sets subsequently analyzed by groups o... more Large-scale scientific simulations generate petascale data sets subsequently analyzed by groups of researchers, often in databases. We developed a software library, MPI-DB, to provide database services to scientific computing applications. As a bridge between CPU-intensive and data-intensive computations, MPI-DB exploits massive parallelism within large databases to provide scalable, fast service. It is built as a client-server framework, using MPI, with MPI-DB server acting as an intermediary between the user application running an MPI-DB client and the database servers. MPI-DB provides high-level objects such as multi-dimensional arrays, acting as an abstraction layer that effectively hides the database from the end user.

Research paper thumbnail of I/O streaming evaluation of batch queries for data-intensive computational turbulence

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11, 2011

We describe a method for evaluating computational turbulence queries, including Lagrange Polynomi... more We describe a method for evaluating computational turbulence queries, including Lagrange Polynomial interpolation, based on partial sums that allows the underlying data to be accessed in any order and in parts. We exploit these properties to stream data from disk in a single pass and concurrently evaluate batch queries. The combination of sequential I/O and data sharing improves performance by an order of magnitude when compared with direct evaluation of each query. The technique also supports distributed evaluation of queries in a database cluster, assembling the partial sums from each node at the query mediator. Interpolation is fundamental to computational turbulence, over 95% of queries use these routines, and the partial sums method allows the JHU Turbulence Database Cluster to realize scale and throughput for our scientists' data-intensive workloads.

Research paper thumbnail of A public turbulence database cluster and applications to study Lagrangian evolution of velocity increments in turbulence

Journal of Turbulence, 2008

A public database system archiving a direct numerical simulation (DNS) data set of isotropic, for... more A public database system archiving a direct numerical simulation (DNS) data set of isotropic, forced turbulence is described in this paper. The data set consists of the DNS output on 1024 3 spatial points and 1024 time-samples spanning about one large-scale turnover timescale. This complete 1024 4 space-time history of turbulence is accessible to users remotely through an interface that is based on the Web-services model. Users may write and execute analysis programs on their host computers, while the programs make subroutine-like calls that request desired parts of the data over the network. The users are thus able to perform numerical experiments by accessing the 27 Terabytes of DNS data using regular platforms such as laptops. The architecture of the database is explained, as are some of the locally defined functions, such as differentiation and interpolation. Test calculations are performed to illustrate the usage of the system and to verify the accuracy of the methods. The database is then used to analyse a dynamical model for small-scale intermittency in turbulence. Specifically, the dynamical effects of pressure and viscous terms on the Lagrangian evolution of velocity increments are evaluated using conditional averages calculated from the DNS data in the database. It is shown that these effects differ considerably among themselves and thus require different modeling strategies in Lagrangian models of velocity increments and intermittency.

Research paper thumbnail of Analysis of isotropic turbulence using a public database and the Web service model, and applications to study subgrid models

Analysis of isotropic turbulence using a public database and the Web service model, and applications to study subgrid models

A public database system archiving a direct numerical simulation (DNS) data set of isotropic, for... more A public database system archiving a direct numerical simulation (DNS) data set of isotropic, forced turbulence is used for studying basic turbulence dynamics. The data set consists of the DNS output on 1024-cubed spatial points and 1024 time-samples spanning about one large-scale turn-over timescale. This complete space-time history of turbulence is accessible to users remotely through an interface that is

Research paper thumbnail of BLOCKSET (Block-Aligned Serialized Trees)

Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, 2021

We present methods to serialize and deserialize gradient-boosted trees and random forests that op... more We present methods to serialize and deserialize gradient-boosted trees and random forests that optimize inference latency when models are not loaded into memory. This arises when models are larger than memory, but also systematically when models are deployed on low-resource devices in the Internet of Things or run as cloud microservices where resources are allocated on demand. Block-Aligned Serialized Trees (BLOCKSET) introduce the concept of selective access for random forests and gradient boosted trees in which only the parts of the model needed for inference are deserialized and loaded into memory. Using principles from external memory algorithms, we block-align the serialization format in order to minimize the number of I/Os. For gradient boosted trees, this results in a more than five time reduction in inference latency over layouts that do not perform selective access and a 2 times latency reduction over techniques that are selective, but do not encode I/O block boundaries in the layout. CCS CONCEPTS • Computing methodologies → Bagging; • Information systems → Record and block layout; Data scans; Data access methods.

Research paper thumbnail of Studying Lagrangian dynamics of turbulence using on-demand fluid particle tracking in the JHU turbulence database

Bulletin of the American Physical Society, Nov 20, 2011

JHU public turbulence database (http://turbulence.pha.jhu.edu) provides access to large datasets ... more JHU public turbulence database (http://turbulence.pha.jhu.edu) provides access to large datasets generated from DNS of turbulence, at present the output of a 1024 3 pseudo-spectral DNS of forced isotropic turbulence (Re λ =443) with 1024 time-steps. The resulting 27 TB dataset can be accessed remotely through an interface based on the Web-services model allowing remote users to issue subroutine-like calls on their host computers. Here we describe the newly developed getPosition function: Given an initial position, integration time-step, as well as an initial and end time, the getPosition function tracks arrays of fluid particles inside the database and returns particle locations at the end of the trajectory integration time. GetPosition is applied to study Lagrangian velocity structure functions as well as tensor-based Lagrangian time correlation functions. The roles of pressure Hessian and viscous terms in the evolution of the strain-rate and rotation tensors are also explored.

Research paper thumbnail of A low-resource reliable pipeline to democratize multi-modal connectome estimation and analysis

bioRxiv (Cold Spring Harbor Laboratory), Nov 3, 2021

Connectomics-the study of brain networks-provides a unique and valuable opportunity to study the ... more Connectomics-the study of brain networks-provides a unique and valuable opportunity to study the brain. However, research in human connectomics, accomplished via Magnetic Resonance Imaging (MRI), is a resourceintensive practice: typical analysis routines require impactful decision making and significant computational capabilities. Mitigating these issues requires the development of low-resource, easy to use, and flexible pipelines which can be applied across data with variable collection parameters. In response to these challenges, we have developed the MRI to Graphs (m2g) pipeline. m2g leverages functional and diffusion datasets to estimate connectomes reliably. To illustrate, m2g was used to process MRI data from 35 different studies (≈ 6,000 scans) from 15 sites without any manual intervention or parameter tuning. Every single scan yielded an estimated connectome that followed established properties, such as stronger ipsilateral than contralateral connections in structural connectomes, and stronger homotopic than heterotopic correlations in functional connectomes. Moreover, the connectomes generated by m2g are more similar within individuals than between them, suggesting that m2g preserves biological variability. m2g is portable, and can run on a single CPU with 16 GB of RAM in less than a couple hours, or be deployed on the cloud using its docker container. All code is available on https://neurodata.io/mri/.

Research paper thumbnail of FlashR

R is one of the most popular programming languages for statistics and machine learning, but it is... more R is one of the most popular programming languages for statistics and machine learning, but it is slow and unable to scale to large datasets. The general approach for having an efficient algorithm in R is to implement it in C or FORTRAN and provide an R wrapper. FlashR accelerates and scales existing R code by parallelizing a large number of matrix functions in the R base package and scaling them beyond memory capacity with solid-state drives (SSDs). FlashR performs memory hierarchy aware execution to speed up parallelized R code by (i) evaluating matrix operations lazily, (ii) performing all operations in a DAG in a single execution and with only one pass over data to increase the ratio of computation to I/O, (iii) performing two levels of matrix partitioning and reordering computation on matrix partitions to reduce data movement in the memory hierarchy. We evaluate FlashR on various machine learning and statistics algorithms on inputs of up to four billion data points. Despite the huge performance gap between SSDs and RAM, FlashR on SSDs closely tracks the performance of FlashR in memory for many algorithms. The R implementations in FlashR outperforms H 2 O and Spark MLlib by a factor of 3 − 20.

Research paper thumbnail of Graphyti: A Semi-External Memory Graph Library for FlashGraph

arXiv (Cornell University), Jul 7, 2019

Graph datasets exceed the in-memory capacity of most standalone machines. Traditionally, graph fr... more Graph datasets exceed the in-memory capacity of most standalone machines. Traditionally, graph frameworks have overcome memory limitations through scale-out, distributing computing. Emerging frameworks avoid the network bottleneck of distributed data with Semi-External Memory (SEM) that uses a single multicore node and operates on graphs larger than memory. In SEM, O(m) data resides on disk and O(n) data in memory, for a graph with n vertices and m edges. For developers, this adds complexity because they must explicitly encode I/O within applications. We present principles that are critical for application developers to adopt in order to achieve state-of-the-art performance, while minimizing I/O and memory for algorithms in SEM. We present them in Graphyti, an extensible parallel SEM graph library built on FlashGraph and available in Python via pip. In SEM, Graphyti achieves 80% of the performance of in-memory execution and retains the performance of FlashGraph, which outperforms distributed engines, such as PowerGraph and Galois.

Research paper thumbnail of Knor

Knor

k-means is one of the most influential and utilized machine learning algorithms. Its computation ... more k-means is one of the most influential and utilized machine learning algorithms. Its computation limits the performance and scalability of many statistical analysis and machine learning tasks. We rethink and optimize k-means in terms of modern NUMA architectures to develop a novel parallelization scheme that delays and minimizes synchronization barriers. The k-means NUMA Optimized Routine knor) library has (i) in-memory knori), (ii) distributed memory (knord), and (ii) semi-external memory (\textsf{knors}) modules that radically improve the performance of k-means for varying memory and hardware budgets. knori boosts performance for single machine datasets by an order of magnitude or more. \textsf{knors} improves the scalability of k-means on a memory budget using SSDs. knors scales to billions of points on a single machine, using a fraction of the resources that distributed in-memory systems require. knord retains knori's performance characteristics, while scaling in-memory through distributed computation in the cloud. knor modifies Elkan's triangle inequality pruning algorithm such that we utilize it on billion-point datasets without the significant memory overhead of the original algorithm. We demonstrate knor outperforms distributed commercial products like H2O, Turi (formerly Dato, GraphLab) and Spark's MLlib by more than an order of magnitude for datasets of 107 to 109 points.

Research paper thumbnail of A Web services accessible database of turbulent channel flow and its use for testing a new integral wall model for LES

Journal of Turbulence, Dec 2, 2015

The output from a direct numerical simulation (DNS) of turbulent channel flow at Reτ ≈ 1000 is us... more The output from a direct numerical simulation (DNS) of turbulent channel flow at Reτ ≈ 1000 is used to construct a publicly and Web-services accessible, spatio-temporal database for this flow. The simulated channel has a size of 8πh × 2h × 3πh, where h is the channel half-height. Data are stored at 2048×512×1536 spatial grid points for a total of 4000 time samples every 5 time steps of the DNS. These cover an entire channel flow-through time, i.e. the time it takes to traverse the entire channel length 8πh at the mean velocity of the bulk flow. Users can access the database through an interface that is based on the Web-services model and perform numerical experiments on the slightly over 100 terabytes (TB) DNS data on their remote platforms, such as laptops or local desktops. Additional technical details about the pressure calculation, database interpolation and differentiation tools are provided in several appendices. As a sample application of the channel flow database, we use it to conduct an a-priori test of a recently introduced integral wall model for Large Eddy Simulation of wall-bounded turbulent flow. The results are compared with those of the equilibrium wall model, showing the strengths of the integral wall model as compared to the equilibrium model.

Research paper thumbnail of clusterNOR: A NUMA-Optimized Clustering Framework

arXiv (Cornell University), Feb 24, 2019

Clustering algorithms are iterative and have complex data access patterns that result in many sma... more Clustering algorithms are iterative and have complex data access patterns that result in many small random memory accesses. Also, the performance of parallel implementations suffer from synchronous barriers for each iteration and skewed workloads. We rethink the parallelization of clustering for modern non-uniform memory architectures (NUMA) to maximize independent, asynchronous computation. We eliminate many barriers, reduce remote memory accesses, and increase cache reuse. Clustering NUMA Optimized Routines (clusterNOR) is an open-source framework that generalizes the knor library for k-means clustering, providing a uniform programming interface and expanding the scope to hierarchical and linear algebraic algorithms. The algorithms share the Majorize-Minimization or Minorize-Maximization (MM) pattern of computation. We demonstrate nine modern clustering algorithms that have simple implementations that run in-memory, with semi-external memory, or distributed. For algorithms that rely on Euclidean distance, we develop a relaxation of Elkan's triangle inequality algorithm that uses asymptotically less memory and halves runtime. Our optimizations produce an order of magnitude performance improvement over other systems, such as Spark's MLlib and Apple's Turi.

Research paper thumbnail of Geodesic Forests

Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020

Together with the curse of dimensionality, nonlinear dependencies in large data sets persist as m... more Together with the curse of dimensionality, nonlinear dependencies in large data sets persist as major challenges in data mining tasks. A reliable way to accurately preserve nonlinear structure is to compute geodesic distances between data points. Manifold learning methods, such as Isomap, aim to preserve geodesic distances in a Riemannian manifold. However, as manifold learning algorithms operate on the ambient dimensionality of the data, the essential step of geodesic distance computation is sensitive to high-dimensional noise. Therefore, a direct application of these algorithms to highdimensional, noisy data often yields unsatisfactory results and does not accurately capture nonlinear structure. We propose an unsupervised random forest approach called geodesic forests (GF) to geodesic distance estimation in linear and nonlinear manifolds with noise. GF operates on low-dimensional sparse linear combinations of features, rather than the full observed dimensionality. To choose the optimal split in a computationally efficient fashion, we developed Fast-BIC, a fast Bayesian Information Criterion statistic for Gaussian mixture models. We additionally propose geodesic precision and geodesic recall as novel evaluation metrics that quantify how well the geodesic distances of a latent manifold are preserved. Empirical results on simulated and real data demonstrate that GF is robust to high-dimensional noise, whereas other methods, such as Isomap, UMAP, and FLANN, quickly deteriorate in such settings. Notably, GF is able to estimate geodesic distances better than other approaches on a real connectome dataset.

Research paper thumbnail of An architecture for a data-intensive computer

Scientific instruments, as well as simulations, generate increasingly large datasets, changing th... more Scientific instruments, as well as simulations, generate increasingly large datasets, changing the way we do science. We propose that processing Petascale-sized datasets will be carried in a data-intensive computer, a system consisting of an HPC cluster, a massively parallel database and an intermediate operating system layer. The operating system will run on dedicated servers and will exploit massive parallelism in the database, as well as numerous optimization strategies, to deliver highthroughput, balanced and regular data flow for I/O operations between the HPC cluster and the database. The programming model of sequential file storage is not appropriate for dataintensive computations, so we propose a data-object-oriented operating system, where support for high-level data objects, such as multi-dimensional arrays, is built in. User application programs will be compiled into code that is executed both on the HPC cluster and inside the database. The data-intensive operating system is however non-local, so that user applications running on a remote PC will be compiled into code executing both on the PC and inside the database. This model supports the collaborative environment, where a large data set is typically created and processed by a large group of users. We have implemented a software library, MPI-DB, which is a prototype of the data-intensive operating system. It is currently being used to ingest the output of the simulation of a turbulent channel flow into the database.

Research paper thumbnail of Linear Optimal Low Rank Projection for High-Dimensional Multi-class Data

arXiv (Cornell University), Sep 5, 2017

Classification of individual samples into one or more categories is critical to modern scientific... more Classification of individual samples into one or more categories is critical to modern scientific inquiry. Most modern datasets, such as those used in genetic analysis or imaging, include numerous features, such as genes or pixels. Principal Components Analysis (PCA) is now generally used to find low-dimensional representations of such features for further analysis. However, PCA ignores class label information, thereby discarding data that could substantially improve downstream classification performance. We here describe an approach called "Linear Optimal Low-rank"' projection (LOL), which extends PCA by incorporating the class labels. Using theory and synthetic data, we show that LOL leads to a better representation of the data for subsequent classification than PCA while adding negligible computational cost. Experimentally we demonstrate that LOL substantially outperforms PCA in differentiating cancer patients from healthy controls using genetic data and in differentiating gender from magnetic resonance imaging data incorporating >500 million features and 400 gigabytes of data. LOL allows the solution of previous intractable problems yet requires only a few minutes to run on a single desktop computer.

Research paper thumbnail of Optimize Unsynchronized Garbage Collection in an SSD Array

arXiv (Cornell University), Jun 24, 2015

Solid state disks (SSDs) have advanced to outperform traditional hard drives significantly in bot... more Solid state disks (SSDs) have advanced to outperform traditional hard drives significantly in both random reads and writes. However, heavy random writes trigger frequent garbage collection and decrease the performance of SSDs. In an SSD array, garbage collection of individual SSDs is not synchronized, leading to underutilization of some of the SSDs. We propose a software solution to tackle the unsynchronized garbage collection in an SSD array installed in a host bus adaptor (HBA), where individual SSDs are exposed to an operating system. We maintain a long I/O queue for each SSD and flush dirty pages intelligently to fill the long I/O queues so that we hide the performance imbalance among SSDs even when there are few parallel application writes. We further define a policy of selecting dirty pages to flush and a policy of taking out stale flush requests to reduce the amount of data written to SSDs. We evaluate our solution in a real system. Experiments show that our solution fully utilizes all SSDs in an array under random write-heavy workloads. It improves I/O throughput by up to 62% under random workloads of mixed reads and writes when SSDs are under active garbage collection. It causes little extra data writeback and increases the cache hit rate.

Research paper thumbnail of Random Projection Forests

arXiv (Cornell University), Jun 10, 2015

Ensemble methods---particularly those based on decision trees---have recently demonstrated superi... more Ensemble methods---particularly those based on decision trees---have recently demonstrated superior performance in a variety of machine learning settings. We introduce a generalization of many existing decision tree methods called "Random Projection Forests" (RPF), which is any decision forest that uses (possibly data dependent and random) linear projections. Using this framework, we introduce a special case, called "Lumberjack", using very sparse random projections, that is, linear combinations of a small subset of features. Lumberjack obtains statistically significantly improved accuracy over Random Forests, Gradient Boosted Trees, and other approaches on a standard benchmark suites for classification with varying dimension, sample size, and number of classes. To illustrate how, why, and when Lumberjack outperforms other methods, we conduct extensive simulated experiments, in vectors, images, and nonlinear manifolds. Lumberjack typically yields improved performance over existing decision trees ensembles, while mitigating computational efficiency and scalability, and maintaining interpretability. Lumberjack can easily be incorporated into other ensemble methods such as boosting to obtain potentially similar gains.

Research paper thumbnail of Forest Packing: Fast, Parallel Decision Forests

arXiv (Cornell University), Jun 19, 2018

Machine learning has an emerging critical role in high-performance computing to modulate simulati... more Machine learning has an emerging critical role in high-performance computing to modulate simulations, extract knowledge from massive data, and replace numerical models with efficient approximations. Decision forests are a critical tool because they provide insight into model operation that is critical to interpreting learned results. While decision forests are trivially parallelizable, the traversals of tree data structures incur many random memory accesses and are very slow. We present memory packing techniques that reorganize learned forests to minimize cache misses during classification. The resulting layout is hierarchical. At low levels, we pack the nodes of multiple trees into contiguous memory blocks so that each memory access fetches data for multiple trees. At higher levels, we use leaf cardinality to identify the most popular paths through a tree and collocate those paths in cache lines. We extend this layout with out-of-order execution and cache-line prefetching to increase memory throughput. Together, these optimizations increase the performance of classification in ensembles by a factor of four over an optimized C++ implementation and a factor of 50 over a popular Rlanguage implementation.

Research paper thumbnail of To the Cloud! A Grassroots Proposal to Accelerate Brain Science Discovery

Neuron, 2016

The revolution in neuroscientific data acquisition is creating an analysis challenge. We propose ... more The revolution in neuroscientific data acquisition is creating an analysis challenge. We propose leveraging cloud-computing technologies to enable large-scale neurodata storing, exploring, analyzing, and modeling. This utility will empower scientists globally to generate and test theories of brain function and dysfunction.

Research paper thumbnail of An architecture for a data-intensive computer

Proceedings of the first international workshop on Network-aware data management, 2011

Scientific instruments, as well as simulations, generate increasingly large datasets, changing th... more Scientific instruments, as well as simulations, generate increasingly large datasets, changing the way we do science. We propose that processing Petascale-sized datasets will be carried in a data-intensive computer, a system consisting of an HPC cluster, a massively parallel database and an intermediate operating system layer. The operating system will run on dedicated servers and will exploit massive parallelism in the database, as well as numerous optimization strategies, to deliver highthroughput, balanced and regular data flow for I/O operations between the HPC cluster and the database. The programming model of sequential file storage is not appropriate for dataintensive computations, so we propose a data-object-oriented operating system, where support for high-level data objects, such as multi-dimensional arrays, is built in. User application programs will be compiled into code that is executed both on the HPC cluster and inside the database. The data-intensive operating system is however non-local, so that user applications running on a remote PC will be compiled into code executing both on the PC and inside the database. This model supports the collaborative environment, where a large data set is typically created and processed by a large group of users. We have implemented a software library, MPI-DB, which is a prototype of the data-intensive operating system. It is currently being used to ingest the output of the simulation of a turbulent channel flow into the database.

Research paper thumbnail of MPI-DB, A Parallel Database Services Software Library for Scientific Computing

MPI-DB, A Parallel Database Services Software Library for Scientific Computing

Lecture Notes in Computer Science, 2011

Large-scale scientific simulations generate petascale data sets subsequently analyzed by groups o... more Large-scale scientific simulations generate petascale data sets subsequently analyzed by groups of researchers, often in databases. We developed a software library, MPI-DB, to provide database services to scientific computing applications. As a bridge between CPU-intensive and data-intensive computations, MPI-DB exploits massive parallelism within large databases to provide scalable, fast service. It is built as a client-server framework, using MPI, with MPI-DB server acting as an intermediary between the user application running an MPI-DB client and the database servers. MPI-DB provides high-level objects such as multi-dimensional arrays, acting as an abstraction layer that effectively hides the database from the end user.

Research paper thumbnail of I/O streaming evaluation of batch queries for data-intensive computational turbulence

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11, 2011

We describe a method for evaluating computational turbulence queries, including Lagrange Polynomi... more We describe a method for evaluating computational turbulence queries, including Lagrange Polynomial interpolation, based on partial sums that allows the underlying data to be accessed in any order and in parts. We exploit these properties to stream data from disk in a single pass and concurrently evaluate batch queries. The combination of sequential I/O and data sharing improves performance by an order of magnitude when compared with direct evaluation of each query. The technique also supports distributed evaluation of queries in a database cluster, assembling the partial sums from each node at the query mediator. Interpolation is fundamental to computational turbulence, over 95% of queries use these routines, and the partial sums method allows the JHU Turbulence Database Cluster to realize scale and throughput for our scientists' data-intensive workloads.

Research paper thumbnail of A public turbulence database cluster and applications to study Lagrangian evolution of velocity increments in turbulence

Journal of Turbulence, 2008

A public database system archiving a direct numerical simulation (DNS) data set of isotropic, for... more A public database system archiving a direct numerical simulation (DNS) data set of isotropic, forced turbulence is described in this paper. The data set consists of the DNS output on 1024 3 spatial points and 1024 time-samples spanning about one large-scale turnover timescale. This complete 1024 4 space-time history of turbulence is accessible to users remotely through an interface that is based on the Web-services model. Users may write and execute analysis programs on their host computers, while the programs make subroutine-like calls that request desired parts of the data over the network. The users are thus able to perform numerical experiments by accessing the 27 Terabytes of DNS data using regular platforms such as laptops. The architecture of the database is explained, as are some of the locally defined functions, such as differentiation and interpolation. Test calculations are performed to illustrate the usage of the system and to verify the accuracy of the methods. The database is then used to analyse a dynamical model for small-scale intermittency in turbulence. Specifically, the dynamical effects of pressure and viscous terms on the Lagrangian evolution of velocity increments are evaluated using conditional averages calculated from the DNS data in the database. It is shown that these effects differ considerably among themselves and thus require different modeling strategies in Lagrangian models of velocity increments and intermittency.

Research paper thumbnail of Analysis of isotropic turbulence using a public database and the Web service model, and applications to study subgrid models

Analysis of isotropic turbulence using a public database and the Web service model, and applications to study subgrid models

A public database system archiving a direct numerical simulation (DNS) data set of isotropic, for... more A public database system archiving a direct numerical simulation (DNS) data set of isotropic, forced turbulence is used for studying basic turbulence dynamics. The data set consists of the DNS output on 1024-cubed spatial points and 1024 time-samples spanning about one large-scale turn-over timescale. This complete space-time history of turbulence is accessible to users remotely through an interface that is

Research paper thumbnail of BLOCKSET (Block-Aligned Serialized Trees)

Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, 2021

We present methods to serialize and deserialize gradient-boosted trees and random forests that op... more We present methods to serialize and deserialize gradient-boosted trees and random forests that optimize inference latency when models are not loaded into memory. This arises when models are larger than memory, but also systematically when models are deployed on low-resource devices in the Internet of Things or run as cloud microservices where resources are allocated on demand. Block-Aligned Serialized Trees (BLOCKSET) introduce the concept of selective access for random forests and gradient boosted trees in which only the parts of the model needed for inference are deserialized and loaded into memory. Using principles from external memory algorithms, we block-align the serialization format in order to minimize the number of I/Os. For gradient boosted trees, this results in a more than five time reduction in inference latency over layouts that do not perform selective access and a 2 times latency reduction over techniques that are selective, but do not encode I/O block boundaries in the layout. CCS CONCEPTS • Computing methodologies → Bagging; • Information systems → Record and block layout; Data scans; Data access methods.

Research paper thumbnail of Studying Lagrangian dynamics of turbulence using on-demand fluid particle tracking in the JHU turbulence database

Bulletin of the American Physical Society, Nov 20, 2011

JHU public turbulence database (http://turbulence.pha.jhu.edu) provides access to large datasets ... more JHU public turbulence database (http://turbulence.pha.jhu.edu) provides access to large datasets generated from DNS of turbulence, at present the output of a 1024 3 pseudo-spectral DNS of forced isotropic turbulence (Re λ =443) with 1024 time-steps. The resulting 27 TB dataset can be accessed remotely through an interface based on the Web-services model allowing remote users to issue subroutine-like calls on their host computers. Here we describe the newly developed getPosition function: Given an initial position, integration time-step, as well as an initial and end time, the getPosition function tracks arrays of fluid particles inside the database and returns particle locations at the end of the trajectory integration time. GetPosition is applied to study Lagrangian velocity structure functions as well as tensor-based Lagrangian time correlation functions. The roles of pressure Hessian and viscous terms in the evolution of the strain-rate and rotation tensors are also explored.

Research paper thumbnail of A low-resource reliable pipeline to democratize multi-modal connectome estimation and analysis

bioRxiv (Cold Spring Harbor Laboratory), Nov 3, 2021

Connectomics-the study of brain networks-provides a unique and valuable opportunity to study the ... more Connectomics-the study of brain networks-provides a unique and valuable opportunity to study the brain. However, research in human connectomics, accomplished via Magnetic Resonance Imaging (MRI), is a resourceintensive practice: typical analysis routines require impactful decision making and significant computational capabilities. Mitigating these issues requires the development of low-resource, easy to use, and flexible pipelines which can be applied across data with variable collection parameters. In response to these challenges, we have developed the MRI to Graphs (m2g) pipeline. m2g leverages functional and diffusion datasets to estimate connectomes reliably. To illustrate, m2g was used to process MRI data from 35 different studies (≈ 6,000 scans) from 15 sites without any manual intervention or parameter tuning. Every single scan yielded an estimated connectome that followed established properties, such as stronger ipsilateral than contralateral connections in structural connectomes, and stronger homotopic than heterotopic correlations in functional connectomes. Moreover, the connectomes generated by m2g are more similar within individuals than between them, suggesting that m2g preserves biological variability. m2g is portable, and can run on a single CPU with 16 GB of RAM in less than a couple hours, or be deployed on the cloud using its docker container. All code is available on https://neurodata.io/mri/.

Research paper thumbnail of FlashR

R is one of the most popular programming languages for statistics and machine learning, but it is... more R is one of the most popular programming languages for statistics and machine learning, but it is slow and unable to scale to large datasets. The general approach for having an efficient algorithm in R is to implement it in C or FORTRAN and provide an R wrapper. FlashR accelerates and scales existing R code by parallelizing a large number of matrix functions in the R base package and scaling them beyond memory capacity with solid-state drives (SSDs). FlashR performs memory hierarchy aware execution to speed up parallelized R code by (i) evaluating matrix operations lazily, (ii) performing all operations in a DAG in a single execution and with only one pass over data to increase the ratio of computation to I/O, (iii) performing two levels of matrix partitioning and reordering computation on matrix partitions to reduce data movement in the memory hierarchy. We evaluate FlashR on various machine learning and statistics algorithms on inputs of up to four billion data points. Despite the huge performance gap between SSDs and RAM, FlashR on SSDs closely tracks the performance of FlashR in memory for many algorithms. The R implementations in FlashR outperforms H 2 O and Spark MLlib by a factor of 3 − 20.

Research paper thumbnail of Graphyti: A Semi-External Memory Graph Library for FlashGraph

arXiv (Cornell University), Jul 7, 2019

Graph datasets exceed the in-memory capacity of most standalone machines. Traditionally, graph fr... more Graph datasets exceed the in-memory capacity of most standalone machines. Traditionally, graph frameworks have overcome memory limitations through scale-out, distributing computing. Emerging frameworks avoid the network bottleneck of distributed data with Semi-External Memory (SEM) that uses a single multicore node and operates on graphs larger than memory. In SEM, O(m) data resides on disk and O(n) data in memory, for a graph with n vertices and m edges. For developers, this adds complexity because they must explicitly encode I/O within applications. We present principles that are critical for application developers to adopt in order to achieve state-of-the-art performance, while minimizing I/O and memory for algorithms in SEM. We present them in Graphyti, an extensible parallel SEM graph library built on FlashGraph and available in Python via pip. In SEM, Graphyti achieves 80% of the performance of in-memory execution and retains the performance of FlashGraph, which outperforms distributed engines, such as PowerGraph and Galois.

Research paper thumbnail of Knor

Knor

k-means is one of the most influential and utilized machine learning algorithms. Its computation ... more k-means is one of the most influential and utilized machine learning algorithms. Its computation limits the performance and scalability of many statistical analysis and machine learning tasks. We rethink and optimize k-means in terms of modern NUMA architectures to develop a novel parallelization scheme that delays and minimizes synchronization barriers. The k-means NUMA Optimized Routine knor) library has (i) in-memory knori), (ii) distributed memory (knord), and (ii) semi-external memory (\textsf{knors}) modules that radically improve the performance of k-means for varying memory and hardware budgets. knori boosts performance for single machine datasets by an order of magnitude or more. \textsf{knors} improves the scalability of k-means on a memory budget using SSDs. knors scales to billions of points on a single machine, using a fraction of the resources that distributed in-memory systems require. knord retains knori's performance characteristics, while scaling in-memory through distributed computation in the cloud. knor modifies Elkan's triangle inequality pruning algorithm such that we utilize it on billion-point datasets without the significant memory overhead of the original algorithm. We demonstrate knor outperforms distributed commercial products like H2O, Turi (formerly Dato, GraphLab) and Spark's MLlib by more than an order of magnitude for datasets of 107 to 109 points.

Research paper thumbnail of A Web services accessible database of turbulent channel flow and its use for testing a new integral wall model for LES

Journal of Turbulence, Dec 2, 2015

The output from a direct numerical simulation (DNS) of turbulent channel flow at Reτ ≈ 1000 is us... more The output from a direct numerical simulation (DNS) of turbulent channel flow at Reτ ≈ 1000 is used to construct a publicly and Web-services accessible, spatio-temporal database for this flow. The simulated channel has a size of 8πh × 2h × 3πh, where h is the channel half-height. Data are stored at 2048×512×1536 spatial grid points for a total of 4000 time samples every 5 time steps of the DNS. These cover an entire channel flow-through time, i.e. the time it takes to traverse the entire channel length 8πh at the mean velocity of the bulk flow. Users can access the database through an interface that is based on the Web-services model and perform numerical experiments on the slightly over 100 terabytes (TB) DNS data on their remote platforms, such as laptops or local desktops. Additional technical details about the pressure calculation, database interpolation and differentiation tools are provided in several appendices. As a sample application of the channel flow database, we use it to conduct an a-priori test of a recently introduced integral wall model for Large Eddy Simulation of wall-bounded turbulent flow. The results are compared with those of the equilibrium wall model, showing the strengths of the integral wall model as compared to the equilibrium model.

Research paper thumbnail of clusterNOR: A NUMA-Optimized Clustering Framework

arXiv (Cornell University), Feb 24, 2019

Clustering algorithms are iterative and have complex data access patterns that result in many sma... more Clustering algorithms are iterative and have complex data access patterns that result in many small random memory accesses. Also, the performance of parallel implementations suffer from synchronous barriers for each iteration and skewed workloads. We rethink the parallelization of clustering for modern non-uniform memory architectures (NUMA) to maximize independent, asynchronous computation. We eliminate many barriers, reduce remote memory accesses, and increase cache reuse. Clustering NUMA Optimized Routines (clusterNOR) is an open-source framework that generalizes the knor library for k-means clustering, providing a uniform programming interface and expanding the scope to hierarchical and linear algebraic algorithms. The algorithms share the Majorize-Minimization or Minorize-Maximization (MM) pattern of computation. We demonstrate nine modern clustering algorithms that have simple implementations that run in-memory, with semi-external memory, or distributed. For algorithms that rely on Euclidean distance, we develop a relaxation of Elkan's triangle inequality algorithm that uses asymptotically less memory and halves runtime. Our optimizations produce an order of magnitude performance improvement over other systems, such as Spark's MLlib and Apple's Turi.