Randal Burns | Johns Hopkins University (original) (raw)

Papers by Randal Burns

Research paper thumbnail of An architecture for a data-intensive computer

Research paper thumbnail of Linear Optimal Low Rank Projection for High-Dimensional Multi-class Data

arXiv (Cornell University), Sep 5, 2017

Classification of individual samples into one or more categories is critical to modern scientific... more Classification of individual samples into one or more categories is critical to modern scientific inquiry. Most modern datasets, such as those used in genetic analysis or imaging, include numerous features, such as genes or pixels. Principal Components Analysis (PCA) is now generally used to find low-dimensional representations of such features for further analysis. However, PCA ignores class label information, thereby discarding data that could substantially improve downstream classification performance. We here describe an approach called "Linear Optimal Low-rank"' projection (LOL), which extends PCA by incorporating the class labels. Using theory and synthetic data, we show that LOL leads to a better representation of the data for subsequent classification than PCA while adding negligible computational cost. Experimentally we demonstrate that LOL substantially outperforms PCA in differentiating cancer patients from healthy controls using genetic data and in differentiating gender from magnetic resonance imaging data incorporating >500 million features and 400 gigabytes of data. LOL allows the solution of previous intractable problems yet requires only a few minutes to run on a single desktop computer.

Research paper thumbnail of Optimize Unsynchronized Garbage Collection in an SSD Array

arXiv (Cornell University), Jun 24, 2015

Research paper thumbnail of Random Projection Forests

arXiv (Cornell University), Jun 10, 2015

Ensemble methods---particularly those based on decision trees---have recently demonstrated superi... more Ensemble methods---particularly those based on decision trees---have recently demonstrated superior performance in a variety of machine learning settings. We introduce a generalization of many existing decision tree methods called "Random Projection Forests" (RPF), which is any decision forest that uses (possibly data dependent and random) linear projections. Using this framework, we introduce a special case, called "Lumberjack", using very sparse random projections, that is, linear combinations of a small subset of features. Lumberjack obtains statistically significantly improved accuracy over Random Forests, Gradient Boosted Trees, and other approaches on a standard benchmark suites for classification with varying dimension, sample size, and number of classes. To illustrate how, why, and when Lumberjack outperforms other methods, we conduct extensive simulated experiments, in vectors, images, and nonlinear manifolds. Lumberjack typically yields improved performance over existing decision trees ensembles, while mitigating computational efficiency and scalability, and maintaining interpretability. Lumberjack can easily be incorporated into other ensemble methods such as boosting to obtain potentially similar gains.

Research paper thumbnail of Forest Packing: Fast, Parallel Decision Forests

arXiv (Cornell University), Jun 19, 2018

Research paper thumbnail of To the Cloud! A Grassroots Proposal to Accelerate Brain Science Discovery

Research paper thumbnail of An architecture for a data-intensive computer

Proceedings of the first international workshop on Network-aware data management, 2011

Research paper thumbnail of MPI-DB, A Parallel Database Services Software Library for Scientific Computing

Lecture Notes in Computer Science, 2011

Large-scale scientific simulations generate petascale data sets subsequently analyzed by groups o... more Large-scale scientific simulations generate petascale data sets subsequently analyzed by groups of researchers, often in databases. We developed a software library, MPI-DB, to provide database services to scientific computing applications. As a bridge between CPU-intensive and data-intensive computations, MPI-DB exploits massive parallelism within large databases to provide scalable, fast service. It is built as a client-server framework, using MPI, with MPI-DB server acting as an intermediary between the user application running an MPI-DB client and the database servers. MPI-DB provides high-level objects such as multi-dimensional arrays, acting as an abstraction layer that effectively hides the database from the end user.

Research paper thumbnail of I/O streaming evaluation of batch queries for data-intensive computational turbulence

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11, 2011

Research paper thumbnail of A public turbulence database cluster and applications to study Lagrangian evolution of velocity increments in turbulence

Journal of Turbulence, 2008

Research paper thumbnail of Analysis of isotropic turbulence using a public database and the Web service model, and applications to study subgrid models

A public database system archiving a direct numerical simulation (DNS) data set of isotropic, for... more A public database system archiving a direct numerical simulation (DNS) data set of isotropic, forced turbulence is used for studying basic turbulence dynamics. The data set consists of the DNS output on 1024-cubed spatial points and 1024 time-samples spanning about one large-scale turn-over timescale. This complete space-time history of turbulence is accessible to users remotely through an interface that is

Research paper thumbnail of BLOCKSET (Block-Aligned Serialized Trees)

Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, 2021

Research paper thumbnail of Studying Lagrangian dynamics of turbulence using on-demand fluid particle tracking in the JHU turbulence database

Bulletin of the American Physical Society, Nov 20, 2011

Research paper thumbnail of A low-resource reliable pipeline to democratize multi-modal connectome estimation and analysis

bioRxiv (Cold Spring Harbor Laboratory), Nov 3, 2021

Research paper thumbnail of FlashR

Research paper thumbnail of Graphyti: A Semi-External Memory Graph Library for FlashGraph

arXiv (Cornell University), Jul 7, 2019

Research paper thumbnail of Knor

k-means is one of the most influential and utilized machine learning algorithms. Its computation ... more k-means is one of the most influential and utilized machine learning algorithms. Its computation limits the performance and scalability of many statistical analysis and machine learning tasks. We rethink and optimize k-means in terms of modern NUMA architectures to develop a novel parallelization scheme that delays and minimizes synchronization barriers. The k-means NUMA Optimized Routine knor) library has (i) in-memory knori), (ii) distributed memory (knord), and (ii) semi-external memory (\textsf{knors}) modules that radically improve the performance of k-means for varying memory and hardware budgets. knori boosts performance for single machine datasets by an order of magnitude or more. \textsf{knors} improves the scalability of k-means on a memory budget using SSDs. knors scales to billions of points on a single machine, using a fraction of the resources that distributed in-memory systems require. knord retains knori's performance characteristics, while scaling in-memory through distributed computation in the cloud. knor modifies Elkan's triangle inequality pruning algorithm such that we utilize it on billion-point datasets without the significant memory overhead of the original algorithm. We demonstrate knor outperforms distributed commercial products like H2O, Turi (formerly Dato, GraphLab) and Spark's MLlib by more than an order of magnitude for datasets of 107 to 109 points.

Research paper thumbnail of A Web services accessible database of turbulent channel flow and its use for testing a new integral wall model for LES

Journal of Turbulence, Dec 2, 2015

Research paper thumbnail of clusterNOR: A NUMA-Optimized Clustering Framework

arXiv (Cornell University), Feb 24, 2019

Research paper thumbnail of NUMA-optimized In-memory and Semi-external-memory Parameterized Clustering

arXiv (Cornell University), Jun 28, 2016

Research paper thumbnail of An architecture for a data-intensive computer

Research paper thumbnail of Linear Optimal Low Rank Projection for High-Dimensional Multi-class Data

arXiv (Cornell University), Sep 5, 2017

Classification of individual samples into one or more categories is critical to modern scientific... more Classification of individual samples into one or more categories is critical to modern scientific inquiry. Most modern datasets, such as those used in genetic analysis or imaging, include numerous features, such as genes or pixels. Principal Components Analysis (PCA) is now generally used to find low-dimensional representations of such features for further analysis. However, PCA ignores class label information, thereby discarding data that could substantially improve downstream classification performance. We here describe an approach called "Linear Optimal Low-rank"' projection (LOL), which extends PCA by incorporating the class labels. Using theory and synthetic data, we show that LOL leads to a better representation of the data for subsequent classification than PCA while adding negligible computational cost. Experimentally we demonstrate that LOL substantially outperforms PCA in differentiating cancer patients from healthy controls using genetic data and in differentiating gender from magnetic resonance imaging data incorporating >500 million features and 400 gigabytes of data. LOL allows the solution of previous intractable problems yet requires only a few minutes to run on a single desktop computer.

Research paper thumbnail of Optimize Unsynchronized Garbage Collection in an SSD Array

arXiv (Cornell University), Jun 24, 2015

Research paper thumbnail of Random Projection Forests

arXiv (Cornell University), Jun 10, 2015

Ensemble methods---particularly those based on decision trees---have recently demonstrated superi... more Ensemble methods---particularly those based on decision trees---have recently demonstrated superior performance in a variety of machine learning settings. We introduce a generalization of many existing decision tree methods called "Random Projection Forests" (RPF), which is any decision forest that uses (possibly data dependent and random) linear projections. Using this framework, we introduce a special case, called "Lumberjack", using very sparse random projections, that is, linear combinations of a small subset of features. Lumberjack obtains statistically significantly improved accuracy over Random Forests, Gradient Boosted Trees, and other approaches on a standard benchmark suites for classification with varying dimension, sample size, and number of classes. To illustrate how, why, and when Lumberjack outperforms other methods, we conduct extensive simulated experiments, in vectors, images, and nonlinear manifolds. Lumberjack typically yields improved performance over existing decision trees ensembles, while mitigating computational efficiency and scalability, and maintaining interpretability. Lumberjack can easily be incorporated into other ensemble methods such as boosting to obtain potentially similar gains.

Research paper thumbnail of Forest Packing: Fast, Parallel Decision Forests

arXiv (Cornell University), Jun 19, 2018

Research paper thumbnail of To the Cloud! A Grassroots Proposal to Accelerate Brain Science Discovery

Research paper thumbnail of An architecture for a data-intensive computer

Proceedings of the first international workshop on Network-aware data management, 2011

Research paper thumbnail of MPI-DB, A Parallel Database Services Software Library for Scientific Computing

Lecture Notes in Computer Science, 2011

Large-scale scientific simulations generate petascale data sets subsequently analyzed by groups o... more Large-scale scientific simulations generate petascale data sets subsequently analyzed by groups of researchers, often in databases. We developed a software library, MPI-DB, to provide database services to scientific computing applications. As a bridge between CPU-intensive and data-intensive computations, MPI-DB exploits massive parallelism within large databases to provide scalable, fast service. It is built as a client-server framework, using MPI, with MPI-DB server acting as an intermediary between the user application running an MPI-DB client and the database servers. MPI-DB provides high-level objects such as multi-dimensional arrays, acting as an abstraction layer that effectively hides the database from the end user.

Research paper thumbnail of I/O streaming evaluation of batch queries for data-intensive computational turbulence

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11, 2011

Research paper thumbnail of A public turbulence database cluster and applications to study Lagrangian evolution of velocity increments in turbulence

Journal of Turbulence, 2008

Research paper thumbnail of Analysis of isotropic turbulence using a public database and the Web service model, and applications to study subgrid models

A public database system archiving a direct numerical simulation (DNS) data set of isotropic, for... more A public database system archiving a direct numerical simulation (DNS) data set of isotropic, forced turbulence is used for studying basic turbulence dynamics. The data set consists of the DNS output on 1024-cubed spatial points and 1024 time-samples spanning about one large-scale turn-over timescale. This complete space-time history of turbulence is accessible to users remotely through an interface that is

Research paper thumbnail of BLOCKSET (Block-Aligned Serialized Trees)

Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, 2021

Research paper thumbnail of Studying Lagrangian dynamics of turbulence using on-demand fluid particle tracking in the JHU turbulence database

Bulletin of the American Physical Society, Nov 20, 2011

Research paper thumbnail of A low-resource reliable pipeline to democratize multi-modal connectome estimation and analysis

bioRxiv (Cold Spring Harbor Laboratory), Nov 3, 2021

Research paper thumbnail of FlashR

Research paper thumbnail of Graphyti: A Semi-External Memory Graph Library for FlashGraph

arXiv (Cornell University), Jul 7, 2019

Research paper thumbnail of Knor

k-means is one of the most influential and utilized machine learning algorithms. Its computation ... more k-means is one of the most influential and utilized machine learning algorithms. Its computation limits the performance and scalability of many statistical analysis and machine learning tasks. We rethink and optimize k-means in terms of modern NUMA architectures to develop a novel parallelization scheme that delays and minimizes synchronization barriers. The k-means NUMA Optimized Routine knor) library has (i) in-memory knori), (ii) distributed memory (knord), and (ii) semi-external memory (\textsf{knors}) modules that radically improve the performance of k-means for varying memory and hardware budgets. knori boosts performance for single machine datasets by an order of magnitude or more. \textsf{knors} improves the scalability of k-means on a memory budget using SSDs. knors scales to billions of points on a single machine, using a fraction of the resources that distributed in-memory systems require. knord retains knori's performance characteristics, while scaling in-memory through distributed computation in the cloud. knor modifies Elkan's triangle inequality pruning algorithm such that we utilize it on billion-point datasets without the significant memory overhead of the original algorithm. We demonstrate knor outperforms distributed commercial products like H2O, Turi (formerly Dato, GraphLab) and Spark's MLlib by more than an order of magnitude for datasets of 107 to 109 points.

Research paper thumbnail of A Web services accessible database of turbulent channel flow and its use for testing a new integral wall model for LES

Journal of Turbulence, Dec 2, 2015

Research paper thumbnail of clusterNOR: A NUMA-Optimized Clustering Framework

arXiv (Cornell University), Feb 24, 2019

Research paper thumbnail of NUMA-optimized In-memory and Semi-external-memory Parameterized Clustering

arXiv (Cornell University), Jun 28, 2016