Randal Burns | Johns Hopkins University (original) (raw)
Papers by Randal Burns
arXiv (Cornell University), Sep 5, 2017
Classification of individual samples into one or more categories is critical to modern scientific... more Classification of individual samples into one or more categories is critical to modern scientific inquiry. Most modern datasets, such as those used in genetic analysis or imaging, include numerous features, such as genes or pixels. Principal Components Analysis (PCA) is now generally used to find low-dimensional representations of such features for further analysis. However, PCA ignores class label information, thereby discarding data that could substantially improve downstream classification performance. We here describe an approach called "Linear Optimal Low-rank"' projection (LOL), which extends PCA by incorporating the class labels. Using theory and synthetic data, we show that LOL leads to a better representation of the data for subsequent classification than PCA while adding negligible computational cost. Experimentally we demonstrate that LOL substantially outperforms PCA in differentiating cancer patients from healthy controls using genetic data and in differentiating gender from magnetic resonance imaging data incorporating >500 million features and 400 gigabytes of data. LOL allows the solution of previous intractable problems yet requires only a few minutes to run on a single desktop computer.
arXiv (Cornell University), Jun 24, 2015
arXiv (Cornell University), Jun 10, 2015
Ensemble methods---particularly those based on decision trees---have recently demonstrated superi... more Ensemble methods---particularly those based on decision trees---have recently demonstrated superior performance in a variety of machine learning settings. We introduce a generalization of many existing decision tree methods called "Random Projection Forests" (RPF), which is any decision forest that uses (possibly data dependent and random) linear projections. Using this framework, we introduce a special case, called "Lumberjack", using very sparse random projections, that is, linear combinations of a small subset of features. Lumberjack obtains statistically significantly improved accuracy over Random Forests, Gradient Boosted Trees, and other approaches on a standard benchmark suites for classification with varying dimension, sample size, and number of classes. To illustrate how, why, and when Lumberjack outperforms other methods, we conduct extensive simulated experiments, in vectors, images, and nonlinear manifolds. Lumberjack typically yields improved performance over existing decision trees ensembles, while mitigating computational efficiency and scalability, and maintaining interpretability. Lumberjack can easily be incorporated into other ensemble methods such as boosting to obtain potentially similar gains.
arXiv (Cornell University), Jun 19, 2018
Proceedings of the first international workshop on Network-aware data management, 2011
Lecture Notes in Computer Science, 2011
Large-scale scientific simulations generate petascale data sets subsequently analyzed by groups o... more Large-scale scientific simulations generate petascale data sets subsequently analyzed by groups of researchers, often in databases. We developed a software library, MPI-DB, to provide database services to scientific computing applications. As a bridge between CPU-intensive and data-intensive computations, MPI-DB exploits massive parallelism within large databases to provide scalable, fast service. It is built as a client-server framework, using MPI, with MPI-DB server acting as an intermediary between the user application running an MPI-DB client and the database servers. MPI-DB provides high-level objects such as multi-dimensional arrays, acting as an abstraction layer that effectively hides the database from the end user.
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11, 2011
Journal of Turbulence, 2008
A public database system archiving a direct numerical simulation (DNS) data set of isotropic, for... more A public database system archiving a direct numerical simulation (DNS) data set of isotropic, forced turbulence is used for studying basic turbulence dynamics. The data set consists of the DNS output on 1024-cubed spatial points and 1024 time-samples spanning about one large-scale turn-over timescale. This complete space-time history of turbulence is accessible to users remotely through an interface that is
Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, 2021
Bulletin of the American Physical Society, Nov 20, 2011
bioRxiv (Cold Spring Harbor Laboratory), Nov 3, 2021
arXiv (Cornell University), Jul 7, 2019
k-means is one of the most influential and utilized machine learning algorithms. Its computation ... more k-means is one of the most influential and utilized machine learning algorithms. Its computation limits the performance and scalability of many statistical analysis and machine learning tasks. We rethink and optimize k-means in terms of modern NUMA architectures to develop a novel parallelization scheme that delays and minimizes synchronization barriers. The k-means NUMA Optimized Routine knor) library has (i) in-memory knori), (ii) distributed memory (knord), and (ii) semi-external memory (\textsf{knors}) modules that radically improve the performance of k-means for varying memory and hardware budgets. knori boosts performance for single machine datasets by an order of magnitude or more. \textsf{knors} improves the scalability of k-means on a memory budget using SSDs. knors scales to billions of points on a single machine, using a fraction of the resources that distributed in-memory systems require. knord retains knori's performance characteristics, while scaling in-memory through distributed computation in the cloud. knor modifies Elkan's triangle inequality pruning algorithm such that we utilize it on billion-point datasets without the significant memory overhead of the original algorithm. We demonstrate knor outperforms distributed commercial products like H2O, Turi (formerly Dato, GraphLab) and Spark's MLlib by more than an order of magnitude for datasets of 107 to 109 points.
Journal of Turbulence, Dec 2, 2015
arXiv (Cornell University), Feb 24, 2019
arXiv (Cornell University), Jun 28, 2016
arXiv (Cornell University), Sep 5, 2017
Classification of individual samples into one or more categories is critical to modern scientific... more Classification of individual samples into one or more categories is critical to modern scientific inquiry. Most modern datasets, such as those used in genetic analysis or imaging, include numerous features, such as genes or pixels. Principal Components Analysis (PCA) is now generally used to find low-dimensional representations of such features for further analysis. However, PCA ignores class label information, thereby discarding data that could substantially improve downstream classification performance. We here describe an approach called "Linear Optimal Low-rank"' projection (LOL), which extends PCA by incorporating the class labels. Using theory and synthetic data, we show that LOL leads to a better representation of the data for subsequent classification than PCA while adding negligible computational cost. Experimentally we demonstrate that LOL substantially outperforms PCA in differentiating cancer patients from healthy controls using genetic data and in differentiating gender from magnetic resonance imaging data incorporating >500 million features and 400 gigabytes of data. LOL allows the solution of previous intractable problems yet requires only a few minutes to run on a single desktop computer.
arXiv (Cornell University), Jun 24, 2015
arXiv (Cornell University), Jun 10, 2015
Ensemble methods---particularly those based on decision trees---have recently demonstrated superi... more Ensemble methods---particularly those based on decision trees---have recently demonstrated superior performance in a variety of machine learning settings. We introduce a generalization of many existing decision tree methods called "Random Projection Forests" (RPF), which is any decision forest that uses (possibly data dependent and random) linear projections. Using this framework, we introduce a special case, called "Lumberjack", using very sparse random projections, that is, linear combinations of a small subset of features. Lumberjack obtains statistically significantly improved accuracy over Random Forests, Gradient Boosted Trees, and other approaches on a standard benchmark suites for classification with varying dimension, sample size, and number of classes. To illustrate how, why, and when Lumberjack outperforms other methods, we conduct extensive simulated experiments, in vectors, images, and nonlinear manifolds. Lumberjack typically yields improved performance over existing decision trees ensembles, while mitigating computational efficiency and scalability, and maintaining interpretability. Lumberjack can easily be incorporated into other ensemble methods such as boosting to obtain potentially similar gains.
arXiv (Cornell University), Jun 19, 2018
Proceedings of the first international workshop on Network-aware data management, 2011
Lecture Notes in Computer Science, 2011
Large-scale scientific simulations generate petascale data sets subsequently analyzed by groups o... more Large-scale scientific simulations generate petascale data sets subsequently analyzed by groups of researchers, often in databases. We developed a software library, MPI-DB, to provide database services to scientific computing applications. As a bridge between CPU-intensive and data-intensive computations, MPI-DB exploits massive parallelism within large databases to provide scalable, fast service. It is built as a client-server framework, using MPI, with MPI-DB server acting as an intermediary between the user application running an MPI-DB client and the database servers. MPI-DB provides high-level objects such as multi-dimensional arrays, acting as an abstraction layer that effectively hides the database from the end user.
Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11, 2011
Journal of Turbulence, 2008
A public database system archiving a direct numerical simulation (DNS) data set of isotropic, for... more A public database system archiving a direct numerical simulation (DNS) data set of isotropic, forced turbulence is used for studying basic turbulence dynamics. The data set consists of the DNS output on 1024-cubed spatial points and 1024 time-samples spanning about one large-scale turn-over timescale. This complete space-time history of turbulence is accessible to users remotely through an interface that is
Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, 2021
Bulletin of the American Physical Society, Nov 20, 2011
bioRxiv (Cold Spring Harbor Laboratory), Nov 3, 2021
arXiv (Cornell University), Jul 7, 2019
k-means is one of the most influential and utilized machine learning algorithms. Its computation ... more k-means is one of the most influential and utilized machine learning algorithms. Its computation limits the performance and scalability of many statistical analysis and machine learning tasks. We rethink and optimize k-means in terms of modern NUMA architectures to develop a novel parallelization scheme that delays and minimizes synchronization barriers. The k-means NUMA Optimized Routine knor) library has (i) in-memory knori), (ii) distributed memory (knord), and (ii) semi-external memory (\textsf{knors}) modules that radically improve the performance of k-means for varying memory and hardware budgets. knori boosts performance for single machine datasets by an order of magnitude or more. \textsf{knors} improves the scalability of k-means on a memory budget using SSDs. knors scales to billions of points on a single machine, using a fraction of the resources that distributed in-memory systems require. knord retains knori's performance characteristics, while scaling in-memory through distributed computation in the cloud. knor modifies Elkan's triangle inequality pruning algorithm such that we utilize it on billion-point datasets without the significant memory overhead of the original algorithm. We demonstrate knor outperforms distributed commercial products like H2O, Turi (formerly Dato, GraphLab) and Spark's MLlib by more than an order of magnitude for datasets of 107 to 109 points.
Journal of Turbulence, Dec 2, 2015
arXiv (Cornell University), Feb 24, 2019
arXiv (Cornell University), Jun 28, 2016