Forest Packing: Fast, Parallel Decision Forests (original) (raw)

Learning Random Forests on the GPU

Random Forests are a popular and powerful machine learning technique, with several fast multi-core CPU implementations. Since many other machine learn- ing methods have seen impressive speedups from GPU implementations, applying GPU acceleration to random forests seems like a natural fit. Previous attempts to use GPUs have relied on coarse-grained task parallelism and have yielded inconclusive or unsatisfying results. We introduce CudaTree, a GPU Random Forest implementation which adaptively switches between data and task parallelism. We show that, for larger datasets, this algorithm is faster than highly tuned multi-core CPU implementations.

PACSET (Packed Serialized Trees): Reducing Inference Latency for Tree Ensemble Deployment

ArXiv, 2020

We present methods to serialize and deserialize tree ensembles that optimize inference latency when models are not already loaded into memory. This arises whenever models are larger than memory, but also systematically when models are deployed on low-resource devices, such as in the Internet of Things, or run as Web micro-services where resources are allocated on demand. Our packed serialized trees (PACSET) encode reference locality in the layout of a tree ensemble using principles from external memory algorithms. The layout interleaves correlated nodes across multiple trees, uses leaf cardinality to collocate the nodes on the most popular paths and is optimized for the I/O blocksize. The result is that each I/O yields a higher fraction of useful data, leading to a 2-6 times reduction in classification latency for interactive workloads.

Implementing Decision Trees and Forests on a GPU

2008

We describe a method for implementing the evaluation and training of decision trees and forests entirely on a GPU, and show how this method can be used in the context of object recognition. Our strategy for evaluation involves mapping the data structure describing a decision forest to a 2D texture array. We navigate through the forest for each point of the input data in parallel using an efficient, non-branching pixel shader. For training, we compute the responses of the training data to a set of candidate features, and scatter the responses into a suitable histogram using a vertex shader. The histograms thus computed can be used in conjunction with a broad range of tree learning algorithms. We demonstrate results for object recognition which are identical to those obtained on a CPU, obtained in about 1% of the time. To our knowledge, this is the first time a method has been proposed which is capable of evaluating or training decision trees on a GPU. Our method leverages the full parallelism of the GPU. Although we use features common to computer vision to demonstrate object recognition, our framework can accommodate other kinds of features for more general utility within computer science.

Linking parallel algorithmic thinking to many-core memory systems and speedups for boosted decision trees

Proceedings of the International Symposium on Memory Systems, 2018

The current focus of research on parallel computing takes current commercial hardware for granted. Here, we consider an alternative approach: start with a time-tested algorithmic theory and develop a supporting computer architecture and toolchain. This paper focuses on the hybrid memory architecture of this computer platform, which is designed to efficiently support execution of both serial and parallel code and switching between the two. A key part of this architecture is a flexible all-to-all interconnection network that connects processors to shared memory modules. To understand some recent advances in GPU memory architecture and how they relate to this hybrid memory architecture, we use microbenchmarks including list ranking. A second part of this work contrasts the scalability of applications with that of routines. In particular, regardless of the scalability needs of full applications, some routines may involve smaller problem sizes, and in particular smaller levels of paralle...

Porting Decision Tree Algorithms to Multicore using FastFlow

Proc. of European Conference in Machine Learning and Knowledge Discovery in Databases (ECML PKDD), Barcelona, Spain, 2010, pp. 7-23. doi:10.1007/978-3-642-15880-3_7 , 2010

The whole computer hardware industry embraced multicores. For these machines, the extreme optimisation of sequential algorithms is no longer sufficient to squeeze the real machine power, which can be only exploited via thread-level parallelism. Decision tree algorithms exhibit natural concurrency that makes them suitable to be parallelised. This paper presents an approach for easy-yet-efficient porting of an implementation of the C4.5 algorithm on multicores. The parallel porting requires minimal changes to the original sequential code, and it is able to exploit up to 7X speedup on an Intel dual-quad core machine.