Big Data Classification : Aspects on Many Features and Many Observations (original) (raw)
Abstract
In this paper we discuss the performance of classical classification methods on Big Data. We distinguish the cases many features and many observations. For the many features case we look at projection methods, distance-based methods, and feature selection. For the many observations case we mainly consider subsampling. The examples in this paper show that standard classification methods should not be blindly applied to Big Data.
Similar content being viewed by others
Notes
- Thanks to T. Glasmachers for suggesting this definition.
- This part of the paper was supported by the Mercator Research Center Ruhr, grant Pr-2013-0015, see http://www.largescalesvm.de/.
- This simulation was carried out using the R-packages BatchJobs (Bischl et al. 2015) and mlr on the SLURM cluster of the Statistics Department of TU Dortmund University.
- This example is inspired by Fan et al. (2011).
References
- Bair, E., Hastie, T., Paul, D., & Tibshirani, R. (2006). Prediction by supervised principal components. Journal of the American Statistical Association, 101, 119–137.
Article MathSciNet MATH Google Scholar - Bickel, P. J., & Levina, E. (2004). Some theory for Fisher’s linear discriminant function, “naive Bayes”, and some alternatives when there are many more variables than observations. Bernoulli, 10, 989–1010.
Article MathSciNet MATH Google Scholar - Bischl, B., Lang, M., Mersmann, O., Rahnenfuehrer, J., & Weihs, C. (2015). BatchJobs and BatchExperiments: Abstraction mechanisms for using R in batch environments. Journal of Statistical Software, 64(11), doi:10.18637/jss.v064.i11.
- Boulesteix, A. L. (2004). PLS dimension reduction for classification with microarray data. Statistical Applications in Genetics and Molecular Biology, 3, 1–33.
Article MathSciNet MATH Google Scholar - Fan, J., Fan, Y., & Wu, Y. (2011). High-dimensional classification. In T. T. Cai, & X. Shen (Eds.), High-dimensional data analysis (pp. 3–37). New Jersey: World Scientific.
Google Scholar - Graf, H.P., Cosatto, E., Bottou, L., Durdanovic, I., & Vapnik, V. (2005). Parallel support vector machines: The cascade SVM. Advances in Neural Information Processing Systems, 17, 521–528.
Google Scholar - Kiiveri, H.T. (2008). A general approach to simultaneous model fitting and variable elimination in response models for biological data with many more variables than observations. BMC Bioinformatics, 9, 195. doi:10.1186/1471-2105-9-195
Article Google Scholar - Meyer, O., Bischl, B., & Weihs, C. (2013). Support vector machines on large data sets: Simple parallel approaches. In M. Spiliopoulou, L. Schmidt-Thieme, & R. Jannings (Eds.), Data analysis, machine learning, and knowledge discovery (pp. 87–95). Berlin: Springer.
Google Scholar - R Core Team (2014). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. http://www.R-project.org/
Author information
Authors and Affiliations
- Chair of Computational Statistics, Faculty of Statistics, TU Dortmund, Dortmund, Germany
Claus Weihs - Department of Statistics, TU Dortmund University, Dortmund, Germany
Daniel Horn & Bernd Bischl
Authors
- Claus Weihs
- Daniel Horn
- Bernd Bischl
Corresponding author
Correspondence toClaus Weihs .
Editor information
Editors and Affiliations
- Jacobs University Bremen , Bremen, Germany
Adalbert F.X. Wilhelm - Universität Ulm, Institute of Medical Systems Biology Universität Ulm, Ulm, Baden-Württemberg, Germany
Hans A. Kestler
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Weihs, C., Horn, D., Bischl, B. (2016). Big Data Classification : Aspects on Many Features and Many Observations. In: Wilhelm, A., Kestler, H. (eds) Analysis of Large and Complex Data. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Cham. https://doi.org/10.1007/978-3-319-25226-1\_10
Download citation
- .RIS
- .ENW
- .BIB
- DOI: https://doi.org/10.1007/978-3-319-25226-1\_10
- Published: 04 August 2016
- Publisher Name: Springer, Cham
- Print ISBN: 978-3-319-25224-7
- Online ISBN: 978-3-319-25226-1
- eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)Springer Nature Proceedings excluding Computer Science
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.