The (black) art of runtime evaluation: Are we comparing algorithms or implementations? (original) (raw)
References
Achtert E, Bernecker T, Kriegel H-P, Schubert E, Zimek A (2009) ELKI in time: ELKI 0.2 for the performance evaluation of distance measures for time series. In: Proceedings of the 11th international symposium on spatial and temporal databases (SSTD), Aalborg, Denmark, pp 436–440
Achtert E, Böhm C, Kriegel H-P, Kröger P, Zimek A (2007) Robust, complete, and efficient correlation clustering. In: Proceedings of the 7th SIAM international conference on data mining (SDM), Minneapolis, MN, pp 413–418
Achtert E, Goldhofer S, Kriegel H-P, Schubert E, Zimek A (2012) Evaluation of clusterings—metrics and visual support. In: Proceedings of the 28th international conference on data engineering (ICDE), Washington, DC, pp 1285–1288
Achtert E, Hettab A, Kriegel H-P, Schubert E, Zimek A (2011) Spatial outlier detection: data, algorithms, visualizations. In: Proceedings of the 12th international symposium on spatial and temporal databases (SSTD), Minneapolis, MN, pp 512–516
Achtert E, Kriegel H-P, Reichert L, Schubert E, Wojdanowski R, Zimek A (2010) Visual evaluation of outlier detection models. In: Proceedings of the 15th international conference on database systems for advanced applications (DASFAA), Tsukuba, Japan, pp 396–399
Achtert E, Kriegel H-P, Schubert E, Zimek A (2013) Interactive data mining with 3D-parallel-coordinate-trees. In: Proceedings of the ACM international conference on management of data (SIGMOD), New York City, NY, pp 1009–1012
Achtert E, Kriegel H-P, Zimek A (2008) ELKI: a software system for evaluation of subspace clustering algorithms. In: Proceedings of the 20th international conference on scientific and statistical database management (SSDBM), Hong Kong, China, pp 580–585
Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: Proceedings of the 20th international conference on very large data bases (VLDB), Santiago de Chile, Chile, pp 487–499
Alsabti K, Ranka S, Singh V (1998) An efficient k-means clustering algorithm. In: Proceedings of IPPS/SPDP workshop on high performance data mining
Anderberg MR (1973) Cluster analysis for applications. Probability and mathematical statistics. Academic Press, Cambridge MATH Google Scholar
Arthur D, Vassilvitskii S (2007) k-means\(++\): the advantages of careful seeding. In: Proceedings of the 18th Annual ACM-SIAM symposium on discrete algorithms (SODA), New Orleans, LA, pp 1027–1035
Arya S, Mount DM (1993) Approximate nearest neighbor queries in fixed dimensions. In: Proceedings of the 4th annual ACM/SIGACT-SIAM symposium on discrete algorithms (SODA), Austin, TX, pp 271–280
Bayardo Jr RJ, Goethals B, Zaki MJ (eds) (2005) FIMI ’04, Proceedings of the IEEE ICDM workshop on frequent itemset mining implementations, Brighton, UK, November 1, 2004, volume 126 of CEUR Workshop Proceedings. CEUR-WS.org
Beckmann N, Kriegel H-P, Schneider R, Seeger B (1990) The R*-tree: an efficient and robust access method for points and rectangles. In: Proceedings of the ACM international conference on management of data (SIGMOD), Atlantic City, NJ, pp 322–331
Bentley JL (1975) Multidimensional binary search trees used for associative searching. Commun ACM 18(9):509–517 ArticleMATH Google Scholar
Beygelzimer A, Kakade S, Langford J (2006) Cover trees for nearest neighbors. In: Proceedings of the 23rd international conference on machine learning (ICML), Pittsburgh, PA, pp 97–104
Bezanson J, Edelman A, Karpinski S, Shah VB (2014) Julia: a fresh approach to numerical computing. CoRR, arXiv:1411.1607
Bock H (2007) Clustering methods: a history of k-means algorithms. In: Brito P, Cucumel G, Bertrand P, Carvalho F (eds) Selected contributions in data analysis and classification. Springer, Berlin, pp 161–172 Chapter Google Scholar
Bodon F (2003) A fast APRIORI implementation. In: Proceedings of the ICDM workshop on frequent itemset mining implementations (FIMI ’03), Melbourne, Florida, USA
Borgelt C (2003) Efficient implementations of Apriori and Eclat. In: Proceedings of the ICDM workshop on frequent itemset mining implementations (FIMI ’03), Melbourne, Florida, USA
Breunig MM, Kriegel H-P, Ng R, Sander J (2000) LOF: identifying density-based local outliers. In: Proceedings of the ACM international conference on management of data (SIGMOD), Dallas, TX, pp 93–104
Budak C, Georgiou T, Agrawal D, El Abbadi A (2013) GeoScope: online detection of geo-correlated information trends in social networks. Proc VLDB Endow 7(4):229–240 Article Google Scholar
Campello RJGB, Moulavi D, Zimek A, Sander J (2015) Hierarchical density estimates for data clustering, visualization, and outlier detection. ACM Trans Knowl Discov Data (TKDD) 10(1):5:1–51 Google Scholar
Campos GO, Zimek A, Sander J, Campello RJGB, Micenková B, Schubert E, Assent I, Houle ME (2016) On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Data Min Knowl Discov 30:891–927 ArticleMathSciNet Google Scholar
Ciaccia P, Patella M, Zezula P (1997) M-tree: an efficient access method for similarity search in metric spaces. In: Proceedings of the 23rd international conference on very large data bases (VLDB), Athens, Greece, pp 426–435
Cordeiro RLF, Traina AJM, Faloutsos C, Traina C Jr (2013) Halite: fast and scalable multiresolution local-correlation clustering. IEEE Trans Knowl Data Eng 25(2):387–401 Article Google Scholar
Eaton JW, Bateman D, Hauberg S, Wehbring R (2014) GNU Octave version 3.8.1 manual: a high-level interactive language for numerical computations. CreateSpace Independent Publishing Platform
Elkan C (2003) Using the triangle inequality to accelerate k-means. In: Proceedings of the 20th international conference on machine learning (ICML), Washington, DC, pp 147–153
Eppstein D (1998) Fast hierarchical clustering and other applications of dynamic closest pairs. In: Proceedings of the 9th annual ACM-SIAM symposium on discrete algorithms (SODA), San Francisco, CA, pp 619–628
Ester M, Kriegel H-P, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the 2nd ACM international conference on knowledge discovery and data mining (KDD), Portland, OR, pp 226–231
Forgy EW (1965) Cluster analysis of multivariate data: efficiency versus interpretability of classifications. Biometrics 21:768–769 Google Scholar
Fournier-Viger P, Gomariz A, Gueniche T, Soltani A, Wu C, Tseng VS (2014) SPMF: a Java open-source pattern mining library. J Mach Learn Res 15(1):3389–3393 MATH Google Scholar
Färber I, Günnemann S, Kriegel H-P, Kröger P, Müller E, Schubert E, Seidl T, Zimek A (2010) On using class-labels in evaluation of clusterings. In: MultiClust: 1st International workshop on discovering, summarizing and using multiple clusterings held in conjunction with KDD 2010, Washington, DC
Gan J, Tao Y (2015) DBSCAN revisited: mis-claim, un-fixability, and approximation. In: Proceedings of the ACM international conference on management of data (SIGMOD), Melbourne, Australia, pp 519–530
Geusebroek JM, Burghouts GJ, Smeulders AWM (2005) The Amsterdam library of object images. Int J Comput Vis 61(1):103–112 Article Google Scholar
Goethals B, Zaki MJ, (eds) (2003) FIMI ’03, Frequent Itemset Mining Implementations, Proceedings of the ICDM 2003 workshop on frequent itemset mining implementations, 19 December 2003, Melbourne, Florida, USA, volume 90 of CEUR workshop proceedings. CEUR-WS.org
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. ACM SIGKDD Explor 11(1):10–18 Article Google Scholar
Hamerly G (2010) Making k-means even faster. In: Proceedings of the 10th SIAM international conference on data mining (SDM), Columbus, OH, pp 130–140
Hamerly G, Elkan C (2003) Learning the k in k-means. In: Proceedings of the Annual conference on neural information processing systems (NIPS), Vancouver, BC, pp 281–288
Hartigan JA (1975) Clustering algorithms. Wiley, New York MATH Google Scholar
Hartigan JA, Wong MA (1979) Algorithm AS 136: A k-means clustering algorithm. J R Stat Soc Ser C (Appl Stat) 28(1):100–108
Jones E, Oliphant T, Peterson P et al (2001) SciPy: open source scientific tools for Python
Kanungo T, Mount DM, Netanyahu NS, Piatko CD, Silverman R, Wu AY (2002) An efficient k-means clustering algorithm: analysis and implementation. IEEE Trans Pattern Anal Mach Intell 24(7):881–892 ArticleMATH Google Scholar
Kaufman L, Rousseeuw PJ (1990) Finding groups in data: an introduction to cluster analysis. Wiley, New York BookMATH Google Scholar
Kriegel H-P, Kröger P, Sander J, Zimek A (2011) Density-based clustering. Wiley Interdiscip Rev Data Min Knowl Discov 1(3):231–240 Article Google Scholar
Leutenegger ST, Edgington JM, Lopez MA (1997) STR: a simple and efficient algorithm for R-tree packing. In: Proceedings of the 13th international conference on data engineering (ICDE), Birmingham, UK, pp 497–506
MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: 5th Berkeley symposium on mathematics, statistics, and probabilistics, vol 1, pp 281–297
Mahran S, Mahar K (2008) Using grid for accelerating density-based clustering. In: Proceedings of 8th IEEE international conference on computer and information technology, CIT 2008, Sydney, Australia, pp 35–40
Murtagh F (1985) A survey of algorithms for contiguity-constrained clustering and related problems. Comput J 28(1):82–88 Article Google Scholar
Müllner D (2011) Modern hierarchical, agglomerative clustering algorithms. arXiv preprint, arXiv:1207.0016
Nijssen S, Kok JN (2006) Frequent subgraph miners: runtimes don’t say everything. In: Proceedings of the 4th workshop on mining and learning with graphs (MLG), Berlin, Germany, pp 173–180
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830 MathSciNetMATH Google Scholar
Pelleg D, Moore A (1999) Accelerating exact k-means algorithms with geometric reasoning. In: Proceedings of the 5th ACM international conference on knowledge discovery and data mining (SIGKDD), San Diego, CA, pp 277–281
Pelleg D, Moore A (2000) X-means: extending k-means with efficient estimation of the number of clusters. In: Proceedings of the 17th international conference on machine learning (ICML), Stanford University, CA, vol 1, pp 727–734
Phillips SJ (2002) Acceleration of k-means and related clustering algorithms. In: The 4th international workshop on algorithm engineering and experiments (ALENEX) 2002, San Francisco, CA, pp 166–177
Prim RC (1957) Shortest connection networks and some generalizations. Bell Syst Tech J 36(6):1389–1401 Article Google Scholar
R Core Team. R: a language and environment for statistical computing. R Foundation for Statistical Computing. http://www.r-project.org/
Rohlf FJ (1973) Algorithm 76: hierarchical clustering using the minimum spanning tree. Comput J 16(1):93–95 Google Scholar
Sander J, Ester M, Kriegel H-P, Xu X (1998) Density-based clustering in spatial databases: the algorithm GDBSCAN and its applications. Data Min Knowl Discov 2(2):169–194 Article Google Scholar
Schubert E, Koos A, Emrich T, Züfle A, Schmid KA, Zimek A (2015) A framework for clustering uncertain data. Proc VLDB Endow 8(12):1976–1979 Article Google Scholar
Schubert E, Zimek A, Kriegel H-P (2013) Geodetic distance queries on R-trees for indexing geographic data. In: Proceedings of the 13th international symposium on spatial and temporal databases (SSTD), Munich, Germany, pp 146–164
Schubert E, Zimek A, Kriegel H-P (2014) Generalized outlier detection with flexible kernel density estimates. In: Proceedings of the 14th SIAM international conference on data mining (SDM), Philadelphia, PA, pp 542–550
Schubert E, Zimek A, Kriegel H-P (2014) Local outlier detection reconsidered: a generalized view on locality with applications to spatial, video, and network outlier detection. Data Min Knowl Discov 28(1):190–237 ArticleMathSciNetMATH Google Scholar
Sculley D (2010) Web-scale k-means clustering. In: Proceedings of the 19th international conference on world wide web (WWW), Raleigh, NC, pp 1177–1178
Sibson R (1973) SLINK: an optimally efficient algorithm for the single-link cluster method. Comput J 16(1):30–34 ArticleMathSciNet Google Scholar
Šidlauskas D, Jensen CS (2014) Spatial joins in main memory: implementation matters!. Proc VLDB Endow 8(1):97–100 Article Google Scholar
Slonim N, Aharoni E, Crammer K (2013) Hartigan’s k-means versus Lloyd’s k-means-is it time for a change? In: Proceedings of the 23rd international joint conference on artificial intelligence (IJCAI), Beijing, China
Sneath PHA (1957) The application of computers to taxonomy. J Gen Microbiol 17:201–226 Article Google Scholar
Sonnenburg S, Rätsch G, Henschel S, Widmer C, Behr J, Zien A, De Bona F, Binder A, Gehl C, Franc V (2010) The SHOGUN machine learning toolbox. J Mach Learn Res 11:1799–1802 MATH Google Scholar
Sowell B, Salles MAV, Cao T, Demers AJ, Gehrke J (2013) An experimental analysis of iterated spatial joins in main memory. Proc VLDB Endow 6(14):1882–1893 Article Google Scholar
Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniques. In: KDD workshop on text mining, vol 400, pp 525–526
Steinhaus H (1956) Sur la division des corp materiels en parties. Bull Acad Pol Sci 1:801–804 MathSciNetMATH Google Scholar
Ward JH Jr (1963) Hierarchical grouping to optimize an objective function. J Am Stat Assoc 58(301):236–244 ArticleMathSciNet Google Scholar
Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, Burlington MATH Google Scholar
Wolpert DH (1996) The lack of a priori distinctions between learning algorithms. Neural Comput 8(7):1341–1390 Article Google Scholar
Wörlein M, Meinl T, Fischer I, Philippsen M (2005) A quantitative comparison of the subgraph miners MoFa, gSpan, FFSM, and Gaston. In: Proceedings of the 9th European conference on principles and practice of knowledge discovery in databases PKDD), Porto, Portugal, pp 392–403
Yu C, Ooi BC, Tan K-L, Jagadish V (2001) Indexing the distance: an efficient method to KNN processing. In: Proceedings of the 27th international conference on very large data bases (VLDB), Roma, Italy, pp 421–430
Zheng Z, Kohavi R, Mason L (2001) Real world performance of association rule algorithms. In: Proceedings of the 7th ACM international conference on knowledge discovery and data mining (SIGKDD), San Francisco, CA, pp 401–406