DECODE: a new method for discovering clusters of different densities in spatial data (original) (raw)
Abstract
When clusters with different densities and noise lie in a spatial point set, the major obstacle to classifying these data is the determination of the thresholds for classification, which may form a series of bins for allocating each point to different clusters. Much of the previous work has adopted a model-based approach, but is either incapable of estimating the thresholds in an automatic way, or limited to only two point processes, i.e. noise and clusters with the same density. In this paper, we present a new density-based cluster method (DECODE), in which a spatial data set is presumed to consist of different point processes and clusters with different densities belong to different point processes. DECODE is based upon a reversible jump Markov Chain Monte Carlo (MCMC) strategy and divided into three steps. The first step is to map each point in the data to its _m_th nearest distance, which is referred to as the distance between a point and its _m_th nearest neighbor. In the second step, classification thresholds are determined via a reversible jump MCMC strategy. In the third step, clusters are formed by spatially connecting the points whose _m_th nearest distances fall into a particular bin defined by the thresholds. Four experiments, including two simulated data sets and two seismic data sets, are used to evaluate the algorithm. Results on simulated data show that our approach is capable of discovering the clusters automatically. Results on seismic data suggest that the clustered earthquakes, identified by DECODE, either imply the epicenters of forthcoming strong earthquakes or indicate the areas with the most intensive seismicity, this is consistent with the tectonic states and estimated stress distribution in the associated areas. The comparison between DECODE and other state-of-the-art methods, such as DBSCAN, OPTICS and Wavelet Cluster, illustrates the contribution of our approach: although DECODE can be computationally expensive, it is capable of identifying the number of point processes and simultaneously estimating the classification thresholds with little prior knowledge.
Access this article
Subscribe and save
- Starting from 10 chapters or articles per month
- Access and download chapters and articles from more than 300k books and 2,500 journals
- Cancel anytime View plans
Buy Now
Price excludes VAT (USA)
Tax calculation will be finalised during checkout.
Instant access to the full article PDF.
Similar content being viewed by others
References
- Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of the ACM SIGMOD ’98 international conference on management of data, Seattle, WA, USA, pp 94–105
- Allard D, Fraley C (1997) Nonparametric maximun likelihood estimation of features in spatial point process using voronoi tessellation. J Am Stat Assoc 92: 1485–1493. doi:10.2307/2965419
Article MATH Google Scholar - Andrieu C, Freitas DN, Doucet A, Jordan IM (2003) An introduction to MCMC for machine learning. Mach Learn 50: 5–43. doi:10.1023/A:1020281327116
Article MATH Google Scholar - Ankerst M, Breunig MM, Kriegel H-P, Sander J (1999) OPTICS: ordering points to identify the clustering structure. In: Proceedings of ACM-SIGMOD’99 international conference on management data, Philadelphia, USA, pp 46-60
- Byers S, Raftery AE (1998) Nearest-neighbor clutter removal for estimating features in spatial point processes. J Am Stat Assoc 93: 577–584. doi:10.2307/2670109
Article MATH Google Scholar - Cheng KH (2002) An analysis of tectonic environment and contemporary seismicity of frontal orogeny in central Taiwan area. Seismol Geol 24(3): 400–411
Google Scholar - China Seismograph Network (CSN) catalog available online at: http://www.csndmc.ac.cn. Accessed in 2008
- Cressie NAC (1991) Statistics for spatial data, 1st edn. Wiley, New York
MATH Google Scholar - Daszykowski M, Walczak B, Massart DL (2001) Looking for natural patterns in data Part 1. Density-based approach. Chemom Intell Lab Syst 56: 83–92. doi:10.1016/S0169-7439(01)00111-3
Article Google Scholar - Diggle PJ (1985) A kernel method for smoothing point process data. Appl Stat 34: 138–147. doi:10.2307/2347366
Article MATH Google Scholar - Ester M, Kriegel H-P, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proc. 2nd int. conf. on knowledge discovery and data mining, Portland, OR, pp 226–231
- Feng H, Huang DY (1980) Earthquake catalogue inWest China (1970—1975,M≥1). Seismological Press, Beijing (in Chinese)
- Feng H, Huang DY (1989) Earthquake catalogue inWest China (1976—1979,M≥1). Seismological Press, Beijing (in Chinese)
- Fu ZX, Jiang LX (1997) On large-scale spatial heterogeneties of great shallow earthquakes and plates coupling mechanism in Chinese mainland and its adjacent area. Earthq Res China 13(1):1–9 (in Chinese)
Google Scholar - Ghosh SC (2002) The raniganj coal basin: an example of an Indian Gondwana rift. Sediment Geol 147(Sp. Iss.): 155–176
Article Google Scholar - Green PJ (1995) Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika 82: 711–732. doi:10.1093/biomet/82.4.711
Article MATH MathSciNet Google Scholar - Gu GX (1983) Chin seismic catalog (1831 BC-1969 AD). Science Press, Beijing
Google Scholar - Han JW, Kamber M, Tung AKH (2001) Spatial clustering methods in data mining. In: Miller HJ, Han JW(eds) Geographic data mining and knowledge discovery. Taylor & Francis, London, pp 188–217
Google Scholar - Hinneburg A, Keim DA (1998) An efficient approach to clustering in large multimedia databases with noise. In: Proceedings of the knowledge discovery and data mining, pp 58–65
- Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice Hall, Englewood Cliffs
MATH Google Scholar - Jasra A, Stephens DA, Gallagher K, Holmes CC (2006) Bayesian mixture modelling in geochronology via Markov chain Monte Carlo. Math Geol 38: 269–300. doi:10.1007/s11004-005-9019-3
Article MATH Google Scholar - Jiao MR, Zhang GM, Che S, Liu J (1999) Numerical calculations of tectonic stress field of Chinese mainland and its neighboring regions and their applications to explanation of seismic activity. Acta Seismologica Sin 12(2): 137–147. doi:10.1007/s11589-999-0018-1
Article Google Scholar - Kagan YY, Houston H (2005) Relation between mainshock rupture process and Omori’s law for aftershock moment release rate. Geophys J Int 163: 1039–1048
Article Google Scholar - Kaufman L, Rousseeuw P (1990) Finding groups in data: an introduction to cluster analysis. Wiley, New York
Google Scholar - Lin CY, Chang CC (2005) A new density-based scheme for clustering based on genetic algorithm. Fundam Inform 68: 315–331
MATH MathSciNet Google Scholar - Liu P, Zhou D, Wu NJ (2007) VDBSCAN: varied density based spatial clustering of applications with noise. In: Proceedings of IEEE international conference on service systems and service management, Chengdu, China, pp 1–4
- Markus MB, Kriegel H-P, Raymond TN, Sander J (2000) LOF: identifying density-based local outliers. In: Proceedings of ACM SIGMOD of 2000 international conference on management of data, vol 29, pp 93–104
- Matsu’ura RS, Karakama I (2005) A point-process analysis of the Matsushiro earthquake swarm sequence: the effect of water on earthquake occurrence. Pure Appl Geophys 162: 1319–1345. doi:10.1007/s00024-005-2672-0
Article Google Scholar - Murtagh F, Starck JL (1998) Pattern clustering based on noise modeling in wavelet space. Pattern Recogn 31(7): 847–855. doi:10.1016/S0031-3203(97)00115-5
Article Google Scholar - Neill DB (2006) Detection of spatial and spatio-temporal clusters. Ph.D. Thesis of University of South Carolina
- Neill DB, Moore AW (2005) Anomalous spatial cluster detection. In: Proceeding of KDD 2005 workshop on data mining methods for anomaly detection, Chicago, Illinois, USA, pp 41–44
- Pascual D, Pla F, Sanchez JS (2006) Non parametric local density-based clustering for multimodal overlapping distributions. In: Proceedings of intelligent data engineering and automated learning (IDEAL2006), Spain, Burgos, pp 671–678
- Pei T, Yang M, Zhang JS, Zhou CH, Luo JC, Li QL (2003) Multi-scale expression of spatial activity anomalies of earthquakes and its indicative significance on the space and time attributes of strong earthquakes. Acta Seismologica Sin 3: 292–303. doi:10.1007/s11589-003-0033-6
Article Google Scholar - Pei T, Zhu AX, Zhou CH, Li BL, Qin CZ (2006) A new approach to the nearest-neighbour method to discover cluster features in overlaid spatial point processes. Int J Geogr Inf Sci 20: 153–168. doi:10.1080/13658810500399654
Article Google Scholar - Reasenberg PA (1999) Foreshock occurrence rates before large earthquakes worldwide. Pure Appl Geophys 155: 355–379. doi:10.1007/s000240050269
Article Google Scholar - Richardson S, Green PJ (1997) On Bayesian analysis of mixtures with an unknown number of components. J Roy Stat Soc Ser B-Methodol 59: 731–758
Article MATH MathSciNet Google Scholar - Robert CP, Casella G (2004) Monte Carlo statistical methods, 2nd edn. Springer, New York
MATH Google Scholar - Roy S, Bhattacharyya DK (2005) An approach to find embedded clusters using density based techniques. Lect Notes Comput Sci 3816: 523–535. doi:10.1007/11604655_59
Article Google Scholar - Sander J, Ester M, Kriegel H, Xu X (1998) Density-based clustering in spatial databases: the algorithm GDBSCAN and its applications. Data Min Knowl Discov 2: 169–194. doi:10.1023/A:1009745219419
Article Google Scholar - Sheikholeslami G, Chatterjee S, Zhang A (1998) WaveCluster: a multi-resolution clustering approach for very large spatial databases. In: Proceedings of the 24th international conference on very large data bases, New York City, NY, pp 428-439
- Thompson HR (1956) Distribution of distance to nth nearest neighbour in a population of randomly distributed individuals. Ecology 27: 391–394. doi:10.2307/1933159
Article Google Scholar - Tran TN, Wehrensa R, Lutgarde MCB (2006) KNN-kernel density-based clustering for high-dimensional multivariate data. Comput Stat Data Anal 51: 513–525. doi:10.1016/j.csda.2005.10.001
Article MATH Google Scholar - Umino N, Okada T, Hasegawa A (2002) Foreshock and aftershock sequence of the 1998 M ≥ 5.0 Sendai, northeastern Japan, earthquake and its implications for earthquake nucleation. Bull Seismol Soc Am 92: 2465–2477. doi:10.1785/0120010140
Article Google Scholar - Wyss M, Toya Y (2000) Is background seismicity produced at a stationary Poissonian rate. Bull Seismol Soc Am 90: 1174–1187. doi:10.1785/0119990158
Article Google Scholar - Zhang GM, Ma HS, Wang H, Wang XL (2005) Boundaries between active-tectonic blocks and strong earthquakes in the China mainland. Chin J Geophys 48: 602–610
Google Scholar - Zhou CH, Pei T, Li QL, Chen JB, Qin CZ, Han ZJ (2006) Database of Integrated Catalog of Chinese earthquakes and Its Application. Water and Electricity Press, Beijing (in Chinese)
- Zhuang JC, Chang CP, Ogata Y, Chen YI (2005) A study on the background and clustering seismicity in the Taiwan region by using point process models. J Geophys Res Solid Earth 110(B05S18). doi:10.1029/2004JB003157
Author information
Authors and Affiliations
- Institute of Geographical Sciences and Natural Resources Research, 11A, Datun Road Anwai, Beijing, 100101, China
Tao Pei, A.-Xing Zhu & Chenghu Zhou - Institute for Mathematical Sciences, Imperial College, London, SW7 2PG, UK
Tao Pei - Department of Mathematics, Imperial College, London, UK
Ajay Jasra - Department of Mathematics and Institute for Mathematical Sciences, Imperial College, London, UK
David J. Hand - Department of Geography, University of Wisconsin Madison, 550N, Park Street, Madison, WI, 53706-1491, USA
A.-Xing Zhu
Authors
- Tao Pei
- Ajay Jasra
- David J. Hand
- A.-Xing Zhu
- Chenghu Zhou
Corresponding author
Correspondence toChenghu Zhou.
Additional information
Responsible editor: Charu Aggarwal.
Rights and permissions
About this article
Cite this article
Pei, T., Jasra, A., Hand, D.J. et al. DECODE: a new method for discovering clusters of different densities in spatial data.Data Min Knowl Disc 18, 337–369 (2009). https://doi.org/10.1007/s10618-008-0120-3
- Received: 05 November 2007
- Accepted: 21 October 2008
- Published: 20 November 2008
- Issue date: June 2009
- DOI: https://doi.org/10.1007/s10618-008-0120-3