Automatic Database Clustering: Issues and Algorithms (original) (raw)

Dynamic Clustering in Object-Oriented Databases: An Advocacy for Simplicity

2000

We present in this paper three dynamic clustering techniques for Object-Oriented Databases (OODBs). The first two, Dynamic, Statistical & Tunable Clustering (DSTC) and StatClust, exploit both comprehensive usage statistics and the inter-object reference graph. They are quite elaborate. However, they are also complex to implement and induce a high overhead. The third clustering technique, called Detection & Reclustering of Objects (DRO), is based on the same principles, but is much simpler to implement. These three clustering algorithm have been implemented in the Texas persistent object store and compared in terms of clustering efficiency (i.e., overall performance increase) and overhead using the Object Clustering Benchmark (OCB). The results obtained showed that DRO induced a lighter overhead while still achieving better overall performance.

Data base reorganization by clustering method

Information Systems, 1978

This paper is concerned with the problem of data re-allocation on a moving-head-disk, in order to minimize the average access time. From the analysis of the chronological accesses to the records during the reference period, the initial exploitation cost of the implementation is situated among all the possible ones, and the benefit of a reorganization may be evaluated. The reorganization itself is a two stage process: at first the file is partitioned by a clustering algorithm, and then, the clusters are allocated to cylinders. By the application of this method on a 1 l,OO&record file, the access mean time was reduced by a factor 2.

An evaluation model for clustering strategies in the O2 object-oriented database system

Lecture Notes in Computer Science, 1990

This paper adresses the problem of clustering complex data on disk to minimize the number of I/O in data intensive applications. It describes the clustering strategies adopted in the O 2 system. As clustering depends on both structural aspects (composition hierarchy of the classes) and dynamic aspects (the methods associated with the classes) the paper details a cost model in order to evaluate the bene ts of the clustering strategies. This model will permit to automatically derive new clustering strategies. To this end, a derivation algorithm which builds an optimal strategy in linear time is presented.

Performance evaluation for clustering algorithms in object-oriented database systems

Database and Expert Systems Applications, 1995

It is widely acknowledged that good object clustering is critical to the performance of object-oriented databases. However, object clustering always involves some kind of overhead for the system. The aim of this paper is to propose a modelling methodology in order to evaluate the performances of different clustering policies. This methodology has been used to compare the performances of three clustering algorithms found in the literature (Cactis, CK and ORION) that we considered representative of the current research in the field of object ...

Automating the design of multi-dimensional clustering tables in relational databases

The ability to physically cluster a database table on multiple dimensions is a powerful technique that offers significant performance benefits in many OLAP, warehousing, and decision-support systems. An industrial implementation of this technique for the DB2® Universal Database™ (DB2 UDB) product, called multidimensional clustering (MDC), which co-exists with other classical forms of data storage and indexing methods, was described in VLDB 2003. This paper describes the first published model for automating the selection of clustering keys in single-dimensional and multidimensional relational databases that use a cell/block storage structure for MDC. For any significant dimensionality (3 or more), the possible solution space is combinatorially complex.

Optimizing and Enhancing Performance of Database Engine Using Data Clustering Technique

The sizes of databases are increasing every day. Hence, now days, accessing data in an acceptable time is one of the biggest challenges in centralized database. In centralized databases, the records can be categorized according to the access frequencies; least accessed records (cold data) and most accessed records (hot data). In a study it shows that more than 90% cases query are requested for hot data, and in case of insertion operation, 99% are done on hot data. Thus categorizing of the data set may improve data accessibility. In this paper, we are proposing a data clustering mechanism based on data access frequency. We have considered only the hot data and the cold data. Here we divided the whole database into two separate files. The first file contains only hot data and the second file contains only the cold data. The time period of hot and cold data will vary for different application domains. The database engine will have direct access on the first database file and in case of unavailability of data; the database engine will look for the second database file. Finally, the experiment result shows how and why data accessibility time should outperform than other available data clustering techniques.

Study of Algorithms for Clustering Records in Document Databases

1997

Response time of an information system can be improved by reducing the number of buckets accessed when retrieving a document set. One approach is to restructure the document base in such a way that similar documents are placed close together in the file space. This ensures greater probability that identifying records will be collocated within the same bucket. This paper is concerned with examining two algorithms proposed to solve the clustering problem and analyze and predict their thus effected density and retrieval times using Random probability theory. Results suggest that, given an acceptable confidence interval, the prediction of file properties, before and after clustering, when the characteristic parameters of a file are known, is fairly accurate.

A Simple Algorithm for Information Clustering in Network Attached Storage

The increasing scale of networked storage has exposed issues such as data integrity and access latency to distributed data and computation. One of the approaches to reduce the wasted network bandwidth is, following the classic localityof-data principle, to partition interrelated data objects between networked storage nodes. The paper describes an algorithm for data clustering based on the relative interconnectivity between data objects. Because of its simplicity, the method could be incorporated in network attached hard disk storage with all range of complexity from embedded controllers to complex network file servers.

Data clustering algorithms: A second look

With the huge volume of digital data, clustering algorithms are providing efficient tools for data organizing and analyzing. Clustering algorithms are used in various domains such as bioinformatics, speech recognition, and information retrieval. Clustering is an automatic technique that divides a set of data objects into smaller groups such that the objects within a group are similar to each other and dissimilar to objects in other groups as much as possible. This paper reviews and discusses different clustering algorithms, their concepts, advantages, and limitations. A comparison among clustering algorithms will also be represented based on certain criteria.

Certain Investigation on Dynamic Clustering in Dynamic Datamining

Clustering is the process of grouping a set of objects into classes of similar objects. Dynamic clustering comes in a new research area that is concerned about dataset with dynamic aspects. It requires updates of the clusters whenever new data records are added to the dataset and may result in a change of clustering over time. When there is a continuous update and huge amount of dynamic data, rescan the database is not possible in static data mining. But this is possible in Dynamic data mining process. This dynamic data mining occurs when the derived information is present for the purpose of analysis and the environment is dynamic, i.e. many updates occur. Since this has now been established by most researchers and they will move into solving some of the problems and the research is to concentrate on solving the problem of using data mining dynamic databases. This paper gives some investigation of existing work done in some papers related with dynamic clustering and incremental data clustering.