K means Clustering Research Papers (original) (raw)

Perpustakaan sebagai sarana sumber informasi dan ilmu pengetahuan untuk menyimpan bahan pustaka yang dipakai oleh pemakai untuk menggali ilmu sumber informasi. Penelitian dilakukan pada salah satu Perpustakaan yang ada di kota Batam. Perpustakaan ini memiliki beragam koleksi buku seperti buku umum, karya ilmiah, bahasa, sejarah, dan lain sebagainya. Permasalahan yang sering terjadi adalah buku yang dipinjam kadang tidak tersedia, selain itu pihak perpustakaan juga mengalami kesulitan karena tidak tau berapa jumlah buku yang dipinjam sehingga pihak Perpustakaan melihat kembali catatan transaksi meminjam buku pada buku tamu. Untuk itu dibuatlah suatu sistem dengan pengolahan jumlah data yang besar dengan teknik data mining metode k-means. Dari hasil yang diperoleh data peminjam buku yang telah diproses mendapatkan buku yang banyak dipinjam terdapat pada cluster 1 sebanyak 9 item, buku yang paling sedikit dipinjam terdapat pada cluster 2 sebanyak 15 item, buku yang cukup banyak dipinja...

- by
- •
- Computer Science, Data Mining, Big Data, Cluster Analysis

The idea of evidence accumulation for the combination of multiple clusterings was recently proposed . Taking the K-means as the basic algorithm for the decomposition of data into a large number, k, of compact clusters, evidence on pattern association is accumulated, by a voting mechanism, over multiple clusterings obtained by random initializations of the K-means algorithm. This produces a mapping of the clusterings into a new similarity measure between patterns. The final data partition is obtained by applying the single-link method over this similarity matrix. In this paper we further explore and extend this idea, by proposing: (a) the combination of multiple K-means clusterings using variable k; (b) using cluster lifetime as the criterion for extracting the final clusters; and (c) the adaptation of this approach to string patterns. This leads to a more robust clustering technique, with fewer design parameters than the previous approach and potential applications in a wider range of problems.

- by Anil Jain
- •
- Pattern Recognition, Similarity, K Means, K means algorithm

Color-based region segmentation of skin lesions is one of the key steps for correctly collecting statistics that can help clinicians in their diagnosis. This study describes the use of differential evolution algorithm for segmentation of wounds on the skin. The abilities of differential evolution optimization algorithm, such as easiness, simple operations using, effectiveness and converging to global optimum reflected to wound image segmentation by using differential evolution algorithm in image segmentation. The system does not have the disadvantages of classical systems such as K-means clustering algorithm and the results obtained from different wound images have been discussed.

- by Mehmet Tunckanat
- •
- Evolutionary Computation, Differential Evolution, Image segmentation, Skin

Classification is one of the most predominant tasks for wide range of applications such as Sentiment analysis in text, voice recognition, image recognition, genetic engineering, data classification etc. Though many efficient classification algorithms have been introduced in the past few decades, due to the drastic increase in the amount of data generated across industry and academia there is a demand for classification algorithms with very high accuracy and robustness. This paper presents a new approach to enhance the accuracy of the classifier by combining Support Vector Machine (Classification algorithm) with K-Means Clustering algorithm and, finally using K Nearest Neighbours to make optimal choice on the classification problem .Experiments have shown that this new methodology has increased the accuracy of the classification problem and thus serves the intended purpose.

- by Mohamed Hussein
- •
- Psychology, Cognitive Science, Social Interaction, Bullying

This paper reports on recent work applying data mining to the task of finding interesting patterns in earth science data derived from global observing satellites, terrestrial observations, and ecosystem models. Patterns are "interesting" if ecosystem scientists can use them to better understand and predict changes in the global carbon cycle and climate system. The initial goal of the work reported here (which is only part of the overall project) is to use clustering to divide the land and ocean areas of the earth into disjoint regions in an automatic, but meaningful, way that enables the direct or indirect discovery of interesting patterns. Finding "meaningful" clusters requires an approach that is aware of various issues related to the spatial and temporal nature of earth science data: the "proper" measure of similarity between time series, removing seasonality from the data to allow detection of non-seasonal patterns, and the presence of spatial and temporal autocorrelation (i.e., measured values that are close in time and space tend to be highly correlated, or similar). While we have techniques to handle some of these spatiotemporal issues (e.g., removing seasonality) and some issues are not a problem (e.g., spatial autocorrelation actually helps our clustering), other issues require more study (e.g., temporal autocorrelation and its effect on time series similarity). Nonetheless, by using the Kmeans as our clustering algorithm and taking linear correlation as our measure of similarity between time series, we have been able to find some interesting ecosystem patterns, including some that are well known to earth scientists and some that require further investigation.

- by Alicia Torregrosa
- •
- Data Mining, Time Series, Spatial autocorrelation, Seasonality
- by Larry Gigliotti and +1
- •
- Zoology, Typology, South Dakota, Attitudes

Short term electricity load forecasting is nowadays, of paramount importance in order to estimate next day electricity load resulting in energy save and environment protection. Electricity demand is influenced (among other things) by the day of the week, the time of year and special periods and/or days such as Ramadhan, all of which must be identified prior to modeling. This identification, known as day-type identification, must be included in the modeling stage either by segmenting the data and modeling each day-type separately or by including the day-type as an input. This paper investigates day-type identification approach for Algerian electricity load. Kohonen maps are used to identify daytypes. The K-Means clustering method will be used as a complementary method to precisely identify the obtained classes. Clustering validity is done by using a criteria measurement of quality. This work has allowed the identification of six different classes.

We describe the use of a binary hierarchical clustering (BHC) framework for clustering of gene expression data. The BHC algorithm involves two major steps. Firstly, the K-means algorithm is used to split the data into two classes. Secondly, the Fisher criterion is applied to the classes to assess whether the splitting is acceptable. The algorithm is applied to the sub-classes recursively and ends when all clusters cannot be split any further. BHC does not require the number of clusters to be known. It does not place any assumption about the number of samples in each cluster or the class distribution. The hierarchical framework naturally leads to a tree structure representation. We show that by arranging the BHC clustered gene expression data in a tree structure, we can easily visualize the cluster results. In addition, the tree structure display allows user judgement in finalizing the clustering result using prior biological knowledge.

- by Alan Liew
- •
- Cognitive Science, Gene expression, Computer Software, Cluster Analysis

The paper presents model based on fuzzy methods for churn prediction in retail banking. The study was done on the real, anonymised data of 5000 clients of a retail bank. Real data are great strength of the study, as a lot of studies often use old, irrelevant or artificial data. Canonical discriminant analysis was applied to reveal variables that provide maximal separation between clusters of churners and non-churners. Combination of standard deviation, canonical discriminant analysis and k-means clustering results were used for outliers detection. Due to the fuzzy nature of practical customer relationship management problems it was expected, and shown, that fuzzy methods performed better than the classical ones. According to the results of the preliminary data exploration and fuzzy clustering with different values of the input parameters for fuzzy cmeans algorithm, the best parameter combination was chosen and applied to training data set. Four different prediction models, called prediction engines, have been developed. The definitions of clients in the fuzzy transitional conditions and the distance of k instances fuzzy sums were introduced. The prediction engine using these sums performed best in churn prediction, applied to both balanced and non-balanced test sets.

- by Bojana Dalbelo
- •
- Computer Science, Customer Relationship Management, Retail Banking, Fuzzy

This paper presents a new enhanced text extraction algorithm from degraded document images on the basis of the probabilistic models. The observed document image is considered as a mixture of Gaussian densities which represents the foreground and background document image components. The EM algorithm is introduced in order to estimate and improve the parameters of the mixtures of densities recursively. The initial parameters of the EM algorithm are estimated by the k-means clustering method. After the parameter estimation, the document image is partitioned into text and background classes by the means of ML approach. The performance of the proposed approach is evaluated on a variety of degraded documents comes from the collections of the National library of Tunisia.

An automated approach to degradation analysis is proposed that uses a rotating machine's acoustic signal to determine Remaining Useful Life (RUL). High resolution spectral features are extracted from the acoustic data collected over the entire lifetime of the machine. A novel approach to the computation of Mutual Information based Feature Subset Selection is applied, to remove redundant and irrelevant features, that does not require class label boundaries of the dataset or spectral locations of developing defect to be known or pre-estimated. Using subsets of the feature space, multi-class linear and Radial Basis Function (RBF) Support Vector Machine (SVM) classifiers are developed and a comparison of their performance is provided. Performance of all classifiers is found to be very high, 85 to 98%, with RBF SVMs outperforming linear SVMs when a smaller number of features are used. As larger numbers of features are used for classification, the problem space becomes more linearly separable and the linear SVMs are shown to have comparable performance. A detailed analysis of the misclassifications is provided and an approach to better understand and interpret costly misclassifications is discussed. While defining class label boundaries using an automated k-means clustering algorithm improves performance with an accuracy of approximately 99%, further analysis shows that in 88% of all misclassifications the actual class of failure had the next highest probability of occurring. Thus, a system that incorporates probability distributions as a measure of confidence for the predicted RUL would provide additional valuable information for scheduling preventative maintenance.

Most scientific data analyses comprise analyzing voluminous data collected from various instruments. Efficient parallel/concurrent algorithms and frameworks are the key to meeting the scalability and performance requirements entailed in such scientific data analyses. The recently introduced MapReduce technique has gained a lot of attention from the scientific community for its applicability in large parallel data analyses. Although there are many evaluations of the MapReduce technique using large textual data collections, there have been only a few evaluations for scientific data analyses. The goals of this paper are twofold. First, we present our experience in applying the MapReduce technique for two scientific data analyses: (i) High Energy Physics data analyses; (ii) Kmeans clustering. Second, we present CGL-MapReduce, a streaming-based MapReduce implementation and compare its performance with Hadoop.

- by Geoffrey Fox
- •
- Parallel Algorithms, Parallel Programming, Data Analysis, High Energy Physics

Watershed transformation is a common technique for image segmentation. However, its use for automatic medical image segmentation has been limited particularly due to oversegmentation and sensitivity to noise. Employing prior shape knowledge has demonstrated robust improvements to medical image segmentation algorithms. We propose a novel method for enhancing watershed segmentation by utilizing prior shape and appearance knowledge. Our method iteratively aligns a shape histogram with the result of an improved k-means clustering algorithm of the watershed segments. Quantitative validation of magnetic resonance imaging segmentation results supports the robust nature of our method.

- by SHANMUGA PRIYA B
- •
- Computer Science, Cluster Analysis, Computer Applications, K means Clustering

Improving student's academic performance is not an easy task for the academic community of higher learning. The academic performance of engineering and science students during their first year at university is a turning point in their educational path and usually encroaches on their General Point Average (GPA) in a decisive manner. The students evaluation factors like class quizzes mid and final exam assignment lab - work are studied. It is recommended that all these correlated information should be conveyed to the class teacher before the conduction of final exam. This study will help the teachers to reduce the drop out ratio to a significant level and improve the performance of students. In this paper, we present a hybrid procedure based on Decision Tree of Data mining method and Data Clustering that enables academicians to predict student's GPA and based on that instructor can take necessary step to improve student academic performance Graded Point Average (GPA) is a co...

- by Md. Hedayetul Islam
- •
- Bioinformatics, Mathematics, Computer Science, Data Mining

- by Alicia Torregrosa
- •
- Data Mining, Time Series, Spatial autocorrelation, Seasonality

Mercu Buana University Campus D is part of Mercu Buana University which began the operational in 2013. Since 2013 until 2017, Mercu Buana University Campus D still got less than a target about getting the new student.This can be due to various things including the lack of precisely marketing strategy undertaken.Therefore, in this study the authors make an application by implementing the concept of data mining using clustering and forecasting methods to obtain information from existing data registrants. So, the information can be used by decision makers to determine effective and efficient marketing strategies.

- by Fajar Masya
- •
- Computer Science, Data Mining, Clustering and Classification Methods, Forecasting

We demonstrate here the development of a non-invasive optical forward-scattering system, called 'scatterometer' for rapid identification of bacterial colonies. The system is based on the concept that variations in refractive indices and size, relative to the arrangement of cells in bacterial colonies growing on a semi-solid agar surface will generate different forward-scattering patterns. A 1.2-1.5 mm colony size for a 1 mm laser beam and brain heart infusion agar as substrate were used as fixed variables. The current study is focused on exploring identification of Listeria monocytogenes and other Listeria species exploiting the known differences in their phenotypic characters. Using diffraction theory, we could model the scattering patterns and explain the appearance of radial spokes and the rings seen in the scattering images of L. monocytogenes. Further, we have also demonstrated development of a suitable software for the extraction of the features (scalar values) calculated from images of the scattering patterns using Zernike moment invariants and principal component analysis and were grouped using K-means clustering. We achieved 91-100% accuracy in detecting different species. It was also observed that substrate variations affect the scattering patterns of Listeria. Finally, a database was constructed based on the scattering patterns from 108 different strains belonging to six species of Listeria. The overall system proved to be simple, non-invasive and virtually reagent-less and has the potential for automated user-friendly application for detection and differentiation of L. monocytogenes and other Listeria species colonies grown on agar plates within 5-10 min analysis time.

The growing demand for link bandwidth and node capacity is a frequent phenomenon in IP network backbones. Within this context, traffic prediction is essential for the network operator. Traffic prediction can be undertaken based on link traffic or on origin-destination (OD) traffic which presents better results. This work investigates a methodology for traffic prediction based on multidimensional OD traffic, focusing on the stage of short-term traffic prediction using Principal Components Analysis as a technique for dimensionality reduction and a Local Linear Model based on K-means as a technique for prediction and trend analysis. The results validated with data on a real network present a satisfactory margin of error for use in practical situations.

- by José Everardo Bessa Maia
- •
- Spine, Principal Component Analysis, Linear Model, Predictive models

Gait analysis is an important aspect of Biomedical Engineering. In the recent past, researchers have applied several signal processing methods for the analysis of gait activities. Sensors such as accelerometers, gyroscopes and pressure sensors are more commonly used to identify gait activities remotely. Most of the applications have multiple sensors placed on a single board which is used for gait assessment. However, the problem with multiple sensors is the cross talk introduced by one sensor due to another sensor. Some of the applications use a single sensor such as accelerometer with dual axis measuring the gait activity in horizontal and vertical planes. Depending on the orientation of the accelerometer, the two axial outputs could have overlapping spectra which is very difficult to observe. Spectral and temporal filtering is not suitable for this because of overlapping spectra due to simultaneous movements of the foot in the horizontal and vertical planes. To reliably identify the gait activities, there is a need to decompose and separate the two vertical and horizontal acceleration signals. The earlier research has described a novel method which can be used remotely to identify the gait in ITW children. This paper discusses a lab based automated classification method using Blind Source Separation (BSS) technique to identify toe walking gait from normal gait in Idiopathic Toe Walkers (ITW) children. The outcome of the research study reveals that the BSS techniques in association with K-means classifier can suitably distinguish toe-walking gait from normal gait in ITW children with 97.9 ± 0.2% accuracy.

- by Hung Nguyen and +2
- •
- Biomedical Engineering, Gait Analysis, Medical Biotechnology, K means Clustering

A wide range of computational methods and tools for data analysis are available. In this study we took advantage of those available technological advancements to develop prediction models for the prediction of a Type-2 Diabetic Patient. We aim to investigate how the diabetes incidents are affected by patients' characteristics and measurements. Efficient predictive modeling is required for medical researchers and practitioners. This study proposes Hybrid Prediction Model (HPM) which uses Simple K-means clustering algorithm aimed at validating chosen class label of given data (incorrectly classified instances are removed, i.e. pattern extracted from original data) and subsequently applying the classification algorithm to the result set. C4.5 algorithm is used to build the final classifier model by using the k-fold cross-validation method. The Pima Indians diabetes data was obtained from the University of California at Irvine (UCI) machine learning repository datasets. A wide range of different classification methods have been applied previously by various researchers in order to find the best performing algorithm on this dataset. The accuracies achieved have been in the range of 59.4-84.05%. However the proposed HPM obtained a classification accuracy of 92.38%. In order to evaluate the performance of the proposed method, sensitivity and specificity performance measures that are used commonly in medical classification studies were used.

- by Durga Toshniwal
- •
- Computer Science, Machine Learning, Data Mining, Data Analysis

Cluster analysis method is one of the most analytical methods of data mining. The method will directly influence the result of clustering. This paper discusses the standard of k-mean clustering and analyzes the shortcomings of standard k-means such as k-means algorithm calculates distance of each data point from each cluster centre. Calculating this distance in each iteration makes the algorithm of low efficiency. This paper introduces an optimized algorithm which solves this problem. This is done by introducing a simple data structure to store some information in every iteration and used this information in next iteration. The introduced algorithm does not require calculating the distance of each data point from each cluster centre in each iteration due to which running time of algorithm is saved. Experimental results show that the improved algorithm can efficiently improve the speed of clustering and accuracy by reducing the computational complexity of standard k-means algorithm.

- by Sadhana Tiwari
- •
- Kd Tree, K means Clustering

Overlapping is one of the topics in wireless sensor networks that is considered by researchers in the last decades. An appropriate overlapping management system can prolong network lifetime and decrease network recovery time. This paper proposes an intelligent and knowledge‐based overlapping clustering protocol for wireless sensor networks, called IKOCP. This protocol uses some of the intelligent and knowledge‐based systems to construct a robust overlapping strategy for sensor networks. The overall network is partitioned to several regions by a proposed multicriteria decision‐making controller to monitor both small‐scale and large‐scale areas. Each region is managed by a sink, where the whole network is managed by a base station. The sensor nodes are categorized by various clusters using the low‐energy adaptive clustering hierarchy (LEACH)‐improved protocol in a way that the value of p is defined by a proposed support vector machine–based mechanism. A proposed fuzzy system determines that noncluster heads associate with several clusters in order to manage overlapping conditions over the network. Cluster heads are changed into clusters in a period by a suggested utility function. Since network lifetime should be prolonged and network traffic should be alleviated, a data aggregation mechanism is proposed to transmit only crucial data packets from cluster heads to sinks. Cluster heads apply a weighted criteria matrix to perform an inner‐cluster routing for transmitting data packets to sinks. Simulation results demonstrate that the proposed protocol surpasses the existing methods in terms of the number of alive nodes, network lifetime, average time to recover, dead time of first node, and dead time of last node.

- by Mohammad Samadi
- •
- Sensors and Sensing, Data Mining, Sensor, Clustering and Classification Methods

In this paper, the different general motivations of gamers for playing video games are explored. Surprisingly, to date little research has been devoted to the characterization of the gamer, based on general game motivations. By means of an online survey, we questioned 2985 Flemish gamers on 11 general game motivations. K-means clustering was used to distinguish four distinctive gamer profiles: the overall convinced gamer, the convinced competitive gamer, the escapist gamer and the passtime gamer.

- by Dimitri Schuurman
- •
- Game studies, Video Game, Digital Games, Online survey
- by Michael Galarnyk
- •
- Machine Learning, Data Analysis, Principal Component Analysis, K-means

Kernel k-means is an extension of the standard kmeans clustering algorithm that identifies nonlinearly separable clusters. In order to overcome the cluster initialization problem associated with this method, in this work we propose the global kernel k-means algorithm, a deterministic and incremental approach to kernel-based clustering. Our method adds one cluster at each stage through a global search procedure consisting of several executions of kernel k-means from suitable initializations. This algorithm does not depend on cluster initialization, identifies nonlinearly separable clusters and, due to its incremental nature and search procedure, locates near optimal solutions avoiding poor local minima. Furthermore a modification is proposed to reduce the computational cost that does not significantly affect the solution quality. We test the proposed methods on artificial data and also for the first time we employ kernel k-means for MRI segmentation along with a novel kernel. The proposed methods compare favorably to kernel k-means with random restarts.

- by Aristidis Likas
- •
- Data Mining, Pattern Recognition, Clustering, Global Optimization

Purpose -This paper aims to propose a solution for recommending digital library services based on data mining techniques (clustering and predictive classification). Design/methodology/approach -Data mining techniques are used to recommend digital library services based on the user's profile and search history. First, similar users were clustered together, based on their profiles and search behavior. Then predictive classification for recommending appropriate services to them was used. It has been shown that users in the same cluster have a high probability of accepting similar services or their patterns. Findings -The results indicate that k-means clustering and Naive Bayes classification may be used to improve the accuracy of service recommendation. The overall accuracy is satisfying, while average accuracy depends on the specific service. The results were better for frequently occurring services. Research limitations/implications -Datasets were used from the KOBSON digital library. Only clustering and predictive classification was applied. If the correlation between the service and the institution were higher, it would have better accuracy. Originality/value -The paper applied different and efficient data mining techniques for clustering digital library users based on their profiles and their search behavior, i.e. users' interaction with library services, and obtain user patterns with respect to the library services they use. A digital library may apply this approach to offer appropriate services to new users more easily. The recommendations will be based on library items that similar users have already found useful.

- by Ana Kovacevic
- •
- Data Mining, Databases, Digital Library, Library and Information Studies

In modern days, image processing methods are widely adopted in the medical field for enhancing the earlier detection of certain abnormalities, such as the breast cancer, lung cancer, brain cancer and so on. This paper mainly concentrates on the
segmentation of lung cancer tumors from X-ray images, Computed Tomography (CT) images and MRI images. Image processing methods are adopted in segmenting the images. In the pre-processing stage mean and median filters are used. In the image segmentation stage, Otsu's thresholding and k-Means clustering segmentation approaches are used to segment the lung images and locate the tumors. To evaluate the performance of the methods used for segmentation, the performance evaluation parameters such as Signal to noise Ratio(SNR) ,Mean Square Error (MSE) and Peak Signal Noise to Ratio (PSNR)) are computed on the segmented images of the two different segmentation methods used for segmentation. Better results are obtained for the K-Means segmentation irrespective of the images.

- by IAEME Publication
- •
- Computed Tomography, MRI, Lung Cancer, PSNR

This paper experiments application of different lean strategies to a real production problem at a furniture manufacturing company. The objective of the study is to improve the productivity of the factory floor. Initially, existing production process was analyzed to get a clear picture of the existing condition of the process. It was observed that there were some redundant tasks performed by the workers which resulted in longer waiting periods. Various lean strategies such as Single-Minute Exchange of Dies (SMED), Gemba (The real place), and Short Interval Control etc were proposed and implemented on the floor which resulted in significant improvement in both monetary terms and also in reduction of processing time of different lots. Multifactor productivity increased from 1.85 to 2.26; average distance travelled by the workers, OEE, wastage of materials and rejected quantity reduced significantly and an additional production equivalent to more than 1, 48705 taka per day was made possible.

The paper is based on data from a questionnaire survey (interviews) conducted in the western part of Poland on 183 rural tourism and agri-tourism small and medium enterprises. The classification of enterprises was based on the methodology proposed by Wysocki (1996) and included the k-means clustering algorithm. As the result of the research three types of SMEs were separated, including the top resilient enterprises aimed mainly at tourism activity and usually connected with horse recreation, a cluster of mixed SMEs for which tourism activity was an additional and less important source of income, and a group of SMEs for which tourism activity was an additional but important source of income. The classification may be used as a hint for rural development policy makers for future support of rural tourism / agri-tourism development.

Normalised cut method has been effectively used for image segmentation by representing an image as weighted graph in global view. It does segmentation via partitioning the graphs into sub-graphs. Clustering algorithm is implemented such that sub-graphs with common similarities are grouped together into one cluster and separates sub-graphs that are dissimilar into distinctive clusters. Clustered segments from the normalised cuts are then produced. As the clusters initialisation gives influence to the segmentation result, optimisation of the clustering algorithm is implemented to achieve better segmentation. With the approach applied in the normalised cuts based image segmentation, the constraint of using normalised cuts algorithm in image segmentation can be alleviated. In this paper, evaluation of the clustering algorithm with the normalised cuts image segmentation on images has been carried out and the effect of different image complexity towards normalised cuts segmentation process is presented.

- by F Wong
- •
- Graph Theory, Clustering Algorithms, Image segmentation, Fuzzy Clustering

Although several studies have assessed Land Degradation (LD) states in the Mediterranean basin through the use of composite indices, relatively few have evaluated the impact of specific LD drivers at the local scale. In this work, a computational strategy is introduced to define homogeneous areas at risk and the main factors acting as determinants of LD. The procedure consists of three steps and is applied to a set of ten environmental indicators available at the municipality scale in Latium, central Italy. A principal component analysis extracting latent patterns and simplifying data complexity was carried out on the original data matrix. Subsequently, a k-means cluster analysis was applied on a restricted number of meaningful, latent factors extracted by PCA in order to produce a classification of the study area into homogeneous regions. Finally, a stepwise discriminant analysis was performed to determine which indicators contributed the most to the definition of homogeneous regions. Three classes of ''risky'' regions were identified according to the main drivers of LD acting at the local scale. These include: (i) soil sealing (coupled with landscape fragmentation, fire risk, and related processes), (ii) soil salinization due to agricultural intensification, and (iii) soil erosion due to farmland depopulation and land abandonment in sloping areas. Areas at risk for LD covered 56 and 63% of the investigated areas in 1970 and 2000, respectively.

- by Marco Zitti
- •
- Geography, Socioeconomics, Principal Component Analysis, Environmental Management

Image segmentation and classification are the two main fundamental steps in pattern recognition. To perform medical image segmentation or classification with deep learning models, it requires training on large image dataset with annotation. The dermoscopy images (ISIC archive) considered for this work does not have ground truth information for lesion segmentation. Performing manual labelling on this dataset is time-consuming. To overcome this issue, self-learning annotation scheme was proposed in the two-stage deep learning algorithm. The two-stage deep learning algorithm consists of U-Net segmentation model with the annotation scheme and CNN classifier model. The annotation scheme uses a K-means clustering algorithm along with merging conditions to achieve initial labelling information for training the U-Net model. The classifier models namely ResNet-50 and LeNet-5 were trained and tested on the image dataset without segmentation for comparison and with the U-Net segmentation for implementing the proposed self-learning Artificial Intelligence (AI) framework. The classification results of the proposed AI framework achieved training accuracy of 93.8% and testing accuracy of 82.42% when compared with the two classifier models directly trained on the input images.

Given a metric d defined on a set V of points (a metric space), we define the ball B(v, r) centered at v ∈ V and having radius r ≥ 0 to be the set {q ∈ V |d(v, q) ≤ r}. In this work, we consider the problem of computing a minimum cost k-cover for a given set P ⊆ V of n points, where k > 0 is some given integer which is also part of the input. For κ ≥ 0, a κ-cover for subset Q ⊆ P is a set of at most κ balls, each centered at a point in P , whose union covers (contains) Q. The cost of a set D of balls, denoted cost(D), is the sum of the radii of those balls.

- by Gaurav Kanade
- •
- Pure Mathematics, Multidisciplinary, Clustering, APPROXIMATION ALGORITHM

1] Reference evapotranspiration (RET), an indicator of atmospheric evaporating capability over a hypothetical reference surface, was calculated using the Penman-Monteith method for 75 stations across the Qinghai-Tibetan Plateau between 1971 and 2004. Generally, both annual and seasonal RET decreased for most part of the plateau during the study period. Multivariate linear models were used to determine the contributions of climate factors to RET change, including air temperature, air humidity, solar radiation, and wind speed. Spatial differences in the causes of RET change were detected by K-means clustering analysis. It indicates that wind speed predominated the changes of RET almost throughout the year, especially in the north of the study region, whereas radiation was the leading factor in the southeast, especially during the summertime. Although the recent warming trend over the plateau would have increased RET, the combined effect of the reduced wind speed and shortened sunshine duration negated the effect of rising temperature and caused RET to decrease in general. The significant decrease in surface wind speed corresponded to the decreasing trends of upper-air zonal wind and the decline of pressure gradient, possibly as a result of the recent warming.

- by Zhi-yong Yin
- •
- Climate variability, Multidisciplinary, Seasonality, Evapotranspiration

In a study conducted on the extraction of protein from the leaves of 30 freshwater aquatic plants, the highest standing crop fresh yield was found in Typha latifolia (2650 g/m2). The Bio-Medical Data Processing (BMDP) K-means clustering program with K = 2 showed that 11 of the 30 plants had a high protein nitrogen extractability as well as a high nitrogen content of the extracted protein. Among these, leaf protein from Allmania nodiflora had the highest content of crude protein (62.7%) and/~-carotene (782.4 vg/g). Leaf protein prepared from Hygrophila spinosa, Ottelia alismoides and Polygonum barbatum had low in-vitro digestibility. The levels of alkaloids and polyphenols were lower in the extracted protein compared to that present in the original leaf sample.

- by Anjana Dewanji
- •
- Aquatic Plants, Digestion, Nitrogen, Wild edible plants

In this paper we proposed the method for road extraction. The road extraction involves the two main steps: the detection of road that might have the other non road parts like buildings and parking lots followed by morphological operations to remove the non road parts based on their features. We used the K-Means clustering to detect the road area and may be some non road area. Morphological operations are used to remove the non road area based on the assumptions that road regions are an elongated area that has largest connected component.

- by Rohit Maurya
- •
- Computer Science, Data Mining, Image segmentation, Information Processing

This paper presents ongoing work on using data mining to evaluate a software system's maintainability according to the ISO/IEC-9126 quality standard. More specifically it proposes a methodology for knowledge acquisition by integrating data from source code with the expertise of a software system's evaluators A process for the extraction of elements from source code and Analytical Hierarchical Processing for assigning

- by Evangelos Theodoridis and +2
- •
- Software Engineering, Programming Languages, Data Mining, Open Source

Anomaly detection refers to methods that provide warnings of unusual behaviors which may compromise the security and performance of communication networks. In this paper it is proposed a novel model for network anomaly detection combining baseline, K-means clustering and particle swarm optimization (PSO). The baseline consists of network traffic normal behavior profiles, generated by the application of Baseline for Automatic Backbone Management (BLGBA) model in SNMP historical network data set, while K-means is a supervised learning clustering algorithm used to recognize patterns or features in data sets. In order to escape from local optima problem, the K-means is associated to PSO, which is a metaheuristic whose main characteristics include low computational complexity and small number of input parameters dependence. The proposed anomaly detection approach classifies data clusters from baseline and real traffic using the K-means combined with PSO. Anomalous behaviors can be identified by comparing the distance between real traffic and cluster centroids. Tests were performed in the network of State University of Londrina and the obtained detection and false alarm rates are promising.

- by Lucas Sampaio and +2
- •
- Computational Complexity, Anomaly Detection, Software, Supervised Learning

Our aim is to find clusters of spatial patterns of criminality among young people and the total population in Medellin, Colombia, within the period between October 2013 and November 2014. For this purpose, a hexagonal city network was created and we looked for groupings into clusters among thirteen tort/delict variables. In order to find the clusters, we used the subtractive clustering and fuzzy c-means clustering. When running them, we found territorial microcorridors where high criminality is consolidated during several periods of time and temporal patterns showing how some high criminality zones are being gradually shaped. Additionally, spatial patterns of criminality were sought among youths, and it was found that, usually, this age group tends to exhibit higher variability in criminal dynamics and meddling territories smaller than the rest of the population.

- by Juan Diego Jaramillo-Morales
- •
- Geography, Violence, Clustering and Classification Methods, Jóvenes

It is important to reveal the relationship between the internal migration and the unemployment rate in the settlements that are experiencing this immigration in order to understand the causes and consequences of unemployment in these areas. But it needs a separate analyse to show direction of this relationship for the both migrated and emigrated regions. For this understanding, both immigration and unemployment statistics should be examined on the basis of settlements. In this study, we made a comparative analysis of cities in Turkey in terms of mobility of people among cities and unemployment rates in these areas and touched on the discussions in the literature. The main purpose of the study is to cluster the provinces in terms of unemployment and migration by using k-means algorithm as a data mining technique and to compare sets obtained. Unemployment and migration statistics of TÜİK (Turkish Statistical Institute) were used in the analysis. For model performance of the k-means algorithm, in each trial, the numbers of clusters were determined in order to ensure minimum distance for intracluster and maximum distances for between the clusters. Analyses were performed on RStudio with the R Programming Language and findings were made more attractive and understandable by visualizing.

- by Muhammet Atalay
- •
- Immigration, Migration, Unemployment, K-means

The K-means algorithm is very popular in the machine learning community due to its inherent simplicity. However in its basic form it is not suitable for use in problems which contain periodic attributes, such as oscillator phase, hour of day or directional heading. A commonly used technique of trigonometrically encoding periodic input attributes to artificially generate the required topology introduces a systematic error. In this paper, a metric which induces a conceptually correct topology for periodic attributes is embedded into the K-means algorithm. This requires solving a non-convex minimization problem in the maximization step. Results of numerical experiments comparing the proposed algorithm to K-means with trigonometric encoding on synthetically generated data are reported. The advantage of using the proposed K-means algorithm is also shown on a real example using gas load data to build simple predictive models.

- by Emil Pelikan
- •
- Cognitive Science, K Means, K means Clustering

Customer churn is a significant issue that is regularly related with the existence cycle of the business. At the point when the business is in a development period of its life cycle, deals are expanding exponentially and the quantity of new clients to a great extent dwarfs the quantity of churners. On the other side, organizations in a develop period of in their life cycle, set their attention on lessening the rate of customer churn. This research work proposes an efficient computational intelligence model comprising of clustering achieved through improvised K-Means algorithm and classification achieved through Non Linear Support Vector Machine.

- by IJCSMC Journal
- •
- Mathematics, Computer Science, Algorithms, Information Technology

The paper studies the pattern of financial performance for listed companies originating from different industries - financial intermediation, beverage and food industry, energy, pharmaceuticals and chemicals - in four Central and Eastern European countries - Czech Republic, Hungary, Poland and Romania over a four year period (2003-2006). The financial performance is addressed by taking into account companies' return on assets (ROA) and return on equity (ROE). The research methodology consists of hierarchical and k-means clustering amalgamation techniques, in order to distinguish between naturally occurring similar groups that are statistically significant in terms of industry and/or national influences. Our analysis encompasses a dynamic approach, as it refers to changes in clusters' structure in time and searches for possible explanations of corporate financial performance in this region.

- by Alexandra Horobet
- •
- Eastern Europe, Research Methodology, Czech Republic, Profitability

In order to obtain a better control of market trend and profit for the company, timely identification of sales is very important for businesses. Upward and downward trends in sales signify new market trends and understanding of sales trends is important for marketing as well as for customer retention. This research develops a hybrid model by integrating K-mean cluster and fuzzy neural network (KFNN) to forecast the future sales of a printed circuit board factory. Based on the K-mean clustering technique, the historical data can be classified into different clusters. The accuracy of the forecasted model can be further improved by referring the new data to be forecasted from a more focused region, i.e., a smaller region after clustering. Numerical data of various affecting factors and actual demand of the past 5 years of the printed circuit board (PCB) factory are collected and input into the hybrid model for future monthly sales forecasted. The experimental results derived from the proposed model show the effectiveness of the hybrid model when compared with other approaches.

- by Chen-hao Liu and +1
- •
- Customer Retention, Printed Circuit Board, Case Study, Profitability

Developing intelligent systems to prevent car accidents can be very effective in minimizing accident death toll. One of the factors which play an important role in accidents is the human errors including driving fatigue relying on new... more

- by Javad Haddadnia
- •
- Video Processing, HSV, K means Clustering

Data Mining has been used extensively in various business applications for last few years. In this paper, data mining technique for Interpretation of Weather Forecasts for one of the most disastrous weather phenomenon viz. cloudburst has been applied. Every year, cloudburst over hilly areas and coastal regions cause loss of lives and property. The forecasting and warning of these events is very difficult. The warning of cloudburst could only be provided at a small lead time say a few hours in advance based on the interpretation of latest satellite imagery data, powerful radar ( Doppler category), if available, or by using Model Output Statistics (MOS) models. Another dimension to forecasting this weather event has been identified by applying clustering technique on primary data forecasted by global and regional models of weather forecasting. A recent case of Cloudburst over Leh that caused a huge loss has been analyzed using k-means clustering technique of data mining. It has been observed that with the mining of Numerical Weather Prediction model forecast data, the signals of formation of cloudburst can be found 5-6 days in advance.

- by Dr. Kavita Pabreja
- •
- Climate Change, Atmospheric Science, Climatology, Meteorology