A Novel Stream Mining Approach as Stream-Cluster Feature Tree Algorithm: A Case Study in Turkish Job Postings (original) (raw)
Related papers
Fast Feature Selection for Naive Bayes Classification in Data Stream Mining
2013
Stream mining is the process of mining a continuous, ordered sequence of data items in real-time. Naive Bayes (NB) classification is one of the popular classification methods for stream mining because it is an incremental classification method whose model can be easily updated as new data arrives. It has been observed in the literature that the performance of the NB classifier improves when irrelevant features are eliminated from the modeling process. This paper reports studies that were conducted to identify efficient computational methods for selecting relevant features for NB classification based on the sliding window method of stream mining. The paper also provides experimental results which demonstrate that continuous feature selection for NB stream mining provides high levels of predictive performance.
An efficient stream mining technique
WSEAS Transactions on …, 2008
Stream analysis is considered as a crucial component of strategic control over a broad variety of disciplines in business, science and engineering. Stream data is a sequence of observations collected over intervals of time. Each data stream describes a phenomenon. Analysis on Stream data includes discovering trends (or patterns) in a Stream sequence. In the last few years, data mining has emerged and been recognized as a new technology for data analysis. Data Mining is the process of discovering potentially valuable patterns, associations, trends, sequences and dependencies in data. Data mining techniques can discover information that many traditional business analysis and statistical techniques fail to deliver. In our study, we emphasis on the use of data mining techniques on data streams, where mining techniques and tools are used in an attempt to recognize, anticipate and learn the stream behavior with different directly related or looked unrelated factors. Targeted data are sequences of observations collected over intervals of time. Each sequence describes a phenomenon or a factor. Such factors could have either a direct or indirect impact on the stream data under study. Examples of factors with direct impact include the yearly budgets and expenditures, taxations, local stocks prices, unemployment rates, inflation rates, fallen angels, and rising odds for upgrades. Indirect factors could include any phenomena in the local or global environments, such as, global stocks prices, education expenditures, weather conditions, employment strategies, and medical services. Analysis on data includes discovering trends (or patterns) and association between sequences in order to generate non-trivial knowledge. In this paper, we propose a data mining technique to predict the dependency between factors that affect performance. The proposed technique consists of three phases: (a) for each data sequence that represents a chosen phenomenon, generate its trend sequences, (b) discover maximal frequent trend patterns, generate pattern vectors (to keep information of frequent trend patterns), use trend pattern vectors to predict future factor sequences.
Knowledge Discovery Using Data Stream Mining
Advances in Business Information Systems and Analytics, 2018
In recent years, advancement in technologies has made it possible for most of the present-day organizations to store and record large streams of data. Such data sets, which continuously and rapidly grow over time, are referred to as data streams. Mining of such data streams is a unique opportunity and also a challenging task. Data stream mining is a process of gaining knowledge from continuous and rapid records of data. Due to increased streaming information, data stream mining has attracted the research community in the recent past. There is voluminous literature that has been published in this domain over the past few years. Due to this, isolating the correct study would be grueling task for researchers and practitioners. While addressing a real-world problem, it would be difficult to find relevant information as it would be hidden in data streams. This chapter tries to provide solution as it is an amalgamation of all techniques used for data stream mining.
Enhancing Decision Trees for Data Stream Mining
Advances in Science, Technology and Engineering Systems Journal
Data stream gained obvious attention by research for years. Mining this type of data generates special challenges because of their unusual nature. Data streams flows are continuous, infinite and with unbounded size. Because of its accuracy, decision tree is one of the most common methods in classifying data streams. The aim of classification is to find a set of models that can be used to differentiate and label different classes of objects. The discovered models are used to predict the class membership of objects in a data set. Although many efforts were done to classify the stream data using decision trees, it still needs a special attention to enhance its performance, especially regarding time which is an important factor for data streams. This fast type of data requires the shortest possible processing time. This paper presents VFDT-S1.0 as an extension of VFDT (Very Fast Decision Trees). Bagging and sampling techniques are used for enhancing the algorithm time and maintaining accuracy. The experimental result proves that the proposed modification reduces time of the classification by more than 20% in more than one dataset. Effect on accuracy was less than 1% in some datasets. Time results proved the suitability of the algorithm for handling fast stream mining.
A survey on data preprocessing for data stream mining: Current status and future directions
Neurocomputing, 2017
Data preprocessing and reduction have become essential techniques in current knowledge discovery scenarios, dominated by increasingly large datasets. These methods aim at reducing the complexity inherent to real-world datasets, so that they can be easily processed by current data mining solutions. Advantages of such approaches include, among others, a faster and more precise learning process, and more understandable structure of raw data. However, in the context of data preprocessing techniques for data streams have a long road ahead of them, despite online learning is growing in importance thanks to the development of Internet and technologies for massive data collection. Throughout this survey, we summarize, categorize and analyze those contributions on data preprocessing that cope with streaming data. This work also takes into account the existing relationships between the different families of methods (feature and instance selection, and discretization). To enrich our study, we conduct thorough experiments using the most relevant contributions and present an analysis of their predictive performance, reduction rates, computational time, and memory usage. Finally, we offer general advices about existing data stream preprocessing algorithms, as well as discuss emerging future challenges to be faced in the domain of data stream preprocessing.
A Survey of Stream Data Mining
At present a growing number of applications that generate massive streams of data need intelligent data processing and online analysis. Real-time surveillance systems, telecommunication systems, sensor networks and other dynamic environments are such examples. The imminent need for turning such data into useful information and knowledge augments the development of systems, algorithms and frameworks that address streaming challenges. The storage, querying and mining of such data sets are highly computationally challenging tasks. Mining data streams is concerned with extracting knowledge structures represented in models and patterns in non stopping streams of information. In this paper, we present the theoretical foundations of data stream analysis and identify potential directions of future research. Mining data stream techniques are being critically reviewed.
An Efficient Way for Scrutinizing the Job Seekers Data to Select a Right Candidate
Decision support systems play a vital role in business, science, medicine, markets, research and many more. The advances in analytical systems of data changed the way and pace of decision making process. Data mining in general and decision trees in particular are contributing a lot to decision support systems. In this paper efforts are made to introduce a simple and useful decision support system based on decision trees. Hypothetical data is considered to explain the methodology and elevate the power of the results. The proposed process can be extended to big data sets by availing the pruning techniques for decision tree construction.
DATA STREAM MINING ALGORITHMS – A REVIEW OF ISSUES AND EXISTING APPROACHES
More and more applications such as traffic modeling, military sensing and tracking, online data processing etc., generate a large amount of data streams every day. Efficient knowledge discovery of such data streams is an emerging active research area in data mining with broad applications. Different from data in traditional static databases, data streams typically arrive continuously in high speed with huge amount and changing data distribution. This raises new issues that need to be considered when developing association rule mining techniques for stream data. Due to the unique features of data stream, traditional data mining techniques which require multiple scans of the entire data sets can not be applied directly to mine stream data, which usually allows only one scan and demands fast response time.
Critical evaluation of classifiers in data stream mining
International Journal of Engineering & Technology
Over past decade there has been a significant increase in the volume of online data. Extracting meaningful knowledge from this high volume data is considered as important aspect of research. It is very difficult to completely store full data, because of its perpetual nature. Therefore, analysis is needed while the “data is moving”. This moving data is known as data stream and analyzing it without storing it completely is termed as data stream mining. In recent years, many new techniques have been proposed to overcome the challenges of data stream mining. In this paper, we review the operation of popular streaming algorithms highlighting their strength and weaknesses. We also evaluate the classifiers used in these algorithms against two popular benchmark datasets namely (a) forest cover (forest) and (b) german credit available at UCI repository. Finally, we present our critical observation and draw conclusions on the basis of our analysis.
An experimental comparison of decision trees in traditional data mining and data stream mining
… Management and Service (IMS), 2010 6th …, 2010
Data Stream mining (DSM) is claimed to be the successor of traditional data mining where it is capable of mining continuous incoming data streams in real-time with an acceptable performance. owadays many computer applications evolved to online and on-demand basis, fresh data are feeding in at high speeds. ot only a decision response needs to be made rapidly, the trained decision tree models would have to be updated recurrently as frequent as the latest data arrive. By the nature of traditional data mining, training datasets are assumed structured and static, and the decision tree models are either refreshed in batches or never. That is, the full dataset will be completely scanned (sometimes in multiple repetitions), induction of rules by Greedy algorithm that proceeds in manner of divide-and-conquer in the case of constructing a C4.5 decision tree. DSM on the other hand progressively builds and renews the decision tree model at a time when a new pass of data come by. In this paper, we evaluated the performance of a popular decision tree in DSM, which is known as Hoeffding Tree vis-à-vis that of C4.5. A good mix of types of datasets was used in the experiments for investigating the apparent differences between the decision trees. An open-source DSM simulator was programmed in JAVA for the experiments.