Efficiently Mining Constrained Subsequence Patterns (original) (raw)

RPM: Representative Pattern Mining for Efficient Time Series Classification

2016

Time series classication is an important problem that has received a great amount of attention by researchers and practitioners in the past two decades. In this work, we propose a novel algorithm for time series classication based on the discovery of class-specic representative patterns. We dene representative patterns of a class as a set of subsequences that has the greatest discriminative power to distinguish one class of time series from another. Our approach rests upon two techniques with linear complexity: symbolic discretization of time series, which generalizes the structural patterns, and grammatical inference, which automatically nds recurrent correlated patterns of variable length, producing an initial pool of common patterns shared by many instances in a class. From this pool of candidate patterns, our algorithm selects the most representative patterns that capture the class specicities, and that can be used to eectively discriminate between time series classes. Through a...

Searching and mining trillions of time series subsequences under dynamic time warping

2012

Most time series data mining algorithms use similarity search as a core subroutine, and thus the time taken for similarity search is the bottleneck for virtually all time series data mining algorithms. The difficulty of scaling search to large datasets largely explains why most academic work on time series data mining has plateaued at considering a few millions of time series objects, while much of industry and science sits on billions of time series objects waiting to be explored. In this work we show that by using a combination of four novel ideas we can search and mine truly massive time series for the first time. We demonstrate the following extremely unintuitive fact; in large datasets we can exactly search under DTW much more quickly than the current state-of-the-art Euclidean distance search algorithms. We demonstrate our work on the largest set of time series experiments ever attempted. In particular, the largest dataset we consider is larger than the combined size of all of the time series datasets considered in all data mining papers ever published. We show that our ideas allow us to solve higher-level time series data mining problem such as motif discovery and clustering at scales that would otherwise be untenable. In addition to mining massive datasets, we will show that our ideas also have implications for real-time monitoring of data streams, allowing us to handle much faster arrival rates and/or use cheaper and lower powered devices than are currently possible.

Mining Sequential Pattern with Time-Constraint

2020

Abstract-Sequential pattern mining is an important data mining task, and different algorithms have been proposed to perform this task efficiently. The problem is to find all sequential patterns with higher or equal support to a predefined minimum support threshold in a data sequence database. Here we present a new methodology to mine a sequential pattern with time-constraint. Our study shows that constraints can be effectively and efficiently pushed deep into sequential pattern mining under this new framework.

Efficient mining of sequential patterns with time constraints: Reducing the combinations

Expert Systems with Applications, 2009

In this paper we consider the problem of discovering sequential patterns by handling time constraints as defined in the Gsp algorithm. While sequential patterns could be seen as temporal relationships between facts embedded in the database where considered facts are merely characteristics of individuals or observations of individual behavior, generalized sequential patterns aim at providing the end user with a more flexible handling of the transactions embedded in the database.

Ranked Subsequence Matching in Time-Series Databases

2007

Existing work on similar sequence matching has focused on either whole matching or range subsequence matching. In this paper, we present novel methods for ranked subsequence matching under time warping, which finds top-k subsequences most similar to a query sequence from data sequences. To the best of our knowledge, this is the first and most sophisticated subsequence matching solution mentioned in the literature. Specifically, we first provide a new notion of the minimum-distance matching-window pair (MDMWP) and formally define the mdmwp-distance, a lower bound between a data subsequence and a query sequence. The mdmwp-distance can be computed prior to accessing the actual subsequence. Based on the mdmwp-distance, we then develop a ranked subsequence matching algorithm to prune unnecessary subsequence accesses. Next, to reduce random disk I/Os and bad buffer utilization, we develop a method of deferred group subsequence retrieval. We then derive another lower bound, the window-group distance, that can be used to effectively prune unnecessary subsequence accesses during deferred group-subsequence retrieval. Through extensive experiments with many data sets, we showcase the superiority of the proposed methods.

An Algorithm for Mining High Utility Sequential Patterns with Time Interval

Cybernetics and Information Technologies, 2019

Mining High Utility Sequential Patterns (HUSP) is an emerging topic in data mining which attracts many researchers. The HUSP mining algorithms can extract sequential patterns having high utility (importance) in a quantitative sequence database. In real world applications, the time intervals between elements are also very important. However, recent HUSP mining algorithms cannot extract sequential patterns with time intervals between elements. Thus, in this paper, we propose an algorithm for mining high utility sequential patterns with the time interval problem. We consider not only sequential patterns’ utilities, but also their time intervals. The sequence weight utility value is used to ensure the important downward closure property. Besides that, we use four time constraints for dealing with time interval in the sequence to extract more meaningful patterns. Experimental results show that our proposed method is efficient and effective in mining high utility sequential pattern with t...

AN ALGORITHM FOR DISCOVERING SIMILAR SUBSEQUENCES IN TIME SERIES DATA USING CID (Complexity – Invariant Distance)

Abstract Discovering subsequences (motifs) in time series data has attracted the interest of researchers. Numerous algorithms, which use distance function or other (dis) similarity measure between two time series, have been proposed during these developments. We present an algorithm to detect subsequence (of length m) which is mostly repeated in a time series (of length n). Detecting repeated subsequence in time series is done dynamically by assigned the length (m) of the subsequence. The value of m is selected by the user according to some characteristics of time series (eg seasonality, periodicity, etc) or from a previous detailed analysis of that time series. The algorithm allows the user to choose between two (dis)similarity measures. The (dis)similarity is examined on two measures: Euclidean distance and CID (Complexity- Invariant Distance, proposed by Batista G. and Keogh E. (2013)). The proposed algorithm is tested on real world time series data and simulated time series in R...

Pre-processing time constraints for efficiently mining generalized sequential patterns

Proceedings. 11th International Symposium on Temporal Representation and Reasoning, 2004. TIME 2004., 2004

In this paper we consider the problem of discovering sequential patterns by handling time constraints. While sequential patterns could be seen as temporal relationships between facts embedded in the database, generalized sequential patterns aim at providing the end user with a more flexible handling of the transactions embedded in the database. We propose a new efficient algorithm, called GTC (Graph for Time Constraints) for mining such patterns in very large databases. It is based on the idea that handling time constraints in the earlier stage of the algorithm can be highly beneficial since it minimizes computational costs by preprocessing data sequences. Our test shows that the proposed algorithm performs significantly faster than a stateof-the-art sequence mining algorithm.

New methods for mining sequential and time series data

2009

ata mining is the process of extracting knowledge from large amounts of data. It covers a variety of techniques aimed at discovering diverse types of patterns on the basis of the requirements of the domain. These techniques include association rules mining, classification, cluster analysis and outlier detection. The availability of applications that produce massive amounts of spatial, spatio-temporal (ST) and time series data (TSD) is the rationale for developing specialized techniques to excavate such data. In spatial data mining, the spatial co-location rule problem is different from the association rule problem, since there is no natural notion of transactions in spatial datasets that are embedded in continuous geographic space. Therefore, we have proposed an efficient algorithm (GridClique) to mine interesting spatial co-location patterns (maximal cliques). These patterns are used as the raw transactions for an association rule mining technique to discover complex co-location rules. Our proposal includes certain types of complex relationships-especially negative relationships-in the patterns. The relationships can be obtained from only the maximal clique patterns, which have never been used until now. Our approach is applied on a well-known astronomy dataset obtained from the Sloan Digital Sky Survey (SDSS). ST data is continuously collected and made accessible in the public domain. We present an approach to mine and query large ST data with the aim of finding interesting patterns and understanding the underlying process of data generation. An important class of queries is based on the flock pattern. A flock is a large subset of objects moving

IJERT-Closed Sequential Pattern Mining using Length Constraint

International Journal of Engineering Research and Technology (IJERT), 2016

https://www.ijert.org/closed-sequential-pattern-mining-using-length-constraint https://www.ijert.org/research/closed-sequential-pattern-mining-using-length-constraint-IJERTV4IS120420.pdf Sequential Pattern Mining is an approach to find sequences that occurs frequently in a dataset. These sequences are later used for predicting occurrence of next event/item in sequences. Sequential Pattern Mining is widely used in Areas like Healthcare, Education, Web Usage Mining, text mining, bioinformatics and telecommunications. Traditional approaches mines frequent closed sequences (sequence that do not have any superset sequence with same support). BIDE is an efficient algorithm for mining closed sequences. These approaches generate large number of sequences many of them are useless. This paper proposed an approach to efficiently mine sequences by incorporating constraint length in algorithms, while mining sequences. Incorporating length constraints reduces number of patterns generated and thus produces less number of sequences. Length constraint is incorporated in an algorithm BIDE which mines frequent closed sequences from multidimensional dataset. In other algorithm, which mines sequences considering time, from dataset, length constraint is added. This also reduced the number of sequences generated.