The SPMF Open-Source Data Mining Library (original) (raw)
Related papers
The SPMF Open-Source Data Mining Library Version 2
Machine Learning and Knowledge Discovery in Databases, 2016
SPMF is an open-source data mining library, specialized in pattern mining, offering implementations of more than 120 data mining algorithms. It has been used in more than 310 research papers to solve applied problems in a wide range of domains from authorship attribution to restaurant recommendation. Its implementations are also commonly used as benchmarks in research papers, and it has also been integrated in several data analysis software programs. After three years of development, this paper introduces the second major revision of the library, named SPMF 2, which provides (1) more than 60 new algorithm implementations (including novel algorithms for sequence prediction), (2) an improved user interface with pattern visualization (3) a novel plug-in system, (4) improved performance, and (5) support for text mining.
S.: SPMF: a Java Open-Source Pattern Mining Library
2014
We present SPMF, an open-source data mining library offering implementations of more than 55 data mining algorithms. SPMF is a cross-platform library implemented in Java, specialized for discovering patterns in transaction and sequence databases such as frequent itemsets, association rules and sequential patterns. The source code can be integrated in other Java programs. Moreover, SPMF offers a command line interface and a simple graphical interface for quick testing. The source code is available under the GNU General Public License, version 3. The website of the project offers several resources such as docu-mentation with examples of how to run each algorithm, a developer’s guide, performance comparisons of algorithms, data sets, an active forum, a FAQ and a mailing list.
Data and text mining KODAMA: an R package for knowledge discovery and data mining
KODAMA, a novel learning algorithm for unsupervised feature extraction, is specifically designed for analysing noisy and high-dimensional datasets. Here we present an R package of the algorithm with additional functions that allow improved interpretation of high-dimensional data. The package requires no additional software and runs on all major platforms. Availability and Implementation: KODAMA is freely available from the R archive CRAN (http:// cran.r-project.org). The software is distributed under the GNU General Public License (version 3 or later).
A unified data mining solution for authorship analysis in anonymous textual communications
Information Sciences, 2013
The cyber world provides an anonymous environment for criminals to conduct malicious activities such as spamming, sending ransom e-mails, and spreading botnet malware. Often, these activities involve textual communication between a criminal and a victim, or between criminals themselves. The forensic analysis of online textual documents for addressing the anonymity problem called authorship analysis is the focus of most cybercrime investigations. Authorship analysis is the statistical study of linguistic and computational characteristics of the written documents of individuals. This paper is the first work that presents a unified data mining solution to address authorship analysis problems based on the concept of frequent pattern-based writeprint. Extensive experiments on real-life data suggest that our proposed solution can precisely capture the writing styles of individuals. Furthermore, the writeprint is effective to identify the author of an anonymous text from This is the preprint version. The official version is published in Elsevier Information Sciences. a group of suspects and to infer sociolinguistic characteristics of the author.
TMT: Object-Oriented Text Classification Library
2007
The purpose of the TMT (Text Mining Tools) library is to enable the use of modern text-mining techniques for natural languages on cross-platform environments that can be applied equally well to research and development of end-user text-mining applications. The paper is structured as follows. Section 2 discusses the related work. Section 3 describes the functionalities of the library, whereas Section 4 describes its usage. Section 5 concludes the paper.
Profile-based Authorship Analysis
Literary and Linguistic Computing (now Digital Scholarship in the Humanities), 2015
This article presents a profile-based authorship analysis method which first categorizes texts according to social and conceptual characteristics of their author (e.g., Sex and Political Ideology) and then combines these profiles for two authorship analysis tasks: (1) determining shared authorship of pairs of texts without a set of candidate authors and (2) clustering texts according to characteristics of their authors in order to provide an analysis of the types of individuals represented in the dataset. The first task outperforms Burrows’ Delta by a wide margin on short texts and a small margin on long texts. The second task has no such benchmark with existing methods. The dataset for evaluating the method consists of speeches from the U.S. House and Senate from 1995 to 2013. This dataset contains both a large number of texts (42,000 in the test sets) and a large number of speakers (over 800). The article shows that this approach to authorship analysis is more accurate than existing approaches given a dataset with hundreds of authors. Further, this profile-based method makes new types of analysis possible by looking at types of individuals as well as at specific individuals.
Journal of Computer Science, 2010
Problem statement: Stylometric authorship attribution is an approach concerned about analyzing texts in text mining, e.g., novels and plays that famous authors wrote, trying to measure the authors style, by choosing some attributes that shows the author style of writing, assuming that these writers have a special way of writing that no other writer has; thus, authorship attribution is the task of identifying the author of a given text. In this study, we propose an authorship attribution algorithm, improving the accuracy of Stylometric features of different professionals so it can be discriminated nearly as well as fingerprints of different persons using authorship attributes. Approach: The main target in this study is to build an algorithm supports a decision making systems enables users to predict and choose the right author for a specific anonymous author's novel under consideration, by using a learning procedure to teach the system the Stylometric map of the author and behave as an expert opinion. The Stylometric Authorship Attribution (AA) usually depends on the frequent word as the best attribute that could be used, many studies strived for other beneficiary attributes, still the frequent word is ahead of other attributes that gives better results in the researches and experiments and still the best parameter and technique that's been used till now is the counting of the bag-of-word with the maximum item set. Results: To improve the techniques of the AA, we need to use new pack of attributes with a new measurement tool, the first pack of attributes we are using in this study is the (frequent pair) which means a pair of words that always appear together, this attribute clearly is not a new one, but it wasn't a successive attribute compared with the frequent word, using the maximum item set counters. the words pair made some mistakes as we see in the experiment results, improving the winnow algorithm by combining it with the computational approach, achieved by using the CV statistical tool as a conditional threshold for attribute selecting; by doing so, the frequent pair result improved from 50% error to 0% in the improved frequent pair with a clear higher score result compared with the frequent word attribute. Conclusion/Recommendations: The new CV algorithm results improvement may lead to several new attributes usage that gave unsatisfying results before that might improve the direction for solving some hard cases couldn't be solved till now.
TRUMIT: a tool to support large-scale mining of text association rules
2011
Due to the nature of textual data the application of association rule mining in text corpora has attracted the focus of the research scientific community for years. In this paper we demonstrate a system that can efficiently mine association rules from text. The system annotates terms using several annotators, and extracts text association rules between terms or categories of terms. An additional contribution of this work is the inclusion of novel unsupervised evaluation measures for weighting and ranking the importance of the text rules. We demonstrate the functionalities of our system with two text collections, a set of Wikileaks documents, and one from TREC-7.
Review of data, text and web mining software
Kybernetes
Purpose – The purpose of this paper is to review and compare selected software for data mining, text mining (TM), and web mining that are not available as free open-source software. Design/methodology/approach – Selected softwares are compared with their common and unique features. The software for data mining are SAS® Enterprise Miner™, Megaputer PolyAnalyst® 5.0, NeuralWare Predict®, and BioDiscovery GeneSight®. The software for TM are CompareSuite, SAS® Text Miner, TextAnalyst, VisualText, Megaputer PolyAnalyst® 5.0, and WordStat. The software for web mining are Megaputer PolyAnalyst®, SPSS Clementine®, ClickTracks, and QL2. Findings – This paper discusses and compares the existing features, characteristics, and algorithms of selected software for data mining, TM, and web mining, respectively. These softwares are also applied to available data sets. Research limitations/implications – The limitations are the inclusion of selected software and datasets rather than considering the ...