Adjusting fuzzy automata for string similarity measuring (original) (raw)

A Fuzzy Approach to Approximate String Matching for Text Retrieval in NLP

Journal of Computational Information Systems , 2019

Approximate string matching has many applications in Natural Language Processing. This paper provides a comparison of various algorithms for approximate string matching. Most of the algorithms are based on the edit distance between characters in the two strings. It also covers the challenges in using these algorithms for the purpose of text retrieval. The authors propose an alternative approach for approximate string matching which are better suited for text retrieval. In this study we are comparing two strings to identify similarities using a matrix. The matrix will be updated for each overlap character between two strings. An overlap counter is maintained to increment value for each overlap character position and reset position to 0 when no overlap position is encountered. The maximum counter value is then used in a ratio to calculate the degree of similarity. The algorithm implemented using Python language. The results indicate the proposed approach can be used for identifying lexically similar words. This type of approach will find it use in lemmatization, text summarization, topic modelling and data mining solutions.

Proposal and study of statistical features for string similarity computation and classification

International Journal of Data Mining, Modelling and Management

Adaptations of features commonly applied in the field of visual computing, co-occurrence matrix (COM) and run-length matrix (RLM), are proposed for the similarity computation of strings in general (words, phrases, codes and texts). The proposed features are not sensitive to language related information. These are purely statistical and can be used in any context with any language or grammatical structure. Other statistical measures that are commonly employed in the field such as longest common subsequence, maximal consecutive longest common subsequence, mutual information and edit distances are evaluated and compared. In the first synthetic set of experiments, the COM and RLM features outperform the remaining state-of-the-art statistical features. In 3 out of 4 cases, the RLM and COM features were statistically more significant than the second best group based on distances (P-value < 0.001). When it comes to a real text plagiarism dataset, the RLM features obtained the best results.

Fuzzy Segmentations of a String

ArXiv, 2022

This article discusses a particular case of the data clustering problem, where it is necessary to find groups of adjacent text segments of the appropriate length that match a fuzzy pattern represented as a sequence of fuzzy properties. To solve this problem, a heuristic algorithm for finding a sufficiently large number of solutions is proposed. The key idea of the proposed algorithm is the use of the prefix structure to track the process of mapping text segments to fuzzy properties. An important special case of the text segmentation problem is the fuzzy string matching problem, when adjacent text segments have unit length and, accordingly, the fuzzy pattern is a sequence of fuzzy properties of text characters. It is proven that the heuristic segmentation algorithm in this case finds all text segments that match the fuzzy pattern. Finally, we consider the problem of a best segmentation of the entire text based on a fuzzy pattern, which is solved using the dynamic programming method.

Determining the Degree of Fuzzy Regularity of a String

Mathematical Problems of Computer Science, 2021

The paper deals with the issue of determining the degree of fuzzy regularity of a crisp string. It is assumed that the concept of fuzzy regularity is formalized by a pattern given as a finite automaton with fuzzy properties of alphabet characters on transitions. Proceeding from this, we replace the problem of determining the degree of fuzzy regularity of a crisp string with the problem of determining the degree of belonging of such a string to the language of the corresponding automaton and propose an effective method for solving it using the dynamic programming approach. The solution to the considered problem makes it possible to fuzzify the set of strings in a given alphabet based on a pattern defining fuzzy regularity. This work is a continuation of the author’s previous works related to finding occurrences of a fuzzy pattern in the text. It may have applications in the field of pattern recognition, data clustering, bio-informatics, etc.

A Comparative Study for String Metrics and the Feasibility of Joining them as Combined Text Similarity Measures

ARO-THE SCIENTIFIC JOURNAL OF KOYA UNIVERSITY, 2017

This paper aims to introduce an optimized Damerau-Levenshtein and dice-coefficients using enumeration operations (ODADNEN) for providing fast string similarity measure with maintaining the results accuracy; searching to find specific words within a large text is a hard job which takes a lot of time and efforts. The string similarity measure plays a critical role in many searching problems. In this paper, different experiments were conducted to handle some spelling mistakes. An enhanced algorithm for string similarity assessment was proposed. This algorithm is a combined set of well-known algorithms with some improvements (e.g. the dice-coefficient was modified to deal with numbers instead of characters using certain conditions). These algorithms were adopted after conducting on a number of experimental tests to check its suitability. The ODADNN algorithm was tested using real data; its performance was compared with the original similarity measure. The results indicated that the most convincing measure is the proposed hybrid measure, which uses the Damerau-Levenshtein and dicedistance based on n-gram of each word to handle; also, it requires less processing time in comparison with the standard algorithms. Furthermore, it provides efficient results to assess the similarity between two words without the need to restrict the word length.