N-Grams Research Papers - Academia.edu (original) (raw)

When an ambiguous lexical item appears within a familiar string of words, it can instantly receive an appropriate interpretation from this context, thus being saturated by it. Such a context may also short-circuit illocutionary and other... more

When an ambiguous lexical item appears within a familiar string of words, it can instantly receive an appropriate interpretation from this context, thus being saturated by it. Such a context may also short-circuit illocutionary and other pragmatic aspects of interpretation. We here extract from the British National Corpus over 500 internally highly collocating and high-frequency lexical n-grams up to 5 words containing have to, must, need to, and/or should. These contexts-as-constructions go some way toward allowing us to group these four necessity modals into clusters with similar semantic and pragmatic properties and to determine which of them is semantico-pragmatically most unlike the others. It appears that have to and need to cluster most closely together thanks to their shared environments (e.g., you may have/need to…, expressing contingent, mitigated necessity), while should has the largest share of unique n-grams (e.g., rhetorical Why shouldn’t I…?, used as a defiant self-ex...

We show how to consider similarity between features for calculation of similarity of objects in the Vector Space Model (VSM) for machine learning algorithms and other classes of methods that involve similarity between objects. Unlike LSA,... more

We show how to consider similarity between features for calculation of similarity of objects in the Vector Space Model (VSM) for machine learning algorithms and other classes of methods that involve similarity between objects. Unlike LSA, we assume that similarity between features is known (say, from a synonym dictionary) and does not need to be learned from the data. We call the proposed similarity measure soft similarity. Similarity between features is common, for example, in natural language processing: words, n-grams, or syntactic n-grams can be somewhat different (which makes them different features) but still have much in common: for example, words "play" and "game" are different but related. When there is no similarity between features then our soft similarity measure is equal to the standard similarity. For this, we generalize the well-known cosine similarity measure in VSM by introducing what we call "soft cosine measure". We propose various formulas for exact or approximate calculation of the soft cosine measure. For example, in one of them we consider for VSM a new feature space consisting of pairs of the original features weighted by their similarity. Again, for features that bear no similarity to each other, our formulas reduce to the standard cosine measure. Our experiments show that our soft cosine measure provides better performance in our case study: entrance exams question answering task at CLEF. In these experiments, we use syntactic n-grams as features and Levenshtein distance as the similarity between n-grams, measured either in characters or in elements of n-grams.

When an ambiguous lexical item appears within a familiar string of words, it can instantly receive an appropriate interpretation from this context, thus being saturated by it. Such a context may also short-circuit illocutionary and other... more

When an ambiguous lexical item appears within a familiar string of words, it can instantly receive an appropriate interpretation from this context, thus being saturated by it. Such a context may also short-circuit illocutionary and other pragmatic aspects of interpretation. We here extract from the British National Corpus over 500 internally highly collocating and high-frequency lexical n-grams up to 5 words containing have to, must, need to and/or should. These contexts-as-constructions go some way toward allowing us to group these four necessity modals into clusters with similar semantic and pragmatic properties and to determine which of them is semantico-pragmatically most unlike the others. It appears that have to and need to cluster most closely together thanks to their shared environments (e.g., you may have/need to…, expressing contingent, mitigated necessity), while should has the largest share of unique n-grams (e.g., rhetorical Why shouldn't I…?, used as a defiant self-exhortation).

The rich availability of ancient Greek texts in the Thesaurus Linguae Graecae (TLG) has opened up new types of research for Biblical and Patristic scholars. A very helpful feature on the TLG's website is the option to trace quotations... more

The rich availability of ancient Greek texts in the Thesaurus Linguae Graecae (TLG) has opened up new types of research for Biblical and Patristic scholars. A very helpful feature on the TLG's website is the option to trace quotations with the help of an n-grams module (Intertex-tual Phrase Matching). However, it is virtually unknown how well this module performs and what scholars might expect from the results it produces. The core of this article, therefore, is devoted to a critical examination of the algorithm and of its results. The gospel according to John has been compared with the Paedagogus of Clement of Alexandria, with the gospel according to Matthew, and with the complete works of Plutarch. As it turns out, the algorithm performs well in cases of longer quotations with no or very few interpolations. Short quotations, however, are missed while interpolated or adapted quotations are poorly handled. It is suggested that the algorithm might perform better if the team of the TLG were to revisit its decision to ignore stopwords and if the algorithm were to allow for foreign words in its n-grams. Finally, it is advised that more transparency in the algorithm's mechanisms and a possibility for manually tuning its parameters might improve its applicability.

In this chapter, it is shown how we can develop a new type of learner’s or student’s grammar based on n-grams (sequences of 2 or 3, 4, etc. items) automatically extracted from a large cor-pus, such as the Corpus of Contemporary American... more

In this chapter, it is shown how we can develop a new type of learner’s or student’s grammar based on n-grams (sequences of 2 or 3, 4, etc. items) automatically extracted from a large cor-pus, such as the Corpus of Contemporary American English (COCA). The notion of n-gram and its primary role in statistical language modelling is first discussed. The part-of-speech (POS) tag-ging provided for lexical n-grams in COCA is then demonstrated to be useful for the identifica-tion of frequent structural strings in the corpus. We propose using the hundred most frequent POS-based 5-grams as the content around which an ‘n-grammar’ of English can be constructed. We counter some obvious objections to this approach (e.g. that these patterns only scratch the surface, or that they display much overlap among them) and describe extra features for this grammar, relating to the patterns’ productivity, corpus dispersion, functional description and practice potential.

Android malware has been on the rise in recent years due to the increasing popularity of Android and the proliferation of third party application markets. Emerging Android malware families are increasingly adopting sophisticated detection... more

Android malware has been on the rise in recent years due to the increasing popularity of Android and the proliferation of third party application markets. Emerging Android malware families are increasingly adopting sophisticated detection avoidance techniques and this calls for more effective approaches for Android malware detection. Hence, in this paper we present and evaluate an n-gram opcode features based approach that utilizes machine learning to identify and categorize Android malware. This approach enables automated feature discovery without relying on prior expert or domain knowledge for predetermined features. Furthermore, by using a data segmentation technique for feature selection, our analysis is able to scale up to 10-gram opcodes. Our experiments on a dataset of 2520 samples showed an f-measure of 98% using the n-gram opcode based approach. We also provide empirical findings that illustrate factors that have probable impact on the overall n-gram opcodes performance trends.

Shaoul, C., C. F. Westbury, and R. H. Baayen When asked to think about the subjective frequency of an n-gram (a group of n words), what properties of the n-gram influence the respondent? It has been recently shown that n-grams that... more

Shaoul, C., C. F. Westbury, and R. H. Baayen
When asked to think about the subjective frequency of an n-gram (a group of n words), what properties of the n-gram influence the respondent? It has been recently shown that n-grams that occurred more frequently in a large corpus of English were read faster than n-grams that occurred less frequently (Arnon & Snider, 2010), an effect that is analogous to the frequency effects in word reading and lexical decision. The subjective frequency of words has also been extensively studied and linked to performance on linguistic tasks. We investigated the capacity of people to gauge the absolute and relative frequencies of n-grams. Subjective frequency ratings collected for 352 n-grams showed a strong correlation with corpus frequency, in particular for n-grams with the highest subjective frequency. These n-grams were then paired up and used in a relative frequency decision task (e.g. Is green hills more frequent than weekend trips?). Accuracy on this task was reliably above chance, and the trial-level accuracy was best predicted by a model that included the corpus frequencies of the whole n-grams. A computational model of word recognition (Baayen, Milin, Djurdjevic, Hendrix, & Marelli, 2011) was then used to attempt to simulate subjective frequency ratings, with limited success. Our results suggest that human n-gram frequency intuitions arise from the probabilistic information contained in n-grams.

Taking a cross-linguistic perspective to the analysis of original cinematic speech, this book examines the dialogues of English and Italian films with the objective of assessing their degree of lexico-grammatical and phraseological... more

Taking a cross-linguistic perspective to the analysis of original cinematic speech, this book examines the dialogues of English and Italian films with the objective of assessing their degree of lexico-grammatical and phraseological comparability.

Place, as one of the most basic semantic categories, plays an important role in children’s literature. This contrastive corpus-based study aims to examine and compare how place, in its widest sense, is expressed in children’s literature... more

Place, as one of the most basic semantic categories, plays an important role in children’s literature. This contrastive corpus-based study aims to examine and compare how place, in its widest sense, is expressed in children’s literature in English and Czech. The study is data driven and the main methodological approach taken is through n-gram extraction. At the same time, it aims to further test the method, which in previous applications in contrastive analysis has raised a number of methodological issues: while giving reassuring results when applied to typologically closer languages, it proves to be challenging in the study of typologically different languages, such as English and Czech. The second objective of this study is therefore to further address these issues and explore the potential of this methodology. The analysis is based on both comparable and parallel corpora: comparable corpora of English and Czech children’s literature and a parallel corpus of English children’s literature and its translations into Czech.

This paper presents TrendStream, a versatile architecture for very large word n-gram datasets. Designed for speed, flexibility, and portability, TrendStream uses a novel triebased architecture, features lossless compression, and... more

This paper presents TrendStream, a versatile architecture for very large word n-gram
datasets. Designed for speed, flexibility, and portability, TrendStream uses a novel triebased
architecture, features lossless compression, and provides optimization for both speed
and memory use. In addition to literal queries, it also supports fast pattern matching searches
(with wildcards or regular expressions), on the same data structure, without any additional
indexing. Language models are updateable directly in the compiled binary format, allowing
rapid encoding of existing tabulated collections, incremental generation of n-gram models
from streaming text, and merging of encoded compiled files. This architecture offers flexible
choices for loading and memory utilization: fast memory-mapping of a multi-gigabyte model,
or on-demand partial data loading with very modest memory requirements. The implemented
system runs successfully on several different platforms, under different operating systems, even
when the n-gram model file is much larger than available memory. Experimental evaluation
results are presented with the Google Web1T collection and the Gigaword corpus.

The paper examines the role of corpus in linguistic research on the example of two Croatian language corpora interfaces, Philologic and Bonito, for language inquires about document and content relation, as well as the level of character... more

The paper examines the role of corpus in linguistic research on the example of two Croatian language corpora interfaces, Philologic and Bonito, for language inquires about document and content relation, as well as the level of character and information display. For specialized linguistic search queries we have built the sport newspaper database made of Sportske novosti online texts (http://sportske.jutarnji.hr/), containing 3,6 mil. of tokens published since April 2008 till July 2009.
The computational procedures of information retrieval and n-gram SQL/regex queries will be shown in order to extract token co-frequencies and reveal phrases, collocations and more constant syntagmemes. The JavaScript wiring library WireIt is used for a token frequencies visualization in browser.
We have compared the output with Google search results based on which we have pointed out seven Google search shortcomings for linguistic investigations and have concluded that our approach could produce unique results in linguistic research.

Sentiment analysis is an interdisciplinary field between natural language processing, artificial intelligence and text mining. The main key of the sentiment analysis is the polarity that is meant by the sentiment is positive or negative... more

Sentiment analysis is an interdisciplinary field between natural language processing, artificial intelligence and text mining. The main key of the sentiment analysis is the polarity that is meant by the sentiment is positive or negative (Chen, 2012). In this study using the method of classification support vector machine with the amount of data consumer reviews amounted to 648 data. The data obtained from consumer reviews from the marketplace with products sold is hand phone. The results of this study get 3 aspects that indicate sentiment analysis on the marketplace aspects of service, delivery and products. The slang dictionary used for the normation process is 552 words slang. This study compares the characteristic analysis to obtain the best classification result, because classification accuracy is influenced by characteristic analysis process. The result of comparison value from characteristic analysis between n-gram and TF-IDF by using Support Vector Machine method found that Unigram has the highest accuracy value, with accuracy value 80,87%. The results of this study explain that in the case of analysis sentiment at the aspect level with the comparison of characteristics with the classification model of support vector machine found that the analysis model of unigram character and classification of support vector machine is the best model.

It is to identify, in the given suspicious document, the fragments that are not consistent with the rest of the text in terms of writing style.

When it is not possible to compare the suspicious document to the source document(s) plagiarism has been committed from, the evidence of plagiarism has to be looked for intrinsically in the document itself. In this paper, we introduce a... more

When it is not possible to compare the suspicious document to the source document(s) plagiarism has been committed from, the evidence of plagiarism has to be looked for intrinsically in the document itself. In this paper, we introduce a novel language-independent intrinsic plagiarism detection method which is based on a new text representation that we called n-gram classes. The proposed method was evaluated on three publicly available standard corpora. The obtained results are comparable to the ones obtained by the best state-of-the-art methods.

This thesis deals with two major topics: plagiarism detection in Arabic documents, and plagiarism detection based on the writing style changes in the suspicious document, which is called intrinsic plagiarism detection. This approach is an... more

This thesis deals with two major topics: plagiarism detection in Arabic documents, and plagiarism detection based on the writing style changes in the suspicious document, which is called intrinsic plagiarism detection. This approach is an alternative to the text-matching approach, notably, in the absence of the plagiarism source. Our key contributions in these two areas lie first, in the development of Arabic corpora to allow for the evaluation of plagiarism detection software on this language and, second, in the development of a language-independent intrinsic plagiarism detection method that exploits the character n-grams in a machine learning approach while avoiding the curse of dimensionality. Our third key contribution is an investigation on which character n-grams, in terms of their frequency and length, are the best to detect plagiarism intrinsically. We carried out our experiments on standardised English corpora and also on the developed Arabic corpora using the method we developed and one of the most prominent intrinsic plagiarism detection methods. The findings of our analysis can be exploited by the future intrinsic plagiarism detection methods that use character n-grams. In addition to the above-mentioned technical contributions, we provide the reader with comprehensive and critical surveys of the literature of Arabic plagiarism detection and intrinsic plagiarism detection, which were lacking in both topics.

When it is not possible to compare the suspicious document to the source document(s) plagiarism has been committed from, the evidence of plagiarism has to be looked for intrinsically in the document itself. In this paper, we introduce a... more

When it is not possible to compare the suspicious document to the source document(s) plagiarism has been committed from, the evidence of plagiarism has to be looked for intrinsically in the document itself. In this paper, we introduce a novel languageindependent intrinsic plagiarism detection method which is based on a new text representation that we called n-gram classes. The proposed method was evaluated on three publicly available standard corpora. The obtained results are comparable to the ones obtained by the best state-of-the-art methods.

Information explosion has resulted in the need for more advanced methods for managing the information. Text stream mining, is very important as people and organizations are trying to process and understand as much of information as... more

Information explosion has resulted in the need for
more advanced methods for managing the information. Text
stream mining, is very important as people and organizations
are trying to process and understand as much of information as
possible. Generalised suffix tree is a data structure which is
capable of solving a number of text stream mining tasks like
detecting changes in the text stream, identifying reuse of text
and detecting events by identifying when the frequencies of
phrases change in a statistically significant way. An efficient
method with polynomial time complexity that uses suffix trees to
analyse streams of data in an online setting is discussed.

N-gram based language models are very popular and extensively used statistical methods for solving various natural language processing problems including grammar checking. Smoothing is one of the most effective techniques used in building... more

N-gram based language models are very popular and extensively used statistical methods for solving various natural language processing problems including grammar checking. Smoothing is one of the most effective techniques used in building a language model to deal with data sparsity problem. Kneser-Ney is one of the most prominently used and successful smoothing technique for language modelling. In our previous work, we presented a Witten-Bell smoothing based language modelling technique for checking grammatical correctness of Bangla sentences which showed promising results outperforming previous methods. In this work, we proposed an improved method using Kneser-Ney smoothing based n-gram language model for grammar checking and performed a comparative performance analysis between Kneser-Ney and Witten-Bell smoothing techniques for the same purpose. We also provided an improved technique for calculating the optimum threshold which further enhanced the the results. Our experimental res...

Learning context and its effect on learner pragmatics: Degree modifiers as lexical teddy bears The article focuses on the role of learning context in language acquisition. Learning context here refers to environments in which languages... more

Learning context and its effect on learner pragmatics: Degree modifiers as lexical teddy bears
The article focuses on the role of learning context in language acquisition. Learning context here refers to environments in which languages are learned: either a foreign language environment (often isolated from the target society, culture and language; mostly in a classroom and educational setting and con- sisting of formal learning) or a second language environment (surrounded by the target language, culture and society; often a natural setting outside of the classroom and comprising informal learning). Previous studies have shown that educational settings have an effect on how language is learned and which skills (e.g. pragmatic or grammatical competence) are mastered earlier.
The data for this study come from four corpora. There were two learner corpora: the International Corpus of Learner Finnish (ICLFI), consisting of foreign language data (texts produced by learners studying Finnish outside Finland); and the National Certificate Corpus (NCC), consisting of second lan- guage data (texts produced for proficiency tests in Finland). Both sets of data are rated according to the Common European Framework of Reference for Lan- guages (CEFR, Council of Europe, 2001). The data for the present study com- prise texts rated at level B1. The size of the ICLFI B1 data is 102,000 tokens and the NCC B1 data 23,500. In addition, two native language corpora and Internet data are used in the study.
Keyword analysis showed that certain degree modifiers are overused in the NCC compared to ICLFI, which indicates that the learning context may affect learner production. The results, however, seem to be contradictory as well as more complex than they appear. They support the conclusions of previous studies that suggest learners overuse degree modifiers in general. However, the range of degree modifiers seem to be more restricted in the second language production data than in the foreign language data. Furthermore, both learner groups tend to favour certain lexical teddy bears, but these differ according to the learner group and learning context. Finally, the study reveals that the usage of degree modifiers is related to syntagmatic associations (collocations) and situational context (spoken vs. written language) and that the associations deviate from lexical structures produced by native speakers.

A frustrating aspect of software development is that compiler error messages often fail to locate the actual cause of a syntax error. An errant semicolon or brace can result in many errors reported throughout the file. We seek to find the... more

A frustrating aspect of software development is that compiler error messages often fail to locate the actual cause of a syntax error. An errant semicolon or brace can result in many errors reported throughout the file. We seek to find the actual source of these syntax errors by relying on the consistency of software: valid source code is usually repetitive and unsurprising. We exploit this consistency by constructing a simple N-gram language model of lexed source code tokens. We implemented an automatic Java syntax-error locator using the corpus of the project itself and evaluated its performance on mutated source code from several projects. Our tool, trained on the past versions of a project, can effectively augment the syntax error locations produced by the native compiler. Thus we provide a methodology and tool that exploits the naturalness of software source code to detect syntax errors alongside the parser.

Objective: To evaluate the impact of changing perinatal practices on survival rates and 4 year neurodevelopmental outcome for infants of birthweight 500-9999. Methodology: The study was a tertiary hospital-based prospective cohort study... more

Objective: To evaluate the impact of changing perinatal practices on survival rates and 4 year neurodevelopmental outcome for infants of birthweight 500-9999. Methodology: The study was a tertiary hospital-based prospective cohort study that compared survival, impairment and handicap rates between two eras, (era 2). All 348 live, inborn infants and 49 outborn infants of birthweight 500-9999 were prospectively enrolled in a study of survival and outcome. Rates of survival, neurodevelopmental impairment and functional handicap at 4 years were compared between eras. Perinatal risk factors for handicap were also compared between eras.

Malware is a malicious code which is developed to harm a computer or network. The number of malwares is growing so fast and this amount of growth makes the computer security researchers invent new methods to protect computers and... more

Malware is a malicious code which is developed to harm a computer or network. The number of malwares is growing so fast and this amount of growth makes the computer security researchers invent new methods to protect computers and networks. There are three main methods used to malware detection: Signature based, Behavioral based and Heuristic ones. Signature based malware detection is the most common method used by commercial antiviruses but it can be used in the cases which are completely known and documented. Behavioral malware detection was introduced to cover deficiencies of signature based method. However, because of some shortcomings, the heuristic methods have been introduced. In this paper, we discuss the state of the art heuristic malware detection methods and briefly overview various features used in these methods such as API Calls, OpCodes, N-Grams etc. and discuss their advantages and disadvantages.

N-gram based language models are very popular and extensively used statistical methods for solving various natural language processing problems including grammar checking. Smoothing is one of the most effective techniques used in building... more

N-gram based language models are very popular and extensively used statistical methods for solving various natural language processing problems including grammar checking. Smoothing is one of the most effective techniques used in building a language model to deal with data sparsity problem. Kneser-Ney is one of the most prominently used and successful smoothing technique for language modelling. In our previous work, we presented a Witten-Bell smoothing based language modelling technique for checking grammatical correctness of Bangla sentences which showed promising results outperforming previous methods. In this work, we proposed an improved method using Kneser-Ney smoothing based n-gram language model for grammar checking and performed a comparative performance analysis between Kneser-Ney and Witten-Bell smoothing techniques for the same purpose. We also provided an improved technique for calculating the optimum threshold which further enhanced the the results. Our experimental results show that, Kneser-Ney outperforms Witten-Bell as a smoothing technique when used with n-gram LMs for checking grammatical correctness of Bangla sentences.

In the paper, there are analyzed and compared results of usable methods for discrepancies detection based on character n-gram profiles (the set of character n-gram normalized frequencies of a text) for English and Arabic documents.... more

In the paper, there are analyzed and compared results of usable methods for discrepancies detection based on character n-gram profiles (the set of character n-gram normalized frequencies of a text) for English and Arabic documents. English and Arabic texts were analyzed from many statistical characteristics point of view. We covered some statistical differences between both languages and we applied some heuristics for measurements of text parts dissimilarities. The results for each text can call for an attention to the text (or not) if the text parts were written by the same author. We evaluate some Arabic and English documents and show its parts they contain discrepancies and they need some following analysis for plagiarism detection. The analysis depends on selected parameters prepared in experiments.

Although the phraseological coverage of dictionaries has improved considerably in recent years, bilingual dictionaries are still lagging behind. The objective of our paper is to show that including a range of multi-word units (MWUs)... more

Although the phraseological coverage of dictionaries has improved considerably in recent years, bilingual dictionaries are still lagging behind. The objective of our paper is to show that including a range of multi-word units (MWUs) extracted via the n-gram method can considerably enhance the quality of English<>French bilingual dictionaries. We show how multiword units extracted from monolingual corpora can enhance the phraseological coverage of bilingual dictionaries and suggest ways in which the presentation of these units can be improved. We also focus on the role of translation corpora to enhance the accuracy and diversity of MWU translations in bilingual dictionaries.

Colors have a very important role on our perception of the world. We often associate colors with various concepts at different levels of consciousnes and these associations can be relevant to many fields such as education and... more

Colors have a very important role on our perception of the world. We often associate colors with various concepts at different levels of consciousnes and these associations can be relevant to many fields such as education and advertisement. However, to the best of our knowledge, there are no systematic approaches to aid the automatic development of resources encoding this kind of knowledge. In this paper, we propose three computational methods based on image analysis, language models, and latent semantic analysis to automatically associate colors to words. We compare these methods against a gold standard obtained via crowdsourcing. The results show that each method is effective in capturing different aspects of word-color associations.

This work investigates using n-gram processing and a temporal relation encoding to providing relational information about events extracted from media streams. The event information is temporal and nominal in nature being categorized by a... more

This work investigates using n-gram processing and a temporal relation encoding to providing relational information about events extracted from media streams. The event information is temporal and nominal in nature being categorized by a descriptive label or symbolic means and can be difficult to relationally compare and give ranking metrics. Given a parsed sequence of events, relational information pertinent to comparison between events can be obtained through the application of n-grams techniques borrowed from speech processing and temporal relation logic. The procedure is discussed along with results computed using a representative data set characterized by nominal event data. Categories and Subject Descriptors

When an ambiguous lexical item appears within a familiar string of words, it can instantly receive an appropriate interpretation from this context, thus being saturated by it. Such a context may also short-circuit illocutionary and other... more

When an ambiguous lexical item appears within a familiar string of words, it can instantly receive an appropriate interpretation from this context, thus being saturated by it. Such a context may also short-circuit illocutionary and other pragmatic aspects of interpretation. We here extract from the British National Corpus over 500 internally highly collocating and high-frequency lexical n-grams up to 5 words containing have to, must, need to and/or should. These contexts-as-constructions go some way toward allowing us to group these four necessity modals into clusters with similar semantic and pragmatic properties and to determine which of them is semantico-pragmatically most unlike the others. It appears that have to and need to cluster most closely together thanks to their shared environments (e.g., you may have/need to…, expressing contingent, mitigated necessity), while should has the largest share of unique n-grams (e.g., rhetorical Why shouldn't I…?, used as a defiant self...

In recent years there h a s b een an increased i n t e r est in the modelling and recognition of human activities involving highly structured and semantically rich behaviour such as dance, aerobics, and sign language. A novel approa c h i... more

In recent years there h a s b een an increased i n t e r est in the modelling and recognition of human activities involving highly structured and semantically rich behaviour such as dance, aerobics, and sign language. A novel approa c h i s p r esented for automatically acquiring stochastic models of the high-level structure o f an activity without the assumption of any prior knowledge. The process involves temporal segmentation into plausible atomic behaviour components and the use of variable length Markov models for the e cient representation of behaviours. Experimental results are p r esented which demonstrate the synthesis of realistic sample behaviours and the performance o f m o dels for long-term temporal prediction.

We present strategies and results for identifying the symbol type of every character in a text document. Assuming reasonable word and character segmentation for shape clustering, we designed several type recognition methods that depend on... more

We present strategies and results for identifying the symbol type of every character in a text document. Assuming reasonable word and character segmentation for shape clustering, we designed several type recognition methods that depend on cluster n-grams, characteristics of neighbors, and within-word context. On an ASCII test corpus of 925 articles, these methods represent a substantial improvementover default assignmentofallcharacters to lower case.

In ths paper, we present our research on dialog dependent language modeling. In accordance with a speech (or sentence) production model in a discourse we split language modeling into two components; namely, dialog dependent concept... more

In ths paper, we present our research on dialog dependent language modeling. In accordance with a speech (or sentence) production model in a discourse we split language modeling into two components; namely, dialog dependent concept modeling and syntactic modeling. ...

In the detection of web attacks, it is necessary that Web Application Firewalls (WAFs) are effective, at the same time than efficient. In this paper, we propose a new methodology for web attack detection that enhances these two aspects of... more

In the detection of web attacks, it is necessary that Web Application Firewalls (WAFs) are effective, at the same time than efficient. In this paper, we propose a new methodology for web attack detection that enhances these two aspects of WAFs. It involves both feature construction and feature selection. For the feature construction phase, many professionals rely on their expert knowledge to define a set of important features, what normally leads to high and reliable attack detection rates. Nevertheless, it is a manual process and not quickly adaptive to the changing network environments. Alternatively, automatic feature construction methods (such as n-grams) overcome this drawback, but they provide unreliable results. Therefore, in this paper, we propose to combine expert knowledge with n-gram feature construction method for reliable and efficient web attack detection. However, the number of n-grams grows exponentially with n, which usually leads to high dimensionality problems. Hence, we propose to apply feature selection to reduce the number of redundant and irrelevant features. In particular, we study the recently proposed Generic Feature Selection (GeFS) measure, which has been successfully tested in intrusion detection systems. Additionally, we use several decision tree algorithms as classifiers of WAFs. The experiments are conducted on the publicly available ECML/PKDD 2007 dataset. The results show that the combination of expert knowledge and n-grams outperforms each separate technique and that the GeFS measure can greatly reduce the number of features, thus enhancing both the effectiveness and efficiency of WAFs.

Linguistic regularity and repetitiveness are widely studied in data-driven linguistic research by applying the n-gram method. Together with language variation, they comprise key features of a natural language. The present paper discusses... more

Linguistic regularity and repetitiveness are widely studied in data-driven linguistic research by applying the n-gram method. Together with language variation, they comprise key features of a natural language. The present paper discusses the applicability of the n-gram method on historical data in the legal domain, for the consolidation of linguistic constituents and their pleonastic usage are indicators of the professional register development.
Due to a restricted stock of n-grams, the paradigmatic data arrangement
has been selected as opposed to the existing practice of the syntagmatic one. Not only has this procedure impacted the frequency and the accuracy of those counts compared with previous research, but also paved the way for combining qualitative and quantitative investigation of n-grams. Finally, we explain the advantages of applying the proposed approach to n-gram analysis in the given settings, which allows for studying domain-specific linguistic regularity beyond the lexis.

N-grams are used to quantify the similarity between two documents or the similarity between two collections of words. This paper shows how N-grams of length 3 and N-gramsof length 4 both coupled with text pre-processing (including stop... more

N-grams are used to quantify the similarity between two documents or the similarity between two collections of words. This paper shows how N-grams of length 3 and N-gramsof length 4 both coupled with text pre-processing (including stop word removal and stemming according to MXit spelling conventions) can be used to categorize very short mathematical conversations conducted in MXit lingo into broad mathematical groups suchas algebra, geometry, trigonometry, and calculus. MXit lingo is an abbreviated form of written English which children, teenagers and young adults utilise when communicating using the popular MXit chat mechanism over cell phones. Conversations from the "Dr Math" project were used for this analysis. "Dr Math" is a mathematics tutoring service which links primary and secondary school pupils to tutors from local universities. The tutors assist the pupils with their mathematics homework.

Abstract. Soft cardinality (SC) is a softened version of the classical cardinality of set theory. However, given its prohibitive cost of computing (exponential order), an approximation quadratic in the number of terms in the text has been... more

Abstract. Soft cardinality (SC) is a softened version of the classical cardinality of set theory. However, given its prohibitive cost of computing (exponential order), an approximation quadratic in the number of terms in the text has been proposed in the past. SC Spectra is a new method of approximation in linear time for text strings, which divides text strings into consecutive substrings (i.e., q-grams) of different sizes. Thus, SC in combination with resemblance coefficients allowed the construction of a family of similarity functions for text comparison. These similarity measures have been used in the past to address a problem of entity resolution (name matching) outperforming SoftTFIDF measure. SC spectra method improves the previous results using less time and getting better performance. This allows the new method to be used with relatively large documents such as those included in classic information retrieval collections. SC spectra method exceeded SoftTFIDF and cosine tf-idf baselines with an approach that requires no term weighing.

Real-time dose rate measurements along with the route followed by the radiation monitoring vehicle and the quick analysis of the data are of crucial importance during a nuclear or radiological emergency. To develop a timely response... more

Real-time dose rate measurements along with the route followed by the radiation monitoring vehicle and the quick analysis of the data are of crucial importance during a nuclear or radiological emergency. To develop a timely response capability in different threat scenarios, such as the release of radioactive materials to the environment during any nuclear or radiological accident, Radiation Safety Systems Division, BARC has developed an advanced online radiation measurement cum vehicle tracking system for use. For the preparedness for response to any nuclear/radiological emergency scenario which may occur anywhere, the system designed is a global system for mobile (GSM) based radiation monitoring system (GRaMS) along with a global positioning system (GPS). It uses an energy compensated GM detector for radiation monitoring and is attached with commercially available GPS for online acquisition of positional coordinates with time, and GSM modem for online data transfer to a remote cont...

In the paper, there are analyzed and compared results of usable methods for discrepancies detection based on character n-gram profiles (the set of character n-gram normalized frequencies of a text) for English and Arabic documents.... more

In the paper, there are analyzed and compared results of usable methods for discrepancies detection based on character n-gram profiles (the set of character n-gram normalized frequencies of a text) for English and Arabic documents. English and Arabic texts were analyzed from many statistical characteristics point of view. We covered some statistical differences between both languages and we applied some heuristics for measurements of text parts dissimilarities. The results for each text can call for an attention to the text (or not) if the text parts were written by the same author. We evaluate some Arabic and English documents and show its parts they contain discrepancies and they need some following analysis for plagiarism detection. The analysis depends on selected parameters prepared in experiments.

We are interested in protein classification based on their primary structures. The goal is to automatically classify proteins sequences according to their families. This task goes through the extraction of a set of descriptors that we... more

We are interested in protein classification based on their primary structures. The goal is to automatically classify proteins sequences according to their families. This task goes through the extraction of a set of descriptors that we present to the supervised learning algorithms. There are many types of descriptors used in the literature. The most popular one is the n-gram. It corresponds to a series of characters of n-length. The standard approach of the ngrams consists in setting first the parameter n, extracting the corresponding n-grams descriptors, and in working with this value during the whole data mining process. In this paper, we propose a n hierarchical approach to the n-grams construction. The goal is to obtain descriptors of varying length for a better characterization of the protein families. This approach tries to answer to the domain knowledge of the biologists. The patterns, which characterize the proteins' family, have most of the time a various length. Our idea is to transpose the frequent itemsets extraction principle, mainly used for the association rule mining, in the n-grams extraction for protein classification context. The experimentation shows that the new approach is consistent with the biological reality and has the same accuracy of the standard approach.

When an ambiguous lexical item appears within a familiar string of words, it can instantly receive an appropriate interpretation from this context, thus being saturated by it. Such a context may also short-circuit illocutionary and other... more

When an ambiguous lexical item appears within a familiar string of words, it can instantly receive an appropriate interpretation from this context, thus being saturated by it. Such a context may also short-circuit illocutionary and other pragmatic aspects of interpretation. We here extract from the British National Corpus over 500 internally highly collocating and high-frequency lexical n-grams up to 5 words containing have to, must, need to, and/or should. These contexts-as-constructions go some way toward allowing us to group these four necessity modals into clusters with similar semantic and pragmatic properties and to determine which of them is semantico-pragmatically most unlike the others. It appears that have to and need to cluster most closely together thanks to their shared environments (e.g., you may have/need to…, expressing contingent, mitigated necessity), while should has the largest share of unique n-grams (e.g., rhetorical Why shouldn’t I…?, used as a defiant self-ex...

In the detection of web attacks, it is necessary that Web Application Firewalls (WAFs) are effective, at the same time than efficient. In this paper, we propose a new methodology for web attack detection that enhances these two aspects of... more

In the detection of web attacks, it is necessary that Web Application Firewalls (WAFs) are effective, at the same time than efficient. In this paper, we propose a new methodology for web attack detection that enhances these two aspects of WAFs. It involves both feature construction and feature selection. For the feature construction phase, many professionals rely on their expert knowledge to define a set of important features, what normally leads to high and reliable attack detection rates. Nevertheless, it is a manual process and not quickly adaptive to the changing network environments. Alternatively, automatic feature construction methods (such as n-grams) overcome this drawback, but they provide unreliable results. Therefore, in this paper, we propose to combine expert knowledge with n-gram feature construction method for reliable and efficient web attack detection. However, the number of n-grams grows exponentially with n, which usually leads to high dimensionality problems. Hence, we propose to apply feature selection to reduce the number of redundant and irrelevant features. In particular, we study the recently proposed Generic Feature Selection (GeFS) measure, which has been successfully tested in intrusion detection systems. Additionally, we use several decision tree algorithms as classifiers of WAFs. The experiments are conducted on the publicly available ECML/PKDD 2007 dataset. The results show that the combination of expert knowledge and n-grams outperforms each separate technique and that the GeFS measure can greatly reduce the number of features, thus enhancing both the effectiveness and efficiency of WAFs.

Malware is a malicious code which is developed to harm a computer or network. The number of malwares is growing so fast and this amount of growth makes the computer security researchers invent new methods to protect computers and... more

Malware is a malicious code which is developed to harm a computer or network. The number of malwares is growing so fast and this amount of growth makes the computer security researchers invent new methods to protect computers and networks. There are three main methods used to malware detection: Signature based, Behavioral based and Heuristic ones. Signature based malware detection is the most common method used by commercial antiviruses but it can be used in the cases which are completely known and documented. Behavioral malware detection was introduced to cover deficiencies of signature based method. However, because of some shortcomings, the heuristic methods have been introduced. In this paper, we discuss the state of the art heuristic malware detection methods and briefly overview various features used in these methods such as API Calls, OpCodes, N-Grams etc. and discuss their advantages and disadvantages.

We show how to consider similarity between features for calculation of similarity of objects in the Vector Space Model (VSM) for machine learning algorithms and other classes of methods that involve similarity between objects. Unlike LSA,... more

We show how to consider similarity between features for calculation of similarity of objects in the Vector Space Model (VSM) for machine learning algorithms and other classes of methods that involve similarity between objects. Unlike LSA, we assume that similarity between features is known (say, from a synonym dictionary) and does not need to be learned from the data. We call the proposed similarity measure soft similarity. Similarity between features is common, for example, in natural language processing: words, n-grams, or syntactic n-grams can be somewhat different (which makes them different features) but still have much in common: for example, words "play" and "game" are different but related. When there is no similarity between features then our soft similarity measure is equal to the standard similarity. For this, we generalize the well-known cosine similarity measure in VSM by introducing what we call "soft cosine measure". We propose various formulas for exact or approximate calculation of the soft cosine measure. For example, in one of them we consider for VSM a new feature space consisting of pairs of the original features weighted by their similarity. Again, for features that bear no similarity to each other, our formulas reduce to the standard cosine measure. Our experiments show that our soft cosine measure provides better performance in our case study: entrance exams question answering task at CLEF. In these experiments, we use syntactic n-grams as features and Levenshtein distance as the similarity between n-grams, measured either in characters or in elements of n-grams.

Mining databases are becoming more essential nowadays. Extracting knowledge by data mining process is facing some issues when its prediction expecting the level of the perfection or accuracy. Recent years, Rough set theory (RST) has been... more

Mining databases are becoming more essential nowadays. Extracting knowledge by data mining process is facing some issues when its prediction expecting the level of the perfection or accuracy. Recent years, Rough set theory (RST) has been accepted as an effective technique to discover hidden patterns in data and also it’s is known for its simplicity. Its concept of attribute reduction executed using approximations that have been used in many places of mining. One of the main issues in data mining is dimension reduction where researchers proposed many methods. RST, the efficient approach simplifies some problems when other processes make in text mining tasks. In text mining, the primary process is preprocessing where we need to apply some filters to remove the irrelevant words and form such kind words using properties that makes classifier uncomplicated. In this paper, we are mining the database of product reviews using Rough set reduction concept by testing the created models combinations of n- grams that identify the word dependency of RST in text mining. Experiment results confirm that combination of unigrams and bigrams to perform well in comparison with other model

This paper examines original film dialogue from a cross-linguistic perspective. More specifically, the paper will identify and compare the most frequent 3-grams -i.e. 3-word clusters -in a corpus of original English and original Italian... more

This paper examines original film dialogue from a cross-linguistic perspective. More specifically, the paper will identify and compare the most frequent 3-grams -i.e. 3-word clusters -in a corpus of original English and original Italian films. This will be done with the specific aim of exploring the dimensions of comparability between the language of English films and the language of Italian films. It will be shown that the dialogues of English and Italian films exhibit a pronounced degree of similarity not only in terms of their decidedly clausal 'texture' and markedly interactional focus but also at the level of individual 3-gramsnamely the English I don't know and the Italian non lo so, whose various functions will be described. ---Zago, Raffaele. 2017. Cross-linguistic dimensions of comparability in film dialogue. Illuminazioni 41: 250-274.

The main issue in Text Document Clustering (TDC) is document similarity. In order to measure the similarity, documents must be transformed into numerical values. Vector Space Model (VSM) is one of technique capable to convert document... more

The main issue in Text Document Clustering (TDC) is document
similarity. In order to measure the similarity, documents must be
transformed into numerical values. Vector Space Model (VSM)
is one of technique capable to convert document into numerical
value. In VSM documents was represented by the frequencies of
term inside document and it works like a Bag of Word (BOW).
BOW has resulted two major problems since it ignores the term
relationship by treating term as single and independent. Both
problems stated as Polysemy and Synonymity concept which is
reflected to the relationship of terms. This study was combined
WordNet and N-gram to overcome both problems. By modifying
document features from single term into Polysemy and
Synonymity concept, it has improved VSM performance. There
are four steps in experimental. Text documents selection,
preprocessing, applying clustering and cluster evaluation using
F-measure. With dataset reuters50_50 obtained from UCI
repository the experiment was successful and the result
promising.