Identification of Non-lexicon Slang Unigrams in Body-enhancement Medicinal UBE (original) (raw)

Identification of Non-Lexicon Non-Slang Unigrams in Body-enhancement Medicinal UBE

Email has become a fast and cheap means of online communication. The main threat to email is Unsolicited Bulk Email (UBE), commonly called spam email. The current work aims at identification of unigrams in more than 2700 UBE that advertise body-enhancement drugs. The identification is based on the requirement that the unigram is neither present in dictionary, nor is a slang term. The motives of the paper are many fold. This is an attempt to analyze spamming behaviour and employment of wordmutation technique. On the side-lines of the paper, we have attempted to better understand the spam, the slang and their interplay. The problem has been addressed by employing Tokenization technique and Unigram BOW model. We found that the non-lexicon words constitute nearly 66% of total number of lexis of corpus whereas non-slang words constitute nearly 2.4% of non-lexicon words. Further, non-lexicon non-slang unigrams composed of 2 lexicon words, form more than 71% of the total number of such unigrams. To the best of our knowledge, this is the first attempt to analyze usage of non-lexicon non-slang unigrams in any kind of UBE.

Slang Unigrams Based Classification of Male Body-enhancement Medicinal UBE

E_mair has become a fast and cheap means of online communication' Themainthreattoe-mailisUnsolicitedBulkE-mail(UBE),commonly ffirT;;; ";",;;'0". i" its inherent disposition of excessive slans usase' known as spam e-mail' The current work aims at classification of UBE that advertizes the UoJy enhancement drugs for males' The classification is based on the presence of slang unigram of male body parts for which the enhancement drug is advertized' The motives of the paper are manyfold' This is "n "it"trnot to provide sub-categories for medicinal UBE and analyze the spamming behavior' lt is also an attempt to better understand the UBE, the slang and their interplay' The problem has been addressed by employing the iokenization technique and unigram Bag of

Identification of Most Frequently Occurring Lexis in Body-enhancement Medicinal Unsolicited Bulk e-mails

e-mail has become an important means of electronic communication but the viability of its usage is marred by Unsolicited Bulk e-mail (UBE) messages. UBE consists of many types like pornographic, virus infected and 'cry-for-help' messages as well as fake and fraudulent offers for jobs, winnings and medicines. UBE poses technical and socio-economic challenges to usage of e-mails. To meet this challenge and combat this menace, we need to understand UBE. Towards this end, the current paper presents a content-based textual analysis of more than 2700 body enhancement medicinal UBE. Technically, this is an application of Text Parsing and Tokenization for an un-structured textual document and we approach it using Bag Of Words (BOW) and Vector Space Document Model techniques. We have attempted to identify the most frequently occurring lexis in the UBE documents that advertise various products for body enhancement. The analysis of such top 100 lexis is also presented. We exhibit the relationship between occurrence of a word from the identified lexis-set in the given UBE and the probability that the given UBE will be the one advertising for fake medicinal product. To the best of our knowledge and survey of related literature, this is the first formal attempt for identification of most frequently occurring lexis in such UBE by its textual analysis. Finally, this is a sincere attempt to bring about alertness against and mitigate the threat of such luring but fake UBE.

Identification of Slang Words Used in Pornographic Unsolicited Bulk Emails

Email has become afast and cheap means of online communicction. The main threat to email is Unsolicited Bulk Email (UBE), commonly known as spam email. The cument work aims at identification of slang words in Pornogrophic UBE. The motives of the paper are manyfold. This is an attempt to better understand the LIBE, the slang and their inter-play. The problem has been addressed by employing Tokenization technique and IJnigram BOW model. The current paper reports the first results on identification of 115 slang words from more than 1850 Pornographic UBE analyzed by us. To the best of our lcnowledge, this is the first attempt to identify slang words in corpus of Pornographic UBE.

Identification of Hindi Words Used in Pornographic Unsolicited Bulk Emails

has become a fast and cheap means of online communication. The main threat to e-mail is Unsolicited Bulk E-mail (UBE), commonly known as spam e-mail. The current work aims at identification of Hindi words in pornographic UBE. The motives of the paper are manifold. This is an attempt to better understand the UBE and its interplay with the regional language in the perspective of international spamming. The problem has been addressed by employing tokenization technique and Unigram BOW model. The current paper reports the first results on the identification of 87 Hindi words from more than 1,850 pornographic UBE analyzed by us. To the best of our knowledge, this is the first attempt to identify Hindi words in the corpus of pornographic UBE. lntrod u ctio n E-mail has become an efficient and popular communication mechanism as the number of lnternet users has increased. lt has provided both faster and cheaper forms of communication mechanism. But a large part of e-mail traffic consists of non-personal, non-time critical and unsolicited information. This type of email is called Unsolicited Bulk E-mail (UBE) and is commonly known by various other synonymous names like spam

Modelling Abbreviation In Internet Slang: a Comparison Study of Indonesian Internet Slang and English Internet Slang

ETERNAL (English Teaching Journal), 2019

People, all around the world, use internet recently. The usage of internet changes the language usage as well. People tend to use short word thus they are abbreviating some words or phrases. Many of new words are coined lately because of the using of internet. The formation of the new word then belongs to the study of morphology. The main theory of this research is morphology. Moreover, it uses the theory of abbreviation. Some types of abbreviation such as blends, acronym, alphabetism and clipping are found in this research.This research entitled ‘Modelling Abbreviation in Internet Slang: A Comparison Study of Indonesian Internet Slang and English Internet Slang’. The data are taken from some websites which have internet slang dictionary. The writer only took 20 data for this research from those online internet slang dictionaries. There are ten (10) data for English Internet slang language and ten (10) data for Indonesian internet slang language. Conducting the analysis, the writer ...

2002-A corpus-based investigation of junk emails

Almost everyone who has an email account receives from time to time unwanted emails. These emails can be jokes from friends or commercial product offers from unknown people. In this paper we focus on these unwanted messages which try to promote a product or service, or to offer some “hot” business opportunities. These messages are called junk emails. Several methods to filter junk emails were proposed, but none considers the linguistic characteristics of junk emails. In this paper, we investigate the linguistic features of a corpus of junk emails, and try to decide if they constitute a distinct genre. Our corpus of junk emails was build from the messages received by the authors over a period of time. Initially, the corpus consisted of 1563, but after eliminating the duplications automatically we kept only 673 files, totalising just over 373,000 tokens. In order to decide if the junk emails constitute a different genre, a comparison with a corpus of leaflets extracted from BNC and wi...

A corpus-based investigation of junk emails

2002

Almost everyone who has an email account receives from time to time unwanted emails. These emails can be jokes from friends or commercial product offers from unknown people. In this paper we focus on these unwanted messages which try to promote a product or service, or to offer some "hot" business opportunities. These messages are called junk emails. Several methods to filter junk emails were proposed, but none considers the linguistic characteristics of junk emails. In this paper, we investigate the linguistic features of a corpus of junk emails, and try to decide if they constitute a distinct genre. Our corpus of junk emails was build from the messages received by the authors over a period of time. Initially, the corpus consisted of 1563, but after eliminating the duplications automatically we kept only 673 files, totalising just over 373,000 tokens. In order to decide if the junk emails constitute a different genre, a comparison with a corpus of leaflets extracted from BNC and with the whole BNC corpus is carried out. Several characteristics at the lexical and grammatical levels were identified.

Identification of Most Frequently Occurring Lexis in Winnings-announcing Unsolicited Bulk e-mails

e-mail has become an important means of electronic communication but the viability of its usage is marred by Unsolicited Bulk e-mail (UBE) messages. UBE consists of many types like pornographic, virus infected and 'cry-for-help' messages as well as fake and fraudulent offers for jobs, winnings and medicines. UBE poses technical and socio-economic challenges to usage of e-mails. To meet this challenge and combat this menace, we need to understand UBE. Towards this end, the current paper presents a content-based textual analysis of nearly 3000 winnings-announcing UBE. Technically, this is an application of Text Parsing and Tokenization for an un-structured textual document and we approach it using Bag Of Words (BOW) and Vector Space Document Model techniques. We have attempted to identify the most frequently occurring lexis in the winnings-announcing UBE documents. The analysis of such top 100 lexis is also presented. We exhibit the relationship between occurrence of a word from the identified lexisset in the given UBE and the probability that the given UBE will be the one announcing fake winnings. To the best of our knowledge and survey of related literature, this is the first formal attempt for identification of most frequently occurring lexis in winningsannouncing UBE by its textual analysis. Finally, this is a sincere attempt to bring about alertness against and mitigate the threat of such luring but fake UBE.