Machine Learning, Natural Language Processing Research Papers (original) (raw)

Graham Greene, the twentieth century British author, demonstrated an interest in the
problems of evil, violence and alienation from the very beginning of his writing career. In his
novels he created a unique world of isolation, oppression and mistrust, which was later given
the name of Greeneland. It reflects Greene’s belief in the reality of another world, which is
removed from people in the same way God is, but which undoubtedly exists and represents
the rich material for human imagination to feed on. Since for Greene imagination and
intuition are significantly more important than objective measurement, the ordinary, run-down
and third-rate are given deeper, almost allegorical significance, as the three analysed novels
(The Heart of the Matter, Our Man in Havana and The Human Factor) show. As a
consequence of the Greenean method of permeating the facts of reality in an omnipresent
sense of suffering, unhappiness and impending catastrophe, the border between reality and
imagination becomes blurry, fades and finally disappears.
Key words: Greeneland, reality, imagination

—Named Entity Recognition (NER) plays a significant role in Information Extraction (IE). In English, the NER systems have achieved excellent performance, but for the Indonesian language, the systems still need a lot of improvement. To create a reliable NER system using machine learning approach, a massive dataset to train the classifier is a must. Several studies have proposed methods in automatically building dataset for Indonesian NER using Indonesian Wikipedia articles as the source of the dataset and DBpedia as the reference in determining entity types automatically. The objective of our research is to improve the quality of the automatically tagged dataset. We proposed a new method in using DBpedia as the referenced named entities. We have created some rules in expanding DBpedia entities corpus for category person, place, and organization. The resulting training dataset is trained using Stanford NER tool to build an Indonesian NER classifier. The evaluation shows that our method improves recall significantly but has lower precision compared to the previous research.

- by Mohamad Ivan Fanany and +1
- •
- Natural Language Processing, Machine Learning, Data Mining, Natural Language Generation

As heart disease is the number one killer in the world today, it is becoming one of the most difficult disease to diagnose the state of disease. If a heart disease is diagnosed early, many lives can be saved. Machine learning classification techniques can significantly benefits the medical field by providing an accurate, unambiguous and quick diagnosis of diseases. Hence, save time for both doctors and patients for prediction. We start by over viewing the machine learning & describing in brief definitions of the most commonly used classification techniques to diagnose heart disease. We used different attributes which can relate to this heart disease well to find the better method to predict and we also used classification algorithms for prediction.

Presenting Tambr, a new software automatically generates musical pieces from text and for translating literature into sound using multiple synthesized voices selected for the way in which their timbre relates to the meaning and sentiment of the topics conveyed in the story. It achieves the result by leveraging a large lexical semantic database to implement a machine-learning-based synthesizer search engine used to select the synthesizers who's meaning best reflects the ideas of the novel. Tambr uses sentiment analysis to generate the pitches, duration's, and intervals of the output melodies in a way corresponding to the sentiment of the novel-implementing algorithmic composition of literature-based music at a level of musicality not previously explored.

Natural Language Processing is a programmed approach to analyze text that is based on both a set of theories and a set of technologies. This forum aims to bring together researchers who have designed and build software that will analyze,... more

Learning a new language is an integral part of human life. Even after years of learning, a person is prone to commit mistakes. These errors are due to their lack of knowledge of the target language and influence of their previously learnt
language. As a consequence, it has been felt that automatic detection and correction of grammatical errors will be of immense help as an aid for language learning.
Automatic detection and correction of grammatical errors in a morphologically rich and free word order language like Bangla is a non trivial task. Little research has been done on detection and correction of grammatical errors in such
languages. For Bangla language, this work needs to be done denovo. The problem is to automatically detect and correct an ungrammatical Bangla sentence having
postpositional and nominal inflectional errors. A methodology needs to be devised for correcting the mistakes committed by users and also to provide relevant examples for supporting the suggested correction. To have an idea on how strongly we can rely on such a correction, it will be useful to devise a measure of sentence complexity with respect to the grammar correction task. If a sentence is complex,
the user should not be overtly reliant on the correction suggested by the system.
Conversely if the complexity measure is low, the user can confidently choose the suggestion.
A sufficiently large error corpus is essential for training and testing of grammar correction methodology. Manual collection of huge error corpora is a tedious and
time consuming task. There is a dearth of error corpora for Bangla Language. Therefore, a synthetic error corpora creation methodology has been proposed.
Divergence between two languages influences second language learners to commit grammatical mistake. It has been widely studied that the divergence between a
pair of languages has a profound effect on various fields of NLP . The effect of divergence becomes more pronounced and acute for widely varying language like English and Bangla
. Bangla is a morphologically rich language and
has a free word order. Therefore, State-of-the-art Context Free Grammar (CFG) is not applicable here. In
addition to this, lack of robust parsers, insufficient linguistic rules and dearth of error annotated parallel corpora make this grammar correction task much more challenging. To address these issues, a novel approach has been proposed for
automatic detection and correction of Bangla grammatical errors using a Natural Language Generation (NLG) technique.
Evaluation of grammar correction system is one of the challenges in this area of research. Performance of most of the available grammar checkers cannot be compared as different systems address different types of errors. Moreover, testing on a common dataset is particularly problematic when different grammar checkers are
designed for different languages. To circumvent these problems, a Methodology for Evaluation of Grammar Assessment (MEGA) combining a Graded Acceptabil-
ity Assessment Metric (GAAM) and a Complexity Measurement Metric (CMM) has been introduced. Initially, MEGA has been applied on our Natural Language Gen-
eration (NLG) based Bangla grammar checker. Since direct comparison between available English grammar checker and the NLG based Bangla grammar checker is not possible, the NLG based system has been compared against a prototype
Bangla grammar checker based on standard Naïve Bayes classification. Results show that NLG based approach for our Bangla grammatical error detection and correction system outperforms the Naïve Bayes classifier system.

Computer Science & Information Technology (CS & IT) is an open access peer reviewed Computer Science Conference Proceedings (CSCP) series that welcomes conferences to publish their proceedings / post conference proceedings. This series intends to focus on publishing high quality papers to help the scientific community furthering our goal to preserve and disseminate scientific knowledge. Conference proceedings are accepted for publication in CS & IT - CSCP based on peer-reviewed full papers and revised short papers that target international scientific community and latest IT trends. Our mission is to provide the most valuable publication service.

The paper aims to examine the extent to which
student-centred ESP (English for Specific Purposes)
teaching approaches can be and are being applied
in higher vocational schools in Serbia at the moment
and what the prospects for the future promotion
of these approaches are. In order to achieve this
objective, the paper relies on the data collected by
means of an e-mail questionnaire consisting of 30
questions, both open- and close-ended, sent to 22
respondents, all of whom are currently employed
as English teachers in 22 higher vocational schools
in Serbia. The results of the research indicate that
despite the fact that traditional approaches are still
present in ESP teaching, constant efforts are being
put in and significant results are being achieved
regarding the implementation of communicative
language learning and similar strategies, which make
ESP learning a student- rather than teacher-centred
process.
Keywords: ESP, student-centred teaching, higher vocational schools

Data mining and machine learning have become a vita
l part of crime detection and prevention. In this
research, we use WEKA, an open source data mining s
oftware, to conduct a comparative study between the
violent crime patterns from the Communities and Cri
me Unnormalized Dataset provided by the University
of California-Irvine repository and actual crime st
atistical data for the state of Mississippi that ha
s been
provided by neighborhoodscout.com. We implemented t
he Linear Regression, Additive Regression, and
Decision Stump algorithms using the same finite set
of features, on the Communities and Crime Dataset.
Overall, the linear regression algorithm performed
the best among the three selected algorithms. The s
cope
of this project is to prove how effective and accur
ate the machine learning algorithms used in data mi
ning
analysis can be at predicting violent crime pattern
s.

Natural Language Processing is a programmed approach to analyze text that is based on both a
set of theories and a set of technologies. This forum aims to bring together researchers who have
designed and build software that will analyze, understand, and generate languages that humans
use naturally to address computers.

—In Today's scenario, files are not secure. They are fetch by any means of attack by eavesdropper like cracking the pins, crashing the OS by viruses, malwares, and plenty of ways. We today can't sure that files protection wizards are secure and data can't be reached to the attacker. But if files are encrypted then even files are accessed original data remains confidential. Therefore, this paper represents the File Encryption System based on Symmetric Key Cryptography. I proposed the strategy to encrypt the files/even multiple files can be encrypted by compressing the files into one 'rar' file and it uses Blowfish as encryption/decryption standard and Cipher Block Chain Mode to perform the operations. I implemented compression function for 64-bit Initialization Vector(IV), use CBC mode with Blowfish and RC4 for 256-bit keystream. It is more efficient and secure then other general encryption process.

- by Ankit Kumar
- •
- Artificial Intelligence, Machine Learning, Data Mining, Mobile Technology

The development of microarray technology has supplied a large volume of data to many fields. The gene microarray analysis and classification have demonstrated an effective way for the effective diagnosis of diseases and cancers. In as much as the data achieving from microarray technology is very noisy and also has thousands of features, feature selection plays an important role in removing irrelevant and redundant features and also reducing computational complexity. There are two important approaches for gene selection in microarray data analysis, the filters and the wrappers. To select a concise subset of informative genes, we introduce a hybrid feature selection which combines two approaches. The fact of the matter is that candidate's features are first selected from the original set via several effective filters. The candidate feature set is further refined by more accurate wrappers. Thus, we can take advantage of both the filters and wrappers. Experimental results based on 11 microarray datasets show that our mechanism can be effected with a smaller feature set. Moreover, these feature subsets can be obtained in a reasonable time.

The performance of traffic systems is greatly dependent on their ability to react to changing traffic patterns and different situations. On traditional traffic systems, the lights run green for fixed intervals of time no matter what the density of the traffic is. Here, we implement an intelligent-agent traffic model that controls the amount of time a light runs green for, based on the number of cars (density) standing in the light.

- by Sabhijiit S Sandhu and +1
- •
- Machine Learning, Classification (Machine Learning), Applications of Machine Learning, Statistical machine learning

This paper reports on the empirical evaluation of five machine learning algorithm such as J48, BayesNet, OneR, NB and ZeroR using ten performance criteria: accuracy, precision, recall, F-Measure, incorrectly classified instances, kappa statistic, mean absolute error, root mean squared error, relative absolute error, root relative squared error. The aim of this paper is to find out which classifier is better in its performance for intrusion detection system. Machine Learning is one of the methods used in the intrusion detection system (IDS).Based on this study, it can be concluded that J48 decision tree is the most suitable associated algorithm than the other four algorithms. In this paper we compared the performance of Intrusion Detection System (IDS) Classifiers using seven feature reduction techniques.

The objective of path planning algorithms is to find the optimal path from a source position to a target position. This paper proposes a real-time path planner for UAVs based on the genetic algorithm. The proposed approach does not identify any specific points outside or between obstacles to solve the problems of the invisible path. In addition, this approach uses no additional steps in the genetic algorithm to handle the problems resulting from generating points inside the obstacles, or the intersection between path segments with obstacles. For these reasons, this paper introduces a simple evaluation method that takes into account the intersections between the path segments and obstacles to find a collision-free and near to optimal path. This evaluation method take into account overlapped and intersected obstacles. The sequential implementation for all of the genetic algorithm steps is detailed. This paper explores the Parallel Genetic Algorithms (PGA) models and introduces the parallel implementation of the proposed path planner on multi-core processors using OpenMP. The execution time of the proposed parallel implementation is reduced compared to sequential execution.

In this paper we focus on, helping editors in the newspaper industry, by making their work easy by processing the huge chunks of data they receive in the form of articles that are given to them by multiple news reporters, from different locations. Usually, these kinds of huge data, and in this kind of industry, only those organizations that provide good insights will be successful. This type of data can be processed only by using some Processing (NLP). And our model usually would begin with taking all the articles as input, pre-NLP, adding patterns, and finally finding the good insights that are needed. The main aim of our mode high accuracy as well as maintain its robustness. importance discussed in the article and can be highlighted using Displacy. So that it is easy for the news industry to maintain high standards by establishing a strong connection to reach people. had these articles in the form of tabular data, which would have all the important attributes and hence being our final output. This paper helps the news industry to properly analyze accessible format, which makes job easy mainly for editors of that particular industry.

Data available on the web is growing at an exponential rate, creating Knowledge or extracting information is of paramount importance. Information Retrieval (IR) plays a crucial role in Knowledge management as it helps us to find the relevant information from the existing data. This paper compares the performance of keyword-based retrieval and other architectural styles in information retrieval system with ontology-based retrieval on documents in regional language

Auto dealerships receive thousands of calls daily from customers interested in sales, service, vendors and jobseekers. With so many calls, it is very important for auto dealers tounderst and the intent of these calls to provide positive customer experiences that ensure customer satisfaction, deeper customer engagement to boost sales and revenue, and optimum allocation of agents or customer service representatives across the business. In this paper, we define the problem of customer phone call intent as a multi-class classification problem stemming from the large database of recorded phone call transcripts. To solve this problem, we develop a convolutional neural network (CNN)-based supervised learning model to classify the customer calls into four intent categories: sales, service, vendor or jobseeker. Experimental results show that with the thrust of our scalable data labeling method to provide sufficient training data, the CNN-based predictive model performs very well on long text classification according to tests that measure the model's quantitative metrics of F1-Score, precision, recall, and accuracy.

Analysis on the high dimensional data is the main problem in several applications like content based
retrieval, speech signals, fMRI scans, electrocardiogram signal analysis, multimedia retrieval, market
based applications etc., to improve the performance of the system, the dimensions should be reduced into
lower dimension. There are many techniques for both linear and non linear dimensionality reduction. Some
of the techniques are suitable linear sample data and not suitable for non linear data and sample size is
another criteria in dimensionality reduction. Each technique has its own features and limitations. This
paper presents the various techniques used to reduce the dimensions of the data.

- by kamatchi priya and +1
- •
- Machine Learning, Classification (Machine Learning), Applications of Machine Learning, Statistical machine learning

The number of Rheumatoid Arthritis (RA) patients increases recently in Japan. Early treatment improves patient’s prognosis and Quality of Life. The appropriate treatment in accordance with RA progression is required for the better prognosis. The hand X-ray image based modiﬁed Total Sharp Score (mTSS) is widely used for the diagnosis of RA progression. The mTSS measurement is essential to achieve the appropriate treatment, but its assessment is time consumed. There are some ﬁnger joint detection and mTSS estimation methods for the fully automated mTSS measurement, which focus on the mild RA patients. This paper proposes the automatic joint detection method and discusses about the mTSS estimation for the mild-to-severe RA patients. Experimental results on 90 RA patients’ hand X-ray images showed that the proposed method detected ﬁnger joints with accuracy of 91.8%, and estimated the erosion and JSN score with accuracy of 53.3% and 60.8%, respectively.

A corpus is an arbitrary sample of language, whereas a dictionary aims to be a systematic account of the lexicon of a language. Children learn language through encountering arbitrary samples, and using them to build systematic representations. These banal observations suggest a relationship between corpus and dictionary in which the former is a provisional and dispensable resource used to develop the latter. In this paper we use the idea to, first, review the Word Sense Disambiguation (WSD) research paradigm, and second, guide our current activity in the development of the Sketch Engine, a corpus query tool. We develop a model in which a database of mappings between collocations and meanings acts as an interface between corpus and dictionary.

To aid literary fictions enthusiasts, the research aims to classify Filipino short stories according to their genre using K-Means algorithm and Artificial Neural Networks (ANN). The study limits the input under at least one specified genre among fantasy, horror, and romance. Two (2) sets of data are used for testing the system. The study concluded that K-Means algorithm provides a better accuracy for classifying Filipino fictions according to their genre if two outputs are used, with respect to both genre produced from source and from the professional. Otherwise, ANN classifies the story more accurately.

Machine Learning and Applications: An International Journal (MLAIJ) is a quarterly open access peer-reviewed journal that publishes articleswhich contribute new results in all areas of the machine learning. The journal is devoted to the publication of high quality papers on theoreticaland practical aspects of machine learning and applications.The goal of this journal is to bring together researchers and practitioners from academiaand industry to focus on machine learning advancements, and establishing new collaborations in these areas. Original research papers, state-of-the-artreviews are invited for publication in all areas of machine learning. Authors are solicited to contribute to the journal by submitting articles that illustrate research results, projects, surveying works and industrial experiencesthat describe significant advances in the areas of machine learning. Topics of interest include, but are not limited to the following:  Applications  Learning in knowledge-intensive systems  Learning Methods and analysis  Learning Problems

International Conference on NLP & Big Data (NLPD 2020) will provide an excellent international forum for sharing knowledge and results in theory, methodology and applications of Natural Language Computing, Big data, Linked Data and Social... more

International Conference on NLP & Big Data (NLPD 2020) will provide an excellent international
forum for sharing knowledge and results in theory, methodology and applications of Natural
Language Computing, Big data, Linked Data and Social Network

International Journal of Computer Vision and machine learning (IJCVML) is an open access, peer-reviewed journal that publishes articles which contribute new results in all areas of the advanced vision computing. The goal of this journal is to bring together researchers and practitioners from academia and industry to focus on understanding advances in vision computing and establishing new collaborations in these areas. Authors are solicited to contribute to the journal by submitting articles that illustrate research results, projects, surveying works and industrial experiences that describe significant advances in the areas of vision computing.

We introduce the strategies used by the Accenture Team for the CLEF2020 CheckThat! Lab, Task 1, on English and Arabic. This shared task evaluated whether a claim in social media text should be professionally fact checked. To a journalist, a statement presented as fact, which would be of interest to a large audience, requires professional fact-checking before dissemination. We utilized BERT and RoBERTa models to identify claims in social media text a professional fact-checker should review, and rank these in priority order for the fact-checker. For the English challenge, we fine-tuned a RoBERTa model and added an extra mean pooling layer and a dropout layer to enhance generalizability to unseen text. For the Arabic task, we fine-tuned Arabic-language BERT models and demonstrate the use of back-translation to amplify the minority class and balance the dataset. The work presented here was scored 1st place in the English track, and 1st, 2nd, 3rd, and 4th place in the Arabic track.

This paper reports the results of a survey I carried out on the beliefs and attitudes held by Italian upper-secondary school students about foreign language learning. The survey was prompted by my experience in teacher training courses, where teachers often wondered what factors were responsible for unsatisfactory learning outcomes, even in contexts where teaching strategies and materials seemed to be grounded on sound methodological choices. This led me to consider aspects of learning which lie “below the surface” of students’ behaviour, and in particular the role that their beliefs and attitudes play in explaining how a curriculum is perceived, interpreted, and implemented in a school context. In this paper I will first illustrate the cognitive nature of beliefs, their corresponding affective component (i.e., attitudes), and their infuence on the learning process. I will then introduce the use of metaphors as a useful tool for probing learners’ beliefs and attitudes, and outline how the research was designed to explore students’ conceptualisations of both the knowledge of foreign languages and the process of language learning. Results show that students tend to describe language knowledge in terms of motivation, intercultural communicative competence, afective implications, mastery of a system, equivalence of L1 and L2 learning, and similarity with other skills. Language learning is mainly seen in terms of a very demanding task, which is perceived either as a productive experience or as a (nearly) impossible undertaking, but also as “learning from scratch” and as a game and pleasant experience. I conclude by considering ways in which these insights can be used by teachers to address their students’ “hidden agenda” and highlighting the role that an increased awareness of beliefs and attitudes can play in the language classroom.
Full text: https://ldjournalsite.files.wordpress.com/2017/11/ldj-1-1-mariani.pdf
Learner Development Journal: http://ld-sig.org/ld-journal-concept/

This paper presents an unsupervised approach for the development of a stemmer (For the case of Urdu &
Marathi language). Especially, during last few years, a wide range of information in Indian regional
languages has been made available on web in the form of e-data. But the access to these data repositories
is very low because the efficient search engines/retrieval systems supporting these languages are very
limited. Hence automatic information processing and retrieval is become an urgent requirement. To train
the system training dataset, taken from CRULP [22] and Marathi corpus [23] are used. For generating
suffix rules two different approaches, namely, frequency based stripping and length based stripping have
been proposed. The evaluation has been made on 1200 words extracted from the Emille corpus. The
experiment results shows that in the case of Urdu language the frequency based suffix generation approach
gives the maximum accuracy of 85.36% whereas Length based suffix stripping algorithm gives maximum
accuracy of 79.76%. In the case of Marathi language the systems gives 63.5% accuracy in the case of
frequency based stripping and achieves maximum accuracy of 82.5% in the case of length based suffix
stripping algorithm.

- by Hiroaki Yamane
- •
- Natural Language Processing, Machine Learning, Semantics, Text Mining