Naive Bayes Research Papers - Academia.edu (original) (raw)

- by
- •
- Text Mining, Engineering Systems, Text Categorization, Asset Management

Data mining is a generous field for researchers due to its various approaches on knowledge discovery in enormous volumes of data that are stored in different formats. At present, data are widely used all over the world, covering areas such as: education, industry, medicine, banking, inssurance companies, research laboratories, business, military domain etc. The major gain from applying data mining techniques is the discovery of unknown patterns and relations between data which can further help in the decision-making processes. There are two forms of data analysis used to extract models by describing important classes or to predict future data trends: classification and prediction. In this paper, the authors present a comparative study of classification algorithms (i.e. Decision Tree, Naïve Bayes and Random Forest) that are currently applied to demographic data referring to death statistics using KNIME Analytics Platform. Our study was based on statistical data provided by the Nation...

- by Irina Ionita
- •
- Computer Science, Data Mining, Decision Tree, Random Forests

Named entity recognition (NER) is one of the fundamental tasks in natural-language processing (NLP). Though the combination of different classifiers has been widely applied in several well-studied languages, this is the first time this method has been applied to Vietnamese. In this article, we describe how voting techniques can improve the performance of Vietnamese NER. By combining several state-of-the-art machine-learning algorithms using voting strategies, our final result outperforms individual algorithms and gained an F-measure of 89.12. A detailed discussion about the challenges of NER in Vietnamese is also presented.

The term non-permanent employee first appeared in the rule of law, namely in Act Number 13 Year 2003 concerning Manpower.This Act has an impact on the emergence of clarity about staffing status so that the salaries obtained by employees do not match their workload.Therefore, this study aims to determine employees who are eligible to earn the same income at one of the private universities in Palembang based on university's strategic plan, namely class, employment status, membership, and education permit using the Naïve Bayes method. The results showed that the highest accuracy of predictive conclusions for non-permanent and permanent employees is 83.33%, while the lowest accuracy value is 50%.

- by Boy Ramadhan
- •
- University, Staffing, Naive Bayes, Naïve Bayes

DC Universe is a fictional universe in which a collection of superheroes and super villains based on characters that appear in comic books by DC Comics is in it. DC Comics itself is the largest and oldest comic book publisher that produces and displays superheroes and super villains. To start a super hero-themed business, there are a number of business examples that can be used as references and can also be used to reap profits from the business, namely, rental of comics, selling merchandise, making clothing lines, making cosplay costumes, making superhero-themed foods, selling action figure. In an effort to start a super hero-themed business, especially the DC Universe theme, it is necessary to pay attention / listen to DC Universe consumers in Indonesia. Classification of opinions or sentiment analysis is one way to find out about a person or group of people towards certain products, services, issues or groups from various social media platforms and the internet. Twitter is one of the social media that is loved by the people of Indonesia. This research tries to utilize what was written by Twitter social media users or better known as a tweet. Tweets will be processed by text mining and processed again using the Naïve Bayes Classifier algorithm.

Naive Bayes is one of most effective classification algorithms. In many applications, however, a ranking of examples are more desirable than just classification. How to extend naive Bayes to improve its ranking performance is an interesting and useful question in practice. Weighted naive Bayes is an extension of naive Bayes, in which attributes have different weights. This paper investigates how to learn a weighted naive Bayes with accurate ranking from data, or more precisely, how to learn the weights of a weighted naive Bayes to produce accurate ranking. We explore various methods: the gain ratio method, the hill climbing method, and the Markov chain Monte Carlo method, the hill climbing method combined with the gain ratio method, and the Markov chain Monte Carlo method combined with the gain ratio method. Our experiments show that a weighted naive Bayes trained to produce accurate ranking outperforms naive Bayes.

- by Harry Zhang
- •
- Data Mining, Markov Processes, Monte Carlo Methods, Decision Trees

The purpose of this study is to find optimal features and classifier's model selection for sleep apnea detection using ECG signals. We want to determine whether a set of unknown ECG signals (test data) is from heavy apnea, mild apnea, or healthy categories. We examine two recent approaches of features selection: an approach proposed by Chazal et al. (2004), which is based on the RR-interval mean and time-series analysis; and an approach proposed by Yilmaz et al. (2010), which is based on the RR-interval median. We also examine cross validation and random sampling method in the classifier's probability model selection. We evaluate the approaches using three classifiers: k-Nearest Neighbor (kNN), Naive-Bayes and Support Vector Machine (SVM). In addition, we use a self organizing map (SOM) clustering or preprocessing to provide better sample that can represent the entire training data. Our experiment using ECG data from PhysioNet shows that classification results using only 3 features as proposed by Yilmaz et al. (2010) gives about 3.59% gain on overall classification accuracy (CA) and 7.5% gain on area under ROC-curve (AUC) on than the classification accuracy using 8 features as proposed by Chazal et al., (2004).

- by Mohamad Ivan Fanany and +2
- •
- Computer Science, Artificial Intelligence, Medical Sciences, Biomedical Engineering
- by Urszula Markowska-Kaczmar and +1
- •
- Knowledge Representation, Fuzzy Clustering, Unsupervised Learning, Computational
- by W. Verhaegh
- •
- User Modeling, Machine Learning, Reliability, Confidence intervals
- by Ana Kovacevic
- •
- Data Mining, Databases, Digital Library, Library and Information Studies
- by Harry Zhang
- •
- Experimental Study, Structure learning, Naive Bayes
- by Waheeda Almayyan
- •
- Mechanical Engineering, Mathematics, Computer Science, Digital Signal Processing

Naive Bayes is very popular in commercial and open-source anti-spam e-mail filters. There are, however, several forms of Naive Bayes, something the anti-spam literature does not always acknowledge. We discuss five different versions of Naive Bayes, and compare them on six new, non-encoded datasets, that contain ham messages of particular Enron users and fresh spam messages. The new datasets, which we make publicly available, are more realistic than previous comparable benchmarks, because they maintain the temporal order of the messages in the two categories, and they emulate the varying proportion of spam and ham messages that users receive over time. In this paper we have discovered various aspects of Naïve Bayes Classifier and smoothing techniques for extraction of useful data along with our research criteria.

- by IJCSMC Journal
- •
- Naive Bayes

This paper is focused on the issue of malware detection for Android mobile system by Reverse Engineering of java code. The characteristics of malicious software were identified based on a collected set of applications. Total number of 1958 applications where tested (including 996 malware apps). A unique set of features was chosen. Five classification algorithms (Random Forest, SVM, K-NN, Nave Bayes, Logistic Regression) and three attribute selection algorithms were examined in order to choose those that would provide the most effective malware detection.

The theory of dual numerical means of random and experienced variables is briefly described in the framework of the new theory of experience and the chance that arises as an axiomatic synthesis of two dual theories — the Kolmogorov theory of probability and the theory of believability. A new term is introduced for the numerical mean of the experienced variable — mathematical reflection, which is dual to the mathematical expectation of a random variable within the framework of the new theory. The basic properties and examples of dual numerical means are considered.

- by Oleg Yu Vorobyev
- •
- Sociology, Probability Theory, Quantum Computing, Expert Systems

- by Rendra Gustriansyah
- •
- University, Staffing, Naive Bayes, Naïve Bayes
- by Daniele Loiacono and +1
- •
- Machine Learning, User Interface, Neural Network, Competitive advantage
- by Lee Jau
- •
- Machine Learning, Semiconductor Manufacturing, Hybrid Intelligent Systems, Case Study

— Here we introduce a classifier which takes in multidimensional data consisting of real world measurements of physical, environmental and vehicular continuous features obtained from number of driving sessions. We will show that using Naive Bayes classifier which assumes the data distribution to be Gaussian distribution we can make a prediction weather the driver is alerted or not while driving and achieve reasonable low misclassification rate for the given data. We will inspect how insight into relevant features were obtain by using Principal Component Analysis (PCA) and simple correlation matrix. We were able to obtain a misclassification rate as low as 12.03% and 27.07% for the test and training data respectively.

- by francesco vaggi
- •
- Data Mining, Driving, Naive Bayes, Kaggle

Player selection is one the most important tasks for any sport and cricket is no exception. The performance of the players depends on various factors such as the opposition team, the venue, his current form etc. The team management, the coach and the captain select 11 players for each match from a squad of 15 to 20 players. They analyze different characteristics and the statistics of the players to select the best playing 11 for each match. Each batsman contributes by scoring maximum runs possible and each bowler contributes by taking maximum wickets and conceding minimum runs. This paper attempts to predict the performance of players as how many runs will each batsman score and how many wickets will each bowler take for both the teams. Both the problems are targeted as classification problems where number of runs and number of wickets are classified in different ranges. We used naïve bayes, random forest, multiclass SVM and decision tree classifiers to generate the prediction model...

- by Kalpdrum Passi
- •
- Computer Science, Cricket, Analytic Hierarchy Process, Decision Trees
- by Harry Zhang
- •
- Text Classification, Scaling up, Time Complexity, Naive Bayes

It has been observed that traditional decision trees produce poor probability estimates. In many applications, however, a probability estimation tree (PET) with accurate probability estimates is desirable. Some researchers ascribe the poor probability estimates of decision trees to the decision tree learning algorithms. To our observation, however, the representation also plays an important role. Indeed, the representation of decision trees is fully expressive theoretically, but it is often impractical to learn such a representation with accurate probability estimates from limited training data. In this paper, we extend decision trees to represent a joint distribution and conditional independence, called conditional independence trees (CITrees), which is a more suitable model for PETs. We propose a novel algorithm for learning CITrees, and our experiments show that the CITree algorithm outperforms C4.5 and naive Bayes significantly in classification accuracy.

- by Harry Zhang
- •
- Artificial Intelligence, Classification, Decision Tree, Naive Bayes

Data mining is defined as the process in which useful information is extracted from the raw data. In order to acquire essential knowledge it is essential to extract large amount of data. This process of extraction is also known as misnomer. Currently in every field, there is large amount of data is present and analyzing whole data is very difficult as well as it consumes a lot of time. The prediction analysis is most useful type of data which is performed today. To perform the prediction analysis the patterns needs to generate from the dataset with the machine learning. The prediction analysis can be done by gathering historical information to generate future trends. So, the knowledge of what has happened previously is used to provide the best valuation of what will happen in future with predictive analysis. Crop production analysis is one of the applications of prediction analysis. The techniques which are designed so far the machine learning techniques. The machine learning techniques are applied with the feature extraction. In this paper, the machine learning techniques are reviewed in terms of technical description and outcomes.

- by IJCSMC Journal
- •
- Computer Science, Algorithms, Information Technology, Technology

Master Thesis covers topic such as architecture of the energy metering systems, artificial neural networks, naive Bayes classifiers, energy monitoring and archiving, electrical appliance recognition. It is describing whole system... more

- by Konrad Gębka
- •
- Electrical Engineering, Artificial Intelligence, Pattern Recognition, Energy

The biggest e-Commerce challenge to understand their market is to chart their level of service quality according to customer perception. The opportunities to collect user perception through online user review is considered faster methodology than conducting direct sampling methodology. To understand the service quality level, sentiment analysis methodology is used to classify the reviews into positive and negative sentiment for five dimensions of electronic service quality (e-Servqual). As case study in this research, we use Tokopedia, one of the biggest e-Commerce service in Indonesia. We obtain the online review comments about Tokopedia service quality during several month observations. The Naïve Bayes classification methodology is applied for the reason of its high-level accuracy and support large data processing. The result revealed that personalization and reliability dimension required more attention because have high negative sentiment. Meanwhile, trust and web design dimension have high positive sentiments that means it has very good services. The responsiveness dimension have balance sentiment positive and negative.

- by Andry Alamsyah
- •
- Sentiment Analysis, Service Quality, Naive Bayes, E-Commerce

Today classification techniques in data mining are most popular to prediction and data exploration. This Heart Disease Prediction System HDPS is using Naive Bayesian Classification with a comparison for simple probability and that of Jelinek Mercer JM Smoothing. It is implemented as an Android based application user must be feedback and answers the questions then can be seen the result as user desired in different ways exactly heart disease is present or not and then with predictions No, Low, Average, High, Very High . And the system will be provided required suggestions such as doctor details and medications to patients could be able. It will be also proved that enhanced Naive Bayes with Jelinek Mercer smoothing technique is also effective to eliminate the noise for prediction the heart disease. This system can also calculate classifier accuracy by using precision and recall. Nan Yu Hlaing | Phyu Pyar Moe "Android Based Questionnaires Application for Heart Disease Prediction S...

- by Nan Yu Hlaing
- •
- Computer Science, Android, Heart Disease, Naive Bayes
- by Feras Al-Obeidat and +2
- •
- Support Vector Machines, Decision Trees, Mathematical Sciences, Naive Bayes
- by Anjali Jivani
- •
- Computer Science, Data Mining, Cancer, Information Extraction

Oral cancer is one of the most dangerous cancers which affects and originates from the oral cavity and neck. Overuse of tobacco and smoking cigarettes are the primary risk factor for developing oral cancer. This technique derives a group of features that would help the classifiers to identify the image state automatically. Various machine learning methods are applied on the datasets and their performance are analyzed. The derived features were classified using CNN, which are compared against various standard classification approaches such as SVM, Naive bayes. From the results, it is observed that the different stage classification of oral cancer can be classified effectively. Hence, the classification of various oral cancers can be achieved more efficiently by means of CNN.

- by IAEME Publication
- •
- CNN, Segmentation, Classification, Svm

Mathematical analysis is becoming ever more useful when dealing with large amounts of archaeological data, due to the precision and certainty with which results can be produced. This article will discuss the use of a relatively new tool in deciphering and dealing with archaeological data, the Naïve Bayes Classifier. The ‘Bayesian’ approach was first proposed in the early 1990s, by archaeological statisticians, Clive Orton (see Orton 1992:139; Buck et al. 1996:1), though at that time the lack of computational power available made use of the Classifier prohibitively difficult. Today a Naïve Bayes Classifier can be employed by anyone with a computer, without any need for particularly specialised computer skills. Programs such as Orange use a graphical interface as a way to circumvent the need for specific mathematical knowledge of the process, and the use of this program is detailed in the paper. The Naïve Bayes Classifier is most useful in attempting to identify unseen patterns in a large amount of data, such as a database with thousands of entries, and potential uses will be illustrated here. This paper presents a case study using a Naïve Bayes Classifier in an attempt to date the rune-stones of Viking-age rune-stones of Sweden which remain undated through conventional methods Two variables were identified as showing some small trace of temporal evolution, the Christian crosses and runic inscriptions on the stones, and the Classifier was utilized to explore their further use.

- by Shreyas Renga
- •
- Discourse Analysis, Computer Science, Architecture, Sentiment Analysis
- by Joao Paulo Papa
- •
- Engineering, Support Vector Machines, Artificial Neural Networks, Shape Analysis

Every second plethora of reviews on various product lines are being posted in a trending e-commerce website. The objective of a review section in such websites is to analyze customer satisfaction for sales growth and to aid buyers make right purchase decisions. This becomes a herculean task for the business analyst when the search space for interesting patterns is vast. Buyers could benefit from a trusted recommendation label for the product and an overall product rating purely based on data mining of the customer generated live review dataset. The proposed sentiment analysis system makes use of data mining and natural language processing algorithms. The words in the corpus selected for the sentiment classification is associated with fuzzy scores to indicate degree of polarity strength. Naïve bayes algorithm forms the basis for the project. It is simple yet yields effective results. Sentiment scoring of the review is used for automated five star rating that replaces the conventional manual rating. Context based anomaly detection is done to remove irrelevant portions of each review. Selective-feature analysis is performed to assist vendors as well as buyers. A threshold value is compared against the overall rating for the recommendation label. The polarity detection gave fair results. It was observed that the richness of corpus and grammar rules contributed to the accuracy of the model. Threshold chosen for product recommendation gave mediocre credence when it was solely based on reviews. Therefore Threshold was set using more complex criteria that involves purchase patterns. Results obtained from the proposed model is ideal for e-commerce websites. Model can be improved by addition of sarcasm detection algorithm which is quite challenging.

Today social media has grown to be a big player in the way businesses and organizations operate, especially with the coronavirus pandemic increasing the online footprint of organizations. The use of data from social media to drive business intelligence is now of growing interest to both researchers and business owners. Business owners can now utilize platforms like Twitter to learn about their target audience and improve their business processes to meet their growing needs. Twitter makes it easy to see what is going or about to go viral and vital details like why it is going viral and the players behind it. This research aims to help business owners’ especially small and medium enterprises and start-ups gain a competitive advantage in their industry by using the "crowd wisdom" opportunity via social media. The proposed system is based on Twitter and crawls the platform for relevant data, including; locations, trends, and important actors (influencers) within a specified field; the system cleans the data and presents the information in an actionable format. Python was used for Twitter data mining, and sentiment analysis of the tweets was done using Naive Bayes classifiers.

Great knowledge and experience on microbiology are required for accurate bacteria identification. Automation of bacteria identification is required because there might be a shortage of skilled microbiologists and clinicians at a time of great need. There have been several attempts to perform automatic background identification. This paper reviews state-of-the-art automatic bacteria identification techniques. This paper also provides discussion on limitations of state-of-the-art automatic bacteria identification systems and recommends future direction of automatic bacteria identification.

SDM (Sumber Daya Manusia) yang berkualitas akan mempengaruhi instansi atau perusahaan dalam mencapai tujuan instansi atau perusahaan itu. Untuk menghasilkan SDM yang berkualitas dilakukan penilaian kinerja karyawan. Kenaikan grade adalah salah satu pemicu dalam meningkatkan kinerja karyawan dari suatu institusi atau perusahaan. Penilaian dalam kenaikan grade biasanya dilakukan oleh staff SDM (Sumber Daya Manusia). Penilaian kenaikan grade secara terkomputerisasi dilakukan untuk mempermudah staff SDM dalam pengelolaan data kinerja karyawan. Untuk menentukan kenaikan grade karyawan dapat dilakukan dengan fuzzyfikasi dan klasifikasi. Fuzzyfikasi dilakukan sebelum melakukan klasifikasi data kinerja karyawan. Naïve Bayes merupakan salah satu metode klasifikasi yang dapat digunakan untuk menentukan kenaikan grade karyawan. Dari pengujian yang telah dilakukan metode klasifikasi Naïve Bayes mampu melakukan klasifikasi kenaikan grade karyawan, ini dibuktikan dengan semakin tinggi nilai akurasi pengujian.

- by Yoga Agung Baktiar
- •
- Information Technology, Fuzzy Logic, Naive Bayes

Through the growth of common networking aera and its development, Internet has developed a capable stage for connected knowledge, replacing concepts and distribution sentiments. Common media covers an enormous quantity of the sentimentality information in the procedure of twitters, blogs, and informs on the position, posts, etc. In this paper, the maximum general micro blogging stage Twitter is used. Twitter sentiment study is a request of sentimentality scrutiny on information since Twitter (tweets), to excerpt user’s sentiments and opinions. The key goal is to discover in what way script study methods can be used to crack into approximately of the information in a sequence of markers concentrating on changed movements of tweets dialects, tweets dimensions on twitter. New valuations display that the future machine learning classifiers are effective and do improved in relations of correctness and period. The future algorithm is executed in python

- by IAEME Publication
- •
- Machine Learning, Sentiment Analysis, Python, Algorithm

Data Mining is the process of extracting interesting, non-trivial, implicit, previously unknown and potentially useful patterns or knowledge with the help of various techniques from various data sources. Classification is the process of finding a model that describes and distinguishes data classes or concepts. There exist several algorithms for classification in data mining, these algorithms have their strengths and weaknesses, and there is no single algorithm that is most suitable for all classes of data. This project is directed at evaluating the performance of three classification algorithms, i.e., decision tree algorithm, naïve bayes algorithm, and k-nearest Neighbour algorithm.
Waikato Environment for Knowledge Analysis (WEKA) was used to analyze the algorithms; performance parameters include classification accuracy, error rate, execution time, confusion matrix, and area under the curve. Five datasets were used for the analysis, which are the Iris dataset, chronic kidney disease dataset, Breast cancer dataset, diabetes dataset, and hypothyroid dataset. The datasets were obtained from the UCI Machine Repository and split into training and testing; 60% 40% and 70% 30%.
The decision tree algorithm was found to be more accurate than the naive bayes algorithm and K-NN algorithm. In terms of Execution time, K-NN outperforms naive bayes and decision trees on the five datasets. Moreover, K-NN has more percentage of error recorded on average on the five datasets.
Therefore, no particular algorithm is best suited for a specific situation, the performance of classification algorithms depends on the type and size of datasets, i.e., one algorithm is more appropriate for one dataset while another algorithm is not appropriate for the same dataset.

Comparison study of algorithms is very much required before implementing them for the needs of any organization. The comparisons of algorithms are depending on the various parameters such as data frequency, types of data and relationship among the attributes in a given data set. There are number of learning and classifications algorithms are used to analyse, learn patterns and categorize data are available. But the problem is the one to find the best algorithm according to the problem and desired output. The desired result has always been higher accuracy in predicting future values or events from the given dataset. Algorithms taken for the comparisons study are Neural net, SVM, Naïve Bayes, BFT and Decision stump. These top algorithms are most influential data mining algorithms in the research community. These algorithms have been considered and mostly used in the field of knowledge discovery and data mining.

Lung cancer is a malignant lung tumour that is characterised by the regulated growth of cells in the lung tissue. The most common cancer diagnosed worldwide is lung cancer. More deaths than any other kind of cancer occur due to lung cancer. Early diagnosis and care are very useful and efficient for the survival of cancer patients. Different image processing and soft computing methods may be used for identifying cancer cells from medical images. Classification depends on features extracted from the images. In order to produce better classification results, the focus is on the feature extraction level. In order to distinguish a pattern that can provide some useful insights into what combination of features is most likely to result in an abnormality, this knowledge is then given to machine learning algorithms. The prediction of lung cancer is analysed using various machine learning classification algorithms such as Naive Bayes, Support Vector Machine, Artificial Neural Network and Logistic Regression. The key aim of this paper is to diagnose lung cancer early by examining the performance of exist classification algorithms.

- by IJCSMC Journal and +1
- •
- Computer Science, Algorithms, Information Technology, Technology

— The main objective of this research is to develop an Intelligent System using data mining modeling technique, namely, Naive Bayes. It is implemented as web based application in this user answers the predefined questions. It retrieves hidden data from stored database and compares the user values with trained data set. It can answer complex queries for diagnosing heart disease and thus assist healthcare practitioners to make intelligent clinical decisions which traditional decision support systems cannot by providing effective treatments.

Since man invented agriculture, plant disease epidemics have been a major challenge for crop
growers. Their rapid spread can cause huge crop losses and thus threating the food and nutritional
security of millions of people at the same time.
In this report, we discuss the role of machine learning techniques in the control and monitoring
of plant diseases. In fact, thanks to classification algorithms, it is possible to build a predictive
model of a plant disease by examining the historical data of climatic conditions that have already
promoted its occurrence. Therefore, this will help detect the danger sufficiently in advance to
inform farmers of its imminence so they can take necessary precautions.
The objective of our study is to determine the evaluation criteria of classification algorithms
performance, to compare between them and select the best one for plant disease forecasting.
Therefore, through an in-depth research on classification algorithms and available comparison
methods, we have shown that the best way to select an algorithm is to try them all on the same
data set. The steps of this approach are illustrated by an example of a comparative test made with
Weka tool on a set of apple scab observations.
In parallel, throughout the literature we find that the systems using machine learning to predict
the occurrence of plant diseases are used so often in some parts of the globe, particularly United
States. This is because of the collaboration between universities in different fields and the
exchange of the necessary knowledge for the implementation of such systems. Unlike Algeria,
where studies on plant epidemics are only limited to their chemical and biological aspects.

- by Bouhenni Sarra and +1
- •
- Artificial Intelligence, Machine Learning, Forecasting, Logistic Regression

Intelligent Models for predicting diseases whether building a model to help the doctor or even preventing its spread in an area globally, is increasing day by day. Here we present a noble approach to predict the disease prone area using the power of Text Analysis and Machine Learning. Epidemic Search model using the power of the social network data analysis and then using this data to provide a probability score of the spread and to analyse the areas whether going to suffer from any epidemic spread-out, is the main focus of this work. We have tried to analyse and showcase how the model with different kinds of pre-processing and algorithms predict the output. We have used the combination of words-n grams, word embeddings and TF-IDF with different data mining and deep learning algorithms like SVM, Naïve Bayes and RNN-LSTM. Naïve Bayes with TF-IDF performed better in comparison to others.