Random Forests Research Papers - Academia.edu (original) (raw)

Objective: The objective of this paper is to highlight the state-of-the-art machine learning (ML) techniques in computational docking. The use of smart computational methods in the life cycle of drug design is relatively a recent... more

Objective: The objective of this paper is to highlight the state-of-the-art machine learning (ML) techniques in computational docking. The use of smart computational methods in the life cycle of drug design is relatively a recent development that has gained much popularity and interest over the last few years. Central to this methodology is the notion of computational docking which is the process of predicting the best pose (orientation + conformation) of a small molecule (drug candidate) when bound to a target larger receptor molecule (protein) in order to form a stable complex molecule. In computational docking, a large number of binding poses are evaluated and ranked using a scoring function. The scoring function is a mathematical predictive model that produces a score that represents the binding free energy, and hence the stability, of the resulting complex molecule. Generally, such a function should produce a set of plausible ligands ranked according to their binding stability along with their binding poses. In more practical terms, an effective scoring function should produce promising drug candidates which can then be synthesized and physically screened using high throughput screening process. Therefore, the key to computer-aided drug design is the design of an efficient highly accurate scoring function (using ML techniques). Methods: The methods presented in this paper is specifically based on ML techniques. Despite many traditional techniques have been proposed, the performance was generally poor. Only in the last few years started the application of the ML technology in the design of scoring functions; and the results have been very promising.
Material: The ML-based techniques are based on various molecular features extracted from the abundance of protein-ligand information in the public molecular databases, e.g., protein data bank bind (PDBbind).
Results: In this paper, we present this paradigm shift elaborating on the main constituent elements of the ML approach to molecular docking along with the state-of-the-art research in this area. For instance, the best random forest (RF)-based scoring function (Li, 2014) on PDBbind v2007 achieves a Pear- son correlation coefficient between the predicted and experimentally deter- mined binding affinities of 0.803 while the best conventional scoring function achieves 0.644 (Cheng, 2009). The best RF-based ranking power (Ashtawy, 2012) ranks the ligands correctly based on their experimentally determined binding affinities with accuracy 62.5% and identifies the top binding ligand with accuracy 78.1%.
Conclusions: We conclude with open questions and potential future research directions that can be pursued in smart computational docking; using molecular features of different nature (geometrical, energy terms, pharmacophore), advanced ML techniques (e.g., deep learning), combining more than one ML models.
Keywords:
machine learning, random forest, support vector machine, drug discovery, computational docking, scoring function, virtual screening, complex binding affinity, ligands ranking accuracy, force field interaction, pharmacophore fingerprint.

Data mining is a generous field for researchers due to its various approaches on knowledge discovery in enormous volumes of data that are stored in different formats. At present, data are widely used all over the world, covering areas... more

Data mining is a generous field for researchers due to its various approaches on knowledge discovery in enormous volumes of data that are stored in different formats. At present, data are widely used all over the world, covering areas such as: education, industry, medicine, banking, inssurance companies, research laboratories, business, military domain etc. The major gain from applying data mining techniques is the discovery of unknown patterns and relations between data which can further help in the decision-making processes. There are two forms of data analysis used to extract models by describing important classes or to predict future data trends: classification and prediction. In this paper, the authors present a comparative study of classification algorithms (i.e. Decision Tree, Naïve Bayes and Random Forest) that are currently applied to demographic data referring to death statistics using KNIME Analytics Platform. Our study was based on statistical data provided by the Nation...

With the explosive growth in the world’s population which has little or no corresponding rise in the food production, food insecurity has become eminent, and hence, the need to seek for opportunities to increase food production in order... more

With the explosive growth in the world’s population which has little or no corresponding rise in the food production, food insecurity has become eminent, and hence, the need to seek for opportunities to increase food production in order to cater for this population is paramount. The second goal of the Sustainable Development Goals (SDGs) (i.e., ending hunger, achieving food security and improved nutrition, and promoting sustainable agriculture) set by the United Nations (UN) for the year 2030 clearly acknowledged this fact. Improving food production cannot be achieved using the obsolete conventional methods of agriculture by our farmers; hence, this study focuses on developing a model for predicting climatic conditions with a view to reducing their negative impact, and boosting the yield of crop. Temperature, wind, humidity and rainfall were considered as the effect of these factors is more devastating in Nigeria as compared to sun light which is always in abundance. We implemented ...

Land-Use/Land-Cover (LULC) products are a common source of information and a key input for spatially explicit models of ecosystem service (ES) supply and demand. Global, continental, and regional, readily available, and free land-cover... more

Land-Use/Land-Cover (LULC) products are a common source of information and a key input for spatially explicit models of ecosystem service (ES) supply and demand. Global, continental, and regional, readily available, and free land-cover products generated through Earth Observation (EO) data, can be potentially used as relevant to ES mapping and assessment processes from regional to national scales. However, several limitations exist in these products, highlighting the need for timely land-cover extraction on demand, that could replace or complement existing products. This study focuses on the development of a classification workflow for fine-scale, object-based land cover mapping, employed on terrestrial ES mapping, within the Greek terrestrial territory. The processing was implemented in the Google Earth Engine cloud computing environment using 10 m spatial resolution Sentinel-1 and Sentinel-2 data. Furthermore, the relevance of different training data extraction strategies and temp...

A computer vision approach to classify garbage into recycling categories could be an efficient way to process waste. This project aims to take garbage waste images and classify them into four classes: glass, paper, metal and, plastic. We... more

A computer vision approach to classify garbage into recycling categories could be an efficient way to process waste. This project aims to take garbage waste images and classify them into four classes: glass, paper, metal and, plastic. We use a garbage image database that contains around 400 images for each class. The models used in the experiments are Pre-trained VGG-16 (VGG16), AlexNet, Support Vector Machine (SVM), K-Nearest Neighbor (KNN) and, Random Forest (RF). Experiments showed that our models reached accuracy around 93%.

A terrifying spread of COVID-19 (which is also known as severe acute respiratory syndrome coronavirus 2 or SARS-COV-2) led scientists to conduct tremendous efforts to reduce the pandemic effects. COVID-19 has been announced pandemic... more

A terrifying spread of COVID-19 (which is also known as severe acute respiratory syndrome coronavirus 2 or SARS-COV-2) led scientists to conduct tremendous efforts to reduce the pandemic effects. COVID-19 has been announced pandemic discovered in 2019 and affected millions of people. Infected people may experience headache, body pain, and sometimes difficulty in breathing. For older people, the symptoms can get worse. Also, it can cause death because of the huge effect on some parts of the human body, particularly for those who have chronic diseases like diabetes. Machine learning algorithms are applied to patients diagnosed with Corona Virus to estimate the severity of the disease depending on their chronic diseases at an early stage. Chronic diseases could raise the severity of COVID-19 and that is what has been proved in this paper. This paper applies different machine learning techniques such as random forest, decision tree, linear regression, binary search, and k-nearest neighbor on Mexican patients' dataset to find out the impact of lifelong illnesses on increasing the symptoms of the virus in the human body. Besides, the paper demonstrates that in some cases, especially for older people, the virus can cause inevitable death.

Acute Myocardial Infarction (Heart Attack), a Cardiovascular Disease (CVD) leads to Ischemic Heart Disease(IHD) is one of the major killers worldwide. A proficient approach is proposed in this paper that can predict the chances of heart... more

Acute Myocardial Infarction (Heart Attack), a Cardiovascular Disease (CVD) leads to Ischemic Heart Disease(IHD) is one of the major killers worldwide. A proficient approach is proposed in this paper that can predict the chances of heart attack when a person is bearing chest pain or equivalent symptoms. We have developed a prototype by integrating clinical data collected from patients admitted in different hospitals attacked by Acute Myocardial Infarction (AMI). 25 attributes related to symptoms of heart attack are collected and analyzed where chest pain, palpitation, breathlessness, syncope with nausea, sweating, vomiting are the prominent symptoms of a person getting heart attack. The data mining techniques namely decision tree and random forest are used to analyze heart attack dataset where classification of more common symptoms related to heart attack is done using c4.5 decision tree algorithm, alongside, random forest is applied to improve the accuracy of the classification result of heart attack prediction. A guiding system to suspect the chest pain as having heart attack or not may help many people who tend to neglect the chest pain and later land up in catastrophe of heart attacks.

Acute Myocardial Infarction (Heart Attack), a Cardiovascular Disease (CVD) leads to Ischemic Heart Disease(IHD) is one of the major killers worldwide. A proficient approach is proposed in this paper that can predict the chances of heart... more

Acute Myocardial Infarction (Heart Attack), a Cardiovascular Disease (CVD) leads to Ischemic Heart Disease(IHD) is one of the major killers worldwide. A proficient approach is proposed in this paper that can predict the chances of heart attack when a person is bearing chest pain or equivalent symptoms. We have developed a prototype by integrating clinical data collected from patients admitted in different hospitals attacked by Acute Myocardial Infarction (AMI). 25 attributes related to symptoms of heart attack are collected and analyzed where chest pain, palpitation, breathlessness, syncope with nausea, sweating, vomiting are the prominent symptoms of a person getting heart attack. The data mining techniques namely decision tree and random forest are used to analyze heart attack dataset where classification of more common symptoms related to heart attack is done using c4.5 decision tree algorithm, alongside, random forest is applied to improve the accuracy of the classification result of heart attack prediction. A guiding system to suspect the chest pain as having heart attack or not may help many people who tend to neglect the chest pain and later land up in catastrophe of heart attacks.

Since the advent of encryption, there has been a steady increase in malware being transmitted over encrypted networks. Traditional approaches to detect malware like packet content analysis are inefficient in dealing with encrypted data.... more

Since the advent of encryption, there has been a steady increase in malware being transmitted over encrypted networks. Traditional approaches to detect malware like packet content analysis are inefficient in dealing with encrypted data. In the absence of actual packet contents, we can make use of other features like packet size, arrival time, source and destination addresses and other such metadata to detect malware. Such information can be used to train machine learning classifiers in order to classify malicious and benign packets. In this paper, we offer an efficient malware detection approach using classification algorithms in machine learning such as support vector machine, random forest and extreme gradient boosting. We employ an extensive feature selection process to reduce the dimensionality of the chosen dataset. The dataset is then split into training and testing sets. Machine learning algorithms are trained using the training set. These models are then evaluated against the testing set in order to assess their respective performances. We further attempt to tune the hyper parameters of the algorithms, in order to achieve better results. Random forest and extreme gradient boosting algorithms performed exceptionally well in our experiments, resulting in area under the curve values of 0.9928 and 0.9998 respectively. Our work demonstrates that malware traffic can be effectively classified using conventional machine learning algorithms and also shows the importance of dimensionality reduction in such classification problems.

Southland English has historically been New Zealand’s only (partially) rhotic variety. There has only been one large-scale study of Southland (r), which suggested a resurgence of rhoticity following NURSE among young women (Bartlett... more

Southland English has historically been New Zealand’s only (partially) rhotic variety. There has only been one large-scale study of Southland (r), which suggested a resurgence of rhoticity following NURSE among young women (Bartlett 2002). We build on this work, using modern statistical methods to better understand the linguistic and social conditioning of change in Southland (r).
We analysed over 20,000 tokens of non-prevocalic Southland (r), coded as present/absent. 20% of tokens were hand-coded; the rest were automatically coded via a random-forest classifier trained on the hand-coded tokens to predict (r) presence/absence based on 180 acoustic measures (this auto-coder achieved over 80% accuracy on the hand-coded training set). Data were modelled via logistic mixed-effects regression, with a three-way generation distinction (birth years 1900–30, 1931–55, 1956–80).
As expected, this analysis reveals a significant effect for vowel, with greater rhoticity for NURSE than other vowels, and a significant effect for generation, which indicates a change in apparent time. The statistical modelling allows us to see further fine-grained phonological and grammatical conditioning of the change as it progressed throughout the speech community. For instance, we find that NURSE before non-sibilant fricatives lags in rhoticity among the oldest speakers but catches up to other NURSE environments among middle and young speakers. We also find that the increase of rhoticity in NURSE appeared first in content words, then spread to function words. We discuss the full trajectory of change for Southland (r) and highlight some implications for theories of phonological change more generally.

Double parking is a common occurrence in dense urban areas. It routinely causes danger for cyclists, pedestrians and short-term traffic disruptions that impede traffic flow. Using New York City as a case study, this paper introduces a... more

Double parking is a common occurrence in dense urban areas. It routinely causes danger for cyclists, pedestrians and short-term traffic disruptions that impede traffic flow. Using New York City as a case study, this paper introduces a novel data-driven framework for understanding the influential factors and estimating the actual frequency of double parking through utilizing parking violation tickets, 311 service requests, and social media information with surrounding street characteristics. The number of hotel rooms, traffic volume, commercial usage, block length and curbside parking spaces are ranked as the top five important factors contributing to double parking. Three feature selection methods, LASSO, stability selection and Random Forests techniques are applied to identify those contributing factors. Random Forests, as one of the most effective machine learning techniques is also applied to predict double parking performance of 50 locations in Midtown Manhattan, New York, where ground truth data is available. The Random Forests model achieves 85% prediction accuracy. The study demonstrates that the violation tickets and 311 service requests supplemented with additional street characteristics are able to offer a higher level of prediction accuracy for double parking events. This predictive power can be further applied to a macroscopic or microscopic traffic simulation model to evaluate double parking impacts on traffic delay and safety. In addition, this study can provide transportation agencies insights into effective data collection strategies to identify potential double parking hotspots for better policy-making, enforcement, and management.

With the explosive growth in the world's population which has little or no corresponding rise in the food production, food insecurity has become eminent, and hence, the need to seek for opportunities to increase food production in... more

With the explosive growth in the world's population which has little or no corresponding rise in the food production, food insecurity has become eminent, and hence, the need to seek for opportunities to increase food production in order to cater for this population is paramount. The second goal of the Sustainable Development Goals (SDGs) (i.e., ending hunger, achieving food security and improved nutrition, and promoting sustainable agriculture) set by the United Nations (UN) for the year 2030 clearly acknowledged this fact. Improving food production cannot be achieved using the obsolete conventional methods of agriculture by our farmers; hence, this study focuses on developing a model for predicting climatic conditions with a view to reducing their negative impact, and boosting the yield of crop. Temperature, wind, humidity and rainfall were considered as the effect of these factors is more devastating in Nigeria as compared to sun light which is always in abundance. We implemen...

This chapter discusses popular non-parametric methods in corpus linguistics: conditional inference trees and conditional random forests. These methods, which allow the researcher to model and interpret the relationships between a numeric... more

This chapter discusses popular non-parametric methods in corpus linguistics: conditional inference trees and conditional random forests. These methods, which allow the researcher to model and interpret the relationships between a numeric or categorical response variable and various predictors, are particularly attractive in 'tricky' situations, when the use of parametric methods (in particular, regression models) can be problematic, for example, in the situations of 'small n, large p', complex interactions, non-linearity and correlated predictors. For illustration, the chapter discusses a case study of T and V politeness forms in Russian based on a corpus of film subtitles.

In this paper we present an approach to detect whether an MRI scan of a brain contains a tumor or not using machine learning. Once detected, it will then classify the type of tumor as either benign or malignant. In any medical field, the... more

In this paper we present an approach to detect whether an MRI scan of a brain contains a tumor or not using machine learning. Once detected, it will then classify the type of tumor as either benign or malignant. In any medical field, the most important resource used by doctors is Medical Images which is a tool with high accuracy. In this work, the system correctly classifies MRI images into images with tumor and images without tumor. This has to be done with no human intervention. In order to apply several types of classifiers, there is a need to pre-process several aspects of the images such as the color, area of interest, image file extension, and contrast level.

Statistics in Medicine, 38: 558-582.

In simulation-based realization of complex systems, we are forced to address the issue of computational complexity. One critical issue that must be addressed is the approximation of reality using surrogate models to replace expensive... more

In simulation-based realization of complex systems, we are forced to address the issue of computational complexity. One critical issue that must be addressed is the approximation of reality using surrogate models to replace expensive simulation models of engineering problems. In this paper, we critically review over two hundred papers. We find that a framework for selecting appropriate surrogate modeling methods for a given function with specific requirements is lacking. To address this gap, we hypothesize that a trade-off among three main drivers, namely, size (how much information is necessary to compute the surrogate model), accuracy (how accurate the surrogate model must be) and computational time (how much time is required for surrogate modeling process) is needed. In the context of the hypotheses, we review the state-of-the-art surrogate modeling literature to answer the following three questions: 1. What are the main classes of the design of experiment (DOE) methods, surrogate modeling methods and model-fitting methods based on the requirements of size, computational time, and accuracy? 2. Which surrogate modeling method is suitable based on the critical characteristics of the requirements of size, computational time and accuracy? 3. Which DOE is suitable based on the critical characteristics of the requirements of size, computational time and accuracy?
Based on these three characteristics, six different categories for the surrogate models are framed through a critical evaluation of the literature. These categories provide a framework for selecting an efficient surrogate modeling process to assist those who wish to select more appropriate surrogate modeling techniques for a given function. It is also summarized in Table 4 and Figure 3 and 5). Artificial neural networks, response surface models, and kriging are more appropriate for large problems, less computation time and high accuracy, respectively. Also, Latin Hypercube, fractional factorial designs, and D-Optimal designs are appropriate experimental designs. Our proposed framework is a qualitative evaluation and a mental model that is based on quantitative results and findings of authors in the published literature. The value of such a framework is in providing practical guidance for researchers and practitioners in the industry to choose the most appropriate surrogate model based on incomplete information about the engineering design problem. Our contribution is to use three drivers, namely, computational time, accuracy, and problem size instead of using one single measure that authors generally used in the published literature.

Late second language (L2) learners report difficulties in specific linguistic areas such as syntactic processing, presumably because brain plasticity declines with age (following the critical period hypothesis). While there is also... more

Late second language (L2) learners report difficulties in specific linguistic areas such as syntactic processing, presumably because brain plasticity declines with age (following the critical period hypothesis). While there is also evidence that L2 learners can achieve native-like online-processing with sufficient proficiency (following the convergence hypothesis), considering multiple mediating factors and their impact on language processing has proven challenging. We recorded EEG while native (n = 36) and L2-speakers of French (n = 40) read sentences that were either well-formed or contained a syntactic-category error. a lexical-semantic anomaly, or both. Consistent with the critical period hypothesis, group differences revealed that while native speakers elicited a biphasic N400-P600 in response to ungrammatical sentences, L2 learners as a group only elicited an N400. However, individual data modeling using a Random Forests approach revealed that language exposure and proficiency are the most reliable predictors in explaining ERP responses, with N400 and P600 effects becoming larger as exposure to French as well as proficiency increased, as predicted by the convergence hypothesis.

Machine learning (ML) is a subject that focuses on the data analysis using various statistical tools and learning processes in order to gain more knowledge from the data. The objective of this research was to apply one of the ML... more

Machine learning (ML) is a subject that focuses on the data analysis using various statistical tools and learning processes in order to gain more knowledge from the data. The objective of this research was to apply one of the ML techniques on the low birth weight (LBW) data in Indonesia. This research conducts two ML tasks, including prediction and classification. The binary logistic regression model was firstly employed on the train and the test data. Then, the random approach was also applied to the data set. The results showed that the binary logistic regression had a good performance for prediction , but it was a poor approach for classification. On the other hand, random forest approach has a very good performance for both prediction and classification of the LBW data set.

The purpose of this study is to build machine learning models to predict the band gap of binary compounds, using its known properties like molecular weight, electronegativity, atomic fraction and the group of the constituent elements in... more

The purpose of this study is to build machine learning
models to predict the band gap of binary compounds, using its
known properties like molecular weight, electronegativity,
atomic fraction and the group of the constituent elements in the
periodic table. Regression techniques like Linear, Ridge
regression and Random Forest were used to build the model.
This model can be used by students and researchers in
experiments involving unknown band gaps or new compounds.

Land-Use/Land-Cover (LULC) products are a common source of information and a key input for spatially explicit models of ecosystem service (ES) supply and demand. Global, continental, and regional, readily available, and free land-cover... more

Land-Use/Land-Cover (LULC) products are a common source of information and a key input for spatially explicit models of ecosystem service (ES) supply and demand. Global, continental, and regional, readily available, and free land-cover products generated through Earth Observation (EO) data, can be potentially used as relevant to ES mapping and assessment processes from regional to national scales. However, several limitations exist in these products, highlighting the need for timely land-cover extraction on demand, that could replace or complement existing products. This study focuses on the development of a classification workflow for fine-scale, object-based land cover mapping, employed on terrestrial ES mapping, within the Greek terrestrial territory. The processing was implemented in the Google Earth Engine cloud computing environment using 10 m spatial resolution Sentinel-1 and Sentinel-2 data. Furthermore, the relevance of different training data extraction strategies and temporal EO information for increasing the classification accuracy was also evaluated. The different classification schemes demonstrated differences in overall accuracy ranging from 0.88% to 4.94% with the most accurate classification scheme being the manual sampling/monthly feature classification achieving a 79.55% overall accuracy. The classification results suggest that existing LULC data must be cautiously considered for automated extraction of training samples, in the case of new supervised land cover classifications aiming also to discern complex vegetation classes. The code used in this study is available on GitHub and runs on the Google Earth Engine web platform.

Movie success prediction plays a vital role in the movie industry as it involves huge amounts of investment. However, success rates of a movie cannot be predicted based on a single attribute. Hence, a model is built based on an... more

Movie success prediction plays a vital role in the movie industry as it involves huge amounts of investment. However, success rates of a movie cannot be predicted based on a single attribute. Hence, a model is built based on an interesting relationship between the attributes. Movie industry can take a help of this model to change the movie criteria for obtaining likelihood of blockbusters. Every criteria involved is given a weight and then the prediction is made based on these. For example, if a movies budget was below 5 million, the budget was given a lower weight. Based on the number of actors, directors and producers past successful movies, every category is given equal weight age. Suppose the movie is to be released on a weekend, it is given higher weight because the chances of success were greater. If with the release of a movie, there was another high success movie released, a lower weight was given to the release time indicating that the fate of movie being a success is low as a result of competition. The criteria were not limited just to the ones mentioned. There was being additional factors discussed in this work.

The literature review on random forests and text mining provided in this paper makes clear the link and relevance that exists between these two fields, and shows how academia and industry are doing an increasing number of studies on these... more

The literature review on random forests and text mining provided in this paper makes clear the link and relevance that exists between these two fields, and shows how academia and industry are doing an increasing number of studies on these subjects, keeping them an interesting issue for further development.

This study explores the forecasting of Major League Baseball game ticket sales and identifies important attendance predictors by means of random forests that are grown from classification and regression trees (CART) and conditional... more

This study explores the forecasting of Major League Baseball game ticket sales and identifies important attendance predictors by means of random forests that are grown from classification and regression trees (CART) and conditional inference trees. Unlike previous studies that predict sport demand, I consider differ-ent forecasting horizons and only use information that is publicly accessible in advance of a game or season. Models are trained using data from 2013 to 2014 to make predictions for the 2015 regular season. The static within-season approach is complemented by a dynamic month-ahead forecasting strategy. Out-of-sample performance is evaluated for individual teams and tested against least-squares regression and a naive lagged attendance forecast. My empirical results show high variation in team-specific prediction accuracy with respect to both models and forecasting horizons. Linear and tree-ensemble models, on average, do not vary substantially in predictive accuracy; however, OLS regression fails to account for various team-specific peculiarities.

The goal of this project is to predict housing prices in Melbourne (Australia), using several statistical/machine learning prediction models. The supervised type of machine learning is implemented for all models. In total, 5 statistical... more

The goal of this project is to predict housing prices in Melbourne (Australia), using several statistical/machine learning prediction models. The supervised type of machine learning is implemented for all models. In total, 5 statistical learning models are used, three of them being the variations of linear regression models, while the other two are decision tree-based models. The models based on the linear regression include the ordinary least squares (OLS), OLS with Ridge Regression and the LASSO model. The decision tree-based models are Decision Tree and Random Forest. The dataset used for the project is downloaded from the website www.kaggle.com. The dataset of 34'847 observations and 21 variables. First, data is cleaned and explained using principles of the Exploratory Data Analysis. The target variable from the given dataset is PRICE. The results of models based on linear regression gave solid results, with RIDGE Regression being the best one. The best performing model, among all models used for the project, is the Random Forest. The model improves results until 300 trees are reached, afterwards no significant changes have been observed. The MAPE of the Random Forest is 9.5 % and the MAPE of the RIDGE Regression was 25.5 %.

Internet usage has become intensive during the last few decades; this has given rise to the use of email which is one of the fastest yet cheap modes of communication. The growing demand of email communication has given rise to the spam... more

Internet usage has become intensive during the last
few decades; this has given rise to the use of email
which is one of the fastest yet cheap modes of
communication. The growing demand of email
communication has given rise to the spam email
which is also known as unsolicited mails. In this
paper we propose an ensemble model that uses
majority voting on top of several classifiers to
detect spam. The classification algorithms used for
this purpose are Naïve Bayesian, Support Vector
Machines, Random Forest, Decision Stump and kNearest Neighbor. Majority voting generates the
final decision of the ensemble by obtaining major
votes from the classifiers. The sample dataset used
for this task is taken from UCI and the tool
Rapidminer is used for the validation of the results.

Small Excel and VBA Demonstration of Random Forest using the ALGLIB Library

Image segmentation is a topic of paramount importance in our society, it finds applications from computer vision to medical image analysis, robotic perception, video surveillance, and many others. Currently, there are various algorithms... more

Image segmentation is a topic of paramount importance in our society, it finds applications from computer vision to medical image analysis, robotic perception, video surveillance, and many others. Currently, there are various algorithms deployed for the semantic segmentation task and the most common are deep learning models such as convolutional neural networks (CNNs). In this report, we present, implement, and study three different algorithms to perform semantic segmentation of aerial images under the constraint of limited data and few classes. The first approach is a fully CNN (FCNN) designed by us taking inspiration from U-Net. The second approach is to adapt the Xception pretrained classifier using transfer learning and fine-tuning (XTFT). The third and last approach is a Random Forest Classifier (RF). The models are trained over the same dataset and in the same environment (same system specifics). Thanks to this, we provide a complete comparison of the three models, seeing how the best approach in our case is a FCNN with a contained number of parameters.

Stock price prediction has always been a challenging task for the researchers in financial domain. While the Efficient Market Hypothesis claims that it is impossible to predict stock prices accurately, there are work in the literature... more

Stock price prediction has always been a challenging task for the researchers in financial domain. While the Efficient Market Hypothesis claims that it is impossible to predict stock prices accurately, there are work in the literature that have demonstrated that stock price movements can be forecasted with a reasonable degree of accuracy, if appropriate variables are chosen and suitable predictive models are built using those variables. In this work, we present a robust and accurate framework of stock price prediction using statistical, machine learning and deep learning methods. We use daily data on stock prices at five minutes interval of time from the National Stock Exchange (NSE) of India and aggregate these granular data suitably to build the forecasting framework for stock prices. We contend that this framework, by combining several machine learning and deep learning methods, can accurately model the volatility of the stock price movement, and hence it can be utilized for short-term forecasting of the stock price. Eight classification and eight regression models, including one on deep learning-based approach, have been built using data of two stocks listed in the NSE-Tata Steel and Hero Moto. Extensive results have been presented on the performance of these models.

Random Forrest is a supervised algorithm used for both classification and regression problems too. We can see it from a supervised algorithm to create a forest in some way & make it random. The larger the number of trees the more accurate... more

Random Forrest is a supervised algorithm used for both classification and regression problems too. We can see it from a supervised algorithm to create a forest in some way & make it random. The larger the number of trees the more accurate results.

The main objective of this project is to predict groundwater levels in various areas under circumstances. In order to predict and forecast the ground water levels various machine learning techniques has been used in this project. India... more

The main objective of this project is to predict groundwater levels in various areas under circumstances. In order to predict and forecast the ground water levels various machine learning techniques has been used in this project. India has enrolled a basic fall in groundwater levels running somewhere in the range of 75 and 85 percent, as indicated by a checking report by the Central Ground Water Board (CGWB). Groundwater levels in different pieces of the nation are declining a result of ceaseless withdrawal because of reasons, for example, expanded interest for crisp water for different utilizations, impulses of precipitation, expanded populace, industrialization, and urbanization. The southern conditions of Kerala, Telangana, and Pondicherry recorded a decay of 40 to 46 percent. Andhra Pradesh and Tamil Nadu, which is confronting an enormous water emergency, saw a 60 percent tumble. In order to implement this application KNN and Random Forest algorithm are used for prediction and forecasting. In this project we procured a dataset from the trusted resource for analysis the ground water levels in various areas. Using ANN the data training can be implemented. Using clustering method unwanted and irrelevant data will be removed from the dataset. The analysis will be done from the pre processed dataset while implementing random forest algorithm. In this project random forest algorithm play a major role for predicting the ground water level. Here random forest algorithm method analysis various factors and whole attributes from the dataset. It analysis major fields like annual rainfall, sol type, temperature, humidity, industrial areas, number of bore wells, lakes, ponds and etc. All the consideration fields analyzed together to plot and predict the present and future ground water level. All the information can be shown in chart and graph for visual representation.

The suspended sediment load (SSL) is one of the major hydrological processes affecting the sustainability of river planning and management. Moreover, sediments have a significant impact on dam operation and reservoir capacity. To this... more

The suspended sediment load (SSL) is one of the major hydrological processes affecting the sustainability of river planning and management. Moreover, sediments have a significant impact on dam operation and reservoir capacity. To this end, reliable and applicable models are required to compute and classify the SSL in rivers. The application of machine learning models has become common to solve complex problems such as SSL modeling. The present research investigated the ability of several models to classify the SSL data. This investigation aims to explore a new version of machine learning classifiers for SSL classification at Johor River, Malaysia. Extreme gradient boosting, random forest, support vector machine, multi-layer perceptron and k-nearest neighbors classifiers have been used to classify the SSL data. The sediment values are divided into multiple discrete ranges, where each range can be considered as one category or class. This study illustrates two different scenarios related to the number of categories, which are five and 10 categories, with two time scales, daily and weekly. The performance of the proposed models was evaluated by several statistical indicators. Overall, the proposed models achieved excellent classification of the SSL data under various scenarios.

La cobertura boscosa de ecosistemas templados que se distribuye en el territorio mexicano, constituye un importante sumidero de carbono, hecho que tiene como potencial la disminución de efectos adversos que contribuyan negativamente al... more

La cobertura boscosa de ecosistemas templados que se distribuye en el territorio mexicano, constituye un importante sumidero de carbono, hecho que tiene como potencial la disminución de efectos adversos que contribuyan negativamente al cambio climático. El estado de Durango presenta una amplia extensión de bosque templado, por lo cual es el principal productor forestal maderable de la república mexicana, actividad relevante en la gestión de los recursos forestales que contribuyen a la captura de carbono. Los objetivos planteados en esta tesis fueron que a través de la implementación de métodos geomáticos, sensores remotos, técnicas estadísticas de machine learning (RF y SVR) y biomasa estimada con variables dasométricas, se lograra establecer un modelo geoespacial que permitiera predecir la distribución espacial de la biomasa en el estado de Durango, específicamente en el área de bosques templados. Se desarrollaron múltiples campos de investigación geomática mediante la implementación de programación y modelación estadística satelital, y métodos geoestadísticos. El procesamiento de la información mediante estas metodologías, permitió confeccionar la base de datos geoespacial que conformarían las cuarenta y cuatro variables predictivas. El rendimiento estadístico de los modelos de machine learning evaluados denota que SVR es un modelo robusto y estable en la evaluación de grandes bases de datos espaciales, ya que, dentro de su entrenamiento y optimización, ninguna variable necesita ser excluida. El costo computacional de SVR es mejor respecto a RF. Se precisa que con SVR se obtiene una confiabilidad estadística del 80 % en la estimación de la biomasa forestal en el estado de Durango, por lo cual se puede considerar un modelo de fácil aplicación, bajo costo computacional y muy optimo en la estimación de atributos forestales.

The concept of machine learning has quickly become very attractive to the healthcare industry. Predictions and analyzes made by the research community on medical data sets help with appropriate care and precautions in the prevention of... more

The concept of machine learning has quickly become very attractive to the healthcare industry. Predictions and analyzes made by the research community on medical data sets help with appropriate care and precautions in the prevention of disease. of machine learning, the types of algorithms that can help make decisions and predictions. We also discuss various applications of machine learning in the medical field, with a focus on diabetes prediction through machine learning. Diabetes is one of the most increasing diseases in the world and it requires continuous monitoring. To check this, we explore various machine learning algorithms which will help in early prediction of this disease. This work explains various aspects of machine learning, the types of algorithm which can help in decision making and prediction. The predictions and analysis made by the research community for medical dataset support the people by taking proper care and precautions by preventing diseases. Discuss various applications of machine learning in the field of medicine focusing on the prediction of diabetes through machine learning. Diabetes is one of the fastest-growing diseases in the world and requires constant monitoring. To verify this, we are exploring different machine learning algorithms that will help with this baseline prediction.

Thyroid is the major disorder occurs due to the lack of thyroid hormone among women than man. The test report of thyroid includes number of attributes such as TSH, T3, TT4, T4U and more. Manually determining the disorder for number of... more

Thyroid is the major disorder occurs due to the lack of thyroid hormone among women than man. The test report of thyroid includes number of attributes such as TSH, T3, TT4, T4U and more. Manually determining the disorder for number of peoples test report is not easier. So, using the data mining approach will made this task simpler by predicting the disorder from the large dataset. Traditionally, Linear Discriminant Analysis (LDA) data mining technique is used to predict the thyroid disorder. In our proposed work, the random forest approach is utilized to predict the hypothyroid disorder by collecting the dataset from UCI repository. The performance measure is calculated from the confusion matrix with the accuracy. The experimental result is obtained from the Weka tool.

Enzymes play an important role in metabolism that helps in catalyzing bio-chemical reactions. A computational method is required to predict the function of enzymes. Many feature selection technique have been used in this paper by... more

Enzymes play an important role in metabolism that helps in catalyzing bio-chemical reactions. A
computational method is required to predict the function of enzymes. Many feature selection technique
have been used in this paper by examining many previous research paper. This paper presents supervised
machine learning approach to predict the functional classes and subclass of enzymes based on set of 857
sequence derived features. It uses seven sequence derived properties including amino acid composition,
dipeptide composition, correlation feature, composition, transition, distribution and pseudo amino acid
composition .Support vector machine recursive Feature elimination (SVRRFE) is used to select the optimal
number of features. The Random Forest has been used to construct a three level model with optimal
number of features selected by SVMRFE, where top level distinguish a query protein as an enzyme or nonenzyme,
second level predicts the enzyme functional class and the third layer predict the sub functional
class. The proposed model reported overall accuracy of 100%, precision of 100% and MCC value of 1.00
for the first level, whereas accuracy of 90.1%,precision of 90.5% and MCC value of 0.88 for second level
and accuracy of 88.0%, precision of 88.7% and MCC value of 0.87 for the third level.

Sentiment analysis is an opinion mining process, in which computational analysis and categorization of opinion of a piece of text is done to obtain an unbiased understanding of the writer’s opinion towards any specific topic. In this... more

Sentiment analysis is an opinion mining process, in which computational analysis and categorization of opinion of a
piece of text is done to obtain an unbiased understanding of the writer’s opinion towards any specific topic. In this paper,
Sentiment Analysis of the twitter user demographic towards Citizenship Amendment Act, which came into effect in India from
January 10th, 2020, has been done. CAA was considered, as it had garnered mixed opinions from different sections of the Indian
demographic, so there was no clear understanding of the overall sentiment of the public towards it. It had also led to protests and
riots in various parts of India, which the Government struggled to handle as it was unexpected