Anderson Ara - Academia.edu (original) (raw)

Papers by Anderson Ara

Expert Systems With Applications, Oct 1, 2012

Fraud is a global problem that has required more attention due to an accentuated expansion of mod... more Fraud is a global problem that has required more attention due to an accentuated expansion of modern technology and communication. When statistical techniques are used to detect fraud, whether a fraud detection model is accurate enough in order to provide correct classification of the case as a fraudulent or legitimate is a critical factor. In this context, the concept of bootstrap aggregating (bagging) arises. The basic idea is to generate multiple classifiers by obtaining the predicted values from the adjusted models to several replicated datasets and then combining them into a single predictive classification in order to improve the classification accuracy. In this paper, for the first time, we aim to present a pioneer study of the performance of the discrete and continuous k-dependence probabilistic networks within the context of bagging predictors classification. Via a large simulation study and various real datasets, we discovered that the probabilistic networks are a strong modeling option with high predictive capacity and with a high increment using the bagging procedure when compared to traditional techniques.

Surveys in Operations Research and Management Science, Dec 1, 2016

The need for controlling and effectively managing credit risk has led financial institutions to e... more The need for controlling and effectively managing credit risk has led financial institutions to excel in improving techniques designed for this purpose, resulting in the development of various quantitative models by financial institutions and consulting companies. Hence, the growing number of academic studies about credit scoring shows a variety of classification methods applied to discriminate good and bad borrowers. This paper, therefore, aims to present a systematic literature review relating theory and application of binary classification techniques for credit scoring financial analysis. The general results show the use and importance of the main techniques for credit rating, as well as some of the scientific paradigm changes throughout the years.

Springer eBooks, 2015

In this chapter we propose a simulation-based method for predicting football match outcomes. We a... more In this chapter we propose a simulation-based method for predicting football match outcomes. We adopt a Bayesian perspective, modeling the number of goals of two opposing teams as a Poisson distribution whose mean is proportional to the relative technical level of opponents. Federation Internationale de Football Association (FIFA) ratings were taken as the measure of technical level of teams saw well as experts’ opinions on the scores of the matches were taken in account to construct the prior distributions of the parameters. Tournament simulations were performed in order to estimate probabilities of winning the tournament assuming different values for the weight attached to the experts’ information and different choices for the sequence of weights attached to the previous observed matches. The methodology is illustrated on the 2010 Football Word Cup.

JMIR mental health, Nov 1, 2022

Recent developments in artificial intelligence technologies have come to a point where machine le... more Recent developments in artificial intelligence technologies have come to a point where machine learning algorithms can infer mental status based on someone's photos and texts posted on social media. More than that, these algorithms are able to predict, with a reasonable degree of accuracy, future mental illness. They potentially represent an important advance in mental health care for preventive and early diagnosis initiatives, and for aiding professionals in the follow-up and prognosis of their patients. However, important issues call for major caution in the use of such technologies, namely, privacy and the stigma related to mental disorders. In this paper, we discuss the bioethical implications of using such technologies to diagnose and predict future mental illness, given the current scenario of swiftly growing technologies that analyze human language and the online availability of personal information given by social media. We also suggest future directions to be taken to minimize the misuse of such important technologies.

Journal of data science, 2021

Ensemble techniques have been gaining strength among machine learning models, considering supervi... more Ensemble techniques have been gaining strength among machine learning models, considering supervised tasks, due to their great predictive capacity when compared with some traditional approaches. The random forest is considered to be one of the off-the-shelf algorithms due to its flexibility and robust performance to both regression and classification tasks. In this paper, the random machines method is applied over simulated data sets and benchmarking datasets in order to be compared with the consolidated random forest models. The results from simulated models show that the random machines method has a better predictive performance than random forest in most of the investigated data sets. Three real data situations demonstrate that the random machines may be used to solve real-world problems with competitive payoff.

Journal of data science, 2021

Improvement of statistical learning models in order to increase efficiency in solving classificat... more Improvement of statistical learning models in order to increase efficiency in solving classification or regression problems is still a goal pursued by the scientific community. In this way, the support vector machine model is one of the most successful and powerful algorithms for those tasks. However, its performance depends directly from the choice of the kernel function and their hyperparameters. The traditional choice of them, actually, can be computationally expensive to do the kernel choice and the tuning processes. In this article, it is proposed a novel framework to deal with the kernel function selection called Random Machines. The results improved accuracy and reduced computational time. The data study was performed in simulated data and over 27 real benchmarking datasets.

Stats, Oct 19, 2020

The analysis of massive databases is a key issue for most applications today and the use of paral... more The analysis of massive databases is a key issue for most applications today and the use of parallel computing techniques is one of the suitable approaches for that. Apache Spark is a widely employed tool within this context, aiming at processing large amounts of data in a distributed way. For the Statistics community, R is one of the preferred tools. Despite its growth in the last years, it still has limitations for processing large volumes of data in single local machines. In general, the data analysis community has difficulty to handle a massive amount of data on local machines, often requiring high-performance computing servers. One way to perform statistical analyzes over massive databases is combining both tools (Spark and R) via the sparklyr package, which allows for an R application to use Spark. This paper presents an analysis of Brazilian public data from the Bolsa Família Programme (BFP-conditional cash transfer), comprising a large data set with 1.26 billion observations. Our goal was to understand how this social program acts in different cities, as well as to identify potentially important variables reflecting its utilization rate. Statistical modeling was performed using random forest to predict the utilization rated of BFP. Variable selection was performed through a recent method based on the importance and interpretation of variables in the random forest model. Among the 89 variables initially considered, the final model presented a high predictive performance capacity with 17 selected variables, as well as indicated high importance of some variables for the observed utilization rate in income, education, job informality, and inactive youth, namely: family income, education, occupation and density of people in the homes. In this work, using a local machine, we highlighted the potential of aggregating Spark and R for analysis of a large database of 111.6 GB. This can serve as proof of concept or reference for other similar works within the Statistics community, as well as our case study can provide important evidence for further analysis of this important social support programme.

arXiv (Cornell University), Mar 27, 2020

Machine learning techniques always aim to reduce the generalized prediction error. In order to re... more Machine learning techniques always aim to reduce the generalized prediction error. In order to reduce it, ensemble methods present a good approach combining several models that results in a greater forecasting capacity. The Random Machines already have been demonstrated as strong technique, i.e: high predictive power, to classification tasks, in this article we propose an procedure to use the bagged-weighted support vector model to regression problems. Simulation studies were realized over artificial datasets, and over real data benchmarks. The results exhibited a good performance of Regression Random Machines through lower generalization error without needing to choose the best kernel function during tuning process.

arXiv (Cornell University), Apr 29, 2016

In this paper we present the development of a modulated web based statistical system, hereafter M... more In this paper we present the development of a modulated web based statistical system, hereafter MWStat, which shifts the statistical paradigm of analyzing data into a real time structure. The MWStat system is useful for both online storage data and questionnaires analysis, as well as to provide real time disposal of results from analysis related to several statistical methodologies in a customizable fashion. Overall, it can be seem as a useful technical solution that can be applied to a large range of statistical applications, which needs of a scheme of devolution of real time results, accessible to anyone with internet access. We display here the step-by-step instructions for implementing the system. The structure is accessible, built with an easily interpretable language and it can be strategically applied to online statistical applications. We rely on the relationship of several free languages, namely, PHP, R, MySQL database and an Apache HTTP server, and on the use of software tools such as phpMyAdmin. We expose three didactical examples of the MWStat system on institutional evaluation, statistical quality control and multivariate analysis. The methodology is also illustrated in a real example on institutional evaluation.

arXiv (Cornell University), Feb 5, 2016

Research Square (Research Square), Mar 15, 2023

Background: Nonverbal communication (NVC) is a complex behavior that involves different modalitie... more Background: Nonverbal communication (NVC) is a complex behavior that involves different modalities that are impaired in schizophrenia spectrum, including gesticulation. However, there are few studies that evaluate it in individuals with at-risk mental states (ARMS) for psychosis, mostly in developed countries. Given our prior ndings of reduced movement during speech seen in Brazilian individuals with ARMS, we now aim to determine if this can be accounted for by reduced gesticulation behavior. Methods: 56 medication-naïve ARMS and 64 healthy controls were lmed during speech tasks. The frequency of speci c coded gestures across four categories (and self-stimulatory behaviors) were compared between groups and tested for correlations with prodromal symptoms of the Structured Interview for Prodromal Syndromes (SIPS) and with the variables previously published. Results: ARMS individuals showed a reduction in one gesture category.Gesture frequency was negatively correlated with prodromal symptoms and positively correlated with the variables of amount of movement previously analyzed. Conclusion: The reduction in gesture performance observed agrees with literature ndings in other cultural contexts in ARMS and schizophrenia subjects. The lack of differences for other categories might be related to differences within the ARMS group itself and the course of the disorder. These ndings show the importance of analyzing NVC in ARMS and of considering different cultural and sociodemographic contexts in the search for markers of these states.

Brazilian Applied Science Review, 2020

The objective of this paper is to provide an applied research comparing the traditional and boots... more The objective of this paper is to provide an applied research comparing the traditional and bootstrap methods to calculate the measure uncertainty. For this purpose, were performed a dimension analysis for internal and external diameter on one lot with around one hundred parts from a Brazilian company. Following, were performed resamples with replacementbootstrap samplesfor each dimension obtained and then the uncertainty calculation. After that, it was concluded that the proposed method is the most appropriate, because decreases the bias of estimation when it works with small sample size, which is common on metrology works. The companies researched assert that they do not perform the uncertainty calculation on dimension

Pesquisa Operacional, 2019

In this paper, a new software for Statistical Process Control (SPC) is proposed. The system, the ... more In this paper, a new software for Statistical Process Control (SPC) is proposed. The system, the so-called CEP Online, was developed based on statistical computing resources of well-known free softwares, such as HTML, PHP, R and MySQL under an online server with operating system Linux Ubuntu. The main uni and multivariate SPC tools are available for monitoring and evaluation of manufacturing and non-manufacturing production processes over time. Some advantages of the new software are: (i) low operational cost, since it is cloud-based, only needing a computer connected to the Internet; (ii) easy to use with great interaction with the user; (iii) it does not require investment in any specific hardware or software; (iv) real time reports generation on process condition monitoring and process capability. Thus, the CEP Online offers for SPC practitioners fast, efficient and accurate SPC procedures. Therefore, CEP Online becomes an important resource for those who have no access to non-free softwares, such as SAS, SPSS, Minitab and STATISTICA. To the best of our knowledge, the CEP Online is unique with respect to its characteristics.

Revista Brasileira de Biometria, Mar 28, 2018

In the biomedical area a critical factor is whether a classification model is accurate enough in ... more In the biomedical area a critical factor is whether a classification model is accurate enough in order to provide correct classification whether or not a patient has a certain disease. Several techniques may be used in order to accommodate such situation. In this context, Bayesian networks have emerged as a practical classification technology with successful applications in many fields. At the same time, logistic regression is a widely used statistical classification method and evidenced in the literature. In the current paper we focus on investigating the preditive performance of a probabilistic networks in its simple particular case, the so called naive Bayes network, compared to the logistic regression. A systematic simulation study is performed and the procedures are illustrated in some benchmark biomedical data sets.

RBPG, Dec 18, 2012

This article presents a characterization of the basic training of teachers who teach in undergrad... more This article presents a characterization of the basic training of teachers who teach in undergraduate courses in statistics in Brazil. It also discusses the need for statistic manpower within these courses, based on the current situation of training of graduate statistical students in Brazil in terms of supply of graduates in contrast to the required number of graduates to fill the vacancies of teachers within the undergraduate courses of Statistics. The study was conducted using statistical sampling procedures, and its importance is in terms of strategic planning, indicating a real imbalance between supply and demand for graduates in Statistics within Statistics undergraduate courses in the country. Furthermore, it points the need for a procedure of doctoral induction within the area. Otherwise, even in 2020, the deficit of graduates in Statistics to meet the current vacancies within the undergraduate courses in such field may endure.

E-tech, Oct 24, 2012

A alta competitividade no mercado tem exigido que as empresas busquem agregar qualidade a seus pr... more A alta competitividade no mercado tem exigido que as empresas busquem agregar qualidade a seus produtos em busca de alto desempenho, minimização de desperdícios e redução do custo final do produto. Neste contexto, estrutura-se a metodologia de Controle Estatístico de Processo (CEP). Empresas que fazem uso de CEP necessitam de softwares específicos. Sua utilização, envolve a realização de cálculos estatísticos, construções de gráficos, medição de capacidade de processo, entre outras necessidades. Com isso, precisa disponibilizar recursos financeiros para compra de licença para a utilização dos mesmos, custo este que é muitas vezes repassado ao produto final. A proposta deste trabalho é a utilização de um ambiente computacional livre e gratuito que está disponível na rede mundial de computadores, conhecida como Linguagem R ou apenas Software R para proceder de forma gratuita à utilização do CEP. Neste artigo, procura-se demonstrar a aplicabilidade do CEP, via Software R, dentro do contexto industrial. Um conjunto de dados reais referente à textura superficial de eixos fabricados através de processo de usinagem é apresentado em detalhes.

Bolema, Apr 1, 2012

O ensino da ciência estatística é obrigatório em praticamente todos os cursos de graduação das un... more O ensino da ciência estatística é obrigatório em praticamente todos os cursos de graduação das universidades brasileiras. Além disso, vários são cursos de Graduação em Estatística, distribuídos pelas várias universidades nacionais. Entretanto, apesar da importância desta ciência, não existem, na literatura nacional, estudos sistemáticos direcionados à caracterização dos docentes responsáveis pelo ensino da ciência estatística no país. Neste contexto, apresentamos, neste artigo, uma descrição de tais docentes, particularmente, no que tange aos cursos de Graduação em Estatística. Esta descrição foi realizada por meio de um levantamento amostral descritivo, relacionado aos aspectos de sua formação e produção científica, sendo finalizada com a apresentação da previsão de demanda de Doutores em Estatística necessários para suprir as vagas em aberto a partir das ocorrências das aposentadorias dos docentes das Graduações em Estatística no país.

Expert Systems With Applications, Sep 1, 2022

Information, Nov 26, 2020

The disease caused by the new coronavirus (COVID-19) has been plaguing the world for months and t... more The disease caused by the new coronavirus (COVID-19) has been plaguing the world for months and the number of cases are growing more rapidly as the days go by. Therefore, finding a way to identify who has the causative virus is impressive, in order to find a way to stop its proliferation. In this paper, a complete and applied study of convolutional support machines will be presented to classify patients infected with COVID-19 using X-ray data and comparing them with traditional convolutional neural network (CNN). Based on the fitted models, it was possible to observe that the convolutional support vector machine with the polynomial kernel (CSV M Pol) has a better predictive performance. In addition to the results obtained based on real images, the behavior of the models studied was observed through simulated images, where it was possible to observe the advantages of support vector machine (SVM) models.

Expert Systems With Applications, Oct 1, 2012

Surveys in Operations Research and Management Science, Dec 1, 2016

Springer eBooks, 2015

JMIR mental health, Nov 1, 2022

Journal of data science, 2021

Stats, Oct 19, 2020

arXiv (Cornell University), Mar 27, 2020

arXiv (Cornell University), Apr 29, 2016

arXiv (Cornell University), Feb 5, 2016

Research Square (Research Square), Mar 15, 2023

Brazilian Applied Science Review, 2020

Pesquisa Operacional, 2019

Revista Brasileira de Biometria, Mar 28, 2018

RBPG, Dec 18, 2012

E-tech, Oct 24, 2012

Bolema, Apr 1, 2012

Expert Systems With Applications, Sep 1, 2022

Information, Nov 26, 2020