Feature Selection in Big Data by Using the enhancement of Mahalanobis-Taguchi System; Case Study, Identifiying Bad Credit clients of a Private Bank of Islamic Republic of Iran (original) (raw)
Related papers
Mathematics, 2020
Measuring credit risk is essential for financial institutions because there is a high risk level associated with incorrect credit decisions. The Basel II agreement recommended the use of advanced credit scoring methods in order to improve the efficiency of capital allocation. The latest Basel agreement (Basel III) states that the requirements for reserves based on risk have increased. Financial institutions currently have exhaustive datasets regarding their operations; this is a problem that can be addressed by applying a good feature selection method combined with big data techniques for data management. A comparative study of selection techniques is conducted in this work to find the selector that reduces the mean square error and requires the least execution time.
A Hybrid Approach for Feature Selection in Data Mining Modeling of Credit Scoring
2020
Recent year researches shows that data mining techniques can be implemented in broad areas of the economy and, in particular, in the banking sector. One of the most burning issues banks face is the problem of non-repayment of loans by the population that related to credit scoring problem. The main goal of this paper is to show the importance of applying feature selection in data mining modeling of credit scoring. The study shows processes of data pre-processing, feature creation and feature selection that can be applicable in real-life business situations for binary classification problems by using nodes from IBM SPSS Modeler. Results have proved that application of hybrid model of feature selection, which allows to obtain the optimal number of features, conduces in credit scoring accuracy increase. Proposed hybrid model comparing to expert judgmental approach performs in harder explanation but shows better accuracy and flexibility of factors selection which is advantage in fast cha...
Understanding Mahalanobis distance criterion for feature selection
2015
Distance criteria are widely applied in cluster analysis and classification techniques. One of the well known and most commonly used distance criteria is the Mahalanobis distance, introduced by P. C. Mahalanobis in 1936. The functions of this distance have been extended to different problems such as detection of multivariate outliers, multivariate statistical testing, and class prediction problems. In the class prediction problems, researcher is usually burdened with problems of excessive features where useful and useless features are all drawn for classification task. Therefore, this paper tries to highlight the procedure of exploiting this criterion in selecting the best features for further classification process. Classification performance for the feature subset of the ordered features based on the Mahalanobis distance criterion is included.
2022
Financial distress prediction (FDP) has been a subject of extensive and ongoing research because of its significance in both internal and external components of enterprises including investors and creditors. Financial institutions must to be able to foresee financial difficulty in order to allow them for evaluating the financial health of businesses and individuals. Data pre-processing techniques have been found to increase the efficacy of prediction models, and many research consider feature selection as a pre-processing step before creating the models. The creation of efficient feature selection algorithms is one of the main challenges facing FDP. In this study, we present a hybrid methodology for predicting financial distress using a Multi-Layer Perceptron and Genetic Algorithm (MLP_GA) model with boruta automated feature selection. The proposed model is designed on genetic algorithm-based tuning of the crucial MLP hyperparameters, including Network depth, Dense layer activation function, Network width, and Network optimizer for a reliable prediction. This paper investigates how boruta algorithm based feature selection method improve the accuracy of our MLP_GA algorithm. We access the FDP performance utilizing samples of enterprises based in MENA area. Resampling with k-fold evaluation metrics is employed in the experiments. The experimental results indicate that the adoption of the boruta automated feature selection method has significantly enhanced the prediction performance and accuracy of the FDP model.
Design Engineering, Parts A and B, 2005
The Mahalanobis-Taguchi System is a diagnosis and predictive method for analyzing patterns in multivariate cases. The goal of this study is to compare the ability of the Mahalanobis-Taguchi System and a neural-network to discriminate using small data sets. We examine the discriminant ability as a function of data set size using an application area where reliable data is publicly available. The study uses the Wisconsin Breast Cancer study with nine attributes and one class.
An effective credit scoring model based on feature selection approaches
— Recent finance and debt crises have made credit risk management one of the most important issues in financial research. Credit scoring is one of the most important issues in financial decision-making. Reliable credit scoring models are crucial for financial agencies to evaluate credit applications and have been widely studied in the field of machine learning and statistics. In this paper, we propose an effective credit scoring model based on feature selection approaches. Feature selection is a process of selecting a subset of relevant features, which can decrease the dimensionality, shorten the running time, and/or improve the classification accuracy. Using the standard k-nearest-neighbors (kNN) rule as the classification algorithm, the feature selection methods are evaluated in classification tasks. Two well-known and readily available such as: Australia and German dataset has been used to test the algorithm. The results obtained by feature selection approaches shown have been superior to state-of-the-art classification algorithms in credit scoring.
Improved Feature Selection Model for Big Data Analytics
IEEE Access, 2020
Although there are many attempts to build an optimal model for feature selection in Big Data applications, the complex nature of processing such kind of data makes it still a big challenge. Accordingly, the data mining process may be obstructed due to the high dimensionality and complexity of huge data sets. For the most informative features and classification accuracy optimization, the feature selection process constitutes a mandatory pre-processing phase to reduce dataset dimensionality. The exhaustive search for the relevant features is time-consuming. In this paper, a new binary variant of the wrapper feature selection grey wolf optimization and particle swarm optimization is proposed. The K-nearest neighbor classifier with Euclidean separation matrices is used to find the optimal solutions. A tent chaotic map helps in avoiding the algorithm from locked to the local optima problem. The sigmoid function employed for converting the search space from a continuous vector to a binary one to be suitable to the problem of feature selection. Crossvalidation K-fold is used to overcome the overfitting issue. A variety of comparisons have been made with well-known and common algorithms, such as the particle swarm optimization algorithm, and the grey wolf optimization algorithm. Twenty datasets are used for the experiments, and statistical analyses are conducted to approve the performance and the effectiveness and of the proposed model with measures like selected features ratio, classification accuracy, and computation time. The cumulative features picked through the twenty datasets were 196 out of 773, as opposed to 393 and 336 in the GWO and the PSO, respectively. The overall accuracy is 90% relative to other algorithms ' 81.6 and 86.8. The total processing time for all datasets equals 184.3 seconds, wherein GWO and PSO equal 272 and 245.6, respectively. INDEX TERMS Particle swarm optimization (PSO), grey wolf optimization (GWO), data mining, big data analytics, feature selection.
Feature Selection in a Credit Scoring Model
Mathematics, 2021
This paper proposes different classification algorithms-logistic regression, support vector machine, K-nearest neighbors, and random forest-in order to identify which candidates are likely to default for a credit scoring model. Three different feature selection methods are used in order to mitigate the overfitting in the curse of dimensionality of these classification algorithms: one filter method (Chi-squared test and correlation coefficients) and two wrapper methods (forward stepwise selection and backward stepwise selection). The performances of these three methods are discussed using two measures, the mean absolute error and the number of selected features. The methodology is applied for a valuable database of Taiwan. The results suggest that forward stepwise selection yields superior performance in each one of the classification algorithms used. The conclusions obtained are related to those in the literature, and their managerial implications are analyzed.
Expert Systems with Applications, 2010
Mahalanobis-Taguchi System (MTS) is a pattern recognition method applied to classify data into categories-''healthy" and ''unhealthy" or ''acceptable" and ''unacceptable". MTS has found applications in a wide range of problem domains. Dimensionality reduction of the input set of attributes forms an important step in MTS. The current practice is to apply Taguchi's design of experiments (DOE) and orthogonal array (OA) method to achieve this end. Maximization of Signal-to-Noise (S/N) ratio forms the basis for selection of the optimal combination of variables. However the DOE-OA method has been reviewed to be inadequate for the purpose. In this research study, we propose a dimensionality reduction method by addressing the problem as feature selection exercise. The optimal combination of attributes minimizes a weighted sum of total fractional misclassification and the percentage of the total number of variables employed to obtain the misclassification. Mahalanobis distances (MDs) of ''healthy" and ''unhealthy" conditions are used to compute the misclassification. A mathematical model formulates the feature selection approach and it is solved by binary particle swarm optimization (PSO). Data from an Indian foundry shop is adopted to test the mathematical model and the swarm heuristic. Results are compared with that of DOE-OA method of MTS.
Modified Mahalanobis Taguchi System for Imbalance Data Classification
Computational Intelligence and Neuroscience, 2017
The Mahalanobis Taguchi System (MTS) is considered one of the most promising binary classification algorithms to handle imbalance data. Unfortunately, MTS lacks a method for determining an efficient threshold for the binary classification. In this paper, a nonlinear optimization model is formulated based on minimizing the distance between MTS Receiver Operating Characteristics (ROC) curve and the theoretical optimal point named Modified Mahalanobis Taguchi System (MMTS). To validate the MMTS classification efficacy, it has been benchmarked with Support Vector Machines (SVMs), Naive Bayes (NB), Probabilistic Mahalanobis Taguchi Systems (PTM), Synthetic Minority Oversampling Technique (SMOTE), Adaptive Conformal Transformation (ACT), Kernel Boundary Alignment (KBA), Hidden Naive Bayes (HNB), and other improved Naive Bayes algorithms. MMTS outperforms the benchmarked algorithms especially when the imbalance ratio is greater than 400. A real life case study on manufacturing sector is us...