Exploring the performance of feature selection method using breast cancer dataset (original) (raw)

Predictive modeling for breast cancer based on machine learning algorithms and features selection methods

International Journal of Electrical and Computer Engineering (IJECE), 2024

Breast cancer is one of the leading causes of death among women worldwide. However, early prediction of breast cancer plays a crucial role. Therefore, strong needs exist for automatic accurate early prediction of breast cancer. In this paper, machine learning (ML) classifiers combined with features selection methods are used to build an intelligent tool for breast cancer prediction. The Wisconsin diagnostic breast cancer (WDBC) dataset is used to train and test the model. Classification algorithms, including support vector machine (SVM), light gradient boosting machine (LightGBM), random forest (RF), logistic regression (LR), k-nearest neighbors (k-NN), and naïve Bayes, were employed. Performance measures for each of them were obtained, namely: accuracy, precision, recall, F-score, Kappa, Matthews correlation coefficient (MCC), and time. The results indicate that without feature selection, LightGBM achieves the highest accuracy at 95%. With minimum redundancy maximum relevance (mRMR) feature selection (15 features), LightGBM outperforms other classifiers, achieving an accuracy of 98%. For Pearson correlation coefficient feature selection (15 features), LightGBM also excels with a 95% accuracy rate. Lasso feature selection (5 features) produces varied results across classifiers, with logistic regression achieving the highest accuracy at 96%. These findings underscore the importance of feature selection in refining model performance and in improving detection for breast cancer.

Influence of Feature Selection Methods on Breast Cancer Early Prediction Phase using Classification and Regression Tree

2022 International Conference on Engineering & MIS (ICEMIS)

In recent years, healthcare data has been growing exponentially. The major challenge is to predict and analyze all this data effectively. Feature selection is a solution in which a subset of informative features is selected from a high-dimensional dataset. Feature selection helps to increase accuracy and remove irrelevant features. In the medical domain, selecting important features for healthcare is essential as it directly affects human health. Several filters, wrapper, and embedded feature selection techniques will be examined in this study including generic univariate selects, select percentile, select k best, Pearson correlation coefficient, mutual information, relief-f, recursive feature elimination, recursive feature elimination with cross-validation, sequential forward selection, sequential backward selection, and select-from-model. The aim is to make the healthcare predictions model named classification and regression tree more accurate by employing feature selection methods, to accurately detect breast cancer in its early stages, where the data is collected from Sebha oncology center in the south of Libya. The performance of the classification and regression tree was seen to be noticeably enhanced when eliminated irrelevant features. Later, our model outperforms other classification methods, namely: logistic regression, naive Bayes, and K-nearest neighbors, by using the optimal subset of features identified by recursive feature elimination.

Performance analysis of machine learning based optimized feature selection approaches for breast cancer diagnosis

2021

Healthcare systems around the world are facing huge challenges in responding to trends of the rise of chronic diseases. The objective of our research study is the adaptation of Data Science and its approaches for prediction of various diseases in early stages. In this study we review latest proposed approaches with few limitations and their possible solutions for future work. This study also shows importance of finding significant features that improves results proposed by existing methodologies. This work aimed to build classification models such as Naïve Bayes, Logistic Regression, k-Nearest neighbor, Support vector machine, Decision tree, Random Forest, Artificial neural network, Adaboost, XGBoost and Gradient boosting. The experimental study chooses group of features by means of three feature selection approaches such as Correlation-based selection, Information Gain based selection and Sequential feature selection. Various Machine learning classifiers are applied on these feature subsets and based on their performance best feature subset is selected. Finally, ensemble based Max Voting Classifier is proposed on top of three best performing models. The proposed model produces an enhanced performance label with accuracy score of 99.41%.

Breast cancer diagnosis improvement using feature selection

Advances in Intelligent Systems, 2014

The objective of this research is to investigate the randomization of data on a computer based feature selection for diagnosing coronary artery disease. The randomization on Cleveland dataset was conducted because the performance value is different for each experiment. Assuming the performance values have a Gaussian probability distribution is a solution to handle different performance value provided by the process of randomizing dataset. The final performance is taken from the mean value of all performance value. In this research, computer based feature selection (CFS), medical expert based feature selection (MFS) and combined both of MFS and CFS (MFS+CFS) are also conducted to improve the performance of the classification algorithm. Also, this research found a different characteristic on Cleveland dataset from previous work. This difference obviously can affect the feature selection result and the final performance. In summary, the randomization dataset and computing the final performance can generally represent the performance of the classification algorithm.

A Novel Feature Selection Method for Effective Breast Cancer Diagnosis and Prognosis

A major area of current research in data mining is the field of medical diagnosis. In the present study using the Breast cancer Wisconsin data sets, a feature selection algorithm Modified Correlation Rough Set Feature Selection (MCRSFS) predicts both diagnosis and prognosis by comparing several data mining classification algorithms. In the proposed approach, in level 1 of feature selection, features are selected based on rough set with different starting values of reduct. In level 2 features are selected from the reduced set based on the Correlation Feature Selection (CFS). Experiments show the proposed method is effective by comparing with others in terms of number of selected features and classification performance.

Comparative Study of Different Machine Learning Classifiers Using Multiple Feature Selection Techniques for Breast Cancer Classification

International Journal for Research in Applied Science & Engineering Technology (IJRASET), 2022

This research investigates use of several Machine Learning classifiers under feature selection methods: Without Dimensionality reduction, using Correlation Coefficient Score, using Voting Classifier, and using Tree Based Feature Selection. The different ML Classifiers used in this research are: Logistic Regression, Decision Trees, Support Vector Machine (SVM), Random Forest, K-Nearest Neighbours (KNN) and Naïve Bayes Classifier. These classification models are run on data generated from processing mammography scans to extract shape, texture, size and other spatial features from the tumour contour. The performance of these ML classifiers is evaluated by performance metrics like: Precision Score, Recall Score, F1 Score, and Accuracy Score. The dataset used for the purpose of our study was The Wisconsin Breast Cancer Dataset for both training and testing. The comparison of these results helps us better understand the nature of these classifiers for such classification problems, give us more insights on feature engineering and selection, and their potential use in clinical trials. After computing the results, we were able to get accuracy levels as high as 97.9% and were able to reach accuracy between 90-95% in general.

Improving the performance of machine learning classifiers for Breast Cancer diagnosis based on feature selection

This paper proposed a comprehensive algorithm for building machine learning classifiers for Breast Cancer diagnosis based on the suitable combination of feature selection methods that provide high performance over the Area Under receiver operating characteristic Curve (AUC). The new developed method allows both for exploring and ranking search spaces of imagebased features, and selecting subsets of optimal features for feeding Machine Learning Classifiers (MLCs). The method was evaluated using six mammography-based datasets (containing calcifications and masses lesions) with different configurations extracted from two public Breast Cancer databases. According to the Wilcoxon Statistical Test, the proposed method demonstrated to provide competitive Breast Cancer classification schemes reducing the number of employed features for each experimental dataset.

OPTIMUM FEATURE SELECTION BASED BREAST CANCER PREDICTION USING MODIFIED LOGISTIC REGRESSION MODEL

JATIT, 2023

Patients with breast cancer are more likely to experience severe health issues and have a higher mortality rate. One of the main reasons for cancer-related deaths in women is breast cancer (BC). Early diagnosis of breast cancer enables patients to obtain proper care, enhancing their chance of survival. The main explanation could be that different breast densities and technical imaging quality issues cause radiologists to misinterpret concerning lesions, increasing the false-positive and negative) ratio. In this work, a new optimum feature selection-based model is developed to efficiently predict breast cancer using a modified logistic regression model. Our proposed model consists of two phases: a) feature selection and b) prediction. As a first step, preprocessing is done on the dataset to find the missing values and remove the unwanted noise, outliers, and so on. In this research work, the first dataset with 568 numbers of data and 30 numbers of features and the second dataset with 952 numbers of data and 26 numbers of features are considered for diagnosis and analysis. To select the features from the dataset's N features, an improved grey wolf population algorithm is used. Hence, 26 sets of features are selected for further processing. Our proposed model performed well on both datasets, with 92.9% and 93.38% accuracy for the first and second datasets, respectively. The novelty of this research work is to provide the best accuracy in disease diagnosis and prediction by selecting the optimum based on meaningful features.

Detection of Breast Cancer Through Clinical Data Using Supervised and Unsupervised Feature Selection Techniques

IEEE Access, 2021

Breast cancer is one the most critical disease and suffered many people around the world. The efficient and correct detection of breast cancer is still needed to ensure this medical issue although the researchers around the world are proposed different diagnostic methods for detection of this disease, however these existing methods still needed further improvement to correct and efficient detection of this disease. In this study, we proposed a new breast cancer identification method by using machine learning algorithms and clinical data. In the proposed method supervised (Relief algorithm) and unsupervised (Autoencoder, PCA algorithms) techniques have been used for related features selection from data set and then these selected features have been used for training and testing of classifier support vector machine for accurate and on time detection of breast cancer. Additionally, in the proposed approach k fold cross validation method has been used for model validation and best hyperparameters selection. The model performance evaluation metrics have been used for model performance evaluation. The BC data sets have been used for testing of the proposed method. The analysis of experimental results has been demonstrated that the features selected by Relief algorithm are more related for accurate detection of Breast cancer instead of features selected by Auotencoder and PCA algorithms. The proposed method has been attained high results in terms of accuracy on selected feature selected by Relief algorithm and achieved 99.91% accuracy. We have been employed McNemar's statistical test for performance comparison of our different models. Further, the proposed method performance has been compared with baseline methods in the literature and the proposed method performance is high as compared to base line methods. Due to the high performance of the proposed method (Relief-Support vector machine) we highly recommended it for the diagnosis of breast cancer. In addition, the proposed method can be easily incorporated into the healthcare system for reliable diagnosis of Breast cancer.

COMPARATIVE STUDY ON DIFFERENT CLASSIFICATION TECHNIQUES FOR BREAST CANCER DATASET

Breast cancer is one of the most common cancers among women in the world. Early detection of breast cancer is essential in reducing their life losses. Data mining is the process of analyzing massive data and summarizing it into useful knowledge discovery and the role of data mining approaches is growing rapidly especially classification techniques are very effective way to classifying the data, which is essential in decision-making process for medical practitioners. This study presents the different data mining classifiers on the database of breast cancer, by using classification accuracy with and without feature selection techniques. Feature selection increases the accuracy of the classifier because it eliminates irrelevant attributes. The experiment shows that the feature selection enhances the accuracy of all three different classifiers, reduces the Mean Standard Error (MSE) and increase Receiver Operating Characteristics (ROC).