A comparison study of machine learning techniques for phishing detection (original) (raw)

Machine Learning-Based Phishing Attack Detection

International Journal of Advanced Computer Science and Applications, 2020

This paper explores machine learning techniques and evaluates their performances when trained to perform against datasets consisting of features that can differentiate between a Phishing Website and a safe one. This capability of telling these sites apart from one another is vital in the modernday internet surfing. As more and more of our resources shift online, one vulnerability and a leak of sensitive information by someone could bring everything down in a connected network. This paper's objective through this research is to highlight the best technique for identifying one of the most commonly occurring cyberattacks and thus allow faster identification and blacklisting of such sites, therefore leading to a safer and more secure web surfing experience for everyone. To achieve this, we describe each of the techniques we look into in great detail and use different evaluation techniques to portray their performance visually. After pitting all of these techniques against each other, we have concluded with an explanation in this paper that Random Forest Classifier does indeed work best for Phishing Website Detection.

Impact of Current Phishing Strategies in Machine Learning Models for Phishing Detection

13th International Conference on Computational Intelligence in Security for Information Systems (CISIS 2020), 2020

Phishing is one of the most widespread attacks based on social engineering. The detection of Phishing using Machine Learning approaches is more robust than the blacklist-based ones, which need regular reports and updates. However, the current Supervised Learning approaches also have some drawbacks as the datasets used for creating the models.These datasets only have the landing page of legitimate domains and they do not include the login forms from the websites, which is the most common situation in a real case of Phishing. As we show in this work, when a model is trained with old datasets, the performance in today's Phishing attacks decreases meaningfully, especially when it is tested using login pages. In this paper, we demonstrate that a machine learning model trained with datasets collected some years ago, could have high performance when tested with the same outdated datasets, but its performance decreases notably with current datasets, using in both cases the same features. We also demonstrate that, among the commonly applied machine learning algorithms, SVM is the most resilient to the new strategies used by the current phishing attacks. To prove these statements, we created a new dataset, Phishing Index Login URL dataset (PILU-60K), containing 60K URLs from legitimate index and login URLs, together with Phishing samples. We evaluated several machine learning methods with the known datasets PWD2016, Ebbu2017 and also with two subsets of PILU, PIU-40K and PLU-40K, which contains only index pages and only login pages respectively, showing that the accuracy decreases remarkably. We also found that Random Forest is the recommended approach among all the evaluated methods with the newly created dataset.

Prediction of phishing websites using machine learning

Spatial Informing Research, 2022

With the growing popularity of the information science, more application is being integrated with websites that can be accessed directly through the internet. This has increased the possibility of attack by ill-legal persons to steal personal information. To identify a phishing assault, several strategies have been presented. However, there is still opportunity for progress in the fight against phishing. The objective of this research paper is to develop a more accurate prediction model using Decision Tree (DT), Random Forest (RF) and Gradient Boosting Classifiers (GBC) with three features selection techniques Extra Tree (ET), Chi-Square and Recursive Feature Elimination (RFE). Since phishing websites dataset contains 89 features, therefore we have applied extra tree and chi-square, feature selection method to identify the limited important features and then recursive features elimination technique has been used to reduce the dataset up-to optimum important features. We have compared the performance of the developed model using machine learning algorithms and find the best prediction performance using GBC, followed by RF and DT. These algorithmic models capture the trends from various cases of phishing with over R-square, Root Mean Square Error (RMSE), and Mean Absolute Error (MAE), in each case.

DETECTION OF PHISHING ATTACKS USING MACHINE LEARNING TECHNIQUES

International Research Journal of Modernization in Engineering Technology and Science, 2024

The study evaluated several machine learning techniques for detecting phishing attacks, including Support Vector Machine (SVM), Extreme Gradient Boosting (XGBoost), Random Forest (RF), Decision Tree (DT), and Logistic Regression (LR). Two datasets were used-one from PhishTank and another from the UCI machine learning repository. Results showed that the Random Forest model achieved the highest accuracy across multiple metrics. On the PhishTank dataset, RF had the best K-fold cross-validation accuracy at 99.55%, feature selection accuracy at 99.00%, and hyperparameter tuning accuracy at 99.45%. The XGBoost model performed well too, with 99.16% K-fold accuracy on PhishTank. On the UCI dataset, XGBoost had the highest K-fold accuracy at 97.16%, while RF still demonstrated maximum accuracy for feature selection and hyperparameter tuning. Logistic Regression consistently showed the lowest accuracy across datasets and metrics. The proposed approach was validated against other researchers' work on PhishTank, achieving 98.80% accuracy, which was compared favorably. ROC curves further illustrated the strong performance, especially for the top-performing models. The study demonstrated that using selected features and hyperparameter tuning could enhance detection accuracy. The machine learning algorithms, particularly Random Forest, outperformed other state-ofthe-art techniques in accurately identifying phishing attacks. The high accuracy metrics indicate the proposed framework's effectiveness in detecting phishing attempts.

MACHINE LEARNING TECHNIQUES FOR IDENTIFYING AND MITIGATING PHISHING ATTACKS

IAEME PUBLICATION, 2024

One of the most prevalent forms of social engineering, phishing attempts to fraudulently get sensitive information from users' email accounts. Their usage can be integrated into larger-scale assaults aimed at penetrating government or business networks. To identify and lessen the impact of these attacks, several antiphishing methods have been suggested within the past ten years. Nevertheless, they continue to be inaccurate and inefficient. Many different channels can be used for phishing, including email, phone, instant messaging, advertisements, website pop-up windows, and DNS poisoning. Significant damages, such as the disclosure of sensitive information, theft of personal or company identities, or even state secrets, can be inflicted upon victims of phishing attempts. This essay aims to evaluate these attacks by looking at how phishing is done now and how it is currently perceived. This article presents a new, comprehensive model of phishing that considers various aspects of attacks, including stages, threats, targets, media, and tactics. Here, we use machine learning methods like Logistic Regression, Random Forest, and XGBoost to classify websites as either legitimate or phishing. In addition to helping readers understand the lifecycle of a phishing assault, the proposed anatomy will make people more aware of these attacks, the tactics used, and how to build a thorough anti-phishing system

Detection of Phishing Websites using Machine Learning Techniques

IJCSIS Vol 18 No. 7 July Issue, 2020

Abstract— With the developing interaction of the Internet and public activity, the Internet is taking a gander at how individuals learn and work, however it likewise opens us to raising genuine security dangers. Step by step instructions to perceive different system assaults, especially attacks not seen already, is a key issue that should be unraveled critically. The target of phishing website URLs is to gather the individual data like client's name, passwords and on the web banking exchanges. Phishers use the sites which are outwardly and semantically like those of genuine sites. Since a large portion of the clients go online to get to the administrations given by government and financial foundations, there has been a significant increment in phishing assaults in last few years. Machine learning is a useful asset used to endeavor against phishing assaults. There are a few strategies or ways to deal with identifying phishing sites. The fundamental point of this paper is to execute the framework with high efficiency, exactness and cost effectively. The task is actualized utilizing 4 ML managed classification models. The four classification models are K-Nearest Neighbor, Kernel Support vector machine, Decision tree and Random Forest classifier. It was discovered that the Random Forest classifier is most accurate for the chosen dataset and gives an accuracy score of 96.82%. Keywords- Machine Learning, classification, Cyber security, Phishing, KNN, Kernel SVM, Decision Tree, Random Forest Classifier

Machine Learning-Based Phishing Detection

IRJET, 2023

Millions of users have been successfully connected globally by the internet today, and as a result, users' reliance on this platform for data browsing, online transactions, and information downloads has grown. Cybersecurity is a term for a collection of technologies and procedures used to safeguard software and hardware against intrusion, harm, and attacks. DoS attacks, Man-inthe-Middle attacks, Phishing attacks, SQL Injection attacks, etc. are some of the most often seen cybersecurity threats. There has been an uptick in consumers losing access to their very sensitive and private information over the past few years. These days, fraudsters utilise such methods to trick their victims in an effort to steal personal information including their username, password, bank account information, and credit card information. Attacks against users are frequently delivered via spoofing emails, illegal websites, malware, etc. To handle complicated and massive amounts of data, a structured automated technique is necessary. The most common and effective approach that can be used to address this issue is machine learning, according to research. The most widely used machine learning methods include neural networks, decision trees, logistic regression, and support vector machines (SVM). A group of deep learning and machine learning models will be trained in this study to identify phishing websites.

Phishing Detection with Machine Learning

International Journal for Research in Applied Science & Engineering Technology (IJRASET), 2022

The goal of our project is to implement a machine learning solution to the problem of detecting phishing and malicious web links. The end result of our project will be a software product which uses a machine learning algorithm to detect malicious URLs. Phishing is the technique of extracting user credentials and sensitive data from users by masquerading as a genuine website. In phishing, the user is provided with a mirror website which is identical to the legitimate one but with malicious code to extract and send user credentials to phishers. Phishing attacks can lead to huge financial losses for customers of banking and financial services. The traditional approach to phishing detection has been to either to use a blacklist of known phishing links or heuristically evaluate the attributes in a suspected phishing page to detect the presence of malicious codes. The heuristic function relies on trial and error to define the threshold, which is used to classify malicious links from benign ones. The drawback to this approach is poor accuracy and low adaptability to new phishing links. We plan to use machine learning to overcome these drawbacks by implementing some classification algorithms and comparing the performance of these algorithms on our dataset. We will test algorithms such as Logistic Regression, SVM, Decision Trees and Neural Networks on a dataset of phishing links from UCI Machine Learning repository and pick the best model to develop a browser plugin, which can be published as a browser extension.

Phish Catch: Machine Learning Way of Detecting Phishing Websites

2020

With the advent of 4G technology, the internet became available to masses. Everyone started to use internet services in different spheres of their life, making them vulnerable to diverse threats. One of the primary risks for internet users is Phishing Websites. Instead of breaching the security of systems phishing websites try to fool the users and make them give away the credentials which they are not supposed to share with anyone. In this study, we took 21 features and tried to predict their class i.e legitimate or phish using a supervised learning algorithm Index Terms Phishing, Machine Learning, SVM, Decision Tree, Random Forest, Internet, Security

Phishing Attacks and Websites Classification Using Machine Learning and Multiple Datasets (A Comparative Analysis)

Intelligent Computing Methodologies, 2020

Phishing attacks are the most common type of cyber-attacks used to obtain sensitive information and have been affecting individuals as well as organizations across the globe. Various techniques have been proposed to identify the phishing attacks specifically, deployment of machine intelligence in recent years. However, the deployed algorithms and discriminating factors are very diverse in existing works. In this study, we present a comprehensive analysis of various machine learning algorithms to evaluate their performances over multiple datasets. We further investigate the most significant features within multiple datasets and compare the classification performance with the reduced dimensional datasets. The statistical results indicate that random forest and artificial neural network outperform other classification algorithms, achieving over 97% accuracy using the identified features.