HETEROGENEOUS ENSEMBLE FEATURE SELECTION: AN ENHANCEMENT APPROACH TO MACHINE LEARNING FOR PHISHING DETECTION
DOI:
https://doi.org/10.15282/ijsecs.10.1.2024.6.0124Keywords:
Phishing detection, Cybersecurity, Machine learning, Feature selection, EnsembleAbstract
Presently, phishing attacks are recognized as a global pandemic, which is adversely affecting global security and causing setbacks to global economy. A successfully conducted phishing attack (cybercrime) results in devastating effects such as: bankruptcy for people and corporations, mostly leading to information and financial fatalities. In the pursuit of accurately providing solutions against phishing threats, machine learning techniques were found to be the right antidote in the detection processes. One of the most important sub-tasks in supervised ML models is feature selection as it helps to eliminate unnecessary features from the dataset without sacrificing data quality. Feature selection is a serious challenge in phishing detection and other classification tasks. The worth of the selected attributes/variables plays a key role in building powerful models and poor-quality data frustrates the process. This work explores the use of ensemble feature selection in data mining to select meaningful features. A novel feature selection technique for phishing detection is proposed, based on frequent, necessary, and correlated items. The innovative Heterogeneous Ensemble Feature Selection framework (HEFS) framework produced a new set of webpage features highly informative apart from the usual common features used for phishing detection. Two experiments were conducted in the process, and the results show that both the classical models and their ensemble versions performed amazingly well when evaluated on the baseline features compared to the component features. However, Boosted_NB recorded the highest accuracy of 0.974 (97.4%). The HEFS is highly recommended as an efficient feature selection method to detect correlated, frequent, and phishing-behaved features for machine learning-based detectors.
References
N. Bacanin et al., ‘Application of Natural Language Processing and Machine Learning Boosted with Swarm Intelligence for Spam Email Filtering’, Mathematics, vol. 10, no. 22, Nov. 2022, doi: 10.3390/math10224173.
S. A. Khan, W. Khan, and A. Hussain, ‘Phishing Attacks and Websites Classification Using Machine Learning and Multiple Datasets (A Comparative Analysis)’, Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 12465 LNAI, pp. 301–313, 2020, doi: 10.1007/978-3-030-60796-8_26.
A. K. Dutta, ‘Detecting phishing websites using machine learning technique’, PLoS One, vol. 16, no. 10 October, Oct. 2021, doi: 10.1371/journal.pone.0258361.
G. Hnini, J. Riffi, M. A. Mahraz, A. Yahyaouy, and H. Tairi, ‘MMPC-RF: A deep multimodal feature-level fusion architecture for hybrid spam E-mail detection’, Appl. Sci., vol. 11, no. 24, Dec. 2021, doi: 10.3390/app112411968.
A.M., Oyelakin, A. O. M, I. O, Mustapha, and I. K, Ajiboye, ‘Analysis of Single and Ensemble Machine Learning Classifiers for Phishing Attacks Detection’, Int. J. Softw. Eng. Comput. Syst., vol. 7, no. 2, pp. 44–49, 2021, doi: 10.15282/ijsecs.7.2.2021.5.0088.
V. V. Ramalingam, P. Yadav, and P. Srivastava, ‘Detection of Phishing Websites using an Efficient Feature-Based Machine Learning Framework’, Int. J. Eng. Adv. Technol., vol. 9, no. 3, pp. 2857–2862, Feb. 2020, doi: 10.35940/ijeat.C5909.029320.
J. Zhou, H. Cui, X. Li, W. Yang, and X. Wu, ‘A Novel Phishing Website Detection Model Based on LightGBM and Domain Name Features’, Symmetry (Basel)., vol. 15, no. 1, Jan. 2023, doi: 10.3390/sym15010180.
T. O. Omotehinwa and D. O. Oyewola, ‘Hyperparameter Optimization of Ensemble Models for Spam Email Detection’, Appl. Sci., vol. 13, no. 3, Feb. 2023, doi: 10.3390/app13031971.
F. Hossain, M. N. Uddin, and R. K. Halder, ‘Analysis of optimized machine learning and deep learning techniques for spam detection’, in 2021 IEEE International IOT, Electronics and Mechatronics Conference, IEMTRONICS 2021 - Proceedings, Institute of Electrical and Electronics Engineers Inc., Apr. 2021. doi: 10.1109/IEMTRONICS52119.2021.9422508.
M. Al-Sarem et al., ‘An optimized stacking ensemble model for phishing websites detection’, Electron., vol. 10, no. 11, Jun. 2021, doi: 10.3390/electronics10111285.
O. Osanaiye, H. Cai, K. K. R. Choo, A. Dehghantanha, Z. Xu, and M. Dlodlo, ‘Ensemble-based multi-filter feature selection method for DDoS detection in cloud computing’, Eurasip J. Wirel. Commun. Netw., vol. 2016, no. 1, Dec. 2016, doi: 10.1186/s13638-016-0623-3.
C. M. Igwilo and V. T. Odumuyiwa, ‘Comparative Analysis of Ensemble Learning and Non-Ensemble Machine Learning Algorithms for Phishing URL Detection’, FUOYE J. Eng. Technol., vol. 7, no. 3, pp. 305–312, 2022, doi: 10.46792/fuoyejet.v7i3.807.
A. Taha, ‘Intelligent ensemble learning approach for phishing website detection based on weighted soft voting’, Mathematics, vol. 9, no. 21, Nov. 2021, doi: 10.3390/math9212799.
R. P. Bellapu, R. Tirumala, and R. N. Kurukundu, ‘Evaluation of homogeneous and heterogeneous distributed ensemble feature selection approaches for classification of rice plant diseases’, Proc. - 5th Int. Conf. Intell. Comput. Control Syst. ICICCS 2021, no. Iciccs, pp. 1086–1094, 2021, doi: 10.1109/ICICCS51141.2021.9432081.
A. A. Orunsolu, A. S. Sodiya, and A. T. Akinwale, ‘A predictive model for phishing detection’, J. King Saud Univ. - Comput. Inf. Sci., vol. 34, no. 2, pp. 232–247, 2022, doi: 10.1016/j.jksuci.2019.12.005.
R. Vinayakumar, K. P. Soman, Prabaharan Poornachandran, S. Akarsh, and M. Elhoseny, ‘Deep learning framework for cyber threat situational awareness based on email and URL data analysis’, in Advanced Sciences and Technologies for Security Applications, Springer, 2019, pp. 87–124. doi: 10.1007/978-3-030-16837-7_6.
E. A. Amusan, O. T. Adedeji, O. Alade, F. A. Ajala, and K. O. Ibidapo, ‘A Mobile Anti-Phishing System Using Linkguard Algorithm’, FUOYE J. Eng. Technol., vol. 6, no. 3, pp. 10–14, 2021, doi: 10.46792/fuoyejet.v6i3.666.
H. M. Farghaly, A. A. Ali, and T. A. El-hafeez, ‘Building an Effective and Accurate Associative Classifier Based on Support Vector Machine Building an Effective and Accurate Associative Classifier Based on Support Vector Machine’, no. March, 2020.
H. Mamdouh Farghaly and T. Abd El-Hafeez, ‘A high-quality feature selection method based on frequent and correlated items for text classification’, Soft Comput., vol. 27, no. 16, pp. 11259–11274, 2023, doi: 10.1007/s00500-023-08587-x.
K. L. Chiew, C. L. Tan, K. S. Wong, K. S. C. Yong, and W. K. Tiong, ‘A new hybrid ensemble feature selection framework for machine learning-based phishing detection system’, Inf. Sci. (Ny)., vol. 484, pp. 153–166, 2019, doi: 10.1016/j.ins.2019.01.064.
J. Moedjahedy, A. Setyanto, F. K. Alarfaj, and M. Alreshoodi, ‘CCrFS: Combine Correlation Features Selection for Detecting Phishing Websites Using Machine Learning’, Futur. Internet, vol. 14, no. 8, Aug. 2022, doi: 10.3390/fi14080229.
H. Abutair, A. Belghith, and S. AlAhmadi, ‘CBR-PDS: a case-based reasoning phishing detection system’, J. Ambient Intell. Humaniz. Comput., vol. 10, no. 7, pp. 2593–2606, 2019, doi: 10.1007/s12652-018-0736-0.
N. Noureldien and S. Mohmoud, ‘The Efficiency of Aggregation Methods in Ensemble Filter Feature Selection Models’, Trans. Mach. Learn. Artif. Intell., vol. 9, no. 4, pp. 39–51, Aug. 2021, doi: 10.14738/tmlai.94.10101.
A. Zamir et al., ‘Phishing web site detection using diverse machine learning algorithms’, Electron. Libr., vol. 38, no. 1, pp. 65–80, 2020, doi: 10.1108/EL-05-2019-0118.
O. Osho, A. Oluyomi, S. Misra, R. Ahuja, R. Damasevicius, and R. Maskeliunas, Comparative evaluation of techniques for detection of phishing URLs, vol. 1051 CCIS. Springer International Publishing, 2019. doi: 10.1007/978-3-030-32475-9_28.
M. Somesha, A. Roshan Pais, R. Srinivasa Rao, and V. Singh Rathour, ‘Efficient deep learning techniques for the detection of phishing websites’, 2046, doi: 10.1007/s12046-020-01392-4S.
G. Mohamed, J. Visumathi, M. Mahdal, J. Anand, and M. Elangovan, ‘An Effective and Secure Mechanism for Phishing Attacks Using a Machine Learning Approach’, Processes, vol. 10, no. 7, Jul. 2022, doi: 10.3390/pr10071356.
M. T. Suleman and S. M. Awan, ‘Optimization of URL-Based Phishing Websites Detection through Genetic Algorithms’, Autom. Control Comput. Sci., vol. 53, no. 4, pp. 333–341, Jul. 2019, doi: 10.3103/S0146411619040102.
R. S. Rao and A. R. Pais, ‘An enhanced blacklist method to detect phishing websites’, in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Springer Verlag, 2017, pp. 323–333. doi: 10.1007/978-3-319-72598-7_20.
N. A. Azeez, S. Misra, I. A. Margaret, L. Fernandez-Sanz, and S. M. Abdulhamid, ‘Adopting automated whitelist approach for detecting phishing attacks’, Comput. Secur., vol. 108, Sep. 2021, doi: 10.1016/j.cose.2021.102328.
A. Aljofey, Q. Jiang, Q. Qu, M. Huang, and J. P. Niyigena, ‘An effective phishing detection model based on character level convolutional neural network from URL’, Electron., vol. 9, no. 9, pp. 1–24, 2020, doi: 10.3390/electronics9091514.
H. Zhou, X. Wang, and R. Zhu, ‘Feature selection based on mutual information with correlation coefficient’, Appl. Intell., vol. 52, no. 5, pp. 5457–5474, 2022, doi: 10.1007/s10489-021-02524-x.
U. I. Larasati, M. A. Muslim, R. Arifudin, and A. Alamsyah, ‘Improve the Accuracy of Support Vector Machine Using Chi Square Statistic and Term Frequency Inverse Document Frequency on Movie Review Sentiment Analysis’, Sci. J. Informatics, vol. 6, no. 1, pp. 138–149, 2019, doi: 10.15294/sji.v6i1.14244.
A. Chaiban, D. Sovilj, H. Soliman, G. Salmon, and X. Lin, ‘Investigating the Influence of Feature Sources for Malicious Website Detection’, Appl. Sci., vol. 12, no. 6, Mar. 2022, doi: 10.3390/app12062806.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2024 Bamidele Musiliu Olukoya, Gabriel Opeyemi Ogunleye, Patrick Olaniyi Olabisi, and Adesoye Sikiru Adegoke
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.