OPTIMIZING SUPPORT VECTOR MACHINE FOR IMBALANCED DATASETS BY COMBINING POSTERIOR PROBABILITY AND CORRELATION METHODS

Canggih Ajika Pamungkas; Megat Farez Azril

doi:10.15282/ijsecs.11.1.2025.2.0134

Authors

Canggih Ajika Pamungkas Malaysian Institute of Information Technology, Universiti Kuala Lumpur, Politeknik Indonusa Surakarta, Jalan KH. Samanhudi No 31, Surakarta, Indonesia
Megat Farez Azril Malaysian Institute of Information Technology, Universiti Kuala Lumpur

DOI:

https://doi.org/10.15282/ijsecs.11.1.2025.2.0134

Keywords:

Supervised classification, Imbalanced dataset, Posterior Probability, Correlation

Abstract

The challenge of classifying imbalanced data persists in machine learning, particularly in critical applications such as medical diagnosis, fraud detection, and anomaly identification, where detecting the minority class is essential. Conventional classifiers like Support Vector Machine (SVM) tend to favor the majority class, leading to reduced sensitivity in identifying minority instances. This study introduces Posterior Probability and Correlation-Support Vector Machine (PC-SVM), a novel approach that integrates posterior probability estimation with correlation analysis to enhance SVM’s performance on imbalanced datasets. Unlike traditional SVM models, which struggle with class imbalance and require additional data balancing techniques, PC-SVM dynamically adjusts classification thresholds using posterior probability values and correlation-weighted features, simplifying the classification process while improving its effectiveness. The effectiveness of PC-SVM was evaluated using multiple imbalanced datasets from KEEL, UCI, and Kaggle repositories. Results demonstrate that PC-SVM achieves 100% recall for the minority class, significantly outperforming traditional SVM, which attained only 80% recall on average. This 20% improvement in recall underscores PC-SVM’s ability to mitigate the imbalance issue without relying on oversampling or cost-sensitive adjustments. Furthermore, PC-SVM exhibits consistent performance across various evaluation metrics, including accuracy, precision, recall, and F1-score, ensuring robust classification results. By improving the detection of minority classes, PC-SVM offers a transformative solution for real-world applications that demand high sensitivity in identifying rare but crucial instances. Its ability to maintain classification integrity without additional balancing techniques positions it as a valuable model for industries such as healthcare, finance, and cybersecurity, where accurate minority class recognition is critical.

References

[1] J. Alcaraz, M. Labbé, and M. Landete, “Support Vector Machine with feature selection: A multiobjective approach,” Expert Syst. Appl., vol. 204, no. May, p. 117485, 2022, doi: 10.1016/j.eswa.2022.117485.

[2] J. Liu, “Fuzzy support vector machine for imbalanced data with borderline noise,” Fuzzy Sets Syst., vol. 1, pp. 1–10, 2020, doi: 10.1016/j.fss.2020.07.018.

[3] C. Wu, N. Wang, and Y. Wang, “Increasing Minority Recall Support Vector Machine Model for Imbalanced Data Classification,” Discret. Dyn. Nat. Soc., vol. 2021, 2021, doi: 10.1155/2021/6647557.

[4] H. Liu, Z. Liu, W. Jia, D. Zhang, and J. Tan, “A Novel Imbalanced Data Classification Method Based on Weakly Supervised Learning for Fault Diagnosis,” IEEE Trans. Ind. Informatics, vol. 18, no. 3, pp. 1583–1593, 2022, doi: 10.1109/TII.2021.3084132.

[5] S. Shaikh, S. M. Daudpota, A. S. Imran, and Z. Kastrati, “Towards improved classification accuracy on highly imbalanced text dataset using deep neural language models,” Appl. Sci., vol. 11, no. 2, pp. 1–20, 2021, doi: 10.3390/app11020869.

[6] R. A. Hamad, M. Kimura, and J. Lundström, “Efficacy of Imbalanced Data Handling Methods on Deep Learning for Smart Homes Environments,” SN Comput. Sci., vol. 1, no. 4, pp. 1–10, 2020, doi: 10.1007/s42979-020-00211-1.

[7] V. Rupapara, F. Rustam, H. F. Shahzad, A. Mehmood, I. Ashraf, and G. S. Choi, “Impact of SMOTE on Imbalanced Text Features for Toxic Comments Classification Using RVVC Model,” IEEE Access, vol. 9, pp. 78621–78634, 2021, doi: 10.1109/ACCESS.2021.3083638.

[8] H. Qin, H. Zhou, and J. Cao, “Imbalanced learning algorithm based intelligent abnormal electricity consumption detection,” Neurocomputing, vol. 402, no. xxxx, pp. 112–123, 2020, doi: 10.1016/j.neucom.2020.03.085.

[9] S. S. Mullick, S. Datta, S. G. Dhekane, and S. Das, “Appropriateness of performance indices for imbalanced data classification: An analysis,” Pattern Recognit., vol. 102, p. 107197, 2020, doi: 10.1016/j.patcog.2020.107197.

[10] H. Shamsudin, U. K. Yusof, A. Jayalakshmi, and M. N. Akmal Khalid, “Combining oversampling and undersampling techniques for imbalanced classification: A comparative study using credit card fraudulent transaction dataset,” IEEE Int. Conf. Control Autom. ICCA, vol. 2020-Octob, pp. 803–808, 2020, doi: 10.1109/ICCA51439.2020.9264517.

[11] K. H. Kim and S. Y. Sohn, “Hybrid neural network with cost-sensitive support vector machine for class-imbalanced multimodal data,” Neural Networks, vol. 130, pp. 176–184, 2020, doi: 10.1016/j.neunet.2020.06.026.

[12] X. Tao et al., “Affinity and class probability-based fuzzy support vector machine for imbalanced data sets,” Neural Networks, vol. 122, pp. 289–307, 2020, doi: 10.1016/j.neunet.2019.10.016.

[13] C. Jimenez-Castaño, A. Alvarez-Meza, and A. Orozco-Gutierrez, “Enhanced automatic twin support vector machine for imbalanced data classification,” Pattern Recognit., vol. 107, 2020, doi: 10.1016/j.patcog.2020.107442.

[14] R. Abo Zidan and G. Karraz, “Gaussian Pyramid for Nonlinear Support Vector Machine,” Appl. Comput. Intell. Soft Comput., vol. 2022, 2022, doi: 10.1155/2022/5255346.

[15] Y. S. Solanki et al., “A hybrid supervised machine learning classifier system for breast cancer prognosis using feature selection and data imbalance handling approaches,” Electron., vol. 10, no. 6, pp. 1–16, 2021, doi: 10.3390/electronics10060699.

[16] R. Kumar R et al., “Investigation of nano composite heat exchanger annular pipeline flow using CFD analysis for crude oil and water characteristics,” Case Stud. Therm. Eng., vol. 49, p. 104908, 2023, doi: 10.1016/j.csite.2023.103297.

[17] B. Richhariya and M. Tanveer, “A reduced universum twin support vector machine for class imbalance learning,” Pattern Recognit., vol. 102, p. 107150, 2020, doi: 10.1016/j.patcog.2019.107150.

[18] M. Li, A. Xiong, L. Wang, S. Deng, and J. Ye, “ACO Resampling: Enhancing the performance of oversampling methods for class imbalance classification,” Knowledge-Based Syst., vol. 196, no. xxxx, p. 105818, 2020, doi: 10.1016/j.knosys.2020.105818.

[19] A. S. Hussein, T. Li, C. W. Yohannese, and K. Bashir, “A-SMOTE: A new preprocessing approach for highly imbalanced datasets by improving SMOTE,” Int. J. Comput. Intell. Syst., vol. 12, no. 2, pp. 1412–1422, 2019, doi: 10.2991/ijcis.d.191114.002.

[20] P. Gnip, L. Vokorokos, and P. Drotár, “Selective oversampling approach for strongly imbalanced data,” PeerJ Comput. Sci., vol. 7, pp. 1–22, 2021, doi: 10.7717/PEERJ-CS.604.

[21] X. W. Liang, A. P. Jiang, T. Li, Y. Y. Xue, and G. T. Wang, “LR-SMOTE — An improved unbalanced data set oversampling based on K-means and SVM,” Knowledge-Based Syst., vol. 196, 2020, doi: 10.1016/j.knosys.2020.105845.

[22] A. S. Desuky and S. Hussain, “An Improved Hybrid Approach for Handling Class Imbalance Problem,” Arab. J. Sci. Eng., vol. 46, no. 4, pp. 3853–3864, 2021, doi: 10.1007/s13369-021-05347-7.

[23] K. Qi, H. Yang, Q. Hu, and D. Yang, “A new adaptive weighted imbalanced data classifier via improved support vector machines with high-dimension nature,” Knowledge-Based Syst., vol. 185, p. 104933, 2019, doi: 10.1016/j.knosys.2019.104933.

[24] H. Shamsudin, U. K. Yusof, Y. Haijie, and I. S. Isa, “an Optimized Support Vector Machine With Genetic Algorithm for Imbalanced Data Classification,” J. Teknol., vol. 85, no. 4, pp. 67–74, 2023, doi: 10.11113/jurnalteknologi.v85.19695.

[25] Y. Park and J. S. Lee, “A Learning Objective Controllable Sphere-Based Method for Balanced and Imbalanced Data Classification,” IEEE Access, vol. 9, pp. 158010–158026, 2021, doi: 10.1109/ACCESS.2021.3130272.

[26] H. Patel, D. Singh Rajput, G. Thippa Reddy, C. Iwendi, A. Kashif Bashir, and O. Jo, “A review on classification of imbalanced data for wireless sensor networks,” Int. J. Distrib. Sens. Networks, vol. 16, no. 4, 2020, doi: 10.1177/1550147720916404.

[27] S. Strasser and M. Klettke, “Transparent Data Preprocessing for Machine Learning,” HILDA 2024 - Work. Human-In-the-Loop Data Anal. Co-located with SIGMOD 2024, 2024, doi: 10.1145/3665939.3665960.

[28] J. Nalic and A. Svraka, “Importance of data pre-processing in credit scoring models based on data mining approaches,” 2018 41st Int. Conv. Inf. Commun. Technol. Electron. Microelectron. MIPRO 2018 - Proc., pp. 1046–1051, 2022, doi: 10.23919/MIPRO.2018.8400191.

[29] H. F. Tayeb, M. Karabatak, and C. Varol, “Time Series Database Preprocessing for Data Mining Using Python,” 8th Int. Symp. Digit. Forensics Secur. ISDFS 2020, pp. 20–23, 2020, doi: 10.1109/ISDFS49300.2020.9116260.

[30] S. Albahra et al., “Artificial intelligence and machine learning overview in pathology & laboratory medicine: A general review of data preprocessing and basic supervised concepts,” Semin. Diagn. Pathol., vol. 40, no. 2, pp. 71–87, 2023, doi: 10.1053/j.semdp.2023.02.002.

[31] Z. Liu, “Research on data preprocessing method for artificial intelligence algorithm based on user online behavior,” J. Comput. Electron. Inf. Manag., vol. 12, no. 3, pp. 74–78, 2024, doi: 10.54097/qf6fv8j1.

[32] A. Q. Md, S. Kulkarni, C. J. Joshua, T. Vaichole, S. Mohan, and C. Iwendi, “Enhanced Preprocessing Approach Using Ensemble Machine Learning Algorithms for Detecting Liver Disease,” Biomedicines, vol. 11, no. 2, 2023, doi: 10.3390/biomedicines11020581.

[33] K. Maharana, S. Mondal, and B. Nemade, “A review: Data pre-processing and data augmentation techniques,” Glob. Transitions Proc., vol. 3, no. 1, pp. 91–99, 2022, doi: 10.1016/j.gltp.2022.04.020.

[34] H. T. Duong and T. A. Nguyen-Thi, “A review: preprocessing techniques and data augmentation for sentiment analysis,” Comput. Soc. Networks, vol. 8, no. 1, pp. 1–16, 2021, doi: 10.1186/s40649-020-00080-x.

[35] C. Fan, M. Chen, X. Wang, J. Wang, and B. Huang, “A Review on Data Preprocessing Techniques Toward Efficient and Reliable Knowledge Discovery From Building Operational Data,” Front. Energy Res., vol. 9, no. March, pp. 1–17, 2021, doi: 10.3389/fenrg.2021.652801.

[36] V. Chernykh, A. Stepnov, and B. O. Lukyanova, “Data preprocessing for machine learning in seismology,” CEUR Workshop Proc., vol. 2930, pp. 119–123, 2021.

[37] A. J. Mohammed, “Improving Classification Performance for a Novel Imbalanced Medical Dataset using SMOTE Method,” Int. J. Adv. Trends Comput. Sci. Eng., vol. 9, no. 3, pp. 3161–3172, 2020, doi: 10.30534/ijatcse/2020/104932020.

[38] A. Kulkarni, D. Chong, and F. A. Batarseh, Foundations of data imbalance and solutions for a data democracy. Elsevier Inc., 2020. doi: 10.1016/B978-0-12-818366-3.00005-8.

[39] D. Makowski, M. Ben-Shachar, I. Patil, and D. Lüdecke, “Methods and Algorithms for Correlation Analysis in R,” J. Open Source Softw., vol. 5, no. 51, p. 2306, 2020, doi: 10.21105/joss.02306.

[40] M. S. Vural and M. Telceken, “Modification of posterior probability variable with frequency factor according to Bayes Theorem,” J. Intell. Syst. with Appl., vol. 5, no. 1, pp. 19–26, 2022, doi: 10.54856/jiswa.202205195.

OPTIMIZING SUPPORT VECTOR MACHINE FOR IMBALANCED DATASETS BY COMBINING POSTERIOR PROBABILITY AND CORRELATION METHODS

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

Issue

Section

License

How to Cite

Similar Articles

sideblock