(Publisher of Peer Reviewed Open Access Journals)

International Journal of Advanced Technology and Engineering Exploration (IJATEE)

ISSN (Print):2394-5443    ISSN (Online):2394-7454
Volume-10 Issue-107 October-2023
Full-Text PDF
Paper Title : Performance analysis of samplers and calibrators with various classifiers for asymmetric hydrological data
Author Name : C. Kaleeswari, K. Kuppusamy and A. Senthilrajan
Abstract :

Asymmetric data classification presents a significant challenge in machine learning (ML). While ML algorithms are known for their ability to classify symmetric data effectively, addressing data asymmetry remains an on-going concern in classification tasks. This research paper aims to select an appropriate method for classifying and predicting asymmetric data, focusing on label and probability predictions. To achieve this, various ML classifiers, calibration techniques, and sampling methods are systematically analyzed. The classifiers under consideration include logistic regression (LR), k-nearest neighbour (KNN), gaussian naive Bayes (GNB), random forest (RF), decision tree (DT), and support vector classifier (SVC). Calibration techniques explored encompass isotonic regression (IR) and platt scaling (PS), while sampling techniques comprise synthetic minority oversampling technique (SMOTE), T-link (Tomek), adaptive synthetic sampling (AdaSyn), integration of SMOTE and edited nearest neighbour (SMOTEENN), and integration of SMOTE and T-link (SMOTETomek). Simulation results for label prediction consistently favour the SMOTEENN approach, with the RF classifier combined with SMOTEENN providing outstanding performance, boasting a balanced random accuracy (BRA) of 98.07%, sensitivity of 98.02%, specificity of 99.01%, an area under the curve (AUC) of 0.98, and a geometric mean (G-mean) of 98.50%. In terms of probability prediction, IR calibration consistently excels. Specifically, the GNB classifier combined with IR produces the best performance, yielding a low brier score (BS), expected calibration error (ECE), and maximum calibration error (MCE). Furthermore, it achieves perfect calibration as demonstrated by the reliability curve. In light of these findings, this study recommends the utilization of SMOTEENN for data resampling and IR calibration for probability prediction as superior methods to address data asymmetry. The comparative analysis presented in this research offers valuable insights for selecting appropriate techniques in the context of asymmetric data classification.

Keywords : Machine learning, Calibration, Asymmetric data, Classification, Probability, Prediction.
Cite this article : Kaleeswari C, Kuppusamy K, Senthilrajan A. Performance analysis of samplers and calibrators with various classifiers for asymmetric hydrological data. International Journal of Advanced Technology and Engineering Exploration. 2023; 10(107):1316-1335. DOI:10.19101/IJATEE.2023.10101138.
References :
[1]Tazoe H. Water quality monitoring. Analytical Sciences. 2023; 39(1):1-3.
[Crossref] [Google Scholar]
[2]Adeleke IA, Nwulu NI, Ogbolumani OA. A hybrid machine learning and embedded IoT-based water quality monitoring system. Internet of Things. 2023; 22:100774.
[Crossref] [Google Scholar]
[3]Wang Z, Jia D, Song S, Sun J. Assessments of surface water quality through the use of multivariate statistical techniques: a case study for the watershed of the Yuqiao reservoir, China. Frontiers in Environmental Science. 2023; 11:1-15.
[Crossref] [Google Scholar]
[4]Banerjee P, Dehnbostel FO, Preissner R. Prediction is a balancing act: importance of sampling methods to balance sensitivity and specificity of predictive models based on imbalanced chemical data sets. Frontiers in Chemistry. 2018; 6:1-11.
[Crossref] [Google Scholar]
[5]Chinnakkaruppan K, Krishnamoorthy K, Agniraj S. A hybrid approach for forecasting the technical anomalies in sensor-based water quality distribution data. In international conference on power, instrumentation, energy and control 2023 (pp. 1-5). IEEE.
[Crossref] [Google Scholar]
[6]Piao C, Wang N, Yuan C. Rebalance weights adaboost-SVM model for imbalanced data. Computational Intelligence and Neuroscience. 2023; 2023:1-26.
[Crossref] [Google Scholar]
[7]Pandey S, Kumar K. Software fault prediction for imbalanced data: a survey on recent developments. Procedia Computer Science. 2023; 218:1815-24.
[Crossref] [Google Scholar]
[8]Douzas G, Bacao F, Fonseca J, Khudinyan M. Imbalanced learning in land cover classification: improving minority classes’ prediction accuracy using the geometric SMOTE algorithm. Remote Sensing. 2019; 11(24):1-14.
[Crossref] [Google Scholar]
[9]Liang Z, Wang H, Yang K, Shi Y. Adaptive fusion based method for imbalanced data classification. Frontiers in Neurorobotics. 2022; 16:1-8.
[Crossref] [Google Scholar]
[10]Ahmed J, Green II RC. Predicting severely imbalanced data disk drive failures with machine learning models. Machine Learning with Applications. 2022; 9:1-12.
[Crossref] [Google Scholar]
[11]Basora L, Bry P, Olive X, Freeman F. Aircraft fleet health monitoring with anomaly detection techniques. Aerospace. 2021; 8(4):1-33.
[Crossref] [Google Scholar]
[12]Muharemi F, Logofătu D, Leon F. Machine learning approaches for anomaly detection of water quality on a real-world data set. Journal of Information and Telecommunication. 2019; 3(3):294-307.
[Crossref] [Google Scholar]
[13]Bao F, Wu Y, Li Z, Li Y, Liu L, Chen G. Effect improved for high-dimensional and unbalanced data anomaly detection model based on KNN-SMOTE-LSTM. Complexity. 2020; 2020:1-7.
[Crossref] [Google Scholar]
[14]Muntasir NM, Faisal F, Jahan RI, Al-monsur A, Ar-rafi AM, Nasrullah SM, et al. A comprehensive investigation of the performances of different machine learning classifiers with SMOTE-ENN oversampling technique and hyperparameter optimization for imbalanced heart failure dataset. Scientific Programming. 2022; 2022:1-7.
[Crossref] [Google Scholar]
[15]Wang ZH, Wu C, Zheng K, Niu X, Wang X. SMOTETomek-based resampling for personality recognition. IEEE Access. 2019; 7:129678-89.
[Crossref] [Google Scholar]
[16]Huang L, Zhao J, Zhu B, Chen H, Broucke SV. An experimental investigation of calibration techniques for imbalanced data. IEEE Access. 2020; 8:127343-52.
[Crossref] [Google Scholar]
[17]Joloudari JH, Marefat A, Nematollahi MA, Oyelere SS, Hussain S. Effective class-imbalance learning based on SMOTE and convolutional neural networks. Applied Sciences. 2023; 13(6):1-34.
[Crossref] [Google Scholar]
[18]Zheng X, Jia J, Chen J, Guo S, Sun L, Zhou C, et al. Hyperspectral image classification with imbalanced data based on semi-supervised learning. Applied Sciences. 2022; 12(8):1-19.
[Crossref] [Google Scholar]
[19]Schmidt L, Heße F, Attinger S, Kumar R. Challenges in applying machine learning models for hydrological inference: a case study for flooding events across Germany. Water Resources Research. 2020; 56(5):1-10.
[Crossref] [Google Scholar]
[20]Rahman AA, Prasetiyowati SS, Sibaroni Y. Performance analysis of the imbalanced data method on increasing the classification accuracy of the machine learning hybrid method. JIPI (Jurnal Ilmiah Penelitian dan Pembelajaran Informatika). 2023; 8(1):115-26.
[Crossref] [Google Scholar]
[21]Werner DVV, Schneider AJA, Dos SCR, Da SPPR, Victória BJL. Imbalanced data preprocessing techniques for machine learning: a systematic mapping study. Knowledge and Information Systems. 2023; 65(1):31-57.
[Crossref] [Google Scholar]
[22]Thabtah F, Hammoud S, Kamalov F, Gonsalves A. Data imbalance in classification: experimental evaluation. Information Sciences. 2020; 513:429-41.
[Crossref] [Google Scholar]
[23]Swana EF, Doorsamy W, Bokoro P. Tomek link and SMOTE approaches for machine fault classification with an imbalanced dataset. Sensors. 2022; 22(9):1-21.
[Crossref] [Google Scholar]
[24]Bennin KE, Tahir A, Macdonell SG, Börstler J. An empirical study on the effectiveness of data resampling approaches for cross‐project software defect prediction. IET Software. 2022; 16(2):185-99.
[Crossref] [Google Scholar]
[25]Aggarwal U, Popescu A, Belouadah E, Hudelot C. A comparative study of calibration methods for imbalanced class incremental learning. Multimedia Tools and Applications. 2022:1-20.
[Crossref] [Google Scholar]
[26]Kuhn M, Johnson K, Kuhn M, Johnson K. Remedies for severe class imbalance. Applied Predictive Modeling. 2013:419-43.
[Crossref] [Google Scholar]
[27]Liu L, Wu X, Li S, Li Y, Tan S, Bai Y. Solving the class imbalance problem using ensemble algorithm: application of screening for aortic dissection. BMC Medical Informatics and Decision Making. 2022; 22(1):1-6.
[Crossref] [Google Scholar]
[28]Davagdorj K, Lee JS, Pham VH, Ryu KH. A comparative analysis of machine learning methods for class imbalance in a smoking cessation intervention. Applied Sciences. 2020; 10(9):1-20.
[Crossref] [Google Scholar]
[29]Xu Z, Shen D, Nie T, Kou Y. A hybrid sampling algorithm combining M-SMOTE and ENN based on random forest for medical imbalanced data. Journal of Biomedical Informatics. 2020; 107:103465.
[Crossref] [Google Scholar]
[30]Johnson JM, Khoshgoftaar TM. The effects of data sampling with deep learning and highly imbalanced big data. Information Systems Frontiers. 2020; 22(5):1113-31.
[Crossref] [Google Scholar]
[31]Shaikh S, Daudpota SM, Imran AS, Kastrati Z. Towards improved classification accuracy on highly imbalanced text dataset using deep neural language models. Applied Sciences. 2021; 11(2):1-20.
[Crossref] [Google Scholar]
[32]Chyon FA, Suman MN, Fahim MR, Ahmmed MS. Time series analysis and predicting COVID-19 affected patients by ARIMA model using machine learning. Journal of Virological Methods. 2022; 301:1-6.
[Crossref] [Google Scholar]
[33]Ahmed DM, Hassan MM, Mstafa RJ. A review on deep sequential models for forecasting time series data. Applied Computational Intelligence and Soft Computing. 2022; 2022:1-19.
[Crossref] [Google Scholar]
[34]Susan S, Kumar A. The balancing trick: optimized sampling of imbalanced datasets-a brief survey of the recent state of the art. Engineering Reports. 2021; 3(4):1-24.
[Crossref] [Google Scholar]
[35]Alharbi F, Ouarbya L, Ward JA. Comparing sampling strategies for tackling imbalanced data in human activity recognition. Sensors. 2022; 22(4):1-20.
[Crossref] [Google Scholar]
[36]Liang XW, Jiang AP, Li T, Xue YY, Wang GT. LR-SMOTE—an improved unbalanced data set oversampling based on K-means and SVM. Knowledge-Based Systems. 2020; 196:105845.
[Crossref] [Google Scholar]
[37]Hussein HI, Anwar SA, Ahmad MI. Imbalanced data classification using SVM based on improved simulated annealing featuring synthetic data generation and reduction. CMC-Computers Materials & Continua. 2023; 75(1):547-64.
[Crossref] [Google Scholar]
[38]Zhao C, Shuai R, Ma L, Liu W, Wu M. Improving cervical cancer classification with imbalanced datasets combining taming transformers with T2T-ViT. Multimedia Tools and Applications. 2022; 81(17):24265-300.
[Crossref] [Google Scholar]
[39]Christianto Y, Rusli A. Evaluating RNN architectures for handling imbalanced dataset in multi-class text classification in Bahasa Indonesia. International Journal of Advanced Trends in Computer Science and Engineering. 2020: 8418-23.
[Google Scholar]
[40]Gao H, Li Y, Lu H, Zhu S. Water potability analysis and prediction. Highlights in Science, Engineering and Technology. 2022; 16:70-7.
[Google Scholar]
[41]https://www.kaggle.com/datasets/adityakadiwal/water-potability. Accessed 02 March 2023.
[42]Patel J, Amipara C, Ahanger TA, Ladhva K, Gupta RK, Alsaab HO, et al. A machine learning-based water potability prediction model by using synthetic minority oversampling technique and explainable AI. Computational Intelligence and Neuroscience. 2022; 2022:1-15.
[Crossref] [Google Scholar]
[43]Rawat N, Kazembe MD, Mishra PK. Water quality prediction using machine learning. International Journal for Research in Applied Science and Engineering Technology. 2022; 10(VI):4173-87.
[Crossref]
[44]Wang H, Zhao Y, Zhou Y, Wang H. Prediction of urban water accumulation points and water accumulation process based on machine learning. Earth Science Informatics. 2021; 14:2317-28.
[Crossref] [Google Scholar]
[45]Moeini M, Shojaeizadeh A, Geza M. Supervised machine learning for estimation of total suspended solids in urban watersheds. Water. 2021; 13(2):1-24.
[Crossref] [Google Scholar]
[46]Mensi A, Tax DM, Bicego M. Detecting outliers from pairwise proximities: proximity isolation forests. Pattern Recognition. 2023; 138:109334.
[Crossref] [Google Scholar]
[47]Buschjäger S, Honysz PJ, Morik K. Randomized outlier detection with trees. International Journal of Data Science and Analytics. 2022; 13(2):91-104.
[Crossref] [Google Scholar]
[48]Gao R, Zhang T, Sun S, Liu Z. Research and improvement of isolation forest in detection of local anomaly points. In journal of physics: conference series 2019 (pp. 1-6). IOP Publishing.
[Crossref] [Google Scholar]
[49]Niculescu-mizil A, Caruana R. Predicting good probabilities with supervised learning. In proceedings of the 22nd international conference on machine learning 2005 (pp. 625-32).
[Crossref] [Google Scholar]
[50]Mulugeta G, Zewotir T, Tegegne AS, Juhar LH, Muleta MB. Classification of imbalanced data using machine learning algorithms to predict the risk of renal graft failures in Ethiopia. BMC Medical Informatics and Decision Making. 2023; 23(1):1-7.
[Crossref] [Google Scholar]
[51]Alipour A, Ahmadalipour A, Abbaszadeh P, Moradkhani H. Leveraging machine learning for predicting flash flood damage in the Southeast US. Environmental Research Letters. 2020; 15(2):1-12.
[Crossref] [Google Scholar]
[52]Ahmed U, Mumtaz R, Anwar H, Shah AA, Irfan R, García-nieto J. Efficient water quality prediction using supervised machine learning. Water. 2019; 11(11):1-14.
[Crossref] [Google Scholar]
[53]Gakii C, Jepkoech J. A classification model for water quality analysis using decision tree. Euro Journal of Computer Science and Information Technology. 2019; 7(3):1-8.
[Google Scholar]
[54]Jaloree S, Rajput A, Gour S. Decision tree approach to build a model for water quality. Binary Journal of Data Mining & Networking. 2014; 4(1):25-8.
[Google Scholar]
[55]Khan TM, Xu S, Khan ZG. Implementing multilabeling, ADASYN, and relieff techniques for classification of breast cancer diagnostic through machine learning: efficient computer-aided diagnostic system. Journal of Healthcare Engineering. 2021; 2021:1-15.
[Crossref] [Google Scholar]
[56]Peng CY, Park YJ. A new hybrid under-sampling approach to imbalanced classification problems. Applied Artificial Intelligence. 2022; 36(1):1-18.
[Crossref] [Google Scholar]
[57]Zhang A, Yu H, Huan Z, Yang X, Zheng S, Gao S. SMOTE-RkNN: a hybrid re-sampling method based on SMOTE and reverse K-nearest neighbors. Information Sciences. 2022; 595:70-88.
[Crossref] [Google Scholar]
[58]Pan T, Zhao J, Wu W, Yang J. Learning imbalanced datasets based on SMOTE and Gaussian distribution. Information Sciences. 2020; 512:1214-33.
[Crossref] [Google Scholar]
[59]Wang K, Tian J, Zheng C, Yang H, Ren J, Li C, et al. Improving risk identification of adverse outcomes in chronic heart failure using SMOTE+ ENN and machine learning. Risk management and Healthcare Policy. 2021:2453-63.
[Crossref] [Google Scholar]
[60]Branco P, Torgo L, Ribeiro RP. A survey of predictive modeling on imbalanced domains. ACM Computing Surveys (CSUR). 2016; 49(2):1-50.
[Crossref] [Google Scholar]
[61]Silva FT, Song H, Perello-nieto M, Santos-rodriguez R, Kull M, Flach P. Classifier calibration: a survey on how to assess and improve predicted class probabilities. Machine Learning. 2023:1-50.
[Crossref] [Google Scholar]
[62]Alqarni AA, Yadav OP, Rathore AP. Application of isotonic regression in predicting corrosion depth of the oil refinery pipelines. In annual reliability and maintainability symposium 2022 (pp. 1-6). IEEE.
[Crossref] [Google Scholar]
[63]Mahmudah KR, Indriani F, Takemori-sakai Y, Iwata Y, Wada T, Satou K. Classification of imbalanced data represented as binary features. Applied Sciences. 2021; 11(17):1-13.
[Crossref] [Google Scholar]
[64]Wegier W, Ksieniewicz P. Application of imbalanced data classification quality metrics as weighting methods of the ensemble data stream classification algorithms. Entropy. 2020; 22(8):1-17.
[Crossref] [Google Scholar]
[65]Ri J, Kim H. G-mean based extreme learning machine for imbalance learning. Digital Signal Processing. 2020; 98:102637.
[Crossref] [Google Scholar]
[66]Aridas CK, Karlos S, Kanas VG, Fazakis N, Kotsiantis SB. Uncertainty based under-sampling for learning naive bayes classifiers under imbalanced data sets. IEEE Access. 2019; 8:2122-33.
[Crossref] [Google Scholar]
[67]Alaraj M, Abbod MF, Majdalawieh M. Modelling customers credit card behaviour using bidirectional LSTM neural networks. Journal of Big Data. 2021; 8(1):1-27.
[Crossref] [Google Scholar]
[68]Rožanec JM, Bizjak L, Trajkova E, Zajec P, Keizer J, Fortuna B, et al. Active learning and novel model calibration measurements for automated visual inspection in manufacturing. Journal of Intelligent Manufacturing. 2023:1-22.
[Crossref] [Google Scholar]