(Publisher of Peer Reviewed Open Access Journals)

International Journal of Advanced Technology and Engineering Exploration (IJATEE)

ISSN (Print):2394-5443    ISSN (Online):2394-7454
Volume-8 Issue-85 December-2021
Full-Text PDF
Paper Title : A two-phase feature selection technique using mutual information and XGB-RFE for credit card fraud detection
Author Name : C. Victoria Priscilla and D. Padma Prabha
Abstract :

With the rapid increase in online transactions, credit card fraud has become a serious menace. Machine Learning (ML) algorithms are beneficial in building a good model to detect fraudulent transactions. Dealing with high-dimensional and imbalanced dataset becomes a hinder in real-world applications like credit card fraud detection. To overcome this issue, feature selection a pre-processing technique is adopted considering the classification performance and computational efficiency. This paper proposes a new two-phase feature selection approach that integrates filter and wrapper methods to identify the significant feature subsets. In the first phase, Mutual Information (MI) has been adopted due to its computational efficiency to rank the features based on their feature importance. However, they cannot drop the less important features. Thus, a second phase is added to eliminate the redundant features using Recursive Feature Elimination (RFE) a wrapper method employed by 5-fold cross-validation. eXtreme Gradient Boosting (XGBoost) is adopted as the estimator for RFE by adjusting the class weights. The optimal features obtained from the proposed method were used in four boosting algorithms such as XGBoost, Gradient Boosting Machine (GBM), Classic Gradient Boosting (CatBoost) and Light Gradient Boosting Machine (LGBM) to analyse the performance of classification. The proposed approach has been applied to the credit card fraud detection dataset obtained from the IEEE-CIS, which consists of imbalance in the binary class target. The experimental outcome shows promising results in terms of Geometric mean (G-Mean) for XGBoost (84.8%) and LGBM (83.7%), the Area Under a Receiver Operating Character (ROC) Curve (AUC) has increased from 79.8% to 85.5% for XGBoost and also the computation time are reduced in training the classifiers.

Keywords : Recursive feature elimination, Hyper-parameter optimization, Class imbalance, XGBoost, Binary classification.
Cite this article : Priscilla CV, Prabha DP. A two-phase feature selection technique using mutual information and XGB-RFE for credit card fraud detection . International Journal of Advanced Technology and Engineering Exploration. 2021; 8(85):1656-1668. DOI:10.19101/IJATEE.2021.874615.
References :
[1]https://nilsonreport.com/publication_newsletter_archive_issue.php?issue=1187. Accessed 22 July 2021.
[2]Bagga S, Goyal A, Gupta N, Goyal A. Credit card fraud detection using pipeling and ensemble learning. Procedia Computer Science. 2020; 173:104-12.
[Crossref] [Google Scholar]
[3]Liu Y, Wang Y, Ren X, Zhou H, Diao X. A classification method based on feature selection for imbalanced data. IEEE Access. 2019; 7:81794-807.
[Crossref] [Google Scholar]
[4]Mahmoudi N, Duman E. Detecting credit card fraud by modified Fisher discriminant analysis. Expert Systems with Applications. 2015; 42(5):2510-6.
[Crossref] [Google Scholar]
[5]De SAG, Pereira AC, Pappa GL. A customized classification algorithm for credit card fraud detection. Engineering Applications of Artificial Intelligence. 2018; 72:21-9.
[Crossref] [Google Scholar]
[6]El Hajjami S, Malki J, Bouju A, Berrada M. A machine learning based approach to reduce behavioral noise problem in an imbalanced data: application to a fraud detection. In international conference on intelligent data science technologies and applications 2020 (pp. 11-20). IEEE.
[Crossref] [Google Scholar]
[7]Abdulrauf SG, Zainol Z. Feature selection for high-dimensional and imbalanced biomedical data based on robust correlation based redundancy and binary grasshopper optimization algorithm. Genes. 2020; 11(7):1-26.
[Crossref] [Google Scholar]
[8]Chen H, Li T, Fan X, Luo C. Feature selection for imbalanced data based on neighborhood rough sets. Information Sciences. 2019; 483:1-20.
[Crossref] [Google Scholar]
[9]Pilnenskiy N, Smetannikov I. Feature selection algorithms as one of the python data analytical tools. Future Internet. 2020; 12(3):1-14.
[Crossref] [Google Scholar]
[10]Liu H, Zhou M, Liu Q. An embedded feature selection method for imbalanced data classification. IEEE/CAA Journal of Automatica Sinica. 2019; 6(3):703-15.
[Crossref] [Google Scholar]
[11]Abdel-basset M, El-shahat D, El-henawy I, De AVH, Mirjalili S. A new fusion of grey wolf optimizer algorithm with a two-phase mutation for feature selection. Expert Systems with Applications. 2020.
[Crossref] [Google Scholar]
[12]Jain D, Singh V. Feature selection and classification systems for chronic disease prediction: a review. Egyptian Informatics Journal. 2018; 19(3):179-89.
[Crossref] [Google Scholar]
[13]Bommert A, Sun X, Bischl B, Rahnenführer J, Lang M. Benchmark for filter methods for feature selection in high-dimensional classification data. Computational Statistics & Data Analysis. 2020.
[Crossref] [Google Scholar]
[14]Albashish D, Hammouri AI, Braik M, Atwan J, Sahran S. Binary biogeography-based optimization based SVM-RFE for feature selection. Applied Soft Computing. 2021.
[Crossref] [Google Scholar]
[15]Elavarasan D, Vincent PM DR, Srinivasan K, Chang CY. A hybrid CFS filter and RF-RFE wrapper-based feature extraction for enhanced agricultural crop yield prediction modeling. Agriculture. 2020; 10(9):1-27.
[Crossref] [Google Scholar]
[16]Hancer E, Xue B, Zhang M. Differential evolution for filter feature selection based on information theory and feature ranking. Knowledge-Based Systems. 2018; 140:103-19.
[Crossref] [Google Scholar]
[17]Amini F, Hu G. A two-layer feature selection method using genetic algorithm and elastic net. Expert Systems with Applications. 2021.
[Crossref] [Google Scholar]
[18]Fu GH, Wu YJ, Zong MJ, Yi LZ. Feature selection and classification by minimizing overlap degree for class-imbalanced data in metabolomics. Chemometrics and Intelligent Laboratory Systems. 2020.
[Crossref] [Google Scholar]
[19]Barraza N, Moro S, Ferreyra M, De LPA. Mutual information and sensitivity analysis for feature selection in customer targeting: a comparative study. Journal of Information Science. 2019; 45(1):53-67.
[Crossref] [Google Scholar]
[20]Wang Y, Cang S, Yu H. Mutual information inspired feature selection using kernel canonical correlation analysis. Expert Systems with Applications: X. 2019.
[Crossref] [Google Scholar]
[21]Zhang J, Xiong Y, Min S. A new hybrid filter/wrapper algorithm for feature selection in classification. Analytica Chimica Acta. 2019; 1080:43-54.
[Crossref] [Google Scholar]
[22]Jeon H, Oh S. Hybrid-recursive feature elimination for efficient feature selection. Applied Sciences. 2020; 10(9):1-8.
[Crossref] [Google Scholar]
[23]Rtayli N, Enneya N. Enhanced credit card fraud detection based on SVM-recursive feature elimination and hyper-parameters optimization. Journal of Information Security and Applications. 2020.
[Crossref] [Google Scholar]
[24]Karasu S, Altan A, Bekiros S, Ahmad W. A new forecasting model with wrapper-based feature selection approach using multi-objective optimization technique for chaotic crude oil time series. Energy. 2020.
[Crossref] [Google Scholar]
[25]Matos T, Macedo JA, Lettich F, Monteiro JM, Renso C, Perego R, et al. Leveraging feature selection to detect potential tax fraudsters. Expert Systems with Applications. 2020.
[Crossref] [Google Scholar]
[26]Omar B, Rustam F, Mehmood A, Choi GS. Minimizing the overlapping degree to improve class-imbalanced learning under sparse feature selection: application to fraud detection. IEEE Access. 2021; 9:28101-10.
[Crossref] [Google Scholar]
[27]Viharos ZJ, Kis KB, Fodor Á, Büki MI. Adaptive, hybrid feature selection (AHFS). Pattern Recognition. 2021.
[Crossref] [Google Scholar]
[28]El-Hasnony IM, Barakat SI, Elhoseny M, Mostafa RR. Improved feature selection model for big data analytics. IEEE Access. 2020; 8:66989-7004.
[Crossref] [Google Scholar]
[29]Lian W, Nie G, Jia B, Shi D, Fan Q, Liang Y. An intrusion detection method based on decision tree-recursive feature elimination in ensemble learning. Mathematical Problems in Engineering. 2020.
[Crossref] [Google Scholar]
[30]Zhang X, Han Y, Xu W, Wang Q. HOBA: A novel feature engineering methodology for credit card fraud detection with a deep learning architecture. Information Sciences. 2021; 557:302-16.
[Crossref] [Google Scholar]
[31]Chiew KL, Tan CL, Wong K, Yong KS, Tiong WK. A new hybrid ensemble feature selection framework for machine learning-based phishing detection system. Information Sciences. 2019; 484:153-66.
[Crossref] [Google Scholar]
[32]Singh N, Singh P. A hybrid ensemble-filter wrapper feature selection approach for medical data classification. Chemometrics and Intelligent Laboratory Systems. 2021.
[Crossref] [Google Scholar]
[33]Zhou Y, Cheng G, Jiang S, Dai M. Building an efficient intrusion detection system based on feature selection and ensemble classifier. Computer Networks. 2020.
[Crossref] [Google Scholar]
[34]Mohammadi S, Mirvaziri H, Ghazizadeh-Ahsaee M, Karimipour H. Cyber intrusion detection by combined feature selection algorithm. Journal of Information Security and Applications. 2019; 44:80-8.
[Crossref] [Google Scholar]
[35]Nagarajan SM, Muthukumaran V, Murugesan R, Joseph RB, Meram M, Prathik A. Innovative feature selection and classification model for heart disease prediction. Journal of Reliable Intelligent Environments. 2021:1-11.
[Crossref] [Google Scholar]
[36]Das S. Filters, wrappers and a boosting-based hybrid for feature selection. In ICML 2001 (pp. 74-81).
[Google Scholar]
[37]Priscilla CV, Prabha DP. Influence of optimizing XGBoost to handle class imbalance in credit card fraud detection. In third international conference on smart systems and inventive technology 2020 (pp. 1309-15). IEEE.
[Crossref] [Google Scholar]
[38]Vergara JR, Estévez PA. A review of feature selection methods based on mutual information. Neural Computing and Applications. 2014; 24(1):175-86.
[Crossref] [Google Scholar]
[39]Chang W, Liu Y, Xiao Y, Yuan X, Xu X, Zhang S, et al. A machine-learning-based prediction method for hypertension outcomes based on medical data. Diagnostics. 2019; 9(4):1-21.
[Crossref] [Google Scholar]
[40]https://www.kaggle.com/c/ieee-fraud-detection/data. Accessed 11 July 2020.
[41]Luque A, Carrasco A, Martín A, De LHA. The impact of class imbalance in classification performance metrics based on the binary confusion matrix. Pattern Recognition. 2019; 91:216-31.
[Crossref] [Google Scholar]