(Publisher of Peer Reviewed Open Access Journals)

International Journal of Advanced Technology and Engineering Exploration (IJATEE)

ISSN (Print):2394-5443    ISSN (Online):2394-7454
Volume-10 Issue-106 September-2023
Full-Text PDF
Paper Title : A hybrid approach for generative process model with topic modelling towards efficient and dynamic document clustering
Author Name : Gugulothu Venkanna and K.F Bharati
Abstract :

Clustering text documents has a wide range of applications across various domains. However, due to the diversity and rapid growth of textual data, performing clustering on a given text corpus has become increasingly challenging. Several existing approaches for text document clustering rely on natural language processing (NLP) and text similarity measures. However, there is a pressing need for a generative process model to systematically and progressively handle text corpora. Furthermore, a hybrid approach that enhances clustering performance is essential. Therefore, developing a model for a given text corpus and dynamically updating it as new documents arrive, rather than starting clustering from scratch, is of paramount importance. In this paper, a framework known as the hybrid approach for dynamic document clustering (HADDC) was proposed. This framework is realized through the definition of two algorithms that collaborate to achieve dynamic document clustering. The first algorithm, called similar document identification (SDI), leverages a lexical dictionary, WordNet, and similarity measures to effectively identify similar documents. The second algorithm, topic modelling for efficient and dynamic document clustering (TM-EDDC), is designed as a dynamic process model based on latent Dirichlet allocation (LDA). It has the capability to cluster documents incrementally as new ones become available. Experimental results demonstrate that the proposed methods outperform existing ones, as evidenced by a lower mean absolute error (MAE). The proposed framework and underlying algorithms were evaluated using the news groups dataset. The empirical study showcases the enhanced utility and efficiency of the proposed framework, making it a valuable tool for organizations to integrate into their existing applications.

Keywords : Document clustering, Natural language processing, Generative process model, Document similarity, Dynamic document clustering.
Cite this article : Venkanna G, Bharati K. A hybrid approach for generative process model with topic modelling towards efficient and dynamic document clustering. International Journal of Advanced Technology and Engineering Exploration. 2023; 10(106):1184-1197. DOI:10.19101/IJATEE.2023.10101071.
References :
[1]Bui QV, Sayadi K, Amor SB, Bui M. Combining latent Dirichlet allocation and K-means for documents clustering: effect of probabilistic based distance measures. In intelligent information and database systems: 9th Asian conference, ACIIDS, Kanazawa, Japan, Proceedings, Part I 2017 (pp. 248-57). Springer International Publishing.
[Crossref] [Google Scholar]
[2]Han X. Evolution of research topics in LIS between 1996 and 2019: an analysis based on latent Dirichlet allocation topic model. Scientometrics. 2020; 125(3):2561-95.
[Crossref] [Google Scholar]
[3]Montenegro C, Ligutom III C, Orio JV, Ramacho DA. Using latent dirichlet allocation for topic modeling and document clustering of Dumaguete city twitter dataset. In proceedings of the international conference on computing and data engineering 2018 (pp. 1-5). ACM.
[Crossref] [Google Scholar]
[4]Raghuveer K. Legal documents clustering using latent Dirichlet allocation. IAES International Journal of Artificial Intelligence 2012; 2(1):34-7.
[Google Scholar]
[5]Tresnasari NA, Adji TB, Permanasari AE. Social-child-case document clustering based on topic modeling using latent Dirichlet allocation. Indonesian Journal of Computing and Cybernetics Systems (IJCCS). 2020; 14(2):179-88.
[Crossref] [Google Scholar]
[6]Duan Z, Liu X, Su Y, Xu Y, Chen B, Zhou M. Bayesian progressive deep topic model with knowledge informed textual data coarsening process. In international conference on machine learning 2023 (pp. 8731-46). PMLR.
[Google Scholar]
[7]Crain SP, Zhou K, Yang SH, Zha H. Dimensionality reduction and topic modeling: from latent semantic indexing to latent Dirichlet allocation and beyond. Mining Text Data. 2012:129-61.
[Crossref] [Google Scholar]
[8]Yeh JF, Lee CH, Tan YS, Yu LC. Topic model allocation of conversational dialogue records by latent Dirichlet allocation. In signal and information processing association annual summit and conference (APSIPA), Asia-Pacific 2014 (pp. 1-4). IEEE.
[Crossref] [Google Scholar]
[9]Andrzejewski D, Zhu X. Latent Dirichlet allocation with topic-in-set knowledge. In proceedings of the NAACL HLT workshop on semi-supervised learning for natural language processing 2009 (pp. 43-8). Association for Computational Linguistics.
[Google Scholar]
[10]Sharaff A, Nagwani NK. Email thread identification using latent Dirichlet allocation and non-negative matrix factorization based clustering techniques. Journal of Information Science. 2016; 42(2):200-12.
[Crossref] [Google Scholar]
[11]Ning W, Liu J, Xiong H. Knowledge discovery using an enhanced latent Dirichlet allocation-based clustering method for solving on-site assembly problems. Robotics and Computer-Integrated Manufacturing. 2022; 73:102246.
[Crossref] [Google Scholar]
[12]Raja DK, Pushpa S. Diversifying personalized mobile multimedia application recommendations through the latent Dirichlet allocation and clustering optimization. Multimedia Tools and Applications. 2019; 78(17):24047-66.
[Crossref] [Google Scholar]
[13]Syed AR, Yau KL, Qadir J, Mohamad H, Ramli N, Keoh SL. Route selection for multi-hop cognitive radio networks using reinforcement learning: an experimental study. IEEE Access. 2016; 4:6304-24.
[Crossref] [Google Scholar]
[14]Shafiei MM, Milios EE. Latent Dirichlet co-clustering. In sixth international conference on data mining (ICDM06) 2006 (pp. 542-51). IEEE.
[Crossref] [Google Scholar]
[15]Curiskis SA, Drake B, Osborn TR, Kennedy PJ. An evaluation of document clustering and topic modelling in two online social networks: Twitter and Reddit. Information Processing & Management. 2020; 57(2):102034.
[Crossref] [Google Scholar]
[16]Hong F, Lai C, Guo H, Shen E, Yuan X, Li S. FLDA: latent Dirichlet allocation based unsteady flow analysis. IEEE Transactions on Visualization and Computer Graphics. 2014; 20(12):2545-54.
[Crossref] [Google Scholar]
[17]Jelodar H, Wang Y, Yuan C, Feng X, Jiang X, Li Y, Zhao L. Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey. Multimedia Tools and Applications. 2019; 78:15169-211.
[Crossref] [Google Scholar]
[18]Liu Y, Du F, Sun J, Jiang Y. iLDA: an interactive latent Dirichlet allocation model to improve topic quality. Journal of Information Science. 2020; 46(1):23-40.
[Crossref] [Google Scholar]
[19]Wang D, Thint M, Al-rubaie A. Semi-supervised latent Dirichlet allocation and its application for document classification. In IEEE/WIC/ACM international conferences on web intelligence and intelligent agent technology 2012 (pp. 306-10). IEEE.
[Crossref] [Google Scholar]
[20]Tang H, Shen L, Qi Y, Chen Y, Shu Y, Li J, et al. A multiscale latent Dirichlet allocation model for object-oriented clustering of VHR panchromatic satellite images. IEEE Transactions on Geoscience and Remote Sensing. 2012; 51(3):1680-92.
[Crossref] [Google Scholar]
[21]Saif A, Ab AMJ, Omar N. Reducing explicit semantic representation vectors using latent Dirichlet allocation. Knowledge-Based Systems. 2016; 100:145-59.
[Crossref] [Google Scholar]
[22]Bird C, Menzies T, Zimmermann T. The art and science of analyzing software data. Elsevier; 2015.
[Google Scholar]
[23]Abinaya G, Winster SG. Event identification in social media through latent Dirichlet allocation and named entity recognition. In proceedings of IEEE international conference on computer communication and systems ICCCS14 2014 (pp. 142-6). IEEE.
[Crossref] [Google Scholar]
[24]Lienou M, Maitre H, Datcu M. Semantic annotation of satellite images using latent Dirichlet allocation. IEEE Geoscience and Remote Sensing Letters. 2009; 7(1):28-32.
[Crossref] [Google Scholar]
[25]Park H, Park T, Lee YS. Partially collapsed Gibbs sampling for latent Dirichlet allocation. Expert Systems with Applications. 2019; 131:208-18.
[Crossref] [Google Scholar]
[26]Pérez J, Pérez A, Casillas A, Gojenola K. Cardiology record multi-label classification using latent Dirichlet allocation. Computer Methods and Programs in Biomedicine. 2018; 164:111-9.
[Crossref] [Google Scholar]
[27]Ma T, Zhou X, Liu J, Lou Z, Hua Z, Wang R. Combining topic modeling and SAO semantic analysis to identify technological opportunities of emerging technologies. Technological Forecasting and Social Change. 2021; 173:121159.
[Crossref] [Google Scholar]
[28]Lossio-ventura JA, Gonzales S, Morzan J, Alatrista-salas H, Hernandez-boussard T, Bian J. Evaluation of clustering and topic modeling methods over health-related tweets and emails. Artificial Intelligence in Medicine. 2021; 117:102096.
[Crossref] [Google Scholar]
[29]Rani S, Kumar M. Topic modeling and its applications in materials science and engineering. Materials Today: Proceedings. 2021; 45:5591-6.
[Crossref] [Google Scholar]
[30]Thirumoorthy K, Muneeswaran K. A hybrid approach for text document clustering using Jaya optimization algorithm. Expert Systems with Applications. 2021; 178:115040.
[Crossref] [Google Scholar]
[31]Murshed BA, Abawajy J, Mallappa S, Saif MA, Al-ghuribi SM, Ghanem FA. Enhancing big social media data quality for use in short-text topic modeling. IEEE Access. 2022; 10:105328-51.
[Crossref] [Google Scholar]
[32]Pathak AR, Pandey M, Rautaray S. Topic-level sentiment analysis of social media data using deep learning. Applied Soft Computing. 2021; 108:107440.
[Crossref] [Google Scholar]
[33]Khan MA, Smyth B, Coyle D. Addressing the complexity of personalized, context-aware and health-aware food recommendations: an ensemble topic modelling based approach. Journal of Intelligent Information Systems. 2021; 57(2):229-69.
[Google Scholar]
[34]Shaik T, Tao X, Li Y, Dann C, Mcdonald J, Redmond P, Galligan L. A review of the trends and challenges in adopting natural language processing methods for education feedback analysis. IEEE Access. 2022; 10:56720-39.
[Crossref] [Google Scholar]
[35]Hindistan YS, Yetkin EF. A hybrid approach with GAN and DP for privacy preservation of IIoT data. IEEE Access. 2023; 11:5837-49.
[Crossref] [Google Scholar]
[36]Curiac CD, Micea MV. Identifying hot information security topics using LDA and multivariate mann-kendall test. IEEE Access. 2023; 11:18374-84.
[Crossref] [Google Scholar]
[37]Murshed BA, Mallappa S, Abawajy J, Saif MA, Al-ariki HD, Abdulwahab HM. Short text topic modelling approaches in the context of big data: taxonomy, survey, and analysis. Artificial Intelligence Review. 2023; 56(6):5133-260.
[Crossref] [Google Scholar]
[38]Vayansky I, Kumar SA. A review of topic modeling methods. Information Systems. 2020; 94:101582.
[Crossref] [Google Scholar]
[39]Farkhod A, Abdusalomov A, Makhmudov F, Cho YI. LDA-based topic modeling sentiment analysis using topic/document/sentence (TDS) model. Applied Sciences. 2021; 11(23):1-15.
[Crossref] [Google Scholar]
[40]Alamsyah A, Rizkika W, Nugroho DD, Renaldi F, Saadah S. Dynamic large scale data on twitter using sentiment analysis and topic modeling. In 6th international conference on information and communication technology (ICoICT) 2018 (pp. 254-8). IEEE.
[Crossref] [Google Scholar]
[41]Gurcan F, Cagiltay NE. Big data software engineering: analysis of knowledge domains and skill sets using LDA-based topic modeling. IEEE Access. 2019; 7:82541-52.
[Crossref] [Google Scholar]
[42]Sundarkumar GG, Ravi V, Nwogu I, Govindaraju V. Malware detection via API calls, topic models and machine learning. In international conference on automation science and engineering (CASE) 2015 (pp. 1212-7). IEEE.
[Crossref] [Google Scholar]
[43]Shahbazi Z, Byun YC. Topic prediction and knowledge discovery based on integrated topic modeling and deep neural networks approaches. Journal of Intelligent & Fuzzy Systems. 2021; 41(1):2441-57.
[Crossref] [Google Scholar]
[44]Miles S, Yao L, Meng W, Black CM, Miled ZB. Comparing PSO-based clustering over contextual vector embeddings to modern topic modeling. Information Processing & Management. 2022; 59(3):1-11.
[Crossref] [Google Scholar]
[45]Acharya S, Rawat U, Bhatnagar R. A low computational cost method for mobile malware detection using transfer learning and familial classification using topic modelling. Applied Computational Intelligence and Soft Computing. 2022; 2022:1-22.
[Crossref] [Google Scholar]
[46]Chehal D, Gupta P, Gulati P. Implementation and comparison of topic modeling techniques based on user reviews in e-commerce recommendations. Journal of Ambient Intelligence and Humanized Computing. 2021; 12:5055-70.
[Crossref] [Google Scholar]
[47]Pathak AR, Pandey M, Rautaray S. Adaptive model for dynamic and temporal topic modeling from big data using deep learning architecture. International Journal of Intelligent Systems and Applications. 2019; 9(6):13-27.
[Crossref] [Google Scholar]
[48]Mazzei D, Ramjattan R. Machine learning for industry 4.0: a systematic review using deep learning-based topic modelling. Sensors. 2022; 22(22):1-31.
[Crossref] [Google Scholar]
[49]https://www.kaggle.com/datasets/crawford/20-newsgroups. Accessed 20 July 2023.