(Publisher of Peer Reviewed Open Access Journals)

International Journal of Advanced Technology and Engineering Exploration (IJATEE)

ISSN (Print):2394-5443    ISSN (Online):2394-7454
Volume-10 Issue-99 February-2023
Full-Text PDF
Paper Title : Improvisation in opinion mining using data preprocessing techniques based on consumer’s review
Author Name : Kartika Makkar, Pardeep Kumar, Monika Poriye and Shalini Aggarwal
Abstract :

In today's digital age, an enormous volume of data is generated daily from various internet sources, including social media sites, emails, and consumer reviews. With competition on the rise, it has become essential for organizations to understand their customers' needs and preferences. To gain meaningful insights from human language data, such as reviews, and understand consumer perceptions, sentiment analysis is an effective method. This research article presents a text preprocessing approach consisting of three stages: data collection, cleaning, and transformation. The approach was applied to three datasets - restaurant, cell phone, and garments - and evaluated using various machine learning classifiers for sentiment prediction. A comparison was made between two sets of techniques: set1 employed data cleaning and transformation with stemming, while set2 used data cleaning and transformation with lemmatization. The results indicated that set2 (data cleaning, transformation with lemmatization) performed better during preprocessing when evaluated using various machine learning classifiers, such as support vector machine (SVM), logistic regression (LR), decision tree (DT), random forest (RF), and Naïve Bayes (NB). Specifically, SVM, LR, RF, and NB performed better for the restaurant dataset, while DT, LR, and RF outperformed for the cell phone dataset. In the garment’s dataset, LR, DT, and RF outperformed for set2 compared to set1, making set2 the best preprocessing technique for subsequent comparison. Additionally, another comparison was made between two sets of techniques: set3 included text cleaning, transformation with lemmatization, and unigram features, while the other set included text cleaning, transformation with lemmatization, and bigram features. The sets were evaluated using machine learning classifiers, and the results revealed that set3 performed better with most classifiers.

Keywords : Support vector machine (SVM), Random forest (RF), Decision tree (DT), Logistic regression(LR), Naïve bayes (NB).
Cite this article : Makkar K, Kumar P, Poriye M, Aggarwal S. Improvisation in opinion mining using data preprocessing techniques based on consumer’s review . International Journal of Advanced Technology and Engineering Exploration. 2023; 10(99):257-277. DOI:10.19101/IJATEE.2021.875886.
References :
[1]Rosid MA, Fitrani AS, Astutik IR, Mulloh NI, Gozali HA. Improving text preprocessing for student complaint document classification using sastrawi. In IOP conference series: materials science and engineering 2020 (pp. 1-7). IOP Publishing.
[Crossref] [Google Scholar]
[2]Pavan KCS, Dhinesh BLD. Novel text preprocessing framework for sentiment analysis. In smart intelligent computing and applications: proceedings of the second international conference on SCI 2018, 2019 (pp. 309-17). Springer Singapore.
[Crossref] [Google Scholar]
[3]Hacohen-kerner Y, Miller D, Yigal Y. The influence of preprocessing on text classification using a bag-of-words representation. PloS one. 2020; 15(5):1-20.
[Crossref] [Google Scholar]
[4]Barushka A, Hajek P. The effect of text preprocessing strategies on detecting fake consumer reviews. In proceedings of the 3rd international conference on e-business and internet 2019 (pp. 13-7).
[Crossref] [Google Scholar]
[5]Khyani D, Siddhartha BS, Niveditha NM, Divya BM. An interpretation of lemmatization and stemming in natural language processing. Journal of University of Shanghai for Science and Technology. 2021; 22(10):350-7.
[Google Scholar]
[6]Muaad AY, Davanagere HJ, Guru DS, Benifa JB, Chola C, Alsalman H, et al. Arabic document classification: performance investigation of preprocessing and representation techniques. Mathematical Problems in Engineering. 2022; 2022:1-6.
[Crossref] [Google Scholar]
[7]Ali MA, Kulkarni SB. Preprocessing of text for emotion detection and sentiment analysis of Hindi movie reviews. International conference on IoT based control networks and intelligent systems 2020 (pp. 848-56).
[Google Scholar]
[8]Pandya SS, Kalani NB. Preprocessing phase of text sequence generation for Gujarati language. In 5th international conference on computing methodologies and communication 2021 (pp. 749-52). IEEE.
[Crossref] [Google Scholar]
[9]Kumar D, Rana P. Stemming of punjabi words by using brute force technique. International Journal of Engineering Science and Technology. 2011; 3:1351-7.
[Google Scholar]
[10]Pind J, Magnússon F, Briem S. The icelandic frequency dictionary. The Institute of Lexicography, University of Iceland, Reykjavik, Iceland. 1991.
[Google Scholar]
[11]Ingason AK, Helgadóttir S, Loftsson H, Rögnvaldsson E. A mixed method lemmatization algorithm using a hierarchy of linguistic identities (HOLI). In advances in natural language processing: 6th international conference, GoTAL 2008 Gothenburg, Sweden, 2008 (pp. 205-16). Springer Berlin Heidelberg.
[Crossref] [Google Scholar]
[12]Helgadóttir S. Testing data-driven learning algorithms for POS tagging of icelandic. Nordisk Sprogteknologi. 2004:257-65.
[Google Scholar]
[13]Setiabudi R, Iswari NM, Rusli A. Enhancing text classification performance by preprocessing misspelled words in Indonesian language. Telecommunication Computing Electronics and Control. 2021; 19(4):1234-41.
[Crossref] [Google Scholar]
[14]Alam S, Yao N. The impact of preprocessing steps on the accuracy of machine learning algorithms in sentiment analysis. Computational and Mathematical Organization Theory. 2019; 25:319-35.
[Crossref] [Google Scholar]
[15]Churchill R, Singh L. Textprep: a text preprocessing toolkit for topic modeling on social media data. In proceedings of the 10th international conference on data science, technology and applications 2021 (pp. 60-70).
[Google Scholar]
[16]Orlovskyi O, Ostapov S. Analysis of the text preprocessing methods influence on the destructive messages classifier. Advanced Information Systems. 2020; 4(3):104-8.
[Google Scholar]
[17]Babanejad N, Agrawal A, An A, Papagelis M. A comprehensive analysis of preprocessing for word representation learning in affective tasks. In proceedings of the 58th annual meeting of the association for computational linguistics 2020 (pp. 5799-810).
[Crossref] [Google Scholar]
[18]Kunilovskaya M, Plum A. Text preprocessing and its implications in a digital humanities project. In proceedings of the student research workshop associated with RANLP 2021 (pp. 85-93).
[Google Scholar]
[19]Dash NS, Dash NS. Lemmatization of inflected nouns. Language Corpora Annotation and Processing. 2021:165-94.
[Crossref] [Google Scholar]
[20]Prakash C, Chittimalli PK, Naik R. Domain specific text preprocessing for open information extraction. In 15th innovations in software engineering conference 2022 (pp. 1-5).
[Crossref] [Google Scholar]
[21]Ranganathan G. A study to find facts behind preprocessing on deep learning algorithms. Journal of Innovative Image Processing (JIIP). 2021; 3(1):66-74.
[Crossref] [Google Scholar]
[22]Mohammad F. Is preprocessing of text really worth your time for online comment classification? Proceedings on the international conference on artificial intelligence 2018 (pp.1-7).
[Crossref] [Google Scholar]
[23]El KA, Zeroual I. The effects of pre-processing techniques on Arabic text classification. International Journal of Advanced Trends in Computer Science and Engineering. 2021; 10(1):41-8.
[Crossref] [Google Scholar]
[24]Yogish D, Manjunath TN, Hegadi RS. Review on natural language processing trends and techniques using NLTK. In recent trends in image processing and pattern recognition: second international conference, RTIP2R 2018, Solapur, India, Revised Selected Papers, Part III 2019 (pp. 589-606). Springer Singapore.
[Crossref] [Google Scholar]
[25]A ML, Benoit K, Keyes O, Selivanov D, Arnold J. Fast, consistent tokenization of natural language text. Journal of Open Source Software. 2018; 3(23):1-3.
[Crossref] [Google Scholar]
[26]Orellana G, Arias B, Orellana M, Saquicela V, Baculima F, Piedra N. A study on the impact of pre-processing techniques in Spanish and English text classification over short and large text documents. In international conference on information systems and computer science 2018 (pp. 277-83). IEEE.
[Crossref] [Google Scholar]
[27]Uysal AK, Gunal S. The impact of preprocessing on text classification. Information Processing & Management. 2014; 50(1):104-12.
[Crossref] [Google Scholar]
[28]Méndez JR, Iglesias EL, Fdez-riverola F, Díaz F, Corchado JM. Tokenising, stemming and stopword removal on anti-spam filtering domain. In current topics in artificial intelligence: 11th conference of the Spanish association for artificial intelligence, CAEPIA 2005, Santiago de Compostela, Spain, 2006 (pp. 449-58). Springer Berlin Heidelberg.
[Crossref] [Google Scholar]
[29]Kotsiantis SB, Kanellopoulos D, Pintelas PE. Data preprocessing for supervised leaning. International Journal of Computer Science. 2006; 1(2):111-7.
[Google Scholar]
[30]Hickman L, Thapa S, Tay L, Cao M, Srinivasan P. Text preprocessing for text mining in organizational research: review and recommendations. Organizational Research Methods. 2022; 25(1):114-46.
[Crossref] [Google Scholar]
[31]Saif H, Fernandez M, He Y, Alani H. On stopwords, filtering and data sparsity for sentiment analysis of twitter. Ninth international conference on language resources and evaluation. 2014 (pp.810-17).
[Google Scholar]
[32]Srividhya V, Anitha R. Evaluating preprocessing techniques in text categorization. International Journal of Computer Science and Application. 2010; 47(11):49-51.
[Google Scholar]
[33]Angiani G, Ferrari L, Fontanini T, Fornacciari P, Iotti E, Magliani F, et al. A comparison between preprocessing techniques for sentiment analysis in twitter. KDWeb. 2016:1-11.
[Google Scholar]
[34]Haddi E, Liu X, Shi Y. The role of text pre-processing in sentiment analysis. Procedia Computer Science. 2013; 17:26-32.
[Crossref] [Google Scholar]
[35]Dos SFL, Ladeira M. The role of text pre-processing in opinion mining on a social media language dataset. In Brazilian conference on intelligent systems 2014 (pp. 50-4). IEEE.
[Crossref] [Google Scholar]
[36]Hemalatha I, Varma GS, Govardhan A. Preprocessing the informal text for efficient sentiment analysis. International Journal of Emerging Trends & Technology in Computer Science. 2012; 1(2):58-61.
[Google Scholar]
[37]Jianqiang Z, Xiaolin G. Comparison research on text pre-processing methods on twitter sentiment analysis. IEEE Access. 2017; 5:2870-9.
[Crossref] [Google Scholar]
[38]M. AS, Mustapha M. The effect of noise elimination and stemming in sentiment analysis for Malay documents. In proceedings of the international conference on computing, mathematics and statistics (iCMS 2015) Bridging Research Endeavors 2017 (pp. 93-102). Springer Singapore.
[Crossref] [Google Scholar]
[39]Boban I, Doko A, Gotovac S. Sentence retrieval using stemming and lemmatization with different length of the queries. Advances in Science, Technology and Engineering Systems. 2020; 5(3):349-54.
[Crossref] [Google Scholar]
[40]Kariyawasam KT, Senanayake SY, Haddela PS. A rule based stemmer for Sinhala language. In 14th conference on industrial and information systems 2019 (pp. 326-31). IEEE.
[Crossref] [Google Scholar]
[41]Akhmetov I, Pak A, Ualiyeva I, Gelbukh A. Highly language-independent word lemmatization using a machine-learning classifier. Computing and Systems. 2020; 24(3):1353-64.
[Crossref] [Google Scholar]
[42]Balakrishnan V, Lloyd-yemoh E. Stemming and lemmatization: a comparison of retrieval performances. In proceedings of SCEI Seoul conferences. 2014 (pp.10-4).
[Google Scholar]
[43]Ozturkmenoglu O, Alpkocak A. Comparison of different lemmatization approaches for information retrieval on Turkish text collection. In international symposium on innovations in intelligent systems and applications 2012 (pp. 1-5). IEEE.
[Crossref] [Google Scholar]
[44]Dalianis H, Jongejan B. Hand-crafted versus machine-learned inflectional rules: the euroling-siteseeker stemmer and CSTs lemmatiser. In LREC 2006 (pp. 663-6).
[Google Scholar]
[45]Korenius T, Laurikkala J, Järvelin K, Juhola M. Stemming and lemmatization in the clustering of finish text documents. In proceedings of the thirteenth ACM international conference on information and knowledge management 2004 (pp. 625-33).
[Crossref] [Google Scholar]
[46]Gupta D, Kumar YR, Sajan N. Improving unsupervised stemming by using partial lemmatization coupled with data-based heuristics for Hindi. International Journal of Computer Applications. 2012; 38(8):1-8.
[Google Scholar]
[47]Kurniasih A, Manik LP. On the role of text preprocessing in BERT embedding-based DNNs for classifying informal texts. Neuron. 2022; 1024(512):927-34.
[Google Scholar]
[48]Haque TU, Saber NN, Shah FM. Sentiment analysis on large scale Amazon product reviews. In international conference on innovative research and development 2018 (pp. 1-6). IEEE.
[Crossref] [Google Scholar]
[49]Krishna A, Akhilesh V, Aich A, Hegde C. Sentiment analysis of restaurant reviews using machine learning techniques. In emerging research in electronics, computer science and technology: proceedings of international conference 2019 (pp. 687-96). Springer Singapore.
[Crossref] [Google Scholar]
[50]Makkar K, Kumar P, Poriye M, Aggarwal S. A comparative study of supervised and unsupervised machine learning algorithms on consumer reviews. In world conference on applied intelligence and computing 2022 (pp. 598-603). IEEE.
[Crossref] [Google Scholar]