(Publisher of Peer Reviewed Open Access Journals)

International Journal of Advanced Computer Research (IJACR)

ISSN (Print):2249-7277    ISSN (Online):2277-7970
Volume-8 Issue-34 January-2018
Full-Text PDF
DOI:10.19101/IJACR.2017.733030
Paper Title : Construction of a generic stopwords list for Hindi language without corpus statistics
Author Name : Sifatullah Siddiqi and Aditi Sharan
Abstract :

Most of the research in the field of information retrieval (IR) has focused on the English language, but recently there has been a considerable amount of work and effort to develop IR systems for languages other than English. Research and experimentation in the field of IR in the Hindi language are relatively new and limited compared to the research that has been done in English, which has been dominant in the field of IR for a long while. A fundamental tool in IR is the employment of stop word lists. Stop words have no retrieval value in IR. Till now, many stop word lists have been developed for English, European and Chinese languages. However, there is no standard stop word list which has been constructed for Hindi language. In this paper an approach to construct a generic stop word list for Hindi language have been presented. Our list contains more than 800 stop words.

Keywords : Stop word, Stop words list, Hindi language, Information retrieval, Text mining, Corpus statistics.
Cite this article : Sifatullah Siddiqi and Aditi Sharan, " Construction of a generic stopwords list for Hindi language without corpus statistics " , International Journal of Advanced Computer Research (IJACR), Volume-8, Issue-34, January-2018 ,pp.35-40.DOI:10.19101/IJACR.2017.733030
References :
[1]Sebastiani F. Machine learning in automated text categorization. ACM Computing Surveys. 2002; 34(1):1-47.
[Crossref] [Google Scholar]
[2]Luhn HP. The automatic creation of literature abstracts. IBM Journal of Research and Development. 1958; 2(2):159-65.
[Crossref] [Google Scholar]
[3]Fox C. A stop list for general text. In SIGIR forum 1989 (pp. 19-21). ACM.
[Crossref] [Google Scholar]
[4]Hart GW. To decode short cryptograms. Communications of the ACM. 1994; 37(9):102-8.
[Crossref] [Google Scholar]
[5]Rijsbergen CJV. Information retrieval. London: Butterworths; 1979.
[6]Fox C. Information retrieval data structures and algorithms. Lexical analysis and stoplists. Prentice Hall; 1992, p.102-30.
[Google Scholar]
[7]Yang Y. Noise reduction in a statistical approach to text categorization. In proceedings of the international ACM SIGIR conference on research and development in information retrieval 1995 (pp. 256-63). ACM.
[Crossref] [Google Scholar]
[8]Chekima K, Alfred R. An automatic construction of Malay stop words based on aggregation method. In international conference on soft computing in data science 2016 (pp. 180-9). Springer Singapore.
[Crossref] [Google Scholar]
[9]Amarasinghe K, Manic M, Hruska R. Optimal stop word selection for text mining in critical infrastructure domain. In resilience week 2015 (pp. 1-6). IEEE.
[Crossref] [Google Scholar]
[10]Na D, Xu C. Automatically generation and evaluation of stop words list for Chinese patents. TELKOMNIKA (Telecommunication Computing Electronics and Control). 2015; 13(4):1414-21.
[Crossref] [Google Scholar]
[11]Medhat W, Yousef AH, Korashy H. Egyptian dialect stopword list generation from social network data. Egyptian Journal of Language Engineering. 2015; 2(1):43-55.
[Google Scholar]
[12]Jha V, Manjunath N, Shenoy PD, Venugopal KR. HSRA: Hindi stopword removal algorithm. In international conference on microelectronics, computing and communications 2016 (pp. 1-5). IEEE.
[Crossref] [Google Scholar]