ACCENTS Journals

Download PDF

Detection of offensive language in the Moroccan dialect using BERT-based models

Moussaoui Otman¹, Yacine El Younoussi² and Naoufal Rtayli³

Research Scholar, Department of Computer Engineering,ENSA of Tetouan, Abdelmalek Essaadi University, Tetouan,Morocco¹
Professor, Department of Computer Engineering,ENSA of Tetouan, Abdelmalek Essaadi University, Tetouan,Morocco²
Professor, Department of Computer Engineering,Faculty of Sciences Tetouan, Abdelmalek Essaadi University, Tetouan,Morocco³

Corresponding Author : Moussaoui Otman

Recieved : 14-Sep-2024; Revised : 25-Jun-2025; Accepted : 04-Jul-2025

Abstract

The detection of offensive language and hate speech in online communication has become increasingly important due to the rapid spread of harmful content on social media. This challenge is especially significant for low-resource languages such as the Moroccan dialect. This study addresses the need for effective automated systems to detect offensive language, rudeness, hate speech, and toxicity in the Moroccan dialect. Six transformer-based models were fine-tuned and evaluated for this task: darija BERT mix (DBERT-mix), MARBERT, multilingual BERT (mBERT), Moroccan BERT (MorrBERT), cross-lingual language model - RoBERTa (XLM-R), and Moroccan RoBERTa (MorRoBERTa). The results show that DBERT-mix achieved the highest performance, outperforming the other models. Furthermore, the analysis indicated better performance on Latin script compared to Arabic script, highlighting the need for further optimization of models for Arabic script. These findings highlight the importance of adapting models to specific dialects and scripts, providing valuable insights for improving offensive language detection in the Moroccan context.

Keywords

Offensive language detection, Hate speech classification, Moroccan dialect, Transformer models, BERT-based models, Script-specific NLP (Latin and Arabic).

Cite this article

Otman M, Younoussi YE, Rtayli N. Detection of offensive language in the Moroccan dialect using BERT-based models. International Journal of Advanced Technology and Engineering Exploration. 2025;12(128):1075-1085. DOI : 10.19101/IJATEE.2024.111101679

References

[1] Waseem Z, Davidson T, Warmsley D, Weber I. Understanding abuse: a typology of abusive language detection subtasks. In proceedings of the first workshop on abusive language. 2017(pp. 78–84).

[Crossref] [Google Scholar]

[2] Al MS, El-haj M, Rayson P. Is it offensive or abusive? an empirical study of hateful language detection of Arabic social media texts. Proceedings of the 1st international conference on NLP & AI for cyber security. 2024 (pp. 137-47). ACL Anthology.

[Google Scholar]

[3] Namane F. Incivility in social media language: reasons, and impacts on people's thoughts and attitudes. Etiquette and Languages. 2024; 19(2):13-28.

[Google Scholar]

[4] Van DWE, Eloff JH, Grobler J. Cyber-security: identity deception detection on social media platforms. Computers & Security. 2018; 78:76-89.

[Crossref] [Google Scholar]

[5] Mozafari M, Farahbakhsh R, Crespi N. Cross-lingual few-shot hate speech and offensive language detection using meta learning. IEEE Access. 2022; 10:14880-96.

[Crossref] [Google Scholar]

[6] Harrat S, Meftouh K, Smaïli K. Maghrebi Arabic dialect processing: an overview. Journal of International Science and General Applications. 2018; 1(1):1-8.

[Google Scholar]

[7] Otman M, El YYAC. Supervised classification of languages used by Moroccans in social networks. International Journal of Computer Engineering and Data Science. 2022; 1(2):1-10.

[Google Scholar]

[8] Devlin J, Chang MW, Lee K, Toutanova K. Bert: pre-training of deep bidirectional transformers for language understanding. In proceedings of conference of the North American chapter of the association for computational linguistics: human language technologies, 2019 (pp. 4171-86). ACL Anthology.

[Crossref] [Google Scholar]

[9] Conneau A, Khandelwal K, Goyal N, Chaudhary V, Wenzek G, Guzmán F, et al. Unsupervised cross-lingual representation learning at scale. In proceedings of the 58th annual meeting of the association for computational linguistics.2020(pp. 8440–51).

[Crossref]

[10] Abdul-Mageed M, Elmadany A, Nagoudi EM. ARBERT & MARBERT: deep bidirectional transformers for Arabic. In proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing. 2020(pp.7088–105).

[Crossref]

[11] Gaanoun K, Naira AM, Allak A, Benelallam I. DarijaBERT: a step forward in NLP for the written Moroccan dialect. International Journal of Data Science and Analytics. 2024:1-3.

[Crossref] [Google Scholar]

[12] Moussaoui O, El YY. Pre-training two BERT-like models for Moroccan dialect: MorRoBERTa and MorrBERT. In MENDEL 2023 (pp. 55-61).

[Crossref] [Google Scholar]

[13] Poletto F, Basile V, Sanguinetti M, Bosco C, Patti V. Resources and benchmark corpora for hate speech detection: a systematic review. Language Resources and Evaluation. 2021; 55:477-523.

[Crossref] [Google Scholar]

[14] Zampieri M, Morgan S, North K, Ranasinghe T, Simmons A, Khandelwal P, et al. Target-based offensive language identification. Proceedings of the 61st annual meeting of the association for computational linguistics. 2023 (pp.762–70). ACL Anthology.

[Crossref] [Google Scholar]

[15] Das S, Bhattacharyya K, Sarkar S. Performance analysis of logistic regression, naive bayes, KNN, decision tree, random forest and SVM on hate speech detection from Twitter. International Research Journal of Innovations in Engineering and Technology. 2023; 7(3):24-8.

[Crossref] [Google Scholar]

[16] Xu M, Liu S. RB_BG_MHA: a RoBERTa-based model with Bi-GRU and multi-head attention for Chinese offensive language detection in social media. Applied Sciences. 2023; 13(19):1-18.

[Crossref] [Google Scholar]

[17] Hatab AL, Sabty C, Abdennadher S. Enhancing deep learning with embedded features for Arabic named entity recognition. In proceedings of the thirteenth language resources and evaluation conference 2022 (pp. 4904-12). European Language Resources Association.

[Google Scholar]

[18] Zampieri M, Ranasinghe T, Chaudhari M, Gaikwad S, Krishna P, Nene M, et al. Predicting the type and target of offensive social media posts in Marathi. Social Network Analysis and Mining. 2022; 12(1):77.

[Crossref] [Google Scholar]

[19] Fortuna P, Da SJR, Wanner L, Nunes S. A hierarchically-labeled Portuguese hate speech dataset. In proceedings of the third workshop on abusive language online 2019 (pp. 94-104). Association for Computational Linguistics.

[Crossref] [Google Scholar]

[20] Mulki H, Haddad H, Ali CB, Alshabani H. L-hsab: a levantine twitter dataset for hate speech and abusive language. In proceedings of the third workshop on abusive language online 2019 (pp. 111-8). Association for Computational Linguistics.

[Crossref] [Google Scholar]

[21] Sanguinetti M, Poletto F, Bosco C, Patti V, Stranisci MA. An Italian twitter corpus of hate speech against immigrants. In proceedings of the eleventh international conference on language resources and evaluation 2018 (pp. 2798-805).

[Google Scholar]

[22] Çöltekin Ç. A corpus of Turkish offensive language on social media. In proceedings of the twelfth language resources and evaluation conference 2020 (pp. 6174-84). European Language Resources Association.

[Google Scholar]

[23] Pereira-kohatsu JC, Quijano-sánchez L, Liberatore F, Camacho-collados M. Detecting and monitoring hate speech in Twitter. Sensors. 2019; 19(21):1-37.

[Crossref] [Google Scholar]

[24] Ranasinghe T, Zampieri M. A text-to-text model for multilingual offensive language identification. In findings of the association for computational linguistics: IJCNLP-AACL 2023 (Findings) 2023 (pp. 375-84). Association for Computational Linguistics.

[Google Scholar]

[25] Zampieri M, Rosenthal S, Nakov P, Dmonte A, Ranasinghe T. OffensEval: offensive language identification in the age of large language models. Natural Language Engineering. 2023; 29(6):1416-35.

[Crossref] [Google Scholar]

[26] Abdelsamie MM, Azab SS, Hefny HA. A comprehensive review on Arabic offensive language and hate speech detection on social media: methods, challenges and solutions. Social Network Analysis and Mining. 2024; 14(1):111.

[Crossref] [Google Scholar]

[27] Al-laith A, Kebdani R. Evaluating calibration of Arabic pre-trained language models on dialectal text. In proceedings of the 4th workshop on Arabic corpus linguistics (WACL-4) 2025(pp. 68-76). Association for Computational Linguistics.

[Google Scholar]

[28] Dmonte A, Satapara S, Alsudais R, Ranasinghe T, Zampieri M. On the effects of machine translation on offensive language detection. Social Network Analysis and Mining. 2024; 14(1):242.

[Crossref] [Google Scholar]

[29] Ahmed I, Abbas M, Hatem R, Ihab A, Fahkr MW. Fine-tuning Arabic pre-trained transformer models for Egyptian-Arabic dialect offensive language and hate speech detection and classification. In 20th international conference on language engineering (ESOLEC) 2022 (pp. 170-4). IEEE.

[Crossref] [Google Scholar]

[30] Bensalem I, Mout M, Rosso P. Offensive language detection in Arabizi. In proceedings of ArabicNLP 2023 (pp. 423-34).

[Crossref] [Google Scholar]

[31] Husain F, Uzuner O. Transfer learning across Arabic dialects for offensive language detection. In international conference on Asian language processing (IALP) 2022 (pp. 196-205). IEEE.

[Crossref] [Google Scholar]

[32] Essefar K, Ait BH, El MA, El MA, Berrada I. Omcd: offensive Moroccan comments dataset. Language Resources and Evaluation. 2023; 57(4):1745-65.

[Crossref] [Google Scholar]

[33] Ibrahimi A, Mourhir A. Moroccan Darija offensive language detection dataset. 2023. Mendeley Data.

[Crossref] [Google Scholar]

[34] Farzindar A, Roche M. Traitement automatique des langues. TAL Review - Natural Language Processing. 2013; 54(3):1-73.

[Google Scholar]

[35] Datta G, Joshi N, Gupta K. Performance comparison of statistical vs. neural-based translation system on low-resource languages. International Journal on Smart Sensing and Intelligent Systems. 2023;16(1):1-13.

[Google Scholar]

[36] Song X, Salcianu A, Song Y, Dopson D, Zhou D. Fast wordpiece tokenization. In proceedings of the conference on empirical methods in natural language processing. 2021(pp.2089–103).

[Crossref]

[37] Kinga D, Adam JB. A method for stochastic optimization. In international conference on learning representations (ICLR) 2015 (pp.1-17).

[Crossref] [Google Scholar]