Detection of offensive language in the Moroccan dialect using BERT-based models
Moussaoui Otman1, Yacine El Younoussi2 and Naoufal Rtayli3
Professor, Department of Computer Engineering,ENSA of Tetouan, Abdelmalek Essaadi University, Tetouan,Morocco2
Professor, Department of Computer Engineering,Faculty of Sciences Tetouan, Abdelmalek Essaadi University, Tetouan,Morocco3
Corresponding Author : Moussaoui Otman
Recieved : 14-Sep-2024; Revised : 25-Jun-2025; Accepted : 04-Jul-2025
Abstract
The detection of offensive language and hate speech in online communication has become increasingly important due to the rapid spread of harmful content on social media. This challenge is especially significant for low-resource languages such as the Moroccan dialect. This study addresses the need for effective automated systems to detect offensive language, rudeness, hate speech, and toxicity in the Moroccan dialect. Six transformer-based models were fine-tuned and evaluated for this task: darija BERT mix (DBERT-mix), MARBERT, multilingual BERT (mBERT), Moroccan BERT (MorrBERT), cross-lingual language model - RoBERTa (XLM-R), and Moroccan RoBERTa (MorRoBERTa). The results show that DBERT-mix achieved the highest performance, outperforming the other models. Furthermore, the analysis indicated better performance on Latin script compared to Arabic script, highlighting the need for further optimization of models for Arabic script. These findings highlight the importance of adapting models to specific dialects and scripts, providing valuable insights for improving offensive language detection in the Moroccan context.
Keywords
Offensive language detection, Hate speech classification, Moroccan dialect, Transformer models, BERT-based models, Script-specific NLP (Latin and Arabic).
References
[1] Waseem Z, Davidson T, Warmsley D, Weber I. Understanding abuse: a typology of abusive language detection subtasks. In proceedings of the first workshop on abusive language. 2017(pp. 78–84).
[2] Al MS, El-haj M, Rayson P. Is it offensive or abusive? an empirical study of hateful language detection of Arabic social media texts. Proceedings of the 1st international conference on NLP & AI for cyber security. 2024 (pp. 137-47). ACL Anthology.
[3] Namane F. Incivility in social media language: reasons, and impacts on people's thoughts and attitudes. Etiquette and Languages. 2024; 19(2):13-28.
[4] Van DWE, Eloff JH, Grobler J. Cyber-security: identity deception detection on social media platforms. Computers & Security. 2018; 78:76-89.
[5] Mozafari M, Farahbakhsh R, Crespi N. Cross-lingual few-shot hate speech and offensive language detection using meta learning. IEEE Access. 2022; 10:14880-96.
[6] Harrat S, Meftouh K, Smaïli K. Maghrebi Arabic dialect processing: an overview. Journal of International Science and General Applications. 2018; 1(1):1-8.
[7] Otman M, El YYAC. Supervised classification of languages used by Moroccans in social networks. International Journal of Computer Engineering and Data Science. 2022; 1(2):1-10.
[8] Devlin J, Chang MW, Lee K, Toutanova K. Bert: pre-training of deep bidirectional transformers for language understanding. In proceedings of conference of the North American chapter of the association for computational linguistics: human language technologies, 2019 (pp. 4171-86). ACL Anthology.
[9] Conneau A, Khandelwal K, Goyal N, Chaudhary V, Wenzek G, Guzmán F, et al. Unsupervised cross-lingual representation learning at scale. In proceedings of the 58th annual meeting of the association for computational linguistics.2020(pp. 8440–51).
[10] Abdul-Mageed M, Elmadany A, Nagoudi EM. ARBERT & MARBERT: deep bidirectional transformers for Arabic. In proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing. 2020(pp.7088–105).
[11] Gaanoun K, Naira AM, Allak A, Benelallam I. DarijaBERT: a step forward in NLP for the written Moroccan dialect. International Journal of Data Science and Analytics. 2024:1-3.
[12] Moussaoui O, El YY. Pre-training two BERT-like models for Moroccan dialect: MorRoBERTa and MorrBERT. In MENDEL 2023 (pp. 55-61).
[13] Poletto F, Basile V, Sanguinetti M, Bosco C, Patti V. Resources and benchmark corpora for hate speech detection: a systematic review. Language Resources and Evaluation. 2021; 55:477-523.
[14] Zampieri M, Morgan S, North K, Ranasinghe T, Simmons A, Khandelwal P, et al. Target-based offensive language identification. Proceedings of the 61st annual meeting of the association for computational linguistics. 2023 (pp.762–70). ACL Anthology.
[15] Das S, Bhattacharyya K, Sarkar S. Performance analysis of logistic regression, naive bayes, KNN, decision tree, random forest and SVM on hate speech detection from Twitter. International Research Journal of Innovations in Engineering and Technology. 2023; 7(3):24-8.
[16] Xu M, Liu S. RB_BG_MHA: a RoBERTa-based model with Bi-GRU and multi-head attention for Chinese offensive language detection in social media. Applied Sciences. 2023; 13(19):1-18.
[17] Hatab AL, Sabty C, Abdennadher S. Enhancing deep learning with embedded features for Arabic named entity recognition. In proceedings of the thirteenth language resources and evaluation conference 2022 (pp. 4904-12). European Language Resources Association.
[18] Zampieri M, Ranasinghe T, Chaudhari M, Gaikwad S, Krishna P, Nene M, et al. Predicting the type and target of offensive social media posts in Marathi. Social Network Analysis and Mining. 2022; 12(1):77.
[19] Fortuna P, Da SJR, Wanner L, Nunes S. A hierarchically-labeled Portuguese hate speech dataset. In proceedings of the third workshop on abusive language online 2019 (pp. 94-104). Association for Computational Linguistics.
[20] Mulki H, Haddad H, Ali CB, Alshabani H. L-hsab: a levantine twitter dataset for hate speech and abusive language. In proceedings of the third workshop on abusive language online 2019 (pp. 111-8). Association for Computational Linguistics.
[21] Sanguinetti M, Poletto F, Bosco C, Patti V, Stranisci MA. An Italian twitter corpus of hate speech against immigrants. In proceedings of the eleventh international conference on language resources and evaluation 2018 (pp. 2798-805).
[22] Çöltekin Ç. A corpus of Turkish offensive language on social media. In proceedings of the twelfth language resources and evaluation conference 2020 (pp. 6174-84). European Language Resources Association.
[23] Pereira-kohatsu JC, Quijano-sánchez L, Liberatore F, Camacho-collados M. Detecting and monitoring hate speech in Twitter. Sensors. 2019; 19(21):1-37.
[24] Ranasinghe T, Zampieri M. A text-to-text model for multilingual offensive language identification. In findings of the association for computational linguistics: IJCNLP-AACL 2023 (Findings) 2023 (pp. 375-84). Association for Computational Linguistics.
[25] Zampieri M, Rosenthal S, Nakov P, Dmonte A, Ranasinghe T. OffensEval: offensive language identification in the age of large language models. Natural Language Engineering. 2023; 29(6):1416-35.
[26] Abdelsamie MM, Azab SS, Hefny HA. A comprehensive review on Arabic offensive language and hate speech detection on social media: methods, challenges and solutions. Social Network Analysis and Mining. 2024; 14(1):111.
[27] Al-laith A, Kebdani R. Evaluating calibration of Arabic pre-trained language models on dialectal text. In proceedings of the 4th workshop on Arabic corpus linguistics (WACL-4) 2025(pp. 68-76). Association for Computational Linguistics.
[28] Dmonte A, Satapara S, Alsudais R, Ranasinghe T, Zampieri M. On the effects of machine translation on offensive language detection. Social Network Analysis and Mining. 2024; 14(1):242.
[29] Ahmed I, Abbas M, Hatem R, Ihab A, Fahkr MW. Fine-tuning Arabic pre-trained transformer models for Egyptian-Arabic dialect offensive language and hate speech detection and classification. In 20th international conference on language engineering (ESOLEC) 2022 (pp. 170-4). IEEE.
[30] Bensalem I, Mout M, Rosso P. Offensive language detection in Arabizi. In proceedings of ArabicNLP 2023 (pp. 423-34).
[31] Husain F, Uzuner O. Transfer learning across Arabic dialects for offensive language detection. In international conference on Asian language processing (IALP) 2022 (pp. 196-205). IEEE.
[32] Essefar K, Ait BH, El MA, El MA, Berrada I. Omcd: offensive Moroccan comments dataset. Language Resources and Evaluation. 2023; 57(4):1745-65.
[33] Ibrahimi A, Mourhir A. Moroccan Darija offensive language detection dataset. 2023. Mendeley Data.
[34] Farzindar A, Roche M. Traitement automatique des langues. TAL Review - Natural Language Processing. 2013; 54(3):1-73.
[35] Datta G, Joshi N, Gupta K. Performance comparison of statistical vs. neural-based translation system on low-resource languages. International Journal on Smart Sensing and Intelligent Systems. 2023;16(1):1-13.
[36] Song X, Salcianu A, Song Y, Dopson D, Zhou D. Fast wordpiece tokenization. In proceedings of the conference on empirical methods in natural language processing. 2021(pp.2089–103).
[37] Kinga D, Adam JB. A method for stochastic optimization. In international conference on learning representations (ICLR) 2015 (pp.1-17).