MOTEC: The Malay Offensive Text Classification using Extra Tree and Language Standardization

Main Article Content

Fairuz Amalina
Faiz Zaki
Hamza H. M. Altarturi
Hazim Hanif
Nor Badrul Anuar

Abstract

Cyberbullying has increased globally, with offensive text contributing significantly. Detecting offensive text in Malay is challenging due to non-standard Malay text, unique social media writing styles, a lack of standardization, and limited resources. This study proposes the Malay Offensive Text Classification (MOTEC) framework to address these challenges. The MOTEC framework incorporates a Malay standardization preprocessing task, utilizing three specialized dictionaries: (a) abbreviations, (b) noisy text, and (c) Malaysian dialects. This approach enhances data quality by converting non-standard text into standardized Malay sentences before classification. For feature extraction, the framework employs Term Frequency-Inverse Document Frequency (TF-IDF). This statistical method evaluates the importance of words in a document relative to a collection of documents, coupled with an Extra Tree classifier for the classification process. Evaluating the MOTEC framework using a private dataset collected from Twitter, this study achieved a classification accuracy of 94%, significantly outperforming other studies, which reported an accuracy of 84%. The MOTEC framework substantially improves the classification of offensive Malay text by enhancing accuracy, reducing execution time, and improving data quality through effective language standardization.

Downloads

Download data is not yet available.

Article Details

How to Cite
Amalina, F. ., Zaki, F. ., H. M. Altarturi, H. ., Hanif, H. ., & Anuar, N. B. . (2025). MOTEC: The Malay Offensive Text Classification using Extra Tree and Language Standardization. Malaysian Journal of Computer Science, 38(1), 82–99. Retrieved from https://adab.um.edu.my/index.php/MJCS/article/view/56105
Section
Articles