Prompt-Based Data Augmentation with Large Language Models for Indonesian Gender-Based Hate Speech Detection

Muhammad Amien Ibrahim; Faisal; Zefanya Delvin Sulistiya; Tora Sangputra Yopie Winarto

doi:10.3844/jcssp.2024.819.826

Research Article Open Access

Prompt-Based Data Augmentation with Large Language Models for Indonesian Gender-Based Hate Speech Detection

Muhammad Amien Ibrahim¹, Faisal², Zefanya Delvin Sulistiya¹ and Tora Sangputra Yopie Winarto¹

¹ Department of Computer Science, School of Computer Science, Bina Nusantara University, Jakarta, Indonesia
² Department of Mathematics, School of Computer Science, Bina Nusantara University, Jakarta, Indonesia

Abstract

The increasing amount of content on social media content makes the use of automatic moderation crucial for preserving a healthy online community and reducing the spread of offensive and abusive content, such as hate speech based on gender. Developing automated social media moderation using machine learning demands a large and balanced dataset. However, difficulties such as data scarcity and class imbalance have hindered the development of gender-based hate speech detection on Indonesian Twitter communities. Creating and annotating a new dataset would be time-consuming and costly. One practical alternative is to use data augmentation methods to help address the minority class imbalance in datasets. This study investigates how prompt-based data augmentation may be used with a large language model to provide organic tweet samples for gender-based hate speech detection. Furthermore, the study investigates the preservation of labels in augmented Twitter samples. In comparison to the benchmark back translation approach, the results show that prompt-based data augmentation using a large language model may generate new and organic Twitter samples while keeping labels preserved and avoiding memorization. In conventional machine learning models, prompt-based data augmentation with a large language model shows competitive performance compared to back translation in terms of accuracy metrics. According to these results, using prompting for data augmentation on large language models is an alternative strategy that can provide new, less memorization tweet samples that maintain label integrity while achieving competitive accuracy results.

Journal of Computer Science

Volume 20 No. 8, 2024, 819-826

DOI: https://doi.org/10.3844/jcssp.2024.819.826

Submitted On: 5 March 2024 Published On: 27 May 2024

How to Cite: Ibrahim, M. A., Faisal, Sulistiya, Z. D. & Winarto, T. S. Y. (2024). Prompt-Based Data Augmentation with Large Language Models for Indonesian Gender-Based Hate Speech Detection. Journal of Computer Science, 20(8), 819-826. https://doi.org/10.3844/jcssp.2024.819.826

Copyright: © 2024 Muhammad Amien Ibrahim, Faisal, Zefanya Delvin Sulistiya and Tora Sangputra Yopie Winarto. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

5,195 Views
2,548 Downloads
3 Citations

Download

Keywords

Hate Speech Detection
Data Augmentation
Large Language Models