Exploring Data Augmentation for Gender-Based Hate Speech Detection
- 1 Department of Computer Science, School of Computer Science, Bina Nusantara University, Jakarta, Indonesia
- 2 Department of Statistics, School of Computer Science, Bina Nusantara University, Jakarta, Indonesia
Abstract
Social media moderation is a crucial component to establish healthy online communities and ensuring online safety from hate speech and offensive language. In many cases, hate speech may be targeted at specific gender which could be expressed in many different languages on social media platforms such as Indonesian Twitter. However, difficulties such as data scarcity and the imbalanced gender-based hate speech dataset in Indonesian tweets have slowed the development and implementation of automatic social media moderation. Obtaining more data to increase the number of samples may be costly in terms of resources required to gather and annotate the data. This study looks at the usage of data augmentation methods to increase the amount of textual dataset while keeping the quality of the augmented data. Three augmentation strategies are explored in this study: Random insertion, back translation, and a sequential combination of back translation and random insertion. Additionally, the study examines the preservation of the increased data labels. The performance result demonstrates that classification models trained with augmented data generated from random insertion strategy outperform the other approaches. In terms of label preservation, the three augmentation approaches have been shown to offer enough label preservation without compromising the meaning of the augmented data. The findings imply that by increasing the amount of the dataset while preserving the original label, data augmentation could be utilized to solve issues such as data scarcity and dataset imbalance.
DOI: https://doi.org/10.3844/jcssp.2023.1222.1230
Copyright: © 2023 Muhammad Amien Ibrahim, Samsul Arifin and Eko Setyo Purwanto. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
- 1,641 Views
- 911 Downloads
- 0 Citations
Download
Keywords
- Dataset
- Data Augmentation
- Hate Speech Detection