Long Short-Term Memory Model for Classification of English-PtBR Cross-Lingual Hate Speech

Thiago D. Bispo; Hendrik T. Macedo; Flávio de O. Santos; Rafael P. da Silva; Leonardo N. Matos; Bruno O.P. Prado; Gilton J.F. da Silva; Adolfo Guimarães

doi:10.3844/jcssp.2019.1546.1571

Research Article Open Access

Long Short-Term Memory Model for Classification of English-PtBR Cross-Lingual Hate Speech

Thiago D. Bispo¹, Hendrik T. Macedo², Flávio de O. Santos³, Rafael P. da Silva², Leonardo N. Matos², Bruno O.P. Prado², Gilton J.F. da Silva² and Adolfo Guimarães⁴

¹ Instituto Federal de Sergipe, Brazil
² Universidade Federal de Sergipe, Brazil
³ Universidade Federal de Pernambuco, Brazil
⁴ Universidade Tiradentes, Brazil

Abstract

Automatic and accurate recognition of hate speech is a difficult job. In addition to the inherent ambiguity of the natural language, deep understanding of the linguistic structure is imperative. Usually, discriminatory discourse does not make use of typical expressions and often abuse of sarcasm. Good knowledge of world and assessment of context are thus highly demanded. Several approaches have been proposed for automating hate speech recognition task. Many of them consider a combination of strategies in order to achieve better results: character-based or word-based N-grams, lexical features such as the presence or absence of negative words, classes or expressions indicative of insult, punctuation marks, repetition of letters, the presence of emoji, etc. The solitary use of linguistic features such as POS tagging have shown itself inefficient. The recent usage of neural networks to create a distributed representation of the sentences within a hate speech corpus is a promising path. Unfortunately, providing such a corpus is hard. Except for the English language, hate speech corpora are rarely found. This work proposes a cross-lingual approach to automatically recognize hate speech in Portuguese language, leveraging the knowledge of English corpora. A deep Long Short-Term Memory (LSTM) model has been trained and many different experimentation scenarios were set to deal with embeddings, TFIDF, N-grams, GloVe vocabulary and so on. At the end, a Gradient Boosting Decision Tree (GBDT) was used to improve classification results. We achieved accuracy of up to 70% in the better scenarios. Two important contributions of this work are: (i) An effective approach to deal with the lack of hate speech corpora in the desired language and (ii) a hate speech database in Portuguese to contribute to research community.

Journal of Computer Science

Volume 15 No. 10, 2019, 1546-1571

DOI: https://doi.org/10.3844/jcssp.2019.1546.1571

Submitted On: 28 August 2019 Published On: 1 November 2019

How to Cite: Bispo, T. D., Macedo, H. T., Santos, F. O., da Silva, R. P., Matos, L. N., Prado, B. O., da Silva, G. J. & Guimarães, A. (2019). Long Short-Term Memory Model for Classification of English-PtBR Cross-Lingual Hate Speech. Journal of Computer Science, 15(10), 1546-1571. https://doi.org/10.3844/jcssp.2019.1546.1571

Copyright: © 2019 Thiago D. Bispo, Hendrik T. Macedo, Flávio de O. Santos, Rafael P. da Silva, Leonardo N. Matos, Bruno O.P. Prado, Gilton J.F. da Silva and Adolfo Guimarães. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

6,309 Views
3,259 Downloads
9 Citations

Download

Keywords

Hate Speech
Portuguese Language
Deep Learning
(Bi) LSTM
GBDT