DAD: A Detailed Arabic Dataset for Online Text Recognition and Writer Identification, a New Type

Said S. Saloum

doi:10.3844/jcssp.2021.19.32

Research Article Open Access

DAD: A Detailed Arabic Dataset for Online Text Recognition and Writer Identification, a New Type

Said S. Saloum¹

¹ Jouf University, Saudi Arabia

Abstract

This paper presents a novel Arabic dataset that considers the characteristics of the Arabic language filling some gaps not covered by existing datasets. Conventional datasets consider Arabic in a similar way to Latin languages. These datasets either delete diacritic and supplement marks, considering them as defects, or keep them without considering the actual meaning. More than half of all Arabic characters have diacritics above or below characters. In this context, this work presents the novel Detailed Arabic Dataset (DAD) for bridging these gaps. The additional marks included in this dataset are the single dot, two dots "-", three dots "^", Hamza and two supplement marks: The bar for Tah, or Zah and the complement bar for Kaf. A special application was built to generate a dataset for Arabic online recognition and writer identification (called OFMArabicDatasetBuilder). Totally the ground truth contains 93064 entries based on sub-word and letter parts (not on words or lines as other datasets). This dataset will provide researchers with a strong tool for online Arabic language text recognition especially in the segmentation phase and writer identification. This paper also presents benchmarking results of using k-nearest neighbours machine learning with DAD.

Journal of Computer Science

Volume 17 No. 1, 2021, 19-32

DOI: https://doi.org/10.3844/jcssp.2021.19.32

Submitted On: 2 November 2020 Published On: 21 January 2021

How to Cite: Saloum, S. S. (2021). DAD: A Detailed Arabic Dataset for Online Text Recognition and Writer Identification, a New Type. Journal of Computer Science, 17(1), 19-32. https://doi.org/10.3844/jcssp.2021.19.32

Copyright: © 2021 Said S. Saloum. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

3,670 Views
1,761 Downloads
3 Citations

Download

Keywords

Arabic Dataset
Arabic Benchmark
Arabic Recognition
Arabic Writer Identification
Diacritics Marks
Hamza
Supplement Marks
Tah
Zah