UzNER: A Human-Reviewed Benchmark for Uzbek Named Entity Recognition With Gazetteer-Augmented Transformer Models
- 1 Faculty of Mechanics and Mathematics, Novosibirsk State University, Novosibirsk, Russia
- 2 Faculty of Computer Engineering, Urgench State University, Urgench, Uzbekistan
- 3 Federal Research Center for Information and Computational Technologies, Novosibirsk, Russia
- 4 Faculty of Software Engineering, University of Information Technologies, Tashkent, Uzbekistan
Abstract
UzNER-100K is a large-scale human-reviewed benchmark for Uzbek named entity recognition with 100,000 training sentences, 18 fine-grained entity types and 200,083 entity mentions across 114,269 sentences in total. The corpus was constructed through an LLM-assisted, expert-reviewed annotation pipeline that achieved strong reliability on the main audit subset while substantially reducing corpus-construction effort. The benchmark includes a standard test split, a gold-audited subset and a hard subset designed to stress long, ambiguous and structurally complex cases. We evaluate 10 Uzbek NER systems spanning recurrent, monolingual Uzbek, multilingual transformer and hybrid architectures. The best model, XLM-R + Gazetteer + CRF, reaches 91.03 Micro-F1 on the standard test set, 89.67 on the gold-audited subset and 83.21 on the hard subset. Quality control included a dedicated inter-annotator agreement audit, achieving 91.3% span-level agreement, 93.7% entity-type agreement, and a Cohen’s Kappa of 0.914. In addition, a qualitative native-speaker assessment confirmed the linguistic naturalness of the model outputs while highlighting remaining challenges in legal, administrative, and event-related expressions.
DOI: https://doi.org/10.3844/jcssp.2026.1894.1911
Copyright: © 2026 Bobur Saidov, Vladimir Barakhnin, Zarnigor Fayzullaeva, Umid Ibragimov and Ulugbek Tursunov. This is an open access article distributed under the terms of the
Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
- 38 Views
- 7 Downloads
- 0 Citations
Download
Keywords
- Uzbek NER
- Low-Resource NLP
- Benchmark Dataset
- Multilingual Transformers
- Gazetteer-Enhanced Decoding