
Automatic Text Normalization in Uzbek: Problems, Tools, And Solutions
Abstract
In recent years, research in the field of Natural Language Processing (NLP) has increased the demand for automated text analysis across multiple languages, including Uzbek. The multi-form, morphologically complex, and stylistically diverse nature of texts written in Uzbek poses certain challenges for automatic analysis. The central focus of this article is the automatic normalization of Uzbek texts—that is, the process of text normalization. It is dedicated to studying the linguistic and technological issues that arise during automatic text normalization in the Uzbek language. Complex morphological structures, polyform words, dialectal variants, Cyrillic-Latin script differences, and non-standard expressions complicate this process. The results of this research contribute to the deeper digital processing of the Uzbek language and to improving the quality of systems for machine translation, speech-to-text conversion, and text analysis.
Keywords
Uzbek language, text normalization, natural language processing
References
Sharipov M, Salaev U. Uzbek affix finite state machine for stemming. IX International Conference on Computer Processing of Turkic Languages “TurkLang 2021” 202;
B. B. Elov, Sh. M. Hamroyeva, O. X. Abdullayeva, Z. Y. Xusainova, N. U. Xudayberganov. (2023). POS tagging and stemming in Uzbek, Turkic, and Uyghur languages, Uzbekistan: language and culture (computer linguistics), 2023, 1(6).
Ulugbek Salaev. 2023. Modeling morphological analysis based on word-ending for Uzbek language. Science and innovation, 2(C11):29–34.
Arofat Akhundjanova and Luigi Talamo. Universal Dependencies Treebank for Uzbek Proceedings of the Third Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL 2025), pages 1–6 March 2, 2025 ©2025 Association for Computational Linguistics
Alessandro Agostini, Timur Usmanov, Ulugbek Khamdamov, Nilufar Abdurakhmonova, and Mukhammadsaid Mamasaidov. 2021. UZWORDNET: A lexical-semantic database for the Uzbek language. In Proceedings of the 11th Global Wordnet Conference, pages 8–19, University of South Africa (UNISA). Global Wordnet Association.
Kh. A. Madatov, D. J. Khujamov, and B. R. Boltayev. 2022. Creating of the Uzbek WordNet based on Turkish WordNet. In AIP Conference Proceedings, volume 2432. AIP Publishing.
B. Mansurov and A. Mansurov. 2021. UzBERT: pretraining a BERT model for Uzbek. CoRR, abs/2108.09814.
Maksud Sharipov, Ulugbek Salaev. Uzbek affix finite state machine for stemming. the IX International Conference on Computer Processing of Turkic Languages "TurkLang 2021", 15 pages
wiki.apertium.org.
pypi.org
Article Statistics
Copyright License
Copyright (c) 2025 Sobirova Nazira G‘anijon qizi

This work is licensed under a Creative Commons Attribution 4.0 International License.