Articles | Open Access | https://doi.org/10.37547/ijll/Volume05Issue06-33

Automatic Text Normalization in Uzbek: Problems, Tools, And Solutions

Sobirova Nazira G‘anijon qizi , PhD candidate at Alisher Navoiy Tashkent State University of Uzbek Language and Literature, Uzbekistan

Abstract

In recent years, research in the field of Natural Language Processing (NLP) has increased the demand for automated text analysis across multiple languages, including Uzbek. The multi-form, morphologically complex, and stylistically diverse nature of texts written in Uzbek poses certain challenges for automatic analysis. The central focus of this article is the automatic normalization of Uzbek texts—that is, the process of text normalization. It is dedicated to studying the linguistic and technological issues that arise during automatic text normalization in the Uzbek language. Complex morphological structures, polyform words, dialectal variants, Cyrillic-Latin script differences, and non-standard expressions complicate this process. The results of this research contribute to the deeper digital processing of the Uzbek language and to improving the quality of systems for machine translation, speech-to-text conversion, and text analysis.

Keywords

Uzbek language, text normalization, natural language processing

References

Sharipov M, Salaev U. Uzbek affix finite state machine for stemming. IX International Conference on Computer Processing of Turkic Languages “TurkLang 2021” 202;

B. B. Elov, Sh. M. Hamroyeva, O. X. Abdullayeva, Z. Y. Xusainova, N. U. Xudayberganov. (2023). POS tagging and stemming in Uzbek, Turkic, and Uyghur languages, Uzbekistan: language and culture (computer linguistics), 2023, 1(6).

Ulugbek Salaev. 2023. Modeling morphological analysis based on word-ending for Uzbek language. Science and innovation, 2(C11):29–34.

Arofat Akhundjanova and Luigi Talamo. Universal Dependencies Treebank for Uzbek Proceedings of the Third Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL 2025), pages 1–6 March 2, 2025 ©2025 Association for Computational Linguistics

Alessandro Agostini, Timur Usmanov, Ulugbek Khamdamov, Nilufar Abdurakhmonova, and Mukhammadsaid Mamasaidov. 2021. UZWORDNET: A lexical-semantic database for the Uzbek language. In Proceedings of the 11th Global Wordnet Conference, pages 8–19, University of South Africa (UNISA). Global Wordnet Association.

Kh. A. Madatov, D. J. Khujamov, and B. R. Boltayev. 2022. Creating of the Uzbek WordNet based on Turkish WordNet. In AIP Conference Proceedings, volume 2432. AIP Publishing.

B. Mansurov and A. Mansurov. 2021. UzBERT: pretraining a BERT model for Uzbek. CoRR, abs/2108.09814.

Maksud Sharipov, Ulugbek Salaev. Uzbek affix finite state machine for stemming. the IX International Conference on Computer Processing of Turkic Languages "TurkLang 2021", 15 pages

wiki.apertium.org.

pypi.org

Article Statistics

Copyright License

Download Citations

How to Cite

Sobirova Nazira G‘anijon qizi. (2025). Automatic Text Normalization in Uzbek: Problems, Tools, And Solutions. International Journal Of Literature And Languages, 5(06), 114–118. https://doi.org/10.37547/ijll/Volume05Issue06-33