Abstract
Part of speech (POS) tagging is an important concept in natural language processing (NLP) and serves as a foundation for several applications such as information retrieval, named entity recognition (NER), machine translation, etc. However, due to the lack of standardized tools and annotated resources, this topic has not been explored in the Pashto language. To fill this gap, we introduce a hybrid POS tagging framework for Pashto, which is inherently a low-resource language (LRL). To overcome the issues of annotated resource scarcity and availability of adequate POS tagsets, we developed a manually annotated dataset, which consists of 10,660 sentences and 242,449 words/tokens, alongside a customized tagset consisting of 37 POS tags to precisely represent the linguistic nuances of Pashto. The dataset consists of words with multiple meanings (ambiguous words), which makes it more comprehensive by covering almost all possible words and their associated tags. Furthermore, the conventional rule-based and machine learning (ML) models often struggle to capture the morphological and contextual representation and fine-grained linguistic features of Pashto. To overcome these challenges, we propose a hybrid architecture that integrates the strength of contextual embeddings (PashtoBERT), a Pashto-specific morphological feature encoder, BiLSTM, a syntax-aware transformer, and sequential modelling with CRF for the automatic POS tagging in Pashto. The proposed approach is motivated by the need to capture hierarchical linguistic information across morphological, syntactic, and semantic dimensions, resulting in enhanced, robust, and more accurate language-specific POS tagging in Pashto. The proposed model performed exceptionally well by achieving a testing accuracy of 97.86% and an F1 score of 97.82%, which is the highest performance of any model for Pashto POS tagging to date. The experimental outcomes reveal the significance and robustness of the proposed model.
| Original language | English |
|---|---|
| Pages (from-to) | 222-229 |
| Number of pages | 8 |
| Journal | Proceedings of the IEEE International Conference on Big Data and Smart Computing, BIGCOMP |
| Issue number | 2026 |
| DOIs | |
| State | Published - 2026 |
| Event | 2026 IEEE International Conference on Big Data and Smart Computing, BigComp 2026 - Guangzhou, China Duration: 2 Feb 2026 → 5 Feb 2026 |
Keywords
- Artificial intelligence
- Hybrid DL models
- Machine learning
- Pashto POS dataset
- Pashto POS tagging
Fingerprint
Dive into the research topics of 'HybPashto-POS: A Hybrid Framework for Improved Part of Speech Tagging in the Low-Resourced Pashto Language'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver