Skip to main navigation Skip to search Skip to main content

HybPashto-POS: A Hybrid Framework for Improved Part of Speech Tagging in the Low-Resourced Pashto Language

  • Yar Muhammad
  • , Richong Zhang*
  • , Asmat Ullah
  • *Corresponding author for this work
  • Beihang University

Research output: Contribution to journalConference articlepeer-review

Abstract

Part of speech (POS) tagging is an important concept in natural language processing (NLP) and serves as a foundation for several applications such as information retrieval, named entity recognition (NER), machine translation, etc. However, due to the lack of standardized tools and annotated resources, this topic has not been explored in the Pashto language. To fill this gap, we introduce a hybrid POS tagging framework for Pashto, which is inherently a low-resource language (LRL). To overcome the issues of annotated resource scarcity and availability of adequate POS tagsets, we developed a manually annotated dataset, which consists of 10,660 sentences and 242,449 words/tokens, alongside a customized tagset consisting of 37 POS tags to precisely represent the linguistic nuances of Pashto. The dataset consists of words with multiple meanings (ambiguous words), which makes it more comprehensive by covering almost all possible words and their associated tags. Furthermore, the conventional rule-based and machine learning (ML) models often struggle to capture the morphological and contextual representation and fine-grained linguistic features of Pashto. To overcome these challenges, we propose a hybrid architecture that integrates the strength of contextual embeddings (PashtoBERT), a Pashto-specific morphological feature encoder, BiLSTM, a syntax-aware transformer, and sequential modelling with CRF for the automatic POS tagging in Pashto. The proposed approach is motivated by the need to capture hierarchical linguistic information across morphological, syntactic, and semantic dimensions, resulting in enhanced, robust, and more accurate language-specific POS tagging in Pashto. The proposed model performed exceptionally well by achieving a testing accuracy of 97.86% and an F1 score of 97.82%, which is the highest performance of any model for Pashto POS tagging to date. The experimental outcomes reveal the significance and robustness of the proposed model.

Original languageEnglish
Pages (from-to)222-229
Number of pages8
JournalProceedings of the IEEE International Conference on Big Data and Smart Computing, BIGCOMP
Issue number2026
DOIs
StatePublished - 2026
Event2026 IEEE International Conference on Big Data and Smart Computing, BigComp 2026 - Guangzhou, China
Duration: 2 Feb 20265 Feb 2026

Keywords

  • Artificial intelligence
  • Hybrid DL models
  • Machine learning
  • Pashto POS dataset
  • Pashto POS tagging

Fingerprint

Dive into the research topics of 'HybPashto-POS: A Hybrid Framework for Improved Part of Speech Tagging in the Low-Resourced Pashto Language'. Together they form a unique fingerprint.

Cite this