跳到主要导航 跳到搜索 跳到主要内容

HybPashto-POS: A Hybrid Framework for Improved Part of Speech Tagging in the Low-Resourced Pashto Language

  • Yar Muhammad
  • , Richong Zhang*
  • , Asmat Ullah
  • *此作品的通讯作者
  • Beihang University

科研成果: 期刊稿件会议文章同行评审

摘要

Part of speech (POS) tagging is an important concept in natural language processing (NLP) and serves as a foundation for several applications such as information retrieval, named entity recognition (NER), machine translation, etc. However, due to the lack of standardized tools and annotated resources, this topic has not been explored in the Pashto language. To fill this gap, we introduce a hybrid POS tagging framework for Pashto, which is inherently a low-resource language (LRL). To overcome the issues of annotated resource scarcity and availability of adequate POS tagsets, we developed a manually annotated dataset, which consists of 10,660 sentences and 242,449 words/tokens, alongside a customized tagset consisting of 37 POS tags to precisely represent the linguistic nuances of Pashto. The dataset consists of words with multiple meanings (ambiguous words), which makes it more comprehensive by covering almost all possible words and their associated tags. Furthermore, the conventional rule-based and machine learning (ML) models often struggle to capture the morphological and contextual representation and fine-grained linguistic features of Pashto. To overcome these challenges, we propose a hybrid architecture that integrates the strength of contextual embeddings (PashtoBERT), a Pashto-specific morphological feature encoder, BiLSTM, a syntax-aware transformer, and sequential modelling with CRF for the automatic POS tagging in Pashto. The proposed approach is motivated by the need to capture hierarchical linguistic information across morphological, syntactic, and semantic dimensions, resulting in enhanced, robust, and more accurate language-specific POS tagging in Pashto. The proposed model performed exceptionally well by achieving a testing accuracy of 97.86% and an F1 score of 97.82%, which is the highest performance of any model for Pashto POS tagging to date. The experimental outcomes reveal the significance and robustness of the proposed model.

源语言英语
页(从-至)222-229
页数8
期刊Proceedings of the IEEE International Conference on Big Data and Smart Computing, BIGCOMP
2026
DOI
出版状态已出版 - 2026
活动2026 IEEE International Conference on Big Data and Smart Computing, BigComp 2026 - Guangzhou, 中国
期限: 2 2月 20265 2月 2026

指纹

探究 'HybPashto-POS: A Hybrid Framework for Improved Part of Speech Tagging in the Low-Resourced Pashto Language' 的科研主题。它们共同构成独一无二的指纹。

引用此