Skip to main navigation Skip to search Skip to main content

Learning to Follow Domain-specific Instruction with Verifiable Rewards

  • Beihang University

Research output: Contribution to journalConference articlepeer-review

Abstract

In this paper, we address the challenge of enabling large language models (LLMs) to effectively follow domain-specific instructions, a critical requirement for their successful deployment across various industries. We propose a novel pipeline for constructing verifiable instructions tailored to specific domains. This pipeline consists of three key stages: the creation of meta-requirement templates, the generation of custom instructions using GPT-4 with seed prompts, and manual refinement to ensure clarity, precision, and relevance. A unique aspect of our approach is the incorporation of verifiability into the instruction-following tuning process. Specifically, we design a verified reward mechanism within the Direct Preference Optimization (DPO) framework. This mechanism leverages the ability to automatically verify whether the generated responses adhere to the given instructions. By integrating this verified reward, we enable more effective alignment of LLM behavior with domain-specific requirements, ensuring higher reliability and consistency in outputs. Our study also explores various strategies to enhance the instruction-following capabilities of LLMs, with a focus on fine-tuning methodologies and data augmentation techniques. We provide a comprehensive analysis of domain-specific requirements to better understand how LLMs can be adapted for practical, real-world applications. The efficacy of our approach is empirically validated on GPT-4 and the LLaMA2 series. Notably, the LLaMA-7B model demonstrates a significant performance improvement of over 19% compared to zero-shot settings, underscoring the effectiveness of our methods. This work contributes to the field by bridging the gap between the general capabilities of LLMs and the nuanced demands of domain-specific instruction following. Our findings pave the way for more reliable and adaptable LLM applications across diverse industries.

Keywords

  • Domain Adaptation
  • Instruction Following
  • Large Language Models
  • Verifiable Rewards

Fingerprint

Dive into the research topics of 'Learning to Follow Domain-specific Instruction with Verifiable Rewards'. Together they form a unique fingerprint.

Cite this