基于大语言模型和机器学习模型协作的特征筛选管道助力缓蚀剂精准预测

Translated title of the contribution: Collaborative feature screen with large language and machine learning model to enhance corrosion inhibitor prediction
  • Jingzhi Yang
  • , Diandian Liu
  • , Haiyan Gong
  • , Xin Guo
  • , Yuting Jin
  • , Lingwei Ma
  • , Dawei Zhang*
  • , Xiaogang Li
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Corrosion affects every sector of the national economy, from industrial and agricultural production to defense technology. It poses a serious threat to the safety of equipment in service, leads to substantial economic losses, and presents significant risks to human life and health. Metal corrosion inhibitors can modify the surface characteristics of metals, increase the activation energy barrier of corrosion reactions, affect surface electrochemical behavior, and slow down the corrosion rate. These inhibitors have advantages such as low dosage, low cost, and high efficiency, making them one of the most widely used methods for corrosion control. However, there are many types of inhibitors with complex mechanisms, which are closely related to environmental factors. Conventional laboratory methods such as precise weight lose analysis or electrochemical measurements such as potentiodynamic polarization and electrochemical impedance spectroscopy are labor-intensive, time-consuming, and costly, which greatly hinders the design and application of high-performance inhibitors. There is an urgent need for a more efficient approach to advance inhibitor research. A recent paradigm shift driven by advancements in materials genome engineering (MGE) is enabling researchers to move beyond the traditional trial-and-error approach. By integrating high-throughput computational tools with fundamental chemical principles, MGE facilitates a more systematic and intelligent exploration of materials science. At the core of this transformation lies machine learning (ML), which serves as a powerful pattern recognition engine. ML algorithms can analyze vast historical experimental data to predict the performance of novel materials and uncover the often hidden, nonlinear relationships between molecular features and their functional properties. In this study, we developed a novel methodology that synergizes a state-of-the-art large language model (LLM) with a predictive ML framework. The LLM was employed to systematically parse and extract meaningful molecular features from thousands of unstructured research papers and experimental datasets, specifically focusing on inhibitors used in CO2-saturated environments. We constructed a comprehensive corrosion inhibitor research dataset by extracting 1152 data entries from 174 peer-reviewed articles on inhibitor development and application in CO2-saturated environments. These entries contain detailed information on molecular structures, corrosion environment parameters, inhibitor concentrations, experimental temperatures, and inhibition efficiency metrics. Statistical analysis revealed that the target variables in our dataset exhibited relatively uniform distributions without significant skewness or clustering, indicating a balanced data structure that supports robust model training and generalization. Our methodology implements a two-stage feature selection strategy based on a collaborative large-small model pipeline. We first established a domain-specific knowledge framework by injecting corrosion science expertise into the Deepseek-R1 LLM, enabling systematic analysis of unstructured scientific texts. This LLM-based approach allowed us to efficiently screen an initial set of 204 molecular descriptors down to 50 candidates that demonstrate clear relevance to CO2 corrosion inhibition mechanisms. We then applied quantitative statistical techniques using a smaller specialized model to further refine the feature set through correlation analysis and recursive feature elimination. This two-phase process reduced the final feature count to 13 non-redundant descriptors that comprehensively captured the interplay between molecular structure, inhibitor concentration, and environmental parameters. The selected 13 features reduced the mean squared error from 121 to 11 of the models. To validate our approach, we built a gradient boosting model incorporating both the selected molecular features and environmental parameters. We identified five representative molecules and their corresponding corrosion environments for experimental testing. The results demonstrated the good generalization ability of the model, confirming its potential for practical application in corrosion inhibitor design and development.

Translated title of the contributionCollaborative feature screen with large language and machine learning model to enhance corrosion inhibitor prediction
Original languageChinese (Traditional)
Pages (from-to)2456-2469
Number of pages14
JournalGongcheng Kexue Xuebao/Chinese Journal of Engineering
Volume47
Issue number12
DOIs
StatePublished - Dec 2025
Externally publishedYes

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

  1. SDG 2 - Zero Hunger
    SDG 2 Zero Hunger
  2. SDG 3 - Good Health and Well-being
    SDG 3 Good Health and Well-being

Fingerprint

Dive into the research topics of 'Collaborative feature screen with large language and machine learning model to enhance corrosion inhibitor prediction'. Together they form a unique fingerprint.

Cite this