基于CaRoMHPE的羊疾病知识图谱构建方法

张泽嘉; 孙小华; 王超; 王斌; 袁万哲; 王福顺

doi:10.11975/j.issn.1002-6819.202504180

基于CaRoMHPE的羊疾病知识图谱构建方法

Sheep disease knowledge graph construction method based on CaRoMHPE

摘要

摘要: 羊疾病领域知识图谱是实现羊疾病防控与智能诊疗的前提。针对羊疾病文本语义边界模糊、实体角色重叠及关系语义复杂等问题，该研究提出了一种基于CaRoMHPE（CasRel-based model combined with RoBERTa, multi-scale cross-attention mechanism, and hybrid position encoding in multi-head attention）模型的知识图谱构建方法。首先根据羊疾病语料特点，构建了一个包含9类实体和8种关系的羊疾病数据集，涵盖了羊疾病诊疗全过程中的关键实体及关系，为实体关系抽取任务提供数据支持。随后，以CasRel（cascade relational triple extraction）为基础模型，使用RoBERTa-wwm-ext（robustly optimized BERT approach）替换BERT（bidirectional encoder representations from transformers）作为预训练编码模型，以增强模型对上下文的理解和对复杂语言结构的处理能力；在主体标注模块后添加多尺度跨注意力机制，更好地细化实体之间的语义关系，同时融入混合位置编码（hybrid position encoding, HPE）对多头注意力机制进行改进，增强关系抽取任务中的实体边界划分和角色区分能力。结果表明，该模型知识抽取的准确率、召回率和F1值分别达到了94.70%、94.04%、94.37%，相较于CasRel模型分别提升了9.14、9.21和9.18个百分点，增强了羊疾病信息实体关系抽取效果。最后，在抽取得到的三元组基础上，结合语义嵌入技术和余弦相似度算法，通过消除同义词重复和处理潜在歧义，构建了规范化的知识图谱，为智能化羊疾病诊疗提供有力的支持。

Abstract: A knowledge graph is often required in the field of sheep diseases, particularly for disease prevention, control, and intelligent diagnosis. The common challenges still remained in the sheep disease texts, such as the ambiguous semantic boundaries, overlapping entity roles, and complex relational semantics. In this study, a knowledge graph was constructed using the CaRoMHPE model (CasRel-based Model Combined with RoBERTa, Multi-scale Cross-Attention Mechanism, and Hybrid Position Encoding in Multi-Head Attention). Firstly, a dataset was constructed with 9 types of entities and 8 types of relationships, according to the features of sheep disease corpora. The key entities and their relationships were also covered throughout the entire process, from disease diagnosis to treatment. Substantial data support was then provided for the entity-relationship extraction tasks. Secondly, the Robustly Optimized BERT Approach with Whole Word Masking (RoBERTa-wwm-ext) was used to replace the conventional Bidirectional Encoder Representations from Transformers (BERT) as the text encoder. The Cascade Relational Triple Extraction (CasRel) framework was established to enhance the contextual semantics in order to handle the complex linguistic structures. A multi-scale cross-attention mechanism was introduced after the subject annotation module. Taking the subject embeddings and full-sentence embeddings as the inputs, the interaction scores were computed using multi-head attention. Multi-source information was fused through residual connections and layer normalization. Furthermore, the high-level semantic representations were extracted via a Feed-Forward Network (FFN). The FFN consisted of two fully connected layers with a GeLU activation function in between. The multi-scale feature fusion module was obtained in the low-dimensional features after linear projection. The original and compressed features were concatenated along the last dimension. And then the original dimensionality was restored after the linear transformation, thereby integrating the global and local semantic information. There was a more refined characterization of semantic relationships between entities in order to enhance the perception of the long-range dependencies and local semantic features. Additionally, a Hybrid Position Encoding (HPE) was incorporated into the multi-head attention mechanism in order to combine the absolute and relative position encoding. There were the relative position bias and Rotary Position Encoding (RoPE). Among them, the relative position bias dynamically adjusts the bias matrix shape using sequence length and maps positional relationships into the attention scores between queries and keys. The RoPE applied the sine and cosine functions to the even and odd dimensions of query and key vectors for the rotational transformation, and then explicitly encoded the positional information into the attention calculation. More accurate modeling of the structural relationships between words enhanced the perception of the positional information. The better performance significantly improved the entity boundary recognition and role differentiation. Experimental results show that the precision, recall, and F1 scores were achieved in 94.70%, 94.04%, and 94.37%, respectively, in the knowledge extraction task, which was improved by 9.14, 9.21, and 9.18 percentage points over the original CasRel model. The significant optimization was realized in the entity-relationship extraction for the sheep disease domain. Finally, the semantic embedding vectors were generated by the RoBERTa model. Cosine similarity was calculated for the synonym discrimination and entity fusion using the structured triples extracted from the text. The knowledge fusion corpora were obtained for different entity types. Synonymous expressions were effectively identified and unified to eliminate duplicate entities and potential ambiguities. No duplicates were verified after retrieval. The fused triples were stored in a Neo4j graph database, resulting in a standardized, consistent, and well-structured sheep disease knowledge graph. This finding can provide reliable data support for the subsequent intelligent diagnosis and treatment of sheep diseases.

HTML全文

参考文献(31)

施引文献

资源附件(0)