Abstract:
A knowledge graph is often required in the field of sheep diseases, particularly for disease prevention, control, and intelligent diagnosis. The common challenges still remained in the sheep disease texts, such as the ambiguous semantic boundaries, overlapping entity roles, and complex relational semantics. In this study, a knowledge graph was constructed using the CaRoMHPE model (CasRel-based Model Combined with RoBERTa, Multi-scale Cross-Attention Mechanism, and Hybrid Position Encoding in Multi-Head Attention). Firstly, a dataset was constructed with 9 types of entities and 8 types of relationships, according to the features of sheep disease corpora. The key entities and their relationships were also covered throughout the entire process, from disease diagnosis to treatment. Substantial data support was then provided for the entity-relationship extraction tasks. Secondly, the Robustly Optimized BERT Approach with Whole Word Masking (RoBERTa-wwm-ext) was used to replace the conventional Bidirectional Encoder Representations from Transformers (BERT) as the text encoder. The Cascade Relational Triple Extraction (CasRel) framework was established to enhance the contextual semantics in order to handle the complex linguistic structures. A multi-scale cross-attention mechanism was introduced after the subject annotation module. Taking the subject embeddings and full-sentence embeddings as the inputs, the interaction scores were computed using multi-head attention. Multi-source information was fused through residual connections and layer normalization. Furthermore, the high-level semantic representations were extracted via a Feed-Forward Network (FFN). The FFN consisted of two fully connected layers with a GeLU activation function in between. The multi-scale feature fusion module was obtained in the low-dimensional features after linear projection. The original and compressed features were concatenated along the last dimension. And then the original dimensionality was restored after the linear transformation, thereby integrating the global and local semantic information. There was a more refined characterization of semantic relationships between entities in order to enhance the perception of the long-range dependencies and local semantic features. Additionally, a Hybrid Position Encoding (HPE) was incorporated into the multi-head attention mechanism in order to combine the absolute and relative position encoding. There were the relative position bias and Rotary Position Encoding (RoPE). Among them, the relative position bias dynamically adjusts the bias matrix shape using sequence length and maps positional relationships into the attention scores between queries and keys. The RoPE applied the sine and cosine functions to the even and odd dimensions of query and key vectors for the rotational transformation, and then explicitly encoded the positional information into the attention calculation. More accurate modeling of the structural relationships between words enhanced the perception of the positional information. The better performance significantly improved the entity boundary recognition and role differentiation. Experimental results show that the precision, recall, and F1 scores were achieved in 94.70%, 94.04%, and 94.37%, respectively, in the knowledge extraction task, which was improved by 9.14, 9.21, and 9.18 percentage points over the original CasRel model. The significant optimization was realized in the entity-relationship extraction for the sheep disease domain. Finally, the semantic embedding vectors were generated by the RoBERTa model. Cosine similarity was calculated for the synonym discrimination and entity fusion using the structured triples extracted from the text. The knowledge fusion corpora were obtained for different entity types. Synonymous expressions were effectively identified and unified to eliminate duplicate entities and potential ambiguities. No duplicates were verified after retrieval. The fused triples were stored in a Neo4j graph database, resulting in a standardized, consistent, and well-structured sheep disease knowledge graph. This finding can provide reliable data support for the subsequent intelligent diagnosis and treatment of sheep diseases.