Article Preview
TopIntroduction
Named entity recognition (NER) is a fundamental task in natural language processing (NLP), aimed at identifying general entities such as person, time, and location from text. In the judicial domain, legal named entity recognition (LNER) is a specialized task that focuses on case-specific entities closely related to legal proceedings. These entities typically consist of terms or phrases that hold significant meaning within the legal context, such as “suspect” and “victim,” which are subtypes of the “person” entity commonly identified in general NER. With advancements in NLP, extracting named entities from vast and unstructured legal texts has become a critical task for constructing legal knowledge graphs and developing intelligent justice systems (Correia et al., 2021; Guo et al., 2021). Additionally, LNER plays a foundational role in downstream tasks, such as judicial summarization, question answering, and case recommendation. However, the presence of specialized terminology, unclear entity boundaries, long entities, and nested entities in legal texts poses significant challenges. Most existing NER models struggle to effectively address these issues, resulting in suboptimal performance in legal entity recognition (Shen et al., 2022). Chinese legal texts, in particular, pose additional challenges due to the frequent occurrence of multi-word phrases or lengthy noun entities. This complexity complicates word segmentation, as Chinese lacks spaces between words, unlike English. Moreover, legal texts often contain nested entities, where one entity is embedded within another. For instance, in the phrase “a gold ring from the victim Lin’s home,” the entity “a gold ring” (stolen item) is nested within “the victim Lin’s home” (location), which is further nested within “Lin” (victim). General NER methods may correctly identify “a gold ring” as a stolen item but may fail to recognize nested entities like “victim” or “location,” resulting in incomplete recognition and a limited understanding of the relationships between entities such as location, person, and stolen item.
LNER faces the following main challenges:
- •
Due to the specificity of the legal domain, legal documents contain long entities and nested entities. Long entities are composed of multiple nouns or phrases, which complicates their segmentation. Nested entities, on the other hand, have multi-layer structures where entity boundaries intertwine and overlap, making their recognition particularly challenging.
- •
General NER methods primarily predict entity labels based on context, often overlooking the semantic relationships between the textual context and entity label types. While the machine reading comprehension (MRC) approach addresses some of these limitations, it is inefficient as it can only identify one entity type per inference. Furthermore, the quality of manually constructed queries in this approach can vary significantly, further affecting its accuracy.
To comprehensively address the issues mentioned above, this paper introduces a LNER method designed specifically for recognizing entities in Chinese legal documents. The method is based on the parallel instance query network-NER (PIQN-NER), which uses trainable queries to replace the fixed queries in MRC and extract entities simultaneously. Unlike previous methods, these queries can be constructed in advance without relying on external knowledge. A linear label assignment mechanism is employed to align gold entities with the instance queries. First, PIQN-NER fine-tunes bidirectional encoder representation from transformers (BERT) to encode character sequences. Then, a bidirectional long short-term memory (BiLSTM) combined with an attention mechanism is applied to assign different attention weights to both the context and instance queries, which improves the model's ability to correctly determine entity boundaries. Finally, the entity prediction component leverages a pointer network to capture both the span boundaries and types of legal entities. Experiments demonstrate that the proposed method outperforms related methods when applied to legal datasets.
The main contributions of this work are as follows: