OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 08.04.2026, 02:27

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Medical knowledge representation enhancement in large language models through clinical tokens optimization

2026·0 Zitationen·Scientific ReportsOpen Access
Volltext beim Verlag öffnen

0

Zitationen

6

Autoren

2026

Jahr

Abstract

During the training of medical large language models (LLMs), conventional tokenizers frequently segment domain-specific medical terms into multiple subword tokens, resulting in suboptimal recognition and representation of specialized vocabulary. As a consequence, the model encounters difficulties in effectively acquiring medical domain knowledge during the fine-tuning process. To address this limitation, the present study introduces “clinical tokens”—medical subword units—by augmenting the vocabulary of the original LLaMA2 tokenizer. This adapted tokenizer retains medical terms as whole tokens wherever feasible, thereby enhancing tokenization accuracy and enabling the model to learn and interpret medical knowledge more effectively. For downstream task adaptation, this study employs the Byte Pair Encoding (BPE) algorithm to construct a domain-specific vocabulary and tokenization model, ensuring the inclusion of medical subword units (clinical tokens). We compare the tokenization performance of three variants: the original LLaMA2 tokenizer, the Chinese-LLaMA2 tokenizer (expanded with an extended Chinese vocabulary), and the clinical token-augmented tokenizer. This was followed by fine-tuning the large language models on curated medical datasets. The experimental results indicate that the enhanced tokenizer improves encoding and decoding efficiency, extends the model’s effective context window, and yields superior performance on downstream medical tasks.

Ähnliche Arbeiten