Beyond Words: How Specialised Tokenisation Transforms Legal AI

Why Tokenisation Matters

Legal tech conferences, webinars, and LinkedIn feeds are dominated by conversations about flashy innovations: Gen AI powered client interfaces, automated contract review dashboards, and sophisticated NLP systems that extract key provisions from thousands of agreements. Yet underneath these advanced technologies sits tokenisation, a foundational component that rarely gets the same level of interest.

Tokenisers determine how AI models process text, affecting accuracy, speed, and computational resources. For those in legal dealing with nuanced documents, effective tokenisation is particularly crucial.

What's Wrong with Current Tokenisers?

GPT, BERT, and other popular AI models use tokenisers trained on general data like Wikipedia, forums, and websites. These systems (Byte Pair Encoding, WordPiece, Unigram) break text into subword units, often in ways that don't respect legal terminology. Consider a citation like "Fed. R. Civ. P. 56(a)" - general tokenisers might fragment it into :

"Fed", ".", " R", ".", " Civ", ".", " P", ".", " 56", "(a)".

This fragmentation destroys the structural meaning that's vital in legal and financial documents. When precise terminology loses its semantic cohesion, models struggle with consistent recognition across documents and can't properly infer legal implications. This undermines tasks like entity recognition, clause classification, and citation detection.

Introducing KL3M: A Better Approach

The ALEA Institute developed KL3M (proncouned Clem) to address these shortcomings. Unlike generic systems, this tokeniser family was built specifically for legal and financial texts. By training on millions of tokens from contracts, financial disclosures, and OCR-processed documents, KL3M preserves critical terminology that would otherwise be fragmented.

How KL3M Tokenisers Make a Difference

KL3M reduces token counts by up to 29% compared to traditional methods like GPT-2's BPE. This intelligent merging of domain-specific terms allows more context to fit within AI model input constraints, improving both accuracy and processing efficiency.

Let's revisit that earlier example, so instead of breaking "Fed. R. Civ. P. 56(a)" into disconnected fragments, KL3M tokenises it as

"Fed.", " ", "R.", " ", "Civ.", " ", "P.", " 56", "(a)".

This approach respects the citation's structure and improves consistency in legal retrieval, linking, and summarisation tasks.

KL3M also offers a character-level variant that excels with messy data. Legal professionals frequently work with documents affected by OCR issues, scanning artifacts, or formatting problems. The character-level tokeniser handles these imperfections smoothly, reducing preprocessing work while improving downstream reliability.

Practical Benefits

KL3M offers several pretty cool benefits:

Lower costs and faster processing: With fewer tokens per document, computational expenses drop and processing speeds up.
Better accuracy: Legal terminology stays intact, leading to more reliable document review, summarisation, and entity recognition.
Simple integration: As an open-source tool, KL3M integrates into existing AI workflows without major system changes.
Better handling of messy documents: The character-level variant manages real-world data quality issues that are common in legal and financial work.

When and How to Swap Tokenisers in Your Workflow

Since tokenisers are typically bound to how models are trained, complete replacement usually requires retraining. However, teams could look to incorporate KL3M into preprocessing or retrieval stages without retraining entire models. This approach improves results while maintaining context and clarity, particularly for lengthy legal documents and financial reports.

Where Character-Level Tokenisation Shines

The character-level tokeniser is particularly valuable for older scanned documents - especially those from ten years back or more found in an old filing cabinet. These often contain inconsistent formatting, OCR errors, and broken layouts that confuse standard tokenisers. Rather than feeding problematic text directly to models (leading to poor results), using KL3M's character-level variant at the preprocessing stage helps clean and segment text consistently. This works well for scanned board minutes, archived contracts, and government filings, reducing noise while preserving structure.

Getting Started with KL3M

Implementation is straightforward since KL3M is open-source. Legal tech teams could quickly add these tokenisers to existing document processing pipelines and see immediate improvements in output quality and efficiency without extensive technical changes.

So while tokenisation may not be the most exciting aspect of legal tech, it's fundamental to how AI performs in specialized fields. KL3M represents a really meaningful advancement that delivers clear performance improvements. For legal tech teams working with complex documents, it offers a practical path to more reliable, accurate, and efficient AI systems.

Source Paper for Article: https://arxiv.org/pdf/2503.17247