Finally, a Legal AI Dataset That Doesn’t Shrug at Copyright

Every so often, a project lands that resets the standard. No big hype cycle, just deliberate work that says: here’s how to do it properly.
That’s what the KL3M Data Project delivers. A full-stack, legally grounded dataset for language models, built not by scraping the web and hoping for fair use, but by actually doing the work, the legal work, the data engineering and the ethics and for legal tech, it couldn’t have come at a better time.
Let’s Be Honest: Most AI Training Data Is a Legal Mess
Nearly every major LLM is trained on copyrighted material, often taken without consent or a proper licence. That’s not a footnote. It’s the foundation and it’s built on shaky assumptions that fall apart the moment someone asks hard questions.
KL3M doesn’t duck this. It puts the issue front and centre:
Practically all existing LLMs use copyrighted materials obtained without consent… Worse yet, the data has often been obtained from individuals and organisations who have expressed preferences through licenses or terms that limit or prohibit their use.
It’s not just theoretical risk either. Look at the cases already in motion like NYT v. Microsoft and Kadrey v. Meta. The legal ambiguity isn’t going anywhere, and if anything, it’s only getting more expensive.
A Protocol That Actually Means Something
Rather than throwing around “fair use” and hoping for the best, KL3M starts from first principles. Copyright. Contract. Attribution. It applies a three-part test to every source:
1. Was it free from copyright at creation?
2. Has it entered the public domain?
3. If still protected, does the licence grant clean rights for commercial use, modification, and AI training?
Fail any of those, and the content gets excluded. No corner-cutting. No vague justifications. And no pretending that “publicly accessible” means “free to use.”
Wikipedia, for instance, fails the test and so do many Creative Commons licences once you read the fine print. It’s the most rigorous, transparent dataset curation process I’ve seen for legal use cases.
Full Pipeline. Real Provenance. Built to Last.
This isn’t a zipped-up dump of PDFs with a permissive licence tacked on. KL3M’s pipeline covers everything.
- Original formats (PDF, HTML, XML) with metadata and legal context
- Standardised text extraction, with Markdown used to retain structure
- Pre-tokenised using a domain-specific tokenizer built for legal and financial text
- Mid and post train resources like Q&A, classification, clause drafting, and hearing transcripts
All of it open source. All of it traceable back to the original document. Nothing vague or half-documented. Just solid engineering and actual legal clarity.
Why It Matters to Legal Tech
Legal teams aren’t just consumers of AI. They’re the ones expected to sign off on how others use it. And that creates a problem when the models on the table were trained on data with unclear or outright broken rights.
KL3M gives legal professionals something rare in this space: confidence. Not vague assurances or footnotes. Actual clarity. A clean record of what’s in the dataset, where it came from, and whether you’re allowed to use it.
That shifts what’s possible. You can train or fine-tune without worrying whether a client or regulator is going to ask awkward questions later. You can build workflows that deal with real documents because KL3M includes nearly 500,000 of them in enterprise formats like Word, Excel, and PDF, all sourced from US government domains. That’s the kind of content lawyers actually deal with, and now we can work with it without relying on private datasets or mystery licensing.
The Quiet Rebuttal to the AI Hype
KL3M isn’t playing the leaderboard game. It’s not trying to be flashy. It’s trying to be correct: transparent, legal and aligned with the values legal professionals are supposed to protect.
Most model providers can’t point to their training data, they handwave the problem with “public availability” arguments and vague claims about transformative use. KL3M doesn’t buy into that. It just says: here’s the data, here’s the code, here’s the licence, and here’s the proof we’re allowed to use it. Surprisingly in 2025, that still feels radical.
If you work in legal tech, especially anywhere near model development or AI deployment then this matters. KL3M solves a problem we’ve all quietly stepped around for years. It proves that you can build something useful, powerful, and open, without breaking the rules or rewriting them after the fact.
It’s not hype. It’s not spin. It’s just done properly.
And it’s about time.
Want help adapting this for your site, LinkedIn, or a follow-up piece on use cases like clause classification or long-context contracts? Just say the word.