Custom AI Benchmarks: A Critical Approach to Evaluating and Implementing Legal Tech Tools

Custom AI Benchmarks: A Critical Approach to Evaluating and Implementing Legal Tech Tools

Artificial Intelligence (AI) has become a huge opportunity for businesses looking to improve efficiency, reduce costs and stay competitive. However, the success of AI depends largely on how well it aligns with your business’s specific needs.

Relying on generic (academic) AI benchmarks often doesn’t cut it, as they can fail to capture the unique challenges your organisation faces (how often do you need to classify what is a cucumber or not?). Developing custom AI benchmarks tailored to the tasks you actually perform, you can ensure your AI solutions are truly relevant and effective.

In this post I'll explore how we build these benchmarks,with some pradctical examples and we'll also look at how these principles apply when buying legal tech tools, ensuring their performance is evaluated just as carefully as AI tools you build in-house.


Why Custom AI Benchmarks Matter

  • Real-World Relevance: Custom benchmarks make sure you're measuring AI based on tasks that are actually relevant to your business, not some generic examples that don't apply.
  • Improved Accuracy: Tailoring benchmarks means capturing the nuances specific to your industry, which generic benchmarks are likely to miss.
  • Proactive Risk Management: You’ll catch potential issues early, reducing the chance of problems down the road when rolling out AI across your operations.
  • Better Resource Allocation: By pinpointing where improvements are needed, you can better focus your time, effort and budget.

Whether you’re developing AI tools internally or evaluating some off the shelf legal tech solution, custom benchmarks help ensure that the technology you’re considering is not only fit for purpose but can also be adapted to fit the specific requirements of your firm.


Steps to Build Your Benchmarks

1. Identify Core Business Tasks

Before you even begin with AI, it’s critical to understand what you’re trying to achieve. You can’t measure performance if you don’t know what success looks like for your business.

  • Task Inventory: Start by listing out all the tasks performed within your firm, whether it’s administrative work, client meetings, research, or case management.
    • Why this matters: Knowing your day-to-day operations gives you a solid foundation for identifying where AI could make the biggest difference.
  • Prioritise Tasks for AI: Not every task is suitable for AI, so focus on those that are resource-intensive or repetitive.
    • If your team spends hours sifting through documents for dilligence, AI might be a good fit here.
    • Some tasks, especially routine ones, are easier to automate than others. Automating the client intake process might be simpler than building something to handle nuanced legal arguments.

It's obvious but you'll need to engage with the people actually doing the work, solicitors, paralegals and administrative staff in order to get an actual view of where AI could help.


2. Define Performance Metrics

Once you’ve identified tasks for AI, it’s crucial to define clear metrics for success. These metrics will help you measure the AI’s effectiveness and guide improvements.

Key Metrics to Track:

  • Accuracy: Does the AI deliver the correct result? For example, when identifying relevant case law, how often does it get it right?
  • Precision: How often is the AI correct when flagging something as important? Does it correctly identify relevant documents without too many false positives?
  • Recall: Does the AI catch everything it should? For example, when reviewing contracts, recall measures how thoroughly the AI identifies key clauses like confidentiality or indemnity.
  • Processing Speed: Is the AI faster than a human and can it maintain quality at that speed?
  • Error Rate: How often does the AI make mistakes or miss important information?

Collaborating to Define ‘Right’

Legal professionals, including paralegals and contract managers, are crucial (obviously) in defining what "right" means for AI in contract review. Their expertise helps determine how thorough the AI needs to be, whether should it actually flag every minor clause or concentrate on key legal provisions. By setting these parameters, they ensure the AI focuses on the most relevant legal points, aligning its output with the firm’s expectations and standards.

This input becomes especially important if you're using large language models (LLMs) where prompting plays a key role. These teams will ultimately help shape the prompts that define what’s considered correct within your firm, ensuring that AI outputs align with your actual legal needs and standards.

Always benchmark the AI's performance against human results, as the goal is to not only automate tasks but also improve accuracy, speed and consistency compared to manual work.


3. Collect Relevant Data

AI is only as good as the data it’s trained on. High-quality, representative data is key to building benchmarks that reflect your business’s unique requirements.

  • Data Types:
    • Structured Data: Reference numbers, dates, financial records.
    • Unstructured Data: Emails, case summaries, contracts.
  • Data Preparation: Clean and standardise your data. If your data’s a mess, the AI’s performance will be, too.
    • Annotation: Label your data carefully. For example, mark up contracts with labels like ‘confidentiality clauses’ or ‘termination clauses’.

Focus on quality over quantity. Clean, well-labelled data is more valuable than vast amounts of poor-quality data.


When designing benchmark tasks, it's important to ensure they are not only realistic but also broken down into small, well-defined steps. This makes it easier to measure the AI’s performance on discrete parts of a larger piece of work, ensuring more accurate and useful insights.

  • Realistic Scenarios: Create tasks that closely mimic the day-to-day applications within your legal practice. Instead of testing an AI on an entire project at once, break it down into smaller tasks that the AI can process step by step.

    Example: Rather than asking the AI to review all documents in a disclosure process at once, define a task where it first classifies documents by type (e.g., contracts, witness statements, emails). Once the documents are categorised, the AI can then identify key clauses within contracts or flag relevant information in witness statements for further review by the legal team.
  • Task Complexity: Incorporate tasks of varying difficulty so you can gauge the AI’s effectiveness at different levels. Start with simple, well-defined tasks before moving to more complex scenarios.
    • Example: First, test the AI on identifying basic clauses within a contract, such as confidentiality terms. Then, gradually introduce more nuanced tasks, such as recognising implied obligations or more intricate legal constructs.

Use a mix of common tasks and edge cases to push the AI to its limits. By creating smaller, well-defined tasks that build up to larger, more complex evaluations, you ensure that the AI’s capabilities are thoroughly tested while also making it easier to diagnose and refine its performance.


5. Implement it (if you're building)

Deploy the AI system in a controlled environment to test its performance against the benchmarks you’ve set.

Once your benchmarks are in place, it’s time to start a controlled deployment. It’s important to select the right tool for each task and sometimes that means considering approaches other than large language models. Different tasks require different types of AI models, so maintaining flexibility is crucial.

Selection: Depending on the task, a variety of AI methods may be suitable:

  • LLMs: These models are ideal for tasks like document summarisation, understanding natural language, or generating detailed text-based outputs. LLMs are especially useful when dealing with large volumes of unstructured data or when extracting meaning from text-heavy legal documents. For example, summarising a set of contracts to identify key clauses can be efficiently handled by an LLM.
  • Traditional Machine Learning Models: For tasks involving structured data, such as making predictions or classifying information, simpler AI models may be more effective. These models are often faster and more efficient for straightforward, data-driven tasks. For instance, categorising contracts by risk level or predicting the likelihood of contract renewal could be achieved using traditional models.
  • Pre-trained Models: In some cases, pre-trained models can be a simple and effective solution, especially in natural language processing. For example, extracting specific information like party names or contract dates may not require a complex LLM, a pre-trained model (e.g, Blackstone NLP can handle this efficiently.
  • Hybrid Approaches: Sometimes combining different models is the best strategy. You could use an LLM to understand and extract key clauses from a document, then pass those structured outputs to a more specialised model for further processing, such as categorising clauses by type or risk.

Deployment Environment: Set up an environment that mirrors your firm’s actual environment, this will help you see how the it integrates with existing workflows and systems and it ensures that any performance issues are caught early rather than causing a black out on day 1.

    • Example: Deploy the AI within your document management system workflow to see how it handles the real-time review of contracts or case files, to ensure it works smoothly with your current tools and doesn’t disrupt existing processes.

Start small, seriously. Roll out the AI on a limited scale first, in one team (those enthusiastic about AI) or on a specific project, before committing to full-scale deployment. This lets you identify and address any issues without causing chaos.


6. Evaluate Performance

Once the AI is live, it’s time to assess its performance using the metrics you put together.

  • Quantitative Analysis: Measure accuracy, precision, recall and other metrics.
  • Qualitative Analysis: Look at the types of errors the AI makes. Are they trivial, or do they undermine its overall utility? Does it align with the deliverables you'd want as a firm?

Involve legal professionals in the evaluation. Their feedback can help fine-tune the AI’s effectiveness in a practical, real-world setting.


7. Iterate and Improve

AI implementations, like any other project we do, aren’t a "release it and forget it" endeavour. Use your evaluation results to refine the system, improve performance and ultimately better align it with your business goals.

  • Model Refinement: Tweak the process, adjust the data, or even chain multiple models for better results.
  • Data Enhancement: Improve your dataset by adding more examples or cleaning up labels where needed.

Much like any project you do you need to keep the improvement loop going, regular updates and continuous monitoring will keep your AI performing well and you'll be able to find places to make improvements.


Example : Contract Risk Assessment

  • Task: Identify high-risk clauses in contracts.
    • Step 1: Extract key clauses from the contract (e.g, confidentiality, indemnity, termination).
    • Step 2: Evaluate the clauses based on pre-defined risk factors (e.g, clauses with unclear obligations or heavy penalties).
    • Step 3: Rank clauses by risk level, flagging high-risk areas for solicitor review.
  • Benchmark: Assess the accuracy in identifying specific clauses first, then separately measure how well the AI classifies the level of risk for each clause. Also, test its speed and scalability by progressively increasing the volume of contracts.
  • Outcome: This approach leads to faster and more reliable risk identification, as each step is tested in isolation and can be fine-tuned. It supports compliance efforts and provides a clear focus for negotiations.

Custom AI benchmarks are critical whether you're deploying in-house AI or assessing legal tech tools from vendors. By focusing on real-world tasks and setting benchmarks that reflect your specific operations, you can ensure the tools you implement deliver proper tangible value.

In the legal sector, where accuracy and compliance are non-negotiable, customised benchmarks help avoid costly mistakes and ensure a smooth, productive deployment. With ongoing refinement, AI and tech tools alike can become integral to your firm’s operations.