Quick Start

RAG Workbench is a platform for evaluating advanced RAG systems. It allows you to efficiently debug and optimize your application, so you can confidently deploy to production.

This quick start demonstrates a basic evaluation run. Try it yourself in our Quick Start Notebook. For a detailed guide on implementing tracing for a RAG system, check out our Getting Started Notebook.

1. Install

pip install lastmile-eval
pip install "lastmile-eval[ui]"

2. Create an API Token

Go to the LastMile Settings page. Click 'Create new token'.

3. Set up your environment

From your terminal, export your API Token.

MacOS
Windows

export LASTMILE_API_TOKEN=<your-api-token>

set LASTMILE_API_TOKEN=<your-api-token>

4. Run an evaluation

Evaluation requires data to serve as test cases and evaluators to grade the results. Here we are using our built-in faithfulness evaluator and our semantic similarity evaluator.

Faithfulness evaluates if the LLM is hallucinating or deviating from the provided context.
Semantic Similarity measures the textual similarity between two strings, regardless of their meaning.

from lastmile_eval.rag.debugger.api.evaluation import evaluate
import pandas as pd

dataset = [
   {
       'input': 'What is Einstein famous for in physics?',
       'groundTruth': 'Albert Einstein is famous for the theory of relativity.',
       'output': 'Einstein is famous for the theory of relativity.'
   },
   {
       'input': 'What instrument did Einstein play?',
       'groundTruth': 'Einstein played the violin.',
       'output': 'Einstein played the piano.'
   }
]

evaluate_result = evaluate(
   project_name="my project",
   evaluators={
       "faithfulness",
       "similarity"
   },
   test_dataset=pd.DataFrame(dataset)
)

This outputs the following:

Input	Ground Truth	Output	Faithfulness Score	Similarity Score
What is Einstein famous for in physics?	Albert Einstein is famous for the theory of relativity.	Einstein is famous for the theory of relativity.	0.98325	0.9
What instrument did Einstein play?	Einstein played the violin.	Einstein played the piano.	0.00073	0.5

The first test case shows a high Faithfulness Score (0.98) and high Similarity Score (0.9), showing that the output adheres to the ground truth and is also textually similar.

In contrast, the second test case reveals a low Faithfulness Score (~0) and neutral Similarity Score (0.5). This indicates that the output diverges from the ground truth. However, since sentence structure remains somewhat similar, the Similarity Score is 0.5.

This is why it's important to use multiple evaluators to assess your application's performance and help identify areas for optimization.

5. View Results in RAG Workbench!

From your terminal, launch the RAG Workbench UI.

rag-debug launch

Next steps

Try out the Getting Started Notebook
Learn about our custom LastMile Evaluators
Read our API reference
Join LastMile's Discord Community to ask questions and get help
Follow LastMile AI on Twitter (@lastmile) for updates

Quick Start

1. Install​

2. Create an API Token​

3. Set up your environment​

4. Run an evaluation​

5. View Results in RAG Workbench!​