Skip to main content

Quick Start

RAG Workbench is a platform for evaluating advanced RAG systems. It allows you to efficiently debug and optimize your application, so you can confidently deploy to production.

This quick start demonstrates a basic evaluation run. Try it yourself in our Quick Start Notebook. For a detailed guide on implementing tracing for a RAG system, check out our Getting Started Notebook.

1. Install

pip install lastmile-eval
pip install "lastmile-eval[ui]"

2. Create an API Token

Go to the LastMile Settings page. Click 'Create new token'.

3. Set up your environment

From your terminal, export your API Token.

export LASTMILE_API_TOKEN=<your-api-token>

4. Run an evaluation

Evaluation requires data to serve as test cases and evaluators to grade the results. Here we are using our built-in faithfulness evaluator and our semantic similarity evaluator.

  • Faithfulness evaluates if the LLM is hallucinating or deviating from the provided context.
  • Semantic Similarity measures the textual similarity between two strings, regardless of their meaning.
from lastmile_eval.rag.debugger.api.evaluation import evaluate
import pandas as pd

dataset = [
{
'input': 'What is Einstein famous for in physics?',
'groundTruth': 'Albert Einstein is famous for the theory of relativity.',
'output': 'Einstein is famous for the theory of relativity.'
},
{
'input': 'What instrument did Einstein play?',
'groundTruth': 'Einstein played the violin.',
'output': 'Einstein played the piano.'
}
]

evaluate_result = evaluate(
project_name="my project",
evaluators={
"faithfulness",
"similarity"
},
test_dataset=pd.DataFrame(dataset)
)

This outputs the following:

InputGround TruthOutputFaithfulness ScoreSimilarity Score
What is Einstein famous for in physics?Albert Einstein is famous for the theory of relativity.Einstein is famous for the theory of relativity.0.983250.9
What instrument did Einstein play?Einstein played the violin.Einstein played the piano.0.000730.5

The first test case shows a high Faithfulness Score (0.98) and high Similarity Score (0.9), showing that the output adheres to the ground truth and is also textually similar.

In contrast, the second test case reveals a low Faithfulness Score (~0) and neutral Similarity Score (0.5). This indicates that the output diverges from the ground truth. However, since sentence structure remains somewhat similar, the Similarity Score is 0.5.

This is why it's important to use multiple evaluators to assess your application's performance and help identify areas for optimization.

5. View Results in RAG Workbench!

From your terminal, launch the RAG Workbench UI.

rag-debug launch

Next steps