Quick Start
RAG Workbench is a platform for evaluating advanced RAG systems. It allows you to efficiently debug and optimize your application, so you can confidently deploy to production.
This quick start demonstrates a basic evaluation run. Try it yourself in our Quick Start Notebook. For a detailed guide on implementing tracing for a RAG system, check out our Getting Started Notebook.
1. Install
pip install lastmile-eval
pip install "lastmile-eval[ui]"
2. Create an API Token
Go to the LastMile Settings page. Click 'Create new token'.
3. Set up your environment
From your terminal, export your API Token.
- MacOS
- Windows
export LASTMILE_API_TOKEN=<your-api-token>
set LASTMILE_API_TOKEN=<your-api-token>
4. Run an evaluation
Evaluation requires data to serve as test cases and evaluators to grade the results. Here we are using our built-in faithfulness evaluator and our semantic similarity evaluator.
- Faithfulness evaluates if the LLM is hallucinating or deviating from the provided context.
- Semantic Similarity measures the textual similarity between two strings, regardless of their meaning.
from lastmile_eval.rag.debugger.api.evaluation import evaluate
import pandas as pd
dataset = [
{
'input': 'What is Einstein famous for in physics?',
'groundTruth': 'Albert Einstein is famous for the theory of relativity.',
'output': 'Einstein is famous for the theory of relativity.'
},
{
'input': 'What instrument did Einstein play?',
'groundTruth': 'Einstein played the violin.',
'output': 'Einstein played the piano.'
}
]
evaluate_result = evaluate(
project_name="my project",
evaluators={
"faithfulness",
"similarity"
},
test_dataset=pd.DataFrame(dataset)
)
This outputs the following:
Input | Ground Truth | Output | Faithfulness Score | Similarity Score |
---|---|---|---|---|
What is Einstein famous for in physics? | Albert Einstein is famous for the theory of relativity. | Einstein is famous for the theory of relativity. | 0.98325 | 0.9 |
What instrument did Einstein play? | Einstein played the violin. | Einstein played the piano. | 0.00073 | 0.5 |
The first test case shows a high Faithfulness Score (0.98) and high Similarity Score (0.9), showing that the output adheres to the ground truth and is also textually similar.
In contrast, the second test case reveals a low Faithfulness Score (~0) and neutral Similarity Score (0.5). This indicates that the output diverges from the ground truth. However, since sentence structure remains somewhat similar, the Similarity Score is 0.5.
This is why it's important to use multiple evaluators to assess your application's performance and help identify areas for optimization.
5. View Results in RAG Workbench!
From your terminal, launch the RAG Workbench UI.
rag-debug launch
Next steps
- Try out the Getting Started Notebook
- Learn about our custom LastMile Evaluators
- Read our API reference
- Join LastMile's Discord Community to ask questions and get help
- Follow LastMile AI on Twitter (@lastmile) for updates