Evaluation Runs

Evaluation Runs help measure an LLM application's performance by taking the user query, LLM response, and an evaluator function. An evaluation run can also take in ground truth and other metadata; these required fields are usually determined by the specific evaluator used. Here are a couple key terms:

Test Sets: Dataset with user input, outputs, and ground truth. A Test Set doesn't have to contain outputs or ground truth - it can only have inputs.
Evaluation Sets: Test Set with evaluation metrics computed for each entry.

Types of Evaluation Runs

There are two ways to start an evaluation run with the LastMile Eval SDK.

Real-time Evaluation: : The function run_and_evaluate() takes inputs, runs your RAG system, generates outputs, and evaluates them using specified evaluators. You should use this function when you want to evaluate your RAG system's performance on-the-fly.
Evaluation on Pre-generated Outputs: The function evaluate() takes a Test Set with pre-generated outputs and evaluates them using specified evaluators. You should use this function when you have a dataset with existing outputs (ex. Test Set created from existing traces) and want to evaluate them without running the RAG system again.

Real-time Evaluation

The code snippet below shows how to use run_and_evaluate(). In this case, we are running a simple function generate_pr_title() that uses an LLM to generate PR titles given an input description. We specify LastMile Evalutors to evaluate the quality of the generated titles for these inputs.

from functools import partial
from lastmile_eval.rag.debugger.api.evaluation import run_and_evaluate

# Define test inputs and expected outputs
test_inputs = [
    "Only link to this file if it exists",
    "If there's no ID, remove the copy button"
]

expected_outputs = [
    "Introducing the Option to Access File with Existing Link",
    "Enhanced User Experience: Removing Copy Button When ID is Absent"
]

# Specify evaluation metrics. These are LastMile Evaluators.
metrics = {"relevance", "qa", "similarity"}

# Run generate_pr_title() and evaluate responses
eval_results = run_and_evaluate(
    project_name="Example-Project",
    evaluators=metrics,
    run_query_fn=partial(generate_pr_title),
    inputs=test_inputs,
    ground_truths=expected_outputs,
)

This generates an Evaluation Set which you can view in the RAG Workbench UI which we show below.

Evaluation on Pre-generated Outputs

The code snippet below shows how to use evaluate().

The input to this function is a Test Set which contains user input, LLM-generated output, and optionally ground truth. It does not contain evaluation metrics.

In this example, we are using a Test Set generated from Traces for which we have a Test Set ID (available in the RAG Workbench UI). We will run evaluators on this Test Set.

metrics = {"relevance", "qa", "similarity"}

# Get Test Set ID from RAG Workbench UI
test_set_id = 'clxkjvg4w003vqphoupza38gq'

# Run Evaluation on Test Set
evaluate(
    test_dataset_id=test_set_id,
    project_name="Evaluation-Example",
    evaluators=metrics
)

This generates an Evaluation Set which you can view in the RAG Workbench UI which we show below.

View Evaluation Results in UI

Run the following command in your terminal to launch the UI:

rag-debug launch

Navigate to the url provided by the RAG Workbench (opens up your web browser). This will look like http://localhost:8080/

Click the Evaluation Console Tab.
Select your Project.
Click on the latest evaluation run.

Evaluation Runs

Types of Evaluation Runs​

Real-time Evaluation​

Evaluation on Pre-generated Outputs​

View Evaluation Results in UI​

More Resources​

Types of Evaluation Runs

Real-time Evaluation

Evaluation on Pre-generated Outputs

View Evaluation Results in UI

More Resources