Test Sets

Test Sets are collections of example data used to evaluate the performance of your RAG system. They provide a standardized set of inputs, LLM-generated resposnes, and, if available, expected outputs to assess how well your RAG system generates responses. Each example within a Test Set is called a Test Case.

Test Sets can be created from various sources, such as production data, previous evaluations, or manually curated examples. In addition to evaluation, Test Sets can be used to assess the performance and scalability of your prompt templates across diverse example data points.

Example

The example below is a Test Set as viewed in the RAG Workbench UI. This Test Set was created from production data. It includes input questions, generated outputs, and associated traces. Each row represents an individual Test Case.

Create a Test Set

You can create a Test Set from existing traces, implicitly when running evaluations, and manually using the SDK.

From Existing Traces

Creating a Test Set from existing traces is a simple process when you have set up tracing for your RAG system using the LastMile Tracing SDK. Follow these steps:

Launch the RAG Workbench UI by running rag-debug launch in your terminal.
Go to the 'Traces' tab, select the desired traces for your Test Set, and click 'Create Test Set'.

Go to the 'Test Sets' tab and copy the Test Set ID for your newly created Test Set.

Use the LastMile Eval SDK to download the Test Set and run evaluations:

trace_level_evaluators = {
    "rouge1": rouge1
}

# Download Test Set using test_set_id
download_test_set(test_set_id=test_set_id, lastmile_api_token=LASTMILE_API_TOKEN)

# Run Evaluation on Test Set
run_and_store_evaluations(
    test_set_id=test_set_id,
    project_id=project_id,
    trace_level_evaluators=trace_level_evaluators,
    dataset_level_evaluators={},
    lastmile_api_token=LASTMILE_API_TOKEN,
    evaluation_set_name='Evaluation Run 1 - Friday Test Set',
)

The run_and_store_evaluations function runs your specified evaluators to the Test Set, which consists of input questions and the corresponding generated outputs from your RAG system, as the Test Set is derived from pre-existing Traces.

You can view the evaluation results and the Test Set in the RAG Workbench UI under the "Evaluation Console" and "Test Sets" tabs, respectively.

Implicitly during Evaluation

You can create a Test Set implicitly while running and evaluating your RAG system on input questions using the run_and_evaluate_outputs function from the LastMile Eval SDK. This method allows you to run your RAG system, evaluate the generated outputs, and create a Test Set in a single step, without explicitly defining a separate Test Set.

Here is a concise example:

input_questions = [
    "What two main things did Paul Graham work on before college, outside of school?",
    "What was the key realization Paul Graham had about artificial intelligence during his first year of grad school at Harvard?",
    "How did Paul Graham and his partner Robert Morris get their initial idea and start working on what became their startup Viaw"
]

ground_truth_answers = [
    "The author first interacted with programming on a mainframe computer, using punch cards to input Fortran code, which was a challenging and time-consuming process",
    "The transition from the IBM 1401 to microcomputers like the TRS-80 represented a significant step forward in terms of both programming capabilities and user interaction.",
    "A turning point came after reading Nick Bostrom's \"Superintelligence,\" which presented a persuasive argument on the potential of Artificial Intelligence (AI)",
]

trace_level_evaluators = {
    "rouge1": rouge1
}

evaluate_result = run_and_evaluate_outputs(
    "Paul-Graham-Demo-Project",
    trace_level_evaluators=trace_level_evaluators,
    dataset_level_evaluators={},
    rag_query_fn=partial(
        run_query_flow,
        ingestion_trace_id=ingestion_trace_id
    ),
    inputs=input_questions,
    ground_truth=ground_truth_answers
)

The run_and_evaluate_outputs function runs your RAG system on the input questions, generates outputs, evaluates them using the specified evaluators, and implicitly creates a Test Set with the input questions and ground truth answers.

You can view the evaluation results and the implicitly created Test Set in the RAG Workbench UI under the "Evaluation Console" and "Test Sets" tabs, respectively.

Manage and View Test Sets

You can view all your Test Sets in the RAG Workbench UI.

Click on 'Test Sets' in the left nav bar.

You can view a Test Set in detail by clicking on it. Here's an expanded view of a Test Set called 'Important Questions'

Run Evaluations on Test Sets

A core workflow is to run evaluators on Test Sets. Given a Test Set ID (available in the RAG Workbench UI), you can run evaluators on each example in the Test Set and view the results in the RAG Workbench UI.

trace_level_evaluators = {
    "rouge1": rouge1
}

# Download Test Set
download_test_set(test_set_id=test_set_id, lastmile_api_token=LASTMILE_API_TOKEN)

# Run Evaluation on Test Set
run_and_store_evaluations(
    test_set_id=test_set_id,
    project_id=project_id,
    trace_level_evaluators=trace_level_evaluators,
    dataset_level_evaluators={},
    lastmile_api_token=LASTMILE_API_TOKEN,
    evaluation_set_name='Evaluation Run 1 - Friday Test Set',
)

Run Prompt Templates on Test Sets

[Coming Soon] The Prompt Debugger UI feature will soon allow users to test prompt templates on TestSet examples. This feature will enable quick assessment of prompt performance across diverse examples, helping identify areas for improvement and optimize prompt effectiveness.

Test Sets

Example​

Create a Test Set​

From Existing Traces​

Implicitly during Evaluation​

Manage and View Test Sets​

Run Evaluations on Test Sets​

Run Prompt Templates on Test Sets​