Skip to Content

Generative AI Custom Evaluation - Quickstart

This tutorial demonstrates how to use SAP AI Core Custom Evaluation to benchmark Large Language Models (LLMs) using **Prompt Registry**. It guides you through environment setup, configuration creation, execution, and result analysis in a unified and simplified workflow.
You will learn
  • How to prepare and organize datasets for evaluation.
  • How to configure and run evaluations in SAP AI Core.
  • How to analyze and interpret aggregated evaluation results.
I321506Smita NaikApril 1, 2026
Created by
I321506
January 16, 2026
Contributors
I321506

Prerequisites

  1. BTP Account
    Set up your SAP Business Technology Platform (BTP) account.
    Create a BTP Account
  2. For SAP Developers or Employees
    Internal SAP stakeholders should refer to the following documentation: How to create BTP Account For Internal SAP Employee, SAP AI Core Internal Documentation
  3. For External Developers, Customers, or Partners
    Follow this tutorial to set up your environment and entitlements: External Developer Setup Tutorial, SAP AI Core External Documentation
  4. Create BTP Instance and Service Key for SAP AI Core
    Follow the steps to create an instance and generate a service key for SAP AI Core:
    Create Service Key and Instance
  5. AI Core Setup Guide
    Step-by-step guide to set up and get started with SAP AI Core:
    AI Core Setup Tutorial
  6. An Extended SAP AI Core service plan is required, as the Generative AI Hub is not available in the Free or Standard tiers. For more details, refer to
    SAP AI Core Service Plans
  7. Orchestration Deployment
    Ensure at least one orchestration deployment is ready to be consumed during this process.
    Refer to this tutorial understand the basic consumption of GenAI models using orchestration.
  8. Basic Knowledge
    Familiarity with the orchestration workflow is recommended
  9. Install Dependencies
    Install the required Python packages using the requirements.txt file provided.
    Download requirements.txt

💡 Right-click the link above and choose “Save link as…” to download it directly.

Pre-Read

This tutorial is designed for users who are unfamiliar with AI Core services and do not require flexibility in their use case. This tutorial is setup in a way that provides automatic setup for your evaluation where only the dataset is minimally required.
It demonstrates a quick start simplified workflow for using AI Core’s custom evaluation capabilities to benchmark Large Language Models (LLMs), and evaluate different prompts for a specific use case. It utilizes the public MedicationQA dataset to showcase how to compute industry-standard metrics and assess the reliability of LLM-generated responses.

  • Step 1
  • Step 2
  • Step 3

    ⚠️ Important Note (Must Read)

    • You must create an object store secret with a user defined name (for eg: default) to store output artifacts from orchestration runs. This is mandatory.
    • For input artifacts, you may create additional object store secrets with different names if needed.
    • If a user defined name (for eg: default) is not configured, orchestration runs will fail due to missing output target setup.
  • Step 4
  • Step 5

    🔑 Tip: Always increment the version (e.g., 1.0.1, 1.0.2) when updating a template. This ensures reproducibility across evaluations.

  • Step 6

    Metrics determine how your model outputs are evaluated during an evaluation run. They define the scoring logic that SAP AI Core uses to compare models, measure quality, and validate improvements over time.

    Metrics must be supplied before creating an Evaluation Configuration.

    Note:

    To compare different models and generate a leaderboard, you must select more than one model.
    When multiple models are provided, the evaluation system automatically creates separate
    evaluation runs for each model within the same execution. This enables the evaluation workflow
    to compare the runs and compute head-to-head win rates across the selected models.

  • Step 7
  • Step 8

    After creating the Evaluation Configuration, the next step is to execute it.

    Execution triggers the evaluation workflow, which:

    - Reads the test dataset
    
    - Generates submissions to the orchestration service
    
    - Collects model outputs
    
    - Computes all selected metrics
    
    - Produces aggregate and raw evaluation results
    

    The process is identical for SAP AI Launchpad, Python, and Bruno, with only the invocation method differing.

  • Step 9

    Once the evaluation execution is complete, SAP AI Core generates both aggregated metrics and detailed instance-level results.
    These results help compare model performance, understand quality metrics, and debug issues.

  • Step 10

    Over time, your workspace may accumulate old configurations, executions, and metrics.
    SAP AI Core allows you to safely delete these resources once they are no longer needed.

    This section explains how to delete:

    - Evaluation Executions
    
    - Evaluation Configurations
    

    ⚠️ Important:

    Deletions are permanent and cannot be undone.

Back to top