Skip to Content

Custom Evaluation for Generative AI – Comprehensive Guide

This tutorial demonstrates how to use SAP AI Core Custom Evaluation to benchmark Large Language Models (LLMs) using **Orchestration Registry**. It guides you through environment setup, configuration creation, execution, and result analysis in a unified and simplified workflow.
You will learn
  • How to prepare and organize datasets for evaluation.
  • How to configure and run evaluations in SAP AI Core.
  • How to analyze and interpret aggregated evaluation results.
I321506Smita NaikApril 1, 2026
Created by
I321506
April 1, 2026
Contributors
I321506

Prerequisites

  1. BTP Account
    Set up your SAP Business Technology Platform (BTP) account.
    Create a BTP Account
  2. For SAP Developers or Employees
    Internal SAP stakeholders should refer to the following documentation: How to create BTP Account For Internal SAP Employee, SAP AI Core Internal Documentation
  3. For External Developers, Customers, or Partners
    Follow this tutorial to set up your environment and entitlements: External Developer Setup Tutorial, SAP AI Core External Documentation
  4. Create BTP Instance and Service Key for SAP AI Core
    Follow the steps to create an instance and generate a service key for SAP AI Core:
    Create Service Key and Instance
  5. AI Core Setup Guide
    Step-by-step guide to set up and get started with SAP AI Core:
    AI Core Setup Tutorial
  6. An Extended SAP AI Core service plan is required, as the Generative AI Hub is not available in the Free or Standard tiers. For more details, refer to
    SAP AI Core Service Plans
  7. Orchestration Deployment
    Ensure at least one orchestration deployment is ready to be consumed during this process.
    Refer to this tutorial understand the basic consumption of GenAI models using orchestration.
  8. Basic Knowledge
    Familiarity with the orchestration workflow is recommended
  9. Install Dependencies
    Install the required Python packages using the requirements.txt file provided.
    Download requirements.txt

It extends the Quick Start tutorial and is intended for Application Developers and Data Scientists who already know the basics of GenAI workflows in SAP AI Core.

💡 Right-click the link above and choose “Save link as…” to download it directly.

Pre-Read

This tutorial which showcases how a user can use AI Core custom evaluation to benchmark their large language models, evaluate orchestration configuration or prompts for their use case.
It uses publicly available MedicationQA dataset which consists of commonly asked consumer questions about medications. The workload computes industry standard metrics to check the reliability of the response generate by llm.

  • Step 1
  • Step 2
  • Step 3

    ⚠️ Important Note (Must Read)

    • You must create an object store secret with a user defined name (for eg: default) to store output artifacts from orchestration runs. This is mandatory.
    • For input artifacts, you may create additional object store secrets with different names if needed.
    • If a user defined name (for eg: default) is not configured, orchestration runs will fail due to missing output target setup.
  • Step 4
  • Step 5

    In this evaluation workflow, prompts can be provided in two different ways.
    Before proceeding, understand the available approaches and choose the one that fits your requirement.

    🔹 Option 1 – Prompt Template + Model (Prompt Registry)

    - The prompt is stored in the Prompt Registry
    
    - The model is referenced directly in the evaluation configuration
    
    - Prompts are reusable and version-controlled
    
    - Best suited for standardized or production-grade workflows
    

    📌 When to use this?

    If you want reusable, versioned prompts that can be managed independently.

    👉 If you would like to see this approach in action, refer to the [Evaluation Quickstart tutorial](LINK TO ADD), where we demonstrate the Prompt Registry method.

    🔹 Option 2 – Orchestration Registry (Inline Prompt)

    - The prompt is defined directly inside the orchestration configuration
    
    - No separate prompt registry entry is required
    
    - Ideal for ad-hoc, experimental, or one-time evaluations
    

    📌 When to use this?

    If the prompt is specific to this evaluation and does not need reuse or versioning.

  • Step 6

    In this tutorial, we will use the Orchestration Registry (Inline Prompt) approach.

    Create Orchestration Registry Configuration

  • Step 7

    Metrics determine how your model outputs are evaluated during an evaluation run. They define the scoring logic that SAP AI Core uses to compare models, measure quality, and validate improvements over time.

    In SAP AI Core, metrics are configured during the Create Evaluation Configuration step:

    json
    Copy
    "metrics": "Content Filter on Input,Pointwise Instruction Following,Content Filter on Output"
    

    You can specify one or multiple metrics (comma-separated).

    Types of Metrics

    SAP AI Core supports two major types:

    1. System-defined Metrics (Ready to use)

    2. Custom Metrics (User-defined)

    1. System-defined Metrics

    These are built-in metrics provided by SAP AI Core. No additional setup required.

    They are grouped into two categories:

    Computed Metrics

    These use reference data, schema validation, or deterministic logic.

    Name Description Reference required
    BERT Score https://huggingface.co/spaces/evaluate-metric/bertscore Yes
    BLEU https://huggingface.co/spaces/evaluate-metric/bleu Yes
    ROUGE https://huggingface.co/spaces/evaluate-metric/rouge Yes
    JSON Schema Match validates LLM generated response against a predefined Json schema, returns boolean result. Yes
    Content Filter on Input Whether orchestration input was rejected by the input filter No
    Content Filter on Output Whether orchestration output was rejected by the output filter No
    Exact Match Whether the output exactly matches the reference Yes
    Language Match The metric returns true/false to indicate if the text matches the given language No

    👉 Use computed metrics when:

    - You have ground truth/reference answers
    
    - You need deterministic validation
    
    - You want schema validation
    

    model-as-a-judge metrics

    These use a judge LLM to evaluate responses qualitatively.

    | Name | Description | Reference required |——————————————————————————————————————–
    | Pointwise Instruction Following | assess the model’s ability to follow instructions provided in the user prompt | No |
    | Pointwise Correctness | assess the model’s ability to provide a correct response based on the user prompt | Yes |
    | Pointwise Answer Relevance | assess the model’s response is related to user prompt | No |
    | Pointwise Conciseness | assess the model’s response is a short and concise answer to user prompt | No |
    |

    Entries marked with an asterisk () are experimental metrics.

    👉 Use model-as-a-judge metrics when:

    - You need qualitative evaluation
    
    - No exact ground truth exists
    
    - You want human-like evaluation logic
    

    Custom Metrics (User-defined metrics)

    When system metrics are insufficient, you can define your own metric.

    Custom metrics can be used to evaluate the LLM outputs according to the unique needs of a use case. A user-defined llm-as-a-judge metric uses a judge LLM along with a rubric to compute a metric rating. The output of a llm-as-a-judge metric can be numeric or text.

    The system defines a structure for the judge prompts and users provide the metric definition in the pre-defined format. Relevant instructions, such as output instructions, are automatically added to ensure the desired output from the LLM.

    Custom Metric Definition Structure

    json
    Copy
    {
      "scenario": "genai-evaluations",
      "metricName": "my_custom_metric",
      "version": "0.0.1",
      "type": "structured",
      "model_configuration": {
        "model_name": "string",
        "model_version": "string"
      },
      "prompt_configuration": {
        "evaluation_task": "Describe the goal of this evaluation.",
        "criteria": "Explain how evaluation is performed.",
        "rating_rubric": [
          {
            "rating": 1,
            "rule": "Poor quality response"
          },
          {
            "rating": 5,
            "rule": "Excellent response"
          }
        ],
        "include_properties": ["prompt", "reference"],
        "examples": [
          {
            "prompt": "Sample prompt",
            "response": "Sample response",
            "reference": "Expected answer",
            "rating": 5,
            "explanation": "Why this rating was given"
          }
        ]
      }
    }
    

    NOTE: “scenario” and “metricName” and “version” is a required parameter for the custom metric in evaluation configuration.

    NOTE: The user must provide at least one prompt, system or user prompt, or both prompts can be provided.

    Model Availability Notice

    ⚠️ If gpt-4.1 (2025-04-14) is not available in your region:

    - LLM-as-a-Judge metrics cannot be executed
    
    - Evaluation service depends on this specific model version
    
  • Step 8

    Metrics determine how your model outputs are evaluated during an evaluation run. They define the scoring logic that SAP AI Core uses to compare models, measure quality, and validate improvements over time.

    Metrics must be supplied before creating an Evaluation Configuration.

    Note

    To evaluate and compare multiple models in a single execution, you must create a distinct orchestration registry ID for each model you wish to test. Assign a different foundation model to each registry ID, and then pass this list of registry IDs into your evaluation configuration. This ensures the system generates separate, comparable runs for each model simultaneously.

  • Step 9
  • Step 10

    After creating the Evaluation Configuration, the next step is to execute it.

    Execution triggers the evaluation workflow, which:

    - Reads the test dataset
    
    - Generates submissions to the orchestration service
    
    - Collects model outputs
    
    - Computes all selected metrics
    
    - Produces aggregate and raw evaluation results
    

    The process is identical for SAP AI Launchpad, Python, and Bruno, with only the invocation method differing.

  • Step 11

    Once the evaluation execution is complete, SAP AI Core generates both aggregated metrics and detailed instance-level results.
    These results help compare model performance, understand quality metrics, and debug issues.

  • Step 12

    Over time, your workspace may accumulate old configurations, executions, and metrics.
    SAP AI Core allows you to safely delete these resources once they are no longer needed.

    This section explains how to delete:

    - Evaluation Executions
    
    - Evaluation Configurations
    

    ⚠️ Important:

    Deletions are permanent and cannot be undone.

Back to top