Skip to Content

Using Evaluation Service available in SAP AI Core

This tutorial demonstrates how to use SAP AI Core Custom Evaluation to benchmark Large Language Models (LLMs) using two different approaches **Prompt Registry** and **Orchestration Registry**. It guides you through dataset preparation, environment setup, configuration creation, execution, and result analysis in a unified and simplified workflow.
You will learn
  • How to prepare and organize datasets for evaluation.
  • How to choose between Prompt Registry and Orchestration Registry approaches.
  • How to configure and run evaluations in SAP AI Core.
  • How to analyze and interpret aggregated evaluation results.
I321506Smita NaikJanuary 16, 2026
Created by
I321506
January 16, 2026
Contributors
I321506

Prerequisites

  • Setup Environment:
    Ensure your instance and AI Core credentials are properly configured according to the steps provided in the initial tutorial
  • Orchestration Deployment:
    Ensure at least one orchestration deployment is ready to be consumed during this process.
    Refer to this tutorial understand the basic consumption of GenAI models using orchestration.
  • Basic Knowledge: Familiarity with the orchestration workflow is recommended
  • Install Dependencies: Install the required Python packages using the requirements.txt file provided.
    Download requirements.txt
    πŸ’‘ Right-click the link above and choose β€œSave link as…” to download it directly.

It extends the Quick Start tutorial and is intended for Application Developers and Data Scientists who already know the basics of GenAI workflows in SAP AI Core.

Below are the Steps to Run a GenAI Evaluation in SAP AI Core

Pre-Read

The structure of the input data should be as follows:

Root
β”œβ”€β”€ PUT_YOUR_PROMPT_TEMPLATE_HERE
|   β”œβ”€β”€ prompt_template.json
β”‚    
β”œβ”€β”€ PUT_YOUR_DATASET_HERE
β”‚   β”œβ”€β”€ medicalqna_dataset.csv
|
└── PUT_YOUR_CUSTOM_METRIC_HERE
    β”œβ”€β”€ custom-llm-metric.json
    β”œβ”€β”€ custom-llm-metric.jsonl

Dataset and Configuration:
To run this evaluation, All required input files must be placed inside the folder structure provided in the repository:

You can download or clone the complete folder from the link below and place your files inside the respective folders Download / Open Full Folder Structure

1.  **Prompt Template Configuration (`PUT_YOUR_PROMPT_TEMPLATE_HERE`)**
    *   Place one or more prompt template configurations as JSON files in this folder. 
2.  **Test Dataset (`PUT_YOUR_DATASET_HERE`)**
    *   The test dataset should be a CSV, JSON, or JSONL file containing prompt variables, ground truth references, and other data required for evaluation. 
3.  **Custom Metrics (`PUT_YOUR_CUSTOM_METRIC_HERE`)**
    *   (Optional) You can provide custom metric definitions in a single JSON or JSONL file. For JSONL, each line should be a JSON object defining one metric. For JSON, it should be an array of metric-definition objects. 
  • Step 1
  • Step 2

    ⚠️ Important Note (Must Read)

    • You must create an object store secret with a user defined name (for eg: default) to store output artifacts from orchestration runs. This is mandatory.
    • For input artifacts, you may create additional object store secrets with different names if needed.
    • If a user defined name (for eg: default) is not configured, orchestration runs will fail due to missing output target setup.
  • Step 3
  • Step 4

    In this evaluation workflow, you can provide prompts in two different ways.
    Choose only one option based on your requirement.

    Here are your two options:

    Option Approach Description When to Use
    Option 1 Prompt Template + Model Directly Prompt stored in Prompt Registry and model referenced directly. When you want reusable, versioned prompts.
    Option 2 Orchestration Registry (Inline Prompt) Prompt provided as part of orchestration config. When prompt is ad-hoc or not reused.

    After selecting your option:

    - Follow only the steps for that option.
    
    - Skip the other options.
    
    - After completing your selected option, go directly to Create Evaluation Configuration.
    
  • Step 5

    βœ” Follow this step ONLY IF you want to use Prompt Template.

    If not, skip this step and go to Option 2.

    πŸ”‘ Tip: Always increment the version (e.g., 1.0.1, 1.0.2) when updating a template. This ensures reproducibility across evaluations.

  • Step 6

    Follow this step only if you want to store prompt + model configuration inside Orchestration Registry.

    Create Orchestration Registry Configuration

    After completing Option 2:

    - Proceed directly to the β€œCreate Evaluation Configuration” section
    
  • Step 7

    Metrics determine how your model outputs are evaluated during an evaluation run. They define the scoring logic that SAP AI Core uses to compare models, measure quality, and validate improvements over time.

    In SAP AI Core, you can use:

    - System-defined metrics (ready-made, no setup needed)
    
    - Custom metrics (your own definitions stored in the metric registry)
    

    How Metrics Apply in Each Approach

    Approach How Metrics Apply
    Option 1 – Prompt Template Metrics score responses generated using the prompt template + selected model.
    Option 2 – Orchestration Registry Metrics score responses generated through orchestration configuration.

    Metrics are provided later during Create Evaluation Configuration:

    json
    Copy
    "metrics": "BERT, answer_relevance"
    

    You can specify one or multiple metrics (comma-separated).

    Types of Metrics

    1. System-defined Metrics

    These come in two categories:

    Computed Metrics

    Score outputs using reference data or validation logic.

    Metric Description Needs Reference?
    BERT Score Embedding similarity to reference Yes
    BLEU N-gram overlap Yes
    ROUGE Recall-based overlap Yes
    Exact Match Checks if output exactly matches reference Yes
    JSON Schema Match Validates output against a schema Yes
    Language Match Detects language No
    Content Filter Safety filter triggered (input/output) No

    2. LLM-as-a-Judge Metrics

    These metrics use a judge LLM to score responses based on a rubric.
    They are ideal for open-ended tasks with no exact references.

    Metric What It Measures Needs Reference?
    Instruction Following How well the prompt was followed No
    Correctness Factual accuracy Yes
    Answer Relevance Relevance of the generated answer No
    Conciseness Brevity + clarity No
    RAG Groundedness Grounding in the provided context No
    RAG Context Relevance Usefulness of retrieved context No

    Custom Metrics

    Create them when system metrics are insufficient.

    Two ways to define custom metrics:

    1. Structured metrics (recommended)

    - Provide task, criteria, rubric, optional examples
    
    - AI Core constructs the judge prompt
    

    2. Free-form metrics

    - You define prompts and scoring logic manually
    

    Custom metric registration:

    bash
    Copy
    POST {{ai_api_url}}/v2/lm/evaluationMetrics
    

    Once registered, use them like system metrics:

    json
    Copy
    "metrics": "my_custom_metric"
    

    Example β€” Prompt Template Approach

    json
    Copy
    "metrics": "BERT Score,answer_relevance"
    

    Example β€” Orchestration Registry Approach

    json
    Copy
    "metrics": "Pointwise Conciseness"
    

    The chosen metrics determine:

    - scoring
    
    - dashboard visualizations
    
    - aggregated results
    
    - model ranking logic
    
  • Step 8

    Metrics must be supplied before creating an Evaluation Configuration.

  • Step 9
  • Step 10

    After creating the Evaluation Configuration, the next step is to execute it.
    Execution triggers the evaluation workflow, which:

    - Reads the test dataset
    
    - Generates submissions to the orchestration service
    
    - Collects model outputs
    
    - Computes all selected metrics
    
    - Produces aggregate and raw evaluation results
    

    The process is identical for SAP AI Launchpad, Python, and Bruno, with only the invocation method differing.

  • Step 11

    Once the evaluation execution is complete, SAP AI Core generates both aggregated metrics and detailed instance-level results.
    These results help compare model performance, understand quality metrics, and debug issues.

  • Step 12

    Over time, your workspace may accumulate old configurations, executions, and metrics.
    SAP AI Core allows you to safely delete these resources once they are no longer needed.

    This section explains how to delete:

    - Evaluation Executions
    
    - Evaluation Configurations
    
    - Custom Metrics (if created)
    

    ⚠️ Important:

    Deletions are permanent and cannot be undone.
    System-defined metrics cannot be deleted β€” only your custom metrics.

Back to top