Custom Evaluation for Generative AI – Comprehensive Guide
This tutorial demonstrates how to use SAP AI Core Custom Evaluation to benchmark Large Language Models (LLMs) using **Orchestration Registry**. It guides you through environment setup, configuration creation, execution, and result analysis in a unified and simplified workflow.
You will learn
- How to prepare and organize datasets for evaluation.
- How to configure and run evaluations in SAP AI Core.
- How to analyze and interpret aggregated evaluation results.
Prerequisites
- BTP Account
Set up your SAP Business Technology Platform (BTP) account.
Create a BTP Account - For SAP Developers or Employees
Internal SAP stakeholders should refer to the following documentation: How to create BTP Account For Internal SAP Employee, SAP AI Core Internal Documentation - For External Developers, Customers, or Partners
Follow this tutorial to set up your environment and entitlements: External Developer Setup Tutorial, SAP AI Core External Documentation - Create BTP Instance and Service Key for SAP AI Core
Follow the steps to create an instance and generate a service key for SAP AI Core:
Create Service Key and Instance - AI Core Setup Guide
Step-by-step guide to set up and get started with SAP AI Core:
AI Core Setup Tutorial - An Extended SAP AI Core service plan is required, as the Generative AI Hub is not available in the Free or Standard tiers. For more details, refer to
SAP AI Core Service Plans - Orchestration Deployment
Ensure at least one orchestration deployment is ready to be consumed during this process.
Refer to this tutorial understand the basic consumption of GenAI models using orchestration. - Basic Knowledge
Familiarity with the orchestration workflow is recommended - Install Dependencies
Install the required Python packages using the requirements.txt file provided.
Download requirements.txt
It extends the Quick Start tutorial and is intended for Application Developers and Data Scientists who already know the basics of GenAI workflows in SAP AI Core.
💡 Right-click the link above and choose “Save link as…” to download it directly.
Pre-Read
This tutorial which showcases how a user can use AI Core custom evaluation to benchmark their large language models, evaluate orchestration configuration or prompts for their use case.
It uses publicly available MedicationQA dataset which consists of commonly asked consumer questions about medications. The workload computes industry standard metrics to check the reliability of the response generate by llm.