Generative AI Custom Evaluation - Quickstart
This tutorial demonstrates how to use SAP AI Core Custom Evaluation to benchmark Large Language Models (LLMs) using **Prompt Registry**. It guides you through environment setup, configuration creation, execution, and result analysis in a unified and simplified workflow.
You will learn
- How to prepare and organize datasets for evaluation.
- How to configure and run evaluations in SAP AI Core.
- How to analyze and interpret aggregated evaluation results.
Prerequisites
- BTP Account
Set up your SAP Business Technology Platform (BTP) account.
Create a BTP Account - For SAP Developers or Employees
Internal SAP stakeholders should refer to the following documentation: How to create BTP Account For Internal SAP Employee, SAP AI Core Internal Documentation - For External Developers, Customers, or Partners
Follow this tutorial to set up your environment and entitlements: External Developer Setup Tutorial, SAP AI Core External Documentation - Create BTP Instance and Service Key for SAP AI Core
Follow the steps to create an instance and generate a service key for SAP AI Core:
Create Service Key and Instance - AI Core Setup Guide
Step-by-step guide to set up and get started with SAP AI Core:
AI Core Setup Tutorial - An Extended SAP AI Core service plan is required, as the Generative AI Hub is not available in the Free or Standard tiers. For more details, refer to
SAP AI Core Service Plans - Orchestration Deployment
Ensure at least one orchestration deployment is ready to be consumed during this process.
Refer to this tutorial understand the basic consumption of GenAI models using orchestration. - Basic Knowledge
Familiarity with the orchestration workflow is recommended - Install Dependencies
Install the required Python packages using the requirements.txt file provided.
Download requirements.txt
💡 Right-click the link above and choose “Save link as…” to download it directly.
Pre-Read
This tutorial is designed for users who are unfamiliar with AI Core services and do not require flexibility in their use case. This tutorial is setup in a way that provides automatic setup for your evaluation where only the dataset is minimally required.
It demonstrates a quick start simplified workflow for using AI Core’s custom evaluation capabilities to benchmark Large Language Models (LLMs), and evaluate different prompts for a specific use case. It utilizes the public MedicationQA dataset to showcase how to compute industry-standard metrics and assess the reliability of LLM-generated responses.