Skip to Content

Using Multimodal inputs with GPT4o for Image Recognition on SAP AI Core

In this tutorial we are going to learn on how to consume GPT4o LLM on AI core deployed on SAP AI core.
You will learn
  • How to inference GPT4o with multimodal inputs on AI core
dhrubpaulDhrubajyoti PaulJuly 10, 2025
Created by
dhrubpaul
August 13, 2024
Contributors
sharmaneeleshsap
dhrubpaul
I321506

Prerequisites

  • A BTP global account
    If you are an SAP Developer or SAP employee, please refer to the following links ( for internal SAP stakeholders only ) -
    How to create a BTP Account (internal)
    SAP AI Core
    If you are an external developer or a customer or a partner kindly refer to this tutorial
  • Ai core setup and basic knowledge: Link to documentation
  • Ai core Instance with Standard Plan or Extended Plan

Multimodality refers to the ability of a model to process and interpret different types of inputs, such as text, images, audio, or video. In the context of GPT-4o on SAP AI Core, multimodal input allows the model to understand and generate responses that incorporate both text and visual data. This enhances the model’s ability to perform complex tasks, such as scene detection, object recognition, and image analysis, by combining the strengths of both language processing and image recognition.
In this tutorial, we will demonstrate these capabilities with the help of GPT-4o, with a sample input and output, which can be replicated in future for various use cases.

  • Step 1

    In this step, we demonstrate how to use GPT-4o to describe a scene depicted in an image. By providing both text and an image URL as input, the model is able to generate a descriptive response that captures the key elements of the scene. This capability is particularly useful for applications like automated content tagging, visual storytelling, or enhancing user experience in multimedia platforms and more.

    Follow the further steps to replicate scene detection using GPT-4o.

    For more information on the models refer to Hello GPT-4o

  • Step 2

    This step focuses on identifying and labeling objects within an image. The multimodal input allows GPT-4o to analyze the visual data and generate a list of objects detected in the scene. Object detection is crucial for tasks such as inventory management, autonomous driving, and augmented reality applications and such.

    Follow the further steps to replicate object detection using GPT-4o.

    For more information on the models refer to Hello GPT-4o

  • Step 3

    Here, the tutorial demonstrates how GPT-4o can be used to interpret and analyze data presented in graphical form. By combining text and image input, the model can extract meaningful insights from charts, graphs, and other visual data representations. This step is valuable for data analysis, reporting, and decision-making processes.

    Follow the further steps to replicate graph analysis using GPT-4o.

    For more information on the models refer to Hello GPT-4o

  • Step 4

    In this step, we explore how GPT-4o handles mathematical problems that involve both textual descriptions and visual data. The model can solve equations, interpret mathematical expressions in images, and provide detailed explanations of its reasoning. This capability is useful in educational tools, scientific research, and engineering applications.

    Follow the further steps to replicate mathematical operations using GPT-4o.

    For more information on the models refer to Hello GPT-4o

  • Step 5

    The final step focuses on converting visual information into text. By providing an image as input, GPT-4o generates a textual description or transcription of the content. This step is particularly beneficial for accessibility tools, content creation, and archiving visual data.

    Follow the further steps to replicate Optical Character Recognition (OCR) using GPT-4o.

    For more information on the models refer to Hello GPT-4o

Back to top