Skip to Content

Ingest Live Data into your House Price Predictor with SAP AI Core

Requires Customer/Partner License
Build data pipelines and reuse code to train and generate models on different datasets.
You will learn
  • How to create placeholders for datasets in your code and associated AI workflow.
  • How to register datasets stored in AWS S3 to SAP AI Core.
  • How to use datasets with placeholders.
  • How to generate models and store them in AWS S3 for later use.
dhrubpaulDhrubajyoti PaulNovember 17, 2022
Created by
helenaaaaaaaaaa
June 7, 2022
Contributors
LunaticMaestro
helenaaaaaaaaaa
maximilianone

Prerequisites

  • You have knowledge on connecting code to AI workflows of SAP AI Core.
  • You have created your first pipeline with SAP AI Core, using this tutorial.

By the end of the tutorial you will have two models trained on two different datasets of house price data. It is possible to change the names of components and file paths mentioned in this tutorial, without breaking the functionality, unless stated explicitly.

IMPORTANT Before you start this tutorial with SAP AI Launchpad, it is recommended that you set up at least one other tool, either Postman or Python (SAP AI Core SDK) because some steps of this tutorial cannot be performed with SAP AI Launchpad.

  • Step 1

    Create a new directory named hello-aicore-data. The code is different from previous tutorial as it reads the data from folder (volumes, virtual storage spaces). The content of these volumes is dynamically loaded during execution of workflows.

    Create a file named main.py, and paste the following snippet there:

    PYTHON
    Copy
    import os
    #
    # Variables
    DATA_PATH = '/app/data/train.csv'
    DT_MAX_DEPTH= int(os.getenv('DT_MAX_DEPTH'))
    MODEL_PATH = '/app/model/model.pkl'
    #
    # Load Datasets
    import pandas as pd
    df = pd.read_csv(DATA_PATH)
    X = df.drop('target', axis=1)
    y = df['target']
    #
    # Partition into Train and test dataset
    from sklearn.model_selection import train_test_split
    train_x, test_x, train_y, test_y = train_test_split(X, y, test_size=0.3)
    #
    # Init model
    from sklearn.tree import DecisionTreeRegressor
    clf = DecisionTreeRegressor(max_depth=DT_MAX_DEPTH)
    #
    # Train model
    clf.fit(train_x, train_y)
    #
    # Test model
    test_r2_score = clf.score(test_x, test_y)
    # Output will be available in logs of SAP AI Core.
    # Not the ideal way of storing /reporting metrics in SAP AI Core, but that is not the focus this tutorial
    print(f"Test Data Score {test_r2_score}")
    #
    # Save model
    import pickle
    pickle.dump(clf, open(MODEL_PATH, 'wb'))
    
  • Step 2

    Your code reads the data file train.csv from the location /app/data, which will be prepared in a later step. It also reads the variable (hyper-parameter) DT_MAX_DEPTH from the environment variables later. When generated, your model will be stored in the location /app/model/ . You will also learn how to transport this code from SAP AI Core to your own cloud storage.

    Recommendation: Although the dataset file train.csv is not present, it will be dynamically copied during execution to the volume mentioned in (/app/data). Its recommended to pass the filename (train.csv) through the environment variable to your code so that if your dataset filename changes, you can dynamically set the dataset file.

    image

    Create file requirements.txt as shown below. Here, if you don’t specify a particular version, as shown for pandas, then the latest version of the package will be fetched automatically.

    TEXT
    Copy
    sklearn==0.0
    pandas
    

    Create a file called Dockerfile with following contents.

    This filename cannot be amended, and does not have a .filetype

    TEXT
    Copy
    # Specify which base layers (default dependencies) to use
    # You may find more base layers at https://hub.docker.com/
    FROM python:3.7
    #
    # Creates directory within your Docker image
    RUN mkdir -p /app/src/
    # Don't place anything in below folders yet, just create them
    RUN mkdir -p /app/data/
    RUN mkdir -p /app/model/
    #
    # Copies file from your Local system TO path in Docker image
    COPY main.py /app/src/
    COPY requirements.txt /app/src/  
    #
    # Installs dependencies within you Docker image
    RUN pip3 install -r /app/src/requirements.txt
    #
    # Enable permission to execute anything inside the folder app
    RUN chgrp -R 65534 /app && \
        chmod -R 777 /app
    
    image

    IMPORTANT Your Dockerfile creates empty folders to store your datasets and models (/app/data and /app/model/ in the example above). Contents from cloud storage will be copied to and from these folders later. Any contents in these folders will be overwritten by the Docker image build.

    Build and upload your Docker image to Docker repository, using the following code in the terminal.

    BASH
    Copy
    docker build -t docker.io/<YOUR_DOCKER_USERNAME>/house-price:03 .
    docker push docker.io/<YOUR_DOCKER_USERNAME>/house-price:03
    
  • Step 3

    Create a pipeline (YAML file) named house-price-train.yaml in your GitHub repository. Use the existing GitHub path which is already synced by your application of SAP AI Core.

    YAML
    Copy
    apiVersion: argoproj.io/v1alpha1
    kind: WorkflowTemplate
    metadata:
      name: data-pipeline # executable id, must be unique across all your workflows (YAML files)
      annotations:
        scenarios.ai.sap.com/description: "Learning how to ingest data to workflows"
        scenarios.ai.sap.com/name: "House Price (Tutorial)" # Scenario name should be the use case
        executables.ai.sap.com/description: "Train with live data"
        executables.ai.sap.com/name: "training" # Executable name should describe the workflow in the use case
        artifacts.ai.sap.com/housedataset.kind: "dataset" # Helps in suggesting the kind of inputs that can be attached.
      labels:
        scenarios.ai.sap.com/id: "learning-datalines"
        ai.sap.com/version: "1.0"
    spec:
      imagePullSecrets:
        - name: credstutorialrepo # your docker registry secret
      entrypoint: mypipeline
      templates:
      - name: mypipeline
        steps:
        - - name: mypredictor
            template: mycodeblock1
      - name: mycodeblock1
        inputs:
          artifacts:  # placeholder for cloud storage attachements
            - name: housedataset # a name for the placeholder
              path: /app/data/ # where to copy in the Dataset in the Docker image
        container:
          image: docker.io/<YOUR_DOCKER_USERNAME>/house-price:03 # Your docker image name
          command: ["/bin/sh", "-c"]
          env:
            - name: DT_MAX_DEPTH # name of the environment variable inside Docker container
              value: "3" # will make it as variable later
          args:
            - "python /app/src/main.py"
    
  • Step 4

    This change to your workflow creates a placeholder through which you can specify a data path (volume) to the container (Docker image in execution).

    image
    1. A placeholder named housedataset is created.
    2. You specify the kind of artifact that the placeholder can accept. Artifact is covered in details later in this tutorial.
    3. You use a placeholder to specify the path that you created in your Dockerfile, which is where you will copy files to your Docker image.

    Why do we need to create placeholders?

    SAP AI Core only uses your workflows as an interface, so is unaware of the volume/ attachments specified in your Docker image. Your data path is specified in your Dockerfile and has a placeholder in your workflow and data is then expected by the Docker image.

  • Step 5

    In your workflow, you have used the variable DT_MAX_DEPTH to incorporate a static value from the corresponding environment variable. Let’s make this a variable in the workflow.

    image

    Replace the contents of the above AI workflow with this snippet.

    YAML
    Copy
    apiVersion: argoproj.io/v1alpha1
    kind: WorkflowTemplate
    metadata:
      name: data-pipeline # executable id, must be unique across all your workflows (YAML files)
      annotations:
        scenarios.ai.sap.com/description: "Learning how to ingest data to workflows"
        scenarios.ai.sap.com/name: "House Price (Tutorial)" # Scenario name should be the use case
        executables.ai.sap.com/description: "Train with live data"
        executables.ai.sap.com/name: "training" # Executable name should describe the workflow in the use case
        artifacts.ai.sap.com/housedataset.kind: "dataset" # Helps in suggesting the kind of inputs that can be attached.
      labels:
        scenarios.ai.sap.com/id: "learning-datalines"
        ai.sap.com/version: "1.0"
    spec:
      imagePullSecrets:
        - name: credstutorialrepo # your docker registry secret
      entrypoint: mypipeline
      arguments:
        parameters: # placeholder for string like inputs
            - name: DT_MAX_DEPTH # identifier local to this workflow
      templates:
      - name: mypipeline
        steps:
        - - name: mypredictor
            template: mycodeblock1
      - name: mycodeblock1
        # Add your resource plan here. The annotation should follow metadata > labels > ai.sap.com/resourcePlan: <plan>
        inputs:
          artifacts:  # placeholder for cloud storage attachements
            - name: housedataset # a name for the placeholder
              path: /app/data/ # where to copy in the Dataset in the Docker image
        container:
          image: docker.io/<YOUR_DOCKER_USERNAME>/house-price:03 # Your docker image name
          command: ["/bin/sh", "-c"]
          env:
            - name: DT_MAX_DEPTH # name of the environment variable inside Docker container
              value: "{{workflow.parameters.DT_MAX_DEPTH}}" # value to set from local (to workflow) variable DT_MAX_DEPTH
          args:
            - "python /app/src/main.py"
    

    The following shows the new important lines in the workflows.

    image

    ####Understanding these changes

    1. A placeholder named DT_MAX_DEPTH is created locally in the workflow. It accepts number content of input type: string. The input is then type cast to an integer elsewhere in your code, so it must be a string containing integers. For example: "4" is acceptable because it is a string containing content that can can be type cast to and integer.
    2. You create an input env (Environment) variable to your Docker image, named DT_MAX_DEPTH. The value of this variable is fed in from workflow.parameters.DT_MAX_DEPTH,the local name from previous point.

    Commit the changes in the GitHub.

  • Step 6

    Add the following snippet in your workflow to specify resource plan. The resource plan helps specify computing resource required to run your Docker image. The computing resources includes GPU, RAM and Processor. If not mentioned the resource plan defaults to starter which is the entry level resource plan.

    BASH
    Copy
    spec:
        ...
        templates:
        ...
        - name: mycodeblock1
          metadata:
            labels:
                ai.sap.com/resourcePlan: starter
        ...
    

    INFORMATION: You can always verify computing resource allocated using the following command echo $(lscpu) within your Docker image. The command is the the shell script command of Linux to print system configuration.

  • Step 7

    Observe the value of the name variable in both inputArtifacts and parameters. These represent the placeholder names which were specified earlier in the process. You are required use these names later when creating your configuration.

  • Step 8

    Why use cloud storage?

    SAP AI Core only provides your ephemeral (short-lived) storage, while training or inferencing a model. Amazon Web Services (AWS) S3 Object store is the cloud storage used by SAP AI Core for storing datasets and models. Here, they can be stored over a longer time period, and can be transferred to and from SAP AI Core during training or online inferencing.

    You need to create AWS S3 object store, using one of the following links:

    • If you are a BTP user, create your storage through the SAP Business Technology Platform. While BTP offers alternative storage solutions, this tutorial uses AWS S3.
    • If you are not a BTP user, go directly through AWS site
  • Step 9

    Download and Install the AWS Command Line Interface (CLI).

    To configure settings for your CWS CLI, open your terminal and run:

    BASH
    Copy
    aws configure
    
    image

    Enter your AWS credentials. Note that the appearance of the screen will not change as you type. You can leave the Default output format entry as blank. Press enter to submit your credentials.

    Your credentials are stored in your system and used by the AWS CLI to interact with AWS. Fore more information, see Configuring the AWS CLI

  • Step 10

    Download the train.csv dataset. You need to right click, and save the page as train.csv.

    image
    image

    INFORMATION The data used is from Scikit Learn. The source of the data is here.

    To upload the datasets to your AWS S3 Storage, paste and edit the following command in the terminal:

    BASH
    Copy
    aws s3 cp train.csv s3://<YOUR_BUCKET_NAME>/example-dataset/house-price-toy/data/jan/train.csv
    
    image

    This command uploaded the data to a folder called jan. Upload it one more time in another folder called feb, by changing your command as shown:

    BASH
    Copy
    aws s3 cp train.csv s3://<YOUR_BUCKET_NAME>/example-dataset/house-price-toy/data/feb/train.csv
    

    You now know how to upload and use multiple datasets with SAP AI Core.

    List your files in your AWS S3 bucket by editing the following command:

    BASH
    Copy
    aws s3 ls s3//<YOUR_BUCKET_NAME/example-dataset/house-price-toy/data/
    

    CAUTION: Ensure your file names and format match what you have specified in your code. For example, if you specify ´train.csv´ in your code, the system expects a file called train, which is of type: comma separated value.

    image
  • Step 11

    An object store secret is required to store credentials to access your AWS S3 buckets, and limit access to a particular directory.

    • The Resource Group must be default
    • The Name field is your choice of identifier for your secret within SAP AI Core.
    • Entries to the other fields are found in your AWS account.

    Why not put a complete path to train.csv as pathPrefix?

    You might have noticed that previously you uploaded data to example-dataset/house-price-toy/data/jan/train.csv but here in object store secret the pathPrefix is you set the value example-dataset/house-price-toy. This is because the use of pathPrefix is to restrict access to particular directory of your cloud storage.

    Why define a complete path instead of using the pathPrefix? Using a root path only would mean that all files and subdirectories in the path would be copied from your AWS S3 to SAP AI Core. This might not be useful where you have multiple datasets only one of them is in use. Extending the pathPrefix to a complete path allows the user to specify specific artifacts.

    With your object store secret created, you can now reference any sub-folders to pathPrefix containing artifacts such as datasets or models.

    INFORMATION You may create any number of object store secrets each with unique name. They can point to the same or different object stores.

  • Step 12

    You have learnt to add data artifacts, allowing you to ingest more data over time.

  • Step 13
    1. Notice the url used in above snippet is ai://mys3/data/jan, here mys3 is the object store secret name that you created previously. Hence the path translates as ai://<PATH_PREFIX_OF_mys3>/data/jan which is the directory that your dataset file is located in.
    2. The url points to a directory, not a file, which gives you advantage that you can store multiple files in an AWS S3 directory and register the directory containing all files as a single artifact.
    3. All the files present in the path referenced by artifact will be copied from your S3 storage to your SAP AI Core instance during training or inferencing. This includes subfolders, apart from where Kind = MODEL.
  • Step 14

    INFORMATION Artifacts appear in the default resource group and the Datasets menu, because you had registered artifacts with resource_group = default and Kind = Dataset in Step 9.

    Copy the artifact ID of the January dataset. You will use this value in the placeholders of your workflows to create your execution. The ID of artifacts allows SAP AI Core to ingest data into workflows.

  • Step 15

    [OPTION BEGIN [Postman]]

    Use the artifact ID of the jan dataset and the placeholder names to create a configuration, based on the code snippet. The key value pair for DT_MAX_DEPTH allows you to use the configuration to pass values to placeholders of the hyper-parameters that your prepared earlier in your workflows. In this case, type "3".

    BODY

    JSON
    Copy
    {
        "name": "House Price January 1",
        "scenarioId": "learning-datalines",
        "executableId": "data-pipeline",
        "inputArtifactBindings": [
            {
                "key": "housedataset",
                "artifactId": "<YOUR_JAN_ARTIFACT_ID>"
            }
        ],
        "parameterBindings": [
            {
                "key": "DT_MAX_DEPTH",
                "value": "3"
            }
        ]
    }
    
  • Step 16
    1. You bind the artifact in the section inputArtifactBindings, where key denotes the placeholder name from your workflow and artifactId is the unique ID of the artifact that you registered. In later steps, you will bind the feb dataset’s artifact ID, to learn how the same workflow can be used with multiple datasets.

    2. You provide the value to hyper-parameters using the section parameterBindings, where the key denotes the placeholder and the value is the string. Your code converts this value to an integer type before utilizing it.

    [OPTION END]

    [OPTION BEGIN [SAP AI Core SDK]]

    Paste and edit the code snippet. The key value pair for DT_MAX_DEPTH allows your to use the configuration to pass values to placeholders of the hyper-parameters that your prepared earlier in your workflows. In this case, type "3". You should locate your jan dataset artifact ID by listing all artifacts and using the relevant ID.

    PYTHON
    Copy
    from ai_api_client_sdk.models.parameter_binding import ParameterBinding
    from ai_api_client_sdk.models.input_artifact_binding import InputArtifactBinding
    
    response = ai_core_client.configuration.create(
        name = "House Price January 1",
        scenario_id = "learning-datalines",
        executable_id = "data-pipeline",
        input_artifact_bindings = [
            InputArtifactBinding(key = "housedataset", artifact_id = "<YOUR_JAN_ARTIFACT_ID>") # placeholder as name
        ],
        parameter_bindings = [
            ParameterBinding(key = "DT_MAX_DEPTH", value = "3") # placeholder name as key
        ],
        resource_group = "default"
    )
    print(response.__dict__)
    
    image
  • Step 17
    1. You bind the artifact in the section inputArtifactBindings, where key denotes the placeholder name from your workflow and artifactId is the unique ID of the artifact that you registered. In later steps, you will bind the feb dataset’s artifact ID, to learn how the same workflow can be used with multiple datasets.

    2. You provide the value to hyper-parameters using the section parameterBindings, where the key denotes the placeholder and the value is a string. Your code converts this value to an integer type before utilizing it.

    [OPTION END]

    This is how you bind values to placeholders to your workflows.

    Which of the following template will best help you locate any model files in S3 generated by any workflow?

  • Step 18

    Until now you have ingested data and specified variables in SAP AI Core. To save your model to use later, you need to extract the model to cloud storage. We will complete this in the next step.

  • Step 19

    In your GitHub repository, edit the workflow house-price-train.yaml and replace the contents with the below snippet. Make sure to add your Docker credentials and artifact details to the relevant fields.

    YAML
    Copy
    apiVersion: argoproj.io/v1alpha1
    kind: WorkflowTemplate
    metadata:
      name: data-pipeline # executable id, must be unique across all your workflows (YAML files)
      annotations:
        scenarios.ai.sap.com/description: "Learning how to ingest data to workflows"
        scenarios.ai.sap.com/name: "House Price (Tutorial)" # Scenario name should be the use case
        executables.ai.sap.com/description: "Train with live data"
        executables.ai.sap.com/name: "training" # Executable name should describe the workflow in the use case
        artifacts.ai.sap.com/housedataset.kind: "dataset" # Helps in suggesting the kind of artifact that can be attached.
        artifacts.ai.sap.com/housemodel.kind: "model" # Helps in suggesting the kind of artifact that can be generated.
      labels:
        scenarios.ai.sap.com/id: "learning-datalines"
        ai.sap.com/version: "2.0"
    spec:
      imagePullSecrets:
        - name: credstutorialrepo # your docker registry secret
      entrypoint: mypipeline
      arguments:
        parameters: # placeholder for string like inputs
            - name: DT_MAX_DEPTH # identifier local to this workflow
      templates:
      - name: mypipeline
        steps:
        - - name: mypredictor
            template: mycodeblock1
      - name: mycodeblock1
        inputs:
          artifacts:  # placeholder for cloud storage attachements
            - name: housedataset # a name for the placeholder
              path: /app/data/ # where to copy in the Dataset in the Docker image
        outputs:
          artifacts:
            - name: housepricemodel # name of the artifact generated, and folder name when placed in S3, complete directory will be `../<executaion_id>/housepricemodel`
              globalName: housemodel # local identifier name to the workflow, also used above in annotation
              path: /app/model/ # from which folder in docker image (after running workflow step) copy contents to cloud storage
              archive:
                none:   # specify not to compress while uploading to cloud
                  {}
        container:
          image: docker.io/<YOUR_DOCKER_USERNAME>/house-price:03 # Your docker image name
          command: ["/bin/sh", "-c"]
          env:
            - name: DT_MAX_DEPTH # name of the environment variable inside Docker container
              value: "{{workflow.parameters.DT_MAX_DEPTH}}" # value to set from local (to workflow) variable DT_MAX_DEPTH
          args:
            - "python /app/src/main.py"
    
  • Step 20
    image

    You added a new outputs section, where you specified files and their directories, which are created during the execution, and will be uploaded to AWS S3 and automatically registered as artifacts in SAP AI Core. You also added a line in the annotations section, which specified the kind of artifact that would be generated. In this case, a model.

    All of the contents within your /app/model/ directory, as defined in your Docker image, will be uploaded to AWS S3. This implies you may generate multiple files of any format after training, for example class_labels.npy, model.h5, classifier.pkl or tokens.json.

  • Step 21

    It is compulsory to create a object store secret named default within your resource group, for your executable to generate models and store them in AWS S3. After execution the model will be saved to PATH_PREFIX_of_default/<execution_id>/housepricemodel in your AWS S3. The housepricemodel is mentioned in the workflow in the previous step.

  • Step 22

    This time, train the model with the feb dataset and with a different hyper-parameter value.

  • Step 23

    Use your new configuration to create an execution.

    When your execution shows status COMPLETED, you will see that a new model artifact called housepricemodel has been generated. Note that the outputArtifacts are automatically registered and copied to AWS S3. Note that your artifact is of the kind model and that its ID is its only unique identifier - not its name.

    Generating and associating metrics (model quality) will covered in a separate tutorial.

  • Step 24

    List your new files by pasting and editing the following snippet in your terminal.

    BASH
    Copy
    aws s3 ls s3://<YOUR_BUCKET_NAME>/example-dataset/house-price-toy/model/<YOUR_EXECUTION_ID>/housepricemodel
    

    You are listing the files in the path example-dataset/house-price-toy/model/ because this is the value you set earlier for the pathPrefix variable, for your object store secret named default.

    image
Back to top