Ingest Live Data into your House Price Predictor with SAP AI Core
- How to create placeholders for datasets in your code and associated AI workflow.
- How to register datasets stored in AWS S3 to SAP AI Core.
- How to use datasets with placeholders.
- How to generate models and store them in AWS S3 for later use.
Prerequisites
- You have knowledge on connecting code to AI workflows of SAP AI Core.
- You have created your first pipeline with SAP AI Core, using this tutorial.
By the end of the tutorial you will have two models trained on two different datasets of house price data. It is possible to change the names of components and file paths mentioned in this tutorial, without breaking the functionality, unless stated explicitly.
IMPORTANT Before you start this tutorial with SAP AI Launchpad, it is recommended that you set up at least one other tool, either Postman or Python (SAP AI Core SDK) because some steps of this tutorial cannot be performed with SAP AI Launchpad.
- Step 1
Create a new directory named
hello-aicore-data
. The code is different from previous tutorial as it reads the data from folder (volumes, virtual storage spaces). The content of these volumes is dynamically loaded during execution of workflows.Create a file named
main.py
, and paste the following snippet there:PYTHONCopyimport os # # Variables DATA_PATH = '/app/data/train.csv' DT_MAX_DEPTH= int(os.getenv('DT_MAX_DEPTH')) MODEL_PATH = '/app/model/model.pkl' # # Load Datasets import pandas as pd df = pd.read_csv(DATA_PATH) X = df.drop('target', axis=1) y = df['target'] # # Partition into Train and test dataset from sklearn.model_selection import train_test_split train_x, test_x, train_y, test_y = train_test_split(X, y, test_size=0.3) # # Init model from sklearn.tree import DecisionTreeRegressor clf = DecisionTreeRegressor(max_depth=DT_MAX_DEPTH) # # Train model clf.fit(train_x, train_y) # # Test model test_r2_score = clf.score(test_x, test_y) # Output will be available in logs of SAP AI Core. # Not the ideal way of storing /reporting metrics in SAP AI Core, but that is not the focus this tutorial print(f"Test Data Score {test_r2_score}") # # Save model import pickle pickle.dump(clf, open(MODEL_PATH, 'wb'))
Log in to complete tutorial - Step 2
Your code reads the data file
train.csv
from the location/app/data
, which will be prepared in a later step. It also reads the variable (hyper-parameter)DT_MAX_DEPTH
from the environment variables later. When generated, your model will be stored in the location/app/model/
. You will also learn how to transport this code from SAP AI Core to your own cloud storage.Recommendation: Although the dataset file
train.csv
is not present, it will be dynamically copied during execution to the volume mentioned in(/app/data)
. Its recommended to pass the filename(train.csv)
through the environment variable to your code so that if your dataset filename changes, you can dynamically set the dataset file.Create file
requirements.txt
as shown below. Here, if you don’t specify a particular version, as shown forpandas
, then the latest version of the package will be fetched automatically.TEXTCopysklearn==0.0 pandas
Create a file called
Dockerfile
with following contents.This filename cannot be amended, and does not have a
.filetype
TEXTCopy# Specify which base layers (default dependencies) to use # You may find more base layers at https://hub.docker.com/ FROM python:3.7 # # Creates directory within your Docker image RUN mkdir -p /app/src/ # Don't place anything in below folders yet, just create them RUN mkdir -p /app/data/ RUN mkdir -p /app/model/ # # Copies file from your Local system TO path in Docker image COPY main.py /app/src/ COPY requirements.txt /app/src/ # # Installs dependencies within you Docker image RUN pip3 install -r /app/src/requirements.txt # # Enable permission to execute anything inside the folder app RUN chgrp -R 65534 /app && \ chmod -R 777 /app
IMPORTANT Your
Dockerfile
creates empty folders to store your datasets and models (/app/data
and/app/model/
in the example above). Contents from cloud storage will be copied to and from these folders later. Any contents in these folders will be overwritten by the Docker image build.Build and upload your Docker image to Docker repository, using the following code in the terminal.
BASHCopydocker build -t docker.io/<YOUR_DOCKER_USERNAME>/house-price:03 . docker push docker.io/<YOUR_DOCKER_USERNAME>/house-price:03
Log in to complete tutorial - Step 3
Create a pipeline (YAML file) named
house-price-train.yaml
in your GitHub repository. Use the existing GitHub path which is already synced by your application of SAP AI Core.YAMLCopyapiVersion: argoproj.io/v1alpha1 kind: WorkflowTemplate metadata: name: data-pipeline # executable id, must be unique across all your workflows (YAML files) annotations: scenarios.ai.sap.com/description: "Learning how to ingest data to workflows" scenarios.ai.sap.com/name: "House Price (Tutorial)" # Scenario name should be the use case executables.ai.sap.com/description: "Train with live data" executables.ai.sap.com/name: "training" # Executable name should describe the workflow in the use case artifacts.ai.sap.com/housedataset.kind: "dataset" # Helps in suggesting the kind of inputs that can be attached. labels: scenarios.ai.sap.com/id: "learning-datalines" ai.sap.com/version: "1.0" spec: imagePullSecrets: - name: credstutorialrepo # your docker registry secret entrypoint: mypipeline templates: - name: mypipeline steps: - - name: mypredictor template: mycodeblock1 - name: mycodeblock1 inputs: artifacts: # placeholder for cloud storage attachements - name: housedataset # a name for the placeholder path: /app/data/ # where to copy in the Dataset in the Docker image container: image: docker.io/<YOUR_DOCKER_USERNAME>/house-price:03 # Your docker image name command: ["/bin/sh", "-c"] env: - name: DT_MAX_DEPTH # name of the environment variable inside Docker container value: "3" # will make it as variable later args: - "python /app/src/main.py"
Log in to complete tutorial - Step 4
This change to your workflow creates a placeholder through which you can specify a data path (volume) to the container (Docker image in execution).
- A placeholder named
housedataset
is created. - You specify the kind of artifact that the placeholder can accept. Artifact is covered in details later in this tutorial.
- You use a placeholder to specify the path that you created in your Dockerfile, which is where you will copy files to your Docker image.
Why do we need to create placeholders?
SAP AI Core only uses your workflows as an interface, so is unaware of the volume/ attachments specified in your Docker image. Your data path is specified in your Dockerfile and has a placeholder in your workflow and data is then expected by the Docker image.
Log in to complete tutorial - A placeholder named
- Step 5
In your workflow, you have used the variable
DT_MAX_DEPTH
to incorporate a static value from the corresponding environment variable. Let’s make this a variable in the workflow.Replace the contents of the above AI workflow with this snippet.
YAMLCopyapiVersion: argoproj.io/v1alpha1 kind: WorkflowTemplate metadata: name: data-pipeline # executable id, must be unique across all your workflows (YAML files) annotations: scenarios.ai.sap.com/description: "Learning how to ingest data to workflows" scenarios.ai.sap.com/name: "House Price (Tutorial)" # Scenario name should be the use case executables.ai.sap.com/description: "Train with live data" executables.ai.sap.com/name: "training" # Executable name should describe the workflow in the use case artifacts.ai.sap.com/housedataset.kind: "dataset" # Helps in suggesting the kind of inputs that can be attached. labels: scenarios.ai.sap.com/id: "learning-datalines" ai.sap.com/version: "1.0" spec: imagePullSecrets: - name: credstutorialrepo # your docker registry secret entrypoint: mypipeline arguments: parameters: # placeholder for string like inputs - name: DT_MAX_DEPTH # identifier local to this workflow templates: - name: mypipeline steps: - - name: mypredictor template: mycodeblock1 - name: mycodeblock1 # Add your resource plan here. The annotation should follow metadata > labels > ai.sap.com/resourcePlan: <plan> inputs: artifacts: # placeholder for cloud storage attachements - name: housedataset # a name for the placeholder path: /app/data/ # where to copy in the Dataset in the Docker image container: image: docker.io/<YOUR_DOCKER_USERNAME>/house-price:03 # Your docker image name command: ["/bin/sh", "-c"] env: - name: DT_MAX_DEPTH # name of the environment variable inside Docker container value: "{{workflow.parameters.DT_MAX_DEPTH}}" # value to set from local (to workflow) variable DT_MAX_DEPTH args: - "python /app/src/main.py"
The following shows the new important lines in the workflows.
####Understanding these changes
- A placeholder named
DT_MAX_DEPTH
is created locally in the workflow. It accepts number content of input type:string
. The input is then type cast to an integer elsewhere in your code, so it must be a string containing integers. For example:"4"
is acceptable because it is a string containing content that can can be type cast to and integer. - You create an input
env
(Environment) variable to your Docker image, namedDT_MAX_DEPTH
. The value of this variable is fed in fromworkflow.parameters.DT_MAX_DEPTH
,the local name from previous point.
Commit the changes in the GitHub.
Log in to complete tutorial - A placeholder named
- Step 6
Add the following snippet in your workflow to specify resource plan. The resource plan helps specify computing resource required to run your Docker image. The computing resources includes GPU, RAM and Processor. If not mentioned the resource plan defaults to
starter
which is the entry level resource plan.BASHCopyspec: ... templates: ... - name: mycodeblock1 metadata: labels: ai.sap.com/resourcePlan: starter ...
INFORMATION: You can always verify computing resource allocated using the following command
echo $(lscpu)
within your Docker image. The command is the the shell script command of Linux to print system configuration.Log in to complete tutorial - Step 7
Observe the value of the
name
variable in bothinputArtifacts
andparameters
. These represent the placeholder names which were specified earlier in the process. You are required use these names later when creating your configuration.Log in to complete tutorial - Step 8
Why use cloud storage?
SAP AI Core only provides your ephemeral (short-lived) storage, while training or inferencing a model. Amazon Web Services (AWS) S3 Object store is the cloud storage used by SAP AI Core for storing datasets and models. Here, they can be stored over a longer time period, and can be transferred to and from SAP AI Core during training or online inferencing.
You need to create AWS S3 object store, using one of the following links:
- If you are a BTP user, create your storage through the SAP Business Technology Platform. While BTP offers alternative storage solutions, this tutorial uses AWS S3.
- If you are not a BTP user, go directly through AWS site
Log in to complete tutorial - Step 9
Download and Install the AWS Command Line Interface (CLI).
To configure settings for your CWS CLI, open your terminal and run:
BASHCopyaws configure
Enter your AWS credentials. Note that the appearance of the screen will not change as you type. You can leave the
Default output format
entry as blank. Press enter to submit your credentials.Your credentials are stored in your system and used by the AWS CLI to interact with AWS. Fore more information, see Configuring the AWS CLI
Log in to complete tutorial - Step 10
Download the
train.csv
dataset. You need to right click, and save the page astrain.csv
.INFORMATION The data used is from
Scikit Learn
. The source of the data is here.To upload the datasets to your AWS S3 Storage, paste and edit the following command in the terminal:
BASHCopyaws s3 cp train.csv s3://<YOUR_BUCKET_NAME>/example-dataset/house-price-toy/data/jan/train.csv
This command uploaded the data to a folder called
jan
. Upload it one more time in another folder calledfeb
, by changing your command as shown:BASHCopyaws s3 cp train.csv s3://<YOUR_BUCKET_NAME>/example-dataset/house-price-toy/data/feb/train.csv
You now know how to upload and use multiple datasets with SAP AI Core.
List your files in your AWS S3 bucket by editing the following command:
BASHCopyaws s3 ls s3//<YOUR_BUCKET_NAME/example-dataset/house-price-toy/data/
CAUTION: Ensure your file names and format match what you have specified in your code. For example, if you specify ´train.csv´ in your code, the system expects a file called train, which is of type: comma separated value.
Log in to complete tutorial - Step 11
An object store secret is required to store credentials to access your AWS S3 buckets, and limit access to a particular directory.
- The Resource Group must be
default
- The
Name
field is your choice of identifier for your secret within SAP AI Core. - Entries to the other fields are found in your AWS account.
Why not put a complete path to train.csv as
pathPrefix
?You might have noticed that previously you uploaded data to
example-dataset/house-price-toy/data/jan/train.csv
but here in object store secret thepathPrefix
is you set the valueexample-dataset/house-price-toy
. This is because the use ofpathPrefix
is to restrict access to particular directory of your cloud storage.Why define a complete path instead of using the
pathPrefix
? Using a root path only would mean that all files and subdirectories in the path would be copied from your AWS S3 to SAP AI Core. This might not be useful where you have multiple datasets only one of them is in use. Extending thepathPrefix
to a complete path allows the user to specify specific artifacts.With your object store secret created, you can now reference any sub-folders to
pathPrefix
containing artifacts such as datasets or models.INFORMATION You may create any number of object store secrets each with unique
name
. They can point to the same or different object stores.Log in to complete tutorial - The Resource Group must be
- Step 12
You have learnt to add data artifacts, allowing you to ingest more data over time.
Log in to complete tutorial - Step 13
- Notice the
url
used in above snippet isai://mys3/data/jan
, heremys3
is the object store secret name that you created previously. Hence the path translates asai://<PATH_PREFIX_OF_mys3>/data/jan
which is the directory that your dataset file is located in. - The
url
points to a directory, not a file, which gives you advantage that you can store multiple files in an AWS S3 directory and register the directory containing all files as a single artifact. - All the files present in the path referenced by artifact will be copied from your S3 storage to your SAP AI Core instance during training or inferencing. This includes subfolders, apart from where
Kind = MODEL
.
Log in to complete tutorial - Notice the
- Step 14
INFORMATION Artifacts appear in the
default
resource group and the Datasets menu, because you had registered artifacts withresource_group = default
andKind = Dataset
in Step 9.Copy the artifact ID of the January dataset. You will use this value in the placeholders of your workflows to create your execution. The ID of artifacts allows SAP AI Core to ingest data into workflows.
Log in to complete tutorial - Step 15
[OPTION BEGIN [Postman]]
Use the artifact ID of the
jan
dataset and the placeholder names to create a configuration, based on the code snippet. The key value pair forDT_MAX_DEPTH
allows you to use the configuration to pass values to placeholders of the hyper-parameters that your prepared earlier in your workflows. In this case, type"3"
.BODY
JSONCopy{ "name": "House Price January 1", "scenarioId": "learning-datalines", "executableId": "data-pipeline", "inputArtifactBindings": [ { "key": "housedataset", "artifactId": "<YOUR_JAN_ARTIFACT_ID>" } ], "parameterBindings": [ { "key": "DT_MAX_DEPTH", "value": "3" } ] }
Log in to complete tutorial - Step 16
You bind the artifact in the section
inputArtifactBindings
, wherekey
denotes the placeholder name from your workflow andartifactId
is the unique ID of the artifact that you registered. In later steps, you will bind thefeb
dataset’s artifact ID, to learn how the same workflow can be used with multiple datasets.You provide the value to hyper-parameters using the section
parameterBindings
, where thekey
denotes the placeholder and the value is the string. Your code converts this value to an integer type before utilizing it.
[OPTION END]
[OPTION BEGIN [SAP AI Core SDK]]
Paste and edit the code snippet. The key value pair for
DT_MAX_DEPTH
allows your to use the configuration to pass values to placeholders of the hyper-parameters that your prepared earlier in your workflows. In this case, type"3"
. You should locate yourjan
dataset artifact ID by listing all artifacts and using the relevant ID.PYTHONCopyfrom ai_api_client_sdk.models.parameter_binding import ParameterBinding from ai_api_client_sdk.models.input_artifact_binding import InputArtifactBinding response = ai_core_client.configuration.create( name = "House Price January 1", scenario_id = "learning-datalines", executable_id = "data-pipeline", input_artifact_bindings = [ InputArtifactBinding(key = "housedataset", artifact_id = "<YOUR_JAN_ARTIFACT_ID>") # placeholder as name ], parameter_bindings = [ ParameterBinding(key = "DT_MAX_DEPTH", value = "3") # placeholder name as key ], resource_group = "default" ) print(response.__dict__)
Log in to complete tutorial - Step 17
You bind the artifact in the section
inputArtifactBindings
, wherekey
denotes the placeholder name from your workflow andartifactId
is the unique ID of the artifact that you registered. In later steps, you will bind thefeb
dataset’s artifact ID, to learn how the same workflow can be used with multiple datasets.You provide the value to hyper-parameters using the section
parameterBindings
, where thekey
denotes the placeholder and the value is a string. Your code converts this value to an integer type before utilizing it.
[OPTION END]
This is how you bind values to placeholders to your workflows.
Which of the following template will best help you locate any model files in S3 generated by any workflow?
Log in to complete tutorial - Step 18
Until now you have ingested data and specified variables in SAP AI Core. To save your model to use later, you need to extract the model to cloud storage. We will complete this in the next step.
Log in to complete tutorial - Step 19
In your GitHub repository, edit the workflow
house-price-train.yaml
and replace the contents with the below snippet. Make sure to add your Docker credentials and artifact details to the relevant fields.YAMLCopyapiVersion: argoproj.io/v1alpha1 kind: WorkflowTemplate metadata: name: data-pipeline # executable id, must be unique across all your workflows (YAML files) annotations: scenarios.ai.sap.com/description: "Learning how to ingest data to workflows" scenarios.ai.sap.com/name: "House Price (Tutorial)" # Scenario name should be the use case executables.ai.sap.com/description: "Train with live data" executables.ai.sap.com/name: "training" # Executable name should describe the workflow in the use case artifacts.ai.sap.com/housedataset.kind: "dataset" # Helps in suggesting the kind of artifact that can be attached. artifacts.ai.sap.com/housemodel.kind: "model" # Helps in suggesting the kind of artifact that can be generated. labels: scenarios.ai.sap.com/id: "learning-datalines" ai.sap.com/version: "2.0" spec: imagePullSecrets: - name: credstutorialrepo # your docker registry secret entrypoint: mypipeline arguments: parameters: # placeholder for string like inputs - name: DT_MAX_DEPTH # identifier local to this workflow templates: - name: mypipeline steps: - - name: mypredictor template: mycodeblock1 - name: mycodeblock1 inputs: artifacts: # placeholder for cloud storage attachements - name: housedataset # a name for the placeholder path: /app/data/ # where to copy in the Dataset in the Docker image outputs: artifacts: - name: housepricemodel # name of the artifact generated, and folder name when placed in S3, complete directory will be `../<executaion_id>/housepricemodel` globalName: housemodel # local identifier name to the workflow, also used above in annotation path: /app/model/ # from which folder in docker image (after running workflow step) copy contents to cloud storage archive: none: # specify not to compress while uploading to cloud {} container: image: docker.io/<YOUR_DOCKER_USERNAME>/house-price:03 # Your docker image name command: ["/bin/sh", "-c"] env: - name: DT_MAX_DEPTH # name of the environment variable inside Docker container value: "{{workflow.parameters.DT_MAX_DEPTH}}" # value to set from local (to workflow) variable DT_MAX_DEPTH args: - "python /app/src/main.py"
Log in to complete tutorial - Step 20
You added a new
outputs
section, where you specified files and their directories, which are created during the execution, and will be uploaded to AWS S3 and automatically registered as artifacts in SAP AI Core. You also added a line in theannotations
section, which specified thekind
of artifact that would be generated. In this case, a model.All of the contents within your
/app/model/
directory, as defined in your Docker image, will be uploaded to AWS S3. This implies you may generate multiple files of any format after training, for exampleclass_labels.npy
,model.h5
,classifier.pkl
ortokens.json
.Log in to complete tutorial - Step 21
It is compulsory to create a object store secret named
default
within your resource group, for your executable to generate models and store them in AWS S3. After execution the model will be saved toPATH_PREFIX_of_default/<execution_id>/housepricemodel
in your AWS S3. Thehousepricemodel
is mentioned in the workflow in the previous step.Log in to complete tutorial - Step 22
This time, train the model with the
feb
dataset and with a different hyper-parameter value.Log in to complete tutorial - Step 23
Use your new configuration to create an execution.
When your execution shows status COMPLETED, you will see that a new model artifact called
housepricemodel
has been generated. Note that theoutputArtifacts
are automatically registered and copied to AWS S3. Note that your artifact is of the kind model and that its ID is its only unique identifier - not its name.Generating and associating metrics (model quality) will covered in a separate tutorial.
Log in to complete tutorial - Step 24
List your new files by pasting and editing the following snippet in your terminal.
BASHCopyaws s3 ls s3://<YOUR_BUCKET_NAME>/example-dataset/house-price-toy/model/<YOUR_EXECUTION_ID>/housepricemodel
You are listing the files in the path
example-dataset/house-price-toy/model/
because this is the value you set earlier for thepathPrefix
variable, for your object store secret nameddefault
.Log in to complete tutorial
- Modify AI code
- Understanding your code
- Create placeholders for datasets in workflows
- Understanding changes in your workflow
- Create placeholders for hyperparameters
- Set resource plan
- Observe your scenario and placeholder
- Create cloud storage for datasets and models
- Connect local system to AWS S3
- Upload datasets to AWS S3
- Store an object store secret in SAP AI Core
- Create artifact to specify folder of dataset
- Important points to notice
- Locate artifacts
- Use artifacts with workflows using a configuration
- Important points
- Important points
- Run you workflow using execution
- Set model pipeline in workflow
- Description of changes
- Create required object store secret `default` for model
- Create another configuration with new data
- Create another execution
- Locate your model in AWS S3