Create Workflow (part 1), Enrich Data With Data Transform in SAP Data Intelligence, Trial Edition
- How to create Data Transform
- How to enrich dataset
Please note here in this tutorial GCP refers to Google Cloud platform and AWS refers to Amazon Web Services, Azure refers to Microsoft Azure.
- Step 1
You are logged as a
system
user to adefault
tenant and are at the Application Launchpad page.Open the modelling environment for building pipelines via SAP Data Intelligence Modeler.
-
Click Modeler to open a Modeler UI.
-
Create a new graph.
- Go to the navigation pane.
- Select the Graphs tab navigation pane toolbar.
- Choose + (Create Graph). The tool opens an empty graph editor in the same window, where you can define your graph.
- Add
Workflow Trigger
operator to the graph by drag and drop. - Add Spark Data Transform operator to the graph by drag and drop.
- Connect the output port of the Workflow Trigger to the input port of the Spark Data Transform operator.
-
- Step 2
-
Double click on the Spark Data Transform operator and it will open the Editor. Here you have to model your Data Workflow by creating data sources and targets along with transformation operators.
-
From the Nodes menu on the left, drag and drop a Data Source to the editor.
Hint As this is a Spark Data Transform task, only Nodes can be added to the task.
-
Double click on the Data Source to open the Data Source Editor.
-
Configure the details for the particular data source.
- Click the Browse.
-
Select
CLOUD_STORAGE
connection from the list.Hint You will see the connection here only, if you have configured the same under Connection Management.
-
As Source, browse the AWS S3 or Google Cloud Storage or Windows Azure Storage Blob and choose
Devices.csv
file. As soon as the file is selected, file configuration parameters will be Auto-proposed.
-
Click the Back at the left top corner, navigate back to the Spark Data Transform editor page.
-
Add another Data Source.
- Using drag and drop, add another Data Source to the task.
- Double click on it to open the Data Source editor.
- Click the Browse.
- Select
CLOUD_STORAGE
connection from the list. - As Source, choose
Customers.csv
file.
-
Navigate back to the Spark Data Transform editor page.
-
- Step 3
In this step you are going to join the two data sources you have created and then to configure the join operator.
-
Add Join to the task from the Nodes tab in the left side menu on the left through drag and drop.
-
Connect the
DataSource1_Output1
output port of theDataSource1
toJoin1_Input1
input port of theJoin1
operator. -
Similarly, connect the
DataSource2_Output1
output port of theDataSource2
toJoin1_Input2
input port of theJoin1
operator. -
Double click on the Join operator and open its configuration. Both the Data Sources connected as inputs are displayed as Join inputs.
-
Configure join.
- Click on
Join1_Input1
table. - Click on the Join highlighted in Step 1 of the below image.
- Holding the click, drag to the connecting point for the table
Join1_Input2
highlighted in Step 2 of the below image and release the click once the second table turns green indicating successful connection.
The resulting join will be similar to Step 3 of the above image.
- Click on
-
As soon as the two tables are connected, a Join Definition dialog box will be displayed in the bottom part. Paste the following Join condition in the dialog box.
SQLCopy"Join1_Input1"."CUSTOMER" = "Join1_Input2"."CUSTOMER"
-
Open the Columns page. On this page, graphical representation of the selected columns is displayed. You can select the output columns of the join condition.
-
Drag & drop the column from source to target. Make sure that you only select the following columns:
Source Source Column Join1_Input1 (Devices)
DEVICE
Join1_Input1 (Devices)
TYPE
Join1_Input1 (Devices)
CUSTOMER
Join1_Input1 (Devices)
TIME
Join1_Input2 (Customers)
COUNTRY
-
Make sure the output should look similar to the above screenshot. Make sure that the name of the Target columns is similar to those in the below screenshot. If not, you can edit the target column names on this page.
-
Navigate back to the Spark Data Transform editor page.
-
- Step 4
You now have to provide a destination for the results of the Join operation. This is achieved by specifying a target.
-
From the Nodes tab in the left side menu, drag and drop a Data Target to the task.
-
Connect the
Join1_Output1
output port of theJoin1
node to theDataTarget1_Input1
input port of theDataTarget1
node. -
Double click on the newly added Data Target node and open the configuration page. As you have connected the Join and the Data Target nodes, Modeler will automatically detect the columns for the Data Target.
-
Click on the Browse button for connection and select
CLOUD_STORAGE
. Later maintain the following configuration for the target :Field Value Target
Directory under which the new .csv
file would be created. Type in the path manually if you want to create a new directory e.g./CSV/EnrichedDevices
File Format
CSV
Column Delimiter
;
Character Set
ISO-8859-2
Text Delimiter
"
Escape Character
"
Includes Header
Yes
-
Using the back button at the top left, navigate back to the pipeline where you have a Workflow Trigger connected to a Spark Data Transform .
-
- Step 5
-
From the Operators tab in the left side menu, drag and drop a Workflow Terminator to the graph.
-
Connect the
output
out port of the Spark Data Transform to thestop
in port of the Workflow Terminator. -
Save the graph as name is
mytest.workflow1
and description isWorkflow 1
. -
Execute the graph using the buttons at the top of the page. The execution status is displayed in the bottom part of the screen and it changes from Running to Completed once the execution completes. It will take a few minutes to complete.
-
- Step 6
Now check the result of the Data Transform task. You can check the results from the Modeler itself.
-
Open the pipeline you have created.
-
Double click the Spark Data Transform operator.
-
Double click on the Data Target node which opens the configuration page.
-
Navigate to the DATA PREVIEW by using the button at the top right corner of the page.
-
You may need to increase the total fetched row count to 300 or more in order to see the entire dataset using the funnel icon in the upper right corner
You see that in contrast to the
Devices
data set, theEnrichedDevices
data set does not include any null values. The original file has been enriched. Using the above screenshot and your generatedEnrichedDevices
data set, answer the below question.What is the value of COUNTRY (Column C4 in the screenshot) for Customer number (Row 1 in the screenshot) 1000000001? Ideally it should be the first row in the EnrichedDevices dataset.
-