Use SOF to Query data from Single Directory
- How to define the database objects required to query structured data files.
- How to add a data source for a single data file and query the output of that file.
- Have access to a licensed managed or standalone SAP HANA data lake.
- Installation of the HDLFSCLI. See tutorial. Getting Started with Data Lake Files HDLFSCLI | Tutorials for SAP Developers.
- Have some structured data files inside of a SAP HANA data lake File Container.
- Have some data with you, which will be uploaded onto the SAP HANA data lake File container.
- Already set up HDLFS Connection in Database Explorer.
Querying structured data files (CSV, ORC, Parquet) in a HANA Data Lake file container can be done using SQL on Files. Below you will find all of the steps required to start using SQL on Files.
If you have not yet provisioned an SAP HANA data lake, here is a great tutorial on how to do so!
- Step 1
To start off with this Tutorial, you will need to connect to a SQL console connected directly with your SAP HANA Data Lake Instance. This can be done easily by clicking on the Actions tab of your Data Lake Instance from the BTP cockpit or Cloud Central and selecting Open in SAP HANA Database Explorer. That way you will be connected to the SQL console connected directly with your SAP HANA Data Lake Instance.
Connect using HDLADMIN user and grant a database user the
CREATE USER SOF_USER IDENTIFIED BY Password1; GRANT ROLE HDL_FILES_SERVICE_ADMIN TO SOF_USER; GRANT ALTER ANY TABLE, CREATE ANY TABLE, DROP ANY TABLE, DELETE ANY TABLE, INSERT ANY TABLE, LOAD ANY TABLE TO SOF_USER WITH NO ADMIN OPTION; GRANT MANAGE ANY REMOTE SERVER TO SOF_USER WITH NO ADMIN OPTION;
Now, Open a new SQL console and you need to add a new Database Connection of the USER we recently created in DBX, and login using the above created user
From Database Explorer, right-click on your Data Lake Instance and select PROPERTIES. Copy the entire field of HOST and hit OK. Now in the DBX, click on the + symbol in the DBX. And from the dropdown under Database Type select Data Lake Relational Engine. Enter the copied value of Host now. Input the username as
SOF_USERand Password as Password1 and check the box to save the password.
- Step 2
Now, let us open a new SQL console window and connect it with the new user
SOF_USERto our HANA Data Lake Instance and create a SQL on files Remote Server.SQLCopy
CREATE REMOTE SERVER MY_REMOTE_SERVER CLASS 'FILES_SERVICE' READ ONLY VALUE 'ON';
- Step 3
Next, we create a Schema in the Files Service. Creating a Schema adds a layer of security to who can access your SQL servers. A good schema design reduces database complexity and optimizes database query time.SQLCopy
CREATE SCHEMA TPCH_SOF IN FILES_SERVICE;
This remote schema will contain a collection of Remote Tables, each of which must be associated with an existing Remote Schema.
Create a SQL on Files Remote Table in Files service.
Note that the table definition will need to match the expected schema/data types of the columns in the structured file you are trying to querySQLCopy
CREATE TABLE TPCH_SOF.CUSTOMER ( C_CUSTKEY integer, C_NAME varchar(300), C_ADDRESS varchar(600), C_NATIONKEY integer, C_PHONE varchar(150), C_ACCTBAL decimal(15,3), C_MKTSEGMENT varchar(100), C_COMMENT varchar(5000) ) IN FILES_SERVICE;
For information about the parameter definitions and supported data types, see docs
- Step 4
Next, We create a Virtual Table. Notice in the SQL below where the remote server name goes and where the reference to the table in the Files Service goes. Over here, we will be creating a CUSTOMER VIRTUAL TABLE in HDLRE that points to the CUSTOMER table that we just created in SQL On Files service.
The EXISTING keyword is used in conjunction with CREATE TABLE and IN FILES SERVICE to indicate that a table exists remotely in SQL on Files and to import its metadata into the new virtual table. The remote table is now visible to users thanks to this syntax.
Note that the table definition will need to match the expected Schema/Data types of the columns in the structured fileSQLCopy
CREATE EXISTING TABLE CUSTOMER( C_CUSTKEY integer, C_NAME varchar(300), C_ADDRESS varchar(600), C_NATIONKEY integer, C_PHONE varchar(150), C_ACCTBAL decimal(15,3), C_MKTSEGMENT varchar(100), C_COMMENT varchar(5000) ) AT 'MY_REMOTE_SERVER..TPCH_SOF.CUSTOMER';
For the full syntax of clauses available to create an existing table, see CREATE EXISTING TABLE Statement for Data Lake Relational Engine & SQL on Files OR CREATE EXISTING TABLE Statement for Data Lake Relational Engine (HANA DB-Managed) & SQL on Files
- Step 5
Make sure that you have everything setup with respect to HDLFSCLI. One can go through the tutorial for getting started with HDLFSCLI - Getting Started with Data Lake Files HDLFSCLI | Tutorials for SAP Developers.
Use the below command to upload a local file onto the Data Lake -
hdlfscli -config <your config> upload <Your Local Path>\ <Target File Path>\
For e.g. I am using the table name
hdlfscli -config myconfig upload C:\Users\<YOUR SAP ID>\Lineitem myPath\Lineitem
C:\Users\<YOUR SAP ID>\Lineitemis my Local Path
myPath\Lineitemis my Target path in the File Container
Verify that the files has been uploaded.
hdlfscli -config <your config> lsr
Make sure you have already set up a HDLFS Connection in Database Explorer. It will look something like below. To get to know how to setup a HDLFS Connection In Database Explorer go through the tutorial – Setting Up HDLFS Connection In Database Explorer.
- Step 6
Add a data source, this can be done multiple times with multiple files.
Note that in this step the file path can lead to an exact file or it can lead to a directory. If it leads to a directory, SQL on Files will try to parse all the data files in that directory. To ensure that there are no parse errors, make sure that all the files in the directory match the schema of the table the data source it is being added to.
One could also refer the ALTER TABLE ADD DATASOURCE doc for any further reference -ALTER (Remote) TABLE ADD DATASOURCE Statement for Data Lake Relational Engine (HANA DB-Managed) and SQL on FilesSQLCopy
ALTER TABLE TPCH_SOF.ORDERS IN FILES_SERVICE ADD DATASOURCE AS ORDERS_DATASOURCE CSV(webhdfs('hdlfs:///<your path>')) DELIMITED BY ',' ENCODING 'utf-8' SKIP 1;
Now that you have created all the foundational database objects for querying our file, you can simply query the files as if it were a table.
Note that the table name here will be the name of the EXISTING table (Virtual table) created in the Step 2 above.SQLCopy
SELECT * FROM CUSTOMER;
- Step 7
If you want to drop a data source for a SQL on Files table you can do so as well.SQLCopy
ALTER TABLE TPCH_SOF.CUSTOMER IN FILES_SERVICE DROP DATASOURCE CUSTOMER_DATASOURCE;
To clean up the SQL on Files table created in Step 2, you drop the existing table, then the files service table, and lastly the files service schema.SQLCopy
DROP TABLE CUSTOMER; DROP TABLE TPCH_SOF.CUSTOMER IN FILES_SERVICE; DROP SCHEMA TPCH_SOF IN FILES_SERVICE;
In what order do you need to create the database objects required to use SQL on Files to query a file from a single director?