# Analyze Graph Workspace in SAP HANA Cloud, SAP HANA Database

Learn to use algorithms to analyze and process a Graph Workspace in SAP HANA Cloud, SAP HANA database.
You will learn
• How to analyze and process your graph using algorithms
• How to create visualization for sub-graphs of components in a Graph Workspace
• How to perform dependency analysis using a graph
Created by
August 3, 2021
Contributors

## Prerequisites

The hana-ml library provides functionalities to analyze and process the graph. In this tutorial, you will learn about algorithms that can help you create sub-graphs of components in a Graph Workspace. Through a dependency analysis of a sample problem, you can also learn about a few special algorithms that assist analysis and visualization of graphs.

The following terms are used in specific contexts:

data-frame : when referring to the logical concept of a data frame

`dataframe`: when referring to a Python Object

• Step 1

Execute `g_storm.describe()` in a new cell of your Jupyter Notebook. This gives you statistical information about the graph like the node’s degree, number of triangles, density, etc.

In this example, let’s focus on `IS_CONNECTED`, which indicates that you don’t have one connected graph, but multiple independent ones.

• Step 2

The hana-ml library provides algorithms that can be executed on the graph like the shortest path, neighbors, topological sort or weakly connected components. You can find them in the `hana_ml.graph.algorithms` package. The execution pattern for these algorithms is always the same as given below:

Python
Copy
``````result = hana_ml.graph.algorithms.<algorithm_name>(graph=<graph_instance>).execute(<parameters>)
``````

Every algorithm is implemented in a separate Python class which you can instantiate using `<algorithm_name>`. This class expects the graph instance as a constructor parameter specified by `graph=<graph_instance>`, so that it knows on which graph to perform the algorithms. Finally, every algorithm class implements an `execute()` method which triggers the execution of the algorithm. The result gives you access to the results of the algorithm execution. The properties available depend on the algorithm you execute.

• Step 3

In this case, you will use the `WeaklyConnectedComponents` algorithm to evaluate which connected components you have.

Each weakly connected component represents a sub-graph that is not connected to the other components.

Execute the following code in a new cell:

Python
Copy
``````import hana_ml.graph.algorithms as hga

wcc = hga.WeaklyConnectedComponents(graph=g_storm).execute()

print(f'There are {wcc.components_count} components in the Graph.')
``````

The result here indicates that you have 8332 independent components in the graph. Let’s have a closer look at the algorithm as it also exposes other details about these components.

The above result contains a property called components. Run the following command in a new cell:

Python
Copy
``````# Which are the largest components (i.e. sub networks)?
``````

Since these components exist as Pandas `dataframe` (every object that materializes data from the database is represented as a Pandas `dataframe` in hana-ml), you need to use Pandas capabilities to only return the five largest components.

Every object that materializes data from the database is represented as a Pandas `dataframe` in hana-ml.

Notice the two components- number 25 and number 5. They look interesting.

The algorithm `wcc.vertices` provides a data-frame that maps each vertex to the component it belongs to. So far, you only have the information about the components of a vertex and therefore can’t visualize a graph without the missing edges. So, let’s store the components to the database and use the `subgraph()` method for the visualization.

You can use the same method to upload the data to the database that you used to create the graph. Run the following code in a new cell:

Python
Copy
``````hdf_wcc = create_dataframe_from_pandas(
connection_context=cc,
pandas_df=wcc.vertices,
drop_exist_tab=True,
table_name='LM_STORMWATER_WCC',
force=True,
allow_bigint=True,
primary_key='ID')
``````

• Step 4

You will visualize the two biggest components in the graph - number 25 and number 5, to gain further insights. The Graph object provides a method to create sub-graphs based on filter criteria on vertices or edges. Let’s run the following in a new cell:

Python
Copy
``````g_storm_comp1 = g_storm.subgraph(
workspace_name = "LM_STORMWATER_COMP1",
vertices_filter='ID IN (SELECT ID FROM LM_STORMWATER_WCC WHERE COMPONENT = 25)',
force = True
)
``````

This creates a new graph with the name `LM_STORMWATER_COMP1` based on a vertex filter. The filter selects only vertices from the components you selected in the step before which belongs to component 25. In other words, you now have a new graph representing the vertices and edges belonging to component 25.

Let’s do the same for the vertices of component 5:

Python
Copy
``````g_storm_comp2 = g_storm.subgraph(
workspace_name = "LM_STORMWATER_COMP2",
vertices_filter='ID IN (SELECT ID FROM LM_STORMWATER_WCC WHERE COMPONENT = 5)',
force = True
)
``````

Since you have two complete graphs now, you can visualize them like you did before:

Python
Copy
``````pdf_storm_comp1_edges = g_storm_comp1.edges_hdf \
.select('ID', 'SOURCE', 'TARGET', ('SHAPE_GEO.ST_TRANSFORM(4326).ST_ASGEOJSON()', 'GJ')).collect()
pdf_storm_comp2_edges = g_storm_comp2.edges_hdf \
.select('ID', 'SOURCE', 'TARGET', ('SHAPE_GEO.ST_TRANSFORM(4326).ST_ASGEOJSON()', 'GJ')).collect()
map = KeplerGl(height=600, width=800)
map
``````

After completion, your notebook should look like this:

In the next step, you learn how to evaluate which sections of the water network could be the causal factor of a problem reported on a specific access point (i.e. vertex). In that step you will get to know the algorithms `Neighbors`, `NeighborsSubgraphs` and `ShortestPath`.

• Step 5

Let’s assume somebody reported a problem with the node `WCC_SW002719`. Suppose there is a reduced flow rate which might indicate a broken pipe somewhere else. You will want to analyze the scenario further.

First, let’s load the vertex itself. You can use the same procedure used before. Since you want to visualize it, you must materialize the position as a `GeoJSON` string. Run the following in a new cell:

Python
Copy
``````start_vertex_id = 'WCC_SW002719'
# Get the details of that vertex
start_vertex = g_storm_comp2.vertices_hdf \
.filter(f"ID = '{start_vertex_id}'") \
.select('ID', ('SHAPE_GEO.ST_TRANSFORM(4326).ST_ASGEOJSON()', 'GJ')).collect()
start_vertex
``````
• Step 6

Let’s use the `Neighbors` algorithm to find all neighbors to the problem vertex.
To display the 5 closest vertices:

Python
Copy
``````neighbors = hga.Neighbors(graph=g_storm_comp2).execute(
start_vertex=start_vertex_id,
direction='ANY',
lower_bound=1,
upper_bound=5)

``````

If you want to display the vertices on the map, you need one additional step. Here, the vertices data-frame can only return vertex IDs. So, you must read the additional columns you need for visualizing separately from the database. To do so, you can use the `filter()` method of the HANA data-frame:

Python
Copy
``````vkc=g_storm_comp2.vertex_key_column
in_list = neighbors.vertices.ID.str.cat(sep="','")
filter = f"{vkc} IN ('{in_list}')"  # Dynamically build the filter condition as SQL WHERE
print(filter)
pdf_storm_comp2_neighbors = g_storm_comp2.vertices_hdf \
.filter(filter) \
.select('ID', ('SHAPE_GEO.ST_TRANSFORM(4326).ST_ASGEOJSON()', 'GJ')).collect()
``````

With that, you query and materialize all positions of all the vertices which are 5 hops away from the start vertex. Visualizing follows the well-known pattern:

Python
Copy
``````map = KeplerGl(height=600, width=800)
map
``````

The final view of the Notebook should look like this:

• Step 7

The image in the last step only reveals the vertices directly surrounding the start vertex. However, when you want to find out what the dependent factors of your reported problem is, you need to know where to start looking and how these vertices are connected.

The water network is a directed graph. This means that you’re only interested in upstream nodes assuming that a reduced rate of flow must be caused by an issue along the upstream region. Therefore, instead of individual points let’s look at sub-graphs again:

Python
Copy
``````g_neighbors_upstream = hga.NeighborsSubgraph(graph=g_storm_comp2).execute(
start_vertex=start_vertex_id, direction='INCOMING',
lower_bound=0, upper_bound=10000)
``````

This creates a sub-graph that only evaluates the incoming edges (upstream) starting from a given vertex. Same is the case for a sub-graph evaluating only outgoing edges (downstream), as shown below:

Python
Copy
``````g_neighbors_downstream = hga.NeighborsSubgraph(graph=g_storm_comp2).execute(
start_vertex=start_vertex_id, direction='OUTGOING',
lower_bound=0, upper_bound=10000)
``````

You can see from the above code that you will get only the source and target IDs without the additional information, for example, `g_neighbors_downstream.edges.head(5)`. Therefore, you’ve got to load the spatial information of the edges in the same way you did in the above sections.

Python
Copy
``````ekc = g_storm_comp2.edge_key_column

in_list = g_neighbors_upstream.edges.ID.astype(str).str.cat(sep=',' )
pdf_storm_comp2_neighbors_upstream_edges = g_storm_comp2.edges_hdf \
.filter(f"{ekc} IN ({in_list})") \
.select('ID', ('SHAPE_GEO.ST_TRANSFORM(4326).ST_ASGEOJSON()', 'GJ')).collect()

in_list = g_neighbors_downstream.edges.ID.astype(str).str.cat(sep=',' )
pdf_storm_comp2_neighbors_downstream_edges = g_storm_comp2.edges_hdf \
.filter(f"{ekc} IN ({in_list})") \
.select('ID', ('SHAPE_GEO.ST_TRANSFORM(4326).ST_ASGEOJSON()', 'GJ')).collect()
``````

With that information you can visualize it again:

Python
Copy
``````map = KeplerGl(height=600, width=800)
map
``````

Now you understand which vertices are upstream of your problem. Moving on, the next question is- In which order should you check the incoming vertices to find the one which is the dependent factor? You will use another algorithm in the next step to solve this question.

• Step 8

The algorithm `ShortestPathsOneToAll` creates a list of the shortest paths from one vertex to all the other vertices in the graph.

Python
Copy
``````# The Shortest Path One to All, could give an indication about what to check first
spoa = hga.ShortestPathsOneToAll(graph=g_storm_comp2).execute(source=start_vertex_id, direction='INCOMING', weight='LENGTH_M')
spoa.vertices.sort_values('DISTANCE')
``````

By having `direction`= `INCOMING` you specify that you’re only interested in the upstream (i.e. incoming) vertices. By having `weight`=`LENGTH_M` you specify what information (i.e. column) is used to calculate the `DISTANCE`. In this case, you are taking the length of a segment while if let unspecified, the algorithm takes the number of hops.

With this information you could traverse through your graph along the nodes, assuming that the issue is close to the reported intake. Hence, you can check the upstream nodes one-by-one in the order of their distances from the problem node.

In this tutorial, you have explored the SAP HANA Cloud, SAP HANA database smart multi-model features of the hana-ml library for Python. Hopefully, you’ve are curious to learn more about these features. If you want to use them in your daily work, you can find the complete notebook here.