Skip to Content

Analyze Graph Workspace in SAP HANA Cloud, SAP HANA Database

Learn to use algorithms to analyze and process a Graph Workspace in SAP HANA Cloud, SAP HANA database.
You will learn
  • How to analyze and process your graph using algorithms
  • How to create visualization for sub-graphs of components in a Graph Workspace
  • How to perform dependency analysis using a graph
Created by
VijayKrishnanSR
August 3, 2021
Contributors
VijayKrishnanSR

Prerequisites

The hana-ml library provides functionalities to analyze and process the graph. In this tutorial, you will learn about algorithms that can help you create sub-graphs of components in a Graph Workspace. Through a dependency analysis of a sample problem, you can also learn about a few special algorithms that assist analysis and visualization of graphs.

The following terms are used in specific contexts:

data-frame : when referring to the logical concept of a data frame

dataframe: when referring to a Python Object

  • Step 1

    Execute g_storm.describe() in a new cell of your Jupyter Notebook. This gives you statistical information about the graph like the node’s degree, number of triangles, density, etc.

    In this example, let’s focus on IS_CONNECTED, which indicates that you don’t have one connected graph, but multiple independent ones.

    statistical Node Information
  • Step 2

    The hana-ml library provides algorithms that can be executed on the graph like the shortest path, neighbors, topological sort or weakly connected components. You can find them in the hana_ml.graph.algorithms package. The execution pattern for these algorithms is always the same as given below:

    Python
    Copy
    result = hana_ml.graph.algorithms.<algorithm_name>(graph=<graph_instance>).execute(<parameters>)
    

    Every algorithm is implemented in a separate Python class which you can instantiate using <algorithm_name>. This class expects the graph instance as a constructor parameter specified by graph=<graph_instance>, so that it knows on which graph to perform the algorithms. Finally, every algorithm class implements an execute() method which triggers the execution of the algorithm. The result gives you access to the results of the algorithm execution. The properties available depend on the algorithm you execute.

  • Step 3

    In this case, you will use the WeaklyConnectedComponents algorithm to evaluate which connected components you have.

    Each weakly connected component represents a sub-graph that is not connected to the other components.

    Execute the following code in a new cell:

    Python
    Copy
    import hana_ml.graph.algorithms as hga
    
    wcc = hga.WeaklyConnectedComponents(graph=g_storm).execute()
    
    print(f'There are {wcc.components_count} components in the Graph.')
    
    Total components

    The result here indicates that you have 8332 independent components in the graph. Let’s have a closer look at the algorithm as it also exposes other details about these components.

    The above result contains a property called components. Run the following command in a new cell:

    Python
    Copy
    # Which are the largest components (i.e. sub networks)?
    wcc.components.sort_values(by='NUMBER_OF_VERTICES', ascending=False).head(5)
    

    Since these components exist as Pandas dataframe (every object that materializes data from the database is represented as a Pandas dataframe in hana-ml), you need to use Pandas capabilities to only return the five largest components.

    Every object that materializes data from the database is represented as a Pandas dataframe in hana-ml.

    Notice the two components- number 25 and number 5. They look interesting.

    Two components

    The algorithm wcc.vertices provides a data-frame that maps each vertex to the component it belongs to. So far, you only have the information about the components of a vertex and therefore can’t visualize a graph without the missing edges. So, let’s store the components to the database and use the subgraph() method for the visualization.

    You can use the same method to upload the data to the database that you used to create the graph. Run the following code in a new cell:

    Python
    Copy
    hdf_wcc = create_dataframe_from_pandas(
        connection_context=cc,
        pandas_df=wcc.vertices,
        drop_exist_tab=True,
        table_name='LM_STORMWATER_WCC',
        force=True,
        allow_bigint=True,
        primary_key='ID')
    

    Now, you have uploaded the algorithm results to your database.

    Upload results
  • Step 4

    You will visualize the two biggest components in the graph - number 25 and number 5, to gain further insights. The Graph object provides a method to create sub-graphs based on filter criteria on vertices or edges. Let’s run the following in a new cell:

    Python
    Copy
    g_storm_comp1 = g_storm.subgraph(
        workspace_name = "LM_STORMWATER_COMP1",
        vertices_filter='ID IN (SELECT ID FROM LM_STORMWATER_WCC WHERE COMPONENT = 25)',
        force = True
    )
    

    This creates a new graph with the name LM_STORMWATER_COMP1 based on a vertex filter. The filter selects only vertices from the components you selected in the step before which belongs to component 25. In other words, you now have a new graph representing the vertices and edges belonging to component 25.

    Let’s do the same for the vertices of component 5:

    Python
    Copy
    g_storm_comp2 = g_storm.subgraph(
        workspace_name = "LM_STORMWATER_COMP2",
        vertices_filter='ID IN (SELECT ID FROM LM_STORMWATER_WCC WHERE COMPONENT = 5)',
        force = True
    )
    
    Separate graphs

    Since you have two complete graphs now, you can visualize them like you did before:

    Python
    Copy
    pdf_storm_comp1_edges = g_storm_comp1.edges_hdf \
        .select('ID', 'SOURCE', 'TARGET', ('SHAPE_GEO.ST_TRANSFORM(4326).ST_ASGEOJSON()', 'GJ')).collect()
    pdf_storm_comp2_edges = g_storm_comp2.edges_hdf \
        .select('ID', 'SOURCE', 'TARGET', ('SHAPE_GEO.ST_TRANSFORM(4326).ST_ASGEOJSON()', 'GJ')).collect()
    map = KeplerGl(height=600, width=800)
    map.add_data(pdf_storm_comp1_edges, 'Stormwater Component 1')
    map.add_data(pdf_storm_comp2_edges, 'Stormwater Component 2')
    map
    

    After completion, your notebook should look like this:

    Visualize two graphs

    In the next step, you learn how to evaluate which sections of the water network could be the causal factor of a problem reported on a specific access point (i.e. vertex). In that step you will get to know the algorithms Neighbors, NeighborsSubgraphs and ShortestPath.

  • Step 5

    Let’s assume somebody reported a problem with the node WCC_SW002719. Suppose there is a reduced flow rate which might indicate a broken pipe somewhere else. You will want to analyze the scenario further.

    First, let’s load the vertex itself. You can use the same procedure used before. Since you want to visualize it, you must materialize the position as a GeoJSON string. Run the following in a new cell:

    Python
    Copy
    start_vertex_id = 'WCC_SW002719'
    # Get the details of that vertex
    start_vertex = g_storm_comp2.vertices_hdf \
        .filter(f"ID = '{start_vertex_id}'") \
        .select('ID', ('SHAPE_GEO.ST_TRANSFORM(4326).ST_ASGEOJSON()', 'GJ')).collect()
    start_vertex
    
    Load Vertex
  • Step 6

    Let’s use the Neighbors algorithm to find all neighbors to the problem vertex.
    To display the 5 closest vertices:

    Python
    Copy
    neighbors = hga.Neighbors(graph=g_storm_comp2).execute(
        start_vertex=start_vertex_id,
        direction='ANY',
        lower_bound=1,
        upper_bound=5)
    
    neighbors.vertices.head(5)
    
    Neighbors algorithm

    If you want to display the vertices on the map, you need one additional step. Here, the vertices data-frame can only return vertex IDs. So, you must read the additional columns you need for visualizing separately from the database. To do so, you can use the filter() method of the HANA data-frame:

    Python
    Copy
    vkc=g_storm_comp2.vertex_key_column
    in_list = neighbors.vertices.ID.str.cat(sep="','")
    filter = f"{vkc} IN ('{in_list}')"  # Dynamically build the filter condition as SQL WHERE
    print(filter)
    pdf_storm_comp2_neighbors = g_storm_comp2.vertices_hdf \
        .filter(filter) \
        .select('ID', ('SHAPE_GEO.ST_TRANSFORM(4326).ST_ASGEOJSON()', 'GJ')).collect()
    
    Separate Visualization

    With that, you query and materialize all positions of all the vertices which are 5 hops away from the start vertex. Visualizing follows the well-known pattern:

    Python
    Copy
    map = KeplerGl(height=600, width=800)
    map.add_data(pdf_storm_comp2_neighbors, '5-hop neighbors')
    map.add_data(start_vertex, 'Start Vertex')
    map
    

    The final view of the Notebook should look like this:

    Vertex Visualization
  • Step 7

    The image in the last step only reveals the vertices directly surrounding the start vertex. However, when you want to find out what the dependent factors of your reported problem is, you need to know where to start looking and how these vertices are connected.

    The water network is a directed graph. This means that you’re only interested in upstream nodes assuming that a reduced rate of flow must be caused by an issue along the upstream region. Therefore, instead of individual points let’s look at sub-graphs again:

    Python
    Copy
    g_neighbors_upstream = hga.NeighborsSubgraph(graph=g_storm_comp2).execute(
        start_vertex=start_vertex_id, direction='INCOMING',
        lower_bound=0, upper_bound=10000)
    

    This creates a sub-graph that only evaluates the incoming edges (upstream) starting from a given vertex. Same is the case for a sub-graph evaluating only outgoing edges (downstream), as shown below:

    Python
    Copy
    g_neighbors_downstream = hga.NeighborsSubgraph(graph=g_storm_comp2).execute(
        start_vertex=start_vertex_id, direction='OUTGOING',
        lower_bound=0, upper_bound=10000)
    
    Upstream and Downstream

    You can see from the above code that you will get only the source and target IDs without the additional information, for example, g_neighbors_downstream.edges.head(5). Therefore, you’ve got to load the spatial information of the edges in the same way you did in the above sections.

    Python
    Copy
    ekc = g_storm_comp2.edge_key_column
    
    in_list = g_neighbors_upstream.edges.ID.astype(str).str.cat(sep=',' )
    pdf_storm_comp2_neighbors_upstream_edges = g_storm_comp2.edges_hdf \
        .filter(f"{ekc} IN ({in_list})") \
        .select('ID', ('SHAPE_GEO.ST_TRANSFORM(4326).ST_ASGEOJSON()', 'GJ')).collect()
    
    in_list = g_neighbors_downstream.edges.ID.astype(str).str.cat(sep=',' )
    pdf_storm_comp2_neighbors_downstream_edges = g_storm_comp2.edges_hdf \
        .filter(f"{ekc} IN ({in_list})") \
        .select('ID', ('SHAPE_GEO.ST_TRANSFORM(4326).ST_ASGEOJSON()', 'GJ')).collect()
    
    Load Spatial Information

    With that information you can visualize it again:

    Python
    Copy
    map = KeplerGl(height=600, width=800)
    map.add_data(start_vertex, 'Start Vertex')
    map.add_data(pdf_storm_comp2_neighbors_upstream_edges, 'Upstream')
    map.add_data(pdf_storm_comp2_neighbors_downstream_edges, 'Downstream')
    map
    
    Upstream Downstream Visualization

    Now you understand which vertices are upstream of your problem. Moving on, the next question is- In which order should you check the incoming vertices to find the one which is the dependent factor? You will use another algorithm in the next step to solve this question.

  • Step 8

    The algorithm ShortestPathsOneToAll creates a list of the shortest paths from one vertex to all the other vertices in the graph.

    Python
    Copy
    # The Shortest Path One to All, could give an indication about what to check first
    spoa = hga.ShortestPathsOneToAll(graph=g_storm_comp2).execute(source=start_vertex_id, direction='INCOMING', weight='LENGTH_M')
    spoa.vertices.sort_values('DISTANCE')
    
    Shortest Upstream Vertices

    By having direction= INCOMING you specify that you’re only interested in the upstream (i.e. incoming) vertices. By having weight=LENGTH_M you specify what information (i.e. column) is used to calculate the DISTANCE. In this case, you are taking the length of a segment while if let unspecified, the algorithm takes the number of hops.

    With this information you could traverse through your graph along the nodes, assuming that the issue is close to the reported intake. Hence, you can check the upstream nodes one-by-one in the order of their distances from the problem node.

    In this tutorial, you have explored the SAP HANA Cloud, SAP HANA database smart multi-model features of the hana-ml library for Python. Hopefully, you’ve are curious to learn more about these features. If you want to use them in your daily work, you can find the complete notebook here.

    For more learning materials on SAP HANA Cloud, click here. Follow our tag in the SAP Community to stay up-to-date on the latest updates and newest content!

    Related content:

  • Step 9

    The execution pattern for graph algorithms in hana-ml is given below. (Fill in the blank)
    ```
    result = hana_ml.graph.algorithms.(graph=).________()
    ```

Back to top