Graph Databases in SQL Server

A little theoretical background is in order before diving into the mechanics of Graph databases. In relational databases, tables typically connect or join based on a unique value in one table to another table where the same value may appear one or more times. 

Called a one-to-one or one-to-many relationship, it functions very well for most purposes. Its enough to say that Graph databases seek to model similar information by expressing the relationship as part of the model. So using the student class example, a graph database design could be visualized like the figure above.

Dealing with real world situations with many-to-many relationship representations has been more awkward. As an example, this image shows a representation of a classic many-to-many situation: class enrollment. 

 


A single student may be enrolled in many classes, and a single class may contain many students. In order to model this in a relational database, you would need to use an intermediary table, with one row for each student and each class they are enrolled in. By doing it this way, you can query the tables in either direction to get a list of all the students with the classes they are enrolled in, or all of the classes and which students are enrolled in those classes.

While graph databases have existed almost as long as relational databases, its only since the advent of NoSQL databases that they have become more widespread. An introduction to Graph databases with all of its nuances is beyond the scope of this series, but for those who are interested, Neo4J, Inc (manufacturer of possibly the most popular graph databases system) provides a free ebook covering many Graph database concepts.

Note that the enrollment relationship shows arrows on both ends, meaning that its bidirectional. Students can be enrolled in a class, and a classes can have many students enrolled. Some situations don’t make sense to have a bidirectional relationship, say for example ownership of a book. A person may own many books in their lifetime, but an individual books doesn’t belong to many people. For bidirectional relationships, use arrows on both ends of the relationship, otherwise just arrows in one direction.

CREATE A GRAPH DATABASE IN MICROSOFT SQL SERVER

In SQL Server 2017, Microsoft started incorporating graph processing technology. While still new and not without some warts, it does hold some promise and the basic functionality is sufficient for what I wanted to accomplish here. Often times when working on a project, I don’t have access to the underlying OLTP database model and have to do a lot of discovery to understand how data is processed through the system. A visual representation of the graph tables is a quick way to understand the relationships and the entities in a graph database. The initial genesis of this series was to show how to generate a visualization of a SQL Server graph database to provide that understanding. It grew into producing a reusable process to generate visualizations from any Microsoft graph database just from knowing the name of the database.

For initial testing, here is the code to create a simple graph database, based on the Student <- enrolled -> Class model.


CREATE DATABASE StudentClass; 
GO 
USE StudentClass; 
GO 

--Create Node tables and populate them 
CREATE TABLE Student( StudentID int PRIMARY KEY, StudentName varchar (100) ) as NODE; 

INSERT INTO Student Values (1, 'Will Shakespeare'); 
INSERT INTO Student Values (2, 'Percy Shelley'); 
INSERT INTO Student Values (3, 'Chuck Dickens'); 
INSERT INTO Student Values (4, 'Art Doyle'); 
INSERT INTO Student Values (5, 'Gina Wolfe'); 

CREATE TABLE Class( ClassID int PRIMARY KEY, ClassName varchar(100) ) as NODE; 

INSERT INTO Class Values (1, 'History of Surfing'); 
INSERT INTO Class Values (2, 'Joy of Garbage'); 
INSERT INTO Class Values (3, 'Art of Walking'); 
INSERT INTO Class Values (4, 'Street Fighting Mathematics'); 
INSERT INTO Class Values (5, 'Fermentation Studies'); 

--Create first EDGE table and populate it. 
CREATE TABLE enrolledIn as Edge; 

INSERT INTO enrolledIn VALUES ((SELECT $node_id FROM Student WHERE StudentID = 1), (Select $node_id from Class WHERE ClassID = 1));

INSERT INTO enrolledIn VALUES ((SELECT $node_id FROM Student WHERE StudentID = 1), (Select $node_id from Class WHERE ClassID = 2));

INSERT INTO enrolledIn VALUES ((SELECT $node_id FROM Student WHERE StudentID = 1), (Select $node_id from Class WHERE ClassID = 3));

INSERT INTO enrolledIn VALUES ((SELECT $node_id FROM Student WHERE StudentID = 2), (Select $node_id from Class WHERE ClassID = 3));

INSERT INTO enrolledIn VALUES ((SELECT $node_id FROM Student WHERE StudentID = 2), (Select $node_id from Class WHERE ClassID = 4));

INSERT INTO enrolledIn VALUES ((SELECT $node_id FROM Student WHERE StudentID = 3), (Select $node_id from Class WHERE ClassID = 1));

INSERT INTO enrolledIn VALUES ((SELECT $node_id FROM Student WHERE StudentID = 3), (Select $node_id from Class WHERE ClassID = 4));

INSERT INTO enrolledIn VALUES ((SELECT $node_id FROM Student WHERE StudentID = 3), (Select $node_id from Class WHERE ClassID = 5));

INSERT INTO enrolledIn VALUES ((SELECT $node_id FROM Student WHERE StudentID = 4), (Select $node_id from Class WHERE ClassID = 2));

INSERT INTO enrolledIn VALUES ((SELECT $node_id FROM Student WHERE StudentID = 4), (Select $node_id from Class WHERE ClassID = 5));

INSERT INTO enrolledIn VALUES ((SELECT $node_id FROM Student WHERE StudentID = 5), (Select $node_id from Class WHERE ClassID = 3));

--Create second EDGE table and populate it. 
CREATE TABLE enrolledStudent as Edge;
 
INSERT INTO enrolledStudent VALUES ((Select $node_id from Class WHERE ClassID = 1), (SELECT $node_id FROM Student WHERE StudentID = 1));

INSERT INTO enrolledStudent VALUES ((Select $node_id from Class WHERE ClassID = 1), (SELECT $node_id FROM Student WHERE StudentID = 3));

INSERT INTO enrolledStudent VALUES ((Select $node_id from Class WHERE ClassID = 2), (SELECT $node_id FROM Student WHERE StudentID = 1));

INSERT INTO enrolledStudent VALUES ((Select $node_id from Class WHERE ClassID = 2), (SELECT $node_id FROM Student WHERE StudentID = 4));

INSERT INTO enrolledStudent VALUES ((Select $node_id from Class WHERE ClassID = 2), (SELECT $node_id FROM Student WHERE StudentID = 3));

INSERT INTO enrolledStudent VALUES ((Select $node_id from Class WHERE ClassID = 3), (SELECT $node_id FROM Student WHERE StudentID = 1));

INSERT INTO enrolledStudent VALUES ((Select $node_id from Class WHERE ClassID = 3), (SELECT $node_id FROM Student WHERE StudentID = 2));

INSERT INTO enrolledStudent VALUES ((Select $node_id from Class WHERE ClassID = 4), (SELECT $node_id FROM Student WHERE StudentID = 2));

INSERT INTO enrolledStudent VALUES ((Select $node_id from Class WHERE ClassID = 4), (SELECT $node_id FROM Student WHERE StudentID = 3));

INSERT INTO enrolledStudent VALUES ((Select $node_id from Class WHERE ClassID = 5), (SELECT $node_id FROM Student WHERE StudentID = 3));

INSERT INTO enrolledStudent VALUES ((Select $node_id from Class WHERE ClassID = 5), (SELECT $node_id FROM Student WHERE StudentID = 4));


This code creates two Node tables (Student and Class), and the Edge table that bridges them (enrolledIn). Sample data for five students, five classes, and who is enrolled in each class is added. The edge data goes in both directions, showing which classes each student is enrolled in, and for each class, which students are enrolled.

In SQL Server Management Studio or Azure Data Studio, you can tell the difference between the tables by looking at the icon for the table. Node tables use an icon with a small enclosed dot in the lower left corner. Edge tables use an icon with an open set of dots with a bar between them.

We now have a functional, but simple graph database. The database has two edge tables (enrolledIn and enrolledStudent) and two node tables (Class and Student).

GRAPH DATABASE ADDITIONS TO SQL SERVER

When creating graph database tables, SQL Server generates a unique node_id and edge_id for the relevant tables. To make querying easier, generic functions $node_id and $edge_id have been added to T-SQL allowing for reusable code without having to determine that unique ID. You can see the $node_id being used in the code for the DML for the sample database.

Above is a screenshot of the enrolledIn edge table on my system. Note that the edge_ID field is really just a substitute for a PK and the data for the from_ID and to_ID fields are essentially FKs to help in locating the relevant records in the node tables. As in a normal relational database this table acts as a bridge table. The $node_id and $edge_id functions will be used later to identify the relationships between the graph tables.

In SQL Server 2017, Microsoft modified some of the system views and functions that provide information on the various components in a database to support the new graph functionality. This Microsoft webpage provides a considerable amount of information for those interested in learning more.

For the purpose of this article, I am working only with a subset of the new information provided. Specifically additional information added to sys.tables where tables now have two new bit datatype columns: is_node and is_edge, and the node_id$ and edge_id$ information added to the edge tables.

When querying the sys.tables view for a table, if is_node = 1, the table is a node. If is_edge = 1, the table is an edge. If both values are zero, then the table is neither an edge or a node. The values cannot be 1 for both fields.

After creating the database StudentClass and the tables from the provided code, I ran the query below to see the sys.tables information, which identifies which tables are nodes and edges. For databases with non graph tables, a WHERE clause excluding tables where is_edge and is_node are both zero would replicate the same functionality:

Use StudentClass; 
go 
select name, object_id, is_edge, is_node 
from sys.tables where type = 'U';




Six new system functions were also added to SQL Server to provide information on graph database components. Of these, I used two to determine the information needed to map out the structure of the graph database. OBJECT_ID_FROM_NODE_ID will take a $node_id as input and will return the object_id of the table that the node belongs to. OBJECT_ID_FROM_EDGE_ID is similar, returning the object_id of a table that and edge belongs to.

CURSOR TO ACQUIRE THE GRAPH DATA

Using the information from the sys.table view and the two system functions discussed above, I was able to create a cursor in TSQL to generate the above output. As you can see, it shows that there are two paths in this graph database:

Student -> (enrolledIn) -> Class 
Class -> (enrolledStudent) -> Student.

Because the edge table acts like a cross reference between the node tables, the cursor only needs to get information from the edge table for the initial compilation of information. All of the data is stored in a temporary table so I can return one data set rather than a separate data set for each node table once everything has been pulled. The completed code for the cursor is shown below:


SET nocount on; 
IF OBJECT_ID('tempdb.dbo.#NodeEdgeXRef', 'U') IS NOT NULL 
    DROP TABLE ##NodeEdgeXRef; 

CREATE TABLE ##NodeEdgeXRef (FromNode INT, ToNode INT, EdgeID Int); 

DECLARE @NodeID as INT;
DECLARE @NodeName as NVARCHAR(128);
DECLARE @EdgeID as INT;
DECLARE @SchemaName as NVARCHAR(128);
DECLARE @sSQL as nvarchar(MAX);
DECLARE @NodeCursor as CURSOR;  

SET @NodeCursor = CURSOR FOR
SELECT tb.name, tb.object_id, tb.object_id, sc.name
FROM sys.tables tb
JOIN sys.schemas sc ON tb.schema_id = sc.schema_id
WHERE is_edge = 1; 

OPEN @NodeCursor; 
FETCH NEXT FROM @NodeCursor 
    INTO @NodeName, @NodeID, @EdgeID, @SchemaName; 
 WHILE @@FETCH_STATUS = 0

BEGIN set @sSQL=N'select distinct object_id_from_node_id($from_id) 
    as FromNode, object_id_from_node_id($to_id) as ToNode,'
    + CAST(@EdgeID as NVARCHAR(MAX))+ ' as EdgeID from ' +      
    @SchemaName + '.' +@NodeName; 
INSERT INTO ##NodeEdgeXRef EXEC sp_executesql @sSQL; 
FETCH NEXT FROM @NodeCursor 
    INTO @NodeName, @NodeID, @EdgeID, @SchemaName; 
END CLOSE @NodeCursor; 
DEALLOCATE @NodeCursor; 

SELECT distinct CASE WHEN frs.name = 'dbo' THEN fr.name
    ELSE frs.name +'.'+ fr.name
END as FromName,
       FromNode,
CASE WHEN es.name = 'dbo' THEN en.name
    ELSE es.name +'.'+ en.name
END as EdgeName,
EdgeID,
CASE WHEN ts.name = 'dbo' THEN tn.name
    ELSE ts.name + '.' + tn.name
END as ToName,
ToNode
FROM ##NodeEdgeXRef xr
JOIN sys.tables fr ON xr.FromNode = fr.object_id
JOIN sys.tables tn ON xr.ToNode = tn.object_id
JOIN sys.tables en ON xr.EdgeID = en.object_id
JOIN sys.schemas frs ON fr.schema_id = frs.schema_id
JOIN sys.schemas ts ON tn.schema_id = ts.schema_id
JOIN sys.schemas es ON en.schema_id = es.schema_id; 

DROP TABLE ##NodeEdgeXRef;

Comments