Representing different kinds of graph in a SQL database

Representing a graph in a SQL database

Look at the diagram of an example undirected cyclic graph on this page's parent page. You can see that it has seven edges thus:

(n1-n2), (n2-n3), (n2-n4), (n3-n5), (n4-n5), (n4-n6), (n5-n6)

The minimal representation in a SQL database uses just a table of table of edges, created thus:

cr-edges.sql
drop table if exists edges cascade;

create table edges(
  node_1 text not null,
  node_2 text not null,
  constraint edges_pk primary key(node_1, node_2));

A typical special case, like computing Bacon Numbers, would specify "node_1" and "node_2" with names that are specific to the use case (like "actor" and "co_actor") whose values are the primary keys in a separate "actors" table. This implies foreign key constraints, of course. The "actors" table would have other columns (like "given_name", "family_name", and so on). Similarly, the "edges" table would have at least one other column: the array of movies that the two connected actors have been in.

How to represent the undirected nature of the edges

There are two obvious design choices. One design populates the "edges" table sparsely by recording each edge just once. For example, if node "a" is recorded in column "node_1", then it must not be recorded in column "node_2". Then you must understand that though the column names "node_1" and "node_2" imply a direction, this is not the case.

Though this design has the appeal of avoiding denormalization, it makes the SQL design tricky. When the traversal arrives at a node and needs to find the nodes at the other ends of the edges that have the present node at one of their ends, you don't know whether to restrict on "node_1" and read "node_2" or to restrict on "node_2" and read "node_1" and so you must try both.

The other design records each edge twice. For example, the edge between nodes "a" and "b" will be recorded both as (node_1 = 'a', node_2 = 'b') and as (node_1 = 'b', node_2 = 'a'). This denormalization makes the SQL straightforward—and is therefore preferred.

An implementation of the design that represents each edge just once is described for completeness. But the approach that leads to simpler SQL, the design that represents each edge twice—once in each direction is described first.

The representations for the more specialized kinds of graph—directed cyclic graph, directed acyclic graph, and rooted tree—all use the same table structure but with appropriately different rules to which the content must conform.