Stress testing different find_paths() implementations on maximally connected graphs
Before trying the code in this section, make sure that you have created the "edges" table (see
cr-edges.sql) and installed all the code shown in the section Common code for traversing all kinds of graph. Make sure that you start by choosing the option that adds a tracing column and a trigger to the "raw_paths" table. (You will later re-create the "raw_paths" table without the tracing code before you do some timing tests.)
Definition of "maximally connected graph"
Look at this undirected cyclic graph:
Each of its six nodes has an edge to each of the other five nodes—in other words, it's maximally connected.
A "maximally connected graph" is a connected undirected graph where every node has an edge to every other node. Such a graph is inevitably cyclic.
("Connected" means that there exists a path from every node to every other node in the graph.)
The following code automates the creation of such a maximally connected undirected cyclic graph for the specified number of nodes. First create a helper procedure:
drop procedure if exists populate_maximum_connectvity_edges(int) cascade; create procedure populate_maximum_connectvity_edges(nr_nodes in int) language plpgsql as $body$ declare o constant text := '(''n'; p constant text := ''', ''n'; c constant text := '''),'; stmt text not null := 'insert into edges(node_1, node_2) values'; begin for j in 1..nr_nodes loop for k in j..(nr_nodes - 1) loop stmt := stmt|| o||ltrim(to_char(j, '009'))|| p||ltrim(to_char(k + 1, '009'))||c; end loop; end loop; stmt := rtrim(stmt, ','); execute stmt; end; $body$;
Now invoke it to populate the "edges" table and inspect the outcome:
delete from edges; alter table edges drop constraint if exists edges_chk cascade; alter table edges add constraint edges_chk check(node_1 <> node_2); call populate_maximum_connectvity_edges(nr_nodes=>6); -- Implement the denormalization. insert into edges(node_1, node_2) select node_2 as node_1, node_1 as node_2 from edges; select node_1, node_2 from edges order by node_1, node_2;
This is the result:
node_1 | node_2 --------+-------- n001 | n002 n001 | n003 n001 | n004 n001 | n005 n001 | n006 n002 | n001 n002 | n003 n002 | n004 n002 | n005 n002 | n006 n003 | n001 n003 | n002 n003 | n004 n003 | n005 n003 | n006 n004 | n001 n004 | n002 n004 | n003 n004 | n005 n004 | n006 n005 | n001 n005 | n002 n005 | n003 n005 | n004 n005 | n006 n006 | n001 n006 | n002 n006 | n003 n006 | n004 n006 | n005
The whitespace was added by hand to group the results into the sets of five edges that connect to each of the six nodes.
The number of paths in a maximally connected graph grows very much faster than linearly with the number of nodes
Recall how the logic of the recursive CTE from
cr-find-paths-with-nocycle-check.sql is expressed:
with recursive paths(path) as ( select array[start_node, node_2] from edges where node_1 = start_node union all select p.path||e.node_2 from edges e, paths p where e.node_1 = terminal(p.path) and not e.node_2 = any(p.path) -- <<<<< Prevent cycles. )
Using node "n001" as the start node for the six-node example maximally connected graph, the result of the non-recursive term will be five paths. Then, the result of the first repeat of the recursive term will be four longer paths for each of the input five paths because the cycle prevention logic will exclude, for each input path in turn, the node that can be reached from its terminal that has already been found. And so it goes on: the result of the next repeat of the recursive term will be three longer paths for each input path; the next result will be two longer paths for each input path; the next result will be one longer path for each input path; and the next repeat will find no paths so that the repetition will end. The number of paths that each repetition finds will therefore grow like this:
Repeat number Number of paths Found 0 5 -- non-recursive term 1 5*4 = 20 -- 1st recursive term 2 5*4*3 = 60 -- 2nd recursive term 3 5*4*3*2 = 120 -- 3rd recursive term 4 5*4*3*2*1 = 120 -- 4th recursive term --- Final total number of paths: 325
You can see this in action by re-creating the "find_paths()" implementation shown in
cr-find-paths-with-pruning.sql and invoking it like this:
call find_paths(start_node => 'n001', prune => false);
Now inspect the repetition history:
select repeat_nr, count(*) as n from raw_paths group by repeat_nr order by 1;
This is the result:
repeat_nr | n -----------+----- 0 | 5 1 | 20 2 | 60 3 | 120 4 | 120
This agrees with what the analysis presented above predicts.
The following function implements the rule described above to compute, for a maximally connected graph, the total number of paths from an arbitrary start node versus the number of nodes in the graph. It follows from the definition that the number of paths for a particular graph is the same for every possible start node.
drop function if exists total_paths_versus_number_of_nodes(int) cascade; create function total_paths_versus_number_of_nodes(nr_nodes in int) returns table(t text) language plpgsql as $body$ begin t := 'no. of nodes no. of paths'; return next; t := '------------ ------------'; return next; for j in 4..nr_nodes loop declare -- Non-recursive term result new_paths numeric not null := j - 1; total numeric not null := new_paths; begin -- recursive term repetitions. for k in 1..(j - 2) loop new_paths := new_paths * (j - k - 1)::bigint; total := total + new_paths; end loop; t := case total < 10000000 when true then lpad(to_char(j, '9999'), 12)||lpad(to_char(total, '9,999,999'), 15) else lpad(to_char(j, '9999'), 12)||lpad(to_char(total, '9.9EEEE'), 15) end; return next; end; end loop; end; $body$;
Invoke it like this:
\t on select t from total_paths_versus_number_of_nodes(20); \t off
This is the result:
no. of nodes no. of paths ------------ ------------ 4 15 5 64 6 325 7 1,956 8 13,699 9 109,600 10 986,409 11 9,864,100 12 1.1e+08 13 1.3e+09 14 1.7e+10 15 2.4e+11 16 3.6e+12 17 5.7e+13 18 9.7e+14 19 1.7e+16 20 3.3e+17
The growth rate is astonishing. You can see immediately that there must be a limit to the viability of using a "find_paths()" implementation that doesn't do early path pruning as the size of the maximally connected graph increases. The following thought experiment makes this clear.
Suppose that a clever, highly compressive, representation scheme were invented to store paths with an average space consumption per path of, say, just 10 bytes. And suppose that you invest in a 1 TB external SSD drive for your laptop to hold paths represented using this clever scheme. Your SSD's capacity is
10^12 bytes. This isn't enough to store the
2.4*10^11 paths that a maximally connected graph with just 15 nodes has.
Consider trying this test again and again with increasing values for
-- Sketch of stress-test kernel. delete from edges; call populate_maximum_connectvity_edges(nr_nodes=>:nr_nodes); insert into edges(node_1, node_2) select node_2 as node_1, node_1 as node_2 from edges; call find_paths(start_node => 'n001', prune => false);
Even if you did create a single-node YugabyteDB cluster so that it used your 1 TB SSD for its data files, you doubtless wouldn't have the capacity for the paths for even a 14-node maximally connected graph because the hypothetical dedicated highly compressive path representation scheme isn't in use. This can only mean that the "find_paths()" execution would crash for some run with fewer nodes than 14.
The stress test experiment
The thought experiment, above, was described to show that the "find_paths()" scheme that uses schema-level tables to implement the intermediate and final results for the recursive CTE algorithm, without pruning, will inevitably reach a limit and crash when the target graph has too many paths. Similar reasoning shows that the ordinary, explicit, use of the recursive CTE will inevitably crash, too, under similar circumstances. There's no point in trying to make more realistic estimates of the various sizes that will jointly conspire to bring eventual failure. Rather, an empirical test better serves the purpose.
Before doing the stress-test experiment, make sure that you have created the "edges" table (see
cr-edges.sql) and installed all the code shown in the section Common code for traversing all kinds of graph. But this time, make sure that you start by choosing the option that re-creates the "raw_paths" table without the tracing code before you do some timing tests.)
Note: The stress-test is implemented by a few scripts where all but one call other script(s). In order to run the whole test, you should create a directory on your computer and, when told to, save each script there with the name that's given. The downloadable code zip arranges all of the scripts it includes in a directory tree that implements a useful classification scheme. This means that, in the files from the zip, the arguments of the
\o meta-commands are spelled differently than they are here to include directory names.
Create some helpers
A scripting scheme is needed to call the basic test (see the code, above, that starts with the comment "Sketch of stress-test kernel") repeatedly for each value of interest of
:nr_nodes and after installing, in turn, each of the three methods. It will also be useful if the stress-test kernel times each invocation of "find_paths()".
The "find_paths()" invocation syntax depends on the present method. A procedure manages this nicely:
drop procedure if exists invoke_find_paths(text) cascade; create procedure invoke_find_paths(method in text) language plpgsql as $body$ begin case method when 'pure-with' then call find_paths(start_node => 'n001'); when 'prune-false' then call find_paths(start_node => 'n001', prune => false); when 'prune-true' then call find_paths(start_node => 'n001', prune => true); end case; end; $body$;
The stress-test kernel must spool the output to a file whose name encodes the present number of nodes and method name. A well-known pattern comes to the rescue: use a table function to write the script that turns on spooling and invoke that script. This is the function:
drop function if exists start_spooling_script(int, text) cascade; create function start_spooling_script(nr_nodes in int, method in text) returns text language plpgsql as $body$ begin return '\o '||ltrim(nr_nodes::text)||'-nodes--'||method||'.txt'; end; $body$;
Create a script to do the stress-test kernel
With the two building blocks, the procedure "invoke_find_paths()" and the function "start_spooling_script()", in place, the following script implements the stress-test kernel:
Save this script.
delete from edges; call populate_maximum_connectvity_edges(nr_nodes=>:nr_nodes); -- Implement the denormalization. insert into edges(node_1, node_2) select node_2 as node_1, node_1 as node_2 from edges; \t on \o start_spooling.sql select start_spooling_script(:nr_nodes, :method); \o \i start_spooling.sql select :method||' -- '||ltrim(:nr_nodes::text)||' nodes'; call start_stopwatch(); call invoke_find_paths(:method); select 'elapsed time: '||stopwatch_reading() as t; select 'count(*) from raw_paths: '||ltrim(to_char(count(*), '9,999,999')) from raw_paths; \o \t off
You can manually test a single use of it. First decide on the method. These are the choices:
prune => false
prune => true
Decide on the number of nodes, for example 7. And choose the method, for example "pure-with". (Remember to install it.) Then do this:
\set nr_nodes 7 \set method '\''prune-false'\'' \i do-stress-test-kernel.sql
This will produce the
7-nodes--prune-false.txt spool file. It will look like this:
prune-false -- 7 nodes elapsed time: 40 ms. count(*) from raw_paths: 1,956
Of course, the reported elapsed time that you see will doubtless differ from "40 ms".
Implement scripts to execute the stress-test kernel for each method for each of the numbers of nodes in the range of interest
First, implement a script to invoke each of the three methods for a particular, pre-set, value of
:nr_nodes. Get a clean slate for the "paths" tables before each timing test by dropping and re-creating each of them.
Save this script. You'll also need to save the scripts that it calls:
\set method '\''pure-with'\'' \i cr-raw-paths-no-tracing.sql \i cr-supporting-path-tables.sql \i cr-find-paths-with-nocycle-check.sql \i do-stress-test-kernel.sql \set method '\''prune-false'\'' \i cr-raw-paths-no-tracing.sql \i cr-supporting-path-tables.sql \i cr-find-paths-with-pruning.sql \i do-stress-test-kernel.sql \set method '\''prune-true'\'' \i cr-raw-paths-no-tracing.sql \i cr-supporting-path-tables.sql \i cr-find-paths-with-pruning.sql \i do-stress-test-kernel.sql
Finally, create a script to invoke
do-stress-test-for-all-methods.sql for each of the numbers of nodes in the range of interest.
Save this script.
\set nr_nodes 4 \i do-stress-test-for-all-methods.sql \set nr_nodes 5 \i do-stress-test-for-all-methods.sql \set nr_nodes 6 \i do-stress-test-for-all-methods.sql \set nr_nodes 7 \i do-stress-test-for-all-methods.sql \set nr_nodes 8 \i do-stress-test-for-all-methods.sql \set nr_nodes 9 \i do-stress-test-for-all-methods.sql \set nr_nodes 10 \i do-stress-test-for-all-methods.sql
do-stress-test-kernel.sql by hand for each of the three methods, setting the number of nodes to 11. You'll see that each of "pure-with" and "prune-false" fails, with error messages that reflect the fact that the implementation can't handle as many as 9,864,100 nodes. But "prune-true" completes without error very quickly.
do-stress-test-kernel.sql by hand using the "prune-true" method and 100 nodes. You'll see that it completes without error in about the same time as for any smaller number of nodes.
Here are the results:
The elapsed times are in seconds.
|no. of nodes||no. of paths||pure-with||prune-false||prune-true|
|11||9,864,100||crash ||crash ||0.2|
 Failed after 12 min. "Timed out: Read RPC (request call id 42549) to 127.0.0.1:9100 timed out after 1.117s"
 Failed after 48:52 min.with errors like this: "Timed out: Read RPC (request call id 9911242) to 127.0.0.1:9100 timed out after 0.983s."
The experiment shows that the "prune-false" approach is never preferable to the "pure-with" approach. Each fails beyond a 10-node graph; and the "prune-false" method is slower. This is very much to be expected. It was implemented only for pedagogic purposes and, more importantly, to provide the framework that allows the implementation of early pruning—the "prune-true" approach.
The experiment also shows that the "prune-true" approach is viable and quick even for graphs with unimaginably huge numbers of possible paths.
This leads to two conclusions:
It simply isn't feasible to use the native recursive CTE features of YugabyteDB, those of PostgreSQL, or indeed those of any SQL database, to discover all the paths in any typically highly connected graph that's likely to be interesting in practice—especially, for example, the IMDb data.
It is straightforward to use ordinary SQL and stored procedure features to discover the shortest paths in even very large and highly connected graphs. As noted, you cannot use the recursive CTE for this purpose because it doesn't let you implement early pruning. Rather, you must use a table-based approach that implements the recursive CTE's algorithm explicitly by hand.
Implication of the stress-test experiment for the implementation of the computation of Bacon Numbers
The introduction to the section Using a recursive CTE to compute Bacon Numbers for actors listed in the IMDb mentions this data set:
- imdb.small.txt: a... file with just a handful of performers (161), fully connected
It seems very likely that a fully connected undirected cyclic graph with 161 nodes, even if it isn't maximally connected will be highly connected—very many of the nodes will have edges to very many of the other nodes. It's therefore likely that the implementation of the computation of Bacon Numbers will have to use the "prune-true" approach.