Planning parallel downloads with TopologicalSorter#

For complicated reasons I found myself wanting to write Python code to resolve a graph of dependencies and produce a plan for efficiently executing them, in parallel where possible.

I figured out how to do it using graphlib.TopologicalSorter, which was added to the Python standard library in Python 3.9.

Since I want my code to work on older versions of Python, I was pleased to see that the graphlib.py module is standalone and can be easily vendored (though it uses the walrus := operator so I had to modify it for compatibility with Python 3.6 and 3.7).

For this example: imagine I know the dependencies of some packages, and I want to fire off some downloads - running them in parallel where possible.

1
from graphlib import TopologicalSorter
2

3
graph = {
4
    "datasette": {"httpx", "sqlite-utils", "click"},
5
    "sqlite-utils": {"click", "httpx"}
6
}
7

8
ts = TopologicalSorter(graph)
9
ts.prepare()
10
plan = []
11
while ts.is_active():
12
    nodes = ts.get_ready()
13
    plan.append(nodes)
14
    ts.done(*nodes)
15

16
print(plan)

This outputs:

1
[('click', 'httpx'), ('sqlite-utils',), ('datasette',)]

So the plan would be to download click and httpx in parallel, then sqlite-utils, then datasette.

In this usage of the module we’re passing the full graph as an argument to TopologicalSorter. It’s also possible to build it up a step at a time like this:

1
ts = TopologicalSorter()
2
# add(node, *predecessors)
3
ts.add("datasette", "httpx", "sqlite-utils", "click")
4
ts.add("sqlite-utils", "click", "httpx")
5
ts.prepare()
6
plan = []
7
while ts.is_active():
8
    nodes = ts.get_ready()
9
    plan.append(nodes)
10
    ts.done(*nodes)

Having called .prepare() no further .add() calls are allowed.

If you just want the nodes ordered such that none of them come before a dependency you can use .static_order():

1
>>> graph = {
2
...     "datasette": {"httpx", "sqlite-utils", "click"},
3
...     "sqlite-utils": {"click", "httpx"}
4
... }
5
>>> ts = TopologicalSorter(graph)
6
>>> list(ts.static_order())
7
['click', 'httpx', 'sqlite-utils', 'datasette']

The .prepare() method raises graphlib.CycleError if it detects a cycle:

1
>>> TopologicalSorter({"datasette": {"sqlite-utils"}, "sqlite-utils": {"datasette"}}).prepare()
2
Traceback (most recent call last):
3
  File "<stdin>", line 1, in <module>
4
  File "/Users/simon/.pyenv/versions/3.10.0/lib/python3.10/graphlib.py", line 104, in prepare
5
    raise CycleError(f"nodes are in a cycle", cycle)
6
graphlib.CycleError: ('nodes are in a cycle', ['datasette', 'sqlite-utils', 'datasette'])