Planning parallel downloads with TopologicalSorter#
For complicated reasons I found myself wanting to write Python code to resolve a graph of dependencies and produce a plan for efficiently executing them, in parallel where possible.
I figured out how to do it using graphlib.TopologicalSorter, which was added to the Python standard library in Python 3.9.
Since I want my code to work on older versions of Python, I was pleased to see that the graphlib.py module is standalone and can be easily vendored (though it uses the walrus :=
operator so I had to modify it for compatibility with Python 3.6 and 3.7).
For this example: imagine I know the dependencies of some packages, and I want to fire off some downloads - running them in parallel where possible.
1from graphlib import TopologicalSorter2
3graph = {4 "datasette": {"httpx", "sqlite-utils", "click"},5 "sqlite-utils": {"click", "httpx"}6}7
8ts = TopologicalSorter(graph)9ts.prepare()10plan = []11while ts.is_active():12 nodes = ts.get_ready()13 plan.append(nodes)14 ts.done(*nodes)15
16print(plan)
This outputs:
1[('click', 'httpx'), ('sqlite-utils',), ('datasette',)]
So the plan would be to download click
and httpx
in parallel, then sqlite-utils
, then datasette
.
In this usage of the module we’re passing the full graph
as an argument to TopologicalSorter
. It’s also possible to build it up a step at a time like this:
1ts = TopologicalSorter()2# add(node, *predecessors)3ts.add("datasette", "httpx", "sqlite-utils", "click")4ts.add("sqlite-utils", "click", "httpx")5ts.prepare()6plan = []7while ts.is_active():8 nodes = ts.get_ready()9 plan.append(nodes)10 ts.done(*nodes)
Having called .prepare()
no further .add()
calls are allowed.
If you just want the nodes ordered such that none of them come before a dependency you can use .static_order()
:
1>>> graph = {2... "datasette": {"httpx", "sqlite-utils", "click"},3... "sqlite-utils": {"click", "httpx"}4... }5>>> ts = TopologicalSorter(graph)6>>> list(ts.static_order())7['click', 'httpx', 'sqlite-utils', 'datasette']
The .prepare()
method raises graphlib.CycleError
if it detects a cycle:
1>>> TopologicalSorter({"datasette": {"sqlite-utils"}, "sqlite-utils": {"datasette"}}).prepare()2Traceback (most recent call last):3 File "<stdin>", line 1, in <module>4 File "/Users/simon/.pyenv/versions/3.10.0/lib/python3.10/graphlib.py", line 104, in prepare5 raise CycleError(f"nodes are in a cycle", cycle)6graphlib.CycleError: ('nodes are in a cycle', ['datasette', 'sqlite-utils', 'datasette'])