Finding the Strongly Connected Components of a Digraph

Applications

When looking at code/library dependencies, one may want to see which code pieces form strongly connected components, and package them separately.
When looking at food and energy cycles (what eats what, etc), strongly connected components give some information about sustainability in the ecology.

Kosaraju’s Algorithm

Let \(G^{R}\) be the same graph as \(G\), but with all the edges reversed. Note that the strongly connected components are the same in \(G\) and \(G^{R}\).

The algorithm involves two stages of DFS:

Run DFS on \(G^{R}\)
Run DFS on \(G\), but visit unmarked nodes in the reverse post order of the first DFS on \(G^{R}\). Give each strongly connected component the same ID (or number) - similar to what we did in finding the connected components. (In detail, if you have \(N\) nodes, have an array of size \(N\). When executing the second DFS, for all nodes in the same strongly connected component, assign the corresponding indices the same value. If two indices have the same value, they are in the same component).

From a big picture view, note that if we create a new graph where the nodes are the strongly connected components of \(G\), then this graph is a DAG.

This algorithm is linear: \(O(N+M)\) where \(M\) is the number of edges.

Analysis

When the first DFS is run, at the end, the first node that pops out of the stack is a node that cannot leave its connected component. We will then find all the nodes in it through the 2nd DFS. The next unvisited node that pops out of the stack will have a similar property - the only way it can leave its strongly connected component is by going into one it’s already been to.

From a bigger picture, we kind of did a topological sort on the strongly connected components, and we are now going through each strongly connected component in reverse topological order.

Formally: If we have two strongly connected components \(C_{i}\) and \(C_{j}\) and we have a node going from \(C_{i}\) to \(C_{j}\), then the first node that gets popped out of the stack from either of these will be from \(C_{j}\). To prove this, simply consider the case of the first pass DFS: What happens if the first encountered node is from \(C_{i}\)? Or from \(C_{j}\)?

A corollary of this is that the first node that is popped will be in a strongly connected component that has no edges leading out of it. Let’s call it a sink scc (also, the highest in the topological sort). Once that has been traversed, the next unvisited node to pop out will be from a strongly connected component that is just beneath the sink scc in the topological ordering (there could be multiple ones, but it doesn’t matter which is popped out).

Notes

There are probably more efficient algorithms out there. Some can find them in one traversal.

Implementations

Python

Use the strongly_connected_components function in NetworkX