Feedback on G5BADS Coursework: Updateable Directed Graph

Solutions submitted used many different class designs. The main variation was in whether some kind of an inner class was used to keep information about nodes (their neighbours, colours or in-degrees, etc.) or not, and whether the loops() method used depth first search or topological sort.

Most people used depth-first search. However, the two fastest programs were using topological sort ( Chris Allsopp's program got the prize, and Daniel Pike's program was just a few percentage points slower in my tests). Some of the depth-first search programs also looked very good but unfortunately had minor bugs in them.

The most common problem with DFS programs was not exploring all of the graph: they would do DFS from some node and stop, potentially missing a cycle. For example, if a graph looked as follows:

1 --> 2
2 --> 1
3 -->
and the enumeration of nodes happened to start with 3, then doing DFS from 3 would not discover the cycle.

Below I discuss some of the solutions and their complexity.

Solution 1: simple DFS

This solution follows suggestions made in the lecture. Here is the program . The graph is represented as a hash table, keys are Objects (node labels) and values are LinkedLists of adjacent nodes (more precisely, labels of adjacent nodes).

To colour the nodes white, grey or black in the loops() method, we put them in corresponding hash tables. If a node is inserted in a white hash table, its colour is white, and so on. We have to make sure the node is always in only one of the three hash tables.

The outline of the method is as follows:

colour all nodes white

for each node in the graph
  if the node is white
     do DFS with cycle detection from this node
return false
The DFS cycle detection from node s goes like this:
create a stack
colour s grey
put it on the stack

while the stack is non-empty
   look at node n on top of the stack
   get n's adjacency list
   iterate through the list until find a white node m or reached the end
     if found a grey node, return true
     if found a white node m
        colour m grey
        push m on the stack
        stop iterating
   if reached the end without finding a white node
        colour n black
        pop the stack

The worst case time complexity of this algorithm is O(V + E2), as shown below.

The worst case is when there is no cycle, so we explore the whole graph. Each node gets to go on the stack once (it only goes on the stack if it is white, and doing DFS from it re-colours it black eventually). For each node on the stack, we backtrack to it as many times as there are outgoing edges from it. This is the number of times the node is found on top of the stack. When a node is on top of the stack, we iterate through its adjacency list to find an unvisited neighbour. This implementation does it in an inefficient way, always starting from the beginning of the adjacency list and going forward until such a neighbour is found. For example, if we have a graph like this:

1 --> 2,3,4,5
then the first time when node 1 is on top of the stack we make one step in the iteration until we find 2, the second time we make two steps to find node 3, next time we do three steps etc.:
traversals of 1's adjacency list:
2         first traversal
2,3       second traversal
2,3,4     third traversal
2,3,4,5   fourth traversal
So the total number of steps in traversal of a node's adjacency list (over the whole DFS) is quadratic in the length of the list (to be precise, (e1+1)e1/2 where e1 is the size of 1's adjacency list, let's just say e12).

Since we do those traversals for every node reachable from s, the complexity of DFS from s depends on how many nodes are reachable from s and the sizes of their adjacency lists. Assuming there are n nodes reachable from s, we push, colour and pop n nodes on the stack and do e12+ e22+ ... + en2 checks, where ei is the size of the adjacency list of the ith node. Complexity of this DFS from s is O(vs + es2) where vs is the size of the subgraph reachable from s and es is the number of edges in it (e12+ e22+ ... + en2 is less than es2 = (e1+...+en)2).

Now for the whole algorithm: since there is an outer loop over all nodes, it looks as if we need to multiply time complexity for each DFS search by the number of nodes. In fact, we do not do DFS from each node, only from the white ones, that is, those not involved in the previous DFS search. Essentially the graph is split into subgraphs (nodes reachable from the first node in the enumeration and all the rest, for example) and each subgraph is searched once, so we just add up the times it takes to search each subgraph. Supposing there are k subgraphs, and each is searched in time (vi + ei2), the whole graph can be searched in time v1 + v2 + ... + vk + e12 + ... + ek2 which is O((V + E2) (V = v1 + v2 + ... + vk and E = e1 + ... + ek).

Optimised DFS

This solution is similar, but we only iterate through each adjacency list once because we remember the place in it which was reached last time. I could have stored this "bookmark" for each node in yet another hash table, but instead I put all the data for a node (colour, adjacency list, bookmark) in a separate class. The program is here . Complexity of this implementation is O(V + E) because each adjacency list is traversed once. On a graph with a large out degree of size 1000 the difference in running time is 2 ms for optimised version vs 115 ms for a non-optimised one.

Topological sort

The program using topological sort (by Chris Allsopp) is here . It also uses an inner class to store information about nodes, only for topological sort we need to store indegree of a node (the number of incoming edges and its copy so that we can restore the graph to its previous state after loops() is finished). The outline of the algorithm is as follows:
create a queue
enqueue all nodes with in-degree 0 
set deleted counter to be 0
while queue is non-empty
   dequeue a node n
   increment deleted counter
   for each node in n's adjacency list
      decrement its in-degree; if it becomes 0, enqueue the node


for each node, make in-degree equal to saved copy (restore the graph to original state)
return (deleted counter < graph size)
The first two operations are linear in V; each node is enqueued once; each adjacency list traversed once; complexity O(V+E). Running time is better than my model solution for DFS.