EE382C Embedded Software Systems - Looped Scheduling

Complexity of SDF Scheduling

As was previously stated, no known scheduling algorithm has been found that will schedule all Synchronous Dataflow graphs in polynomial time in the size of the graph. But, it is not as bad as it seems. According Prof. Shuvra S. Bhattacharyya (University of Maryland):

We know that a schedule can be constructed in polynomial time if no tightly interdependent components exist. Equivalently, we know that a schedule can be constructed in polynomial time if a single appearance schedule exists. For graphs that do not have single appearance schedules, no known algorithm is guaranteed to construct a schedule in polynomial time, but we don't know if the problem is NP-complete.

The good news is that tightly interdependent components have not been seen in practice. Two heuristic algorithms that meet the above claims are the Recursive Partitioning by Minimum Cuts (RPMC) and the Pairwise Grouping of Adjacent Nodes (PGAN). Next, we set the stage for explaining "tightly interdependent components" as well as the RPMC and PGAN algorithms.

Looped Scheduling

A looped schedule is a sequence of the form V₁ V₂ ... V_m where each V_i is either an actor or a schedule loop. A schedule loop is of the form (n T₁ T₂ ... T_m), which means to execute the sequence T₁ T₂ .. T_m for n times. In a schedule loop, T_i is either an actor or another schedule loop, and n is the iteration count. We can remove iteration counts of 1 in the schedule without changing the behavior of the schedule.

A null schedule is an empty schedule, i.e. m = 0. A subschedule is any contiguous sequence of actors in a schedule, e.g. (3 AB)C is a subschedule of (2B(3AB)C)A which is a subschedule of A(2B(3AB)C)A(2B). We define two new operators:

actors( (2(2B)(5A)) ) = { A, B }
appearances( C, (3CA)(4BC) ) = 2

So, a schedule S is single-appearance schedule if appearances( A, S ) = 1 for all A.

For single appearance schedules, we define two new operators:

position(actor, schedule): returns the number of actors that lexically precede X in the single appearance schedule, e.g. with S = (2(3B)(5C))7A, position(A,S) = 2.
lexorder(schedule): returns the lexical ordering of the actors in a single-appearance schedule, e.g. with S = (2(3B)(5C))7A, lexorder(S) = { B, C, A }.

In a valid single appearance schedule S for an acyclic SDF graph without delays, then whenever actor X is an ancestor of Y, position(X,S) < position(Y,S).

As we have already seen in previous lectures, the choice of schedule has a dramatic impact on the amount of buffer memory required on the arcs of an SDF graph. We will assume that each token in each arc takes one unit of memory. Several models for buffering exists. One model uses shared buffers. Chain-structured graphs can share one buffer. In the chain-structured graph in Figure 1, we can use the schedule (9A)(12B)(12C)(8D). This particular style of schedule, in which all of the firings of an actor are done before another actor is considered, makes it easier on the shared global buffer memory management because there is only one writer and one reader at a time. With shared buffers, the schedule requires max(9*4, 12*1, 12*2) = 36 tokens if the buffers are shared. The total buffer size would be 9*4 + 12*1 + 12*2 = 72 if the buffers were independent. When the SDF graph is not chain-structured, it becomes more difficult to allocate and manage shared buffers.

  ----          ----          ----         ---- 
 | A  | -----> | B  | -----> | C  | ----> | D  |
 |    |        |    |        |    |       |    |
  ----          ----          ----         ---- 
       4      3      1      1     2      3

Figure 1: A Chain-Structured SDF Graph.
The repetitions vector is [9 12 12 8]' for A, B, C, and D.

In chained-structured graphs, the total buffer size for a shared memory model will always be equal to or less than the total buffer size of the independent memory model. For example, for the schedule A(50B(2C))(4D) of the graph in Figure 2, we would require 200 tokens using shared memory and 250 tokens using independent memory. For the shared memory model, the required program memory for buffer management increases to reduce the amount of data memory by reusing memory locations.

Buffer memory management may be statically scheduled. If we use the schedule (A)(50B)(100C)(4D) to simplify shared memory management, then the shared memory model would require 5000 tokens, which is 20 fold increase in the data size. Note that the independent memory model would require 5150 tokens.

Memory management for the shared memory model becomes significantly more complex when delays are on edges or when the graph has feedback or multiple acyclic paths, but the complexity does not change for the independent memory model. The SDF scheduling algorithms discussed hereafter will use an independent buffer memory model. In the independent buffer memory model, each buffer is allocated in contiguous memory and implemented as a circular buffer.

  ----          ----          ----         ---- 
 | A  | -----> | B  | -----> | C  | ----> | D  |
 |    |        |    |        |    |       |    |
  ----          ----          ----         ---- 
       50     1      100   50      1    25

Figure 2: Another Chain-Structured SDF Graph.
The repetitions vector is [1 50 100 4]' for A, B, C, and D.

When we attempt to find an SDF schedule with minimal buffer cost, we know with certainty that this problem is not polynomial in complexity. The reason is that an SDF schedule for a graph G = (V, E) can have up to a number of actor appearances equal to the sum of the elements of the repetitions vector. In reality, the complexity of the problem to compute the minimize buffer size for all SDF graphs is NP-complete, because the problem of minimizing buffer sizes for Homogeneous SDF graphs (abbreviated HSDF-MIN-BUFFER) is NP-complete.

Clustering SDF Graphs

SDF graphs do not generally compose. As a consequence, we cannot schedule a hierarchical combination of SDF graphs independently. Composition is a desirable property because one could schedule each subsystem autonomously, which would dramatically reduce the complexity. For M clusters, the scheduling complexity reduces from O((N₁ + N₂ + .... + N_M)³) to O(N₁³) + O(N₂³) + ... + O(N_M³) Unfortunately, in order to load balance a hierarchy of SDF graphs, we must first flatten the hierarchy. Once we have load balanced the graph, we have the option of clustering the flattened graph into a hierarchy that composes. The proper clustering of SDF graphs must maintain the load balance and prevent introducing deadlock into the graph.

Clustering nodes B and C in the homogeneous SDF graph in Figure 3 into a supernode will introduce deadlock because there is no delay on the resulting feedback arc. The SDF Composition Theorem gives the sufficient conditions for clustering SDF graphs without introducing deadlock.

       ----            ----            ----        
 ---> | A  | -------> | B  | -- D --> | C  | ----- 
|     |    |          |    |          |    |      |
|      ----            ----            ----       |
|    1      1        1      1         1     1     |
|                                                 |
 -------------------------------------------------

Figure 3: An example of an SDF graph that deadlocks when clustering BC into a supernode. In the graph, D on the B-C arc represents a delay of one token.

Clustering of SDF graphs is also key in generating efficient schedules for a single SDF graph that may be disconnected.

Blocking factors apply to connected SDF graphs but do not extend to disconnected graphs. We can generalize blocking factors to blocking vectors, in which we have one entry for each connected SDF graph. We can also apply this concept to subgraphs that are clustered hierarchically.

Example

Figure 4a shows a valid SDF graph with a repetitions vector of [2 2 4 4 1]' for nodes A, B, C, D, and E, respectively. The graph is clustered into another valid SDF graph in Figure 4b. The clustered graph has two levels of hierarchy: graph({F, E}) and F = subgraph({A,B,C,D}). The repetitions vector for F is the repetitions vector entries for A, B, C, and D for the original graph divided by the greatest common divisor of the repetition rates of A, B, C, and D, which is [1 1 2 2]'. Using this approach, we could also perform the following clustering choices:

Cluster A and B and cluster C and D
Cluster B and C

However, if we had grouped C and D into a cluster and A and B into a cluster and generated a repetitions vector for each cluster in isolation, then the repetitions vector would be [1 1 1 1]' and the clustering cannot be load balanced. So, the graph must be first load balanced, and then any clustering must preserve the load balancing.

  ----    e1    ----                                              
 | C  | -----> | D  |                                             
 |    |        |    |       4                  2      4           
  ----          ----\  e2     ----        ----    e5    ----      
       1      1    1 ------> |    |      |    | -----> |    |     
                             | E  |      | F  |        | E  |     
                   1 ------> |    |      |    | -----> |    |     
  ----    e3    ----/  e4     ----        ----    e6    ----      
 | A  | -----> | B  |       2                  1      2           
 |    |        |    |                                             
  ----          ----                                              
       1      1                                                   
                                                                  
  (a) The SDF graph                    (b) F = subgraph({A,B,C,D})
      q = [2 2 4 4 1]'                  q = [2 1] for F, E        
      for A, B, C, D                    q = [1 1 2 2] for A,B,C,D

Figure 4: An example of clustering a subgraph in an SDF graph.
[Figure 4.5 in Software Synthesis from Dataflow Graphs]

Mathematical Description

The concept of clustering will arise frequently in discussing the scheduling of SDF graphs. From a valid SDF graph G = (V,E), we cluster a subset of the vertices Z of the graph into an actor F (where F is not in V) such that

V' = V - Z + {F}
E' = E - ({e | (src(e) in Z) or (snk(e) in Z)}) + E*

where E* is a modification of the set of edges that connect actors in Z to actors outside of Z. This is called clustering Z into F and is denoted as clust(Z,G,F).

Using Figure 4 as an example, we will cluster the graph in Figure 4a, V = { A, B, C, D, E } and E = { e1, e2, e3, e4 }, to create a new graph (V', E'). The repetition vector is q = [2 2 4 4 1]'. We then cluster the subgraph {A, B, C, D} using a blocking factor of 1 to create a new node F. We form V' by starting with V and removing all of the vertices in the subgraph and adding the new node, so V' = { E, F }. We form E' by starting with E and removing all edges that connect to vertices in the subgraph and adding the new edges, so E' = { e5, e6 }. The repetitions vector for the subgraph ( {A, B, C, D}, {e1, e2} ) is the repetitions vector formed from the elements of the original repetitions vector for { A, B, C, D } in the original graph, i.e., the coprime version of [2 2 4 4] which would be [1 1 2 2]. On one firing of F, two tokens are output on e5 and one token is output on e6. We update the original repetitions vector by replacing the first four elements with their greatest common divisor. Hence, we arrive at the system in Figure 4b.

Updated 04/19/04.