Signal Processing System Design

The goal in signal processing system design to model signal processing systems using formal models that are decoupled from implementation details. In formal modeling, a fundamental tradeoff exists between the expressiveness of the model and the ability to analyze properties of the model. The more expressive the model, the more difficult it is to analyze it. An appropriate compromise to use several different models of computation to model different parts of a signal processing system.

Signal Processing Algorithms

The most fundamental numeric computation in signal processing algorithms is a vector inner (dot) product. It is the basis for Fast Fourier Transforms, Finite Impulse Response filters, and Infinite Impulse Response filters, which are commonly called kernels. In an algorithm, these kernels communicate data to each other. The flow of data is generally very regular and has a fixed static pattern.

The order of computation in signal processing algorithms can be specified loosely in terms of data dependencies between kernels. Many possible of orders of execution of the kernels are possible. Therefore, it is beneficial not to use a model of computation which forces the designer to pick one fixed order in which to perform the operations, such as an in imperative programming language such as C. Rather, one should use a model of computation that can capture the flexibility in a signal processing algorithm by only specifying the dataflow in the algorithm.

These dataflow models of computation come in many varieties. One set of dataflow models can always be scheduled statically, e.g. Synchronous Dataflow, Cyclo-static Dataflow, and Static Dataflow. Using the static schedule, code generators can synthesize software, hardware, or software and hardware from the same specification.

Programmable Embedded Processors

A common implementation architecture for signal processing algorithms is embedded programmable processors. In embedded programmable processors, memory is severely limited, and the schedulers should search for schedules that will require a minimal amount of memory for code and data. For example, the Motorola 56000 Digital Signal Processor contains three separate memory banks: one for code, and two for data. On any instruction, three memory reads or two memory reads and one memory write can be performed. Each bank of memory is typically the same size which is commonly between 4 kB and 64 kB in size. The goal for automatically generating code for an embedded DSP processor is to jointly optimize the amount of program and data memory required.

A common practice in industry is to develop kernels and applications in a high-level language and cross-compile the application to an embedded processor. Compilers excel at optimizing local computation and data dependencies, and perform fairly well on the small blocks of code which implement kernels . Compared to manual coding of kernels in assembly language, the overhead required by the best compilers is 0-20% on data size and 50-60% on program size. Compilers are not well-suited at optimizing the global structure of programs. Compilers also have the following additional problems when generating code for embedded programmable processors:

require stack sizes that are too large for the available memory
no division operation in hardware
difficulty in expressing fixed-point operations in the high-level language
special data input and output architecture
custom DSP operations

The key to the generation of efficient software is to model the global structure of an application using a static dataflow model in which kernels are connected together. Scheduling algorithms would then determine an efficient ordering of the kernels. For generation of the kernels of software, we could then use a compiler. This approach leverages the best of both types of tools.

Static Dataflow Models

Many varieties of dataflow models exist. The ones that can be statically scheduled are preferred of course because the resource requirements to implement the algorithms can be determined in advanced at compile time, thereby avoiding the overhead and uncertainty associated with run-time scheduling. We will discuss one particular static dataflow model known as Synchronous Dataflow.

Synchronous Dataflow (SDF) is a model first proposed by Edward A. Lee in 1986. In SDF, all computation and data communication is scheduled statically. That is, algorithms expressed as SDF graphs can always be converted into an implementation that is guaranteed to take finite-time to complete all tasks and use finite memory. Thus, an SDF graph can be executed over and over again in a periodic fashion without requiring additional resources as it runs. This type of operation is well-suited to digital signal processing and communications systems which often process an endless supply of data.

An SDF graph consists of nodes and arcs. Nodes represent operations which are called actors. Arcs represent data values called tokens which stored in first-in first-out (FIFO) queues. The word token is used because each data values can represent any data type (e.g. integer or real) or any data structure (e.g. matrix or image).

SDF graphs obey the following rules:

An actor is enabled for execution when enough tokens are available at all of the inputs.
When an actor executes, it always produces and consumes the same fixed amount of tokens.
The flow of data through the graph may not depend on values of the data.

Because of the second rule, the data that an actor consumes is removed from the buffers on the input arcs and not restored. The consequence of the last rule is that an SDF graph may not contain data-dependent switch statements such as an if-then-else construct and data-dependent iteration such as for loop. However, the actors may contain these constructs because the scheduling of an SDF graph is independent of the what tasks the actors do.

Example

This example is taken from Figure 1.5 of [1]. Considered the feedforward (acyclic) synchronous dataflow graph shown below:

A    ------>   B   ------>    C
  20        10   20        10

The notation means that when A executes, it produces 20 tokens. When B executes, it consumes 10 tokens and produces 20 tokens. When C executes, it consumes 10 tokens.

The first step in scheduling an SDF for execution is that we must figure out how many times to execute each actor so that all of the intermediate tokens that are produced get consumed. This process is known as load balancing. Load balancing is implemented by an algorithm that is linear in time and memory in the size of the SDF graph (number of vertices plus number of edges plus three times the base-two logarithm of the number of edges). For example, in the example above, we must

Fire A 1 time
Fire B 2 times
Fire C 4 times

to balance the number of tokens produced and consumed. However, load balancing does not tell us the order in which to schedule the firings. If there were no constraints on the order, then the number of possible schedules would be combinatoric in the total number of executions (seven in this case). Because of the data dependencies, the worst case is a polynomial function of exponential function of the size of the SDF graph. As we will also discover later, the size of an SDF graph is

#nodes + #arcs * (1 + log delayPerArc + log inputTokensPerArc + log outputTokensPerArc))

where log is the base-two logarithm (i.e., the number of bits).

The next step is to schedule the firings required by load balancing. Several scheduling algorithms have been developed including

list scheduling - quadratic algorithm
looped scheduling - cubic algorithm

There are many variants on looped schedulers, such as the complementary algorithms called pairwise grouping of adjacent nodes [2] and recursive partitioning based on minimum cuts [2], which are discussed in [1].

Possible schedules for the above SDF graph is ABCCBCC for the list scheduler and A (2 B(2 C)) for the looped scheduler. The generated code to execute the schedule A (2 B(2 C)) would be the following:

code block for A
for (i = 0; i < 2; i++) {
  code block for B
  for (j = 0; j < 2; j++) {
    code block for C
  }
}

The schedule A (2 B(2 C)) is an example of a single-appearance schedule since the invocation of each actor only appears once. When generating code that is "stitched" together, a single-appearance schedule requires the minimal amount of program memory because the code for each actor only appears once.

The scheduling algorithms could actually return several different valid schedules, such as those shown below.

1. List Scheduler ABCBCCC 50
2. Looped Scheduler A (2 B(2 C)) 40
3. Looped Scheduler A(2 B)(4 C) 60
4. Looped Scheduler A(2 BC)(2 C) 50
The smallest amount of buffer memory possible is 40, which is met by schedule #2. It is optimal in terms of data memory usage. The list scheduler could also have created a data optimal schedule of ABCCBCC, which is just the expanded version of schedule #2. Because schedule #2 is a single-appearance schedule, we know that it is optimal in terms of program memory usage.

Slides

References

Shuvra S. Bhattacharyya, Praveen K. Murthy, and Edward A. Lee, Software Synthesis from Dataflow Graphs, Kluwer Academic Press, Norwell, MA, ISBN 0-7923-9722-3, 1996.
S. S. Bhattacharyya, P. K. Murthy, and E. A. Lee, ``APGAN and RPMC: Complimentary Heuristics for Translating DSP Block Diagrams into Efficient Software Implementations'', Design Automation for Embedded Systems Journal, to appear.

Updated 07/31/99.

1.	List Scheduler	ABCBCCC	50
2.	Looped Scheduler	A (2 B(2 C))	40
3.	Looped Scheduler	A(2 B)(4 C)	60
4.	Looped Scheduler	A(2 BC)(2 C)	50