Signal Processing System Design
The goal in signal processing system design to model signal processing
systems using formal models that are decoupled from implementation details.
In formal modeling, a fundamental tradeoff exists between the
expressiveness of the model and the ability to analyze properties of the
model.
The more expressive the model, the more difficult it is to analyze it.
An appropriate compromise to use several different models of computation
to model different parts of a signal processing system.
Signal Processing Algorithms
The most fundamental numeric computation in signal processing algorithms
is a vector inner (dot) product.
It is the basis for Fast Fourier Transforms, Finite Impulse Response
filters, and Infinite Impulse Response filters, which are commonly
called kernels.
In an algorithm, these kernels communicate data to each other.
The flow of data is generally very regular and has a fixed static pattern.
The order of computation in signal processing algorithms can be
specified loosely in terms of data dependencies between kernels.
Many possible of orders of execution of the kernels are possible.
Therefore, it is beneficial not to use a model of computation which
forces the designer to pick one fixed order in which to perform the
operations, such as an in imperative programming language such as C.
Rather, one should use a model of computation that can capture the
flexibility in a signal processing algorithm by only specifying the
dataflow in the algorithm.
These dataflow models of computation come in many varieties.
One set of dataflow models can always be scheduled statically, e.g.
Synchronous Dataflow, Cyclo-static Dataflow, and Static Dataflow.
Using the static schedule, code generators can synthesize
software, hardware, or software and hardware from the same specification.
Programmable Embedded Processors
A common implementation architecture for signal processing algorithms
is embedded programmable processors.
In embedded programmable processors, memory is severely limited,
and the schedulers should search for schedules that will require
a minimal amount of memory for code and data.
For example, the Motorola 56000 Digital Signal Processor contains
three separate memory banks: one for code, and two for data.
On any instruction, three memory reads or two memory reads and one
memory write can be performed.
Each bank of memory is typically the same size which is commonly
between 4 kB and 64 kB in size.
The goal for automatically generating code for an embedded DSP
processor is to jointly optimize the amount of program and data
memory required.
A common practice in industry is to develop kernels and applications
in a high-level language and cross-compile the application to an
embedded processor.
Compilers excel at optimizing local computation and data dependencies,
and perform fairly well on the small blocks of code which implement
kernels .
Compared to manual coding of kernels in assembly language, the overhead
required by the best compilers is 0-20% on data size and 50-60% on program
size.
Compilers are not well-suited at optimizing the global structure of
programs.
Compilers also have the following additional problems when generating
code for embedded programmable processors:
- require stack sizes that are too large for the available memory
- no division operation in hardware
- difficulty in expressing fixed-point operations in the high-level
language
- special data input and output architecture
- custom DSP operations
The key to the generation of efficient software is to model the global
structure of an application using a static dataflow model in which
kernels are connected together.
Scheduling algorithms would then determine an efficient ordering of
the kernels.
For generation of the kernels of software, we could then use a compiler.
This approach leverages the best of both types of tools.
Static Dataflow Models
Many varieties of dataflow models exist.
The ones that can be statically scheduled are preferred of course because
the resource requirements to implement the algorithms can be determined
in advanced at compile time, thereby avoiding the overhead and uncertainty
associated with run-time scheduling.
We will discuss one particular static dataflow model known as
Synchronous Dataflow.
Synchronous Dataflow (SDF) is a model first proposed by Edward A. Lee in 1986.
In SDF, all computation and data communication is scheduled statically.
That is, algorithms expressed as SDF graphs can always be converted into
an implementation that is guaranteed to take finite-time to complete all
tasks and use finite memory.
Thus, an SDF graph can be executed over and over again in a periodic
fashion without requiring additional resources as it runs.
This type of operation is well-suited to digital signal processing and
communications systems which often process an endless supply of data.
An SDF graph consists of nodes and arcs.
Nodes represent operations which are called actors.
Arcs represent data values called tokens which stored in
first-in first-out (FIFO) queues.
The word token is used because each data values can represent any
data type (e.g. integer or real) or any data structure (e.g. matrix
or image).
SDF graphs obey the following rules:
- An actor is enabled for execution when enough tokens are available
at all of the inputs.
- When an actor executes, it always produces and consumes the same
fixed amount of tokens.
- The flow of data through the graph may not depend on values of the data.
Because of the second rule, the data that an actor consumes is removed
from the buffers on the input arcs and not restored.
The consequence of the last rule is that an SDF graph may not contain
data-dependent switch statements such as an if-then-else construct and
data-dependent iteration such as for loop.
However, the actors may contain these constructs because the scheduling
of an SDF graph is independent of the what tasks the actors do.
Example
This example is taken from Figure 1.5 of [1].
Considered the feedforward (acyclic) synchronous dataflow graph shown below:
A ------> B ------> C
20 10 20 10
The notation means that when A executes, it produces 20 tokens.
When B executes, it consumes 10 tokens and produces 20 tokens.
When C executes, it consumes 10 tokens.
The first step in scheduling an SDF for execution is that we must figure
out how many times to execute each actor so that all of the intermediate
tokens that are produced get consumed.
This process is known as load balancing.
Load balancing is implemented by an algorithm that is linear in time
and memory in the size of the SDF graph (number of vertices plus
number of edges plus three times the base-two logarithm of the number
of edges).
For example, in the example above, we must
- Fire A 1 time
- Fire B 2 times
- Fire C 4 times
to balance the number of tokens produced and consumed.
However, load balancing does not tell us the order in which to schedule
the firings.
If there were no constraints on the order, then the number of possible
schedules would be combinatoric in the total number of executions (seven
in this case).
Because of the data dependencies, the worst case is a polynomial function
of exponential function of the size of the SDF graph.
As we will also discover later, the size of an SDF graph is
#nodes + #arcs * (1 + log delayPerArc +
log inputTokensPerArc + log outputTokensPerArc))
where log is the base-two logarithm (i.e., the number of bits).
The next step is to schedule the firings required by load balancing.
Several scheduling algorithms have been developed including
- list scheduling - quadratic algorithm
- looped scheduling - cubic algorithm
There are many variants on looped schedulers, such as the complementary
algorithms called pairwise grouping of adjacent nodes [2] and recursive
partitioning based on minimum cuts [2], which are discussed in [1].
Possible schedules for the above SDF graph is ABCCBCC for the list
scheduler and A (2 B(2 C)) for the looped scheduler.
The generated code to execute the schedule A (2 B(2 C)) would be
the following:
code block for A
for (i = 0; i < 2; i++) {
code block for B
for (j = 0; j < 2; j++) {
code block for C
}
}
The schedule A (2 B(2 C)) is an example of a single-appearance
schedule since the invocation of each actor only appears once.
When generating code that is "stitched" together, a single-appearance
schedule requires the minimal amount of program memory because the
code for each actor only appears once.
The scheduling algorithms could actually return several different valid
schedules, such as those shown below.
1.
| List Scheduler
| ABCBCCC
| 50
|
2.
| Looped Scheduler
| A (2 B(2 C))
| 40
|
3.
| Looped Scheduler
| A(2 B)(4 C)
| 60
|
4.
| Looped Scheduler
| A(2 BC)(2 C)
| 50
|
The smallest amount of buffer memory possible is 40, which is met
by schedule #2.
It is optimal in terms of data memory usage.
The list scheduler could also have created a data optimal schedule
of ABCCBCC, which is just the expanded version of schedule #2.
Because schedule #2 is a single-appearance schedule, we know that it
is optimal in terms of program memory usage.
Slides
References
- Shuvra S. Bhattacharyya, Praveen K. Murthy, and Edward A. Lee,
Software
Synthesis from Dataflow Graphs,
Kluwer Academic Press, Norwell, MA, ISBN 0-7923-9722-3, 1996.
- S. S. Bhattacharyya, P. K. Murthy, and E. A. Lee,
``APGAN
and RPMC: Complimentary Heuristics for Translating DSP Block Diagrams
into Efficient Software Implementations'',
Design Automation for Embedded Systems Journal,
to appear.
Updated 07/31/99.