#### HW/SW Co-Design © Copyright Margarida Jacome #### Outline - Introduction - When to Use Accelerators - Real Time Scheduling - Accelerated System Design - Architecture Selection - Partitioning and Scheduling - Key Recent Trends Margarida Jacome - UT Austin #### **Embedded Systems** - Signal processing systems - radar, sonar, real-time video, set-top boxes, DVD players, medical equipment, residential gateways - Mission critical systems - avionics, space-craft control, nuclear plant control - Distributed control - network routers & switches, mass transit systems, elevators in large buildings - "Small" systems - cellular phones, pagers, home appliances, toys, smart cards, MP3 players, PDAs, digital cameras and camcorders, sensors, smart badges Margarida Jacome - UT Austin - 1 #### Typical Characteristics of Embedded Systems - Part of a larger system - not a "computer with keyboard, display, etc." - HW & SW do application-specific function not G.P. - application is known a priori - but definition and development concurrent - Some degree of re-programmability is essential - flexibility in upgrading, bug fixing, product differentiation, product customization - Interact (sense, manipulate, communicate) with the external world - Never terminate (ideally) - Operation is time constrained: latency, throughput - Other constraints: power, size, weight, heat, reliability etc. - Increasingly high-performance (DSP) & networked Margarida Jacome - UT Austin #### Modern Embedded Systems? - Embedded systems employ a combination of - application-specific h/w (boards, ASICs, FPGAs etc.) - \* performance, low power - s/w on prog. processors: DSPs, μcontrollers etc. - # flexibility, complexity - mechanical transducers and actuators Margarida Jacome - UT Austin #### **Accelerating Systems** - Use additional computational unit(s) dedicated to some functions - Hardwired logic. - Extra CPU. - Hardware/Software Co-design: joint design of hardware and software architectures. - performance analysis - scheduling and allocation Margarida Jacome - UT Austin \_ ′ #### Accelerated System Architecture Margarida Jacome - UT Austin #### Accelerator vs. Co-Processor - A co-processor executes instructions. - Instructions are dispatched by the CPU. - An accelerator appears as a device on the bus. - The accelerator is controlled via registers. Margarida Jacome - UT Austin 9 #### **Accelerator Implementations** - Application-specific integrated circuit. - Field-programmable gate array (FPGA). - Standard component. - Example: graphics processor. - SoCs enable multiple accelerators, CPUs, peripherals, and some memory to be placed within a single chip. Margarida Jacome - UT Austin #### System Design Tasks - Design a heterogeneous multiprocessor architecture that satisfies the design requirements. - Processing element (PE): CPU, accelerator, etc. - Program the system. Margarida Jacome - UT Austin 11 #### Why Accelerators? - Better cost/performance. - Custom logic may be able to perform operation faster or at lower power than a CPU of equivalent cost. - CPU cost is a non-linear function of performance. Margarida Jacome - UT Austin #### Why Accelerators? cont'd. - Better real-time performance. - Put time-critical functions on less-loaded processing elements. - Rate Monotonic Scheduling (RMS) utilization is '*limited*'---extra CPU cycles must be reserved to meet deadlines. (*see next section*) Margarida Jacome - UT Austin 13 #### Why Accelerators? cont'd. - Good for processing I/O in real-time. - May consume less energy. - May be better at streaming data. - May not be able to do all the work on even the largest single CPU... Margarida Jacome - UT Austin #### Outline - Introduction - When to Use Accelerators - Real Time Scheduling - Accelerated System Design - Architecture Selection - Partitioning and Scheduling - Key Recent Trends Margarida Jacome - UT Austin 15 #### Real Time Scheduling - Scheduling Policies - RMS Rate Monotonic Scheduling: - \* Task Priority = Rate = 1/Period - \* RMS is the optimal preemptive *fixed-priority* scheduling policy. - EDF Earliest Deadline First: - \* Task Priority = Current Absolute Deadline - \* EDF is the optimal preemptive *dynamic-priority* scheduling policy. Margarida Jacome - UT Austin #### Real Time Scheduling Assumptions - Scheduling Assumptions - Single Processor - All Tasks are Periodic - Zero Context-Switch Time - Worst-Case Task Execution Times are Known - No Data Dependencies Among Tasks. - RMS and EDF have both been extended to relax these assumptions. Margarida Jacome - UT Austin 17 #### Metrics - How do we evaluate a scheduling policy: - Ability to satisfy all deadlines. - CPU utilization---percentage of time devoted to useful work. - Scheduling overhead---time required to make scheduling decision. Margarida Jacome - UT Austin #### Rate Monotonic Scheduling - RMS (Liu and Layland): widely-used, analyzable scheduling policy. - Analysis is known as Rate Monotonic Analysis (RMA). Margarida Jacome - UT Austin 19 #### RMA model - All process run on single CPU. - Zero context switch time. - No data dependencies between processes. - Process execution time is constant. - Deadline is at end of period. - **■** Highest-priority ready process runs. $Margarida\ Jacome\ -\ UT\ Austin$ #### **Process Parameters** **T**<sub>i</sub> is execution time of process i; $\tau$ <sub>i</sub> is period of process i. Margarida Jacome - UT Austin 21 #### Rate-Monotonic Analysis - Response time: time required to finish a process/task. - Critical instant: scheduling state that gives worst response time. - Critical instant occurs when all higher-priority processes are ready to execute. $Margarida\ Jacome\ -\ UT\ Austin$ #### RMS priorities - Optimal (fixed) priority assignment: - shortest-period process gets highest priority; - \* priority based preemption can be used... - priority inversely proportional to period; - break ties arbitrarily. - No fixed-priority scheme does better. - RMS provides the highest worst case CPU utilization while ensuring that all processes meet their deadlines #### RMS CPU utilization - $\blacksquare$ Utilization for n processes is - $\sum_{i} T_i / \tau_i$ - As number of tasks approaches infinity, the **worst case** maximum utilization approaches 69%. - Yet, is not uncommon to find total utilizations around .90 or more (.69 is worst case behavior of algorithm) - Achievable utilization is strongly dependent upon the relative values of the periods of the tasks comprising the task set... Margarida Jacome - UT Austin 27 #### RMS: example 3 | Process I | Execution Tim<br>T <sub>i</sub> | ne Period<br>τ <sub>i</sub> | |----------------|---------------------------------|-----------------------------| | P <sub>1</sub> | 1 | 4 | | P <sub>2</sub> | 6 | 8 | Is this task set schedulable?? If yes, give the CPU utilization. Margarida Jacome - UT Austin #### RMS CPU utilization, cont'd. - RMS cannot asymptotically **guarantee** use of 100% of CPU, even with zero context switch overhead. - Must keep idle cycles available to handle worst-case scenario. - However, RMS guarantees all processes will always meet their deadlines. #### RMS implementation - Efficient implementation: - scan processes; - choose highest-priority active process. #### Earliest-deadline-first scheduling - **EDF**: **dynamic** priority scheduling scheme. - Process closest to its deadline has highest priority. - Requires recalculating processes at every timer interrupt. Margarida Jacome - UT Austin #### EDF analysis - EDF can use 100% of CPU for worst case - But EDF may miss deadlines. #### **EDF** implementation - On each timer interrupt: - compute time to deadline; - choose process closest to deadline. - Generally considered too expensive to use in practice, unless the task count is small Margarida Jacome - UT Austin 41 #### **Priority Inversion** - Priority Inversion: low-priority process keeps high-priority process from running. - Improper use of system resources can cause scheduling problems: - Low-priority process grabs I/O device. - High-priority device needs I/O device, but can't get it until lowpriority process is done. - Can cause deadlock. Margarida Jacome - UT Austin #### Solving priority inversion - Give priorities to system resources. - Have process inherit the priority of a resource that it requests. - Low-priority process inherits priority of device if higher. Margarida Jacome - UT Austin 43 #### Context-switching time - Non-zero context switch time can push limits of a tight schedule. - Hard to calculate effects---depends on order of context switches. - In practice, OS context switch overhead is small. $Margarida\ Jacome\ -\ UT\ Austin$ #### What about interrupts? - Interrupts take time away from processes. - Other event processing may be masked during interrupt service routine (ISR) - Perform minimum work possible in the interrupt handler. P1 OS intr OS P3 Margarida Jacome - UT Austin 45 #### Device processing structure - Interrupt service routine (ISR) performs minimal I/O. - Get register values, put register values. - Interrupt service process/thread performs most of device function. $Margarida\ Jacome\ -\ UT\ Austin$ ### **Evaluating performance** - May want to test - context switch time assumptions on real platform - scheduling policy Margarida Jacome - UT Austin 47 #### Processes and caches - Processes can cause additional caching problems. - Even if individual processes are well-behaved, processes may interfere with each other. - Worst-case execution time with bad cache behavior is usually much worse than execution time with good cache behavior. Margarida Jacome - UT Austin #### Fixing scheduling problems - What if your set of processes is unschedulable? - Change deadlines in requirements. - Reduce execution times of processes. - Get a faster CPU - Get an Accelerator Margarida Jacome - UT Austin 49 #### Outline - Introduction - When to Use Accelerators - Real Time Scheduling - Accelerated System Design - Architecture Selection - Partitioning and Scheduling - Key Recent Trends $Margarida\ Jacome\ -\ UT\ Austin$ #### Accelerated system design - First, determine that the system really needs to be accelerated. - How much faster is the accelerator on the core function? - How much data transfer overhead? - Design the accelerator itself. - Design CPU interface to accelerator. Margarida Jacome - UT Austin 51 #### Performance analysis - Critical parameter is speedup: how much faster is the system with the accelerator? - Must take into account: - Accelerator execution time. - Data transfer time. - Synchronization with the master CPU. Margarida Jacome - UT Austin #### Accelerator execution time ■ Total accelerator execution time: Accelerated computation Margarida Jacome - UT Austin 53 #### Data input/output times - Bus transactions include: - flushing register/cache values to main memory; - time required for CPU to set up transaction; - overhead of data transfers by bus packets, handshaking, etc. Margarida Jacome - UT Austin #### Accelerator speedup - $\blacksquare$ Assume loop is executed n times. - Compare accelerated system to non-accelerated system: - Saved Time = $n(t_{CPU} t_{accel})$ - $= n[t_{CPU} (t_{in} + t_x + t_{out})]$ Execution time of equivalent function on CPU - Speed-Up = Original Ex. Time / Accelerated Ex. Time Margarida Jacome - UT Austin 55 #### Single- vs. multi-threaded - One critical factor is available parallelism: - single-threaded/blocking: CPU waits for accelerator; - multithreaded/non-blocking: CPU continues to execute along with accelerator. - To multithread, CPU must have useful work to do. - But software must also support multithreading. Margarida Jacome - UT Austin ## Total execution time ■ Single-threaded: #### ■ Multi-threaded: ## Execution time analysis - Single-threaded: - Count execution time of all component processes. - Multi-threaded: - Find longest path through execution. #### Sources of parallelism - Overlap I/O and accelerator computation. - Perform operations in batches, read in second batch of data while computing on first batch. - Find other work to do on the CPU. - May reschedule operations to move work after accelerator initiation. Margarida Jacome - UT Austin 59 #### Accelerated systems - Several off-the-shelf boards are available for acceleration in PCs: - FPGA-based core; - PC bus interface. Margarida Jacome - UT Austin #### Accelerator/CPU interface - Accelerator registers provide control registers for CPU. - Data registers can be used for small data objects. - Accelerator may include special-purpose read/write logic (DMA hardware) - Especially valuable for large data transfers. Margarida Jacome - UT Austin 61 #### Caching problems - Main memory provides the primary data transfer mechanism to the accelerator. - Programs must ensure that caching does not invalidate main memory data. - CPU reads location S. - Accelerator writes location S. • CPU writes location S. **BAD** (program will not see the value of S stored in the cache) The bus interface may provide mechanisms for accelerators to tell the CPU of required cache changes... Margarida Jacome - UT Austin ### Synchronization - As with cache, main memory writes to shared memory may cause invalidation: - CPU reads S. - Accelerator writes S. - CPU write S. Many CPU buses implement test-and-set atomic operations that the accelerator can use to implement a semaphore. This can serve as a highly efficient means of synchronizing inter-process Communications (IPC). Margarida Jacome - UT Austin 63 #### Partitioning/Decomposition - Divide functional specification into units. - Map units onto PEs. - Units may become processes. - Determine proper level of parallelism: f3(f1(),f2()) vs. f1() f2() Margarida Jacome - UT Austin #### "Typical" Decomposition Methodology - Divide Control-Data Flow Graph (CDFG) into pieces, shuffle functions between pieces. - Hierarchically decompose CDFG to identify possible partitions. Margarida Jacome - UT Austin 65 # Decomposition example cond 1 Block 1 P1 Block 1 P2 Block 2 P2 Block 3 P4 Margarida Jacome - UT Austin 66 #### Scheduling and allocation - Must: - schedule operations in time; - allocate computations to processing elements. - Scheduling and allocation interact, but separating them helps. - Alternatively allocate, then schedule. Margarida Jacome - UT Austin #### Example process execution times | M1 | M2 | | |----|----|------------| | 5 | 5 | | | 5 | 6 | | | | 5 | | | | 5 | 5 5<br>5 6 | Margarida Jacome - UT Austin 69 #### Example communication model - Assume communication within PE is free. - Cost of communication from P1 to P3 is d1 =2; cost of P2 to P3 communication is d2 = 4. Margarida Jacome - UT Austin # System integration and debugging - Try to debug the CPU/accelerator interface separately from the accelerator core. - Build scaffolding to test the accelerator (Hardware Abstraction Layer is a good place for this functionality, under compile switches) - Hardware/software co-simulation can be useful. Margarida Jacome - UT Austin 73 #### Outline - Introduction - When to Use Accelerators - Real Time Scheduling - Accelerated System Design - Architecture Selection - Partitioning and Scheduling - Key Recent Trends Margarida Jacome - UT Austin #### Hardware vs. Software Modules - Hardware = functionality implemented via a custom architecture (e.g. datapath + FSM) - Software = functionality implemented in software on a programmable processor - Key differences: - Multiplexing - ⇒ software modules multiplexed with others on a processor → e.g. using an OS - \* hardware modules are typically mapped individually on dedicated hardware - Concurrency - \* processors usually have one "thread of control" - \* dedicated hardware often has concurrent datapaths Margarida Jacome - UT Austin ## Many Types of Programmable Processors - Past/Now - ◆ Microprocessor - ◆ Microcontroller - **◆DSP** - ◆ Graphics Processor - Now / Future - ◆Network Processor - ◆Sensor Processor - **♦** Cryptoprocessor - ◆Game Processor - ♦ Wearable Processor - ◆ Mobile Processor Margarida Jacome - UT Austin 8 ## Application-Specific Instruction Processors (ASIPs) - Processors with instruction-sets tailored to specific applications or application domains - instruction-set generation as part of synthesis - e.g. Tensilica - Pluses: - customization yields lower area, power etc. - Minuses: - higher h/w & s/w development overhead - design, compilers, debuggers Margarida Jacome - UT Austin #### Other Examples Atmel's FPSLIC (AVR + FPGA) Altera's Nios (configurable RISC on a PLD) Margarida Jacome - UT Austin 83 #### H/W-S/W Architecture - A significant part of the problem is deciding which parts should be in s/w on programmable processors, and which in specialized h/w - Today: - Ad hoc approaches based on earlier experience with similar products, & on manual design - H/W-S/W partitioning decided at the beginning, and then designs proceed separately Margarida Jacome - UT Austin #### Embedded System Design - CAD tools take care of HW fairly well (at least in relative terms) - Although a productivity gap emerging - But, SW is a different story... - HLLs such as C help, but can't cope with complexity and performance constraints Holy Grail for Tools People: H/W-like synthesis & verification from a behavior description of the whole system at a high level of abstraction using formal computation models Margarida Jacome - UT Austin 85 ## Productivity Gap in Hardware Design Source: sematech97 A growing gap between design complexity and design productivity Margarida Jacome - UT Austin