# **AnySP: Anytime Anywhere Anyway Signal Processing** Mark Woh<sup>1</sup>, Sangwon Seo<sup>1</sup>, Scott Mahlke<sup>1</sup>, Trevor Mudge<sup>1</sup>, Chaitali Chakrabarti<sup>2</sup>, Krisztian Flautner<sup>3</sup> University of Michigan – ACAL<sup>1</sup> Arizona State University<sup>2</sup> ARM, Ltd.<sup>3</sup> #### The Modern Mobile Phone **Advanced Image Processing** Photos From - http://www.engadget.com/2009/06/10/iphone-3g-s-supports-opengl-es-2-0-but-3g-only-supports-1-1/http://www.apple.com/iphone ## Inside Today's Smart Phones #### Power/Performance Requirements for Multiple Systems ## The Applications Is there anything we can learn from the applications themselves? #### H.264 Basics #### **4G Wireless Basics** - Three kernels make up the majority of the work - FFT Extract Data from Signals - STBC Combine Data into More Reliable Stream - LDPC Error Correction on Data Stream #### **Mobile Signal Processing Algorithm Characteristics** | | Algorithm | SIMD | SIMD Scalar | | SIMD Width | Amount | | |--------------------------------------------------------|------------------------------------------|----------------------|-----------------------|--------------|------------|--------|--| | | Aigoritiiii | Workload (%) | Workload (%) | Workload (%) | (Elements) | of TLP | | | | FFT | 75 | 5 | 20 | 1024 | Low | | | 46 | STBC | 81 | 5 | 14 | 4 | High | | | | LDPC | 40 | 10 | 22 | | Low | | | _ | Deblo | SIMD of | omoc ot o | oostl | | Medium | | | H.264 | Intra-l | | SIMD comes at a cost! | | | | | | Ξ̈́ | Invers | •Register File Power | | | | High | | | | *Data Movement/Alignment Cost High | | | | | | | | SIMD architectures have to deal with this! | | | | | | | | | From very large to very small | | | | | | | | | Though SIMD width varies all algorithms can exploit it | | | | | | | | | | Large percentage of work can be SIMDized | | | | | | | | | Larger SIMD width tend to have less TLP | | | | | | | **University of Michigan - ACAL** ## Only the instructions shown in red are MMX computations. All other instructions are simply supporting these computations. | Pentium III – SIMD code for Discrete Cosine Transform (DCT) | | | | | |-------------------------------------------------------------|-----------------------------------------------------|-------------------------------------------------|--|--| | lea<br>mov<br>\$B1\$2: | ebx, DWORD PTR [ebp+128]<br>DWORD PTR [esp+28], ebx | load/address overhead load/address overhead | | | | xor | eax, eax | address overhead | | | | move<br>lea | dx, ecx<br>edi, DWORD PTR [ecx+16] | address overhead<br>load/address overhead | | | | mov<br>\$B1\$3: | DWORD PTR [esp+24], ecx | load/address overhead | | | | movq | mm1, MMWORD PTR [ebp] | load overhead | | | | pxor | mm0, mm0 | initialization overhead | | | | pmaddwd | mm1, MMWORD PTR [eax+esi] | True Computation | | | | movq | mm2, MMWORD PTR [ebp+8] | load overhead | | | | pmaddwd | mm2, MMWORD PTR [eax+esi+8] | True Computation | | | | add | eax, 16address | overhead | | | | paddw | mm1, mm0 | True Computation | | | | paddw | mm2, mm1 | True Computation | | | | movq | mm0, mm2 | load related overhead | | | | psrlq | mm2, 32 | SIMD reduction overhead | | | | povd | ecx, mm0 | SIMD load overhead | | | | movd | ebx, mm2 | SIMD load overhead | | | | add | ecx, ebx | SIMD conversion Overhead | | | | mov<br>add | WORD PTR [edx], cx edx, 2 | store overhead address overhead | | | | cmp | edi, edx | branch related overhead | | | | jg<br>\$B1\$4: | \$B1\$3 | loop branch overhead | | | | move | cx, DWORD PTR [esp+24] | load/address overhead | | | | add | ebp, 16 | address overhead | | | | add | ecx, 16 | address overhead | | | | move | ax, DWORD PTR [esp+28] | load/address overhead | | | | jg<br>jg | eax, ebp<br>\$B1\$2 | branch related overhead<br>loop branch overhead | | | ### a) Deblocking Filter Subgraph ## c) Subgraph for Bit Node and Check Node Operation #### **Traditional SIMD Power Breakdown** Register File Power consumes a lot of power in traditional 32-wide SIMD architecture ## Register File Access Many of the register file access do not have to go back to the main register file University of Michigan - ACAL ## **Instruction Pair Frequency** | | Instruction Pair | Frequency | | | |----|----------------------|-----------|--|--| | 1 | multiply-add | 26.71% | | | | 2 | add-add | 13.74% | | | | 3 | shuffle-add | 8.54% | | | | 4 | shift right-add | 6.90% | | | | 5 | subtract-add | 6.94% | | | | 6 | add-shift right | 5.76% | | | | 7 | multiply-subtract | 4.00% | | | | 8 | shift right-subtract | 3.75% | | | | 9 | add-subtract | 3.07% | | | | 10 | Others | 20.45% | | | | | Instruction Pair | Frequency | | | |---|------------------|-----------|--|--| | 1 | shuffle-move | 32.07% | | | | 2 | abs-subtract | 8.54% | | | | 3 | move-subtract | 8.54% | | | | 4 | shuffle-subtract | 3.54% | | | | 5 | add-shuffle | 3.54% | | | | 6 | Others | 43.77% | | | | | Instruction Pair | Frequency | | | |---|-------------------|-----------|--|--| | 1 | shuffle-shuffle | 16.67% | | | | 2 | add-multiply | 16.67% | | | | 3 | multiply-subtract | 16.67% | | | | 4 | multiply-add | 16.67% | | | | 5 | subtract-mult | 16.67% | | | | 6 | shuffle-add | 16.67% | | | a) Intra-prediction and Deblocking Filter Combined b) LDPC c) FFT Like the Multiply-Accumulate (MAC) instruction there is opportunity to fuse other instructions A few instruction pairs (3-5) make up the majority of all instruction pairs! ## **Data Alignment Problem!** - H.264 Intra-prediction has 9 different prediction modes - Each prediction mode requires a specific permutation #### **Summary** - Conclusion about 4G and H.264 - Lots of different sized parallelism - From 4 wide to 96 wide to 1024 wide SIMD - Which means many different SIMD widths need to be supported - Very short lived values - Lots of potential for instruction fusings - Limited set of shuffle patterns required for each kernel ## **AnySP Design** #### **Traditional SIMD Architectures** 32-Wide SIMD with Simple Shuffle Network ## **AnySP Architecture – High Level** ## **Multi-Width Support** Each 8-wide SIMD Group works on different memory locations of the same 8-wide code – AGU Offsets ## **AnySP FFU Datapath** - 1. Exploit Pipeline-parallelism by joining two lanes together - 2. Handle register bypass and the temporary buffer - 3. Join multiple pipelines to process deeper subgraphs - 4. Fuse Instruction Pairs ## **AnySP Results** #### **Simulation Environment** - Traditional SIMD architecture comparison - SODA at 90nm technology - AnySP - Synthesized at 90nm TSMC - Power, timing, area numbers were extracted - Performance and Power for each kernel was generated using synthesized data on in-house simulator - 4G based on a NTT DoCoMo 4G test setup - H.264 4CIF@30fps ### **AnySP Speedup vs SIMD-based Architecture** For all benchmarks we perform more than 2x better than a SIMD-based architecture University of Michigan - ACAL 25 #### **AnySP Energy-Delay vs SIMD-based Architecture** More importantly energy efficiency is much better! University of Michigan - ACAL ## **AnySP Power Breakdown** | | | | Area | | 4G + H.264 Decoder | | |--------|---------------------------------|-------|-------------|-----------|--------------------|------------| | | Components | Units | Area<br>mm² | Area<br>% | Power<br>mW | Power<br>% | | | SIMD Data Mem (32KB) | 4 | 9.76 | 38.78% | 102.88 | 7.24% | | | SIMD Register File (16x1024bit) | 4 | 3.17 | 12.59% | 299.00 | 21.05% | | | SIMD ALUs, Multipliers, and SSN | 4 | 4.50 | 17.88% | 448.51 | 31.58% | | PE | SIMD Pipeline+Clock+Routing | 4 | 1.18 | 4.69% | 233.60 | 16.45% | | FC | SIMD Buffer (128B) | 4 | 0.82 | 3.25% | 84.09 | 5.92% | | | SIMD Adder Tree | 4 | 0.18 | <1% | 10.43 | <1% | | | Intra-processor Interconnect | 4 | 0.94 | 3.73% | 93.44 | 6.58% | | | Scalar/AGU Pipeline & Misc. | 4 | 1.22 | 4.85% | 134.32 | 9.46% | | | ARM (Cortex-M3) | 1 | 0.6 | 2.38% | 2.5 | <1% | | System | Global Scratchpad Memory(128KB) | 1 | 1.8 | 7.15% | 10 | <1% | | | Inter-processor Bus with DMA | 1 | 1.0 | 3.97% | 1.5 | <1% | | Total | 90nm (1V @300MHz) | | 25.17 | 100% | 1347.03 | 100% | | Est. | 65nm (0.9V @ 300MHz) | | 13.14 | | 1091.09 | | | | 45nm (0.8V @ 300MHz) | | 6.86 | | 862.09 | | We estimate that both H.264 and 4G wireless can be done in under 1 Watt at 45nm #### **Conclusion & Future Work** - Conclusion - We have presented an example architecture that could possibly meet the requirements of 100Mbps 4G and HD video on the same platform - Under the power budget and meeting the performance at 45nm - Future and Ongoing Work - Application-specific language - Larger class of algorithms for AnySP - Better utilization of resources for non-parallel kernels - Speedup sequential parts