• Higher throughput requirement • Energy Efficiencies ~1 TOPS/W gy • Flexibility needed for seamless mobility LTE Wi-Fi • Flexible hardware less energy efficient _than custom hardware Wimax Seamless Mobility Seamless Mobility Mobility Standard agnostic terminals enabled by Need to reduce the area and power penalty of flexible portions of the radio terminals enabled by software defined radio
Gops/W Energy Efficiency (Mops/mW) FPGA 10 – 100 Gops/W DSP GPP 0 1 1 G /W 1 – 10 Gops/W Flexibility GPP 0.1 – 1 Gops/W y  Rabaey J M "Wireless beyond the third generation facing the energy challenge " Low Power Electronics  Rabaey, J.M., Wireless beyond the third generation-facing the energy challenge , Low Power Electronics and Design, International Symposium on, 2001. , vol., no., pp.1-3, 2001
End Programmable ΣΔ ADC Flexible Digital Front-end SDR Baseband A “fixed digitization bandwidth” relaxes the flexibility requirement of the analog front-end in multistandard radios requirement of the analog front-end in multistandard radios
End Programmable ΣΔ ADC Flexible Digital Front-end SDR Baseband • Highly computationally intensive O C • Operates on highly oversampled ADC output • Needs to be implemented using a flexible ASIC HW accelerator
O W ACCELERATOR Area Single Mode HW Single Mode HW Accelerator Power Behavioral Optimizations •Constant propagation •Constant propagation •Common subexpression elimination •Operator strength reduction •Operator strength reduction
bus coefficient address bus Data R Fil Coeff R Fil Data Coeff il Data R Fil Coeff R Fil x(n) coefficient address bus Reg File Reg File Reg File Reg File Reg File Reg File x + x + x + z-1 z-1 z-1 + + y(n)
IN DIFFERENT MULTIMODE HARDWARE ACCELERATORS Reuse of fine-grained bit-level operators Fine-grained fi bl Reuse of coarse grained p Filter coprocessor reconfigurable fabric datapath operators Multimode ASIC No reuse Finer granularity of reuse Velcro based multimode accelerator reuse
reconfiguration data Fine-grained fi bl Reconfiguration data is a function of the filter length Filter coprocessor reconfigurable fabric Low reconfiguration data f f g Multimode ASIC Low reconfiguration data Finer granularity of reuse Velcro based multimode accelerator reuse
i i High dynamic power consumption Fine-grained fi bl High dynamic power consumption Low dynamic Filter coprocessor reconfigurable fabric Low dynamic power consumption Multimode ASIC Lowest dynamic power consumption Finer granularity of reuse Velcro based multimode accelerator reuse
The area, power and reconfiguration latency overheads need to be minimized without i i th fl ibilit t t compromising on the flexibility to support a new specification. Identify opportunities for reusing hardware at coarser levels of granularity across multiple standards, with low parameterization overheads. The reused hardwired functional blocks should not be a bottleneck for supporting a new standard. Use area/power optimizations to minimize the overheads associated with functional blocks which demand a high degree of flexibility.
ASIC h Coprocessor approach and multimode ASIC approach reuse coarse- grained datapath operators : adders, multipliers, registers, MAC units. All the functionally different channelization tasks of filtering, sample rate conversion, interference attenuation and pulse shaping can be , p p g simultaneously performed by a multistage decimation filter. The filter stages in a multistage decimation filter represent a coarser The filter stages in a multistage decimation filter represent a coarser granularity level for investigating hardware reuse, than simple datapath operators.
Consider an arbitrary factorization of the SRC factor, Mj j n j j j j j m m m m M .... 3 2 1 Symbol rate ΣΔ ADC Sample rate j j F M j F z H j 2 j m1 z H j 1 z H j nj j m2 j nj m Multistage Decimation filter for the jth standard
the decimation filter of any standard which needs: A decimation by ‘p’ stage at an OSR of ‘pq’ Required stopband attenuation for abo e stage is less than or Required stopband attenuation for above stage is less than or equal to Amax Can we manipulate the factorization of the SRC factor to exploit the above observation ? p
f i h SRC f i Fixed factorization method factorizes the SRC factor in a manner , which maximizes the number of filter stages at the same OSR and which decimate by the same factor, for different standards. Mj = Kj x m1 x m2 x ….mn-1 x mn
ti th d f t i th SRC f t i Fixed factorization method factorizes the SRC factor, in a manner , which maximizes the number of filter stages at the same OSR and which decimate by the same factor, for different standards. Mj = Kj x m1 x m2 x ….mn-1 x mn Standard dependent rational factor Integral Load : CIC Filter Fractional Load : Transpose Farrow Filter Weakly parameterizable
h d f i h SRC f i Fixed factorization method factorizes the SRC factor, in a manner , which maximizes the number of filter stages at the same OSR and which decimate by the same factor, for different standards. Mj = Kj x m1 x m2 x ….mn-1 x mn Fixed integral factors, common to all standards Hardwired FIR filter stages No reconfiguration overheads g
h d f i h SRC f i Fixed factorization method factorizes the SRC factor, in a manner , which maximizes the number of filter stages at the same OSR and which decimate by the same factor, for different standards. Mj = Kj x m1 x m2 x ….mn-1 x mn Fixed integral factor, common to all standards P bl FIR filt Programmable FIR filter Incurs reconfiguration latency area, power penalties penalties
802.11a 154812 WCDMA 232752 Wimax 210656 Velcro Approach 864205 Proposed Multistandard Accelerator 299701 • Synthesis results obtained from implementation using a TSMC 0.18 μm process S C 0. 8 μ p ocess
h f Nearly 65% reduction in area, compared to a Velcro approach for 4 standards. Percentage area reduction can be expected to increase with increasing number of supported standard. The fixed and weakly parameterizable portions of the architecture need to be designed for the worst case attenuation requirements requirements. Paradigm is scalable for an arbitrary number of standards with low reconfiguration overheads.
bili i h l fil i h Programmability in the last stage filter necessitates the use of generic MAC units. The last stage filter can be implemented as a time-shared MAC FIR Filter. Power reduction strategies for time-shared MAC based FIR filters have generally focused on reducing the switching filters have generally focused on reducing the switching activity. In nanoscale CMOS technologies, the leakage power also needs to be taken into account.
(VDD ) and threshold voltage (Vth ) Subthreshold Leakage Power – Increases exponentially with reduced Vth p y th. Gate Leakage Power – Increases exponentially with increased VDD. Dynamic Power : Increases quadratically with Dynamic Power : Increases quadratically with increased VDD.
f fi d MAC i l i Increasing the number of fixed MAC units, results in a reduced operating frequency while maintaining the same throughput. Reduced operating frequency translates to increased timing l k i th iti l th slack in the critical paths. Timing slack can be exploited for increasing V h or Timing slack can be exploited for increasing Vth or reducing VDD.
bth h ld L k P H li Subthreshold Leakage Power – Has a linear dependence on total gate width. Gate Leakage Power – Has a linear dependence on total gate width. Dynamic Power : Has a linear dependence on the total physical capacitance total physical capacitance. Total Gate width and total physical capacitance are p y p strongly correlated to the total circuit area.
f l i Parallelism trades increased area, for a lower operating frequency and increased timing slack. Increased timing slack can be traded for lower VDD and increased Vth , and hence reduced total power consumption. Increased area penalty of parallelism, lowers the possible reduction in total power consumption reduction in total power consumption Area-slack Efficiency : Amount of timing slack increment y f g Amount of area increment
Nf /M Co o Nfs /M data address bus coefficient address bus clock Data R Fil Coeff R Fil Data Coeff il Data R Fil Coeff R Fil x(n/fs ) coefficient address bus Reg File Reg File Reg File Reg File Reg File Reg File x + x + x + z-1 z-1 z-1 MAC-1 MAC-2 MAC-M + + y(n/fs )
l i d f M MAC b d ti h d di t f Cycle period of a M-MAC based time-shared direct form filter of length: M Extra timing slack obtained by adding P MAC units each s M Nf M T Extra timing slack obtained by adding P MAC units, each of area Am P T T Area slack efficiency of a time shared direct form filter s M P M Nf T T Area-slack efficiency of a time-shared direct form filter DF Nf A PA Nf P E 1 1 Can we design filter structures that have hi h l k s m m Nf A PA Nf a higher area-slack efficiency ?
d i FFA h l Algorithmic strength reduction : FFA structures have a lower number of expensive MAC operations at the cost of increased add operations. FFA structures can be derived by exploiting the redundancies in the FIR subfilters of a K-parallel FIR filter. p x(2k+1) x(2k+1) Each FIR subfilter is of length N/2
i d b Time–shared FFA structures can be obtained by implementing each of the FIR subfilters as a time-shared FIR filter , while implementing the irregular addition network in parallel. A KxK FFA structure of a N-tap filter has Sk subfilters of length N/K, and Ak postprocessing/preprocessing adders. g , k p p g p p g Notation, KxK|L used to indicate a structure in which each of theSk subfilters is multiplexed onto L MAC units.
h bfil i K K FFA i f /K Input sample rate of the subfilters in a KxK FFA is fs /K Cycle period of the MAC units in a KxK|L FFA Cycle period of the MAC units in a KxK|L FFA L K K Nf L K T 2 | Extra timing slack obtained by adding P MAC units in each of the SK subfilters s Nf P K 2 Area slack efficiency of a KxK FFA structure s L K K P L K K Nf P K T T | ) |( Area-slack efficiency of a KxK FFA structure DF FFA E S K Nf A S K PA S Nf PK E 2 2 2 1 1 DF K s m K m K s FFA S Nf A S PA S Nf
structures have a MAC units in the FFA based time-shared structures have a greater timing slack than a time-shared direct form filter with the same number of MAC units Adding MAC units to a time-shared FFA structure offers greater timing slack increment than adding the same number greater timing slack increment, than adding the same number of MAC units to a time-shared direct form structure.
i f Proposed a design strategy for efficient implementation of a channelization accelerator for in a flexible mobile radio. Design has a high degree of hardware reuse across multiple standards. Proposed design strategy is scalable for supporting an arbitrary number of standards number of standards. We have investigated strategies for reducing the area/power g g g p penalty of the last stage programmable filter.
Moy and Jacques Palicot, “Flexibility and reusability in the digital front-end of cognitive radio terminals,” Circuits, Systems and Signal Processing Journal, Springer Accepted in August 2010 Springer, Accepted in August 2010.  Navin Michael, Christophe Moy, A. P. Vinod and Jacques Palicot, “Area-Power tradeoffs for flexible filtering in green radios,” Journal of Communications and Networks, vol.12, no.2, pp. 158-167, April 2010. International Conferences  Christophe Moy, Wassim Jouini, Navin Michael, “Cognitive Radio Equipments Supporting Spectrum Agility,” International Workshop on Cognitive Radio and Advanced Spectrum M t (C ART 2010) It l 7 10 N b 2010 Management (CogART 2010), Italy, 7-10 November 2010.  Navin Michael, A. P. Vinod, Christophe Moy and Jacques Palicot, “Low power, flexible FIR filters in the digital front-end of green radios,” Proceedings of IEEE International Symposium on Personal, Indoor and Mobile Radio Communications, Istanbul, Turkey, September 2010.  Navin Michael, A. P. Vinod, Christophe Moy and Jacques Palicot, “Area-efficient time-shared FIR fil i l C OS ” di f i l C f G Ci i d filters in nanoscale CMOS,” Proceedings of IEEE International Conference on Green Circuits and Systems, Shanghai, China, June 2010.  Navin Michael, A. P. Vinod, Christophe Moy and Jacques Palicot, “Design paradigm for standard agnostic channelization in flexible mobile radios,” Proceedings of IEEE International Symposium on Circuits and Systems, Paris, France, May-June 2010.  Navin Michael, A. P. Vinod, Christophe Moy and Jacques Palicot, “Design of low power multimode time-shared filters,” Proceedings of 7th IEEE International Conference on Information, Communications and Signal Processing, pp. 1-5, Macau, December 2009.  Navin Michael and A. P. Vinod, “Reconfigurable architecture for arbitrary sample rate conversion in software defined radios,” Proceedings of 19th IEEE International Symposium on Personal, I d d M bil R di C i ti 1 6 C F S t b 2008 Indoor and Mobile Radio Communications, pp. 1-6, Cannes, France, September 2008.