June 6, 2011

Error Resilient Processor Design

Sandip Kundu, Department of Electrtical & Computer Engineering, University of Massachusetts, Amherst, MA, USA

Abstract: Relentless advancement in process technology during the last four decades has led to processor designs with progressively higher transistor count and increased clock frequency. However, sustaining this explosive growth of device-count on a chip is predicted to be difficult due to yield and reliability problems. Earlier we had shown through architectural performance evaluation that for floating point and integer division instructions that consume large amount of resources, it does not make sense to add dedicated redundancy. We proposed a shared resource approach for multicore environment. We have seen validation of this concept in a recently announced product where multiple cores share a common FP unit. In this talk we will describe a set of solutions for the general problem of resilient processor design, namely (i) functional error detection schemes to identify failures, (ii) isolation techniques to contain such failures and a (iii) graceful degradation mechanism to degrade performance gracefully with negligible impact on area/power of the processor. Results show that with 5-15% performance degradation a system can degrade gracefully in presence of defects.
 
About the Speaker: Sandip Kundu is a Professor at the University of Massachusetts at Amherst. Prior to joining academia, he spent 17 years in industry: first as a Research Staff Member at IBM Research in Yorktown Heights and then at Intel Corporation as a Principal Engineer. He has published more than 170 papers, holds several key patents, and has given more than a dozen tutorials at various conferences.