Soft Error Analysis Toolset (SEAT) Development
Participants: | V. Narayanan, Professor of Computer Science and Engineering |
M. J. Irwin, Professor of Computer Science and Engineering |
Kenan Ünlü, Professor of Mechanical and Nuclear Engineering | |
Y. Xie, Professor of Computer Science and Engineering | |
S. Çetiner, Ph.D. Student of Mechanical and Nuclear Engineering | |
V. Degalahal, Ph.D. Student of Computer Science and Engineering |
F. Alim, Ph.D. Student of Mechanical and Nuclear Engineering | |
Services Provided: | Neutron Beam Laboratory |
Sponsors: | Radiation Science and Engineering Center |
Department of Computer Science and Engineering |
Introduction
Soft errors, or single event effects (SEE), are transient circuit errors caused due to excess charge carriers induced primarily by external radiation. Radiation, directly or indirectly, may induce localized ionization that can flip the internal values of the memory cells. The major radiation source that causes this temporary malfunction in semiconductor devices is the cosmic rays.
Figure 1 . A 65-nm DRAM (left) and a schematic that conceptualizes the soft error phenomenon (right): Electron-hole pairs created through ionization by radiation might get drawn to node terminals before they recombine in the substrate causing a transient glitch in the device node. This temporary pulse might flip the internal state of the memory bit.
Figure 2. Integrated Circuit (IC) are becoming a major component of modern societies
Cosmic ray particles have the ability to either toggle the state of memory elements or create unwanted glitches in combinational logic that may be latched by memory elements. As supply voltages reduce and feature sizes become smaller in future technologies, soft error tolerance is considered a significant challenge for designing future electronic systems. For example, a 1 GB memory system based on 64Mbit DRAMs has a combined error rate of 3435 FIT (failure in 10 9 hours of operation) when using single error correction and double error detection. An even higher soft error rate of 4000 FIT was reported for a typical processor with approximately half of the errors affecting the processor core and the rest affecting the cache. Such errors also affect the fast growing FPGA (Field Programmable Gate Array) segment.
As earth’s atmosphere shields most cosmic ray particles from reaching the ground and charge per circuit node used to be large, SEE on terrestrial devices has not been important until recently. The galactic flux of primary cosmic rays (mainly consisting of protons) is very large, about 100,000 particles/m 2s as compared to the much lower final flux (mainly consisting of neutrons) at sea level of about 360 particles/m 2s [1]. Only few of the galactic particles have adequate energy to penetrate the earth’s atmosphere. However, with continued scaling of feature sizes and the use of more complex systems, soft errors in terrestrial applications are becoming an increasing concern and have drawn attention since late 1990s.
The issue of SEE was first studied in the context of scaling trends of microelectronics in 1962 [2]. Interestingly, the forecast from this study that the lower limit on supply voltage reduction will be imposed by SEE is shared by a recent work from researchers at Intel [3]. However, most works on radiation effects, since the work in 1962, focused on space applications rather than terrestrial applications.
There have been various documented failures due to soft errors ranging from memories used in large servers and aircrafts to implantable medical devices like cardiac defibrillators [4]. A widely cited soft error episode involves L2 caches with no error correction or protection that caused Sun Microsystems’ flagship servers to crash suddenly and mysteriously [5]. This problem resulted in loss of various customers for Sun Microsystems. More ominous than this failure can be errors in embedded devices such as cardiac defibrillators that are becoming an integral part of our society. As computing systems develop into indispensable part of various critical applications ranging from medical implants to fly-by-wire aircrafts, immunity against soft errors becomes more critical for the society as a whole.
The importance of dealing with the soft error problem can be evidenced by the large number of papers and articles that flooded the scientific community over the last decades. However, most researchers are impeded by access to realistic fault models and real soft error data. This limitation results from confidentiality of soft error data of chips tested by semiconductor companies and the limited access to accelerated soft error testing facilities for academics. Most commercial soft error testing in U.S.A. is performed at the Los Alamos test facility, access to which is expensive and cumbersome due to security clearances required.
The Impetus Behind the Soft Error Analysis Toolset (SEAT)
Radiation-induced SEE may seem to be easily solved through techniques such as radiation-hardened processing. These kinds of countermeasures have been traditionally and successfully adopted to remedy radiation effects in space applications. However, they are not suitable for commercial manufacturers of terrestrial devices as many of the solutions consume more power, reduce manufacturability and severely influence IC performance [6]. Even space applications are moving away from the use of radiation hardened process technology. They are using commercial off-the-shelf components that employ soft error protection techniques at software and architecture level for cost and performance reasons. As a result, many researchers have been focusing on employing new soft error countermeasures ranging from process to software levels.
Advances in process technology such as adoption of silicon-on-insulator (SOI), elimination of boron-10 impurities are expected to mitigate the soft error problem to a certain extent. However, solutions at higher levels are still essential for reliable operation of the computing system. The lack of fault models that abstract the physical phenomena of soft errors accurately in a fashion that is accessible to computer engineers and the absence of tools that analyze the effectiveness of soft error countermeasures are affecting researchers in their quest for taming the soft error problem.
There is an obvious need for a community resource for researchers and industrial practitioners studying radiation-induced SEE on computing systems. Existing tools either do not address the problem in full extent or they are kept confidential by the sole proprietorship of commercial entities, and therefore are not available to the research community. The SEAT will serve a critical purpose in providing researchers of electrical, computer, information sciences or nuclear origin with an open, modular, flexible yet a comprehensive tool.
The SEAT has emerged as a complementary tool to furnish theoretical foundation to experimental radiation-induced soft error research at Penn State Breazeale Nuclear Reactor by Mechanical and Nuclear Engineering, and Computer Science and Engineering Departments. More details can be found in [7] in this annual report. The experiments performed are compiled into an “accelerated soft error testing dataset”. The researchers are then able to seek to duplicate these observations by the SEAT or vice versa.
The strength of the SEAT is the fact that it is built upon the combined expertise of computer and nuclear engineers. The SEAT hierarchy starts with modeling the ionization effects of particle strikes on semiconductor devices, and then creates higher-level abstractions of these effects for analysis at the circuit and architecture level. This infrastructure will enable researchers working on circuit, architectural and software countermeasures for soft errors to obtain a better perspective of the physical phenomena, and help them tune their techniques accordingly. If the fault model used at architecture or circuit-level fails to model the SEE accurately, the underlying value of solutions proposed at higher abstractions become meaningless.
Implementation of the SEAT Code Bundle
The SEAT code bundle has a hierarchical structure as shown in Figure 5. Information collected at each level is relayed to one upper level for analysis.
Neutron transport constitutes the basis of the lowest-level analysis of the toolset. The creation of neutron-induced charged particles is simulated at this stage of the analysis. We chose MCNP5 to perform the neutron transport simulation. MCNP5 is a very powerful Monte Carlo code and comes with an extensive set of utilities. Figure 4 shows geometry that we used for the multi-bit SEE analysis in memories. The transparent blue material corresponds to bulk silicon, and red and yellow cubes are boron- and phosphorus-doped p and n wells, respectively.
The MCNP output is analyzed and the reactions that ultimately end up in creation of charged particles are collected. These ions are saved into a file in a format that is compatible with the next code, TRIM [8], at that level of analysis.
TRIM is another Monte Carlo code that simulates ion transport in solid media. TRIM accepts 3D ion distribution as input. Since it is given the same geometry as MCNP, it simulates the behavior of ions that come out of the neutron-induced reactions. The analysis of ion tracks gives as to what point in the medium the energy will be deposited. This is a first-order approximation to ionization caused by the charged particle transport, and is known as the Bragg approximation.
Once the list of ions induced by external radiation and position of charges deposited is obtained, then the simulation of the phenomenon in device level can be performed. Figure 6 shows the output of a sample simulation performed with TCAD. In the figure, color contours represent the electron concentrations, and the line contours correspond to equipotential lines. Ion strike is initialized at time 1 psec. Before the ion strikes, equipotentials are parallel to the junction. The sequence of plots show that the charge column widens by the outward diffusion of carriers. The charge column also pinches off at the junction due to collection of charge from the depletion region. At time 1 psec, the equipotentials begin to extend into the substrate due to the voltage drop along the charge column. After a time the funnel starts to collapse and the drift current starts to decrease as charge is swept away from the depletion region. At 100 psec, the depletion region is effectively restored and only diffusion charge collection occurs.
Conclusions and Future Work
The SEAT seems to fill a critical need, particularly in research community. With its current stage, it received a great deal of interest from both the academia and industry. We received many positive critiques in a conference that we presented the SEAT the first time [9]. Many industry affiliates stated their interest in the tool.
Even though neutrons account for the majority of the cosmic particles at sea level, the contribution of other particles, particularly protons, become dominant to the soft error problem at higher altitudes. For a more through analysis and more extensive applicability, other particle interactions should also be incorporated into the simulation. At this stage of the tool, we managed to include proton flux in the cosmic rays into the analysis. This, however, does not include nuclear-level interactions of protons with the medium, but it does take into account their direct ionization effect. For better modeling of physics, nuclear interactions of protons with the host nuclei must be accounted for since this dominates proton-related single-event effects.
The problem can be tackled by using medium-, high- and very high-energy intranuclear cascade codes, which simulates particle reactions within the nucleus between the particle and nucleons based on nucleon-to-nucleon cross sections. Addition of a cascade code will extend the capability of MCNP, which, with the current cross section library, can support up to 150-MeV incident neutron energy. Cross sections for incident neutron energies beyond 150 MeV is extrapolated. The nuclear cascade codes, however, can simulate interactions for particle incident energies even beyond 1 GeV.
References
- J. F. Ziegler, Terrestrial Cosmic Ray Intensities, IBM Journal of Research and Development, v. 40, No: 1, pp. 19-39, 1996.
- J. Wallmark, S. Marcus, Minimum Size and Maximum Packaging Density of Non-redundant Semiconductor Devices, Proceedings of IRE, 50, 1962.
- S. Borkar, T. Karnik, V. De, Design and Reliability Challenges in Nanometer Technologies, Proceedings of 41 st Design Automation Conference, San Diego, 1994.
- P. D. Bradley, E. Normand, Single Event Upsets in Implantable Cardioverter Defibrillators, IEEE Trans. Nucl. Sci., v. 45, pp. 2929-2940, Dec. 1998.
- Sun Microsystems, Soft Memory Errors and Their Effect on Sun Fire Systems, 2002.
- P. E. Dodd, L. W. Massengill, Basic Mechanisms and Modeling of Single-event Upset in Digital Microelectronics, IEEE Trans. Nucl. Sci., v. 50, pp. 583-602, June 2003.
- N. Vijaykrishnan, M. J. Irwin, K. Ünlü, V. Degalahal, S. M. Çetiner, Testing Neutron-induced Soft Errors in Semiconductor Memories, Breazeale Nuclear Annual Report, December 2004.
- J. F. Ziegler, http://www.srim.org/
- V. Degalahal, S. M. Cetiner, F. Alim, N. Vijaykrishnan, M. J. Irwin, K. Ünlü, SESEE: A Soft Error Simulation and Estimation Engine, 7 th Int. Conf. on Military and Aerospace Programmable Logic Devices (MAPLD), September 8-10, 2004