The Wright Flyer contained no electronics whatsoever. Sputnik I was quite simple, consisting of two transmitters based on tube amplifiers. Modern aerospace vehicles are far more capable than the two aforementioned but cannot do their job without sophisticated microchips. Ironically, the increased capability of electronics has also introduced a new form of susceptibility: Single Event Effects (SEE’s). What is this?
SEE’s are a very intricate interaction between electronics and small particles coming from very far away, some even from outside our galaxy. This is not science fiction, it is real. In fact, we will look at the operational experience from a Low Earth Orbit (LEO) satellite and the in-flight upset of an A330 off the coast of Australia. No worries, the phenomenon is understood quite well nowadays and the industry requires SEE consideration for critical applications.
Particles – From where?
Space is a very hostile environment containing an abundance of ionizing radiation (subatomic particles) travelling at extreme velocities. Luckily, the Earth’s magnetic field and atmosphere provide significant shielding against this. Some particles are “trapped” in a specific region around the Earth, called the Van Allen belts. While the effects of radiation are generally smallest close the Earth’s surface, stronger effects will be observed with increasing altitude or when penetrating the Van Allen belts. This affects both, humans and electronics. So, where are these particles coming from?
Cosmic radiation
Galactic Cosmic Rays (GCR’s) mainly consist of nuclei of atoms produced outside of the solar system [1]. Energy levels can reach several hundred to thousand MeV and the effects can be modulated by solar activity [1][2]. GCR’s are very relevant for space applications and high-altitude operations of aircraft but less significant for terrestrial applications, as the earth’s atmosphere provides good protection.
Coronal mass ejections (CME’s)
Transient solar effects, producing protons and nuclei of heavy atoms with energies in the order of several MeV [1].
Solar wind
Steady flow of particles (mainly electrons and protons) with several keV of energy [1]. Interactions with the Earth’s atmosphere can sometimes be observed as the polar lights.
Alpha particles
These are equivalent to the nucleus of a Helium atom, consisting of two neutrons and two protons. They are mainly produced by radioactive decay processes of certain materials (e.g., Thorium, Radium, and Actinium), which can sometimes be found at very small trace levels in packaging materials of microchips [4]. These can also be relevant for terrestrial applications.
Neutrons
These particles are found frequently at commercial aircraft operating altitudes and lower, since they are often produced in the atmosphere through particle collisions. They can affect semiconductors in an indirect manner, when they hit atoms of the semiconductor material. Due to their neutral charge, neutrons are very difficult to shield against [4].
SEE distinction from Total Ionizing Dose (TID)
SEE’s differ fundamentally from dose-related radiation effects, which are cumulative and cause a device degradation over time. A typical TID effect would be the slow decrease in gain of a transistor over time, when on a space mission [4]. A chip that is susceptible to SEE, could equally likely be affected on day one or day 1000 of the mission.
The probability of an external particle SEE in a certain component is independent of the component age.
On the other hand, there is a strong correlation of SEE susceptibility and the physical construction of a chip. Due to manufacturing tolerances, significant differences can occur between production batches and even single chips [4]. (Keep that in mind for the A330 example later on).
SEE susceptibility can vary greatly amongst a set of chips having the same part number.
From an aircraft certification standpoint, SEE’s are considered to be random, independent events and thus do not constitute a new form of common cause [5]. In the aviation world, SEE analysis is currently required for systems, whose failure could lead to condition classified as “hazardous” or “catastrophic” as per CS2X.1309 [5]. In the space engineering world, SEE analysis requirements are much more stringent as the SEE rate can vary greatly depending on the region of operation, as we shall see later.
Particle interaction
The effect of particles penetrating semiconductors is rather complex and we will therefore only consider a simplified example. The interested reader is directed to [1] and [2] for more details.
If you are not familiar with semiconductors, a Metal Oxide Semiconductor Field Effect Transistor (MOSFET) is a very small electronic “switch”. The “gate” is the control for that “switch” and depending on the gate voltage, the “drain” and “source” are eighter connected to, or isolated from each other. This and similar transistor technologies form the basis of modern electronics and are used for digital memories, processors, power switching and many more applications. Your smartphone alone typically contains more than a billion of transistors!
When an ionizing particle passes through a semiconductor, it leaves a trace of electron-hole pairs behind. When this happens in a sensitive region, the electric field will cause a subsequent charge depletion [2]. The key to understand this: In semiconductor materials, the “electron mobility” is much higher than the “hole mobility”. The effects of this can reach from a short “spike” to a “bit flip” or even cause permanent damage [1].
SEE’s are therefore classified in different subcategories, depending on their effects, as depicted in Figure 4 below:
Not all SEE’s result in permanent damage. In fact, most of them do not. That is one of the reasons, why it can be so difficult to trace SEE’s.
The typical SEE is some form of “misbehavior” of an electronic circuit during operation, which in a well-designed system will be immediately rectified and there will be no consequences.
In some cases, a unit may provide invalid data for a short period (as we will see in the A330 story), but a reset of that particular unit will usually clear the fault. Any subsequent maintenance investigation will result in “No Fault found” (NFF). Equally important: SEE’s affect both analog and digital circuits, but digital circuits are often more vulnerable [4].
SEE countermeasures
So, what to do about this? The bad news is, that the problem tends to get worse, as electronics get smaller and smaller. The good news is, the industry (lead by the space and nuclear sector) has developed sophisticated SEE mitigation techniques and system design guidelines. In the following, we will look at some of these techniques:
Shielding
Adding some form of physical shielding is mainly used to protect against TID effects [1]. Weight is a significant problem here, while it may be acceptable to add heavy shielding around an electronic circuit in a nuclear power plant, the same would be unrealistic on a spacecraft. In fact, shielding works quite well against low energy particles, but can have counter-intuitive effects for heavy ions: If a high-energy heavy ion is slowed somewhat by a shielding layer, it may actually cause more damage to electronics than without shielding [1] [3]. This is related to an effect called “Bremsstrahlung” and the way particles interact with semiconductors [6]. Some research is also being carried out to investigate the use of magnetic field shielding on spacecraft.
Hardware modifications on chip
The aerospace industry currently uses a mix of special “radiation tolerant” Application Specific Integrated Circuits (ASIC’s) and Commercial Off-The-Shelf (COTS) chips, to keep the cost at a reasonable level [1]. The number of chips used in aerospace applications is simply too small to allow for specific modifications on all chips. For critical applications, some chip-level hardware modifications are used, such as Triple Modular Redundancy (TMR) or Quadruple Modular Redundancy (QMR) [1].
It is neither feasible, nor desirable to achieve SEE tolerance by modifying the chip layout alone.
As an example, let us look at TMR implementation on chip. The redundancy can be built in partially or fully, depending on the criticality and/or hardware limitations:
On many occasions, there is a physical limitation on the Input/Output (I/O) so TMR can only be implemented partially (left side in Figure 5). A full TMR implementation requires triplication of all parts involved and obviously adds significant spatial demand, cost, power consumption etc. to the system (Figure 5 right side).
Software adaptions
With the widespread use of COTS hardware in aerospace applications, the implementation of resilience against hardware faults at software level has gained significant importance over the recent years. There is a term for this: Software Implemented Hardware Fault Tolerance (SIHFT) [4]. Sometimes the term Error Detection And Correction (EDAC) is also used. Basic concepts include adding redundancy to the data (Hamming, Reed-Solomon coding), duplicating variables and storing them in different locations and temporal redundancy [8]. Also, the periodic re-writing of configurations is sometimes applied, a process called “scrubbing” [1].
An example of temporal redundancy is shown in Figure 6. The same calculation is carried out multiple times, using the same hardware. To protect against specific “bit errors” in the processor, the data is encoded in a different way each time [8].
Note, that with temporal redundancy, there is no additional hardware required, apart from the “sample and hold” voter implementation. The drawback is that the process gets slower with each additional computation.
Testing
Candidate electronics for aerospace applications are tested in labs for the effects of ionizing radiation. Some manufacturers provide complete “radiation test datasheets”, when such data is not available, the component in question has to undergo dedicated tests [4]. The usual setup is depicted in Figure 7 and includes some form of “particle source”, such as a cyclotron. The device under test (DUT) is then exposed to the particle beam and its functionality is monitored. Tests can be static (load a chip with data, irradiate and check data afterwards) or dynamic (have the chip running while being irradiated) [4].
Other testing techniques involve software alterations to simulate bit upsets or LASER-induced bit alterations.
The goal of radiation testing is not to show that there are no radiation effects, but to demonstrate that the component in question is sufficiently tolerant to the expected radiation levels.
Case study: SEASTAR spacecraft
Figure 8 shows the locations of SEU’s over several years in the solid-state recorder of the SEASTAR satellite. The region east the South American coast is the well-known South Atlantic Anomaly (SAA), where the inner Van Allen belt is closer to the Earth than elsewhere [9].
There are quite a number of upsets, the ones in polar regions mainly caused by GCR’s and the ones in the SAA caused by trapped protons [9]. These effects have to be mitigated by using error correction codes.
Case study: A330 in-flight upset
The following is sourced from the official accident report [10] and highlights the aspects relevant to SEE. The interested reader is directed to [10] for the complete report.
In 2008, Qantas flight 72 from Singapore to Perth was enroute at 37’000 ft when one of the aircraft’s Air Data Inertial Reference Units (ADIRU's) started to output invalid data for short periods. More specifically, an unrealistic AOA-value of 50.6° was transmitted on the ARINC 429 bus, together with other invalid parameters. The flight control computers subsequently commanded an undue pitch-down maneuver, lasting less than two seconds, but strong enough to cause more than 100 injuries as many passengers were not wearing their seatbelts. After a second, less severe pitch-down, the crew declared an emergency and diverted to Learmonth.
In a perfect world, the aircraft’s flight computers would have detected the misleading nature of these values and flagged them as invalid. Unfortunately, the A330 flight computers were not able to detect this particular anomaly back in 2008, due to a very specific timing issue. This was later changed with a software upgrade.
When investigators analyzed the ADIRU data in detail, they found the “smoking gun”: The binary data word of the 50.6° AOA value, corresponded EXACTLY to the binary data value of the altitude 37’000 ft!
The ADIRU (LTN-101) had a common processor unit for the IR and ADR part, as shown in Figure 11 below. It appears, that there was a temporary error in the data packaging for the ARINC 429 words.
Altitude data was labelled as AOA data for short periods that were 1.2 s apart.
The Failure Mode and Effects Analysis (FMEA) of the ADIRU manufacturer had not identified this type of malfunction and subsequently, the aircraft’s monitoring functions were not up to the task of immediately identifying this condition. Eventually, the flight control computers did identify it and the flight control law changed to alternate law, which in turn disabled the affected parts of the envelope protection. There was also misleading data in the IR parameters, but investigators were unable to pinpoint the exact interactions that might have caused this. Most manufacturers use some proprietary baro-aiding for inertial navigation units...
In 2008, more than 8000 units of the LTN-101 had been produced and accumulated over 128 Million flight hours. There were only three documented cases of such data spikes. This is well within the certification criteria. However, as a famous aviation safety engineer once stated:
The fact that a unit meets the certification requirements, does not automatically imply that it is a good design.
Important: the LTN-101 ADIRU at the time did not incorporate EDAC, nor was it required to do so. It only contained a BITE and a parity bit, which was added in the I/O module. Later in the production cycle, a new CPU software was used, capable of EDAC.
A definitive conclusion was not reached by the investigation, but this event has all the “fingerprints” of an SEE. A temporary malfunction of a microchip, which appears logical in terms of bit manipulations. After a reset, no faults were found. Also, the serial number involved appears to have been particularly susceptible to SEE as was confirmed by testing.
This shows the importance of SIHFT and EDAC in critical applications. Further, the system-level tolerance for corrupted data is paramount. If a certain unit is affected, this fault shall not propagate further into the system. This incident also highlights a very important aspect for operators of complex machinery:
It took investigators two years to find out what had happened. The pilots had to react within seconds.
Revision/20210414
References
[1] Velazco, McMorrow, Estela, Radiation Effects on Integrated Circuits and Systems for Space Applications, Springer, 2019
[2] Petersen, Single Event Effects in Aerospace, Wiley, 2011
[3] Johnston, Reliability and Radiation Effects in Compound Semiconductors, World Scientific, 2010
[4] Nicolaidis, Soft Errors in Modern Electronic Systems, Frontiers in Electronic Testing, Vol. 41, Springer, 2011
[5] EASA, CM–AS-004, Single Event Effects (SEE) Caused by Atmospheric Radiation, Issue 01, 2018
[6] S.E. Kerns et. al, The Design of Radiation Hardened ICs for Space: A Compendium of Approaches, Proc. IEEE, v. 76, n. 11, pp. 1470-1509, 1988
[7] Finn, System effects of Single Event Upsets, Computers in Aerospace, pp. 994-1001, AIAA, 1989
[8] Iniewski, Radiation Effects in Semiconductors, CRC press, 2011
[9] Ladbury et al., Lessons Learned from Radiation Induced Effects on Solid State Recorders (SSR) and Memories, NASA Electronic Parts and Packaging (NEPP) Program, 2013
[10] ATSB, Aviation Occurrence Investigation AO-2008-070, In-flight upset 154 km west of Learmonth, WA 7 October 2008 VH-QPA Airbus A330-303, 2011