Projects:2021s1-13001 Improving the Resilience of Autonomous Satellite Networks against High-Energy Disruptions

From Projects
Revision as of 14:46, 21 April 2021 by A1686655 (talk | contribs) (Restructured page. Added outcomes of literature review and proposed system architecture.)
Jump to: navigation, search
Artist's impression of the Buccaneer Risk Mitigation Mission (BRMM) satellite.

While FPGAs offer a number of benefits for aerospace applications, they are highly susceptible to single event effects (SEE) when exposed to high-radiation environments. These upsets can cause undesirable behaviour within the system, and potentially lead to catastrophic system failure. Students will design and develop a novel FPGA configuration scrubber to overcome these effects using an external microcontroller. Radiation testing will be conducted to verify system performance in a (simulated) space environment.

Introduction

This project is sponsored by the Defence Science and Technology Group (DST). Students will gain experience in an industry environment, while supporting Defence capabilities within DST.

Project team

Project students

  • Jack Nelson
  • Albert Pistorius

Supervisors

  • Dr. Said Al-Sarawi
  • Dr. Dharmapriya Bandara (DST)

Advisors

  • Dr. Brayden Phillips

Project Objectives

  • To design and develop a novel system architecture to detect and correct single event upsets, and to restore system operation in a failure event.​
  • To provide sufficient fault protection such that an industry-rated FPGA may be used in space applications for a minimum period of 2 years (in Low Earth Orbit) without loss of functionality. ​
  • To provide clearly defined research outcomes which can be incorporated into the development process for future CubeSat launches.​

Literature Review

Single Event Effects

Traditional avionics and ground-based electronic systems are shielded from the effects of solar radiation thanks to the Earth's atmosphere and magnetic field. However, systems operating within a space environment do not receive the same level of protection and therefore are subjected to extremely high levels of radiation. This radiation can be produced by a wide variety of phenomena, but cosmic rays and high-energy protons are the most prevalent sources in space applications.

File:Filename.jpg
SEE Diagram (TBC)

When one of these high-energy radiation particles travels through a semiconductor, the resulting ionisation produces free charge carriers within the substrate. These charge carriers diffuse through the material, altering the shape and size of the depletion region. This can cause transient voltages within the gate, known as Single Event Transients (SETs), which can ultimately lead to a variety of highly disruptive effects, known as Single Event Effects (SEEs).

These effects may be classified according to two categories: soft errors, which are reversible and may or may not interrupt normal operation, and hard errors, which are irreversible and can cause catastrophic damage to the device. This categorisation of errors is presented below.

File:Filename.jpg
Types of SEE (TBC)

Soft Errors

Single Event Upset (SEU)

SEUs are non-destructive, soft errors which cause a change of state within a memory cell. If a SET occurs at the same time as a clock edge, the impulse will be read as an incorrect logic state and the pulse will propagate through combinational logic, where it may be latched into memory. In memory cells and registers this generally appears as a bit-flip.

If only a single bit-flip occurs, it is classified as a Single Bit Upset (SBU). However, it is also possible for a single, high-energy particle to collide with multiple transistors as it passes through a memory bank. This can cause multiple bit-flips in a single event, and is known as a Multi Bit Upset (MBU). MBUs may cause single errors across multiple words/frames, in which case they may be treated in the same manner as SBUs, however they can also cause multiple errors within a single word, in which case the upset must be handled as a special case, as discussed in later sections.

Single Event Functional Interrupt (SEFI)

While the effects of SEUs are often negligible, they have the potential to cause catastrophic system failure if the upset occurs in a critical system, such as FPGA configuration memory or the POWER/RESET bit in a microcontroller. An upset which interrupts or otherwise prevents the normal operation of a system is known as a Single Event Functional Interrupt (SEFI). These events generally require power cycling the system or reloading the configuration memory to recover normal system operation.

Single Event Latch-Up (SEL)

It is possible for the ionisation of a SEE to create a low impedance path within the circuit and form a parasitic structure within the device. This parasitic structure may 'latch' one or more transistors in a forward-biased state, causing them to conduct current.

A SEL may be cleared by power cycling the device. However, if the device is allowed to conduct current for too long, or the current through the transistor exceeds device specifications, this fault may cause irreparable damage to the device, including leading to Single Event Burnout (SEB) and Single Event Gate Rupture (SEGR).

Hard Errors

Single Event Burnout (SEB)

When a SEL occurs, the resulting current through the transistor causes excessive heating. If the SEL is not cleared quickly, catastrophic failure may occur due to bond wire failure. This is most likely to occur if the power and ground rails are shorted, leading to an extremely high current through the device, although other shorts can lead to equally destructive results.

Single Event Gate Rupture (SEGR)

SEGRs occur when the gate oxide in a transistor is destroyed due to a SEL. This causes device burnout similar to a SEB. Similar effects can occur in non-transistor devices, such as capacitors, in which case it is known as a Single Event Dielectric Rupture (SEDR).

Microprocessors, FPGAs and memory devices are vital components in any space system, however they are also the most sensitive components to SEEs. Therefore, we need a reliable way to protect these components against SEEs in order for any space mission to be successful.

One approach to this problem is to use physical manufacturing techniques to reduce device vulnerability to ionisation. Alternatively, we can introduce error detection and correction (EDAC) systems designed to mitigate the effects of SEEs when they occur. As with the physical manufacturing techniques, there are a wide variety of proven approaches to this problem; most of which involve either Triple Modular Redundancy (TMR), an external ‘scrubber’ circuit, or some combination of the two. These solutions, and their many variations, are explored below.

SEE Prevention and Mitigation Strategies

Radiation Hardening

Components and circuits which have been designed and manufactured to be less susceptible to SEEs are known as radiation hardened, or RadHard, components. There are many different hardening techniques available depending on the device. For example, a semiconductor may be shielded against radiation using impervious materials, such as aluminium or depleted boron, and mounted to a substrate material with a wide band gap, such as Silicon Carbide or Gallium Nitride, instead of conventional silicon wafers.

While RadHard devices provide robust, reliable performance in space applications, they are often orders of magnitude more expensive than their industrial-grade equivalents, and tend to lag roughly a generation behind the most recent developments due to the extensive development and testing required for each design. For this reason, we would prefer to use industrial-grade components coupled with some kind of EDAC subsystem wherever possible. This simplifies the development process and provides a substantial reduction in cost.

Triple Modular Redundancy

TMR is a method of upset mitigation used to reduce single point failures by triplicating the original circuit (i.e. adding two additional redundant copies of the circuit). All three circuits run in parallel to one another, feeding their outputs into a shared voter circuit, which then compares each circuit’s output and chooses the value output by the majority of the circuits. In the event of an upset occurring in one of the modules, the other two modules will remain unaffected and the voter circuit will still deliver the correct output.

File:Filename.jpg
SEE Diagram (TBC)

While single voter TMR configurations, such as the one shown in Figure 3, greatly reduce the risk of failure by reducing the size of the vulnerable area, the voter circuits are just as vulnerable to upsets as the modules themselves. Hence, single voter configurations can still be corrupted by SEEs as they maintain a single point of failure.

File:Filename.jpg
SEE Diagram (TBC)

To eliminate this vulnerability, the voter circuit can also be triplicated. This requires the output of each module to be fed to all three voter circuits, which then produce three majority outputs, as shown in Figure 4. Multiple instances of this configuration can be chained together to create a robust circuit.

One comparison between single and triple voter circuits found that the use of triple voters lowered the failure rate from 0.85% to just 0.35% [1].

TMR implementations which make use of additional area to run two redundant copies of the circuit in parallel to the original are known as space-TMR. All three modules execute simultaneously and independently of one another. As such, there is very little overhead on processor performance, but the area and power requirements of the circuit are tripled.

Time-TMR is an alternative approach which uses temporal methods to triplicate the module execution. Instead of two redundant circuits running in parallel, time-TMR uses a single copy of the circuit to perform the same instruction three times. This can be achieved by locking the program counter to execute the same command three times and storing the results of the first two executions in memory. After the last execution the three outputs are compared, and the majority result is selected. Time-TMR has lower area and power requirement than space-TMR, as it requires little to no extra hardware, however the performance of the circuit is decreased by a factor of three due to the additional instructions being executed sequentially instead of in parallel.

Regardless of the selected implementation, the use of TMR in a circuit can greatly improve the reliability and increase the mean time to failure (MTTF). However, if an error in a failed module is not repaired, the whole circuit is at risk of failure if either one of the other two modules experience an upset. In the event that two of the three modules fail simultaneously, the TMR voter will be unable to provide a trustworthy output and thus the TMR will fail. TMR must therefore be accompanied by a repair mechanism in order to substantially improve the MTTF.

Scrubbing

The process of periodically reprogramming an FPGA to avoid an accumulation of errors is known as scrubbing. This can be achieved using a dedicated circuit, commonly known as a scrubber, whose primary purpose is to mitigate errors in the configuration memory before they can disrupt the overall system. These scrubbers are often coupled with ‘golden’ copy of the configuration memory which is not susceptible to SEEs (e.g. NAND Flash or RadHard memory) and is therefore known to be correct.

There are a variety of proven scrubber architectures available in literature, each with its own distinct benefits and drawbacks. The following section will discuss these benefits and drawbacks to identify the architecture which is most suited to this project.

Internal vs External Scrubbing
File:Filename.jpg
SEE Diagram (TBC)

A scrubber may be implemented internally within an FPGA using configurable logic blocks, or external to the FPGA using additional hardware such as a microcontroller or secondary FPGA to store and execute the scrubbing program instructions. As the internal scrubber architecture is housed entirely within the FPGA, it is much faster than an external scrubber, and the lack of additional hardware also reduces space and power requirements. However, this also means that more resources are required on the FPGA to implement the scrubber logic, resulting in less available space for the user’s program.

Internal scrubbers make use of Internal Configuration Access Port (ICAP) to perform continuous readback of the configuration memory. The Xilinx Kintex 7-series of FPGAs includes a built-in readback CRC circuit which utilizes this interface to provide single error correction and double error detection (SECDED) capabilities. External scrubbers however, cannot access ICAP and must instead use either the SelectMAP or JTAG interface to perform readback operations.

When a MBU is detected by a SECDED circuit, some additional scrubbing capability is required in order to repair the upset. This may involve simply reconfiguring the entire FPGA, or may use a more precise method, as explored in Section 3.3.2. Logic such as this can be implemented easily in an external microcontroller, whereas an internal operation would require a softcore processor within the FPGA (e.g. PicoBlaze).

The key issue with internal scrubbers is that the scrubbing hardware is just as susceptible to SEE’s as the rest of the FPGA. The scrubber circuit is unable to repair itself, and therefore if a fault occurs within this portion of the configuration memory the entire scrubber may fail. Internal scrubbing can be implemented in combination with TMR, which has been shown to reduce failure rates by 30%, however external scrubbers are still considered to be more robust [7]. Of course, external scrubbers are also vulnerable to SEEs, however they can be designed using RadHard components to overcome this problem at a fraction of the cost of a full RadHard FPGA.

Scrubbing Strategies
Blind Scrubbing

Blind scrubbing is a relatively simple scrubbing strategy as it does not require error detection. Instead, the entire configuration memory is overwritten with data from the golden memory at fixed intervals. Xilinx Virtex FPGAs include a dynamic reconfiguration capability which allows scrubbing to occur without interrupting the application layer operations. Since error detection is not required, blind scrubbing can performed at a reasonably fast speed. However, it is still considered to be an inefficient method as the scrubber is constantly occupying processing bandwidth to correct memory frames which contain no errors. The advantage is that any errors that occur within the scrubbed memory are guaranteed to be corrected since the entire memory is rewritten.

Global CRC

Cyclic Redundancy Checks (CRCs) are commonly used when error detection is required for large blocks of data. In 7-series FPGAs, a single 32-bit CRC word is calculated for the entire bitstream. The CRC word is calculated using the remainder of a polynomial division circuit, so a single bit-flip within the frame or bitstream will result in drastically different result which makes it particularly effective at detecting MBU’s. However, CRC is only an error detection tool as it cannot locate where an error is within a block of data, only that an error is present. For this reason, CRCs are often used as the final defence against configuration upsets as correcting any detected errors would require a full frame scrubbing.

Frame ECC

7-series FPGA’s contain built-in error correction codes (ECCs) which provide local EDAC functionality for each individual frame of the configuration memory. Each frame is made up of 100 32-bit data words and a single 32-bit ECC word which can be used for SECDED. Each time a frame is written to or read from, the ECC syndrome is recalculated. If the syndrome is equal to zero, it implies that zero errors were detected, whereas a non-zero syndrome indicates that an error has occurred and the syndrome value can then be used to determine the location of the error within the frame. A limitation of the frame ECC is that odd MBUs in a frame will alias a SBU at an incorrect location. The scrubber will then try to repair this false SBU, creating an additional error. This operation will result in a zero syndrome for the ECC, making the scrubber believe that the error has been successfully repaired, but the global CRC will still show that an error has occurred and trigger a scrub of the full configuration memory.

Readback CRC

7-series FPGAs also contain an internal built-in hardware mechanism for performing continuous readback and SECDED during device operation, referred to as readback CRC. This mechanism is responsible for computing the ECC syndrome for each frame as well as the global CRC. After all frames are checked, the CRC value is compared against the previously calculated CRC to determine whether an unidentified error has occurred. Since the readback CRC is implemented using dedicated circuitry, it operates much faster than other readback mechanisms which often require additional resources.

Proposed System Architecture

File:Filename.jpg
SEE Diagram (TBC)

To maximise the reliability of the scrubber, while maintaining the highest possible performance, a hybrid scrubbing approach has been selected, as proposed in [1]. The internal readback CRC mechanism will be used to perform continuous readback using the ICAP interface and subsequently correct SBUs. When the readback CRC detects an error it cannot correct, including MBUs, it will pass control over to the external scrubbing hardware which will perform the necessary operations to correct the error and then pass control back to the readback CRC. This allows us to utilise the speed of the internal readback hardware, while maintaining the robustness of the external scrubber.

Hardware Requirements

The internal scrubber will operate entirely using the FPGA’s readback CRC circuit, and so no external hardware is required. A separate microcontroller and memory bank will be used to store and execute the scrubbing logic for the external scrubber. An auxiliary FPGA could also be used instead of a microcontroller, but as this component would need to be radiation hardened, it would be much more expensive. Using a microcontroller should also simplify the development process, as the development team far more experience working with microcontrollers compared to FPGAs. The model of microcontroller has not yet been determined, however it will need to have enough GPIO pins to drive the SelectMAP interface on the FPGA. This requires 6 control pins, as well as either 8, 16 or 32 data pins, as detailed in [9-10]. A RadHard memory bank will also be required to store the ‘golden’ copy of the configuration bitsteam. This is likely to be implemented using NAND Flash memory which is not susceptible to SEEs. The size of this memory has not yet been determined.

Software Requirements

Scrubbing Algorithm

File:Filename.jpg
SEE Diagram (TBC)

The scrubbing algorithm will roughly follow the framework proposed in [1]. The internal readback CRC mechanism will be used to perform continuous readback of the configuration memory and correct any SBUs. If a MBU is detected by the readback CRC, an error event is generated and passed to a FIFO queue. The external scrubbing circuit will monitor this queue and initiate scrubbing via the SelectMAP interface when an error event is detected.

The scrubbing process performed by the external scrubber varies depending on the type of error detected. This process is clearly outlined in [1], but can be represented at a high-level as seen in Figure 8. This process will always scrub a specific frame in the configuration memory if possible, and only scrub the entire configuration memory if there is no alternative. This makes the overall operation of the device as efficient possible while maintaining full functionality.

The scrubbing architecture described above does not detect or respond to SEFIs, although there are scrubbers capable of handling these events. If time permits, common SEFIs such as the power-on reset SEFI, frame address SEFI and SelectMAP SEFI may be addressed in the scrubbing algorithm, however that is not within the scope of the project at this time.

Additional Features

Implementing a scrubber circuit using an external microcontroller allows for much greater flexibility in terms of debugging. As long as the computational power of the microcontroller is not exceeded, and there are sufficient GPIO pins available, the possibilities are endless. There are two debugging tools which have been identified as being critical to the development process.

The first tool is the ability to simulate SEUs for testing purposes. The system will be thoroughly tested in a lab environment before undergoing radiation testing, and therefore we must be able to simulate both SBUs and MBUs in various locations within the configuration memory.

The second tool is the ability to connect to an external PC for easy FPGA configuration and testing. This would most likely be implemented over a UART interface, which would allow a user to input commands (such as the fault-injection commands outlined above) and retrieve device status information (such as the type and location of any upsets detected).

Development and Testing

Project Outcomes

References

[1] A. Stoddard, A. Gruwell, P. Zabriskie, M. J. Wirthlin, "A Hybrid Approach to FPGA Configuration Scrubbing", Nuclear Science IEEE Transactions on, vol. 64, no. 1, pp. 497-503, 2017.