Difference between revisions of "Projects:2021s1-13001 Improving the Resilience of Autonomous Satellite Networks against High-Energy Disruptions"
(Added BMM summary) |
(Restructured for final version) |
||
Line 21: | Line 21: | ||
* Dr. Said Al-Sarawi | * Dr. Said Al-Sarawi | ||
* Dr. Dharmapriya Bandara (DST) | * Dr. Dharmapriya Bandara (DST) | ||
− | |||
− | |||
− | |||
− | |||
=== Project Objectives === | === Project Objectives === | ||
Line 34: | Line 30: | ||
* To provide clearly defined research outcomes which can be incorporated into the development process for future CubeSat launches. | * To provide clearly defined research outcomes which can be incorporated into the development process for future CubeSat launches. | ||
− | == | + | == Background == |
=== Buccaneer Main Mission === | === Buccaneer Main Mission === | ||
Line 50: | Line 46: | ||
When one of these high-energy radiation particles travels through a semiconductor, the resulting ionisation produces free charge carriers within the substrate. These charge carriers diffuse through the material, altering the shape and size of the depletion region. This can cause transient voltages within the gate, known as Single Event Transients (SETs), which can ultimately lead to a variety of highly disruptive effects, known as Single Event Effects (SEEs). | When one of these high-energy radiation particles travels through a semiconductor, the resulting ionisation produces free charge carriers within the substrate. These charge carriers diffuse through the material, altering the shape and size of the depletion region. This can cause transient voltages within the gate, known as Single Event Transients (SETs), which can ultimately lead to a variety of highly disruptive effects, known as Single Event Effects (SEEs). | ||
− | + | In this project, we are primarily concerned with Single Event Upsets (SEUs), which are non-destructive, soft errors resulting in a change of state within a memory cell. If a SET occurs at the same time as a clock edge, the impulse will be read as an incorrect logic state and the pulse will propagate through combinational logic, where it may be latched into memory. In memory cells and registers this generally appears as a bit-flip. | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | + | If only a single bit-flip occurs, it is classified as a Single Bit Upset (SBU). However, it is also possible for a single, high-energy particle to collide with multiple transistors as it passes through a memory bank. This can cause multiple bit-flips in a single event, and is known as a Multi Bit Upset (MBU). MBUs may cause single errors across multiple words/frames, in which case they may be treated in the same manner as SBUs, however they can also cause multiple errors within a single word, in which case the upset must be handled as a special case. | |
While the effects of SEUs are often negligible, they have the potential to cause catastrophic system failure if the upset occurs in a critical system, such as FPGA configuration memory or the POWER/RESET bit in a microcontroller. An upset which interrupts or otherwise prevents the normal operation of a system is known as a Single Event Functional Interrupt (SEFI). These events generally require power cycling the system or reloading the configuration memory to recover normal system operation. | While the effects of SEUs are often negligible, they have the potential to cause catastrophic system failure if the upset occurs in a critical system, such as FPGA configuration memory or the POWER/RESET bit in a microcontroller. An upset which interrupts or otherwise prevents the normal operation of a system is known as a Single Event Functional Interrupt (SEFI). These events generally require power cycling the system or reloading the configuration memory to recover normal system operation. | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
=== SEE Prevention and Mitigation Strategies === | === SEE Prevention and Mitigation Strategies === | ||
Line 93: | Line 59: | ||
While RadHard devices provide robust, reliable performance in space applications, they are often orders of magnitude more expensive than their industrial-grade equivalents, and tend to lag roughly a generation behind the most recent developments due to the extensive development and testing required for each design. For this reason, we would prefer to use industrial-grade components coupled with some kind of EDAC subsystem wherever possible. This simplifies the development process and provides a substantial reduction in cost. | While RadHard devices provide robust, reliable performance in space applications, they are often orders of magnitude more expensive than their industrial-grade equivalents, and tend to lag roughly a generation behind the most recent developments due to the extensive development and testing required for each design. For this reason, we would prefer to use industrial-grade components coupled with some kind of EDAC subsystem wherever possible. This simplifies the development process and provides a substantial reduction in cost. | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
==== Scrubbing ==== | ==== Scrubbing ==== | ||
The process of periodically reprogramming an FPGA to avoid an accumulation of errors is known as scrubbing. This can be achieved using a dedicated circuit, commonly known as a scrubber, whose primary purpose is to mitigate errors in the configuration memory before they can disrupt the overall system. These scrubbers are often coupled with ‘golden’ copy of the configuration memory which is not susceptible to SEEs (e.g. NAND Flash or RadHard memory) and is therefore known to be correct. | The process of periodically reprogramming an FPGA to avoid an accumulation of errors is known as scrubbing. This can be achieved using a dedicated circuit, commonly known as a scrubber, whose primary purpose is to mitigate errors in the configuration memory before they can disrupt the overall system. These scrubbers are often coupled with ‘golden’ copy of the configuration memory which is not susceptible to SEEs (e.g. NAND Flash or RadHard memory) and is therefore known to be correct. | ||
− | |||
− | |||
− | |||
− | |||
[[File:filename.jpg|300px|thumb|right|SEE Diagram (TBC)]] | [[File:filename.jpg|300px|thumb|right|SEE Diagram (TBC)]] | ||
Line 132: | Line 74: | ||
The key issue with internal scrubbers is that the scrubbing hardware is just as susceptible to SEE’s as the rest of the FPGA. The scrubber circuit is unable to repair itself, and therefore if a fault occurs within this portion of the configuration memory the entire scrubber may fail. Internal scrubbing can be implemented in combination with TMR, which has been shown to reduce failure rates by 30%, however external scrubbers are still considered to be more robust [7]. Of course, external scrubbers are also vulnerable to SEEs, however they can be designed using RadHard components to overcome this problem at a fraction of the cost of a full RadHard FPGA. | The key issue with internal scrubbers is that the scrubbing hardware is just as susceptible to SEE’s as the rest of the FPGA. The scrubber circuit is unable to repair itself, and therefore if a fault occurs within this portion of the configuration memory the entire scrubber may fail. Internal scrubbing can be implemented in combination with TMR, which has been shown to reduce failure rates by 30%, however external scrubbers are still considered to be more robust [7]. Of course, external scrubbers are also vulnerable to SEEs, however they can be designed using RadHard components to overcome this problem at a fraction of the cost of a full RadHard FPGA. | ||
− | == | + | == System Architecture == |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | + | [[File:filename.jpg|300px|thumb|right|System Architecture Diagram (TBC)]] | |
− | |||
− | |||
− | |||
− | [[File:filename.jpg|300px|thumb|right| | ||
To maximise the reliability of the scrubber, while maintaining the highest possible performance, a hybrid scrubbing approach has been selected, as proposed in [1]. The internal readback CRC mechanism will be used to perform continuous readback using the ICAP interface and subsequently correct SBUs. When the readback CRC detects an error it cannot correct, including MBUs, it will pass control over to the external scrubbing hardware which will perform the necessary operations to correct the error and then pass control back to the readback CRC. This allows us to utilise the speed of the internal readback hardware, while maintaining the robustness of the external scrubber. | To maximise the reliability of the scrubber, while maintaining the highest possible performance, a hybrid scrubbing approach has been selected, as proposed in [1]. The internal readback CRC mechanism will be used to perform continuous readback using the ICAP interface and subsequently correct SBUs. When the readback CRC detects an error it cannot correct, including MBUs, it will pass control over to the external scrubbing hardware which will perform the necessary operations to correct the error and then pass control back to the readback CRC. This allows us to utilise the speed of the internal readback hardware, while maintaining the robustness of the external scrubber. | ||
− | === Hardware | + | === Hardware Design === |
The internal scrubber will operate entirely using the FPGA’s readback CRC circuit, and so no external hardware is required. A separate microcontroller and memory bank will be used to store and execute the scrubbing logic for the external scrubber. | The internal scrubber will operate entirely using the FPGA’s readback CRC circuit, and so no external hardware is required. A separate microcontroller and memory bank will be used to store and execute the scrubbing logic for the external scrubber. | ||
Line 163: | Line 86: | ||
A RadHard memory bank will also be required to store the ‘golden’ copy of the configuration bitsteam. This is likely to be implemented using NAND Flash memory which is not susceptible to SEEs. The size of this memory has not yet been determined. | A RadHard memory bank will also be required to store the ‘golden’ copy of the configuration bitsteam. This is likely to be implemented using NAND Flash memory which is not susceptible to SEEs. The size of this memory has not yet been determined. | ||
− | === Software | + | === Software Design === |
− | + | Implementing a scrubber circuit using an external microcontroller allows for much greater flexibility in terms of debugging. As long as the computational power of the microcontroller is not exceeded, and there are sufficient GPIO pins available, the possibilities are endless. There are two debugging tools which have been identified as being critical to the development process. | |
+ | |||
+ | The first tool is the ability to simulate SEUs for testing purposes. The system will be thoroughly tested in a lab environment before undergoing radiation testing, and therefore we must be able to simulate both SBUs and MBUs in various locations within the configuration memory. | ||
+ | |||
+ | The second tool is the ability to connect to an external PC for easy FPGA configuration and testing. This would most likely be implemented over a UART interface, which would allow a user to input commands (such as the fault-injection commands outlined above) and retrieve device status information (such as the type and location of any upsets detected). | ||
+ | |||
+ | == Scrubbing Process == | ||
[[File:filename.jpg|300px|thumb|right|SEE Diagram (TBC)]] | [[File:filename.jpg|300px|thumb|right|SEE Diagram (TBC)]] | ||
Line 175: | Line 104: | ||
The scrubbing architecture described above does not detect or respond to SEFIs, although there are scrubbers capable of handling these events. If time permits, common SEFIs such as the power-on reset SEFI, frame address SEFI and SelectMAP SEFI may be addressed in the scrubbing algorithm, however that is not within the scope of the project at this time. | The scrubbing architecture described above does not detect or respond to SEFIs, although there are scrubbers capable of handling these events. If time permits, common SEFIs such as the power-on reset SEFI, frame address SEFI and SelectMAP SEFI may be addressed in the scrubbing algorithm, however that is not within the scope of the project at this time. | ||
− | == | + | == Manufacturing and Testing == |
− | + | == Project Outcomes == | |
− | + | === Future Work === | |
− | |||
− | |||
− | |||
− | == | ||
− | |||
− | |||
== References == | == References == | ||
[1] A. Stoddard, A. Gruwell, P. Zabriskie, M. J. Wirthlin, "A Hybrid Approach to FPGA Configuration Scrubbing", Nuclear Science IEEE Transactions on, vol. 64, no. 1, pp. 497-503, 2017. | [1] A. Stoddard, A. Gruwell, P. Zabriskie, M. J. Wirthlin, "A Hybrid Approach to FPGA Configuration Scrubbing", Nuclear Science IEEE Transactions on, vol. 64, no. 1, pp. 497-503, 2017. |
Revision as of 11:35, 18 October 2021
While FPGAs offer a number of benefits for aerospace applications, they are highly susceptible to single event effects (SEE) when exposed to high-radiation environments. These upsets can cause undesirable behaviour within the system, and potentially lead to catastrophic system failure. Students will design and develop a novel FPGA configuration scrubber to overcome these effects using an external microcontroller. Radiation testing will be conducted to verify system performance in a (simulated) space environment.
Contents
Introduction
This project is sponsored by the Defence Science and Technology Group (DST). Students will gain experience in an industry environment, while supporting Defence capabilities within DST.
Project team
Project students
- Jack Nelson
- Albert Pistorius
Supervisors
- Dr. Said Al-Sarawi
- Dr. Dharmapriya Bandara (DST)
Project Objectives
- To design and develop a novel system architecture to detect and correct single event upsets, and to restore system operation in a failure event.
- To provide sufficient fault protection such that an industry-rated FPGA may be used in space applications for a minimum period of 2 years (in Low Earth Orbit) without loss of functionality.
- To provide clearly defined research outcomes which can be incorporated into the development process for future CubeSat launches.
Background
Buccaneer Main Mission
One of the biggest changes in the space domain in recent years has been the move from large satellites costing billions of dollars and decades in development, to small disposable satellites that can cost less than one million dollars and have development cycles measured in months. This research has led to the creation and popularisation of the Cube Satellite (CubeSat) form factor. These designs measure roughly the same size as a shoe box and have a typical launch mass of 5 to 15 kg.
DST are currently undertaking their own CubeSat mission, called Buccaneer, in collaboration with the University of New South Wales (UNSW), and various other Industry, Academia and International partners. The Buccaneer program consists of two separate launches with the first satellite, launched in November 2017, focused on proving the technologies involved and the second satellite, due to be launched in Q3 2022, designed to conduct the main mission.
Single Event Effects
Traditional avionics and ground-based electronic systems are shielded from the effects of solar radiation thanks to the Earth's atmosphere and magnetic field. However, systems operating within a space environment do not receive the same level of protection and therefore are subjected to extremely high levels of radiation. This radiation can be produced by a wide variety of phenomena, but cosmic rays and high-energy protons are the most prevalent sources in space applications.
When one of these high-energy radiation particles travels through a semiconductor, the resulting ionisation produces free charge carriers within the substrate. These charge carriers diffuse through the material, altering the shape and size of the depletion region. This can cause transient voltages within the gate, known as Single Event Transients (SETs), which can ultimately lead to a variety of highly disruptive effects, known as Single Event Effects (SEEs).
In this project, we are primarily concerned with Single Event Upsets (SEUs), which are non-destructive, soft errors resulting in a change of state within a memory cell. If a SET occurs at the same time as a clock edge, the impulse will be read as an incorrect logic state and the pulse will propagate through combinational logic, where it may be latched into memory. In memory cells and registers this generally appears as a bit-flip.
If only a single bit-flip occurs, it is classified as a Single Bit Upset (SBU). However, it is also possible for a single, high-energy particle to collide with multiple transistors as it passes through a memory bank. This can cause multiple bit-flips in a single event, and is known as a Multi Bit Upset (MBU). MBUs may cause single errors across multiple words/frames, in which case they may be treated in the same manner as SBUs, however they can also cause multiple errors within a single word, in which case the upset must be handled as a special case.
While the effects of SEUs are often negligible, they have the potential to cause catastrophic system failure if the upset occurs in a critical system, such as FPGA configuration memory or the POWER/RESET bit in a microcontroller. An upset which interrupts or otherwise prevents the normal operation of a system is known as a Single Event Functional Interrupt (SEFI). These events generally require power cycling the system or reloading the configuration memory to recover normal system operation.
SEE Prevention and Mitigation Strategies
Radiation Hardening
Components and circuits which have been designed and manufactured to be less susceptible to SEEs are known as radiation hardened, or RadHard, components. There are many different hardening techniques available depending on the device. For example, a semiconductor may be shielded against radiation using impervious materials, such as aluminium or depleted boron, and mounted to a substrate material with a wide band gap, such as Silicon Carbide or Gallium Nitride, instead of conventional silicon wafers.
While RadHard devices provide robust, reliable performance in space applications, they are often orders of magnitude more expensive than their industrial-grade equivalents, and tend to lag roughly a generation behind the most recent developments due to the extensive development and testing required for each design. For this reason, we would prefer to use industrial-grade components coupled with some kind of EDAC subsystem wherever possible. This simplifies the development process and provides a substantial reduction in cost.
Scrubbing
The process of periodically reprogramming an FPGA to avoid an accumulation of errors is known as scrubbing. This can be achieved using a dedicated circuit, commonly known as a scrubber, whose primary purpose is to mitigate errors in the configuration memory before they can disrupt the overall system. These scrubbers are often coupled with ‘golden’ copy of the configuration memory which is not susceptible to SEEs (e.g. NAND Flash or RadHard memory) and is therefore known to be correct.
A scrubber may be implemented internally within an FPGA using configurable logic blocks, or external to the FPGA using additional hardware such as a microcontroller or secondary FPGA to store and execute the scrubbing program instructions. As the internal scrubber architecture is housed entirely within the FPGA, it is much faster than an external scrubber, and the lack of additional hardware also reduces space and power requirements. However, this also means that more resources are required on the FPGA to implement the scrubber logic, resulting in less available space for the user’s program.
Internal scrubbers make use of Internal Configuration Access Port (ICAP) to perform continuous readback of the configuration memory. The Xilinx Kintex 7-series of FPGAs includes a built-in readback CRC circuit which utilizes this interface to provide single error correction and double error detection (SECDED) capabilities. External scrubbers however, cannot access ICAP and must instead use either the SelectMAP or JTAG interface to perform readback operations.
When a MBU is detected by a SECDED circuit, some additional scrubbing capability is required in order to repair the upset. This may involve simply reconfiguring the entire FPGA, or may use a more precise method, as explored in Section 3.3.2. Logic such as this can be implemented easily in an external microcontroller, whereas an internal operation would require a softcore processor within the FPGA (e.g. PicoBlaze).
The key issue with internal scrubbers is that the scrubbing hardware is just as susceptible to SEE’s as the rest of the FPGA. The scrubber circuit is unable to repair itself, and therefore if a fault occurs within this portion of the configuration memory the entire scrubber may fail. Internal scrubbing can be implemented in combination with TMR, which has been shown to reduce failure rates by 30%, however external scrubbers are still considered to be more robust [7]. Of course, external scrubbers are also vulnerable to SEEs, however they can be designed using RadHard components to overcome this problem at a fraction of the cost of a full RadHard FPGA.
System Architecture
To maximise the reliability of the scrubber, while maintaining the highest possible performance, a hybrid scrubbing approach has been selected, as proposed in [1]. The internal readback CRC mechanism will be used to perform continuous readback using the ICAP interface and subsequently correct SBUs. When the readback CRC detects an error it cannot correct, including MBUs, it will pass control over to the external scrubbing hardware which will perform the necessary operations to correct the error and then pass control back to the readback CRC. This allows us to utilise the speed of the internal readback hardware, while maintaining the robustness of the external scrubber.
Hardware Design
The internal scrubber will operate entirely using the FPGA’s readback CRC circuit, and so no external hardware is required. A separate microcontroller and memory bank will be used to store and execute the scrubbing logic for the external scrubber. An auxiliary FPGA could also be used instead of a microcontroller, but as this component would need to be radiation hardened, it would be much more expensive. Using a microcontroller should also simplify the development process, as the development team far more experience working with microcontrollers compared to FPGAs. The model of microcontroller has not yet been determined, however it will need to have enough GPIO pins to drive the SelectMAP interface on the FPGA. This requires 6 control pins, as well as either 8, 16 or 32 data pins, as detailed in [9-10]. A RadHard memory bank will also be required to store the ‘golden’ copy of the configuration bitsteam. This is likely to be implemented using NAND Flash memory which is not susceptible to SEEs. The size of this memory has not yet been determined.
Software Design
Implementing a scrubber circuit using an external microcontroller allows for much greater flexibility in terms of debugging. As long as the computational power of the microcontroller is not exceeded, and there are sufficient GPIO pins available, the possibilities are endless. There are two debugging tools which have been identified as being critical to the development process.
The first tool is the ability to simulate SEUs for testing purposes. The system will be thoroughly tested in a lab environment before undergoing radiation testing, and therefore we must be able to simulate both SBUs and MBUs in various locations within the configuration memory.
The second tool is the ability to connect to an external PC for easy FPGA configuration and testing. This would most likely be implemented over a UART interface, which would allow a user to input commands (such as the fault-injection commands outlined above) and retrieve device status information (such as the type and location of any upsets detected).
Scrubbing Process
The scrubbing algorithm will roughly follow the framework proposed in [1]. The internal readback CRC mechanism will be used to perform continuous readback of the configuration memory and correct any SBUs. If a MBU is detected by the readback CRC, an error event is generated and passed to a FIFO queue. The external scrubbing circuit will monitor this queue and initiate scrubbing via the SelectMAP interface when an error event is detected.
The scrubbing process performed by the external scrubber varies depending on the type of error detected. This process is clearly outlined in [1], but can be represented at a high-level as seen in Figure 8. This process will always scrub a specific frame in the configuration memory if possible, and only scrub the entire configuration memory if there is no alternative. This makes the overall operation of the device as efficient possible while maintaining full functionality.
The scrubbing architecture described above does not detect or respond to SEFIs, although there are scrubbers capable of handling these events. If time permits, common SEFIs such as the power-on reset SEFI, frame address SEFI and SelectMAP SEFI may be addressed in the scrubbing algorithm, however that is not within the scope of the project at this time.
Manufacturing and Testing
Project Outcomes
Future Work
References
[1] A. Stoddard, A. Gruwell, P. Zabriskie, M. J. Wirthlin, "A Hybrid Approach to FPGA Configuration Scrubbing", Nuclear Science IEEE Transactions on, vol. 64, no. 1, pp. 497-503, 2017.