Difference between revisions of "Projects:2021s1-13001 Improving the Resilience of Autonomous Satellite Networks against High-Energy Disruptions"

Latest revision as of 17:05, 22 October 2021

Artist's depiction of the Buccaneer Main Mission (BMM) CubeSat in low Earth orbit. Courtesy of Inovor Technologies.

While Field Programmable Gate Arrays (FPGAs) offer a number of benefits for aerospace applications, they are highly susceptible to single event upsets (SEUs) when exposed to high-radiation environments. These upsets can cause undesirable behaviour within the system, and potentially even lead to catastrophic system failure. Students have built upon existing research to develop a ‘scrubber’ circuit which uses an external microcontroller to detect and repair upsets within a Xilinx 7-Series FPGA. This system will be deployed in a space environment as part of the upcoming CubeSat mission, Buccaneer.

Introduction

This project is sponsored by the Defence Science and Technology Group (DST). Students will gain valuable experience working in an industry environment, while supporting Defence capabilities within DST.

Project team

Project students

Jack Nelson
Albert Pistorius

Supervisors

Dr. Said Al-Sarawi
Dr. Dharmapriya Bandara (DST)

Project Objectives

To design and develop a novel system architecture to detect and correct single event upsets, and to restore system operation in a failure event.

To provide sufficient fault protection such that an industry-rated FPGA may be used in space applications for a minimum period of 2 years (in Low Earth Orbit) without loss of functionality.

To provide clearly defined research outcomes which can be incorporated into the development process for future CubeSat launches.

Background

Buccaneer Main Mission

One of the biggest changes in the space domain in recent years has been the move from large satellites, costing billions of dollars and decades in development, to small disposable satellites that can cost less than one million dollars and have development cycles measured in months. This research has led to the popularisation of the Cube Satellite (CubeSat) form factor. These designs measure roughly the same size as a shoe box and have a typical launch mass of 5 to 15 kg.

DST are currently undertaking their own CubeSat mission, called Buccaneer, in collaboration with the University of New South Wales (UNSW), and various other industry, academia and international partners. The Buccaneer program consists of two separate launches. The first satellite, the Buccaneer Risk Mitigation Mission (BRMM), was launched in November 2017 and was proving the technologies involved. The second satellite, the Buccaneer Main Mission (BMM), is scheduled for launch in 2023 and will be used to obtain calibration data for the Jindalee Operational Radar Network (JORN).

Single Event Effects

Error creating thumbnail: Unable to save thumbnail to destination

Ionization within a semiconductor due to a single event effect.

Traditional avionics and ground-based electronic systems are shielded from the effects of solar radiation thanks to the Earth's atmosphere and magnetic field. However, systems operating within a space environment do not receive the same level of protection and therefore are subjected to extremely high levels of radiation. This radiation can be produced by a wide variety of phenomena, but cosmic rays and high-energy protons are the most prevalent sources in space applications.

When one of these high-energy radiation particles travels through a semiconductor, the resulting ionisation produces free charge carriers within the substrate. These charge carriers diffuse through the material and alter the shape and size of the depletion region. This causes transient voltages within the gate, and can ultimately lead to a variety of highly disruptive effects known as Single Event Effects (SEEs).

If a transient voltage occurs at the same time as a clock edge, the impulse will be read as an incorrect logic state and the pulse will propagate through combinational logic, where it may become latched into memory. In memory cells and registers this generally appears as a bit-flip, and is referred to as a Single Event Upset (SEU).

While the effects of SEUs are often negligible, they have the potential to cause catastrophic system failure if the upset occurs in a critical system, such as FPGA configuration memory or the POWER/RESET bit in a microcontroller. An upset which interrupts or otherwise prevents the normal operation of a system is known as a Single Event Functional Interrupt (SEFI). These events generally require power cycling the system or reloading the configuration memory to recover normal system operation.

SEE Prevention and Mitigation Strategies

Radiation Hardening

Components and circuits which have been designed and manufactured to be less susceptible to SEEs are known as radiation hardened, or RadHard, components. While these devices provide robust, reliable performance in space applications, they are often orders of magnitude more expensive than their industrial-grade equivalents, and tend to lag roughly a generation behind the most recent developments due to the extensive development and testing required for each design. For this reason, it may be preferable to use industry-grade components coupled with some kind of error detection and correction (EDAC) subsystem wherever possible. This simplifies the development process and provides a substantial reduction in cost.

Scrubbing

The process of periodically reprogramming an FPGA to avoid an accumulation of errors is known as scrubbing. This can be achieved using a dedicated circuit, commonly known as a scrubber, whose primary purpose is to mitigate errors in the configuration memory before they can disrupt the overall system. These scrubbers are often coupled with ‘golden’ copy of the configuration memory which is not susceptible to SEEs (e.g. NAND Flash or RadHard memory) and is therefore known to be correct.

A scrubber may be implemented internally within an FPGA using configurable logic blocks, or external to the FPGA using additional hardware such as a microcontroller or secondary FPGA to store and execute the scrubbing logic. As the internal scrubber architecture is housed entirely within the FPGA, it is much faster than an external scrubber, and the lack of additional hardware also reduces space and power requirements. However, this also means that more resources are required on the FPGA to implement the scrubber logic, resulting in less available space for the user’s program.

Xilinx 7-series FPGAs, such as those used with this project, include a built-in Readback CRC circuit which provide single error correction and double error detection (SECDED) capabilities without the need for additional hardware. However, when a MBU is detected by a SECDED circuit, some additional scrubbing capability is required in order to repair the upset. This could involve simply reconfiguring the entire FPGA, or may use a more precise method such as locating and repairing only the erroneous memory frame. Logic such as this can be implemented easily in an external microcontroller, whereas an internal operation would require a softcore processor to be implemented within the FPGA.

The key issue with internal scrubbers is that the scrubbing hardware is just as susceptible to SEEs as the rest of the FPGA. The scrubber circuit is unable to repair itself, and therefore if a fault occurs within this portion of the configuration memory the entire scrubber may fail. Of course, external scrubbers are also vulnerable to SEEs, however they can be designed using RadHard components to overcome this problem at just fraction of the cost of a full RadHard FPGA.

Design Process

System Architecture

Error creating thumbnail: Unable to save thumbnail to destination

Scrubber system architecture diagram.

To maximise the reliability of the scrubber, while maintaining the highest possible performance, a hybrid scrubbing approach was selected. The FPGA's internal Readback CRC mechanism is used to perform continuous readback of the configuration memory and subsequently correct SBUs. When the Readback CRC detects an error it cannot correct, including MBUs, it sends the details of the error to an external microcontroller, which can then perform the necessary operations to correct the error. This allows us to utilise the speed of the internal readback hardware, while maintaining the robustness of the external scrubber.

Component Specification

The BMM secondary payload will use a Xilinx Ultrascale FPGA in the final design, but a Xilinx 7-Series FPGA is being used for the purposes of this project as it is far more affordable. Xilinx's Soft Error Mitigation (SEM) IP-Core is used to provide the SECDED and fault injection capabilities required to implement the internal portion of the scrubber.

An MSP430FR5969 microcontroller from Texas Instruments is used to store and execute the scrubbing logic for the external portion of the scrubber. This is a 16MHz microcontroller based on the MSP430 platform, which is popular for its high performance to cost ratio and ultra-low power operation. This particular model has undergone thorough radiation testing [2] and can safely be used in a space environment without the need for additional radiation hardening.

The configuration bitstream for the FPGA will be stored in the 'golden' memory bank. This memory is radiation hardened and therefore not susceptible to SEUs, which menas that we will always have access to an uncorrupted copy of the bitstream. This is referred to as the 'golden' bitstream. A 3DFN8G08VS1706 8Gb RadHard NAND Flash memory module from 3D Plus has been selected for this purpose.

Finally, a NOR Flash memory bank acts as a buffer between the microcontroller and the FPGA. While this is not strictly necessary in the design of the scrubber, it has been included to mirror the hardware used on the BMM secondary payload. A S25FL128S 128Mb SPI NOR Flash memory module from Cypress Technologies has been selected for this purpose.

Manufacturing and Testing

Error creating thumbnail: Unable to save thumbnail to destination

3D render of the completed Scrubber PCB.

Error creating thumbnail: Unable to save thumbnail to destination

Our custom PCB and Arty S7-50 connected in a test environment.

In order to test our scrubber design, a two-layer printed circuit board (PCB) was developed by the project team, with assistance of DST's Research Engineering branch. This PCB was designed to interface with a Digilent Arty S7-50 evaluation board, which contained the FPGA used in our tests.

The schematic and layout for our custom PCB were completed using Altium Designer, and the board was then manufactured and loaded by electronic technicians from DST. The PCB underwent a visual inspection by the project team, but we failed to recognize that the NOR flash IC had been oriented incorrectly. This resulted in a short circuit, pulling the 3.3V rail down to approx. 0.8V. The issue was quickly identified and corrected, however the microcontroller had already been damaged. After several more days of troubleshooting, this issue was also identified and the damaged part was replaced.

The Arty S7-50 also underwent minor modifications to allow it to interface with our custom PCB. The NOR flash IC on the Arty S7-50 was removed and leads were soldered to the exposed pads. These leads could then be connected to the NOR flash IC on our custom board, allowing the microcontroller and FPGA to share the same memory space. Additionally, three thin wires were soldered to the reverse side of the Arty S7-50 board, which allowed the microcontroller to interface directly with the FPGA's core programming pins (DONE, INIT_B and PROGRAM_B).

Scrubbing Process

Error Detection and Classification

In order to detect and classify the errors in the FPGAs configuration memory, the scrubber makes use of the Readback CRC. The readback CRC is a built-in feature of the 7-series FPGAs. The readback CRC operates by performing read cycles on the configuration memory. In the first cycle, performed during the device initialisation, for each frame within the configuration memory, a 32-bit checksum is calculated and inserted into that frame. This will allow the readback CRC to perform a frame-by-frame scan for errors. After the second cycle is completed a global checksum is calculated to allow the readback CRC to perform a global CRC on the entire device.

Once all the checksums have been calculated and placed into the memory, the readback CRC will continue to cycle through and calculate and compare checksums. An error is indicated when a checksum which has been calculated using memory does not match the checksum stored. When this occurs, a separate hardware primitive named the FRAME_ECC, outputs a set of signals that can be used to determine the errors location, classification and whether the error was detected by the frame-by-frame CRC or the global CRC. These signals are concatenated together, along with a CRC checksum and padding bit, in a 64-bit called an error event. Each error event is stored in a FIFO queue within the FPGA memory. The FIFO queue uses a signal connected directly to the MSP430 to indicate the presence of an entry within the queue. When this occurs, the MSP430 will retrieve the error event and use an algorithm which was developed to classify the error type.

Error creating thumbnail: Unable to save thumbnail to destination

Error event structure (left) and associated error classification flowchart (right).

Error Correction

The algorithm which we have developed is capable of classifying each error into one of five types of upsets. Error classification is an important step as the type of upset will affect the method of scrubbing which will be used when we correct the errors. Once the device has been initialized the scrubber will periodically query the FIFO queue containing the error events. Once an error event appears in the queue, the first thing the algorithm checks is the Frame Address Register (FAR). If the FAR is zero, then no actual upset has been detected and the scrubber may move onto the next error event. However, should the FAR point to a valid frame address, the algorithm will move on to the next step. One of the first and easiest types to distinguish between is an ‘unknown error’ which has been detected using the global CRC. This can be determined by using the ECCERROR and CRCERROR signals, if ECCERROR is false and CRCERROR is true, then the upset is caused by an unknown error. Due to the limited information retrieved through the global CRC, the method being used to correct an unknown error is to scrub the entire device.

However, if the ECCERROR signal is true, then the upset has been detected by the frame-by-frame CRC, and will contain much more information regarding the whereabouts of it. For example, by taking a look at the ECCERRRORSINGLE value we can determine if the error contains either an even or odd number of bit-upsets within it. The ECCERRORSINGLE bit will be set to true for any error containing an odd number of upsets within it. Therefore if the signal is high, then the error can be classified as an Even Number MBU.

When a frame contains an odd number of upsets, the built-in error correction will recognize the error as being a single bit upset and attempt to flip the affected bit. However, due to the frame containing multiple bits upset, the syndrome will contain incorrect information regarding the location of the upset. This does not stop the automatic error correction from attempting to repair the upset and depending on the SYNWORD of the error event the error correction may attempt to repair a bit outside of the bounds of the frame. This is called an Odd-Numbered, Out-of-Bounds MBU.

If the SYNWORD points to a valid word within the frame then we must perform one more check before we can be absolutely certain of the classification of the error. This check will be to see what the value of CRCERROR. Due to the manner in which MBUs are reported in the Readback CRC, when an upset is not able to be fixed by the built-in correction, the error value of CRCERROR is set to high. So if CRCERROR is true, then the upset is a Odd Numbered In-Bounds MBU. But if CRCERROR is false, then the error was caused by an SBU and will have already been corrected.

One thing to note, the current algorithm includes only two scrubbing methods, either via scrubbing the full configuration memory or by scrubbing only the affected frame. There are methods which can be used with the different types of upsets which will optimize the way the entire system is able to resume once the upset has been repaired. One example is that each type of upset produces a different amount of duplicated error events. By classifying the upset, we are able to remove each of the duplicate error events and not waste time reclassifying upsets which have already been repaired.

Project Outcomes

Proof of Concept

While we initially set out to develop a complete system, it quickly became apparent that our initial scope was very ambitious given the time constrains of this project. This, combined with significant delays in the development of the greater BMM system (beyond the scope of our project), made it infeasible to develop a space-ready system.

The decision was then made to revise the scope and aim to develop a 'proof of concept' system, which could later be expanded to meet the remaining system requirements. One of the major changes resulting from this decision was the development of an 'interim' scrubber PCB, rather than integrating our design directly into the BMM secondary payload.

This process is described in the previous section, and allowed us to successfully implement our proposed system architecture at much a smaller scale. This system has all the core elements of the hybrid scrubber architecture and has been designed in such a way that additional functionality can be added at a later date, without interfering with existing functionality.

Future Work

To prepare the developed scrubber system for use with the BMM secondary payload, as intended in the original project scope, there are a number of objectives which will need to be met:

The hardware from our custom PCB will need to be integrated into the design of the BMM secondary payload PCB. This process is already underway, but has faced significant delays.

The functionality of the microcontroller will need to be expanded to include features which were considered out-of-scope for this project. This includes features such as a 'sleep' mode to reduce power consumption while the scrubber is waiting for an error to be detected.

The FPGA software will need to be modified for use with a Xilinx Ultrascale FPGA, such as the one used onboard Buccaneer. The Xilinx SEM IP-Core is available for both Ultrascale and 7-Series devices, so any changes should be minimal.

Once all of the above objectives have been met, the completed system will undergo radiation testing at the National Space Test Facility (NSTF) in Canberra, Australia to verify system performance in response to real-world upsets. This involves subjecting the scrubber circuit to ion beams of various energy levels to simulate those found in a space environment. It is critical that the secondary payload system passes this test before it can be certified for launch.

References

A. Stoddard, A. Gruwell, P. Zabriskie, M. J. Wirthlin, "A Hybrid Approach to FPGA Configuration Scrubbing", Nuclear Science IEEE Transactions on, vol. 64, no. 1, pp. 497-503, 2017.
S. Guertin, M. Amrbar and S. Vartanian, “Radiation Test Results for Common CubeSat Microcontrollers and Microprocessors,” Jet Propulsion Laboratory, California Institute of Technology, Pasadena, CA, 2015.

@@ Line 2: / Line 2: @@
 [[Category:Final Year Projects]]
 [[Category:2021s1|13001]]
-[[File:BRMM.jpg|300px|thumb|right|Artist's impression of the Buccaneer Risk Mitigation Mission (BRMM) satellite.]]
+[[File:BMM_Cropped.jpg|300px|thumb|right|Artist's depiction of the Buccaneer Main Mission (BMM) CubeSat in low Earth orbit. Courtesy of Inovor Technologies.]]
-While FPGAs offer a number of benefits for aerospace applications, they are highly susceptible to single event effects (SEE) when exposed to high-radiation environments. These upsets can cause undesirable behaviour within the system, and potentially lead to catastrophic system failure. Students will design and develop a novel FPGA configuration scrubber to overcome these effects using an external microcontroller. Radiation testing will be conducted to verify system performance in a (simulated) space environment.
+While Field Programmable Gate Arrays (FPGAs) offer a number of benefits for aerospace applications, they are highly susceptible to single event upsets (SEUs) when exposed to high-radiation environments. These upsets can cause undesirable behaviour within the system, and potentially even lead to catastrophic system failure. Students have built upon existing research to develop a ‘scrubber’ circuit which uses an external microcontroller to detect and repair upsets within a Xilinx 7-Series FPGA. This system will be deployed in a space environment as part of the upcoming CubeSat mission, Buccaneer.
 == Introduction ==
-This project is sponsored by the Defence Science and Technology Group (DST). Students will gain experience in an industry environment, while supporting Defence capabilities within DST.
+This project is sponsored by the Defence Science and Technology Group (DST). Students will gain valuable experience working in an industry environment, while supporting Defence capabilities within DST.
 === Project team ===
@@ Line 21: / Line 21: @@
 * Dr. Said Al-Sarawi
 * Dr. Dharmapriya Bandara (DST)
-==== Advisors ====
-* Dr. Brayden Phillips
 === Project Objectives ===
@@ Line 34: / Line 30: @@
 * To provide clearly defined research outcomes which can be incorporated into the development process for future CubeSat launches.
-== Literature Review ==
+== Background ==
 === Buccaneer Main Mission ===
-One of the biggest changes in the space domain in recent years has been the move from large satellites costing billions of dollars and decades in development, to small disposable satellites that can cost less than one million dollars and have development cycles measured in months. This research has led to the creation and popularisation of the Cube Satellite (CubeSat) form factor. These designs measure roughly the same size as a shoe box and have a typical launch mass of 5 to 15 kg.
+One of the biggest changes in the space domain in recent years has been the move from large satellites, costing billions of dollars and decades in development, to small disposable satellites that can cost less than one million dollars and have development cycles measured in months. This research has led to the popularisation of the Cube Satellite (CubeSat) form factor. These designs measure roughly the same size as a shoe box and have a typical launch mass of 5 to 15 kg.
-DST are currently undertaking their own CubeSat mission, called Buccaneer, in collaboration with the University of New South Wales (UNSW), and various other Industry, Academia and International partners.  The Buccaneer program consists of two separate launches with the first satellite, launched in November 2017, focused on proving the technologies involved and the second satellite, due to be launched in Q3 2022, designed to conduct the main mission.
+DST are currently undertaking their own CubeSat mission, called Buccaneer, in collaboration with the University of New South Wales (UNSW), and various other industry, academia and international partners. The Buccaneer program consists of two separate launches. The first satellite, the Buccaneer Risk Mitigation Mission (BRMM), was launched in November 2017 and was proving the technologies involved. The second satellite, the Buccaneer Main Mission (BMM), is scheduled for launch in 2023 and will be used to obtain calibration data for the Jindalee Operational Radar Network (JORN).
 === Single Event Effects ===
+[[File:Scrubber_SEE.png|350px|thumb|right|Ionization within a semiconductor due to a single event effect.]]
 Traditional avionics and ground-based electronic systems are shielded from the effects of solar radiation thanks to the Earth's atmosphere and magnetic field. However, systems operating within a space environment do not receive the same level of protection and therefore are subjected to extremely high levels of radiation. This radiation can be produced by a wide variety of phenomena, but cosmic rays and high-energy protons are the most prevalent sources in space applications.
-[[File:filename.jpg|300px|thumb|right|SEE Diagram (TBC)]]
+When one of these high-energy radiation particles travels through a semiconductor, the resulting ionisation produces free charge carriers within the substrate. These charge carriers diffuse through the material and alter the shape and size of the depletion region. This causes transient voltages within the gate, and can ultimately lead to a variety of highly disruptive effects known as Single Event Effects (SEEs).
-When one of these high-energy radiation particles travels through a semiconductor, the resulting ionisation produces free charge carriers within the substrate. These charge carriers diffuse through the material, altering the shape and size of the depletion region. This can cause transient voltages within the gate, known as Single Event Transients (SETs), which can ultimately lead to a variety of highly disruptive effects, known as Single Event Effects (SEEs).
-These effects may be classified according to two categories: soft errors, which are reversible and may or may not interrupt normal operation, and hard errors, which are irreversible and can cause catastrophic damage to the device. This categorisation of errors is presented below.
-[[File:filename.jpg|600px|thumb|centre|Types of SEE (TBC)]]
-==== Soft Errors ====
-===== Single Event Upset (SEU) =====
-SEUs are non-destructive, soft errors which cause a change of state within a memory cell. If a SET occurs at the same time as a clock edge, the impulse will be read as an incorrect logic state and the pulse will propagate through combinational logic, where it may be latched into memory. In memory cells and registers this generally appears as a bit-flip.
-If only a single bit-flip occurs, it is classified as a Single Bit Upset (SBU). However, it is also possible for a single, high-energy particle to collide with multiple transistors as it passes through a memory bank. This can cause multiple bit-flips in a single event, and is known as a Multi Bit Upset (MBU). MBUs may cause single errors across multiple words/frames, in which case they may be treated in the same manner as SBUs, however they can also cause multiple errors within a single word, in which case the upset must be handled as a special case, as discussed in later sections.
-===== Single Event Functional Interrupt (SEFI) =====
+If a transient voltage occurs at the same time as a clock edge, the impulse will be read as an incorrect logic state and the pulse will propagate through combinational logic, where it may become latched into memory. In memory cells and registers this generally appears as a bit-flip, and is referred to as a Single Event Upset (SEU).
 While the effects of SEUs are often negligible, they have the potential to cause catastrophic system failure if the upset occurs in a critical system, such as FPGA configuration memory or the POWER/RESET bit in a microcontroller. An upset which interrupts or otherwise prevents the normal operation of a system is known as a Single Event Functional Interrupt (SEFI). These events generally require power cycling the system or reloading the configuration memory to recover normal system operation.
-===== Single Event Latch-Up (SEL) =====
-It is possible for the ionisation of a SEE to create a low impedance path within the circuit and form a parasitic structure within the device. This parasitic structure may 'latch' one or more transistors in a forward-biased state, causing them to conduct current.
-A SEL may be cleared by power cycling the device. However, if the device is allowed to conduct current for too long, or the current through the transistor exceeds device specifications, this fault may cause irreparable damage to the device, including leading to Single Event Burnout (SEB) and Single Event Gate Rupture (SEGR).
-==== Hard Errors ====
-===== Single Event Burnout (SEB) =====
-When a SEL occurs, the resulting current through the transistor causes excessive heating. If the SEL is not cleared quickly, catastrophic failure may occur due to bond wire failure. This is most likely to occur if the power and ground rails are shorted, leading to an extremely high current through the device, although other shorts can lead to equally destructive results.
-===== Single Event Gate Rupture (SEGR) =====
-SEGRs occur when the gate oxide in a transistor is destroyed due to a SEL. This causes device burnout similar to a SEB. Similar effects can occur in non-transistor devices, such as capacitors, in which case it is known as a Single Event Dielectric Rupture (SEDR).
-Microprocessors, FPGAs and memory devices are vital components in any space system, however they are also the most sensitive components to SEEs. Therefore, we need a reliable way to protect these components against SEEs in order for any space mission to be successful.
-One approach to this problem is to use physical manufacturing techniques to reduce device vulnerability to ionisation. Alternatively, we can introduce error detection and correction (EDAC) systems designed to mitigate the effects of SEEs when they occur. As with the physical manufacturing techniques, there are a wide variety of proven approaches to this problem; most of which involve either Triple Modular Redundancy (TMR), an external ‘scrubber’ circuit, or some combination of the two. These solutions, and their many variations, are explored below.
 === SEE Prevention and Mitigation Strategies ===
 ==== Radiation Hardening ====
-Components and circuits which have been designed and manufactured to be less susceptible to SEEs are known as radiation hardened, or RadHard, components. There are many different hardening techniques available depending on the device. For example, a semiconductor may be shielded against radiation using impervious materials, such as aluminium or depleted boron, and mounted to a substrate material with a wide band gap, such as Silicon Carbide or Gallium Nitride, instead of conventional silicon wafers.
+Components and circuits which have been designed and manufactured to be less susceptible to SEEs are known as radiation hardened, or RadHard, components. While these devices provide robust, reliable performance in space applications, they are often orders of magnitude more expensive than their industrial-grade equivalents, and tend to lag roughly a generation behind the most recent developments due to the extensive development and testing required for each design. For this reason, it may be preferable to use industry-grade components coupled with some kind of error detection and correction (EDAC) subsystem wherever possible. This simplifies the development process and provides a substantial reduction in cost.
-While RadHard devices provide robust, reliable performance in space applications, they are often orders of magnitude more expensive than their industrial-grade equivalents, and tend to lag roughly a generation behind the most recent developments due to the extensive development and testing required for each design. For this reason, we would prefer to use industrial-grade components coupled with some kind of EDAC subsystem wherever possible. This simplifies the development process and provides a substantial reduction in cost.
+==== Scrubbing ====
+The process of periodically reprogramming an FPGA to avoid an accumulation of errors is known as scrubbing. This can be achieved using a dedicated circuit, commonly known as a scrubber, whose primary purpose is to mitigate errors in the configuration memory before they can disrupt the overall system. These scrubbers are often coupled with ‘golden’ copy of the configuration memory which is not susceptible to SEEs (e.g. NAND Flash or RadHard memory) and is therefore known to be correct.
-==== Triple Modular Redundancy ====
+A scrubber may be implemented internally within an FPGA using configurable logic blocks, or external to the FPGA using additional hardware such as a microcontroller or secondary FPGA to store and execute the scrubbing logic. As the internal scrubber architecture is housed entirely within the FPGA, it is much faster than an external scrubber, and the lack of additional hardware also reduces space and power requirements. However, this also means that more resources are required on the FPGA to implement the scrubber logic, resulting in less available space for the user’s program.
-TMR is a method of upset mitigation used to reduce single point failures by triplicating the original circuit (i.e. adding two additional redundant copies of the circuit). All three circuits run in parallel to one another, feeding their outputs into a shared voter circuit, which then compares each circuit’s output and chooses the value output by the majority of the circuits. In the event of an upset occurring in one of the modules, the other two modules will remain unaffected and the voter circuit will still deliver the correct output.
+Xilinx 7-series FPGAs, such as those used with this project, include a built-in Readback CRC circuit which provide single error correction and double error detection (SECDED) capabilities without the need for additional hardware. However, when a MBU is detected by a SECDED circuit, some additional scrubbing capability is required in order to repair the upset. This could involve simply reconfiguring the entire FPGA, or may use a more precise method such as locating and repairing only the erroneous memory frame. Logic such as this can be implemented easily in an external microcontroller, whereas an internal operation would require a softcore processor to be implemented within the FPGA.
-[[File:filename.jpg|300px|thumb|right|SEE Diagram (TBC)]]
+The key issue with internal scrubbers is that the scrubbing hardware is just as susceptible to SEEs as the rest of the FPGA. The scrubber circuit is unable to repair itself, and therefore if a fault occurs within this portion of the configuration memory the entire scrubber may fail. Of course, external scrubbers are also vulnerable to SEEs, however they can be designed using RadHard components to overcome this problem at just fraction of the cost of a full RadHard FPGA.
-While single voter TMR configurations, such as the one shown in Figure 3, greatly reduce the risk of failure by reducing the size of the vulnerable area, the voter circuits are just as vulnerable to upsets as the modules themselves. Hence, single voter configurations can still be corrupted by SEEs as they maintain a single point of failure.
-[[File:filename.jpg|300px|thumb|right|SEE Diagram (TBC)]]
+== Design Process ==
-To eliminate this vulnerability, the voter circuit can also be triplicated. This requires the output of each module to be fed to all three voter circuits, which then produce three majority outputs, as shown in Figure 4. Multiple instances of this configuration can be chained together to create a robust circuit.
+=== System Architecture ===
-One comparison between single and triple voter circuits found that the use of triple voters lowered the failure rate from 0.85% to just 0.35% [1].
+[[File:Scrubber_SystemArch.png|350px|thumb|right|Scrubber system architecture diagram.]]
-TMR implementations which make use of additional area to run two redundant copies of the circuit in parallel to the original are known as space-TMR. All three modules execute simultaneously and independently of one another. As such, there is very little overhead on processor performance, but the area and power requirements of the circuit are tripled.
-Time-TMR is an alternative approach which uses temporal methods to triplicate the module execution. Instead of two redundant circuits running in parallel, time-TMR uses a single copy of the circuit to perform the same instruction three times. This can be achieved by locking the program counter to execute the same command three times and storing the results of the first two executions in memory. After the last execution the three outputs are compared, and the majority result is selected. Time-TMR has lower area and power requirement than space-TMR, as it requires little to no extra hardware, however the performance of the circuit is decreased by a factor of three due to the additional instructions being executed sequentially instead of in parallel.
+To maximise the reliability of the scrubber, while maintaining the highest possible performance, a hybrid scrubbing approach was selected. The FPGA's internal Readback CRC mechanism is used to perform continuous readback of the configuration memory and subsequently correct SBUs. When the Readback CRC detects an error it cannot correct, including MBUs, it sends the details of the error to an external microcontroller, which can then perform the necessary operations to correct the error. This allows us to utilise the speed of the internal readback hardware, while maintaining the robustness of the external scrubber.
-Regardless of the selected implementation, the use of TMR in a circuit can greatly improve the reliability and increase the mean time to failure (MTTF). However, if an error in a failed module is not repaired, the whole circuit is at risk of failure if either one of the other two modules experience an upset. In the event that two of the three modules fail simultaneously, the TMR voter will be unable to provide a trustworthy output and thus the TMR will fail. TMR must therefore be accompanied by a repair mechanism in order to substantially improve the MTTF.
+=== Component Specification ===
-==== Scrubbing ====
-The process of periodically reprogramming an FPGA to avoid an accumulation of errors is known as scrubbing. This can be achieved using a dedicated circuit, commonly known as a scrubber, whose primary purpose is to mitigate errors in the configuration memory before they can disrupt the overall system. These scrubbers are often coupled with ‘golden’ copy of the configuration memory which is not susceptible to SEEs (e.g. NAND Flash or RadHard memory) and is therefore known to be correct.
-There are a variety of proven scrubber architectures available in literature, each with its own distinct benefits and drawbacks. The following section will discuss these benefits and drawbacks to identify the architecture which is most suited to this project.
+The BMM secondary payload will use a Xilinx Ultrascale FPGA in the final design, but a Xilinx 7-Series FPGA is being used for the purposes of this project as it is far more affordable. Xilinx's Soft Error Mitigation (SEM) IP-Core is used to provide the SECDED and fault injection capabilities required to implement the internal portion of the scrubber.
-===== Internal vs External Scrubbing =====
+An MSP430FR5969 microcontroller from Texas Instruments is used to store and execute the scrubbing logic for the external portion of the scrubber. This is a 16MHz microcontroller based on the MSP430 platform, which is popular for its high performance to cost ratio and ultra-low power operation. This particular model has undergone thorough radiation testing [2] and can safely be used in a space environment without the need for additional radiation hardening.
-[[File:filename.jpg|300px|thumb|right|SEE Diagram (TBC)]]
+The configuration bitstream for the FPGA will be stored in the 'golden' memory bank. This memory is radiation hardened and therefore not susceptible to SEUs, which menas that we will always have access to an uncorrupted copy of the bitstream. This is referred to as the 'golden' bitstream. A 3DFN8G08VS1706 8Gb RadHard NAND Flash memory module from 3D Plus has been selected for this purpose.
-A scrubber may be implemented internally within an FPGA using configurable logic blocks, or external to the FPGA using additional hardware such as a microcontroller or secondary FPGA to store and execute the scrubbing program instructions. As the internal scrubber architecture is housed entirely within the FPGA, it is much faster than an external scrubber, and the lack of additional hardware also reduces space and power requirements. However, this also means that more resources are required on the FPGA to implement the scrubber logic, resulting in less available space for the user’s program.
+Finally, a NOR Flash memory bank acts as a buffer between the microcontroller and the FPGA. While this is not strictly necessary in the design of the scrubber, it has been included to mirror the hardware used on the BMM secondary payload. A S25FL128S 128Mb SPI NOR Flash memory module from Cypress Technologies has been selected for this purpose.
-Internal scrubbers make use of Internal Configuration Access Port (ICAP) to perform continuous readback of the configuration memory. The Xilinx Kintex 7-series of FPGAs includes a built-in readback CRC circuit which utilizes this interface to provide single error correction and double error detection (SECDED) capabilities. External scrubbers however, cannot access ICAP and must instead use either the SelectMAP or JTAG interface to perform readback operations.
-When a MBU is detected by a SECDED circuit, some additional scrubbing capability is required in order to repair the upset. This may involve simply reconfiguring the entire FPGA, or may use a more precise method, as explored in Section 3.3.2. Logic such as this can be implemented easily in an external microcontroller, whereas an internal operation would require a softcore processor within the FPGA (e.g. PicoBlaze).
+=== Manufacturing and Testing ===
-The key issue with internal scrubbers is that the scrubbing hardware is just as susceptible to SEE’s as the rest of the FPGA. The scrubber circuit is unable to repair itself, and therefore if a fault occurs within this portion of the configuration memory the entire scrubber may fail. Internal scrubbing can be implemented in combination with TMR, which has been shown to reduce failure rates by 30%, however external scrubbers are still considered to be more robust [7]. Of course, external scrubbers are also vulnerable to SEEs, however they can be designed using RadHard components to overcome this problem at a fraction of the cost of a full RadHard FPGA.
+[[File:Scrubber_PCBRender.png|300px|thumb|right|3D render of the completed Scrubber PCB.]]
-===== Scrubbing Strategies =====
+[[File:Scrubber_TestSetup.jpg|300px|thumb|right|Our custom PCB and Arty S7-50 connected in a test environment.]]
-====== Blind Scrubbing ======
+In order to test our scrubber design, a two-layer printed circuit board (PCB) was developed by the project team, with assistance of DST's Research Engineering branch. This PCB was designed to interface with a Digilent Arty S7-50 evaluation board, which contained the FPGA used in our tests.
-Blind scrubbing is a relatively simple scrubbing strategy as it does not require error detection. Instead, the entire configuration memory is overwritten with data from the golden memory at fixed intervals. Xilinx Virtex FPGAs include a dynamic reconfiguration capability which allows scrubbing to occur without interrupting the application layer operations. Since error detection is not required, blind scrubbing can performed at a reasonably fast speed. However, it is still considered to be an inefficient method as the scrubber is constantly occupying processing bandwidth to correct memory frames which contain no errors. The advantage is that any errors that occur within the scrubbed memory are guaranteed to be corrected since the entire memory is rewritten.
+The schematic and layout for our custom PCB were completed using Altium Designer, and the board was then manufactured and loaded by electronic technicians from DST. The PCB underwent a visual inspection by the project team, but we failed to recognize that the NOR flash IC had been oriented incorrectly. This resulted in a short circuit, pulling the 3.3V rail down to approx. 0.8V. The issue was quickly identified and corrected, however the microcontroller had already been damaged. After several more days of troubleshooting, this issue was also identified and the damaged part was replaced.
-====== Global CRC ======
+The Arty S7-50 also underwent minor modifications to allow it to interface with our custom PCB. The NOR flash IC on the Arty S7-50 was removed and leads were soldered to the exposed pads. These leads could then be connected to the NOR flash IC on our custom board, allowing the microcontroller and FPGA to share the same memory space. Additionally, three thin wires were soldered to the reverse side of the Arty S7-50 board, which allowed the microcontroller to interface directly with the FPGA's core programming pins (DONE, INIT_B and PROGRAM_B).
-Cyclic Redundancy Checks (CRCs) are commonly used when error detection is required for large blocks of data. In 7-series FPGAs, a single 32-bit CRC word is calculated for the entire bitstream. The CRC word is calculated using the remainder of a polynomial division circuit, so a single bit-flip within the frame or bitstream will result in drastically different result which makes it particularly effective at detecting MBU’s. However, CRC is only an error detection tool as it cannot locate where an error is within a block of data, only that an error is present. For this reason, CRCs are often used as the final defence against configuration upsets as correcting any detected errors would require a full frame scrubbing.
+== Scrubbing Process ==
-====== Frame ECC ======
--series FPGA’s contain built-in error correction codes (ECCs) which provide local EDAC functionality for each individual frame of the configuration memory. Each frame is made up of 100 32-bit data words and a single 32-bit ECC word which can be used for SECDED. Each time a frame is written to or read from, the ECC syndrome is recalculated. If the syndrome is equal to zero, it implies that zero errors were detected, whereas a non-zero syndrome indicates that an error has occurred and the syndrome value can then be used to determine the location of the error within the frame.
+=== Error Detection and Classification ===
-A limitation of the frame ECC is that odd MBUs in a frame will alias a SBU at an incorrect location. The scrubber will then try to repair this false SBU, creating an additional error. This operation will result in a zero syndrome for the ECC, making the scrubber believe that the error has been successfully repaired, but the global CRC will still show that an error has occurred and trigger a scrub of the full configuration memory.
+In order to detect and classify the errors in the FPGAs configuration memory, the scrubber makes use of the Readback CRC. The readback CRC is a built-in feature of the 7-series FPGAs. The readback CRC operates by performing read cycles on the configuration memory. In the first cycle, performed during the device initialisation, for each frame within the configuration memory, a 32-bit checksum is calculated and inserted into that frame. This will allow the readback CRC to perform a frame-by-frame scan for errors. After the second cycle is completed a global checksum is calculated to allow the readback CRC to perform a global CRC on the entire device.
-====== Readback CRC ======
+Once all the checksums have been calculated and placed into the memory, the readback CRC will continue to cycle through and calculate and compare checksums. An error is indicated when a checksum which has been calculated using memory does not match the checksum stored. When this occurs, a separate hardware primitive named the FRAME_ECC, outputs a set of signals that can be used to determine the errors location, classification and whether the error was detected by the frame-by-frame CRC or the global CRC.
+These signals are concatenated together, along with a CRC checksum and padding bit, in a 64-bit called an error event. Each error event is stored in a FIFO queue within the FPGA memory. The FIFO queue uses a signal connected directly to the MSP430 to indicate the presence of an entry within the queue. When this occurs, the MSP430 will retrieve the error event and use an algorithm which was developed to classify the error type.
--series FPGAs also contain an internal built-in hardware mechanism for performing continuous readback and SECDED during device operation, referred to as readback CRC. This mechanism is responsible for computing the ECC syndrome for each frame as well as the global CRC. After all frames are checked, the CRC value is compared against the previously calculated CRC to determine whether an unidentified error has occurred. Since the readback CRC is implemented using dedicated circuitry, it operates much faster than other readback mechanisms which often require additional resources.
-== Proposed System Architecture ==
+[[File:Scrubber_ClassificationFlowchart.png|500px|thumb|center|Error event structure (left) and associated error classification flowchart (right).]]
-[[File:filename.jpg|300px|thumb|right|SEE Diagram (TBC)]]
+=== Error Correction ===
+The algorithm which we have developed is capable of classifying each error into one of five types of upsets. Error classification is an important step as the type of upset will affect the method of scrubbing which will be used when we correct the errors.
+Once the device has been initialized the scrubber will periodically query the FIFO queue containing the error events. Once an error event appears in the queue, the first thing the algorithm checks is the Frame Address Register (FAR). If the FAR is zero, then no actual upset has been detected and the scrubber may move onto the next error event. However, should the FAR point to a valid frame address, the algorithm will move on to the next step.
+One of the first and easiest types to distinguish between is an ‘unknown error’ which has been detected using the global CRC. This can be determined by using the ECCERROR and CRCERROR signals, if ECCERROR is false and CRCERROR is true, then the upset is caused by an unknown error. Due to the limited information retrieved through the global CRC, the method being used to correct an unknown error is to scrub the entire device.
-To maximise the reliability of the scrubber, while maintaining the highest possible performance, a hybrid scrubbing approach has been selected, as proposed in [1]. The internal readback CRC mechanism will be used to perform continuous readback using the ICAP interface and subsequently correct SBUs. When the readback CRC detects an error it cannot correct, including MBUs, it will pass control over to the external scrubbing hardware which will perform the necessary operations to correct the error and then pass control back to the readback CRC. This allows us to utilise the speed of the internal readback hardware, while maintaining the robustness of the external scrubber.
+However, if the ECCERROR signal is true, then the upset has been detected by the frame-by-frame CRC, and will contain much more information regarding the whereabouts of it. For example, by taking a look at the ECCERRRORSINGLE value we can determine if the error contains either an even or odd number of bit-upsets within it. The ECCERRORSINGLE bit will be set to true for any error containing an odd number of upsets within it. Therefore if the signal is high, then the error can be classified as an Even Number MBU.
-=== Hardware Requirements ===
+When a frame contains an odd number of upsets, the built-in error correction will recognize the error as being a single bit upset and attempt to flip the affected bit. However, due to the frame containing multiple bits upset, the syndrome will contain incorrect information regarding the location of the upset. This does not stop the automatic error correction from attempting to repair the upset and depending on the SYNWORD of the error event the error correction may attempt to repair a bit outside of the bounds of the frame. This is called an Odd-Numbered, Out-of-Bounds MBU.
-The internal scrubber will operate entirely using the FPGA’s readback CRC circuit, and so no external hardware is required. A separate microcontroller and memory bank will be used to store and execute the scrubbing logic for the external scrubber.
+If the SYNWORD points to a valid word within the frame then we must perform one more check before we can be absolutely certain of the classification of the error. This check will be to see what the value of CRCERROR. Due to the manner in which MBUs are reported in the Readback CRC, when an upset is not able to be fixed by the built-in correction, the error value of CRCERROR is set to high. So if CRCERROR is true, then the upset is a Odd Numbered In-Bounds MBU. But if CRCERROR is false, then the error was caused by an SBU and will have already been corrected.
-An auxiliary FPGA could also be used instead of a microcontroller, but as this component would need to be radiation hardened, it would be much more expensive. Using a microcontroller should also simplify the development process, as the development team far more experience working with microcontrollers compared to FPGAs. The model of microcontroller has not yet been determined, however it will need to have enough GPIO pins to drive the SelectMAP interface on the FPGA. This requires 6 control pins, as well as either 8, 16 or 32 data pins, as detailed in [9-10].
-A RadHard memory bank will also be required to store the ‘golden’ copy of the configuration bitsteam. This is likely to be implemented using NAND Flash memory which is not susceptible to SEEs. The size of this memory has not yet been determined.
-=== Software Requirements ===
+One thing to note, the current algorithm includes only two scrubbing methods, either via scrubbing the full configuration memory or by scrubbing only the affected frame. There are methods which can be used with the different types of upsets which will optimize the way the entire system is able to resume once the upset has been repaired. One example is that each type of upset produces a different amount of duplicated error events. By classifying the upset, we are able to remove each of the duplicate error events and not waste time reclassifying upsets which have already been repaired.
-==== Scrubbing Algorithm ====
+== Project Outcomes ==
-[[File:filename.jpg|300px|thumb|right|SEE Diagram (TBC)]]
+=== Proof of Concept ===
-The scrubbing algorithm will roughly follow the framework proposed in [1]. The internal readback CRC mechanism will be used to perform continuous readback of the configuration memory and correct any SBUs. If a MBU is detected by the readback CRC, an error event is generated and passed to a FIFO queue. The external scrubbing circuit will monitor this queue and initiate scrubbing via the SelectMAP interface when an error event is detected.
+While we initially set out to develop a complete system, it quickly became apparent that our initial scope was very ambitious given the time constrains of this project. This, combined with significant delays in the development of the greater BMM system (beyond the scope of our project), made it infeasible to develop a space-ready system.
-The scrubbing process performed by the external scrubber varies depending on the type of error detected. This process is clearly outlined in [1], but can be represented at a high-level as seen in Figure 8. This process will always scrub a specific frame in the configuration memory if possible, and only scrub the entire configuration memory if there is no alternative. This makes the overall operation of the device as efficient possible while maintaining full functionality.
+The decision was then made to revise the scope and aim to develop a 'proof of concept' system, which could later be expanded to meet the remaining system requirements. One of the major changes resulting from this decision was the development of an 'interim' scrubber PCB, rather than integrating our design directly into the BMM secondary payload.
-The scrubbing architecture described above does not detect or respond to SEFIs, although there are scrubbers capable of handling these events. If time permits, common SEFIs such as the power-on reset SEFI, frame address SEFI and SelectMAP SEFI may be addressed in the scrubbing algorithm, however that is not within the scope of the project at this time.
+This process is described in the previous section, and allowed us to successfully implement our proposed system architecture at much a smaller scale. This system has all the core elements of the hybrid scrubber architecture and has been designed in such a way that additional functionality can be added at a later date, without interfering with existing functionality.
-==== Additional Features ====
+=== Future Work ===
-Implementing a scrubber circuit using an external microcontroller allows for much greater flexibility in terms of debugging. As long as the computational power of the microcontroller is not exceeded, and there are sufficient GPIO pins available, the possibilities are endless. There are two debugging tools which have been identified as being critical to the development process.
+To prepare the developed scrubber system for use with the BMM secondary payload, as intended in the original project scope, there are a number of objectives which will need to be met:
-The first tool is the ability to simulate SEUs for testing purposes. The system will be thoroughly tested in a lab environment before undergoing radiation testing, and therefore we must be able to simulate both SBUs and MBUs in various locations within the configuration memory.
+* The hardware from our custom PCB will need to be integrated into the design of the BMM secondary payload PCB. This process is already underway, but has faced significant delays.
-The second tool is the ability to connect to an external PC for easy FPGA configuration and testing. This would most likely be implemented over a UART interface, which would allow a user to input commands (such as the fault-injection commands outlined above) and retrieve device status information (such as the type and location of any upsets detected).
+* The functionality of the microcontroller will need to be expanded to include features which were considered out-of-scope for this project. This includes features such as a 'sleep' mode to reduce power consumption while the scrubber is waiting for an error to be detected.
-== Development and Testing ==
+* The FPGA software will need to be modified for use with a Xilinx Ultrascale FPGA, such as the one used onboard Buccaneer. The Xilinx SEM IP-Core is available for both Ultrascale and 7-Series devices, so any changes should be minimal.
-== Project Outcomes ==
+Once all of the above objectives have been met, the completed system will undergo radiation testing at the National Space Test Facility (NSTF) in Canberra, Australia to verify system performance in response to real-world upsets. This involves subjecting the scrubber circuit to ion beams of various energy levels to simulate those found in a space environment. It is critical that the secondary payload system passes this test before it can be certified for launch.
 == References ==
-[1] A. Stoddard, A. Gruwell, P. Zabriskie, M. J. Wirthlin, "A Hybrid Approach to FPGA Configuration Scrubbing", Nuclear Science IEEE Transactions on, vol. 64, no. 1, pp. 497-503, 2017.
+# A. Stoddard, A. Gruwell, P. Zabriskie, M. J. Wirthlin, "A Hybrid Approach to FPGA Configuration Scrubbing", Nuclear Science IEEE Transactions on, vol. 64, no. 1, pp. 497-503, 2017.
+# S. Guertin, M. Amrbar and S. Vartanian, “Radiation Test Results for Common CubeSat Microcontrollers and Microprocessors,” Jet Propulsion Laboratory, California Institute of Technology, Pasadena, CA, 2015.

Difference between revisions of "Projects:2021s1-13001 Improving the Resilience of Autonomous Satellite Networks against High-Energy Disruptions"

Latest revision as of 17:05, 22 October 2021

Contents

Introduction

Project team

Project students

Supervisors

Project Objectives

Background

Buccaneer Main Mission

Single Event Effects

SEE Prevention and Mitigation Strategies

Radiation Hardening

Scrubbing

Design Process

System Architecture

Component Specification

Manufacturing and Testing

Scrubbing Process

Error Detection and Classification

Error Correction

Project Outcomes

Proof of Concept

Future Work

References

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools