Projects:2017s1-111 OTHR Alternative Computing Architecture

From Projects
Revision as of 15:16, 29 October 2017 by A1646441 (talk | contribs) (Comparison of Final Architectures Images Added)
Jump to: navigation, search

Project Team

Daniel Lawson

Supervisors

Dr Braden Phillips
Mr Shane Breandler BAE Systems (External)

Abstract

The aim of this project was to investigate the four computer architectures most commonly used for signal processing and radar to determine which would be most suitable for use in the JORN Phase 6 upgrade using auto code generation tools.

The architectures chosen were CPU, GPU, FPGA and ASIC with each compared across a range of metrics including run-time, utilisation, power, thermal and cost, base on previous work in computer architecture comparisons. Each system was explored for feasibility through experimentation on a wave propagation algorithm to evaluate device parameters and the measurement tools. Feasible architectures were then compared against each other using a second algorithm representative of expected workload where it was found that a GPU based system produced the best results with due to high performance with large data sets and strong developer support through tools and testing

As the project involved mapping a single algorithm to a number of different architectures, insight into the general process of high level synthesis was also discovered. It was found that even architectures designed for maximum portability required some manual rewriting of code to fully take advantage of parallelism with correct output.

Introduction

As part of the latest (Phase 6) upgrade to the Jindalee Operational Radar Network (JORN) project, BAE Systems Australia expect that the demand for radar simulation algorithms will increase over the duration of the networks lifetime. In expectation for this increased demand, this project will investigate the feasibility or a range of different architectures to identify the most suitable architecture for radar simulation. The methods found and results gained from this project will provide a basis for both strategic decisions for the JORN 6 upgrade, as well as a potential design guide for future algorithms.

Project Constrains and Assumptions

In was noted that JORN already possesses a large pre-existing code base designed for a CPU architecture. To reduce engineering costs of mapping the entire code base to a new architecture automatic code generation tools were used. This constraint acknowledges the fact that results may change if optimal implementations of each architecture were designed manually.

Methodology

1) An initial reference algorithm was mapped to each of the four architectures to show that it was possible to generate an implementation on that system with the tools available. 2) The system was then explored and the effect of different device parameters experimented on to determine optimal settings. 3) If the architecture was deemed feasible for use with JORN then the second algorithm that represents the expected system workload was mapped and data recorded for latency, utilisation, power, thermal and cost. 4) Each of these results was then compared against each other and plots produced. From these results and the observations made during implementation, the architecture that was most suitable for JORN was determined.


Reference Algorithm: Wave Simulation
The initial reference algorithm was based on simulating a finite array for a specified number of time steps implementing the wave equations for a Gaussian Pulse source. This algorithm was chosen based on its simplicity with only a few operations on a large data set, the array size and number of simulations steps were both changeable and the application is somewhat related to radar.

Data Flow Diagram for the Wave Simulation Algorithm

Reference Equation.JPG


Comparison Algorithm: Cooley-Tukey FFT
It was known that the JORN system uses a Polyphase channeliser to down convert the received signal, filter out noise and extract the relevant data from each channel in one system block. As this system was not implementable due to time constraints, a portion of the channeliser in the form of the FFT block was used. The Cooley-Tukey algorithm is a well known standard that computes an NxN complex Fourier Series with Nlog(N) time complexity rather than N^2 such as occurred in older implementations. Both a recursive implementation based on divide and conquer and an in-place bit shifting algorithm was generated with the recursive version preferred due to intuitiveness.

FFT Algorithm.JPG

CPU Experimentation

Devices Available: Intel i5-3570k (Ivy Bridge) and Intel Xeon E5-2698 Tools Used: Compiler directives in the form of OpenACC through the PGI Compiler and OpenMP through the Intel Compiler.


Experiment 1: Choice of Compiler with No Directives used

Effect of Compiler on Run Time

A number compilers were chosen and compared against the single core implementation of the reference algorithm where the following was found:

   1. With no optimisation the best performer was the PGI compiler while the worst was Intel.
   2. As optimisation levels increased this difference became saturated and minimal difference in run time was found.


Experiment 2: Effect of Compiler Directives
Using OpenACC compiler directives through the PGI compiler, a number of options for where to accelerate the code was tested with differentiating results as seen by the Table below: CompilerSummary.JPG

Based on these results the following observations were made:

   1. Race conditions were discovered if trying to parallelism arrays over multiple simulations or at the output text file.
   2. Adding directives to small loops can decrease overall performance due to overhead in setting up memory.
   3. Large code blocks correctly parallelised can be sped up by a large amount.


Effect of Number of Cores on Runtime

Experiment 3: Effect of the number of Cores
The final experiment conducted was to evaluate the number of cores. It was found that the compiler can be set to target any number of cores on the system, with difference latency results. As the number of cores were increased, up to a maximum of 4 the total latency decreased. However beyond this point the increased overhead started to dominate the program and thus performance decreased.


GPU Experimentation

Devices Available: NVIDIA GTX 1070 and NVIDIA Tesla K80 Tools Used: CUDA Toolkit Version 8.0. This provided access to the CUDA compiler and Visual Profiler to monitor all metrics.


Experiment 1: Effect of Block Size on Run Time

Effect of Block Size on Runtime

It was found that the GTX 1070 has a maximum block size of 1024 threads per block. As such an experiment was made to compare how the block size affects the run time. It was found that as the block size was increased, then more threads could be actively computed at once hence decreasing the overall latency. This relationship was not linear however as the use of more threads per block added additional overhead.



Experiment 2: Effect of Grid Size on Run Time

Effect of Block Size on Runtime for increasing Grid Sizes

The second parameter that can be controlled in CUDA was the grid size. Each grid is made up of a number of blocks which are not able to communicate through shared memory. Thus using larger grids meant that initially the code performance was faster as block size increased, this speed up eventually reached a maximum value where further increases in block size caused decreased performance due to too much overhead. Using a constant relationship between grid and block size removed this decreases at the expense of worse initial performance.

FPGA Experimentation

Devices Available: Xilinx branded FPGA's Tools Used: Vivado High Level Synthesis (HLS), which is able to take a high level algorithm written in C++ and translate to a Verilog implementation on a target device.


Experiment 1: Developing a Correct Verilog Implementation of Algorithm
A test bench was first created using the initial single core implementation of the wave propagation algorithm. Then once a hardware block was implemented it was ran against the output of the test bench at both high level and as a Verilog design.


Experiment 2: Effect of Clock Target on Synthesis
Once it was shown that the design could be implemented, the next stage was to optimise the design. It was found that Vivado HLS would only optimise the design just enough to pass the clock constraint. Thus to get an optimal design this constraint would have to be reduced until it was no longer possible to reach the target. However decreasing the clock cycle could cause some blocks to take more than one cycle and hence the overall latency could increase. It was found that a clock target of 10ns produced the optimal results.

FPGA Clocks.JPG


Experiment 3: Effect of Choice of FPGA on Results
The final area of experimentation for FPGA was to test which device was the best choice. As simulations were used, we had access to a large number of Xilinx devices, each with different costs and resources. Through running the design against each device, as per the table below the following was found:

FPGA Devices.JPG

   1. Even when using a small algorithm with less than 500 lines of code, synthesis targets were not met by all devices. The two cheapest available models, the series 7 Spartan and Artix both provided fewer `Digital Signal Processing blocks' than needed for synthesis.
   2. The Spartan 7 was the only device unable to reach the timing constraint of 10 ns.
   3. Each device produced a slightly different estimated clock despite having the same target constraint
   4. The most expensive device was not the fastest.

Based on a mixture of cost, utilisation and performance the Kintex7 was chosen as the preferred FPGA device.

ASIC Experimentation

Tools Used: As an ASIC is a fully custom device, Cadence Genus was used to take the Verilog output produced in the FPGA stage and produce a fully customised system.


Experiment 1: Effect of Timing Estimates
Once an initial implementation of the ASIC design was generated, the first experiment was to determine the best timing result. Similar to Vivado HLS, Genus performed lazy optimisation where it will only optimise enough to hit the clock targets. Thus the clock target was varied until an optimal value that meets the timing requirements were produced as seen in the Table below.

ASIC timing.JPG

Based on this a number of observations were made:

   1. The optimal timing value was found to be 250ns to allow for 50ns of slack due to changes in the environment.
   2. As the clock setting was loosened, the total area increased, due to changing between large and small cell sizes based on the available library.
   3. Timing constraints did not affect the leakage power across the design.
   4. As timing constraint was loosened, the dynamic power decreased as large cell sizes were used.


Experiment 2: Effect of Power Constraints
The second constraint that could be modified in the design setup was the ratio of leakage to dynamic power. As the ratio was tightened such that less leakage power was allowed in the design, the overall power decreased. However at very tight bounds (1% leakage), the dynamic power was found to increase suggesting changes to the cells used in the design.

ASIC Power.JPG


Feasibility of ASIC
While it was found that it was possible to generate a successful implementation of the design, due to the high cost of tools and the inability to accurately measure all metrics, it was concluded that an ASIC system was not feasible for use with JORN and hence was not included in the final design.

Comparison of Final Architectures

Through the initial experiments above, it was found that CPU, GPU and FPGA were feasible solutions to use in JORN, while ASIC was not due to high tool costs and not all metrics could be measured. The table below shows a summary of the implementation results of each architecture using the final Cooley-Tukey FFT algorithm.

Final Results.JPG


Run Time
Latency was measured by running the algorithm on each architecture against a growing number of data points, increasing by a power of 2 from an initial 1024 points. In the FPGA case each data size was taken as a separate design and did not include I/O overhead. Both the GPU and CPU only timed the function call thus did not include set-up costs. It was found that the GPU performed slightly better than an FPGA at large data sizes, while both single and multi-core devices performed far worse. At small data sizes (less than 1 million) it was seen that the single core CPU outperformed the GPU and FPGA due to lower overhead. Latency.JPG


Utilisation

Utilisation Results

Utilisation was defined as the total percentage of the system an algorithm uses during run-time. In the software cases, this was measured during run time of the largest data size program through resource monitoring. In the case of GPU this did not include overhead such as memory transfers. For the case of the FPGA, the largest percentage of total resources (Look-up tables, RAM, etc.) for the largest data size was taken. It was found that the worst resource usage occurred with the multi-core CPU due to the compiler explicitly telling the Operating System to use all four cores whilst in all other cases the supervisor (OS or GPU scheduler) allocated the resources as needed. The lost utilisation was found to be the GPU suggesting most of the GPU was idle and a smaller device could be used.


Temperature Results

Thermal Measurements
Temperature was recorded for each device through either the use of MSI Afterburner for the CPU or inbuilt temperature measurement systems. It was noted that the FPGA measurements have low overall confidence due to problems detecting the clock signal in Vivado while simulating the power report. It was found that the FPGA ran the hottest (if the results are considered valid), whilst the single core implementation ran the coolest on average, with the same maximum temperature as the GPU. It is noted however that the GPU fans do not even turn on until temperatures of 60 degrees Celsius are reached thus were considered to be within normal range.

Power Results


Power Measurements
As gaining accurate power measurements of a single component of a larger computer system was not possible with the tools available, power measurements were based on manufacturer recommendations. The summary of power measurements show that the FPGA and CPU consume similar amounts of power whilst the GPU is considerably more power intensive leading to increased running costs.


Cost Considerations

Cost Results

As each generation of architecture is constantly being updated and released, accurate cost measurements were hard to acquire with new models released across the duration of the project. When considering the relative differences in cost, the price of the GPU and FPGA were similar with CPU costing about half the price of the others. However it is likely that both the GPU and FPGA would act as a co-processor and thus still requires a CPU to run further increasing the cost of these architectures.

General Case Mapping Insights

Lessons Learnt

Future Work