In this opening chapter of Part 3, you will be introduced to some of the advanced topics of SoC design that present many challenges to design engineers given their multidimensional nature. It will continue with the same practical approach as Part 2 of this book by adding more complex elements to the hardware design. The hardware will also be built to be able to host an embedded Operating System (OS). You will learn how to use advanced hardware acceleration techniques to help augment the system performance, and you will be equipped with the required fundamental knowledge that makes this design step less challenging. You will examine the different ways these techniques can be applied at the system level, and what aspects need considering at the architectural step in the shared data paradigm.
In this chapter, we’re going to cover the following main topics:
The GitHub repo for this title can be found here: https://github.com/PacktPublishing/Architecting-and-Building-High-Speed-SoCs.
Code in Action videos for this chapter: http://bit.ly/3TlFU1I.
We will start from the Electronic Trading System (ETS) SoC hardware design and add more features to it, such as connecting a master interface from within the Programmable Logic (PL) side of the FPGA SoC to the Accelerator Coherency Port (ACP) of the Processing Subsystem (PS) side. We will also make sure that the PS design includes all the hardware features necessary to run an embedded OS such as the timers, the storage devices, the Input/Output (IO) peripherals, and the communication interfaces. Let’s get started:
Figure 10.1 – ETS SoC PS block diagram
Figure 10.2 – Configuring the ACP slave AXI interface in the Vivado IDE
Figure 10.3 – Enabling the ACP transaction checker for the ACP port
Figure 10.4 – Using Run Connection Automation to connect the ACP port
Figure 10.5 – ACP slave port connection wizard
Figure 10.6 – Adding the SD IO to the PS subsystem in the Vivado IDE
Figure 10.7 – Adding the ACP slave port to the MicroBlaze subsystem address map
Figure 10.8 – Validating the ETS SoC design in Vivado IDE
To perform the SoC system performance analysis and quantitively study it, we need to refer to Chapter 8, FPGA SoC Software Design Flow, specifically, all the details from the section titled Defining the distributed software microarchitecture for the ETS SoC processors. We have provided a full ETS SoC microarchitecture, as shown in the following diagram:
Figure 10.9 – ETS SoC microarchitecture simplified diagram
In this analysis, we aim to understand whether the proposed IPC between the Cortex-A9 CPU and the MicroBlaze PP processor mechanism is optimal and using all the possible hardware capabilities of the SoC FPGAs. We would like to figure out whether using the ACP would be a better alternative to using the microarchitecture proposal implementation via the PS AXI GP interface. The current IPC mechanism from the Cortex-A9 toward the MicroBlaze PP uses the circular buffer queue hosted in the AXI BRAM, where the Acceleration Requests (AREs) are built by the Cortex-A9 first and then written as entries into the ARE circular buffer. This is followed by a notification from the Cortex-A9 to the MicroBlaze PP as an interrupt by writing to the AXI INTC0. Once the MicroBlaze PP reaches the ARE entry, it uses the pointers in it to retrieve the Ethernet frame to filter from where the Ethernet DMA engine has written it, and these are in the OCM memory. The following figure summarizes this interaction with the sequencing of events details:
Figure 10.10 – ETS SoC Cortex-A9 and MicroBlaze PP IPC data flow and notification sequencing
The ETS SoC Cortex-A9 and MicroBlaze PP IPC data flow and the notification sequencing include the following steps:
In the preceding approximation, we observe that the time it takes for the IPC communication associated with a single Ethernet frame from its reception by the Ethernet DMA to providing the filtering results is the sum of all the estimated segment’s times:
Obviously, this only includes the IPC times, which is interleaved by the packet inspection time by the MicroBlaze; however, the filtering time is another performance metric we can estimate, but it won’t affect the decision to move into using the ACP in our case.
In the ETS SoC design, we have the following:
The estimated IPC required time is therefore as follows:
Since [n x CC (ca9_clk)] is roughly the same even if we modify the IPC mechanism used, we can then use the preceding result as a base figure to compare against.
Another point worth highlighting here is that the IPC mechanism dictates the use of non-cacheable memory since the system interconnect isn’t coherent within the PS nor between the PS and the PL. If cacheable memory regions are used for any of the previously involved data in the IPC mechanism, cache management operations should be carefully used to flash the Cortex-A9 data cache whenever a descriptor-related field is updated or an ARE is constructed. The use of cache management is fine but will have a hit on the Cortex-A9 CPU performance. We can study this case as an exercise, but we will need to look at the resulting assembly language instructions to be able to estimate the amount of time needed by using the cache management instructions. This is easily done using the emulation platform and running code on it that allows us to view its disassembly associated instructions.
In this chapter, the focus for the hardware acceleration is to find ways to closely integrate the PL logic hardware accelerator with the Cortex-A9 cluster and build a more direct path between software and the acceleration hardware. This direct path should be without paying the penalty of using non-cacheable memory for data shared between the Cortex-A9 and its hardware accelerators. Using cacheable memory without any cache coherency support from the hardware imposes some performance penalties. Such sharing requires some form of synchronization between the Cortex-A9 and the PL Accelerators; for example, by using cache maintenance operation in the Cortex-A9 software. This is required following every update to the common data variables between the Cortex-A9 software and the PL hardware accelerator. The way this close integration can be achieved in the Zynq-7000 SoC is through the ACP, which provides a direct coherent path from the PL logic accelerator to the Cortex-A9 caches and doesn’t require the Cortex-A9 to use any cache maintenance operations following access to a shared data variable with the PL hardware accelerator. The following diagram provides an overview of the envisaged topology to connect the MicroBlaze PP-based PL hardware accelerator to the PS:
Figure 10.11 – ETS SoC Cortex-A9 and MicroBlaze PP IPC via the ACP
The ACP is an AXI 64-bit slave port on the PS that allows the PL implemented and mapped masters to access the Cortex-A9 CPU cluster L2 and the OCM memory via transactions that are coherent with the L1 data caches of the Cortex-A9 cores and the L2 common cache. This is possible as the ACP is connected to the Cortex-A9 CPU cluster Snoop Control Unit (SCU). Within the SCU, address filtering is implemented by default that will route transactions targeting the upper 1 MB or the lower 1 MB of the 4 GB system address space to the OCM memory, whereas the remaining addresses are routed to the L2 cache controller.
Information
For more information on the SCU address filtering, please consult Chapter 29 of the Zynq-7000 SoC Technical Reference Manual: https://docs.xilinx.com/v/u/en-US/ug585-Zynq-7000-TRM.
The ACP write Issuing Capability (IC) is three transactions, and its read IC is seven transactions.
The ACP read or write requests can be coherent or non-coherent depending on the setting of the AXI AxUSER[0] and AxCACHE[1] signals, where x can be either R for read or W for write transactions. We distinguish the following request types:
As already introduced, the ACP interface will allow the use of a cacheable memory to store the data variables shared between the Cortex-A9 and the MicroBlaze PP to implement the IPC data flow and its associated notifications. It is worth noting that the coherency is only in one direction; that is, if the PL master were caching any data, the Cortex-A9 would have no way of knowing whether it has changed without the PL master explicitly notifying it. This is because the AXI bus is not a coherent interconnect. It is only because we are routing transactions first through the Cortex-A9 SCU that we have a way of fulfilling coherent transactions by the Cortex-A9 caches. We can now move all the content of the AXI BRAM to the OCM or DRAM memory and access them coherently through the ACP port of the Cortex-A9. The sequencing diagram showing the different data mappings becomes like the following:
Figure 10.12 – ETS SoC Cortex-A9 and MicroBlaze PP IPC sequencing when using the ACP
Let’s now quantify the ETS SoC Cortex-A9 and MicroBlaze PP IPC data flow and the notification sequencing when using the ACP. We will follow the same method we already established earlier. The sequence includes the following steps:
In the preceding approximation, when using the ACP, we observe that the time it takes for the IPC communication associated with a single Ethernet frame from its reception by the Ethernet DMA to providing the filtering results is the sum of all the estimated segment’s times:
Obviously, this only includes the IPC times, which are interleaved by the packet inspection time by the MicroBlaze.
In the ETS SoC design, we have the following:
The estimated IPC required time is therefore as follows:
Since [n x CC (ca9_clk)] is roughly the same even if we have modified the IPC mechanism used, we can then use the preceding result as a base figure to compare against the previously computed one.
We observe that we have improved the IPC time by 571 nanoseconds. We are also able to use shared SoC memories cacheable by the Cortex-A9 and without the need for cache management operations to maintain the data coherency with the MicroBlaze PP.
In general, using the ACP is beneficial in system designs such as for packet processing, where the IPC latency and caching improve the application performance greatly.
In this chapter, we added a few hardware elements to the ETS SoC design to prepare it for hosting an embedded OS and improved the IPC communication between the Cortex-A9 CPU and the MicroBlaze PP. We also delved into the system performance analysis by first providing a detailed sequencing diagram of the IPC mechanism and then using it as a base to perform a quantitative study. We have used time estimates to measure how long the IPC communication associated with a received Ethernet frame to filter by the PL logic would cost. We found that a significant amount of time is needed to provide the information for moving the data and its associated descriptors from the PS domain to the PL domain. We studied the case of the IPC mechanism when using the PS AXI GP port and then studied the alternative solution of using the ACP port of the Cortex-A9. We have also exposed the issues of using cacheable memory in these scenarios and how this will require cache management operations when not using the ACP to keep the PS and PL looking and using the same copy of the shared data all the time. We computed that the use of the ACP with its coherency capability greatly reduces the time consumed by the IPC mechanism. We have also looked at most of the technical details of ACP to provide a good overview of its features and supported transactions.
In the next chapter, we will explore more advanced topics of the FPGA-based SoCs, with a focus on security.
Answer the following questions to test your knowledge of this chapter: