6

What Goes Where in a High-Speed SoC Design

In this opening chapter of Part 2 of this book, you will learn about the SoC architecture definition phase, which precedes the design and implementation phase of the required SoC. This phase is performed by the system architects, who translate a certain set of product requirements into a high-level description of the SoC design to accomplish. We will also detail the criteria that will be used during the functional decomposition stage in which a trade-off is reached between what is better suited for implementation in hardware and what is rather a good target for software implementation. Finally, we will provide an overview of the SoC system modeling, which can use many available tools and environments.

In this chapter, we’re going to cover the following main topics:

  • The SoC architecture exploration phase
  • SoC hardware and software partitioning
  • Hardware and software interfacing and communication
  • Introducing the Semi-Soft algorithm
  • Early SoC architecture modeling and the golden model

The SoC architecture exploration phase

This is the beginning of the pure technical stage in a project aiming to design an SoC. Usually, the technology to use isn’t specified at this stage, but there could be clear business reasons, as covered in Chapter 1, Introducing FPGA Devices and SoCs, that put the FPGA as the primary target technology for the SoC to design. These reasons can include (but are not limited to) the following:

  • The expected production volume is low.
  • The time to market and the product opportunity window are narrow.
  • The non-recurring engineering (NRE) cost of an FPGA technology is within the project’s budget.
  • In this project, using an ASIC has no competitive advantage. It only provides disadvantages and added project uncertainty and risks.

There could be many other reasons for making the FPGA the best target for the SoC to design, which will then benefit the time to market and flexibilities such a choice offers. At this stage of the thinking process, the architecture to use is also centered around the Zynq-7000 SoC or the Zynq UltraScale+ SoC with a starting processer subsystem (PS) block. We still need to perform our feasibility study and confirm that the project objectives and the FPGA SoC capabilities are in line. Each FPGA SoC device family has a known set of features, capabilities, and associated device costs. An initial idea about the SoC capabilities to include and the major intellectual properties (IPs) to design is put together based on the input from the marketing team. The marketing team conducts many interviews with the key target customers of the product under design and gathers the key product requirements. It is at this moment that broad guidelines from the business team are needed to define the cost of the overall solution and product that the company is making. However, the business decision could be delayed slightly until the SoC cost is defined, the performance requirements are refined, and the time to design it is estimated. It is at this stage that the overall integration of the product cost could be revisited to figure out alternative business strategies if this requires further adaptations.

The overall product system architecture definition and the business strategy are outside the scope of this book, so we assume these have already been set up by the time the SoC architecture definition is considered. The remaining tasks are to decide which SoC can provide the required interfaces and deliver the performance of the product. This SoC architecture exploration phase will look at all the possible alternatives and the associated design costs in terms of the SoC hardware and software. Then, it will provide a report to the business decision-makers to approve the optimal choice for the project that meets the company strategy. There could be other ways and methods by which this decision is made according to the company’s culture, but this decision process is irrelevant to this book.

Additionally, there is also a need to define how much custom work will be needed to complement the SoC PS block’s functionalities, features, and performance. A detailed report on this will be provided as an architecture specification chapter in the following section, where we will cover the SoC hardware and software partitioning stage. It’s expected that IPs will be designed for this project, implemented in the programmable logic (PL) side of the FPGA, and integrated with the PS subsystem of the SoC. This also adds another criterion of choice between the SoCs to use from within the same family or among the available families.

As introduced in Chapter 1, Introducing FPGA Devices and SoCs, some SoCs specifically target some applications and industries in terms of their capabilities and features, as well as the availability of certain packages and certifications in their portfolio. This choice should also make sure that enough input/output (I/O) is available in the SoC package to use, and that the SoC can physically interface with the neighboring integrated circuits (ICs) in terms of supported PHYs and inter-device communication protocols.

These interfaces include the ones covered in Chapter 4, Connecting High-Speed Devices Using Busses and Interconnects, and Chapter 5, Basic and Advanced SoC Interfaces, as well as those briefly introduced in Chapter 1 of this book, such as the following:

  • The Serial Peripheral Interface (SPI)
  • The Inter-Integrated Circuit (I2C)
  • The Peripheral Component Interconnect Express (PCIe)
  • The Universal Asynchronous Receiver Transmitter (UART)
  • The General-Purpose Input/Output (GPIO)
  • The Ethernet controller
  • The DDR memory controller
  • The flash controller

The list should also state which version of the protocol standard these interfaces support, as well as what kind of backward compatibility is available if the revision of the standards and the generations aren’t the same between the SoC available interfaces and the neighboring ICs within the electronics board of the product. The physical characteristics of the FPGA device in terms of temperature range, package size, mechanical properties, and all required certifications per the industry vertical targeted by the product need to be considered before we start any architecture design work. For these, the list of device requirements should have been established by the project team at the technology target selection phase, before the SoC architecture design phase. This selection process is also outside the scope of this book. To give you an easier and clearer way to decide between the different Xilinx SoCs covered by this book, four tables (Tables 6.1 to 6.4) have been provided in this chapter that summarize their available features.

There are clear differences in the processing capabilities of the Zynq-7000 SoC and the Zynq UltraScale+ SoC PS blocks. There is no Cortex-R processor cluster in the Zynq-7000 SoC FPGAs, but applications that require some form of real-time profile and a deterministic processor type may be built in the PL side of the FPGA using the MicroBlaze processor, as introduced in Chapter 1. Then, some RTL integration is needed to interface it to the Cortex-A9 cluster over the ACP or simply over the AXI ports available for bridging from the PS block to the PL block in both directions. We will cover this design methodology and techniques later in this book when we cover the available co-processing methods in both the Zynq SoC families of FPGAs.

SoCs PS processors block features

This subsection lists the PS block processors and their features per FPGA SoC type.

SoC PS Cortex-A CPU

The following table lists the Cortex-A CPU features for both the Zynq-7000 and Zynq UltraScale+ SoC FPGAs:

Feature

Zynq-7000 SoC

Zynq UltraScale+ SoC

Cortex-A cluster type

A9 ARMv7-A

A53 ARMv8-A

Cores per cluster

Single core or cluster of 2 cores

A cluster of 2 or 4 cores

ISA

AArch32, 16-bit, and 32-bit thumb instructions

AArch64, AArch32, and 32-bit thumb instructions

Core performance

2.5 DMIPS/MHz

2.3 DMIPS/MHz

Operation modes

Both SMP and AMP

Both SMP and AMP

L1 caches

L1 instruction cache of 32 KB

L1 data cache of 32 KB

L1 instruction cache of 32 KB

L1 data cache of 32 KB

L1 instruction cache associativity

4-way set-associative

2-way set-associative

L1 data cache associativity

4-way set-associative

4-way set-associative

L2 cache

L2 shared cache of 512 KB

L2 shared cache of 1024 KB

L2 cache associativity

8-way set-associative

16-way set-associative

SIMD and FPU

NEON

NEON

Accelerator coherency port

ACP

ACP

Core frequency per speed, grade, or device type

(-1): Up to 667 MHz

(-2): Up to 766 MHz or 800 MHz

(-3): Up to 866 MHz or 1 GHz

(CG): Up to 1.3 GHz

(EG): Up to 1.5 GHz

(EV): Up to 1.5 GHz

Security

TrustZone

TrustZone and PS SMMU

Interrupts

GIC v1

GIC v2

Debug and trace

CoreSight

CoreSight

Table 6.1 – The SoC PS block’s processor features

Memory and storage interfaces

The following table lists the memory controllers and the storage interfaces available in both the Zynq-7000 and Zynq UltraScale+ SoC FPGAs:

Feature

Zynq-7000 SoC

Zynq UltraScale+ SoC

DRAM controller ports

4

6

DRAM controller standards

DDR2/LPDDR2

DDR3/DDR3L

DDR3/DDR3L/LPDDR3

DDR4/LPDDR4

DRAM controller maximum capacity

1 GB

8 GB

Up to 16 GB for DDR4

DRAM controller ranks

1

1 and 2

QSPI flash controller

I/O and linear modes

DMA, linear, and SPI modes

OCM controller

256 KB SRAM and 128 KB BootROM

256 KB SRAM

SRAM controller

1

1

NOR flash controller

1

N/A

NAND flash controller

1

1

SD/SDIO controller

1

1

SATA controller

N/A

1

Table 6.2 – The SoC PS block’s memory and storage controllers

Communication interfaces

The following table lists the communication interfaces available in both the Zynq-7000 and Zynq UltraScale+ SoC FPGAs:

Interface

Zynq-7000 SoC

Zynq UltraScale+ SoC

USB

Device, host, and OTG

Device, host, and OTG

Ethernet

2x (10/100/1,000 Mbps)

4x (10/100/1,000 Mbps)

SPI

2

2

CAN

2

2

I2C

2

2

PCIe

NA in the PS

Gen2 x1/x2/x4 within the PS

UART

2

2

Table 6.3 – The SoC PS block’s communication interfaces

PS block dedicated hardware functions

There are also many dedicated hardware functions built into the PS of both Zynq SoCs, as summarized in the following table:

Feature

Zynq-7000 SoC

Zynq UltraScale+ SoC

DMA

8x channels

8x channels LPD

8x channels FPD

ADC

1x XADC

1x SYSMON

GPU

N/A

ARM Mali-400 MP2

PMU

N/A

MicroBlaze-based PMU

Display controller

N/A

1x VESA DisplayPort v1.2a

Table 6.4 – The SoC PS block’s dedicated hardware functions

FPGA SoC device general characteristics

In the SoC architecture exploration phase, we also need to know the following:

  • What are the available densities in terms of PL logic elements?
  • Which packages are available for the target FPGA SoC?
  • What is the maximum I/O offered per package?
  • What are the temperature ranges for a specific FPGA SoC package?
  • What are the speed grades of the SoC FPGAs that we can potentially target?

The speed grade classifies the FPGA SoCs in terms of the maximum frequency the design elements’ PS and PL can run at; there is a higher price tag attached to a higher FPGA SoC speed grade. These details are too vast to summarize here, but all these are provided by Xilinx in the FPGAs selection guides documentation. You are encouraged to read the Zynq-7000 SoC FPGAs Product Selection Guide at https://docs.xilinx.com/v/u/en-US/zynq-7000-product-selection-guide and the Zynq UltraScale+ SoC FPGAs Product Selection Guide at https://docs.xilinx.com/v/u/en-US/zynq-ultrascale-plus-product-selection-guide to learn more.

The key defining elements of the specific FPGA SoC to choose are the PS block’s required processing power and the amount of FPGA logic elements necessary to build the custom hardware that implements the key acceleration functions or the company IPs. It is logical to start with the most cost-effective option and build on it by adding more features as more of the details are unveiled, thus moving on to the next target device. Since the hardware and software partitioning phase hasn’t been accomplished yet, it is a good idea to perform a technical assessment to estimate the main functions that will be executed in hardware that are impossible to run in software, or that are company or third-party IPs forming part of the overall SoC architecture. Also, any new company IP that is to be built into the hardware will be listed, which will give us an idea in which direction the choice should be heading.

For this chapter, a practical example is the best approach to apply the ideas and suggestions listed thus far since we will be implementing this first simple but complete design in the next chapter. Therefore, we will start with the SoC architecture, which is based on a Zynq-7000 SoC since we would like this exercise to be simple and illustrative.

We need to perform the following tasks:

  1. Decide upon the SoC architecture.
  2. Perform the hardware and software partitioning.
  3. Define the hardware-to-software and software-to-hardware interfaces and communication protocols.
  4. Configure the SoC PS with the necessary IPs for our processor system architecture.
  5. Build the custom IP section (if any).
  6. Integrate the custom IP into the overall SoC hardware design.
  7. Implement the design.
  8. Simulate any part of the design that is subject to RTL design flow.
  9. Build some test software to verify that the hardware is functioning as expected.
  10. Implement the design.
  11. Finally, if we have a demo board available, we can use that to check that the design is fully functional. To do so, we can download the configuration binary and executable files.

In Part 3 of this book, we will be building more complex SoCs that require higher processing power and therefore potentially targeting the Zynq UltraScale+ SoC. We will also look at performing system profiling to help us implement custom acceleration hardware that we will also integrate into the SoC. We will also build the necessary software drivers for these custom functions that will run under an RTOS such as Embedded Linux.

To conclude the architecture exploration phase, we must come up with a list of possibilities that have advantages and disadvantages. By doing this, we can compare them in terms of cost, design, verification effort, and time. We usually discuss these with the business stakeholders and decide on the best option. After this, we start mapping the functions of the SoC on the processing elements (PEs) that are in either hardware blocks or software functions and start the next phase of the SoC architecture development.

SoC hardware and software partitioning

As mentioned in the previous section, the architecture devolvement task of mapping the functions of the SoC to the PEs available is better exercised using a practical example.

A simple SoC example – an electronic trading system

To perform hardware and software partitioning, let’s design an SoC that implements the intelligent parts of a dummy financial Electronic Trading System (ETS). It’s a dummy since it isn’t a system that we can use to perform financial transactions in an Electronic Trading Market (ETM) managed by a specific private organization; it just behaves like one. Most financial ETSs are co-located in a data center managed by a private organization. The interface between the ETM and the trading clients is a network switch where the trading clients plug in their network interfaces, which connect them to the ETM. The market itself is a network of servers that broadcasts the market data over, for example, the user datagram protocol (UDP) and receives trade transactions and their confirmation over, for example, the transmission control protocol/internet protocol (TCP/IP). Both UDP and TCP/IP are part of the Internet Protocol (IP) suite, commonly known as the TCP/IP stack, and are widely used in computer networking and communications architectures. Further details on the TCP/IP protocol suite can be found at https://datatracker.ietf.org/doc/html/rfc1180.

In the envisaged ETM wider network architecture, every client is connected to the trading market switch over Ethernet interfaces, and listens to the market information over the UDP broadcasted packets. The information is specific to the ETM itself and is formatted as we see fit. Our system should be able to cope with any formatting used and be able to adapt to it if it’s updated by the ETM. The ETM organization uses trading symbols that represent a financial market product. Information about these symbols, such as the asking prices, the volumes, the transactions on a given symbol, and many other details, are broadcasted by the ETM. This information is what the ETM has pre-formatted and what the clients listen to while electronic trading is open. The clients receive the ETM information over UDP via their Ethernet interfaces from the ETM switch, decode its content, filter it, make decisions in software (or accelerated software in hardware), and then inform the market of any buying or selling decisions over their TCP/IP connection. As we can imagine, the client with the fastest round-trip communication and processing is the one who can maximize their gains and be the first to exploit a good opportunity, such as a low asking price for a symbol they are targeting and wish to invest in. This race to zero latency is at the heart of the low-latency and high-frequency ETMs that exploit the superior technological solutions a trading organization may possess to drive the market in one direction or another. The following diagram depicts the simplified electronic financial market concept:

Figure 6.1 – Electronic trading data center concept

Figure 6.1 – Electronic trading data center concept

For our SoC architecture development exercise, we need to design an SoC that can do the following:

  • [CP1]: Listen to the market data over the UDP stream using the Ethernet port.
  • [CP2]: Extract the information from the received Ethernet packets.
  • [CP3]: Make sure that the UDP packet is valid and that its fields are as expected.
  • [CP4]: Understand the ETM protocol and act accordingly.
  • [CP5]: Distinguish between the trading information data and the ETM systems management data that’s broadcast over a different UDP stream but over the same Ethernet link.
  • [CP6]: Extract the information of interest to the software (running on the Cortex-A9) and feed it to our trading algorithm, which is also running in the software.
  • [CP7]: Consume the data fed via the receiving mechanism and use it to feed our trading algorithm implemented in the software.
  • [CP8]: Maintain a database where traded symbols of interest are stored alongside the associated information (date, time, volume, and price). This database should be stored in nonvolatile memory.
  • [CP9]: Make trade decisions when certain conditions are met for a specific trading symbol.
  • [CP10]: Send the trading decision over the TCP/IP connection to the ETM organization using the Ethernet link to the market interface trading switch.
  • [CP11]: Maintain a database of its transactions, including the date, time, volume, and price.
  • [CP12]: The transactions database regarding its trades is also stored in non-volatile memory, but this is encrypted using a private key stored in a secure location.

The electronic trading SoC is part of a trading server:

  • [CP13]: This is hosted in one of the server PCIe slots and communicates with the server over a PCIe endpoint integrated as part of the SoC.
  • [CP14]: The trading server sets the trading algorithm policies and manages the databases previously mentioned using PCIe over a predefined protocol between the server and the SoC software.
  • [CP15]: The server sends regular updates to the SoC trading algorithm and can modify the way UDP packet filtering is performed. This is needed to make sure the SoC trading engine is always up to date with the ETM organization’s latest policies and updates.

The preceding list of capabilities is just the bare minimum to implement in the SoC to design the ETS. The objective is to design a system with the lowest latency possible and use all the possible techniques to make such a trading system as fast and secure as possible. We will have many other questions as we design the SoC architecture, but these should be on the details side rather than a fundamental architecture issue. We will start by putting all the listed capabilities and the options we can use to implement them in a table. Then, we will cover the overall implementation for every capability before seeing whether they are better suited for software, hardware, or a combination of both in our low-latency ETS. The following table classifies these capabilities and their possible implementation options:

Capability

Hardware

Software

Both

[CP1]

[CP2]

[CP3]

[CP4]

[CP5]

[CP6]

[CP7]

[CP8]

[CP9]

[CP10]

[CP11]

[CP12]

[CP13]

[CP13]

[CP14]

[CP15]

Table 6.4 – Electronic trading SoC capabilities classification

As you can see, all the capabilities except database management can be implemented in hardware only, software only, or using both software and hardware. We have excluded the database management as it is a background task not worth designing a mechanism for from scratch in RTL. At this stage, we are also looking to assess the effort required to design a capability in hardware since we are assuming that it will be faster when executed in a hardware PE specifically designed for it. Most of the time, this is true. But the real question to ask is whether this speed-up is worth the effort, the time spent, and the implementation cost. To do so, we need to draw a back-to-back data communication pipeline for the ETS, highlight the critical paths in this communication pipeline that are sensitive to time, and understand what it would mean if we were to move a capability from software to execute it in its custom form to be designed by a PE. We are assuming that the easiest implementation (but not necessarily optimal in terms of speed) is putting everything in the software. Once a capability is moved from software to hardware, we can understand what the interface between the two looks like. This is important and by itself requires further consideration.

If we look at our ETS, the back-to-back communication pipeline that is sensitive to latency is from the time a UDP packet hits the Ethernet port of the SoC to the moment a TCP/IP packet with a trade decision is sent from the SoC back to the switch, acting as the interface with the ETM via the same or another Ethernet port. The fastest solution is to design everything in hardware and use IPs that are designed for very low latency, but this is the spirit of a high-frequency trading system and defeats the purpose of our book’s objective, which is learning how to design SoCs and integrate PL-based IPs with them using both the PS and PL blocks of the FPGA. We would still like to design a low-latency solution, but we don’t want to design an exotic TCP/IP stack and middleware, or any hardware-based real-time operating system (RTOS) in RTL. Therefore, we will use a simpler approach where we can use hardware acceleration when it makes sense to meet our design objectives. Our back-to-back data communication pipeline does the following:

  • Receives the UDP packet over the Ethernet interface.
  • Checks the content of the UDP packet and makes sure it is a valid packet that matches either the management or the market data format, as well as that it has no errors in it by computing the electronic trading protocol error checking code such as the Cyclic Redundancy Check 32-bit (CRC32) that’s computed over all the packets. The UDP payload transports the ETP, which defines the format of the payload, the length of the packet, a packet number, a timestamp, all the fields along with their length and meaning, and a CRC32 that’s computed over the packet and inserted at the end of it. We will revisit the design of the ETP later in this book when we start building, simulating, and integrating the ETS.
  • If the content is a management packet, then it’s sent to the management queue, which is associated with a task called management task that runs in the software on the Cortex-A9, to deal with it by also notifying it using an interrupt.
  • If the content is market data, then it filters it using the information set by the trading algorithm, called the trading algorithm task, which is running as a high-priority software task on the Cortex-A9.
  • If the preceding market data is a symbol our trading algorithm is interested in and for which the filter returned true (such a specific price is lower than the price set in the filter), then the filtering PE should put it in the urgent buying queue and notify the trading algorithm task via a high-priority interrupt.
  • If the preceding market data is a symbol our trading algorithm is interested in and for which the filter returned true (such a specific price is higher than the price set in the filter), then the filtering PE should put it in the urgent selling queue and notify the trading algorithm task via an even higher priority interrupt.
  • All market data symbols of interest to the trading algorithm that met the filter conditions (a symbol matching a set of symbols) are also put in a queue, called the market data queue, and sent to the market database manager task running in the software on the Cortex-A9 to store them in non-volatile memory. There is also a low-priority interrupt associated with this notification that’s sent to the market database manager task.
  • If a trade is performed on a target symbol by the trading algorithm task, then the trading algorithm task puts it in a trade queue and notifies the trade database manager task via a software interrupt to add it to the trade database, but first, it will encrypt it.
  • The trade database manager puts the trade information it received from the trading algorithm task securely in an encryption queue that another task will manage as a low-priority task in software, or even as a task subject to offload to the hardware. For this, we need to implement a PE in the FPGA PL block. This is done not to accelerate it, but to free up heavy CPU usage in such a task and leave the performance for urgent tasks, such as dealing with the trading algorithm and interaction with the trading queues. This is subject to profiling to decide what to do and can always be changed at a subsequent design stage as an improvement or an optimization of the CPU usage and its shared resources.

The server side will not be included in our simple example design of this electronic trading SoC, but it may be a good addition to cover PCIe. This will be covered in the advanced applications of this book in Part 3.

The following diagram summarizes our electronic trading receive communication path:

Figure 6.2 – Electronic trading receive communication path

Figure 6.2 – Electronic trading receive communication path

The preceding diagram illustrates the frontend of the electronic trading communication path, where the data of interest for trading can be easily highlighted, analyzed, and then mapped to either a hardware PE, a software PE, or a combination of both. The backend communication path is assumed to be well-adapted to software, so it will be implemented as such. This is why it isn’t shown in the preceding diagram. It needs a TCP/IP stack to communicate trades to the ETM. The TCP/IP protocol is an enormous task to implement in hardware, so it’s not worth the cost and effort to find out whether it is available as a third-party IP to license for our project. The latency gains it may give us aren’t important for this specific ETS. The software trading algorithm has three high-priority tasks that run on the Cortex-A9:

  • The Sell task, which consumes entries from the urgent selling queue.
  • The Buy task, which consumes entries from the urgent buying queue.
  • The third task is to interface with the server, via which algorithm policy changes are forwarded to the electronic trading algorithm from the traders.

As mentioned previously, we are not focusing on the server side of the system in this architecture definition example.

Now, let’s analyze these paths and highlight the critical paths in this ETS to evaluate which PE type or combination will be used to implement them.

From Figure 6.2 and the previous descriptions, we can conclude that the critical paths for our low-latency trading SoC are as follows:

  • [Path 1]: Start at the receiving end of the UDP packet to make a sell decision.
  • [Path 2]: Start at the receiving end of the UDP packet to make a buy decision.

Therefore, to make a low-latency solution, all the corresponding PEs required to fulfill the tasks involved in these two critical paths for our low-latency electronic trading system need to be implemented in hardware; putting them in software isn’t going to be fast enough. Now, let’s revisit Table 6.4 and update it so that it suits our hardware and software partitioning exercise:

Capability

Hardware

Software

Both

[CP1]

[CP2]

[CP3]

[CP4]

[CP5]

[CP6]

[CP7]

[CP8]

[CP9]

[CP10]

[CP11]

[CP12]

Table 6.5 – ETS capabilities partitioned between the hardware and software PEs

Please note that we have removed [CP13], [CP14], and [CP15] from the table as we are not designing the Server PCIe integration side of the SoC in this initial architecture design example.

We still need to define how we will be managing the split between hardware PEs and their peer software PEs, and what the interfacing between them should be. We also need to implement a way by which Ethernet packets that aren’t of interest to our hardware acceleration paths are returned to be consumed by the software directly, since that would be done if no hardware acceleration was implemented. We also need to study the consequences of our partitioning on the Ethernet controller software driver and make the necessary changes to adapt it to our new hardware design. All these important details will be covered in the following section.

Hardware and software interfacing and communication

The acceleration path introduced in the end-to-end communication path between the ETM switch and the software running on the Cortex-A9 is only necessary for the UDP packets received over the Ethernet port. Anything else that’s received over this Ethernet port, such as Ethernet management frames and ARP frames, we have no interest in accelerating, so we would like the acceleration path to be transparent to them. However, we can’t do this by simply returning the received Ethernet packets that aren’t of interest to our acceleration hardware to the Ethernet controller. This is because the received buffer within the Ethernet controller is a FIFO that the Ethernet frames can’t be written back to once they’ve been consumed from it. One approach would be to let the Cortex-A9 processor perform the Ethernet frames receive management and pass them to the hardware acceleration PE, which behaves like a packet processor. This is the best approach that will introduce less work for the implementation and specifically the Ethernet controller software driver, as well as the mechanism by which the hardware acceleration PE is notified of the arrival of new frames. The hardware DMA interrupt is hard-wired to the Cortex-A9 GIC, and as such the easy way to pass notifications is via the Cortex-A9, which will be acting as a proxy in this respect.

Data path models of the ETS

In the data reception model of the Ethernet controller, which is using its DMA in the Zynq-7000 SoC, the Cortex-A9 software sets the DMA engine within the Ethernet controller to transfer the received Ethernet frames to a destination memory. Then, the DMA engine notifies the Cortex-A9 via an interrupt. We want this Ethernet frame transfer to be done in memory located within the PL or to the OCM or the DDR DRAM memory. When the Cortex-A9 receives an interrupt from the Ethernet DMA engine when the Ethernet frames received are transferred to the nominated memory, the Cortex-A9 rings a doorbell register within the hardware accelerator domain to notify it that several Ethernet frames have been received. It also tells it how many of them have been received. First, we want to filter these Ethernet frames, extract the UDP frames from them, and then put back the other Ethernet frames where the Cortex-A9 is expecting them. It is only after we filter in the hardware acceleration engine and store it in the receive memory of the non-UDP packets that a second notification is sent to the Cortex-A9 via another interrupt. This subsequent interrupt will tell the Cortex-A9 that Ethernet frames have been received that it needs to deal with. The first DMA receive interrupt to the Cortex-A9 was just for the Cortex-A9 to forward them due to the hardware filtering and processing PE. Breaking the received data path model using the Ethernet DMA engine and its associated interrupt notification still has another aspect that we must deal with, which is the data exchange interface mechanism between the Ethernet DMA engine and the Cortex-A9 software. This exchange is done via the DMA descriptors that are prepared by software for the DMA engine to use when data is received by the Ethernet controller. As mentioned in Chapter 4, Connecting High-Speed Devices Using Busses and Interconnects, these descriptors specify the data local destination and the next pointer of the descriptor, as well as an important field known as the Ownership field. The Ownership field tells the DMA engine that the DMA descriptor is valid and has been consumed by the software from its previous use and that it is ready to be reused again.

We need to make sure that the task that’s recycling the DMA descriptors after consuming their associated received data is performing this task properly and performed for the UDP packets that have now been consumed by the hardware PE, not the Cortex-A9 software anymore. This is fine as we simply need to include this mechanism between the hardware PE and the DMA Descriptors Recycling Task (DDRT) in software via a DMA Descriptor Recycling Queue (DDRQ) via which the hardware accelerator engine sends requests to the DDRT running in software on the Cortex-A9. Another issue we have introduced in this reception model over the DMA engine of the Ethernet frames is that the consumer of the Ethernet frames is not only the Cortex-A9 but both the hardware acceleration engine and the Cortex-A9. This model introduces an out-of-order consumption of the Ethernet frames, but this isn’t a problem as the out-of-order would have been noticeable if it introduced some discrepancy in the DMA descriptors reuse model, which should have enough entries to make this reordering fine. That is, by the time we get to reuse a continuous set of DMA descriptors for subsequent receive operations, both the software and hardware would have been filled with their associated data. Even if the hardware had finished first with its DMA descriptors flipping the Ownership field, the software would have had time to catch up.

If there is a dependency between the received Ethernet frames, this shouldn’t be an issue as the ETP guarantees that no change is introduced in the protocol until all the clients have acknowledged the reception of its update and have adjusted to it. This condition requires having a large enough DMA descriptors pool that the slowest consumer is allowed to finish while the Ethernet controller keeps up with the speed at which the Ethernet frames are arriving. This can easily be computed using the maximum rate at which we expect the Ethernet frames to be arriving. The Ethernet interface’s default receive path includes the controller hardware, the associated receive DMA engine, and the Ethernet software drivers. To minimize the changes in the receive path, which now includes the added UDP packets filtering, the hardware acceleration engine will deal with the filtering task on a set basis. It will be sending a job completion notification to the Cortex-A9 when it has consumed all the Ethernet frames that have been received within the last set. When the hardware acceleration engine finds a UDP packet within the received Ethernet frames set from the ETM, it does the following:

  • It processes the Ethernet frame.
  • It puts its DMA descriptor in the DDRQ for the DDRT to be marked as consumed (returned to the pool of DMA descriptors).
  • It sends the notification to the Cortex-A9 processor when it reaches the last frame in the last received set.

All non-treated Ethernet frames (which are not UDP packets) are left for the Cortex-A9 to consume and are left as follows:

  • They are still in situ in memory like when they were put there by the Ethernet DMA receive engine.
  • Their DMA descriptors are still marked as owned by the DMA engine, which means that the Cortex-A9 can consume them.
  • When processed by the Cortex-A9 processor, it will flip the Ownership field of the DMA descriptor to mark it as recyclable.

In this model, we are keeping everything looking almost the same in the flow of processing, without hardware acceleration of the UDP packet processing. Here, we are just delaying the processing of the non-UDP packets by the Cortex-A9 until the hardware acceleration engine has had the chance to inspect the received Ethernet packet, deal with the ones found to be market data UDP packets, notify the market treading tasks about the UDP packets of interest, and request the DDRT to mark the DMA descriptors of these frames as consumed before sending a final notification to the Cortex-A9 so that it can deal with the remaining non-UDP packets, if any, recycle their corresponding DMA descriptors back to the pool, and call the Ethernet driver to finish the receive flow.

The changes in the hardware model will require some associated changes to be made in the Ethernet controller software model, but this can easily be done since the Ethernet receive path can function in delayed mode. Therefore, the changes we are introducing by adding the filtering path can be added almost silently from the Ethernet controller driver’s perspective. It is just what the Cortex-A9 does now when receiving the Ethernet DMA receive notification that has changed from calling the Ethernet frame processing to passing them to the hardware PE, and then waiting for the hardware PE notification to start processing any remaining received Ethernet frames that weren’t accelerated. It is here that the UDP that was found to be urgent and that matches the filters is dealt with by the hardware PE. To summarize, the hardware-to-software interfacing and communication process goes as follows:

  1. The software sets the Ethernet receive DMA engine’s DMA descriptors to use a destination memory to be within the PL block, OCM, or DDR memory.
  2. The software waits for the Ethernet receive DMA engine’s interrupt.
  3. The software prepares a request for acceleration by filling in information about the DMA descriptors to use and the number of receive packets.
  4. The software notifies the hardware accelerator engine that there are received Ethernet frames within the memory that need to be filtered and processed with UDPs.
  5. The hardware accelerator engine processes the packets of the Ethernet frames that are found to be UDP packets from the electronic trading market.
  6. The hardware accelerator populates the urgent buying queue and notifies the trading algorithm task via an interrupt when it finds a matching symbol.
  7. The hardware accelerator puts the DMA descriptor associated with the UDP packets in the DDRQ for the DDRT to recycle them.
  8. The hardware accelerator populates the urgent selling queue and notifies the trading algorithm task via an interrupt when it finds a matching symbol.
  9. The hardware accelerator puts the DMA descriptor associated with the UDP packets in the DDRQ for the DDRT to recycle them.
  10. The hardware accelerator populates the market data queue and sends a notification to the market database manager task via an interrupt when it finds a matching symbol.
  11. The hardware accelerator puts the DMA descriptor associated with the UDP packets in the DDRQ for the DDRT to recycle them.
  12. Once the Ethernet frames have all been inspected, the hardware accelerator notifies the Cortex-A9 via an interrupt that it has processed all the Ethernet frames from the last ones it has been asked to deal with.
  13. The Cortex-A9 checks the received Ethernet frames to check whether any Ethernet frames haven’t been dealt with yet. If there are, it consumes them.
  14. The Cortex-A9 flips the Ownership field of the DMA descriptors for the Ethernet frames that it dealt with itself.
  15. The Cortex-A9 notifies the Ethernet drivers or the DDRT to perform any DMA descriptor recycling actions.

The following diagram illustrates these steps:

Figure 6.3 – ETS low-latency path hardware to software interaction

Figure 6.3 – ETS low-latency path hardware to software interaction

Introducing the Semi-Soft algorithm

The Semi-Soft algorithm idea isn’t new, and it has been around for many decades now since FPGA technology became prevalent in the electronics industry. It is a combination of hardware and software from the initial architecture development stages and is used to implement compute algorithms.

Using the Semi-Soft algorithm approach in the Zynq-based SoCs

When targeting a Zynq SoC, the hardware and software split between the PEs is considered from the start of the project rather than all the computing algorithms in the software being implemented, profiled, having the bottlenecks pinpointed (if any), and then hardware-accelerated. There is nothing wrong with the latter approach – it is just limiting in what can be achieved when targeting an ASIC technology to implement the design. Once the overall SoC architecture and microarchitecture have been defined, it is hard to modify the design at a later stage and introduce the optimal communication paths between the hardware and the software. When the SoC targets an FPGA, as in our case, the simpler approach should consider the acceleration from the architecture phase, as we have, so that we can put packet processing compute algorithms in place that are Semi-Soft. This means we can modify them easily in both the hardware and the software, so long as we have the appropriate interfacing between the two.

A Semi-Soft algorithm is a computational methodology that uses both the hardware and the software from the start of the architecture design phase. In our example, we are focusing on packet processing, which can greatly benefit from the Ethernet packets being inspected in parallel. Since all the UPD fields can be checked in parallel, many packets can also be checked simultaneously, and many decisions and results can be found and forwarded to the Cortex-A9 in parallel. Although the Cortex-A9 will have to deal with them sequentially when using a single-core CPU, it is capable of processing them in parallel if we deploy a dual-core CPU cluster, where each core is dedicated to a trade (buy or sell) queue.

Consequently, we continue maximizing parallelism and making faster trade decisions. The possibilities are enormous when we consider the compute as a mixture of generic software sequential compute resources of the PS and the parallel and multi-instance possible hardware acceleration engines of the PL. These could augment the PS capabilities to make a powerful custom hybrid processor. In our ETS, the Cortex-A9 manages the Ethernet DMA engine, recycles the DMA descriptors on behalf of both the hardware and software, and runs the middleware for the TCP/IP stack and other Ethernet link management. On the other hand, the hardware performs the Ethernet frame filtering and UDP packet processing, and then notifies the software to complete any remaining packet processing that does not need to be low latency or is complex to perform in hardware.

Using system-level alternative solutions

A design methodology that has a flexible way of moving PE elements in and out of the PL to replace elements from the software tasks or to be implemented instead in software can be matched with other, more complex, hardware electronic systems. One example is when using a PCIe interconnect hosting a hardware accelerator such as a GPU, which can be used to accelerate parallel compute operations for the main processor. We can think of another architecture that uses discrete electronic components that can do the packet processing task. A network processor (NP) hosted over the PCIe interconnect of a server platform can perform the packet processing tasks, but as can be imagined, first, the cost is at least an order of magnitude higher, the design framework is complex, and the required technical skills to put such a system together is also demanding in comparison to using an SoC-based FPGA. Also, using a server-based PCIe approach won’t be low latency; it may be a good approach for high-volume traffic processing, but it’s not necessarily comparable for low latency. In this architecture, we have added more layers to the communication stack between the software and hardware accelerators. Here, the platform can only efficiently perform packet processing if we need to accelerate another aspect of the software tasks that were found to be inefficient in the software via system profiling. The FPGA SoC provides us with flexibility and a paradigm shift toward the architecture design phase via an early system implementation that we call the Semi-Soft algorithm concept.

Introduction to OpenCL

Other solutions provide a whole framework for performing this mixing by defining hardware functions and integrating them into a kernel, such as the OpenCL approach. However, OpenCL tries to abstract compute operations from the compute engines performing it and defines methods to match them. You can find out more about OpenCL at https://www.khronos.org/opencl/.

Exploring FPGA partial reconfiguration as an alternative method

Another methodology that was also used in the past is to exploit the FPGA partial reconfiguration feature, where a block of the FPGA is defined as a reprogrammable function and used on demand by software running on the same FPGA or externally and interfaced to the FPGA. This means that the design can define many computational functions that can fit within this reprogrammable block of the FPGA, and when a computation acceleration is required, the reprogrammable block is reconfigured to perform this specific hardware acceleration task. This reprogrammable block should have a predefined interface with software through which data and commands are provided to the hardware function by software, and results and statuses are generated by the hardware accelerator hosted in this reprogrammable block.

More information on the FPGA partial reconfiguration can be found at https://docs.xilinx.com/v/u/2018.1-English/ug909-vivado-partial-reconfiguration.

Early SoC architecture modeling and the golden model

When targeting an FPGA for an SoC implementation, reimplementing the design takes a matter of days or even sometimes just hours. It involves changing the behavior of an RTL block or drop in a verified IP. Doing so when the target technology is an ASIC is a lengthy process with significant costs in terms of resources and budget.

We will introduce system modeling in the closing section of this chapter as it is part of the architecture development in general and it is also becoming a time-to-market solution for system implementations that take a long time to accomplish, specifically when targeting an ASIC technology. The industry is exploiting the availability of detailed system models of processors, interconnects, and all the IP elements of an SoC to build virtual SoCs. These are system models that emulate the functional behavior of an SoC in software simulation. Many frameworks are available that can put system models together. These system models behave functionally like the target hardware SoC, and software can be prototyped on them earlier than the SoC hardware and electronics board availability. Timing accuracy is sometimes sacrificed to accelerate the timely execution of the system model in the simulation itself when the system model is used for early software development. In this case, we produce a version that is called a virtual prototype (VP); quantum or simulation time is sometimes fast forwarded to areas of the prototyped software execution of interest to the developer.

The major frameworks of system modeling are as follows:

  • System modeling using Accellera SystemC and TLM2.0
  • System modeling using Synopsys Platform Architect
  • System modeling using the gem5 framework
  • System modeling using the QEMU framework and SystemC/TLM2.0

The final timing accurate system model is what we call a golden model. It is usually built according to the SoC architecture specification. This can be checked against the system’s final verification so that it can be benchmarked for its correct functionality and achieved performance.

System modeling using Accellera SystemC and TLM2.0

Many IP vendors provide SystemC and TLM2.0 models of their IPs so that they can be used in system modeling to perform the early architecture exploration activities and build VPs for early software development. As mentioned previously, this captures the system architecture specification and helps make a golden model against which the functional verification of the system’s RTL can be compared.

SystemC and TLM2.0 are becoming one of the de facto standards for building IP models and connecting them in a transaction-oriented way using TLM2.0 socket-based interfaces. SystemC and TLM2.0 are available for free from Accellera, which also provides the SystemC basic simulator. For more information on this framework, go to https://www.accellera.org/downloads/standards/systemc.

This framework is known as the Open SystemC Initiative (OSCI), which is now owned and maintained by Accellera. It is an Institute of Electrical and Electronics Engineers (IEEE) standard under IEEE 1666-2011. SystemC is a C++ class library and provides macros that allow you to build concurrent processes that can communicate between themselves through function calls and returns. It is also associated with transaction-level modeling (TLM), a socket-based programming methodology also based on the C++ library and macros. It provides many ways of mimicking handshaking data and controlling information exchange protocols and helps in building SoC interconnects. While this framework is available for free and can be used to build initial system models, there is a lack of availability for free specific processors and interconnect IPs system models, so the IP vendor needs to be contacted in case they have them available for licensing.

Accellera (or OSCI) SystemC/TLM2.0 is great for building system IP models if we are using RTL, so it provides a rapid prototyping methodology for custom IPs. It is also very useful in building system-level prototypes where no specific CPU ISA is required as it has a SystemC/TLM2.0 model of a generic RISC CPU. This can be used with Assembly language for sequential operations, but this will only allow accurate system modeling that isn’t CPU-centric, which means it’s not very useful for SoC VPs on their own. However, it can be combined with other frameworks as most of the VPs have a bridge to SystemC/TLM2.0, which means they can still be part of the tools a system architect uses to build a final system model of the targeted SoC with freely available tools.

System modeling using Synopsys Platform Architect

Platform Architect is a graphical user interface (GUI)-based system modeling environment from Synopsys. It is based on the SystemC/TLM2.0 framework but has a lot of utilities for making the system modeling and the architecture exploration tasks easier than it is on the OSCI framework. Synopsys has built many system models for their IPs and integrated many third-party IPs within Platform Architect, which can be dragged and dropped in the GUI to stitch a system model together. This comes with a significant price tag, so some third-party IP models – specifically, the CPUs and interconnects – need an additional license from their providers to be able to use them in any system modeling work. For major SoC projects targeting ASICs, Synopsys Platform Architect could be a good solution, but for FPGA-based SoCs, other alternatives for system modeling should be considered.

Further information on the Platform Architect framework is available at https://www.synopsys.com/verification/virtual-prototyping/platform-architect.html.

System modeling using the gem5 framework

gem5 is used in computer architecture research and is a framework for performing multi-processor system simulation, which it does by assembling computer system elements. The framework is centered on the CPU models that are built into C++ and configured using a Python script. This framework is suitable for SoC system modeling as it is also a method by which SoC component models such as processor cores, interconnect, memory interfaces, and peripherals can be connected using the configuration Python script to form a custom SoC emulating the real SoC hardware. Custom IPs can be written to extend the portfolio of gem5 and allow the user to produce a system model for the SoC target architecture. gem5 is free and provided under a BSD-like license. A full system can be built, including hardware, operating systems, and application software, to be run on the gem5 simulator. It supports two operating modes:

  • Full System (FS) mode
  • Syscall Emulation (SE) mode

For the SoC architecture development tasks, FS mode is suitable for system modeling, whereas SE mode is better suited for early software prototyping and software-centered research. CPU models aren’t an exact RTL translation of their architecture implementation, but it uses an execution model with support of the specific Instruction Set Architecture (ISA), which is a good enough execution environment for architecture exploration-associated work. Using gem5 requires a good understanding of the CPU microarchitecture, including its internal memory hierarchy and its settings to customize the CPU using the Python script. This approximates it to a system model that emulates the targeted CPU architecture to be used within the SoC. SystemC/TLM2.0 models are supported by gem5 either by bridging from gem5 to SystemC/TLM2.0 using a full bridge model available in gem5 or by running the SystemC/TLM models within the built-in SystemC/TLM2.0 simulation kernel. This last method is near native in the integration and configuration of SystemC/TLM2.0 models since they can be integrated and configured into the SoC system model using Python, such as the native IP models of gem5. More details on the gem5 simulator can be found at https://www.gem5.org/.

System modeling using the QEMU framework and SystemC/TLM2.0

QEMU is a machine emulator with multiple operating modes:

  • User-mode emulation
  • System emulation
  • KVM hosting
  • Xen hosting

System emulation mode is the one that’s relevant to SoC system modeling as it emulates the full SoC elements. It can boot guest operating systems and emulate many ISAs, including ARMv7, ARMv8, and MicroBlaze. QEMU is a free and open source framework licensed under GPL-2.0.

For more information on QEMU, go to https://www.qemu.org/.

Xilinx has a QEMU port that’s provided as a VP for SoCs built using the MicroBlaze processor, and both the Zynq-7000 and UltraScale+ SoC FPGAs. The platform can connect to custom IPs written in SystemC/TLM via an interface from the Xilinx QEMU.

For more information on Xilinx QEMU, check out its User Guide at https://docs.xilinx.com/v/u/2020.1-English/ug1169-xilinx-qemu.

Summary

This chapter opened Part 2 of this book, which has a practical aspect to it since we will be putting the theoretical topics that were introduced in Part 1 to use. This chapter was purely architectural since we need to understand why certain choices that we implement in an SoC design are the way they are. We also need to be capable of making certain changes to the design microarchitecture while considering the overall aspect of the system we are designing and whether we have met the stated objectives. This chapter covered all the major steps involved in SoC architecture design. We started by covering the exploration phase, where the possible design options are studied and compared in terms of cost, implementation effort, and time. We proposed a comparative method by which the initial theoretical analysis can be conducted and how the thinking process of choosing a potential solution can be driven. Then, we moved on to the next stage of the architecture definition, which was very analytical and was conducted practically on an example SoC, known as the ETS, which implements a low-latency dummy trading engine but behaves very much like one. We performed the hardware and software partitioning tasks on this trading engine by decomposing the SoC microarchitecture into many elements classified by tasks. While targeting the possible processing elements, we also looked at the end-to-end data path and what would make it low latency and easy to implant before putting together a table containing the choices we made. After that, we had to figure out how to make these processing elements communicate with each other by considering their specific characteristics. We listed these interfaces and how they should be dimensioned when such quantification is needed and makes sense. This collaboration between the software and hardware processing elements naturally led us to cover the Semi-Soft algorithm concept, where we covered its importance for SoC-based FPGA designs. We also covered many existing frameworks that are currently using it, starting from OpenCL, FPGA partial reconfiguration, and the simple hardware acceleration method we introduced in this chapter. We concluded this chapter by introducing the last stage of the architecture definition, known as system modeling. We covered how helpful it is nowadays for complex designs that specifically target ASIC technologies. We closed this chapter by providing the major frameworks currently used in the industry to perform system modeling and virtual prototyping.

In the next chapter, we will continue in the same vein by taking the ETS from its architecture definition to its implementation on the Zynq-7000 SoC FPGA. By doing so, you will learn how to translate the architecture choices into an implementable SoC for the ETS.

Questions

Answer the following questions to test your knowledge of this chapter:

  1. Name some of the reasons that make an FPGA-based SoC the primary target technology for a full SoC system architecture implementation.
  2. How does the architecture definition phase interact with the company’s business strategy and its decision-making process?
  3. List the main technical criteria that make a specific FPGA SoC a potential target device to implement the projected SoC architecture.
  4. Summarize the methodology to use to start the architecture exploration phase, as presented in this chapter.
  5. How is the architecture exploration phase concluded and what is its overall objective?
  6. What step follows the architecture exploration phase and what is its main objective?
  7. Describe the main functions that are performed by the ETS introduced in this chapter.
  8. Why is UDP packet filtering chosen to be performed in hardware rather than in software?
  9. How does the hardware-based UDP packet filtering performance compare to doing the same operation purely in software?
  10. What is the main reason we choose to manage the Ethernet controller DMA descriptors in software using the pre-existing software drivers?
  11. What kind of hardware-to-software and software-to-hardware communication and interfacing do we need to make hardware-based packet filtering work?
  12. How did we make the UDP packet processing lower latency by introducing minimal disturbance to the software-only processing model? Why do we always aim to minimize the changes in any existing model?
  13. Describe the Semi-Soft algorithm approach.
  14. How was the Semi-Soft algorithm architecture design methodology used in the ETS?
  15. List some of the main hardware acceleration techniques that are used at the system level and how they compare to using the Semi-Soft algorithm method.
  16. What is OpenCL?
  17. What is the purpose of system modeling and when is it performed?
  18. What is a VP and why did we choose to use one?
  19. List some of the major system modeling frameworks.
  20. What are the operating modes of gem5?
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset