3

Embedded and Multicore System Architecture—Design and Optimization

Michael C. Brogioli    Polymathic Consulting, Austin, TX, United States

Abstract

As Donald Knuth famously once said, “… Premature optimization is the root of all evil.” System designers must consider a myriad of aspects of application and underlying hardware architecture when bringing a new application or technology to market. These can include hardware architecture, available compute resources, power consumption limitations, timing requirements, tooling capabilities and limitations, etc. This chapter explores a pragmatic and systematic approach to decomposing a target application in terms of various requirements and the process of methodically implementing and optimizing for a given target architecture.

Keywords

Optimization; Software developers; Real-time tasks; Compiler; Build tools; Embedded software; System architecture; Wireless

1 Introduction

When implementing a given application on a specific hardware target, system architects and managers must consider several different factors ranging from hardware capabilities, application requirements, software requirements, and even the technical ability of engineering teams. This chapter explores how to take a given application that demands a specific number of channels and data rates, and the steps required to systematically decompose the application for implementation on the target architecture. By properly accounting for the compute resources available, and the timing/bandwidth requirements of the application, system architects and managers can appropriately delegate implementation and analysis to appropriate engineering resources. In addition, by formally understanding the underlying application, optimization efforts can be pragmatically applied to yield the best outcome rather than simply applying premature optimization to the application which may adversely affect numerous aspects of the resultant system. By exploring the intersection point of hardware resources, application requirements, software tooling capabilities and limitations, as well as power requirements, system architects and managers can effectively and efficiently bring well-optimized systems to market.

2 The Right Way and the Wrong Way

Like many things, in the areas of embedded and multicore software and system design, there are often right ways and wrong ways to go about things. Programmers and developers all to often set out to optimize various aspects of the system far too prematurely, often resulting in less than acceptable results.

There is a topical quote by Donald Knuth, author of The Art of Computer Programming, that sums this phenomenon succinctly and is reproduced below:

Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.

Donald Knuth

This is by no means to say that optimization is not required in embedded and multicore design, rather quite the opposite. Optimization must, however, be performed with a disciplined and iterative approach. Regarding serial performance tuning specifically, there are several key factors to consider to ensure that optimization is applied to software in which there is a firm understanding of behavior and bottlenecks. As such, a good iterative optimization approach should include such things as performing measurements and careful analysis for a guide to informed decision making, making changes to only one thing at a time, and meticulous and regular remeasurement of the augmented system to confirm changes have been beneficial. These should be done as part of software development, validation, measurement, simulation, and the use of profiling tools to gain insight into runtime behavior and architectural response.

There are several common metrics that are associated with embedded system design at the hardware and software level. These include, but are not limited to, nonrecurring engineering cost, size, performance, power, flexibility, as well as time to prototype, time to market, maintainability, and system correctness. Considering these complex design challenges, domain expertise in both the hardware and software is needed to optimize design metrics. The designer must be comfortable with various technologies to choose what is best for a given application and constraints.

As such, premature optimization, in addition to excessive optimization, can consume precious system resources. These include delay of the prototype or product release and compromise of the software design, often without direct or applicable improvement of system performance. To remedy this, system modeling before optimization is required to appropriately plan and deploy system design resources. Once modeling is in place, a combination of measurement, regression testing, and tuning can be employed.

3 Understanding Requirements

It is important for system architects, managers, and engineers to spend time up front to understand the nonfunctional requirements of the system. Fig. 1 shows an example of a functional requirement and various metrics and attributes that should be associated.

Fig. 1
Fig. 1 Functional requirements of an application.

An example of a functional metric could be that the embedded software shall or must perform a specific task. Examples of these could be monitoring a certain interface or subsystem, controlling a peripheral or subcomponent, and other metrics that mandate what the system must do. Examples of nonfunctional metrics, on the other hand, could be that the system shall be fast, reliable, scalable, etc. In summary, functional metrics represent what the system should do, whereas nonfunctional metrics represent how well the system should do it.

Fig. 2 shows a concrete illustration of this. Here the system dimension is IP forwarding, otherwise known as internet routing. The system dimension has the nonfunctional requirement of being “fast.” It is noted that the functional requirements are the inner block listed at kilo packets per second (kpps). Here we see that the kpps is shown as 600, however, the hard requirement is that it must be at least 550.

Fig. 2
Fig. 2 System dimensions and questions.

Following on with the above metrics, it is important to point out that there is a difference between system latency and system throughput. In general, it is not possible to design a system that provides both low latency and high throughput. However, many real-world systems have a requirement for both, such as media, wireless, eNodeB in LTE, and LTE Advanced. As such, it is a requirement for designers to be able to tune the system for the appropriate balance of latency and performance. An example is illustrated in Fig. 3 for eNodeB implementation.

Fig. 3
Fig. 3 eNodeB real-time and pseudo real-time tasks.

Here we can see that the system has real-time tasks (to be completed in 1 ms, or the TTI interval for LTE), whereby an external interrupt is triggered for radio link control and medium access control. The system also has pseudo real-time tasks such as Packet Data Convergence Protocol and IPSec. An example set of requirements for this functionality is that latency must be 10 μs, with a 50 users maximum wake up latency for real-time tasks. Similarly, throughput requirements could be as much as 50 Mbps in the uplink, and 100 Mbps in the downlink for 512-byte packet sizes. By firmly codifying these requirements, both in latency and throughput, as well as for hard real-time and pseudo real-time tasks, system designers now have firm criteria with which to implement and focus tuning and optimization for the system.

In summary, and as touched upon previously, it is a mandate that system architects and implementers know the architecture and know the algorithms. As we will see shortly, system architects and implementers are also advised to know about the tools and compilers.

4 Mapping the Application

When mapping an application to the underlying system architecture, one must consider the various types of processing components available within the system. Some may be latency oriented, like general-purpose CPUs. Others may be throughput oriented, such as GPU, GPGPU, FPGA, or accelerators. The system may also likely include VLIW-based DSPs. Which parts of the application map to which components is a task that must be analyzed as part of mapping the application at hand to the underlying system architecture.

Fig. 4 illustrates examples of some of the application components one might need to map to a given signal-processing or wireless system. Here we can see numerous blocks that are common in wireless and multimedia systems, such as finite impulse response, convolution, discrete Fourier transform, and so forth.

Fig. 4
Fig. 4 Algorithmic breakdown of computational and memory bottlenecks.

Generally, in considering these types of application blocks, the estimations for system performance should be done prior to the stage in which code is implemented. System designers will need to account for things such as:

  •  Maximum CPU performance. What is the maximum number of times the CPU can execute the algorithm per unit of time? How many channels can be supported simultaneously?
  •  Maximum I/O performance. Can the I/O system keep up with this proposed maximum number of channels?
  •  High-speed memory. Is there enough high-speed internal memory to support the desired system performance?
  •  CPU load percentage. At a given CPU load percentage, what other functions might the CPU be able to support?

4.1 Performance Calculations to Map the Application to Hardware

In this subsection, we will take the FIR algorithm component of the above table as an example of mapping the application software component to system resources. For a particular FIR benchmark, let us assume that there is a 200-tap (nh) low-pass FIR filter. Let’s also assume that the frame size is 256 (nx) 16-bit elements. Lastly, let’s assume that the sampling frequency is 48 kHz.

There are two main questions that this exercise will aim to answer that are listed below, each of which include a table of calculations showing the mathematics that is used to compute the final answer.

  • Question 1: How Many Channels Can the Core Handle Given the Complexity of the Algorithm?
  • Question 2: Are the I/O and Memory Capable of Handling This Many Channels?

4.1.1 How Many Channels Can the Core Handle?

Referring to the computations in Figs. 5 and 6 in the earlier sections, the goal here is to determine the maximum number of channels that this processor can handle given a specific algorithm. To do this, we must first determine the benchmark of the chosen algorithm. Again, in this case, we chose a 200-tap FIR filter. The DSPLIB documentation gives us the benchmark with two variables: nx, which is the size of the buffer, and nh, which is the number of coefficients. In Table MCB-1, we have plugged these number in.

Fig. 5
Fig. 5 CPU mapping of compute per channel.
Fig. 6
Fig. 6 I/O and channel mappings per compute and memory resource.

It turns out that this FIR routine takes about 26 K cycles/frame. Now, the sampling frequency comes into play. How many times is a frame full each second? Here, we divide the sampling frequency, which specifies how often a new data item is sampled, by the size of the buffer. After plugging in the numbers, we find that we fill about 47 frames/s. Next is one of the most important calculations, how many MIPS does this algorithm require of a processor? In other words, we need to find out how many cycles this algorithm will require per second. Here, we multiply frames per second by cycles per frame—if we plug in the numbers we get about 5 MIPs. Assuming this is the only thing you’re doing on the processor, we can do a maximum number of 300/5 = 60 channels. This completes the CPU calculation. We’ll use this number (60 channels) in the I/O calculations below.

4.1.2 Are the I/O and Memory Capable of This Many Channels?

The next question is whether the I/O interface can feed the CPU fast enough to handle the 60-channel goal? To determine this, we must first calculate the bit rate required of the serial port. Here, we take the required sampling rate which is 48 kHz and multiply it by the maximum channels (60) and then multiply by 16 (assuming the word size is 16 bits—which it is given the chosen algorithm). This calculation yields a requirement of 46 Mbps for 60 channels operating at 48 kHz.

Next, we must determine what the target architecture’s serial port can support. For our target architecture, the maximum bit rate is 50 Mbps (1/2 the CPU clock rate up to 50 Mbps). It looks like we are OK here. Next, we must determine whether the DMA can move these samples to memory fast enough. This appears to not be an issue. Now, we come to the issue of required data memory. This calculation is somewhat confusing and is explained below.

First, we are assuming that all 60 channels are using different filters—i.e., 60 different sets of coefficients and 60 double buffers. In other words, the system is ping ponging on both receive and transmit sides, four total buffers per channel hence the multiplication by four in the fourth row of Table MCB-2, pertaining to the required data memory. This also needs to account for the delay buffers for each channel. In this exercise, only the receive side has delay buffers. This calculation is the number of channels * 2 * delay buffer size, which is 60 * 2 * 199. Yes, this is extremely conservative, and you could save some memory if this is not the case. But, this is a worst-case scenario. So, we’ll have 60 sets of 200 coefficients, 60 double buffers (ping and pong on receive and transmit hence the * 4), and we’ll need a delay buffer of #coeffs-1 which is 199 for each channel. So, the calculation is:

(#Ch * #coeffs) + (#Ch * 4 * frame size) + (#Ch * #delay_buffers * delay_buffer_size)
(60 * 200) + (60 * 4 * 256) + (60 * 2 * 199)

This results in a requirement of 97 kb of memory. System designers must ensure that the target architecture has at least 97 kb of memory to support this configuration. If the target architecture does not, then the calculations can be performed again assuming only a single type of filter is used, perhaps reducing overhead and memory requirements.

4.2 How the Estimation Results Drive Options

Following on with the analysis detailed earlier, we can see that this quantitative analysis can now drive various system implementation options. For example, if we were analyzing a low-end, simple application that might only consume 5%–20% of the total CPU cycles, what might a system designer do with the remaining 80% of the compute cycles? Perhaps add additional functions or tasks? Perhaps increase the sampling rate which would result in increased accuracy? The system designer might also decide to add channels or perhaps decrease the voltage/clock speed to result in a lower system power.

Conversely, what about if the application analyzed were a complex, very high–end application that required a CPU load more than 100%! The system designer would need to wisely split up the tasks based on the data at hand. Perhaps use a GPP microcontroller for the user interface or migrate all signal processing to the DSP. Maybe the DSP could handle the user interface and most of the signal processing while an FPGA could handle the high-speed, heavy-lifting signal processing portions of the workload. Perhaps even more aggressive application partitioning could be used whereby a general-purpose processor handles the user interface, a DSP handles most but not all signal processing, and then an FPGA performs the high-speed, heavy-lifting portion of the signal-processing workload. By performing application mapping in a quantitative manner, and before the code implementation occurs, optimizations can be used effectively to meet key metrics.

5 Helping the Compiler and Build Tools

When it comes to finally optimizing the application after the exercises above have mapped it to the target architecture, software developers must become familiar with build tools and specifically the compiler. As was mentioned in Chapter [ ] the job of the compiler at the high level is to map high-level application code to the target platform. In doing so, it preserves the defined behavior of the high-level language. At the same time, the target architecture may provide functionality that is not directly present in the high-level language. Examples of this may be fractional arithmetic, packed data moves to/from memory, fused multiply accumulate operations, and various addressing modes. In addition, the application may be comprised of algorithmic concepts that are not handled by the high-level language, such as fractional arithmetic and vector operations.

Software engineers must understand how the compiler generates code, as it plays an important role in terms of writing code for a desired result. Fig. 7 illustrates a typical compilation tool chain.

Fig. 7
Fig. 7 Example of a modern compilation tool chain.

While compiler optimization is discussed in detail elsewhere in this book, this will serve as a recap for the reader of this chapter. As can be seen in Fig. 7, high-level source code files are parsed by the front end, and then optimized by both a high-level and low-level optimizer. Finally, assembly files are output by the code generator which then pass through an assembler. These assembly files are then combined with libraries, as well as various command files to produce the resulting executable. It is important to note that many build tool chains also support assembly optimization, link-time optimization, as well as various other optimizations that can be specified in the linker command file. It is important that the user refers to the built tool documentation to see which features are supported. Chapter [ ] offers additional information on compiler optimizations that are common to most tool chains.

5.1 Choosing Algorithmic Components to Work With Compilers and Architectures

Small parts of your application can often be tailored to have a big impact for loop-focused computation. By implementing these aspects of the computation in an architecture and compiler friendly manner, big improvements can be achieved in often unexpected ways. For instance, 16-bit arithmetic can often be slow on 32- and 64-bit architectures versus a packed arithmetic equivalent. Inlining of functions can also make gains if appropriate instruction cache is available, this can be true for code inside heavily nested loops where the caller/callee overhead can be reduced. Arithmetic operations, like multiply shifting, can be implemented in appropriate ways such that the compiler can compress them down to a single native instruction on the target architecture versus multiple instructions or worse! If input data types are known, it may also be advisable to avoid generic functions. Again, referring to the compiler, assembler, and linker build tools for a given architecture is advised. The reader is advised to revisit Chapter [ ] for more in-depth reading.

6 Power Optimization

As many embedded devices are battery operated or operate on low power constraints, power optimization is also important for embedded and mobile devices. This section is not meant to be an exhaustive exploration of power optimization, for which an entire text could be written. Rather, this section serves as a highlight to system developers and refers the reader to other chapters of this text for more in-depth analysis.

There are several power optimizations that system and software developers should keep in mind when implementing embedded software.

  • Software architecture. It may be advisable to architect system software to have natural idle points. This includes low-power booting or intelligently powering down PCI Express links and buffering transmissions on the UL and DL. Power can be conserved by only powering up these costly resources when needed by a specific application.
  • Interrupt driven design. Using interrupts intelligently can reduce system power consumption. By using interrupts to wake up certain functionality, rather than implementing polling loops, significant power consumption can often be saved. Use the operating system to perform blocks in in this context.
  • Code and data placement. By placing code and data close to the processor, one can often minimize off-chip access. Look into overlays from nonvolatile memory to fast memory. If the device has fast scratch pad memory it may be advisable to perform computations at that location.
  • Code size. By performing code size optimizations, the application size can be significantly reduced. These optimizations may involve using a compressed instruction set that limits functionality with a more aggressive instruction set for encoding. This will also reduce the memory required for the application and resulting leakage current.
  • Speed and idle modes. Often, one can optimize for speed in the computationally intensive parts of the application. While this may be unrelated to the task, it can result in increasing time during which the system can be put into idle mode, or the ability to reduce the clock rate at which the CPU and other system components operate.
  • Over calculation. By having a deep understanding of the application requirements, as described previously in this chapter, programmers can elect to use the minimal data widths required. This in turn can permit the use of smaller multipliers and arithmetic operations. It may also decrease the amount of bus activity and switching required during memory transfers.
  • Direct memory access. While it may be easier to use programmable CPU-based I/O, using the DMA engines for blocked memory transfer can be significantly more efficient in both time and resource utilization.
  • Coprocessors. Coprocessors are often designed to accelerate computation. By using coprocessors to efficiently handle and accelerate frequent computation, or application-specific computation, runtime can be reduced. This may increase the opportunity to put CPUs into idle mode.
  • Batch and buffer. By implementing the buffering of computation, and subsequent batch processing of computation, one may increase the amount of computation that can be performed during a block of time. Like the PCI Express link use case described above, this may increase the amount of time during which a device can be placed in idle/low-power mode while still meeting real-time deadlines.
  • Voltage and frequency. Use the operating system to your advantage, in this case by scaling voltage and frequency. Again, this requires deep knowledge of the application requirements and runtime performance, be sure to analyze and benchmark your application to achieve the right configuration.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset