Chapter 13
Conclusions

In this book, we presented novel methods for modeling and optimization of parallel and distributed embedded systems. We illustrated our modeling and optimization of distributed embedded systems using our research and experimental evaluation, which focused on distributed embedded wireless sensor networks (EWSNs). Specifically, we developed and tested our dynamic optimization methodologies on an embedded senor node's tunable parameter value settings for EWSNs and modeled application metrics, such as lifetime and reliability. We demonstrated our modeling and optimization methods for parallel embedded systems using our research on multicore-based parallel embedded systems.

Chapter 1 introduced the modeling and optimization of embedded systems and further discussed various diverse embedded system application domains including cyber-physical systems (CPSs), space, medical, and automotive. The chapter presented an overview of modeling, modeling objectives, and various modeling paradigms.

Chapter 2 presented our proposed architecture for heterogeneous hierarchical multicore embedded wireless sensor networks (MCEWSNs). The increased computation power afforded by multicore embedded sensor nodes benefits a myriad of compute-intensive tasks, such as information fusion, encryption, network coding, and software-defined radio, which are prevalent in many application domains. MCEWSNs are especially beneficial for wireless sensor networking application domains, such as wireless video sensor networks, wireless multimedia sensor networks, satellite-based sensor networks, space shuttle sensor networks, aerial–terrestrial hybrid sensor networks, and fault-tolerant sensor networks. Both academia and industry have recognized the MCEWSN's potential benefits and have undertaken several initiatives to develop multicore embedded sensor nodes, such as InstraNode, satellite-based sensor node, and smart camera mote. The chapter elaborated on these endeavors.

Chapter 3 proposed an application metric estimation model to estimate high-level metrics, such as lifetime, throughput, and reliability from an embedded sensor node's parameters. This estimation model's main purpose is to assist dynamic optimization methodologies in comparing different operating states. Our application metric estimation model provided a prototype model for application metric estimation, which can be easily extended to other application metrics.

Chapter 4 provided a comprehensive research on modeling and analysis of fault detection and tolerance in EWSNs. To elucidate fault detection in EWSNs, we presented a taxonomy for fault diagnosis in EWSNs. Using nsc13-math-00012, we simulated several prominent fault detection algorithms to evaluate the algorithms' accuracies and false alarm rates. We developed comprehensive Markov models that hierarchically encompassed the individual sensor nodes, WSN clusters, and the overall WSN. Our models characterized WSN reliability and mean time to failure (MTTF) for different sensor failure probabilities, and can assist WSN designers in achieving closer application requirement adherence by evaluating the reliability and MTTF in the pre-deployment phase.

Chapter 5 detailed our development of closed product-form queueing network models for performance evaluation of multicore embedded architectures for different workload characteristics. The performance evaluation results indicated that shared last-level cache (LLC) architectures provide better cache response times and MFLOPS/W as compared to private LLC architectures, regardless of the specific cache miss rate, especially as the number of cores increases. The results also revealed that shared LLCs have some disadvantages, such as being more susceptible to causing main memory response time bottlenecks for large cache miss rates as compared to the private LLCs. However, results indicated that these bottlenecks can be mitigated using a hybrid combination of private and shared LLCs (i.e., sharing LLCs by only a subset of cores, rather than all cores). The trade-offs for this hybrid LLC architecture is increased power consumption as compared to shared LLCs, and comparatively less MFLOPS/W. The performance per watt and performance per unit area results for the multicore embedded architectures revealed that shared LLC multicore architectures become more area and power efficient as compared to private LLC architectures as the number of cores increases.

Chapter 6 explored optimization strategies for distributed EWSNs at various design levels to meet disparate application requirements. To aid in easy design incorporation, we discussed commercial off-the-shelf (COTS) embedded sensor node components and the components' associated tunable parameter value settings that can be specialized to provide component-level optimizations. We explored data link-level and network-level optimization strategies focusing on medium access control (MAC) and routing protocols, respectively. The MAC protocols presented targeted load balancing, throughput, and energy optimizations, and the routing protocols focused on query dissemination, real-time data delivery, and network topology. We illustrated sensor node operating system (OS)-level optimizations, such as power management and fault tolerance, using state-of-the-art sensor node OSs. Finally, we described dynamic optimizations, such as dynamic voltage and frequency scaling (DVFS) and dynamic network reprogramming.

Chapter 7 gave a holistic survey of high-performance energy-efficient parallel embedded computing (HPEPEC) techniques that enable meeting diverse embedded application requirements. We presented novel architectural approaches for core layout (e.g., heterogeneous chipmultiprocessors (CMPs), tiled multicore architectures (TMAs), 3D multicore architectures), memory design (e.g., cache partitioning, cooperative caching), interconnection networks (e.g., 2D mesh, hypercube), and reduction techniques (e.g., leakage current reduction, peak power reduction), which enhance performance and reduce energy consumption in parallel embedded systems. We discussed hardware-assisted middleware techniques, such as DVFS, advanced configuration and power interface (ACPI), threading techniques (hyper-threading, helper threading, and speculative threading), dynamic thermal management (DTM), and various low-power gating techniques for performance and energy optimizations of parallel embedded systems. We also considered software approaches, such as task scheduling, task migration, and load balancing, to improve the attainable parallel performance and power efficiency. Finally, we discussed some prominent multicore-based parallel processors, emphasized these processors' HPEPEC features, and concluded with HPEPEC research challenges and future research directions.

Chapter 8 focused on our proposed EWSN-based application-oriented dynamic optimization methodology using Markov decision processes (MDPs). Our MDP-based optimal policy tuned sensor node processor voltage, frequency, and sensing frequency in accordance with application requirements during a sensor node's lifetime. Our methodology was highly adaptive to changing application requirements and determined a new MDP-based policy whenever these requirements changed, which may also reflect changing environmental stimuli. We compared our MDP-based policy with four fixed-heuristic policies and concluded that our proposed MDP-based policy outperformed each heuristic policy for all sensor node lifetimes, state transition costs, and application metric weight factors. We also provided implementation guidelines to assist embedded system designer in designing and architecting appropriate sensor nodes given application requirements.

Although our MDP-based methodology presented in Chapter 8 provided a sound foundation for high-quality sensor node parameter tuning with respect to diverse application requirements, this methodology was only a first step toward holistic EWSN dynamic optimization. Since our initially proposed MDP-based methodology required high computational and memory resources for large design spaces and necessitated a high-performance base station/sink node to compute the optimal operating state for large design spaces, much research was left to architect a less computational complex methodology. The original methodology had to determine operating states at the base station, and then these states were communicated to the other sensor nodes. Given the constrained resources of individual sensor nodes, these high resource requirements made the MDP-based methodology infeasible for autonomous dynamic optimization of large design spaces and individual nodes.

Chapter 9 extended this initial work on sensor node parameter tuning discussed in Chapter 8 and proposed lightweight online greedy and simulated annealing algorithms that are suitable for dynamic optimizations of distributed, resource-constrained EWSNs. As compared to prior work, our refined methodologyconsidered an extensive embedded sensor node design space, which allowed embedded sensor nodes to more closely adhere to diverse application requirements. Our experimental results revealed that our online algorithms were lightweight; required little computational, memory, and energy resources; and thus are amenable to sensor nodes with highly constrained resources and energy budgets. Furthermore, our online algorithms could perform in situ parameter tuning to adapt to changing environmental stimuli to meet application requirements.

Chapter 10 proposed an EWSN's lightweight dynamic optimization methodology that provided high-quality solutions in one-shot using intelligent initial tunable parameter value settings for highly constrained application domains. To improve on the one-shot solution for less constrained application domains, we proposed an additional online greedy optimization algorithm that leveraged intelligent design space exploration techniques to iteratively improve on the one-shot solution. Results showed the near-optimality of our one-shot solution, which was within 8% of the optimal solution on average. Results indicated that our greedy algorithm converged to the optimal or near-optimal solution after exploring only 1% and 0.04% of the design space, whereas SA explored 55% and 1.3% of the design space for design space cardinalities of 729 and 31,104, respectively. Data memory and execution time analysis revealed that one-shot solution required 361% and 85% less data memory and execution time, respectively, as compared to using all the three steps required by our dynamic optimization methodology. Furthermore, the one-shot methodology consumed 1679% and 166,510% less energy as compared to an exhaustive search for c13-math-0002 and c13-math-0003, respectively. The improvement mode using a greedy online algorithm consumed 926% and 83,135% less energy as compared to the exhaustive search for c13-math-0004 and c13-math-0005, respectively. Computational complexity analysis coupled with the execution time, data memory analysis, and energy consumption confirmed that our methodology was in fact lightweight and thus feasible for limited-resource sensor nodes.

Chapter 11 compared the performance of symmetric multiprocessors (SMPs) and TMAs with respect to a parallelized information fusion application, Gaussian elimination (GE), and embarrassingly parallel (EP) benchmarks. We provided a quantitative comparison of these architectures using various device metric calculations, including computational density (CD), computational density per watt (CD/W), internal memory bandwidth (IMB), and external memory bandwidth (EMB). Although a quantitative comparison provides a high-level evaluation of the architectures' computational capabilities, our evaluations provided deeper insights considering parallelized benchmark-driven evaluation. Our results revealed that the SMPs outperform the TMAs in terms of overall execution time; however, TMAs can deliver comparable or better performance per watt. Specifically, our results indicated that the TILEPro64 exhibited better scalability and attained better performance per watt as compared to SMPs for applications consisting of integer operations and those that operate primarily on private data with little communication between operating cores by exploiting the data locality, such as information fusion application. The SMPs depictedbetter scalability and performance for benchmarks that required extensive inter-core communication and synchronization operations, such as the GE benchmark. Results from the EP benchmark revealed that the SMPs provided higher peak floating point performance per watt as compared to the TMAs primarily because the TMAs we studied did not contain dedicated floating-point unit.

Chapter 12 provided an overview of TMAs and evaluated several contemporary TMA chips, such as Intel's TeraFLOPS research chip, IBM's C64, and Tilera's TILEPro64. Our analysis focused on Tilera's TILEPro64 to demonstrate TMA performance optimizations. We highlighted platform considerations for parallel performance optimizations, such as chip locality, cache locality, tile locality, translation look-aside buffer locality, and memory balancing. We also investigated compiler-based optimizations for attaining high performance, such as function inlining, alias analysis, loop unrolling, loop nest optimizations, software pipelining, and feedback-based optimizations.

To demonstrate the TMAs high-performance optimization capabilities, we optimized a dense matrix multiplication (MM) application for Tilera's TILEPro64, and results indicated that blocking and parallelization alone are not sufficient to achieve maximum TMA performance. Compiler-based optimizations were required to provide tremendous enhancements in attainable performance and performance per watt. Results demonstrated that an algorithm exploiting horizontal communication, such as Cannon's algorithm, provided an effective means of attaining high performance on TMAs.

Future Research Directions: System optimizations have always played a key role in meeting design goals, such as reducing the system's power/energy consumption. Given the relative simplicity of past systems (e.g., single core, few tunable parameters, small design spaces), determining the best parameter configuration was reasonably accomplished via directly evaluating different configurations during runtime and selecting the configuration that most closely adhered to the desired design goals. However, increasing system complexity (e.g., manycore systems result in explosively large design spaces) makes this direct evaluation infeasible, even with highly efficient design space exploration heuristics. This complexity, coupled with the fact that much prior work has shown that maximum design goal adherence requires per-application and/or per-application execution phase parameter turning, necessitates ultra-fast configuration selection/determination mechanisms in order to successfully integrate optimization into future systems.

In order to enable future optimizations in highly complex system with a myriad of tunable parameters and parameter values, and considering massively manycore systems (e.g., hundreds to thousands of cores), physical design space exploration must be replaced with fast, analytical, predictive mechanisms. In this book, we have presented several methods that intend to bridge the complexity gap between modern and future systems, providing high-level modeling and predictive optimization methods. However, these methods are only a small step toward sustaining optimizations in future massively complexsystems, and there is an urgent need to progress the optimization methods, underlying evaluation frameworks and modeling techniques in order to sustain the progression of system complexity. We summarize some of these challenges as follows:

  • Since the cache hierarchy is a key system resource with large impacts on power/energy consumption, performance, and chip area, large optimization impacts are possible via cache hierarchy specialization for specific application/phase requirements. Even though there is a large number of cache simulators and evaluation frameworks, each offering different coverage, accuracy, and simulation speed, a specially designed, accurate, fast cache simulator for an arbitrary cache hierarchy for a system with an arbitrary number of cores and intercore communication is still needed.
  • In the massively manycore era, accurately capturing intercore communication, dependencies, interactions, synchronizations will play an enormous role in cross-core system optimizations. The main challenge will be in capturing the dynamic and timing-dependent behavior for simulations of out-of-order processors and multithreaded/multicore architectures.
  • Given the vast number of tunable parameters, it will be important to clearly delineate independent and dependent parameters, since independent parameters can be explored in isolation, whereas dependent parameters must be explored together, which vastly increase the design space. In order to minimize the design space, it will be necessary to decouple this parameter interference and clearly categorize independent and dependent parameters. Furthermore, even for dependent parameters, the level of dependence must be quantified with respect to the desired design constraints, such that even if parameters are dependent, and if those parameters have a small affect on the design constraints, those parameters can be classified as independent to reduce the design space and tuning complexity.
  • Runtime parameter tuning is essential for future systems since runtime tuning can dynamically react to changing execution patterns, application requirements, application phases, and so on. Existing runtime tuning and phase change detection techniques introduce hardware overhead, can be intrusive to normal system execution, and are not geared toward massively complex, manycore systems. Reducing the impact of parameter tuning and phase change detection on energy consumption is critical.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset