6
Parallel Computing for Support Vector Machines

6.1. Introduction

As introduced previously, the essential computation of SVMs is to solve a quadratic problem, which is both time and memory costly. This presents a challenge when solving large-scale problems. Despite several optimizing or heuristic methods such as shrinking, chunking [OSU 97], kernel caching [JOA 99], approximation of a kernel matrix [CHA 07], sequential minimal optimization (SMO) [PLA 99a] and a primal estimated subgradient solver [SHA 07], a more sophisticated and satisfactory resolution is always expected for this challenging problem. As stated previously, the building’s energy system is extremely complex involving large number of influence factors, making tackling large-scale datasets common.

With the development of chip technologies, computers with multicores or multiprocessors are becoming more available and affordable in the modern market. This chapter, therefore, attempts to investigate and demonstrate how SVMs can benefit from this modern platform when solving the problem of predicting building energy consumption. A new parallel SVM that is particularly suitable to this platform is proposed. The decomposition method and inner SMO solver compose the main procedure of training. A shared cache is designed to store the kernel columns. For the purpose of achieving easy implementation without sacrificing performance, the new parallel programming framework MapReduce is chosen to perform the underlying parallelism. The proposed system is, therefore, called MRPsvm, abbreviation of MapReduce parallel SVM. We parallelize both classification and regression algorithms. Comparative experiments are conducted on five benchmarking datasets and three energy consumption datasets for SVC and SVR respectively, showing significant performance improvement in our system compared to the sequential implementation Libsvm [CHA 01] and the state-of-the-art parallel implementation Pisvm [BRU 07].

This chapter is organized as follows. Section 6.2 states the related work. Section 6.3 introduces the decomposition QP solver method for SVC. Section 6.5 explains some important issues with the MRPsvm’s implementation. Section 6.5.4 presents numerical experiments on the performance test and comparative analysis based on benchmarking datasets. Section 6.6.1 uses the same mechanism to parallelize SVR and shows its application in predicting building energy consumption. Conclusions are drawn in section 6.7.

6.2. Overview of parallel support vector machines

Several approaches have been proposed to parallelize SVM, mostly for solving classification problems. They can be classified into several categories according to the type of QP solver.

Based on stochastic gradient descent method, P-packSVM optimizes SVM training directly on the primal form of SVM for arbitrary kernels [ZHU 09]. Very high efficiency and competitive accuracy have been achieved. Psvm proposed in [CHA 07] is based on an interior point QP solver. It approximates the kernel matrix by incomplete Cholesky factorization. Memory requirement is reduced and scalable performance has been achieved. Bickson et al. [BIC 08] solve the problem by Gaussian belief propagation which is a method from complex system domain. The parallel solver brings competitive speedup on large-scale problems. The decomposition method attracts more attention than the above solvers. Graf et al. [GRA 05] train several SVMs on small data partitions, then they aggregate support vectors from two-pair SVMs to form new training samples on which another training is performed. The aggregation is repeated until only one SVM remains. A similar idea is adopted by Dong et al. [DON 05b], in their work, sub-SVMs are performed on block diagonal matrices which are regarded as the approximation of the original kernel matrix. Consequently, non-support vectors are removed when dealing with these subproblems. Zanni et al. [ZAN 06] parallelize SVM-Light with improved working set selection and inner QP solver. Hazan et al. [HAZ 08] propose a parallel decomposition solver using Fenchel duality. Lu et al. [LU 08] parallelize randomized sampling algorithms for SVM and SVR. Cao et al. [CAO 06] and Catanzaro et al. [CAT 08] parallelize SMO solver for training SVM for classification. Both works mainly focus on updating gradient for KKT condition evaluation and the working set selection. The difference between them lies in the implementation details and the programming models. Specifically, the first work is conducted by using message passing interface (MPI) on clusters, while the second work is conducted by using MapReduce threads on modern GPU platform. In our work, we also adopt SMO algorithm. But, we use it as the inner QP solver without any parallel computation, in fact, we perform the parallelization on external decomposition procedure. The main advantage of our coarse-grained parallelism is that it can significantly reduce the burden of overheads since the number of iterations in global decomposition procedure (where n rarrow.jpg 2) is much smaller than that of pure SMO algorithm (where n = 2). Although both GPU SVM [CAT 08] and our system are implemented in threads, we will not compare them in experiments since they are designed for different platforms and GPU SVM is especially used to solve classification problems.

6.3. Parallel quadratic problem solver

In section 3.3, we introduced interior point and gradient descent methods for QP solver. As we have stated, the third is the decomposition method. We present this approach in the following, and then based on it, we develop our parallel implementation.

The decomposition method reduces the problem into smaller tasks, then solves these small tasks and finally achieves global convergence. This method has attracted a lot of attention as the QP solver for SVMs in recent years, since it is quite efficient for large-scale problems and its memory requirement is also much less. It was first proposed by Osuna et al. to decompose the dual problem of SVMs [OSU 97]. In each small task, a working set which contains certain parts of α is selected to be optimized, while the rest of α remains at a constant value. The program repeats the select-optimize process iteratively until global optimality conditions are satisfied. In each iteration, only the involved partition of the kernel matrix needs to stay in the memory.

Similar to the dual form of SVC defined equation [3.17], we can write the general dual form of SVMs as:

Let B denote the working set which has n variables and N denote the non-working set which has (ln) variables. Then, α, y, Q and p can be correspondingly written as:

figure018.gif

Accordingly, the small task of the dual form in this case can be written as:

under the constraints

[6.3]figure016.jpg
[6.4]figure015.jpg

Since the last term figure014.gif of equation [6.2] remains constant in each iteration, it can be omitted while calculating, so that minimization function [6.2] basically holds the same form as the original objective function [3.17]. One of the advantages of this decomposition method is that the newly generated task is small enough to be solved by most off-the-shelf methods, requiring less storage space, which is probably affordable for modern computers. To the extreme, if the working set contains only two variables each time, the derived algorithm is SMO. In fact, this is an efficient inner small task solver due to its relative simplicity, yet high-performance characteristics. This kind of binary subproblem can be easily solved analytically [PLA 99a, CHA 01]. As stated in [JOA 99], the solution of equation [6.2] is strictly feasible toward the optimum solution of global problem defined in equation [3.17]. This feature guarantees the global convergence of the decomposition method.

The KKT optimality conditions are verified through evaluating the gradient of [6.1], i.e.

c03_Inline_4_10.jpg

This procedure can be summarized as follows. First, we classify the training samples into two categories:

c03_Inline_4_10.jpg

Then, we search two extreme values m(α) and M(α):

And finally, we define the stopping criterion as:

[6.7]figure011.jpg

The selection of working set directly influences the speed of convergence. For inner SMO solver, a maximal violating pair is selected to be the binary working set according to the first or second-order information [FAN 05]. We do not state here how inner SMO solver works since it has been discussed in detail in [PLA 99a] and [CHA 01].

For the selection of working set B, we simply consider the first-order information and select the maximal violating pairs, as proposed by Zanni [ZAN 06]. Suppose the required size of B is n, we choose q (q < n) variables from α by sequentially selecting pairs of variables which satisfy equations [6.5] and [6.6]. The remaining (nq) variables are chosen as those which entered B in the last iteration but have not yet been selected in current B. The selection of these (nq) variables follows the sequence: free variables (0 < αi < C), lower bound variables (αi = 0) and upper bound variables (αi = C). The reason for putting restraint on the number of new variables entering the working set is to avoid frequent entering-leaving of certain variables. Otherwise, the speed of convergence would considerably slow down [ZAN 06].

After the working set is optimized, f is updated by the newly optimized αj , ∀jB. This procedure is crucial as it prepares f to do optimality condition evaluation and working set selection for the next iteration. In fact, this is the most computational expensive step in SVM training due to the heavy work of computing Qij . The updating procedure can be written as follows:

where Δαj is the newly optimized αj minus the old αj . The whole decomposition method is summarized in algorithm 6.1.

6.4. MPI-based parallel support vector machines

6.4.1. Message passing interface programming model

MPIis a library specification for message-passing parallel computation. Basically, processors have separate address spaces and the communication between processors is by sending and receiving message. MPI is mainly designed for distributed memory parallel architecture, in which each process is run on a separate node and communication is over high- performance switch. The rich libraries make it applicable to parallel computers, clusters and heterogeneous networks. However, it also supports shared memory programming model in which multiple processes can read/write to the same memory space. Several free while well-tested and efficient implementations of MPI are available in the public domain, such as MPICH, LAM, IBM’s MPI, etc. These fostered the development of a parallel software industry, and there encouraged development of portable and scalable large-scale parallel applications.

MPI functions require that we specify the type of data which is sent between processors. This is because these functions pass variables, not defined types. If the data type is a standard one, such as int, char, double, etc., you can use predefined MPI datatypes such as MPI_INT, MPI_CHAR, MPI_DOUBLE.

Here is an example in C that passes an array of ints and all the processors want to send their arrays to the root with MPI_Gather:

figure029.jpg

However, we may instead wish to send data as one block as opposed to 100 ints. To do this define a “contiguous block” derived data type.

figure028.jpg

Passing a class or a data structure cannot use a predefined data type. MPI_Type_create_struct creates an MPI derived data type from MPI_predefined data types, as follows:

figure027.jpg

where count is a number of blocks, also number of entries in blocklen, disp, and type. blocklen is the number of elements in each block (array of integer), disp is the byte displacement of each block (array of integer), type is the type of elements in each block (array of handles to datatype objects).

MPI uses objects called communicators and groups to define which collection of processes may communicate with each other. Most MPI routines require us to specify a communicator as an argument. Simply use MPI_COMM_WORLD whenever a communicator is required – it is the predefined communicator that includes all of our MPI processes.

Within a communicator, every process has its own unique, integer identifier assigned by the system when the process initializes. A rank is sometimes also called a task ID. Ranks are contiguous and begin at zero. Ranks are used by the programmer to specify the source and destination of messages, and are often used conditionally by the application to control program execution like in the expression (if rank=0 do this / if rank=1 do that).

Most MPI routines include a return/error code parameter, as described in the above functions. However, according to the MPI standard, the default behavior of an MPI call is to abort if there is an error. This means we will probably not be able to capture a return/error code other than MPI_SUCCESS (zero). The standard does provide a means to override this default error handler. The types of errors displayed to the user are implementation-dependent.

Most of the MPI point-to-point routines can be used in either blocking or non-blocking mode. In blocking mode, the send routine will only “return” after it is “safe” to modify the application buffer (your send data) for reuse. “Safe” means that modifications will not affect the data intended for the receive task. Safe does not imply that the data were actually received – it may very well be sitting in a system buffer. A blocking receive only “returns” after the data have arrived and are ready for use by the program.

While in non-blocking mode, the send and receive routines behave similarly – they will return almost immediately. They do not wait for any communication events to complete, such as message copying from user memory to system buffer space or the actual arrival of message. Non-blocking operations simply “request” the MPI library to perform the operation when it is able. The user cannot predict when that will happen. It is unsafe to modify the application buffer (our variable space) until we know for a fact the requested non-blocking operation was actually performed by the library. There are “wait” routines used to do this. Non-blocking communications are primarily used to overlap computation with communication and exploit possible performance gains.

In the next sections, we will introduce two parallel implementations of SVM, Pisvm and Psvm which are MPI-based implementation. We briefly introduce their features and advantages without going into implementation details. Indeed, since they are open source projects, readers can find their documents and codes on the Internet.

6.4.2. Pisvm

Pisvm is based on a decomposition method, allowing for efficient training and testing with various kernel functions. Its implementation is based on MPI. It is compatible with Libsvm, using the same input format and command line switches. It is a good substitute of the Libsvm when solving larger problems. It runs on any cluster computer where the MPI is available. It achieves superlinear speedups on some large-scale datasets. It supports training of all SVM formulations including C-SVC, ν-SVC for classification and ε-SVR, ν-SVR for regression.

Pisvm uses three different parallelization strategies for the interior point algorithm, the gradient projection algorithm and SMO in combination with the decomposition approach for SVM training. The gradient projection algorithm is successfully used for parallel C-SVC and its extension is explored to solve QP problems with two linear constraints that arise when training ν-SVC and ν-SVR. Unfortunately, this extension converges slowly, making it unapplicable in practice. Similar slow convergence is also found in parallel interior point implementation of the interior point algorithm. Therefore, parallel SMO is proposed and embraced in Pisvm.

This parallel implementation achieves superlinear speedups for C-SVC and ν-SVR. In the regression setting, Pisvm showed close to linear speedup on the Kddcup99 dataset benchmark, as explained later.

6.4.3. Psvm

Psvm tries to reduce memory use and parallelize both data loading and computation. The key step of PSVM is parallel incomplete Cholesky factorization (ICF). Traditional column-based ICF can reduce computational cost, but the initial memory requirement is higher. Psvm devises parallel row-based ICF as its initial step, which loads training instances onto parallel machines and performs factorization simultaneously on these machines. Once PICF has loaded n training data distributed on m machines, and reduced the size of the kernel matrix through factorization, interior-point method can be solved on parallel machines simultaneously.

Psvm first loads the training data in a round-robin manner onto m machines. Then, it performs a parallel row-based ICF on the loaded data. At the end of parallel ICF, each machine stores only a fraction of the factorized matrix. It then performs parallel interior point method to solve the quadratic optimization problem. The computation time is improved compared to the sequential and traditional parallel SVM implementations. However, linear speedup cannot be achieved when the number of machines continues to increase beyond a data-size-dependent threshold.

6.5. MapReduce-based parallel support vector machines

This section discusses some important issues with MRPsvm [PAN 12] implementation. The underlying parallelism is based on MapReduce framework. The communication and data decomposition is especially designed for multicore and multiprocessor systems.

6.5.1. MapReduce programming model

MapReduce is a new parallel programming framework originally proposed in [DEA 08] in Google. It allows users to write code in a functional style: map computations on separated data, generate intermediate key-value pairs and then reduce the summation of intermediate values assigned to the same key. A runtime system is designed to automatically handle low-level mapping, scheduling, parallel processing and fault tolerance. It is a simple, yet very useful framework, helping people extract parallelism of computations on large datasets by taking advantage of distributed systems as illustrated in [PAN 12, PAN 10] and [PAN 13]. The Map and Reduce functions are defined with respect to data structured in key-value pairs.

Map function takes one pair of data with a type in one data domain, and returns a list of pairs in a different domain:

c03_Inline_4_10.jpg

The Map function is applied in parallel to every item in the input dataset. This produces a list of (k2, v2) pairs for each call. After this, the MapReduce framework collects all pairs with the same key from all lists and groups them together, thus creating one group for each one of the different generated keys.

The Reduce function is then applied in parallel to each group, which in turn produces a collection of values in the same domain:

c03_Inline_4_10.jpg

Each Reduce call typically produces either one value v3 or an empty return, though one call is allowed to return more than one value. The returns of all calls are collected as the desired result list. Reduce invocations are distributed by partitioning the intermediate key space into R pieces using a partitioning function, for instance: hash(key)modR. The number of partitions, i.e. R, and the partitioning function are specified by the user.

Thus, the MapReduce framework transforms a list of key-value pairs into a list of values. This behavior is different from the typical functional Data Mining and Machine Learning in Building Energy Analysis programming map and reduce combination, which accepts a list of arbitrary values and returns one single value that combines all the values returned by Map.

It is necessary but not sufficient to have implementations of the map and reduce abstractions in order to implement MapReduce. Distributed implementations of MapReduce require a means of connecting the processes performing the Map and Reduce phases. This may be a distributed file system. Other options are possible, such as direct streaming from mappers to reducers, or for the mapping processors to serve up their results to reducers that query them.

Problem defined in equation [6.8] can be regarded as a summation of several computational expensive terms, as shown in the top right corner in Figure 6.1. Therefore, MapReduce is naturally suitable to deal with this problem. The working set B is uniformly decomposed into several small pieces, the calculation of f is also divided into several parts in the same manner as for B. Each part is then assigned to a mapper. After the parallel calculations of these mappers, final f is added up by the reducer. Here, jk, k = 1, ...,n is the variable index of working set in kernel matrix, which gives the kth variable in B with its index in Q as jk . In practice, since some of figure026.gif are so marginal that they can be omitted, it is not necessary to update f on all of the n variables.

figure036.jpg

Figure 6.1. Architecture of the parallelization in one iteration

In some recent work, MapReduce has proved to be an effective parallel computing framework on multicore systems. Chu et al. [CHU 06] have developed MapReduce as a general programming framework on multicore systems for machine-learning applications. Phoenix designed in [RAN 07] implements a common API for MapReduce. It allows users to easily parallelize their applications without conducting concurrency management. The MapReduce tasks are performed in threads on multicore systems. An efficient integrated runtime system is supposed to handle the parallelization, resource management and fault recovery by itself. This system is adopted as the underlying MapReduce handler in our MRPsvm. We show the parallel architecture of one iteration in Figure 6.1. The small tasks are uniformly distributed to mappers which have the same number of processors. Phoenix serves as the role of creating and managing mappers and reducers, making the system easy to implement.

6.5.2. Caching technique

As stated in the previous sections, for large-scale problems, kernel matrix Q is too large to be stored in memory, and the calculation of kernel elements Qij is the dominant work that slows down the training. It is an effective technique to cache the kernel elements in memory as much as possible. MRPsvm maintains a fix-sized cache which stores recently accessed or generated kernel columns. The cache replacement policy is a simple least-recent-use strategy, the same as that of Libsvm. Only the column currently needed but not hit in the cache will be calculated. All parallel maps share unique copy of cache in the shared memory. In consequence, the operation of inserting a new column into the cache performed by whichever map should be synchronized.

For inner SMO solver, the kernel matrix size is dependent on the size of working set B which is normally set as 1,024 according to the knowledge gained by experience, it is practical to precompute and cache the full version of this small kernel matrix.

6.5.3. Sparse data representation

To reduce the storage requirements, the sample vectors xi are stored in a sparse representation. When calculating a kernel column Qij , i = 1, 2, ..., l, we need to unroll the jth sample vector to dense format and then calculate the dot products of this vector with the other l sample vectors.

6.5.4. Comparison of MRPsvm with Pisvm

Pisvm also uses decomposition method to train SVM in parallel. It is an efficient tool to analyze multiple buildings’ energy behaviors as stated in our previous work [ZHA 10]. However, its implementation is different from our new implementation MRPsvm in many aspects. First, Pisvm is based on MPI implementation and aims at extracting parallelism from distributed memory systems, while our parallel algorithm is conducted by MapReduce threads on shared memory system. The two implementations are based on totally different models. Second, in Pisvm, each process stores a copy of data samples, while on the contrary, MRPsvm stores only one copy in the shared memory. This means MRPsvm can save large amount of storage when the dataset is huge. The saved space can be used to cache more kernel matrix in order to further improve training speed. Third, Pisvm adopts a distributed cache strategy in order to share the saved kernel elements among all of the processes. Each process stores a piece of the cache locally. Consequently, the work for updating gradients is divided and assigned globally to proper processors according to the cache locality. In contrast, MRPsvm has only one copy of the cache, and each processor accesses the cache equally, so that the overhead of global assignment is avoided. However, we have to note that synchronization when writing in the cache is required.

In the next section, we will compare the performance of Pisvm with that of MRPsvm in real application datasets, providing direct evidence that our system is more efficient and suitable than the MPI implementation on multicore systems.

We test MRPsvm by comparing it with the parallel implementation Pisvm and the serial implementation Libsvm on five widely used benchmark datasets. Although this comparison may not be based on systems especially designed for multicore architecture, we still have good reasons for doing so. First, to the best of our knowledge, there is no existing parallel implementation of general SVM that is especially developed for multicore systems. Therefore, there is a strong need to verify if our system could outperform the state-of-the-art parallel implementation. Second, most of the systems surveyed in section 6.2 are not available to the public, while Pisvm, as a typical parallel implementation of SVM, is easy to obtain. Finally, the quadratic problem solver of Pisvm is the same as MRPsvm, hence, if we compare our system with Pisvm, the advantage of MapReduce framework is more convincing.

Two computers with different hardware architectures are adopted to check hardware effects. As shown in Table 6.2, the first computer has 4 cores with a shared L2 cache and memory. The second computer is a dual-processor system with 4 cores in each processor. The cores in the same processor share one cache, and the main memory is shared among all of the cores. Both of the two computers are running Linux 2.6.27-7.

Table 6.2. Characteristics of the multi-core systems

Features Computer I Computer II
# of CPUs 1 2
# of cores 4 8
Frequency 1600MHz*4 2327MHz*8
Memory 2G 4G
L2 cache 4M 6M*2

The five datasets are shown in Table 6.3. They vary in sample size and dimension. We train all SVMs with the Gaussian kernel. The tolerance of the termination criterion is set to 0.01. Since we focus on comparing three systems, the outputs of these classifiers may not be optimal. In other words, we do not guarantee that the parameters of SVM, i.e. C and γ, will reach optimal values. They are just chosen from the literature as shown in the last two lines of Table 6.3.

Table 6.3. Description of the five datasets and the two parameters of support vector machines on each dataset

Web Adult Mnist Covtype Kddcup99
# training samples 24,692 32,561 60,000 435,759 898,430
# testing samples 25,075 16,281 10,000 145,253 311,029
# Classes 2 2 2 8 2
# Dimensions 300 123 576 54 122
C 64 100 10 10 2
γ 7.8152 0.5 1.667 2e-5 0.6

Since the caching technique is crucial for performance, for a reliable comparison, we set the cache size to be the same for all three systems. Furthermore, we restrict the cache size to be far smaller than the memory size in order to minimize page faults in runtime. Here, we have to emphasize that the following reported performance might not be optimal for all three systems, only serving for comparison purpose.

Table 6.4. Training time and accuracy of the three systems on five datasets performed on computer I. The unit of time is second

figure035.jpg

Table 6.7 shows the results of the three implementations performed on 4 processors. The time columns represent the whole training time, i.e. from reading the problem to writing the outputs. Here, we consider the speedup to denote how many times faster parallel implementation is over sequential implementation:

c03_Inline_4_10.jpg

By analyzing the results, we can see that MRPsvm has successfully parallelized SVM training. For all five datasets, much more time is saved when running MRPsvm rather than Libsvm. Especially in the first four cases, the speed of MRPsvm is more than four times higher than that of Libsvm. In all of the cases, MRPsvm achieves outstandingly higher speedup than Pisvm, indicating that MRPsvm is more suitable than Pisvm on multicore systems.

We note that in these experiments, the accuracy of the three classifiers is almost the same. In fact, the numbers of support vectors generated by these classifiers are also quite close. In fact, these three implementations essentially have the same mechanism in quadratic problem solving, i.e. to iteratively optimize one pair of variables until achieving global optimization. The difference in runtime is mainly caused by the selected working set. Selecting different variables to perform optimization may induce totally different results in an iteration, but generally speaking, as long as global convergence is reached, the influence is marginal.

We show in Figure 6.2 the times up of the two parallel solvers over the sequential solver when running on the second computer. MRPsvm again outperforms Pisvm on all of the datasets. Among them, the best speedup is achieved on Adult, while the worst is found on Kddcup99. This indicates that MRPsvm performs better on smaller datasets. The main reasons for worse performance on larger problems are due to locality and overheads of reduction. In each map, the updating of f requires accessing the whole data samples and several temporal vectors with the size close to l. Therefore, for large datasets, it is difficult to guarantee the locality for using L2 cache, especially when the cache is shared by several threads. Since we partition the global problem in columns, each map generates l intermediate fi, so that the reduction is costly when l is very large.

figure034.jpg

Figure 6.2. Speedup of Pisvm and MRPsvm over Libsvm when running on computer II

In fact, the parallel performance on 8 cores only slightly outperforms that on 4 cores. As explained at the beginning of this section, this is because we did not make full use of the memory on computer II. Far more time can be saved if we increase the cache size to the maximum with caution. In this optimal case, the cache size of MRPsvm and Libsvm is larger than that of Pisvm, since the former two systems generally require less memory. Therefore, this implies that more improvements can be achieved for MRPsvm and Libsvm than Pisvm.

6.6. MapReduce-based parallel ε-support vector regression

6.6.1. Implementation aspects

In this section, we use the same mechanism to parallelize SVR training process, and then apply this new parallel algorithm to predict building energy consumption. The above three implementations are compared again in this application.

First, we reformulate ε-SVR in the same form as the one defined in equation [6.1]. Let us present the training data as (x1, z1), ..., (xl, zl), where vector xi is the ith sample, zi is the ith target value corresponding to xi , l is the number of samples. The dual form of the SVR can be written in the following quadratic form:

where α is a vector of 2l variables, Q is a 2l by 2l kernel matrix. Each element of Q has the following form:

c03_Inline_4_10.jpg

where K(xi, xj) is a kernel function. The parameter p in equation [6.9] is defined as:

[6.12]figure021.jpg

and the variable y in the constraint [6.10] is defined as:

[6.13]figure020.jpg

In the constraint [6.11], C is the upper bound used to trade off between model performance on training data and its generalization ability. The objective of the problem is to find the solution of α which minimizes equation [6.9] and fulfills constraint equations [6.10] and [6.11]. After the optimum α is found, the decision function can be formulated as:

c03_Inline_4_10.jpg

where only support vectors satisfy the relation: −αi + αi+l ≠ 0.

6.6.2. Energy consumption datasets

Three datasets are prepared for the model training. They denote the historical energy consumption of one building, 20 buildings and 50 buildings, respectively. All of these buildings are located in urban areas. We evenly distribute them into five typical cities in France. The dry bulb temperatures of these five places is drawn in Figure 5.3. Each building has similar structures, i.e. single-story, mass-built, one rectangular room with an attic roof and four windows without shading. Electrical equipment including lighting systems, fans and water heaters, are scheduled as for common office use. In the winter season (from November 1 to March 31), district heating is applied in order to keep the room temperature at a constant level. Ventilation is adopted for indoor thermal comfort. The number of occupants depends on the housing space and people density, with the average of 0.2 people per zone floor area. During the simulation, some input variables such as size, orientation, window area and scheduling are set differently to achieve diversity among multiple buildings.

The dataset of one building is hourly energy dynamics in a period of 1 year. We select eight important features as shown in the Case I column in Table 5.2 according to the evaluation of our feature selection method. For two other datasets, the recording period is from November to March which is the winter season in France, and we record four more features which generate the building diversity, i.e. height, length, width and window/wall area ratio. Since weekends and holidays have totally different energy behaviors from working days, to simplify the model in our practice, we only use the consumption data of working days in the experiments. One more building is simulated for model evaluation purpose. The attributes of the three datasets are shown in Table 6.5.

Table 6.5. Description of the three datasets and the three parameters of support vector regression on each dataset. #tr: number of training samples, #te: number of testing samples

Dataset #tr #te Dimensions C γ ε
One building 5064 1008 8 32 0.594 0.01
20 buildings 49940 2498 12 16 0.4193 0.01
50 buildings 124850 2498 12 16 0.4357 0.01

6.6.3. Evaluation for building energy prediction

We perform the experiments on two shared memory systems as outlined in Table 6.6. To vary from the above computers, this time we use two different ones, the first has two cores while the second has four cores. Both of them have a shared L2 cache and memory and running 64 bit Linux system (kernel version 2.6.27-7). Their memory size is the same.

We train all SVRs with Gaussian kernel, the parameters as shown in the last three columns of Table 6.5. Again, we restrict the cache size to be far smaller than the memory size.

On each dataset, we train Libsvm, Pisvm and MRPsvm on both computer I and computer II. Table 6.7 shows the results of the three implementations performed on dual-core processor, including the number of support vectors (nSVs), MSE, SCC and training time. We show the training time on quad-core system in Table 6.8. Since nSVs, MSE and SCC in quad-core case are the same as those in dual-core case, we omit them in Table 6.8.

Table 6.6. Characteristics of the experimental environment

Features Computer I Computer II
# of CPUs 1 1
# of cores 2 4
Frequency 3.4GHz*2 1.6GHz*4
Memory 2G 2G
L2 cache 2M 4M

Table 6.7. Training time and performance of the three predictors on three datasets performed on computer I. bd: building, nSVs: number of support vectors, MSE: mean squared error, SCC: squared correlation coefficient. The unit of time is second

figure033.jpg

We can see that, for all three datasets, the accuracy and the nSVs of MRPsvm are quite close to that of Libsvm and Pisvm. MRPsvm runs faster than Libsvm and Pisvm in all tests.

We show the speedup of MRPsvm and that of Pisvm comparatively in Figures 6.3, 6.4 and 6.5. For the case of one building, the speed of MRPsvm is twice as high as that of Libsvm. For the case of 20 and 50 buildings, MRPsvm performs more than 16 times faster than Libsvm. On both computers and on all three datasets, MRPsvm achieves remarkably higher speedup than Pisvm, indicating that MRPsvm is more suitable than Pisvm on multicore systems. The speed improvement by MRPsvm is particularly obvious in multiple building cases, indicating that MRPsvm performs better than Pisvm on much larger datasets. However, the better performance is not guaranteed on even larger problems due to locality and overheads of reduction as stated previously.

Table 6.8. Training time of the three predictors performed on computer II. Time unit is second

Dataset Libsvm Pisvm MRPsvm
1 building 18.0 7.3 6.9
20 buildings 2532.1 214.1 133.5
50 buildings 32952.2 2325.5 1699.0
figure032.jpg

Figure 6.3. Speedup of Pisvm and MRPsvm on one building’s data

6.7. Concluding remarks

This chapter proposes a parallel implementation of SVMs for multicore and multiprocessor systems. It implements the decomposition method and utilizes SMO as an inner solver. The parallelism is conducted to update the vector f in the decomposition step and is programmed in the simple, yet pragmatic programming framework MapReduce. A shared cache is designed to save the kernel matrix columns when the data size is very large. First, we use this mechanism to implement SVC, and the extensive experiments show that our system is very efficient in solving large-scale problems. For instance, the speed on 4 processors can increase to more than 4 times that of Libsvm for most of the applications. It overwhelms the state-of-the-art Pisvm in all benchmark tests in terms of both speed and memory requirement.

figure031.jpg

Figure 6.4. Speedup of Pisvm and MRPsvm on 20 buildings’ data

figure030.jpg

Figure 6.5. Speedup of Pisvm and MRPsvm on 50 buildings’ data

We also parallelize SVR for solving large-scale problems in building energy analysis. Experimental results show that the new parallel system provides the same accuracy as Libsvm does, yet performs far more efficiently than the sequential implementation in solving stated problems. On the smallest dataset, MRPsvm achieves more than twice the speedup times of Libsvm while on the largest dataset, the speedup times can increase to 16-fold and 19-fold on dual-core system and on quad-core system, respectively. Again, the proposed implementation is superior to the Pisvm in all tests in terms of both speed and memory requirement.

Since the multicore system is dominating the trend of processor development and is highly available in the modern market, MRPsvm is potentially very practical and feasible in solving large-scale regression problems. Furthermore, the success of MRPsvm indicates that MapReduce is a suitable option for parallelizing machine-learning algorithms.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset