Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

5
Model Reduction for Support Vector Machines

5.1. Introduction

In Chapter 1, we stated that there are a great number of factors which probably impact energy dynamics of a building. In Chapter 4, when predicting one single building’s energy profiles, we selected 24 features to train the model, including variables from weather conditions, energy profiles of each sublevel component, ventilation, water temperature, etc. While predicting multiple buildings’ consumption, four more features which represent building structure characteristics were used. However, we do not guarantee that these features are the right choices, nor can we even say they are all useful. How to reasonably choose a subset of appropriate features to be used in model learning is one of the key issues in machine learning. On the one hand, using different sets of features would probably change the performance of the models in accuracy and learning speed. On the other hand, the optimal set of features would make the predictive models more practical.

In this chapter, we discuss how to select subsets of features for SVR applied to the prediction of building energy consumption [ZHA 12a]. We present a heuristic approach for selecting subsets of features in this chapter, and systematically analyze how it will influence the model performance. The motivation is to develop a feature set that is simple enough and can be recorded easily in practice. The models are trained by SVR with different kernel methods based on three datasets. The model reduction or the feature selection (FS) method is evaluated by comparing the models’ performances before and after FS is performed.

This chapter is organized as follows. Section 5.2 gives an overview of model reduction, and introduces three directions of feature selection. Section 5.3 discusses general FS methods and in particular the ones introduced in this work. Sections 5.4 and 5.5 illustrate, with several numerical experiments, the robustness and efficiency of the proposed method. Finally, section 5.6 gives the conclusion and discussion.

5.2. Overview of model reduction

The past few years have witnessed the significant change in data sciences, where domains with hundreds to tens of thousands of variables or features are being thoroughly explored. Two typical examples of these new application domains are shown here, which serve us as an illustration of this introduction. One is gene selection from microarray data and the other is text categorization. For the former, the variables are gene expression coefficients corresponding to the abundance of mRNA in a sample, e.g. tissue biopsy, for a certain number of patients. A typical classification task is to separate healthy patients from those with cancer, based on their gene expression profiles. Under normal circumstances, fewer than 100 examples (patients) are available altogether for training and testing. However, in reality, the number of variables in the raw data ranges from 6,000 to 60,000. Certain initial filtering usually renders this number to a few thousand. For the reason that the abundance of mRNA varies by several orders of magnitude depending on the gene, the variables are usually standardized. In the text classification problem, documents are represented by a bag-of-words, which is a vector of dimension with a size of the vocabulary containing word frequency counts (proper normalization of the variables also applies). Common vocabularies usually carry hundreds of thousands of words, so that an initial pruning of the most and least frequent words may significantly reduce the effective number of words to 15,000. Furthermore, large document collections of 5,000 to 800,000 documents are also available for research. Typical tasks include the automatic sorting of URLs into a web directory and the detection of unsolicited emails (spam).

New techniques are being developed to address these challenging tasks, where many irrelevant and redundant variables are involved often with comparatively fewer training examples. Obvious potential benefits of variable and feature selection are facilitating data visualization and data understanding, reducing the measurement and storage requirements, reducing training and utilization time, and defying the curse of dimensionality for improving prediction performance. Generally, they are divided into wrappers, filters and embedded methods. Wrappers use the learning machine of interest as a black box to score subsets of variables according to their predictive power. Filters select subsets of variables as a preprocessing step, independently of the chosen predictor. Embedded methods perform variable selection in the training process and are usually specific to the given learning machines.

5.2.1. Wrapper methods

This method offers a simple and powerful way to address the problem of variable selection, regardless of the chosen learning machine. In fact, the learning machine is considered as a perfect black box and the method lends itself to the use of off-the-shelf machine-learning software packages. In its most general formulation, the wrapper methodology consists of using the prediction performance of a given learning machine to assess the relative usefulness of subsets of variables. In practice, we need to define: (1) how to search the space of all possible variable subsets; (2) how to assess the prediction performance of a learning machine for the purpose of guiding the search or halting it; and (3) which predictor to choose. Generally, an exhaustive search can conceivably be performed, provided that the number of variables is not too large. However, this problem is known to be NP-hard [AMA 98] and the search correspondingly becomes quickly computationally intractable. Therefore, in tackling this problem, a wide range of search strategies is proposed, which includes best-first, branch-and-bound, simulated annealing and genetic algorithms. Their performance assessments are usually implemented by using a validation set or by cross-validation. As illustrated here, popular predictors include decision trees, naive bayes, least-square linear predictors and support vector machines.

Wrappers are often criticized as being a brute force method requiring massive amounts of computation, which is not always the case. More efficient search strategies are, therefore, devised. The use of such strategies is not necessarily at the expense of prediction performance. In fact, it appears to be the inverse in some cases: coarse search strategies may alleviate the problem of overfitting. Greedy search strategies seem to be particularly computationally advantageous and robust against overfitting. They run well in both forward selection and backward elimination. For the former one, variables are progressively incorporated into larger and larger subsets, whereas for the latter one, it starts with the set of all variables and progressively eliminates the least promising ones. Both methods yield nested subsets of variables. On the one hand, by using the learning machine as a black box, wrappers are remarkably universal and simple. On the other hand, embedded methods that incorporate variable selection as a part of the training process may be more efficient in the following several respects: they make better use of the available data in the sense that they do not need to split the training data into a training and validation set; they reach a solution faster by avoiding retraining a predictor from scratch for every variable subset investigated. Embedded methods are not new, decision trees such as classification and regression trees (CARTs), for instance, have a built-in mechanism to perform variable selection. The next two sections are devoted to two families of embedded methods illustrated by algorithms.

5.2.2. Filter methods

Certain justifications for using filters in subset selection have been put forward. It is argued that filters are faster compared to wrappers. Moreover, recently proposed efficient embedded methods are competitive in this respect. Another argument is that some filters, e.g. those based on mutual information criteria, provide a generic selection of variables, not tuned for/by a given learning machine. A third compelling justification suggested that filtering can be used as a preprocessing step to reduce space dimensionality and overcome overfitting. In this respect, it seems reasonable to use a wrapper (or embedded method) with a linear predictor as the filter and then train a more complex nonlinear predictor on the resulting variables. A classic example of this approach is found in the paper of Bi et al. [BI 03]: a linear 1-norm SVM is used for variable selection, while a nonlinear 1-norm SVM is used for prediction. The complexity of linear filters can be ramped up by adding to the selection process products of input variables (monomials of a polynomial) and retaining the variables that are part of any selected monomial. Another predictor, e.g. a neural network, is eventually substituted to the polynomial to perform predictions using the selected variables. In some cases, however, we may want to reduce the complexity of linear filters to overcome overfitting problems. When the number of examples is small compared to the number of variables (for instance, in the case of microarray data), we may need to resort to selecting variables with correlation coefficients. Information theoretic filtering methods such as Markov blanket algorithms contribute to another broad family. The justification for classification problems is that the measure of mutual information does not rely on any prediction process, but provides a bound on the error rate with any prediction scheme for the given distribution. No illustration of such methods is provided for the problem of variable subset selection.

5.2.3. Embedded methods

Embedded methods are very different from other feature selection methods in how feature selection and learning interact. Unlike the wrapper methods, filter methods do not incorporate learning. Wrapper methods use a learning machine to measure the quality of subsets of features without incorporating knowledge about the specific structure of the classification or regression function, and can therefore be combined with any learning machine. In contrast to filter or wrapper approaches, in embedded methods the learning part and the feature selection part cannot be separated – the structure of the class of functions under consideration plays a crucial role. There are three embedded methods:

– Explicit removal or addition of features: the scaling factors are optimized over the discrete set {0, 1}ⁿ in a greedy iteration;
– Optimization of scaling factors: the optimization is performed over the compact interval [0, 1] ⁿ;
– Linear approaches: these approaches directly enforce sparsity of the model parameters.

Each family of feature selection methods, i.e. filter, wrapper and embedded, has its own advantages and drawbacks. In general, filter methods are fast, since they do not incorporate learning. Most wrapper methods are slower compared to filter methods, since they typically need to evaluate a cross-validation scheme at every iteration. Whenever the function that measures the quality of a scaling factor can be evaluated faster than a cross-validation error estimation procedure, embedded methods are, therefore, expected to be faster than wrapper approaches. Embedded methods tend to have higher capacity than filter methods and are therefore more likely to overfit. We thus expect filter methods to perform better if only small amounts of training data are available. While as the number of training points increases, embedded methods will eventually outperform filter methods.

5.3. Model reduction for energy consumption

5.3.1. Introduction

FS is a challenging subject and is widely studied in the machine-learning community. PCA and KPCA are two broadly used methods in exploratory data analysis and training predictive models [ROS 01]. In a raw dataset, there would be correlations between variables. PCA aims at reducing these correlations and making variance of data as high as possible. It converts a set of possibly correlated features by using an orthogonal transformation into a set of uncorrelated features, which are called principal components. After the PCA processing, some new features are created and the total number of features will be reduced. KPCA is developed as an extension of PCA by involving kernel methods for extracting nonlinear principal components [SCH 98]. This allows us to obtain new features with higher order correlations between original variables.

Factor analysis is similar to PCA which can be used to reduce dimensionality. It investigates whether a number of variables are linearly related to a lower number of unobservable variables (factors). The obtained interdependencies between original variables can be used to reduce the set of variables. Independent component analysis (ICA) is another feature extraction method. It is a powerful solution to the problem of blind signal separation. In this model, the original variables are linearly transformed into mixtures of some unknown latent variables which are statistically independent, so that these latent variables are called independent components of the original data. Different from PCA, which attempts to uncorrelate data, ICA aims at composing statistically independent features [LIU 99].

The above four methods have been used as data preprocessing methods for SVMs in various applications [CAO 03, QI 01, DÉN 03, FAR 09]. However, they are not the right choices for us. Our aim is to find the set of features which are not only optimal for learning algorithms, but also reachable in practice. It means that we need to select features from the original set without developing new features.

Some FS methods especially designed for SVMs have been proposed. Weston et al. [WES 00] reduced features by minimizing the radius-margin bound on the leave-one-out error via a gradient method. Fröhlic and Zell [FRÖ 04] incrementally chose features based on the regularized risk and a combination of backward elimination and exchange algorithm. Gold et al. [GOL 05] used a Bayesian approach, automatic relevance determination (ARD), to select relevant features. In [MAN 07], Mangasarian and Wild proposed a mixed-integer algorithm which was considered to be straightforward and easily implementable. All of these methods focus on eliminating irrelevant features or improving generalization ability. However, they do not consider the feasibility of selected features in a specific application domain, such as predicting energy consumption. Furthermore, they were implemented only for classification problems.

To the best of our knowledge, there is little work concerning FS of building energy consumption with regard to machine-learning methods. Most of the existing work derives models based on previously established sets of features. Madsen et al. [MAD 95] derived their continuous-time models on five variables, which are room air temperature, surface temperature, ambient dry bulb temperature, energy input from the electrical heaters and solar radiation on southern surface. Neto et al. [NET 08] built their neural network based on the input of daily average values of dry bulb temperature, relative humidity, global solar radiation and diffuse solar radiation. Azadeh et al. [AZA 08] and Maia et al. [MAI 09] forecast electrical energy consumption through analyzing the varying inner targets without any contributory variable involved. Yokoyama et al. [YOK 09] considered only two features, air temperature and relative humidity in their neural network model. Tso et al. [TSO 07] used more than 15 features in their assessment of traditional regression analysis, decision tree and neural networks. Similar approaches can be found in [DON 05a, BEN 04, WON 10] and [LAI 08].

5.3.2. Algorithm

FS aims at selecting the most useful feature set to establish a good predictor for the concerned learning algorithm. The irrelevant and unimportant features are discarded in order to reduce the dimensionality. Several advantages will be achieved if we wisely select the best subset of features. The first is the simplification of the calculation while keeping the dimensionality minimized, which could contribute to avoiding the problem of dimensionality curse. The second is the possible improvement of accuracy of the developed model. The third is the improved interpretability of the models. The last is the feasibility of obtaining accurate feature samples, especially for some time series problems in practice.

Two FS methods will be used in our approach to preprocess raw data before model training. The first one ranks the features individually by correlation coefficient between each feature and the target. We use CC to stand for this method. The correlation coefficient between two vectors X and Y of size N is defined as:

The other method is called regression gradient guided feature selection (RGS), which is developed by Navot et al. in the application of brain neural activities [NAV 06b]. We chose this method since it is designed especially for regression and has shown competitive ability to handle a complicated dependency of the target function on groups of features. The basic idea is to assign a weight to each feature and evaluate the weight vectors of all the features simultaneously by gradient ascent. The nonlinear function K-Nearest-Neighbor (KNN) is applied as the predictor to evaluate the dependency of the target on the features. The estimated target of sample x under KNN is defined as:

where N(x) is the set of K nearest neighbors of sample x. Quantity

is the distance between sample x and one of its nearest neighbors x′, n is the number of features, w is the weight vector and w_i is the specific weight assigned to the ith feature. Quantity

is a normalization factor and β is a Gaussian decay factor. Then, the optimal w can be found by maximizing the following evaluation function:

where S is the sample for model training. Since e(w) is smooth almost everywhere in a continuous domain, we can solve the extremum seeking problem by gradient ascent method. More details can be found in [NAV 06b].

5.3.3. Feature set description

We use the data generated in Chapter 2. For a single building, the hourly electric demands, together with hourly profiles of 23 features, are recorded through 1 year. Table 5.1 shows the recorded features and their units. It is known that building structures such as room space, wall thickness and windows area play important roles in the total energy consumption of a building. However, for one particular building, these variables have constant values through the simulation period, which means they do not contribute to SVR model learning. Therefore, it is practical to discard these variables in this set without losing accuracy in the model. Later, in the data of multiple buildings, we will take these factors into consideration. Concerning the data analyzing process, the data for model training are the first 10 months of 1 year’s consumption (from January 1 to October 31) and for model testing are the remaining 2 months (from November 1 to December 31).

Since people do not usually work at weekends and holidays, the energy requirement on these days is quite low, compared to normal working days. This means weekends and holidays have totally different energy behaviors from working days. Take the 56th day as an example, it is a Saturday, the energy consumption for that day is 0.18, compared to other working days which have a normal consumption of more than 4, it can thus be safely ignored. It is proved that if we distinguish these two types of days, when training the model by predictive models such as neural networks, considerable performance improvements could be achieved [KAR 06, YAN 05]. Therefore, to simplify the model in our practice, we only select the consumption data of working days for use in model training and testing. Consequently, the number of samples for training is 5,064 and for testing is 1,008.

Table 5.1. Twenty-three features for the model training and testing on one building’s consumption

Features	Unit
Outdoor Dry Bulb	C
Outdoor Relative Humidity	%
Wind Speed	m/s
Direct Solar	W/m²
Ground Temperature	C
Outdoor Air Density	kg/m³
Water Mains Temperature	C
Zone Total Internal Total Heat Gain	J
People Number Of Occupants	-
People Total Heat Gain	J
Lights Total Heat Gain	J
Electric Equipment Total Heat Gain	J
Window Heat Gain for each wall	W
Window Heat Loss for each wall	W
Zone Mean Air Temperature	C
Zone Infiltration Volume	m³
District Heating Outlet Temp	C

5.4. Model reduction for single building energy

5.4.1. Feature set selection

In this section, we experimentally analyze our approach to select the best subset of features for training statistical models on building energy consumption data. Since the number of features does not have a significant effect on the computational cost of SVM training, we put our focus primarily on the following two aspects, which are also two evaluation criteria of our method. The first is that the selected features should be potentially the most important ones to the predictor. In other words, the model generalization error should still be acceptable after FS. For this purpose, we have to involve some FS algorithms in the object and choose the features with the highest rankings or scores. The second is to make sure that the selected features can be easily obtained in practice. Concerning the energy data, the values of the chosen features for each observation can normally be collected from measurements, surveys, related documents like building plans and so on. However, in practice, efficient and accurate data are difficult to obtain, therefore, reducing necessary features is always welcome.

The two methods RGS and CC described in section 5.3 are applied to evaluate the usefulness of features. The object dataset is the previous consumption of working days. The scores for each feature are listed in Table 5.2 in columns two and three. We can see that even the same feature could probably have totally different scores under the evaluation of two FS algorithms. For example, the outdoor dry bulb temperature is the most important feature under the judgment of RGS, while on the contrary, it is almost useless according to CC ranking method. As experimental results have shown, the features with the highest scores under RGS are generally more useful than those with the highest ranks according to CC. This indicates that RGS method is more applicable to SVR than CC method. However, since the feature subsets with low scores are possibly still useful for the learning algorithms [GUY 03], we take both RGS and CC into consideration while choosing the features.

The weather data can be recorded on site or gathered from meteorological department. We keep two weather features that have the highest scores under RGS, which are dry bulb temperature and outdoor air density. And at the same time, we discard relative humidity, wind speed, direct solar and ground temperature, no matter how their variations could contribute to energy requirement, as we naturally thought. The heat gain of the room consists of water mains’ temperature, electrical equipment, and occupants’ schedules. The part from the water mains’ temperature corresponds to the water temperatures delivered by underground water main pipes. The part from electrical equipment such as lights and TVs, is determined by these equipments power. They could probably be measured or assessed in actual buildings. We divide the room into several zones according to their thermal dynamics. The two features, zone mean air temperature, which is the effective bulk air temperature of the zone, and zone infiltration volume which denotes hourly air infiltration of the zone, could also be measured or estimated in a normally operated building. All of the above selected features have scores not less than one. A special case we have to consider is the people number of occupants. This feature takes a middle place under RGS, but since it can be easily counted in real life and has a very high score under the evaluation of CC, we choose to keep it in the final subset. All other features will be discarded since they get low scores or are hard to collect in actual buildings. For example, zone total internal total heat gain is difficult to obtain directly and district heating outlet temp is useless according to CC. The selected features are indicated with stars in column Case I in Table 5.2.

**Table 5.2.** *Scores of features evaluated by RGS and CC selection methods. The stars indicate selected features in that case.*

5.4.2. Evaluation in experiments

New datasets for both training and testing are generated by eliminating useless features from the datasets used in the previous experiment. Then, the model is retrained from the new training data and after applying the model to predict on the testing data, our results are as follows: MSE is 6.19 ∗ 10⁻⁴ and SCC is 0.97. To obtain a clear view of how the model performance changes before and after FS, we plot the measured and predicted daily consumptions in Figure 5.1. The relative errors are within the range [−16%, 12%] as shown in Figure 5.2. We note that after FS, the number of features is 8, which is only one-third of the original set which has 23 features. However, compared to the results before FS, the model’s prediction ability is still very high and the selected subset is, therefore, regarded as acceptable.

**Figure 5.1.** *Comparison of measured and predicted daily electricity consumption for a particular building on working days, with feature selection performed*

**Figure 5.2.** *Relative error for the prediction*

Four other subsets are formed in order to further evaluate if the selected feature set is optimal. They are indicated by columns Case II, Case III, Case IV and Case V in Table 5.2. In Case II, we select the top eight features under the evaluation of CC alone. By doing this, we are aiming at demonstrating whether the single CC is sufficient to select the best feature set. The zone total internal total heat gain feature is also ignored in this case just as we do in Case I. In Case III, we change three of the selected features to three other unselected ones. Outdoor air density, water mains temperature and zone mean air temperature, which are selected in Case I, are substituted with outdoor relative humidity, wind speed and district heating outlet temp. In Case IV, all of the selected features are substituted with other unselected ones except zone total internal total heat gain, which is not regarded as being directly obtainable in practice. In the last case, two features which gain the lowest scores are removed from the selected subset. They are number of occupants and zone infiltration volume.

Based on these considerations, four new datasets are generated both for training and testing, and a model is retrained for each case. We show the results of all five cases in Table 5.3. Two conclusions can be reached according to the results, the first one is that the designed FS method is valid due to model performance in Case I outperforming the other three cases. The other conclusion is that the SVR model with the RBF kernel has a stable performance since high-prediction accuracy is always achieved on all of the four subsets.

Table 5.3. Comparison of model performance on different feature sets. NF: Number of features, MSE: Mean squared error, SCC: Squared correlation coefficient

	Case I	Case II	Case III	Case IV	Case V
NF	8	8	8	14	6
MSE	6.2e-4	1.9e-3	7.5e-4	2.1e-3	9.2e-4
SCC	0.97	0.93	0.96	0.90	0.96

5.5. Model reduction for multiple buildings energy

Previously, we tested the FS method on one particular building’s consumption over 1 year. In this section, we investigate how the subset of features influences the model performance on multiple buildings’ consumption.

We choose the consumption data in the winter season for 50 buildings. The differences among these buildings mainly come from the weather conditions, building structures and the number of occupants. We suppose these buildings are randomly distributed across five cities in France: Paris-Orly, Marseilles, Strasbourg, Bordeaux and Lyon. As outlined in Figure 5.3, the five cities vary remarkably in ambient dry bulb temperatures, making the datasets represent energy requirements under five typical weather conditions. The buildings have diverse characteristics with randomly generated length, width, height and window/wall area ratio. The number of occupants is determined by the ground area and people density of the buildings. The time series data of those buildings are combined together to form the training sets. One more building is simulated for model evaluation purposes.

**Figure 5.3.** *Dry bulb temperature in the first 11 days of January. For a color version of the figure, www.iste.co.uk/magoules/mining.zip*

Two sets of consumption data are designed. The first set has 20 buildings and the second set includes all 50 buildings. To fully investigate how FS on these two datasets influences SVR models, two kernels are involved. In addition to the RBF kernel, we also test the performance of FS on SVR with a polynomial kernel, which is also applicable to nonlinear problems. The kernel parameter r is set to zero and d is estimated by fivefold cross-validation in a searching space {2, 3, ..., 7}. Selected features for representing multiple buildings are the feature sets for a single building plus building structures. Therefore, FS for multiple buildings has reduced the number of features from 28 to 12. The changes of MSE and SCC on these datasets are shown in Table 5.4. For clarification, the results of one single building are indicated in the same table.

**Table 5.4.** Prediction results of support vector regression with two kernel methods on three data sets. BF: Before feature selection, AF: After feature selection, MSE: Mean squared error, SCC: Squared correlation coefficient

After FS, the accuracy of the prediction on 50 buildings’ consumption improves significantly. With regard to 20 buildings’ consumption, MSE increases to a certain extent, indicating a decrease in prediction accuracy. However, from the standpoint of SCC, the performance of the model with the RBF kernel involved is quite close to the situation without FS performed, as shown in Figure 5.4. With regard to the polynomial kernel, when training on the original datasets, the prediction ability of the model is just as good as RBF kernel, indicating that the polynomial kernel is also applicable on such a problem. After adopting FS, the performance of the model improves in the case of 50 buildings. Unfortunately, it decreases largely in the case of 20 buildings. It seems that the polynomial kernel is not as stable as RBF kernel when applied to such problems. However, we can see that it performs better for the case in 50 buildings than for that of 20 buildings. The same trend is also found for RBF kernel. These phenomena indicate that the proposed FS approach could give better performance to the models when more training samples are involved.

Another advantage of FS for statistical models is the reduction of training time. We show the time consumed for training SVR models with RBF kernel in Figure 5.5 where the time is in the logarithm form. The training time after FS is obviously less than that before FS, but the reduction is not too much. This phenomenon can be explained by the different parameter values we assigned for the learning algorithm, which always have a great influence on the training speed. We note that the time for choosing parameters for the predictor via cross-validation is too long to be ignored when evaluating a learning algorithm. While in this chapter we primarily focus on the influences of FS on predictors, the labor and time for choosing model parameters are not considered here since they are quite approximate before and after FS.

**Figure 5.4.** *Comparison of model performance from the standpoint of SCC before and after feature selection for radial basis function kernel*

**Figure 5.5.** *Comparison of training time before and after FS for RBF kernel*

5.6. Concluding remarks

This chapter introduces a new feature selection method for applying suppor vector regression to predict the energy consumptions of office buildings.

To evaluate the proposed feature selection method, three datasets are first generated by EnergyPlus. They are time series consumptions for 1, 20 and 50 buildings, respectively. We assume that the developed models are applied to predict the energy requirements of actual buildings, therefore, the features are selected according to their feasibility in practice. To support the selection, we adopt two filter methods: the gradient-guided feature selection and the correlation coefficients, which can give each feature a score according to its usefulness to the predictor. Extensive experiments show that the selected subset is valid and can provide acceptable predictors. Performance improvement is achieved in some cases, e.g. accuracy remarkably enhanced for the models with either radial basis function or a polynomial kernel on 50 buildings’ data, and the time for model learning decreases to a certain extent. We also identify that the performance improves when more training samples get involved. Besides radial basis function kernel, we proved that a polynomial kernel is also applicable to our application. However, it does not seem as stable as radial basis function kernel. Furthermore, it requires more complicated preprocessing work since more kernel parameters need to be estimated.

This preliminary work on feature selection for building energy consumptions has paved the way for its further progress. It serves as the first guide for selecting an optimal subset of features when applying machine-learning methods to the prediction of building energy consumption.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 5 Model Reduction for Support Vector Machines

Create new playlist

Sign In

Sign Up