Choosing between the linear and RBF kernel

The rule of thumb, of course, is linear separability. However, this is most of the time very difficult to identify, unless you have sufficient prior knowledge or the features are of low dimension (1 to 3).

Prior knowledge, including text data, is often linearly separable, data from the XOR function is not, and we will look at the following three scenarios where the linear kernel is favored over RBF:

Case 1: both the numbers of features and instances are large (more than 104 or 105). As the dimension of the feature space is high enough, additional features as a result of RBF transformation will not provide any performance improvement, but will increase computational expense. Some examples from the UCI Machine Learning Repository are of this type:

URL Reputation Data Set: https://archive.ics.uci.edu/ml/datasets/URL+Reputation (number of instances: 2396130, number of features: 3231961) for malicious URL detection based on their lexical and host information
YouTube Multiview Video Games Data Set: https://archive.ics.uci.edu/ml/datasets/YouTube+Multiview+Video+Games+Dataset (number of instances: 120000, number of features: 1000000) for topic classification

Case 2: the number of features is noticeably large compared to the number of training samples. Apart from the reasons stated in Scenario 1, the RBF kernel is significantly more prone to overfitting. Such a scenario occurs in, for example:

Dorothea Data Set: https://archive.ics.uci.edu/ml/datasets/Dorothea (number of instances: 1950, number of features: 100000) for drug discovery that classifies chemical compounds as active or inactive by structural molecular features
Arcene Data Set: https://archive.ics.uci.edu/ml/datasets/Arcene (number of instances: 900, number of features: 10000) a mass-spectrometry dataset for cancer detection

Case 3: the number of instances is significantly large compared to the number of features. For a dataset of low dimension, the RBF kernel will, in general, boost the performance by mapping it to a higher dimensional space. However, due to the training complexity, it usually becomes no longer efficient on a training set with more than 106 or 107 samples. Some exemplar datasets include:

Heterogeneity Activity Recognition Data Set: https://archive.ics.uci.edu/ml/datasets/Heterogeneity+Activity+Recognition (number of instances: 43930257, number of features: 16) for human activity recognition
HIGGS Data Set: https://archive.ics.uci.edu/ml/datasets/HIGGS (number of instances: 11000000, number of features: 28) for distinguishing between a signal process producing Higgs bosons or a background process

Other than these three preceding cases, RBF is practically the first choice.

Rules of choosing between the linear and RBF kernel can be summarized as follows:

Case	Linear	RBF
Expert prior knowledge	If linearly separable	If nonlinearly separable
Visualizable data of 1 to 3 dimension	If linearly separable	If nonlinearly separable
Both numbers of features and instances are large	First choice
Features Instances	First choice
Instances Features	First choice
Others		First choice

Table of Contents for Choosing between the linear and RBF kernel

Create new playlist

Sign In

Sign Up

Table of Contents for
Choosing between the linear and RBF kernel