The rule of thumb, of course, is linear separability. However, this is most of the time very difficult to identify, unless you have sufficient prior knowledge or the features are of low dimension (1 to 3).
Prior knowledge, including text data, is often linearly separable, data from the XOR function is not, and we will look at the following three scenarios where the linear kernel is favored over RBF:
Case 1: both the numbers of features and instances are large (more than 104 or 105). As the dimension of the feature space is high enough, additional features as a result of RBF transformation will not provide any performance improvement, but will increase computational expense. Some examples from the UCI Machine Learning Repository are of this type:
- URL Reputation Data Set: https://archive.ics.uci.edu/ml/datasets/URL+Reputation (number of instances: 2396130, number of features: 3231961) for malicious URL detection based on their lexical and host information
- YouTube Multiview Video Games Data Set: https://archive.ics.uci.edu/ml/datasets/YouTube+Multiview+Video+Games+Dataset (number of instances: 120000, number of features: 1000000) for topic classification
Case 2: the number of features is noticeably large compared to the number of training samples. Apart from the reasons stated in Scenario 1, the RBF kernel is significantly more prone to overfitting. Such a scenario occurs in, for example:
- Dorothea Data Set: https://archive.ics.uci.edu/ml/datasets/Dorothea (number of instances: 1950, number of features: 100000) for drug discovery that classifies chemical compounds as active or inactive by structural molecular features
- Arcene Data Set: https://archive.ics.uci.edu/ml/datasets/Arcene (number of instances: 900, number of features: 10000) a mass-spectrometry dataset for cancer detection
Case 3: the number of instances is significantly large compared to the number of features. For a dataset of low dimension, the RBF kernel will, in general, boost the performance by mapping it to a higher dimensional space. However, due to the training complexity, it usually becomes no longer efficient on a training set with more than 106 or 107 samples. Some exemplar datasets include:
- Heterogeneity Activity Recognition Data Set: https://archive.ics.uci.edu/ml/datasets/Heterogeneity+Activity+Recognition (number of instances: 43930257, number of features: 16) for human activity recognition
- HIGGS Data Set: https://archive.ics.uci.edu/ml/datasets/HIGGS (number of instances: 11000000, number of features: 28) for distinguishing between a signal process producing Higgs bosons or a background process
Other than these three preceding cases, RBF is practically the first choice.
Rules of choosing between the linear and RBF kernel can be summarized as follows:
Case |
Linear |
RBF |
Expert prior knowledge |
If linearly separable |
If nonlinearly separable |
Visualizable data of 1 to 3 dimension |
If linearly separable |
If nonlinearly separable |
Both numbers of features and instances are large |
First choice |
|
Features Instances |
First choice |
|
Instances Features |
First choice |
|
Others |
First choice |