Of course, linear separability is the rule of thumb when choosing the right kernel to start with. However, most of the time, this is very difficult to identify, unless you have sufficient prior knowledge of the dataset, or its features are of low dimensions (1 to 3).
Now, let's look at the following three scenarios where linear kernel is favored over RBF:
Scenario 1: Both the numbers of features and instances are large (more than 104 or 105). Since the dimension of the feature space is high enough, additional features as a result of RBF transformation will not provide any performance improvement, but will increase computational expense. Some examples from the UCI machine learning repository are of this type:
- URL Reputation Dataset: https://archive.ics.uci.edu/ml/datasets/URL+Reputation (number of instances: 2,396,130; number of features: 3,231,961). This is designed for malicious URL detection based on their lexical and host information.
- YouTube Multiview Video Games Dataset: https://archive.ics.uci.edu/ml/datasets/YouTube+Multiview+Video+Games+Dataset (number of instances: 120,000; number of features: 1,000,000). This is designed for topic classification.
Scenario 2: The number of features is noticeably large compared to the number of training samples. Apart from the reasons stated in scenario 1, the RBF kernel is significantly more prone to overfitting. Such a scenario occurs, for example, in the following referral links:
- Dorothea Dataset: https://archive.ics.uci.edu/ml/datasets/Dorothea (number of instances: 1,950; number of features: 100,000). This is designed for drug discovery that classifies chemical compounds as active or inactive according to their structural molecular features.
- Arcene Dataset: https://archive.ics.uci.edu/ml/datasets/Arcene (number of instances: 900; number of features: 10,000). This represents a mass-spectrometry dataset for cancer detection.
Scenario 3: The number of instances is significantly large compared to the number of features. For a dataset of low dimension, the RBF kernel will, in general, boost the performance by mapping it to a higher-dimensional space. However, due to the training complexity, it usually becomes no longer efficient on a training set with more than 106 or 107 samples. Example datasets include the following:
- Heterogeneity Activity Recognition Dataset: https://archive.ics.uci.edu/ml/datasets/Heterogeneity+Activity+Recognition (number of instances: 43,930,257; number of features: 16). This is designed for human activity recognition.
- HIGGS Dataset: https://archive.ics.uci.edu/ml/datasets/HIGGS (number of instances: 11,000,000; number of features: 28). This is designed to distinguish between a signal process producing Higgs bosons or a background process
Aside from these three scenarios, RBF is ordinarily the first choice.
The rules for choosing between linear and RBF kernel can be summarized as follows:
Once again, first choice means what we can begin with this option; it does not mean that this is the only option moving forward.