However, all is not lost on feature selection and I want to take some space to show you a quick way in how to begin exploring this matter. It will require some trial and error on your part. Again, the caret
package helps out in this matter as it will run a cross-validation on a linear SVM based on the kernlab
package.
To do this, we will need to set the random seed, specify the cross-validation method in the caret's rfeControl()
function, perform a recursive feature selection with the rfe()
function, and then test how the model performs on the test
set. In rfeControl()
, you will need to specify the function based on the model being used. There are several different functions that you can use. Here we will need lrFuncs
. To see a list of the available functions, your best bet is to explore the documentation with ?rfeControl
and ?caretFuncs
. The code for this example is as follows:
> set.seed(123) > rfeCNTL = rfeControl(functions=lrFuncs, method="cv", number=10) > svm.features = rfe(train[,1:7], train[,8],sizes = c(7, 6, 5, 4), rfeControl = rfeCNTL, method = "svmLinear")
To create the svm.features
object, it was important to specify the inputs and response factor, number of input features via sizes
, and linear method from kernlab
, which is the svmLinear
syntax. Other options are available using this method, such as svmPoly
. No method for a sigmoid kernel is available. Calling the object allows us to see how the various feature sizes perform, as follows:
> svm.features Recursive feature selection Outer resampling method: Cross-Validated (10 fold) Resampling performance over subset size: Variables Accuracy Kappa AccuracySD KappaSD Selected 4 0.7797 0.4700 0.04969 0.1203 5 0.7875 0.4865 0.04267 0.1096 * 6 0.7847 0.4820 0.04760 0.1141 7 0.7822 0.4768 0.05065 0.1232 The top 5 variables (out of 5):
Counter-intuitive as it is, the five variables perform quite well by themselves as well as when skin
and bp
are included. Let's try this out on the test
set, remembering that the accuracy in the full model was 76.2 percent:
> svm.5 <- svm(type~glu+ped+npreg+bmi+age, data=train, kernel="linear") > svm.5.predict <- predict(svm.5, newdata=test[c(1,2,5,6,7)]) > table(svm.5.predict, test$type) svm.5.predict No Yes No 79 21 Yes 14 33
This did not perform as well and we can stick with the full model. You can see through trial and error how this technique can play in order to determine some simple identification of feature importance. If you want to explore the other techniques and methods that you can apply here—and for blackbox techniques in particular—I recommend that you start by reading the work by Guyon and Elisseeff (2003) on this subject.