Comparing classification methods

In this chapter we have examined classification using logistic regression, support vector machines, and gradient boosted decision trees. In what scenarios should we prefer one algorithm over another?

For logistic regression, the data ideally will be linearly separable (the exponent in the formula for the logistic regression, after all, is essentially the same as the SVM equation for a separating hyperplane). If our goal is inference (producing a unit increase in response per 1-unit increase of input measurement, as we described in Chapter 1, From Data to Decisions – Getting Started with Analytic Applications) then the coefficients and log-odds values will be helpful. The stochastic gradient method can also be helpful in cases where we are unable to process all the data concurrently, while the second order methods we discussed may be easier to employ on un-normalized data. Finally, in the context of serializing model parameters and using these results to score new data, the logistic regression is attractive in that it is represented by a vector of numbers and is thus easily stored.

Support vector machines, as we discussed, can accommodate complex nonlinear boundaries between inputs. They can also be used on data without a vector representation, or data of different lengths, making them quite flexible. However, they require more computational resources for fitting as well as scoring.

Gradient boosted decision trees can fit nonlinear boundaries between inputs, but only certain kinds. Consider that a decision tree splits a dataset into two groups at each decision node. Thus, the resulting boundaries represent a series of hyperplanes in the m-dimensional space of the dataset, but only split along a particular dimension at each pass and only in a straight line. Thus, these planes will not necessarily capture the nonlinearity possible with the SVM, but if the data can be separated in this piecewise fashion a GBM may perform well.

The flowchart below gives a general overview from choosing among the classification methods we have discussed. Also, keep in mind that the Random Forest algorithm we discussed in Chapter 4, Connecting the Dots with Models – Regression Methods may also be applied for classification, while the SVM and GBM models describe in this chapter have forms that may be applied for regression.

Comparing classification methods
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset