Scaling and centering

For some machine learning algorithms, it is preferable to transform not only the categorical variables (using one-hot encoding, discussed previously) but also the continuous variables. Recall from Chapter 1, Introduction to Healthcare Analytics that a continuous variable is numerical and can take on any rational value (although in many cases they are restricted to integers). A particularly common practice is to standardize each continuous variable so that the mean of the variable is zero and the standard deviation is one. For example, take the AGE variable: it typically ranges from 0 to about 100, with a mean of perhaps 40. Let's pretend that for a particular population, the mean of our AGE variable is 40 with a standard deviation of 20. If we were to center and rescale our AGE variable, a person whose age was 40 would be represented as zero in the transformed variable. A person who was 20 years old would be represented as -1, a person who was 60 years old would be represented as 1, a person who was 80 years old would be represented as 2, and a person who was 50 years old would be 0.5. This transformation prevents variables with larger ranges from being overrepresented in the machine learning algorithm.

scikit-learn has many built-in classes and functions for centering and scaling data, including sklearn.preprocessing.StandardScaler(), sklearn.preprocessing.MinMaxScaler(), and sklearn.preprocessing.RobustScaler(). These various tools are specialized for centering and scaling different types of continuous data, such as normally distributed variables, or variables that have many outliers.

For instructions on how the scaling classes are used, you can check out the scikit-learn documentation: http://scikit-learn.org/stable/modules/preprocessing.html#standardization-or-mean-removal-and-variance-scaling.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset