A one-hot encoding maps a column of label indices to a column of binary vectors, with at most a single value. This encoding allows algorithms that expect continuous features, such as Logistic Regression, to use categorical features. Suppose you have some categorical data in the following format (the same that we used for describing the StringIndexer in the previous section):
Now, we want to index the name column so that the most frequent name in the dataset (that is, Jason in our case) gets index 0. However, what's the use of just indexing them? In other words, you can further vectorize them and then you can feed the DataFrame to any ML models easily. Since we have already seen how to create a DataFrame in the previous section, here, we will just show how to encode them toward Vectors:
val indexer = new StringIndexer()
.setInputCol("name")
.setOutputCol("categoryIndex")
.fit(df)
val indexed = indexer.transform(df)
val encoder = new OneHotEncoder()
.setInputCol("categoryIndex")
.setOutputCol("categoryVec")
Now let's transform it into a vector using Transformer and then see the contents, as follows:
val encoded = encoder.transform(indexed)
encoded.show()
The resulting DataFrame containing a snap is as follows:
Now you can see that a new column containing feature vectors has been added in the resulting DataFrame.