Let's do some (model) training!

At this point, we have a numeric representation of textual data, which captures the structure of reviews in a simple way. Now, it is time for model building. First, we will select columns that we need for training and split the resulting dataset. We will keep the generated row_id column in the dataset. However, we will not use it as an input feature, but only as a simple unique row identifier:

valsplits = tfIdfTokens.select("row_id", "label", idf.getOutputCol).randomSplit(Array(0.7, 0.1, 0.1, 0.1), seed = 42)
val(trainData, testData, transferData, validationData) = (splits(0), splits(1), splits(2), splits(3))
Seq(trainData, testData, transferData, validationData).foreach(_.cache())

Notice that we have created four different subsets of our data: a training dataset, testing dataset, transfer dataset, and a final validation dataset. The transfer dataset will be explained later on in the chapter, but everything else should appear very familiar to you already from the previous chapters.

Also, the cache call is important since the majority of the algorithms are going to iteratively query the dataset data, and we want to avoid repeated evaluation of all the data preparation operations.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset