Useless columns

The first step is to remove columns that contain unique values per line. Typical examples of this are user IDs or transaction IDs. In our case, we will identify them manually based on data description:

import com.packtpub.mmlwspark.utils.Tabulizer.table
val idColumns = Seq("id", "member_id")
println(s"Columns with Ids: ${table(idColumns, 4, None)}")

The output is as follows:

The next step is to identify useless columns, such as the following:

  • Constant columns
  • Bad columns (containing only missing values)

The following code will help us do so:

val constantColumns = loanDataHf.names().indices
.filter(idx => loanDataHf.vec(idx).isConst || loanDataHf.vec(idx).isBad)
.map(idx => loanDataHf.name(idx))
println(s"Constant and bad columns: ${table(constantColumns, 4, None)}")

The output is as follows:

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset