Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Useless columns

The first step is to remove columns that contain unique values per line. Typical examples of this are user IDs or transaction IDs. In our case, we will identify them manually based on data description:

import com.packtpub.mmlwspark.utils.Tabulizer.table
val idColumns = Seq("id", "member_id")
println(s"Columns with Ids: ${table(idColumns, 4, None)}")

The output is as follows:

The next step is to identify useless columns, such as the following:

Constant columns
Bad columns (containing only missing values)

The following code will help us do so:

val constantColumns = loanDataHf.names().indices
   .filter(idx => loanDataHf.vec(idx).isConst || loanDataHf.vec(idx).isBad)
   .map(idx => loanDataHf.name(idx))
println(s"Constant and bad columns: ${table(constantColumns, 4, None)}")

The output is as follows:

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Useless columns

Create new playlist

Sign In

Sign Up

Table of Contents for
Useless columns