Cancer-prognosis pipeline with Spark

In the previous section, we showed how to develop a cancer diagnosis pipeline for predicting cancer based on two labels (Benign and Malignant). In this section, we will look at how to develop a cancer prognosis pipeline with Spark ML and MLlib APIs. The Wisconsin Prognosis Breast Cancer (WPBC) datasets will be used to predict the probability of breast cancer toward the prognosis for recurrent and non-recurrent tumor cells. Again, the dataset was downloaded from https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Prognostic). To understand the problem formalization, please refer to Figure 1 once again as we will follow almost the same stages during the cancer-prognosis pipeline development.

Dataset exploration

The details of the attributes found in the WPBC dataset in https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wpbc.names are as follows:

  • ID number
  • Outcome (R = recurrent, N = non-recurrent)
  • Time (recurrence time if field 2 => R, disease-free time if field 2 => N)
  • 3 to 33: Ten real-valued features are computed for each cell nucleus: Radius, Texture, Perimeter, Area, Smoothness, Compactness, Concavity, Concave points, Symmetry, and Fractal dimension. Thirty-four is Tumor size and thirty-five is the Lymph node status, as follows:
    • Tumor size: Diameter of the excised tumor in centimeters
    • Lymph node status: The number of positive axillary lymph nodes

If you compare Figure 3 and Figure 9, you will see that the diagnosis and prognosis have the same features, yet the prognosis has two additional features (mentioned previously as 34 and 35). Note that these are observed at the time of surgery from the year 1988 to 1995 and out of the 198 instances, 151 are non-recurring (N) and 47 are recurring (R), as shown in Figure 8.

Of course, a real cancer diagnosis and prognosis dataset today contains many other features and fields in a structured or unstructured way:

Dataset exploration

Figure 8: Snapshot of the data (partial)

Tip

For more detailed discussion and meaningful insights, interested readers can refer to the following research paper: The Wisconsin Breast Cancer Problem: Diagnosis and DFS time prognosis using probabilistic and generalized regression neural classifiers Oncology Reports, special issue Computational Analysis and Decision Support Systems in Oncology, last quarter 2005 by Ioannis A. et al. found in the following link: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.65.2463&rep=rep1&type=pdf.

Breast-cancer-prognosis pipeline with Spark ML/MLlib

In this subsection, we will look at how to develop a breast cancer prognosis machine learning pipeline step-by-step, including taking the input of the dataset to prediction in 10 different steps that are described in Figure 1, as a data workflow.

Tip

Readers are advised to download the dataset and the project files, along with the pom.xml file for the Maven project configuration, from the Packt materials. We have advised how to make the code work in previous chapters, for example, Chapter 1, Introduction to Data Analytics with Spark.

Step 1: Import necessary packages/libraries/APIs

import org.apache.spark.api.java.JavaRDD; 
import org.apache.spark.api.java.function.Function; 
import org.apache.spark.ml.Pipeline; 
import org.apache.spark.ml.PipelineModel; 
import org.apache.spark.ml.PipelineStage; 
import org.apache.spark.ml.classification.LogisticRegression; 
import org.apache.spark.ml.feature.LabeledPoint; 
import org.apache.spark.ml.linalg.DenseVector; 
import org.apache.spark.ml.linalg.Vector; 
import org.apache.spark.sql.Dataset; 
import org.apache.spark.sql.Row; 
import org.apache.spark.sql.SparkSession; 

Step 2: Initialize the necessary Spark environment

static SparkSession spark = SparkSession 
        .builder() 
        .appName("BreastCancerDetectionPrognosis") 
       .master("local[*]") 
       .config("spark.sql.warehouse.dir", "E:/Exp/") 
       .getOrCreate(); 

Here we set the application name as BreastCancerDetectionPrognosis, the master URL as local[*]. The Spark Context is the entry point of the program. Please set these parameters accordingly.

Step 3: Take the breast cancer data as input and prepare JavaRDD out of the data

String path = "input/wpbc.data"; 
JavaRDD<String> lines = spark.sparkContext().textFile(path, 3).toJavaRDD(); 

Tip

To learn more about the data, please refer to Figure 5 and its description and the Dataset exploration subsection.

Step 4: Create LabeledPoint RDDs

Create LabeledPoint RDDs for the prognosis for N = recurrent and R= non-recurrent, respectively, using the following code segments:

JavaRDD<LabeledPoint> linesRDD = lines.map(new Function<String, LabeledPoint>() { 
      public LabeledPoint call(String lines) { 
        String[] tokens = lines.split(","); 
        double[] features = new double[30]; 
        for (int i = 2; i < features.length; i++) { 
          features[i - 2] = Double.parseDouble(tokens[i]); 
        } 
        Vector v = new DenseVector(features); 
        if (tokens[1].equals("N")) { 
          return new LabeledPoint(1.0, v); // recurrent 
        } else { 
          return new LabeledPoint(0.0, v); // non-recurrent 
        } 
      } 
    });  

Step 5: Create the Dataset from the lines RDD and show the top features

Dataset<Row> data = spark.createDataFrame(linesRDD,LabeledPoint.class); 
data.show(); 

The top features and their corresponding labels are shown in Figure 9:

Breast-cancer-prognosis pipeline with Spark ML/MLlib

Figure 9: Top features and their corresponding labels

Step 6: Split the Dataset to prepare the training and test sets

Here we split the dataset to test and the training set as 60% and 40%, respectively. Please adjust these based on your requirements:

Dataset<Row>[] splits = data.randomSplit(new double[] { 0.6, 0.4 }, 12345L); 
Dataset<Row> trainingData = splits[0];   
Dataset<Row> testData = splits[1]; 

To see a quick snapshot of these two sets, just write trainingData.show() and testData.show(), for training and test sets, respectively.

Step 7: Create a Logistic Regression classifier

Create a logistic regression classifier by specifying the max iteration and regression parameter:

LogisticRegression logisticRegression = new LogisticRegression() 
.setMaxIter(100) 
.setRegParam(0.01) 
.setElasticNetParam(0.4); 

Step 8: Create a pipeline and train the pipeline model

Pipeline pipeline = new Pipeline().setStages(new PipelineStage[]{logisticRegression}); 
PipelineModel model=pipeline.fit(trainingData); 

Here, similarly to the diagnosis pipeline, we have created the prognosis pipeline whose stages are defined by only the logistic regression, which is again an estimator, and of course a stage.

Step 9: Create a Dataset and transform the model

Create a Dataset and do the transformation to make a prediction based on the test dataset:

Dataset<Row> predictions=model.transform(testData); 

Step 10: Show the prediction with prediction precision

predictions.show(); 
Breast-cancer-prognosis pipeline with Spark ML/MLlib

Figure 10: Prediction with prediction precision

long count = 0; 
for (Row r : predictions.select("features", "label", "prediction").collectAsList()) { 
      System.out.println("(" + r.get(0) + ", " + r.get(1) + r.get(2) + ", prediction=" + r.get(2)); 
      count++; 
    } 

This code segment will produce an output similar to that shown in Figure 7, with different features, labels, and predictions:

System.out.println("precision: " + (double) (count * 100) / predictions.count());  
Precision: 100.0  

Therefore, the precision is almost 100%, which is fantastic. However, depending upon the data preparation, you might receive different results.

If you have any confusion, the following chapter demonstrates how to tune parameters so that the prediction accuracy increases, as they might have many false-negative predictions.

Tip

In their book titled Machine Learning with R, Packt Publishing, 2015, Brett Lantz at el. argue that it's possible to eliminate the false negatives completely by classifying every mass as malignant, benign, recurrent, or non-recurrent. Obviously, this is not a realistic strategy. Still, it illustrates the fact that prediction involves striking a balance between the false-positive rate and the false-negative rate.

If you are still unsatisfied, we will be tuning several parameters in Chapter 7, Tuning Machine Learning Models, so that the prediction accuracy increases toward more sophisticated methods for measuring predictive accuracy that can be used to identify places where the error rate can be optimized depending on the costs of each type of error.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset