In the previous section, we showed how to develop a cancer diagnosis pipeline for predicting cancer based on two labels (Benign and Malignant). In this section, we will look at how to develop a cancer prognosis pipeline with Spark ML and MLlib APIs. The Wisconsin Prognosis Breast Cancer (WPBC) datasets will be used to predict the probability of breast cancer toward the prognosis for recurrent and non-recurrent tumor cells. Again, the dataset was downloaded from https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Prognostic). To understand the problem formalization, please refer to Figure 1 once again as we will follow almost the same stages during the cancer-prognosis pipeline development.
The details of the attributes found in the WPBC dataset in https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wpbc.names are as follows:
If you compare Figure 3 and Figure 9, you will see that the diagnosis and prognosis have the same features, yet the prognosis has two additional features (mentioned previously as 34 and 35). Note that these are observed at the time of surgery from the year 1988 to 1995 and out of the 198 instances, 151 are non-recurring (N) and 47 are recurring (R), as shown in Figure 8.
Of course, a real cancer diagnosis and prognosis dataset today contains many other features and fields in a structured or unstructured way:
For more detailed discussion and meaningful insights, interested readers can refer to the following research paper: The Wisconsin Breast Cancer Problem: Diagnosis and DFS time prognosis using probabilistic and generalized regression neural classifiers Oncology Reports, special issue Computational Analysis and Decision Support Systems in Oncology, last quarter 2005 by Ioannis A. et al. found in the following link: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.65.2463&rep=rep1&type=pdf.
In this subsection, we will look at how to develop a breast cancer prognosis machine learning pipeline step-by-step, including taking the input of the dataset to prediction in 10 different steps that are described in Figure 1, as a data workflow.
Readers are advised to download the dataset and the project files, along with the pom.xml
file for the Maven project configuration, from the Packt materials. We have advised how to make the code work in previous chapters, for example, Chapter 1, Introduction to Data Analytics with Spark.
Step 1: Import necessary packages/libraries/APIs
import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.function.Function; import org.apache.spark.ml.Pipeline; import org.apache.spark.ml.PipelineModel; import org.apache.spark.ml.PipelineStage; import org.apache.spark.ml.classification.LogisticRegression; import org.apache.spark.ml.feature.LabeledPoint; import org.apache.spark.ml.linalg.DenseVector; import org.apache.spark.ml.linalg.Vector; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import org.apache.spark.sql.SparkSession;
Step 2: Initialize the necessary Spark environment
static SparkSession spark = SparkSession .builder() .appName("BreastCancerDetectionPrognosis") .master("local[*]") .config("spark.sql.warehouse.dir", "E:/Exp/") .getOrCreate();
Here we set the application name as BreastCancerDetectionPrognosis
, the master URL as local[*]
. The Spark Context is the entry point of the program. Please set these parameters accordingly.
Step 3: Take the breast cancer data as input and prepare JavaRDD out of the data
String path = "input/wpbc.data"; JavaRDD<String> lines = spark.sparkContext().textFile(path, 3).toJavaRDD();
Step 4: Create LabeledPoint RDDs
Create LabeledPoint
RDDs for the prognosis for N = recurrent and R= non-recurrent, respectively, using the following code segments:
JavaRDD<LabeledPoint> linesRDD = lines.map(new Function<String, LabeledPoint>() { public LabeledPoint call(String lines) { String[] tokens = lines.split(","); double[] features = new double[30]; for (int i = 2; i < features.length; i++) { features[i - 2] = Double.parseDouble(tokens[i]); } Vector v = new DenseVector(features); if (tokens[1].equals("N")) { return new LabeledPoint(1.0, v); // recurrent } else { return new LabeledPoint(0.0, v); // non-recurrent } } });
Step 5: Create the Dataset from the lines RDD and show the top features
Dataset<Row> data = spark.createDataFrame(linesRDD,LabeledPoint.class); data.show();
The top features and their corresponding labels are shown in Figure 9:
Step 6: Split the Dataset to prepare the training and test sets
Here we split the dataset to test and the training set as 60% and 40%, respectively. Please adjust these based on your requirements:
Dataset<Row>[] splits = data.randomSplit(new double[] { 0.6, 0.4 }, 12345L); Dataset<Row> trainingData = splits[0]; Dataset<Row> testData = splits[1];
To see a quick snapshot of these two sets, just write trainingData.show()
and testData.show()
, for training and test sets, respectively.
Step 7: Create a Logistic Regression classifier
Create a logistic regression classifier by specifying the max iteration and regression parameter:
LogisticRegression logisticRegression = new LogisticRegression() .setMaxIter(100) .setRegParam(0.01) .setElasticNetParam(0.4);
Step 8: Create a pipeline and train the pipeline model
Pipeline pipeline = new Pipeline().setStages(new PipelineStage[]{logisticRegression}); PipelineModel model=pipeline.fit(trainingData);
Here, similarly to the diagnosis pipeline, we have created the prognosis pipeline whose stages are defined by only the logistic regression, which is again an estimator, and of course a stage.
Step 9: Create a Dataset and transform the model
Create a Dataset and do the transformation to make a prediction based on the test dataset:
Dataset<Row> predictions=model.transform(testData);
Step 10: Show the prediction with prediction precision
predictions.show();
long count = 0; for (Row r : predictions.select("features", "label", "prediction").collectAsList()) { System.out.println("(" + r.get(0) + ", " + r.get(1) + r.get(2) + ", prediction=" + r.get(2)); count++; }
This code segment will produce an output similar to that shown in Figure 7, with different features, labels, and predictions:
System.out.println("precision: " + (double) (count * 100) / predictions.count()); Precision: 100.0
Therefore, the precision is almost 100%, which is fantastic. However, depending upon the data preparation, you might receive different results.
If you have any confusion, the following chapter demonstrates how to tune parameters so that the prediction accuracy increases, as they might have many false-negative predictions.
In their book titled Machine Learning with R, Packt Publishing, 2015, Brett Lantz at el. argue that it's possible to eliminate the false negatives completely by classifying every mass as malignant, benign, recurrent, or non-recurrent. Obviously, this is not a realistic strategy. Still, it illustrates the fact that prediction involves striking a balance between the false-positive rate and the false-negative rate.
If you are still unsatisfied, we will be tuning several parameters in Chapter 7, Tuning Machine Learning Models, so that the prediction accuracy increases toward more sophisticated methods for measuring predictive accuracy that can be used to identify places where the error rate can be optimized depending on the costs of each type of error.