Before accepting a data science model, we need to evaluate it, to establish whether it is ready for production or not. Model evaluation is the process of assessing whether a trained model performs as expected. Usually, we perform model evaluation on a different dataset from the one on which the model was trained.
In this chapter, you will review the basic concepts behind model evaluation, such as data splitting, how to choose metrics for evaluation, and basic concepts behind error analysis. In addition, you will see the main model evaluation techniques for the different data science tasks (classification, regression, and clustering).
Finally, you will learn how to perform model evaluation in Comet by deepening some concepts that you already know, such as experiments, panels, and reports, as well as introducing new concepts, including hyperparameter tuning, model registry, and queries.
Throughout the chapter, you will also implement a practical example.
The chapter is organized as follows:
Before reviewing the concepts behind model evaluation, let’s install all the Python packages needed to run the code and the experiments contained in this chapter.
We will run all the experiments and code in this chapter using Python 3.8. You can download it from the official website (https://www.python.org/downloads/) and choose the 3.8 version.
The examples described in this chapter use the following Python packages:
We have already described the first five packages and how to install them in Chapter 1, An Overview of Comet. So, please refer back to that for further details on installation.
Now that you have installed all the software needed in this chapter, let’s move toward how to use Comet for model evaluation, starting from reviewing some basic concepts on model evaluation.
Model evaluation is the process of assessing the performance of one or more data science models to decide which is the best one to solve a given task. Model evaluation is an iterative task because we run it over and over again, until we reach a satisfactory model.
Model evaluation depends on the task we want to solve. In general, there are two types of tasks:
In the case of supervised learning, model evaluation involves comparison in terms of committed errors between testing data and values predicted by the model. We can calculate different metrics that depend on a specific task, as we will see in the next sections. In the case of unsupervised learning, model evaluation is not a trivial task because we do not have a reference dataset for comparison. However, we can still also calculate some metrics in this case, as we will see in the following sections.
If we suppose that we already have a set of models to test, model evaluation is composed of the following two steps:
Let’s investigate each step separately, by starting from the first step – data splitting.
Let’s suppose that we want to build an application where users upload pictures and the system recognizes cars in them. To train a classification model, we need a dataset of pictures. We can collect them from the web, split them into training and test sets (for example, 70% of images for the training set and the remaining 30% for the test set), and train a neural network with the training set. Then, we perform model evaluation on the test set, so we calculate the accuracy. Let’s suppose that we obtain a good accuracy of 90%. But when we run our system in a real-case scenario, we obtain a bad performance, such as an accuracy of only 50%. What went wrong with our model?
In the previous example, we have split our original dataset into two parts, a training set and a test set, and we have used the training set to train the model and the test set to evaluate the model performance. However, in the era of big data, this practice of dividing data into training and test sets with a ratio of 70–30 is now obsolete because it is sufficient to have enough samples for the test set to evaluate small improvements in performance. Also, the best practice is to use three datasets:
In practice, we use both the dev and test sets to perform model evaluation, but while we exploit the performance calculated on the dev set to improve the model, we use the test set only to assess the final result of the model. In practice, the test set represents the real-case scenario; thus we should not extract it from the original dataset but from the real world, where we will use our system. In the previous example of car recognition, we should extract the test set directly from the application. In the beginning, we will have a few pictures, so our test set will be very small. Then, as users upload pictures, our test set will increase, so we can make more accurate assessments.
All three datasets should have the same distribution. Data distribution shows all possible values the data can assume. Although we know many types of data distribution, such as normal, beta, and gamma, in the real world, data does not assume any of them. For example, referring to the previous example of car recognition, if the dev set contains pictures with racing cars and the test set contains pictures with vintage cars, the two datasets follow different distributions.
Many techniques exist to test whether two variables follow the same distribution. However, they need to test each column of the two datasets independently. In this section, we propose a strategy to check whether the entire datasets follow the same distribution, not just the single columns.
We can use the strategy shown in the following figure.
Figure 3.1 – A possible strategy to check whether two datasets have the same distribution
Let’s suppose that we have two datasets, namely Dataset A and Dataset B (for example, the dev and test sets):
Note that in the previous strategy, we have transformed the problem of calculating whether two datasets follow the same distribution into a classification problem.
If the dev and test datasets do not follow the same distribution, we have a situation called covariate shift. In this case, we can encounter the following problems:
Now that you have learned how to split the data into training, dev, and test sets, as well as the general problems of covariate shift, we can move to the next step, choosing metrics.
Let’s suppose that you have implemented four models that solve the same problem of car recognition in pictures. Now, you want to choose the best model and move it into production. You may decide to calculate different metrics for all the models (for instance, precision and recall) and then choose the model that has the best metrics. However, it may happen that a model performs better than the others for one metric, while another model outperforms the others for another metric, as shown in the following table:
Figure 3.2 – Precision and recall for the four models of the car recognition example
We note that Model 1 has the best precision value, and Model 4 has the best recall value. Which one is the best model? According to the calculated metrics, the previous example suggests that there is no absolute best model.
To solve the previous problem, you should combine all the metrics you calculate to define a single metric. Your best model will be that with the best value for your defined metric. You can define your own metric, which depends on your specific task. For example, you can write the following metric:
In the previous formula, is the combined metric, and are precision and recall respectively, and and are the weights assigned to and respectively. If, for example, precision is more important than recall for your task, you can set and .
In the previous example, choosing the best model is simple, as shown in the following table:
Figure 3.3 – Precision, Recall, and Combined Metric for the car recognition example
The previous table shows that Model 2 is the best model.
Once you have defined your combined metric, you should optimize it on the dev set. It may happen that your combined metric is not the best metric to measure the performance of your model. Usually, this occurs when at least one of the following situations occurs:
If at least one of the previous cases occurs, it is advisable to change your combined metric.
Now that you have learned how to choose the best metric to evaluate your model, we can move toward the next section, exploring model evaluation techniques.
Depending on the problem we want to solve, there are different model evaluation techniques. In this section, we will consider three types of problems: regression, classification, and clustering.
The first two problems fall within the scope of supervised learning, while the third method falls within the scope of unsupervised learning.
In this section, you will review the main metrics used for model evaluation in the previously cited problems. We will implement a practical example in Python to illustrate how to calculate each metric. To review the main evaluation metrics, we will use only two datasets: the training and test sets.
Regarding supervised learning, there is also an additional technique to perform model evaluation. This technique is called cross validation. The basic idea behind cross validation is to split an original dataset into several subsets. The model trains all the subsets, except one. When the training phase is completed, the model is tested on the remaining subset. This is an iterative procedure, for all the possible subsets of the dataset. We will discuss cross validation and how Comet supports it in detail in Chapter 8, Comet for Machine Learning.
This section is organized as follows:
Let’s start from the first step, loading and preparing the dataset.
We will use the Diamonds dataset, provided by ggplot2 under the MIT licenses (https://ggplot2.tidyverse.org/reference/diamonds.html) and available on Kaggle as a CSV file (https://www.kaggle.com/shivam2503/diamonds):
import pandas as pd
df = pd.read_csv('source/diamonds.csv')
The dataset contains 53,940 rows and 11 columns. The following figure shows the first 10 rows of the dataset:
Figure 3.4 – The first 10 rows of the diamonds dataset
Note that the diamonds dataset contains some categorical columns, including cut, color, and clarity, and some numerical columns (the others).
df = df.drop(["Unnamed: 0"], axis=1)
We use the drop() method provided by pandas with axis=1 to indicate that we want to drop columns.
from sklearn.preprocessing import LabelEncoder
def encode_labels(data):
categories = (data.dtypes =="object")
cat_cols = list(categories[categories].index)
categories = (X.dtypes =="object")
feature_label_encoder_dict = {}
for col in cat_cols:
feature_label_encoder_dict[col] = LabelEncoder()
X[col] = feature_label_encoder_dict[col].fit_transform(X[col])
The function receives the DataFrame as input. Firstly, we import the LabelEncoder class, which will permit us to convert categorical values into numerical ones. Then, we select all the categorical columns, and we store them in the categories variable. Next, we build a LabelEncoder() object for each category column and store it in a dictionary named feature_label_encoder_dict. Finally, for each categorical column, we fit and transform the built feature_label_encoder_dict object.
from sklearn.preprocessing import StandardScaler
def scale_numerical(data):
scaler = StandardScaler()
data[data.columns] = scaler.fit_transform(data[data.columns])
The function receives the DataFrame as input and scales all the numerical columns through a StandardScaler() object.
Now that we have prepared the data, we can move to the next step, evaluation metrics for regression.
Regression analysis is a type of supervised machine learning that tries to predict a continuous target variable, named Y, on the basis of one or more input variables, named X. To evaluate a regression task, we can calculate many metrics.
As a regression task example, we want to build a model that predicts a diamond’s price on the basis of other features. Before calculating these metrics, we build the training and test sets as follows:
X = df.drop("price", axis = 1)
y = df["price"]
encode_labels(X)
scale_numerical(X)
Note that we have used the previously defined functions.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
We have reserved 20% of samples for the test set and the remaining 80% of samples for the training set. Also, we have set random_state to 42 to make the experiment reproducible.
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
In this example, we do not care about model optimization; thus we use the default model.
Now that we have extracted the predicted values, we can calculate the main metrics used to evaluate a regression model. In this section, we calculate three of the most popular metrics: Mean Absolute Error, Root Mean Squared Error, and R Squared:
from sklearn.metrics import mean_absolute_error
MAE = mean_absolute_error(y_test,y_pred)
In the previous example, we can calculate RMSE as follows:
from sklearn.metrics import mean_squared_error
import numpy as np
RMSE = np.sqrt(mean_squared_error(y_test, y_pred))
from sklearn.metrics import r2_score
R2 = r2_score(y_test, y_pred)
Now that you have seen the most popular metrics used to evaluate regression models, we can analyze the most common metrics for classification.
Classification is a type of supervised learning that tries to predict the target class label, named Y, on the basis of one or more input variables, named X. If the number of class labels is two, we have binary classification; otherwise, if the number of labels is greater than two, we have multiclass classification. In this chapter, we consider only binary classification, but the general concepts described can also be extended to multiclass classification.
As a classification task example, we want to build a model that predicts a diamond’s cut on the basis of other features. The diamonds dataset contains five types of diamond cuts (Ideal, Premium, Very Good, Good, and Fair); thus the problem is multiclass classification. For simplicity, we transform the multiclass classification problem into a binary classification problem.
Before calculating the metrics for binary classification, we prepare the dataset, as follows:
def set_target(x):
golden_set = ['Ideal', 'Premium', 'Very Good']
if x in golden_set:
return 'Gold'
return 'Silver'
df['target'] = df['cut'].apply(lambda x: set_target(x))
df.drop("cut", axis = 1,inplace=True)
We define a function, named set_target(), which receives as input a variable, named x, checks whether it belongs to golder_set or not, and returns 'Gold' if true or 'Silver' otherwise. Then, we create a new column in the original DataFrame, called target, which contains the output of the set_target() function. We also drop the original cut column, which is not needed anymore.
X = df.drop("target", axis = 1)
y = df["target"]
As the input features X, we consider all the columns except the target one, which instead is associated with the y output feature.
encode_labels(X)
scale_numerical(X)
Since we have previously defined the encode_labels() and scale_numerical() functions, this operation is quite simple.
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
We use the train_test_split() function provided by scikit-learn.
model = RandomForestClassifier()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
We fit the model with the training set and predict the values of the test set. Similar to the case of regression, we do not care about model optimization, since in this section, our objective is to show the evaluation metrics.
Now that we have built the model, we can calculate the most popular metrics for classification – confusion matrix, precision, recall, accuracy, the F1-score, and ROC curve:
Figure 3.5 – The confusion matrix
The table shows the predicted values (rows) versus the actual values (columns). Each cell of the table corresponds to the number of correct or wrong classifications:
from sklearn.metrics import confusion_matrix
[tp,fp], [fn,tn] = confusion_matrix(y_test, y_pred)
from sklearn.metrics import plot_confusion_matrix
import matplotlib.pyplot as plt
plot_confusion_matrix(model, X_test, y_test, cmap='GnBu')
plt.show()
The function receives the model and the test set as input. As an additional parameter, we can pass the color map ('GnBu', in our case). The following figure shows the output of the plot_confusion_matrix() function for our example:
Figure 3.6 – The output of the plot_confusion_matrix() function
The matrix shows the number of records for each cell of the table. For example, the first cell indicates that there are 9,300 true positives. The table also colors the cells according to a gradient of colors.
from sklearn.metrics import precision_score
precision = precision_score(y_test, y_pred)
We use the precision_score() function, which receives the test set and the predicted values as input, and returns a number corresponding to the precision.
from sklearn.metrics import recall_score
recall = recall_score(y_test, y_pred)
We use the recall_score() function, which receives the test set and the predicted values as input, and returns a number corresponding to the recall.
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred)
We use the accuracy_score() function, which receives the test set and the predicted values as an input, and returns a number corresponding to the accuracy. Many data scientists use accuracy as a single metric to test the validity of their model.
from sklearn.metrics import f1_score
f1 = f1_score(y_test, y_pred)
We use the f1_score() function, which receives the test set and the predicted values as input, and returns a number corresponding to the F1-score.
from sklearn.metrics import roc_curve,roc_auc_score
y_pred_proba = model.predict_proba(X_test)[::,1]
fpr, tpr, _ = roc_curve(y_test, y_pred_proba)
auc = roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label='auc=%.3f' % auc, color='#084081')
axis_ranges = [0,1]
plt.plot(axis_ranges, axis_ranges, linestyle='--', color='k', scalex=False, scaley=False)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend()
plt.grid()
plt.show()
Firstly, we calculate the predicted probabilities for each class through the predict_proba() method. Then, we calculate the ROC curve through the roc_curve() function, which receives as input the output of the predict_proba() method. We also calculate the AUC score, through the roc_auc_score() function. The Area Under the Curve (AUC) score measures the ability of a classifier to distinguish between the target classes. The higher the AUC, the better the performance of a model in distinguishing between positive and negative targets. The following figure shows the output of the previous code:
Figure 3.7 – The ROC curve
Note that in our example the model performs quite well because the ROC curve tends to be flattened to the left. The dotted line represents a 0.5 AUC score, which is inherent to the random guessing model.
Now that we have reviewed the most popular metrics used to evaluate classification models, we can analyze the most common metrics for clustering.
Clustering analysis is a type of unsupervised machine learning, where there is no training set. Clustering is used to group records according to similarity criteria, such as distance. A clustering model takes a dataset as input and returns a list of labels as output, corresponding to the associated clusters.
Evaluating the performance of a clustering model is not easy because you should verify that each record has been assigned the right cluster. In other words, you should verify that each record is much more similar to the records belonging to its cluster than to the records belonging to the other clusters.
Before calculating the metrics, we will prepare the diamonds dataset, as follows:
X = df[['price', 'carat']]
plt.scatter(X['price'],X['carat'])
plt.xlabel('Price')
plt.ylabel('Carat')
plt.grid()
plt.show()
The following figure shows the resulting plot:
Figure 3.8 – A scatter plot showing Carat against Price
from sklearn.cluster import KMeans
model = KMeans(n_clusters=2)
labels = model.fit_predict(X)
We use the KMeans() class provided by scikit-learn, and then we call the fit_predict() method to calculate the clusters.
from matplotlib.colors import ListedColormap
cmap = ListedColormap(['#40B7AD', '#084081'])
plt.scatter(X['price'],X['carat'], c=labels, cmap=cmap)
plt.xlabel('Price')
plt.ylabel('Carat')
plt.grid()
plt.show()
We assign to each point of the scatter plot a color, corresponding to the associated label. The following figure shows the resulting plot:
Figure 3.9 – The dataset after clustering
Now that we have clustered our dataset, we can evaluate the model. There are two types of evaluation:
Regarding the intrinsic methods, we can calculate many metrics, including the following ones:
from sklearn.metrics import silhouette_score
score = silhouette_score(X, labels)
The function receives the X dataset and the labels as input and returns a number representing the silhouette score. In the previous example, the score is 0.708.
sse = {}
for i in range(2,10):
model = KMeans(n_clusters=i)
sse[i] = model.inertia_
We define a loop with different values of the number of clusters. After fitting the model, we calculate SSE (model.inertia_) for each iteration. We plot the results as follows:
plt.plot(sse.keys(), sse.values())
plt.grid()
plt.xlabel('Number of Clusters')
plt.ylabel('SSE')
plt.show()
The previous code produces the following figure:
Figure 3.10 – The elbow method
We can identify the correct number of clusters as the point after which the curve begins to decrease rapidly. In our case, the best number of clusters could be two.
Now that we have reviewed all the main techniques for model evaluation, we are ready to move to the next section, using Comet for model evaluation.
Comet provides the following features to deal with model evaluation:
The following figure shows how to combine the features provided by Comet to compare different models and then choose the best one for production:
Figure 3.11 – How to use Comet for model evaluation
Let’s suppose that you want to compare N models and then choose the best model for deployment. You build your experiments and then you track them in Comet. Through Comet Dashboard, you can compare models by building panels, charts, tables, and other similar objects. You can also store your models in the Comet registry. You can even export a report showing the results of comparison from the Comet platform. Once you have selected the best model, you can export it from the registry and make it available for production.
To show how we can use Comet for model evaluation, you will use the previously defined diamonds dataset, and you will implement four classification models – Random Forest, Decision Tree, Gaussian Naive Bayes, and K-Nearest Neighbors. We will use the basic version of these models, without optimizing them, because the objective of this chapter is to show how to perform model evaluation. For more details on how to optimize classification models, you can refer to Chapter 8, Comet for Machine Learning. You will use the cleaned version of the dataset, described in the previous section, Classification, that we built as follows:
options = ['Ideal', 'Premium']
df2 = df[df['cut'].isin(options)]
X = df2.drop("cut", axis = 1)
y = df2["cut"]
We have supposed that df contains the diamonds dataset. We have selected only two possible values for cut, to deal with binary classification. Then, we have built the input (X) and target (y) variables. We also suppose that we have encoded labels and scaled numerical values, as described in the previous section.
We can now move on to analyze each feature provided by Comet separately, by starting from the first, Log.
A Comet Log is an object that stores a metric, a parameter, or an object in general in Comet. We have already described the basic concepts behind a Comet Log in Chapter 1, An Overview of Comet, and Chapter 2, Exploratory Data Analysis in Comet. Thus, you can refer to those chapters for basic concepts. In this section, we will review the most useful Comet logs for model evaluation.
We will use the following methods provided by the experiment class:
Let’s suppose that we have already split our dataset into training and test sets, as described in the previous section:
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score
def compute_metrics(y_pred, y_true):
metrics = {}
metrics['precision'] = precision_score(y_true, y_pred)
metrics['recall'] = recall_score(y_true, y_pred)
metrics['f1-score'] = f1_score(y_true, y_pred)
metrics['accuracy'] = accuracy_score(y_true, y_pred)
return metrics
The function returns a dictionary, named metrics, which stores all the calculated metrics. We calculate precision, recall, the F1-score, and accuracy. Following the discussion in the Choosing metrics section, we will use just one metric to perform the comparison among the different models: accuracy. However, for completeness, we also calculate the other metrics.
from sklearn.metrics import roc_curve
def run_experiment(ModelClass, name):
experiment = Experiment()
experiment.set_name(name)
experiment.add_tag(name)
model = ModelClass()
with experiment.train():
model.fit(X_train, y_train)
y_pred = model.predict(X_train)
metrics = compute_metrics(y_pred, y_train)
experiment.log_metrics(metrics)
experiment.log_confusion_matrix(y_train, y_pred)
with experiment.validate():
y_pred = model.predict(X_test)
metrics = compute_metrics(y_pred, y_test)
experiment.log_metrics(metrics)
experiment.log_confusion_matrix(y_test, y_pred)
fpr, tpr, _ = roc_curve(y_test, y_pred)
experiment.log_curve(name, fpr, tpr)
Firstly, we create a new experiment object to permit communication with Comet. To configure the experiment parameters, including the workspace and the API key, you can refer to Chapter 1, An Overview of Comet. We also set the experiment name to a name passed as input, to make the experiment recognizable from the Comet Dashboard. We use the set_name() method of the Experiment class to perform this operation. In addition, we add a new tag to the experiment through the add_tag() method. Then, we create our model through the model = ModelClass() statement. Note that ModelClass() is a variable, which you will set when calling the function.
Once we have created the model, we can train it. We use the with experiment.train() statement to let Comet know that all the logged objects belong to the training phase. We calculate metrics through the compute_metrics() function, and then we log them in Comet through the log_metrics() method provided by the Experiment class. In addition, we log the confusion matrix through the log_confusion_matrix() method. Once training is complete, we can test the model. Similar to the training phase, we use the with experiment.validate() statement to let Comet know that all the logged objects belong to the training phase. We call the predict() method provided by the model on the test set, and we calculate all the metrics, as already done during the training phase. In addition, in this phase, we calculate the ROC curve through the roc_curve() function, and we log it in Comet through the log_curve() method, provided by the Experiment class.
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
run_experiment(RandomForestClassifier, 'RandomForest')
run_experiment(DecisionTreeClassifier, 'DecisionTreeClassifier')
run_experiment(GaussianNB, 'GaussianNB')
run_experiment(KNeighborsClassifier, 'KNeighborsClassifier')
Simply, we call the run_experiment() function for each of the classifiers we want to test. In the example, we test Random Forest, Decision Tree, Gaussian Naive Bayes, and K-Nearest Neighbors.
The log_metrics() method permits us to also log epochs and steps. An epoch is a hyperparameter available only for certain types of algorithms, such as those based on gradient descent. An epoch corresponds to the number of times the algorithm will work with the training dataset.
Setting the epoch to 1 means that each sample of the training set has just one opportunity to update the algorithm during the training phase. A step (or batch) defines the number of samples to use before updating the model parameters. We can set the step equal to the training set size or a smaller number.
def run_experiment(ModelClass, name, n_epochs):
...
with experiment.train():
for i in range(n_epochs):
model = ModelClass(max_iter=n_epochs)
model.fit(X_train, y_train)
y_pred = model.predict(X_train)
metrics = compute_metrics(y_pred, y_train)
experiment.log_metrics(metrics, epoch = i)
experiment.log_confusion_matrix(y_train, y_pred, epoch=i)
...
We add another argument to the function, which is the number of epochs (n_epochs). Then, during the training phase, we build a loop over the number of epochs, and we build a model for each epoch. The previous function works only with models supporting the number of epochs, such as the SGD classifier:
from sklearn.linear_model import SGDClassifier
run_experiment_with_epoch(SGDClassifier, 'SGD',1000)
We build a classifier and set the number of epochs up to 1,000.
Similar to the number of epochs, we can set the number of steps. For example, we can decide to train our model with different sizes of the dataset. Thus, we can modify the run_experiment() function, as follows:
import numpy as np
def run_experiment(ModelClass, name):
step_size = len(X_train)
min_steps = 20
...
with experiment.train():
for i in np.arange(min_steps, step_size+1, step = 5000):
model = ModelClass()
X_t = X_train[0:i]
y_t = y_train[0:i]
model.fit(X_t, y_t)
y_pred = model.predict(X_t)
metrics = compute_metrics(y_pred, y_t)
experiment.log_metrics(metrics, step = i)
experiment.log_confusion_matrix(y_t, y_pred, step=i)
...
We set the step size to the length of the training set and the minimum number of samples in a step equal to 20. Then, we loop over the different steps and train different models, each with an increasing number of samples.
Now that you have logged all the needed metrics, we can analyze them in the Comet Dashboard.
Comet Dashboard is the Comet online website, which stores all your experiments. Let’s suppose that you have run all four experiments, described before. Under the Panels or Experiments tab, you will see them, as shown in the following figure:
Figure 3.12 – The four experiments shown in Comet
We can easily identify each experiment because we have set the experiment name for each model.
We can perform the following types of comparison among the experiments:
Let’s look at each type of comparison separately, starting with the first one, ordering.
We can order experiments by a specific parameter or metric:
Figure 3.13 – The ordering button in Comet Dashboard
Figure 3.14 – How to select the ordering criterion in the drop-down menu
We can also choose whether to order the experiment by ascending (the top arrow) or descending (the down arrow). In our case, we select the descending order. Now, the four models are ordered, as shown in the following figure:
Figure 3.15 – The four experiments after sorting by validate_accuracy
Thanks to this simple operation, we know which is the best model, according to our evaluation metric.
Now, we can perform a raw comparison among the experiments through a table. We can see detailed information about each evaluation metric.
Figure 3.16 – A portion of the Experiments tab
The figure shows only some details about each experiment, such as TAGS, SERVER END TIME, FILE NAME, and DURATION. However, Comet provides the user with all the logged parameters and metrics, 15 columns in our case, as shown in the columns button at the top part of the previous figure.
Figure 3.17 – The pop-up window to customize columns
Figure 3.18 – The Experiments tab after selecting some columns
We note that the Random Forest model reaches an accuracy of 0.957, followed by the Decision Tree model with an accuracy of 0.942, and then the other two models. Through the Comet Dashboard, it is very simple to compare experiments.
Now, we can group experiments by some criteria:
Figure 3.19 – The Group by button in the Experiments tab
Figure 3.20 – The result of grouping by duration
There are three groups – 1, 4, and 10 seconds. In the 4-seconds group, there are two models, while in the other groups, there is just one model.
Finally, we can filter experiments by some criteria:
Figure 3.21 – The Filters button in the Experiments tab
Figure 3.22 – The Experiments tab after applying the validate_accuracy greater than 0.92 filter
Note that Random Forest, Decision Tree, and K-Nearest Neighbors satisfy the applied filter.
For all the described operations, you can apply as many criteria as you want from the pop-up window.
Now that you have learned how to compare experiments in Comet, we can move onto the next step, model registry.
A Comet Registry or Model Registry is a place available in the Comet platform that stores all the registered models. There are at least two advantages of registering models in Comet. Firstly, you can keep track of all the stages of your project, and secondly, you can use the registry as secure storage.
To make a model available in the Model Registry, firstly, you need to register it during an experiment by using one of the following methods provided by the Experiment class:
To log the model, we can modify the previously defined run_experiment() function, as follows:
import pickle
def save_file_to_comet(obj, obj_name, file_name, experiment):
with open(file_name, 'wb') as file:
pickle.dump(obj, file)
file.close()
experiment.log_model(obj_name, file_name)
The function receives the object (obj) as input to save its name, the output filename, and the Comet experiment object. We can save the model as a pickle file, through the dump() function provided by the pickle library. Then, we log the saved model through the log_model() method of the Experiment class.
def run_experiment(ModelClass, name):
...
with experiment.train():
...
file_name = 'model.pkl'
save_file_to_comet(model, name, file_name, experiment)
...
def run_experiment(ModelClass, name, feature_label_encoder_dict, scaler,label_encoder):
...
if feature_label_encoder_dict:
for k,v in feature_label_encoder_dict.items():
obj_name = f"{k}FeatureLabelEncoder"
file_name = f"{obj_name}.pkl"
save_file_to_comet(feature_label_encoder_dict[k], obj_name, file_name, experiment)
if label_encoder:
obj_name = "labelEncoder"
file_name = f"{obj_name}.pkl"
save_file_to_comet(label_encoder, obj_name, file_name, experiment)
if scaler:
obj_name = "scaler"
file_name = f"{obj_name}.pkl"
save_file_to_comet(scaler, obj_name, file_name, experiment)
We save each object separately. For each object, we define its name (obj_name) and the output filename (file_name) in the local filesystem.
Once we have run all the experiments, we can access the saved models, as follows:
Figure 3.23 – The models directory under the Assets and Artifacts menu items
The figure shows all the logged objects, with a focus on the models directory. We can download the model if we want.
Figure 3.24 – The popup to register a new model
Since it is the first model of our project, we need to register it as a new model – for example, we can call it Diamonds classification. If we have already registered a previous model, we can add the new model to an existing Registry.
Figure 3.25 – A view of the Model Registry
Figure 3.26 – How to set the stage as an experiment
Now that you have learned how to save your models in Comet, we can move towards the next step, a report.
A Comet Report is an interactive document that contains experiments, panels, and text. You already learned the basic concepts behind the Comet Report in Chapter 2, Exploratory Data Analysis in Comet. In this section, you will learn how to build a report for model evaluation. The report will contain the following panels:
Let’s start from the first panel, precision, recall, the F1-score, and accuracy graphs. Let’s suppose that we have logged the involved metrics using the step parameter, as described in the previous sections:
Figure 3.27 – The Panels menu item in the Comet Dashboard
The window contains four line charts, respectively of training accuracy, a training F1-score, training precision, and training recall, all against the step.
Now that we have added the basic panels to your report, we can move onto the next step, the ROC curve. Since Comet does not provide a default panel for the ROC curve, the idea is to build a custom panel that shows it:
from comet_ml import API, ui
import matplotlib.pyplot as plt
api = API()
experiment_keys = api.get_panel_experiment_keys()
We build an API() object, and we get all the experiment keys through the get_panel_experiment_keys() method.
colors = iter(['#40B7AD', '#A1CDB3','#508DED','#454372'])
for experiment_key in experiment_keys:
for curve in api.get_experiment_curves(experiment_key):
curve_json = api.get_experiment_asset(experiment_key,curve['assetId'], return_type='json')
plt.plot(curve_json['x'],curve_json['y'],color=next(colors), label=curve_json['name'])
We use the get_experiment_curve() method to access all the logged curves, and the get_experiment_asset() method to retrieve each curve as JSON, which contains the TPR (x) and FPR (y). We plot the curve through the plot() function provided by matplotlib.
plt.plot([0, 1], [0, 1], color="navy", linestyle="--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.legend()
Firstly, we plot the line y = x, and then we set the axis ranges through the xlim() and ylim() functions. Finally, we define the axis titles through xlabel() and ylabel(), as well as the legend.
ui.display(plt)
Figure 3.28 – The custom panel showing the ROC curve for all the experiments
The following figure shows the final report:
Figure 3.29 – The final report
The report is interactive; thus, we can modify it dynamically by selecting the involved experiments in the Experiment tab.
We have just completed the journey to perform model evaluation in Comet!
Throughout this chapter, we described some general concepts regarding model evaluation, as well as the main techniques to evaluate regression, classification, and clustering. We also illustrated the importance of model evaluation in a data science project; model evaluation permits us to define some metrics to choose the best model for production.
In the third part of the chapter, you learned which features Comet provides to perform model evaluation and how you can use them through a practical example. We deepened the concepts of logs and reports, which you already knew about, and illustrated two new concepts, the Comet Dashboard and the Model Registry.
Throughout this chapter, you learned how easy it is to use Comet to run model evaluation, as Comet provides very intuitive features that can be combined to build fantastic reports, as well as how to keep track of the best model for production.
Now that you have learned how to perform model evaluation in Comet, we can continue our journey toward the discovery of Comet for data science. In the next chapter, we will learn about some advanced concepts regarding workspaces, projects, experiments, and models in Comet.