The template method pattern is used to create a well-defined process that can use different kinds of algorithms or operations. As a template, it can be customized with whatever algorithm or functions the client requires.
Here, we will explore how the template method pattern can be utilized in a machine learning (ML) pipeline use case. For those who are unfamiliar with ML pipelines, here is a simplified version of what a data scientist might do:
A dataset is first split into two separate datasets for training and testing purposes. The training dataset is fed into a process that fits the data into a statistical model. Then, the validate function uses the model to predict the response (also called the target) variable in the test set. Finally, it compares the predicted values against the actual values and determines how accurate the model is.
Let's say we have the pipeline already set up as follows:
function run(data::DataFrame, response::Symbol, predictors::Vector{Symbol})
train, test = split_data(data, 0.7)
model = fit(train, response, predictors)
validate(test, model, response)
end
For the sake of brevity, the specific functions, split_data, fit, and validate, are not shown here; you can look them up on this book's GitHub site if you wish. However, the pipeline concept is demonstrated in the preceding logic. Let's take a quick spin at predicting Boston house prices:
In this example, the response variable is :MedV, and we will build a statistic model based on :Rm, :Tax, and :Crim.
MedV: Median value of owner-occupied homes in $1,000's
Rm: Average number of rooms per dwelling
Tax: Full-value property tax rate per $10,000
Crim: Per capita crime rate by town
The accuracy of the model is captured in the rmse variable (meaning the root mean squared error). The default implementation uses linear regression as the fitting function.
To implement the template method pattern, we should allow the client to plug in any part of the process. For that reason, we can modify the function with keyword arguments:
function run2(data::DataFrame, response::Symbol, predictors::Vector{Symbol};
fit = fit, split_data = split_data, validate = validate)
train, test = split_data(data, 0.7)
model = fit(train, response, predictors)
validate(test, model, response)
end
Here, we have added three keyword arguments: fit, split_data, and validate. The function is named as run2 to avoid confusion here, so the client should be able to customize any one of them by passing in a custom function. To illustrate how it works, let's create a new fit function that uses the generalized linear model (GLM):
using GLM
function fit_glm(df::DataFrame, response::Symbol, predictors::Vector{Symbol})
formula = Term(response) ~ +(Term.(predictors)...)
return glm(formula, df, Normal(), IdentityLink())
end
Now that we have customized the fitting function, we can rerun the program by passing it via the fit keyword argument:
As you can see, the client can customize the pipeline easily by just passing in functions. This is possible because Julia supports first-class functions.
In the next section, we will review a few other traditional behavioral patterns.