Tracking your data and metrics with version control

As with all ML projects, there is always room for improvement—especially if we converge on the actual use case scenario. But let's switch gears and talk about the technical side of the question.

As you probably noticed, in this chapter, we had to constantly iterate, adding and removing features from the data or settings to the model. And again, as we mentioned, only one-third of the initial experiments went into this book. This is probably fine for this toy dataset and this third of the code but eventually, we might be swamped in different versions and iterations of the model.

In Chapter 9, Shell, Git, Conda, and More – at Your Command, of this book, we learned about git—a system that stores versions of code, so you can safely switch to the previous version or even keep work on different versions of the code in parallel. This definitely will work for the code behind the model, especially if we carefully explain the differences in the commit messages.

However, in a real-world situation, ML pipelines won't be enough. We need to track metrics and store data and models for each version of the code, especially if the models take hours or even days to train, which is quite often. There is a need for reproducibility when storing not only code but also data, and by data, we mean not only the datasets but also any derivatives, models, and metrics so you can compare different iterations (experiments) and reproduce every one of them, on demand. It may be tentative to use git itself and, for the small datasets, it will work. It won't work for even a medium-size dataset, however, let alone the large ones.

There are a few systems and technologies that help to track experiments, but the field is very young and dynamic. The most popular solutions seem to be sacred, mlflow, and dvc. While all three products generally address similar goals—experimentation and reproducibility—each operates under a certain set of predefined conditions and opinions. For example, sacred is a Python library that helps to store the outcomes and settings of experiments and visualize them later on a dashboard, while mlflow is a powerful framework that prefers to have a separate server for tracking and supports a few languages.

The last one, dvc, is focused on data version control (DVC literally stands for Data Version Control), is small and language-agnostic, and does not require any servers—everything is communicated via the flat files. It also does not require any changes nor additions to the code itself, which is good. dvc tries to keep its interfaces very similar to git and relies on git itself for many of its features. It supports multiple cloud providers but can be used without a remote as well (similar to git). Let's now try to use DVC on our small pipeline.

Table of Contents for Tracking your data and metrics with version control

Create new playlist

Sign In

Sign Up

Table of Contents for
Tracking your data and metrics with version control