In the previous chapters, we learned about Pachyderm's architecture, the internals of the Pachyderm solution, and version control primitives such as repositories branches, and commits. We reviewed why reproducibility is essential and why it should be a part of a successful data science process. We also learned how to do this on all three major platforms – macOS, Linux, and Windows.
There are many ways and a variety of platforms that enable you to run your end-to-end Machine Learning (ML) workflows using Pachyderm. We will start with the most common and easy to configure local deployment method on your computer; then, in the following chapters, we will review the deployment process on cloud platforms.
This chapter will walk you through the process of installing Pachyderm locally so that you can get started quickly and test Pachyderm. This chapter will prepare you to run your first pipeline. We will provide an overview of the system requirements and guide you through the process of installing all the prerequisite software needed for Pachyderm to run smoothly.
In this chapter, we're going to cover the following main topics:
Whether you are on macOS, Windows, or Linux, you need to install the following tools:
We will get into the specifics of installing and configuring these tools as we go through this chapter. If you already know how to do this, you can go ahead and set them up now.
In this section, we will cover how to install the system tools that we will use to prepare our environment before installing Pachyderm.
While Linux distributions have many package management options, there is no default package manager for macOS users. Homebrew (brew) fills this gap and provides a great solution to easily install and manage software from the macOS Terminal and Linux shell as an alternative to apt, yum, or flatpak, which are available in Linux distributions.
Homebrew uses Git to download its updates. In Homebrew, packages are installed based on definitions known as Formulae. Homebrew installs software packages to the Cellar, which is located under the /user/local/Cellar directory. Another term you will hear often is Tap. Tap is a Git repository of Formulae.
In this chapter, we will frequently use brew to install various software packages on macOS. Therefore, you need to install it if you are using macOS. The same brew commands we will use in this chapter run on Linux as well, but we will keep the use of brew for Linux optional:
$ /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
$ brew commands
The following screenshot shows the system's output:
$ brew update
$ brew outdated
$ brew upgrade
Now that you have installed the Homebrew package manager on your computer, let's install kubectl.
WSL is a tool that enables Windows users to run Linux commands and utilities natively in Windows. If you are using Windows, you can install WSL on your machine by following these steps:
wsl --install
Important note
If you are on Windows, run all the Linux and Pachyderm commands described in this book from WSL.
For more information, see the official Microsoft Windows documentation at https://docs.microsoft.com/en-us/windows/wsl/install.
Before you create your first Kubernetes cluster, you need to install the Kubernetes command-line tool, kubectl, to execute commands against the cluster. Now, let's learn how to install kubectl on a computer.
See the official Kubernetes documentation for more information: https://kubernetes.io/docs/home/.
Follow these steps:
If you are using Linux, run the following command:
$ curl -LO https://storage.googleapis.com/kubernetes-release/release/$(curl -s https://storage.googleapis.com/kubernetes-release/release/stable.txt)/bin/linux/amd64/kubectl
$ chmod +x ./kubectl && sudo mv ./kubectl /usr/local/bin/kubectl
If you are on macOS (Intel), run the following command:
$ curl -LO https://storage.googleapis.com/kubernetes-release/release/$(curl -s https://storage.googleapis.com/kubernetes-release/release/stable.txt)/bin/darwin/amd64/kubectl
$ chmod +x ./kubectl && sudo mv ./kubectl /usr/local/bin/kubectl
If you are on Windows, the following command will do the trick:
curl -LO https://dl.k8s.io/release/v1.22.0/bin/windows/amd64/kubectl.exe
$ kubectl version --short --client
Here is an example of the system's output:
Client Version: v1.22.3
To be able to perform the following commands, the kubectl version must be v1.19 or later.
Now that you have kubectl installed on your computer to execute commands against your Kubernetes cluster, let's install Helm.
Helm is a popular package manager for Kubernetes clusters. Before you deploy Pachyderm by using its Helm chart, you need to install the Helm binary on your environment to be able to manage the life cycle of your Helm chart. Follow these steps to install Helm on your computer:
$ curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/
helm/helm/master/scripts/get-helm-3
$ chmod 700 get_helm.sh
$ ./get_helm.sh
$ helm version --short
Here is an example of the system's output:
V3.7.1+g1d11fcb
Next, you must install the necessary tools to prepare your local Kubernetes cluster environment before you can deploy Pachyderm. If you are familiar with containers in Linux, you must be familiar with these tools. If you are using Linux as your local machine, follow the instructions provided in the Installing minikube section to prepare your environment. If you are using macOS, follow the instructions provided in the Installing Docker Desktop section. Using Docker Desktop is recommended due to its simplicity.
Minikube is a popular cross-platform and lightweight Kubernetes implementation that helps users quickly create a single-node local Kubernetes cluster. minikube supports multiple runtimes, including CRI-O, container, and Docker. It can be deployed as a Virtual Machine (VM), a container, or on bare metal. Since Pachyderm supports the Docker runtime only, we will cover how to use the Docker container runtime and deploy it as a container. For additional configuration details, you can refer to the official Docker documentation at https://minikube.sigs.k8s.io/docs/start/. Let's install the latest version of minikube:
If are using Linux, run the following command:
$ curl -LO https://storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64
$ sudo install minikube-linux-amd64 /usr/local/bin/minikube
If you are on Windows (Chocolatey package manager is required), run the following command:
choco install minikube
$ minikube version
The following is an example of the command's response:
minikube version: v1.22.0
commit: a03fbcf166e6f74ef224d4a63be4277d017bb62e
Now that you have installed minikube installed, let's install Docker Desktop.
Docker simplifies developing, delivering, and running applications by separating applications from the infrastructure and its dependencies. Pachyderm supports the Docker container runtime only; therefore, Docker tools must be installed before you deploy Pachyderm.
Docker runs as a native application using the macOS sandbox security model and installs all Docker tools on your macOS, including Docker Engine, the CLI, Docker Compose, Credential Helper, Notary, and Kubernetes.
If you do not have Docker Desktop already installed, you can follow the instructions provided in the next section to install it. Otherwise, you can skip to the Preparing your Kubernetes environment section. You can also refer to the official Docker documentation at https://docs.docker.com/get-docker/.
Follow these steps to install Docker Desktop on macOS. The latest version of Docker is supported on the last three versions of macOS. If your macOS version is older than the last three versions, you need to upgrade it to the latest version of macOS:
Installing Docker Desktop on Windows
Install Docker Desktop on your Windows machine by following these steps:
Now that you have installed Docker Desktop on your machine, let's install the Pachyderm CLI, called pachctl.
The Pachyderm CLI, pachctl, is used to deploy and interact with Pachyderm clusters. Follow these steps to install pachctl:
$ PACHYDERMVERSION=$(curl --silent "https://api.github.com/repos/pachyderm/pachyderm/releases/latest" | grep '"tag_name":' |
sed -E 's/.*"v([^"]+)".*/1/')
If you are using macOS, run the following command:
$ brew tap pachyderm/tap && brew install pachyderm/tap/pachctl@${PACHYDERMVERSION}
If you are using Debian Linux or WSL on Windows 10, run the following command:
$ curl -o /tmp/pachctl.deb -L https://github.com/pachyderm/pachyderm/releases/download/v${PACHYDERMVERSION}/pachctl_${PACHYDERMVERSION}_amd64.deb && sudo dpkg -i /tmp/pachctl.deb
$ pachctl version --client-only
The following is an example of the system's output:
COMPONENT VERSION
pachctl 2.0.1
With that, you have installed the prerequisites to run Pachyderm locally. Now, let's prepare our cluster and deploy Pachyderm on our local Kubernetes cluster.
Autocompletion is a functionality that's offered by Unix shell flavors to autofill parameters using the CLI. Depending on the type of shell that's used in your system, the autocompletion feature suggests or autocompletes the partially typed commands as you type, sometimes by pressing the Tab key. Pachyderm supports autocompletion for Bourne Again Shell (bash) and Z shell (zsh), an extended Bourne shell. bash and zsh are the most common Unix command-line interpreters that are used on macOS and Linux. In this section, you will learn how to enable the Pachyderm autocompletion feature and the parameters that are available from the pachctl command.
If you don't know which shell you are using, type the following command to find out:
$ echo "$SHELL"
If you are using bash, the output of the preceding command should look as follows:
/bin/bash
If you are using zsh, the output of the preceding command should look as follows:
/bin/zsh
Since we now know which shell we are using, we can install Pachyderm autocompletion.
Follow these steps to enable Pachyderm autocompletion on your computer:
If you are using macOS or Linux with Homebrew, use the following command:
$ brew install bash-completion
If you are on Ubuntu Linux, use the following command:
$ sudo apt install bash-completion
If you are using RHEL or CentOS Linux, use the following command:
$ sudo yum install bash-completion bash-completion-extras
If you are on macOS, run the following command:
$ brew info bash-completion
If you are using Linux, run the following command:
$ complete -p
If you are on macOS, run the following command:
$ pachctl completion bash --install --path /usr/local/etc/bash_completion.d/pachctl
If you are using Linux, run the following command:
$ pachctl completion bash --install --path /usr/share/bash-completion/completions/pachctl
With that, Pachyderm's command-line autocompletion has been enabled in your bash shell.
Z shell, or zsh, is an improved interactive login shell with many advanced features. The default interactive shell in Apple laptops was changed to zsh with macOS Catalina. Follow these steps to enable Pachyderm autocompletion on your computer:
Important note
If you do not wish to enable autocompletion, you can try using pachctl shell instead. To enable this feature, type pachctl shell.
If you are using macOS or Linux with Homebrew, use the following command:
$ brew install zsh-completion
If you are on Linux, visit the https://github.com/zsh-users/zsh-completions page and follow the instructions for your Linux distribution to enable zsh completion. As an example, for Ubuntu 19.10, this would look as follows:
$ echo 'deb http://download.opensuse.org/repositories/shells:/zsh-users:/zsh-completions/xUbuntu_19.10/ /' | sudo tee /etc/apt/sources.list.d/shells:zsh-users:zsh-completions.list
$ curl -fsSL https://download.opensuse.org/repositories/shells:zsh-users:zsh-completions/xUbuntu_19.10/Release.key | gpg --dearmor | sudo tee /etc/apt/trusted.gpg.d/shells_zsh-users_zsh-completions.gpg > /dev/null
$ sudo apt update && sudo apt install zsh-completions
If you are on macOS, run the following command:
$ brew info zsh-completions
If you are using Linux, run the following command:
$ complete -p
On macOS, run the following command:
$ pachctl completion zsh --install --path /usr/local/share/zsh-completions/_pachctl
If you are using Linux, run the following command:
$ pachctl completion zsh --install --path /home/linuxbrew/.linuxbrew/share/zsh-completions/_pachctl
With that, Pachyderm command-line autocompletion is now enabled in your zsh shell. Next, let's prepare the Kubernetes environment.
In this section, you will provision a Kubernetes cluster by using the preferred tools that you deployed in the Installing the required tools section.
Follow these steps to enable Kubernetes if you're using Docker Desktop as your container platform to deploy your Kubernetes cluster on both Windows and macOS:
$ kubectl get node
The following is an example of the system's response:
NAME STATUS ROLES AGE VERSION
docker-desktop Ready control-plane,master 7m9s v1.21.5
With that, you have a single-node Kubernetes cluster configured on Docker Desktop. Now, we are ready to deploy Pachyderm on our local Kubernetes environment.
Follow these steps to run Kubernetes locally when using minikube:
$ minikube config set driver docker
$ minikube start
$ kubectl get node
NAME STATUS ROLES AGE VERSION
minikube Ready control-plane,master 29m v1.20.2
With that, your Kubernetes cluster has been configured using minikube. Now, we are ready to deploy Pachyderm on our local Kubernetes environment.
When running Pachyderm in production, it is recommended to start in an environment where resources can scale up to handle the computational needs of larger pipelines. Pachyderm can be installed on any Kubernetes cluster, including managed Kubernetes services provided by AWS, Google Cloud, Microsoft Azure, IBM Cloud, and OpenShift, as well as locally on your workstation. In this section, we are going to focus on a smaller test deployment; therefore, a local cluster is good enough to get started.
Pachyderm provides sample Helm charts to help you deploy Pachyderm to all major cloud platforms. You can read more about Helm Charts in Chapter 2, Pachyderm Basics. Because Helm charts are flexible, you can pick the components that you want to install. For example, you can install the Pachyderm in-browser interface called the Console.
The Pachyderm Console is the Pachyderm user interface and provides a birds-eye view of your pipelines through the Direct Acyclic Graph (DAG), as well as other useful features.
Some components, such as the Pachyderm Console, require an Enterprise license but are also available for testing with a free trial license for 30 days. You can request a free trial license at https://www.pachyderm.com/trial/.
Follow these steps to install Pachyderm on your local Kubernetes cluster:
$ helm repo add pach https://helm.pachyderm.com
$ helm repo update
$ helm install pachd pach/pachyderm --set deployTarget=LOCAL
If you have an Enterprise key and you would like to deploy it with Pachyderm's console user interface, create a file called license.txt and paste your Enterprise token into that file. Then, run the following commands:
$ helm install pachd pach/pachyderm --set deployTarget=LOCAL --set pachd.enterpriseLicenseKey=$(cat license.txt) --set console.enabled=true
Once the Console has been deployed successfully, follow the instructions provided in the Accessing the Pachyderm Console section to access the Console.
The preceding commands return the following output:
$ kubectl get deployments
The output of the preceding command should look as follows:
$ kubectl get pods
The output of the preceding command should look as follows:
pachctl config import-kube local –overwrite
pachctl config set active-context local
pachctl port-forward
pachctl auth activate
You'll be prompted to log into the UI again. Log in with the mock user called admin and use password as your password.
pachctl version
The output of the preceding command should look as follows:
COMPONENT VERSION
pachctl 2.0.1
pachd 2.0.1
Now that we have installed Pachyderm on our cluster, we are ready to create our first pipeline.
If you have installed the Console with your Pachyderm cluster, you can access it and view your pipelines, repositories, and other Pachyderm objects in the UI. The Pachyderm Console is available as a free trial for 30 days. Follow these steps to access the Pachyderm Console:
pachctl enterprise get-state
The output of the preceding command should look as follows:
Pachyderm Enterprise token state: ACTIVE
Expiration: 2022-02-02 22:35:21 +0000 UTC
pachctl port-forward
Because we have not created any Pachyderm objects, this page is empty.
Now that you have learned how to access the Pachyderm Console, you are ready to create your first pipeline in Pachyderm.
Only perform the steps in this section if you want to delete your cluster. If you want to continue working on the examples in other chapters, then please skip this section.
If you need to delete your deployment and start afresh, you need to wipe out your environment and start over again from the steps provided in the Preparing the Kubernetes environment section. When you delete an existing Pachyderm deployment, all the components, except for the Helm repository and pachctl, are removed from your machine.
Follow these steps to delete your existing Pachyderm deployment:
$ helm ls | grep pachyderm
The output of the preceding command should look as follows:
pachd default 1 2021-11-08 21:33:44 deployed Pachyderm-2.0.1 2.0.1
$ helm uninstall pachd
$ minikube stop
$ minikube delete
With that, you have completely removed Pachyderm and the local Kubernetes cluster from your computer.
In this chapter, we learned about the software prerequisites for getting Pachyderm up and running on your local computer for testing purposes.
We gained basic knowledge about minikube and Docker Desktop and learned how to install them on our local machine. We also learned how to install the Pachyderm CLI and enable autocompletion on different operating systems.
We then installed Helm and the Pachyderm Helm repository on our system. We learned about Helm charts and how to obtain a free trial Pachyderm license.
We deployed a single-node, local Kubernetes cluster by using the most popular options available based on our desktop operating system. Finally, we deployed Pachyderm and learned how to access the Pachyderm Console.
We also learned how to do so on all three major platforms – macOS, Linux, and Windows.
In the next chapter, we will learn about how to install Pachyderm via the cloud and explain the software requirements needed to run a Pachyderm cluster in production. We will also learn about Pachyderm Hub, the Software-as-a-Service (SaaS) version of Pachyderm that is great for both testing and production environments.
Please refer to the following links for more information about the topics that were covered in this chapter: