In this chapter we’ll continue some of the themes from last chapter but now apply the concepts to deploying to the cloud.
Most organization will have some sort of preference for one of the 3 major clouds based on technology differentiators or dogma around history with a particular vendor. It really comes down to, many times, “which is your organization’s preferred cloud?”.
There are many places that offer hosted VMs, but for the purposes of talking about kubeflow operations we’re going to focus on the big 3 cloud vendors:
Other vendors offer managed kubernetes as a service and are likely good candidates for deploying kubeflow, but for the sake of brevity we’ll focus on the big 3 clouds. Our focus for each cloud offering is how their managed kubernetes is deployed and then what products on the cloud are relevant for integration.
Over the course of this chapter we’ll educate the reader of what each major cloud offers and how it integrates with the managed kubernetes offering for the cloud. This will give the reader a solid view of how appropriate their preferred cloud is for running their kubernetes and kubeflow infrastructure.
Installing kubeflow on a public cloud requires a few things, including:
Beyond vendor dogmatic themes, a dominant narrative in infrastructure is how a system can be integrated as legacy momentum in enterprise infrastructure holds considerable sway.
All 3 major clouds offer an open source compatible version of managed kubernetes:
Each system has similarities and then differences in how we install and integrate kubeflow. There are of course variations in how to do a kubeflow installation for every cloud such as:
Over the course of the next 3 chapters will introduce the reader to the core concepts for each cloud offering and then show how to install kubeflow specifically for each of the cloud platforms.
In this chapter we give a review of the relevant components to kubeflow operations for Google Cloud and then point out the specific aspects of the cloud to keep in mind.
Kubeflow can run on the Google Cloud Platform via the managed Google Kubernetes Engine (GKE). Let’s start off by taking a tour of the GCP platform.
The Google Cloud Platform (GCP) is a suite of modular services that include:
These services represent a set of physical assets such as servers, memory, hard disks, and virtual resources such as virtual machines. These assets live in Google data centers around the world and each data center belongs to a global region.
Each region is a collection of zones, which are separate and distinct from other zones in the same region1. A zone identifier is a combination of the region ID and the a letter representing the zone within the region.
An example would be zone “c” in the region “us-east1” and its full ID would be:
us-east1-c
Some regions have different types of available resources (e.g., some regions do not have GPUs of certain types available, different storage options, etc).
Some of the major services offered on top of the base cloud infrastructure include:
This is not an exhaustive list of services, but we want to highlight for the reader a few relevant services for operating Kubeflow and Kubernetes on Google Cloud Platform. In the following sub-sections we’ll dig deeper in some of the services that are relevant to Kubeflow.
Google Cloud Storage is a RESTful service for the Google Cloud Platform that allows us to store and access data. It is considered Infrastructure as a Service (IaaS) and is a combination of scalability and performance along with the security functionality of the Google Cloud platform. Similar to how Amazon’s S3 is setup, Google Cloud Storage is organized into buckets that are identified by a unique key. Within each unique bucket, we can have individual objects.
All objects are addressable using HTTP URLs with patterns such as:
There are 4 classes of storage2 offered by the Google Storage service:
All of these storage classes offer low latency and high durability. The main consideration for what type of storage offering we want to use for a particular dataset is how often we’d want to access the dataset.
For data that is access the most frequently we’d want to use the multi-regional storage offering (e.g., “serving web content”) but we should also consider that the multi-regional storage class costs the most.
When we’re processing data with Google Cloud DataProc or a Google Compute Engine instance then we’d want to use regional storage. In cases where we’re only going to access the data once a month, we can opt to use the nearline storage offering. In the case that we only plan on accessing a dataset once a year (e.g., “disaster recovery”), then we can go with the coldline storage offering which is the cheapest of the 4 offerings.
For more details on how pricing works for the Google Cloud Storage offering, check out the Google Cloud page at:
Google storage upload operations are atomic, providing a consistency model that has strong read-after-write consistency for our upload operations. As described above, the Google storage API is consistent regardless of the storage class selected for the dataset.
To manage access to applications running in Google Cloud’s App Engine standard environment, App Engine flexible environment, Compute Engine, and GKE we need to use the Cloud Identity-Aware Proxy (Cloud IAP).
We are able to use an application-level access control model as Cloud IAP establishes a central authorization layer for applications such as kubeflow that are accessed by HTTPS. This allows us an alternative to using network-level firewalls.
We want to use Cloud IAP when we want to keep our users using access control policies for applications and resources. This means we can setup a group policy for members of one group to access the system, such as our data scientists, and then another group policy for engineers working on another system such that they cannot use kubeflow. In the Figure below we can see how Cloud IAP manages the Authentication and Authorization flow.
Once we’ve granted a user access to an application, Cloud IAP will execute authentication and authorization checks when the user tries to access the Cloud IAP-secured application, as we can see in the figure above.
Roles in the Cloud Identity and Access Management8 (Cloud IAM) control who can access systems protected by Cloud IAP. This allows us to setup fine-grained access controls for users of a product or resource such as kubeflow without requiring a VPN.
A system determines a client’s identity through methods of authentication. We can authenticate to a GCP API by either using a normal user account or by using a GCP service account. We cover service accounts further on in this chapter.
For GKE applications such as Kubeflow, HTTPS requests to kubeflow are sent to the Load Balancer and then routed to Cloud IAP. If Cloud IAP is enabled then the Cloud IAP Authentication Server is queried for information on Kubeflow such as GCP project number, request URL, and Cloud IAP credentials in the request headers or cookies.
OAuth 2.0 manages the Google Account sign-in system and sets a token in the kubeflow user’s browser cookie for later use.
Cloud IAP uses OAuth 2.0 to manage the Google Account sign-in flow that the user is directed to if no credentials are found. Once the user signs-in a token in stored locally for future use.
All Google Cloud Platform applications use the OAuth 2.0 protocol for authentication and authorization.
The Cloud IAP checks for browser credentials of the current user and if none are found then redirects the user to the Google Account sign-in web page. Typically this access flow is kicked off by authenticating from the gcloud sdk tools on your local laptop with the command:
gcloud auth login
This command will bring up a browser window loaded with the page shown in the image below.
If you have multiple gmail accounts you may see an account selection screen before you see this screen.
The system checks to see if the request credentials are stored in the system and if so then the system uses these credentials to pull the user’s identity. The user identity is defined as both their email address and user ID. Beyond pulling the user’s identity, the system checks the user’s Cloud IAM role to see if the user has access to the given resource.
Authorization in enterprise infrastructure security9 is about checking to see if a user has access to certain resources. Authorization determines what an authenticated client can access with respect to GCP resources. In the world of authorization we want to create policies showing which user can access which resource and to what extent.
On the Google Cloud Platform the relevant Cloud IAM policies are applied to by Cloud IAP to confirm whether the user is allowed to access the desired resource. For example, when we log into the GCP Console project, the system checks to see if we have the IAP-secured Web App User role. If so, we can access the GCP Console project.
A kubeflow user will need the proper IAP-secured Web App User role in our GCP Console project if they want to access the kubeflow application.
We need a GCP project to contain any application that we want to run on GCP. We can access GCP from either the command-line or from the web user interface, which we see below.
From the GCP Console we can create a project that holds any arbitrary application. For Kubeflow, we’ll need a project to contain all of the resources related to the Kubeflow application. To see a list of current projects or create a new GCP project, click on the project drop down list in the top blue bar to the right of the “Google Cloud Platform” logo. We will see a dialog popup similar to the image below.
We’ll use both the GCP Console and GCP project screen in later on in this chapter when we start to setup Kubeflow on our own GCP account. Let’s continue our overview of the GCP platform by taking a look at service accounts.
Sometimes we want an account to represent an application on GCP as opposed to a user on GCP. For this purpose we use “service accounts” on GCP. These service accounts allow our application to access other GCP APIs on our behalf. Service accounts are supported by all GCP applications and are recommended to be used for most server applications.
When deploying server applications on GCP it is recommended to use service accounts as they are a best practice on GCP. Regardless if we are developing locally or as a production system, we should consider using service accounts so we don’t end up hard-coding in transient user accounts or private API keys (e.g., “a Google project ID”).
Given that kubeflow runs as a long-lived application in a GCP project, it is a good candidate for using service accounts as we’ll see later in this chapter.
GCP exposes its Infrastructure as a Service (IaaS) offering in the form of the Google Compute Engine (GCE). One of the primary features of GCE is how it enables users to launch virtual machines on demand as standard images or custom user-defined images. GCE also uses OAuth 2.0 for authenticating user accounts, as described above, before launching the VM images. GCE can be access via the Console application, the REST API for GCP, or via the command-line interface.
The Google Compute Engine can leverage persistent disk storage from the Google Cloud Storage system on GCP. There are many stock images for your favorite operation systems along with images customized for specific tasks such as deep learning, as seen in the image below.
Some regions offer VMs with the option to attach GPUs10 or TPUs11 for added horsepower for your compute workloads12. A nice feature of the GCE VM system is how you can quickly launch an instance and then use the GCP web ssh terminal to jump right into your instance by leveraging your pre-existing OAuth credentials, as seen in the image below.
Beyond just VMs, however, with Kubeflow deployment we’re interested in specifically deploying containers on a managed kubernetes cluster. For that we’ll need to take a look at the managed Google Kubernetes Engine on GCP.
When dealing with a managed platform pricing is always a complex subject but something we need to review. For more information and transparency on understanding what using instances on the Google Compute Engine will cost, check out the pricing guide at:
https://cloud.google.com/compute/all-pricing
This guide gives more information about how different compute resources are priced such as:
Google Compute Engine charges based on a per-second usage metric, so while the usage granularity is acute we still should understand what we’re signing up for before cranking a cluster up. To better forecast your project’s total cost, also check out the Google Cloud Platform Pricing Calculator:
Managed kubernetes is a kubernetes cluster that runs in another datacenter and can be expanded and shrunk on demaned. Google offers a managed kubernetes engine called the Google Kubernetes Engine13.
Later on in this chapter we’ll setup our GCP project but we’ll still need a managed kubernetes cluster to deploy the kubeflow application onto, and that is where GKE comes into play. GKE offers features14 such as network security with VPC and cluster autoscaling. They also have Site Reliability Engineers (SREs) working behind the scenes to make sure your cluster has enough networking, compute, and storage resources to operate at a high level. In the image below we can see the main screen for the Google Kubernetes Engine.
From this interface we can create new kubernetes cluster, expand existing clusters, or completely shutdown clusters we don’t need to use right now. Later on in this chapter we’ll revisit this part of GCP to start up our kubeflow kubernetes cluster. Let’s now move on to getting the reader signed up to use the Google Cloud Platform so we can get started.
If you don’t already have an account on Google Cloud Platform (GCloud), you can sign up for a free trial at:
https://cloud.google.com/free/
We can see the main sign-up splash screen below:
You can get $300 USD in free credit and you’ll have 12 months to use it which gives you plenty of time to try out kubeflow (just don’t leave those instances running!).
Kubeflow can quickly consume GCP resources. Once you get going it’s easy to setup Kubeflow and then forget you allocated all of those GCP resources. If you don’t plan on using Kubeflow for an extended period of time you’ll likely need to tear down the Kubeflow install or you may get an unexpected GCP bill at the end of the month.
The Google Cloud SDK is a set of client side tools that allow us to work with the Google Cloud Platform. Some of the main command-line tools include:
These tools allow us to use services such as:
There is also a robust set of web user interfaces for the Google Cloud Platform, and we have the option to do most functions from either the command-line or the web user interface.
Before you install Google Cloud’s SDK, make sure and upgrade python to the latest version to avoid issues (e.g., SSL issues). There are multiple ways to manage Python and its dependencies and the ways will vary depending on what platform you are running on. For OSX users (a typical client), this may be a number of package managers including:
For other operating systems there will be other options. However, we simply want to remind the user that python should be updated to the latest version regardless of what platform you are using.
After you have python updated, take a look at the instructions for getting the GCloud SDK working.
For Mac OSX users, a simple way to do this is to use the interactive installer.
Once you have the GCloud SDK working, log into your GCloud account from the command line with the gcloud auth tool:
gcloud auth login
This will pop up some screens in your web browser asking for permission via OAuth for GCloud tools to access the Google Cloud Platform.
The high-level version of the kubeflow install process is covered in the following steps:
To setup a basic kubeflow install on GCP we have the option of allowing the kubeflow installation scripts handle the GKE cluster part for us as part of the kubeflow install.
Production enterprise installations of kubeflow likely will want more control over how kubernetes is setup so we include instructions in this chapter about how to setup kubeflow both ways.
Now we’ll take a closer look at how to perform each step from the kubeflow install step list above.
We need to create a new GCP project to contain our kubeflow installation. In the image below we can see the projects dialog window from previously in this chapter but with a focus on the top half.
To create a new project we click on the “New Project” button in the top right of the dialog window, bringing up the “New Project Dialog Window” as seen in the figure below.
After we have a new project created for GCP, we need to do some configuration work. Specifically we need to enable certain APIs as we’ll see in the next section.
The easiest install of Kubeflow does the GKE cluster installation for us, so we’ll skip those instructions for now. We cover the specifics of creating managed GKE clusters later in this chapter.
We need to enable the following APIs for our new project:
The compute engine API creates and runs VMs on GCP. The Kubernetes engine API allows our tools to create and manage kubernetes clusters on GCP and the Identity APIs allow us to manage users. Finally, the Google Cloud Deployment Manager API allows our tools to programmatically configure, deploy, and view Google Cloud Services.
By default most of the APIs for a project are disabled for new GCP projects, so we need to enable each of the above APIs so that kubeflow can operate correctly. To find the page for enabling APIs go to the main page for “APIs and Services” on GCP by using the top left drop down navigation panel. Once there, we’ll see a page similar to the image below and it will have a button at the top that says “Enable APIs and Services”.
When we click on this button it will take us to the GCP API Library page. We can either use the links above to find the specific API pages to enable, or use the search box on the API library as seen below in the image.
When we enable each API we’ll see a screen such as the figure below.
The GCP Free Tier will give us a limited number of resources for our initial kubeflow project, but likely you will want to move past that as you build out your kubeflow deployment.
Some resources (such as?) have quotas on them and you will hit those quotas periodically limiting what you can do with kubeflow. As you hit the limitations on a project you’ll have to request increases in a services quotas. For more information on how to check and adjust resource quotas check out the Google Cloud documentation:
The next step in the kubeflow install process for GCP is to set up Cloud IAP (previously written about in this chapter). For testing and evaluation purposes we can skip this test and just use basic authentication (e.g., “username and password”) with kubeflow instead of Cloud IAP.
For production kubeflow installations we’re going to want to use Cloud Identity-Aware Proxy (Cloud IAP).
When we set up Cloud IAP we create an OAuth client for Cloud IAP on GCP that uses the email address to verify the user’s identity. We need to do 4 tasks to get Cloud IAP working with kubeflow:
Let’s now work through each of the above steps.
To get Cloud IAP working we first need to set up our consent screen. We can find this page in the GCP portal at the URL:
https://console.cloud.google.com/apis/credentials/consent
This screen will allow a user of kubeflow to choose whether or not to grant access to their private data. All applications in the associated GCP project will work through this consent screen. We see the OAuth consent screen configuration panel below.
We need to fill out a few specific fields:
For Authorized domains, we want to use something of the pattern:
<GCP Project ID>.cloud.goog
Where the <GCP Project ID> is the GCP Project ID for the project containing the kubeflow deployment, as we can see in the figure below.
In the image above we can see an authorized domain entered for the project ID “kubeflow-book”. We can find out project ID at any time in the GCP Console by clicking on the “project settings” link in the top right drop down menu.
Some project configurations will not present the Authorized Domains configuration. If you are using your own domain then add that as well.
Once you have filled out the configuration fields for the OAuth consent screen hit the “Save” button.
On the same page for the OAuth consent screen tab we have the “Credentials” tab. When we click on this tab we’ll see a “create credentials” dialog as shown in the image below.
We need to click the “Create credentials” button and then click “OAuth client ID”. We then select “Web application” under the “Application type” and in the “Name” box we’ll enter any name we choose for the OAuth client ID.
The name we choose here is not the name of the application nor the name of the Kubeflow deployment but a handy label for the OAuth client ID.
Next let’s configure the “Authorized redirect URIs” field with the following URI:
https://iap.googleapis.com/v1/oauth/clientIds/<CLIENT_ID>:handleRedirect
where <CLIENT_ID> is the OAuth client ID (similar to: “abc.apps.googleusercontent.com”). To get the CLIENT_ID we need to first create the OAuth Client ID and save it. Then we will see a dialog similar to the image in the figure below.
The post creation pop-up dialog will give us the client ID we need to update the redirect URL. We’ll want to save both the client ID and the client secret to enable Cloud IAP later as well in our environment variables during the kubeflow install process later in the chapter.
To complete the Redirect URI entry click on the edit button for the newly created OAuth 2.0 client ID as in the image below.
When we click on the pencil icon in the image above, it takes us back to the client ID edit page and we can insert the new client ID in the template of:
https://iap.googleapis.com/v1/oauth/clientIds/<CLIENT_ID>:handleRedirect
which ends up looking like the image below based on the client ID in the image above:
This should complete your OAuth 2.0 setup.
The URI for the Authorized redirect URI is not dependent on the kubeflow deployment or endpoint. We may have multiple kubeflow deployments using the same OAuth client and we would not have to modfify the redirect URIs.
Now we’re ready to continue our installation of kubeflow.
To deploy kubeflow from the command-line on GCP we’ll use the kfctl command-line tool included with the kubeflow project. The command-line install process gives us more control over how the deployment process works.
If you’d like to use a GUI for the install, check out the online resource on deploying kubeflow using the deployment UI.
The prerequisites for the command-line installation are:
If you are using Cloud shell15 for certain command-line operations you should enable boost mode16. Boost mode temporarily boosts the power of your Cloud Shell VM which helps with certain install operations.
To install kubeflow on GCP we have to complete the following steps:
Let’s now work through each of these high-level steps in detail.
We need to authenticate our local client with OAuth2 credentials for the Google Cloud SDK tools. We do this with the command:
gcloud auth application-default login
This command will immediately pop-up the Google Authentication web page below in your web browser.
Once we press the “Allow” button the command-line will show output similar to:
Now let’s move on and create the required local environment variables.
Normally as a developer we interact with GCP via the gcloud command with the auth command:
glcoud auth login
Once we run this command from our terminal the system gets our credentials and storages them in ~/.config/gcloud/ and these credentials will be found automatically. However, any code or SDK running code will not automatically pick up these credentials.
However, when we want our code to interact with GCP via the SDK in a system such as Kubeflow we want to use the variation of the auth command:
gcloud auth application-default login
The difference in this variation of the auth command is that the credentials are stored in ‘the well-known location for Application Default Credentials’17. This enables our SDK-based code to find the credentials in a consistent fashion18 automatically.
Now that we have our Cloud IAP system set up on GCP we need to configure the local kfctl tool to be able to use it. Let’s create environment variables for CLIENT_ID and CLIENT_SECRET, as we can see in the sample code below. This is where we’ll use the information we saved previously for the OAuth client ID on GCP.
export CLIENT_ID=<CLIENT_ID from OAuth page>
export CLIENT_SECRET=<CLIENT_SECRET from OAuth page>
We can see the same information saved previously being used below for CLIENT_ID:
josh$ export CLIENT_ID=1009144991723-3hf8m978i03v1tfep28302afnilufe1i.apps.googleusercontent.com josh$ echo $CLIENT_ID 1009144991723-3hf8m978i03v1tfep28302afnilufe1i.apps.googleusercontent.com
And then the same export command being used for CLIENT_SECRET:
josh$ export CLIENT_SECRET=fLT_u5KnCc1oQVYKbMmkoh0d josh$ echo $CLIENT_SECRET fLT_u5KnCc1oQVYKbMmkoh0d
We now have our environment variables setup for Cloud IAP authentication on GCP.
Now let’s move on and setup the kfctl tool.
We need to download and set up the kfctl tool before we can continue our kubeflow installation. kfctl is similar to kubectl but specialized for kubeflow.
Download a kfctl
release from the Kubeflow releases page at:
https://github.com/kubeflow/kubeflow/releases/
Now unpack the downloaded tarball for the release with the command:
tar -xvf kfctl_<release tag>_<platform>.tar.gz
Optionally we can add kfctl location to our path, which is recommended unless we want to use the full path for kfctl each time we use it.
export PATH=$PATH:<path to kfctl in your kubeflow installation>
Now when we type the kfctl command from anywhere, we should see:
josh$ kfctl A client CLI to create kubeflow applications for specific platforms or 'on-prem' to an existing k8s cluster. Usage: kfctl [command] Available Commands: apply Deploy a generated kubeflow application. completion Generate shell completions delete Delete a kubeflow application. generate Generate a kubeflow application where resources is one of 'platform|k8s|all'. help Help about any command init Create a kubeflow application under <[path/]name> show Show a generated kubeflow application. version Print the version of kfctl. Flags: -h, --help help for kfctl Use "kfctl [command] --help" for more information about a command.
We’re now ready to configure kfctl for our specific environment.
We need to setup the following environment variables for use with kfctl:
The first environment variable we need is the zone we want to deploy our GKE cluster and kubeflow installation.
If we’ll recall from earlier in the chapter, a zone identifier19 is a combination of the region ID and the a letter representing the zone within the region. An example would be zone “c” in the region “us-east1” and its full ID would be:
us-east1-c
Some regions have different types of available resources (e.g., some regions do not have GPUs of certain types available, different storage options, etc).
To configure our zone, check out the options in the Google cloud docs at:
https://cloud.google.com/compute/docs/regions-zones/
Once you have selected a zone ID to use, set up the $ZONE environment variable with the command below:
export ZONE=<your target zone>
Now that we have a zone selected, let’s move on and configure the project ID for the GCP project.
We’ll do this with the command below.
export PROJECT=<your GCP project ID>
We can find our GCP project ID (not to be confused with the “project name”) by clicking on the vertical 3 dots in the right upper-hand corner of the Google Cloud Console screen as seen in the figure below:
From here we click on the “Project settings” item and we should see a panel similar to the figure below.
Grab the “Project ID” from the middle input box and paste it into the export line above. Once you execute this line in your local terminal you should have your project ID as an environment variable.
Now we need to set the environmental variable for the name of the directory where we want kubeflow configurations to be stored ($KFAPP). We do this with the export command below.
export KFAPP=<your choice of application directory name>
When we start using the kfctl command in the next section we will run “kfctl init” in this directory. This environment variable value also becomes the name of your kubeflow deployment.
The value of KFAPP cannot be greater than 25 characters and must consist of lower case alphanumeric characters or '-'. This variable also must start and end with an alphanumeric character (For example, ‘kubeflow-test’ or ‘kfw-test’.) and must contain only the directory name (not the full path to the directory).
We need to specify a specific configuration file during ‘init’ phase of the kubeflow install. We see this in the form of the kfctl example command below.
kfctl init ${KFAPP} --config=${CONFIG} ...
We need to set our CONFIG environment variable to point to the kubeflow deployment configuration:
export CONFIG=<url of your configuration file for init>
Kubeflow deployment configuration files are yaml files for specific ways to deploy kubeflow (such as “on GCP with IAP enabled”20).
Once we have the above environment variables set we can use kfctl to deploy kubeflow on GCP.
Kubeflow deployment consists of 3 major steps:
We’ll start with the init sub-command to perform our project install one time setup.
To run the default initialization of your kubeflow install using kfctl and Cloud IAP run the following command:
# Run the following commands for the default installation which uses Cloud IAP:
export CONFIG="https://raw.githubusercontent.com/kubeflow/kubeflow/c54401e/bootstrap/config/kfctl_gcp_iap.0.6.2.yaml"
kfctl init ${KFAPP} --project=${PROJECT} --config=${CONFIG} -V
If you want to only use basic authentication (e.g., for evaluation) then we’d use the command:
# Alternatively, run these commands if you want to use basic authentication:
export CONFIG="https://raw.githubusercontent.com/kubeflow/kubeflow/c54401e/bootstrap/config/kfctl_gcp_basic_auth.0.6.2.yaml"
kfctl init ${KFAPP} --project=${PROJECT} --config=${CONFIG} -V --use_basic_auth
The above 'kfctl init ...' command will generate a subdirectory with the ${KFAPP} name and stores our kubeflow configurations during deployment. The directory contains the following contents:
drwxr-xr-x 6 josh staff 204 Jul 19 15:36 . drwxr-xr-x 4 josh staff 136 Jul 18 17:37 .. drwxr-xr-x 3 josh staff 102 Jul 18 17:37 .cache -rw-r--r-- 1 josh staff 2151 Jul 19 15:36 app.yaml
We can see in the subdirectory the app.yaml file that defines configurations related to the kubernetes kubeflow application deployment. If we take a look at the contents of the app.yaml file we’ll see something like the console output below.
apiVersion: kfdef.apps.kubeflow.org/v1alpha1 kind: KfDef metadata: creationTimestamp: null name: sept-kf-test-0 namespace: kubeflow spec: appdir: /Users/josh/Documents/workspace/PattersonConsulting/kubeflow_book_tests/sept-kf-test-0 componentParams: application: - name: overlay value: application argo: - name: overlay value: istio basic-auth-ingress: - name: namespace value: istio-system
This app.yaml file stores the primary kubeflow configuration in the form of a KfDef kubernetes object. The values in the KfDef object are set when we run ‘kfctl init’ and are self containered in the app by being snap-shotted during the init process. The lines of the yaml file define each kubeflow application as a kustomize package.
If we change our working directory into the ${KFAPP} subdirectory and then run the generate sub-command we can create configuration files defining various resources as seen in the next lines.
cd ${KFAPP}
kfctl generate all -V --zone ${ZONE}
This command will generate output as seen in the listing below:
josh$ kfctl generate all -V --zone ${ZONE} INFO[0000] Downloading /Users/josh/Documents/workspace/PattersonConsulting/kubeflow_book_tests/sept-kf-test-0/app.yaml to /var/folders/_x/fn4y2g8s0155ksjkl96xm8fc0000gn/T/793110509/app.yaml filename="v1alpha1/application_types.go:334" WARN[0000] Defaulting Spec.IpName to sept-kf-test-0-ip. This is deprecated; IpName should be explicitly set in app.yaml filename="coordinator/coordinator.go:572" WARN[0000] Defaulting Spec.Hostame to sept-kf-test-0.endpoints.kubeflow-book.cloud.goog. This is deprecated; Hostname should be explicitly set in app.yaml filename="coordinator/coordinator.go:583" WARN[0000] Defaulting Spec.Zone to us-east1-c. This is deprecated; Zone should be explicitly set in app.yaml filename="coordinator/coordinator.go:593" INFO[0000] Writing stripped KfDef to /Users/josh/Documents/workspace/PattersonConsulting/kubeflow_book_tests/sept-kf-test-0/app.yaml filename="v1alpha1/application_types.go:626" INFO[0000] Downloading /Users/josh/Documents/workspace/PattersonConsulting/kubeflow_book_tests/sept-kf-test-0/app.yaml to /var/folders/_x/fn4y2g8s0155ksjkl96xm8fc0000gn/T/596191336/app.yaml filename="v1alpha1/application_types.go:334" INFO[0000] **************************************************************** Notice anonymous usage reporting enabled using spartakus To disable it If you have already deployed it run the following commands: cd $(pwd) kubectl -n ${K8S_NAMESPACE} delete deploy -l app=spartakus For more info: https://www.kubeflow.org/docs/other-guides/usage-reporting/ **************************************************************** filename="coordinator/coordinator.go:184" INFO[0000] /Users/josh/Documents/workspace/PattersonConsulting/kubeflow_book_tests/sept-kf-test-0/.cache/kubeflow exists; not resyncing filename="v1alpha1/application_types.go:424"
This will produce the following directory structure:
josh$ ls -la drwxr-xr-x 6 josh staff 204 Sep 5 16:43 . drwxr-xr-x 8 josh staff 272 Sep 5 15:49 .. drwxr-xr-x 4 josh staff 136 Sep 5 15:49 .cache -rw-r--r-- 1 josh staff 12003 Sep 5 16:43 app.yaml drwxr-xr-x 9 josh staff 306 Sep 5 16:43 gcp_config drwxr-xr-x 40 josh staff 1360 Sep 5 16:43 kustomize
Now we see two new additional directories in addition to the existing app.yaml file:
The gcp_config subdirectory contains the Deployment Manager configuration files21 for the GCP platform. This is a directory containing configurations specific to the chosen platform of cloud provider (in this case, GCP). If we want to customize the platform configuration then we need to edit the app.yaml file and re-run both ‘kfctl generate ...’ and then ‘kfctl apply...’ (which is covered below).
The other subdirectory created is the kustomize/ subdirectory that contains the kubeflow application manifests. This means the directory contains the kustomize packages for the kubeflow applications that are included in your deployment. If we want to further customize our application manifests then we’d edit the KfDef object in the app.yaml file and re-run ‘kfctl generate ...’ and ‘kfctl apply ...’.
Let’s now deploy our locally defined application to the GCP platform with the command:
kfctl apply all -V
When we run the above command we should see output similar to the listing below.
josh$ kfctl apply all -V INFO[0000] Downloading /Users/josh/Documents/workspace/PattersonConsulting/kubeflow_book_tests/sept-kf-test-0/app.yaml to /var/folders/_x/fn4y2g8s0155ksjkl96xm8fc0000gn/T/921294592/app.yaml filename="v1alpha1/application_types.go:334" INFO[0000] Writing stripped KfDef to /Users/josh/Documents/workspace/PattersonConsulting/kubeflow_book_tests/sept-kf-test-0/app.yaml filename="v1alpha1/application_types.go:626" INFO[0000] Downloading /Users/josh/Documents/workspace/PattersonConsulting/kubeflow_book_tests/sept-kf-test-0/app.yaml to /var/folders/_x/fn4y2g8s0155ksjkl96xm8fc0000gn/T/030150751/app.yaml filename="v1alpha1/application_types.go:334" INFO[0000] Creating default token source filename="gcp/gcp.go:170" INFO[0000] Creating GCP client. filename="gcp/gcp.go:182" INFO[0000] Reading config file: /Users/josh/Documents/workspace/PattersonConsulting/kubeflow_book_tests/sept-kf-test-0/gcp_config/storage-kubeflow.yaml filename="gcp/gcp.go:250" INFO[0000] Reading import file: /Users/josh/Documents/workspace/PattersonConsulting/kubeflow_book_tests/sept-kf-test-0/gcp_config/storage.jinja filename="gcp/gcp.go:286" INFO[0001] Creating deployment sept-kf-test-0-storage filename="gcp/gcp.go:424" INFO[0001] Reading config file: /Users/josh/Documents/workspace/PattersonConsulting/kubeflow_book_tests/sept-kf-test-0/gcp_config/cluster-kubeflow.yaml filename="gcp/gcp.go:250" INFO[0001] Reading import file: /Users/josh/Documents/workspace/PattersonConsulting/kubeflow_book_tests/sept-kf-test-0/gcp_config/cluster.jinja filename="gcp/gcp.go:286" INFO[0001] Creating deployment sept-kf-test-0 filename="gcp/gcp.go:424" ...
The kubeflow application should now be deployed on GCP. It may take a few minutes to show up as deployed in the GCP console, but from the administrator’s end of things, you are done with deployment commands.
If we want to confirm we successfully deployed kubeflow we can use the command:
kubectl -n kubeflow get all
which should show output similar to:
NAME READY STATUS RESTARTS AGE pod/admission-webhook-bootstrap-stateful-set-0 1/1 Running 0 8m pod/admission-webhook-deployment-57bf887886-5pwtw 1/1 Running 0 7m49s pod/application-controller-stateful-set-0 1/1 Running 0 8m36s pod/argo-ui-67659f4795-k4khf 1/1 Running 0 8m21s pod/centraldashboard-85dbd5d544-5qnqk 1/1 Running 0 8m13s pod/cert-manager-59d9f6f68f-6bw96 1/1 Running 0 5m27s pod/cloud-endpoints-controller-64587687b4-tmknc 1/1 Running 0 5m19s pod/jupyter-web-app-deployment-5c8dc58f-2qxht 1/1 Running 0 7m52s pod/katib-controller-79578f6f89-mfwr8 1/1 Running 1 7m34s pod/katib-db-7dcbc9c964-j6tnd 1/1 Running 0 7m47s pod/katib-manager-7b8f875b8f-2wg9n 1/1 Running 1 7m42s pod/katib-manager-rest-58867b555f-ggg9s 1/1 Running 0 7m42s pod/katib-suggestion-bayesianoptimization-5b86c549f9-5q6cw 1/1 Running 0 7m7s pod/katib-suggestion-grid-db47c7869-7v68m 1/1 Running 0 7m6s pod/katib-suggestion-hyperband-749c777fc7-pvmlf 1/1 Running 0 7m6s ...
And then the services deployed:
... NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/admission-webhook-service ClusterIP 10.35.252.135 <none> 443/TCP 8m6s service/application-controller-service ClusterIP 10.35.245.6 <none> 443/TCP 8m37s service/argo-ui NodePort 10.35.241.46 <none> 80:30067/TCP 8m22s service/centraldashboard ClusterIP 10.35.249.34 <none> 80/TCP 8m14s service/cloud-endpoints-controller ClusterIP 10.35.244.164 <none> 80/TCP 5m20s service/jupyter-web-app-service ClusterIP 10.35.249.25 <none> 80/TCP 7m53s service/katib-controller ClusterIP 10.35.244.62 <none> 443/TCP 7m35s service/katib-db ClusterIP 10.35.241.104 <none> 3306/TCP 7m48s service/katib-manager ClusterIP 10.35.251.241 <none> 6789/TCP 7m43s service/katib-manager-rest ClusterIP 10.35.251.221 <none> 80/TCP 7m44s service/katib-suggestion-bayesianoptimization ClusterIP 10.35.249.241 <none> 6789/TCP 7m11s service/katib-suggestion-grid ClusterIP 10.35.245.31 <none> 6789/TCP 7m10s service/katib-suggestion-hyperband ClusterIP 10.35.248.253 <none> 6789/TCP 7m9s service/katib-suggestion-nasrl ClusterIP 10.35.246.193 <none> 6789/TCP 7m9s service/katib-suggestion-random ClusterIP 10.35.250.250 <none> 6789/TCP 7m8s ...
Once we have confirmed that kubeflow deployed on GKE, let’s take a look at the main user interface and check out the kubeflow application.
Once we have kubeflow installed, we can use a web browser to log into the main user interface included with kubeflow as seen in the image below.
We can access this page from the following URI (typically several minutes after the deployment process completes on GCP):
https://<deployment-name>.endpoints.<project>.cloud.goog/
The delay on the URI being available is attributed to kubeflow’s need to provision a signed SSL certificate and register a DNS name on GCP.
Now let’s move on to look at how the deployment process works and what was deployed in more detail.
The kfctl command works similar to how we’d use the kubectl command in practice. kfctl has 4 different sub-commands:
All of the sub-commands except the init sub-command takes an argument which can be one of the following:
The "platform” argument represents anything that does not run on kubernetes, meaning all GCP resources. Any resources that run on kubernetes, however, are indicated by the "k8s” argument. The union of both of these groups is represented by the "all" argument.
Deploying kubeflow requires 3 service accounts to be created on the GCP platform in your project:
Each of the above service accounts are created using the principle of least privilege22.
If we take a look at the main GKE console page23 on GCP, we’ll see the image below with our newly deployed GKE cluster listed.
From this page we can investigate the deployments, workloads, and other components installed on GCP as part of the kubeflow installation process.
The kubeflow install deployment process creates a separate deployment24 for the data storage in our kubeflow application. Post install we should see something similar to the following image in our GCP console page for deployments.
When we deploy kubeflow we get two deployments:
Now let’s take a quick look at the workloads deployed on GCP as part of kubeflow.
In kubernetes clusters a workload represents a deployable unit of compute that can be created and administered in the cluster. In the image below we can see a list of the available workloads associated with our newly deployed default kubeflow application on GCP.
These workloads were deployed via kfctl in the previous set of kubeflow installation commands and represent the applications, daemons, and batch jobs needed to operate our custom kubeflow installation on GCP.
We can see there are two types of workloads deployed for our kubeflow application, Stateful Sets and Deployments.
The stateful sets are applications that require their state to be saved or persistent with persistent storage (such as persistent volumes) to save data for later use by the applications.
Under our kubernetes cluster detail page if we click on the storage tab we can see the associated persistent volumes and storage classes for our cluster as seen in the image below.
At the bottom of the page we can see the storage classes associated with the persistent volumes.
Kubeflow and kubernetes use storage classes to specify a class of storage that has specific performance capabilities (depending on the use case).
In the image above we can see the default storage class type “pd-standard” that comes with the stock install on GCP.
In some cases we’ll want to manually create a our GKE cluster. This section describes the process to create a GKE cluster from the command line with the Google SDK tools installed locally.
We need a project on Google Cloud Platform to organize our project resources inside which we described previously in this chapter. Once you have the project created, check to make sure it shows up from the command line with the command:
gcloud projects list
This command lists all of the projects we have in our account on the google cloud platform. The output should be similar to below:
PROJECT_ID NAME PROJECT_NUMBER kubeflow-project kubeflow-project 781217520374 kubeflow-codelab-229018 kubeflow-codelab 919126119217
Now we need to create a Kubernetes cluster for our project on Google Cloud Platform. First, we need to set our current working project from the command line so we’ll use the command as shown below:
PROJECT_ID=kubeflow-project
gcloud config set project $PROJECT_ID
gcloud container clusters create [your-cluster-name-here]
--zone us-central1-a --machine-type n1-standard-2
Note that the name of the project and the project ID may not be exactly the same, so be careful. Most of the time we want to use the Project_ID of our project when working from the command line. It may take 3-5 minutes for the system to complete the kubernetes cluster setup on GCP.
Instructions for installing kubectl locally are in chapter 3 with the brew command. Kubectl works the same for on-premise kubernetes clusters as it does for cloud managed kubernetes clusters, we just have to make sure and configure it properly.
When using kubectl from the command-line we need permission for it to talk to our remote managed kubernetes cluster on GCP. We get the credentials for our new kubernetes cluster with the command:
gcloud container clusters get-credentials kubeflow-codelab --zone us-central1-a
This command writes a context into our local ~/.kube/context
file so kubectl knows where to look for the current cluster we’re working with. In some cases, you will be working with multiple clusters, and their context information will also be stored in this file.
Once we can connect to our kubernetes cluster with kubectl, we can check out the status of the running cluster with the command:
kubectl cluster-info
We should see output similar to below:
Kubernetes master is running at https://31.239.115.73 GLBCDefaultBackend is running at https://31.239.115.73/api/v1/namespaces/kube-system/services/default-http-backend:http/proxy Heapster is running at https://31.239.115.73/api/v1/namespaces/kube-system/services/heapster/proxy KubeDNS is running at https://31.239.115.73/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy Metrics-server is running at https://31.239.115.73/api/v1/namespaces/kube-system/services/https:metrics-server:/proxy To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.
Now that we have an operational GKE cluster, we can install kubeflow or other applications manually.
We’ll quickly cover a few key cluster options for GKE clusters in this section.
Check out the kubeflow resource page for troubleshooting GKE:
We can resize a GKE cluster25 from the command line with the gcloud command or we can use the GCP web interface.
From the command-line with the Google Cloud SDK tooks:
gcloud container clusters resize [CLUSTER_NAME] --node-pool [POOL_NAME] --size [SIZE]
From the Google Cloud Web Console:
To delete a GKE cluster26, we can use either the command-line or the web console:
From the command-line:
gcloud container clusters delete [CLUSTER_NAME]
From the web console:
1 https://cloud.google.com/compute/docs/regions-zones/
2 https://cloud.google.com/storage/docs/storage-classes
3 https://cloud.google.com/storage/docs/storage-classes#multi-regional
4 https://cloud.google.com/storage/docs/storage-classes#regional
5 https://cloud.google.com/storage/docs/storage-classes#nearline
6 https://cloud.google.com/storage/docs/storage-classes#coldline
7 https://cloud.google.com/iap/docs/concepts-overview
8 https://cloud.google.com/iam/docs/understanding-roles
9 https://en.wikipedia.org/wiki/Authorization
10 https://cloud.google.com/gpu/
11 https://cloud.google.com/tpu/
12 http://www.pattersonconsultingtn.com/blog/datascience_guide_tensorflow_gpus.html
13 https://cloud.google.com/kubernetes-engine/
14 https://cloud.google.com/kubernetes-engine/docs/concepts/
15 https://cloud.google.com/shell/
16 https://cloud.google.com/shell/docs/features#boost_mode
17 https://cloud.google.com/sdk/gcloud/reference/auth/application-default/login
18 https://cloud.google.com/docs/authentication/production
19 https://cloud.google.com/compute/docs/regions-zones/
20 https://raw.githubusercontent.com/kubeflow/kubeflow/c54401e/bootstrap/config/kfctl_gcp_iap.0.6.2.yaml
21 https://cloud.google.com/deployment-manager/docs/configuration/
22 https://en.wikipedia.org/wiki/Principle_of_least_privilege
23 https://console.cloud.google.com/kubernetes
24 https://console.cloud.google.com/dm/deployments
25 https://cloud.google.com/kubernetes-engine/docs/how-to/resizing-a-cluster
26 https://cloud.google.com/kubernetes-engine/docs/how-to/deleting-a-cluster