Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 2. Planning a Kubeflow Installation

Planning a new Kubeflow installation is key to having a successful platform for your machine learning operations. This chapter introduces and covers key topics your team needs to consider in the process of planning your new Kubeflow cluster.

It’s important to keep in mind that Kubeflow itself runs atop of Kubernetes, and though a deep understanding of Kubernetes is not a pre-requisite to following the various topics and installation & configuration steps outlined, a working knowledge of Kubernetes may be beneficial.

Additionally, though Kubeflow can be deployed on top of an existing Kubernetes deployment, the focus in this chapter assumes that a new Kubernetes deployment is being created specifically for Kubeflow. Although the narrative outlined is for a dedicated Kubeflow installation, these topics can be easily transferred if Kubeflow is being deployed to an existing Kubernetes environment.

This chapter discusses the various types of users of a Kubeflow installation, deciding which components of Kubeflow to deploy, how storage will be allocated to users, installation on-premise vs. the cloud, security requirements, and hardware considerations.

Users

Prior to planning a Kubeflow installation, it is important to take a moment to consider the user community who will be using Kubeflow on a day-to-day basis [for machine learning], as well as the users responsible for installation, managing, and overseeing the operation of it.

Understanding the varying the skillsets for these different user communities, and how Kubeflow can bridge potential gaps between them, helps in fostering input and buy-in from each of these groups.

Profiling Users

Historically Data Science has been an insular practice where a Data Scientist has been huddled away with their own laptop or personal server, many times using tools like Anaconda locally. Eventually, after considerable effort by the Data Scientist, a machine learning model emerges, that, ideally is useful to the line of business in the organization. The Data Scientist look to incorporate their model into production pipeline or application, such as by integrating the model or its outputs in an application.

As Data Science workflows and model deployment expectations have grown more complex, organizations have struggled with the inherent overhead of maintaining the infrastructure to facilitate model training and model deployments. This is complexity is further compounded when the Data Scientist has been off on their own, eventually emerging with a model produced using their own tooling that Operation Teams may struggle to integrate and support.

An classic example of this would be where a Data Scientist builds a model with a specific version of TensorFlow and then sends the frozen model graph to the Operations Team. The Operations Team, not being experts in TensorFlow, are unable to load the model in a standardized manner, such as in the workflow of deploying to an ordinary webserver or database system, and now are faced with two options:

design, deploy, and maintain a custom model serving process
duplicate the model to each application that wants to integrate with the model outputs

Many times this scenario puts the Data Science teams and the Operation Teams at odds: the Data Scientist has certain specific expectations for their models while the Operation Team will attempt to generalize the deployment for ease of maintainability.

Other factors that might make the relationship between the Data Scientists and the Operation Teams more complex include:

versions of tooling the Data Scientists want to use in their workflows
shared access to GPUs for modeling jobs
secure access to sensitive corporate and customer data
Data Scientists wanting to run distributed training jobs on shared infrastructure
container management
portability: Data Scientists want to try a workload out on the public could then move it back on-premise; or vice-versa

From a higher level, we can roll most of these issues into the following requirements:

provide secure access to data
provide a shared multi-tenant compute infrastructure (CPU, GPU)
allow for flexibility in versioning of tools
a generally supportable method to deploy and integrate models
portability in where model training and inference can be run (on-premise vs. cloud, etc.)

Fortunately, a properly installed and working Kubeflow system alleviates much of the aforementioned potential issues; Kubeflow provides tooling to all of the above items, and as an operator, our focus needs to be mostly on the Kubeflow system itself and how it effects the various users.

Varying Skillsets

To better characterize our users, we can roll users into 3 major groups:

Data Scientists
Data Engineers¹
DevOps Engineers

The Data Scientists typically want to take Python code they is running locally and run it on more powerful hardware with access to data they cannot move to their own laptop or server.

Data scientists typically know the basics of:

working with a terminal and shell scripting
Python
Jupyter Notebooks
machine learning fundamentals

The Data Scientists are supported by both Data Engineers and then DevOps Engineers.

A Data Engineer² works with³ a Data Scientist (though, from an organizational perspective, they may be under the Data Warehouse team, or may be attached to the Data Science team). A Data Engineer typically knows:

where to get certain data from the system of record
how to build ETL jobs to get the data in a form to be vectorized
SQL, Hadoop, Kafka, Spark, Hive, Sqoop, and other ETL tooling

Data Engineers are many times the interface between the data warehouse system (where the system of record is located) and consumers of the that system and its data.

Though both, Data Engineers and Data Scientists need a platform they can work together on, this system, or platform, is built and supported by yet another team: the DevOps Team.

A DevOps Engineer typically has the following skills:

hardware management (GPUs, drivers, storage, network)
networking
shared infrastructure management for clusters (e.g., Kubernetes, Hadoop)
container management
general Linux (or other operating system) skills
general overview of application & network security

A DevOps Engineer (though more widely applied to DevOps teams) faces a range of challenges, including:

The ability to support multiple versions of machine learning libraries
Providing ways to secure sensitive customer data, while also allowing Data Engineers flexibility to build workflows that support Data Scientists
Managing infrastructure for different workflow strategies, such as Notebooks vs. straight Python code
Running a software [and hardware] stack that provides portability between on-premise and the cloud
Supporting hardware [and associated software drivers] for new emerging hardware, such as ASICs, TPUs, in a shared multi-tenant environment
Providing secure and separate storage on a per-user basis within a multi-tenant environment

In some ways, DevOps job is the hardest of the lot. They are required to provide an agnostic infrastructure that satisfies the machine learning requirements, the data security requirements, and portability of a system. It is for this reason that Container Management Systems, such as Kubernetes has grown to where it is today. It is also for these reasons, that we believe it will continue to grow as a platform.

Kubeflow Components

Understanding the various Kubeflow components and how they interact with the underlying Kubernetes system provides a keen perspective on knowing how to operate operate, secure, and otherwise manage a Kubeflow installation.

At a very high level, the distinct Kubeflow components can be broken into two broad categories:

Components that extend the Kubernetes API
Components that are applications that run atop of Kubernetes

Later in this chapter, it’ll become apparent on why this distinction is important, especially when planning to secure the Kubeflow installation, and/or integrate it with other security systems.

Components that Extend the Kubernetes API

Most, if not all, of the training components of Kubeflow, be it Chainer Training, MPI Training, TensorFlow Training, etc. provide a Kubernetes Custom Resource Definitions (CRDs). The purpose of CRDs is to extend the Kubernetes API so that a desired state of infrastructure, beyond what Kubenretes provides out of the box, can be declaratively written and achieved.

As an example, the TensorFlow Training CRD defines a TFJob that describes the number of TensorFlow Chiefs, Workers, and Parameter Servers that should be run for a given training job. By simply declaring the desired state, and defining the Container that will run, Kubernetes is given the knowledge on how to deploy the desired number of Chiefs, Workers, and Parameter servers for a training job. In the example of TFJob, it is the role of a “TF Operator” component to continuously monitor Kubernetes for the addition (or changes) of TFJob objects, and act accordingly, by requesting Kubernetes to deploy additional containers, inject TensorFlow specific primitives to the container, and so on.

By virtue of providing a CRD, it can be stated that a the TFJob extends the Kubernetes API, so that a user deploying a TensorFlow training job (or, conversely a Chainer Training Job, or a an MPI Training Job, etc.) needs to only submit a declaration to Kubernetes on the desired state, and Kubernetes (with help of the controller watching the CRDs) moves to keep the components deployed to the desired state.

Components running atop of Kubernetes

Other components of Kubeflow, namely Jupyter Notebooks, Hyperparameter Training (Katib), Pipelines, and others, are applications that are deployed atop of Kubernetes. These are applications, that in theory, can be deployed standalone and are not Kubernetes specific applications. Rather, they are a collecting of tooling that has been specifically chosen to integrate with other areas of Kubeflow, to wave together an end-to-end system for machine learning.

Workloads

Thinking about the target workloads that Kubeflow will be supporting aids in considering the placement of the infrastructure, how to allocate storage, on other planning an on-going operational needs.

In this section two major usage patters are outlined, one that describes Kubeflow as being used more along the lines of a dedicated platform to ingest and house data, as well as doing machine learning. The other pattern considers Kubeflow as more utilitarian and as a tool that provides machine learning capabilities — but not as a data silo; utilitarian by choice in that it provides machine learning infrastructure with a “Bring Your Own Data” mindset.

Cluster Utilization

Depending on what types of components are installed on the Kubeflow cluster, Kubeflow can be used to solve a number of problems. As a thought exercise, we can group into 3 sub-patterns:

ad-hoc exploratory data science workflows
ETL and data science modeling workflows that are run daily (or on some temporal pattern⁴)
model serving (“inference”) transactions

Analyzing these patterns further, we can put them into two groups:

batch (or “analytical”) workloads
transactional workloads

Both the exploratory data science workflows and then the temporal ETL and data science workflows are considered batch workloads. Batch workloads are workloads that startup and run for a few minutes or even hours. While they tend to consume a set of resources for longer periods of time than transactional workloads, they are (typically) more efficient in terms of how they use the hardware in contrast to transactional workloads. Batch workloads deal in larger amounts of data, typically, so they are able to do things like “read data sequentially from disk and operate at the transfer rate of the disk”, which is more effective than lots of small disk seeks.

Jeff Dean’s “Numbers Everyone Should Know”

To learn more about writing software to use hardware more effectively in distributed systems, check out Jeff Dean’s⁵ “Numbers Everyone Should Know”⁶.

If Kubeflow is serving inferences to external applications then it can be considered a typical pattern of transactional operations. Transactional operations are short lived and tend to do quick (e.g., “sub-500ms”) data operations (e.g., “pull data from a database and send it back over a HTTP connection”). It’s a challenge in itself to make transactional operations more efficient (and many books have been written on the topic).

Model inferences are transactions where input data (or vector / tensor) is sent to a model server hosting the copy of the model we wish to query for prediction output. The model server uses the incoming data as input to the model and then sends the result back across the network. With smaller models (e.g., “100k parameters in a neural network”) the inference latency is not bad (and in some cases can be dominated by the time required to deserialize the input and then re-serialized the output). However, when we are dealing with larger models (e.g., a R-CNN⁷ in TensorFlow) we can see inference latencies in the 100’s of ms even when using a GPU such as an Nvidia Titan-X. Planning for inference loads for applications is covered in more detail in a later chapter [x]. Model server instances are typically pinned to a specific set of resources (e.g., a single machine w a GPU) so in most cases we can just forecast the load, and then set aside a specific number of instances to run model server processes.

Data Patterns

Depending on the type of utilization, the amount of data required varies. In planning an installation thinking if Kubeflow will act as a dedicated repository of data (a “data silo”) or will only provided transient storage for immediate processing needs is a key consideration.

Dedicated vs. Transient

In the case where users want to have large amounts of data housed within Kubeflow a dedicated data storage story is needed. Questions such as how much space to provision, quotas, and other traditional data issues arise.

When considering whether to have Kubeflow function as a data repository, it should be considered whether other data lakes or larger data repositories exists, and if Kubeflow is the right place to store the data. If there are other existing data warehouses or data lakes then it makes less sense to duplicate this data in various places, and Kubeflow should only house the data it needs when utilized for machine-learning purposes. Thought of another way: the data should live in a data warehouse or data lake, and either a subset of data should be brought into Kubeflow for processing — or perhaps only vectorized data should arrive in Kubeflow, and the vectorization be prepared elsewhere.

On the flip side, if the data being used within Kubeflow is derived data, or other data that doesn’t already have a home, or perhaps data that has been specifically engineered from machine learning purposes, then the long-term storage of the data within Kubeflow might be appropriate.

Ultimately, deciding on whether data should live in Kubeflow or elsewhere is a function of considering whether Kubeflow will be functioning merely as an “application layer” or as a more über-like infrastructure, providing data storage, compute, and processing power all together.

GPU Planning

Part of planning process for a Kubeflow cluster involves understanding what are users will need to support their workflows in terms of processing power. The base case in processing is the CPU, but newer workloads are more hungry for linear algebra processing so we see GPUs becoming more common every day. As the industry grows, we’re seeing new processing chips such as FPGAs and ASICs (Tensor Processing Units / TPUs). While they are not as common today in usage, its likely we’ll have to consider them in our modeling forecasts at some point. In the rest of this section we’ll take a look at the considerations for GPU workloads on a Kubeflow cluster.

Planning for GPUs

A core use case for Kubeflow is to have scheduled and secure access to a cluster of GPUs in a multi-tenant environment.

There are three major ways to use GPUs with TensorFlow:

single machine single GPU
single machine multiple GPUs
multiple machines each with a single GPU

Deciding on which GPU training strategy depends on considerations such as:

training data size
GPU hardware setup

Many times when we are dealing with data that fits on a single machine we will use a single or multiple GPUs on a single machine. This is because a single machine with multiple GPUs can utilize inter-GPU communication which would outperform distributed training. This is attributed to how distributed training’s network overhead is more expensive in terms of latency compared to inter-GPU communication.

However if we are dealing with a huge amount of data (say 1PB) that is too large to fit in a single machine, then distributed training may be the only option.

GPU Use Cases

There are 3 major areas GPUs can help your workloads:

traditional HPC applications
Deep Learning and machine learning modeling
applications ported to use GPUs specifically (e.g., using NVIDIA’s RAPIDS⁸ system)

Examples of HPC-related workloads would be things like Fluid Dynamics and Protein Folding.

Most deep learning models are valid candidates for use with GPUs, as we’ll give more detail further on in this section. Some traditional machine learning models are also candidates for speedup with GPUs, and we cover certain known cases for a few popular models below as well. There is also the cases where workloads that are not HPC or machine learning are being ported to GPUs for speedup with Nvidia’s new RAPIDs system.

The best way to leverage GPUs in your jobs is to first confirm its a good candidate for GPU acceleration. If the task is not a great candidate for GPU, but still needs access to data only on the cluster, then you could still run the job via Kubeflow as a container but on CPU yet still able to process the data securely.

Should I Write CUDA Code Directly?

We’d advise against trying to write CUDA code directly, as there is already a lot of machine learning and deep learning libraries with CUDA backends for linear algebra acceleration.

GPU Anti-Use Cases

The most common error we see is where someone assumes since GPUs sped up their complex linear algebra problem, it can speed up any arbitrary system by extension. If the system is not written with some sort of CUDA-aware⁹ libraries, then it likely will not be able to take advantage of GPUS “out-of-the-box”.

Models that Benefit from GPUs

In terms of GPU-usage, we largely need to consider how interconnected the “model” is, and how large the model’s memory space is. If your model is thousands of completely independent algorithms/equations, then each GPU can work on its own problem as fast as it can and then come back to the workload manager with its results and get another batch of work to do. Think of the SETI@Home model, where you can take chunks and distribute them around over the internet. These workloads just need lots of GPUs, but the GPUs don’t need to talk to each other.

Most deep learning models are extremely inter-connected, so that’s the interconnect between GPUs becomes important (when we’re using multiple GPUs on a single model). Nvidia’s NVLink gives GPUs massive bandwidth to talk to each other, so if you’re doing simulations where millions of particles are bouncing off each other, or CNNs where neurons are sending stuff between them, that’s where this interconnect becomes important. The DGX-1, DGX-2 and other 8-way V100 NVLink boxes like Cisco’s or HPE’s box have this capability.

For reference, the DGX-2 is unique right now in that it has NVSwitch, which gives each GPU (which has 6 NVLink “lanes”) maximum bandwidth between each GPU. Instead of having point-to-point links between GPUs where you might only have 1 or 2 lanes between GPUs, the non-blocking switch gives six lanes from any GPU to any other GPU. This secondarily has the benefit of exposing a single memory space of 512GB for the whole system, so GPUs can transparently access memory on another GPU without having to use CUDA memory management functions where they copy stuff between memory spaces. If you have a particular model (like NLP) where your GPU memory is the bottleneck, this can speed up training by 10x.

The two major deep learning architectures to note would be:

1. Convolutional Neural Networks
2. Long Short-Term Memory Networks

Both of these are models that we see in common use today and both are tremendous candidates for GPU-usage. We touch on details around GPU-usage for both architectures in the following sub-sections.

While you can train any deep learning model with CPU, certain workloads just take a lot shorter time to train using GPUs. Given the dense parameter count and connection counts in deep learning, that’s why we typically associate deep learning in general with GPU usage.

Distributed vs. Multi-GPU Training

When discussing multi-GPU training, it is, at times, important to make the distinction between training in a distributed training — using multiple physical machines together — vs multi-GPU training — using a single machine with multiple GPUs.

As an example, in the case of “Distributed Tensorflow” it is typically referring to the situation where the user uses Kubeflow and the TFJob CRD to run a distributed training job across multiple machines, with or without GPUs. That is, the training (regardless of CPU or GPU can be distributed to many machines). To further clear up the differences:

To think of it another way: we can define “multi-GPU” in the context of a single machine with multiple GPUs directly attached, and “distributed-GPU” as a many machines with zero or more GPUs.

We want to highlight these definitions because sometimes this is conflated with the scenario where a distributed training job, and specifically when using TensorFlow, has multiple hosts and each has a GPU (which we differentiate by calling it “distributed-multi-gpu”). In the case of “distributed-multi-gpu”, it is technically a multi-gpu setup, but we consider this to have different execution semantics so we give it a different moniker.

For most cases (defined as “use cases using 8 gpus or fewer”¹⁰), training on a single machine with multiple GPUs is going to out-perform both a single GPU and distributed tensorflow¹¹ with GPUs. Due to network overhead, for instance, we see that it takes more than 16 distributed GPUs to equal the performance of a single machine with 8 GPUs onboard.

The place that distributed training across multiple machines comes into play is when need to leverage more than 8 GPUs at once, and then we are ready to amortize the cost of the network overhead to jump up above 16 GPUs and gain the added scale.

Infrastructure Planning

When planning a Kubeflow installation certain consideration must be given to the underlying infrastructure that will be running Kubeflow. Though Kubeflow is running on top of Kubernetes, and as such many of the topics discussed in this section will — to a certain degree — overlap with core Kubernetes concepts, the requirements and thought process is framed more from a Kubeflow perspective, and less from a general Kubernetes one.

Kubernetes Considerations

To frame our discussion on how we want to setup our Kubeflow installation, be it on-premise or the cloud, we’d first look at how Kubernetes clusters are typically arranged from a logical standpoint. In the diagram below (from the official Kubernetes site), we can see different types of clusters broken up into logical layers:

Depending on the route chosen: cloud, on-prem, etc. will drive the amount of investment required to stand up and manage Kubernetes, and consequently Kubeflow.

On-Premise

Installation Kubeflow using on-premise infrastructure comes with all the requirements that any other on-premise hardware has: it requires the datacenter, the workforce to manage the hardware, the network, the software, etc.

Deploying Kubeflow on-premise generally means we want to use our existing hardware in a multi-tennant way such that many users can share expensive infrastructure.

In this section we’ll look at some of the various ways an on-premise cluster can be setup to give more context around what “on-premise” really means.

The DGX

NVIDIA DGX-1¹² is a line of Nvidia produced servers and workstations which specialize in using GPGPU to accelerate deep learning applications. The servers feature 8 GPUs based on the Pascal or Volta daughter cards with HBM 2 memory, connected by an NVLink mesh network.

The product line is intended to bridge the gap between GPUs and AI accelerators in that the device has specific features specializing it for deep learning workloads. Nvidia recommends kubernetes as a great way to manage a DGX between users:

For NVIDIA® DGX™ servers, Kubernetes is an especially useful way of efficiently allowing users to distribute their work across a cluster. For example, a deep learning (DL) training job can be submitted that makes the request to use eight GPUs and Kubernetes will schedule that job accordingly as GPUs become available in the cluster. Once the job is complete, another job can start using the same GPUs. Another example is a long-standing service can be set up to receive live input data and output inferenced results.

Deploying Kubernetes and forming DGX servers as a cluster requires some setup, but it is preferable to giving users direct access to individual machines. Instead of users needing to ensure that they reserve a server, Kubernetes handles scheduling their work. It also can split up a single node so that multiple users can use it at the same time. All of this ensures that GPUs are being used as efficiently as possible. User access to the cluster can still be managed, certain nodes can be tagged for privileged use, specific jobs can have resource priority over others, and jobs can write to network storage.

Normally if you were using TensorFlow (as an example) and GPUs locally, you could setup TensorFlow configuration¹³ to run directly on the GPUs.

However, when running on Kubeflow and GPUs on the DGX-1, you can simply setup your custom job CRD yaml file with certain flags and a container with CUDA dependencies built in.

Datacenter Considerations

GPUs consume an enormous amount of power and generate a tremendous amount of heat. Depending on the amount of power and heat dissipation available per rack, the amount of GPUs (and specifically the number of DGXs) varies. It is not uncommon to find only two DGXs in a 42U rack.

Depending on the layout and storage, additional considerations need to made. For example, if using flash-backed storage served over InfiniBand consideration needs to be given to the distance of the interconnects, the required switching hardware, etc.

Cloud

Given that GPU workloads tend to be ad-hoc (e.g., “every so often we’ll want to run a GPU for 2 days solid”), running a machine learning workload on the cloud with GPUs makes a lot of sense. All three major clouds offer GPUs:

Google Cloud¹⁴
Microsoft Azure¹⁵
AWS¹⁶

The common Nvidia GPUs offered on the cloud are:

K80
P4
T4
P100
V100

We should also consider what kind of instance (example: Azure GPUs¹⁷) we’re running our workloads on, beyond just what kind of GPU.

By nature of being the cloud, there are no traditional datacenter considerations. Though consideration does need to given on transport of data to the cloud (assuming the data is coming from another datacenter or another form of “on-premise”). Many cloud provides provide dedicated interconnects to their cloud for exactly such purposes.

Another consideration for using GPUs in the cloud is the speed GPUs can consume data. In a fully decked out on-premise installation, where the GPU cluster is interconnected with flash-backed, super low-latency high-speed storage, GPUs can obtain data pretty fast. In the cloud, and certainly dependent upon the cloud provider, there may be additional overhead — by orders of magnitude. The consequence of this related more to training time, and might only be an issue on very large models, or when training on very large amounts of data.

Placement

The choice of where to place the infrastructure for Kubeflow, and conversely Kubernetes can boil down to a capital vs. operational costs (CapEx vs. OpEx). Certainly, the cloud provides enormous benefit in terms of elasticity and flexibility, while on on-premise solution may solve other business or regulatory requirements.

Ultimately, regardless of where the underlying infrastructure is placed, given that Kubeflow runs on top of Kubernetes — which provides a rather large degree of portability. It would be a safe bet to say that at any given point a Kubeflow installation moved from the cloud to on-premise, from an on-premise installation to the cloud, or even extended between the two.

Container Management

There are public and private registries for container images. Docker is currently the most popular container platform and Dockerhub.com is currently a major repository for public containers. Docker hub provides a repository for container images to be stored, searched, and retrieved.

Other repositories include Google’s Container Registry and on-premise Artifactory installs. Unless specified otherwise, Docker Hub is the default location Docker will pull images from.

For many types of jobs (e.g., a basic TensorFlow model training job in python), we can just leverage a public container repository like Docker hub or the Google container registry (gcr.io). However, in other cases where our code may be more sensitive, we need to consider other measures such as an on-premise container registry such as Artifactory.

Security

This section reviews the current state of enterprise security integration (Authentication & Authorization) for Kuberenetes, and then specifically for Kubeflow and associated components. Ideally we want to provide a holistic and fully encompassed security approach with regard to the aforementioned components. For purposes of this section, the term “Security” will mean Authentication & Authorization generally, and not things such as container security, digital signatures, encryption, etc.

Background & Motivation

A direct consequence of the various component of Kubeflow, how users access the environment can differ dramatically between the components.

As described, Kubeflow has the following major usage patterns:

interacting with Kubeflow components via the API, using the CRDs that extend the Kubernetes API
interacting with Kubeflow components that are applications, that are deployed atop of Kubernetes

Each of these workflows have a series of specific steps that are required to have them execute in a Kubeflow environment — especially if they are to run in a secure manner. Often, though not a concrete requirement, a gateway machine is used to access the Kubernetes cluster that Kubeflow is running atop of.

Before discussing how to these access patterns translate to Kubeflow, it’s important to take a moment and review some of the methods used to access Kubernetes itself — as well as applications that run atop of Kubernetes.

All access to the Kubernetes “control plane” is accessed via the Kubernetes API. Two of the most common ways to access the API would be via the Kubernetes command-line utility (kubectl), and the Kubernetes Dashboard.

When an application has been deployed to Kubernetes, access to that application is governed by how the application was exposed when it was deployed. For example a simple web application can be exposed via a Load Balancer, or via a Kubernetes Ingress. The distinction between the two is rather subtle: in the Load Balancer scenario, requests to the application will typically bypass Kubrnetes itself (that is on Layer 7 — networking and networking policies aside), while in the Ingress one, access is piped through a Kubernetes primitive (the Ingress Controller).

Being that Kubeflow is an application that, in some ways, extends Kubernetes, while also deploying applications atop of Kubernetes, it is an important point when thinking through a security strategy. To properly secure a Kubeflow installation, the Kuberenets API — the “control plane” — must be secured, in addition to — or as well as the deployed Kubeflow applications (Jupyter Hub, etc.)

Regardless of which access pattern a user takes, the security goal would be that the the cluster is to be accessed securely. The root of the requirements can be anything from corporate policy, auditability, to securing customer data. A key motivator (besides the obvious) is data security: users in a Kubeflow installation will (or should) have access to highly sensitive information — it is key to building accurate models — and their ability to share any information (intentionally or accidentally) should be extremely limited.

As such, controls must be in place that regardless of how a user deploys Kubeflow specific infrastructure, data shouldn’t be inadvertently exposed by some HTTP service, or otherwise.

Control Plane

The “Control Plane” hereon is defined as that Kuberenets Control Plane, which is responsible for receiving instructions from users and acting upon them. Again, at a high-level, there are two levels of actions: a user submits a request to the Control Plane, and the Control Plane acts upon the request. For purposes of security, the control plane is required to authenticate and authorize a request prior to acting upon it.

End-users (and systems interacting with the Control Plane) do so by communicating with the Kubernetes API server. Typically this is done either by using tools such as “kubectl”, or the associated API. Authentication and Authorization are de-coupled at the Control Plane; authorization is typically handled by RBAC (Role Based Access Control), and authentication is “pluggable”. What is important is that authorization defines what a user is able to do at the Control Plane level (e.g. launch pods, use volumes, etc.) - but it makes no assumption as to how authentication occurred.

As mentioned, authenticated is “pluggable”, which simply means there are a variety of built-in methods that can be used to authorize a user. Though there are a number of methods: X509 Clients, Tokens, Username/Password, OIDC, Webhooks.

Briefly: tokens refers to static tokens, username/password refers to static username/passwords, and Webhooks simply pass on any tokens received to a third-party webhook for processing. For the purposes of practical security, only X509 and OIDC will be further discussed.

Of the two possibilities for authentication, X509 client certificates is one option, having clients present a signed a trusted certificate. The username is determined from the subject, and membership determined from the OU fields within the certificate. Open ID Connect (OIDC) is another option, whereby users present a signed JWT token, issued by a trusted authority, with all the required claims present in the token. Kubernetes itself simply verifies the validity of the token and does no further OIDC processing or interactions.

Example OIDC Authentication Scenario with Kerberos

We’ll consider the case where a Kubernetes cluster backing the Kubeflow installation has been configured with OIDC authentication. Users obtain an OIDC token based on their local Kerberos ticket, which is then authenticated by Kubernetes. Authorization happens as described elsewhere, via Kubernetes RBAC roles.

This could support “kubectl” interactions, using a custom client-go credential plugin. This plugin is invoked prior to contacting the API server and would be responsible for facilitating the Kerberos/OIDC token interaction. Once an OIDC token has been acquired, “kubectl” uses that token when communication with the API server

Options for Securing the Control Plane

As described above, the Control Plane security is an area that is well addressed. It is the Kubeflow and other deployed applications that require additional effort. To provide the desired level of security for deployed applications one or more of the following is required:

Limit which containers can be deployed
- This will effectively eliminate a large surface area of risk, but is highly restrictive to users.
Disallow users to expose Ingresses of their own
- Allow users to deploy what they need, and expose any service *within* the cluster, but disallow any “outside” access, or intra-namespace access.
Allow users to deploy Ingresses, but deploy an authenticating proxy or system
- This will allow users to deploy what they need; allows them to expose service to the “outside” world, but will be fronted by an authenticating proxy or service (such as Istio).
A hybrid of above
- For example: create a finite list of services that are exposed, secure them, and disallow any other Ingress into the cluster.

Most on-premise installations will have to make some situation-dependent decisions based on the above paths.

Kubeflow and Deployed Applications

Once an application has been deployed to Kubernetes (e.g. a user had the necessary authorization to deploy a deployment, or a pod, or a service) - and has in turn launched, for example, a web server, it is the responsibility of that server or service to provide its own methods of authentication and authorization. That is: Kubernetes only authenticates and authorizes the infrastructure aspect here — whether a user is allowed to submit a desired state of an application’s infrastructure. However, once that has been authorized, if an application requires any level of authentication and authorization it must provide it on its own. Kubernetes does not facilitate that at all.

As an example: the Kubernetes Dashboard. The Dashboard application is a JavaScript application that allows users to interact with the Kubernetes API via a GUI. As previously mentioned the API server requires requests to be authenticated. The Dashboard can pass along authentication tokens (e.g. Bearer tokens), but it itself has no native method of obtaining such tokens. While a client-go credential plugin can be used for kubectl commands, no such method is available out of the box for the Dashboard.

To provide a similar experience, the HTTP endpoint of the Dashboard is exposed to the “world” (that is anything outside of the Kubernetes cluster) via a “Ingress”, which is typically a reverse proxy. Prior to forwarding requests to the Dashboard application itself, the Ingress has an Oauth Proxy which it invokes to check if a user is authenticated. It is the Oauth Proxy [filter] that checks to see if a token is present, and if not, initiated an OIDC flow. Once a token is obtained (browser-based), that token is presented to the Ingress, which in turn copies it to the upstream request to the Dashboard. The Dashboard at that point simply forwards that request to the API server, along with the obtained token, and the flow continues the same as the “kubectl” interaction.

It is important to once again note that the Kubernetes Dashboard is not a core component of Kubernetes or the Control Plane, and is not even required. It is merely a a JavaScript application that *interacts* with the API server. It is deployed like “any other application”.

Conversely, Kubeflow is a set of applications deployed atop Kubernetes. For example, part of the Kubeflow deployment is JupyterHub (and Jupyter Notebooks), which are themselves exposed as HTTP servers. These exposed servers require authentication and authorization to be configured independently. The Kubeflow dashboard (another HTTP service) is also deployed as part of Kubeflow; it too requires authentication and authorization be configured for it.

The challenge at this juncture is precisely the security around these application: it is to be ensured that only authorized users can and should be allowed to access these applications exposed by Kubeflow. This cannot be handled by the “Control Plane” level, and must be addressed for each component (or via a higher/rolled up method, which is yet to be determined).

There is an additional component that has to be considered: when users deploy their own containers and potentially expose an HTTP service via that said container. These endpoints also required security, with (ideally) little to no configuration requirements on the users part.

For certain Kubeflow components, this can be handled using Istio integration, and in other parts may require custom integrations.

Multitenancy & Isolation

Multitenancy in the context of Kubeflow can be thought of multiple users sharing a Kubeflow installation or a number of Kubeflow installations within a single Kubernetes cluster.

In the context of security, if absolute isolation is required, then isolation can be achieved by deploying a separate Kubeflow installation per-user in their own namespace. To facilitate multitenancy this (in part), users are provided their own namespace in the cluster, and all their Pods, Services, Deployments, Job, etc. are encapsulated in their namespace. Control Plane security (as described below) is provided at the namespace level, while application level security (described below as well), is a large motivating factor for this document.

Multitenancy is how we setup a cluster of shared resources and then allow multiple tenants (“users”) to access the cluster through scheduling. While there is overhead in setting up a multi-tenant cluster, it is preferable from an operations standpoint in contrast to giving users direct access to individual computers.

Other properties of multitenancy include:

Isolation of user storage and workloads
load-balancing
users sharing a single machine’s resources

Advantages of setting up a shared cluster with Kubernetes include:

resource scheduling
security integration with kerberos and Active Directory
heterogeneous GPU support¹⁹
more efficient use of expensive resources
node-level user access so certain types of jobs have priority over others

Dealing with the implications of multitenancy falls under the direction of the DevOps team and obviously has a lot of things to consider as shown above. The data science team really only care if they can run their python job on GPUs; The DevOps team has to make that happen while also worrying about running these container-based workloads in a secure scheduled environment along with other users.

How Do the Users Want to Use the System?

The 3 major ways users will submit jobs to the Kubernetes/Kubeflow cluster are:

By submitting a basic CRD-based container jobs via Kubectl from the command-line via a gateway host
Through the web browser logging into a Jupyter Notebook and running (likely) python code on CPU or GPU
By submitting a basic CRD-based operator-specific job (e.g., TFJob to tf-operator) via Kubectl from the command-line via a gateway host

In practice you will likely see a mixture of all three of the job workflows submitted to the cluster.

Integration

At the control plane level, many forms of integration is available, namely:

X509 Certificates
OIDC Connectors

X509 certificates works as any other PKI certificate infrastructure, and clients require a valid certificate to authenticate to the Kubernetes Control Plane. Management, issuing, revoking, and other aspects of a PKI infrastructure is a topic of its own, and well documented.

OIDC integration can be achieved with an existing Identity Provider installation, such as one that authenticates Kerberos tickets and exchanges them for OIDC tokens. Additionally, Kubeflow integrate well with Dex, which is an open-source OIDC Federation — or bridge — that allows users to use various authentication methods. Some of the integrations that Dex provides include:

LDAP
GitHub
SAML 2.0
OIDC
LinkedIn
Microsoft

Sizing & Growing

Forecasting would be the exercise of taking the concepts from the previous sections and applying them in a sizing heuristic. If we summarize the user group profiled workloads from the section above on, such as “multitennancy”, we’re likely to see on our cluster, we could say:

There will be periods when a set of users will be trying different ideas in data science and will be kicking off a batch training job that could take minutes or hours
There will be a group of jobs (ETL, data science) that run everyday that are batch workloads
There will be transactional workloads that are sustained by a set amount of resources and are always running the model server processes

So if we carve off the model server (transactional) portion of things, we can focus on modeling the batch workloads for the ad-hoc jobs and then the scheduled jobs.

For the ad-hoc exploratory data science jobs, ideally we’ll create “lanes” for different sub-groups of users in our data science teams, so a single user cannot hog the cluster with a “super-job”. These exploratory jobs represent future output from the data science organization, but we cannot run them at the expense of the present output of the organization. So we’ll also want to create a lane for the scheduled jobs that need to run everyday and likely need to finish by a set time.

Forecasting

As a compute-planning exercise, its generally a good idea to build a spreadsheet with 24 hour buckets on it, where each row on the spreadsheet represents a user group. We’ll consider the daily (temporal) jobs as their own group and enter them as the first group.

List the other groups you can think of as subsequent rows in the spreadsheet. In each cell, add up the number of GPU instances you think the group will need for that time block. It’s good to do this exercise over a 24 hour set of time blocks because of different overlapping timezones and then also the time of the workday teams tend to run jobs.

Once you have all of the groups as rows and all of the time blocks filled out per group, sum up the columns across the 24-hour windows. Look for the cell that has the largest number, and this number is your peak-GPU usage for a typical day in your organization.

To forecast out what your team will need a year from now, make a copy of the spreadsheet and then add in a growth metric per cell (may be different per group, etc). These two spreadsheets will give you cluster usage today and then a peak usage a year from now.

These two numbers will give you a decent estimate on how to plan for GPUs for either a cloud budget for running Kubeflow in the cloud or for an on-premise infrastructure hardware budget.

Storage

Kubeflow clusters tend to cater to compute-hungry workloads, and so far in this section we’ve focused on sizing with respect to the compute budget our users will need. However, compute cannot exist without data so we need to consider how much data we’ll be processing.

Key storage considerations are:

Ingest rate of data (per day, per year)
Current total data size in the storage system
Average data size per job
Average intermediate data size per job
Compression impact on file size on disk
Replication factor²⁰ (if any) of the storage system

An example spreadsheet of these factors in use is shown below:

If we multiply the incoming daily ingest rate by 365 days, and then add the existing data stored, we have a storage number that roughly tells us how much space we’ll need at the end of the year. Beyond just data stored, we want to consider space for running jobs (potentially a copy of the data as it is processed) and then any intermediate data the job produces. If we add all 3 of these numbers together, we get a total forecasted storage required at the end of the year.

Obviously this number won’t be perfect, but it gets us in the ballpark. We also should consider the fact that we want to have extra space for operations beyond just what we’re storing, so when we go to make the hardware purchase (or cloud provision) we’ll likely allocate more space than this number directly.

Scaling

For most systems that are running constantly, we build a sizing and growth plan (as shown previously in this chapter) and then make direct considerations for how many machines we need of which type. We’ll call this “planned scaling” based on a direct growth forecast, and most systems require this type of planning.

We contrast this to dynamic scaling systems that automatically provision “most hosts” for a certain type of code or function deployed to a system (todo: footnote example). An example of dynamic scaling would be how AWS²¹ can add more webservers to a group to handle more traffic load as a response to changing demand.

Although some emerging design patterns where workloads can “burst” to the cloud from on-premise under the guise of needing more capacity, most of the time when we’re dealing with enterprise infrastructure we’re talking about a planned scaling situation.

Situations Where Scale is Over-Marketed

Many times the term “scale” is thrown around and has conflated meanings. Some vendors in the enterprise arena want to give the idea that their application (somehow) has no limit to scale. This typically is just not true, because “network switches have limits”.

Beyond that just parallelization any operation on “more hardware” can hit other bottlenecks such as

Algorithm design
GPU memory transfer rates
Neural network architecture design
Communication overhead

There really is just no free lunch here, so always take scale guarantees with a grain of salt.

¹ http://www.jesse-anderson.com/2018/11/creating-a-data-engineering-culture/

² https://www.oreilly.com/ideas/data-engineering-a-quick-and-simple-definition

³ https://www.oreilly.com/ideas/data-engineers-vs-data-scientists

⁴ https://kubernetes.io/docs/concepts/workloads/controllers/cron-jobs/

⁵ https://en.wikipedia.org/wiki/Jeff_Dean_(computer_scientist)

⁶ http://brenocon.com/dean_perf.html

⁷ https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/detection_model_zoo.md

⁸ https://rapids.ai/

⁹ https://en.wikipedia.org/wiki/CUDA

¹⁰ https://www.usenix.org/system/files/conference/osdi16/osdi16-abadi.pdf

¹¹ https://www.tensorflow.org/guide/distribute_strategy

¹² https://images.nvidia.com/content/technologies/deep-learning/pdf/61681-DB2-Launch-Datasheet-Deep-Learning-Letter-WEB.pdf

¹³ https://www.tensorflow.org/guide/using_gpu

¹⁴ https://cloud.google.com/gpu/

¹⁵ https://azure.microsoft.com/en-us/pricing/details/virtual-machines/series/

¹⁶ https://docs.aws.amazon.com/dlami/latest/devguide/gpu.html

¹⁷ https://docs.microsoft.com/en-us/azure/virtual-machines/windows/sizes-gpu

¹⁸ https://console.cloud.google.com/marketplace/details/click-to-deploy-images/deeplearning

¹⁹ https://docs.nvidia.com/datacenter/kubernetes/kubernetes-on-dgx-user-guide/index.html

²⁰ https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html

²¹ https://docs.aws.amazon.com/autoscaling/ec2/userguide/as-scale-based-on-demand.html

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.