Preface

Pachyderm is a distributed version control platform for building end-to-end data science workflows. Since its creation in 2016, Pachyderm has become a go-to solution for large and small organizations. The core functionality of Pachyderm is open source and has a vivid community of engineers around it. This book walks you through basic and advanced examples of Pachyderm usage. This book will help you get started quickly and integrate a reliable data science solution into your infrastructure.

Reproducible Data Science with Pachyderm provides a clear overview of Pachyderm, as well as instructions on how to install and run Pachyderm in the cloud, and how to use the Pachyderm Software-as-a-Service (SaaS) version – Pachyderm Hub. This book has practical examples of data science technics running on a Pachyderm cluster.

Who this book is for

This book is for new and more experienced data scientists and machine learning engineers who want to build scalable infrastructures for their data science projects. Basic knowledge of Python programming and Kubernetes will be beneficial. Familiarity with GoLang is nice to have.

What this book covers

Chapter 1, The Problem of Data Reproducibility, discusses the problem of reproducibility in modern science and data science and how it aligns with the Pachyderm mission.

Chapter 2, Pachyderm Basics, describes basic Pachyderm concepts and primitives.

Chapter 3, Pachyderm Pipeline Specification, provides a detailed overview of the Pachyderm specification file, the main configuration file of Pachyderm pipelines.

Chapter 4, Installing Pachyderm Locally, walks you through the process of installing Pachyderm locally on your computer.

Chapter 5, Installing Pachyderm on a Cloud Platform, describes how to install Pachyderm on three major cloud platforms: Amazon Elastic Kubernetes Service (EKS), Google Kubernetes Engine (GKE), and Microsoft Azure Kubernetes Service (AKS).

Chapter 6, Creating Your First Pipeline, covers how to create a simple pipeline that processes images.

Chapter 7, Pachyderm Operations, looks at the most often used operations.

Chapter 8, Creating an End-to-End Machine Learning Workflow, shows how to deploy an end-to-end ML workflow on an example Natural Language Processing (NLP) pipeline.

Chapter 9, Distributed Hyperparameter Tuning with Pachyderm, looks at performing distributed hyperparameter tuning with a Named-Entity Recognition (NER) pipeline.

Chapter 10, Pachyderm Language Clients, walks you through the most common examples of using Pachyderm Python and Golang clients.

Chapter 11, Using Pachyderm Notebooks, discusses the Pachyderm Hub, Pachyderm's Software-as-a-Service (SaaS) platform, and you will learn about Pachyderm Notebooks, an Integrated Development Environment (IDE) for data scientists.

To get the most out of this book

You will need to have the latest Pachyderm version installed on your computer. All operations were tested using Pachyderm 2.0 on macOS. However, they should work with future version releases too. If you are on Windows, all operations must be performed in Windows Subsystem for Linux (WSL).

You will need to request an enterprise version of Pachyderm to use the Pachyderm Console. Pachyderm provides a free trial license for first-time users. However, most examples will work without an enterprise license. To test Pachyderm Notebooks in Chapter 11, Using Pachyderm Notebooks, you will need to create a Pachyderm Hub account.

If you are using the digital version of this book, we advise you to type the code yourself or access the code from the book's GitHub repository (a link is available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.

Download the example code files

You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Reproducible-Data-Science-with-Pachyderm. If there's an update to the code, it will be updated in the GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Download the color images

We also provide a PDF file that has color images of the screenshots and diagrams used in this book. You can download it here: https://static.packt-cdn.com/downloads/9781801074483_ColorImages.pdf.

Conventions used

There are a number of text conventions used throughout this book.

Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: "Mount the downloaded WebStorm-10*.dmg disk image file as another disk in your system."

A block of code is set as follows:

image = displacy.render(textfile, style='dep', options={"compact": True, "distance": 70})

f = open('/pfs/out/pos-tag-dependency.svg', "w")

f.write(image)

f.close()

Any command-line input or output is written as follows:

$ minikube start

Bold: Indicates a new term, an important word, or words that you see onscreen. For instance, words in menus or dialog boxes appear in bold. Here is an example: "Select System info from the Administration panel."

Tips or Important notes

Appear like this.

Get in touch

Feedback from our readers is always welcome.

General feedback: If you have questions about any aspect of this book, email us at [email protected] and mention the book title in the subject of your message.

Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata and fill in the form.

Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material.

If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.

Share Your Thoughts

Once you've read Reproducible Data Science with Pachyderm, we'd love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.

Your review is important to us and the tech community and will help us make sure we're delivering excellent quality content.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset