In Chapter 10, Pachyderm Language Clients, we learned how to use Pachyderm language clients, including the Pachyderm Go client and Pachyderm Python client. The latter is probably more popular among data scientists as Python is the language that a lot of data scientists use. And if you write in Python, you are likely familiar with the open source tool, JupyterLab.
JupyterLab is an open source platform that provides an Interactive Development Environment (IDE) in which you can not only author your code but also execute it. This advantage makes JupyterLab an ideal tool for data science experiments. However, while JupyterLab provides a basic version control system for its notebooks, it's not to the level of data provenance that Pachyderm offers. Pachyderm Hub, the SaaS version of Pachyderm, provides a way to integrate your Pachyderm cluster with Pachyderm Notebooks, a built-in version of JupyterLab coupled with Pachyderm.
This chapter is intended to demonstrate how to configure a Pachyderm cluster in Pachyderm Hub and use Pachyderm Notebooks. By the end of this chapter, we will learn how to run basic Pachyderm operations in Pachyderm Notebooks and will create a sentiment analysis pipeline.
This chapter covers the following topics:
You should have already installed the following components:
All code samples used in this section are stored in the GitHub repository created for this book at https://github.com/PacktPublishing/Reproducible-Data-Science-with-Pachyderm/tree/main/Chapter11-Using-Pachyderm-Notebooks.
The Dockerfile used in this section is stored at https://hub.docker.com/repository/docker/svekars/pachyderm-ide.
Before you can take advantage of Pachyderm Notebooks, you need to create an account in Pachyderm Hub and a Pachyderm workspace. Pachyderm Hub provides a trial period for all users to test its functionality.
After the trial period ends, you need to upgrade to the Pachyderm Pro version to continue using Pachyderm Notebooks.
In Pachyderm Hub, your work is organized in workspaces. A workspace is a grouping in which multiple Pachyderm clusters can run. Your organization might decide to assign each workspace to a team of engineers.
Before you can create a workspace, you need a Pachyderm Hub account, so let's create one. Pachyderm Hub supports authentication with Gmail and GitHub. You must have either of those to create an account with Pachyderm Hub.
To create a Pachyderm Hub account, complete the following steps:
When you create a workspace, you automatically deploy a Pachyderm cluster. If you're using a trial version of Pachyderm, you will have deployed a single-node cluster, which should be enough for testing.
Now that you have created your first workspace, you need to connect to your cluster using pachctl.
If this is not the first chapter that you are reading in this book, you should already have pacthl installed on your computer. Otherwise, install pachctl as described in Chapter 4, Installing Pachyderm Locally.
To connect to your Pachyderm Hub workspace, do the following:
pachctl version
You should get the following response:
COMPONENT VERSION
pachctl 2.0.1
pachd 2.0.1
pachctl config get active-context
This command should return the name of your Pachyderm Hub workspace.
You can now communicate with your cluster deployed on Pachyderm Hub through pachctl from a terminal on your computer.
Now that we have configured our cluster, let's connect to a Pachyderm notebook.
Pachyderm Notebooks is an IDE for data scientists that provides easy access to familiar Python libraries. You can run and test your code in cells while Pachyderm backs your pipeline.
To connect to a Pachyderm notebook, complete the following steps:
You should see the following screen:
Now that we have access to Pachyderm Notebooks, we can create Pachyderm pipelines directly from the Pachyderm Notebooks UI, experiment with Python code and python-pachyderm, run pachctl commands, and even create Markdown files to document our experiments. We'll look into this functionality in the next section.
The main advantage of Pachyderm Notebooks is that it provides a unified experience. Not only can you run your experiments inside of it, but you can also access your Pachyderm cluster through the integrated terminal by using both pachctl and python-pachyderm. All the pipelines that you create through the JupyterLab UI will be reflected in your Pachyderm cluster whether it runs locally or on a cloud platform.
Now let's see how we can use Pachyderm Notebooks to access our Pachyderm cluster.
You can run the integrated terminal from within Pachyderm Notebooks and use it to execute pachctl or any other UNIX commands.
To use the integrated terminal, complete the following steps:
pachctl version
Try running other Pachyderm commands that you have learned in previous chapters to see how it works.
Important note
The version of pachctl and pachd are different than the ones your run directly from your computer terminal as Pachyderm Notebooks has a preinstalled version of pachctl, which sometimes might not match the version of your cluster. This should not affect your work with Pachyderm.
Now that we know how to use the terminal, let's try to create a Pachyderm notebook.
A notebook is an interactive document in which the users can write Python code, run it, and visualize it. These features make notebooks a great experimentation tool that many data scientists use in their work. After the experiment is finished, you might want to export the notebook as a Python script or library.
Pachyderm Notebooks supports the following types of notebooks:
These three languages seem to be most popular among data scientists. You can create Julia and R notebooks to experiment specifically with the code that you want to use with your pipeline. With Python notebooks, not only you can test your code, but you can also use the python-pachyderm client to interact with the Pachyderm cluster.
Important note
The code described in this section can be found in the https://github.com/PacktPublishing/Reproducible-Data-Science-with-Pachyderm/blob/main/Chapter11-Using-Pachyderm-Notebooks/example.ipynb file.
Let's create a Python notebook and run a few commands:
import python_pachyderm
client = python_pachyderm.Client()
print(client.get_remote_version())
print(list(client.list_repo()))
client.create_repo("data")
print(list(client.list_repo()))
Here is the output that you should see:
Note that you do not need to import python_pachyderm and define the client in the second and subsequent cells since you have already defined it in the first cell.
with client.commit('data', 'master') as i:
client.put_file_url(i, 'total_vaccinations_dec_2020_24-31.csv', 'https://raw.githubusercontent.com/PacktPublishing/Reproducible-Data-Science-with-Pachyderm/main/Chapter11-Using-Pachyderm-Notebooks/total_vaccinations_dec_2020_24-31.csv')
print(list(client.list_file(("data", "master"), "")))
Here is the output that you should see:
This dataset contains statistics about COVID-19 vaccinations from December 24 to 31, 2020.
import pandas as pd
pd.read_csv(client.get_file(("data", "master"), "total_vaccinations_dec_2020_24-31.csv"))
You should see the following output:
This CSV file has the following columns: location, date, vaccine (manufacturer), and total_vaccinations.
import pandas as pd
df = pd.read_csv(client.get_file(("data", "master"), "total_vaccinations_dec_2020_24-31.csv"))
data_top = df.head()
print(data_top)
You should see the following output:
from python_pachyderm.service import pps_proto
client.create_pipeline(
pipeline_name="find-vaccinations",
transform=pps_proto.Transform(
cmd=["python3"],
stdin=[
"import pandas as pd",
"df = pd.read_csv('/pfs/data/total_vaccinations_dec_2020_24-31.csv')",
"max_vac = df['total_vaccinations'].idxmax()",
"row = df.iloc[[max_vac]]",
"row.to_csv('/pfs/out/max_vaccinations.csv', header=None, index=None, sep=' ', mode='a')",
],
image="amancevice/pandas",
),
input=pps_proto.Input(
pfs=pps_proto.PFSInput(glob="/", repo="data")
),
)
print(list(client.list_pipeline()))
You should see the following output:
client.get_file(("find-vaccinations", "master"), "/max_vaccinations.csv").read()
You should see the following output:
Our pipeline has determined that during the period from December 24 to December 31, the most vaccinations were done in Germany on December 31. The number of vaccinations was 206443 and the manufacturer was Pfizer/BioNTech.
client.delete_repo("data", force=True)
client.delete_pipeline("find-vaccinations")
print(list(client.list_repo()))
print(list(client.list_pipeline()))
You should see the following output:
In this section, we learned how to perform basic Pachyderm operations in Pachyderm Python Notebooks. Next, we'll create another example pipeline.
In the previous section, we learned how to use Pachyderm Notebooks, create repositories, put data, and even created a simple pipeline. In this section, we will create a pipeline that performs sentiment analysis on a Twitter dataset.
Important note
The code described in this section can be found in the https://github.com/PacktPublishing/Reproducible-Data-Science-with-Pachyderm/blob/main/Chapter11-Using-Pachyderm-Notebooks/sentiment-pipeline.ipynb file.
We will use a modified version of the International Women's Day Tweets dataset from Kaggle available at https://www.kaggle.com/michau96/international-womens-day-tweets. Our modified version includes only two columns—tweet number # and text. The dataset includes 51,480 rows.
Here is an extract of the first few rows of the dataset:
# text
0 3 "She believed she could, so she did." #interna...
1 4 Knocking it out of the park again is @marya...
2 5 Happy International Women's Day! Today we are ...
3 6 Happy #InternationalWomensDay You're all power...
4 7 Listen to an experimental podcast recorded by ...
Here is a diagram of the workflow:
In the next section, we will learn about the methodology that is used to build this pipeline.
We will use Natural Language Toolkit (NLTK), which is familiar to us from previous chapters, to clean the data. Then, we will use TextBlob, an open source Python library for text processing, to perform sentiment analysis on the tweets in the dataset.
Sentiment analysis is a technique that helps understand the overall mood of the individuals involved in a specific conversation, discussing specific products and services, or rating a movie. Sentiment analysis is widely used in various types of businesses and industries by marketers and sociologists to get a quick assessment of customer sentiment. In this example, we will be looking at emotions expressed in a selection of tweets about International Women's Day.
TextBlob provides two metrics for sentiment analysis—polarity and subjectivity. Each word in a sentence is assigned a score and then the mean score is assigned to the whole sentence. In this example, we will only determine the polarity of the tweets. Polarity defines the positivity or negativity of a sentence based on a predefined word intensity.
Polarity values range from -1 to 1, with -1 meaning negative sentiment, 0 being neutral, and 1 being positive. If we were to show this on a scale, it would look like this:
If we were to put the words into a table and assign them polarity scores, here is what we might get:
Let's run a quick TextBlob example on a simple sentence to see how it works. Use the code in the sentiment-test.py file to try this example:
from textblob import TextBlob
text = '''Here is the most simple example of a sentence. The rest of the text is autogenerated. This gives the program some time to perform its computations and then tries to find the shortest possible possible sentence. Finally, let's look at the output that is used for the rest of the process.'''
blob = TextBlob(text)
blob.tags
blob.noun_phrases
for i in blob.sentences:
print(i.sentiment.polarity)
To run this script, complete the following steps:
pip install textblob && python -m textblob.download_corpora
Here is the output that you should see:
As you can see in the output, TextBlob assigns a score for each sentence.
Now that we have reviewed the methodology of our example, let's create our pipelines.
Our first pipeline will use NLTK to clean the Twitter data in our data.csv file. We will create a standard Pachyderm pipeline by using python-pachyderm. The pipeline will consume files from the data repository, run the data-clean.py script on it, and output the cleaned text to the data-clean output repository. The pipeline will use the svekars/pachyderm-ide:1.0 Docker image to run the code.
The first part of data-clean.py imports the components that are familiar to us from Chapter 8, Creating an End-to-End Machine Learning Workflow. These components include NLTK and pandas, which we will use to preprocess our data. We will also import re to specify regular expressions:
import nltk
import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('wordnet')
nltk.download('punkt')
import re
The second part of the script performs the data cleaning using the NLTK word_tokenize method, stopwords with the lambda function, and re.split to remove the URLs. Finally, the script saves the cleaned text to cleaned-data.csv in the output repository:
stopwords = stopwords.words("english")
data = pd.read_csv("/pfs/data/data.csv", delimiter=",")
tokens = data['text'].apply(word_tokenize)
remove_stopwords = tokens.apply(lambda x: [w for w in x if w not in stopwords and w.isalpha()])
remove_urls = remove_stopwords.apply(lambda x: re.split('https://.*', str(x))[0])
remove_urls.to_csv('/pfs/out/cleaned-data.csv', index=True)
Our second pipeline will perform sentiment analysis on the cleaned data with the TextBlob Python library. The sentiment.py script imports the following components:
from textblob import TextBlob
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from contextlib import redirect_stdout
We'll use pandas to manipulate the data frames. We'll use matplotlib and seaborn to visualize our results, and we'll use redirect_stdout to save our results to a file.
Next, our script performs sentiment analysis and creates two new columns—polarity_score and sentiment. The resulting table is saved to a new CSV file called polarity.csv:
data = pd.read_csv('/pfs/data-clean/cleaned-data.csv', delimiter=',')
data = data[['text']]
data["polarity_score"] = data["text"].apply(lambda data: TextBlob(data).sentiment.polarity)
data['sentiment'] = data['polarity_score'].apply(lambda x: 'Positive' if x >= 0.1 else ('Negative' if x <= -0.1 else 'Neutral'))
print(data.head(10))
data.to_csv('/pfs/out/polarity.csv', index=True)
Then, the script saves the tweets of sentiment category to its own variable, calculates the total for each category, and saves the totals to the number_of_tweets.txt file:
positive = [ data for index, t in enumerate(data['text']) if data['polarity_score'][index] > 0]
neutral = [ data for index, tweet in enumerate(data['text']) if data['polarity_score'][index] == 0]
negative = [ data for index, t in enumerate(data['text']) if data['polarity_score'][index] < 0]
with open('/pfs/out/number_of_tweets.txt', 'w') as file:
with redirect_stdout(file):
print("Number of Positive tweets:", len(positive))
print("Number of Neutral tweets:", len(neutral))
print("Number of Negative tweets:", len(negative))
The last part of the script builds a pie chart with the percentages of tweets in each category and saves them to the plot.png file:
colors = ['#9b5de5','#f15bb5','#fee440']
figure = pd.DataFrame({'percentage': [len(positive), len(negative), len(neutral)]},
index=['Positive', 'Negative', 'Neutral'])
plot = figure.plot.pie(y='percentage', figsize=(5, 5), autopct='%1.1f%%', colors=colors)
circle = plt.Circle((0,0),0.70,fc='white')
fig = plt.gcf()
fig.gca().add_artist(circle)
plot.axis('equal')
plt.tight_layout()
plot.figure.savefig("/pfs/out/plot.png")
import python_pachyderm
client = python_pachyderm.Client()
client.create_repo("data")
print(list(client.list_repo()))
You should see the following output:
with client.commit('data', 'master') as i:
client.put_file_url(i, 'data.csv', 'https://raw.githubusercontent.com/PacktPublishing/Reproducible-Data-Science-with-Pachyderm/main/Chapter11-Using-Pachyderm-Notebooks/data.csv')
print(list(client.list_file(("data", "master"), "")))
This script returns the following output:
list(client.list_file(("data", "master"), ""))
You should see the following response:
from python_pachyderm.service import pps_proto
client.create_pipeline(
pipeline_name="data-clean",
transform=pps_proto.Transform(
cmd=["python3", "data-clean.py"],
image="svekars/pachyderm-ide:1.0",
),
input=pps_proto.Input(
pfs=pps_proto.PFSInput(glob="/", repo="data")
),
)
client.create_pipeline(
pipeline_name="sentiment",
transform=pps_proto.Transform(
cmd=["python3", "sentiment.py"],
image="svekars/pachyderm-ide:1.0",
),
input=pps_proto.Input(
pfs=pps_proto.PFSInput(glob="/", repo="data-clean")
),
)
print(list(client.list_pipeline()))
You should see the following output:
The output is truncated and shows only the data-clean pipeline. You should see similar output for the sentiment pipeline.
import pandas as pd
pd.read_csv(client.get_file(("data-clean", "master"), "cleaned-data.csv"), nrows=10)
This script returns the following output:
You can see that the text was broken down into tokens.
list(client.list_file(("sentiment","master"), ""))
You should see a lengthy output. The files will be under path:, similar to the following response:
There should be three files, as follows:
pd.read_csv(client.get_file(("sentiment","master"), "polarity.csv"), nrows=10)
This script returns the following output:
You can see the two new columns appended to our original table, giving a polarity score in the range [-1;1] and a sentiment category.
client.get_file(("sentiment", "master"),"number_of_tweets.txt").read()
You should see the following output:
from IPython.display import display
from PIL import Image
display(Image.open(client.get_file(("sentiment", "master"), "/plot.png")))
This script returns the following output:
Based on this chart, we can tell that the majority of the tweets contain positive sentiments and the percentage of negative tweets can be considered insignificant.
This concludes our sentiment analysis example.
In this chapter, we have learned how to create Pachyderm notebooks in Pachyderm Hub, a powerful addition to Pachyderm that enables data scientists to leverage the benefits of an integrated environment with the Pachyderm data lineage functionality and pipelines. Data scientists spend hours performing exploratory data analysis and do so in notebooks. Combining Pachyderm and notebooks brings data scientists and data engineers together on one platform, letting them speak the same language and use the same tools.
In addition to the above, we created a pipeline that performs basic sentiment analysis of Twitter data and ran it completely in a Pachyderm notebook. We have expanded our knowledge of Python Pachyderm and how it can be used in conjunction with other tools and libraries.