Let's begin with the basics. We'll import our dataset and get a sense of the quantity of data that we are working with. We will do this by using pandas to import our data:
# pandas is a powerful Python-based data package that can handle large quantities of row/column data
# we will use pandas many times during these videos. a 2D group of data in pandas is called a 'DataFrame'
# import pandas
import pandas as pd
# use the read_csv method to read in a local file of leaked passwords
# here we specify `header=None` so that that there is no header in the file (no titles of columns)
# we also specify that if any row gives us an error, skip over it (this is done in error_bad_lines=False)
data = pd.read_csv('../data/passwords.txt', header=None, error_bad_lines=False)
Now that we have our data imported, let's call on the shape method of the DataFrame to see how many rows and columns we have:
# shape attribute gives us tuple of (# rows, # cols)
# 1,048,489 passwords
print data.shape
(1048489, 1)
Since we only have one column to worry about (the actual text of the password), as a good practice, let's call on the dropna method of the DataFrame to remove any null values:
# the dropna method will remove any null values from our dataset. We have to include the inplace in order for the
# change to take effect
data.dropna(inplace=True)
# still 1,048,485 passwords after dropping null values
print data.shape
(1048485, 1)
We only lost four passwords. Now let's take a look at what we are working with. Let's ensure proper nomenclature and change the name of our only column to text and call on the head method:
# let's change the name of our columns to make it make more sense
data.columns = ['text']
# the head method will return the first n rows (default 5)
data.head()
Running the head method reveals the first five passwords in our dataset:
Text | |
---|---|
0 | 7606374520 |
1 | piontekendre |
2 | rambo144 |
3 | primoz123 |
4 | sal1387 |
Let's isolate our only column as a pandas 1-D Series object and call the variable as text. Once we have our series object in hand, we can use value_counts to see the most common passwords in our dataset:
# we will grab a single column from our DataFrame.
# A 1-Dimensional version of a DataFrame is called a Series
text = data['text']
# show the type of the variable text
print type(text)
# the value_counts method will count the unique elements of a Series or DataFrame and show the most used passwords
# in this case, no password repeats itself more than 2 times
text.value_counts()[:10]
0 21 123 12 1 10 123456 8 8 8 5 7 2 7 1230 7
123456789 7 12345 6
This is interesting because we see some expected passwords (12345), but also odd because, usually, most sites would not allow one-character passwords. Therefore, in order to get a better picture, we will have to do some manual feature extraction.