Introduction to our password dataset

Let's begin with the basics. We'll import our dataset and get a sense of the quantity of data that we are working with. We will do this by using pandas to import our data:

# pandas is a powerful Python-based data package that can handle large quantities of row/column data
# we will use pandas many times during these videos. a 2D group of data in pandas is called a 'DataFrame'

# import pandas
import pandas as pd

# use the read_csv method to read in a local file of leaked passwords
# here we specify `header=None` so that that there is no header in the file (no titles of columns)
# we also specify that if any row gives us an error, skip over it (this is done in error_bad_lines=False)
data = pd.read_csv('../data/passwords.txt', header=None, error_bad_lines=False)

Now that we have our data imported, let's call on the shape method of the DataFrame to see how many rows and columns we have:

# shape attribute gives us tuple of (# rows, # cols)

# 1,048,489 passwords
print data.shape

(1048489, 1)

Since we only have one column to worry about (the actual text of the password), as a good practice, let's call on the dropna method of the DataFrame to remove any null values:

# the dropna method will remove any null values from our dataset. We have to include the inplace in order for the
# change to take effect
data.dropna(inplace=True)

# still 1,048,485 passwords after dropping null values
print data.shape
(1048485, 1)

We only lost four passwords. Now let's take a look at what we are working with. Let's ensure proper nomenclature and change the name of our only column to text and call on the head method:

# let's change the name of our columns to make it make more sense
data.columns = ['text']

# the head method will return the first n rows (default 5)

data.head()

Running the head method reveals the first five passwords in our dataset:

Text
0 7606374520
1 piontekendre
2 rambo144
3 primoz123
4 sal1387

 

Let's isolate our only column as a pandas 1-D Series object and call the variable as text. Once we have our series object in hand, we can use value_counts to see the most common passwords in our dataset:

# we will grab a single column from our DataFrame. 
# A 1-Dimensional version of a DataFrame is called a Series
text = data['text']

# show the type of the variable text
print type(text)

# the value_counts method will count the unique elements of a Series or DataFrame and show the most used passwords
# in this case, no password repeats itself more than 2 times
text.value_counts()[:10]


0 21 123 12 1 10 123456 8 8 8 5 7 2 7 1230 7
123456789     7
12345         6

This is interesting because we see some expected passwords (12345), but also odd because, usually, most sites would not allow one-character passwords. Therefore, in order to get a better picture, we will have to do some manual feature extraction.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset