strip() and split()

The strip() function can come in quite handy in data cleaning. It removes trailing whitespaces or any specific pattern of a string from textual content in series. If the substring to be removed is not specified, trailing whitespaces are trimmed by default. The following example demonstrates an application of the strip function with stray whitespaces:

strip_series = pd.Series(["	China", "U.S.A ", "U
K"])
strip_series

The following is the output:

Series with stray whitespaces

The following example demonstrates an application of the strip() function with trailing whitespaces:

strip_series.strip()

The following is the output:

Stripping trailing whitespaces

This shows that strip() only removes trailing whitespaces and not those in the middle. Now, let's use strip() to remove a specific string:

sample_df["Movie"].str.strip("opnf")

The following is the output:

The strip function for removing string sequences

In the preceding example, the strip() function strips out any of the characters of the substring found in the trailing ends of the series elements.

The split() function splits a string at specified delimiters. Consider the following series:

split_series = pd.Series(["Black, White", "Red, Blue, Green", "Cyan, Magenta, Yellow"])
split_series

The following is the output:

Sample series for the split function

Each element has two to three items separated by ,. Let's use this as a delimiter to separate the items stacked together in each row of the series:

split_series.str.split(", ")

The following is the output:

Splitting as a list

The result is a list of items in each row. The expand() parameter creates a separate column for each item. By default, expand is set to False, which leads to a list being created in each row:

split_series.str.split(", ", expand = True)

The following is the output:

Splitting multiple columns
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset