The strip() function can come in quite handy in data cleaning. It removes trailing whitespaces or any specific pattern of a string from textual content in series. If the substring to be removed is not specified, trailing whitespaces are trimmed by default. The following example demonstrates an application of the strip function with stray whitespaces:
strip_series = pd.Series([" China", "U.S.A ", "U K"])
strip_series
The following is the output:
The following example demonstrates an application of the strip() function with trailing whitespaces:
strip_series.strip()
The following is the output:
This shows that strip() only removes trailing whitespaces and not those in the middle. Now, let's use strip() to remove a specific string:
sample_df["Movie"].str.strip("opnf")
The following is the output:
In the preceding example, the strip() function strips out any of the characters of the substring found in the trailing ends of the series elements.
The split() function splits a string at specified delimiters. Consider the following series:
split_series = pd.Series(["Black, White", "Red, Blue, Green", "Cyan, Magenta, Yellow"])
split_series
The following is the output:
Each element has two to three items separated by ,. Let's use this as a delimiter to separate the items stacked together in each row of the series:
split_series.str.split(", ")
The following is the output:
The result is a list of items in each row. The expand() parameter creates a separate column for each item. By default, expand is set to False, which leads to a list being created in each row:
split_series.str.split(", ", expand = True)
The following is the output: