Understanding the need for pattern recognition

The simplest way to process the values of text fields to treat them as categorical variables. In a categorical variable, the data entries take on a fixed number of values. To illustrate working with categorical variables, consider a categorical field, such as the US states. If the state of Connecticut, for instance, were to appear in a large enough number of data entries, you might expect to see certain characteristic misspellings, such as the following:

  • Conecticut
  • Conneticut
  • Connetict

An easy way to fix all of the misspellings might be to iterate through each of the data entries and check against a list of common misspellings as is done in the following demonstration. Note that the following code sample is just for demonstration purposes and doesn't belong to a particular file:

misspellings = ["Conecticut", "Conneticut", "Connectict"]
for ind in range(len(data)):
if data[ind]["state"] in misspellings:
data[ind]["state"] = "Connecticut

In the previous demonstration, incorrect text fields are corrected based on their exact value. It is often the case, however, that you will need to analyze and change text fields on a more structural level. Working with addresses is a good example of such a task. No two addresses are alike, but addresses generally fall under a certain structure.

Specifically, in the US, proper addresses are written as follows:

<house number> <street name> <city>, <state> <zipcode>

Text fields that contain structured information, such as addresses, are, in a way, like their own data type. Such text fields are represented in Python on an atomic level as a single string, but they may contain several pieces of information. Working with text fields such as addresses requires a new programming tool called regular expressions, which can search for patterns within strings.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset