Looking for patterns

Creating a good regular expression is a bit of a design process. A regular expression that is too rigid may not be able to match all of the potentially correct matches. On the other hand, a regular expression that is not specific enough may match a large number of strings incorrectly.

The key is to look for a well-defined pattern in the data that easily distinguishes the correct matches from otherwise incorrect matches. It is usually a helpful first step to look through the data itself. This allows you to get an intuitive sense for the existence and frequency of certain patterns.

The following python script uses pandas to read the dataset into a pandas dataframe, extract the address column, and print out a random sample of 100 addresses using the pandas series.sample() function. A random seed of 0 is used in order to make the resulting printout consistent. The script is available in the external resources as available in the external resources as explore_addresses.py.

import pandas

## read data to a dataframe
addresses=pandas.read_csv("data/scf_address_data.csv")

## extract address column and print random sample to
## output
print(addresses["address"].sample(100),random_state=0)

Note that I could have just used the default printout of the address column rather than a random sample. It's always a good idea, however, to anticipate that the data might be skewed in some way. It's possible that the first 100 addresses have one format while the other addresses have another format. Randomizing the sample ensures that you are able to get a homogeneous view of the data.

After running the previous program, you should see a number of addresses printed out to the Terminal:

Looking over the output, it appears that most addresses fall under the pattern that is standard for US addresses, with the street number first, then the street name, and then the city, state, and zip code.

However, not all of the addresses follow this pattern, for example:

Spring St. & Market St.  Greensboro, North Car...
Pleasant And Main St Malden, Massachusetts
Corner Of Lima Ave And Cowan

It will therefore be important to filter out the entries where the street name cannot be determined by a pattern. 

Of the addresses that do fit the standard format, the simplest case seems to be where the street name consists of one word, followed by a street suffix:

6836 Mortenview Dr Taylor, MI 48180, USA
3401 Garland Ave Richmond, Virginia
8982 Channing Avenue Westminster, California
230 Quincy Ave Quincy, MA 02169, USA

In this case, the address begins with three components that make up the street address:

<street number> <initial street name> <street suffix>

These three components provide an excellent way of pinning down the street name. The street number indicates the beginning of the street name. The street suffix both indicates the end of the street name and confirms that the identified string refers to a street name.

A good approach then would be to identify and extract the street address from the address. From there, the street name, consisting of the initial street name and the street suffix, can be extracted from the street address. 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset