Quantifying the existence of patterns

If you look through the addresses output from the previous step, you may notice that not all of them have a street address as outlined earlier.

A common deviation from the street address pattern is the addition of an N or S to a street name. Another deviation is initial street names that contain more than one word:

3649 N Southport Ave
4022 N Mozart St Irving Park
260-300 Osceola Ave S St Paul, MN 55102, USA
103 & 105 Misty Morning Way Savannah, Georgia
1656 Mount Eagle Place Alexandria, Virginia

Yet another deviation is the omission of street numbers:

West Outer Drive Dearborn, Michigan
Crown St New Haven, CT, USA

Depending on the project, you will usually need to decide how far to go to capture all of the variations in the data. The more complex the pattern, the more work it will take to capture.

Due to this trade-off, it is helpful to quantify how much of the data is captured by a particular pattern. In the next few subsections, I will walk through the process of creating a pattern string to capture the street address. 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset