Chapter 30

Handling Bad Data with the Fuzzy Lookup

More often than not, when you are working in the real world, data is not going to be perfect like it is in the AdventureWorks2012 database. Real-world situations call for cleansing dirty data or data that has abnormalities like misspellings or truncation.

Imagine you are attempting to retrieve a foreign key from a dimension table, but, strangely, you find rows without a match. Upon investigation, you find bad data is being supplied to you. One technique might be to divert these rows without matches to a table to be dealt with later; another might be to just add the bad data regardless of misspellings and other mishaps that occur during data entry.

The Fuzzy Lookup Transform, discussed in this lesson, and the Fuzzy Grouping Transform, discussed in the next lesson, gives other alternatives to dealing with dirty data while reducing your number of unmatched rows. The Fuzzy Lookup Transform matches input records with data that has already been cleansed in a reference table. It returns the match and can also indicate the quality of the match. This way you know the likelihood of the match being correct.


NOTE A best practice tip is to use the Fuzzy Lookup Transform only after trying a regular lookup on the field first. The Fuzzy Lookup Transform is a very expensive operation that builds specialized indexes of the input stream and the reference data for comparison purposes. Therefore, it is recommended to first use a regular Lookup Transform and then divert only those rows not matching to the Fuzzy Lookup Transform.

During the configuration of the transform, you must specify a reference table to be used for comparison. Figure 30-1 shows the reference table selection being made in the Fuzzy Lookup Transformation Editor. The transform uses this reference data and builds a token-based index (which, despite its name, is actually a table) before it begins the process of comparing entries.

Using the Fuzzy Lookup Transform requires at least one field to be a string, either a DT_WSTR or DT_STR data type. On the Columns tab in the editor, you need to map at least one text field from the input to the reference table for comparison.

The Advanced tab contains the settings that control the fuzzy logic algorithms. You can set the maximum number of matches to output per incoming row. The default is set to 1, which pulls only the best record out of the reference table that meets the similarity threshold. Incrementing this setting higher than the default might generate more results that you’ll have to sift through, but it might be required if you have too many closely matching strings in your data. A slider controls the similarity threshold. When you are experimenting, a good strategy is to start this setting at 0.5 and move up or down as you review the results. This setting is normally decided based on a businessperson’s review of the data, not the developer’s review. If a row cannot be found that’s similar enough, the columns that you checked in the Columns tab will be set to NULL. The token delimiters can also be set if, for example, you don’t want the comparison process to break up incoming strings with a period (.) or spaces. The default for this setting is all common delimiters. See Figure 30-2 for an example of an Advanced tab.

The transform creates several output columns that you may or may not decide are useful to store in a table. Either way, they are important to understand:

  • Input and Pass-Through Field Names and Values—This column contains the name and value of the text input provided to the Fuzzy Lookup Transform or passed through during the lookup.
  • Reference Field Name and Value—This column contains the name and value(s) of the matched results from the reference table.
  • Similarity—This column contains a number between 0 and 1 representing similarity. Similarity is a threshold calculated by comparing one word with another; you set this when configuring the Fuzzy Lookup Transform. The closer this number is to 1, the closer the two text fields match. A similarity of 1 would indicate an exact match.
  • Confidence—This column contains a number between 0 and 1 representing confidence of the match relative to the set of matched results. Confidence is different from similarity; it is not calculated by comparing just one string against another, but rather by comparing the chosen string match against all the other possible matches. Confidence gets better the more accurately your reference data represents your subject domain, and it can change based on the sample of the data coming into the ETL process.

You may not want to use each of these fields, but it is important to appreciate the value they could provide.

Try It

In this Try It, you use the Fuzzy Lookup Transform to attempt to correct some bad data that you receive in a flat file. After this lesson, you should have an idea of how useful the Fuzzy Lookup Transform can be in cleansing your data.

You can download the completed Lesson30.dtsx and other sample files for this lesson from www.wrox.com.

Lesson Requirements

Create a table in the AdventureWorks2012 database named Occupation, using the following code (which you can find as part of this lesson’s download on the book’s website at www.wrox.com):

CREATE TABLE [dbo].[Occupation](
    [OccupationID] [smallint] IDENTITY(1,1) NOT NULL,
    [OccupationLabel] [varchar](50) NOT NULL,
 CONSTRAINT [PK_Occupation_OccupationID] PRIMARY KEY CLUSTERED
(
    [OccupationID] ASC
) ON [PRIMARY]
) ON [PRIMARY]

GO
INSERT INTO [dbo].[Occupation] Select ‘CUSTOMER SERVICE REPRESENTATIVE’
INSERT INTO [dbo].[Occupation] Select ‘SHIFT LEADER’
INSERT INTO [dbo].[Occupation] Select ‘ASSISTANT MANAGER’
INSERT INTO [dbo].[Occupation] Select ‘STORE MANAGER’
INSERT INTO [dbo].[Occupation] Select ‘DISTRICT MANAGER’
INSERT INTO [dbo].[Occupation] Select ‘REGIONAL MANAGER’

Download the flat file named FuzzyExample.txt from www.wrox.com to use as your source. Save this file to the C:ProjectsSSISPersonalTrainer directory. Correct the bad data from this flat file and insert it to a new table called EmployeeRoster.

Hints

  • Remember the best practice tip mentioned earlier in this lesson. First, attempt to use a regular Lookup and then use the Fuzzy Lookup to catch the bad data.

Step-by-Step

1. Create a new package and name it Lesson30, or download the completed Lesson30.dtsx package from www.wrox.com.
2. Drag a Data Flow Task onto your designer and name it DFT - Fuzzy Lookup.
3. Create a new Flat File Connection Manager (creating connection managers is discussed in Lesson 6), name it New Employee, and point it to C:ProjectsSSISPersonalTrainerFuzzyExample.txt. Check the Column names in the first data row option. The editor should look like Figure 30-3:
4. In the Data Flow, bring a new Flat File Source over and name it New Employee Load. Open the editor and make the connection manager the newly created New Employee.
5. On the Columns tab, change the name of the output columns to LastName, FirstName, and OccupationLabel.
6. Open Management Studio and run the following query to create a new table called Occupation (you can download the query from www.wrox.com):
CREATE TABLE [dbo].[Occupation](
    [OccupationID] [smallint] IDENTITY(1,1) NOT NULL,
    [OccupationLabel] [varchar](50) NOT NULL,
 CONSTRAINT [PK_Occupation_OccupationID] PRIMARY KEY CLUSTERED
(
    [OccupationID] ASC
) ON [PRIMARY]
) ON [PRIMARY]

GO
INSERT INTO [dbo].[Occupation] Select ‘CUSTOMER SERVICE REPRESENTATIVE’
INSERT INTO [dbo].[Occupation] Select ‘SHIFT LEADER’
INSERT INTO [dbo].[Occupation] Select ‘ASSISTANT MANAGER’
INSERT INTO [dbo].[Occupation] Select ‘STORE MANAGER’
INSERT INTO [dbo].[Occupation] Select ‘DISTRICT MANAGER’
INSERT INTO [dbo].[Occupation] Select ‘REGIONAL MANAGER’
7. Next, create another connection manager, this time an OLE DB Connection Manager, using the AdventureWorks2012 database.
8. Drag a Lookup Transform on the design surface and use the new [dbo].[Occupation] table to select the OccupationID based on the OccupationLabel that exists in both the source and the reference table. (Refer back to Lesson 24 if you need help with a regular Lookup.) Figure 30-4 shows what your mapping should look like. Lastly, before closing the editor, make sure to specify in the General tab that non-matching entries should redirect rows to no match output.
9. You already know from the lesson description that the source data is dirty, so now you’re going to use a Fuzzy Lookup Transform to catch all the bad data the regular Lookup doesn’t recognize. Drag a new Fuzzy Lookup Transform in the Data Flow and connect the blue no match output arrow from the Lookup Transform to it.
10. Open the Fuzzy Lookup and select [dbo].[Occupation] for the Reference table name property. Figure 30-5 shows the Fuzzy Lookup Transformation Editor using the Occupation table as the reference table.
11. The Columns tab should be joined by OccupationLabel as shown in Figure 30-6. It should also return the OccupationID and OccupationLabel from the reference table, which you can ensure by checking the boxes in the Available Lookup Columns box. The OccupationLabel from the reference table should replace the same column from the input stream to correct bad data. To do this, uncheck the OccupationLabel column from the Available Input Columns.
12. Next, in the Advanced tab, leave the Similarity threshold at the default setting and change the token delimiters to use only a period in the Additional delimiters box, as reflected in Figure 30-7. Also, modify the Similarity threshold to 0.50 and then click OK.
13. To bring together the data from both lookup transforms, drag a Union All over and connect the two lookups to it. First, connect the blue arrow from the Fuzzy Lookup Transform and then connect the blue arrow from the regular Lookup Transform. Then open the Union All Transformation Editor and delete the unneeded columns by right-clicking and selecting Delete on the columns that are not pictured in Figure 30-8. You may also need to rename the output of OccupationLabel to not include (1) in the name.
14. To finish off this package you need to load the results into a new table. Bring an OLE DB Destination onto the design surface, and from within the editor, select New to create a new table. Use the following code to create the EmployeeRoster table or download the code from www.wrox.com:
CREATE TABLE [EmployeeRoster] (
    [EmployeeID] [smallint] IDENTITY(1,1) NOT NULL,
    [LastName] varchar(50),
    [FirstName] varchar(50),
    [OccupationID] smallint,
    [OccupationLabel] varchar(50)
)
15. Once the mapping has been set in the destination, click OK and your package is complete. A successful run of this package should look like Figure 30-9. Compare the EmployeeRoster table to the original flat file you started with, and you will see the Fuzzy Lookup using the reference table corrected 10 rows of dirty data.

Please select Lesson 30 on the DVD, or online at www.wrox.com/go/ssis2012video, to view the video that accompanies this lesson.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset