Chapter 39

Using the DQS Cleansing Transform

In the previous lesson, you created a DQS knowledge base and cleansed some data, all interactively. As you improve the quality, domain coverage, and capability of your knowledge base, it will be able to correct a larger and larger percentage of values from new incoming data. As this occurs, you will benefit from automating as much of the cleansing as possible. You may want to have values that are correct or corrected with high confidence levels to move directly into the destination. Then you can review and fix only the remaining values. This capability exists with SSIS as the DQS Cleansing Transform. Additionally, you can correct and approve or reject the remaining values using the DQS Client. There is a truly intelligent cooperation between the Cleansing Task and the DQS Client.

The Cleansing Task accepts a Data Flow as input, cleanses the data using the knowledge base of your choice, adds output meta data, and passes the Data Flow forward. A commonly used Data Flow for the Cleansing Transform is shown in Figure 39-1.


NOTE The DQS Client is interactive, multithreaded, and is written to run as fast as possible, because most of us are impatient. The fast run time corresponds to high memory use for the client. Because we do not sit and wait for the Cleansing Transform to run in SSIS, it was written to reduce the memory footprint (single threaded) and therefore runs for a longer period of time.

When you double-click the DQS Cleansing Task, its editor pops up, and contains only three tabs: Connection Manager, Mapping, and Advanced.

The Connection Manager tab is where you choose the connection manager that points to your DQS server. After you choose the data quality connection manager, you can select which of the knowledge bases on that server should be used by the task. At the bottom of the window is the Configure error output check box. By default, the task will fail if there is an error, but you can choose to redirect the error rows.

In the Mapping tab, you choose which columns from the input should be cleansed and associate each of those input columns with its corresponding domain. The standard output columns for each input are Source, Output, Status, Confidence, and Reason. The default column names are supplied, but you can change these names if you want. Figure 39-2 shows the Mapping tab.

While each column has a status associated with it, there is also a status associated with the entire row. This enables you to use either or both statuses. The row status tells you the worst column status, which should indicate whether the data steward should trust the row or review the values in the row. For instance, if all the column statuses are Correct, the row status should be Correct. If all column statuses are Correct, except for one which is Corrected, the row status is Corrected. Though you are free to do whatever you want with these row statuses, most examples indicate Correct and Corrected as trusted values, which are sent forward without additional scrutiny. Any other row status can require individual examination, correction and approval, or rejection.

The Advanced tab shown in Figure 39-3 contains a list of additional options that you can set.

  • Standardize output—When selected, your output will be formatted using the “Format Output To” domain properties chosen for the domain. Formatting properties for domains depend on the data type of the domain:
  • String—Uppercase, lowercase, or capitalized
  • Date—Format of the day, month, and year
  • Integer—Apply a format mask
  • Decimal—Precision and format mask
  • Confidence—When chosen, this value is provided for each source column. The confidence score between 0 and 100 indicates the level of trust that the DQS Server or relevant Reference Data Source has in the accuracy of the suggested value. Higher values indicate a higher level of confidence.
  • Reason—When chosen, this value is provided for each source column. This column explains the output column value. The reason “Domain Value” indicates a valid value found in the domain. “New Value” indicates the value is not marked as error or invalid in the domain, nor does it exist as a valid value. “New Value” also includes low confidence suggestions. Other examples of reasons are “Unknown,” “Invalid,” and “Corrected to leading value.”
  • Appended Data—If your domain uses a reference data source, the reference can return additional information about the domain. For instance, an address validation reference data source can provide additional longitude/latitude information.
  • Appended Data Schema—Automatically selected when Appended Data is selected, this adds the schema definition you would use to interpret the information provided in the appended data column.

With the Cleansing Transform configured, you need to route the output to complete the package. You will use a Conditional Split Task to route the rows into two Data Flows: one Data Flow that will continue and be automatically added to the data warehouse, and another Data Flow that requires manual review and decisions about the data quality. Using the row status as the source, values of Correct and Corrected are routed to a “ready to use” flow. Any other value in the row status causes the rows to be routed to the “items for review” Data Flow.

Every time your package runs, a data cleansing project is created. This means the data steward can use the Client tool to interactively review the data from the SSIS cleansing in the same way that was done during knowledge discovery. You can use the DQS Client tool to open the data quality project and review, correct, approve, or reject values.

After the DQS project review is complete, you can import the new information into the knowledge base, continuously improving the knowledge base. To bring the project information into the knowledge base, open the knowledge base in the Client tool and choose Domain Management. In the Domain Values tab, choose the Import Values icon, as shown in Figure 39-4.

Try It

In this Try It, you create an SSIS package that uses the DQS Cleansing Transform. After the data is cleansed, you split the output into two Data Flows, one with acceptable values and another that requires further inspection. After you complete this lesson, you will be able to use SSIS to automatically clean your data.

You can download the completed Lesson39.dtsx and sample files for this lesson from www.wrox.com. You should copy these files to C:ProjectsSSISPersonalTrainer directory.

Lesson Requirements

You must have elevated permissions to complete this lesson. The minimum permission is membership in the dqs_kb_editor security role in the DQS_Main database. To create the knowledge base that will be used for cleansing, import the Lesson39KB.dqs file.

Hints

  • Import Lesson39KB.dqs, which you downloaded from the www.wrox.com site to create the knowledge base.
  • Use the DQS Cleansing Transform to cleanse the data.
  • Use the Conditional Split Transform to partition the output and direct it to two Data Flows.

Step-by-Step

1. Run the Data Quality Services Client. You can find it under All Programs > Microsoft SQL Server 2012 > Data Quality Services. You will see a Connect To Server dialog box. Type the name of the server where DQS is installed. Click Connect.
2. In the Knowledge Base Management tab, click New Knowledge Base. Use the following to complete the dialog box and click Next.
  • Name—Lesson39KB
  • Create Knowledge Base From—Import from DQS file
  • Select data file—Use Lesson39KB.dqs in your folder
3. Create a new SSIS package named Lesson39.dtsx. Drag a Data Flow Task onto the Control Flow tab. Double-click it to open the Data Flow tab.
4. Drag a Flat File Source onto the Data Flow tab, and double-click it to open the editor. The Flat File Source Editor window opens. Click New to create a connection manager. Name the connection manager Lesson39SampleData. Click Browse, navigate to your Lesson39SampleData.txt file, and select it. Click Preview and take a look at the rows to import. Notice the sample file contains some misspellings, abbreviations, and new data. Click OK to exit the Preview, and click OK again to exit the editor.
5. Drag the DQS Cleansing Transform onto the Data Flow tab, and connect the Flat File Source to it by dragging the blue line from the Flat File Source to the DQS Cleansing Transform. Then double-click it to open the editor.
6. In the Connection Manager tab, click New to create a connection manager that points to the DQS server that contains your knowledge base. Type in the name of your DQS server. Test the connection and click OK. In the Data Quality Knowledge Base tab, select Lesson39KB.
7. Choose the Mapping tab. Select all of the columns and map them to their corresponding domains. Refer back to Figure 39-2 to see what your tab should look.
8. Choose the Advanced tab. Make the following selections:
  • Standardize output—Checked
  • Confidence—Checked
  • Reason—Checked
  • Appended Data—Checked
Your tab should look like the one shown back in Figure 39-3. Click OK to exit the editor.
9. Right-click the DQS Connection Manager in the Connection Managers window and rename it to DQS Cleansing Connection Manager.
10. Drag a Conditional Split Transform onto the Data Flow, and connect it to the Cleansing Transform by dragging the blue line from the Cleansing Transform. Double-click the conditional split to open the editor. Make the following selections to complete the configuration:
  • Output Name—Ready to Use
  • Condition—[Record Status] = = "Correct" || [Record Status] = = "Corrected"
  • Default Output Name—Items for Review
Your Conditional Split Transformation Editor window should look like Figure 39-5. Click OK to complete the configuration. You will now have two good Data Flows from the conditional split: Ready to Use and Items for Review.
11. Now you will connect each of the two data flows you created in the Conditional Split Task to an output. Drag a Flat File Destination onto the Data Flow tab, and name it Ready to Use. Connect it to the Conditional Split by dragging the blue line from the Conditional Split to the Flat File Destination Ready to Use. An Input Output Selection dialog box appears, enabling you to choose which output from the conditional split to use. Choose Ready to Use Output. Then click OK.
12. Double-click the Ready to Use Destination to open the editor. Choose New for Flat File Connection Manager. Then choose Delimited in the Flat File Format dialog box. Click OK to exit the dialog box. The Flat File Connection Manager appears.
  • Connection Manager Name—Ready to Use
  • File Name—C:ProjectsSSISPersonalTrainerLesson39ReadytoUse.txt
  • Column names in the first data row—Checked
Click OK to exit the Connection Manager Transformation Editor. You are returned to the Flat File Destination Editor. Click the Mappings tab so the columns will be configured. Click OK.
13. Create a destination for the Items for Review Data Flow. Drag a Flat File Destination onto the Data Flow tab, and name it Items for Review. Connect it to the Conditional Split by dragging the blue line from Conditional Split to the Flat File Destination Ready to Use. The last time you dragged a Conditional Split blue output, a dialog box allowed you to choose which of the two outputs to map. Because there is only one remaining blue output—Items for Review—no dialog box is presented. The only remaining output is used.
14. Double-click the Items for Review Destination to open the editor. Choose New for Flat File Connection Manager. Then choose Delimited in the Flat File Format dialog box. Click OK to exit the dialog box. The Flat File Connection Manager appears. Complete the configuration using the following information:
  • Connection Manager Name—Ready to Use
  • File Name—C:ProjectsSSISPersonalTrainerLesson39ReadytoUse.txt
  • Column names in the first data row—Checked
Click OK to exit the Connection Manager Editor. You are returned to the Flat File Destination Editor. Click the Mappings tab so the columns will be configured. Click OK. Your Data Flow should look like Figure 39-1 earlier in this lesson.
15. Save and run the package. The completed Data Flow should look like Figure 39-6. Nine rows were read, and seven of them were correct or corrected, leaving two rows in the Items for Review file.

NOTE You can open these files in Excel and review them if you want. If you do, they are delimited files with a tab delimiter.

16. To review the work done by this run of the Cleansing Transform, run the DQS Client application. Click Open Data Quality Project, then scroll through the Open Project list, select your SSIS package run, and select Next.
17. Here you can change, approve, and reject values just as you did in the previous lesson. Figure 39-7 shows that two new values in the First Name domain came from this cleansing exercise. Once you have completed your changes, approvals, and rejections, you can export this information to be used as input for the next cleansing activity.

Please select Lesson 39 on the DVD, or online at www.wrox.com/go/ssis2012video, to view the video that accompanies this lesson.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset