Chapter 5. Loading Data into the Warehouse

The ETL Process

Populating a data warehouse involves all of the tasks related to getting the data from the source operational systems, cleansing and transforming the data to the right format and level of detail, loading it into the target data warehouse, and preparing it for analysis purposes.

Figure 5.1 shows the steps making up the extraction, transformation, and load (ETL) process. Data is extracted from the source operational systems and transported to the staging area. The staging area is a temporary holding place used to prepare the data. The staging area may be a set of flat files, temporary staging tables in the Oracle warehouse, or both. The data is integrated with other data, cleansed, and transformed into a common representation. It is then loaded into the target data warehouse tables. Sometimes this process is also referred to as ETT—extraction, transportation, and transformation.

The ETL Process

Figure 5.1. The ETL Process

During the initial population of the data warehouse, historical data is loaded that could have accumulated over several years of business operation. The data in the operational systems may often be in multiple formats. If, for instance, the point-of-sales operational system was replaced two years ago, the current two years of history will be in one format, while data older than two years will be in another format.

After the initial historical load, new transaction and event data needs to be loaded on a periodic basis. This is typically done on a regular time schedule, such as at the end of the day, week, or month. During the load, and while the indexes and materialized views are being refreshed, the data is generally unavailable to warehouse users for querying. The period of time allowed for inserting the new data is called the batch window. The batch window is a continuously shrinking amount of time as more businesses are on-line for longer periods of time. Currently, with the ability to refresh the warehouse in real time, or near real time, batch windows are becoming a higher priority and requirement as businesses want and need to react to changes in a more immediate fashion. Higher availability can be achieved by partitioning the fact table. While data is loaded into a new partition, or being updated in an existing partition, the rest of the partitions are still available for use.

A large portion of the work in building a data warehouse will be devoted to the ETL process. Finding the data from the operational systems, creating extraction processes to get it out, transporting, filtering, cleansing, transforming, integrating data from multiple sources, and loading it into the warehouse can take a considerable amount of time.

If you are not already familiar with the company’s data, part of the difficulty in developing the ETL process is gaining an understanding of the data. One object can have different names in different systems. Even worse, two different things could have the same name. It can be a challenge to discover all of this, particularly in a company where the systems are not well documented. Each column in the target data warehouse must be mapped to the corresponding column in the source system. Some of the data will be mapped directly; other data will need to be derived and transformed into a different format.

Once the data is loaded into the warehouse, further processing to integrate it with existing data, update indexes, gather statistics, and refresh materialized views needs to take place prior to it being “published” as ready for users to access. Once the data is published, it should not be updated again until the next batch window, to ensure users do not receive different answers to the same query asked at a different time.

Extracting Data from the Operational Systems

Once you have identified the data you need in the warehouse for analysis purposes, you need to locate the operational systems within the company that contain that data. The data needed for the warehouse is extracted from the source operational systems and written to the staging area, where it will later be transformed. To minimize the performance impact on the source database, data is generally extracted without applying any transformations to it.

Often the owners of the operational systems will not allow the warehouse developers direct access to those systems but will provide periodic extracts. These extracts are generally in the form of flat, sequential operating system files, which will make up the staging area.

In order to extract the fields and records needed for the warehouse, specialized application programs may need to be developed. If the data is stored in a legacy system, then these programs may require special logic—for example, if written in COBOL—in order to handle things such as repeating fields in the “COBOL occurs” clause. The data warehouse designers need to work closely with the application developers for the OLTP systems that are building the extract scripts to provide the necessary columns and formats of the data.

As part of designing the ETL process, you need to determine how frequently data should be extracted from the operational systems. It may be at the end of some time period or business event, such as at the end of the day or week or upon closing of the fiscal quarter. It should be clearly defined what is meant by the “end of the day” or the “last day of the week,” particularly if you have a system used across different time zones. The extraction may be done at different times for different systems and staged to be loaded into the warehouse during an upcoming batch window. Another aspect of the warehouse design process involves deciding which level of aggregation is needed to answer the business queries. This also has an impact on which and how much data is extracted and transported across the network.

Some operational systems may be in relational databases, such as Oracle 8i, 9i, or 10g; Oracle Rdb; DB2/MVS; Microsoft SQL Server; Sybase; or Informix. Others may be in a legacy database format, such as IMS or Oracle DBMS. Others may be in VSAM, RMS indexed files, or some other structured file system.

If extracting and transporting the data from the source systems must be done by writing the data to flat files, then there is also the issue of defining:

  1. The file naming specification.

  2. Which files constitute the batch—for example, if more than one source table is being read from each source system, then data in each table will probably be written to its own flat file. All of the data extracted at the same time from the source system should normally be batched together and transferred and loaded into the data warehouse as a logical unit of work.

  3. The method for transporting the files between the source system and the warehouse—for example, is the data pushed from the source system or pulled by the warehouse system? If FTP is being used, then typically it may require a new operating system account on the warehouse server if the source system is pushing the data, but this new account will normally only be able to write to a very restricted directory area that the warehouse load processes can then read from.

Let’s consider the file naming convention, and, in particular, the situation if multiple different source systems provide their data using flat files. With this situation, it is quite typical for the file naming convention to incorporate the following:

  1. The source system name

  2. The date of extraction

  3. A file batch number—particularly if there can be more than one data extraction in a business day

  4. The source table name

  5. A single character indicator to show whether this is an original extraction or a repeat—for example, if some corruption occurred and the data had to be reextracted and resent

Alternatively, if you are able to access the source systems directly without recourse to using flat files, you can retrieve the data by using a variety of techniques, depending on the type of system it is. For small quantities of data, a gateway such as ODBC can be used. For larger amounts of data, a custom program directly connecting to the source database in the database’s native Application Programming Interface (API) can be written. Many ETL tools simplify the extraction process by providing connectivity to the source.

Identifying Data That Has Changed

After the initial load of the warehouse, as the source data changes, the data in the warehouse must be updated or refreshed to reflect those changes on a regular basis. A mechanism needs to be put into place to monitor and capture changes of interest from the operational systems. Rather than rebuilding the entire warehouse periodically, it is preferable to apply only the changes. By isolating changes as part of the extraction process, less data needs to be moved across the network and loaded into the data warehouse.

Changed data includes both new data that has been added to the operational system as well as updates and deletes to existing data. For example, in the EASYDW warehouse, we are interested in all new orders as well as updates to existing product information and customers. If we are no longer selling a product, the product is deleted from the order-entry system, but we still want to retain the history in the warehouse. This is why surrogate keys are recommended for use in the data warehouse. If the product_key is reused in the production system, it does not affect the data warehouse records.

In the data warehouse, it is not uncommon to change the dimension tables, because a column such as a product description may change. Part of the warehouse design involves deciding how changes to the dimensions will be reflected. If you need to keep one version of the old product description, you could have an additional column in the table to store both the current description and the previous description. If you need to keep all the old product descriptions, you would have to create a new row for each change, assigning different key values. In general, you should try to avoid updates to the fact table.

There are various ways to identify the new or changed data. One technique to determine the changes is to include a time stamp to record when each row in the operational system was changed. The data extraction program then selects the source data based on the time stamp of the transaction and extracts all rows that have been updated since the time of the last extraction. For example, when moving orders from the order processing system into the EASYDW warehouse, this technique can be used by selecting rows based on the PURCHASE_DATE column, as illustrated later in this chapter.

However, this technique does have some potential disadvantages:

  1. If multiple updates have occurred to a record since the date and time of the last extraction, then only the current version of the record is read and not all of the interim versions. Typically, this may not have a significant impact on the warehouse unless it is necessary for analysis purposes to track all changes.

  2. The query to select the changed data based on the time stamp can have an impact on the source system.

  3. If records are deleted in the source system, then this mechanism is unsuitable because you cannot select a record that is no longer present. Converting the delete into a “logical delete” by setting a flag is normally not practical and involves considerable application change. This is where the next technique using triggers can help.

If the source is a relational database, triggers can be used to identify the changed rows. Triggers are stored procedures that can be invoked before or after an event, such as when an insert, update, or delete occurs on each record. The trigger can be used to save the changed records into a separate table from where the extract process can later retrieve the changed rows. One advantage of triggers is that the same transaction is used for writing the changed record to another table as is used to alter the source record itself. If this transaction aborts and rolls back for any reason, then our change record is also rolled back. However, be very careful of triggers in high-volume applications, as they can add significant overhead to the operational system.

Sometimes it may not be possible to change the schema to add a timestamp or trigger. The system may already be heavily loaded, and you do not want to degrade the performance in any way. Or the source may be a legacy system, which does not have triggers. Therefore, you may need to use a file comparison to identify changes. This involves keeping before and after images of the extract files to find the changes. For example, you may need to compare the recent extract with the current product or customer list to identify the changes.

Changes to the metadata, or data definitions, must also be identified. Changes to the structure of the operational system, such as adding or dropping a column, impact the extraction and load programs, which may need to be modified to account for the change.

Oracle Change Data Capture

Oracle Database 10g uses a feature called Change Data Capture, often referred to as CDC, to facilitate identifying changes when the source system is also an Oracle 9i or 10g database. CDC was introduced in Oracle 9i with just the synchronous form, where generation of the change records is tied to the original transaction. In Oracle Database 10g the new asynchronous form is introduced; this disassociates the generation of the change records from the original transaction and reduces the impact on the source system for collecting the change data. In this section, we will look at both forms and work through examples of each mechanism.

With CDC, the results of all INSERT, UPDATE, and DELETE operations can be saved in tables called change tables. The data extraction programs can then select the data from the change tables. CDC uses a publish-subscribe interface to capture and distribute the change data, as illustrated in Figure 5.2. The publisher, usually a DBA, determines which user tables in the operational system are used to load the warehouse and sets up the system to capture and publish the change data. A change table is created for each source table with data that needs to be moved to the warehouse.

Change Data Capture Publish/Subscribe Architecture

Figure 5.2. Change Data Capture Publish/Subscribe Architecture

The extract programs then subscribe to the source tables; therefore, there can be any number of subscribers. Each subscriber is given his or her own view of the change table. This isolates the subscribers from each other while they are simultaneously accessing the same change tables. The subscribers use SQL to select the change data from their subscriber views. They see just the columns that they are interested in and only the rows that they have not yet processed. If the updates of a set of tables are dependent on each other, the change tables can be grouped into a change set. If, for example, you had an order header and an order detail table, these two tables would be grouped together in the same change set to maintain transactional consistency. In order to create a change table, you must first, therefore, create the parent change set. A change source is a logical representation of the source database that contains one or more change sets.

In Oracle 9i only synchronous data capture existed, where changes on the operational system are captured in real time as part of the source transaction. The change data is generated as DML operations are performed on the source tables. When a new row is inserted into the user table, it is also stored in the change table. When a row is updated in a user table, the updated columns are stored in the change table. The old values, new values, or both can be written to the change table. When a row is deleted from a user table, the deleted row is also stored in the change table. The change data records are only visible in the change table when the source transaction is committed.

Synchronous CDC is based on triggers, which fire for each row as the different DML statements are executed. This simplifies the data extraction process; however, it adds an overhead to the transaction performing the DML. In Asynchronous Change Data Capture, which is introduced in Oracle Database 10g, the capture of the changes is not dependent upon the DML transaction on the source system. With asynchronous CDC, the changes are extracted from the source system’s redo logs, which removes the impact on the actual database transaction. The Oracle redo logs are special files used by the database that record all of the changes to the database as they occur. There are two forms of redo logs:

  • On-line redo logs

  • Archive redo logs

The on-line redo logs are groups of redo logs that are written to in a round-robin fashion as transactions occur on the database. When a redo log fills up, the database swaps to the next one, and if archiving of the logs is turned on (which is practically a certainty on any production database), then the database writes the completed redo log file to a specified destination where it can subsequently be safely stored. This is the archived redo log. Asynchronous CDC reads redo logs to extract the change data for the tables that we are interested in and therefore it is a noninvasive technique, which does not need to alter the source schema to add triggers to tables. The action of reading the logs in this fashion is often called mining the logs. In addition, the mining of the log files for the change data is disassociated from the source transaction itself, which reduces, but doesn’t remove, the impact on the source system.

Asynchronous CDC needs an additional level of redo logging to be enabled on the source system, which adds its own level of impact on the source database, but this is much less than that of synchronous CDC.

Asynchronous CDC can be used in two ways:

  1. HOTLOG, where the changes are extracted from the on-line redo logs on the source database and then moved (by Oracle Streams processes) into local CDC tables also in the source database. The changed data in the CDC tables will still need to be transported to our warehouse database.

  2. AUTOLOG, where the changed data is captured from the redo logs as they are moved between databases by the Log Transport Services. The changes are then extracted from these logs and made available in change tables on this other database. If this other database is our data warehouse database, then, by using AUTOLOG, we have removed the transportation step to make our changed data available in the staging area of the warehouse.

Log Transport Services are a standard part of the operation of the database—for example, they are used by Oracle Data Guard for moving the log files to other servers in order to maintain standby databases and as such do not add any additional impact to the source database operation.

The time between the change being committed by the transaction and it being detected by CDC and moved into the change table is called the latency of change. This is smaller with the HOTLOG method than with the AUTOLOG method. This is because HOTLOG is reading the on-line redo logs and publishing to tables on the same source system, compared with AUTO LOG, where the logs must first be switched and then transported between databases—in which case the frequency of transporting the logs is the determining factor. For HOTLOG, mining the redo logs occurs on a transaction-by-transaction basis when the source transaction commits.

Publishing Change Data

In the EASYDW warehouse, we are interested in all new orders from the order-entry system. The DBA creates the change tables, using the DBMS_CDC_PUBLISH.CREATE_CHANGE_TABLE procedure, and specifies a list of columns that should be included. A change table contains changes from only one source table.

In many of the examples in this chapter, we will use a schema named OLTP, which represents a part of the operational database. In the following example, the DBA uses the CREATE_CHANGE_TABLE procedure to capture the changes to the columns from the ORDERS table in the OLTP schema.

The next two sections provide an example of the two different methods for setting up synchronous and asynchronous CDC and, following that, the common mechanism for subscribing to the change data. To do this we are going to use three new schemas:

  • OLTP for the owner of our warehouse source tables

  • OLTPPUB for the publisher of the change data that has occurred on our OLTP tables

  • OLTPSUBSCR for the subscriber to the change data

Our source table for both methods is ORDERS, and the other two differences between the methods to note are:

  • For synchronous, the change set is EASYDW_SCS and the change table is ORDERS_SYNCH_CT

  • For asynchronous, the change set is EASYDW_ACS and the change table is ORDERS_ASYNCH_CT

In the subscriber section, we will highlight what needs to change in order to subscribe to the change data created by these two methods.

From a DBA account, create the new accounts and create a source table in the OLTP account:

CREATE USER oltp IDENTIFIED BY oltp
  DEFAULT TABLESPACE users
  TEMPORARY TABLESPACE temp
  QUOTA UNLIMITED ON users;
GRANT connect, resource TO oltp ;

CREATE USER oltpsubscr IDENTIFIED BY oltpsubscr
  DEFAULT TABLESPACE users
  TEMPORARY TABLESPACE temp
  QUOTA UNLIMITED ON users;
GRANT connect, resource TO oltpsubscr ;

CREATE USER oltppub IDENTIFIED BY oltppub
  QUOTA UNLIMITED ON SYSTEM
  QUOTA UNLIMITED ON SYSAUX;

GRANT CREATE SESSION TO oltppub;
GRANT CREATE TABLE TO oltppub;
GRANT CREATE TABLESPACE TO oltppub;
GRANT UNLIMITED TABLESPACE TO oltppub;
GRANT SELECT_CATALOG_ROLE TO oltppub;
GRANT EXECUTE_CATALOG_ROLE TO oltppub;
GRANT CREATE SEQUENCE TO oltppub;
GRANT CONNECT, RESOURCE, DBA TO oltppub;

CREATE TABLE oltp.orders
(order_id              varchar2(8)  NOT NULL,
 product_id            varchar2(8)  NOT NULL,
 customer_id           varchar2(10) NOT NULL,
 purchase_date         date         NOT NULL,
 purchase_time         number(4,0)  NOT NULL,
 purchase_price        number(6,2)  NOT NULL,
 shipping_charge       number(5,2)  NOT NULL,
 today_special_offer   varchar2(1)  NOT NULL,
 sales_person_id       varchar2(20) NOT NULL,
 payment_method        varchar2(10) NOT NULL
)
TABLESPACE users ;

Synchronous CDC

For synchronous CDC, creation of the change set must use the predefined change source, SYNC_SOURCE, which represents the source database. The following PL/SQL block performs both of these steps:

BEGIN
  DBMS_CDC_PUBLISH.CREATE_CHANGE_SET
    (change_set_name =>'EASYDW_SCS',
     description => 'Synchronous Change set for EasyDW',
     change_source_name => 'SYNC_SOURCE'
    );
  DBMS_CDC_PUBLISH.CREATE_CHANGE_TABLE
    (
      owner => 'oltppub',
      change_table_name => 'ORDERS_SYNCH_CT',
      change_set_name => 'EASYDW_SCS',
      source_schema => 'oltp',
      source_table => 'ORDERS',
      column_type_list =>
         'order_id varchar2(8),product_id varchar2(8),'
       ||'customer_id varchar2(10), purchase_date date,'
       ||'purchase_time number(4,0),purchase_price number(6,2),'
       ||'shipping_charge number(5,2), '
       ||'today_special_offer varchar2(1),'
       ||'sales_person varchar2(20), '
       ||'payment_method varchar2(10)',
      capture_values => 'both',
      rs_id => 'y',
      row_id => 'n',
      user_id => 'n',
      timestamp => 'n',
      object_id => 'n',
      source_colmap => 'y',
      target_colmap => 'y',
      options_string => 'TABLESPACE USERS'
    );
END;
/

This script creates the change set EASYDW_SCS for changes to the table ORDERS owned by account OLTP to publish the changes into the change table ORDERS_SYNCH_CT owned by OLTPPUB. We must now grant select privilege on the change table to the subscriber account, OLTPSUBSCR, so that it can see the contents of the table.

GRANT SELECT ON ORDERS_SYNCH_CT TO OLTPSUBSCR ;

When creating the change table, you must specify the column list, which indicates the columns you are interested in capturing. In addition, there are a number of other parameters that allow you to specify:

  • Whether you want the change table to contain the old values for the row, the new values, or both

  • Whether you want a row sequence number, which provides the sequence of operations within a transaction

  • The rowid of the changed row

  • The user who changed the row

  • The time stamp of the change

  • The object id of the change record

  • A source column map, which indicates the source columns that have been modified

  • A target column map to track which columns in the change table have been modified

  • An options column to append to a CREATE TABLE DDL statement

An application can check either the source column map or the target column map to determine which columns have been modified.

A sample of the output will be seen later in the chapter. To see a list of change tables that have been published, query the CHANGE_TABLES dictionary table.

SQL> SELECT CHANGE_TABLE_NAME FROM CHANGE_TABLES;

CHANGE_TABLE_NAME
------------------
ORDERS_SYNCH_CT

The DBA then grants SELECT privileges on the change table to the subscribers.

Asynchronous CDC

For the asynchronous CDC example, we will use the HOTLOG method, where the change data is mined from the on-line redo logs of the source system.

First, make sure that your source database is in ARCHIVELOG mode, where the log files are being archived to a separate destination area on your file system. It would be very unusual for a source production system not to be already operating in archive log mode, because this is fundamental to any recovery of the database in the event of failure. To put the database in ARCHIVELOG mode, you will need to shut down the database and restart it, as summarized in the following code segment. These commands are executed from SQL*Plus using a sysdba account—for example, SYS:

shutdown immediate
startup mount
alter database archivelog ;
alter database open ;

Hint

Simply changing the ARCHIVELOG mode like this may be fine on a sandpit, test, or play system, but this operation can invalidate your backups—so on any production or similarly important system, you will want to redo your backups.

The next step requires altering the database in order to create the additional logging information into the log files.

ALTER DATABASE FORCE LOGGING;
ALTER DATABASE ADD SUPPLEMENTAL LOG DATA;
ALTER TABLE oltp.orders
     ADD SUPPLEMENTAL LOG GROUP log_group_orders
          (order_id,product_id,customer_id,
           purchase_date,purchase_time,purchase_price,
           shipping_charge,today_special_offer) ALWAYS;

The FORCE LOGGING clause to the alter database statement specifies that the database will always generate redo logs, even when database operations have been used with the NOLOGGING clause. This ensures that asynchronous CDC always has the necessary redo log data to mine for the changes. The SUPPLEMENTAL LOG DATA clause is adding minimal supplemental logging, but this does not add a significant overhead to the database performance and performing this enables the use of the Oracle log mining features. This statement is creating an unconditional log group for the data changes for those source table columns to be captured in a change table. Without the unconditional log group, CDC records unchanged column values in UPDATE operations as NULL, making it ambiguous whether the NULL means the value was unchanged or changed to be NULL. With the unconditional log group, CDC records the actual value in UPDATE operations for unchanged column values, so that a NULL always means the value was changed to NULL.

Asynchronous CDC utilizes Oracle Streams for the propagation of our change data within the database and to use Streams we must be granted certain privileges. These privileges enable the user to use Streams and the underlying Oracle Advanced Queue objects, such as queues, propagations, and rules. Execute the following command from your SYSDBA (e.g., SYS) account:

EXECUTE DBMS_STREAMS_AUTH.GRANT_ADMIN_PRIVILEGE
                       (GRANTEE=>'oltppub'),

Each source table must be instantiated with Oracle Streams in order that Streams can capture certain information that it requires in order to record the source table data changes. This is achieved from a DBA account by calling the PREPARE_TABLE_INSTANTIATION procedure as shown here:

EXECUTE DBMS_CAPTURE_ADM.PREPARE_TABLE_INSTANTIATION
                       (TABLE_NAME=>'oltp.orders'),

Now we must create our change set, which we are going to call EASYDW_ACS, using the predefined change source HOTLOG_SOURCE, which represents the current redo log files of the source database. If we were performing an AUTOLOG type of asynchronous CDC, then there would be no predefined change source to be used in this step. Instead, the DBA on the target staging database must define and create the change source using the CREATE_AUTOLOG_CHANGE_SOURCE procedure in the DBMS_CDC_PACKAGE. The call to this procedure uses:

  1. The global name of the source database

  2. The SCN number of the data dictionary build, which is determined by a call to DBMS_CAPTURE_ADM.BUILD() on the source database

In order to interpret the redo logs—for example, in order to know which internal reference number identifies what table—the log mining functionality needs a version of the source system data dictionary. This SCN number identifies a source system redo log, which contains the Log-Miner dictionary and therefore the correct definition of the tables.

Execute the following from the OLTPPUB publisher account. This command creates the change set and its associated Oracle Streams processes but does not start them.

BEGIN
  DBMS_CDC_PUBLISH.CREATE_CHANGE_SET
   (
    change_set_name => 'EASYDW_ACS',
    description => 'Asynchronous Change set for purchases info',
    change_source_name => 'HOTLOG_SOURCE',
    stop_on_ddl => 'y'),
END;
/

Now we can create the change table, ORDERS_CT, in the OLTPPUB publisher account, which will contain the changes that have been mined from the on-line redo log. Execute the following from the OLTPPUB account:

BEGIN
 DBMS_CDC_PUBLISH.CREATE_CHANGE_TABLE
  (
   owner => 'oltppub',
   change_table_name => 'ORDERS_ASYNCH_CT',
   change_set_name => 'EASYDW_ACS',
   source_schema => 'OLTP',
   source_table => 'ORDERS',
   column_type_list =>
     'order_id varchar2(8),product_id varchar2(8),'
   ||'customer_id varchar2(10), purchase_date date,'
   ||'purchase_time number(4,0),purchase_price number(6,2),'
   ||'shipping_charge number(5,2),today_special_offer varchar2(1),'
   ||'sales_person varchar2(20), payment_method varchar2(10)',
   capture_values => 'both',
   rs_id => 'y',
   row_id => 'n',
   user_id => 'n',
   timestamp => 'n',
   object_id => 'n',
   source_colmap => 'n',
   target_colmap => 'y',
   options_string => 'TABLESPACE USERS'),
END;
/

GRANT SELECT ON ORDERS_ASYNCH_CT TO OLTPSUBSCR ;

Finally, we will enable our change set EASYDW_ACS, which starts the underlying Oracle Streams processes for moving our change data:

BEGIN
  DBMS_CDC_PUBLISH.ALTER_CHANGE_SET
   (
    change_set_name => 'EASYDW_ACS',
    enable_capture => 'y'
   );
END;
/

To verify that everything is working, perform this simple test. Insert and commit a record into your OLTP.ORDERS source table. After a few minutes you will be able to see the change record in the publisher table OLTPPUB.ORDERS_CT. We have now created the facility using database asynchronous CDC mechanisms to capture changes on our ORDERS table without needing to invasively alter the source schema to create triggers or amend any application.

Subscribing to Change Data

The extraction programs create subscriptions to access the change tables. A subscription can contain data from one or more change tables in the same change set.

The ALL_SOURCE_TABLES dictionary view lists the source tables that have already been published by the DBA. In this example, changes for the ORDERS table in the OLTP schema have been published.

SQL> SELECT * FROM ALL_SOURCE_TABLES;

SOURCE_SCHEMA_NAME             SOURCE_TABLE_NAME
------------------------------ ------------------
OLTP                           ORDERS

Creating a Subscription

There are several steps to creating a subscription.

  1. Create a subscription.

  2. List all the tables and columns the extract program wants to subscribe to.

  3. Activate the subscription.

The first step is to create a subscription. The following is performed from the subscriber account OLTPSUBSCR. Note that for our example, we are going to access the change data created by the asynchronous CDC method via the EASYDW_ACS change set. If we wanted to access the synchronous change data, then we would simply have to use the change set that we created for synchronous CDC, i.e., EASYDW_SCS.

SQL> BEGIN
DBMS_CDC_SUBSCRIBE.CREATE_SUBSCRIPTION
     (SUBSCRIPTION_NAME => 'ORDERS_SUB',
      CHANGE_SET_NAME => 'EASYDW_ACS',
      DESCRIPTION => 'Changes to orders table'),
END;
/

Next, specify the source tables and columns of interest using the SUBSCRIBE procedure. A subscription can contain one or more tables from the same change set. The SUBSCRIBE procedure lists the schema, table, and columns of change data that the extract program will use to load the warehouse. In this example, the subscribe procedure is used to get changes from all the columns in the ORDERS table in the OLTP schema. The subscribe procedure is executed once for each table in the subscription, and in this example we were only interested in changes from one table. However, you could repeat this procedure to subscribe to changes to other tables in the change set.

Instead of accessing the change tables directly, the subscriber creates a subscriber view for each source table of interest. This is also done in the call to the SUBSCRIBE procedure.

SQL> BEGIN
DBMS_CDC_SUBSCRIBE.SUBSCRIBE
  (SUBSCRIPTION_NAME => 'ORDERS_SUB',
   SOURCE_SCHEMA => 'oltp',
   SOURCE_TABLE => 'orders',
   COLUMN_LIST => 'order_id,product_id,'
                     ||'customer_id, purchase_date,'
                     ||'purchase_time,purchase_price,'
                     ||'shipping_charge, today_special_offer,'
                     ||'sales_person, payment_method',
   SUBSCRIBER_VIEW => 'ORDERS_VIEW'
  );
END;
/

After subscribing to all the change tables, the subscription is activated using the ACTIVATE_SUBSCRIPTION procedure. Activating a subscription is done to indicate that all tables have been added, and the subscription is now complete.

SQL> EXECUTE DBMS_CDC_SUBSCRIBE.ACTIVATE_SUBSCRIPTION
 (SUBSCRIPTION_NAME => 'ORDERS_SUB'),

Once a subscription has been activated, as new data gets added to the source tables it is made available for processing via the change tables.

Processing the Change Data

To illustrate how change data is processed, let us assume two rows are inserted into the ORDERS table. As data is inserted into the ORDERS table, the changes are also stored in the change table, ORDERS_ASYNCH_CT.

SQL> INSERT INTO oltp.orders
           (order_id,product_id,customer_id,
            purchase_date,
            purchase_time, purchase_price,shipping_charge,
            today_special_offer,
            sales_person_id,payment_method)
    VALUES ('123','SP1031', 'AB123495',
            to_date('01-JAN-2004', 'dd-mon-yyyy'),
            1031,156.45,6.95,'N','SMITH','VISA'),

1 row created.

SQL> INSERT INTO oltp.orders
             (order_id,product_id,customer_id,
              purchase_date,
              purchase_time,purchase_price,shipping_charge,
              today_special_offer,
              sales_person_id,payment_method)
     VALUES ('123','SP1031','AB123495',
             to_date('01-FEB-2004', 'dd-mon-yyyy'),
             1031,156.45,6.95,'N','SMITH','VISA'),

1 row created.

SQL> commit;

In order to process the change data, a program loops through the steps described in the following text and illustrated in Figure 5.3. A change table is dynamic; new change data is appended to the change table at the same time the extraction programs are reading from it. In order to present a consistent view of the contents of the change table, change data is viewed for a window of source database transactions. Prior to accessing the data, the window is extended. In Figure 5.3, rows 1–8 are available in the first window. While the program was processing these rows, rows 9–13 were added to the change table. Purging the first window and extending the window again can access rows 9–13.

Querying Change Data

Figure 5.3. Querying Change Data

Rather than accessing the change table directly, the program selects the data from the change table using the subscriber view specified earlier in the SUBSCRIBE procedure.

Step 1: Extend the window

Change data is only available for a window of time: from the time the EXTEND_WINDOW procedure is invoked until the PURGE_WINDOW procedure is invoked. To see new data added to the change table, the window must be extended using the EXTEND_WINDOW procedure.

SQL> BEGIN
  DBMS_CDC_SUBSCRIBE.EXTEND_WINDOW
     (SUBSCRIPTION_NAME => 'ORDERS_SUB'),
END;
/

Step 2: Select Data from the Subscriber View

In this example, the contents of the subscriber view will be examined.

SQL> describe ORDERS_VIEW
Name                    Null?    Type
----------------------- -------- ----------------
OPERATION$                       CHAR(2)
CSCN$                            NUMBER
COMMIT_TIMESTAMP$                DATE
RSID$                            NUMBER
SOURCE_COLMAP$                   RAW(128)
TARGET_COLMAP$                   RAW(128)
CUSTOMER_ID                      VARCHAR2(10)
ORDER_ID                         VARCHAR2(8)
PAYMENT_METHOD                   VARCHAR2(10)
PRODUCT_ID                       VARCHAR2(8)
PURCHASE_DATE                    DATE
PURCHASE_PRICE                   NUMBER(6,2)
PURCHASE_TIME                    NUMBER(4)
SALES_PERSON                     VARCHAR2(20)
SHIPPING_CHARGE                  NUMBER(5,2)
TODAY_SPECIAL_OFFER              VARCHAR2(1)

The first column of the output shows the operation: I for insert. Next, the commit scn and commit time are listed. The new data is listed for each row, and the row source id, indicated by RSID$, shows the order of the statements in the transaction.

SQL> SELECT OPERATION$, CSCN$, COMMIT_TIMESTAMP$, RSID$,
            CUSTOMER_ID, ORDER_ID, PAYMENT_METHOD,
            PRODUCT_ID, PURCHASE_DATE, PURCHASE_PRICE,
            PURCHASE_TIME, SALES_PERSON, SHIPPING_CHARGE,
            TODAY_SPECIAL_OFFER
     FROM ORDERS_VIEW;

OP      CSCN$ COMMIT_TIMESTAMP      RSID$ CUSTOMER_ID
-- ---------- ---------------- ---------- -----------
ORDER_ID PAYMENT_METHOD PRODUCT_ID PURCHASE_DATE PURCHASE_PRICE PURCHASE_TIME
-------- -------------- ---------- ------------- -------------- -------------
SALES_PERSON         SHIPPING_CHARGE TODAY_SPECIAL_OFFER
-------------------- --------------- -------------------
I     6848693 05-JUN-04             10001 AB123495
123      VISA           SP1031     01-JAN-04     156.45         1031
                                6.95 N

I     6848693 05-JUN-04             10002 AB123495
123      VISA           SP1031     01-FEB-04     156.45         1031
                                6.95 N

Step 3: Purge the Window

The window is purged when the data is no longer needed, using the PURGE_WINDOW procedure. When all subscribers have purged their windows, the data in those windows is automatically deleted.

SQL> EXECUTE DBMS_CDC_SUBSCRIBE.PURGE_WINDOW
          (SUBSCRIPTION_NAME => 'ORDERS_SUB'),

Ending the Subscription

When an extract program is no longer needed, you can end the subscription, using the DROP_SUBSCRIPTION procedure.

SQL> EXECUTE DBMS_CDC_SUBSCRIBE.DROP_SUBSCRIPTION
    (SUBSCRIPTION_NAME => 'ORDERS_SUB'),

Transporting the Changes to the Staging Area

For synchronous and HOTLOG asynchronous CDC, now that the changes have been captured from the operational system, they need to be transported to the staging area on the warehouse database. The extract program could write them to a data file outside the database, use FTP to copy it, and SQL*Loader or external tables to load the change data into the staging area. Alternatively, the changes could be written to a table and moved to the staging area using transportable tablespaces. Both these techniques are discussed later in this chapter. Of course, if we were using the AUTOLOG form of asynchronous CDC, then we will have already moved our changes to the warehouse database and into the staging area as part of the CDC operation.

Transforming the Data into a Common Representation

Once the data has been extracted from the operational systems, it is ready to be cleansed and transformed into a common representation. Differences between naming conventions, storage formats, data types, and encoding schemes must all be resolved. Duplicates are removed, relationships are validated, and unique key identifiers are added. In this section, various types of transformations will be introduced; later in the chapter, we’ll see specific examples of transformations.

Integrating Data from Multiple Sources

Often the information needed to create a table in the warehouse comes from multiple source systems. If there is a field in common between the systems, the data can be joined via that column.

Integrating data from multiple sources can be very challenging. Different people may have designed the operational systems, at different times, and using different styles, standards, and methodologies. They may use different technology (e.g., hardware platforms, database management systems, and operating system software). If data is coming from an IBM mainframe, the data may need to be converted from EBCDIC to ASCII or from big endian to little endian or vice versa.

To compound the problem, there may not be a common identifier in the source systems. For example, when creating the customer dimension, there may not be a CUSTOMER_ID in each system. You may have to look at customer names and addresses to determine that it is the same customer. These may have different spacing, case, and punctuation. Oracle Warehouse Builder helps address the customer deduplication problem.

Cleansing Data

The majority of operational systems contain some dirty data, which means that there may be:

  • Duplicate records

  • Data missing

  • Data containing invalid values

  • Data pointing to primary keys that do not exist

Sometimes, business rules are enforced by the applications; other times, by integrity constraints within the database; and sometimes there may be no enforcement at all.

Data must be standardized. For example, any given street address, such as 1741 Coleman Ave., can be represented in many ways. The word “Avenue” may be stored as “Ave,” “Ave.,” “Avenue,” or “AVE.”. Search & Replace transforms allow you to search for any of these values and replace them with the standard value you’ve chosen for your warehouse.

You may want to check the validity of certain types of data. If a product is sold only in three colors, you can validate that the data conforms to a list of values, such as red, yellow, and green. This list may change over time to include new values, for example in February 2002; the product may also be available in blue. You may want to validate data against a larger list of values stored in a table, such as the states within the United States.

Some types of cleansing involve combining and separating character data. You may need to concatenate two string columns—for example, combining LAST_NAME, comma, and FIRST_NAME into the CUSTOMER_NAME column. Or you may need to use a substring operation to divide a string into separate parts, such as separating the area code from a phone number.

An important data integrity step involves enforcement of one-to-one and one-to-many relationships. Often these are checked as part of the transformation process rather than by using referential integrity constraints in the warehouse.

Deriving New Data

While loading the data, you may want to perform calculations or derive new data from the existing data. For example, you may want to keep a running total or count of records as they are moved from the source to the target database.

During the design process, the appropriate level of granularity for the warehouse is determined. It is often best to store data at various levels of granularity with different retention and archive periods. The most finegrained transaction data will usually be retained for a much shorter period of time than data aggregated at a higher level. Transaction granular sales data is necessary to analyze which products are purchased together. Daily sales of a product by store are used to analyze regional trends and product performance.

Data may be aggregated as part of the transformation process. If you did not want to store the detailed transactions in your data warehouse, the data can be aggregated prior to moving it to the data warehouse.

Generating Warehouse Keys

Instead of using the keys that were used in the operational system, a common design technique is to make up a new key, called the surrogate or synthetic key, to use in the warehouse. The surrogate key is usually a generated sequence of integers.

Surrogate keys are used for a variety of reasons. The keys used in the operational system may be long character strings, with meaning embedded into the components of the key. Because surrogate keys are integers, the fact tables and B*tree indexes are smaller, with fewer levels, and take less space, improving query response time. Surrogate keys provide a degree of isolation from changes in the operational system.

If the operational system changes the product-code naming conventions, or format, all data in the warehouse does not have to be changed. When one company acquires another, you may need to load products from a newly acquired company into the warehouse. It is highly unlikely that both companies used the same product encoding schemes. If there is a chance that the two companies used the same product key for different products, then the product key in the warehouse may need to be extended to add the company id as well. The use of surrogate keys can greatly help integrate the data in these types of situations.

Both the surrogate keys and operational system keys are stored in the dimension table, as shown in Figure 5.4, where product code SR125 is known in the data warehouse as PRODUCT_ID 1. Therefore, we can see that in the fact table, the product key is stored as 1. However, users can happily query using code SR125, completely unaware of the transformation being done within the data warehouse.

The Use of Surrogate Keys in the Warehouse

Figure 5.4. The Use of Surrogate Keys in the Warehouse

The surrogate key is used in the fact table as the column that joins the fact table to the dimension table. In this example, there are two different formats for product codes. Some are numeric, separated by a dash, “654-123”. Others are a mix of alphanumeric and numeric characters, “SR125”. As part of the ETL process, as each fact record is loaded, the surrogate key is looked up in the dimension tables and stored in the fact table.

Choosing the Optimal Place to Perform the Transformations

Transformations of the data may be done at any step in the ETL process. You need to decide the most efficient place to do each transformation: at the source, in the staging area, during the load operation, or in temporary tables once the data is loaded into the warehouse. Several powerful features are present in 9i and new in 10g to facilitate performing transformations.

  • Transformations can be done as part of the extraction process. In general, it is best to do filtering types of transformations whenever possible at the source. This allows you to select only the records of interest for loading into the warehouse and consequently this also reduces the impact on the network or other mechanism used to transport the data into the warehouse. Ideally, you want to extract only the data that has been changed since your last extraction. While transformations could be done at the source operational system, an important consideration is to minimize the additional load the extraction process puts on the operational system.

  • Transformations can be done in a staging area prior to loading the data into the data warehouse. When data needs to be integrated from multiple systems, it cannot be done as part of the extraction process. You can use flat files as your staging area, your Oracle database as your staging area, or a combination of both. If your incoming data is in a flat file, it is probably more efficient to finish your staging processes prior to loading the data into the Oracle warehouse. Transformations that require sorting, sequential processing and row-at-a-time operations can be done efficiently in the flat file staging area.

  • Transformations can be done during the load process. Some important types of transformations can be done as the data is being loaded using SQL*Loader—for example, converting the “endian-ness” of the data, or, changing the case of a character column to uppercase. This is best done when a small number of rows need to be added—for example, when initially loading a dimension table. Oracle external tables facilitate more complex transformations of the data as part of the load process. Sections 5.4.1 and 5.4.3 discuss some examples of transformations while loading the data into the warehouse.

    If your source is an Oracle 8i, 9i, or 10g database, transportable tablespaces make it easy to move data into the warehouse without first extracting the data into an external table. In this case, it makes sense to do the transformations in temporary staging tables once the data is in the warehouse. By doing transformations in Oracle, if the data can be processed in bulk using SQL set operations, they can be done in parallel.

  • Transformations can be done in the warehouse staging tables. Conversion of the natural key to the surrogate key should be performed in the warehouse where the surrogate key is generated. But, in addition in Oracle Database 10g, a new SQL feature, called REGEXP, is introduced for processing character data using regular expressions. This new, powerful feature operates in addition to the simpler, existing text search and replace functions and operators and enables true regular expression matching, substitution, and manipulation to be performed on character data.

Loading the Warehouse

When loading the warehouse, the dimension tables are generally loaded first. The dimension tables contain the surrogate keys or other descriptive information needed by the fact tables. When loading the fact tables, information is looked up from the dimension tables and added to the columns in the fact table.

When loading the dimension table, you need both to add new rows and make changes to existing rows. For example, a customer dimension may contain tens of thousands of customers. Usually, only 10 percent or less of the customer information changes. You will be adding new customers and sometimes modifying the information about existing customers.

When adding new data to the dimension table, you need to determine if the record already exists. If it does not, you can add it to the table. If it does exist, there are various ways to handle the changes, based on whether you need to keep the old information in the warehouse for analysis purposes.

If a customer’s address changes, there is generally no need to retain the old address, so the record can simply be updated. In a rapidly growing company the sales regions will change often. For example, “Canada” rolled up into the “rest of the world” until 1990, and then rolled up into the “Americas,” after reorganization. If you needed to understand both the old geographical hierarchy as well as the new one, you can create a new dimension record containing all the old data plus the new hierarchy, giving the record a new surrogate key. Alternatively, you could create columns in the original record to hold both the previous and current values.

One dimension that will usually be present in any data warehouse is the time dimension, which contains one row for each unit of time that is of interest in the warehouse. In the EASYDW shopping example, purchases can be made on-line 365 days a year, so every day is of interest. For each given date, information about the day is stored, including the day of the week, the week number, the month, the quarter, the year, and if it is a holiday. The time dimension may be loaded on a yearly basis.

When loading the fact table, you typically append new information to the end of the existing fact table. You do not want to alter the existing rows, because you want to preserve that data. For example, the PURCHASES fact table contains three months of data. New data from the source order-entry system is appended to the purchases fact table monthly. Partitioning the data by month facilitates this type of operation.

In the next sections, we will take a look at different ways to load data, including:

  • SQL*Loader—which inserts data into a new table or appends to an existing table when your data is in a flat file that is external to the database.

  • Data Pump utilities for export and import.

  • External Tables—inserts data into a new table or appends to an existing table when your data is in a flat file that is external to the database and you want to transform it while loading.

  • Transportable Tablespaces—used to move the data from between two Oracle databases, such as the operational system and the warehouse, which may reside on different operating system platforms.

Using SQL*Loader to Load the Warehouse

One of the most popular tools for loading data is SQL*Loader, because it has been designed to load records as fast as possible. Its particular strength is that the format of the records that it can load are fully user definable, which can make it an ideal mechanism for loading data from non-Oracle source systems. For example, one possible format is to use comma-separated value (CSV) files. SQL*Loader can be used either from the operating system command line or via its wizard in Oracle Enterprise Manager (OEM), which we will discuss now.

Using Oracle Enterprise Manager Load Wizard

Figure 5.5 shows the Oracle Enterprise Manager Maintenance screen, from where you can launch the load wizard from the Utilities section. Click on the Load Data from File link. The wizard guides you through the process of loading data from an external file into the database according to a set of instructions in a control file and the subsequent submission of a batch job through Enterprise Manager to execute the load.

Accessing SQL*Loader from Enterprise Manager

Figure 5.5. Accessing SQL*Loader from Enterprise Manager

The Control File

The control file is a text file that describes the load operation. The role of the control file, which is illustrated in Figure 5.6, is to tell SQL*Loader which datafile to load, how to interpret the records and columns, and into which tables to insert the data. At this point you also need to specify a host server account, which Enterprise Manager can use to execute your SQL*Loader job.

Identifying the Control File

Figure 5.6. Identifying the Control File

Hint

This is a server account and not the database account that you wish to use to access the tables that you’re loading into.

The control file is written in SQL*Loader’s data definition language. The following example shows the control file that would be used to add new product data into the product table in the EASYDW warehouse. The data is stored in the file product.dat. New rows will be appended to the existing table.

- Load product dimension
LOAD DATA
INFILE 'product.dat' append
INTO TABLE product
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY "'"
(product_id,
 product_name,
 category,
 cost_price,
 sell_price,
 weight,
 shipping_charge,
 manufacturer,
 supplier)

The Data File

The following example shows a small sample of the datafile product.dat. Each field is separated by a comma and optionally enclosed in a single quote. Each field in the input file is mapped to the corresponding columns in the table. As the data is read, it is converted from the data type in the input file to the data type of the column in the database.

'SP1242', 'CD LX1','MUSC', 8.90, 15.67, 2.5, 2.95, 'RTG', 'CD Inc'
'SP1243', 'CD LX2','MUSC', 8.90, 15.67, 2.5, 2.95, 'RTG', 'CD Inc'
'SP1244', 'CD LX3','MUSC', 8.90, 15.67, 2.5, 2.95, 'RTG', 'CD Inc'
'SP1245', 'CD LX4','MUSC', 8.90, 15.67, 2.5, 2.95, 'RTG', 'CD Inc'

The datafile is an example of data stored in Stream format. A record separator, often a line feed or carriage return/line feed, terminates each record. A delimiter character, often a comma, separates each field. The fields may also be enclosed in single or double quotes.

In addition to the Stream format, SQL*Loader supports fixed-length and variable-length format files. In a fixed-length file, each record is the same length. Normally, each field in the record is also the same length. In the control file, the input record is described by specifying the starting position, length, and data type. In a variable-length file, each record may be a different length. The first field in each record is used to specify the length of that record.

In OEM, the screen shown in Figure 5.7 is where the datafile is specified, and here it is c:easydwloadproducts.dat. The alternative mechanism to specify the data file location is within the control file, as shown previously by the INFILE parameter.

The Data File

Figure 5.7. The Data File

SQL*Loader Modes of Operation

Figure 5.8 shows SQL*Loader’s three modes of operation:

  • Conventional path

  • Direct path

  • Parallel direct path

SQL*Loader Modes of Operation

Figure 5.8. SQL*Loader Modes of Operation

Conventional path load should only be used to load small amounts of data, such as initially loading a small dimension table or when loading data with data types not supported by direct path load. The conventional path load issues SQL INSERT statements. As each row is inserted, the indexes are updated, triggers are fired, and constraints are evaluated. When loading large amounts of data in a small batch window, direct path load can be used to optimize performance.

Direct path load bypasses the SQL layer. It formats the data blocks directly and writes them to the database files. When running on a system with multiple processors, the load can be executed in parallel, which can result in significant performance gains.

Data Load Options

In the next step in the wizard, you are presented with the screen shown in Figure 5.9, which contains various options to control the execution of your SQL*Loader operation.

SQL*Loader Advanced Options

Figure 5.9. SQL*Loader Advanced Options

A common method when extracting datafiles from the source system is to have the first record in the file contain the names of the fields of the subsequent data records. This mechanism helps to self-document the datafile, which can be particularly useful when resolving errors for files with large and complex record formats. However, we do not want to have to manually remove this record prior to loading files of this format every single time this datafile is received. The Skip Initial Rows option will instruct SQL*Loader to do this removal for us automatically.

You can create additional files during the load operation to aid in the diagnosis and correction of any errors that may occur during the load. A log file is created to record the status of the load operation. This should always be reviewed to ensure the load was successful. Copies of the records that could not be loaded into the database because of data integrity violations can be saved in a “bad” file. This file can later be reentered once the data integrity problems have been corrected. If you receive an extract file with more records than you are interested in, you can load a subset of records from the file. The WHEN clause in the control file is used to select the records to load. Any records that are skipped are written to a discard file.

Select which optional files you would like created using the advanced option, as shown in Figure 5.9. In this example, we’ve selected a bad file, a discard file and a log file.

In order to specify the database account that SQL*Loader will log on with, you need to use the advanced options. Clicking on the Show Advanced Options link displays a hidden part of the page, where you can specify the account and password in the Parameters, field using the USERID parameter as shown.

Scheduling the Load Operation

Enterprise Manager’s job scheduling system allows you to create and manage jobs, schedule the jobs to run, and monitor progress. You can run a job once or choose how frequently you would like the job to run. If you will run the job multiple times, you can save the job in Enterprise Manager’s jobs library so that it can be rerun in the future. In Figure 5.10, the job will be scheduled to run immediately. Click on Next and you will see the review screen (not shown) and clicking the Submit Job on this screen will actually submit your job for execution and display a status screen. At this point you will also have the option to monitor your job by clicking the View Job button which displays the screen shown in Figure 5.11 for monitoring your job.

Scheduling the Load

Figure 5.10. Scheduling the Load

Monitoring Jobs in Enterprise Manager

Figure 5.11. Monitoring Jobs in Enterprise Manager

Monitoring Progress of the Load Operation

You can also monitor the progress of a job while it is running by going to the bottom of the Administration screen and clicking on Jobs in the Related Links section. The Job Activity page shown in Figure 5.11 lists all jobs that are either running, are scheduled to run, or have completed.

The Results section contains a list of jobs and their status. You can check to see if the job ran successfully after it has completed and look at the output to see any errors when a job has failed. By selecting the job and clicking on the View or Edit buttons you can view information about the job’s state and progress, as shown in Figure 5.12.

Monitoring Jobs in Enterprise Manager

Figure 5.12. Monitoring Jobs in Enterprise Manager

Inspecting the SQL*Loader Log

When SQL*Loader executes, it creates a log file, which can be inspected. The following example shows a log file from a sample load session that can be accessed by clicking on the Load name on the screen shown in Figure 5.12. The log file is also written to the file system (by default into the directory where the loader control file is held). Four rows were appended to the product table; one record was rejected due to invalid data. Copies of those bad records can be found in the file products.bad.

SQL*Loader: Release 10.1.0.2.0 - Production on Sat Jul 31 11:43:56 2004

Copyright (c) 1982, 2004, Oracle. All rights reserved.

Control File:   C:easydwloadproducts.ctl
Data File:      c:easydwloadproducts.dat
  Bad File:     C:easydwloadproducts.bad
  Discard File: none specified

 (Allow all discards)
Number to load: ALL
Number to skip: 0
Errors allowed: 50
Bind array:     64 rows, maximum of 64512 bytes
Continuation:    none specified
Path used:      Conventional

Table PRODUCT, loaded from every logical record.
Insert option in effect for this table: APPEND

Column Name                      Position   Len Term Encl Datatype
------------------------------ ---------- ----- ---- ---- --------
PRODUCT_ID                          FIRST     *   ,  O(') CHARACTER
PRODUCT_NAME                         NEXT     *   ,  O(') CHARACTER
CATEGORY                             NEXT     *   ,  O(') CHARACTER
COST_PRICE                           NEXT     *   ,  O(') CHARACTER
SELL_PRICE                           NEXT     *   ,  O(') CHARACTER
WEIGHT                               NEXT     *   ,  O(') CHARACTER
SHIPPING_CHARGE                      NEXT     *   ,  O(') CHARACTER
MANUFACTURER                         NEXT     *   ,  O(') CHARACTER
SUPPLIER                             NEXT     *   ,  O(') CHARACTER

value used for ROWS parameter changed from 64 to 27
Record 5: Rejected - Error on table PRODUCT, column PRODUCT_ID.
Column not found before end of logical record (use TRAILING
NULLCOLS)

Table PRODUCT:
  4 Rows successfully loaded.
  1 Row not loaded due to data errors.
  0 Rows not loaded because all WHEN clauses were failed.
  0 Rows not loaded because all fields were null.

Space allocated for bind array:        62694 bytes(27 rows)
Read buffer bytes:     64512

Total logical records skipped:              0
Total logical records read:                 5
Total logical records rejected:             1
Total logical records discarded:            0

Run began on Sat Jul 31 11:43:56 2004
Run ended on Sat Jul 31 11:43:56 2004

Elapsed time was:     00:00:00.66
CPU time was:         00:00:00.08

Optimizing SQL*Loader Performance

When loading large amounts of data in a small batch window, a variety of techniques can be used to optimize performance:

  • Using direct path load. Formatting the data blocks directly and writing them to the database files eliminate much of the work needed to execute a SQL INSERT statement. Direct path load requires exclusive access to the table or partition being loaded. In addition, triggers are automatically disabled, and constraint evaluation is deferred until the load completes.

  • Disabling integrity constraint evaluation prior to loading the data. When loading data with direct path, SQL*Loader automatically disables all CHECK and REFERENCES integrity constraints. When using parallel direct path load or loading into a single partition, other types of constraints must be disabled. You can manually disable evaluation of not null, unique, and primary-key constraints during the load process as well. When the load completes, you can have SQL*Loader reenable the constraints, or do it yourself manually.

  • Loading the data in sorted order. Presorting data minimizes the amount of temporary storage needed during the load, enabling optimizations to minimize the processing during the merge phase to be applied. To tell SQL*Loader which indexes the data is sorted on, use the SORTED INDEXES statement in the control file.

  • Deferring index maintenance. Indexes are maintained automatically whenever data is inserted or deleted, or the key column is updated. When loading large amounts of data with direct path load, it may be faster to defer index maintenance until after the data is loaded. You can either drop the indexes prior to the beginning of the load or skip index maintenance by setting SKIP_INDEX_MAINTENANCE=TRUE on the SQL*Loader command line. Index partitions that would have been updated are marked “index unusable,” because the index segment is inconsistent with respect to the data it indexes. After the data is loaded, the indexes must be rebuilt separately.

  • Disabling redo logging by using the UNRECOVERABLE option in the control file. By default, all changes made to the database are also written to the redo log so they can be used to recover the database after failures. Media recovery is the process of recovering after the loss of a database file, often due to a hardware failure such as a disk head crash. By disabling redo logging, the load is faster.

However, if the system fails in the middle of loading the data, you need to restart the load, since you cannot use the redo log for recovery. If you are using Oracle Data Guard to protect your data with a logical or physical standby database, you may not want to disable redo logging. Any data not logged cannot be automatically applied to the standby site.

After the data is loaded, using the UNRECOVERABLE option, it is important to do a backup to make sure you can recover the data in the future if the need arises.

  • Loading the data into a single partition. While you are loading a partition of a partitioned or subpartitioned table, other users can continue to access the other partitions in the table. Loading the April transactions will not prevent users from querying the existing data for January through March. Thus, overall availability of the warehouse is increased.

  • Loading the data in parallel. When a table is partitioned, it can be loaded into multiple partitions in parallel. You can also set up multiple, concurrent sessions to perform a load into the same table or into the same partition of a partitioned table.

  • Increasing the STREAMSIZE parameter can lead to better direct path load times, since larger amounts of data will be passed in the data stream from the SQL*Loader client to the Oracle server.

  • If the data being loaded contains many duplicate dates, using the DATE_CACHE parameter can lead to better performance of direct path load. Use the date cache statistics (entries, hits, and misses) contained in the SQL*Loader log file to tune the size of the cache for future similar loads.

SQL*Loader Direct Path Load of a Single Partition

Next, we will look at an example of loading data into a single partition. In the EASYDW warehouse, the fact table is partitioned by date. At the end of April, the April sales transactions are loaded into the EASYDW warehouse.

In the following example, we create a tablespace, add a partition to the purchases table, and then use SQL*Loader direct path load to insert the data into the January 2005 partition.

Step 1: Create a Tablespace

CREATE TABLESPACE purchases_jan2005
 DATAFILE
  'C:ORACLEPRODUCT10.1.0ORADATAEASYDWPURCHASESJAN2005.f'
 SIZE 5M
 REUSE AUTOEXTEND ON
 DEFAULT STORAGE
  (INITIAL 64K NEXT 64K PCTINCREASE 0 MAXEXTENTS UNLIMITED);

Step 2: Add a Partition

If our new partition is higher than the last partition in the table (i.e., based on the boundary clauses), then we can add a partition, as follows:

ALTER TABLE easydw.purchases
      ADD PARTITION purchases_jan2005
      VALUES LESS THAN (TO_DATE('01-02-2005', 'DD-MM-YYYY'))
      PCTFREE 0 PCTUSED 99
      STORAGE (INITIAL 64K NEXT 64K PCTINCREASE 0)
      TABLESPACE purchases_jan2005;

Step 3: Disable All Referential Integrity Constraints and Triggers

When using direct path load of a single partition, referential and check constraints on the table partition must be disabled, along with any triggers.

SQL> ALTER TABLE purchases DISABLE CONSTRAINT fk_time;
SQL> ALTER TABLE purchases DISABLE CONSTRAINT fk_product_id;
SQL> ALTER TABLE purchases DISABLE CONSTRAINT fk_customer_id;

The status column in the USER_CONSTRAINTS view can be used to determine if the constraint is currently enabled or disabled. Here we can see that the special_offer constraint is still enabled.

SQL> SELECT TABLE_NAME, CONSTRAINT_NAME, STATUS
     FROM USER_CONSTRAINTS
     WHERE TABLE_NAME = 'PURCHASES';

TABLE_NAME    CONSTRAINT_NAME         STATUS
------------- ----------------------- --------
PURCHASES     NOT_NULL_PRODUCT_ID     DISABLED
PURCHASES     NOT_NULL_TIME           DISABLED
PURCHASES     NOT_NULL_CUSTOMER_ID    DISABLED
PURCHASES     SPECIAL_OFFER           ENABLED
PURCHASES     FK_PRODUCT_ID           DISABLED
PURCHASES     FK_TIME                 DISABLED
PURCHASES     FK_CUSTOMER_ID          DISABLED

7 rows selected.

The status column in the USER_TRIGGERS view can be used to determine if any triggers must be disabled. There are no triggers on the PURCHASES table.

SQL> SELECT TRIGGER_NAME, STATUS
     FROM ALL_TRIGGERS
     WHERE TABLE_NAME = 'PURCHASES';

no rows selected

Step 4: Load the Data

The following example shows the SQL*Loader control file to load new data into a single partition. Note that the partition clause is used.

OPTIONS (DIRECT=TRUE)
UNRECOVERABLE LOAD DATA
INFILE 'purchases.dat' BADFILE 'purchases.bad'
APPEND
INTO TABLE purchases
PARTITION (purchases_jan2005)
(product_id          position (1-6)    char,
 time_key            position (7-17)   date "DD-MON-YYYY",
 customer_id         position (18-25)  char,
 ship_date           position (26-36)  date "DD-MON-YYYY",
 purchase_price      position (37-43)  decimal external,
 shipping_charge     position (44-49)  integer external,
 today_special_offer position (50)     char)

The unrecoverable keyword is specified, disabling media recovery for the table being loaded by disabling the redo logging for this operation; this also necessitates that the DIRECT option be used as well. Database changes being made by other users will continue to be logged. After disabling media recovery, it is important to do a backup to make it possible to recover the data in the future if the need arises. If you attempted media recovery before the backup was taken, you would discover that the data blocks that were loaded have been marked as logically corrupt. To recover the data, if you haven’t performed the backup following the load operation, you would have to drop the partition and reload the data.

Any data that cannot be loaded will be written to the file purchases.bad. Data will be loaded into the PURCHASES_JAN2005 partition. This example shows loading a fixed-length file named purchases.dat. Each field in the input record is described by specifying the starting position, its ending position, and its data type. Note: These are SQL*Loader data types representing the data formats in the file, not the data types in an Oracle table. When the data is loaded into the tables, each field is converted to the data type of the Oracle table column, if necessary.

The following example shows a sample of the purchases data file. The PRODUCT_ID starts in column 1 and is six bytes long. “Time_key” starts in column 7 and is 11 bytes long. The data mask “DD-MON-YYYY” is used to describe the input format of the date fields.

12345678901234567890123456789012345678901234567890
      |          |       |          |      |     |
SP100001-jan-2005AB12367501-jan-20050067.23004.50N
SP101001-jan-2005AB12367301-jan-20050047.89004.50N

The alternative method to invoke SQL*Loader direct path mode is from the command line using DIRECT=TRUE. In this example, skip_index_ maintenance is set to true, so the indexes will need to be rebuilt after the load.

sqlldr USERID=easydw/easydw CONTROL=purchases.ctl
LOG=purchases.log DIRECT=TRUE SKIP_INDEX_MAINTENANCE=TRUE

Step 5: Inspect the Log

The following example shows a portion of the SQL*Loader log file from the load operation. Rather than generating redo to allow recovery, invalidation redo was generated to let Oracle know this table cannot be recovered. The indexes were made unusable. The column starting position and length are described.

SQL*Loader: Release 10.1.0.2.0 - Production on Mon Jun 7 14:59:24 2004

Copyright (c) 1982, 2004, Oracle. All rights reserved.

Control File:   purchases.ctl
Data File:      purchases.dat
  Bad File:     purchases.bad
  Discard File:  none specified

 (Allow all discards)
Number to load: ALL
Number to skip: 0
Errors allowed: 50
Continuation:    none specified
Path used:      Direct

Load is UNRECOVERABLE; invalidation redo is produced.
Table PURCHASES, partition PURCHASES_JAN2005, loaded from every logical record.
Insert option in effect for this partition: APPEND
   Column Name                  Position   Len  Term Encl Datatype
------------------------------ ---------- ----- ---- ---- ---------------------
PRODUCT_ID                            1:6     6           CHARACTER
TIME_KEY                             7:17    11           DATE DD-MON-YYYY
CUSTOMER_ID                         18:25     8           CHARACTER
SHIP_DATE                           26:36    11           DATE DD-MON-YYYY
PURCHASE_PRICE                      37:43     7           CHARACTER
SHIPPING_CHARGE                     44:49     6           CHARACTER
TODAY_SPECIAL_OFFER                    50     1           CHARACTER

Record 3845: Discarded - all columns null.
The following index(es) on table PURCHASES were processed:
index EASYDW.PURCHASE_CUSTOMER_INDEX partition PURCHASES_JAN2005 was made unusable
due to:
SKIP_INDEX_MAINTENANCE option requested
index EASYDW.PURCHASE_PRODUCT_INDEX partition PURCHASES_JAN2005 was made unusable due
to:
SKIP_INDEX_MAINTENANCE option requested
index EASYDW.PURCHASE_SPECIAL_INDEX partition PURCHASES_JAN2005 was made unusable due
to:
SKIP_INDEX_MAINTENANCE option requested
index EASYDW.PURCHASE_TIME_INDEX partition PURCHASES_JAN2005 was made unusable due
to:
SKIP_INDEX_MAINTENANCE option requested

Step 6: Reenable All Constraints and Triggers, Rebuild Indexes

After loading the data into a single partition, all references to constraints and triggers must be reenabled. All local indexes for the partition can be maintained by SQL*Loader. Global indexes are not maintained on single partition or subpartition direct path loads and must be rebuilt. In the previous example, the indexes must be rebuilt since index maintenance was skipped.

These steps are discussed in more detail later in the chapter.

SQL*Loader Parallel Direct Path Load

When a table is partitioned, the direct path loader can be used to load multiple partitions in parallel. Each parallel direct path load process should be loaded into a partition of a table stored on a separate disk to minimize I/O contention.

Since data is extracted from multiple operational systems, you will often have several input files that need to be loaded into the warehouse. These files can be loaded in parallel, and the workload distributed among several concurrent SQL*Loader sessions.

Figure 5.13 shows an example of how parallel direct path load can be used to initially load the historical transactions into the purchases table. You need to invoke multiple SQL*Loader sessions. Each SQL*Loader session takes a different datafile as input. In this example, there are three data files, each containing the purchases for one month: January, February, and March. These will be loaded into the purchases table, which is also partitioned by month. Each datafile is loaded in parallel into its own partition.

SQL*Loader Parallel Direct Path Load

Figure 5.13. SQL*Loader Parallel Direct Path Load

It is suggested that the following steps be followed to load data in parallel using SQL*Loader.

Step 1: Disable All Constraints and Triggers

Constraints cannot be evaluated, and triggers cannot be fired during a parallel direct path load. If you forget, SQL*Loader will issue an error.

Step 2: Drop all Indexes

Indexes cannot be maintained during a parallel direct path load. However, if we are loading only a few partitions out of many, then it is probably better to skip index maintenance and have them marked as unusable instead.

Step 3: Load the Data

By invoking multiple SQL*Loader sessions and setting direct and parallel to true, the processes will load concurrently. Depending on your operating system, you may need to put an “&” at the end of each line (i.e., to be able to invoke one sqlldr and immediately progress to invoking the second without waiting for number one to complete).

sqlldr userid=easydw/easydw CONTROL=jan.ctl DIRECT=TRUE PARALLEL=TRUE
sqlldr userid=easydw/easydw CONTROL=feb.ctl DIRECT=TRUE PARALLEL=TRUE
sqlldr userid=easydw/easydw CONTROL=mar.ctl DIRECT=TRUE PARALLEL=TRUE

Step 4: Inspect the Log

A portion of one of the log files follows. Note that the mode is direct with the parallel option.

SQL*Loader: Release 10.1.0.2.0 - Production on Sat Jun 5 18:30:43 2004

Copyright (c) 1982, 2004, Oracle. All rights reserved.

Control File:   c:feb.ctl
Data File:      c:feb.dat
  Bad File:     c:feb.bad
  Discard File:  none specified

 (Allow all discards)

Number to load: ALL
Number to skip: 0
Errors allowed: 50
Continuation:    none specified
Path used:      Direct - with parallel option.

Load is UNRECOVERABLE; invalidation redo is produced.

Table PURCHASES, partition PURCHASES_FEB2004, loaded from every logical record.
Insert option in effect for this partition: APPEND

Step 5: Reenable All Constraints and Triggers, Recreate All Indexes

After using parallel direct path load, reenable any constraints and triggers that were disabled for the load. Recreate any indexes that were dropped.

Transformations Using SQL*Loader

If you receive extract files that have data in them that you do not want to load into the warehouse, you can use SQL*Loader to filter the rows of interest. You select the records that meet the load criteria by specifying a WHEN clause to test an equality or inequality condition in the record. If a record does not satisfy the WHEN condition, it is written to the discard file. The discard file contains records that were filtered out of the load, because they did not match any record-selection criteria specified in the control file. Note that these records differ from rejected records written to the BAD file. Discarded records do not necessarily have any bad data. The WHEN clause can be used with either conventional or direct path load.

You can use SQL*Loader to perform simple types of transformations on character data. For example, portions of a string can be inserted using the substring function; two fields can be concatenated together using the CONCAT operator. You can trim leading or trailing characters from a string using the trim operator. The control file in the following example illustrates both the use of the WHEN clause to discard any rows where the PRODUCT_ID is blank and how to uppercase the PRODUCT_ID column.

LOAD DATA
INFILE 'product.dat' append
INTO TABLE product WHEN product_id != BLANKS
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY "'"
(product_id "upper(:product_id)",
 product_name,
 category,
 cost_price,
 sell_price,
 weight,
 shipping_charge,
 manufacturer,
 supplier)

The discard file is specified when invoking SQL*Loader from the command line, as shown here, or with the OEM Load wizard, as seen previously.

sqlldr userid=easydw/easydw CONTROL=product.ctl
LOG=product.log BAD=product.bad DISCARD=product.dis DIRECT=true

These types of transformations can be done with both direct path and conventional path modes; however, since they are applied to each record individually, they do have an impact on the load performance.

SQL*Loader Postload Operations

After the data is loaded, you may need to process exceptions, reenable constraints, and rebuild indexes.

Step 1: Inspect the Logs

Always look at the logs to ensure that the data was loaded successfully. Validate that the correct number of rows have been added.

Step 2: Process the Load Exceptions

Look in the .bad file to find out which rows were not loaded. Records that fail NOT NULL constraints are rejected and written to the SQL*Loader bad file.

Step 3: Reenable Data Integrity Constraints

Ensure, referential integrity is reenabled, if you have not already done so. Make sure that each foreign key in the fact table has a corresponding primary key in each dimension table. For the EASYDW warehouse, each row in the PURCHASES table needs to have a valid CUSTOMER_ID in the CUSTOMERS table and a valid PRODUCT_ID in the PRODUCTS table.

When using direct path load, CHECK and REFERENCES integrity constraints were disabled. When using parallel direct path load, all constraints were disabled.

Step 4: Handle Constraint Violations

To find the rows with bad data, you can create an exceptions table. Create the table named “exceptions” by running the script UTLEXCPT.SQL. When enabling the constraint, list the table name the exceptions should be written to.

SQL> ALTER TABLE purchases
       ENABLE CONSTRAINT fk_product_id
       EXCEPTIONS INTO exceptions;

In our example, two rows had bad PRODUCT_IDs. A sale was made for a product that does not exist in the product dimension.

SQL> SELECT * FROM EXCEPTIONS;

ROW_ID             OWNER      TABLE_NAME  CONSTRAINT
------------------ ---------- ----------- -------------
AAAOZAAAkAAAABBAAB EASYDW     PURCHASES   FK_PRODUCT_ID
AAAOZAAAkAAAABBAAA EASYDW     PURCHASES   FK_PRODUCT_ID

To find out which rows have violated referential integrity, select the rows from the purchases table where the rowid is in the exception table. In this example, there are two rows where there is no matching product in the products dimension.

SQL> SELECT * from purchases WHERE rowid in (select row_id from
exceptions);

PRODUCT TIME_KEY  CUSTOMER_ID SHIP_DATE PURCHASE SHIPPING TODAY_SPECIAL
     ID                                    PRICE   CHARGE         OFFER
------- --------- ----------- --------- -------- -------- -------------
XY1001  01-JAN-05 AB123675    01-JAN-05    67.23      4.5 N
AB1234  01-JAN-05 AB123673    01-JAN-05    47.89      4.5 N

Hint

It is important to fix any referential integrity constraint problems, particularly if you are using summary management, which relies on the correctness of these relationships to perform query rewrite.

Step 5: Enabling Constraints without Validation

If integrity checking is maintained in an application, or the data has already been cleansed, and you know it will not violate any integrity constraints, enable the constraints with the NOVALIDATE clause. Include the RELY clause for query rewrite (which is discussed in Chapter 9). Since summary management and other tools depend on the relationships defined by referential integrity constraints, you should always define the constraints, even if they are not validated.

ALTER TABLE purchases ENABLE NOVALIDATE CONSTRAINT fk_product_id;
ALTER TABLE purchases MODIFY CONSTRAINT fk_product_id RELY;

Check for Unusable Indexes

Prior to publishing data in the warehouse, you should check to determine if any indexes are in an unusable state. An index becomes unusable when it no longer contains index entries to all the data. An index may be marked unusable for a variety of reasons.

  • You requested that index maintenance be deferred, using the skip_index_maintenance parameter when invoking SQL*Loader.

  • UNIQUE constraints were not disabled when using SQL*Loader direct path. At the end of the load, the constraints are verified when the indexes are rebuilt. If any duplicates are found, the index is not correct and will be left in an “Index Unusable” state.

The index must be dropped and recreated or rebuilt to make it usable again. If one partition is marked UNUSABLE, the other partitions of the index are still valid.

To check for unusable indexes query the table USER_INDEXES, as illustrated in the following code. Here we can see that the PRODUCT_PK_INDEX is unusable. If an index is partitioned, its status is N/A.

SQL> SELECT INDEX_NAME, STATUS FROM USER_INDEXES;

INDEX_NAME                    STATUS
------------------------------------
CUSTOMER_PK_INDEX             VALID
I_SNAP$_CUSTOMER_SUM          VALID
PRODUCT_PK_INDEX              UNUSABLE
PURCHASE_CUSTOMER_INDEX       N/A
PURCHASE_PRODUCT_INDEX        N/A
PURCHASE_SPECIAL_INDEX        N/A
PURCHASE_TIME_INDEX           N/A
TIME_PK_INDEX                 VALID
TSO_PK_INDEX                  VALID

9 rows selected.

Next, check to see if any index partitions are in an unusable state by checking the table USER_IND_PARTITIONS. In the following example, the PURCHASE_TIME_INDEX for the PURCHASES partitions for April, May, and June 2004 are unusable.

SQL> SELECT INDEX_NAME, PARTITION_NAME, STATUS FROM
     USER_IND_PARTITIONS WHERE STATUS != 'VALID';
INDEX_NAME                PARTITION_NAME       STATUS
------------------------- -------------------- --------
PURCHASE_TIME_INDEX       PURCHASES_MAR2004    USABLE
PURCHASE_TIME_INDEX       PURCHASES_APR2004    UNUSABLE
PURCHASE_TIME_INDEX       PURCHASES_MAY2004    UNUSABLE
PURCHASE_TIME_INDEX       PURCHASES_JUN2004    UNUSABLE

Rebuild unusable indexes

If you have any indexes that are unusable, you must rebuild them prior to accessing the table or partition. In the following example, the PURCHASES_PRODUCT_INDEX for the newly added PURCHASES_APR2004 partition is rebuilt.

SQL> ALTER INDEX purchase_time_index
     REBUILD PARTITION purchases_apr2004;

Loading the Warehouse Using Data Pump

Data Pump is a new product in Oracle Database 10g that enables a very rapid movement of data and metadata. In any data warehouse environment there will be the problem of transporting data from the source database to the warehouse database. When your source system is also on an Oracle Database 10g, then it makes sense to use the fast database import and export utilities to transport your data rather than resorting to character-based operating system files. The performance increase afforded by Data Pump import and export enables the necessary rapid transfer of data from one database to another, which is very important when moving the large volumes found when sourcing data for a warehouse. Data Pump reads and writes dump files in its own binary file format; this is in contrast to SQL*Loader, which can use files of different formats as defined by the control file.

If you have used the existing import and export utilities, imp and exp, then the new Data Pump import and export will be very familiar to you. However, in addition to the improved engine for shifting the data, they also provide a lot more functionality and control.

Hint

The dump files created and used by imp and exp cannot be used by the new Data Pump import and export.

Data Pump Import/Export—impdp and expdp

The first, and most important, thing to note about Data Pump impdp and expdp is that they are client-side utilities, but the data movement is performed as a job running on the database that is accessing the dump file on the database host server directories. This detachment of the client utility from the server-side job means that more control, as well as independence, from the job is now available. For example, as a client-side tool, impdp and expdp no longer have to be dedicated to just one job on the server, but for long running jobs can, for example, detach from the job and re-attach, monitor and control a second job, or invoke a completely separate import/export operation.

Let’s have a brief look at the main new features of Data Pump impdp and expdp before working through an example of how they are used in our warehouse.

  • Restartability. The new utilities may stop and start the database server-side job. The new commands are STOP_JOB and START_JOB.

  • Attaching and detaching from a job. The client impdp and expdp can attach and detach from the database job, which provides a much needed improvement in the control over these potentially long-running operations.

  • Performing a network import. A new parameter, NETWORK_LINK, when used on an import, instructs a local database to access a remote database, retrieve data from it, and insert it into the local database.

  • Performing a network export. On export operations, NETWORK_LINK allows data from a remote database to be written to dump files on the local database server.

  • Mapping of datafile names. For import operations it is now possible to control the transformation of a datafile name on the source database to a different datafile name on the import database. Of particular importance, this feature can be used to change the file name formats used by different operating systems.

  • Remapping of tablespaces during import.

  • Filtering the objects that are exported and imported. Two new parameters are provided, called INCLUDE and EXCLUDE, which enable specification of the type of object and also control over the name of the objects.

  • Filtering the data that is exported. By using the Data Pump QUERY parameter, a subset of the data can be exported via a WHERE clause.

  • The ability to estimate the space an export job would use prior to performing the export.

  • The ability to use multiple dump files.

  • To control the degree of parallelism by use of the PARALLEL parameter. If multiple dump files are used, then it is normal for PARALLEL to be set to the number of the files.

Specifying the Location of the Datafile and Log Files for Data Pump Tools

Before using Data Pump, a directory object must be created in the database to specify the location of the dump files and log files. The dump and log files must be on the same machine as the database server, or must be accessible to the server (i.e., on a networked drive). This is different from SQL*Loader, where the SQL*Loader client sends the data to the database server. Directories are created in a single namespace and are not owned by an individual’s schema. In the following example, the DBA creates directories for the data and log files.

CREATE OR REPLACE DIRECTORY data_file_dir AS 'C:datafiles';
CREATE OR REPLACE DIRECTORY log_file_dir AS 'C:logfiles';

Read and write access must be granted to the users that will be accessing the dump files in the directory. In the example below, the DBA grants read and write access to user easydw.

GRANT READ, WRITE ON DIRECTORY data_file_dir TO easydw;
GRANT READ, WRITE ON DIRECTORY log_file_dir TO easydw;

To find the directories defined in a database, check the DBA_DIRECTORIES view.

SELECT * FROM DBA_DIRECTORIES;

OWNER        DIRECTORY_NAME          DIRECTORY_PATH
------------ ----------------------- ---------------
SYS          LOG_FILE_DIR            c:logfiles
SYS          DATA_FILE_DIR           c:datafiles

Moving Data between Databases

Suppose we want to export all of the OLTP ORDERS data for December 2004 from our OLTP database and move it to our warehouse database. Let us create a job to export the data using the new Data Pump export utility. We will use a parameter file called expdp_par.txt for this, which will contain all of the command-line arguments.

Hint

It is advisable to use a parameter file when the command contains keywords with values that contain spaces or quoted strings, such as QUERY and INCLUDE.

SCHEMAS=(OLTP)
INCLUDE=TABLE:"IN ('ORDERS')"
QUERY=OLTP.ORDERS:"WHERE purchase_date BETWEEN to_date('01-dec-
2004','dd-mon-yyyy') AND to_date('31-dec-2004','dd-mon-yyyy')"
DIRECTORY=data_file_dir
DUMPFILE=exp1.dmp
LOGFILE=log_file_dir:exporders2004.log

The QUERY clause allows us to specify a WHERE clause to filter the data. If the source table ORDERS were partitioned by date with a partition for each month, an alternative method to using the QUERY parameter would be to use the TABLES clause to specify the partition, as shown here.

TABLES=OLTP.ORDERS:DEC04 where DEC04 is the name of the partition.

Now we launch Data Pump export with the following command:

expdp oltp/oltp@easydw parfile=expdp_par.txt

The output at the start of the expdp execution is shown below:

Export: Release 10.1.0.2.0 - Production on Thursday, 08 July, 2004
21:11

Copyright (c) 2003, Oracle. All rights reserved.

Connected to: Oracle Database 10g Enterprise Edition Release
10.1.0.2.0 - Production
With the Partitioning, OLAP and Data Mining options
FLASHBACK automatically enabled to preserve database integrity.
Starting "OLTP"."SYS_EXPORT_SCHEMA_01": oltp/********@easydw
parfile=expdp_par.txt
Estimate in progress using BLOCKS method...
Processing object type SCHEMA_EXPORT/TABLE/TABLE_DATA
Total estimation using BLOCKS method: 64 KB
Processing object type SCHEMA_EXPORT/TABLE/PROCACT_INSTANCE
Processing object type SCHEMA_EXPORT/TABLE/TABLE
Processing object type SCHEMA_EXPORT/TABLE/CONSTRAINT/CONSTRAINT
Processing object type SCHEMA_EXPORT/TABLE/STATISTICS/
TABLE_STATISTICS

Here we can see that the Data Pump export will automatically use the new Oracle Database 10g Flashback capability of the database to preserve the integrity of our exported data. Flashback is part of a much larger feature in Oracle Database 10g that enables earlier versions of the data to be retrieved by rewinding the view of the data to an earlier point in time. We will look at Flashback in more detail in Chapter 17 but for our export, Data Pump is using Flashback to maintain a consistent view of the data at the point where we issued the expdp command.

This should not be confused with the Data Pump export parameter called FLASHBACK_TIME that enables export of data that is consistent with an earlier point in time.

Another new item to note in the output from Data Pump export is that an estimate of the export file size is provided for each of the tables exported. There are two mechanisms that are used, BLOCKS and STATISTICS, which are specified by the ESTIMATE parameter. Associated with this, there is also the ability to cap the size of the generated dump file using the FILESIZE parameter.

Now that we have exported our data, we can move the dump file exp1.dmp between our source server and our warehouse server using any conventional means such as FTP.

On the warehouse server, we must now import the data using impdp. Let’s look at some new options that are available to us. During the import we may want to put our source data in a different tablespace to keep it separated from the tablespaces that we are using for our warehouse tables. We do this using the REMAP_TABLESPACE parameter, which is used to specify the mapping between the tablespace on the source system (USERS) and the tablespace on the warehouse (STAGE).

Similarly, we want to change the ownership of the table from the OLTP account on the source system to the EASYDW account on the warehouse, and this mapping is specified using the REMAP_SCHEMA parameter.

impdp easydw/easydw@easydw DIRECTORY=data_file_dir
DUMPFILE=exp1.dmp
LOGFILE=log_file_dir:imporders2004.log
REMAP_TABLESPACE=users:stage REMAP_SCHEMA=oltp:easydw

In the same fashion that the control can be exercised over what is exported by Data Pump export, import also has the same facility to control the type of metadata this is imported. For example, by using the EXCLUDE=GRANT parameter and keyword during Data Pump import, it is possible to exclude object grants from being imported. In this way, Data Pump import will not try to regrant privileges that were only valid on the source database.

Improved Job Monitoring and Control

Let’s look in more detail at the new improvements to monitoring and controlling our import and export jobs that the new impdp and expdp provide.

In Oracle Database 10g, impdp and expdp are client-side tools that communicate with a database server-side job that is actually doing the import and export. This provides the opportunity for the client to attach to the job that is running on the database, monitor its status, suspend it or alter some of its execution parameters, detach from it and reattach, and monitor and control another job. This provides a lot more flexibility over the preceding imp and exp utilities. Note that impdp and expdp can only be attached and control; one server-side job at a time; however, you can have multiple impdp/expdp clients running—each of which can be attached to the same or different server-side jobs.

When these tools are running in normal noninteractive mode, you can move into interactive mode by typing Control-C.

In interactive mode, the commands that we can issue are:

  • ADD_FILE to add a new dump file for the export

  • KILL_JOB to delete the job

  • STOP_JOB, which will suspend the processing of our job and exit back to the operating system

  • PARALLEL to alter the degree of parallelism

If we want to reconnect Data Pump export to a suspended job, then we can reinvoke expdp with the ATTACH command-line parameter. If we originally provided our export job with a job name (e.g., JOB_NAME=expfull), then we can specifically name the job that we want to attach to. If there is only one job executing for the database account, then Data Pump export will automatically attach to it.

An example of this activity is shown in the following code. We are running Data Pump export using the JOB_NAME=expfull parameter.

C:> expdp easydw/easydw@easydw full=y directory=data_file_dir dumpfile=
full_dump.dat job_name=expfull

Export: Release 10.1.0.2.0 - Production on Saturday, 10 July, 2004 10:32

Copyright (c) 2003, Oracle. All rights reserved.

Connected to: Oracle Database 10g Enterprise Edition Release 10.1.0.2.0 - Production
With the Partitioning, OLAP and Data Mining options
FLASHBACK automatically enabled to preserve database integrity.
Starting "EASYDW"."EXPFULL": easydw/********@easydw full=y directory=data_file_dir
dumpfile=full_dump.dat job_name=expfull
Estimate in progress using BLOCKS method...
                                     ← Issue a Control-C here
Export> stop_job=immediate
Are you sure you wish to stop this job ([y]/n): y
                                     ← Returned to the operating system


C:easydwdp>expdp easydw/easydw@easydw attach=expful← Restart expdp and name
                                                         the job to attach to
Export: Release 10.1.0.2.0 - Production on Saturday, 10 July, 2004 10:32

Copyright (c) 2003, Oracle. All rights reserved.

Connected to: Oracle Database 10g Enterprise Edition Release 10.1.0.2.0 - Production
With the Partitioning, OLAP and Data Mining options

Job: EXPFULL                  ← Automatic report of the job status when we attach
  Owner: EASYDW
  Operation: EXPORT
  Creator Privs: FALSE
  GUID: F3A0628C541B4DD5B35F34C164A8408A
  Start Time: Saturday, 10 July, 2004 10:32
  Mode: FULL
  Instance: easydw
  Max Parallelism: 1
  EXPORT Job Parameters:
  Parameter Name      Parameter Value:
     CLIENT_COMMAND        easydw/********@easydw full=y directory=data_file_dir
 dumpfile=full_dump.dat job_name=expfull
     DATA_ACCESS_METHOD    AUTOMATIC
     ESTIMATE              BLOCKS
     INCLUDE_METADATA      1
     LOG_FILE_DIRECTORY    DATA_FILE_DIR
     LOG_FILE_NAME         export.log
     TABLE_CONSISTENCY     0
  State: IDLING
  Bytes Processed: 0
  Current Parallelism: 1
  Job Error Count: 0
  Dump File: c:easydwexternalfull_dump.dat
    bytes written: 4,096

Worker 1 Status:
  State: UNDEFINED

Export> continue_client      ← Restart the job and the logging.
                                 Return to non-interactive mode
Job EXPFULL has been reopened at Saturday, 10 July, 2004 10:32
Restarting "EASYDW"."EXPFULL": easydw/********@easydw full=y
directory=data_file_dir dumpfile=full_dump.dat job_name=expfull
Estimate in progress using BLOCKS method...
Processing object type DATABASE_EXPORT/SCHEMA/TABLE/TABLE_DATA
                                   ← Issue a Control-C here
Export> stop_job=immediate
Are you sure you wish to stop this job ([y]/n): y

We have only touched upon some of the new, powerful features available with Data Pump import and export, but it should be apparent that the features provided are very important in a warehouse environment when dealing with large data volumes. There are three main areas that benefit our warehouse management from the new Data Pump import and export utilities:

  1. The significantly improved speed and performance from the new Data Pump architecture used to move the data in and out of the database

  2. The new parameters, which provide additional control of the import and export jobs

  3. The new client architecture and commands, which provide better monitoring and management of the jobs

Loading the Warehouse Using External Tables

An external table is a table stored outside the database in a flat file. The data in an external table can be queried just like a table stored inside the database. You can select columns, rows, and join the data to other tables using SQL. The data in the external table can be accessed in parallel just like tables stored in the database.

External tables are read-only. No DML operations are allowed, and you cannot create indexes on an external table. If you need to update the data or access it more than once, you can load it into the database, where you can update it and add indexes to improve query performance. By loading the data into the database, you can manage it as part of the database. RMAN will not back up the data for any external tables that are defined in the database.

Since you can query the data in the external table you have just defined using SQL, you can also load the data using an INSERT SELECT statement. External tables provide an alternative to SQL*Loader to load data from flat files. They can be used to perform more complex transformations while the data is loaded and simplify many of the operational aspects while loading data in parallel and managing triggers, constraints, and indexes. However, in Oracle Database 10g, SQL*Loader direct path load may still be faster in many cases. There are many similarities between the two methods, and if you’ve used SQL*Loader, you will find it easy to learn to use external tables.

External tables require the same directory objects that we defined in earlier.

Creating an External Table

In order to create an external table you must specify the following:

  • The metadata, which describes how the data looks to Oracle, including the external table name, the column names, and Oracle data types. These are the names you will use in the SQL statements to access the external table. The metadata is stored in the Oracle data dictionary. In the example in Figure 5.14, the external table name is NEW_PRODUCTS. The data is stored outside the database. This differs from the other regular tables in the database, where the data and metadata are both stored in the database.

    External Table

    Figure 5.14. External Table

  • The access parameters, which describe how the data is stored in the external file (i.e., where it’s located, its format, and how to identify the fields and records). The access driver uses the access parameters. An example of the access parameters is shown in the CREATE EXTERNAL TABLE statement.

In Oracle 9i, there was only one type of access driver, called ORACLE_LOADER, which provides read-only access to flat files. In Oracle Database 10g, a new ORACLE_DATAPUMP access driver is introduced; provides greater performance, as we have already discussed, and new features, which we will discuss later. The access driver is specified in the TYPE clause on the CREATE EXTERNAL TABLE statement.

The following example shows the SQL to create an external table. In the CREATE TABLE statement, the ORGANIZATION EXTERNAL clause is used to specify the table as external, and the TYPE clause to specify the type of access driver and which begins the description of the structure of the file stored on disk. The metadata describing the column names and data types is listed and looks very similar to what is stored in a SQL*Loader control file. The locations of the datafile, log file, and bad files are specified here, and in this example we specify that each field be separated by a comma and optionally enclosed in a single quote. By specifying the PARALLEL clause, the data will be loaded in parallel.

CREATE TABLE new_products
(product_id             VARCHAR2(8),
 product_name           VARCHAR2(30),
 category               VARCHAR2(4),
 cost_price             NUMBER (6,2),

 sell_price             NUMBER (6,2),
 weight                 NUMBER (4,2),
 shipping_charge        NUMBER (5,2),
 manufacturer           VARCHAR2(20),
 supplier               VARCHAR2(10))
ORGANIZATION EXTERNAL
(TYPE ORACLE_LOADER
   DEFAULT DIRECTORY data_file_dir
   ACCESS PARAMETERS
        (RECORDS DELIMITED BY NEWLINE
         CHARACTERSET US7ASCII
         BADFILE log_file_dir:'product.bad'
         LOGFILE log_file_dir:'product.log'
         FIELDS TERMINATED BY ','
                OPTIONALLY ENCLOSED BY "'")
   LOCATION  ('product.dat')
)
REJECT LIMIT UNLIMITED PARALLEL;

A portion of the data file, product.dat is shown in the following example. Each field is separated by a comma and optionally enclosed in a single quote. As the data is read, each field in the input file is mapped to the corresponding columns in the external table definition. As the data is read, it is converted from the data type of the input file to the data type of the column in the database, as necessary.

'SP1000', 'Digital Camera','ELEC', 45.67, 67.23, 15.00, 4.50, 'Ricoh','Ricoh'
'SP1001', 'APS Camera','ELEC', 24.67, 36.23,5.00, 4.50, 'Ricoh','Ricoh'
'SP1010', 'Camera','ELEC', 35.67, 47.89, 5.00,4.50, 'Agfa','Agfa'

After executing the CREATE TABLE command shown previously, the metadata for the NEW_PRODUCTS table is stored in the database. The table is as follows:

SQL> DESCRIBE new_products;

 Name                  Null?    Type
 --------------------- -------- -------------
 PRODUCT_ID                     VARCHAR2(8)
 PRODUCT_NAME                   VARCHAR2(30)
 CATEGORY                       VARCHAR2(4)
 COST_PRICE                     NUMBER(6,2)
 SELL_PRICE                     NUMBER(6,2)
 WEIGHT                         NUMBER(4,2)
 SHIPPING_CHARGE                NUMBER(5,2)
 MANUFACTURER                   VARCHAR2(20)
 SUPPLIER                       VARCHAR2(10)

The USER_EXTERNAL_TABLES dictionary view shows which external tables have been created, along with a description of them. The USER_EXTERNAL_LOCATIONS dictionary view shows the location of the datafile.

SQL> SELECT * FROM USER_EXTERNAL_TABLES;

TABLE_NAME      TYPE_OWNER TYPE_NAME
--------------- ---------- ------------------------------
DEFAULT_DIRECTORY_OWNER DEFAULT_DIRECTORY_NAME
----------------------- ------------------------------
REJECT_LIMIT                             ACCESS_TYPE
---------------------------------------- -----------
ACCESS_PARAMETERS
-----------------------------------------------------------
PROPERTY
----------
NEW_PRODUCTS    SYS        ORACLE_LOADER
SYS                     DATA_FILE_DIR
UNLIMITED                               CLOB
RECORDS DELIMITED BY NEWLINE
       CHARACTERSET US7ASCII
       BADFILE log_file_dir:'product.bad'
       LOGFILE log_file_dir:'product.log'
       FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY "'"
ALL

SQL> SELECT * FROM USER_EXTERNAL_LOCATIONS;

TABLE_NAME    LOCATION DIR DIRECTORY_NAME
------------- ------------ --- ------------
NEW_PRODUCTS  product.dat  SYS DATA_FILE_DIR

Accessing Data Stored in an External Table

After defining the external table, it can be accessed using SQL, just as if it were any other table in the database, although the data is actually being read from the file outside the database. A portion of the output is as follows:

SQL> SELECT * FROM NEW_PRODUCTS;

PRODUCT PRODUCT_NAME    CATE  COST  SELL  WEIGHT SHIPPING MANUF SUPPLIER
ID                           PRICE PRICE           CHARGE
------- --------------- ---- ----- ----- ------- -------- ----- --------
SP1000  Digital  Camera ELEC 45.67 67.23      15      4.5 Ricoh Ricoh
SP1001  APS  Camera     ELEC 24.67 36.23       5      4.5 Ricoh Ricoh
SP1010  Camera          ELEC 35.67 47.89       5      4.5 Agfa Agfa

Loading Data From an External Table

The next example will use the INSERT/SELECT statement to load the data into the PRODUCT dimension table. It is very easy to perform transformations during the load using SQL functions and arithmetic operators. In this example, after the datafile was created, the cost of fuel rose, and our shipping company increased the shipping rates by 10 percent. In this example, data is loaded into the EASYDW.PRODUCT table by selecting the columns from the NEW_PRODUCTS external table. The shipping charge is multiplied by 1.1 for each item as the data is loaded.

SQL> INSERT INTO easydw.product
       (product_id, product_name, category,
        cost_price, sell_price, weight,
        shipping_charge, manufacturer, supplier)
     SELECT product_id, product_name, category,
            cost_price, sell_price, weight,
            (shipping_charge * 1.10),
            manufacturer, supplier
     FROM new_products;

The data has now been loaded into the EASYDW.PRODUCT table, and we can see here that the shipping charge increased from $4.50 to $4.95 for the items displayed.

SQL> select * from easydw.product;

PRODUCT PRODUCT_NAME    CATE  COST  SELL  WEIGHT SHIPPING MANUF SUPPLIER
ID                           PRICE PRICE           CHARGE
------- --------------- ---- ----- ----- ------- -------- ----- --------
SP1000  Digital  Camera ELEC 45.67 67.23      15     4.95 Ricoh    Ricoh
SP1001  APS  Camera     ELEC 24.67 36.23       5     4.95 Ricoh    Ricoh
SP1010  Camera          ELEC 35.67 47.89       5     4.95  Agfa     Agfa

Loading Data in Parallel Using External Tables

Another advantage of using external tables is the ability to load the data in parallel without having to split a large file into smaller files and start multiple sessions, as you must do with SQL*Loader. The degree of parallelism is set using the standard parallel hints or with the PARALLEL clause when creating the external table, as shown previously. The output of an EXPLAIN PLAN shows the parallel access, as follows:

SQL> EXPLAIN PLAN FOR
     INSERT INTO easydw.product
         (product_id, product_name, category,
         cost_price, sell_price, weight,
         shipping_charge, manufacturer, supplier)
     SELECT product_id, product_name, category,
            cost_price, sell_price, weight,
            (shipping_charge * 1.10),
            manufacturer, supplier
     FROM new_products;

The utlxplp.sql script is used to show the EXPLAIN PLAN output with the columns pertaining to parallel execution, which have been edited and highlighted for readability. Chapter 10 provides more of an explanation on how to read parallel execution plans.

SQL> @c:oracleproduct10.1.0db_1
dbmsadminutlxplp.sql

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------

----------------------------------------------------------------------------------------------
| Id  | Operation                     | Name         | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------------
|   0 | INSERT STATEMENT              |              |  8168 |   781K|    13   (0)| 00:00:01 |
|   1 |  PX COORDINATOR               |              |       |       |            |          |
|   2 |   PX SEND QC (RANDOM)         | :TQ10000     |  8168 |   781K|    13   (0)| 00:00:01 |
|   3 |    PX BLOCK ITERATOR          |              |  8168 |   781K|    13   (0)| 00:00:01 |
|   4 |     EXTERNAL TABLE ACCESS FULL| NEW_PRODUCTS |  8168 |   781K|    13   (0)| 00:00:01 |
----------------------------------------------------------------------------------------------

Using Data Pump External Tables to Move and Load Data

Data Pump external tables are another fast method available to us for moving data between databases; now we can actually write to the external file during the creation of the external table, which we could not do with the ORACLE_LOADER access driver. We will demonstrate this with an example.

First, let’s create our external table based on the current PURCHASES table. To do this we will use a database directory object called xt_dir, where our external data file, called purch_xt.dmp, will reside.

CREATE TABLE purchases_xt ORGANIZATION EXTERNAL
( TYPE ORACLE_DATAPUMP
  DEFAULT DIRECTORY xt_dir
  LOCATION ('purch_xt.dmp')
)
AS SELECT * FROM purchases;

This statement is creating the Data Pump dump file, purch_xt.dmp, as the external table PURCHASES_XT is being created. Once it has been created, you can select from this table just like any other however, because it is now an external table, after the initial CREATE statement it is not possible to perform any further alterations to the records. Any subsequent DELETE, INSERT or UPDATE operations will result in an Oracle error, as illustrated here.

SQL> desc purchases_xt
 Name                       Null?    Type
 -------------------------- -------- -----------------------
 PRODUCT_ID                 NOT NULL VARCHAR2(8)
 TIME_KEY                   NOT NULL DATE
 CUSTOMER_ID                NOT NULL VARCHAR2(10)
 SHIP_DATE                           DATE
 PURCHASE_PRICE                      NUMBER(6,2)
 SHIPPING_CHARGE                     NUMBER(5,2)
 TODAY_SPECIAL_OFFER                 VARCHAR2(1)

SQL> SELECT count(*) FROM purchases_xt;

  COUNT(*)
----------
     94619

SQL> SELECT * FROM purchases_xt WHERE rownum < 3;


PRODUCT  TIME_KEY  CUSTOMER_I SHIP_DATE PURCHASE SHIPPING T
ID                                         PRICE   CHARGE
-------- --------- ---------- --------- -------- -------- -
SP1001   01-JAN-03 AB123899   01-JAN-03    36.23      4.5 N
SP1011   01-JAN-03 AB123897   01-JAN-03    47.89      4.5 N

SQL> UPDATE purchases_xt SET purchase_price=37 WHERE
product_id='SP1001';
UPDATE purchases_xt SET purchase_price=37 WHERE
product_id='SP1001'
       *
ERROR at line 1:
ORA-30657: operation not supported on external organized table

We can now move this Data Pump formatted dump file from our specified directory to another server and access it from another database to use it as the basis of a new external table. This time, however, to create our external table we will need to specify the columns explicitly, as shown here.

CREATE TABLE purchases_xt2
 (product_id          VARCHAR2(8) ,
  time_key            DATE ,
  customer_id         VARCHAR2(10) ,
  ship_date           DATE,
  purchase_price      NUMBER(6,2),
  shipping_charge     NUMBER(5,2),
  today_special_offer VARCHAR2(1)
 )
 ORGANIZATION EXTERNAL
 (
  TYPE ORACLE_DATAPUMP
  DEFAULT DIRECTORY xt_dir
  LOCATION ('purch_xt.dmp')
 );

Table created.

SQL> SELECT count(*) FROM purchases_xt2;

  COUNT(*)
----------
     94619

By using a Data Pump external table, we have reaped the benefit of the performance improvements from Data Pump for both the writing of the dump file and also the reading of it on the target database. In addition, the syntax of the commands for creating this version of an external table is very succinct and readable.

Because the creation of the initial external table and the dump file can use a CREATE TABLE AS SELECT operation, we also have the capability to perform filtering with a WHERE clause and use joins when the external table is created. This can form a powerful mechanism for being selective and controlling the data that is transferred to the external table. For example, for transferring the data from a refreshed warehouse schema into a data mart, which requires a smaller subset of the fact data, the data is filtered with a WHERE clause when the external table is created.

Loading the Warehouse Using Transportable Tablespaces

The fastest way to move data from one Oracle database to another is using transportable tablespaces. Transportable tablespaces provide a mechanism to move one or more tablespaces from one Oracle database into another. Rather than processing the data a row at time, the entire file or set of files is physically copied from one database and integrated into the second database by importing the metadata describing the tables from the files themselves. In addition to data in tables, the indexes can also be moved.

Because the data is not unloaded and reloaded, just detached, it can be moved quickly. In Oracle 9i, you could only transport tablespaces between Oracle databases that were on the same operating system. In Oracle Database 10g, this limitation is removed, and transportable tablespaces can be moved between different operating system platforms—for example, from Windows to Linux.

Transportable tablespaces can be used to move data from the operational database to the staging area if you are using an Oracle database to do your staging. Transportable tablespaces are also useful to move data from the data warehouse to a dependent data mart.

Figure 5.15 shows the steps involved in using transportable tablespaces.

These steps are as follows:

  1. Create a new tablespace.

  2. Move the data you want to transfer into its own tablespace.

  3. Alter the tablespace to read-only.

  4. Use the export utility to unload the metadata describing the objects in the tablespace.

  5. If moving between different operating systems, then convert the files using RMAN (this conversion step can alternatively be performed on the target platform after step 6).

  6. Copy the datafiles and export dump file containing the metadata to the target system.

  7. Use the import utility to load the metadata descriptions into the target database.

  8. Alter the tablespace to read/write.

  9. Perform transformations.

  10. Move the data from the staging area to the warehouse fact table.

Transportable Tablespaces in Oracle Database 10g

Figure 5.15. Transportable Tablespaces in Oracle Database 10g

This section will discuss steps 1–8. In the following sections, we’ll look at steps 9 and 10.

One of the sources for data in the EASYDW Warehouse is an Oracle Database 10g order-entry system. In this example, the order-entry system has a record for every order stored in the ORDERS table. At the end of April, all the orders for April 2004 will be copied into a new table, called APR_ORDERS, in the ORDERS tablespace stored in the datafile “orders.f.”

Here are the steps that are required to transport a tablespace from one database to another. In our example, our source OLTP system is on Linux and our target warehouse system is on a Windows platform.

Step 1: Create a Tablespace in the OLTP System

Choose a name that is unique on both the source and target system. In this example, the tablespace ORDERS is created, and its corresponding datafile is “orders.f.”

SQL> CREATE TABLESPACE orders
     DATAFILE 'C:ORACLEPRODUCT10.1.0ORADATAEASYDWorders.f'
     SIZE 5M REUSE
     AUTOEXTEND ON
     DEFAULT STORAGE
     (INITIAL 64K PCTINCREASE 0 MAXEXTENTS UNLIMITED);

Step 2: Move the Data for April 2004 into a Table in the Newly Created Tablespace

In this example, a table is created and populated using the CREATE TABLE AS SELECT statement. It is created in the ORDERS tablespace. In this example, there are 3,720 orders for April.

SQL> CREATE TABLE oltp.apr_orders TABLESPACE orders
     AS
     SELECT * FROM purchases
     WHERE ship_date BETWEEN to_date('01-APR-2004', 'dd-on-yyyy')
                        AND to_date('30-APR-2004', 'dd-mon-yyyy'),

SQL> SELECT COUNT(*) FROM apr_orders;

  COUNT(*)
----------
      3720

Each tablespace must be self-contained and cannot reference anything outside the tablespace. If there were a global index on the April ORDERS table, it would not be self-contained, and the index would have to be dropped before the tablespace could be moved.

Step 3: Alter the Tablespace So That It Is Read-Only

If you do not do this, the export in the next step will fail.

SQL> ALTER TABLESPACE orders READ ONLY;

Step 4: EXPORT the Metadata

By using the Data Pump expdp command, the metadata definitions for the orders tablespace are extracted and stored in the export dump file “expdat.dmp.”

expdp system/manager TRANSPORT_TABLESPACE=orders
DIRECTORY=data_file_dir
DUMPFILE=expdat.dmp

When transporting a set of tablespaces, you can choose to include referential integrity constraints. However, if you do include these, you must move the tables with both the primary and foreign keys.

Hint

When exporting and importing tablespaces, be sure to use an account that has been granted the EXP_FULL_DATABASE role.

The following code is a copy of the run-time information generated by the export. Only the tablespace metadata is exported, not the data.

$ expdp system/magic9 transport_tablespaces=orders
directory=data_file_dir dumpfile=expdat.dmp

Export: Release 10.1.0.2.0 - Production on Saturday, 10 July, 2004 8:01

Copyright (c) 2003, Oracle. All rights reserved.

Connected to: Oracle Database 10g Enterprise Edition Release 10.1.0.2.0 - Production
With the Partitioning, OLAP and Data Mining options
Starting "SYSTEM"."SYS_EXPORT_TRANSPORTABLE_01": system/********
transport_tablespaces=orders directory=data_file_dir dumpfile=expdat.dmp
Processing object type TRANSPORTABLE_EXPORT/PLUGTS_BLK
Processing object type TRANSPORTABLE_EXPORT/TABLE
Processing object type TRANSPORTABLE_EXPORT/TABLE_STATISTICS
Processing object type TRANSPORTABLE_EXPORT/TTE_POSTINST/PLUGTS_BLK
Master table "SYSTEM"."SYS_EXPORT_TRANSPORTABLE_01" successfully
loaded/
unloaded****************************************************************************
**
Dump file set for SYSTEM.SYS_EXPORT_TRANSPORTABLE_01 is:
  /home/oracle/tt/expdat.dmp
Job "SYSTEM"."SYS_EXPORT_TRANSPORTABLE_01" successfully completed at 08:04

Step 5: Convert the Datafiles (Optional)

If you are transporting between different platforms and the endian formats of your platforms are different, you will need to convert the data files. The endian nature of your platform is the format in which multibyte data is stored. Big endian platforms store the most significant byte in the lowest address and little endian platforms store it in the highest address.

Oracle provides some tables and a simple query to determine the endian nature of your source and target platforms.

SELECT d.PLATFORM_NAME, tp.ENDIAN_FORMAT
FROM   V$TRANSPORTABLE_PLATFORM tp,
       V$DATABASE d
WHERE  tp.PLATFORM_NAME = d.PLATFORM_NAME;

When the query is run on the source Linux platform database, we see that the platform uses little endian format.

PLATFORM_NAME                            ENDIAN_FORMAT
---------------------------------------- --------------
Linux IA (32-bit)                        Little

When the query is run on the target warehouse database, we see that it is also little endian.

PLATFORM_NAME                            ENDIAN_FORMAT
---------------------------------------- --------------
Microsoft Windows IA (32-bit)            Little

If the endian format were different on the two platforms, then conversion would be required. In our example, even though both of our platforms are of the same endian format, we will still demonstrate the step for the conversion, which uses the Recovery Manager utility RMAN.

$ rman target /

Recovery Manager: Release 10.1.0.2.0 - Production
 Copyright (c) 1995, 2004, Oracle. All rights reserved.
 connected to target database: ORCL (DBID=1058909169)

RMAN> convert tablespace orders
2> to platform 'Microsoft Windows IA (32-bit)'
3> FORMAT '/home/oracle/tt/%N_%f' ;
Starting backup at 10-JUL-04
using target database controlfile instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: sid=244 devtype=DISK
channel ORA_DISK_1: starting datafile conversion
input datafile fno=00005 name=/u01/app/oracle/oradata/orcl/
orders.f
converted datafile=/home/oracle/tt/ORDERS_5
channel ORA_DISK_1: datafile conversion complete, elapsed time:
00:00:01
Finished backup at 10-JUL-04

The conversion format mask ′%N_%f′ that was used will create a file name in the directory specified using the original tablespace name (%N) and the file id number on the database (%f).

If the conversion is performed on the source system, then RMAN can reference the tablespaces by logging on to the database, but if the conversion is performed after the files are transported to the warehouse server, then RMAN can no longer access the database to use the tablespace names. In this case, a slightly different format must be adopted, where the datafile names are used, as follows:

RMAN> CONVERT DATAFILE
2> '/u01/app/oracle/oradata/easydw/orders.f'
3> TO PLATFORM='Microsoft Windows IA (32-bit)'
4> FROM PLATFORM='Linux IA (32-bit)';

Step 6: Transport the Tablespace

Now copy the data file, ORDERS_5, and the export dump file, expdat.dmp, to the physical location on the system containing the staging database. You can use any facility for copying flat files, such as an operating system copy utility or FTP. These should be copied in binary mode, since they are not ASCII files.

In our example, ORDERS_5 was copied to c:oracleproduct10.1.0oradataeasydwORDERS_5 on the for the EASYDW warehouse database using FTP.

Step 7: Import the Metadata

By importing the metadata, you are plugging the tablespace into the target database, which is why you should take care to place it in the correct directory that is appropriate to your database. Note that we have also specified that the tablespace contents be remapped from the OLTP schema used on the source database to the EASYDW schema on our warehouse database.

impdp easydw/easydw@easydw TRANSPORT_DATAFILES=c:oracleproduct
10.1.0oradataeasydwORDERS_5 DIRECTORY=tt2 DUMPFILE=expdat.dmp
logfile=log_file_dir:imporders2004.log REMAP_SCHEMA=(oltp:easydw)

Check the run-time information or import log to ensure that no errors have occurred. Note that only the transportable tablespace metadata was imported.

Import: Release 10.1.0.2.0 - Production on Saturday, 10 July, 2004 8:45

Copyright (c) 2003, Oracle. All rights reserved.
;;;
Connected to: Oracle Database 10g Enterprise Edition Release 10.1.0.2.0 - Production
With the Partitioning, OLAP and Data Mining options
Master table "EASYDW"."SYS_IMPORT_TRANSPORTABLE_01" successfully loaded/unloaded
Starting "EASYDW"."SYS_IMPORT_TRANSPORTABLE_01": easydw/********@easydw
TRANSPORT_DATAFILES=c:oracleproduct10.1.0oradataeasydwORDERS_5 DIRECTORY=tt2
DUMPFILE=expdat.dmp logfile=log_file_dir:imporders2004.log
REMAP_SCHEMA=(oltp:easydw)
Processing object type TRANSPORTABLE_EXPORT/PLUGTS_BLK
Processing object type TRANSPORTABLE_EXPORT/TABLE
Processing object type TRANSPORTABLE_EXPORT/TABLE_STATISTICS
Processing object type TRANSPORTABLE_EXPORT/TTE_POSTINST/PLUGTS_BLK
Job "EASYDW"."SYS_IMPORT_TRANSPORTABLE_01" successfully completed at 08:45

Check the count to ensure that the totals match the OLTP system.

SQL> SELECT COUNT(*) FROM orders;

  COUNT(*)
----------
      3720

Step 8: Alter the Tablespace to Read/Write

Alter the ORDERS tablespace so that it is in read/write mode. You are ready to perform your transformations!

SQL> ALTER TABLESPACE orders READ WRITE;

As you can see by following these steps, the individual rows of a table are never unloaded and reloaded into the database. Thus, using transportable tablespaces is the fastest way to move data between two Oracle databases.

Loading the Dimensions Using SQL MERGE

When loading new data into existing dimension tables, you may need to add new rows and make changes to existing rows. In the past, special programming logic was required to differentiate a new row from a changed row, which typically involved separate INSERT and UPDATE statements. A new capability was added in Oracle 9i, which makes this process much easier: the SQL MERGE. This says that if a row exists, update it; if it doesn’t exist, insert it. This is often called an upsert operation.

In the following example, new data is added to the customer dimension. The input file has both new customers who have been added and changes to existing customer data. In this case, there is no need to retain the old customer information, so it will be updated. An external table, named CUSTOMER_CHANGES, is created for the file containing the updates to the customer dimension.

CREATE TABLE easydw.customer_changes
(customer_id            VARCHAR2(10),
 gender                 VARCHAR2(1),
 tax_rate               NUMBER,
 city                   VARCHAR2(15),
 state                  VARCHAR2(10),
 region                 VARCHAR2(15),
 postal_code            VARCHAR2(10),
 country                VARCHAR2(20),
 occupation             VARCHAR2(15))
ORGANIZATION EXTERNAL
(TYPE ORACLE_LOADER
      DEFAULT DIRECTORY data_file_dir
      ACCESS PARAMETERS
        ( records delimited by newline
          characterset us7ascii
          badfile log_file_dir:'cus_changes.bad'
          logfile log_file_dir:'cust_changes.log'
          fields terminated by ','
          optionally enclosed by "'")
      LOCATION ('customer_changes.dat')
)
REJECT LIMIT UNLIMITED NOPARALLEL;

Instead of using an insert statement to load the data, the MERGE statement is used to load the data. The MERGE statement has two parts. When the customer_id in the customer table matches the customer_id in the customer_changes table, the row is updated. When the customer_ids do not match, a new row is inserted.

MERGE INTO easydw.customer c
USING easydw.customer_changes cc
ON (c.customer_id = cc.customer_id)
WHEN MATCHED THEN UPDATE SET
      c.city=cc.city, c.state=cc.state,
      c.postal_code=cc.postal_code,
      c.gender=cc.gender, c.country=cc.country,
      c.region=cc.region, c.tax_rate=cc.tax_rate,
      c.occupation=cc.occupation
WHEN NOT MATCHED THEN INSERT
      (customer_id, city, state, postal_code,
       gender, region, country, tax_rate,
       occupation)
    VALUES
      (cc.customer_id, cc.city, cc.state, cc.postal_code,
       cc.gender, cc.region, cc.country, cc.tax_rate,
       cc.occupation) ;

Before merging the customer changes, we had 500 customers.

SQL> SELECT COUNT(*) FROM customer;

  COUNT(*)
----------
       500

Once the external table is created, we can use SQL to look at the customer_changes data file. The first two are rows that will be updated. The first customer was previously an astronomer and is now returning to work as an astrophysicist. The second customer moved from postal region W1 1QC to W1-2BA. The last two rows are new customers who will be inserted.

SQL> SELECT * FROM customer_changes;

CUSTOMER_I G TAX_RATE CITY      STATE      REGION   POSTAL_COD COUNTRY OCCUPATION
---------- - -------- --------- ---------- -------- ---------- ------- -----------
AB123459   M        5 Phoenix   AZ         AmerWest 85001      USA     Astro-Physicist
AB123460   F       15 London    London     EuroWest W1-2BA     UK      Engineer
AA114778   M       40 Reading   Berkshire  EuroWest RG1 1BB    UK      Astronomer
AA123478   F       25 Camberley Surrey     Eurowest GU14 2DR   UK      DB Consultant

Looking at the following portion of the customer dimension, we can see the rows where the CUSTOMER_ID column matches the CUSTOMER_ID column in the CUSTOMER_CHANGES external table. These rows will be updated.

SQL> SELECT * FROM customer;

CUSTOMER_I CITY    STATE  POSTAL_COD G REGION    COUNTRY TAX_RATE OCCUPATION
---------- ------- ------ ---------- - --------- ------- -------- -----------
AB123459   Phoenix AZ     85001      M AmerWest  USA            5 Astronomer
AB123460   London  London W1 1QC     F EuroWest  UK            15 Engineer

After the MERGE, we now have 502 customers. A portion of the output, just the rows that have changed from the customer table, is displayed here.

SQL> SELECT COUNT(*) FROM customer;

  COUNT(*)
----------
       502

SQL> SELECT * FROM customer;

CUSTOMER_I CITY      STATE     POSTAL_COD G REGION    COUNTRY TAX_RATE OCCUPATION
---------- --------- --------- ---------- - --------- ------- -------- -----------
AB123459   Phoenix   AZ        85001      M AmerWest  USA            5 Astro-Physicist
AB123460   London    London    W1-2BA     F EuroWest  UK            15 Engineer
AA114778   Reading   Berkshire RG1-1BB    M EuroWest  UK            40 Astronomer
AA123478   Camberley Surrey    GU14 2DR   F Eurowest  UK            25 DB Consultant

In this example, the MERGE statement was used with external tables. It can also be used with any user tables. By using the MERGE statement we avoid multiple passes of our source data required for separate INSERT and UPDATE statements; therefore, there is a significant potential saving in terms of I/O and processor resources.

We can also use the MERGE statement and omit either the INSERT or the UPDATE clauses. For example, if, in our warehouse, we are not allowed to change data that is historical and can only insert new data, we can then omit the UPDATE part of the MERGE statement, as follows:

MERGE INTO easydw.customer c
USING easydw.customer_changes cc
ON (c.customer_id = cc.customer_id)
WHEN NOT MATCHED THEN INSERT
      (customer_id, city, state, postal_code,
       gender, region, country, tax_rate, occupation)
    VALUES
      (cc.customer_id, cc.city, cc.state, cc.postal_code,
       cc.gender, cc.region, cc.country,
       cc.tax_rate, cc.occupation) ;

In addition, MERGE also provides the capability to conditionally execute the INSERT or UPDATE clauses on a record-by-record basis. For example, you’ll notice that one of our records to be updated, as well as a new record to be inserted, has the postal code in an illegal format using a hyphen “-” separator. We can add a conditional clause to both the update and the insert parts of the merge, so that records with a hyphen are not merged, as follows:

MERGE INTO easydw.customer c
USING easydw.customer_changes cc
ON (c.customer_id = cc.customer_id)
WHEN MATCHED THEN UPDATE SET
      c.city=cc.city, c.state=cc.state,
      c.postal_code=cc.postal_code,
      c.gender=cc.gender, c.country=cc.country,
      c.region=cc.region, c.tax_rate=cc.tax_rate,
      c.occupation=cc.occupation
  WHERE cc.postal_code not like '%-%'
WHEN NOT MATCHED THEN INSERT
      (customer_id, city, state, postal_code,
       gender, region, country, tax_rate, occupation)
    VALUES
      (cc.customer_id, cc.city, cc.state, cc.postal_code,
       cc.gender, cc.region, cc.country,
       cc.tax_rate, cc.occupation)
  WHERE cc.postal_code not like '%-%' ;

Now we have only one record inserted (AA123478), which had a valid postal code in the change record, but our existing record (AB123460) was not updated with its change; it had a bad postal code and remains W1 1QC.

CUSTOMER_I CITY       STATE      POSTAL_COD G REGION    COUNTRY TAX_RATE OCCUPATION
---------- ---------- ---------- ---------- - --------- ------- -------- -----------
AA123478   Camberley  Surrey     GU14 2DR   F Eurowest  UK            25 DB Consultant
AB123459   Phoenix    AZ         85001      M AmerWest  USA            5 Astro-Physicist
AB123460   London     London     W1 1QC     F EuroWest  UK            15 Engineer

Finally, under certain circumstances, we can also use the MERGE statement to delete rows in our target table. When the update fires (because the match clause succeeds), we can add an additional clause to delete the matched records when an additional delete criteria succeeds. For example, let’s assume that our legacy system for the customer data cannot add a new field to the table to indicate that a record has been deleted and instead the delete flag is encoded into the tax rate. If the tax rate is –1, then the record is deleted on the source system. Our new MERGE statement now looks like this:

MERGE INTO easydw.customer c
USING easydw.customer_changes cc
ON (c.customer_id = cc.customer_id)
WHEN MATCHED THEN UPDATE SET c.city=cc.city, c.state=cc.state,
      c.postal_code=cc.postal_code,
      c.gender=cc.gender, c.country=cc.country,
      c.region=cc.region, c.tax_rate=cc.tax_rate,
      c.occupation=cc.occupation
  DELETE WHERE (cc.tax_rate=-1)
WHEN NOT MATCHED THEN INSERT
      (customer_id, city, state, postal_code,
       gender, region, country, tax_rate, occupation)
    VALUES
      (cc.customer_id, cc.city, cc.state, cc.postal_code,
       cc.gender, cc.region, cc.country,
       cc.tax_rate, cc.occupation) ;

If the update match succeeds, then the new delete criteria is tested; if this also succeeds, then the customer record is deleted. Note, however, that this is a delete under special circumstances and only the rows that are updated are candidates for the delete.

The operation that MERGE performs is one of the most common in warehouse ETL: insert if it does not exist; otherwise, update. Therefore, instead of having to do this using a number of passes of your data, MERGE enables you to perform this operation with one call to the database and one pass over your data. A very useful feature in the warehouse!

Transformations inside the Oracle Database

Section 5.3 introduced transformations and discussed choosing the optimal place to perform the transformations. We’ve seen examples of transforming the data while loading it using both SQL*Loader and external tables. This section will discuss performing transformations inside the Oracle server. If you are doing transformations in the Oracle data warehouse, you typically load data into temporary staging tables, transform the data, then move it to the warehouse detail fact tables.

When using transportable tablespaces, as shown in the previous section, the data was moved from the OLTP system to the staging area in the Oracle warehouse. Data that was loaded using external tables or SQL*Loader, can be further transformed in a staging area inside the Oracle server. Oracle Database 10g provides tools that can be used to implement transformations in SQL, PL/SQL, or Java stored procedures.

In this section, we will first look at the SQL to perform the following simple transformations.

  • Remove the hyphen from the product id and increase the shipping charges by 10 percent.

  • Check for invalid product_ids.

  • Look up the warehouse key, and substitute it for the PRODUCT_ID.

We will look at a new SQL feature to use regular expressions for searching and manipulating character data. Then, we will rewrite the first transformation as a table function.

Transformations That Cleanse Data and Derive New Data

The SQL UPDATE statement using built-in functions can be used to perform some simple transformations. Continuing with our example, the APR_ORDERS table must be cleansed. Some of the PRODUCT_ID’S contain a hyphen, which needs to be removed. The fuel costs have increased, so we must add 10 percent to the shipping charges. Both of these operations can be done in one step using the UPDATE statement. A few of the rows are shown here before they are transformed.

SQL> SELECT * FROM apr_orders;

PRODUCT TIME_KEY  CUSTOMER SHIP_DATE PURCHASE SHIPPING TODAY_SPECIAL
     ID                 ID              PRICE   CHARGE OFFER
------- --------- -------- --------- -------- -------- -------------
SP1001  02-APR-04 AB123457 02-APR-04    28.01     5.45 N
SP-1000 01-APR-04 AB123456 01-APR-04    67.23     5.45 N
SP-1000 01-APR-04 AB123457 01-APR-04    67.23     5.45 N

The SQL multiplication operator, *, will be used to update the shipping charge, and the REPLACE function will be used to replace the hyphen, ′-′ with an empty quote, ′′, thus removing it. The modified fields are highlighted. The shipping charge has increased from 5.45 to 6. The hyphen has been removed from the PRODUCT_ID.

SQL> UPDATE apr_orders
     SET shipping_charge = (shipping_charge * 1.10),
         product_id=REPLACE(product_id, '-',''),
PRODUCT TIME_KEY  CUSTOMER SHIP_DATE PURCHASE SHIPPING TODAY_SPECIAL
     ID                 ID              PRICE   CHARGE OFFER
------- --------- -------- --------- -------- -------- -------------
SP1001  02-APR-04 AB123457 02-APR-04    28.01        6 N
SP1000  01-APR-04 AB123456 01-APR-04    67.23        6 N
SP1000  01-APR-04 AB123457 01-APR-04    67.23        6 N

Processing With More Power: The REGEXP Functions

In Oracle Database 10g, a powerful set of text search and replace functions are introduced; these use regular expressions and significantly improve our capability to search on, and process, character data.

The new regular expressions extend our ability to define the rules for the types of strings we are searching for and are also used in functions to manipulate character data. This provides us with a new, powerful mechanism for parsing more sophisticated strings in order to validate our warehouse data. It also gives us greater control and flexibility with the operations to transform and reorganize our source data to make it conform to the warehouse rules and representation.

Prior to Oracle Database 10g, the fundamental ability to process textual strings in the database was limited by relatively simple search and replace functionality—for example, the LIKE operator and the INSTR, SUBSTR, and REPLACE functions, which are really just based on either string matching or simple wildcards. Even with the facility to define new, more powerful functions in PL/SQL and call them from INSERT, UPDATE, and SELECT clauses we still have the burden of creating the functions in the first place. Having a more powerful set of text search and manipulation functions as standard is extremely valuable in the warehouse.

Let’s look at an example, the criteria LIKE ′%CARD%′ searches for any string of characters (represented by ′%′) followed by the string ′CARD′ followed by another string of any characters (the second ′%′). We can also use the underscore character ′_′ to represent a single character wildcard. But other than that, this is more or less our standard search capability.

Regular Expression Basics and Searching

But what is a regular expression? Regular expressions operate by giving certain characters special meaning, which, when used in conjunction with normal characters, enables us to define an expression that represents the string we want to search for. Often these special characters are called metacharacters. For example, ′%′ and ′_′ in our previous simple example are the two metacharacters that we had prior to Oracle Database 10g.

Some of the basic metacharacters in regular expressions are:

.

A single character match

^

The start of a line or string

$

The end of a line or string

*

Repeat the pattern 0 or more times

?

Repeat the pattern 0 or 1 times

+

Repeat the pattern 1 or more times

{m}

Repeat the pattern m times

{m,}

Repeat the pattern at least m times

{m,n}

Repeat the pattern m times but not more than n times

If you are familiar with UNIX systems, you will recognize and feel comfortable with many of the constructs used in Oracle regular expressions, which are based on the POSIX rules.

Let’s take a simple example based on the following table and rows.

SQL> CREATE TABLE regexp (chardata varchar2(50));

SQL> INSERT INTO regexp VALUES ('A theory concerning'),

1 row created.

SQL> INSERT INTO regexp
     VALUES ('the origin of the universe, is the Big '),

1 row created.

SQL> INSERT INTO regexp VALUES ('Bang.'),

SQL> COMMIT;
Commit complete.

Now let’s see how we define a search condition using the wildcard character ′*′:

SQL> SELECT * FROM regexp
     WHERE regexp_like (chardata, '*the*'),

CHARDATA
--------------------------------------------------
A theory concerning
the origin of the universe is the

The ′*′ metacharacter represents any string of characters, so we have specified a criteria for any record where the CHARDATA column data contains the string ′the′. The REGEXP_LIKE function can actually take three parameters, as follows:

REGEXP_LIKE(search_string, pattern_string, match_parameter)

In our example, we have already seen how search_string and pattern_string are used, but the match_parameter provides some extra functionality to add to the power and flexibility of REGEXP_LIKE. This parameter can take a small set of values, which changes the matching behavior:

′i′

case-insensitive matching

′c′

case-sensitive matching

′m′

allows the period ′.′ to match the new line character

′n′

allows the search_string to represent or contain multiple lines

Without this, Oracle treats each record as a separate line; by specifying ′n′ Oracle will interpret search_string as containing multiple lines.

If we want to find the record where ′the′ occurs only at the start of the line, then we use the ′^′ character, which represents the start of the line:

SQL> SELECT * FROM regexp
     WHERE regexp_like (chardata, '^the*'),

CHARDATA
--------------------------------------------------
the origin of the universe is the

So if we combine the use of the ′.′ to match any single character with the ′{}′ repetition capability, we can show how we can easily construct a regular expression to find a single character between ′B′ and ′g′ and exactly two characters between the ′B′ and ′g′.

SQL> SELECT * FROM regexp
     WHERE regexp_like(chardata, 'B.{1}g'),
CHARDATA
--------------------------------------------------
the origin of the universe, is the Big

SQL> SELECT * FROM regexp
     WHERE regexp_like(chardata, 'B.{2}g'),

CHARDATA
--------------------------------------------------
Bang.

To take another example to demonstrate the power of regular expressions, a construct for use in the pattern string is one to specify lists by bracketing the list values using ′[′ and ′]′. For example, to find the records that contain the characters ′y′ and ′v′, the following list would be used, ′[yv]′. Now the real power of lists is where predefined lists are provided. For example, lists representing all alphanumeric characters, all uppercase characters, or control characters enable these sets of characters to be referenced in one clear construct.

Regular Expressions and Substrings

We have shown the basic forms of regular expressions, and you can see that very powerful searching capability can be defined with their use. But there is also a REGEXP version equivalent to the existing INSTR function, called REGEXP_INSTR, which can tell us at what position our pattern occurs in the search string. For example:

SQL> SELECT regexp_instr(chardata, 'B.{1}g')
     FROM regexp
     WHERE regexp_like(chardata, 'B.{1}g') ;

REGEXP_INSTR(CHARDATA,'B.{1}G')
-------------------------------
                             36

Similarly, there is also a new regular expression version of the familiar SUBSTR functions. For example, to extract the substring that we just identified in the preceding example:

SQL> SELECT regexp_substr(chardata, 'B.{2}g')
     FROM regexp
     WHERE regexp_like(chardata, 'B.{2}g') ;

REGEXP_SUBSTR(CHARDATA,'B.{2}G')
--------------------------------
Bang

If we had used ′1′ instead of ′2′, then we would have extracted the string ′Big′ from our other record.

Regular Expressions to Manipulate Data

Finally, let’s take a look at the REGEXP_REPLACE function, which extends the REPLACE functionality. Now that we understand regular expressions a bit better, we will take an example that is closer to a warehouse parsing and manipulation requirement and add some data to our table: U.S. phone numbers. Some of our phone numbers are arriving with the area code in parentheses, which is not the required format for our warehouse. By using REGEXP_REPLACE, we can define a regular expression that matches the parenthesized phone number and removes the parentheses.

This example will take a bit more explaining; it is also demonstrating another powerful feature of the regular expression implementation called back references, where subexpressions in our pattern string can be separately referenced.

In our REGEXP_REPLACE() call, we have actually defined five subexpressions, as follows, to parse the string ′(123)-456-7890′:

SQL> SELECT regexp_replace('(123)-456-7890',
                           '(()(.*)())-(.*)-(.*)',
                           '2-4-5') AS transformed_string
     FROM dual ;


TRANSFORMED_STRING
------------------
123-456-7890

Where each of the five subexpressions is defined as follows:

(()

is the ′(′ character, where we have used the backslash to remove the special meaning

(.*)

for any string of characters

())

is the ′)′ character, from which we have also removed any special meaning

(.*)

any string of characters

(.*)

any string of characters

The hyphens act as themselves and are not our subexpressions. Therefore, in our replace string, we can refer to these subexpressions and reorder them or even remove them entirely from our resulting string. The string ′2-4-5′ in our example means take the seconds, fourth, and fifth sub-expressions and separate them with hyphens.

To run our new phone number transformation function using the data within our table we will first add a few records containing some new numbers.

SQL> INSERT INTO regexp VALUES ('123-456-7890'),

1 row created.

SQL> INSERT INTO regexp VALUES ('(111)-222-3333'),
1 row created.

SQL> INSERT INTO regexp VALUES ('333-444-5555'),

1 row created.

SQL> COMMIT;

Commit complete.

SQL> SELECT regexp_replace(chardata,
                           '(()(.*)())-(.*)-(.*)',
                           '2-4-5') AS transformed_string
     FROM regexp;

TRANSFORMED_STRING
-------------------------------------------
A theory concerning
the origin of the universe, is the Big
Bang.
123-456-7890
111-222-3333
333-444-5555

Hence, REGEXP_REPLACE has restructured just the strings that match our regular expression, removed the parenthesis to make it conform to our warehouse representation, and left alone all of the other records that don’t match. Now, if you consider how to code that operation using normal SQL functions, you can see that regular expressions are very powerful. We have only touched the tip of the iceberg with some simple examples, but the REGEXP functions use a succinct and convenient notation to encompass some very useful functionality for our warehouse transformations.

Validating Data Using a Dimension

Often the incoming data must be validated using information that is in the dimension tables. While this is not actually a transformation, it is discussed here, since it is often done at this stage prior to loading the warehouse tables. In our example, we want to ensure that all the PRODUCT_ID’S for the April orders are valid. The PRODUCT_ID for each order must match a PRODUCT_ID in the product dimension.

This query shows that the April data does have an invalid PRODUCT_ID, where there is no matching PRODUCT_CODE in the product dimension. Any data that is invalid should be corrected prior to loading it from the staging area into the warehouse.

SQL> SELECT DISTINCT product_id FROM apr_orders
     WHERE product_id NOT IN (SELECT product_id FROM product);

PRODUCT_ID
----------
SP1036

Note that in many actual warehouse systems, the warehouse designer sometimes makes a conscious decision to map unknown natural dimension codes (SP1036 in our previous example) to a dimension record representing “unknown.” For example, if the product surrogate key is in incrementing integer sequence from 1, then the unknown dimension new records are mapped to a record with the id -1. With this technique, we are always able to map our new source products to a product dimension record, and, consequently, the referential integrity constraints can always be activated, however, there is a necessary cleanup step required when the true dimensional record arrives in the warehouse.

Looking up the Warehouse Key

Now that we have cleansed the PRODUCT_ID column, we will modify it to use the warehouse key. For the next example, a PRODUCT_CODE has been added to the product table; this will be used to look up the PRODUCT_ID, which is the surrogate key for the warehouse. Figure 5.4 showed the use of surrogate keys in the warehouse. A portion of the product dimension is displayed.

SQL> SELECT PRODUCT_ID, PRODUCT_CODE FROM PRODUCT;

PRODUCT_ID PRODUCT_CODE
---------- ------------
         1 SP1000
         2 SP1001
         3 SP1010
         4 SP1011
         5 SP1012

In this next transform, we are going to use the PRODUCT_ID in the APR_ORDERS table to look up the warehouse key from the PRODUCT dimension. The PRODUCT_ID column in the APR_ORDERS table will be replaced with the warehouse key.

SQL> SELECT * FROM APR_ORDERS;

PRODUCT TIME_KEY  CUSTOMER SHIP_DATE PURCHASE SHIPPING TODAY_SPECIAL
     ID                 ID              PRICE   CHARGE OFFER
------- --------- -------- --------- -------- -------- -------------
SP1001  01-APR-04 AB123456 01-APR-04    28.01        6 Y
SP1001  01-APR-04 AB123457 01-APR-04    28.01        6 Y
SP1061  01-APR-04 AB123456 01-APR-04    28.01     8.42 Y
SP1062  01-APR-04 AB123457 01-APR-04    28.01     3.58 Y

SQL> UPDATE APR_ORDERS A
     SET A.PRODUCT_ID = (SELECT P.PRODUCT_ID
                         FROM PRODUCT P
                         WHERE A.PRODUCT_ID = P.PRODUCT_CODE);

4 rows updated.

Note that the original PRODUCT_ID’S have been replaced with the warehouse key.

SQL> SELECT * FROM APR_ORDERS;

PRODUCT TIME_KEY  CUSTOMER SHIP_DATE PURCHASE SHIPPING TODAY_SPECIAL
     ID                 ID              PRICE   CHARGE OFFER
------- --------- -------- --------- -------- -------- -------------
2       01-APR-04 AB123456 01-APR-04    28.01        6 Y
2       01-APR-04 AB123457 01-APR-04    28.01        6 Y
54      01-APR-04 AB123456 01-APR-04    28.01     8.42 Y
55      01-APR-04 AB123457 01-APR-04    28.01     3.58 Y

Table Functions

The results of one transformation are often stored in a database table. This table is then used as input to the next transformation. The process of transforming and storing intermediate results, which are used as input to the next transformation, is repeated for each transformation in the sequence.

The drawback to this technique is performance. The goal is to perform all transformations so that each record is read, transformed, and updated only once. Of course, there are times when this may not be possible, and the data must be passed through multiple times.

A table function is a function whose input is a set of rows and whose output is a set of rows, which could be a table—hence, the name table function. The sets of rows can be processed in parallel, and the results of one function can be pipelined to the next before the transformation has been completed on all the rows in the set, eliminating the need to pass through the data multiple times.

Table functions use Oracle’s object technology and user-defined data types. First, new data types must be defined for the input record and output table. In the following example, the PURCHASES_RECORD data type is defined to describe the records in the PURCHASES table.

SQL> CREATE TYPE purchases_record as OBJECT
(product_id                     VARCHAR2(8),
 time_key                       DATE,
 customer_id                    VARCHAR2(10),
 ship_date                      DATE,
 purchase_price                 NUMBER(6,2),
 shipping_charge                NUMBER(5,2),
 today_special_offer            VARCHAR2(1));

Next, the PURCHASES_TABLE data type is defined. It contains a collection of PURCHASES_RECORDS, which will be returned as output from the function.

SQL> CREATE TYPE purchases_table
     AS TABLE of purchases_record;

Next, define a type for a cursor variable, which will be used to pass a set of rows as input to the table function. Cursor variables are pointers, which hold the address of some item instead of the item itself. In PL/SQL, a pointer is created using the data type of REF. Therefore, a cursor variable has the data type of REF CURSOR. To create cursor variables, you first define a REF CURSOR type.

SQL> CREATE PACKAGE cur_pack
     AS TYPE ref_cur_type IS REF CURSOR;
     END cur_pack;

Here is our table function, named TRANSFORM, which performs the search and replace, removing the hyphen from the PRODUCT_ID and increasing the shipping charge. There are some things that differentiate it from other functions. The function uses PIPELINED in its definition and PIPE ROW in the body. This causes the function to return each row as it is completed, instead of waiting until all rows are processed. The input to the function is a cursor variable, INPUTRECS, which is of type ref_cur_type, defined previously. The output of the function is a table of purchase records of type PURCHASES_TABLE, defined previously. The REF CURSOR, INPUTRECS, is used to fetch the input rows, the transformation is performed, and the results for each row are piped out. The function ends with a RETURN statement, which does not specify any return value.

CREATE OR REPLACE FUNCTION
           transform(inputrecs IN cur_pack.ref_cur_type)
RETURN purchases_table
PIPELINED
IS
  product_id                     VARCHAR2(8);
  time_key                       DATE;
  customer_id                    VARCHAR2(10);
  ship_date                      DATE;
  purchase_price                 NUMBER(6,2);
  shipping_charge                NUMBER(5,2);
  today_special_offer            VARCHAR2(1);
BEGIN
LOOP
   FETCH inputrecs INTO product_id, time_key,customer_id,
     ship_date,purchase_price,shipping_charge,
     today_special_offer;

   EXIT WHEN INPUTRECS%NOTFOUND;

   product_id := REPLACE(product_id, '-',''),
   shipping_charge :=(shipping_charge+shipping_charge*.10);
   PIPE ROW (purchases_record( product_id,
             time_key,
             customer_id,
             ship_date,
             purchase_price,
             shipping_charge,
             today_special_offer));
END LOOP;
CLOSE inputrecs;
RETURN;
END;

We’ve rolled back the changes from the previous transforms and are going to do them again using the table function instead. Here is the data prior to being transformed.

SQL> SELECT * FROM apr_orders;

PRODUCT TIME_KEY  CUSTOMER SHIP_DATE PURCHASE SHIPPING TODAY_SPECIAL
     ID                 ID              PRICE   CHARGE OFFER
------- --------- -------- --------- -------- -------- ----------
SP-1001 01-APR-04 AB123456 01-APR-04    28.01     5.45 Y
SP-1001 01-APR-04 AB123457 01-APR-04    28.01     5.45 Y
SP1061  01-APR-04 AB123456 01-APR-04    28.01     8.42 Y
SP1062  01-APR-04 AB123457 01-APR-04    28.01     3.58 Y

To invoke the function, use it as part of a SELECT statement. The TABLE keyword is used before the function name in the FROM clause. The changes are shown in bold in the following code. For example:

SQL> SELECT * FROM
     TABLE(transform(CURSOR(SELECT * FROM apr_orders)));

PRODUCT TIME_KEY  CUSTOMER SHIP_DATE PURCHASE SHIPPING TODAY_SPECIAL
     ID                 ID              PRICE   CHARGE OFFER
------- --------- -------- --------- -------- -------- ----------
SP1001  01-APR-04 AB123456 01-APR-04    28.01        6 Y
SP1001  01-APR-04 AB123457 01-APR-04    28.01        6 Y
SP1061  01-APR-04 AB123456 01-APR-04    28.01     9.26 Y
SP1062  01-APR-04 AB123457 01-APR-04    28.01     3.94 Y

If we needed to save the data, the following example shows creating a table to save the results of the table function.

SQL> CREATE TABLE TEST
     AS SELECT * FROM TABLE(transform (CURSOR(SELECT * FROM
apr_orders)));

Table created.

Transformations That Split One Data Source into Multiple Targets

Sometimes transformations involve splitting a data source into multiple targets, as illustrated in Figure 5.16. The multitable INSERT statement facilitates this type of transformation.

Multitable Insert

Figure 5.16. Multitable Insert

In the following example, the APR_ORDERS will be split into two tables. All orders that took advantage of today’s special offer will be written to the SPECIAL_PURCHASES table. All regular sales will be written to the PURCHASES table. This information could be used to target future advertising. A new table, SPECIAL_PURCHASES, has been created with the same columns as the APR_ORDERS and PURCHASES tables.

The INSERT statement specifies a condition, which is evaluated to determine which table each row should be inserted into. In this example, there is only one WHEN clause, but you can have multiple WHEN clauses if there are multiple conditions to evaluate. If no WHEN clause evaluates to true, the ELSE clause is executed.

By specifying FIRST, Oracle stops evaluating the WHEN clause when the first condition is met. Alternatively, if ALL is specified, all conditions will be checked for each row. ALL is useful when the same row is stored in multiple tables.

SQL> INSERT FIRST WHEN today_special_offer = 'Y'
     THEN INTO special_purchases
     ELSE INTO purchases
     SELECT * FROM apr_orders;

15004 rows created.

The four rows with today_special_offer = ′Y′ have been inserted into the SPECIAL_PURCHASES table. The remaining rows have been inserted into the PURCHASES table.

SQL> SELECT COUNT(*) FROM purchases;

  COUNT(*)
----------
     15000

SQL> SELECT COUNT(*) FROM special_purchases;

  COUNT(*)
----------
         4

From the following queries, we can see that the data has been split between the two tables. Purchases made with the value of N in TODAY_SPECIAL_OFFER are stored in the PURCHASES table. Those with the value of Y are stored in the SPECIAL_PURCHASES table.

SQL> SELECT DISTINCT(today_special_offer) FROM purchases;

TODAY_SPECIAL_OFFER
-------------------
N

SQL> SELECT DISTINCT(today_special_offer) FROM special_purchases;

TODAY_SPECIAL_OFFER
-------------------
Y

Moving Data from a Staging Table into the Fact Table

Once the data has been transformed, it is ready to be moved to the warehouse tables. For example, if the orders data for January 2005 has been moved into temporary staging tables in the warehouse, then, once the data has been cleansed and transformed, it is ready to be moved into the purchases fact table, as shown in Figure 5.17.

Moving Data from a Staging Table into the Fact Table

Figure 5.17. Moving Data from a Staging Table into the Fact Table

When the fact table is partitioned, new tablespaces are created for the datafile and indexes, a new partition is added to the fact table, and the data is moved into the new partition. The steps are illustrated using the EASYDW example.

Step 1: Create a New Tablespace for the Jan Purchases and the Jan Purchases Index

Since the purchases table is partitioned by month, a new tablespace will be created to store the January purchases. Another tablespace is created for the indexes.

CREATE TABLESPACE purchases_jan2005
DATAFILE
'C:ORACLEPRODUCT10.1.0ORADATAEASYDWpurchasesjan2005.f'
SIZE 5M REUSE AUTOEXTEND ON
DEFAULT STORAGE (INITIAL 64K NEXT 64K PCTINCREASE 0
                 MAXEXTENTS UNLIMITED)

CREATE TABLESPACE purchases_jan2005_idx
DATAFILE
'C:ORACLEPRODUCT10.1.0ORADATAEASYDWpurchasesjan2005_IDX.f'
SIZE 3M REUSE AUTOEXTEND ON
DEFAULT STORAGE (INITIAL 16K NEXT 16K PCTINCREASE 0
                 MAXEXTENTS UNLIMITED)

Step 2: Add a Partition to the Purchases Table

The PURCHASES table is altered, and the purchases_jan2005 partition is created via an add partition operation.

SQL> ALTER TABLE purchases ADD PARTITION purchases_jan2005
VALUE LESS THAN (to_date('01-feb-2005', 'dd-mon-yyyy'))
PCTFREE 0 PCTUSED 99 STORAGE (INITIAL 64k NEXT 64k PCTINCREASE 0)
TABLESPACE purchases_jan2005;

Figure 5.18 shows the new partition that has been added to the PURCHASES table (this page can be accessed from the standard Tables page for EASYDW tables, clicking on the View button for the PURCHASES table, and scrolling down the resultant page—there is actually more information above and below than we have shown—the Options information extends further down the page). For the corresponding PURCHASE_PRODUCT_INDEX, you can view the index in a similar fashion by selecting Indexes on the Administration tab, selecting the indexes for EASYDW and, again, selecting the View button. The presentation for the index information is similar to that shown for the table.

Viewing a Table’s Partitions Using Enterprise Manager

Figure 5.18. Viewing a Table’s Partitions Using Enterprise Manager

Step 3: Move the table into the new partition

There are a variety of ways to move the data from one table to another in the same database:

  • Exchange Partition (when the fact table is partitioned)

  • Direct Path Insert

  • Create Table as Select

Moving Data Using Exchange Partition

The ALTER TABLE EXCHANGE PARTITION clause is generally the fastest way to move the data of a nonpartitioned table into a partition of a partitioned table. It can be used to move both the data and local indexes from a staging table into a partitioned fact table. The reason it is so fast is because the data is not actually moved; instead, the metadata is updated to reflect the changes.

If the data has not previously been cleansed, it can be validated to ensure it meets the partitioning criteria, or this step can be skipped using the WITHOUT VALIDATION clause. Figure 5.19 shows moving the data from the APR_ORDERS staging table into the PURCHASES fact table using exchange partition.

Exchange Partition

Figure 5.19. Exchange Partition

The following example shows moving the data from the JAN_ORDERS into the PURCHASES_JAN2005 partition of the EASYDW.PURCHASES table.

SQL> ALTER TABLE easydw.purchases
     EXCHANGE PARTITION purchases_jan2005 WITH TABLE jan_orders
     WITHOUT VALIDATION;

Assuming that the PURCHASES_JAN2005 partition was newly created and empty at the start, then, after exchanging the partition, it will be full and there will be no rows left in the JAN_ORDERS table. The JAN_ORDERS table can now be dropped.

SQL> SELECT * FROM jan_orders;

no rows selected

SQL> DROP TABLE jan_orders;
Table dropped.

We discuss table partition operations more extensively in Chapter 11 where we also look at the Enterprise Manager screens to support these operations.

Moving Data Between Tables Using Direct Path Insert

If the fact table is not partitioned, you can add more data to it by using direct path insert. A direct path insert enhances performance during insert operations by formatting and writing data directly into the datafiles without using the buffer cache. This functionality is similar to SQL*Loader direct path mode.

Direct path insert appends the inserted data after existing data in a table; free space within the existing table is not reused when executing direct path operations. Data can be inserted into partitioned or nonpartitioned tables, either in parallel or serially. Direct path insert updates the indexes of the table.

In the EASYDW database, since the purchases table already exists and is partitioned by month, we could use the direct path insert to move the data into the table. Direct path INSERT is executed when you include the APPEND hint and are using the SELECT syntax of the INSERT statement.

SQL> INSERT /*+ APPEND */ INTO easydw.purchases
     SELECT * FROM jan_orders

In our example, the purchases fact table already existed, and new data was added into a separate partition.

Hint

Be sure you have disabled all reference constraints before executing the direct load insert. If you do not, the append hint will be ignored, no warnings will be issued, and a conventional insert will be used. Plus, the insert will take a long time if there is a lot of data. Conventional path is used when using the INSERT...with the VALUES clause even if you use the APPEND hint.

Creating a New Table Using Create Table As Select

If the detail fact table does not yet exist, you can create a new table selecting a subset of one or more tables using the CREATE TABLE AS SELECT statement. The table creation can be done in parallel. You can also disable logging of redo. In the following example, TEMP_PRODUCTS is the name of the staging table. After performing the transformations and data cleansing, the products table is created by copying the data from the TEMP_PRODUCTS table.

SQL> CREATE TABLE products
     PARALLEL NOLOGGING
     AS
     SELECT * FROM temp_products;

Postload Operations

After the data is loaded, you may need to validate its quality, reenable constraints and triggers, rebuild indexes, update the cost-based optimizer statistics, and refresh the materialized views prior to making the data available to the warehouse users. Some of these tasks have been discussed previously in this chapter.

Step 1: Gather Optimizer Statistics for the Tables

Run the DBMS_STATS package to update the optimizer statistics on any tables where you have added a significant amount of data. Statistics can be gathered for an index, index partition, table, partition or subpartition, and a materialized view.

Since we added an entire partition to the purchases fact table, the following example will gather statistics for the purchases_jan2005 partition.

SQL> EXECUTE DBMS_STATS.GATHER_TABLE_STATS('easydw','purchases',
'purchases_jan2005',DBMS_STATS.AUTO_SAMPLE_SIZE);

Step 2: Verify the Dimensions

If using summary management, run the DBMS_OLAP.VALIDATE_ DIMENSION procedure for each dimension to verify that the hierarchical, attribute, and join relationships are correct. It can also be used to determine if any level columns in a dimension are NULL. You can either verify the newly added rows or all the rows. Any exceptions are logged in the table MVIEW$_EXCEPTIONS, which is created in the user’s schema.

In the following example, the product dimension (parameter 1) in the EASYDW schema (parameter 2) is verified for correctness. All rows will be validated (parameter 3 is set to false), and we’ll check for nulls (parameter 4 is set to true).

SQL> EXECUTE
dbms_olap.validate_dimension('product','easydw',false,true);

SQL> SELECT * FROM mview$_exceptions;

no rows selected

Step 3: Refresh the materialized views

After loading the new data into the PURCHASES table, the PRODUCT_SUM materialized view became stale. It only contains information for the existing data and must be refreshed to incorporate the newly loaded data. The operations and options for refreshing materialized views is discussed extensively in Chapter 7.

Step 4: Gather Optimizer Statistics for the Materialized Views

Next, update the optimizer statistics on any materialized views that may have significantly changed in size. If you used Oracle Enterprise Manager to refresh your materialized views, you have the option of analyzing the materialized view by using the Analyze button on the Materialized View General tab.

To gather optimizer statistics outside of Oracle Enterprise Manager, use the DBMS_STATS package. In this example, table statistics will be gathered from the PRODUCT_SUM materialized view in the EASYDW schema.

SQL> EXECUTE
DBMS_STATS.GATHER_TABLE_STATS('easydw','product_sum'),

Step 5: Back up the Database Table, or Partition

Back up the database, table or partition after the load is complete, if you used the UNRECOVERABLE option. Since media recovery is disabled for the table being loaded, you will not be able to recover in the event of media failure. Refer to Chapter 12 for a discussion on backup techniques.

Step 6: Publish the Data

Notify the users about which data has been loaded the previous day and is ready for their use.

Using Tools for the ETL process

Several tools are available on the marketplace to help automate parts of the ETL process, including Oracle Warehouse Builder and tools from Informatica, AbInitio, Ascential, and Sagent. These tools provide the mechanisms to connect to the heterogeneous data sources, generally to either relational databases or flat files, and perform the data extraction functions. They control the transfer of data across a network, provide data transformation services, data cleansing capabilities, and load the data into your data warehouse. We discuss Oracle Warehouse Builder in more detail in Chapter 13.

Summary

In this chapter, we looked at the ETL process. Various techniques were discussed to identify rows that changed on the source system as part of the extraction process, and both synchronous and asynchronous forms of Change Data Capture were introduced. A number of types of transformations that are common to a data warehouse were discussed. Examples were shown performing transformations during the load process, using staging tables inside the Oracle database, and using the new regular expression functionality. Techniques used to load the warehouse were illustrated, including SQL*Loader, Data Pump, external tables, and transportable tablespaces. Moving data from the staging system to the warehouse tables can be done using a variety of techniques, such as exchange partition, direct path insert, or create table as select. In Chapter 13, we’ll look at Oracle Warehouse Builder, which uses many of the functions discussed in this chapter, to help you load your data warehouse.

In the next chapter, we will discuss more about how to query the data we have taken so much care in loading.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset