Data Movement Utilities

Before a database becomes useful for real work, it has to have data. Initially, this sounds like a really stupid statement, but it is really the place where we have to start in working with OnLine. The data does not just magically appear in the tables ready to be used in our applications. It has to be put there, most often by being imported from other non-OnLine sources.

Maybe your system is well established and the tables are already populated. You may be tempted to believe that data movement utilities are no longer of use to you. Wrong. You will have many instances in which these utilities will be used. You may need to upgrade your hardware to another machine. You may need to provide data dumps that can be read by another database system on another machine. Your users may want data that they can use in their PC programs. Maybe they want files that they can read into their spreadsheets or word processors. Maybe you need to move large tables from one instance to another or from one dbspace to another. Maybe you need to provide extra archive protection for a few tables. The data movement utilities are essential for all of these applications.

General Concepts

There are some concepts that apply to moving data within OnLine or to and from other systems no matter what methods you choose to use. Data migration requires a knowledge of your particular OnLine setup and of the contents of your data tables. For the most part, unloading data from your OnLine system is fairly simple. It is when you need to load data from another source into OnLine that complications may arise. Choosing the proper tools for data migration depends upon several factors:

  • Requirements of the other system

  • Contents of the data

  • Logging status of your OnLine system

  • Indexing requirements

  • Degree of logical consistency required between tables

  • Time frame

  • UNIX resources

  • OnLine resources

Requirements of the Other System

Whether the other system is providing data to OnLine or whether OnLine is providing data to the other system, each has to be aware of the other's needs. A basic need is that they both use the same character representations. If you output data in ASCII and the other system wants it in EBCDIC, you will need to use the proper UNIX utilities to translate the data.

Contents of the Data

You need to know how the data is currently formatted and how it needs to be formatted for the other system. If the other system wants multiline records or is sending you multiline records, you will use different tools than if all records are single lines. For the methods that use ASCII dumps, you need to be able to choose a delimiter character. Your data will be junk if you try to use a UNIX pipe "|" symbol to separate your fields and your data actually contains the pipe symbol. Some of the means of unloading data are smart enough to watch out for this, but you need to double-check. For example, the UNLOAD TO xxxx SQL statement will escape a delimiter character if it finds one in a character field. You also need to be aware of how the source and target systems handle null fields.

Logging Status of your OnLine System

Logging status is one of the "gotchas" of moving data to and from OnLine. This usually gives you trouble when you are loading data into an OnLine system. Some of the utilities require that the logging modes of the sending and receiving database (OnLine to OnLine) be the same. No matter what tools you're using, it is usually better to load into a database that has logging turned off. Of course, this is not practical if you are loading into a production system using logging and you can't or don't want to turn the logging off. It's tempting to try to get around this problem by loading data into a nonlogging database and then either creating synonyms from a logging database into the new table or using INFORMIX-Star to load data from the new, nonlogging table into your production system. This does not work, as OnLine enforces rules that require similarity of logging status on each side of synonyms and INFORMIX-Star links.

Indexing Requirements

You should load into nonindexed tables and index the tables after the data is in place. If you are loading and indexing at the same time, you will have data pages interspersed with index pages in your tablespace. This causes the data to be fragmented and makes access to the data slower. This is especially noticeable with long, sequential, nonindexed reads of the data. If the data pages had all been contiguous, the reads would be quite efficient. The indexes would have also been more compact, as they would have been presorted. With data pages and index pages interspersed, the disk head has to jump around quite a lot to read the data. In addition, the entire process goes faster if the indexing is done later. Especially in versions later than 4.11, the engine supports parallel sorting via the PSORT environmental variables. If you have multiple processors, using this parallel sorting capability will make your sorts go much faster. There are three separate PSORT variables:

  • PSORT_DBTEMP = list of filesystems for sorting and working storage

  • PSORT_NPROCS = number of sort threads to use

  • PSORT_MALLOC = amount of data to keep in memory before writing work files. This is an undocumented variable and performs best at 10240.

If you can, when you are unloading a table to a flat file for loading into another OnLine database, sort the output in the primary key order that is desired for the target table. That way, when the table is loaded, it is already clustered on the primary key. In addition, if a table is already in the proper sorted order, there is an environmental variable that you can set to tell the sort that the table is already sorted and that the engine does not need to perform a sort for this index. This variable is $NOSORTINDEX. Set it to 1 before creating the primary index for this table, and the sorting portion of your index build will go even faster.

Degree of Logical Consistency Between Tables

If you are dumping just one table or loading just one table, you usually don't have to worry too much about consistency, but it may bear some thought. If you are moving multiple tables that depend upon one another, you definitely need to consider whether you can unload them while database activity is occurring. First, you need to decide whether the data movement occurs when the database is online or when it is quiescent. Are others accessing or needing access to the tables? Does the table depend upon other tables for join information or lookup information? Can others access the table while the data movement is happening? Does the utility lock the table when unloading? All of these factors make a difference in deciding how you move data.

If you try to move two tables while the database is running, you could be facing some serious problems if changes were made to the first and second tables after the first had been unloaded and before the second had been unloaded. The tables could contain logically inconsistent data.

Time Frame

How much time do you have to do this job? Do you need to have a table locked for hours? Do you need the database quiescent for hours? Just how long will it take to move a few million rows?

UNIX Resources

If you're moving data to disk, do you have enough disk space? How much of load will the job place on your system?

OnLine Resources

This is an important factor. First and foremost, will loading the data create a long transaction problem? If you're loading into a logging database, this could definitely be a problem. Make sure that you have enough logfiles and that other activities that generate heavy logging are not happening at the same time. Make sure that your LTXHWM and LTEHWM logfile high water marks are set lower than the defaults, as the defaults are too high for most real-world applications. Use at most 60 percent and 70 percent for the respective numbers. Make sure that you have enough locks in the system to handle the locking load. If you're loading using SQL into an existing or newly created table, lock the target table before beginning the load. That way, you'll use only one table lock instead of thousands of row or page locks.

We'll now look briefly at the tools for moving data. For full details on the operations of each of the utilities, look to your manuals. This is just an overview of the capabilities and applicability of each tool. We'll explore the tools from fastest to slowest.

Tbtape -s (archive and restore)

This is by far your tool of choice if you are moving an entire instance. It will move everything: all databases, all indexes, all extent data, even all wasted space. If you want to move just one database from one machine to another, you can always move everything and drop the ones you don't want. You may even be able to use your normal backup archives as a medium of transferring data to the new computer without doing any extra work on your source machine.

Obviously, this works only from one OnLine system to another. Depending upon the differences between versions, it may work from one OnLine version to another. If the versions are close enough and if the machines have similar page sizes and similar ways of representing data, it may even work from one brand of computer to another. If you're upgrading to a bigger computer using the same operating system, the same OnLine version, and the same architecture, this may be the way to go.

At least with OnLine versions prior to version 6.0, the host and target machines must have similar layouts. If you have a one gigabyte chunk on /dev/chunk1 on the source machine, you must have a similar (or larger) size device on the target machine. The general rule is that the target machine must have everything that the host machine has. If you're working on a disaster recovery plan, make sure that your backup machine at least equals your source in disk space allocated to OnLine.

Along with the archives that are necessary for recovery, be sure that you have copied to a safe place any other information that would be necessary to recreate your system's layout. Keep a copy of the tbconfig file and a copy of the tbstat -a output. Also keep a copy of the output of the tbcheck -pr command.

If you can use tbtape's archives and restores to move your data, you can expect to restart on the new system with it looking exactly like the old system with little or no effort.

Tbunload/Tbload

The tbload and tbunload utilities are a very fast means of moving tables and databases. They can work either on single tables or on entire databases. Like archive and restore, they work with binary pages of data and are very fast, similar in speed to the archive and restore utilities. Like archives and restores using tbtape they require compatible versions of OnLine, compatible page sizes and page format, and compatible integer representations. Unlike tbtape archives and restores, they do not require similar resources and identical chunk size and naming.

The tbunload and tbload programs also transfer index, extent, and other table information. One of the potential problems that can occur with the moving of indexes is the requirement that index names be unique over a database. You cannot have two indexes named index_1 within a database even if they are on separate tables. The names must be unique. The tbload utility provides for this by providing a -i parameter that will allow you to rename indexes during the move to preserve the uniqueness of index names.

The tbload and tbunload utilities are designed to move data to and from tape. If you want to use a disk file instead, you should first create the file and make sure that it has proper ownership and permissions and then treat the file just like a tape device. Of course, you should be certain that you will have enough space on the device to hold all the data.

If you succeed in moving a table or a database using tbunload/tbload, always run the tbcheck utility on the resulting table or database. I have seen situations where the tbload completed with no problems, the tables showed up on the new system, and the data seemed to be there. However, when the tbcheck utility was run, there were internal problems with the data. Check it out carefully before putting your trust in it.

Dbexport/Dbimport

This utility pair moves entire databases between OnLine and/or INFORMIX-SE systems in the form of ASCII files. They transfer entire databases and cannot be used to add tables to existing databases. They create database schema files and database export files. The schema files and the export files can be individually directed either to disk or to tape. The schema files used by these utilities are basic and do not include information about extent sizes, locking modes, dbspaces, or logging modes.

These schemas are generalized so that they can be used to transfer data to and from both OnLine and INFORMIX-SE systems. If you are planning to send the data files to disk, be sure that you have enough disk space to hold your entire database.

Schema files can be edited if they are sent to disk, or they can be replaced with complete schema files such as those generated by the newschema shell script that is included in the CD-ROM that accompanies this book.

During operation of dbexport, the source database is locked in exclusive mode, so it is not suitable for operation while the database is in use.

Dbload

The dbload utility does not have a reciprocal dbunload program. The dbload utility is used to insert ASCII formatted data into existing tables, with the loading parameters controlled by a command file. This utility is best suited for moving complex ASCII data into OnLine tables.

This utility is executed from the UNIX command line. It has many option flags that give you a lot of control over how the utility runs.

The load utility gives you detailed control over delimiters, frequency of commits to avoid long transaction problems, acceptable number of errors before abort, number of rows to skip at the beginning of a data file, and whether or not to lock the target table when loading.

The usage of the dbload utility is:

   INFORMIX : 19> dbload
   DBLOAD Load Utility           INFORMIX-SQL Version 5.00.UC2
   Copyright (C) Informix Software, Inc., 1984-1991
   Software Serial Number XYZ#C2999999

Usage:
dbload [-d dbname] [-c cfilname] [-l logfile] [-e errnum] [-n nnum]
[-i inum] [-s] [-p] [-r]

d      database name
c      command file name
l      bad row(s) log file
e      bad row(s) # before abort
s      syntax error check only
n      # of row(s) before commit
p      prompt to commit or not on abort
i      # or row(s) to ignore before starting
r      loading without locking table

The command files can handle two main types of data: data that is delimited by a defined delimiter character and tabular data that is character-position dependent. This flexibility of input formats makes dbload a very powerful tool for loading ASCII files from sources other than an OnLine database. The dbload utility can be combined with other UNIX utilities to create flexible systems that takes a datafile or report from another source, massages it with UNIX utilities into a format that dbload understands, and finally loads the data into OnLine through dbload.

One of the main features of the dbload command is the -n parameter. It is possible that if you try to load a very large table into a tablespace you will run into problems with running out of locks or creating a long transaction. The -n parameter allows you to get around either of these problems, as the value of n tells dbload to perform a COMMIT WORK every n rows. These periodic commits prevent dbload from using all of your locks and from trying to do the entire load in one transaction, causing a long transaction and possibly filling up all your logfiles.

Another very useful capability of dbload is the -e parameter. Unless you are using dbload, if you attempt to load duplicate rows into a table with unique indexes or unique constraints, you will bomb out with an error message that tells you that you have duplicate data. This can become a real pain if you are trying to eliminate these duplicates. You can use the -e parameter to tell dbload to ignore bad rows and to place them in an error file (the name of which is determined by the -1 parameter). The value of e is the total number of bad rows you will accept before allowing dbload to fail.

If you have a case where you have partially loaded a table, or if you have two tables that have duplicates, and you want to consolidate them, you can use dbload to create one table. Unload the first table into a flat file using the SQL UNLOAD TO command. Then do a dbload with a very large value for the -e parameter, loading the second table from the flat file that contains the first table's contents. The dbload command will complain about the duplicate rows, but it will load all of the nonduplicate rows into the second table.

Writing and debugging the command files for dbload is as much an exercise in programming as is writing SQL statements or writing C programs. If you need to use dbload to move data, sit down with the manuals and the examples and see how it is done. You may want to look at the *.unl files in the $INFORMIXDIR/demo file tree for examples of the use of this program.

The command file for dbload contains the name of the file from which the dbload program gets the data to load into the table. This does not necessarily have to be the name of a file. If your flavor of UNIX supports named pipes created by the mkfifo UNIX command, the data source can be a named pipe. You can use this to move data from table to table or from database to database without worrying about long transactions or running out of locks. Create the named pipe and start a SQL UNLOAD TO "named_pipe" command from your source table. From another terminal, run dbload with "named_pipe" as the source of your data. The data will be unloaded from the source table and inserted into the target table.

Unload/Load To/From filename select_statement

This pair of SQL statements allows you to create delimited ASCII files from tables and to place the contents of delimited ASCII files into tables. They can be used either with isql or with dbaccess. The select_statement in the UNLOAD command can be any legal OnLine SELECT statement, so it is possible to create LOADS of partial tables or fragments of tables.

These commands are suitable for use on an operating production system provided you have covered the problems of ensuring adequate resources and have adequately addressed the issues of logging and concurrency of access. The UNLOAD SQL statement is very seldom a problem in any manner, unless you don't have adequate disk space, but using LOAD can create long transactions or can cause you to run out of locks. In cases like this, the dbload program may be better for your use.

Reports to a File

ASCII files can be generated using a report writer. If you use isql, you can use the ACE report writer that is included with isql. These report files, if used with the default report format, will create character-position dependent ASCII output files suitable for use with dbload. By modifying the SELECT statement in the report, the report writer could create a delimited report.

Output to filename WITHOUT HEADINGS

This SQL statement is similar to UNLOAD except that it does not place delimiters in the output file. This makes it suitable for creating output files that are character-position dependent. It can also be run with the WITHOUT HEADINGS constraint that causes OnLine to omit the names of the fields. The OUTPUT TO or the UNLOAD TO SQL statements can create files that can be read by the dbload program.

An interesting and useful feature of the OUTPUT TO statement in SQL is the ability to use the PIPE statement in place of an output file. This statement takes the output of the SQL command and pipes it into a UNIX pipeline. Use of this construct is one way that you can cause a UNIX command to execute from within your SQL statements.

For example:

   OUTPUT TO PIPE " > /dev/null; ls -la" select * from dummy

Since SQL does not have the equivalent of a system command, this trick can simulate a system command. The contents of the pipe command needs to do something with the standard output of the SQL statement.

If you just want to execute a system command and don't care about the output of the SQL command, just redirect its output to /dev/null as in the example.

Of course, this command is also very handy for its intended purpose, which is to pipe the output of the SQL to a UNIX command. This allows for further processing of the output by UNIX. This pipeline can be as complicated or as detailed as UNIX will allow.

If you wish, your SQL output can provide the starting point for extensive post-processing outside of the database. In some instances, this can be quite useful, as there are a lot of tasks that can be efficiently completed in UNIX but are difficult, expensive, or impractical to do within SQL.

In fact, it is possible to pipe the output to a long series of UNIX commands, massage the output, pipe the output into files, and then use one of the IDS LOAD commands to retrieve the data back into the OnLine database. This gives you the capability of choosing your working tools based upon the job to be done instead of trying to attack everything using the OnLine tools.

Isql and Dbaccess

For many users, the isql or dbaccess program is all they know of the OnLine system. This menu-based program is ubiquitous and is often taken for granted. There is a lot of utility in Informix's implementation of isql, and much of this utility is often overlooked. This seems to be especially true with users who have moved over from another database system to an OnLine system. Dbaccess is a stripped-down version of isql that is included with the 5.0 series of products. Dbaccess does not include the menu-creation and report-writing capabilities of isql.

Different database management programs all use extensions to ANSI SQL to achieve more functionality. Users coming from database management systems that add many extensions to their SQL products sometimes feel restricted when in the SQL query portion of OnLine. Such things as extensive string functions are not present as they are in the SQL's of programs such as Oracle. Heavy users of ESQL/C seem to feel this loss the most. They may use only the embedded SQL capabilities of OnLine and may miss out on some of the other functions available through isql.

The I-SQL program contains five major areas of functionality. For the SQL developer, the one that gets the most attention is the Query Language section. This system allows the user to create, modify, save, run, and redirect the output of SQL statements. A user who spends most of her time in the Query Language module may believe that some of the more sophisticated features of other database management systems are not present in OnLine.

Of course, this is not true. The isql module is a suite of packages. The main features are a form-based query and user-interface program, a report generator, the Query Language module, a user menu generator and runner, and a database and table status module. As an example, much of the character manipulation and string management functions are found in the report writer application. You can do things here that you cannot do in the query management module such as concatenation of strings, control of output format, and the like.

The forms module provides a very handy "quick and dirty" method of dealing with the data in the tables. You can generate, compile, and run complex, multipage screen forms quickly and easily. This forms module is not as comprehensive as the INFORM1X-4GL product, but it is a handy tool for rapid development and management of database systems.

The user menu module is also very useful in developing a user interface for various purposes. Using the user menu options, you can easily integrate Informix programs, scripts, UNIX utilities, and other executables into a seamless whole. It doesn't give you all the bells and whistles that custom development in C or Microsoft Windows™ will give you, but it doesn't take you months to do it either.

In my opinion, these tools are underrated and underused. As a working DBA, you will need to provide many services and rapid response to your user community. You will find yourself needing to develop and use many programs, scripts, and data management schemes to do your job. You can make the choice to develop your tools with lots of options and user-interface gadgets or to develop them quickly and simply. If you decide to go for the utility over the flash, you probably have all the tools you need using isql and INFORMIX-4GL.

This book will not go into detail about isql, INFORMIX-4GL, and the SQL query language. We are assuming that the DBA either has experience with, or can read the manuals for these programs. However, we will include examples of user tools developed with INFORMIX-4GL, forms built in the forms generator, and menu programs running under the menu module. All of these examples are of tools that will be helpful or informative to the DBA. As you are studying the examples, remain open to other things that you can do using these tools. I'm taking a typical UNIX approach to this. Show you the tools, show you how they can work together, and then turn you loose to develop your own variants.

If you develop some particularly interesting or useful variants, get on the Internet and post them to the comp.databases.informix newsgroups. If they're handy for you, others can probably use them. Besides, it'll give you a moment or two of fame!

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset