Agile Database Techniques

The goal of data architecture is to define the entities that are relevant to the enterprise and to avoid designing the logical and physical storage systems. At this point, you may be thinking about how to apply several of the topics discussed within this chapter for your database administrators. We feel that a slight deviation is warranted here. We will discuss techniques that database administrators (DBAs) may use to become more agile and to empower development teams.

For additional reading, see Agile Modeling: Effective Practices for Extreme Programming and the Unified Process by Scott Ambler and Ron Jeffries.


The traditional database management approach has multiple problems, including but not limited to the following:

  • Iterations of the data model or database are not allowed.

  • Changes to the production environment are rare or not allowed.

  • Migrating from one data model to another is a project in and of itself.

  • There is usually only one development, QA, or other such database.

  • The database administrator is a roadblock.

When considering how important databases are to the development of applications, one finds multiple opportunities for taking advantage of agile principles to save time and speed development. Now let's look at some agile principles we can apply to this discipline and then show how they can be realized.

Applying the Agile Approach

The agile approach to database management acknowledges that functionality in any undertaking should be done in increments. This is the first challenge to the previous bullet points. The agile approach is an advocate of automated testing. If one can have automated procedures for testing, one could potentially consider having more than one database for each release level (QA, development, etc.). If we could solve for this, both developers and QA could develop and test not only independently of each other but they could test different versions of future functionality. Also, the agile approach acknowledges the importance of team interaction. This challenges the notion that database administrators have to do everything related to databases, which sometimes causes developers to wait around while twiddling their thumbs.

Over time, both applications and databases grow more complex and undergo frequent rounds of refactoring. It becomes very important for architects to focus on the flexibility of the database. If an organization has an application in which business users can create complex scenarios within a half hour and run acceptance tests immediately, the importance of being agile is reduced. However, if the testing process requires intervention by multiple parties and running test scripts (even automated ones) for hours, if not days, then an enterprise can benefit by using agile approaches.

In an agile database approach, everyone gets his or her own copy of the database. A database should be thought of no differently than an application. In many Java shops, all developers are equipped with an application server on their desktops. The same should hold true for databases. This allows developers to work independently of other developers. Developers who understand how databases work (the vast majority) can have their own copies of the database so they can try out various changes to the database structure without involving the DBA. This allows developers to use the database as a scratch pad.

To learn about sandbox diagrams, see www.agiledata.org/essays/tools.html.


If each person were to have an individual copy of the database, matters could get out of hand quickly—unless scripts (automation) allowed everyone to recreate a clean database at will. This illustrates a concept that can be manifested differently, depending upon the database in use (e.g., Oracle, MySQL, Pick, etc.).

Database administrators may ask “What's in it for me?” The answer is simple. In the agile approach, DBAs will receive fewer requests for work, allowing them to concentrate on what is truly important, which is maintaining the quality of the data and the structure of the production database. DBAs can maintain specialized data sets for each context (development, load testing, demos, etc.) that further empower the development team. By allowing each person to have a copy of their own database, developers can independently test incremental changes that allow for the migration of older data sets to newer versions of the application. For organizations that employ a release management process, the database can be thought of simply as just another component within the application and can be assigned a build number and context. In an agile approach, the DBA has more of a stewardship role toward databases instead of worker bee.

To make this work, several procedures must be put into place. First, modifications to the schema and setup data must be logged chronologically and the version controlled using the same manner as the application. By using version control practices, any member of the team can simply check out the appropriate scripts needed to support the version level they require and can execute them to get the proper version of the database. Since the schemas are within version control, this can happen on demand without DBA intervention. This same version control mechanism will allow old database instances to be rolled forward simply by applying the logged changes that were made to the schema and setup data.

In an agile database, everything starts with the notion of a database instance. Database instances are equivalent to application instances because they are a working copy of the system at a point in time. The vast majority of large projects use some form of source control (e.g., PVCS, Sourcesafe, Clearcase, etc.) that becomes the system of record for all project assets. For example, if the Canaxia development team were currently working on the sixth release of its enterprise customer relationship management system but expressed a desire to analyze the code base as of the fourth release, one would simply use the source control product and check out the code as of that release (Figure 11-7).

Figure 11-7. Development timeline.


As we can see, the development team is working on the current release of software (version 6.0). When this build of software has been signed off on, exactly one version of the code base will be present at the end of the iteration. Part of this iteration includes an instance of the database that is comprised of the database schema and setup data. Since Canaxia uses scripts to create its database schemas and maintain setup data, the development team is able to recreate a functional database instance at any point in time and in the same manner as code.

Table 11-4. Database iterations.
Version/IterationFunction
4.0—Iteration 1DBA makes a copy of certified schema from previous version and loads the database with programmatically generated data.
4.0—Iteration 2DBA creates a new data set to allow for automated regression testing.
4.0—Iteration 3DBA creates a new data set to allow for acceptance testing.
4.0—Iteration 4Version has been certified. DBA creates production database.

The version control should also introduce the notion of context. Context represents the notion that multiple parties may require different sets of data loaded into the database for varying reasons, such as development, regression testing, acceptance testing, and production deployment. A development timeline is two-dimensional in that releases occur in order. Context establishes a third dimension that handles the usage of data within a development timeline. Table 11-4 presents a timeline to further explain this point.

Based on this scenario, you may have noted that each iteration had only one version of code data sets but multiple lineages (regression and acceptance testing). The schemas were the same throughout the iterations, yet the data sets could contain radically different data. For example, the development data set could contain only employees with the first name of Ernie. The acceptance data set could contain only data required to fulfill test cases that require preexisting data. Likewise, production databases created in the last iteration would be sizable and could require data conversion.

A lineage is a preestablished context that is tracked across time and inherits its attributes from its ancestors.


It is important to track the changes in database lineages in a manner different from what is done for typical code. Ideally, lineages should be maintained using scripts and storing them using version control. Sometimes, using lineages is not possible, so you may have to revert to snapshots. In any case, a lineage includes a master instance, a change log, an update list, and a collection of child instances that were derived from it. The first lineage of any application is the lineage that supports the application itself. Ideally, a new lineage is created every time an application branches or whenever the team requires a unique instance or specialized data set.

Working with Scripts

Over time, the quantity of scripts required to create a database can become excessive. In such a situation, we recommend periodically creating new master scripts that essentially start a new lineage. This is referred to as the master lineage. The master should contain all changes to schemas and setup data from prior lineages.

Let's look at a database script and show the information that we will record within it.

/*
**    SUPPLIER.SQL
**    Creates table for supplier and signon.
**    Copyright 2003, Canaxia.      All rights reserved.
**
**    Build: 2226
**    Release: 4.0
**    Date: September 10, 2001
**    Author: Little James
*/

use CRMDB
;

create table supplier (
      suppid int not null,
      name varchar(80) null,
      status char(2) not null,
      addr1 varchar(80) null,
      addr2 varchar(80) null,
      city varchar(80) null,
      state varchar(80) null,
      zip char(5) null,
      phone varchar(80) null,
      constraint pk_supplier primary key (suppid)
)
;

create table signon (
      username varchar(25) not null,
      password varchar(25) not null,
      constraint pk_signon primary key (username)
)
;

Scripts should be employed that handle the following tasks:

  • Copy last certified release of database locally.

  • Create a local copy of database specifying a version number.

  • Create a new baseline and assign it a version number.

  • Perform differential of local database versus last certified release.

  • Perform differential of local database versus baseline.

Having the ability to use version control for databases will allow an organization to do the following:

  • Conduct acceptance testing against a copy of current production data using a modified schema.

  • Allow production databases to be migrated to a future state of development.

  • Allow for changes to production databases to happen on a weekly or even daily basis since it is very easy to do so.

Normalization

Another important step toward having agile databases is not to be savage in the pursuit of normalization. In decision support applications, a star schema is typically used. Star schemas usually require fact tables to be third normal form and dimension tables to support second normal form. With most databases, 99 percent of the access occurs in a read-only fashion in which updates represent the other 1 percent. Using star schemas results in queries that are easy to write and run efficiently.

Agile modeling techniques are presented in detail in Chapter 8.


Online transaction processing (OLTP) databases are typically used nowadays in conjunction with application servers for data access. For example, J2EE-based application servers support the notion of entity beans, which provide an object representation to a relational database. Since entity beans map one for one with a database's underlying table/view structure, it makes sense to consider, where appropriate, a less normalized view of databases. Many application servers can enforce the integrity rules that would otherwise be contained within the database, so following strict rules of normalization is no longer required. Traditionally, following rules of normalization resulted in an increase in performance. However, using this same mind-set with J2EE applications, and specifically with entity beans, could result not only in additional coding but also in lost performance since the application server has to manage relationships as well as create an object representation of each table/row run-time.

Star schemas are also easier to refactor because they are less normalized. They also map better to an object-oriented run-time model, especially when used in conjunction with adaptive object models that can build their representations using metadata obtained from a database.

When using star schemas, it is better to have a column containing precalculated values instead of performing calculations at run-time. This also has the effect of simplifying queries. For example, if your database supported a calendar table represented as a dimension in which each row at the day level has a “number of days since time began” column, your applications would be made simpler because would not have to calculate across month boundaries and so on. Dimension tables that are wide and contain pre-exploded values allow one to develop applications faster. Besides, columns are very easy to add to a database today.

One of the thoughts behind normalization was related to increasing the quality of data. Using normalization beyond Third Normal Form (3NF) does have the effect of potentially increasing data quality but comes at the expense of being agile. We believe a better approach is to build the quality information directly into the tables. The quality information could explain when, where, how good, etc., which is simply additional metadata.

Automating processes using metadata, creation of database scripts, and giving everyone the tools they need to become productive are the key outcomes of an agile process. DBAs should automate every process they perform on a repeated basis and, whenever possible, should ensure that these processes and the tools they use are accessible by team members. Process automation is the only activity that will allow DBAs to respond in a timelier manner.

We now present final best practices, which range across all the topics discussed previously in this chapter.

Establish an Infrastructure That Allows for Rapid Changes in Business Requirements and Database Technologies

In most organizations, business requirements change frequently. The design of your data infrastructure must be agile and allow for easy and rapidly implemented changes to data models. Likewise, technology advances continue to outpace all but the most flexible organizations. Vendors are constantly adding new features to their engines that allow them to perform previously impossible tasks. The infrastructure must allow for replacing the database technology if necessary to gain advantage.

Data That Need to Be Shared and Current Should Be Centralized

Transactional data that are high volume and shared across locations should be centralized. Additionally, if the data for all locations must be current, centralization becomes mandatory. Some organizations have tried to work around this by replicating updates to distributed databases in remote locations with limited success. Using a replication approach increases the complexity of an organization's data architecture and results in increased network traffic.

If any of the following criteria are present within your organization, a centralized database strategy is mandatory:

  • More than a handful of users requiring access to up-to-date data. Usually this may include access to OLTP systems.

  • A handful of users and no distributed locations. Sometimes an organization adopts the departmental mind-set, but this should be avoided.

  • Multiple locations but a lack of appropriate skill sets and/or tools to manage distributed data.

  • A business requirement to provide consolidated data for federated metadata on an open platform.

Avoid Monolithic Database Designs

A good database design is aligned with the principles of a Service-Oriented Architecture whereby data are associated with a particular business service. When data for a particular process (service) needs to change, it is localized and does not have an effect on the entire application. Mapping your database design to a Service-Oriented Architecture allows an organization to realize quality attributes including the following:

  • Scalability

  • Performance

  • Availability

  • Flexibility

  • Modularity

A modular database design allows databases to be backed up and recovered quicker than in a monolithic approach. Scalability and availability are also increased when data are partitioned along the lines of a service because that allows parallel tasks to be accomplished. A complex request can be divided into smaller chunks and processed by multiple databases simultaneously.

To align the data architecture strategy with the principles of service-orientation, the following steps must occur:

  • Clear definition of business processes.

  • Clear definition of data requirements that the service will use.

  • Accountability of business units for the integrity of data.

  • Establishment of service levels to ensure consistency.

Use Access Rules to Protect Data

Ideally, all data access should occur through a Service-Oriented Architecture that will have its own access rules. Data should not be directly accessible except through other processes that own the data. This practice will help prevent unauthorized access or its accidental destruction.

For best practices on modeling, see Chapter 8, Agile Modeling.


Data Validation Should Occur at Multiple Tiers within an N-Tier Architecture

Data can be represented in many forms, including EDI streams, XML documents, and SOAP messages. Validation in these scenarios should occur before data are sent across the network. This reduces errors within a distributed architecture and can increase performance in many situations. For example, if a GUI-based application captured user-entered data in a data entry screen, this can be corrected before data are written to the database. In a distributed environment, it may be more efficient to validate the data within the GUI instead of it having to travel through multiple and geographically dispersed systems, only for it to be rejected by a database constraint.

For best practices on validation in the presentation layer, see Chapter 9, Presentation Tier Architecture.


Data Should Be Replicated Based on Business Requirements and Only When Necessary

Our next best practice states that it is better to have a single version of a data source whenever possible. The only viable reasons for maintaining a replicated environment are either to increase performance or to create a decision support database that mirrors an online transaction processing (OLTP) database.

A data architecture that includes a replicated data source is much harder to design and maintain than a design that assumes a single stable location for data. Likewise, if data that is replicated are also updated frequently, the potential for latency and the potential for services to return inaccurate data are introduced.

A replication strategy is appropriate when users in different geographic locations need similar data that are not required to be current and a central data source is not possible. In such a situation, your business requirements should address nonfunctional quality attributes such as data availability, recoverability, freshness (near real-time versus 1 week old), and so on.

Replicated Data Should Be Read-only

It is preferable that data architecture includes an authoritative source for data. This is where all updates should be directed. Allowing users to update replicated data unnecessarily complicates the replication environment and introduces potential data integrity problems. In a distributed environment, using a single source to guarantee data consistency is easier than managing updates in a multiple data source environment.

For example, if you have a replicated database that can be updated from multiple locations, you must consider the expected behavior when two people update the same record at the same time. Possible outcomes could include the following:

  • Allowing the first update to succeed and undoing the last update (FIFO)

  • Allowing the last update to overwrite the first update (LIFO)

  • Canceling both updates

  • Moving both updates to another table and allowing an administrator to determine the outcome

Obviously, multiple deviations on the above outcomes are possible. What is important to note is that they will make your data architecture a lot more complicated than it needs to be. By utilizing an authoritative source, other data sources can synchronize to the source of record, ensuring consistency.

The Replication Topology Implementation Should Meet Existing Business Needs

Business requirements should define the following:

  • Acceptable lag times

  • Transformation

  • Monitoring

By documenting the acceptable lag times in replication in a data architecture, service level agreements can be established along with the ability to determine appropriate replication schemes to be used along with scheduling. Data replication from an OLTP to a decision support system (DSS) usually results in data transformation. The business requirements should clearly articulate transformation requirements. This will allow an architect to determine any additional processing overhead on both the source and destination data sources and to make appropriate adjustments for performance and scalability.

Additionally, the business requirements should outline any monitoring requirements. Sound enterprise architecture will have a systems management component that will address configuration, monitoring, and administration.

Transparency of Location of Data by Accessing Applications Should Be Encouraged

In an N-tier architecture, data access will usually occur through some form of middle tier (Figure 11-8). This allows applications not to know the location of their data sources. By introducing this principle into an architecture, one can restructure databases by changing schemas on the fly on backup sources using promotion/demotion schemes without the application knowing this occurred. For disaster recovery and high-availability situations, the underlying data source can be offline and restarted in another location, also without affecting the application. Furthermore, one can use this strategy for replatforming data sources, such as moving from Oracle on Solaris to Microsoft SQL Server on Windows 2000.

Figure 11-8. Separation of tiers.


For example, a client should never send SQL requests directly to a database. It is preferable that a client send requests to services that handle processing. The service will receive the request from a client and send a message to the middle tier. The middle tier in turn will send an SQL call to the database. This approach means the client does not need to know data-specific language constructs, whether SQL, XML, or proprietary. The client is simply responsible for sending a request for work.

The Command Pattern as outlined by the Gang of Four should be used. See Design Patterns: Elements of Reusable Object-Oriented Software by E. Gamma, R. Helm, R. Johnson, and J. Vlissides.


Implement Tools, Processes, and Policies to Ensure Data Integrity

The responsibility for data integrity resides in the hands of both business and information technology personnel. Business users should determine functional security requirements, and IT should provide the mechanisms that guarantee that this requirement is met. It is the primary responsibility of IT to safeguard all data under its control.

The principles of systems management will ensure that all data are protected against corruption and loss by proving a mechanism to backup and restore data to an intact state after a system failure. Good backup and recovery strategies are mandatory so that data can be recovered in a timely manner regardless of the cause of loss. Systems management may also require reserving a portion of bandwidth between locations or even providing redundant connections to support replication in a disaster recovery environment. If the replicated/distributed data are mission-critical, it becomes important to make sure the data architecture considers performance and survivability under normal circumstances, as well as within degraded operations. In the latter situation, it may be necessary to record the flow of data across tiers (e.g., breadcrumbs).

When Designing a Database for Performance, It Should Include the Total Performance and Not Just Local Performance

Performance should be viewed holistically. The performance of your data tier should be considered in conjunction with the capabilities of other tiers, such as network and application. Sometimes, performance of the data tier can be increased by employing caching strategies at other tiers, use of compression techniques at the network level, and so on.

However, when tuning the data tier for performance, consider the following whenever possible:

  • Limit the number of indexes in a database.

  • Limit ad hoc data access.

  • Reduce the number of rows in a result set.

  • Limit the number of joins.

  • Avoid sorting.

For relational databases, an index is updated each time a record is updated. Logically, if a record has one corresponding index, the update will occur faster than if the record had ten corresponding indexes. By limiting ad hoc data access to a database, one can make data access patterns more deterministic and predictable. Ad hoc access can have an impact on the performance of a database.

By reducing the number of rows in a result set and/or at least potentially limiting the amount returning to the calling tier, performance can also be increased. In an OLTP system, users typically work with a single row displayed on the screen or with a couple of rows displayed within a grid or list box. Returning only the rows and columns that are needed at that instant will reduce the amount of network traffic, transformation, and other forms of processing associated with the request.

Limiting the number of joins in a result set also will have a profound effect on the performance of the data tier. By reducing the number of tables, the optimizer algorithms that many relational databases use can look at a smaller set of indexes and create a better data access plan. Sorts should be avoided on large amounts of data whenever possible. Additionally, sorting should occur only on indexed fields.

For additional information on mapping, refer to www.agiledata.org/essays/mappingObjects.html.


Prefer Open Standards to Proprietary Extensions

The temptation to use vendor-specific extensions to SQL should be avoided at all costs. Advocate for use of ANSI-standard SQL, and not proprietary extensions, in access to relational database stores. This concept is applicable to other forms of databases also. For example, an XML database should be exclusively accessed using XPath or XQuery.

For additional information on XPath and XQuery, see XQuery: Kick Start by J. McGovern, C. Kagle, P. Bothner, J. Linn, and V. Nagarajan.


By utilizing a separate data access tier, data architecture can minimize the impact of calling applications since the APIs used by higher tiers do not change. The implementation of stored procedures is another form of proprietary access to a database that should be avoided. Stored procedures are specific to the database implementation in use and are difficult to migrate if a different implementation of a database is required.

For more information on implementing a separate data access tier, see www.agiledata.org/essays/implementationStrategies.html.


Those familiar with Java and J2EE will see this principle represented by Enterprise Java Beans and will notice the preference of the Java community to use container-managed persistence, which does not support the use of stored procedures. The Java community is slowly adopting APIs such as the Java Data Object (JDO) specification that allows for data store independence.

For a great book on Java Data Objects, see Core Java Data Objects by Sameer Tyagi, Michael Vorburger, Keiron McCannon, and Heiko Bobzin.


Use of database triggers is also a proprietary extension of a database. It is a good practice to use database triggers for the sole purpose of supporting referential integrity and no more.

Protect Credit Card Information Using Encryption

Credit card information is of special concern to enterprise architects as these data are usually the most vulnerable. Ideally, the protection of credit card information starts with the protection of the entry point into your organization. This may be in the form of a Web site that should be protected using a secure sockets layer (SSL). For one-time transactions, the credit card information should never be stored. Instead, it is preferable to pass this information to the payment gateway (provided by a bank) that the enterprise uses. Every transport used between systems should also support some form of encryption.

At times, it is preferable to allow applications to process recurring charges that mandate storing credit card information. In this situation, it becomes mandatory to encrypt the credit card number in the database. Likewise, related personal information, such as name and address, should be stored separately in a database separate from credit card information. To further increase the security of the data, consider using separate user IDs and passwords for the database that stores personal information and the database that stores credit card information. Ideally, neither database should use default database passwords.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset