Chapter 3. Architecture of a Data Warehouse

Introduction

Data warehouses have evolved because, in order for businesses to remain competitive in the marketplace, they need access to a wealth of information to help them make the right business decisions. To assist with those decisions, data may go back for many years and could entail keeping the details of every item that a business ever sold. Performance optimizations for update-centric OLTP systems are well understood; however, data warehouses, being query-centric, have vastly different requirements, and a single business intelligence query may need to retrieve and aggregate many records from the warehouse.

In order to understand how the database can efficiently access this large amount of data, we need to look at a more fundamental aspect of the warehouse implementation and that is the technical architecture and physical hardware. Good performance in a data warehouse, and the ability to retrieve and process the data quickly, is dependent on a sound physical database design, which must be supported by a solid foundation of server and infrastructure hardware. This hardware platform, in combination with specific features and techniques within the Oracle Database 10g database, can be used to significantly improve query performance in a data warehouse. This chapter introduces concepts about the technical architecture of a data warehouse and discusses the significant changes that Oracle has implemented with 10g and how they can be beneficially deployed in the data warehouse. In Chapter 4, we will look more at physical database design techniques, such as partitioning, and how they can use the strengths of the underlying architecture.

An important component of a data warehouse architecture is its ability to scale. A data warehouse will grow with an increase of users and reporting requirements and as more data is loaded to address new business areas. The architecture must be able to handle this growth to process the new data without any detrimental impact to the query response to our increasing user community. To grow our architecture normally means that we will need to add more processors to handle the increased processing requirements, more memory to accommodate the extra processes, and more disks to handle the larger data volumes.

There are various approaches that can be used to scale a system. Many servers scale simply by allowing more processors and memory to be added, though, ultimately, there is a physical limit as to what can actually be added into the server box. An alternative method of scaling is clustering, where multiple, possibly smaller, servers operate together in a coordinated fashion to service the increased demands. Oracle provides the Real Application Cluster (RAC) technology for clustering the database i.e., to have more than one set of database processes executing on separate servers but operating together as a whole and with an effective pooling of the separate server resources. Oracle Database 10g includes significant enhancements to the RAC technology and uses the RAC resources in an optimal fashion.

We will discuss RAC in more detail later in this chapter and show how RAC helps the scalability and robustness of our data warehouse environment. Next, we will look at the primary hardware components of our architecture.

Hardware Configurations for a Warehouse

Correct utilization of the available hardware is paramount to being able to run resource intensive queries found in a typical production data warehouse. As in the building industry, a solid foundation is critical.

Server Architectures

The Single-Processor, Single-Disk Architecture

In its simplest hardware deployment, the database can be implemented on a single-processor server, as shown in Figure 3.1. This configuration may be very viable for some small data warehouses or data marts and can provide valuable benefits and should not be dismissed. However, this configuration is obviously not going to run a large data warehouse and support the demands of a large user population requiring a quick response time to their queries.

A Single-Processor Server

Figure 3.1. A Single-Processor Server

This architecture has a number of inherent problems, not least of which is the risk to the data with this deployment, because it is dependent upon a single disk with no built-in redundancy. In addition, because there is only one CPU, there is no facility for true parallelism in the warehouse, where more than one process is actually executing concurrently. There is a single CPU running a single process at a time, pulling data from one disk through one I/O channel from disk to server. Adding more disks does not necessarily enable any improvement if the queries that are running at peak times are already saturating that I/O channel. Therefore, we need to see how we can improve the delivery of the data from the disks.

The Multidisk Architecture

Although the single-disk architecture will suffice for a very small data warehouse, it is obviously totally unsuitable for a typical data warehouse with very large volumes of data. Now we need to look at a configuration that uses multiple disks and that uses a bigger capacity I/O channel between our disks and our server in order to transfer more data. An example of this is shown in Figure 3.2, where multiple separate controllers are used to interface with our multiple disks. The advantage of this approach is that more disks can be used to provide the data to answer the users’ queries and can provide a bigger I/O channel, which enables more data to be communicated between the disks and the server at the same time. There are a number of technologies that can connect multiple disks to our server, from SCSI and Ethernet to Fibre Channel, and there are different architectures, such as storage area networks and network attached storage, for utilizing this connectivity. We will look at these in more detail later in this chapter.

A Single-Processor, Multidisk Architecture

Figure 3.2. A Single-Processor, Multidisk Architecture

We also need to consider carefully how we use our multiple disks. For example, where do we put the data files for each of our tablespaces? Traditionally, the DBA, who has a number of separate disks, will try to split the tablespaces onto different disks so that the I/O requirements don’t interfere with each other. A typical approach used by many DBAs is to place index tablespaces and data tablespaces on separate disks. In this way the I/O load is spread over multiple disks, improving performance.

However, even with careful placement of tablespaces on disks to avoid heavy loaded disks, we can still get hot spots, where certain disks are always heavily in demand and other disks are rarely used; what we really want is to smooth the I/O out across all of the disks. The sheer size of a typical data warehouse will necessitate a large number of disks, resulting in this type of manual placement quickly becoming an impractical and demanding chore that is prone to errors.

A solution to these problems is to use some form of RAID. This will transparently split our data files across multiple disks, improve response times, and also provide a higher level of protection against disk failure. In Oracle Database 10g, an important feature called Automatic Storage Management (ASM) is introduced and can be thought of as RAID within the database. We will look at RAID and ASM in more detail later in this chapter.

The Multiprocessor Server Architecture

Simply adding more disks is not sufficient for our large data warehouse if we only have a single processor handling all of the user workload. We may have multiple user queries being executed at the same time, but this parallelism is actually an illusion if we have only a single processor, because the processor can only really execute one process at a time. It provides the illusion of parallelism by swapping between, and executing small increments of, the processes in very quick succession. Adding more processors to the server enables Oracle to actually run more of the operations truly in parallel. This can be executing Oracle specific processes, running multiple separate users’ queries or even just a single query that the database has transparently split among multiple processing threads operating on different parts of the warehouse data. Adding more disks and improving the I/O channels between the disks and server improves the ability of these multiple processing threads to concurrently access more data.

The server architecture, which uses multiple CPUs that are able to address a common memory structure, is called a symmetrical multiprocessor (SMP) architecture. The operating system runs across all CPUs concurrently and schedules and load balances its separate processes on all of its CPUs. A normal dual processor PC is an example of an SMP architecture server. Figure 3.3 shows a four-processor SMP system, where each processor accesses the same system memory via the system bus.

A Multiprocessor, Multidisk Architecture

Figure 3.3. A Multiprocessor, Multidisk Architecture

Let’s look at an example of why multiple processors are required. The Oracle database can run a query by transparently splitting it into a number of cooperating parallel processes, where each operate on a different portion of the warehouse data. Executing these processes in parallel needs multiple processors if any speedup is to be achieved. Adding more disks and improving the I/O channels between the disks and server further improves the ability of these multiple processing threads to concurrently access more data. In addition, this parallelism in the database can be better utilized by physical database structures such as table partitions. However, the improved ability to use parallelism can only be realized if it is supported by the underlying hardware: multiple processors for the processing capability and handling the multiple process threads, more memory to support the multiple parallel processes, and multiple disks to maximize the available I/O and deliver the data in parallel to those parallel processes.

All of our server architectures so far have been SMP architectures that are limited in their ability to scale by the internal system bus, which must handle all of the data traffic and which also limits the number of processors because of the contention and demand that they place upon it. It is unusual to see SMP servers having greater than 16 to 32 processors without necessitating a more sophisticated, and therefore more expensive, internal system bus design.

Clustered Servers

Another architecture to consider is clustering, which involves multiple independent servers that work together to perform a common set of functions and may appear to client users and applications as a single server. The servers are physically connected by a network. We can identify two groups:

  • Those that have disks shared and accessible to all servers

  • Those that are “shared nothing” and do not have disks shared between the servers

In Figure 3.4, we have shown a simple two-node cluster that is accessing a common disk bus. We have shown our cluster nodes as having only two processors, but, of course, there is nothing preventing them from having more, as previously discussed. There are a number of technologies to share and connect external disks to our servers, and we will look at some of these in more detail later in this chapter.

A Two-Node Cluster Using Shared External Disks

Figure 3.4. A Two-Node Cluster Using Shared External Disks

Clustering provides the benefits of:

  • Improved scalability. Within limits, new nodes can be added to scale the architecture.

  • Improved availability. If one node fails, then the other nodes are available to take over.

Massively Parallel Processor Architectures

Another alternative architecture is the massively parallel processor (MPP) architecture, where separate nodes or servers, each consisting of their own CPUs, memory, and which may or may not have their own disk, are connected to each other by a high-speed interconnect. Each node runs its own operating system, and the application typically coordinates its processing across the nodes in an MPP architecture utilizing the inter-node connectivity. Figure 3.5 shows just seven nodes in an MPP architecture using a single common interconnect, but large systems can have hundreds of nodes, which can be interconnected in a variety of different topologies.

A Massively Parallel Architecture

Figure 3.5. A Massively Parallel Architecture

The Oracle Database Architecture

Before we proceed, now is the time to explain some of the various components of the Oracle database architecture.

The Oracle Instance and Database

An Oracle database is the set of database files that comprise the data warehouse, including the data files, control files, and redo log files. An Oracle instance, as shown in Figure 3.6, consists of the memory area (known as the System Global Area or SGA) and background processes—for example, SMON, PMON, and ARCH. The background processes access data files and manage user sessions. Background processes are shared by all database users.

The Oracle SGA and Background Processes

Figure 3.6. The Oracle SGA and Background Processes

The background processes each perform a specific set of tasks for the database. For example:

  • SMON is the System Monitor and, among other things, performs instance recovery, cleans up the database transactions, and tidies up space utilization.

  • PMON is the Process Monitor and cleans up failed user processes and their resources.

  • DBWn are the Database Writer processes, and they write the changed data blocks from the SGA cache to the data files. There can actually be more than one of these background processes.

  • LGWR is the Log Writer Process and is responsible for writing to, and managing, the redo logs.

  • ARCn are the Archiver processes, which copy completed redo log files to a separate disk location.

  • CKPT is the Checkpoint Process. When a special database event called a checkpoint occurs, this process synchronizes all of the headers of the data files with the new checkpoint information.

This is only a partial list. There are quite a number of Oracle processes, and each has its own area of responsibility and interacts with its own part of the Oracle database. For example, only the DBWn processes are responsible for writing out to the data files. A diagram of a single instance on a server and an overview of the background processes is shown in Figure 3.6.

When an Oracle instance is started, the database initialization parameter file is read and the SGA memory area is allocated from the server’s main memory and configured, the background processes are started, and the database files are opened. The SGA is where all the important information in the database is held when the instance is running. It contains the buffer and dictionary caches used internally by Oracle and the redo logs. However, it contains more than just data; it also contains the memory regions for the shared pool for SQL and PL/SQL and the Java pool for Java objects.

The Scalable Database: Oracle Database 10g RAC

When one database and instance are running on a single server, even on a very powerful multiprocessor server with a lot of memory, performance is still inherently limited by the server itself and by the number of processors and memory that it can contain. In order to scale our database beyond one server, we need to be able to cluster the database using Oracle Database 10g RAC.

Clustering the Oracle database requires more than one Oracle instance to access the same database data files and therefore uses the shared disk type of server cluster we identified previously. A clustered Oracle database offers a number of benefits:

  • More CPU and memory resources are available for running the queries.

  • The architecture is more robust: if one server falls over, then the others are available to continue to provide some system availability, albeit a reduced one.

The simplest example of this is shown in Figure 3.7, with two servers, each with their own CPUs, running their own database instance but both instances accessing the same database files stored on disks shared between the servers.

Overview of a two-node Oracle RAC Architecture

Figure 3.7. Overview of a two-node Oracle RAC Architecture

RAC was introduced in Oracle 9i as a replacement technology for Oracle Parallel Server. A key component of the Oracle RAC architecture is Cache Fusion. This uses a dedicated, high-speed interconnect to coordinate nodes in a RAC cluster and to keep data in each node’s memory synchronized. Cache Fusion effectively enables the individual caches from the separate instances to act as a single cache.

Cache Fusion significantly improves the ability for clustered database instances to scale. Prior to Cache Fusion, one of the major problems involved with more than one Oracle instance accessing the same database files concerned the interinstance communication, such as for passing Oracle data blocks between nodes. For example, if the same disk block were being accessed and updated by two separate database instances on different servers at the same time, then, in order for the second instance to access the block, it had to be transferred, via disk, from the first instance. Block pinging (as this was known) could occur repeatedly, since blocks were frequently written to disk in order to transfer them between instances and disk accesses are relatively slow in a database. To solve this without Cache Fusion necessitated clever partitioning of the data, often by splitting the user community between the two instances by the data that they were accessing. Partitioning in this fashion was not often a realistic possibility, was time consuming, and required skilled planning.

Cache Fusion, with its dedicated high-speed communication link, improves the communication between the instances. A number of technologies can be used for the interconnect. For example, the simplest, but still effective, one is a high-speed gigabit, or 10G Ethernet. If Ethernet is used then the Cache Fusion network should be dedicated to just the interconnect traffic and kept separate from the Ethernet network used for the normal LAN network to connect the users to the servers. There are also other new technologies, such as Infiniband, that offer a low latency and high speed interconnectivity. This improved, faster communication of the blocks between the instances in a RAC cluster removes the need for the slower block communication via the disks and therefore almost always removes the need for the careful partitioning of the data. This enables real applications, i.e., normal third-party applications and not just highly tuned and balanced ones, to be deployed on RAC and to fully use the benefit of the scalable architecture.

To scale the RAC architecture involves the addition of a new server and Oracle RAC instance into an existing cluster. This makes available a new set of CPU and memory resources to be utilized by the cluster. As new nodes are added, then the RAC cluster exhibits very good scalability and queries are transparently run across all of the nodes in the cluster to utilize any free resource on other nodes.

An additional benefit to our warehouse architecture from Oracle RAC is in the improved reliability of the whole environment. With RAC, instead of a single instance on a single server accessing our database, we have multiple instances on multiple servers. So, if one server fails, we still have the other servers in the RAC cluster that are able to pick up and continue the failed database sessions. For a production system that cannot afford downtime due to server failures, this is a very attractive and important feature.

Oracle Database 10g Grid

With Oracle Database 10g, Oracle has based its Database Grid upon the RAC architecture concept and has rearchitected many of its components as a consequence.

Declining hardware costs, leading to the increased availability of powerful, lower-cost servers, coupled with the easier availability and access within the Enterprise architecture to shared storage, make the appearance and use of interacting servers and clusters cheaper and therefore more likely. This means that isolated servers, with their isolated pools of storage and with their own application systems, can be consolidated into a lower-cost cluster offering better performance and reliability. This has evolved into the grid concept in Oracle Database 10g.

But plugging many servers and shared storage together to create a large cluster does not automatically provide you with the computing power that you need or use the computing resources to their best effect. The ability to harness and manage the power of this grid of multiple servers and storage comes from the sophistication of the management software to specifically address the issues that this type of environment presents. Oracle Database 10g Grid is a new, integrated software infrastructure that enables the improved management and use of resources provided by a grid of servers and storage.

Previously, we have been administering our database using Enterprise Manager Database Control, but the new Oracle Enterprise Manager Grid Control administers and manages multiple database components in the grid as well as Oracle Application Server 10g components. The 10g Grid management by Enterprise Manager provides new features, such as:

  • A unified management environment. A single environment to manage database servers as well as Oracle Database 10g AS application servers, to manage groups of the servers as easily as you could manage a single server and to manage storage across the grid.

  • Improved management of the grid by simplifying and improving the management of the individual database nodes in the grid. Oracle Database 10g includes a whole new framework for monitoring, advising, and managing the administration of the separate databases.

  • Software provisioning to automate the installation and configuration of the Oracle software across the multiple nodes in the grid, and also to automate the application of patches and upgrades.

  • Dynamic Provisioning, which enables the balancing of the computing demands in one part of the grid with the availability of resources in another part by the use of policies to control and balance the allocation of resources.

  • Integrated software. The new Oracle Database 10g clusterware, which enables the services and communication between servers in a cluster, is provided for all operating system platforms and eliminates the need to purchase and integrate with third-party clusterware. Having one software version from one vendor simplifies the environment and reduces the possibility of cluster failure.

We will examine Enterprise Manager in more detail in Chapter 11.

Hardware Components

Now that we have a better understanding of the different types of architectures that we need to build our data warehouse, let’s take a step back and consider what our warehouse is trying to do:

  1. Support multiple users

  2. Execute large queries, in parallel and possibly split each query into smaller parallel threads automatically by the database

  3. Access very large volumes of data

  4. Load large quantities of data from operational systems

Therefore, the next step is to determine the requirements for the following three important components of our architecture:

  • Memory

  • Processing power (i.e., CPUs)

  • Storage

We will now have a look at these components in more detail.

Memory

Our database will be servicing queries from users, running reports, and executing batch jobs to refresh the warehouse. Each of these queries may be split into multiple parallel processes by the database, and each of these processes will require a certain amount of memory. The main area here is the memory required for:

  • The SGA, for storing all of the cache of the database data blocks and other memory structures

  • The Program Global Area (PGA), which is a private memory space used by Oracle code for each user’s server process for example, for sorts and hash operations

On top of the memory that Oracle needs, we must also ensure that there is sufficient memory for the operating system and for any other applications that must also execute on our warehouse server.

Oracle Database 10g has the new Automatic Shared Memory Management for dynamically tuning the memory used by the SGA and PGA (and which we will look at in more detail in Chapter 10). Memory requirements can grow quickly and become large. Imagine 50 users running parallel queries. Each parallel process may require around 10Mb of memory. If each query were split into four parallel threads, then the server would require 2Gb of memory to support these 50 users.

Processors

We have looked at the different server architectures that enable us to have different numbers, and configurations, of processors, but what about the processors themselves? It is not necessarily the case that we must have the newest, fastest processors. Significant warehouse performance can be obtained from using servers with processors that are not the fastest ones available, particularly if we also deploy them in a clustered architecture using RAC, which minimizes the contention and use of the internal system bus on a single server.

An equally important criteria for a processor is the amount of memory that it can address. Thirty-two-bit processors can address up to 4G of memory, but 64-bit processors can address significantly more. Even though it is unlikely to ever need the maximum addressable space, many databases can benefit from the increased memory capacity provided by 64-bit addressing.

Storage Configurations for a Warehouse

Now that we have seen the various architectures that are available to us to implement our data warehouse, the next step is to decide upon the storage requirements for our data warehouse.

The I/O Subsystem

One of the main objectives with our I/O subsystem is to keep the processors in our server constantly supplied with data when processing queries. With data warehouses, this is much more relevant than with OLTP systems, because in a warehouse the queries are typically spanning a much larger set of the data, whereas in an OLTP the data access is generally more specific record oriented. This means that both the disk subsystem and the I/O subsystem must have sufficient bandwidth to transport the data to the warehouse quickly enough to support the processing needs. This requirement is made more severe in a parallel environment, where multiple processes are accessing data at the same time.

The other factor that must be considered for the I/O subsystem is the latency of the technology used to connect the disks to the server. Latency is the time it takes a packet of information to travel from the disks to the server. Latency and bandwidth together define the speed and capacity of the interconnectivity. In addition to this, there is the latency of the disk itself which is the time the disk takes to position the disk head over the required data block on disk before the data can actually be read.

When considering the number of disks for a warehouse, you must take into account both the capacity and the maximum transfer rate. If we were to base our disks for our warehouse purely on the disk capacity and go for a small number of large disks, then we are inherently limiting the ability of the disks to provide the data. Small numbers of disks may have sufficient storage capacity but are unlikely to have sufficient I/O capacity to meet our needs. For example, if we chose four disks, each with a 320Mbps transfer rate, then the maximum data that can be transferred even if everything is working ideally (which is probably not the case), is (4 * 320)Mbps. Alternatively, if we select a larger number of smaller sized disks, then we have more disks—each able to run up to their own maximum data transfer limit to provide the data to the servers.

Following this simple rule, we should, therefore, be looking at using a larger number of smaller disks that can deliver the data that we want “in parallel.” However, the I/O subsystem should not be considered solely on the basis of the ability of the disks to provide data: the capability of the communication channel (the data bus and controllers) between the disks and the servers and the I/O channels at the server end are equally as important. One rule of thumb often seen is that there should be at least two disks per processor. Considering that the size of a warehouse will typically be very large, we will probably be using many times that multiplier.

The capability to use disks in this fashion in parallel comes from the ability to stripe our data across many disks and for that we need to have a brief look at RAID to understand how to use RAID and striping in our data warehouse.

Striping and RAID

Striping, or RAID, is the ability to spread the I/O requests across multiple disks in parallel. RAID is an acronym, which stands for Redundant Arrays of Inexpensive Disks, and the basic concept of RAID is to combine small, inexpensive disks to yield improved performance and reliability and have them appear to the server as a single, larger disk. We touched upon this problem earlier in the chapter when we looked at adding more disks into our warehouse architecture and where RAID provides a solution for better throughput and reliability by using multiple disks.

RAID can be described as levels, which progress from 0 to 10. As you progress through the levels, you get different performance and reliability. For example, rather than storing all of the data on one file on a single disk, the data is striped across multiple disks, which yields better performance during reading, because all of the disks potentially get read in parallel rather than just one.

Striping is the process where the disks are split into separate sections and the sections are utilized round-robin style on each disk. So, for example, if your stripe size is 1M, then the first 1M of your file is written to disk 1, the second 1M stripe is written to disk 2, and so on. This splitting of the data into stripes that are written to different disks is done either at the byte level or the block level and is performed automatically and completely transparently to the applications, whether it is a file for a word processor or a sophisticated database. Striping has the effect of reducing the contention for disk areas and improving I/O throughput to the server. Figure 3.8 shows a four disk RAID, where the stripe width is four and stripe size is the size of the individual amounts of data written to each disk.

Four-way RAID Showing Stripe Size and Width

Figure 3.8. Four-way RAID Showing Stripe Size and Width

The process of mirroring the data across multiple disks involves duplicating the information that we write to one disk by writing a complete copy to a second disk so we always have a copy of our data.

The RAID levels are:

  • RAID 0: striping (i.e., striping our files so that each successive stripe occurs on a different disk in a group of disks)

  • RAID 1: disk mirroring. Where an exact copy of each disk is automatically maintained. If one disk fails, then the data is available on the other disk until a new disk can be inserted and the copy made while the system is on-line. The cost of this is that it doubles the number of disks that are required.

  • RAID 3: striping with parity at the byte level. Uses an additional parity disk with other data, where parity data is simply a calculated numerical value from using the data on the actual data disks, so that if any one of the data disks fail, then the remaining data in conjunction with the parity data can rebuild the lost data. The parity information is on a separate, dedicated disk, for example, five data disks and one parity disk.

  • RAID 4: block data striping with parity. Same as RAID 3 but working at the block and not byte level.

  • RAID 5: block striping rotated parity. Both the parity and the data are striped across the disks. This removes the need for a dedicated parity disk, because the parity information is rotated across all disks. Read performance is better, but write performance can be poor. At least three disks are needed for a RAID 5 array.

  • RAID 10: (also known as 0+1): mirrored stripe sets providing better security from the mirroring and better performance from the striping.

RAID can be implemented either in hardware, via a RAID controller for the disks, or in software, via a tool called a logical volume manager(LVM). An LVM is a piece of software that could be provided either as part of the operating system, from a third-party vendor (e.g., Veritas), or from the disk storage vendor (e.g., from EMC). The LVM combines areas on separate disks and makes them appear as a single unit to the operating system, and in the process it provides software striping and mirroring. Performing RAID in hardware may be easier, but there may be the disadvantage that it can be more difficult to change at a later stage—for example, if more disks are to be added; performing RAID in software can take advantage of a number of powerful volume managers available. Regardless of the mechanism, the I/O request from the server is transparently split by the controller or LVM into separate I/Os for the individual disks.

From a performance perspective, the DBA and systems administrator must carefully consider not just the type of RAID that they wish to employ and on which disks, but they must also consider items such as the RAID stripe size (i.e., the byte size of each stripe on the disks) and the number of disks in the stripe set.

A high-level perspective of the operations needed to use a volume manager to prepare the disks for Oracle can be seen in the following text. Different hardware vendors will have their own volume managers and there are third-party ones, and in addition, different operating systems may have a slightly different approach, so the steps listed here are intended to be very generic. The DBA and the system administrator must:

  1. Create or initialize the physical volumes. Define the disk as a physical volume and write certain configuration information to it.

  2. Create the physical volume groups. This is the process where the physical volumes are collected together to create the physical volume group. A physical volume can only belong to one volume group, but a volume group may contain many physical volumes. During this step the physical volume is partitioned into units of space called physical extents: These are the smallest units of storage used on the disk, and an extent is a contiguous section of the disk.

  3. Create logical volumes in the volume groups. This is the step when striping and mirroring are implemented.

  4. Create the file systems on the logical volumes. The file system is the structure on the disk necessary for storing and accessing the data. Without the file system it is not possible to create directories and files. For example, the file systems can be FAT32 or NTFS, which will be familiar to Windows users, or Linux Ext2, Linux Ext 3, and Linux Swap, which will be familiar to Linux users.

  5. Mount the file systems. Make the file systems known and usable to the operating system.

  6. Then, finally, use the file systems for the database files (i.e., create the database tablespaces using files on these file systems).

Stripe and Mirror Everything

After much research on the storage configurations that are optimal to the Oracle database for both OLTP and warehouse configurations, Oracle recommends using the SAME (Stripe And Mirror Everything) method. This method involves all of the disks being mirrored and striped using a 1M stripe width and with all of the database files being placed on these disks.

  • By striping all database files across all disks using a 1M stripe size, the use of the bandwidth across all of the disks is maximized and the occurrence of disk hotspots and bottlenecks is reduced. The stripe size of 1M has been carefully analyzed as a balance point between the access time for the disks to get to the data location on disk compared with the transfer time for the data (i.e., 1M in this case).

  • By mirroring, we increase the availability of the database by reducing the risk due to data loss from disk failure.

Equally important, SAME is a very simple storage configuration concept to grasp and is independent of any third-party’s storage product; it provides significant benefits to our warehouse database.

Shared Storage

Our choice of architecture selected for the data warehouse will influence the type of storage that is needed. If we are using RAC, then we need to be able to access the disks from more than one server. We can use technologies such as SCSI, a bus technology that can connect up to 16 external devices (i.e., disks in the case of our warehouse) with our warehouse server. SCSI comes in many forms and can offer transfer speeds up to 160 MBps (megabytes/ second); shared SCSI enables the bus to be shared between two servers. We will now have a brief look at two alternative solutions, Storage Area Networks (SAN) and Network Attached Storage (NAS), which offer better capacity, scalability, and robustness.

A Storage Area Network, or SAN, can be viewed as a dedicated high performance network to connect servers and storage. An example of this is a disk array accessed using a technology such as Fibre Channel to link the disk array to the different servers (though other connection media, such as optical fiber, can also be used).

Fibre Channel is a technology for transmitting data at a rate of Gbps (giga-bits per second) between the devices in our warehouse architecture (i.e., between servers and between servers and disks). In addition, Fibre Channel technology allows a physical separation in terms of kilometers, so this also enables a large physical separation of the hardware components for our warehouse, if required. It may help to think of the Fibre Channel (or other technology) as a specialist data bus between the servers and the shared disk storage devices. High-availability systems will use multiple Fibre Channel routes from the servers to the disks and use hubs and switches to minimize the chance of system failure if any part of the SAN fails.

An alternative shared storage technology is Network Attached Storage, or NAS, which can be viewed as a set of disks running from its own special server platform and accessed via a normal network to which it has been assigned its own network IP address. For our warehouse architecture we may prefer to keep the normal company LAN network traffic separate from the warehouse server to NAS traffic and therefore use a separate LAN network between our database servers and our NAS.

It can be seen that the SAN concept encapsulates NAS. The Ethernet-connected disk devices in a NAS architecture are just a very specific example of storage available on a dedicated network (which SAN is). NAS performance, however, will be limited by the capacity of the network to move the data between the NAS storage devices and the servers: typically this limitation will be readily overcome by the network communication technology used in a SAN. In addition, new protocols, such as i-SCSI, are much more performant than old-style NFS.

The use of SAN and NAS in the Oracle warehouse architecture separates the disk storage system from the servers. This also means that these technologies are not just for the single-server environment. NAS and SAN storage are attached to a network that can have multiple database servers attached to it; therefore, it provides the required shared storage for Oracle RAC.

An example RAC architecture with shared NAS storage is shown in Figure 3.9. Taking our simple RAC architecture one stage further, for the example, we’ll base the interconnects on Ethernet, add in the shared storage using NAS, and include two separate connections between the servers and the NAS storage, so if one fails, then the other is still present to enable the RAC to operate. Of course, the same redundancy could be introduced for the network providing the connection for the RAC Cache Fusion interconnect as well. The reason that three separate networks are used (i.e., the normal network to connect to the users, the RAC interconnect, and the NAS storage) is to ensure that the three different types of network traffic do not interfere with each other and degrade the overall system performance. Of course, our example is very simplified to demonstrate the point, and typical large warehouse production systems can be significantly more complex.

Overview of RAC and NAS Shared Storage

Figure 3.9. Overview of RAC and NAS Shared Storage

In this section, we have discussed the various hardware components of our architecture and how our storage can be configured and deployed. In the next section, we will see how a new storage feature in Oracle Database 10g can build upon the strengths of the RAID striping and mirroring.

Automatic Storage Management

Automatic Storage Management (ASM) is new in Oracle Database 10g and provides the control and management of disks in conjunction with a purpose-built file system for the database files. Essentially, you provide the disks to Oracle and specify the degree of protection with mirroring that you require (unmirrored, mirrored once, or mirrored twice) and ASM manages the rest. ASM will also perform the striping by controlling the layout of the blocks across the disks, and it will effectively balance the load for you automatically. Furthermore, when new disks are added, removed, or simply fail, Oracle ASM will automatically redistribute the blocks and the load so that the database is open and available all the time. On top of this striped and mirrored set of disks, a file system for the Oracle files is implemented.

Prior to Oracle Database 10g, if you wanted to spread your I/O across multiple disks the DBA would work closely with the system administrator to make careful configuration of the underlying disks, for example, to provide the required RAID level that we have just discussed. While doing this, the administrator must plan and track the volume sizes, the striping sizes, and the extent sizes. There are a large number of steps in this process, the complexity of which can increase with the number of disks being used, the number and size of the logical volumes, and resultant file systems to be placed on these volumes. Generally, the bigger the database (and warehouses can get very, very large) the more disks and mount points required and, consequently, the greater the complexity.

This administration can quickly become a headache and a chore. What is worse is that if the database subsequently needs to be expanded at some point in the future, then the administrator will need to add more disks, and hence physical volumes, into an existing volume group. This process of rebuilding the underlying volumes may even necessitate our warehouse database being off-line, and the contents of certain areas of the database may need to be exported and reimported, which is a very undesirable situation.

The purpose of this task to implement RAID, is to stripe and mirror the underlying disks to get the higher throughput and availability. It is easy to see that it can be a complex and labor-intensive task and, even with a knowledgeable and experienced administrator, it can still be prone to errors because of its manual nature.

Now, it would be unfair to paint an overly negative picture, because there are powerful logical volume managers and other tools available to help the administrator with these tasks. There are also tools available to assist in tracking where the database objects in our warehouse actually reside (on which physical disks—Oracle File Mapper in 9i assists with this, for example) and for adding new disks into the physical volumes. The good news is that ASM in Oracle Database 10g, fundamentally, does the work of an LVM and all of these tasks for you, and, even better, ASM is a standard part of the Oracle Database 10g server and not a separately installed option. Using ASM allows all of your disk infrastructure, configuration, and definition to be handled by software from one vendor, Oracle, without introducing another layer of software from a third party. You provide the disks to the database and ASM, and the database will stripe and mirror the disks to provide the level of redundancy and protection that you require.

ASM Overview

There are two main concepts in understanding ASM:

  • The disk group

  • The ASM instance

The disk group can be viewed as the basic unit of control of the disks within ASM and is the logical collection of disks that you want to be managed by Oracle. Disk groups are either created via a new SQL statement (which we will look at shortly) or from Enterprise Manager and are then used as the repository for Oracle database files. One thing to bear in mind is that all of the disks in a disk group should have the same characteristics in terms of size and performance (i.e., you want to group together the same type of disks with the same performance characteristics). When you create files on an ASM disk group, these are known as ASM files and the names are automatically generated by Oracle (though you can retain some control of the format of the names).

In order to use ASM, a special form of an Oracle instance, called an ASM instance, must be up and running. The ASM instance is responsible for:

  • Discovering and acquiring the ASM disks at startup

  • Managing the disk groups

  • Automatically providing the I/O balancing

  • Protecting the data within the disk group

An ASM instance is much smaller than a normal database instance and it does not have a data dictionary: it has the very focused task of finding the disks at startup, managing the disks, and presenting the disk groups to the main database instances (i.e., our actual warehouse instance). You can have multiple normal instances on a server, but a server will only contain one ASM instance.

One important step that must be performed when the disk group is created is to specify which of the following three redundancy methods is to be used:

  • External Redundancy—where Oracle will not mirror at all and mirroring is done externally and separately on the disk storage

  • Normal Redundancy (i.e., mirror once)

  • High Redundancy (i.e., mirror twice)

In order to use either of the redundancy options, more than one failure group must be specified. A failure group is some extra information that indicates to Oracle how the disks may fail collectively due to a fault. For example, consider Figure 3.3, which uses two disk controllers to access a set of disks; a failure group would be those disks that are accessed through the same I/O disk controller. The definition of failure groups enables Oracle to know how to store the redundant data required to protect the database data, or, in other words, the safe alternative disks to use for redundancy (i.e., mirroring), because they are in a separate failure group. Therefore, for the normal redundancy option, there must be at least two failure groups, and for the high redundancy option, there must be at least three failure groups.

Administering ASM

Now, we will show how to administer ASM either using the GUI interface or from within SQL*Plus.

Setting up ASM

In Oracle Database 10g, ASM is an attractive alternative to using the previously described Logical Volume Manager (LVM), so let’s see how easy ASM is to use to manage our disks. The first step is to ensure that the disks that you want to use are made known to Oracle as candidate ASM disks, and this is done by using the ASM Tool to stamp an ASM header on the disks. On Windows systems you will find this in the bin directory as asmtool.exe (the command-line version) or as asmtoolg.exe (the graphical version); on UNIX based systems you will find oracleasm in the /etc/init.d directory. Our examples are based on the Windows graphical version.

Hint

If you are creating an ASM-based database as part of the initial Oracle installation, you must still remember to use the ASM tool to stamp the disks; otherwise, they will not be visible to the installation program.

The ASM Tool marks the disks with a header, which identifies the disks as ASM disks: to do this the disks need to be configured with a single disk partition and without a file system. Figure 3.10 shows the initial ASM Tool screen, where we will begin by adding the header to our ASM disks.

The ASM Tool

Figure 3.10. The ASM Tool

Clicking the Next button displays a list of the disks with their different partitions, and what ASM Tool considers as potential candidates is shown in Figure 3.11.

ASM Tooldisk selectionASM ToolillustratedASM Tool Disk Selection

Figure 3.11. ASM Tool Disk Selection

In Figure 3.11, we have four disks available, shown as Harddisk0 to Harddisk3, and ASM Tool has identified disks 2 and 3 (the last two in the list) which are actually two small SCSI disks and are the ones we want to use for ASM. Note that ASM Tool identifies and displays the file system on the partition, but it has marked three of them as candidates. Actually, on our system, this is a dual operating system machine and these partitions are Linux partitions, not ones we want to use for ASM. Generally, you must provide whole disks and not partitions of part of the disks to ASM, so you should not get this situation. In particular, any production system will probably not be a dual boot system. However, this example is used to make the point that you do need to be very aware of your disk and partition layout when you are using ASM.

Perform a multiple selection of both disks, click Next and you will see the screen in Figure 3.12, which shows the ASM stamp that will be applied to the disks. Note that the prefix stamp, DATA in our case, from the field in Figure 3.11, is incorporated into the ASM link name.

ASM Tool: Stamp Disks Confirmation Screen

Figure 3.12. ASM Tool: Stamp Disks Confirmation Screen

Configuring ASM during Installation

If we were creating the database as an ASM database using DBCA or the Universal Installer (during Oracle installation), then we could take a different choice for the storage option of the file system selection that we made in Chapter 2 (Figure 2.8). Now, we will choose ASM, as shown in Figure 3.13.

Database Creation: ASM Storage Option

Figure 3.13. Database Creation: ASM Storage Option

With ASM selected clicking the Next button displays the Backup choice screen as before, but after that we get a different screen, as is shown in Figure 3.14, which enables us to select the ASM disks that we stamped with the ASM Tool earlier. Because we are performing an actual installation at this point, we will need to have previously invoked ASM Tool directly from our installation disk and stamped our disks so that they are recognized as ASM disks. For Windows installations, ASMTool can be found on the installation disk in the ASM Tool subdirectory.

Configure Automatic Storage Management

Figure 3.14. Configure Automatic Storage Management

Here we have selected the External redundancy option, so no mirroring will be performed and we can see and check the tick boxes for both of the disks that we previously stamped and that we want to be members of our new ASM disk group. The disk group that we are creating is named DATA, which we will also refer to later when we manage the disk group via SQL and create tablespaces in it.

When we click Next and continue with the installation and database creation, these two ASM disks will now be used to hold our database files.

Managing ASM from the Command Line

The ASM disks and disk groups can also be created, deleted, and managed manually and to do this we need to be logged on to the ASM instance and not the normal warehouse instance. We will work through a simple example, but if you try to run these commands on a normal warehouse instance database, you will get Oracle error ORA-15000, saying, effectively, that it is the wrong command for the instance type. This is because there is a new database initialization parameter introduced with ASM called instance_type, which takes the value RDBMS for a database instance and the value ASM for an ASM instance. In our previous example, where we created a database using an ASM disk group, the ASM instance was created automatically for us as part of the database creation.

Before any ASM management tasks can be performed, you must first logon to the ASM instance. To logon to the ASM instance from the Windows operating system command line you, will need to reset your Oracle SID environment variable to that for the ASM instance. By default, this takes the value +ASM.

Hint

Note the + in front of ASM. If you omit this, you will not be able to logon, because your SID will not be set correctly.

A disk group is created as follows:

C:> SET ORACLE_SID=+ASM
C:> sqlplus /nolog
SQL*Plus: Release 10.1.0.2.0 - Production on Thu May 27 19:51:13
2004

Copywright (c) 1982, 2004, Oracle. All rights reserved.

SQL> connect sys/password as sysdba

SQL> ALTER SYSTEM SET asm_diskstring = '\.*:' ;

SQL> CREATE DISKGROUP data
           NORMAL REDUNDANCY DISK '\.H:', '\.I:' ;

Here, the disk group is being created with the name DATA with a normal redundancy, i.e. a single copy of the data will be maintained. The disk group is formed from the two disks which are visible on our Windows system as H and I.

The new ASM_DISKSTRING parameter controls and limits the disks that Oracle discovers at the ASM instance startup; it is normally placed in the ASM instance initialization parameter file. To illustrate its use, we have specifically set this parameter using a wildcard. The format of this string, as shown, is specific to the Windows platform and will be different for UNIX and other platforms. So the \. indicates the current server and the *: indicates any disk on the server

If we want to add a new disk into our disk group, then we issue the following statement, which adds disks J and K into the precreated disk group DATA.

SQL> ALTER DISKGROUP data ADD DISK '\.J:', '\.K:' ;

Then Oracle will automatically distribute our blocks and rebalance the load from the initial set of two disks, which we specified earlier when the database was created, to include the new disks. All the time, the database is kept open to the users.

Creating tablespaces is now simplified, because we just need to refer to the ASM disk group. For example, to create a tablespace on our new DATA disk group we issue the following. The + used in +DATA for the datafile name indicates that it is a disk group data file.

SQL> CREATE TABLESPACE easydw_asmtest DATAFILE '+DATA' SIZE 5M;

Furthermore, using ASM for your database doesn’t force the database to only use ASM. Oracle Database 10g enables a database to use both the standard file system method and the new ASM method at the same time. For example, previously we created our database as ASM during installation, but if we had created our database using the file system and subsequently decided that we wanted to start using ASM, then we would need to create the ASM instance manually. On Windows this is done using the ORADIM utility, which also creates the Windows service, and there are some new ORADIM parameters in Oracle Database 10g for ASM for specifying the ASM SID name and the ASM service name.

oradim -NEW -ASMSID +ASM -SYSPWD change_on_install
-PFILE c:oracleproduct10.1.0admin+ASMpfileinit+ASM.ora

The init.ora file for the ASM instance is much simpler, and, if the defaults for the parameters are used, it can be reduced to only the following parameters:

INSTANCE_TYPE = ASM
DB_UNIQUE_NAME = +ASM
ASM_DISKSTRING = '\.*:'
COMPATIBLE = 10.1.0.2.0
ASM_POWER_LIMIT = 1
USER_DUMP_DEST = 'c:oracleproduct10.1.0admin+ASMudump'
BACKGROUND_DUMP_DEST = 'c:oracleproduct10.1.0admin+ASMdump'
CORE_DUMP_DEST = 'c:oracleproduct10.1.0admin+ASMcdump'

Once the ASM instance is created, it is accessed and used in exactly the same way described previously and an ASM-based tablespace can be created alongside our file system tablespaces without any problems.

ASM is an important new feature, which is easy to administer once you understand its deployment and operation. Two other areas indicate other aspects of the administration to investigate for understanding ASM further:

  • The ASM instance initialization parameters. There are a very small number of parameters that control how ASM operates. For example, ASM_POWER_LIMIT is used prioritize the disk rebalancing operation: The higher the number assigned to this parameter, the higher the rebalancing priority (though a high priority can have a detrimental effect on other system operations due to the impact on processor and I/O resource).

  • Special views, which are part of the V$ set, in the data dictionary for monitoring ASM operations and viewing the status of the disks and disk groups. For example, V$ASM_DISK, V$ASM_DISKGROUP, and V$ASM_OPERATION.

We think that you will agree that ASM is very easy to administer from the command line, but, as we shall now show, it is even easier by using Enterprise Manager.

Managing ASM from Enterprise Manager Grid Control

Now that we have a good understanding of how ASM operates and how it is controlled from the command line, we will have a look at how ASM is administered and controlled from Enterprise Manager.

To administer ASM, either Enterprise Manager Database Control version or Grid Control version can be used. We are going to use Grid Control for our ASM examples in this chapter, and in Chapter 11 we will discuss more about how EM Grid Control is deployed, started, and used. To complete this section on ASM we want to show how Enterprise Manager also supports ASM and the screens that complement what we have discussed so far.

From within EM Grid Control, you are able to drill down to view information on individual servers and then drill further to the Oracle software that is executing on the server. From the Hosts Target page within EM Grid Control, clicking the link that identifies an ASM instance will drill you down to the Home page (not shown) for that ASM instance, which displays the general status and alerts for the instance. From this page we can follow the Configuration link to display the screen shown in Figure 3.15, where you can see the parameters for the instance, some of which we have talked about earlier, and which can easily be set from this screen.

EM Grid Control ASM Instance Configuration

Figure 3.15. EM Grid Control ASM Instance Configuration

If the Administration link is followed, then we will see the screen shown in Figure 3.16, where the disk groups handled by this ASM instance are shown. Here we can see our DATA disk group, and we have expanded the pick list on the right to show the operations that can be performed on it. Mounting and dismounting a disk group makes it available or unavailable to the database instances running on that server.

EM Grid Control ASM Instance Administration

Figure 3.16. EM Grid Control ASM Instance Administration

The third option involves the distribution of the file blocks across the disks in the group. When a disk is added or removed from the group, either intentionally or due to disk failure, ASM automatically redistributes, or rebalances, the file blocks. This Rebalance option enables you to manually perform a rebalance operation, though, because it is generally done automatically, you should rarely need to do this yourself.

By drilling down on the DATA disk group we see the screen in Figure 3.17.

EM Grid Control ASM Disk Group Members

Figure 3.17. EM Grid Control ASM Disk Group Members

This screen shows more information about the individual disks that make up the disk group, their path on the server, status, and their capacity and free space. It is a nice and comprehensive way to show you information at a glance about your disk group members.

The final screen in our quick tour of ASM via Enterprise Manager is that shown in Figure 3.18, which shows the hierarchical contents of the disk group. Here you can expand the hierarchy to see the databases using the disk group, their folders, and the files that they have created. This screen also shows that the ASM is as much about implementing a file system on the disk groups as it is about the mirroring and striping of the underlying disks.

EM Grid Control ASM Disk Group Files

Figure 3.18. EM Grid Control ASM Disk Group Files

You can see that Enterprise Manager supports the easy configuration of the ASM parameters, and its clear, logical screens allow you to traverse and examine the ASM information with ease. You can drill up and down the hierarchies of ASM information from the instance, to the disk group, to the disks, and to the actual database files in the file system on the disk group in order to fully appreciate the structure of your storage.

In this section, we have only touched upon ASM to demonstrate its power and ease of use. The important question is how is it used to help us in our data warehouse, and this is what we are going to look at in the next section.

Using ASM in the Warehouse

Now that we understand what ASM is, how does ASM help us in our data warehouse physical architecture? Well, we can define different groups for the different major areas of our warehouse:

  • Staging

  • Warehouse data

  • Warehouse indexes

  • Summary data and indexes

  • Flash recovery disk area (for backup and recovery files, which must be distinct from the disk areas for the database data, control, or on-line redo log files)

  • Temporary tablespace (for sorting)

  • UNDO

  • Redo logs and archived redo logs

One of the challenges in the data warehouse is determining what the appropriate solution is for each of these areas. For example, staging is where data is loaded from disk files and cleansed and transformed prior to loading into the warehouse tables. Within staging we may not necessarily require mirroring, because if a disk fails then the data on it can always be recovered from the source data files and reloaded. However, for the warehouse data, we will want it to be mirrored (and striped) in order to protect the data that we have expended a lot of time and resources on getting into the warehouse. In addition, following the guidelines for having disks with the same characteristics in the same disk group and not mixing and matching, it may be preferable to have the faster disks for the high I/O files (e.g., TEMP or UNDO); other disks for the warehouse, index, or summary disk groups; and the slower disks in the flash recovery disk group.

However, recall our earlier discussion on RAID and the existing Oracle recommendation for the SAME method, which stands for Stripe And Mirror Everything. In many ways ASM is an extension to the SAME concept in that it is a transparent division of our Oracle data into 1M blocks, which are automatically distributed and balanced across all disks in the group. The reasoning behind SAME holds true for ASM, and we should be defining one disk group into which we place all of our disks; this disk group contains all of our database files. You may ask, in that case, why we need multiple disk groups if we are only going to use one. Recall that an ASM instance on a server manages the disk and disk groups for all database instances on that server, and using multiple disk groups can help to segregate and manage those disks that are specific to each of our database instances.

An exception to the rule about striping and mirroring everything in one disk group would be the placement of the flash recovery area. The flash recovery area can reside on ASM disk groups (as an alternative to the normal file system), and, in order to protect our warehouse, these must be on a disk group separate those used by the warehouse. This is an additional and sensible precaution to avoid any possibility of disk failure losing the main warehouse data and the backup files. This is discussed further in Chapter 12.

An important consideration when using ASM is that the ASM file system only manages Oracle database files. If your data warehouse receives files from external sources, such as a flat file from another database or application that is not integrated with the warehouse, these will need to be managed via another solution, such as traditionally by the operating system.

Finally, note that the use of ASM to protect your disks via mirroring does not remove the need for a carefully designed backup strategy and plan. An individual disk may fail and be recoverable because of mirroring, but if a disaster befell the entire data center—for example, fire or flood—then you will need to have a proper backup and recovery strategy in place in order to restore your warehouse onto another server environment.

File Management in Oracle

With ASM, Oracle manages the disks, but to complete the process of aiding our DBA to manage the many different types of files typically found in a data warehouse, we also need a better way to manage the files themselves.

Oracle Managed Files

With Oracle Managed Files (OMF), the database server takes on the management of the individual datafiles comprise the database; this removes a considerable burden from the DBA. This is particularly true in a very large, complex warehouse environment, which may have many hundreds of files. In reality, OMF is not a true architectural consideration but a managerial feature to aid the DBA; however, in many ways the path to ASM started with OMF, so we are going to discuss OMF briefly in this chapter.

To use OMF there are a small number of initialization parameters that specify the default location of the files that you want Oracle to manage.

Initilization Parameter

Brief Description

DB_CREATE_FILE_DEST

The default file location for database datafiles

DB_CREATE_ONLINE_LOG_DEST_n

The default file location for database redo logs and control files.

DB_RECOVERY_FILE_DEST

The default file location for database RMAN backups

If you perform a database operation that creates a file—for example, creating a tablespace—but do not specify a file name and location, then Oracle will use the parameter relevant to the operation and create the file. It is now an Oracle Managed File.

For example, try a simple test: Identify where your database files are currently residing and add a subdirectory “omf_test.” Then from SQL*Plus alter the value of the DB_CREATE_FILE_DEST to use that subdirectory:

ALTER SESSION SET db_create_file_dest
             = 'f:oracleproduct10.1.0oradataeasydw	est';

Now create a test tablespace:

CREATE TABLESPACE omf_test DATAFILE SIZE 1m;

Within your new TEST directory, you should now have the subdirectory “EASYDW/datafile” (where EASYDW will be the unique name of your database) and within this directory a new datafile—for example, o1_mf_omf_test_05r8ggy9_.dbf. The string “05r8ggy9” is, in fact, a unique eight-character string generated by Oracle; it ensures that no two files will have the same name.

If, at some point in the future, this datafile has to be removed from the database, then when the tablespace is removed from the database the datafiles are removed from the server as well. To demonstrate this, now drop the tablespace from within SQL*Plus and this new datafile is automatically removed by the database server.

DROP TABLESPACE omf_test;

Other database operations (such as “ALTER DATABASE ADD LOGFILE”) will use the other OMF initialization parameters in a similar manner.

OMF is a very useful and powerful aid to the DBA. At the very least, this automatic removal prevents unneeded, old files from proliferating on your file system: it definitely helps to ensure that any manual removal, that would otherwise be necessary, doesn’t inadvertently remove a wrong file that is still actually being used by the database! As a bonus, Oracle will also tidy up any partially created files, which may result from an operation that errored; so with OMF you should never be in doubt as to the validity and use of the Oracle datafiles found on your file system.

Bigfiles and Big Databases

Some data warehouses use terabytes of storage, and this means that the database will consist of some very large datafiles. On many systems the sizes of these files can be limited, and that results in a larger number of smaller files, which consequently necessitate careful management. However, on operating systems that support large files, Oracle is able to offer the facility to use them in the database and replace the many files with a significantly smaller number of much larger files.

Bigfile tablespaces are exactly that: tablespaces with a single, very large datafile. A bigfile tablespace using 8K-sized blocks can have up to a 32 terabyte datafile; if 32K block sizes are used, then this increases to a datafile size of 128 terabytes. Since the database can have up to 65,536 datafiles then the database size supported by Oracle Database 10g is extremely large at 8 exabytes. Bigfile tablespaces are intended to be used with locally managed tablespaces (which track space usage information in the tablespace rather than in the Data Dictionary) and with ASM (to handle the underlying mirroring and striping). Large files must obviously be supported on your underlying operating system for bigfiles to be used.

A bigfile tablespace is created simply by including the keyword bigfile in the create tablespace command:

CREATE BIGFILE TABLESPACE data_ts SIZE 20G;

Bigfile tablespaces can be resized and set to autoextend, as per normal, but now this is done by controlling the tablespace and not the datafile—for example:

ALTER TABLESPACE data_ts AUTOEXTEND ON;

Summary

In this chapter, we have looked at the range of architectures available for our data warehouse and provided information to help you decide which is the one that is most appropriate for your business. At the hardware level, you should now have an appreciation of the storage issues and options available both inside and outside of the data warehouse. Within the database, you should now also be familiar with Oracle features to aid in the administration and use of these storage options.

A well-designed warehouse, making proper use of these features, can deliver excellent query performance for large amounts of data. In the next chapter, we will discuss the physical database design of the various schema objects and how these make use of the underlying technical architecture.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset