Chapter 9
SQL Data Types

 

Not everything that counts can be counted, and not everything that can be counted counts.

 
  --Albert Einstein

Welcome to the first of three chapters in this book about database design. If you’ve followed along faithfully until now, well done. Several chapters were needed to cover the SELECT statement in detail, so that we could gain an appreciation for how tabular data is extracted from the database, filtered, summarized, presented, and sequenced. Now it’s time to turn our attention to the challenges of creating database tables.

Creating tables is straightforward, with only a few tricky aspects to watch out for. These are encountered primarily when deciding how tables should be related to each other, and we’ll cover table relationships in Chapter 10. In this chapter, we’ll examine table columns in isolation, and discuss the options available to define them.

In our sample applications, we’ve seen several examples of the CREATE TABLE statement. When we create a table, we must give it one or more columns, and once the table has been defined, we can go ahead and insert rows of data into it, and then use it in our SELECT queries.

This chapter looks at how to choose a column’s data type. A data type must be assigned to each column, and we’ll cover the choices available. We’ll also discuss briefly some of the constraints that we may employ to tailor the columns more to our requirements.

An Overview of Data Types

When we create a column, we must give it a data type. The data type will correspond to one of these basic categories of data:

  1. numeric

  2. character

  3. temporal (date and time)

Each of these data type categories allows for a wide range of possible values, and each of them is, by its very nature, different from the others. Numeric data type columns are used to store amounts, prices, counts, ratings, temperatures, measurements, latitudes and longitudes, shoe sizes, scores, salaries, identifier numbers, and so on. Character data type columns are used to store names, descriptions, text, strings, words, source code, symbols, identifier codes, and so on. Temporal data type columns are used to store a date, a time, or a timestamp (which has both date and time components). Although the concept is easy, temporal data types are often the most troublesome for novices.

The process of choosing an appropriate data type begins with an analysis of the data values that we wish to store in the column. Because the categories of data types are so inherently different from each other, this is often a trivially easy step. Perhaps the only difficulty arises in a few edge cases, where it may look like numeric data but should actually be defined with a character data type. There’s an example later in the chapter.

So let’s start discussing the data types in detail.

Numeric Data Types

Numeric data types can be divided into in two types: exact and approximate. Before you begin to wonder how a number can be approximate, let me reassure you that most of the numeric data types we use in web development are exact.

Exact numbers are those like 42 and 9.37. When you store a numeric value in an exact numeric column, you’ll always be able to retrieve exactly the same value in a SELECT query. This is not the case with approximate numbers, where the value you retrieve might be a different number, although it would be very, very close.

Let’s start with the exact numeric data types, which are either integers or decimals.

Integers

Integers are the whole numbers that we have been accustomed to from the earliest days of our childhood: 1, 2, 3, and so on. In standard SQL, there are three integer data types. INTEGER and SMALLINT have been standard all along, and BIGINT appears to have been added in either the SQL-1999 or SQL-2003 standard.[9]

INTEGER

INTEGER columns can hold both positive and negative numbers (and zero, of course). The range of numbers that can be supported is usually from -2,147,483,648 to 2,147,483,647. This is the range of numbers that can be implemented in binary notation using 32 bits (4 bytes). Curiously, the SQL standard does not actually specify a range for INTEGER, but all database systems uniformly use 32 bits.

SMALLINT

SMALLINT columns will support—you guessed it—a smaller range of integers than INTEGER. As with INTEGER, standard SQL does not specify the range, merely stipulating that the range be smaller. SMALLINT is usually implemented in 8 bits (2 bytes), leading to a range of -32,768 to 32,767.

BIGINT

BIGINT columns support a much larger range of integers than INTEGER. Database systems that support BIGINT usually use 64 bits (8 bytes), resulting in a range of numbers from -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807. That's over nine quintillion. Hence, it’s extremely unlikely that you’ll need to use BIGINT. We'll see BIGINT again in the section on autonumbers in Chapter 10.

Tip: Pros and Cons of Non-standard Data Types

Some database systems have implemented additional, non-standard integer data types.

MySQL has added MEDIUMINT, implemented in 12 bits (3 bytes), giving a range of –8,388,608 to 8,388,607. This slots MEDIUMINT in between SMALLINT and INTEGER.

MySQL and SQL Server also support TINYINT, although they have different implementations. Both are based on 4 bits (1 byte). MySQL’s range is –128 to 127, while SQL Server disallows negative TINYINT values and so has a range of 0 to 255.

If your database system supports TINYINT, using it can seem irresistible. Why declare a numeric column with the 2-byte SMALLINT data type, when you know that there will be only a few small values, comfortable fitting within the 1-byte TINYINT range of -128 to 127 or 0 to 255?

One benefit of using TINYINT over SMALLINT or INTEGER comes from the reduced disk space requirements. Of course, our table will need to have many millions of rows in order for the saved space to amount to more than a few megabytes, and we’ll also need to take the total space requirements of all other columns into consideration to determine if the overall savings are meaningful.

One disadvantage is that we’ll need to change the data type if we have to port our tables to a database system that doesn’t support TINYINT. This is mitigated by the fact that changing the data type is easily accomplished; for example, we could use a text editor on the source DDL, changing all occurrences of TINYINT to SMALLINT in one command.

So while the best practice strategy is to use either SMALLINT or INTEGER because these are portable to all database systems, many SQL developers will use TINYINT anyway, if it’s available, even though the space saved is rarely considerable. Perhaps we’re just being neat and tidy.

Decimals

Decimal numbers have two parts: the total number of digits, and the number of digits to the right of the decimal point; the decimal point isn’t actually stored. For example, the number 9.37 has three total digits, of which two are to the right of the decimal point.

There are two, almost identical, kinds of decimal data type: NUMERIC and DECIMAL. Both data types have the same format:

NUMERIC(p[,s])
DECIMAL(p[,s])

The mandatory parameter p above, represents the precision: the total number of digits allowed. The optional parameters (which defaults to 0 if omitted) represents the scale: the total number of digits to the right of the decimal point.

Standard SQL says that the difference between NUMERIC and DECIMAL is implementation dependent. NUMERIC columns must have the exact precision specified, but DECIMAL columns might have a larger precision than specified if this is more efficient or convenient for the database system. In practice, they behave identically. My personal preference is DECIMAL.

NUMERIC and DECIMAL data types each allow both positive and negative values, and have the same range of possible values. However, the size of this range varies from one database system to another. PostgreSQL, for example, allows a precision of 1,000 digits. In practice, you’ll rarely approach the limits of the range, whatever they are.

Tip: Use DECIMAL but Consult Your SQL Reference Manual

Check your manual for details about the DECIMAL data types available to you. The maximum precision (total number of digits) and maximum scale (number of digits to the right of the decimal point) can vary from one database system to another.

DECIMAL data types are almost always preferred over floating-point data types (discussed further on), simply because decimals are exact and floating-point numbers are approximate.

DECIMAL data types are also preferred over non-standard ones such as SQL Server’s MONEY data type, which is deprecated. (Deprecated means that you shouldn’t use it because it will be removed in a future release of this SQL standard or product, even though you can at present.)

Let’s look at a few quick examples of DECIMAL data types.

To define a column which will hold a value such as 9.37, we could employ DECIMAL(3,2) as the data type. The precision and scale of 3 and 2 mean that:

  1. 3 digits in total are allowed

  2. 2 of those digits are to the right of the decimal point

Note that DECIMAL(3,2) is inadequate for holding a value such as 12.34, because 12.34 has two digits to the left of the decimal point, and we allowed for only one. Attempting to insert this value usually results in an error message about “arithmetic overflow.”

Nor can DECIMAL(3,2) properly hold a value such as 0.567, because even though there are only three significant digits in total, the column can hold only two positions to the right of the decimal point. Attempting to insert this value, however, does proceed, with the value being rounded to 0.57 to fit into the column. The column can hold the value, but with an accuracy of only two decimal digits. As to what your particular database system will do, in the case where you attempt to insert a value that does not conform to the column data type, you’ll just have to test it to make sure.

Tip: Test Your Database System

Depending on your database system, attempting to insert the value above might be allowed silently. To confirm how your database system handles this situation, you might like to run a test query like the following. In this query we create a table called test_decimals, add a column called d, and try to insert various decimal values into it:

test_02_DECIMAL.sql (excerpt)
CREATE TABLE test_decimals
( 
  d   DECIMAL(3,2)  NOT NULL PRIMARY KEY
);

INSERT INTO test_decimals (d) VALUES (   9.37  );
INSERT INTO test_decimals (d) VALUES (   0.567 );
INSERT INTO test_decimals (d) VALUES (  12.34  );
INSERT INTO test_decimals (d) VALUES ( 888.88  );

SELECT
  d 
FROM
  test_decimals
;

The two emphasized INSERT statements above will fail when run on SQL Server with the error "arithmetic overflow error converting numeric to data type numeric", but MySQL will allow them. Interestingly, when running the SELECT query, MySQL will return:

 0.57
 9.37
12.34
99.99  

In answer to the question, why, I’ll leave it as an exercise for you.

When using decimal data types, always choose a precision that comfortably holds the maximum range of data that the column is expected to contain. Make the scale adequate for your needs, too, considering that rounding will take place, especially where arithmetic calculations are performed.

For financial amounts, some people like to specify four decimal places instead of two for greater decimal accuracy, for example, interest calculations. Accuracy here refers to the decimal portion of the number; 12.0625 is more accurate than 12.06 if the number being represented is twelve and one sixteenth.

Tip: PS: Precision, scale, and accuracy

It’s easy to confuse the words accuracy and precision in this context, because in everyday language they are synonyms. The syntax of the decimal and numeric data type keywords is often written as:

DECIMAL(p,s)
NUMERIC(p,s)

A more accurate decimal number has more digits to the right of the decimal point, but precision (the first parameter above: p) means the total number of significant digits. A more accurate decimal number has a larger scale, but since scale digits are counted within the total number of precision digits, a more accurate number means a larger precision as well.

Scale (the second parameter above: s) can also be misunderstood as the range of values describing how large or small the number can be; in everyday language, to scale something up means to allow for it to enlarge. In decimal numbers, to allow for a larger range, we need to increase the number of digits to the left of the decimal point, which is equal to p minus s. So to increase the range also means increasing the total number of significant digits, the p in DECIMAL(p,s).

PS: An easy way to remember which words to use is with the mnemonic, PS.

Example: Latitude and Longitude

Latitude and longitude (see Figure 9.1) are often expressed as decimals. Suppose we wanted to keep 6 positions to the right of the decimal point. The values we’re planning to store look like 43.697677 and -79.371643. Maybe that’s too accurate, because specifying 6 digits to the right of the decimal point corresponds to pinpointing a location on earth with a level of accuracy as refined as to the size of a grapefruit. To locate buildings, a scale of 4 (4 digits to the right of the decimal point) is sufficient.

Latitude and longitude

Figure 9.1. Latitude and longitude

We could use DECIMAL(6,4) for latitude, which has values that range from –90° to +90°, but we’d need DECIMAL(7,4) for longitude, which has values that range from –180° to +180°.

Having seen the exact numeric data types—integer and decimal—let’s move on to the approximate numeric data types.

Floating-point Numbers

Approximate numbers are implemented as floating-point numbers, and are usually either very, very large, or very, very small. Floating-point numbers are often used for scientific data, where absolute accuracy is neither required nor assumed.

Consider this example of a very large number: a glass of water has approximately 7,900,000,000,000,000,000,000,000 water molecules. This number is much larger than a BIGINT column will allow. A decimal specification to hold this number would be DECIMAL(25,0) and that’s quite a large precision value—each digit will require extra storage space, but only two of the 25 digits are significant.

A floating-point number is compatible with scientific notation. That humongous number of water molecules can also be written as 7.9 x 1024, where 7.9 is called the mantissa and 24 is called the exponent. Scientific notation is useful because it separates the accuracy of the number from its largeness or smallness. Floating-point numbers also have a precision, but it applies to the mantissa only. Thus, 7.91 x 1024 is more accurate than 7.9 x 1024.

Why are floating-point numbers called approximate? Simply because of rounding errors, which depend in part on the underlying hardware architecture of the computer. A more detailed explanation is beyond the scope of this book; see Wikipedia’s page on IEEE Standard 754.

FLOAT, REAL, and DOUBLE PRECISION

As with the decimal data types DECIMAL and NUMERIC, Standard SQL has several kinds of floating-point data types: FLOAT, REAL, and DOUBLE PRECISION. As with DECIMAL and NUMERIC, the differences are minor and implementation defined. In practice, they all behave the same. It’s common for database systems to use either 4 or 8 bytes to store a floating-point number. DOUBLE PRECISION, as you might have guessed, has greater precision than FLOAT or REAL. Check your SQL reference manual for the full details of floating-point numbers in your database system.

Imagine a table called test_floats with a FLOAT column called f; when storing numbers into a floating-point column, we can specify the value of the number like so:

test_03_FLOAT.sql (excerpt)
INSERT INTO test_floats 
  ( f ) 
VALUES 
  ( 7900000000000000000000000 )

We can also do the same using exponent notation:

test_03_FLOAT.sql (excerpt)
INSERT INTO test_floats 
  ( f ) 
VALUES 
  ( 7.9E24 )

Exponent notation uses the letter E between the decimal mantissa and integer exponent. The mantissa can be signed, giving a positive or negative number, while the exponent can also be signed, giving a very large or very small number.

Tip: When to Use Floating-point Data Types

Use a floating-point data type to store only very large or very small numbers.

As well, you may also use floating-point when accuracy is not crucial. In the earlier example of latitudes and longitudes, floating-point numbers could be used because the location specified by the latitude and the longitude numbers is only approximate anyway. Think of it like this: with a large enough precision, you can specify a location accurate to within the size of a grapefruit, while the rounding errors due to using an approximate data type are equivalent to the width of the grapefruit peel.

Conversions in Numeric Calculations

Whenever we perform a numeric calculation, data type conversion may occur depending on the calculation. A numeric calculation performed in an SQL statement will have a resulting value that’s typically given a data type large enough to accommodate it.

For example, the numbers 123,456,789 and 555 are perfectly good integers on their own, but if we multiply them together, the result is 68,518,517,895. This is also an integer, but will exceed an INTEGER column. If this multiplication is done in the SELECT clause of a query, the result is simply returned as a BIGINT data type. If we try to insert the result into an INTEGER column, we’ll receive an error message, as the value is too big.

As another example, suppose we multiply the two perfectly ordinary DECIMAL(3,2) numbers—1.23 and 4.56—together. The answer is 5.6088, and we know this value will be rounded if it’s stored into another DECIMAL(3,2) column. (An error would occur if the multiplication result is larger than 9.99.) If we simply return the calculation in a SELECT clause, the expression will have a data type of DECIMAL(5,4) or DECIMAL(6,4).

Numeric Functions

Now that we’ve learned how and when to assign the various numeric data types, let’s have a quick look at the functions that are available to work with them.

In the section called “Numeric Operators” in Chapter 7, we saw the basic arithmetic operators, +, -, *, and / (for addition, subtraction, multiplication, and division) used to combine numeric values when creating expressions in the SELECT clause. There are, in all database systems, a large number of additional numeric functions that can be used. Here are a few:

Table 9.1. Commonly available numeric functions

Numeric Function Purpose

ABS

returns the absolute value of a number

CEILING

returns the smallest integer greater than or equal to the number specified

FLOOR

returns the largest integer less than or equal to the number specified

MOD

returns the modulus (the remainder) of a division expression

RAND

returns a random number

ROUND

rounds a number to a specified precision

Database systems will vary according to which functions they offer. Some will have many more, including algebraic functions, geometric functions, and so on. Check your SQL reference manual for details.

Now let’s move on to the second of the three major data type categories: character.

Character Data Types

Character data types should be very familiar to anyone who has worked with computers. Character data types are used to store any values containing letters, symbols, punctuation marks, and so on. Digits are permitted too, of course, and therein lies one of the few pitfalls that you might encounter in selecting an appropriate data type. We’ll discuss this issue in a moment.

CHAR

The first of the three character data types is CHAR, also called CHARACTER, which is used to define a fixed-width character column. When specifying a CHAR column, we must give a width. Thus, CHAR(1) and CHAR(21) are examples of character data types resulting in columns which can hold character strings that are 1 and 21 characters long, respectively.

Again, the terminology here might be confusing. We speak of strings that have a length of so many characters, but we speak of character columns as being so many characters wide. This is mere convention, and both words should convey the same concept in this context.

If a character string value is inserted into a CHAR column with fewer characters than the column allows, it’s positioned at the left and padded with spaces on the right until it fills the column. Fortunately, we don’t have to specify those spaces when evaluating column values. Consider this condition:

WHERE country = 'Cuba'

If country is a CHAR(50) column, then the condition will evaluate as TRUE even though the value stored in the column consists of the letters Cuba followed by 46 spaces. Trailing spaces are ignored. All values in the column are 50 characters long.

If a character string value is inserted into a CHAR column with more characters than the column allows, excess characters are truncated from the right until it fits.

VARCHAR

VARCHAR, also called VARYING CHARACTER, requires a width too, but in this case it’s the maximum width of the values allowed in the column. The actual width of the column value in each row depends on the value.

Thus, the value Cuba stored in a VARCHAR(50) column will require only 4 characters. This can make the rows substantially shorter. There is a very small amount of overhead required for VARCHAR columns—an additional number is required to indicate how long each value is. This number is typically a one byte binary number, which is why many database implementations allow VARCHAR widths only up to 255.

In Figure 9.2, the lengths for the VARCHAR values are actually stored in the row, whereas the lengths for the CHAR values are not.

How CHAR(50) and VARCHAR(50) are stored in the database

Figure 9.2. How CHAR(50) and VARCHAR(50) are stored in the database

Check your SQL reference manual for the maximum column width allowed for both CHAR and VARCHAR columns.

The additional resource overhead for VARCHAR is typically more than offset by the fact that the rows are, in general, shorter. Thus, more rows can fit onto a single block of disk space, meaning that overall input and output operations performed by the database system will be faster for the same number of rows.

If there is variation in the lengths of a column’s values, my own preference is to use CHAR for widths up to 4, and VARCHAR for widths from 5 up, but that’s just a hunch as to where the break even point might be. The difference in row lengths with CHAR(4) versus VARCHAR(4) would be marginal, and I’ve yet to tested the differences in performance.

Of course, you can also have large CHAR columns if they’re appropriate. If all the values you wish to store are always the same length, then it only makes sense to use CHAR with a fixed width. Fixed length standard codes, like the EAN number used on product bar codes, are good candidates.

Numeric or Character?

One of the few problems in deciding on the data type to use for a column is an edge case in which there’s a choice between numeric and character. Obviously, this choice is possible only when the values consist of digits and no other characters; as soon as any letter or symbol is involved, a numeric data type is out of the question.

The classic examples are phone numbers and American ZIP codes (postal codes). Neither contains any characters other than digits, so both appear to be good candidates for a numeric data type, especially given that:

  1. numeric data types often require fewer bytes of storage than the equivalent number of digits in a character column

  2. numeric values are more efficient when used in indexes for searching

These properties are often attractive to novices. Nevertheless, to paraphrase Einstein, not everything that can fit into a numeric column should be numeric. One good rule of thumb is: “if you’re not going to do arithmetic with it, leave it as a character data type.” Since finding an average phone number or the sum of all ZIP codes is a ridiculous proposition, numeric properties are not needed for these columns. The disadvantages of using numeric data types include formatting issues and sorting problems.

When formatting is required, a character data type is better. The phone number 9375551212 is a valid BIGINT value, but we’d rather see it formatted as (937) 555-1212. Doing this in every SELECT statement where we want to display the phone number becomes tedious. With a character data type, the formatting can be imposed upon data entry, storing the parentheses and dashes right in the value. Sorting the formatted values works too, since they would all have the same parentheses and dashes in the same positions.

If we store ZIP codes as a numeric data type, applying sorting would be disastrous. For ZIP codes, the presence of the optional last 4 digits—indicating a more specific segment within a delivery area—is problematic, since 5-digit and 9-digit codes would be in separate parts of the sorted results. Table 9.2 shows us that while sorting ZIP codes numerically is accurate for numbers, what we’d really prefer is alphabetical sorting.

Table 9.2. ZIP codes sorted numerically and alphabetically

Sorted Numerically Sorted Alphabetically
12345 12345
12346 12345-0112
12347 12345-0114
long list of values… 12345-0116
99901 12346
99902 12347
123450112 long list of values…
123450114 99901
123450116 99902

As always, if you’re careful to analyse the needs of your application, you’ll be guided in your decision about which data type to use.

NCHAR and NVARCHAR

The SQL standard also provides for NATIONAL CHARACTER and NATIONAL CHARACTER VARYING data types. The NCHAR and NVARCHAR data types are just like CHAR and VARCHAR, but use a larger character set, most often one of the Unicode character sets such as UTF-8. You’d need this for internationalization purposes, for example, if you store characters from foreign languages.

These data types behave just like CHAR and VARCHAR, except the NCHAR and NVARCHAR columns require 2 bytes to store every character, rather than the 1 byte we’re used to with character sets like ASCII.

Consult your SQL reference manual for specifics about the character sets supported.

CLOB and BLOB

When the contents of a character column are expected to be relatively large, larger than can fit into the maximum CHAR or VARCHAR data type, a large-object character data type is required. In web development, this type of column is used to store things like the source text of an article entry (perhaps in plain text, perhaps with HTML or XML tags), the full description of a shopping cart item (as opposed to its shorter name or description), the content of a forum posting, and so on.

In standard SQL, the two data types CHARACTER LARGE OBJECT and BINARY LARGE OBJECT, also affectionately known as CLOB and BLOB, are provided for large column values. CLOBs are used to store character data (using character sets), while BLOBs are used to store binary data, such as images, sound, and video. Only some database systems implement both CLOBs and BLOBs, and these can vary in the way they’re implemented across systems; MySQL, for instance, implements TEXT instead of CLOB. Typical space limitations, if any, are in the megabyte range or even larger.

Both of these data types have restrictions in database systems that implement them. For example, CLOBs and BLOBs can’t be indexed or sorted—but then, doing so hardly makes sense.[10] Imagine all the novels ever written stored in a large database; why would you want to list them sorted by their text content?

There are also some limitations with character functions and operators working with CLOBs or BLOBs. Speaking of which, it’s time for a quick review of character functions, more commonly known as string functions.

String Functions

In Chapter 7, we saw two basic string functions, SUBSTRING and the concatenation operator (||), used to extract portions of, or combine character values when creating expressions in the SELECT clause. There are, in all database systems, a large number of additional string functions that can be used. Table 9.3 lists just a few.

Table 9.3. Some common string functions

String Function Purpose

LEFT

returns a substring from the left of the string value

RIGHT

returns a substring from the right of the string value

CHAR_LENGTH

returns the length of the string in bytes

REVERSE

reverses the characters of the string

TRIM

removes characters from the string

REPLACE

replaces characters in the string with specified characters

LOWER, UPPER

change the characters to lower or upper case

It will vary amongst database systems who has what function, or even whether they use the same function name. There are many other useful string functions, particularly regular expressions, supported by some database systems, which allow for very sophisticated pattern matching. You know the drill by now: check your SQL reference manual for details of the string functions available to you.

Now let’s move on to the third major data type category: temporal.

Temporal Data Types

Temporal data types are dates, times, and timestamps (which consist of both date and time components). These are intuitive data types that we can all understand, but working with them causes new SQL developers more trouble than other data types. The situation is not helped by the fact that temporal data types can have vastly different implementations from one database system to another.

On the surface, dates and times are dead simple. The DATE data type is used to store dates, and the TIME data type is used to store times. It gets progressively more complicated after that, so let’s take things one step at a time (no pun intended).

DATE

The DATE data type is used to store date values from the Common Era calendar, which is the standard, 365-day Gregorian calendar. A date value has three components: year, month, and day. An example of a date constant can be seen in the following query:

SELECT
  customer_id
FROM
  carts
WHERE
  cartdate = DATE '2008-09-21'

The important point to note here is that, in the above query (and in the SQL standard), the keyword DATE is actually part of the date constant value. However, because most database systems were developed before dates and times were standardized in SQL, standard SQL doesn’t always work. For example, the above query works fine in MySQL, but fails in SQL Server until the DATE keyword is removed:

SELECT
  customer_id
FROM
  carts
WHERE
  cartdate = '2008-09-21'

This form of the date constant looks like a simple character string value, and will work in all database systems, so I recommend using it.

The next issue with dates is even more problematic for novices.

Input Format, Storage Format, and Display Format

Each database system has its own specific rules for how to specify a date value. Checking these rules in your SQL reference manual is a must. We specify input date values when inserting new values into a column, whether with an INSERT or an UPDATE statement, and when writing WHERE conditions such as the one in the previous example.

The allowable date formats for input vary considerably from one database system to another.

All of the following are valid input date value formats in at least one database system:

DATE '2008-09-21'
'2008-09-21' 
'20080921' 
20080921 
'09-21-2008' 
'21-SEP-2008' 
'2008/09/21' 
'2008$09$21'

Tip: Specify Input Date Values Using the YYYY-MM-DD Format

The YYYY-MM-DD format is recognized by all database systems: 4 digits for the year, 2 for the month, and 2 for the day. It’s compatible with ISO-8601, which is the international standard for date and time formats.

Note that the dashes are optional, so, for example, we can specify either '2008-09-21' or '20080921' for the 21st of September, 2008.

One of the biggest surprises regarding date and time values in database systems is that they’re not stored the way they’re entered.

Usually, dates are stored as integers. The integer for a given date is calculated as the number of days since a base or zero date. So if January 1, 1900 is the base date—day 1, then January 2, 1900 would be day 2, and December 31, 1900 would be day 365, not 366, since 1900 was not a leap year. (Date handling in database systems is fully aware of the Gregorian leap year rules.) January 1, 1901 would then be day 366, and we can count the days like this all the way from January 1, 1900 to today, and on into the future for as long as we like. If the database system uses an INTEGER internally for this day count, the day numbers can go for almost 12 million years before the size of the INTEGER counter becomes inadequate.

So when we insert a date value, we must specify it using an allowed date constant format, and realize that it will be converted upon entry into an internal storage format. We never see the actual storage format, however, because every time we SELECT a date value, it comes out in a display format. The database system performs a conversion both on input and on output.

This is where many people stumble. They enter a date as 09/21/2008 (assuming this format is allowed), and are surprised when it comes out as 2008-09-21 in a SELECT query. I’ve seen people change the data type from DATE to VARCHAR just so they can retrieve exactly the same format they put in! But I’d caution against this; if you value the possibility of doing date calculations, or returning dates in a proper chronological sequence, you’ll always store dates in a proper temporal data type and not a character data type.

There are three options for dealing with display formats:

  1. only use the default display format of your database system (or find a way to change the default)

  2. use whatever formatting functions are provided by your database system to achieve the format you want

  3. format the date in your application

The first option is, of course, the easiest. Fortunately, the default format is usually YYYY-MM-DD anyway, which, in my opinion, is the easiest to understand. If formatting is required, the third option is best practice because web application languages (like PHP or ASP) have formatting functions built in. The second option may be appropriate if you’re writing the SELECT query to extract data that will be sent elsewhere, like in an XML file.

TIME

Time values are similar to date values, in that there are differences between input format, storage format, and display format. An input time format might look like TIME '09:37' and a display format might look like 9:37 AM. Internally, time values are often stored as another integer, representing the number of clock ticks after midnight. (A clock tick might be a millisecond, or three milliseconds, or some similar value.)

Times, however, have another aspect that makes them a bit trickier. That’s because adding dates together is pure folly, yet adding times can make sense.

Times as Duration

TIME values are assumed to be points in time on a time scale, but what if we need to store durations—measures of elapsed time. Suppose we want to have a web site for displaying triathlon race results. We’ll need a type of column to record different times like these:

swim 20:35
bike 1:49:59
run 1:28:32

These times will then be added, and the total needs to come out as 3:39:06.

When dealing with durations like these, there are several choices for the data type to use:

  1. We could use TIME, but few database systems allow times to be added. Similarly, DATE values are considered points on a calendric scale, not durations.

  2. We could store three separate TINYINT values, for hours, minutes, and seconds, but this requires complex expressions to calculate totals.

  3. We could store the equivalent total seconds—instead of hours, minutes, and seconds—in a single SMALLINT; this makes calculating the total easy, but we’d still need to convert it back into hours, minutes, and seconds formatting.

Right about here, those of us from a programming background will start thinking of complex ways to implement the second and third option in our chosen web application languages. And it’s right about here that the lazy programmers among us will look for a way to make the first option work. We look for a time function, provided by the database system, to convert times to seconds. We also make sure there’s another one for converting back. If we’re lucky, we find them both, and the problem of adding times becomes very, very simple:

test_04_SUM_times.sql (excerpt)
SELECT
  SEC_TO_TIME( SUM( TIME_TO_SEC(splittime) ) ) AS total_time
FROM
  raceresults

This example shows the TIME_TO_SEC function is used to convert individual TIME values in the splittime column to seconds. These seconds are then added up by the SUM aggregate function. The result of the SUM is then converted back to a TIME value using the SEC_TO_TIME function.

TIME_TO_SEC and SEC_TO_TIME are MySQL functions, but the same approach can be used in database systems that have different functions. All it takes is a couple of expressions to perform the same calculations. Converting a time to seconds will require use of the EXTRACT function, to pull out the hours, minutes, and seconds separately, with some familiar multiplication (hours multiplied by 3600 and minutes multiplied by 60), as well as addition. Converting seconds to time can be accomplished easily by using the TIMEADD or DATEADD function—which every database system has—to add those seconds to a base time of 00:00:00 (midnight).

The point of this example was to demonstrate, step by step, the thinking process that leads to simplifying an application; because there are no functions to develop in your application programming language, we can do the calculations with SQL instead.

Times as Points in Time

Also known as clock time, this is used for single points in time, independent of any date.

For example, a bricks-and-mortar store would have an opening time and a closing time. These might vary by day of the week, but one feature of a clock time value is that it often recurs. So for this particular store, an opening time of 8:30 AM is the same on every day to which it applies.

TIMESTAMP

Timestamps are data types that contain both a date and a time component. What we’ve learned about dates and times separately applies equally to dates and times combined in timestamps: be careful with input formats, and reformat for display in the application if necessary.

Tip: When to Use DATE, TIME, or TIMESTAMP

Use DATE when the event or activity has a date only, and the time is irrelevant. For example, in most database applications where people’s birth dates are stored, the time of birth is not applicable.

Use TIME for recurring clock times and for durations as required. Remember that duration calculations may require conversion.

Use TIMESTAMP when an event has a specific date and time. Tables which store system logins and similar events should use the greatest timestamp precision available.

You should refrain from using separate DATE and TIME columns for the same event. For example, avoid organizing columns like this:

event_date   DATE
event_start  TIME
event_end    TIME

This may appear to be worthwhile because it avoids repeating the date, but it can cause serious headaches to calculate intervals from one event to another. Instead organize your columns like this:

event_start  TIMESTAMP
event_end    TIMESTAMP

Calculating intervals is discussed in the next section, where you’ll see how using separate DATE and TIME columns make that task much too difficult.

Intervals

Intervals are like the time duration examples we saw earlier in the athletic race: swim 20:35, bike 1:49:59, and run 1:28:32. Naturally, there are date intervals as well. The interval including January 1st through to March 1st is either 59 or 60 days, depending on the year.

In standard SQL, intervals have their own special syntax. However, few database systems have adopted the standard interval syntax, primarily because—as stated previously—the need for date calculations was anticipated by every database system, and implemented as date functions, long before the standard was agreed to.

We saw one example of an interval calculation back in the section called “BETWEEN: It haz a flavr” in Chapter 4:

CURRENT_DATE - INTERVAL 5 DAY

This is an expression in standard SQL that calculates the date that is 5 days earlier than the current date. If it fails to work in your particular database system, there’ll be equivalent date functions for the same purpose. As we mentioned back in the section called “Temporal Operators” in Chapter 7, you’ll have to read the documentation for your system.

Date Functions

Database system implementations have a rich variety of date functions. We call them date functions, but they also include time functions, and timestamp functions. Standard SQL has few date functions, EXTRACT being the main one, in addition to functions that perform interval calculations. Standard SQL also has the three functions, CURRENT_DATE, CURRENT_TIME, and CURRENT_TIMESTAMP, designed expressly to return the corresponding date and time value from the computer that database system is running on.

Each database system also has a number of other non-standard date functions. Some, such as WEEKDAY, are decidedly useful in real world applications. Table 9.4 lists some of the date functions available in one database system or another.

Table 9.4. Some common date functions

Date Function Purpose

DATEADD or ADDDATE

adjusts a date by a specified interval

DATEDIFF

returns the interval between two dates

YEAR, MONTH, DAY

performs the same function as EXTRACT but more intuitively named

WEEKDAY

returns the day of the week of a specified date as a number from 1 through 7

DAYNAME

returns the name of the day of a specified date, for example Sunday, Monday, and so on

Date functions are, in general, very comprehensive, but it’s important to use them correctly. Refer to your SQL reference manual for more details.

Column Constraints

Column constraints enable us to specify additional data integrity criteria for columns than what is permitted by their data type.

For example, a SMALLINT column can hold values between -32,768 and 32,767, but we might want to restrict this to a range that is meaningful. We might, for example, have a rule that the maximum purchase of any particular item in a single shopping cart, is 10. This can be implemented with a CHECK constraint, as we’ll see in a moment.

NULL or NOT NULL

The first constraint that we should think about for any column, is whether the column should allow NULLs. Any attempt to insert a row in which a value is missing for a column designated as NOT NULL will fail, and the database system will return an error message.

How do we decide if a column should be NULL or NOT NULL? Simply, if we need to have a value in every possible instance. For example, it’s impossible to have an item on a customer cart without a selected quantity:

Cart_04_ANDs_and_ORs.sql (excerpt)
CREATE TABLE cartitems
(
  cart_id  INTEGER  NOT NULL
, item_id  INTEGER  NOT NULL
, qty      SMALLINT NOT NULL
);

The qty column is NOT NULL because it’s senseless to have a customer cart for a null quantity of an item. Key columns, like cart_id and item_id, must also be NOT NULL, but we’ll cover them in Chapter 10.

DEFAULT

The DEFAULT constraint allows us to specify a default value for a column. This default value will be used in those instances where a NULL is about to be inserted. Let’s adjust the cartitems table so that the default qty value for any item is 1:

Cart_04_ANDs_and_ORs.sql (excerpt)
CREATE TABLE cartitems
(
  cart_id  INTEGER  NOT NULL
, item_id  INTEGER  NOT NULL
, qty      SMALLINT NOT NULL  DEFAULT 1
);

We’ve also used a DEFAULT constraint in our customers table for the shipping address:

Cart_04_ANDs_and_ORs.sql (excerpt)
CREATE TABLE customers
(
  id        INTEGER      NOT NULL PRIMARY KEY
, name      VARCHAR(99)  NOT NULL
, billaddr  VARCHAR(255) NOT NULL
, shipaddr  VARCHAR(255) NOT NULL DEFAULT 'See billing address.'
);

The default for the shipping address is the constant ‘See billing address’, and this string would be inserted into the shipaddr column when a customer is added to the table without specifying a shipping address.

CHECK Constraints

CHECK constraints are even more useful, because they can be as complex as needed by the application. A CHECK constraint consists of the keyword CHECK followed by a parenthesized condition. The neat part is that this condition can be a compound condition, involving AND and OR, just like in the WHERE clause.

Lets adjust the CREATE query for the cartitems table so that the maximum qty value for any item is 10:

Cart_04_ANDs_and_ORs.sql (excerpt)
CREATE TABLE cartitems
(
  cart_id  INTEGER  NOT NULL
, item_id  INTEGER  NOT NULL
, qty      SMALLINT NOT NULL  DEFAULT 1  CHECK ( qty <= 10 )
);

In our forums application, the CHECK constraint was used to ensure the TIMESTAMP value in the revised column was always after (chronologically speaking) the value in the created column:

Forums_01_Setup.sql (excerpt)
  created  TIMESTAMP  NOT NULL  DEFAULT CURRENT_TIMESTAMP
, revised  TIMESTAMP  NULL      CHECK ( revised >= created )

Wrapping Up: SQL Data Types

In this chapter, we learned about numeric, character, and temporal data types. We also learned when—and in some cases when not—to use them. We did a very quick tour of the functions that are available when working with the different types of data. Selecting an appropriate data type for each column in the tables we’re designing is fairly straightforward. Implementing appropriate constraints can ensure the integrity of the data in our database.

In the next chapter, we’ll tackle the more difficult task of selecting which columns to combine into which tables, and how to relate the tables properly.



[9] The various versions of the SQL standard, as I mentioned before, are not freely available and must be purchased. What matters much more than minutiae like this, of course, is whether your particular database system has implemented a given feature. MySQL, PostgreSQL, SQL Server, and DB2 all support BIGINT.

[10] This refers to normal performance-oriented indexes. Several database systems support full text indexes for large text objects.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset