In this chapter, we’ll explain a process by which you can create a good configuration file for your MySQL server. It is a roundabout trip, with many points of interest and side trips to scenic overlooks. These are necessary, because determining the shortest path to a good configuration doesn’t start with studying configuration options and asking which ones you should set or how you should change them, nor does it start with examining server behavior and asking whether any configuration options can improve it. It’s best to begin with an understanding of MySQL’s internals and behavior. You can then use that knowledge as a guide for how MySQL should be configured. Finally, you can compare the desired configuration to the current configuration and correct any differences that are important and worthwhile.
People often ask, “What’s the optimal configuration file for my server with 32 GB of RAM and 12 CPU cores?” Unfortunately, it’s not that simple. The server should be configured for the workload, data, and application requirements, not just the hardware. MySQL has scores of settings that you can change—but you shouldn’t. It’s usually better to configure the basic settings correctly (and there are only a few that really matter in most cases) and spend more time on schema optimization, indexes, and query design. After you’ve set MySQL’s basic configuration options correctly, the potential gains from further changes are usually small.
On the other hand, the potential downside of fiddling with the configuration can be great. We’ve seen more than one “highly tuned” server that was crashing constantly, stalling, or performing slowly due to unwise settings. We’ll spend a bit of time on why that can happen and what not to do.
So what should you do? Make sure the basics such as the InnoDB buffer pool and log file size are appropriate, set a few safety and sanity options if you wish to prevent bad behavior (but note that these usually won’t improve performance—they’ll only avoid problems), and then leave the rest of the settings alone. If you begin to experience a problem, diagnose it carefully with the techniques shown in Chapter 3. If the problem is caused by a part of the server whose behavior can be corrected with a configuration option, then you might need to change it.
Sometimes you might also need to set specific configuration options that can have a significant performance impact in special cases. However, these should not be part of a basic server configuration file. You should set them only when you find the specific performance problems they address. That’s why we don’t suggest that you approach configuration options by looking for bad things to improve. If something needs to be improved, it should show up in query response times. It’s best to start your search with queries and their response times, not with configuration options. This could save you a lot of time and prevent many problems.
Another good way to save time and trouble is to use the defaults unless you know you shouldn’t. There is safety in numbers, and a lot of people are running with default settings. That makes them the most thoroughly tested settings. Unexpected bugs can arise when you change things needlessly.
We’ll begin by explaining MySQL’s configuration mechanisms, before covering what you should configure in MySQL. MySQL is generally pretty forgiving about its configuration, but following these suggestions might save you a lot of work and time.
The first thing to know is where MySQL gets configuration information: from command-line arguments and settings in its configuration file. On Unix-like systems, the configuration file is typically located at /etc/my.cnf or /etc/mysql/my.cnf. If you use your operating system’s startup scripts, this is typically the only place you’ll specify configuration settings. If you start MySQL manually, which you might do when you’re running a test installation, you can also specify settings on the command line. The server actually reads the contents of the configuration file, removes any comment lines and newlines, and then processes it together with the command-line options.
A note on terminology: because many of MySQL’s command-line
options correspond to server variables, we sometimes use the terms
option and variable interchangeably. Most
variables have the same names as their corresponding command-line
options, but there are a few exceptions. For example,
--memlock sets the locked_in_memory
variable.
Any settings you decide to use permanently should go into the global configuration file, instead of being specified at the command line. Otherwise, you risk accidentally starting the server without them. It’s also a good idea to keep all of your configuration files in a single place so that you can inspect them easily.
Be sure you know where your server’s configuration file is located! We’ve seen people try unsuccessfully to configure a server with a file it doesn’t read, such as /etc/my.cnf on Debian servers, which look in /etc/mysql/my.cnf for their configuration. Sometimes there are files in several places, perhaps because a previous system administrator was confused as well. If you don’t know which files your server reads, you can ask it:
$which mysqld
/usr/sbin/mysqld $/usr/sbin/mysqld --verbose --help | grep -A 1 'Default options'
Default options are read from the following files in the given order: /etc/mysql/my.cnf ~/.my.cnf /usr/etc/my.cnf
This applies to typical installations, where there’s a single server on a host. You can design more complicated configurations, but there’s no standard way to do this. The MySQL server distribution used to include a now-deprecated program called mysqlmanager, which can run multiple instances from a single configuration with separate sections. (This was a replacement for the even older mysqld_multi script.) However, many operating system distributions don’t include or use this program in their startup scripts. In fact, many don’t use the MySQL-provided startup script at all.
The configuration file is divided into sections, each of which
begins with a line that contains the section name in square brackets. A
MySQL program will generally read the section that has the same name as
that program, and many client programs also read the client
section, which gives you a place to put
common settings. The server usually reads the mysqld
section. Be sure you place your
settings in the correct section in the file, or they will have no
effect.
Configuration settings are written in all lowercase, with words separated by underscores or dashes. The following are equivalent, and you might see both forms in command lines and configuration files:
/usr/sbin/mysqld --auto-increment-offset=5 /usr/sbin/mysqld --auto_increment_offset=5
We suggest that you pick a style and use it consistently. This makes it easier to search for settings in your files.
Configuration settings can have several scopes. Some settings are server-wide (global scope); others are different for each connection (session scope); and others are per-object. Many session-scoped variables have global equivalents, which you can think of as defaults. If you change the session-scoped variable, it affects only the connection from which you changed it, and the changes are lost when the connection closes. Here are some examples of the variety of behaviors of which you should be aware:
The query_cache_size
variable is globally scoped.
The sort_buffer_size
variable has a global default, but you can set it per-session as
well.
The join_buffer_size
variable has a global default and can be set per-session, but a
single query that joins several tables can allocate one join
buffer per join, so there might be several
join buffers per query.
In addition to setting variables in the configuration files, you
can also change many (but not all) of them while the server is
running. MySQL refers to these as dynamic
configuration variables. The following statements show different ways
to change the session and global values of sort_buffer_size
dynamically:
SET sort_buffer_size =<value>
; SET GLOBAL sort_buffer_size =<value>
; SET @@sort_buffer_size :=<value>
; SET @@session.sort_buffer_size :=<value>
; SET @@global.sort_buffer_size :=<value>
;
If you set variables dynamically, be aware that those settings will be lost when MySQL shuts down. If you want to keep the settings, you’ll have to update your configuration file as well.
If you set a variable’s global value while the server is
running, the values for the current session and any other existing
sessions are not affected. This is because the session values are
initialized from the global value when the connections are created.
You should inspect the output of SHOW GLOBAL
VARIABLES
after each change to make sure it’s had the
desired effect.
Variables use different kinds of units, and you have to know the
correct unit for each variable. For example, the table_cache
variable specifies the number of
tables that can be cached, not the size of the table cache in bytes.
The key_buffer_size
is specified in
bytes, whereas still other variables are specified in number of pages
or other units, such as percentages.
Many variables can be specified with a suffix, such as 1M
for one megabyte. However, this works
only in the configuration file or as a command-line argument. When you
use the SQL SET
command, you must
use the literal value 1048576
, or
an expression such as 1024 * 1024
.
You can’t use expressions in configuration files.
There is also a special value you can assign to variables with
the SET
command: the keyword
DEFAULT
. Assigning this value to a
session-scoped variable sets that variable to the corresponding
globally scoped variable’s value; assigning it to a globally scoped
variable sets the variable to the compiled-in default (not the value
specified in the configuration file). This is useful for resetting
session-scoped variables back to the values they had when you opened
the connection. We advise you not to use it for global variables,
because it probably won’t do what you want—that is, it doesn’t set the
values back to what they were when you started the server.
Setting variables dynamically can have unexpected side effects, such as flushing dirty blocks from buffers. Be careful which settings you change online, because this can cause the server to do a lot of work.
Sometimes you can infer a
variable’s behavior from its name. For example, max_
heap_table_size
does what it sounds like: it
specifies the maximum size to which
implicit in-memory temporary tables are allowed to grow. However, the
naming conventions aren’t completely consistent, so you can’t always
guess what a variable will do by looking at its name.
Let’s take a look at some commonly used variables and the effects of changing them dynamically:
key_buffer_size
Setting this variable allocates the designated amount of space for the key buffer (or key cache) all at once. However, the operating system doesn’t actually commit memory to it until it is used. Setting the key buffer size to one gigabyte, for example, doesn’t mean you’ve instantly caused the server to actually commit a gigabyte of memory to it. (We discuss how to watch the server’s memory usage in the next chapter.)
MySQL lets you create multiple key caches, as we explain
later in this chapter. If you set this variable to 0
for a nondefault key cache, MySQL
discards any indexes cached in the specified cache, begins to
cache them in the default cache, and deletes the specified cache
when nothing is using it anymore. Setting this variable for a
nonexistent cache creates it. Setting the variable to a nonzero
value for an existing cache will flush the specified cache’s
memory. This blocks all operations that try to access the cache
until the flush is finished.
table_cache_size
Setting this variable has no immediate effect—the effect is delayed until the next time a thread opens a table. When this happens, MySQL checks the variable’s value. If the value is larger than the number of tables in the cache, the thread can insert the newly opened table into the cache. If the value is smaller than the number of tables in the cache, MySQL deletes unused tables from the cache.
thread_cache_size
Setting this variable has no immediate effect—the effect is delayed until the next time a connection is closed. At that time, MySQL checks whether there is space in the cache to store the thread. If so, it caches the thread for future reuse by another connection. If not, it kills the thread instead of caching it. In this case, the number of threads in the cache, and hence the amount of memory the thread cache uses, does not immediately decrease; it decreases only when a new connection removes a thread from the cache to use it. (MySQL adds threads to the cache only when connections close and removes them from the cache only when new connections are created.)
query_cache_size
MySQL allocates and initializes the specified amount of memory for the query cache all at once when the server starts. If you update this variable (even if you set it to its current value), MySQL immediately deletes all cached queries, resizes the cache to the specified size, and reinitializes the cache’s memory. This can take a long time and stalls the server until it completes, because MySQL deletes all of the cached queries one by one, not instantaneously.
read_buffer_size
MySQL doesn’t allocate any memory for this buffer until a query needs it, but then it immediately allocates the entire chunk of memory specified here.
read_rnd_buffer_size
MySQL doesn’t allocate any memory for this buffer
until a query needs it, and then it allocates only as much memory as
needed. (The name max_read_
rnd
_buffer_size
would describe this
variable more accurately.)
sort_buffer_size
MySQL doesn’t allocate any memory for this buffer until a query needs to do a sort. However, when there’s a sort, MySQL allocates the entire chunk of memory immediately, whether the full size is required or not.
We explain what these variables do in more detail elsewhere, and this isn’t an exhaustive list. Our goal here is simply to show you what behavior to expect when you change a few common variables.
You should not raise the value of a per-connection setting globally unless you know it’s the right thing to do. Some buffers are allocated all at once, even if they’re not needed, so a large global setting can be a huge waste. Instead, you can raise the value when a query needs it.
The most common example of a variable that you should probably
keep small and raise only for certain queries is sort_buffer_size
, which controls how large
the sort buffer should be for filesorts. MySQL performs some work to
initialize the sort buffer after allocating it.
In addition, the sort buffer is allocated to its full size even for very small sorts, so if you make it much larger than the average sort requires, you’ll be wasting memory and adding allocation cost. This can be surprising to those readers who think of memory allocation as an inexpensive operation. Without digging into all of the technical details, it’s enough to say that memory allocation includes setting up the address space, which can be relatively expensive; in Linux in particular, memory allocation uses a couple of strategies with varying cost depending on the size.
In summary, a large sort buffer can be very expensive, so don’t increase its size unless you know it’s needed.
When you find a query that needs a larger sort buffer to perform
well, you can raise the sort_buffer_size
value just before the query
and then restore it to DEFAULT
afterward. Here’s an example of how to do this:
SET @@session.sort_buffer_size := <value>
;
-- Execute the query...
SET @@session.sort_buffer_size := DEFAULT;
Wrapper functions can be handy for this type of code. Other
variables you might set on a
per-connection basis are read_buffer_size
, read_rnd_buffer_size
, tmp_table
_size
, and
myisam_sort_buffer_size
(if you’re
repairing tables).
If you need to save and restore a possibly customized value, you can do something like the following:
SET @saved_<unique_variable_name>
:= @@session.sort_buffer_size; SET @@session.sort_buffer_size :=<value>
; -- Execute the query... SET @@session.sort_buffer_size := @saved_<unique_variable_name>
;
The sort buffer size is one of the settings that is the focus of far too much “tuning.” Some people seem to have the idea that bigger is better, and we’ve even seen servers with this variable set to 1 GB. Perhaps not surprisingly, this can cause the server to try to allocate too much memory and crash, or simply to burn a lot of CPU time when initializing the sort buffer for a query; see MySQL bug 37359 for more on this.
Don’t assign too much importance to the sort buffer size. Do you really need your queries to allocate 128 MB of memory to sort 10 rows and return them to the client? Think about what kinds of sorting your queries are doing, and how much, and try to avoid them with proper indexing and query design (see Chapter 5 and Chapter 6) rather than trying to make the sorting operation itself faster. And you should definitely profile your queries to see whether sorting is where you should focus your attention anyway; see Chapter 3 for an example of a query that performs a sort but doesn’t spend much of its time sorting.
Be careful when setting variables. More is not always better, and if you set the values too high, you can easily cause problems: you might run out of memory, causing your server to swap, or run out of address space.[109]
You should always have a monitoring system in place to measure whether a change improves or hurts your server’s overall performance in real life. Benchmarks aren’t enough, because they’re not real. If you don’t measure your server’s actual performance, you might hurt performance without knowing it. We’ve seen many cases where someone changed a server’s configuration and thought it improved performance, when in fact the server’s performance worsened overall because of a different workload at a different time of day or day of the week.
If you take notes, perhaps with comments in the configuration file, you might save yourself (and your colleagues) a lot of work. An even better idea is to place your configuration file under version control. This is a good practice anyway, because it lets you undo changes. To reduce the complexity of managing many configuration files, simply create a symbolic link from the configuration file to a central version control repository.
Before you start changing your configuration, you should optimize your queries and your schema, addressing at least the obvious things such as adding indexes. If you get deep into tweaking the configuration and then change your queries or schema, you might need to reevaluate the configuration. Keep in mind that unless your hardware, workload, and data are completely static, chances are you’ll need to revisit your configuration later. And in fact, most people’s servers don’t even have a steady workload throughout the day—meaning that the “perfect” configuration for the middle of the morning is not right for midafternoon! Obviously, chasing the mythical “perfect” configuration is completely impractical. Thus, you don’t need to squeeze every last ounce of performance out of your server; in fact, the return for such an investment of time will probably be very small. We suggest that you stop at “good enough,” unless you have reason to believe you’re forgoing a significant performance improvement.
You might be expected (or believe that you’re expected) to set up a benchmark suite and “tune” your server by changing its configuration iteratively in search of optimal settings. This usually is not something we advise most people to do. It requires so much work and research, and the potential payoff is so small in most cases, that it can be a huge waste of time. You are probably better off spending that time on other things such as checking your backups, monitoring changes in query plans, and so on.
It’s also very hard to know what side effects your changes might have over the long run. If you change an option and it appears to improve your benchmark, but your benchmark doesn’t measure everything that’s important, or you don’t run it long enough to detect changes in the system’s long-term steady-state behavior, you might cause problems such as periodic server stalls or sporadic slow queries. These can be difficult to detect.
We do sometimes run sets of benchmarks to examine or stress particular parts of the server so we can understand their behavior better. A good example is the many benchmarks we’ve run over the years to understand InnoDB’s flushing behavior, in our quest to develop better flushing algorithms for various workloads and types of hardware. It often happens that we benchmark extensively with different settings to understand their effects and how to optimize them. But this is not a small undertaking—it can take many days or weeks—and it is also not beneficial for most people to do, because such tunnel vision about a specific part of the server often obscures other concerns. For example, sometimes we find that specific combinations of settings enable better performance in edge cases, but the configuration options are not really practical for production usage, due to factors such as wasting a huge amount of memory or optimizing for throughput while ignoring the impact on crash recovery altogether.
If you must do this, we suggest that you develop a custom benchmark suite before you begin configuring your server. You need something that represents your overall workload and includes edge cases such as very large and complex queries. Replaying your actual workload against your actual data is usually a good approach. If you have identified a particular problem spot—such as a single query that runs slowly—you can also try to optimize for that case, but you risk impacting other queries negatively without knowing it.
The best way to proceed is to change one or two variables, a
little at a time, and run the benchmarks after each change, being sure
to run them long enough to observe the steady-state behavior.
Sometimes the results will surprise you; you might increase a variable
a little and see an improvement, then increase it a little more and
see a sharp drop in performance. If performance suffers after a
change, you might be asking for too much of some resource, such as too
much memory for a buffer that’s frequently allocated and deallocated.
You might also have created a mismatch between MySQL and your
operating system or hardware. For example, we’ve found that the
optimal sort_buffer_size
might be affected by how
the CPU cache works, and the read_
buffer_size
needs to be matched to the server’s read-ahead and general I/O
subsystem configuration. Larger is not always better, and can
be much worse. Some variables are also dependent on others, which is
something you learn with experience and by understanding the system’s
architecture.
Before we get started with server configuration, we want to encourage you to avoid a few common practices that we’ve found to be risky or harmful. Warning: rants ahead!
First, you should not “tune by ratio.” The classic “tuning ratio” is the rule of thumb that your key cache hit ratio should be higher than some percentage, and you should increase the cache size if the hit rate is too low. This is very wrong advice. Regardless of what anyone tells you, the cache hit ratio has nothing to do with whether the cache is too large or too small. To begin with, the hit ratio depends on the workload—some workloads simply aren’t cacheable no matter how big the cache is—and secondly, cache hits are meaningless, for reasons we’ll explain later. It sometimes happens that when the cache is too small, the hit rate is low, and increasing the cache size increases the hit rate. However, this is an accidental correlation and does not indicate anything about performance or proper sizing of the cache.
The problem with correlations that sometimes appear to be true is that people begin to believe they will always be true. Oracle DBAs abandoned ratio-based tuning years ago, and we wish MySQL DBAs would follow their lead.[110] We wish even more fervently that people wouldn’t write “tuning scripts” that codify these dangerous practices and teach them to thousands of people. This leads to our second suggestion of what not to do: don’t use tuning scripts! There are several very popular ones that you can find on the Internet. It’s probably best to ignore them.[111]
We also suggest that you avoid the word “tuning,” which we’ve used liberally in the past few paragraphs. We favor “configuration” or “optimization” instead (as long as that’s what you’re actually doing; see Chapter 3). The word “tuning” conjures up images of an undisciplined novice who tweaks the server and sees what happens. We suggested in the previous section that this practice is best left to those who are researching server internals. “Tuning” your server can be a stunning waste of time.
On a related topic, searching the Internet for configuration advice is not always a great idea. You can find a lot of bad advice in blogs, forums, and so on.[112] Although many experts contribute what they know online, it is not always easy to tell who is qualified. We can’t give unbiased recommendations about where to find real experts, of course. But we can say that the credible, reputable MySQL service providers are a safer bet in general than what a simple Internet search turns up, because people who have happy customers are probably doing something right. Even their advice, however, can be dangerous to apply without testing and understanding, because it might have been directed at a situation that differed from yours in a way you don’t understand.
Finally, don’t believe the popular memory consumption formula—yes, the very one that MySQL itself prints out when it crashes. (We won’t repeat it here.) This formula is from an ancient time. It is not a reliable or even useful way to understand how much memory MySQL can use in the worst case. You might see some variations on this formula on the Internet, too. These are similarly flawed, even though they add in more factors that the original formula doesn’t have. The truth is that you can’t put an upper bound on MySQL’s memory consumption. It is not a tightly regulated database server that controls memory allocation. You can prove that very simply by logging into the server and running a number of queries that consume a lot of memory:
mysql>SET @crash_me_1 := REPEAT('a', @@max_allowed_packet);
mysql>SET @crash_me_2 := REPEAT('a', @@max_allowed_packet);
# ... run a lot of these ... mysql>SET @crash_me_1000000 := REPEAT('a', @@max_allowed_packet);
Run that in a loop, creating new variables each time, and you’ll eventually run the server out of memory and crash it! And it requires no privileges to execute.
The points we’ve tried to illustrate in this section have sometimes made us unpopular with people who perceive us as arrogant, think that we’re trying to discredit others and set ourselves up as the sole authority, or feel that we’re trying to promote our services. It is not our intention to be self-serving. We have simply seen so much bad advice that appears legitimate if you are not experienced enough to know better, and helped clean up the wreckage so many times, that we think it is important to debunk a few myths and warn our readers to be careful whose expertise they trust. We’ll try to avoid ranting from here on.
As we mentioned at the beginning of this chapter, we don’t have a one-size-fits-all “best configuration file” for, say, a 4-CPU server with 16 GB of memory and 12 hard drives. You really do need to develop your own configurations, because even a good starting point will vary widely depending on how you’re using the server.
MySQL’s compiled-in default settings aren’t all great, although most of them are fine. They are designed not to use a lot of resources, because MySQL is intended to be very versatile, and it does not assume it is the only thing running on the server on which it is installed. By default, MySQL uses just enough resources to start and run simple queries with a little bit of data. You’ll certainly need to customize it if you have more than a few megabytes of data.
You can start with one of the sample configuration files included with the MySQL server distribution, but they have their own problems. For example, they have a lot of commented-out settings that might tempt you to think that you should choose values and uncomment them (it’s a bit reminiscent of an Apache configuration file). And they have a lot of prose comments that explain the options, but these explanations are not always well-balanced, complete, or even correct. Some of the options don’t even apply to popular operating systems at all! Finally, the samples are perpetually out of date for modern hardware and workloads.
MySQL experts have had many conversations about how to fix these problems over the years, but the issues remain. Here’s our suggestion: don’t use those files as a starting point, and don’t use the samples that ship with your operating system’s packages either. It’s better to start from scratch.
That’s what we’ll do in this chapter. It’s actually a weakness that MySQL is so configurable, because it makes it seem as though you should spend a lot of time on configuration, when in fact most things are fine at their defaults, and you are often better off setting and forgetting. That’s why we’ve created a sane minimal sample configuration file for this book, which you can use as a good starting point for your own servers.[113] You must choose values for a few of the settings; we’ll explain those later in this chapter. Our base file looks like this:
[mysqld] # GENERAL datadir = /var/lib/mysql socket = /var/lib/mysql/mysql.sock pid_file = /var/lib/mysql/mysql.pid user = mysql port = 3306 storage_engine = InnoDB # INNODB innodb_buffer_pool_size = <value> innodb_log_file_size = <value> innodb_file_per_table = 1 innodb_flush_method = O_DIRECT # MyISAM key_buffer_size = <value> # LOGGING log_error = /var/lib/mysql/mysql-error.log log_slow_queries = /var/lib/mysql/mysql-slow.log # OTHER tmp_table_size = 32M max_heap_table_size = 32M query_cache_type = 0 query_cache_size = 0 max_connections = <value> thread_cache_size = <value> table_cache_size = <value> open_files_limit = 65535 [client] socket = /var/lib/mysql/mysql.sock port = 3306
This might seem too minimal in comparison to what you’re used to seeing,[114] but it’s actually more than many people need. There are a few other types of configuration options that you are likely to use as well, such as binary logging; we’ll cover those later in this and other chapters.
The first thing we configured is the location of the data. We chose /var/lib/mysql for this, because it’s a popular location on many Unix variants. There is nothing wrong with choosing another location; you decide. We’ve put the PID file into the same location, but many operating systems will want to place it in /var/run instead. That’s fine, too. We simply needed to have something configured for these settings. By the way, don’t let the socket and PID file be located according to the server’s compiled-in defaults; there are some bugs in various MySQL versions that can cause problems with this. It’s best to set these locations explicitly. (We’re not advising you to choose different locations; we’re just advising you to make sure the my.cnf file mentions those locations explicitly, so they won’t change and break things if you upgrade the server.)
We also specified that mysqld should run as the mysql user account on the operating system. You’ll need to make sure this account exists, and that it owns the data directory. The port is set to the default of 3306, but sometimes that is something you’ll want to change.
We’ve chosen the default storage engine to be InnoDB, and this is
worth explaining. We think InnoDB is the best choice in most situations,
but that’s not always the case. Some third-party software, for example,
might assume the default is MyISAM, and will create tables without
specifying the engine. This might cause the software to malfunction if,
for example, it assumes that it can create full-text indexes. And the
default storage engine is used for explicitly created temporary tables,
too, which can cause quite a bit of unexpected work for the server. If
you want your permanent tables to use InnoDB but any temporary tables to
use MyISAM, you should be sure to specify the engine explicitly in the
CREATE TABLE
statement.
In general, if you decide to use a storage engine as your default, it’s best to configure it as the default. Many users think they use only a specific storage engine, but then discover another engine has crept into use because of the configured default.
We’ll illustrate the basics of configuration with InnoDB. All
InnoDB really needs to run well in most cases is a proper buffer pool size and log file size. The defaults are far too small. All of the
other settings for InnoDB are optional, although we’ve enabled innodb_file_per_table
for manageability and
flexibility reasons. Setting the InnoDB log
file size is a topic that we’ll discuss later in this chapter, as is the
setting of innodb
_flush_method
,
which is Unix-specific.
There’s a popular rule of thumb that says you should set the buffer pool size to around 75% or 80% of your server’s memory. This is another accidental ratio that seems to work okay sometimes, but isn’t always correct. It’s a better idea to set the buffer pool roughly as follows:
Begin with the amount of memory in the server.
Subtract out a bit for the operating system and perhaps for other programs, if MySQL isn’t the only thing running on the server.
Subtract some more for MySQL’s memory needs; it uses various buffers for per-query operations, for example.
Subtract enough for the InnoDB log files, so the operating
system has enough memory to cache them, or at least the recently
accessed portion thereof. (This advice applies to standard MySQL; in
Percona Server, you can configure the log files to be opened with
O_DIRECT
, bypassing the operating system caches.) It might also
be a good idea to leave some memory free for caching at least the
tail of the binary logs, especially if you have replicas that are
delayed, because they can sometimes read old binary log files on the
master, causing some pressure on its memory.
Subtract enough for any other buffers and caches that you configure inside MySQL, such as the MyISAM key cache or the query cache.
Divide by 105%, which is an approximation of the overhead InnoDB adds on to manage the buffer pool itself.
Round the result down to a sensible number. Rounding down won’t change things much, but overallocating can be a bad thing.
We were a bit blasé about some of the amounts of memory involved here—what exactly is “a bit for the operating system,” anyway? That varies, and we’ll discuss it in some depth later in this chapter and the rest of this book. You need to understand your system and estimate how much memory you think it’ll need to run well. This is why one-size-fits-all configuration files are not possible. Experience and sometimes a bit of math will be your guide.
Here’s an example. Suppose you have a server with 192 GB of memory, and you want to dedicate it to MySQL and to use only InnoDB, with no query cache and not very many connections to the server. If your log files are 4 GB in total, you might proceed as follows: “I think that 2 GB or 5% of overall memory, whichever is larger, should be enough for the OS and for MySQL’s other memory needs; subtract 4 GB for the log files; use everything else for InnoDB.” The result is about 177 GB, but it’s probably a good idea to round that down a bit. You might configure the server with 168 GB or so of buffer pool. If the server tends to run with a fair amount of unallocated memory in practice, you might set the buffer pool larger when there is an opportunity to restart it for some other purpose.
The result would be very different if you had a number of MyISAM
tables and needed to cache their indexes, naturally. It would also be
quite different on Windows, which has trouble using large amounts of
memory in most MySQL versions (although it’s improved in MySQL 5.5), or
if you chose not to use O_DIRECT
for
some reason.
As you can see, it’s not crucial to get this setting precisely right from the beginning. It’s better to start with a safe value that’s larger than the default but not as large as it could be, run the server for a while, and see how much memory it really uses. These things can be hard to anticipate, because MySQL’s memory usage isn’t always predictable: it can depend on factors such as the query complexity and concurrency. With a simple workload, MySQL’s memory needs are pretty minimal—around 256 KB per connection. But complex queries that use temporary tables, sorting, stored procedures, and so forth can use a lot more RAM.
That’s why we chose a pretty safe starting point. You can see that even the conservative setting for InnoDB’s buffer pool is actually 87.5% of the server’s installed RAM—more than 75%, which is why we said that simple ratios aren’t the right approach.
We suggest that when it comes to configuring the memory buffers, you err on the side of caution, rather than making them too large. If you make the buffer pool 20% smaller than it could be, you’ll likely impact performance only a small amount—maybe a few percent. If you set it 20% too large, you’ll probably cause much more severe problems: swapping, thrashing the disks, or even running out of memory and crashing hard.
This InnoDB configuration example illustrates our preferred approach to configuring the server: understand what it does internally and how that interacts with the settings, and then decide.
Most of the other settings in our sample file are pretty self-explanatory, and many of them are a matter of judgment. We’ll explore several of them in the rest of this chapter. You can see that we’ve enabled logging, disabled the query cache, and so on. We’ll also discuss some safety and sanity settings later in this chapter, which can be very helpful for making your server more robust and helping prevent bad data and other problems. We don’t show those settings here.
One setting to explain here is the open_files_limit
option. We’ve set this as
large as possible on a typical Linux system. Open filehandles are very
cheap on modern operating systems. If this setting isn’t large enough,
you’ll see error 24, “too many open files.”
Skipping all the way to the end, the last section in the configuration file is for client programs such as mysql and mysqladmin, and simply lets them know how to connect to the server. You should set the values for client programs to match those you chose for the server.
Sometimes you can use the output from SHOW GLOBAL STATUS
as
input to your configuration to help customize the settings better for
your workload. For the best results, look both at absolute values and
at how the values change over time, preferably with several snapshots
at peak and off-peak times. You can use the following command to see
incremental changes to status variables every 60 seconds:
$ mysqladmin extended-status -ri60
We will frequently refer to changes in status variables over time as we explain various configuration settings. We will usually expect you to be examining the output of a command such as the one we just showed. Other helpful tools that can provide a compact display of status counter changes are Percona Toolkit’s pt-mext or pt-mysql-summary.
Now that we’ve shown you the preliminaries, we’ll take you on a guided tour of some server internals, interleaved with advice on configuration. This will give you the background you’ll need to choose appropriate values for configuration options when we return to the sample configuration file later.
Configuring MySQL to use memory correctly is vital to good performance. You’ll almost certainly need to customize MySQL’s memory usage for your needs. You can think of MySQL’s memory consumption as falling into two categories: the memory you can control, and the memory you can’t. You can’t control how much memory MySQL uses merely to run the server, parse queries, and manage its internals, but you have a lot of control over how much memory it uses for specific purposes. Making good use of the memory you can control is not hard, but it does require you to know what you’re configuring.
As shown previously, you can approach memory configuration in steps:
Determine the absolute upper limit of memory MySQL can possibly use.
Determine how much memory MySQL will use for per-connection needs, such as sort buffers and temporary tables.
Determine how much memory the operating system needs to run well. Include memory for other programs that run on the same machine, such as periodic jobs.
Assuming that it makes sense to do so, use the rest of the memory for MySQL’s caches, such as the InnoDB buffer pool.
We go over each of these steps in the following sections, and then we take a more detailed look at the various MySQL caches’ requirements.
There is a hard upper limit on the amount of memory that can possibly be available to MySQL on any given system. The starting point is the amount of physically installed memory. If your server doesn’t have it, MySQL can’t use it.
You also need to think about operating system or architecture limits, such as restrictions 32-bit operating systems place on how much memory a given process can address. Because MySQL runs in a single process with multiple threads, the amount of memory it can use overall might be severely limited by such restrictions—for example, 32-bit Linux kernels limit the amount of memory any one process can address to a value that is typically between 2.5 and 2.7 GB. Running out of address space is very dangerous and can cause MySQL to crash. This is pretty rare to see these days, but it used to be common.
There are many other operating system–specific parameters and
oddities that must be taken into account, including not just the
per-process limits, but also stack sizes and other settings. The
system’s glibc libraries can also impose
limits per single allocation. For example, you might not be able to
set innodb_buffer_pool
larger than 2 GB if that’s all your glibc
libraries support in a single allocation.
Even on 64-bit servers, some limitations still apply. For example, many of the buffers we discuss, such as the key buffer, are limited to 4 GB on a 64-bit server in 5.0 and older MySQL versions. Some of these restrictions are lifted in MySQL 5.1, and the MySQL manual documents each variable’s maximum value.
MySQL needs a small amount of memory just to hold a connection (thread) open. It also requires a base amount of memory to execute any given query. You’ll need to set aside enough memory for MySQL to execute queries during peak load times. Otherwise, your queries will be starved for memory, and they will run poorly or fail.
It’s useful to know how much memory MySQL will consume during peak usage, but some usage patterns can unexpectedly consume a lot of memory, which makes this hard to predict. Prepared statements are one example, because you can have many of them open at once. Another example is the InnoDB data dictionary (more about this later).
You don’t need to assume a worst-case scenario when trying to
predict peak memory consumption. For example, if you configure MySQL
to allow a maximum of 100 connections, it theoretically might be
possible to simultaneously run large queries on all 100 connections,
but in reality this probably won’t happen. For example, if you set
myisam_sort_buffer_size
to 256M
, your worst-case usage is at least 25
GB, but this level of consumption is highly unlikely to actually
occur. Queries that use many large temporary tables, or complex stored
procedures, are the most likely causes of high per-connection memory
consumption.
Rather than calculating worst cases, a better approach is to
watch your server under a real workload and see how much memory it
uses, which you can see by watching the process’s virtual memory size.
In many Unix-like systems, this is reported in the VIRT
column in top, or
VSZ
in ps. The
next chapter has more information on how to monitor memory
usage.
Just as with queries, you need to reserve enough memory for the operating system to do its work. The best indication that the operating system has enough memory is that it’s not actively swapping (paging) virtual memory to disk. (See the next chapter for more on this topic.)
You should reserve at least a gigabyte or two for the operating system—more for machines with a lot of memory. We suggest starting with 2 GB or 5% of total memory as the baseline, whichever is greater. Add in some extra for safety, and add in some more if you’ll be running periodic memory-intensive jobs on the machine (such as backups). Don’t add any memory for the operating system’s caches, because they can be very large. The operating system will generally use any leftover memory for these caches, and we consider them separately from the operating system’s own needs in the following sections.
If the server is dedicated to MySQL, any memory you don’t reserve for the operating system or for query processing is available for caches.
MySQL needs more memory for caches than anything else. It uses caches to avoid disk access, which is orders of magnitude slower than accessing data in memory. The operating system might cache some data on MySQL’s behalf (especially for MyISAM), but MySQL needs lots of memory for itself, too.
The following are the most important caches to consider for most installations:
There are other caches, but they generally don’t use much memory. We discussed the query cache in detail in the previous chapter, so the following sections concentrate on the caches InnoDB and MyISAM need to work well.
It is much easier to configure a server if you’re using only one storage engine. If you’re using only MyISAM tables, you can disable InnoDB completely, and if you’re using only InnoDB, you need to allocate only minimal resources for MyISAM (MySQL uses MyISAM tables internally for some operations). But if you’re using a mixture of storage engines, it can be very hard to figure out the right balance between them. The best approach we’ve found is to make an educated guess and then observe the server in operation.
If you use mostly InnoDB tables, the InnoDB buffer pool probably
needs more memory than anything else. The InnoDB buffer pool doesn’t
just cache indexes: it also holds row data, the adaptive hash index,
the insert buffer, locks, and other internal structures. InnoDB also
uses the buffer pool to help it delay writes, so it can merge many
writes together and perform them sequentially. In short, InnoDB relies
heavily on the buffer pool, and you should be
sure to allocate enough memory to it, typically with a process such as
that shown earlier in this chapter. You can use variables from
SHOW
commands or tools such as
innotop to monitor your InnoDB buffer pool’s
memory usage.
If you don’t have much data, and you know that your data won’t grow quickly, you don’t need to overallocate memory to the buffer pool. It’s not really beneficial to make it much larger than the size of the tables and indexes that it will hold. There’s nothing wrong with planning ahead for a rapidly growing database, of course, but sometimes we see huge buffer pools with a tiny amount of data. This isn’t necessary.
Large buffer pools come with some challenges, such as long
shutdown and warmup times. If there are a lot of dirty (modified)
pages in the buffer pool InnoDB can take a long time to shut down,
because it writes the dirty pages to the data files upon shutdown. You
can force it to shut down quickly, but then it just has to do more
recovery when it restarts, so you can’t actually speed up the shutdown
and restart cycle time. If you know in
advance when you need to shut down, you can change the innodb_
max_dirty_pages_pct
variable at runtime to a
lower value, wait for the flush thread to clean up the buffer
pool, and then shut down once the number of dirty pages becomes
small. You can monitor the number of
dirty pages by watching the Innodb
_buffer_pool_pages_dirty
server status
variable or using innotop to monitor SHOW
INNODB STATUS
.
Lowering the value of the innodb_max_dirty_pages_pct
variable doesn’t
actually guarantee that InnoDB will keep fewer dirty pages in the
buffer pool. Instead, it controls the threshold at which InnoDB stops
being “lazy.” InnoDB’s default behavior is to flush dirty pages with a background
thread, merging writes together and performing them sequentially for
efficiency. This behavior is called “lazy” because it lets InnoDB
delay flushing dirty pages in the buffer pool, unless it needs to use
the space for some other data. When the percentage of dirty pages
exceeds the threshold, InnoDB will flush pages as quickly as it can to
try to keep the dirty page count lower. InnoDB will also go into
“furious flushing” mode when there isn’t enough space left in the
transaction logs, which is one reason that large logs can improve
performance.
When you have a large buffer pool, especially in combination with slow disks, the server might take a long time (many hours or even days) to warm up after a restart. In such cases, you might benefit from using Percona Server’s feature to reload the pages after restart. This can reduce warmup times to a few minutes. MySQL 5.6 will introduce a similar feature. This is especially beneficial on replicas, which pay an extra warmup penalty due to the single-threaded nature of replication.
If you can’t use Percona Server’s fast warmup feature, some people issue full-table scans
or index scans immediately after a restart to load indexes into the
buffer pool. This is crude, but can sometimes be better than nothing.
You can use the init_file
setting
to accomplish this. You can place SQL into a file that’s executed when
MySQL starts up. The filename must be specified in the init_file
option, and the file can include
multiple SQL commands, each on a
single line (no comments are allowed).
The MyISAM key caches are also referred to as key buffers; there is one by default, but you can create more. Unlike InnoDB and some other storage engines, MyISAM itself caches only indexes, not data (it lets the operating system cache the data). If you use mostly MyISAM, you should allocate a lot of memory to the key caches.
The most important option is the key_buffer_size
. Any
memory not allocated to it will be available for the operating system
caches, which the operating system will usually fill with data from
MyISAM’s .MYD files. MySQL 5.0 has a hard upper
limit of 4 GB for this variable, no matter what architecture you’re
running. MySQL 5.1 allows larger sizes. Check the current
documentation for your version of the server.
When you’re deciding how much memory to allocate to the key
caches, it might help to know how much space your MyISAM indexes are
actually using on disk. You don’t need to make the key buffers larger
than the data they will cache. You can query the INFORMATION_SCHEMA
tables and sum up the
INDEX_LENGTH
column to find out the
size of the files storing the indexes:
SELECT SUM(INDEX_LENGTH) FROM INFORMATION_SCHEMA.TABLES WHERE ENGINE='MYISAM';
If you have a Unix-like system, you can also use a command like the following:
$ du -sch `find /path/to/mysql/data/directory/ -name "*.MYI"`
How big should you set the key caches? No bigger than the total index size or 25% to 50% of the amount of memory you reserved for operating system caches, whichever is smaller.
By default, MyISAM caches all indexes in the default key buffer,
but you can create multiple named key buffers. This lets you keep more
than 4 GB of indexes in memory at once. To create key buffers named
key_buffer_1
and key_buffer_2
, each sized at 1 GB, place the following in the
configuration file:
key_buffer_1.key_buffer_size = 1G key_buffer_2.key_buffer_size = 1G
Now there are three key buffers: the two explicitly created by
those lines and the default buffer. You can use the CACHE INDEX
command
to map tables to caches. You can tell MySQL to use key_buffer_1
for the indexes from tables
t1
and t2
with the following SQL statement:
mysql> CACHE INDEX t1, t2 IN key_buffer_1;
Now when MySQL reads blocks from the indexes on these tables, it
will cache the blocks in the specified buffer. You can also preload
the tables’ indexes into the cache with the init_file
option and the LOAD INDEX
command:
mysql> LOAD INDEX INTO CACHE t1, t2;
Any indexes you don’t explicitly map to a key buffer will be assigned to the default buffer the first time MySQL needs to access the .MYI file.
You can monitor key buffer usage with information from
SHOW STATUS
and
SHOW VARIABLES
. You
can calculate the percentage of the buffer in use with this
equation:
100 - ( (Key_blocks_unused * key_cache_block_size) * 100 / key_buffer_size )
If the server doesn’t use all of its key buffer after it’s been running for a long time, you can consider making the buffer smaller.
What about the key buffer hit ratio? As we explained previously, this number is useless. For example, the difference between 99% and 99.9% looks small, but it really represents a tenfold increase. The cache hit ratio is also application-dependent: some applications might work fine at 95%, whereas others might be I/O-bound at 99.9%. You might even be able to get a 99.99% hit ratio with properly sized caches.
The number of cache misses per second is much more empirically useful. Suppose you have a single hard drive that can do 100 random reads per second. Five misses per second will not cause your workload to be I/O-bound, but 80 per second will likely cause problems. You can use the following equation to calculate this value:
Key_reads / Uptime
Calculate the number of misses incrementally over intervals of 10 to 100 seconds, so you can get an idea of the current performance. The following command will show the incremental values every 10 seconds:
$ mysqladmin extended-status -r -i 10 | grep Key_reads
Remember that MyISAM uses the operating system cache for the data files, which are often larger than the indexes. Therefore, it often makes sense to leave more memory for the operating system cache than for the key caches. Even if you have enough memory to cache all the indexes, and the key cache miss rate is very low, cache misses when MyISAM tries to read from the data files (not the index files!) happen at the operating system level, which is completely invisible to MySQL. Thus, you can have a lot of data file cache misses independently of your index cache miss rate.
Finally, even if you don’t have any MyISAM tables, bear in mind
that you still need to set key_buffer_size
to a small amount of memory,
such as 32M
. The MySQL server
sometimes uses MyISAM tables for internal purposes, such as temporary
tables for GROUP BY
queries.
The key block size is important (especially for write-intensive workloads) because of the way it causes MyISAM, the operating system cache, and the filesystem to interact. If the key block size is too small, you might encounter read-around writes, which are writes that the operating system cannot perform without first reading some data from the disk. Here’s how a read-around write happens, assuming the operating system’s page size is 4 KB (typically true on the x86 architecture) and the key block size is 1 KB:
MyISAM requests a 1 KB key block from disk.
The operating system reads 4 KB of data from the disk and caches it, then passes the desired 1 KB of data to MyISAM.
The operating system discards the cached data in favor of some other data.
MyISAM modifies the 1 KB key block and asks the operating system to write it back to disk.
The operating system reads the same 4 KB of data from the disk into the operating system cache, modifies the 1 KB that MyISAM changed, and writes the entire 4 KB back to disk.
The read-around write happened in step 5, when MyISAM asked the operating system to write only part of a 4 KB page. If MyISAM’s block size had matched the operating system’s, the disk read in step 5 could have been avoided.[115]
Unfortunately, in MySQL 5.0 and earlier there’s no way to
configure the key block size.
However, in MySQL 5.1 and later you can avoid read-around
writes by making MyISAM’s key
block size the same as the operating system’s. The myisam_block_size
variable controls the
key block size. You can also specify the size for each key with the
KEY_BLOCK_SIZE
option in a
CREATE TABLE
or
CREATE INDEX
statement, but because all keys are stored in the same file, you
really need all of them to have blocks as large as or larger than
the operating system’s to avoid alignment issues that could still
cause read-around writes. (For example, if one key has 1 KB blocks
and another has 4 KB blocks, the 4 KB block boundaries might not
match the operating system’s page boundaries.)
The thread cache holds threads that aren’t currently associated with a connection but are ready to serve new connections. When there’s a thread in the cache and a new connection is created, MySQL removes the thread from the cache and gives it to the new connection. When the connection is closed, MySQL places the thread back into the cache, if there’s room. If there isn’t room, MySQL destroys the thread. As long as MySQL has a free thread in the cache it can respond rapidly to connection requests, because it doesn’t have to create a new thread for each connection.
The thread_cache_size
variable specifies the number of threads MySQL can keep in the cache.
You probably won’t need to configure this value unless your server
gets many connection requests. To check whether the thread cache is
large enough, watch the Threads_created
status variable. We generally try to keep the thread cache large
enough that we see fewer than 10 new threads created each second, but
it’s often pretty easy to get this number lower than 1 per
second.
A good approach is to watch
the Threads_connected
variable and
try to set thread
_cache_size
large enough to handle the typical fluctuation in your workload.
For example, if Threads_connected
usually stays between 100
and 120, you can set the cache size to 20. If it stays between 500 and
700, a thread cache of 200 should be large enough. Think of it this
way: at 700 connections, there are probably no threads in the cache;
at 500 connections, there are 200 cached threads ready to be used if
the load increases to 700 again.
Making the thread cache very large is probably not necessary for
most uses, but keeping it small doesn’t save much memory, so there’s
little benefit in doing so. Each thread that’s in the thread cache or
sleeping typically uses around 256 KB of memory. This is not very much
compared to the amount of memory a thread can use when a connection is
actively processing a query. In general, you should keep your thread
cache large enough that Threads_created
doesn’t increase very often.
If this is a very large number, however (e.g., many thousand threads),
you might want to set it lower because some operating systems don’t
handle very large numbers of threads well, even when most of them are
sleeping.
The table cache is similar in concept to the thread cache, but it stores objects that represent tables. Each object in the cache contains the associated table’s parsed .frm file, plus other data. Exactly what else is in the object depends on the table’s storage engine. For example, for MyISAM, it holds the table data and/or index file descriptors. For merge tables it might hold many file descriptors, because merge tables can have many underlying tables.
The table cache can help you reuse resources. For instance, when a query requests access to a MyISAM table, MySQL might be able to give it a file descriptor from the cached object. Although this does avoid the cost of opening a file descriptor, that’s not as expensive as you might think. Opening and closing file descriptors is very fast on local storage; the server should be able to do it a million times a second easily (it’s different on network-attached storage, though). The real benefit of the table cache is for MyISAM tables, where it lets the server avoid modifying the MyISAM file headers to mark a table as “in use.”[116]
The table cache’s design is one of the areas where the separation between the server and the storage engines is not completely clean, for historical reasons. The table cache is a little less important for InnoDB, because InnoDB doesn’t rely on it for as many purposes (such as holding file descriptors; it has its own version of a table cache for this purpose). However, even InnoDB benefits from caching the parsed .frm files.
In MySQL 5.1, the table cache is separated into two parts: a
cache of open tables and a table
definition cache (configured via the table_open_cache
and table_definition
_cache
variables). Thus, the table definitions (the parsed
.frm files) are separated from the other
resources, such as file descriptors. Opened tables are still
per-thread, per-table-used, but the table definitions are global and
can be shared among all connections efficiently. You can generally set
table_definition_cache
high enough
to cache all your table definitions. Unless you have tens of thousands
of tables, this is likely to be the easiest approach.
If the Opened_tables
status
variable is large or increasing, the table cache might not be large
enough, and you can consider increasing the table_cache
system variable (or table_open_cache
, in MySQL 5.1). However,
note that this counter increases when you create and drop temporary
tables, so if you do that a lot, you’ll never get the counter to stop
increasing.
One downside to making the table cache very large is that it
might cause longer shutdown times when your server has a lot of MyISAM
tables, because the key blocks have to be flushed and the tables have
to be marked as no longer open. It can also make FLUSH TABLES WITH READ
LOCK
take a long time to complete, for the same
reason.
More seriously, the algorithms that check the table cache aren’t very efficient; more on this later.
If you get errors indicating that MySQL can’t open any more
files (use the perror utility to check what the
error number means), you might need to increase the number of files
MySQL is allowed to keep open. You can do this with the open_files_limit
server variable in your
my.cnf file.
The thread and table caches don’t really use much memory, and they can be beneficial when they conserve resources. Although creating a new thread and opening a new table aren’t really expensive compared to other things MySQL might do, the overhead can add up. Caching threads and tables can sometimes improve efficiency.
InnoDB has its own per-table cache, variously called a table definition cache or data dictionary, which you cannot configure in current versions of MySQL. When InnoDB opens a table, it adds a corresponding object to the data dictionary. Each table can take up 4 KB or more of memory (although much less space is required in MySQL 5.1). Tables are not removed from the data dictionary when they are closed.
As a result, the server can appear to leak memory over time, due to an ever-increasing number of entries in the dictionary cache. It isn’t truly leaking memory; it just isn’t implementing any kind of cache expiration. This is normally a problem only when you have many (thousands or tens of thousands) large tables. If this is a problem for you, you can use Percona Server, which has an option to limit the data dictionary’s size by removing tables that are unused. There is a similar feature in the yet-to-be-released MySQL 5.6.
The other performance issue is computing statistics for the
tables when opening them for the first time, which is expensive
because it requires a lot of I/O. In contrast to MyISAM, InnoDB
doesn’t store statistics in the tables permanently; it recomputes them
each time it starts, and thereafter when various intervals expire or
events occur (changes to the table’s contents, queries against the
INFORMATION_SCHEMA
, and so on). If
you have a lot of tables, your server can take hours to start and
fully warm up, during which time it might not be doing much other than
waiting for one I/O operation after another. You can enable the
innodb_use_sys_stats_table
option in Percona
Server (also in MySQL 5.6, but called innodb_analyze_is_persistent
) to store the
statistics persistently on disk and solve this problem.
Even after startup, InnoDB statistics operations can have an
impact on the server and on individual queries. You can turn off the
innodb_stats_on_metadata
option to avoid
time-consuming refreshes of table statistics. This can make a big
difference when tools such as IDEs are querying the INFORMATION_SCHEMA
tables.
If you use InnoDB’s innodb_file_per_table
option (described
later), there’s a separate limit on the number of
.ibd files InnoDB can keep open at any time. This
is handled by the InnoDB storage engine,
not the MySQL server, and is controlled by innodb
_open_files
.
InnoDB doesn’t open files the same way MyISAM does: whereas
MyISAM uses the table cache to hold file descriptors for open tables,
in InnoDB there is no direct relationship between open tables and open
files. InnoDB uses a single, global file descriptor for each
.ibd file. If you can afford it,
it’s best to set innodb_open_files
large enough that the server can keep all .ibd
files open simultaneously.
A few configuration options affect how MySQL synchronizes data to disk and performs recovery. These can affect performance dramatically, because they involve expensive I/O operations. They also represent a trade-off between performance and data safety. In general, it’s expensive to ensure that your data is written to disk immediately and consistently. If you’re willing to risk the danger that a disk write won’t really make it to permanent storage, you can increase concurrency and/or reduce I/O waits, but you’ll have to decide for yourself how much risk you can tolerate.
InnoDB permits you to control not only how it recovers, but also how it opens and flushes its data, which greatly affects recovery and overall performance. InnoDB’s recovery process is automatic and always runs when InnoDB starts, though you can influence what actions it takes. Leaving aside recovery and assuming nothing ever crashes or goes wrong, there’s still a lot to configure for InnoDB. It has a complex chain of buffers and files designed to increase performance and guarantee ACID properties, and each piece of the chain is configurable. Figure 8-1 illustrates these files and buffers.
A few of the most important things to change for normal usage are the InnoDB log file size, how InnoDB flushes its log buffer, and how InnoDB performs I/O.
InnoDB uses its log to reduce the cost of committing transactions. Instead of flushing the buffer pool to disk when each transaction commits, it logs the transactions. The changes transactions make to data and indexes often map to random locations in the tablespace, so flushing these changes to disk would require random I/O. InnoDB assumes it’s using conventional disks, where random I/O is much more expensive than sequential I/O because of the time it takes to seek to the correct location on disk and wait for the desired part of the disk to rotate under the head.
InnoDB uses its log to convert this random disk I/O into sequential I/O. Once the log is safely on disk, the transactions are permanent, even though the changes haven’t been written to the data files yet. If something bad happens (such as a power failure), InnoDB can replay the log and recover the committed transactions.
Of course, InnoDB does ultimately have to write the changes to the data files, because the log has a fixed size. It writes to the log in a circular fashion: when it reaches the end of the log, it wraps around to the beginning. It can’t overwrite a log record if the changes contained there haven’t been applied to the data files, because this would erase the only permanent record of the committed transaction.
InnoDB uses a background thread to flush the changes to the data files intelligently. This thread can group writes together and make the data writes sequential, for improved efficiency. In effect, the transaction log converts random data file I/O into mostly sequential log file and data file I/O. Moving flushes into the background makes queries complete more quickly and helps cushion the I/O system from spikes in the query load.
The overall log file size
is controlled by innodb_
log_file_size
and innodb_log
_files_in_group
, and it’s very important
for write performance. The total size is the sum of each
file’s size. By default there are two 5 MB files, for a total of 10
MB. This is much too small for a high-performance workload. You need
hundreds of megabytes, or even gigabytes, of log files.
InnoDB uses multiple files as a single circular log. You usually don’t need to change the default number of logs, just the size of each log file. To change the log file size, shut down MySQL cleanly, move the old logs away, reconfigure, and restart. Be sure MySQL shuts down cleanly, or the log files will actually have entries that need to be applied to the data files! Watch the MySQL error log when you restart the server. After you’ve restarted successfully, you can delete the old log files.
To determine the ideal size for your log files, you’ll have to weigh the overhead of routine data changes against the recovery time required in the event of a crash. If the log is too small, InnoDB will have to do more checkpoints, causing more log writes. In extreme cases, write queries might stall and have to wait for changes to be applied to the data files before there is room to write into the log. On the other hand, if the log is too large, InnoDB might have to do a lot of work when it recovers. This can greatly increase recovery time, although this process is much more efficient in newer MySQL versions.
Your data size and access patterns will influence the recovery time, too. Suppose you have a terabyte of data and 16 GB of buffer pool, and your total log size is 128 MB. If you have a lot of dirty pages (i.e., pages whose changes have not yet been flushed to the data files) in the buffer pool and they are uniformly spread across your terabyte of data, recovery after a crash might take a long time. InnoDB will have to scan through the log, examine the data files, and apply changes to the data files as needed. That’s a lot of reading and writing! On the other hand, if the changes are localized—say, if only a few hundred megabytes of data are updated frequently—recovery might be fast, even when your data and log files are huge. Recovery time also depends on the size of a typical modification, which is related to your average row length. Short rows let more modifications fit in the log, so InnoDB might need to replay more modifications on recovery.[117]
When InnoDB changes any data, it writes a record of the
change into its log buffer, which it keeps in
memory. InnoDB flushes the buffer to the log files on disk when
the buffer gets full, when a transaction commits, or once per
second—whichever comes first. Increasing the buffer size, which is
1 MB by default, can help reduce I/O if you have large transactions. The variable that
controls the buffer size is called innodb_
l
og_buffer_size
.
You usually don’t need to make the buffer very large. The
recommended range is 1 to 8 MB, and this usually will be enough
unless you write a lot of huge BLOB
records. The log entries are very
compact compared to InnoDB’s normal data. They are not page-based,
so they don’t waste space storing whole pages at a time. InnoDB
also makes log entries as short as possible. They are sometimes
even stored as the function number and parameters of a C
function!
There’s an additional circumstance where a larger value might be beneficial: when it can reduce contention during allocation of space in the buffer. When we’re configuring servers with a large amount of memory, we’ll sometimes allocate 32 to 128 MB of log buffer simply because spending such a relatively small amount of extra memory is not detrimental and it can help avoid pressure on a bottleneck. The bottleneck shows up as contention on the log buffer mutex when it’s a problem.
You can monitor InnoDB’s log and log buffer I/O
performance by inspecting the LOG
section
of the output of SHOW INNODB
STATUS
, and by watching the Innodb_os_log_
written
status variable to see how much data InnoDB writes to the log
files. A good rule of thumb is to watch it over intervals
of 10 to 100 seconds and note the peak value. You can use this to
judge whether your log buffer is sized right. For example, if you
see a peak of 100 KB written to the log per second, a 1 MB log
buffer is probably plenty.
You can also use this metric to decide on a good size for your log files. If the peak is 100 KB per second, a 256 MB log file is enough to store at least 2,560 seconds of log entries, which is likely to be enough. As a rule of thumb, you can make your total log file size large enough to hold an hour’s worth of server activity.
When InnoDB flushes the log buffer to the log files on disk, it locks the buffer with a mutex, flushes it up to the desired point, and then moves any remaining entries to the front of the buffer. It is possible that more than one transaction will be ready to flush its log entries when the mutex is released. InnoDB has a group commit feature that can commit all of them to the log in a single I/O operation, but this is broken in MySQL 5.0 when the binary log is enabled. We wrote about group commit in the previous chapter.
The log buffer must be flushed to
durable storage to ensure that committed transactions are fully
durable. If you care more about performance than durability, you
can change innodb_flush_log_at_trx_commit
to
control where and how often the log buffer is flushed. Possible
settings are as follows:
0
Write the log buffer to the log file and flush the log file every second, but do nothing at transaction commit.
1
Write the log buffer to the log file and flush it to durable storage every time a transaction commits. This is the default (and safest) setting; it guarantees that you won’t lose any committed transactions, unless the disk or operating system “fakes” the flush operation.
2
Write the log buffer to the log file at every commit,
but don’t flush it. InnoDB schedules a flush once every
second. The most important difference from the 0
setting (and what makes 2
the preferable setting) is that
2
won’t lose any
transactions if the MySQL process crashes. If the entire
server crashes or loses power, however, you can still lose
transactions.
It’s important to know the difference between
writing the log buffer to the log file and
flushing the log to durable storage. In most
operating systems, writing the buffer to the log simply moves the
data from InnoDB’s memory buffer to the operating system’s cache,
which is also in memory. It doesn’t actually write the data to
durable storage. Thus, settings 0
and 2
usually result in
at most one second of lost data if there’s a crash or a power
outage, because the data might exist only in the operating
system’s cache. We say “usually” because InnoDB tries to flush the
log file to disk about once per second no matter what, but it is
possible to lose more than a second of transactions in some cases,
such as when a flush gets stalled.
In contrast, flushing the log to durable storage means
InnoDB asks the operating system
to actually flush the data out of the cache and ensure it
is written to the disk. This is a blocking
I/O call that doesn’t complete until the data is completely
written. Because writing data to a disk is slow, this can
dramatically reduce the number of transactions InnoDB can commit
per second when innodb_flush_log_at_trx_commit
is set to
1
. Today’s high-speed
drives[118] can perform only a couple of hundred real disk
transactions per second, simply because of the limitations of
drive rotation speed and seek time.
Sometimes the hard disk controller or operating system fakes
a flush by putting the data into yet another
cache, such as the hard disk’s own cache. This is faster but very
dangerous, because the data might still be lost if the drive loses
power. This is even worse than setting innodb_flush_log_at_trx_commit
to
something other than 1
, because
it can cause data corruption, not just lost transactions.
Setting innodb_flush_log_at_trx_commit
to
anything other than 1
can cause
you to lose transactions. However, you might find the other
settings useful if you don’t care about durability (the D in
ACID). Maybe you just want some of InnoDB’s other features, such
as clustered indexes, resistance to data corruption, and row-level
locking. This is not uncommon when using InnoDB to replace MyISAM
solely for performance reasons.
The best configuration for
high-performance transactional needs is to leave innodb_
flush_log_at_trx_commit
set to 1
and place the log files on a RAID
volume with a battery-backed write cache. This is both
safe and very fast. In fact, we dare say that any production
database server that’s expected to handle a serious workload needs
to have this kind of hardware.
Percona Server extends innodb_flush_log_at_trx_commit
to make
it a per-session variable,
instead of global for the whole server. This allows applications
with varying performance and durability needs to use the same
database, and avoids the one-size-fits-all solution offered by
standard MySQL.
The innodb_flush_method
option lets you configure how InnoDB actually interacts with the
filesystem. Despite its name, it can affect how InnoDB reads data,
not just how it writes it. The Windows and non-Windows values for
this option are mutually exclusive: you can use async_unbuffered
, unbuffered
, and normal
only on Windows, and you cannot use
any other values on Windows. The default value is unbuffered
on Windows and fdatasync
on all other systems. (If
SHOW GLOBAL VARIABLES
shows the
variable with an empty value, that means it’s set to the
default.)
Changing how InnoDB performs I/O operations can impact performance greatly, so be sure you understand what you’re doing before you change anything!
This is a slightly confusing option, because it affects both the log files and the data files, and it sometimes does different things to each kind of file. It would be nice to have one configuration option for the logs and another for the data files, but they’re combined. Here are the possible values:
fdatasync
The default value on non-Windows systems: InnoDB uses
fsync()
to flush both data
and log files.
InnoDB generally uses fsync()
instead of fdatasync()
,
even though this value seems to indicate the contrary.
fdatasync()
is like
fsync()
, except it flushes
only the file’s data, not its metadata (last modified time,
etc.). Therefore, fsync()
can cause more I/O. However, the InnoDB developers are very
conservative, and they found that fdatasync()
caused corruption in
some cases. InnoDB determines which methods can be used
safely; some options are set at compile time and some are
discovered at runtime. It uses the fastest safe method it
can.
The disadvantage of using fsync()
is that the operating system
buffers at least some of the data in its own cache. In theory,
this is wasteful double buffering, because InnoDB manages its
own buffers more intelligently than the operating system can.
However, the ultimate effect is very system- and
filesystem-dependent. The double buffering might not be a bad
thing if it lets the filesystem do smarter I/O scheduling and
batching. Some filesystems and operating systems can
accumulate writes and execute them together, reorder them for
efficiency, or write to multiple devices in parallel. They
might also do read-ahead optimizations, such as instructing
the disk to preread the next sequential block if several have
been requested in sequence.
Sometimes these optimizations help, and sometimes they
don’t. You can read your system’s manpage for fsync(2)
if you’re curious about
exactly what your version of fsync()
does.
innodb_file_per_table
causes each
file to be fsync()
ed
separately, which means writes to multiple tables can’t be
combined into a single I/O operation. This might require
InnoDB to perform a higher total number of fsync()
operations.
O_DIRECT
InnoDB uses the O_DIRECT
flag, or directio()
,
depending on the system, on the data files. This option does
not affect the log files and is not necessarily available on
all Unix-like operating systems. At least GNU/Linux, FreeBSD,
and Solaris (late 5.0 and newer) support it. Unlike the
O_DSYNC
flag, it affects
both reads and writes.
This setting still uses fsync()
to flush the files to disk,
but it instructs the operating system not to cache the data
and not to use read-ahead. This disables the operating
system’s caches completely and makes all reads and writes go
directly to the storage device, avoiding double
buffering.
On most systems, this is implemented with a call to
fcntl()
to set the O_DIRECT
flag on the file
descriptor, so you can read the fcntl(2)
manpage for your system’s
details. On Solaris, this option uses directio()
.
If your RAID card does read-ahead, this setting will not disable that. It disables only the operating system’s and/or filesystem’s read-ahead capabilities.
You generally need a RAID card with a write cache set to
a write-back policy if you use O_DIRECT
, because that’s typically
the only thing that keeps performance good. Using O_DIRECT
when there is no buffer
between InnoDB and the actual storage device, such as when you
have no write cache on your RAID card, can cause performance
to degrade greatly. This is a bit less of a problem nowadays
with multiple write threads (and native asynchronous I/O
introduced in MySQL 5.5), but it’s still the case in
general.
This setting can cause the server’s warmup time to increase significantly, especially if the operating system’s cache is very large. It can also make a small buffer pool (e.g., a buffer pool of the default size) much slower than buffered I/O would. This is because the operating system won’t “help out” by keeping more of the data in its own cache. If the desired data isn’t in the buffer pool, InnoDB will have to read it directly from disk.
This setting does not
impose any extra penalty on the use of innodb_
file_per_table
. However, the reverse
can be true: if you do not use innodb
_file_per_table
, you can suffer from
some serialization of I/O when you use O_DIRECT
. This happens because some
filesystems (including all of Linux’s ext filesystems) have a per-inode
mutex. When you use O_DIRECT
with such filesystems, you
really need innodb_file_per_table
to be enabled.
We delve more into filesystems in the next chapter.
ALL_O_DIRECT
This option is available in Percona Server and MariaDB. It lets the server open the log files, not just the data files, in the same way that standard MySQL opens the data files.
O_DSYNC
This option sets the O_SYNC
flag on the open()
call
for the log files. It makes all writes synchronous—in other
words, writes do not return until the data is written to
disk. This option does
not affect the data files.
The difference between the O_SYNC
flag and the O_DIRECT
flag is that O_SYNC
doesn’t disable caching at
the operating system level. Therefore, it doesn’t avoid double
buffering, and it doesn’t make writes go directly to disk.
With O_SYNC
, writes modify
the data in the cache, and then it is sent to the disk.
While synchronous writes with O_SYNC
might sound very similar to
what fsync()
does, the two
can be implemented very differently on both the operating
system and the hardware level. When the O_SYNC
flag is used, the operating
system might pass a “use synchronous I/O” flag down to the
hardware level, telling the device not to use caches. On the
other hand, fsync()
tells
the operating system to flush modified buffers to the device,
followed by an instruction for the device to flush its own
caches, if applicable, so it is certain that the data has been
recorded on the physical media. Another difference is that
with O_SYNC
, every write()
or pwrite()
operation syncs data to
disk before it finishes, blocking the calling process. In
contrast, writing without the O_SYNC
flag and then calling
fsync()
allows writes to
accumulate in the cache (which makes each write fast), and
then flushes them all at once.
Again, despite its name, this option sets the O_SYNC
flag, not the O_DSYNC
flag, because the InnoDB
developers found bugs with O_DSYNC
. O_SYNC
and O_DSYNC
are similar to fysnc()
and fdatasync()
: O_SYNC
syncs both data and metadata,
whereas O_DSYNC
syncs data
only.
async_unbuffered
This is the default value on Windows. This option causes
InnoDB to use unbuffered I/O for most writes; the exception is
that it uses buffered I/O to the log files when innodb_flush_log_at_trx_commit
is
set to 2
.
This setting causes InnoDB to use the operating system’s native asynchronous (overlapped) I/O for both reads and writes on Windows 2000, XP, and newer. On older Windows versions, InnoDB uses its own asynchronous I/O, which is implemented with threads.
unbuffered
Windows-only. This option is similar to async_unbuffered
but does not use
native asynchronous I/O.
normal
Windows-only. This option causes InnoDB not to use native asynchronous I/O or unbuffered I/O.
nosync
and littlesync
For development use only. These options are undocumented and unsafe for production; they should not be used.
If that all seemed like a lot of explanation with no advice,
here’s the advice: if you use a Unix-like operating system and your
RAID controller has a battery-backed write cache, we recommend that
you use O_DIRECT
. If not, either
the default or O_DIRECT
will
probably be the best choice, depending on your application.
InnoDB keeps its data in a tablespace, which is essentially a virtual filesystem spanning one or many files on disk. InnoDB uses the tablespace for many purposes, not just for storing tables and indexes. It keeps its undo log (old row versions), insert buffer, doublewrite buffer (described in an upcoming section), and other internal structures in the tablespace.
You specify the
tablespace files with the innodb_data_
file
_path
configuration option. The files are all contained in the directory
given by innodb_
data_home_dir
. Here’s an
example:
innodb_data_home_dir = /var/lib/mysql/ innodb_data_file_path = ibdata1:1G;ibdata2:1G;ibdata3:1G
That creates a 3 GB tablespace in three files. Sometimes people wonder whether they can use multiple files to spread load across drives, like this:
innodb_data_file_path = /disk1/ibdata1:1G;/disk2/ibdata2:1G;...
While that does indeed place the files in different directories, which represent different drives in this example, InnoDB concatenates the files end-to-end. Thus, you usually don’t gain much this way. InnoDB will fill the first file, then the second when the first is full, and so on; the load isn’t really spread in the fashion you need for higher performance. A RAID controller is a smarter way to spread load.
To allow the tablespace to grow if it runs out of space, you can make the last file autoextend as follows:
...ibdata3:1G:autoextend
The default behavior is to create a single 10 MB autoextending file. If you make the file autoextend, it’s a good idea to place an upper limit on the tablespace’s size to keep it from growing very large, because once it grows, it doesn’t shrink. For example, the following example limits the autoextending file to 2 GB:
...ibdata3:1G:autoextend:max:2G
Managing a single tablespace can be a hassle, especially if it autoextends and you want to reclaim the space (for this reason, we recommend disabling the autoextend feature, or at least setting a reasonable cap on the space). The only way to reclaim space is to dump your data, shut down MySQL, delete all the files, change the configuration, restart, let InnoDB create new empty files, and restore your data. InnoDB is completely unforgiving about its tablespace—you cannot simply remove files or change their sizes. It will refuse to start if you corrupt its tablespace. It is likewise very strict about its log files. If you’re used to casually moving files around with MyISAM, take heed!
The innodb_file_per_table
option lets you
configure InnoDB to use one file per table in MySQL 4.1 and later.
It stores the data in the database directory as
tablename.ibd files. This makes it easier to
reclaim space when you drop a table, and it can be useful for
spreading tables across multiple disks. However, placing the data
in multiple files can actually result in more wasted space
overall, because it trades internal fragmentation in the single
InnoDB tablespace for wasted space in the
.ibd files. This is more of an issue for very
small tables, because InnoDB’s page size is 16 KB. Even if your
table has only 1 KB of data, it will still require at least 16 KB
on disk.
Even if you enable the innodb_file_per_table
option, you’ll
still need the main tablespace for the undo logs and other system
data. It will be smaller if you’re not storing all the data in it,
but it’s still a good idea to disable autoextend, because you
can’t shrink the file without reloading all your data.
Some people like to use innodb_file_per_table
just because of
the extra manageability and visibility it gives you. For example,
it’s much faster to find a table’s size by examining a single file
than it is to use SHOW TABLE
STATUS
, which has to perform more complex work to
determine how many pages are allocated to a table.
There is a dark side to innodb_file_per_table
: slow DROP TABLE
performance. This can be severe enough to cause a noticeable
server-wide stall, for two reasons:
Dropping the table unlinks (deletes) the file at the filesystem level, which can be very slow on some filesystems (ext3, we’re looking at you). You can shorten the duration of this with tricks on the filesystem: link the .ibd file to a zero-sized file, then delete the file manually, instead of waiting for MySQL to do it.
When you enable this option, each table gets its own
tablespace inside InnoDB. It turns out that removing the
tablespace actually requires InnoDB to lock and scan the
buffer pool while it looks for pages belonging to this
tablespace, which is very slow on a server with a large buffer
pool. If you’re going to be dropping a lot of InnoDB tables
(including temporary tables) and you use innodb_file_per_table
, you might
benefit from the fix included with Percona Server, which lets
the server lazily invalidate the pages belonging to the
dropped tables. You just need to set the innodb_lazy_drop_table
option.
What’s the final recommendation? We suggest that you
use innodb_file_per_
table
and
cap the size of your shared tablespace to make your life
easier. If you run into any circumstances that make this painful,
as noted above, consider one of the fixes we suggested.
We should also note that you don’t actually have to store your InnoDB files in a traditional filesystem. Like many traditional database servers, InnoDB offers the option of using a raw device—i.e., an unformatted partition—for its storage. However, today’s filesystems can handle sufficiently large files that you shouldn’t need to use this option. Using raw devices might improve performance by a few percentage points, but we don’t think this small increase justifies the disadvantages of not being able to manipulate the data as files. When you store your data on a raw partition, you can’t use mv, cp, or any other tools on it. Ultimately, the tiny performance gains you get from using raw devices aren’t worth the extra hassle.
InnoDB’s tablespace can grow very large
in a write-heavy environment. If transactions stay open for a long
time (even if they’re not doing
any work) and they’re using the default REPEATABLE READ
transaction isolation
level, InnoDB won’t be able to remove old row versions, because
the uncommitted transactions will still need to be able to see
them. InnoDB stores the old versions in the tablespace, so it continues to grow
as more data is updated. Sometimes the problem isn’t uncommitted
transactions, but just the workload: the purge process is only a
single thread until recent versions of MySQL, and it might not be
able to keep up with the number of old row versions that need to
be purged.
In either case, the output of SHOW INNODB
STATUS
can help you pinpoint the problem. Look at the
history list length; it shows the size of the undo log, in units
of pages.
You can corroborate this by examining the first and second
lines of the TRANSACTIONS
section, which show the current transaction number and the point
to which the purge has completed. If the difference is large, you
might have a lot of unpurged transactions. Here’s an
example:
------------ TRANSACTIONS ------------ Trx id counter 0 80157601 Purge done for trx’s n:o <0 80154573 undo n:o <0 0
The transaction identifier is a 64-bit number composed of two 32-bit numbers (it’s a hexadecimal number in newer versions of InnoDB), so you might have to do a little math to compute the difference. In this case it’s easy, because the high bits are just zeros: there are 80,157,601 – 80,154,573 = 3,028 potentially unpurged transactions (innotop can do this math for you). We said “potentially” because a large difference doesn’t necessarily mean there are a lot of unpurged rows. Only transactions that change data will create old row versions, and there might be many transactions that haven’t changed any data (conversely, a single transaction could have changed many rows).
If you have a large undo log and your tablespace is growing because of it, you can force MySQL to slow down enough for InnoDB’s purge thread to keep up. This might not sound attractive, but there’s no alternative. Otherwise, InnoDB will keep writing data and filling up your disk until the disk runs out of space or the tablespace reaches the limits you’ve defined.
To throttle the writes, set the innodb_max_purge_lag
variable to a value
other than 0
. This value
indicates the maximum number of transactions that can be waiting
to be purged before InnoDB starts to delay further queries that
update data. You’ll have to know your workload to decide on a good
value. As an example, if your average transaction affects 1 KB of
rows and you can tolerate 100 MB of unpurged rows in your
tablespace, you could set the value to 100000
.
Bear in mind that unpurged row versions impact all queries,
because they effectively make your tables and indexes larger. If
the purge thread simply can’t keep up, performance can decrease
dramatically. Setting the innodb_max_purge_lag
variable will slow
down performance too, but it’s the lesser of the two
evils.[119]
In newer versions of MySQL, and even in older versions of Percona Server and MariaDB, the purging process is significantly improved and separated from other internal housekeeping tasks. You can even create multiple dedicated purge threads to do this background work more quickly. This is a better option than throttling the server, if you can take advantage of it.
InnoDB uses a doublewrite buffer to avoid data corruption in case of partial page writes. A partial page write occurs when a disk write doesn’t complete fully, and only a portion of a 16 KB page is written to disk. There are a variety of reasons (crashes, bugs, and so on) that a page might be partially written to disk. The doublewrite buffer guards against data corruption if this happens.
The doublewrite buffer is a special reserved area of the tablespace, large enough to hold 100 pages in a contiguous block. It is essentially a backup copy of recently written pages. When InnoDB flushes pages from the buffer pool to the disk, it writes (and flushes) them first to the doublewrite buffer, then to the main data area where they really belong. This ensures that every page write is atomic and durable.
Doesn’t this mean that every page is written twice? Yes, it
does, but because InnoDB writes several pages to the doublewrite
buffer sequentially and only then calls fsync()
to sync them to disk, the
performance impact is relatively small—generally a few percentage
points, not double, although the overhead is more noticeable on
solid-state drives, as we’ll discuss in the next chapter. More
importantly, this strategy allows the log files to be much more
efficient. Because the doublewrite buffer gives InnoDB a very strong
guarantee that the data pages are not corrupt, InnoDB’s log records
don’t have to contain full pages; they are more like binary deltas
to pages.
If there’s a partial page write to the doublewrite buffer itself, the original page will still be on disk in its real location. When InnoDB recovers, it will use the original page instead of the corrupted copy in the doublewrite buffer. However, if the doublewrite buffer succeeds and the write to the page’s real location fails, InnoDB will use the copy in the doublewrite buffer during recovery. InnoDB knows when a page is corrupt because each page has a checksum at the end; the checksum is the last thing to be written, so if the page’s contents don’t match the checksum, the page is corrupt. Upon recovery, therefore, InnoDB just reads each page in the doublewrite buffer and verifies the checksums. If a page’s checksum is incorrect, it reads the page from its original location.
In some cases, the doublewrite buffer really isn’t
necessary—for example, you might want to disable it on replicas.
Also, some filesystems (such as ZFS) do the same thing themselves,
so it is redundant for InnoDB to do it. You can disable the
doublewrite buffer by setting innodb_doublewrite
to 0
. In Percona Server, you can
configure the doublewrite buffer to be stored in its own file, so
you can separate this workload from the rest of the server’s work by
placing it on separate disk drives.
The sync_binlog
option
controls how MySQL flushes the binary log to disk. Its default value
is 0
, which means MySQL does no
flushing and it’s up to the operating system to decide when to flush
its cache to durable storage. If the value is greater than 0
, it specifies how many binary log writes
happen between flushes to disk (each write is a single statement if
autocommit
is set, and otherwise
a transaction). It’s rare to set this option to anything other than
0
or 1
.
If you don’t set sync_binlog
to 1
, it’s likely that a crash will cause
your binary log to be out of sync with your transactional data. This
can easily break replication and make point-in-time recovery
impossible. However, the safety provided by setting this option to
1
comes at high price.
Synchronizing the binary log and the transaction log requires MySQL
to flush two files in two distinct locations. This might require a
disk seek, which is relatively slow.
As with the InnoDB log file, placing the binary log on a RAID
volume with a battery-backed write cache can give a huge performance
boost. In fact, writing and flushing the binary logs is actually
more expensive than writing and flushing the InnoDB transaction
logs, because unlike the InnoDB transaction logs, every write to the
binary logs increases their size. That requires a metadata update at
the filesystem level for every write. Thus, setting sync_binlog=1
can be much more detrimental
to performance than setting
innodb_flush_log_at_trx_commit=1
,
especially on network filesystems such as NFS.
A non-performance-related note on
the binary logs: if you want to use the expire_
logs_days
option to remove old binary logs automatically, don’t remove them
with rm. The server will get confused
and refuse to remove them automatically, and PURGE MASTER LOGS
will stop working. The solution, should you find yourself entangled
in this situation, is to manually resync the
hostname-bin.index file with the list of files
that still exist on disk.
We cover RAID in more depth in the next chapter, but it’s worth repeating here that good-quality RAID controllers, with battery-backed write caches set to use the write-back policy, can handle thousands of writes per second and still give you durable storage. The data gets written to a fast cache with a battery, so it will survive even if the system loses power. When the power comes back, the RAID controller will write the data from the cache to the disk before making the disk available for use. Thus, a good RAID controller with a large enough battery-backed write cache can improve performance dramatically and is a very good investment. Of course, solid-state storage is another option; we also cover that in the next chapter.
Let’s begin by considering how MyISAM performs I/O for its indexes. MyISAM normally flushes index changes to disk after every write. If you’re going to make many modifications to a table, however, it might be faster to batch these writes together.
One way to do this is with LOCK
TABLES
, which defers writes until you unlock the tables.
This can be a valuable technique for improving performance, because it
lets you control exactly which writes are deferred and when the writes
are flushed to disk. You can defer writes for precisely the statements
you want.
You can also defer index writes by using the delay_key_write
variable. If you do this,
modified key buffer blocks are not flushed until the table is
closed.[120] The possible settings are as follows:
OFF
MyISAM flushes modified blocks in the key buffer (key
cache) to disk after every write, unless the table is locked
with LOCK TABLES
.
ON
Delayed key writes are
enabled, but only for tables created with the DELAY_
KEY_WRITE
option.
ALL
All MyISAM tables use delayed key writes.
Delaying key writes can be helpful in some cases, but it doesn’t usually create a big performance boost. It’s most useful with smaller data sizes, when the key cache’s read hit ratio is good but the write hit ratio is bad. It also has quite a few drawbacks:
If the server crashes and the blocks haven’t been flushed to disk, the index will be corrupt.
If many writes are delayed, it’ll take longer for MySQL to close a table, because it will have to wait for the buffers to be flushed to disk. This can cause long table cache locks in MySQL 5.0.
FLUSH TABLES
can take a
long time, for the reason just mentioned. This in turn can
increase the time it takes to run FLUSH TABLES WITH READ
LOCK
for a logical volume manager (LVM) snapshot or
other backup operation.
Unflushed dirty blocks in the key buffer might not leave any room in the buffer for new blocks to be read from disk. Therefore, queries might stall while waiting for MyISAM to free up some space in the key buffer.
In addition to configuring MyISAM’s index I/O, you can configure
how MyISAM tries to recover from corruption. The myisam_recover
option controls how MyISAM
looks for and repairs errors. You have to set this option in the
configuration file or at the command line. You can view, but not
change, the option’s value with this SQL statement (this is not a
typo—the system variable has a different name from the corresponding
command-line option):
mysql> SHOW VARIABLES LIKE 'myisam_recover_options';
Enabling this option instructs MySQL to check MyISAM tables for corruption when it opens them, and to repair them if problems are found. You can set the following values:
DEFAULT
(or no
setting)Instructs MySQL to try to repair any table that is marked
as having crashed or not marked as having been closed cleanly.
The default setting performs no other actions upon recovery. In
contrast to how most variables work, this DEFAULT
value is not an instruction to
reset the variable to its compiled-in value; it essentially
means “no setting.”
BACKUP
Makes MySQL write a backup of the data file into a .BAK file, which you can examine afterward.
FORCE
Makes recovery continue even if more than one row will be lost from the .MYD file.
QUICK
Skips recovery unless there are delete blocks. These are
blocks of deleted rows that are still occupying space and can be
reused for future INSERT
statements. This can be useful because MyISAM recovery can take
a very long time on large tables.
You can use multiple settings, separated by commas. For example,
BACKUP,FORCE
will force recovery
and create a backup. This is what we used in our sample configuration
file earlier in this chapter.
We recommend that you enable this option, especially if you have
just a few small MyISAM tables. Running a server with corrupted MyISAM
tables is dangerous, because they can sometimes cause more data
corruption and even server crashes. However, if you have large tables,
automatic recovery might be impractical: it causes the server to check
and repair all MyISAM tables when they’re opened, which is
inefficient. During this time, MySQL tends to block connections from
performing any work. If you have a lot of MyISAM tables, it might be a
good idea to use a less intrusive process that runs CHECK TABLES
and
REPAIR TABLES
after
startup.[121] Either way, it is very important to check and repair the
tables.
Enabling memory-mapped access to data files is another useful
MyISAM option. Memory mapping lets MyISAM access the
.MYD files directly via the operating system’s
page cache, avoiding costly system calls. In MySQL 5.1 and newer, you
can enable memory mapping with the myisam_use_mmap
option. Older versions of
MySQL use memory mapping for compressed MyISAM tables only.
When you’re running MySQL in a high-concurrency workload, you might run into bottlenecks you wouldn’t otherwise experience. This section explains how to detect these problems when they happen, and how to get the best performance possible under these workloads for MyISAM and InnoDB.
InnoDB is designed for high concurrency, and it has improved dramatically in the last few years, but it’s still not perfect. The InnoDB architecture still shows some roots in limited-memory, single-CPU, single-disk systems. Some aspects of InnoDB’s performance can degrade in high-concurrency situations, and your only recourse is to limit concurrency. You can use the techniques shown in Chapter 3 to diagnose concurrency problems.
If you have problems with InnoDB concurrency, the solution is usually to upgrade the server. In comparison with current versions, older versions such as MySQL 5.0 and early MySQL 5.1 were an unmitigated disaster under high concurrency. Everything queued on global mutexes such as the buffer pool mutex, and the server practically ground to a halt. If you upgrade to one of the newer versions of MySQL, you don’t need to limit concurrency in most cases.
If you do, here’s how it works. InnoDB has its own “thread
scheduler” that controls how threads enter its kernel to access data,
and what they can do once they’re inside the kernel. The most basic
way to limit concurrency is with the innodb_thread_concurrency
variable, which
limits how many threads can be in the kernel at once. A value of
0
means there is no limit on the
number of threads. If you are having InnoDB concurrency problems in
older MySQL versions, this variable is the most important one to
configure.[122]
It’s impossible to name a good value for any given architecture and workload. In theory, the following formula gives a good value:
concurrency = Number of CPUs * Number of Disks * 2
But in practice, it can be better to use a much smaller value. You will have to experiment to find the best value for your system.
If more than the allowed number of threads are already in the
kernel, a thread can’t enter the kernel. InnoDB uses a two-phase
process to try to let threads enter as efficiently as possible. The two-phase
policy reduces the overhead of context switches caused by the operating system scheduler. The
thread first sleeps for innodb
_thread_sleep_delay
microseconds, and then
tries again. If it still can’t enter, it goes into a
queue of waiting threads and yields to the operating system.
The default sleep time in the first phase is 10,000 microseconds. Changing this value can help in high-concurrency environments, when the CPU is underused with a lot of threads in the “sleeping before entering queue” status. The default value can also be much too large if you have a lot of small queries, because it adds 10 milliseconds to query latency.
Once a thread is inside the kernel, it has a certain number of
“tickets” that let it back into the kernel for “free,”
without any concurrency checks. This limits how much work it can do
before it has to get back in line with other waiting threads. The
innodb_concurrency_tickets
option controls
the number of tickets. It rarely needs to be changed unless you have a
lot of extremely long-running queries. Tickets are granted per-query,
not per-transaction. Once a query finishes, its unused tickets are
discarded.
In addition to the bottlenecks in the buffer pool and other
structures, there’s another concurrency bottleneck at the commit
stage, which is largely I/O-bound because of flush operations. The
innodb_commit_concurrency
variable governs
how many threads can commit at the same time. Configuring this option
might help if there’s a lot of thread thrashing even when innodb_thread_concurrency
is set to a low
value.
Finally, there’s a new solution that might be worth considering: using a thread pool to limit concurrency. The original thread pool implementation was in the abandoned MySQL 6.0 source tree, and had serious flaws. But it’s been reimplemented in MariaDB, and Oracle has recently released a commercial plugin to provide a thread pool for MySQL 5.5. We don’t have enough experience with either of these to guide you, so we’ll confuse you further by pointing out that neither implementation seemed to satisfy Facebook, which has met its unique needs with so-called “admission control” features in its own private branch of MySQL. Hopefully by the fourth edition of this book we’ll have some more knowledge to share on thread pools and when they work or don’t work.
MyISAM allows concurrent inserts and reads under some conditions, and it lets you “schedule” some operations to try to block as little as possible.
Before we look at MyISAM’s concurrency settings, it’s important to understand how MyISAM deletes and inserts rows. Delete operations don’t rearrange the entire table; they just mark rows as deleted, leaving “holes” in the table. MyISAM prefers to fill the holes if it can, reusing the spaces for inserted rows. If there are no holes, it appends new rows to the end of the table.
Even though MyISAM has table-level locks, it can append new rows concurrently with reads. It does this by stopping the reads at the last row that existed when they began. This avoids inconsistent reads.
However, it is much more difficult to provide consistent reads when something is changing the middle of the table. MVCC is the most popular way to solve this problem: it lets readers read old versions of data while writers create new versions. However, MyISAM doesn’t support MVCC as InnoDB does, so it doesn’t support concurrent inserts unless they go at the end of the table.
You can configure MyISAM’s concurrent insert behavior with the
concurrent_insert
variable, which
can have the following values:
0
MyISAM allows no concurrent inserts; every insert locks the table exclusively.
1
This is the default value. MyISAM allows concurrent inserts, as long as there are no holes in the table.
2
This value is available in MySQL 5.0 and newer. It forces concurrent inserts to append to the end of the table, even when there are holes. If there are no threads reading from the table, MySQL will place the new rows in the holes. The table can become more fragmented than usual with this setting.
You can also configure MySQL to delay some operations to a later
time, when they can be combined for greater efficiency. For instance,
you can delay index writes with the delay_key_write
variable, which we mentioned
earlier in this chapter. This involves the familiar trade-off: write the index
right away (safe but expensive), or wait and hope the power doesn’t
fail before the write happens (faster, but likely to cause massive
index corruption in the event of a crash because the index file will
be very out of date).
You can also give INSERT
,
REPLACE
, DELETE
, and UPDATE
queries lower priority than SELECT
queries with the low_priority_updates
option. This is
equivalent to globally applying the LOW_PRIORITY
modifier to UPDATE
queries. It’s actually a very
important option when you use MyISAM; it lets you get decent
concurrency for SELECT
queries that
would otherwise starve in the presence of a very small number of
queries getting top priority for write locks.
Finally, even though InnoDB’s scalability issues are more often
talked about, MyISAM has also had problems with mutexes for a long
time. In MySQL 4.0 and earlier, a global mutex protected any I/O to
the key buffer, which caused scalability problems with multiple CPUs
and multiple disks. MySQL 4.1’s key buffer code is improved and
doesn’t have this problem anymore, but it still holds a mutex on each
key buffer. This is an issue when a thread copies key blocks from the
key buffer into its local storage, rather than reading from the disk.
The disk bottleneck is gone, but there’s still a bottleneck when
accessing data in the key buffer. You can sometimes work around this
problem with multiple key buffers, but this approach isn’t always
successful. For example, there’s no way to solve the problem when it
involves only a single index. As a result, concurrent SELECT
queries can perform significantly
worse on multi-CPU machines than on a single-CPU machine, even when
these are the only queries running. MariaDB offers segmented
(partitioned) key buffers, which can help significantly when you
experience this problem.
One goal of configuring your server is to customize it for your specific workload. This requires intimate knowledge of the number, type, and frequency of all kinds of server activities—not just queries, but other activities too, such as connecting to the server and flushing tables.
The first thing you should do, if you haven’t done it already, is become familiar with your server. Know what kinds of queries run on it. Monitor it with tools such as innotop, and use pt-query-digest to create a query report. It’s helpful to know not only what your server is doing overall, but what each MySQL query spends a lot of time doing. Chapter 3 explains how to find this out.
Try to log all queries when your server is running at full capacity, because that’s the best way to see what kinds of queries suffer most. At the same time, capture snapshots of the process list and aggregate them by their state or command (innotop can do this for you, or you can use the scripts shown in Chapter 3). For example, are there a lot of queries copying results to temporary tables, or sorting results? If so, you might need to optimize the queries, and potentially look at the configuration settings for temporary tables and sort buffers.
BLOB
and TEXT
columns are a
special type of workload for MySQL. (We’ll refer to all of the
BLOB
and TEXT
types as
BLOB
here for simplicity, because
they belong to the same class of data types.) There are several
restrictions on BLOB
values that
make the server treat them differently from other types. One of the
most important considerations is that the server cannot use in-memory
temporary tables for BLOB
values.[123] Thus, if a query involving BLOB
values requires a temporary table—no
matter how small—it will go to disk immediately. This is very
inefficient, especially for otherwise small and fast queries. The
temporary table could be most of the query’s cost.
There are two ways to ease this penalty: convert the values to
VARCHAR
with the SUBSTRING()
function
(see Chapter 4 for more on
this), or make temporary tables faster.
The best way to make temporary tables faster is to place them on a memory-based filesystem (tmpfs on GNU/Linux). This removes some overhead, although it’s still much slower than using in-memory tables. Using a memory-based filesystem is helpful because the operating system tries to avoid writing data to disk.[124] Normal filesystems are cached in memory too, but the operating system might flush normal filesystem data every few seconds. A tmpfs filesystem never gets flushed. The tmpfs filesystem is also designed for low overhead and simplicity. For example, there’s no need for the filesystem to make any provisions for recovery. That makes it faster.
The server setting that controls where temporary tables are
placed is tmpdir
. Monitor how full
the filesystem gets to ensure you have enough space for temporary
tables. If necessary, you can even specify several temporary table
locations, which MySQL will use in a round-robin fashion.
If your BLOB
columns are very
large and you use InnoDB, you might also want to increase InnoDB’s log buffer size. We
wrote more about this earlier in this chapter.
For long variable-length columns (e.g., BLOB
, TEXT
, and long character columns), InnoDB stores a 768-byte prefix in-page
with the rest of the row.[125] If the column’s value is longer than this prefix length,
InnoDB might allocate external storage space outside the row to store
the rest of the value. It allocates this space in whole 16 KB pages,
just like all other InnoDB pages, and each column gets its own page
(columns do not share external storage space). InnoDB allocates
external storage space to a column a page at a time until 32 pages are
used; then it allocates 64 pages at a time.
Note that we said InnoDB might allocate external storage. If the total length of the row, including the full value of the long column, is shorter than InnoDB’s maximum row length (a little less than 8 KB), InnoDB will not allocate external storage even if the long column’s value exceeds the prefix length.
Finally, when InnoDB updates a long column that is placed in external storage, it doesn’t update it in place. Instead, it writes the new value to a new location in external storage and deletes the old value.
All of this has the following consequences:
Long columns can waste a lot of space in InnoDB. For example, if you store a column value that is one byte too long to fit in the row, it will use an entire page to store the remaining byte, wasting most of the page. Likewise, if you have a value that is slightly more than 32 pages long, it might actually use 96 pages on disk.
External storage disables the adaptive hash index, which needs to compare the full length of columns to verify that it has found the right data. (The hash helps InnoDB find “guesses” very quickly, but it must check that its “guess” is correct.) Because the adaptive hash index is completely in-memory and is built directly “on top of” frequently accessed pages in the buffer pool, it doesn’t work with external storage.
Long values can make any query with a WHERE
clause that doesn’t use an index
run slowly. MySQL reads all columns before it applies the WHERE
clause, so it might ask InnoDB to
read a lot of external storage, then check the WHERE
clause and throw away all the data
it read. It’s never a good idea to select columns you don’t need,
but this is a special case where it’s even more important to avoid
doing so. If you find your queries are suffering from this
limitation, you can try to use covering indexes to help.
If you have many long columns in a single table, it might be better to combine the data they store into a single column, perhaps as an XML document. That lets all the values share external storage, rather than using their own pages.
You can sometimes gain significant space and performance
benefits by storing long columns in a BLOB
and compressing them with
COMPRESS()
, or
compressing them in the application before sending them to
MySQL.
Recall from Chapter 6 that MySQL has two
filesort algorithms. It uses the two-pass algorithm if the total size
of all the columns needed for the query, plus the ORDER BY
columns, exceeds max_length_for_sort_data
bytes. It also uses
this algorithm when any of the required columns—even those not used
for the ORDER BY
—is a BLOB
or TEXT
column. (You can use SUBSTRING()
to convert such columns to types
that can work with the single-pass algorithm.)
MySQL has two variables that can help you control how it
performs filesorts. You can influence
MySQL’s choice of algorithm by changing the value of the max_
length_for_sort_data
variable.[126] Because the single-pass algorithm creates a fixed-size
buffer for each row it will sort, the maximum length of VARCHAR
columns is what counts toward
max_length_for_sort_data
, not the
actual size of the stored data. This is one of the reasons why we
recommend you make these columns only as large as necessary.
When MySQL has to sort on BLOB
or TEXT
columns, it uses only a prefix and
ignores the remainder of the values. This is because it has to
allocate a fixed-size structure to hold the values and copy the prefix
from external storage into that structure. You can specify how large
this prefix should be with the max_sort_length
variable.
Unfortunately, MySQL doesn’t really give you any visibility into
which sort algorithm it uses. If you increase the max_length_for_sort_data
variable and your
disk usage goes up, your CPU usage goes down, and the Sort_merge_passes
status variable begins to
grow more quickly than it did before the change, you’ve probably
forced more sorts to use the single-pass algorithm.
We’re done with the tour of server internals—hope you enjoyed the trip! Now let’s return to our sample configuration file and see how to choose values for the settings that remain.
We’ve already discussed how to choose values for the general settings such as the data directory, the InnoDB and MyISAM caches, logs, and a few other things. Let’s go over what remains:
tmp_table_size
and
max_heap_table_size
These settings control how large an in-memory temporary table using the Memory storage engine can grow. If an implicit temporary table’s size exceeds either of these settings, it will be converted to an on-disk MyISAM table so it can keep growing. (An implicit temporary table is one that you don’t create yourself; the server creates it for you to hold an intermediate result while executing a query.)
You should simply set both of these variables to the same
value. We’ve chosen the value 32M
for our sample configuration file.
This might not be enough, but beware of setting this variable too
large. It’s good for temporary tables to live in memory, but if
they’re simply going to be huge, it’s actually best for them to
just use on-disk tables, or you could run the server out of
memory.
Assuming that your queries aren’t creating enormous temporary tables (which you can often avoid with proper indexing and query design), it’s a good idea to set these variables large enough that you don’t have to go through the process of converting an in-memory table to an on-disk table. This procedure will show up in the process list.
You can look at how the server’s SHOW STATUS
counters change over time to
understand how often you create temporary tables and whether they
go to disk. You can’t tell whether a table was created in memory
and then converted to on-disk or just created on-disk to begin
with (perhaps because of a BLOB
column), but you can at least see how
often the tables go to disk. Examine the Created_tmp_
disk_tables
and Created_tmp_tables
variables.
max_connections
This setting acts like an emergency brake to keep your server from being overwhelmed by a surge of connections from the application. If the application misbehaves, or the server encounters a problem such as a stall, a lot of new connections can be opened. But opening a connection does no good if it can’t execute queries, so being denied with a “too many connections” error is a way to fail fast and fail cheaply.
Set max_connections
high
enough to accommodate the usual load that you think you’ll
experience, as well as a safety margin to permit logging in and
administering the server. For example, if you think you’ll have
300 or so connections in normal operations, you might set this to
500 or so. If you don’t know how many connections you’ll get, 500
is not an unreasonable starting point anyway. The default is 100,
but that’s not enough for a lot of applications.
Beware also of surprises that might make you hit the limit of connections. For example, if you restart an application server, it might not close its connections cleanly, and MySQL might not realize they’ve been closed. When the application server comes back up and tries to open connections to the database, it might be refused due to the dead connections that haven’t timed out yet.
Watch the Max_used_connections
status variable
over time. It is a high-water mark that shows you if the server
has had a spike in connections at some point. If it reaches
max_connections
, chances are a
client has been denied at least once, and you should probably use
the techniques shown in Chapter 3 to capture server
activity when that occurs.
thread_cache_size
You can compute a reasonable value for this variable
by observing the server’s behavior over time. Watch the Threads_connected
status variable and
find its typical maximum and
minimum. You might want to set the thread cache large enough to
hold the difference between the peak and off-peak usage, and go
ahead and be generous, because if you set it a bit too high it’s
not a big problem. You might set it two or three times as large as
needed to hold the fluctuations in usage. For example, if the
Threads_connected
status
variable seems to vary between 150 and 175, you could set the
thread cache to 75. But you probably shouldn’t set it very large,
because it isn’t really useful to keep around a huge amount of
spare threads waiting for connections; a ceiling of 250 is a nice
round number (or 256, if you prefer a power of two).
You can also watch the change over time in the Threads_created
status variable. If this
value is large or increasing, it’s another clue that you might
need to increase the thread_cache_size
variable. Check
Threads_cached
to see how many
threads are in the cache already.
A related status variable is Slow_launch_threads
. A large value for
this status variable means that something is delaying new threads
upon connection. This is a clue that something is wrong with your
server, but it doesn’t really indicate what. It usually means
there’s a system overload, causing the operating system not to
schedule any CPU time for newly created threads. It doesn’t
necessarily indicate that you need to increase the size of the
thread cache. You should diagnose the problem and fix it rather
than masking it with a cache, because it might be affecting other
things, too.
table_cache_size
This cache (or the two caches into which it was
split in MySQL 5.1) should be set large enough to keep from
reopening and reparsing table definitions all the time. You can
check this by inspecting the value of Open_tables
and the change over time in
the value of Opened_tables
. If
you see many Opened_tables
per
second, your table_cache
value
might not be large enough. Explicit temporary tables can also
cause a growing number of opened tables even when the table cache
isn’t fully used, though, so it might be nothing to worry about.
Your clue would be that Opened_tables
grows constantly even
though Open_tables
isn’t as
large as table_
cache_size
.
Even if the table cache is useful, you should not set this variable too large. It turns out that the table cache can be counterproductive in two circumstances.
First, MySQL doesn’t use a very efficient algorithm to check the cache, so if it’s really big, it can get really slow. You probably shouldn’t set it higher than 10,000 in most cases, or 10,240 if you like those powers of two.[127]
The second reason to avoid setting this very large is that some workloads simply aren’t cacheable. If the workload isn’t cacheable, and everything is going to be a cache miss no matter how large you make the cache, forget the cache and set it to zero! This helps you avoid making the situation worse; a cache miss is better than an expensive cache check followed by a cache miss. What kinds of workloads aren’t cacheable? If you have tens or hundreds of thousands of tables and you use them all pretty uniformly, you probably can’t cache them all, and you’re better off setting this variable small. This is sometimes appropriate on systems that have a very large number of collocated applications, none of which is very busy.
A reasonable starting value for this setting is 10 times as
big as max_connections
, but
again, keep it under 10,000 or so in most cases.
There are several other kinds of settings that you will frequently include in your configuration file, including binary logging and replication settings. Binary logging is useful for enabling point-in-time recovery and for replication, and replication has a few settings of its own. We’ll cover the important settings in the chapters on replication and backups, later in this book.
After your basic configuration settings are in place, you might wish to enable a number of settings that make the server safer and more reliable. Some of them influence performance, because safety and reliability are often more costly to guarantee. Some are just sensible, however: they prevent silly mistakes such as inserting nonsensical data into the server. And some don’t make a difference in day-to-day operation, but prevent bad things from happening in edge cases.
Let’s look at a collection of useful options for general server behavior first:
expire_logs_days
If you enable binary logging, you should enable this option, which causes the server to purge old binary logs after the specified number of days. If you don’t enable it, you will eventually run the server out of disk space, and it will freeze or crash. We suggest setting this option large enough that you can recover from at least two backups ago (in case the most recent backup fails). Even if you take backups every day, still leave yourself at least 7 to 14 days’ worth of binary logs. Our experience shows that you’ll be grateful for a week or two of binary logs when you have some unusual problem, such as rebuilding a replica and then trying to get it caught up again with the master. You want to keep enough binary logs around to give yourself some breathing room for operations such as these.
max_allowed_packet
This setting prevents the server from sending too large a packet, and also controls how large a packet it will accept. The default is probably too small, but it can also be set dangerously large. If it’s set too small, sometimes problems can occur in replication, typically when the replica can’t retrieve data from the master that it needs for replication. You might increase the setting from its default to 16 MB or so.
It’s not documented, but this option also controls the
maximum size of a user-defined variable, so if you need very large
variables, be careful—they can be truncated or set to NULL
if they exceed the size of this
variable.
max_connect_errors
If something goes wrong with your networking for a moment, there is an application or configuration error, or there is another problem such as privileges that prevent connections from completing successfully for a brief period of time, clients can get blacklisted and will be unable to connect again until you flush the host cache. The default setting for this option is so small that this problem can happen too easily. You might want to increase it, and in fact, if you know that the server is adequately secured against brute-force attacks, you can just make it very large to effectively disable host blacklisting.
skip_name_resolve
This setting disables another networking- and authentication-related trap: DNS lookups. DNS is one of the weak points in MySQL’s connection process. When you connect to the server, by default it tries to determine the hostname from which you’re connecting and uses that as part of the authentication credentials. (That is, your credentials are your username, hostname, and password—not just your username and password.) But to verify your hostname, the server needs to perform both a reverse and a forward DNS lookup. This is all fine until DNS starts to have problems, which is pretty much a certainty at some point in time. When that happens, everything piles up and eventually the connection times out. To prevent this, we strongly recommend that you set this option, which disables DNS lookups during authentication. However, if you do this you will need to convert all of your hostname-based grants to use IP addresses, wildcards, or the special hostname “localhost,” because hostname-based accounts will be disabled.
sql_mode
This setting can accept a variety of options that
modify server behavior. We don’t recommend changing these just for
the fun of it; it’s better to let MySQL be MySQL in most ways and
not try to make it behave like other database servers. (Many
client and GUI tools expect MySQL to have its own flavor of SQL,
for example, so if you change it to speak more ANSI-compliant SQL
some things might break.) However, several of the settings are
very useful, and some might be worth considering in your specific
cases. You might want to look at the documentation for the
following options and consider using
them: STRICT_TRANS_TABLES
,
ERROR_FOR_DIVISION_BY
_ZERO
, NO_AUTO_CREATE_USER
, NO_AUTO_VALUE_ON_ZERO
, NO_ENGINE_SUB
STITUTION
,
NO_ZERO_DATE
, NO_ZERO_IN_DATE
, and ONLY_FULL_GROUP_BY
.
However, be aware that it might not be a good idea to change
these settings for existing applications, because doing so might
make the server incompatible with the application’s expectations.
It’s pretty common for people to unwittingly write queries that
refer to columns not in the GROUP
BY
clause or use aggregate functions, for example, so if
you want to enable the ONLY_FULL_GROUP_BY
option it’s a good
idea to do it in a development or staging server first, and only
deploy it in production once you’re sure everything is
working.
sysdate_is_now
This is another setting that might be
backward-incompatible with applications’ expectations. But if you
don’t explicitly desire the SYSDATE()
function to have nondeterministic behavior, which can break
replication and make point-in-time recovery from backups
unreliable, you might want to enable this option and make its
behavior deterministic.
A few options control replication behavior and are very helpful for preventing problems on replicas:
read_only
This option prevents unprivileged users from making changes on replicas, which should be receiving changes only from the master, not from the application. We strongly recommend setting replicas to read-only mode.
skip_slave_start
This option prevents MySQL from taking the bit between its teeth and attempting to start replication automatically. You want to disable automatic starting because it is unsafe after a crash or other problem; a human needs to examine the server manually and determine that it is safe to start replication.
slave_net_timeout
This option controls how long it’ll be before a replica notices that its connection to its master has failed and needs to be reconnected. The default option, one hour, is way too long. Set it to a minute or less.
sync_master_info
, sync_relay_log
, and sync_relay_log_info
These options, available in MySQL 5.5 and newer,
correct longstanding problems with replicas: they don’t sync their
status files to disk, so if the server crashes it can be anyone’s
guess what the replica’s position relative to the master actually
was, and there can be corruption in the relay logs. These options
make replicas much more likely to be recoverable after a crash.
They are not enabled by default, because they cause extra fsync()
operations on replicas, which
can slow them down. We suggest enabling these options if you have
decent hardware, and disabling them if there is a problem with
replication that you can trace to latency caused by fsync()
.
There’s a less intrusive way to do this in
Percona Server, enabled with the innodb_
overwrite_relay_log_info
option. This
makes InnoDB store the replication position in the InnoDB
transaction logs, which is fully transactional and doesn’t require
any extra fsync()
operations.
During crash recovery, InnoDB will check the replication metadata
files and update them to have the correct position if they’re out
of date.
Recall our discussion of InnoDB’s history in Chapter 1: it was first built in, then
available in two versions, and now the newer version of the engine is
once again built into the server. The newer InnoDB code has more
features and is much more scalable. If you’re using MySQL 5.1, you
should configure MySQL explicitly to ignore the old version of InnoDB
and use the newer version. It will improve server performance greatly.
You’ll need to enable the ignore_builtin_innodb
option, and then
configure the plugin_load
option to
enable InnoDB as a plugin. Consult the InnoDB manual for the exact
syntax for your platform.[128]
Several options are available in the newer version of InnoDB, once you’ve enabled it. Some of these are quite important for server performance, and there are also a couple of safety and sanity options:
innodb
This rather innocuous-looking option is actually
very important. If you set its value to FORCE
, the server will not be able to
start unless InnoDB can start. If you use InnoDB as your default
storage engine, this is definitely what you want. You do not want
the server to start when InnoDB fails because of some error such
as a misconfiguration, because a badly behaved application could
then connect to the server and cause who knows what harm and
confusion. It’s much better for the server to fail as a whole,
which will force you to look at the error log instead of believing
that the server started okay.
innodb_autoinc_lock_mode
This option controls how InnoDB generates
autoincrementing primary key values, which can be a bottleneck in
some cases, such as high-concurrency inserts. If you have many
transactions waiting on the autoincrement lock (you can see this
in SHOW ENGINE INNODB
STATUS
), you should investigate this setting. We won’t
repeat the manual’s explanation of the options and their
behaviors.
innodb_buffer_pool_instances
This setting divides the buffer pool into multiple segments in MySQL 5.5 and newer, and is probably one of the most important ways to improve MySQL’s scalability on multicore machines with a highly concurrent workload. Multiple buffer pools partition the workload so that some of the global mutexes are not such hot contention points.
It is not yet clear what kind of guidelines we should develop for choosing the number of buffer pool instances. We have run most of our benchmarks with eight instances, but we probably won’t understand some of the subtleties of multiple buffer pool instances until MySQL 5.5 has been deployed more widely for a longer time.
We don’t mean that to imply that MySQL 5.5 isn’t deployed widely in production. It’s just that the most extreme cases of mutex contention we’ve helped solve have been for very large, very conservative users, for whom an upgrade can require many months to plan, validate, and execute. These users are sometimes running a highly customized version of MySQL, which makes it doubly important for them to be careful with upgrades. When more of these folks upgrade to MySQL 5.5 and stress it in their own unique ways, we’ll probably learn some interesting things about multiple buffer pools that we haven’t seen yet. Until then, we can say that it appears to be very beneficial to run with eight buffer pool instances.
It’s worth noting that Percona Server takes a different approach to solving InnoDB’s mutex contention issues. Instead of partitioning the buffer pool—an admittedly tried-and-true approach in many systems like InnoDB—we opted to divide some of the global mutexes into smaller, more special-purpose mutexes. Our benchmarks show that the best improvement of all comes from a combination of the two approaches, which is available in Percona Server version 5.5: multiple buffer pools and more fine-grained mutexes.
innodb_io_capacity
InnoDB used to be hardcoded to assume that it ran on a single hard disk capable of 100 I/O operations per second. This was a bad default. Now you can inform InnoDB how much I/O capacity is available to it. InnoDB sometimes needs this set quite high (tens of thousands on extremely fast storage such as PCI-E flash devices) to flush dirty pages in a steady fashion, for reasons that are quite complex to explain.
innodb_read_io_threads
and
innodb_write_io_threads
These options control how many background threads are available for I/O operations. The default in recent versions of MySQL is to have four read threads and four write threads, which is enough for a lot of servers, especially with the native asynchronous I/O available in MySQL 5.5. If you have many hard drives and a high-concurrency workload, and you see that the threads are having a hard time keeping up, you can increase the number of threads, or you can simply set them to the number of physical spindles you have for I/O (even if they’re behind a RAID controller).
innodb_strict_mode
This setting makes InnoDB throw errors instead of
warnings for some conditions, especially invalid or possibly
dangerous CREATE TABLE
options.
If you enable this option, be certain to check all of your
CREATE TABLE
options, because
it might not let you create some tables that used to be fine.
Sometimes it’s a bit pessimistic and overly restrictive. You
wouldn’t want to find this out when you were trying to restore a
backup.
innodb_old_blocks_time
InnoDB has a two-part buffer pool least recently
used (LRU) list, which is designed to prevent ad hoc queries from
evicting pages that are used many times over the long term. A
one-off query such as those issued by
mysqldump will typically bring a page into
the buffer pool LRU list, read the rows from it, and move on to
the next page. In theory, the two-part LRU list will prevent this
page from displacing pages that will be needed for a long time by
placing it into the “young” sublist and only moving it to the
“old” sublist after it has been accessed multiple times. But
InnoDB is not configured to prevent this by default, because the
page has multiple rows, and thus the multiple accesses to read
rows from the page will cause it to be moved to the “old” sublist
immediately, placing pressure on pages that need a long lifetime.
This variable specifies the number of milliseconds that must
elapse before a page can move from the “young” part of the LRU
list to the “old” part. It’s set to 0
by default, and setting it to a small
value such as 1000
(one second)
has proven very effective in our benchmarks.
After you’ve worked through this chapter, you should have a server configuration that is much better than the defaults. Your server should be fast and stable, and you should not need to tweak the configuration unless you run into an unusual circumstance.
To review, we suggest that you begin with our sample configuration file, set the basic options for your server and workload, add safety and sanity options as desired, and, if appropriate, configure the new options available in the InnoDB plugin and in MySQL 5.5. That’s really all you need to do.
The most important options are these two, assuming that you use InnoDB, which most people should:
innodb_buffer_pool_size
innodb_log_file_size
Congratulations—you just solved the vast majority of real-world configuration problems we’ve seen! If you use our configuration tool at http://tools.percona.com, you will get good suggestions for a starting point on these and other configuration options.
We’ve also made a lot of suggestions about what not to do. The
most important of these are not to “tune” your server; not to use
ratios, formulas, or “tuning scripts” as a basis for setting the
configuration variables; not to trust advice from unknown people on the
Internet; and not to go hunting in SHOW
STATUS
counters for things that look bad. If something is
actually wrong, it’ll show up in your server profiling.
There are a few significant settings we didn’t discuss in this chapter, which are important for specific types of hardware and workloads. We delayed discussion of these settings because we believe that any advice on settings needs to be paired with an explanation of the internal processes at work. This brings us to the next chapter, which will show you how to optimize your hardware and operating system for MySQL, and vice versa.
[109] A common mistake we’ve seen is to set up a server with twice as much memory as your existing server, and—using the old server’s configuration as a baseline—create the new server’s configuration by multiplying everything by two. This doesn’t work.
[110] If you are not convinced that “tuning by ratio” is bad, please read Optimizing Oracle Performance by Cary Millsap (O’Reilly). He even devotes an appendix to the topic, with a tool that can artificially generate any cache hit ratio you wish, no matter how badly your system is performing! Of course, it’s all for the purpose of illustrating how useless the ratio is.
[111] An exception: we maintain a (good) free online configuration tool at http://tools.percona.com. Yes, we’re biased.
[112] Q: How is query formed? A: They need to do way instain DBAs who kill thier querys, becuse these querys cant frigth back?
[113] Please note that newer versions of MySQL remove, deprecate, and change some options; check the docs for details.
[114] Question: where are the settings for the sort buffer size and read buffer size? Answer: they’re off minding their own business. Leave them at their defaults unless you can prove the defaults are not good enough.
[115] Theoretically, if you could ensure that the original 4 KB of data was still in the operating system’s cache, the read wouldn’t be needed. However, you have no control over which blocks the operating system decides to keep in its cache. You can find out which blocks are in the cache with the fincore tool, available at http://net.doit.wisc.edu/~plonka/fincore/.
[116] The concept of an “opened table” can be a little confusing. MySQL counts a table as opened many times when different queries are accessing it simultaneously, or even when a single query refers to the same table more than once, as in a subquery or a self-join. MyISAM’s index files contain a counter that MyISAM increments when the table is opened and decrements when it is closed. This lets MyISAM see when the table wasn’t closed cleanly: if it opens a table for the first time and the counter is not zero, the table wasn’t closed cleanly.
[117] For the curious, Percona Server’s innodb_recovery_stats
option can
help you understand your server’s workload from the standpoint
of performing crash recovery.
[118] We’re talking about spindle-based disk drives with rotating platters, not solid-state hard drives, which have completely different performance characteristics.
[119] Note that the way this ought to be implemented is a topic of some debate; see MySQL bug 60776 for the details.
[120] The table can be closed for several reasons. For example,
the server might close the table because there’s not enough room
in the table cache, or someone might execute FLUSH TABLES
.
[121] Some Debian systems do this automatically, which is a swing of the pendulum too far in the other direction. It’s not a good idea to just configure this behavior by default as Debian does; the DBA should decide.
[122] In fact, in some workloads, the system that implements the concurrency limits itself can become a bottleneck, so sometimes it needs to be enabled, and at other times it needs to be disabled. Profiling will show you which to do.
[123] Recent versions of Percona Server lift this restriction in some cases.
[124] Data can still go to disk if the operating system swaps it.
[125] This is long enough to create a 255-character index on a
column, even if it’s utf8
,
which might require up to three bytes per character. This prefix
is specific to the Antelope InnoDB file format; it doesn’t apply
to the Barracuda format, which is available in MySQL 5.1 and newer
(though not enabled by default).
[126] MySQL 5.6 will introduce changes to the way the sort buffer
is used in queries with a LIMIT
clause and will fix a problem that caused a large sort buffer to
perform an expensive setup routine, so when you upgrade to MySQL
5.6 you should carefully check any customizations you’ve made to
these settings.
[127] Have you heard the joke about powers of two? There are 10 types of people in the world: those who understand binary, and those who don’t. There are also another 10 types of people: those who think binary/decimal jokes are funny, and those who have sex. We won’t say whether or not we think that’s hilarious.
[128] In Percona Server, there’s only one version of InnoDB and it’s built in, so you don’t need to disable one version and load another one to replace it.