Instrumenting Your System

Many different applications can show you what’s going on provided you feed it the right data. Although we can’t recommend which tool you should use, we can try to give you an overview of the data you can gather. In short, instead of discovering how to view the different performance views of your system, we’d like to focus on what to measure. We’ll show you what useful data you can coax out of the VM and how to use it to inform your decisions. Let’s get started.

Using Observer as a Guide

Because Elixir chose early on to build onto the Erlang ecosystem you can take advantage of many tools. One of those is Observer, a tool for understanding how your application is using resources like processes and memory. While you won’t use Observer to gather metrics in production, it’s a great tool to explore what the VM offers you. If the information is available to Observer, it is available to you.

In this section, we will create a new Phoenix application and we’ll use it through the rest of the chapter. We will start by observing this application and translating ideas we find into code.

We chose a Phoenix application because it comes with enough code for us to jump straight into measuring. In any case, the lessons here apply to any Elixir application.

If you are not yet familiar with Phoenix, see their website to get started.[108] Once you have the Phoenix installer available on your machine, create a new application like this:

 $ ​​mix​​ ​​phx.new​​ ​​demo

You’ll then need to follow the instructions printed out to get your app up and running with iex -S mix. When the iex prompt becomes available, type :observer.start(). That command will start Observer in all of its glory as shown in the figure.

images/production_ready/system_tab.png

When Observer opens, you’ll see several tabs. The system tab is open by default. For the two panels on your left, most of the information comes from a function called :erlang.system_info/1.[109] They’re fairly static, and just a small subset of all the information system_info/1 returns.

Measuring Memory Usage

The first pane on the right shows memory usage. That’s definitely the kind of information you want to push to your metrics system. You can retrieve all of this information programmatically by calling :erlang.memory/0.[110] Try it in your terminal, like this:

 iex>​ ​:erlang​.memory()
 [
  total: 17479216,
  processes: 4837512,
  processes_used: 4831320,
  system: 12641704,
  atom: 264529,
  atom_used: 248278,
  binary: 64888,
  code: 5903532,
  ets: 350960
 ]

Total is the total amount of memory dynamically allocated, not including the VM itself or the system libraries the VM has started. It is the sum of the memory currently allocated by processes and the system. The process key shows the amount of memory allocated for processes and process used shows how much of that memory is in use.

The system memory is broken into the memory allocated for atoms, the binaries that are not in the process heap, the code loaded by the VM, and finally the memory allocated for ets tables.

You can use this information to identify resource leakage. For example, if the amount of atoms keep growing, your application may be leaking atoms. The code key is the same. If your application somehow dynamically defines modules, you want to make sure to purge them from the system, otherwise the amount of memory used by code will keep growing and growing. Having this information in your dashboards can help you identify leaks before they bring the system down.

Some of those resources have hard limits. For example, the last pane on the right shows Statistics about the system. Part of those statistics is exactly how many processes exist and what is the maximum number of processes allowed. If you reach that limit, the VM will simply refuse to start processes. In a web application, it means you are unable to accept more requests. Therefore, you want to make sure to measure the number of processes and ensure that they are safely below the maximum number of processes, say 80% of your process capacity.

Tracking Process, Port, and Atom Limits

We can compute the ratio of existing processes by the maximum amount of allowed processes like this:

 iex>​ 100 * ​:erlang​.system_info(​:process_count​) /
 iex>​ ​:erlang​.system_info(​:process_limit​)
 0.0167

If your servers are reaching the stipulated threshold and the machine still has plenty of resources available, you can increase this limit at boot time by passing --erl flags. For example, to set the limit north of one million processes:

 $ iex --erl "+P 1000000"
 Erlang/OTP 19 [erts-8.0] [source] [64-bit] [smp:4:4] [ds:4:4:10]
  [async-threads:10] [hipe] [kernel-poll:false]
 
 Interactive Elixir (1.5.0) - press Ctrl+C to exit (type h() ENTER for help)
 iex(1)>​ ​:erlang​.system_info(​:process_limit​)
 1048576

You can also use :port_count and :port_limit to track the number of ports your system is using. This metric is especially useful if you are integrating with external code using ports, as outlined in Strategy 2: Communicating via I/O with Ports.

Erlang/OTP 20 also introduced the ability to compute usage rates for atoms:

 iex>​ 100 * ​:erlang​.system_info(​:atom_count​) / ​:erlang​.system_info(​:atom_limit​)
 0.0167

Most applications should expect their atom usage to be constant after their application has warmed up in production. Pairing the ratio above with memory usage can help you quickly discover if your application is leaking atoms.

Getting the Run Queue Length

Another important statistic to track is the run queue. When your VM boots, it starts a scheduler per core, and each core has a queue of actions the scheduler should perform. That’s the run queue. An overloaded system will show a steadily increasing number of actions in your run queue.

To understand the impact of the Run Queue, let’s revisit a discussion that happened on the Elixir Forum.[111] In that thread, Myron Marston reported that some calls to a GenServer were exceeding the default limit of 5 seconds and timing out. Throughout the week, they tried to find the source of the slow down but they were getting stumped. After gathering more information, they noticed that the GenServer message queue was not getting backed up and that each GenServer callback executed quickly. The numbers didn’t add up. If the GenServer was never busy and the callbacks were fast, why were the calls still timing out?

José Valim jumped into the discussion and suggested Myron and team to look at the run queue metric. If the system is overloaded, it may take a while until each process gets a chance to run. So even if the GenServer is not busy and can answer fairly fast, by the time the GenServer executes, the timeout value of 5 seconds may have already passed! After measuring the run queue, they concluded the system was indeed overloaded. They could fix it by either getting more powerful machines (scaling vertically) or by adding more nodes (scaling horizontally).

You can retrieve the Run Queue by calling :erlang.statistics/1.[112] Use :erlang.statistics (:total_run_queue_lengths) to get the total run queue length. Avoid using :erlang.statistics(:run_queue) as it is atomic and therefore can be quite expensive.

If you are expecting to push the VM to the limit, it is worth carefully reading the docs for the statistics function to learn more about all of the available metrics.

At this point you may be wondering what is an appropriate value for run queue. That’s a very hard question to answer since it depends on your machine, your application, and the kind of loads you expect. However, graphing the run queue can still be very useful when diagnosing problems. For example, if you have an increase in error rates or requests are taking too long, if you’ve also noticed a simultaneous surge in the run queue, you will have more insight into what may be happening.

Tracking Process Health

Another area worth exploring is the Processes tab. The following figure shows it in action:

images/production_ready/process_tab.png

By default Observer lists all processes in your system, showing their memory usage, message queue length, and the amount of reductions (instructions) they have executed. High values in any area may indicate a bottleneck or memory leak.

You can find all processes in the system by running Process.list/0, or fetch all locally registered processes with Process.registered/0, which returns a list of process IDs. You can use these PIDs to get additional information with Process.info/1. For example, you can get the top five processes by memory usage like this:

 iex>​ Process.list |> Enum.sort_by(&Process.info(&1, ​:memory​)) |> Enum.take(-5)
 [#PID<0.48.0>, #PID<0.81.0>, #PID<0.36.0>, #PID<0.4.0>, #PID<0.31.0>]

In practice, it is unlikely that you will instrument all of the processes in your system. Instead, you want to choose processes that are more likely to be a central part of the system. Those often come up when stress testing the system.

Observer has many other tabs and we won’t explore them all. The lesson here, though, applies regardless of the tab: for any information you see in Observer, you can likely find an API to push it to your metrics system as well.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset