Using architectural trade-offs

When your code cannot be improved any further by reducing the complexity or choosing the proper data structure, a good approach may be to consider doing some trade-offs. If we review user problems and define what is really important for them, we can relax some of the application requirements. The performance can often be improved by:

  • Replacing exact solution algorithms with heuristics and approximation algorithms
  • Deferring some work to delayed task queues
  • Using probabilistic data structures

Using heuristics and approximation algorithms

Some algorithmic problems simply don't have good state of the art solutions that could run in time acceptable to the user. For example, consider a program that deals with some complex optimization problems such as Traveling Salesman Problem (TSP) or Vehicle Routing Problem (VRP). Both problems are NP-hard problems in combinatorial optimization. The exact algorithms for such problems that have low complexity are not known. This means that the size of the problems that can be practically solved is greatly limited. For very large inputs, it is very unlikely that it will be able to provide the exact solution in a time that would be acceptable for any user.

Fortunately, it is very probable that the user is not interested in the best possible solution but the one that is good enough and the one that can be obtained in a timely manner. So, it really makes sense to use heuristics or approximation algorithms whenever they provide an acceptable quality of results:

  • Heuristics solve given problems by trading optimality, completeness, accuracy, or precision for speed. They concentrate on the speed, but it may be really hard to prove their solution quality compared to the result of exact algorithms.
  • Approximation algorithms are similar in idea to heuristics, but unlike heuristics have provable solution quality and run-time bounds.

For instance, there are known good heuristics and approximation problems that can solve extremely large TSP problems within a reasonable time. They also have a high probability of producing results just 2-5% from the optimal solution.

Another good thing about heuristics is that they don't always need to be constructed from scratch for every new problem you need to solve. Their higher-level versions, called metaheuristics, provide strategies for solving mathematical optimization problems that are not problem-specific and can thus be applied in many situations. Some popular metaheuristic algorithms include:

  • Simulated annealing
  • Genetic algorithms
  • Tabu search
  • Ant colony optimization
  • Evolutionary computation

Using task queues and delayed processing

Sometimes it's not about doing a lot but about doing things at the right time. A good example of that is sending e-mails in web applications. In that case, increased response times may not necessarily be the result of your implementation. The response time may be dominated by some third-party service, such as an e-mail server. Can you optimize your application if it just spends most of its time on waiting for other services to reply?

The answer is both: yes and no. If you don't have any control over a service that is the main contributor to your processing time and there is no other faster solution you could use, you, of course, cannot speed it up any further. You cannot simply skip in time to get the replies you are waiting for. A simple example of processing an HTTP request that results in sending an e-mail is presented in the following figure (Figure 1). You cannot reduce the waiting time, but you can change the way users will perceive it!

Using task queues and delayed processing

Figure 1 An example of synchronous e-mail delivery in web application

The usual pattern for such type of problems is using message/task queues. When you need to do something that may take an indefinite amount of time, just add this to the queue of work that needs to be done and immediately respond to the user whose request was accepted. Here, we come to the reason why sending e-mails is such a great example. E-mails are already task queues! If you submit a new message to the e-mail server using SMTP protocol, the successful response does not mean that your e-mail was delivered to addressee. It means that the e-mail was delivered to the e-mail server and it will try later to deliver it further.

So, if the response from the server does not guarantee that the e-mail was delivered at all, you don't need to wait for it in order to generate an HTTP response for the user. The updated flow of processing requests with the usage of the task queue is presented in the following figure:

Using task queues and delayed processing

Figure 2 An example of asynchronous e-mail delivery in web application

Of course, your e-mail server may be responding blazingly fast, but you need some more time to generate the message that needs to be sent. Perhaps you are generating yearly reports in an XLS format or maybe delivering invoices in PDF files. If you use e-mail transport that is already asynchronous, then put the whole message generation task to the message processing system too. If you cannot guarantee the exact time of delivery, then you should not bother to generate your deliverables synchronously.

The proper usage of task/message queues in critical sections of the application can also give you other benefits:

  • Web workers that serve HTTP requests will be relieved from additional work and processing requests faster. This means that you will be able to process more requests with the same resources and thus handle greater load.
  • Message queues are generally more immune to transient failures of external services. For instance, if your database or e-mail server times out from time to time, you can always re-queue the currently processed task and retry it later.
  • With a good message queue implementation, you can easily distribute the work on multiple machines. This approach may improve the scalability of some of your application components.

As you can see in Figure 2, adding an asynchronous task processing to your application inevitably increases the complexity of the whole system's architecture. You will need to set up some new backing services (a message queue such as RabbitMQ) and create workers that will be able to process these asynchronous jobs. Fortunately, there are some popular tools for building distributed task queues. The most popular one among Python developers is Celery (http://www.celeryproject.org/). It is a full-fledged task queue framework with support of multiple message brokers that also allows for the scheduled execution of tasks (it can replace your cron jobs). If you need something simpler, then RQ (http://python-rq.org/) might be a good alternative. It is a lot simpler than Celery and uses Redis key/value storage as its message broker (RQ actually stands for Redis Queue).

Although there are some good and battle-tested tools, you should always carefully consider your approach to the task queues. Definitely not every kind of work should be processed in queues. They are good at solving a few types of issues but also introduce a load of new problems:

  • Increased complexity of system architecture
  • Dealing with more than once deliveries
  • More services to maintain and monitor
  • Larger processing delays
  • More difficult logging

Using probabilistic data structures

Probabilistic data structures are structures that are designed to store collections of values in a way that allows you to answer some specific questions within time or resource constraints that would not be possible with other data structures. The most important fact is that the answer is only probable to be true or is the approximation of the real value. However, the probability of the correct answer or its accuracy can be easily estimated. So, despite not always giving the correct answer, it can be still useful if we accept some level of error.

There are a lot of data structures with such probabilistic properties. Each one of them solves some specific problems, and due to theirs stochastic nature cannot be used in every situation. But to give a practical example, let's talk about one of them that is especially popular—HyperLogLog.

HyperLogLog (refer to https://en.wikipedia.org/wiki/HyperLogLog) is an algorithm that approximates the number of distinct elements in a multiset. With ordinary sets, you need to store every element, and this may be very impractical for very large datasets. HLL is distinct from the classical way of implementing sets as programming data structures. Without digging into implementation details, let's say that it only concentrates on providing an approximation of the set cardinality. Thus, real values are never stored. They cannot be retrieved, iterated, and tested for membership. HyperLogLog trades accuracy and correctness for time complexity and size in memory. For instance, the Redis implementation of HLL takes only 12k bytes with a standard error of 0.81% with no practical limit of collection size.

Using probabilistic data structures is a very interesting way of solving performance problems. In most cases, it is about trading off some accuracy or correctness for faster processing or better resource usage. But it does not always need to be that way. Probabilistic data structures are very often used in key/value storage systems to speed up key lookups. One of the popular techniques used in such systems is called approximate member query (AMQ). One interesting data structure that can be used for that purpose is Bloom filter (refer to https://en.wikipedia.org/wiki/Bloom_filter).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset