Drawbacks of DiskCache

Our disk-based caching system was relatively simple to implement, does not depend on installing additional modules, and the results are viewable in our file manager. However, it has the drawback of depending on the limitations of the local filesystem. Earlier in this chapter, we applied various restrictions to map URLs to safe filenames, but an unfortunate consequence of this system is that some URLs will map to the same filename. For example, replacing unsupported characters in the following URLs will map them all to the same filename:

  • http://example.com/?a+b
  • http://example.com/?a*b
  • http://example.com/?a=b
  • http://example.com/?a!b

This means that, if one of these URLs were cached, it would look like the other three URLs were cached as well because they map to the same filename. Alternatively, if some long URLs only differed after the 255th character, the shortened versions would also map to the same filename. This is a particularly important problem since there is no defined limit on the maximum length of a URL. However, in practice, URLs over 2,000 characters are rare, and older versions of Internet Explorer did not support over 2,083 characters.

One potential solution to avoid these limitations is to take the hash of the URL and use the hash as the filename. This may be an improvement; however, we will eventually face a larger problem many filesystems have, that is, a limit on the number of files allowed per volume and per directory. If this cache is used in a FAT32 filesystem, the maximum number of files allowed per directory is just 65,535. This limitation could be avoided by splitting the cache across multiple directories; however, filesystems can also limit the total number of files. My current ext4 partition supports a little over 31 million files, whereas a large website may have excess of 100 million web pages. Unfortunately, the DiskCache approach has too many limitations to be of general use. What we need instead is to combine multiple cached web pages into a single file and index them with aB+tree or a similar data structure. Instead of implementing our own, we will use existing key-value store in the next section.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset