In this chapter, you will learn how to use Beautiful Soup, a lightweight Python library, to extract and navigate HTML content easily and forget overly complex regular expressions and text parsing.
Before I let you jump right into coding, I will tell you some things about this tool to familiarize yourself with it.
Feel free to jump to the next section if you are not in the mood for reading dry introductory text or basic tutorials; and if you don’t understand something in my later approach or the code, come back here.
I find Beautiful Soup easy to use, and it is a perfect tool for handling HTML DOM elements: you can navigate, search, and even modify a document with this tool. It has a superb user experience, as you will see in the first section of this chapter.
Installing Beautiful Soup
The number 4 is crucial because I developed and tested the examples in this book with version 4.6.0.
Simple Examples
After a lengthy introduction, it is time to start coding now, with simple examples to familiarize yourself with Beautiful Soup and try out some basic features without creating a complex scraper.
These examples will show the building blocks of Beautiful Soup and how to use them if needed.
You won’t scrape an existing site, but instead will use HTML text prepared for each use case.
For these examples, I assume you’ve already entered from bs4 import BeautifulSoup into your Python script or interactive command line, so you have Beautiful Soup ready to use.
Parsing HTML Text
The very basic usage of Beautiful Soup , which you will see in every tutorial, is parsing and extracting information from an HTML string.
This is the basic step, because when you download a website, you send its content to Beautiful Soup to parse, but there is nothing to see if you pass a variable to the parser.
This warning is well defined and tells you everything you need to know. Because you can use different parsers with Beautiful Soup (see later in this chapter), you cannot assume it will always use the same parser; if a better one is installed, it will use that. Moreover, this can lead to unexpected behavior, for example, your script slows down.
Now you can use the soup variable to navigate through the HTML.
Parsing Remote HTML
Parsing a File
Difference Between find and find_all
You will use two methods excessively with Beautiful Soup: find and find_all .
The difference between these two lies in their function and return type: find returns only one—if multiple nodes match the criteria, the first is returned; None, if nothing is found. find_all returns all results matching the provided arguments as a list; this list can be empty.
This means, every time you search for a tag with a certain id, you can use find because you can assume that an id is used only once in a page. Alternatively, if you are looking for the first occurrence of a tag, then you can use find too. If you are unsure, use find_all and iterate through the results.
Extracting All Links
The core function of a scraper is to extract links from the website that lead to other pages or other websites.
The find_all method call includes the href=True argument. This tells Beautiful Soup to return only those anchor tags thaat have an href attribute. This gives you the freedom to access this attribute on resulting links without checking their existence.
To verify this, try running the preceding code, but remove the href=True argument from the function call. It results in an exception because the empty anchor doesn’t have an href attribute.
You can add any attribute to the find_all method, and you can search for tags where the attribute is not present too.
Extracting All Images
The second biggest use case for scrapers is to extract images from websites and download them or just store their information, like where they are located, their display size, alternative text, and much more.
Looking for a present src attribute helps to find images that have something to display. Naturally, sometimes the source attribute is added through JavaScript , and you must do some reverse engineering—but this is not the subject of this chapter.
Finding Tags Through Their Attributes
Sometimes you must find tags based on their attributes. For example, we identified HTML blocks for the requirements in the previous chapter through their class attribute.
The previous sections have shown you how to find tags where an attribute is present. Now it’s time to find tags whose attributes have certain values.
You can use any attribute in the find and find_all methods. The only exception is class because it is a keyword in Python. However, as you can see, you can use class_ instead.
The difference between the two preceding examples is their result. Because in the example HTML there is no paragraph that contains only the text “paragraph”, an empty list is returned. The second method call returns a list of paragraph tags that contain the word “paragraph.”
Finding Multiple Tags Based on Property
Previously, you have seen how to find one kind of tag (<p>, <img>) based on its properties.
Here you use the True keyword to match all tags. If you don’t provide an attribute to narrow the search, you will get back a list of all tags in the HTML document.
Changing Content
I rarely use this function of Beautiful Soup, but valid use cases exist. Therefore I think you should learn about how to change the contents of a soup. Moreover, because I don’t use this function a lot, this section is skinny and won’t go into deep details.
Adding Tags and Attributes
Adding tags to the HTML is easy, though it is seldom used. If you add a tag, you must take care where and how you do it. You can use two methods: insert and append. Both work on a tag of the soup.
insert requires a position where to insert the new tag, and the new tag itself.
append requires only the new tag to append the new tag to the parent tag’s end on which the method is called.
For the two methods just shown, there are convenience methods too: insert_before, insert_after.
The only difference is that the insert_after method is not implemented on soup objects, just on tags.
Anyway, with these methods you must pay attention where you insert or append new tags into the document.
Changing Tags and Attributes
The preceding example changes (or adds) the class withid to all tags that have an id attribute.
Deleting Tags and Attributes
If you want to delete a tag , you can use either extract() or decompose() on the tag.
extract() removes the tag from the tree and returns it, so you can use it in the future or add it to the HTML content at a different position.
Deletion doesn’t only work for tags; you can remove attributes of tags too.
Finding Comments
The preceding code finds and prints contents of all comments. To make it work, you need to import Comments from the bs4 package too.
Conver ting a Soup to HTML Text
This is one of the easiest parts for Beautiful Soup because as you may know from your Python studies, everything is an object in Python , and objects have a method __str__ that returns the string representation of this object.
Instead of writing something like soup.__str__() every time, this method is called every time you convert the object to a string—for example when you print it to the console: print(soup).
However, this results in the same string representation as you provided in the HTML content. Moreover, you know, you can do better and provide a formatted string.
Extracting the Required Information
Now it is time to prepare your fingers and keyboard because you are about to create your first dedicated scraper, which will extract the required information, introduced in Chapter 2, from the Sainsbury’s website .
All the source code shown in this chapter can be found in the file called bs_scraper.py in the source codes of this book.
However, I suggest, you start by trying to implement each functionality yourself with the tools and knowledge learned from this book already. I promise, it is not hard—and if your solution differs a bit from mine, don’t worry. This is coding; every one of us has his/her style and approach. What matters is the result in the end.
Identifying, Extracting, and Calling the Target URLs
The first step in creating the scraper is to identify the links that lead us to product pages. In Chapter 2 we used Chrome’s DevTools to find the corresponding links and their locations.
You now have the links that lead to pages listing products, each showing 36 at most.
The navigation goes from “Chicken & turkey” to “Sauces, marinades & Yorkshire puddings,” which leads to the third layer of links .
The preceding code uses the simple Breadth First Search (BFS) from the previous chapter to navigate through all the URLs until it finds the product lists. You can change the algorithm to Depth First Search(DFS) ; this results in a logically cleaner solution because if your code finds a URL that points to a navigation layer, it digs deeper until it finds all the pages.
The code looks first for shelves (categories shelf), which are the last layer of navigation prior to extracting categories aisles. This is because if it would extract aisles first and because all those URLs are already visited, the shelves and their content will be missing.
Navigating the Product Pages
In Chapter 2 you have seen that products can be listed on multiple pages . To gather information about every product, you need to navigate between these pages.
The preceding code block navigates through all the product lists and adds the URLs of the product sites to the list of products .
I used a BFS again, and a DFS would be OK too. The interesting thing is the handling of the next pages: you don’t search for the numbering of the navigation but consecutively for the link pointing to the next page. This is useful for bigger sites, where you have umpteen-thousand pages. They won’t be listed on the first site.1
Extracting the Information
You arrived at the product page. Now it is time to extract all the information required.
Because you already identified and noted the locations in Chapter 2, it will be a simple task to wire everything together.
Depending on your preferences, you can use dictionaries, named tuples, or classes to store information on a product. Here, you will create code using dictionaries and classes.
Using Dictionaries
The first solution you create will store the extracted information of products in dictionaries.
The keys in the dictionary will be the fields’ names (which will be later used as a header in a CSV [Comma Separated Value], for example), the value the extracted information.
I could list here how to extract all the information required, but I will only list the tricky parts. The other building blocks you should figure out yourself, as an exercise.
You can take a break, put down the book and try to implement the extractor. If you struggle with nutrition information or product origin, you will find help below.
If you are lazy, you can go ahead and find my whole solution later in this section or look at the source code provided for this book.
After implementing a solution, I hope you’ve got something similar to the following code :
As you can see in the preceding code, this is the biggest part of the scraper. But hey! You finished your very first scraper, which extracts meaningful information from a real website.
What you have probably noticed is the caution implemented in the code: every HTML tag is verified. If it does not exist, no processing happens; it would be a disaster and the application would crash.
Using Classes
You can implement the class-based solution similarly to the dictionary-based one. The only difference is in the planning phase : while using a dictionary you don’t have to plan much ahead, but with classes, you need to define the class model.
For my solution, I used a simple, pragmatic approach and created two classes: one holds the basic information; the second is a key-value pair for nutrition details.
I don’t plan to go deep into OOP2 concepts. If you want to learn more, you can refer to different Python books .
As you already know, filling these objects is different too. There are different options for how to solve such a problem,3 but I used a lazy version where I access and set every field directly.
Unforeseen Changes
While implementing the source code yourself, you may have found some problems and needed to react.
What to do in such cases? Well, first, mention to your customer (if you have any) that you’ve found tables that contain nutrition information but in different details and format. Then think out a solution that is good for the outcome , and you don’t have to create extra errands in your code to let it happen.
The exceptional case of Energy and Energy kcal (if not th) in the preceding code is fixed automatically in tables, which provide labels for every row.
Such changes are inevitable. Even though you get requirements and prepare your scraping process, exceptions in the pages can occur. Therefore, always be prepared and write code that can handle the unexpected, and you don’t have to redo all the work. You can read more about how I deal with such thing later in this chapter.
Exporting the Data
Now that all information is gathered, we want to store it somewhere because keeping it in memory does not have much use for our customer.
In this section, you will see basic approaches to how you can save your information into a CSV or JSON file, or into a relational database, which will be SQLite.
Each subsection will create code for the following export objects: classes and dictionaries.
To CSV
A good old friend to store data is CSV . Python provides built-in functionality to export your information into this file type.
Because you implemented two solutions in the previous section, you will now create exports for both. But don’t worry; you will keep both solutions simple.
The common part is the csv module of Python. It is integrated and has everything you need.
Quick Glance at the csv Module
Here you get a quick introduction into the csv module of the Python standard library. If you need more information or reference, you can read it online.4
I will focus on writing CSV files in this section; here I present the basics to give you a smooth landing on the examples where you write the exported information into CSV files .
For the code examples, I assume you did import csv.
dialect: With the dialect parameter, you can specify formatting attributes grouped together to represent a common formatting. Such dialects are excel (the default dialect), excel_tab, or unix_dialect. You can define your own dialects too.
delimiter: If you do/don’t specify a dialect, you can customize the delimiter through this argument. This can be needed if you must use some special character for delimiting purposes because comma and escaping don’t do the trick, or your specifications are restrictive.
quotechar: As its name already mentions, you can override the default quoting. Sometimes your texts contain quote characters and escaping results in unwanted representations in MS Excel.
quoting: Quoting occurs automatically if the writer encounters the delimiter inside a field’s value. You can override the default behavior, and you can completely disable quoting (although I don’t encourage you to do this).
lineterminator: This setting enables you to change the character at the line’s ending. It defaults to ' ' but in Windows you don’t want this, just ' '.
Most of the time, you are good to go without changing any of these settings (and relying on the Excel configuration). However, I encourage you to take some time and try out different settings. If something is wrong with your dataset and the export configuration, you’ll get an exception from the csv module—and this is bad if your script already scraped all the information and dies at the export.
Line Endings
Headers
What are CSV files without a header? Useful for those who know what to expect in which order, but if the order or number of columns changes, you can expect nothing good.
Saving a Dictionary
To save a dictionary , Python has a custom writer object that handles this key-value pair object: the DictWriter.
This writer object handles mapping of dictionary elements to lines properly, using the keys to write the values into the right columns. Because of this, you must provide an extra element to the constructor of DictWriter: the list of field names. This list determines the order of the columns; and Python raises an error if a key is missing from the dictionary you want to write.
If the order of the result doesn’t matter, you can easily set the field names when writing the results to the keys of the dictionary you want to write. However, this can lead to various problems: the order is not defined; it is mostly random on every machine you run it on (sometimes on the same machine too); and if the dictionary you choose is missing some keys, then your whole export is missing those values.
Alternatively, you can define the set of headers to use beforehand. In this case, you have power over the order of the fields, but you must know all the fields possible. This is not easy if you deal with dynamic key-value pairs just like the nutrition tables.
As you see, for both options you must create the list (set) of possible headers before you write your CSV file. You can do this by iterating through all product information and put the keys of each into a set, or you can add the keys in the extraction method to a global set.
I hope your code is like this one. As you can see, I used an extra method to gather all the header-fields. However, as mentioned earlier, use the version that fits you better. My solution is slower because I iterate multiple times over the rows .
Saving a Class
Even with this structure , you will need a minimal key-mapping from the table to the properties of the Product class. This is because there are some properties that need to be filled with values from the table that have a different name, for example total_sugars will get the value from the field Total Sugars.
As you can see, the code didn’t change much; I highlighted the parts that are different. And you must modify your code in a similar fashion to fill the class’ fields.
Using the get_field_names method seems like a bit of overwork. If you feel like it, you can add the function’s body instead of the method call, or create a method in the Product class that returns you the field names.
Again, this approach results in a nonpredictable order of columns in your CSV file. To ensure the order between runs and computers, you should define a fixed list for the fieldnames and use it for the export.
Another interesting code part is using the __dict__ method of the Product class. This is a handy built-in method to convert the properties of an instance object to a dictionary. The vars built-in function works like the __dict__ function and returns the variables of the given instance object as a dictionary.
To JSON
An alternative and more popular way to hold data is as JSON files. Therefore, you will create code blocks to export both dictionaries and classes to JSON files.
Quick Glance at the json module
This will be a quick introduction too. The json module of the Python standard library is huge, and you can find more information online.7
As in the CSV section, I’ll focus on writing JSON files because the application writes the product information into JSON files.
I assume you did import json for the examples in this section.
The preceding example writes the content (two dictionaries in a list) to the result.json file.
Moreover, this is everything you need to know for now about writing data to JSON files.
Saving a Dictionary
The preceding code saves the list filled with product information into the designated JSON file.
Saving a Class
Saving a class to a JSON file is not a trivial task, because classes are not your typical object to save into a JSON file.
To a Relational Database
Now you will learn how to connect to a database and write data into it. For the sake of simplicity, all the code will use SQLite because it doesn’t require any installation or configuration.
The code you will write in this section will be database agnostic; you can port your code to populate any relational database (MySQL, Postgres).
The data you extracted in this chapter (and you will see throughout this book) doesn’t need a relational database because it has no relations defined. I won’t go into deeper detail on relational databases because my purpose is to get you going on your way to scraping, and many clients need their data in a MySQL table. Therefore, in this section, you will see how you can save the extracted information into an SQLite 3 database. The approach is similar to other databases. The only difference is that those databases need more configuration (like username, password, connection information), but there are plenty of resources available.
The first step is to decide on a database schema. One option is to put everything in a single table. In this case, you will have some empty columns, but you don’t have to deal with dynamic names from the nutrition table. The other approach is to store common information (everything but the nutrition table) in one table and reference a second table with the key-value pairs.
The first approach is good when using dictionaries in the way this chapter uses them, because there you have all entries in one dictionary and it is hard to split the nutrition table from the other content. The second approach is good for classes, because there you already have two classes storing common information and the dynamic nutrition table.
Sure, there is a third approach: set the columns in stone and then you can skip the not needed/unknown keys, which result from different nutrition tables across the site. With this, you must take care of error handling and missing keys—but this keeps the schema maintainable.
This DDL is SQLite 3; you may need to change it according to what database you’re using. As you can see, we create the table only if it does not exist. This avoids errors and error handling when running the application multiple times. The primary key of the table is the product code. URL and product name cannot be null; for the other attributes you can allow null.
The interesting code comes when you add entries to the database. There can be two cases: you insert a new value, or the product is already in the table and you want to update it.
When you insert a new value, you must make sure the information contains every column by name, and if not, you must avoid exceptions. For the products of this chapter you could create a mapper that maps keys to their database representation prior to saving. I won’t do this, but you are free to extend the examples as you wish.
When updating, there is already an entry in the database. Therefore, you must find the entry and update the relevant (or all) fields. Naturally, if you work with a historical dataset , then you don’t need any updates, just inserts.
The preceding code is a sample example to save the entries in the database.
The main entry point is the save_to_sqlite function. The database_path variable holds the path to the target SQLite database. If it doesn’t exist, the code will create it for you. The rows variable contains the data-dictionaries in a list.
The interesting part is the __save_row function. It saves a row, and as you can see, it requires a lot of information on the object you want to save. I use the get method of the dict class to avoid Key Errors if the given key is not present in the row to persist.
If you are using classes , I suggest you look at peewee,8 an ORM9 tool that helps you map objects to the relational database schema. It has built-in support for MySQL, PostgreSQL, and SQLite. In the examples, I will use peewee too because I like the tool.10
Here you can find a quick primer to peewee, where we will save data gathered into classes to the same SQLite database schema as previously.
This structure enables you to use the class later with peewee and store the information using ORM without any conversion. I named the class ProductOrm to show the difference from the previously used Product class.
To save an instance of the class, you simply must adapt the functions of the previous section.
Any way you proceed, you can use peewee to take over all the action of persisting the data: creating the table and saving the data.
To create the table , you must call the create_table method on the ProductOrm class. With the True parameter provided, this method call will ensure that your target database has the table and if the table isn’t there, it will be created. How will the table be created? This is based on the ORM model provided by you, the developer. peewee creates the DDL information based on the ProductOrm class: text fields will be TEXT database columns,and IntegerField fields will generate an INTEGER column.
And to save the entity itself, you must call the save method on the instantiated object itself. This removes all knowledge from you about the name of the target table, which parameters to save in which column, how to construct the INSERT statement… And this is just great if you ask me.
To an NoSQL Database
It would be a shame to forget about modern databases, which are state of the art. Therefore, in this section, you will export the gathered information into a MongoDB .
If you are familiar with this database and followed along with my examples in this book, you already know how I will approach the solution: I will use previous building blocks. In this case, the JSON export.
An NoSQL database is a good fit because most of the time they are designed to store documents that share few or no relations with other entries in the database—at least they shouldn’t do it excessively.
Installing MongoDB
Unlike SQLite , you must install MongoDB on your computer to get it running.
In this section, I won’t go into detailed instructions on how to install and configure MongoDB; it is up to you, and their homepage has very good documentation,11 especially for Python developers.
I assume for this section you installed MongoDB and the Python library: PyMongo. Without this, it will be hard for you to follow the code examples.
Writing to MongoDB
As previously, I will focus only on writing to the target database because the scraper stores information but won’t read any entries from the database.
Writing to an NoSQL database like MongoDB is easier because it doesn’t require a real structure and you can put everything into it as you wish. Sure, it would be ridiculous to do such things; we need structure to avoid chaos. However, theoretically, you can just jam everything into your database.
My version is like the SQL-version . I open the connection to the provided database and insert each product into the MongoDB database. To get the JSON representation of the product, I use the __dict__ variable.
If you want to insert a collection into the database, use insert_many instead of insert_one.
If you are interested in using a library like peewee just for MongoDB and ODM (Object-Document Mapping), you can take a look at MongoEngine.
Per formance Improvements
If you put the code of this chapter together and run the extractor, you will see how slow it is.
Serial operations are always slow, and depending on your network connection, it can be slower than slow. The parser behind Beautiful Soup is another point where you can gain some performance improvements, but this is not a big boost. Moreover, what happens if you encounter an error right before finishing the application? Will you lose all data?
In this section, I’ll try to give you options for how you can handle such cases, but it is up to you to implement them.
You could create benchmarks of the different solutions in this section, but as I mentioned earlier in this book, it makes no sense because the environment always changes, and you cannot ensure that your scripts run in exactly the same conditions.
Changing the Parser
One way to improve Beautiful Soup is to change the parser that it uses to create the object model out of the HTML content.
html.parser
lxml (install with pip install lxml)
html5lib (install with pip install html5lib)
The default parser, which is already installed with the Python standard library, is html.parser—as you have already seen in this book.
Changing the parser doesn’t give such a speed boost that you will see the difference right away, just some minor improvements. However, to see some flawed benchmarking, I added a timer that starts at the beginning of the script and prints the time needed to extract all the 3,005 products without writing them to any storage.
Some Execution Speed Comparisons
Parser | Entries | Time taken (in seconds) |
---|---|---|
html.parser | 3,005 | 2,347.9281 |
lxml | 3,005 | 2167.9156 |
lxml-xml | 3,005 | 2457.7533 |
html5lib | 3,005 | 2,544.8480 |
As you can see, the difference is significant. lxml wins the game because it is a well-defined parser written in C, and therefore it can work extremely fast on well-structured documents.
html5lib is very slow; its only advantage is that it creates valid HTML5 code from any input.
Choosing a parser
has trade-offs. If you need speed, I suggest you install lxml. If you cannot rely on installing any external modules to Python, then you should go with the built-in html.parser.
Any way you decide, you must remember: if you change the parser, the parse tree of the soup changes. This means you must revisit and perhaps change your code.
Parse Only What’s Needed
Even with an optimized parser, creating the document model of the HTML text takes time. The bigger the page, the more slowly this model is created.
One option to tune the performance a bit is to tell Beautiful Soup which part of the whole page you will need, and it will create the object model from the relevant part. To do this, you can use a SoupStrainer object.
The preceding code creates a simple SoupStrainer that limits the parse tree to unordered lists having a class attribute 'productLister gridView'— which helps to reduce the site to the required parts—and it uses this strainer to create the soup.
Because you already have a working scraper, you can replace the soup calls using a strainer to speed up things.
The link leads to product pages.
The link leads to a first-level sublist.
The link leads from a first-level sublist to a second-level sublist.
Here, you have listed all three versions of the lists that can happen, and the soup contains all the relevant information.
A (flawed) benchmark using a hard cache:12 my script gained 100% speedup (from 158.907 seconds to 79.109 seconds) using strainers.
Saving While Working
If your application encounters an exception while running, the current version breaks on the spot and all your gathered information is lost.
One approach is to use DFS. With this approach, you go straight down the target graph and extract the products in the shortest way. Moreover, when you encounter a product, you save it to your target medium (CSV, JSON, relational, or NoSQL database).
Another approach keeps the BFS and applies saving the products as they are extracted. This is the same approach as using the DFS algorithm. The only difference is when you reach the products.
Both approaches need a mechanism to restart work, or at least save some time with skipping already written products. For this, you create a function that loads the contents of the target file, stores the extracted URLs in memory, and skips the download of already extracted products.
Staying with the BFS solution of this chapter, you must modify the extract_product_information function to yield every piece of product information when it is ready. Then you wrap the call of this method into a loop and save the results to your target.
Surely, this creates some overhead: you open a file-handle every time you save a piece, you must take care of saving the entries into a JSON array, you open and close database connections for every write… Alternatively, you do opening and closing (file-handle or database connection) surrounding the extraction. In those cases, you must take care of flushing/committing the results ; if something happens, your extracted data is saved.
What about try-except?
Well, wrapping the whole extracting code in a try-except block is a solution too, but you must ensure that you don’t forget about the exceptions that happened and you can get the missing data later. But such exceptions can happen while you’re at a main page that leads to detail pages—and from my experience I know that once you wrap code into an exception handling block, you will forget to revisit the issues in the future.
Developing on a Long Run
Sometimes you develop scrapers for bigger projects, and you cannot launch your script after every change because it takes too much time.
Even though this scraper you implemented is short and extracts around 3,000 products, it takes some time to finish—and if you have an error in the data extraction, it is always time-consuming to fix the error and start over.
In such cases I utilize caching of results of intermediate steps; sometimes I cache the HTML codes themselves. This section is about my approach and my opinions.
Because you already have deep Python knowledge, this section is again an optional read: feel free if you know how to utilize such approaches.
Caching Intermediate Step Results
The first thing I always did when I started working with a basic, self-written spider just like the one in this example was to cache intermediate step results.
Applying this approach to this chapter’s code, you export the resulting URLs after each step into a file and change the application so that it reads the file of the last step back when it starts and skips the scraping until the following step.
Your challenge in such cases is to write your code to continue work where it went down. With intermediate results, this can mean you have to scrape the biggest part of the websites again because your script died before it could save all information on products—or it died while it was about to save the extracted information.
This step is not bad, because you have a checkpoint where you can continue if you step messes up. But honestly, this requires much extra work, like saving the intermediate steps and loading them back for each stage. And because I am lazy and learned a lot while on my development journey, I use the next solution as the basis for all my scraping tasks.
Caching Whole Websites
A better approach is to cache whole websites locally. This gives better performance in the long run for rerunning your script every time.
When implementing this approach, I extend the functionality of the website gathering method to route over a cache: if the requested URL is in the cache, return the cached version; if it’s not present, gather the site and store the result in the cache.
You can use file-based or database caches to store the websites while you’re developing. In this section you will learn both approaches.
The basic idea for the cache is to create a key that identifies the website. Keys are unique identifiers, and a web page’s URL is unique too. Therefore, let’s use this as the key, and the content of the page is the value.
Limitations by Operating Systems
Operating system | File system | Invalid filename characters | Maximum filename length |
---|---|---|---|
Linux | Ext3/Ext4 | / and |