HTML in a nutshell

You are probably aware that web pages are written in three main languages – HTML, CSS, and JavaScript, of which only the latter is an actual language. In this triad, CSS is used to style objects visually, for example, set the color or font of the elements. JavaScript is a language that runs on the client's machine, and allows basic interactions on the website – for example, sending information back to the server, and selecting elements from the drop-down menu.

The main body of a page is described with HTML. Its goal is to present the hierarchy and the layout of the page. HTML is a subset of XML, a general-purpose markup language, designed specifically for web pages. Like XML, HTML describes the document via nested series of objects, defined with tags. It has a large nomenclature of those objects, each describing specific behavior. For example, the <ol> tag describes an ordered list; its children elements, <li>, will be enumerated. Similarly, <table> describes a table (duh), and the <a> element represents links. Each element can have the following:

  • Internal text.
  • Child elements.
  • An ID attribute – unique identification for a specific element.
  • A class attribute – non-unique type identification. One or more classes and IDs help to apply specific designs and interactions to the right elements.
  • Any other attribute. For example, <a> elements encode their link as an href attribute (a hyperlink reference).

For the sake of scraping, we don't really need to understand how elements behave and differ from one another. Scrapers usually use a combination of element names, properties (IDs, classes, and so on), their relative position, and the content to find specific elements. As pages tend to differ slightly both in structure and content, in order to scrape, we have to understand the logic of the page builder, which might be tricky. 

It is important to note that this approach works only with the pre-generated, static content on the page.

Most modern pages use JavaScript – client-side code, in order to execute some interactions; for example, to send a note to Google Analytics, or adjust the layout depending on the window size. Some, however, use JavaScript intensively; for example, for data acquisition (for example, the Facebook news feed). Beautiful Soup does not run JavaScript, so it won't work with these kinds of systems. There are ways to overcome this limitation, for example, by using special headless browsers. This approach, however, is significantly more complex and computation-demanding. We'll talk a little bit more about this topic and how to scrape JavaScript-intensive pages in the final section of this chapter.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset