Scraping with Beautiful Soup 4

Any publicly accessible HTTP can be pulled with a requests library. As you remember, if the resulting value is stored as a JSON, requests have a built-in parsing method. For HTML, it is different: parsing HTML is no simple task. It is much more complex than your ordinary JSON; HTML files are large and can be invalid (browsers will often still "fix" and render them).

In order to do so, we'll be using Beautiful Soup 4 (BS4), one of the two main libraries for parsing HTML, together with LXML. Beautiful Soup also knows how to parse HTML, and can even repair invalid files. Once the document has Pythonic representation, we can drill down and retrieve specific elements we're interested in by using a combination of element ID, class, CSS properties, their order, and so on using either CSS selectors or the XPath mini-language.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset