HTML parsers

For parsing HTML, the recommended third-party package is lxml, which is primarily an XML parser. However, it does include a very good HTML parser. It's quick, it offers several ways of navigating documents, and it is tolerant of broken HTML.

The lxml library can be installed on Debian and Ubuntu distributions through the python-lxml package. If you need an up-to-date version, then lxml can be installed through pip with the pip install lxml command.

Another option is to use BeautifulSoup. BeautifulSoup is pure Python, so it can be installed with pip, and it should run anywhere. Although it has its own API, it's a well-respected and capable library, and it can, in fact, use lxml as a backend library.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset