Automated scraping with Scrapely

For scraping the annotated fields Portia uses a library called Scrapely (https://github.com/scrapy/scrapely), which is a useful open-source tool developed independently from Portia. Scrapely uses training data to build a model of what to scrape from a web page. The trained model can then be applied to scrape other web pages with the same structure.

You can install it using pip:

pip install scrapely

Here is an example to show how it works:

  
>>> from scrapely import Scraper
>>> s = Scraper()
>>> train_url = 'http://example.webscraping.com/view/Afghanistan-1'
>>> s.train(train_url, {'name': 'Afghanistan', 'population': '29,121,286'})
>>> test_url = 'http://example.webscraping.com/view/United-Kingdom-239'
>>> s.scrape(test_url)
[{u'name': [u'United Kingdom'], u'population': [u'62,348,447']}]

First, Scrapely is given the data we want to scrape from the Afghanistan web page to train the model (here, the country name and population). This model is then applied to a different country page and Scrapely uses the trained model to correctly return the country name and population here as well.

This workflow allows scraping web pages without needing to know their structure, only the desired content you want to extract for the training case (or multiple training cases). This approach can be particularly useful if the content of a web page is static, but the layout is changing. For example, with a news website, the text of the published article will most likely not change, though the layout may be updated. In this case, Scrapely can then be retrained using the same data to generate a model for the new website structure. For this example to work properly, you would need to store your training data somewhere for reuse.

The example web page used here to test Scrapely is well structured with separate tags and attributes for each data type so Scrapely was able to correctly and easily train a model. For more complex web pages, Scrapely can fail to locate the content correctly. The Scrapely documentation warns you should "train with caution". As machine learning becomes faster and easier, perhaps a more robust automated web scraping library will be released; for now, it is still quite useful to know how to scrape a website directly using the techniques covered throughout this book.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset