Waiting for results

The final part of implementing our WebKit crawler is scraping the search results, which turns out to be the most difficult part because it isn't obvious when the AJAX event is complete and the country data is loaded. There are three possible approaches to deal with this conundrum:

  • Wait a set amount of time and hope the AJAX event is complete
  • Override Qt's network manager to track when URL requests are complete
  • Poll the web page for the expected content to appear

The first option is the simplest to implement but it's inefficient, since if a safe timeout is set, usually the script spends too much time waiting. Also, when the network is slower than usual, a fixed timeout could fail. The second option is more efficient but cannot be applied when there are client-side delays; for example, if the download is complete, but a button needs to be pressed before content is displayed. The third option is more reliable and straightforward to implement; though there is the minor drawback of wasting CPU cycles when checking whether the content has loaded. Here is an implementation for the third option:

>>> elements = None 
>>> while not elements:
... app.processEvents()
... elements = frame.findAllElements('#results a')
...
>>> countries = [e.toPlainText().strip() for e in elements]
>>> print(countries)
['Afghanistan', 'Aland Islands', ... , 'Zambia', 'Zimbabwe']

Here, the code will remain in the while loop until the country links are present in the results div. For each loop, app.processEvents() is called to give the Qt event loop time to perform tasks, such as responding to click events and updating the GUI. We could additionally add a sleep for a short period of seconds in this loop to give the CPU intermittent breaks.

A full example of the code so far can be found at https://github.com/kjam/wswp/blob/master/code/chp5/pyqt_search.py.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset