Selenium

With the WebKit library used in the previous section, we have full control to customize the browser renderer to behave as we need it to. If this level of flexibility is not needed, a good and easier-to-install alternative is Selenium, which provides an API to automate several popular web browsers. Selenium can be installed using pip with the following command:

pip install selenium

To demonstrate how Selenium works, we will rewrite the previous search example in Selenium. The first step is to create a connection to the web browser:

>>> from selenium import webdriver 
>>> driver = webdriver.Firefox()

When this command is run, an empty browser window will pop up. If you received an error instead, you likely need to install geckodriver (https://github.com/mozilla/geckodriver/releases) and ensure it is available via your PATH variables.

Using a browser you can see and interact with (rather than a Qt widget) is handy because with each command, the browser window can be checked to see if the script worked as expected. Here, we used Firefox, but Selenium also provides interfaces to other common web browsers, such as Chrome and Internet Explorer. Note that you can only use a Selenium interface for a web browser that is installed on your system.

To see if your system's browser is supported and what other dependencies or drivers you may need to install to use Selenium, check the Selenium documentation on supported platforms: http://www.seleniumhq.org/about/platforms.jsp

To load a web page in the chosen web browser, the get() method is called:

>>> driver.get('http://example.webscraping.com/search') 

Then, to set which element to select, the ID of the search textbox can be used. Selenium also supports selecting elements with a CSS selector or XPath. When the search textbox is found, we can enter content with the send_keys() method, which simulates typing:

>>> driver.find_element_by_id('search_term').send_keys('.') 

To return all results in a single search, we want to set the page size to 1000. However, this is not straightforward because Selenium is designed to interact with the browser, rather than to modify the web page content. To get around this limitation, we can use JavaScript to set the select box content:

>>> js = "document.getElementById('page_size').options[1].text = '1000';" 
>>> driver.execute_script(js)

Now the form inputs are ready, so the search button can be clicked on to perform the search:

>>> driver.find_element_by_id('search').click() 

We need to wait for the AJAX request to complete before loading the results, which was the hardest part of the script in the previous WebKit implementation. Fortunately, Selenium provides a simple solution to this problem by setting a timeout with the implicitly_wait() method:

>>> driver.implicitly_wait(30) 

Here, a delay of 30 seconds was used. Now, if we search for elements that are not yet available, Selenium will wait up to 30 seconds before raising an exception. Selenium also allows for more detailed polling control using explicit waits (which are well-documented at http://www.seleniumhq.org/docs/04_webdriver_advanced.jsp).

To select the country links, we use the same CSS selector that we used in the WebKit example:

>>> links = driver.find_elements_by_css_selector('#results a') 

Then, the text of each link can be extracted to create a list of countries:

>>> countries = [link.text for link in links] 
>>> print(countries)
['Afghanistan', 'Aland Islands', ... , 'Zambia', 'Zimbabwe']

Finally, the browser can be shut down by calling the close() method:

>>> driver.close() 

The source code for this example is available at https://github.com/kjam/wswp/blob/master/code/chp5/selenium_search.py. For further details about Selenium, the Python bindings are documented at https://selenium-python.readthedocs.org/.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset