Simulating web browsing

Corporate websites are usually made by teams or departments using specialized tools and templates. A lot of the content is generated on the fly and consists of a large part of JavaScript and CSS. This means that even if we download the content, we still have to, at least, evaluate the JavaScript code. One way that we can do this from a Python program is using the Selenium API. Selenium's main purpose is actually testing websites, but nothing stops us from using it to scrape websites.

Instead of scraping a website, we will scrape an IPython Notebook—the test_widget.ipynb file in this book's code bundle. To simulate browsing this web page, we provided a unit test class in test_simulating_browsing.py. In case you wondered, this is not the recommended way to test IPython Notebooks.

For historic reasons, I prefer using XPath to find HTML elements. XPath is a query language, which also works with HTML. This is not the only method, you can also use CSS selectors, tag names, or IDs. To find the right XPath expression, you can either install a relevant plugin for your favorite browser, or for instance in Google Chrome, you can inspect an element's XPath.

Getting ready

Install Selenium with the following command:

$ pip install selenium

I tested the code with Selenium 2.47.1.

How to do it…

The following steps show you how to simulate web browsing using an IPython widget that I made. The code for this recipe is in the test_simulating_browsing.py file in this book's code bundle:

  1. The first step is to run the following:
    $ ipython notebook
    
  2. The imports are as follows:
    from selenium import webdriver
    import time
    import unittest
    import dautil as dl
    
    NAP_SECS = 10
  3. Define the following function, which creates a Firefox browser instance:
    class SeleniumTest(unittest.TestCase):
        def setUp(self):
            self.logger = dl.log_api.conf_logger(__name__)
            self.browser = webdriver.Firefox()
  4. Define the following function to clean up when the test is done:
        def tearDown(self):
            self.browser.quit()
  5. The following function clicks on the widget tabs (we have to wait for the user interface to respond):
    def wait_and_click(self, toggle, text):
            xpath = "//a[@data-toggle='{0}' and contains(text(), '{1}')]"
            xpath = xpath.format(toggle, text)
            elem = dl.web.wait_browser(self.browser, xpath)
            elem.click()
  6. Define the following function, which performs the test that consists of evaluating the notebook cells and clicking on a couple of tabs in the IPython widget (we use port 8888):
        def test_widget(self):
            self.browser.implicitly_wait(NAP_SECS)
            self.browser.get('http://localhost:8888/notebooks/test_widget.ipynb')
    
            try:
                # Cell menu
                xpath = '//*[@id="menus"]/div/div/ul/li[5]/a'
                link = dl.web.wait_browser(self.browser, xpath)
                link.click()
                time.sleep(1)
    
                # Run all
                xpath = '//*[@id="run_all_cells"]/a'
                link = dl.web.wait_browser(self.browser, xpath)
                link.click()
                time.sleep(1)
    
                self.wait_and_click('tab', 'Figure')
                self.wait_and_click('collapse', 'figure.figsize')
            except Exception:
                self.logger.warning('Error while waiting to click', exc_info=True)
                self.browser.quit()
    
            time.sleep(NAP_SECS)
            self.browser.save_screenshot('widgets_screenshot.png')
    
    if __name__ == "__main__":
        unittest.main()

The following screenshot is created by the code:

How to do it…

See also

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset