Corporate websites are usually made by teams or departments using specialized tools and templates. A lot of the content is generated on the fly and consists of a large part of JavaScript and CSS. This means that even if we download the content, we still have to, at least, evaluate the JavaScript code. One way that we can do this from a Python program is using the Selenium API. Selenium's main purpose is actually testing websites, but nothing stops us from using it to scrape websites.
Instead of scraping a website, we will scrape an IPython Notebook—the test_widget.ipynb
file in this book's code bundle. To simulate browsing this web page, we provided a unit test class in test_simulating_browsing.py
. In case you wondered, this is not the recommended way to test IPython Notebooks.
For historic reasons, I prefer using XPath to find HTML elements. XPath is a query language, which also works with HTML. This is not the only method, you can also use CSS selectors, tag names, or IDs. To find the right XPath expression, you can either install a relevant plugin for your favorite browser, or for instance in Google Chrome, you can inspect an element's XPath.
Install Selenium with the following command:
$ pip install selenium
I tested the code with Selenium 2.47.1.
The following steps show you how to simulate web browsing using an IPython widget that I made. The code for this recipe is in the test_simulating_browsing.py
file in this book's code bundle:
$ ipython notebook
from selenium import webdriver import time import unittest import dautil as dl NAP_SECS = 10
class SeleniumTest(unittest.TestCase): def setUp(self): self.logger = dl.log_api.conf_logger(__name__) self.browser = webdriver.Firefox()
def tearDown(self): self.browser.quit()
def wait_and_click(self, toggle, text): xpath = "//a[@data-toggle='{0}' and contains(text(), '{1}')]" xpath = xpath.format(toggle, text) elem = dl.web.wait_browser(self.browser, xpath) elem.click()
def test_widget(self): self.browser.implicitly_wait(NAP_SECS) self.browser.get('http://localhost:8888/notebooks/test_widget.ipynb') try: # Cell menu xpath = '//*[@id="menus"]/div/div/ul/li[5]/a' link = dl.web.wait_browser(self.browser, xpath) link.click() time.sleep(1) # Run all xpath = '//*[@id="run_all_cells"]/a' link = dl.web.wait_browser(self.browser, xpath) link.click() time.sleep(1) self.wait_and_click('tab', 'Figure') self.wait_and_click('collapse', 'figure.figsize') except Exception: self.logger.warning('Error while waiting to click', exc_info=True) self.browser.quit() time.sleep(NAP_SECS) self.browser.save_screenshot('widgets_screenshot.png') if __name__ == "__main__": unittest.main()