Selenium and Headless Browsers

Although it's convenient and fairly easy to install and use Selenium with common browsers; this can present problems when running these scripts on servers. For servers, it's more common to use headless browsers. They also tend to be faster and more configurable than fully-functional web browsers.

The most popular headless browser at the time of this publication is PhantomJS. It runs via its own JavaScript-based webkit engine. PhantomJS can be installed easily on most servers, and can be installed locally by following the latest download instructions (http://phantomjs.org/download.html). 

Using PhantomJS with Selenium merely requires a different initialization:

>>> from selenium import webdriver
>>> driver = webdriver.PhantomJS() # note: you should use the phantomjs executable path here
# if you see an error (e.g. PhantomJS('/Downloads/pjs'))

The first difference you notice is no browser window is opened, but there is a PhantomJS instance running. To test our code, we can visit a page and take a screenshot.

>>> driver.get('http://python.org')
>>> driver.save_screenshot('../data/python_website.png')
True

Now if you open that saved PNG file, you can see what the PhantomJS browser has rendered:

We notice it is a long window. We could change this by using maximize_window or setting a window size with set_window_size, both of which are documented in the Selenium Python documentation on the WebDriver API.

Screenshot options are great for debugging any Selenium issues you have, even if you are using Selenium with a real browser -- since there are times the script may fail to work due to a slow-loading page or changes in the page structure or JavaScript on the site. Having a screenshot of the page exactly as it was at the time of the error can be very helpful. Additionally, you can use the driver's page_source attribute to save or inspect the current page source. 

Another reason to utilize a browser-based parser like Selenium is it makes it more difficult to act like a scraper. Some sites use scraper-avoidance techniques like Honeypots, where the site might include a hidden toxic link on a page, which will get your scraper banned if your script clicks it. For these types of problems, Selenium acts as a great scraper because of its browser-based architecture. If you cannot click or see a link in the browser, you also cannot interact with it via Selenium. Additionally, your headers will include whichever browser you are using and you'll have access to normal browser features like cookies, sessions as well as loading images and interactive elements, which are sometimes required to load particular forms or pages. If your scraper must interact with the page and seem "human-like", Selenium is a great choice.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset