Analyzing a web page

To understand how a web page is structured, we can try examining the source code. In most web browsers, the source code of a web page can be viewed by right-clicking on the page and selecting the View page source option:

For our example website, the data we are interested in is found on the country pages. Take a look at page source (via browser menu or right click browser menu). In the source for the example page for the United Kingdom (http://example.webscraping.com/view/United-Kingdom-239) you will find a table containing the country data (you can use search to find this in the page source code):

<table> 
<tr id="places_national_flag__row"><td class="w2p_fl"><label for="places_national_flag" id="places_national_flag__label">National Flag:</label></td>
<td class="w2p_fw"><img src="/places/static/images/flags/gb.png" /></td><td class="w2p_fc"></td></tr>
...
<tr id="places_neighbours__row"><td class="w2p_fl"><label for="places_neighbours" id="places_neighbours__label">Neighbours: </label></td><td class="w2p_fw"><div><a href="/iso/IE">IE </a></div></td><td class="w2p_fc"></td></tr></table>

The lack of white space and formatting is not an issue for a web browser to interpret, but it is difficult for us to read. To help us interpret this table, we can use browser tools. To find your browser's developer tools, you can usually simply right click and select an option like Developer Tools. Depending on the browser you use, you may have different developer tool options, but nearly every browser will have a tab titled Elements or HTML. In Chrome and Firefox, you can simply right click on an element on the page (what you are interested in scraping) and select Inspect Element. For Internet Explorer, you need to open the Developer toolbar by pressing F12. Then you can select items by clicking Ctrl B. If you use a different browser without built-in developer tools, you may want to try the Firebug Lite extension, which is available for most web browsers at https://getfirebug.com/firebuglite

When I right click on the table on the page and click Inspect Element using Chrome, I see the following open panel with the surrounding HTML hierarchy of the selected element:

In this screenshot, I can see that the table element sits inside a form element. I can also see that the attributes for the country are included in tr  or table row elements with different CSS IDs (shown via the id="places_national_flag__row"). Depending on your browser, the coloring or layout might be different, but you should be able to click on the elements and navigate through the hierarchy to see the data on the page. If I expand the tr elements further by clicking on the arrows next to them, I notice the data for each of these rows is included is included within a <td> element of class w2p_fw, which is the child of a <tr> element, shown as follows:

Now that we have investigated the page with our browser tools, we know the HTML hierarchy of the country data table, and have the necessary information to scrape that data from the page.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset