Annotation

At the Portia start page, the page prompts you to enter a project. Once you enter that text, then there is a textbox to enter the URL of the website you want to scrape, such as http://example.webscraping.com.

When you've typed that, Portia will then load the project view:

Once you click the New Spider button, you will see the following Spider view:

You will start to recognize some of the fields from the Scrapy spider we already built earlier in this chapter (such as start pages and link crawling rules). By default, the spider name is set to the domain (example.webscraping.com), which can be modified by clicking on the labels.

Next, click on the "New Sample" button to start collecting data from the page:

Now when you roll over the different elements of the page, you will see them highlighted. You can also see the CSS selector in the Inspector tab to the right of the website area.

Because we want to scrape the population elements on the individual country pages, we first need to navigate from this homepage to the individual country pages. To do so, we first need to click "Close Sample" and then click on any country. When the country page loads, we can once again click "New Sample".

To start adding fields to our items for extraction, we can click on the population field. When we do, an item is added and we can see the extracted information:


We can rename the field by using the left text field area and simply typing in the new name "population". Then, we can click the "Add Field" button. To add more fields, we can do the same for the country name and any other fields we are interested in by first clicking on the large + button and then selecting the field values in the same way. The annotated fields will be highlighted in the web page and you can see the extracted data in the extracted items section. 

If you want to delete any fields, you can simply use the red - sign next to the field name. When the annotations are complete, click on the blue "Close sample" button at the top. If you then wanted to download the spider to run in a Scrapy project, you can do so by clicking the link next to the spider name:



You can also see all of your spiders and the settings in the mounted folder ~/portia_projects

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset