Pipelines items and export formats

The items pipelines could be called the channels or pipes of the items. They are elements of Scrapy and the information that arrives to them are Items that have been previously obtained and processed by some spider. They are classes in themselves that have a simple objective—to re-process the item that arrives to them, being able to reject it for some reasons or let it pass through this channel.

The typical uses of pipelines are as follows:

  • Cleaning data in HTML
  • Validation of scraped data checking that the items contain certain fields
  • Checking duplicate items
  • Storage of the data in a database
For each element that is obtained, it is sent to the corresponding pipeline, which will process it either to save it in the database or to send it to another pipeline. For detail, you can go to official documentation: https://doc.scrapy.org/en/latest/topics/item-pipeline.html.

An item pipeline is a Python class that overrides some specific methods and needs to be activated on the settings of the Scrapy project. When creating a Scrapy project with the scrapy startproject myproject, you'll find a pipelines.py file already available for creating your own pipelines. It isn't mandatory to create your pipelines in this file, but it would be good practice. We'll be explaining how to create a pipeline using the pipelines.py file.

These objects are Python classes that must implement the process_item (item, spider) method and must return an item type object (or a subclass of it) or, if it does not return it, it must throw an exception of a DropItem type to indicate that item will not continue to be processed. An example of this component is as follows:

#!/usr/bin/python
# -*- coding: utf-8 -*-
from scrapy.exceptions import DropItem
class MyPipeline(object):
def process_item(self, item, spider):
if item['key']:
return item
else:
raise DropItem("Element not exists: %s" % item['key'])

One more point to keep in mind is that when we create an object of this type, we must enter in the settings.py file of the project a line like the following to activate the pipe. Now, to enable it you need to specify it is going to be used in your settings. Go to your settings.py file and search (or add) the ITEM_PIPELINES variable. Update it with the path to your pipeline class and its priority over other pipelines:

ITEM_PIPELINES = {
'myproject.pipelines.MyPipeline': 300,
}
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset