Web content extraction

Among the techniques available to extract content from the web, we can highlight the following:

  • Screen scraping: A technique that allows you to obtain information by moving around the screen, registering user pulsations.
  • Web scraping: The aim is to obtain the information of a resource, such as a web page in HTML, and process that information to extract relevant data.
  • Report mining: A technique that also tries to obtain information, but in this case from a file (HTML, RDF, CSV, and so on). So, with this approach, we can create a simple and fast mechanism without the need to write an API. A main characteristic is that we can indicate that the system does not need a connection, since it is possible to extract the information offline and without using any API when working from a file. With this technique, it is possible to facilitate the analysis while avoiding the excessive use of equipment and computing time, and increase the efficiency and speed for a prototype and the development of customized reports.
  • Spiders: Scripts that follow specific rules to move around the website and gather information imitating the interaction a user would perform with the website. The idea is that developers only need to write the rules for managing the data and leave automated tools such as Scrapy to get the contents of the website for you.
  • Crawlers: Processes that automatically parse and extract content from a website and provide that content to search engine providers for building their page indexes.

In this chapter, we will focus on the web scraping and spiders techniques that allow the collection or extraction of data from web pages automatically. They are very active and developing fields that share objectives with the semantic web, automatic word processing, artificial intelligence, and human-computer interaction.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset