"Humanizing" methods for Web Scraping

There are sites which detect web scrapers via particular behaviors. In Chapter 5, Dynamic Content, we covered how to avoid honeypots by avoiding clicking on hidden links. Here are a few other tips for appearing more like a human while scraping content online.

  • Utilize Headers: Most of the scraping libraries we have covered can alter the headers of your requests, allowing you to modify things like User-AgentReferrerHost, and Connection. Also, when utilizing browser-based scrapers like Selenium, your scraper will look like a normal browser with normal headers. You can always take a look at what headers your browser is using by opening your browser tools and viewing one of the recent requests in the Network tab. This might give you a good idea of what headers the site is expecting.
  • Add Delays: Some scraper detection techniques use timing to determine if a form is filled out too quickly or links are clicked too soon after page load. To appear more "human-like", add reasonable delays when interacting with forms or use sleep to add delays between requests. This is also the polite way to scrape a site so as to not overload the server.
  • Use Sessions and Cookies: As we have covered in this chapter, using sessions and cookies will help your scraper navigate the site easier and allow you to appear more like a normal browser. By saving sessions and cookies locally, you can pick up sessions where you left off and resume scraping with saved data.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset