There are sites which detect web scrapers via particular behaviors. In Chapter 5, Dynamic Content, we covered how to avoid honeypots by avoiding clicking on hidden links. Here are a few other tips for appearing more like a human while scraping content online.
- Utilize Headers: Most of the scraping libraries we have covered can alter the headers of your requests, allowing you to modify things like User-Agent, Referrer, Host, and Connection. Also, when utilizing browser-based scrapers like Selenium, your scraper will look like a normal browser with normal headers. You can always take a look at what headers your browser is using by opening your browser tools and viewing one of the recent requests in the Network tab. This might give you a good idea of what headers the site is expecting.
- Add Delays: Some scraper detection techniques use timing to determine if a form is filled out too quickly or links are clicked too soon after page load. To appear more "human-like", add reasonable delays when interacting with forms or use sleep to add delays between requests. This is also the polite way to scrape a site so as to not overload the server.
- Use Sessions and Cookies: As we have covered in this chapter, using sessions and cookies will help your scraper navigate the site easier and allow you to appear more like a normal browser. By saving sessions and cookies locally, you can pick up sessions where you left off and resume scraping with saved data.