Front Matter

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Previous Chapter

Cover

Next Chapter

1. Getting Started

Gábor László Hajba

Website Scraping with PythonUsing BeautifulSoup and Scrapy

../images/460350_1_En_BookFrontmatter_Figa_HTML.png

Gábor László Hajba

Sopron, Hungary

ISBN 978-1-4842-3924-7e-ISBN 978-1-4842-3925-4

https://doi.org/10.1007/978-1-4842-3925-4

Library of Congress Control Number: 2018957273

This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed.

Trademarked names, logos, and images may appear in this book. Rather than use a trademark symbol with every occurrence of a trademarked name, logo, or image we use the names, logos, and images only in an editorial fashion and to the benefit of the trademark owner, with no intention of infringement of the trademark. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights.

While the advice and information in this book are believed to be true and accurate at the date of publication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect to the material contained herein.

Distributed to the book trade worldwide by Springer Science+Business Media New York, 233 Spring Street, 6th Floor, New York, NY 10013. Phone 1-800-SPRINGER, fax (201) 348-4505, e-mail [email protected], or visit www.springeronline.com. Apress Media, LLC is a California LLC and the sole member (owner) is Springer Science + Business Media Finance Inc (SSBM Finance Inc). SSBM Finance Inc is a Delaware corporation.

To those who are restless, like me, and always want to learn something new.

Introduction

Welcome to our journey together exploring website scraping solutions using the Python programming language!

As the title already tells you, this book is about website scraping with Python. I distilled my knowledge into this book to give you a useful manual if you want to start data gathering from websites.

Website scraping is (in my opinion) an emerging topic.

I expect you have Python programming knowledge. This means I won’t clarify every code block I write or constructs I use. But because of this, you’re allowed to differ: every programmer has his/her own unique coding style, and your coding results can be different than mine.

This book is split into six chapters:

1.
Getting Started is to get you started with this book: you can learn what website scraping is and why it worth writing a book about this topic.
2.
Enter the Requirements introduces the requirements we will use to implement website scrapers in the follow-up chapters.
3.
Using Beautiful Soup introduces you to Beautiful Soup, an HTML content parser that you can use to write website scraper scripts. We will implement a scraper to gather the requirements of Chapter 2 using Beautiful Soup.
4.
Using Scrapy introduces you to Scrapy, the (in my opinion) best website scraping toolbox available for the Python programming language. We will use Scrapy to implement a website scraper to gather the requirements of Chapter 2 .
5.
Handling JavaScript shows you options for how you can deal with websites that utilize JavaScript to load data dynamically and through this, give users a better experience. Unfortunately, this makes basic website scraping a torture but there are options that you can rely on.
6.
Website Scraping in the Cloud moves your scrapers from running on your computer locally to remote computers in the Cloud. I’ll show you free and paid providers where you can deploy your spiders and automate the scraping schedules.

You can read this book from cover to cover if you want to learn the different approaches of website scraping with Python. If you’re interested only in a specific topic, like Scrapy for example, you can jump straight to Chapter 4 , although I recommend reading Chapter 2 because it contains the description of the data gathering task we will implement in the vast part of the book.

Acknowledgments

Many people have contributed to what is good in this book. Remaining errors and problems are the author’s alone.

Thanks to Apress for making this book happen. Without them, I’d have never considered approaching a publisher with my book idea.

Thanks to the editors, especially Jill Balzano and James Markham. Their advices made this book much better.

Thanks to Chaim Krause, who pointed out missing technical information that may be obvious to me but not for the readers.

Last but not least, a big thank you to my wife, Ágnes, for enduring the time invested in this book.

I hope this book will be a good resource to get your own website scraping projects started!

Chapter 1: Getting Started 1

Website Scraping 1

Projects for Website Scraping 2

Websites Are the Bottleneck 3

Tools in This Book 3

Preparation 4

Terms and Robots 5

Technology of the Website 7

Using Chrome Developer Tools 8

Tool Considerations 12

Starting to Code 13

Parsing robots.txt 13

Creating a Link Extractor 15

Extracting Images 17

Summary 18

Chapter 2: Enter the Requirements 19

The Requirements 20

Preparation 21

Navigating Through “Meat & fishFish” 23

Outlining the Application 31

Navigating the Website 32

Creating the Navigation 33

The requests Library 36

Switching to requests 37

Putting the Code Together 38

Summary 39

Chapter 3: Using Beautiful Soup 41

Installing Beautiful Soup 41

Simple Examples 42

Parsing HTML Text 42

Parsing Remote HTML 44

Parsing a File 45

Difference Between find and find_all 45

Extracting All Links 45

Extracting All Images 46

Finding Tags Through Their Attributes 46

Finding Multiple Tags Based on Property 47

Changing Content 48

Finding Comments 52

Converting a Soup to HTML Text 53

Extracting the Required Information 53

Identifying, Extracting, and Calling the Target URLs 54

Navigating the Product Pages 56

Extracting the Information 58

Unforeseen Changes 63

Exporting the Data 65

To CSV 66

To JSON 73

To a Relational Database 76

To an NoSQL Database 83

Performance Improvements 85

Changing the Parser 86

Parse Only What’s Needed 87

Saving While Working 88

Developing on a Long Run 90

Caching Intermediate Step Results 90

Caching Whole Websites 91

Source Code for this Chapter 95

Summary 95

Chapter 4: Using Scrapy 97

Installing Scrapy 98

Creating the Project 98

Configuring the Project 100

Terminology 102

Middleware 102

Pipeline 103

Extension 104

Selectors 104

Implementing the Sainsbury Scraper 106

What’s This allowed_domains About? 107

Preparation 108

def parse(self, response) 110

Navigating Through Categories 112

Navigating Through the Product Listings 116

Extracting the Data 118

Where to Put the Data? 123

Running the Spider 127

Exporting the Results 133

To CSV 134

To JSON 135

To Databases 137

Bring Your Own Exporter 143

Caching with Scrapy 153

Storage Solutions 154

Cache Policies 156

Downloading Images 158

Using Beautiful Soup with Scrapy 161

Logging 162

(A Bit) Advanced Configuration 162

LOG_LEVEL 163

CONCURRENT_REQUESTS 164

DOWNLOAD_DELAY 164

Autothrottling 165

COOKIES_ENABLED 166

Summary 167

Chapter 5: Handling JavaScript 169

Reverse Engineering 169

Thoughts on Reverse Engineering 172

Summary 172

Splash 172

Set-up 173

A Dynamic Example 176

Integration with Scrapy 177

Adapting the basic Spider 179

What Happens When Splash Isn’t Running? 183

Summary 183

Selenium 183

Prerequisites 184

Basic Usage 185

Integration with Scrapy 186

Summary 189

Solutions for Beautiful Soup 189

Splash 190

Selenium 191

Summary 192

Chapter 6: Website Scraping in the Cloud 193

Scrapy Cloud 193

Creating a Project 194

Deploying Your Spider 195

Start and Wait 196

Accessing the Data 198

API 200

Limitations 202

Summary 203

PythonAnywhere 203

The Example Script 203

PythonAnywhere Configuration 204

Uploading the Script 204

Running the Script 206

This Works Just Manually… 207

Storing Data in a Database? 210

Summary 214

What About Beautiful Soup? 214

Summary 216

Index219

About the Author and About the Technical Reviewer

About the Author

Gábor László Hajba

../images/460350_1_En_BookFrontmatter_Figb_HTML.jpg

is a Senior Consultant at EBCONT enterprise technologies, who specializes in Java, Python, and Crystal. He is responsible for designing and developing customer needs in the enterprise software world. He has also held roles as an Advanced Software Engineer with Zühlke Engineering, and as a freelance developer with Porsche Informatik. He considers himself a workaholic, (hard)core and well-grounded developer, pragmatic minded, and freak of portable apps and functional code.

He currently resides in Sopron, Hungary with his loving wife, Ágnes.

About the Technical Reviewer

Chaim Krause

../images/460350_1_En_BookFrontmatter_Figc_HTML.jpg

is an expert computer programmer with over thirty years of experience to prove it. He has worked as a lead tech support engineer for ISPs as early as 1995, as a senior developer support engineer with Borland for Delphi, and has worked in Silicon Valley for over a decade in various roles, including technical support engineer and developer support engineer. He is currently a military simulation specialist for the US Army’s Command and General Staff College, working on projects such as developing serious games for use in training exercises.

He has also authored several video training courses on Linux topics and has been a technical reviewer for over twenty books, including iOS Code Testing, Android Apps for Absolute Beginners (4ed), and XML Essentials for C# and .NET Development (all Apress). It seems only natural then that he would be an avid gamer and have his own electronics lab and server room in his basement. He currently resides in Leavenworth, Kansas with his loving partner, Ivana, and a menagerie of four-legged companions: their two dogs, Dasher and Minnie, and their three cats, Pudems, Talyn, and Alaska.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for
Front Matter

Table of Contents

About the Author and About the Technical Reviewer

About the Author

About the Technical Reviewer

Table of Contents for Front Matter

Create new playlist

Sign In

Sign Up

Table of Contents

About the Author and About the Technical Reviewer

About the Author

About the Technical Reviewer

Table of Contents for
Front Matter