Scraping WWII battles

The goal of this chapter is to collect the information on all battles in WWII from Wikipedia. A corresponding list is provided: https://en.wikipedia.org/wiki/List_of_World_War_II_battles. As you can see, it contains links to a large set of pages, one for each battle, operation, and campaign. Furthermore, the list is structured, so battles are grouped according to the campaign or operation, which are, in turn, grouped by the theaters – it would be great to preserve this hierarchy! Most elements of the list also have a date. We'll work with those lists in a minute.

Now, if you check a couple of pages for specific battles, you may notice that they have a similar structure. For most of them, the large information card on the right has a set of similar subsections, including the main section with dates, locations, and outcomes, and a few additional sections, such as strengths, commanders, casualties, and belligerents. This is great news – we can use this structure to write generally applicable code, and collect specific information for each battle in a uniform fashion.

Given all that, the task can be executed in three steps:

  1. First, we'll collect all the links and names from the initial list of battles, preserving the nested nature.
  2. Next, we will create a scraper that will extract specific information, such as locations, dates, sides, leaders, and casualties from a page pertaining to a particular battle.
  3. Finally, we will loop over all the links we collected in the first part and collect information for each.

In doing so, we will try to use several approaches that we have found to be useful:

  • Write simple, universal functions first, moving all decisions and opinions to the functions of a higher level.
  • Collect and store raw data – clean and process it afterward. Any exception or error might lead to the loss of the data.
  • Don't clean data within the scraper – it will be way easier to do that afterward, in bulk, having the "raw" data as a reference.

Let's do it!

The scraper we're building in this chapter does work at the time of writing. It may be broken in the future, however, by any design change in the Wikipedia page. This is the unfortunate nature of scrapers.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset