Access to elements through DOM

DOM is the acronym for Document Object Model and is the way in which a browser interprets an HTML document inside a window.

The DOM presents a structure similar to that of the trunk of a tree from which branches are emerge. It is said that the HTML element that contains other elements is the father of these:

parent
| children
├── Element
├── brother (sibling)

When searching through the DOM, BeautifulSoup returns the first item with the matching HTML tag. An interesting feature of the library is that it allows the user to search for specific elements in the structure of the document; in this way, we can search for meta tags, form, and links.

bs.find_all() is a method that allows us to find all the HTML elements of a certain type and returns a list of tags that match the search pattern.

For example, to search for all meta tags in an HTML document, use the following code:

>>> meta_tags = bs.find_all("meta")
>>> for tag in meta_tags:
>>> print(tag)

To search all the forms of an HTML document, use the following code:

>>> form_tags = bs.find_all("form")
>>> for form in form_tags:
>>> print (form)

To search all links in an HTML document, use the following code:

>>> link_tags = bs.find_all("a")
>>> for link in link_tags:
>>> print (link)

The findAll function returns all the elements of the collection that match the argument specified. If you want to return a single element, you can use the find function, which only returns the first element of the collection.

In this example, we extract all the links of a certain URL. The idea is to make the request with requests and with BeautifulSoup to parse the data that the request returns.

You can find the following code in the extract_links_from_url.py file inside the beautifulSoup folder:

#!/usr/bin/env python3

from bs4 import BeautifulSoup
import requests

url = input("Enter a website to extract the URL's from: ")

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
response = requests.get("http://" +url, headers = headers)
data = response.text
soup = BeautifulSoup(data,'lxml')
for link in soup.find_all('a'):
print(link.get('href'))

In this screenshot, we can see the output of the previous script:

We can also extract images directly with BeautifulSoup, in the same way that we extracted the images with thlxml module in the previous section.

In this example, we make the request to the URL passed by the parameter with the requests module. Later, we build the BeautifulSoup object from which we will extract those tags that are <img>. If the URL is correct, the image is downloaded again using the requests package.

You can find the following code in the download_images.py file inside the beautifulSoup folder:

#!/usr/bin/env python3

import requests
from bs4 import BeautifulSoup
import urllib.parse
import sys
import os

response = requests.get('http://www.freeimages.co.uk/galleries/transtech/informationtechnology/index.htm')
parse = BeautifulSoup(response.text,'lxml')

# Get all image tags
image_tags = parse.find_all('img')

# Get urls to the images
images = [ url.get('src') for url in image_tags]

# If no images found in the page
if not images:
sys.exit("Found No Images")

# Convert relative urls to absolute urls if any
images = [urllib.parse.urljoin(response.url, url) for url in images]
print('Found %s images' % len(images))


In the previous code block, we have obtained images' URLs using BeautifulSoup and a lxml parser. Now we are going to create the folder for storing images and download images in that folder using the request package.

#create download_images folder if not exists

file_path = "download_images"
directory = os.path.dirname(file_path)

if not os.path.exists(directory):
try:
os.makedirs(file_path)
print ("Creation of the directory %s OK" % file_path)
except OSError:
print ("Creation of the directory %s failed" % file_path)
else:
print ("download_images directory exists")

# Download images to downloaded folder
for url in images:response = requests.get(url)
file = open('download_images/%s' % url.split('/')[-1], 'wb')
file.write(response.content)
file.close()
print('Downloaded %s' % url)

In this screenshot, we can see the output of the previous script:

In this example, we are going to extract titles and links from the following hacker news domain: https://news.ycombinator.com. In this case, we are using the findAll function to obtain elements that match with a specific style, later we use the find function for getting elements that match with the href tag.

You can find the following code in the extract_links_hacker_news.py file inside the beautifulSoup folder:

#!/usr/bin/env python3
import requests
from bs4 import BeautifulSoup

def get_front_page():
target = "https://news.ycombinator.com"
frontpage = requests.get(target)
if not frontpage.ok:
raise RuntimeError("Can't access hacker news, you should go outside")
news_soup = BeautifulSoup(frontpage.text,"lxml")
return news_soup

def find_interesting_links(soup):
items = soup.findAll('td', {'align': 'right', 'class': 'title'})
links = []
for i in items:
try:
siblings = list(i.next_siblings)
post_id = siblings[1].find('a')['id']
link = siblings[2].find('a')['href']
title = siblings[2].text
links.append({'link': link, 'title': title,'post_id':post_id})
except Exception as e:
pass
return links

if __name__ == '__main__':
soup = get_front_page()
results = find_interesting_links(soup)
for r in results:
if r is not None:
print(r['link'] +" "+(r['title']))

In this screenshot, we can see the output of the previous script:

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset