Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Implementing association tables

The association table acts as a bridge between database tables, which have a many-to-many relationship. The table contains foreign keys that are linked to the primary keys of the tables it connects.

In this recipe, we will associate web pages with links within the page. A page has many links, and links can be in many pages. We will concern ourselves only with links to other websites, but this is not a requirement. If you are trying to reproduce a website on your local machine for testing or analysis, you will want to store image and JavaScript links as well. Have a look at the following relational schema diagram:

Getting ready

I installed SQLAlchemy 0.9.9 with Anaconda, as follows:

$ conda install sqlalchemy

If you prefer, you can also install SQLAlchemy with the following command:

$ pip install sqlalchemy

How to do it…

The following code from the impl_association.py file in this book's code bundle implements the association table pattern:

The imports are as follows:

from sqlalchemy import create_engine
from sqlalchemy import Column
from sqlalchemy import ForeignKey
from sqlalchemy import Integer
from sqlalchemy import String
from sqlalchemy import Table
from sqlalchemy.orm import backref
from sqlalchemy.orm import relationship
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker
from sqlalchemy.exc import IntegrityError
import dautil as dl
import os

Base = declarative_base()

Define the following class to represent a web page:

class Page(Base):
    __tablename__ = 'pages'
    id = Column(Integer, primary_key=True)
    filename = Column(String, nullable=False, unique=True)
    links = relationship('Link', secondary='page_links')

    def __repr__(self):
        return "Id=%d filename=%s" %(self.id, self.filename)

Define the following class to represent a web link:

class Link(Base):
    __tablename__ = 'links'
    id = Column(Integer, primary_key=True)
    url = Column(String, nullable=False, unique=True)

    def __repr__(self):
        return "Id=%d url=%s" %(self.id, self.url)

Define the following class to represent the association between pages and links:

class PageLink(Base):
    __tablename__ = 'page_links'
    page_id = Column(Integer, ForeignKey('pages.id'), primary_key=True)
    link_id = Column(Integer, ForeignKey('links.id'), primary_key=True)
    page = relationship('Page', backref=backref('link_assoc'))
    link = relationship('Link', backref=backref('page_assoc'))

    def __repr__(self):
        return "page_id=%s link_id=%s" %(self.page_id, self.link_id)

Define the following function to go through HTML files and update the related tables:

def process_file(fname, session):
    with open(fname) as html_file:
        text = html_file.read()

        if dl.db.count_where(session, Page.filename, fname):
            # Cowardly refusing to continue
            return

        page = Page(filename=fname)
        hrefs = dl.web.find_hrefs(text)

        for href in set(hrefs):
            # Only saving http links
            if href.startswith('http'):
                if dl.db.count_where(session, Link.url, href):
                    continue

                link = Link(url=href)
                session.add(PageLink(page=page, link=link))

        session.commit()

Define the following function to populate the database:

def populate():
    dir = dl.data.get_data_dir()
    path = os.path.join(dir, 'crawled_pages.db')
    engine = create_engine('sqlite:///' + path)
    DBSession = sessionmaker(bind=engine)
    Base.metadata.create_all(engine)
    session = DBSession()

    files  = ['460_cc_phantomjs.html', '468_live_phantomjs.html']

    for file in files:
        process_file(file, session)

    return session

The following code snippet uses the functions and classes that we defined:

if __name__ == "__main__":
    session = populate()
    printer = dl.log_api.Printer(nelems=3)
    pages = session.query(Page).all()
    printer.print('Pages', pages)

    links = session.query(Link).all()
    printer.print('Links', links)

    page_links = session.query(PageLink).all()
    printer.print('PageLinks', page_links)

Refer to the following screenshot for the end result:

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Implementing association tables

Create new playlist

Sign In

Sign Up

Implementing association tables

Getting ready

How to do it…

Table of Contents for
Implementing association tables