Optical character recognition

Optical character recognition (OCR) is a process to extract text from images. In this section, we will use the open source Tesseract OCR engine, which was originally developed at HP and now primarily at Google. Installation instructions for Tesseract are available at https://github.com/tesseract-ocr/tesseract/wiki. The pytesseract Python wrapper can be installed with pip:

pip install pytesseract

If the original CAPTCHA image is passed to pytesseract, the results are terrible:

>>> import pytesseract 
>>> img = get_captcha_img(html.content) 
>>> pytesseract.image_to_string(img) 
''

An empty string was returned, which means Tesseract failed to extract any characters from the input image. Tesseract was designed to extract more typical text, such as book pages with a consistent background. If we want to use Tesseract effectively, we will need to first modify the CAPTCHA images to remove the background noise and isolate the text.

To better understand the CAPTCHA system we are dealing with, here are some more samples:

The samples in the previous image show that the CAPTCHA text is always black while the background is lighter, so this text can be isolated by checking each pixel and only keeping the black ones, a process known as thresholding. This process is straightforward to achieve with Pillow:

>>> img.save('captcha_original.png') 
>>> gray = img.convert('L') 
>>> gray.save('captcha_gray.png') 
>>> bw = gray.point(lambda x: 0 if x < 1 else 255, '1') 
>>> bw.save('captcha_thresholded.png')

First, we converted the image to grayscale using the convert method. Then, we mapped the image over a lambdafunction using the point command, which will iterate over every pixel in the image. In the lambda function, a threshold of less than 1 is used, which will only keep completely black pixels. This snippet saved three images--the original CAPTCHA image, the image in grayscale, and the image after thresholding.

The text in the final image is much clearer and is ready to be passed to Tesseract:

>>> pytesseract.image_to_string(bw) 
'strange'

Success! The CAPTCHA text has been successfully extracted. In my test of 100 images, this approach correctly interpreted the CAPTCHA image 82 times.

Since the sample text is always lowercase ASCII characters, the performance can be improved further by restricting the result to these characters:

>>> import string 
>>> word = pytesseract.image_to_string(bw) 
>>> ascii_word = ''.join(c for c in word.lower() if c in string.ascii_lowercase)

In my test on the same sample images, this improved the performance to 88 times out of 100.

Here is the full code of the registration script so far:

import requests
import string
import pytesseract
from lxml.html import fromstring
from chp6.login import parse_form
from chp7.image_processing import get_captcha_img, img_to_bw

REGISTER_URL = 'http://example.webscraping.com/user/register'


def register(first_name, last_name, email, password):
    session = requests.Session()
    html = session.get(REGISTER_URL)
    form = parse_form(html.content)
    form['first_name'] = first_name
    form['last_name'] = last_name
    form['email'] = email
    form['password'] = form['password_two'] = password
    img = get_captcha_img(html.content)
    captcha = ocr(img)
    form['recaptcha_response_field'] = captcha
    resp = session.post(html.url, form)
    success = '/user/register' not in resp.url
    if not success:
        form_errors = fromstring(resp.content).cssselect('div.error')
        print('Form Errors:')
        print('n'.join(
              (' {}: {}'.format(f.get('id'), f.text) for f in form_errors)))
    return success


def ocr(img):
    bw = img_to_bw(img)
    captcha = pytesseract.image_to_string(bw)
    cleaned = ''.join(c for c in captcha.lower() if c in string.ascii_lowercase)
    if len(cleaned) != len(captcha):
        print('removed bad characters: {}'.format(set(captcha) - set(cleaned)))
    return cleaned

The register() function downloads the registration page and scrapes the form as usual, where the desired name, e-mail, and password for the new account are set. The CAPTCHA image is then extracted, passed to the OCR function, and the result is added to the form. This form data is then submitted, and the response URL is checked to see whether the registration was successful.

If it fails (by not being properly redirected to the homepage), the form errors are printed as we may need to use a longer password, a different e-mail, or the CAPTCHA might have been unsuccessful. We also print out characters we removed in order to help debug how to make our CAPTCHA parser even better. These logs may help us identify common OCR errors, such as mistaking l for 1, and similar errors, which require fine distinction between similarly drawn characters.

Now, to register an account, we simply need to call the register() function with the new account details:

>>> register(first_name, last_name, email, password) 
True

Table of Contents for Optical character recognition

Create new playlist

Sign In

Sign Up

Table of Contents for
Optical character recognition