Using boilerpipe to extract text from HTML

There are several libraries available for extracting text from HTML documents. We will demonstrate how to use boilerpipe (https://code.google.com/p/boilerpipe/) to perform this operation. This is a flexible API that not only extracts the entire text of an HTML document but can also extract selected parts of an HTML document, such as its title and individual text blocks. We will use the HTML page at http://en.wikipedia.org/wiki/Berlin to illustrate the use of boilerpipe. Part of this page is shown in the following screenshot:

In order to use boilerpipe, you will need to download the binary for the Xerces Parser, which can be found at http://xerces.apache.org/index.html.

We start by creating a URL object that represents this page. We will use two classes to extract text. The first is the HTMLDocument class that represents the HTML document. The second is the TextDocument class that represents the text within an HTML document. It consists of one or more TextBlock objects that can be accessed individually if needed. We will create a HTMLDocument instance for the Berlin page. The BoilerpipeSAXInput class uses this input source to create a TextDocument instance. It then uses the TextDocument class' getText method to retrieve the text. This method uses two arguments. The first argument specifies whether to include the TextBlock instances marked as content. The second argument specifies whether non-content TextBlock instances should be included. In this example, both types of TextBlock instances are included. The following is the working code:

try{
URL url = new URL("https://en.wikipedia.org/wiki/Berlin");
HTMLDocument htmldoc = HTMLFetcher.fetch(url);
InputSource is = htmldoc.toInputSource();
TextDocument document = new BoilerpipeSAXInput(is).getTextDocument();
System.out.println(document.getText(true, true));
} catch (MalformedURLException ex) {
System.out.println(ex);
} catch (IOException ex) {
System.out.println(ex);
} catch (SAXException | BoilerpipeProcessingException ex) {
System.out.println(ex);
}

The output is lengthy, but a few lines are shown here:

Berlin
From Wikipedia, the free encyclopedia
Jump to navigation Jump to search
This article is about the capital of Germany. For other uses, see Berlin (disambiguation) .
State of Germany in Germany
Berlin
State of Germany
From top: Skyline including the TV Tower ,
City West skyline with Kaiser Wilhelm Memorial Church , Brandenburg Gate ,
East Side Gallery ( Berlin Wall ),
Oberbaum Bridge over the Spree ,
Reichstag building ( Bundestag )
.......
This page was last edited on 18 June 2018, at 11:18 (UTC).
Text is available under the Creative Commons Attribution-ShareAlike License ; additional terms may apply. By using this site, you agree to the Terms of Use and Privacy Policy . Wikipedia® is a registered trademark of the Wikimedia Foundation, Inc. , a non-profit organization.
Privacy policy
About Wikipedia
Disclaimers
Contact Wikipedia
Developers
Cookie statement
Mobile view
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset