Using POI to extract text from Word documents

The Apache POI project (http://poi.apache.org/index.html) is an API used to extract information from Microsoft Office products. It is an extensive library that allows information extraction from Word documents and other office products, such as Excel and Outlook. When downloading the API for POI, you will also need to use XMLBeans (http://xmlbeans.apache.org/), which supports POI. The binaries for XMLBeans can be downloaded from http://www.java2s.com/Code/Jar/x/Downloadxmlbeans524jar.htm. Our interest is in demonstrating how to use POI to extract text from word documents.

To demonstrate this, we will use a file called TestDocument.docx, with some text, tables, and other stuff, as shown in the following screenshot (we have taken the English home page of Wikipedia):

There are several different file formats used by different versions of Word. To simplify the selection of which text extraction class to use, we will use the ExtractorFactory factory class. Although the POI's capabilities are considerable, the process of extracting text is simple. As shown here, a FileInputStream object representing the file, TestDocument.docx, is used by the ExtractorFactory class' createExtractor method to select the appropriate POITextExtractor instance. This is the base class for several different extractors. The getText method is applied to the extractor to get the text:

private static String getResourcePath(){
File currDir = new File(".");
String path = currDir .getAbsolutePath();
path = path.substring(0, path.length()-2);
String resourcePath = path + File.separator + "src/chapter11/TestDocument.docx";
return resourcePath;
}
public static void main(String args[]){
try {
FileInputStream fis = new FileInputStream(getResourcePath());
POITextExtractor textExtractor = ExtractorFactory.createExtractor(fis);
System.out.println(textExtractor.getText());
} catch (FileNotFoundException ex) {
Logger.getLogger(WordDocExtractor.class.getName()).log(Level.SEVERE, null, ex);
} catch (IOException ex) {
System.out.println(ex);
} catch (OpenXML4JException ex) {
System.out.println(ex);
} catch (XmlException ex) {
System.out.println(ex);
}
}

The output is as follows:

Jump to navigation Jump to search
Welcome to Wikipedia,
the free encyclopedia that anyone can edit.
5,673,388 articles in English
Arts
Biography
Geography
History
Mathematics
Science
Society
Technology
All portals
From today's featured article George Steiner The Portage to San Cristobal of A.H. is a 1981 literary and philosophical novella by George Steiner (pictured). The story is about Jewish Nazi hunters who find a fictional Adolf Hitler (A.H.) alive in the Amazon jungle thirty years after the end of World War II. The book was controversial, particularly among reviewers and Jewish scholars, because the author allows Hitler to defend himself when he is put on trial in the jungle by his captors. There Hitler maintains that Israel owes its existence to the Holocaust and that he is the "benefactor of the Jews". A central theme of The Portage is the nature of language, and revolves around Steiner's lifelong work on the subject and his fascination in the power and terror of human speech. Other themes include the philosophical and moral analysis of history, justice, guilt and revenge. Despite the controversy, it was a 1983 finalist in the PEN/Faulkner Award for Fiction. It was adapted for the theatre by British playwright Christopher Hampton. (Full article...) Recently featured: Monroe Edwards C. R. M. F. Cruttwell Russulaceae Archive By email More featured articles Did you know... Maria Bengtsson ... that a reviewer found Maria Bengtsson (pictured) believable and expressive when she first performed the title role of Arabella by Strauss? ... that the 2018 Osaka earthquake disrupted train services during the morning rush hour, forcing passengers to walk between the tracks? ... that funding for Celia Brackenridge's research into child protection in football was ended because the sport "was not ready for a gay former lacrosse international rummaging through its dirty linen"? ... that the multi-armed Heliaster helianthus sheds several of its arms when attacked by the six-armed predatory starfish Meyenaster gelatinosus? ... that if elected, Democratic candidate Deb Haaland would be the first Native American woman to become a member of the United States House of Representatives? ... that 145 Vietnamese civilians were killed during the 1967 Thuy Bo massacre? ... that Velvl Greene, a University of Minnesota professor of public health, taught more than 30,000 students? ... that a group of Fijians placed a newspaper ad to recruit skiers for Fiji at the 2002 Olympic Games after discussing it at a New Year's Eve party? Archive Start a new article Nominate an article In the news Lake Toba Saudi Arabia lifts its ban on women driving. Canada legalizes the cultivation of cannabis for recreational use with effect from October 2018, making it the second country to do so. An overloaded tourist ferry capsizes in Lake Toba (pictured), Indonesia, killing at least 3 people and leaving 193 others missing. In golf, Brooks Koepka wins the U.S. Open at the Shinnecock Hills Golf Club. Ongoing: FIFA World Cup Recent deaths: Joe Jackson Richard Harrison Yan Jizhou John Mack Nominate an article On this day June 28: Vidovdan in Serbia Anna Pavlova as Giselle 1776 – American Revolutionary War: South Carolina militia repelled a British attack on Charleston. 1841 – Giselle (Anna Pavlova pictured in the title role), a ballet by French composer Adolphe Adam, was first performed at the Théâtre de l'Académie Royale de Musique in Paris. 1911 – The first meteorite to suggest signs of aqueous processes on Mars fell to Earth in Abu Hummus, Egypt. 1978 – In Regents of the Univ. of Cal. v. Bakke, the U.S. Supreme Court barred quota systems in college admissions but declared that affirmative action programs giving advantage to minorities are constitutional. 2016 – Gunmen attacked Istanbul's Atatürk Airport, killing 45 people and injuring more than 230 others. Primož Trubar (d. 1586) · Paul Broca (b. 1824) · Yvonne Sylvain (b. 1907) More anniversaries: June 27 June 28 June 29 Archive By email List of historical anniversaries

Today's featured picture
Henry VIII of England (1491–1547) was King of England from 1509 until his death. Henry was the second Tudor monarch, succeeding his father, Henry VII. Perhaps best known for his six marriages, his disagreement with the Pope on the question of annulment led Henry to initiate the English Reformation, separating the Church of England from papal authority and making the English monarch the Supreme Head of the Church of England. He also instituted radical changes to the English Constitution, expanded royal power, dissolved monasteries, and united England and Wales. In this, he spent lavishly and frequently quelled unrest using charges of treason and heresy. Painting: Workshop of Hans Holbein the Younger Recently featured: Lion of Al-lāt Sagittarius Japanese destroyer Yamakaze (1936) Archive More featured pictures

Other areas of Wikipedia
Community portal – Bulletin board, projects, resources and activities covering a wide range of Wikipedia areas.
Help desk – Ask questions about using Wikipedia.

Furthermore, metadata about the document can also be extracted using metaExtractor, as shown in the following code:

POITextExtractor metaExtractor = textExtractor.getMetadataTextExtractor();
System.out.println(metaExtractor.getText());

It will generate the following output:

Created = Thu Jun 28 06:36:00 UTC 2018
CreatedString = 2018-06-28T06:36:00Z
Creator = Ashish
LastModifiedBy = Ashish
LastPrintedString =
Modified = Thu Jun 28 06:37:00 UTC 2018
ModifiedString = 2018-06-28T06:37:00Z
Revision = 1
Application = Microsoft Office Word
AppVersion = 12.0000
Characters = 26588
CharactersWithSpaces = 31190
Company =
HyperlinksChanged = false
Lines = 221
LinksUpToDate = false
Pages = 8
Paragraphs = 62
Template = Normal.dotm
TotalTime = 1

The other approach is to create an instance of the POIXMLPropertiesTextExtractor class using XWPFDocument, which can be used for CoreProperties and ExtendedProperties, as shown in the following code:

fis = new FileInputStream(getResourcePath());
POIXMLPropertiesTextExtractor properties = new POIXMLPropertiesTextExtractor(new XWPFDocument(fis));
CoreProperties coreProperties = properties.getCoreProperties();
System.out.println(properties.getCorePropertiesText());

ExtendedProperties extendedProperties = properties.getExtendedProperties();
System.out.println(properties.getExtendedPropertiesText());

The output is as follows:

Created = Thu Jun 28 06:36:00 UTC 2018
CreatedString = 2018-06-28T06:36:00Z
Creator = Ashish
LastModifiedBy = Ashish
LastPrintedString =
Modified = Thu Jun 28 06:37:00 UTC 2018
ModifiedString = 2018-06-28T06:37:00Z
Revision = 1

Application = Microsoft Office Word
AppVersion = 12.0000
Characters = 26588
CharactersWithSpaces = 31190
Company =
HyperlinksChanged = false
Lines = 221
LinksUpToDate = false
Pages = 8
Paragraphs = 62
Template = Normal.dotm
TotalTime = 1

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset