Using Apache Tika for content analysis and extraction

Apache Tika is capable of detecting and extracting metadata and text from thousands of different type of files, such as .doc, .docx, .ppt, .pdf, .xls, and so on. It can be used for various file formats, which makes it useful for search engines, indexing, content analysis, translation, and so on. It can be downloaded from https://tika.apache.org/download.html. This section will explore how Tika can be used for text extraction for various formats. We will use Testdocument.docx and TestDocument.pdf only.

Using Tika is very straightforward, as shown in the following code:

File file = new File("TestDocument.pdf");            
Tika tika = new Tika();
String filetype = tika.detect(file);

System.out.println(filetype);
System.out.println(tika.parseToString(file));

Simply create an instance of Tika and use the detect and parseToString methods to get the following output:

application/pdf
Jump to navigation Jump to search

Welcome to Wikipedia,
the free encyclopedia that anyone can edit.

5,673,388 articles in English

Arts

Biography

Geography

History

Mathematics

Science

Society

Technology

All portals

From today's featured article


George Steiner

The Portage to San Cristobal of A.H. is a 1981

literary and philosophical novella by George Steiner

(pictured). The story is about Jewish Nazi hunters

who find a fictional Adolf Hitler (A.H.) alive in the

Amazon jungle thirty years after the end of World

War II. The book was controversial, particularly
....

Internally, Tika will first detect the type of the document, select the appropriate parser, and then it will perform text extraction from the document. Tika also provides the parser interface and classes to parse the documents. We can also use AutoDetectParser or CompositeParser of Tika to achieve the same thing. Using the parser, it is possible to get the metadata of the document. More on Tika can be explored at https://tika.apache.org/.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset