Apache Tika is capable of detecting and extracting metadata and text from thousands of different type of files, such as .doc, .docx, .ppt, .pdf, .xls, and so on. It can be used for various file formats, which makes it useful for search engines, indexing, content analysis, translation, and so on. It can be downloaded from https://tika.apache.org/download.html. This section will explore how Tika can be used for text extraction for various formats. We will use Testdocument.docx and TestDocument.pdf only.
Using Tika is very straightforward, as shown in the following code:
File file = new File("TestDocument.pdf");
Tika tika = new Tika();
String filetype = tika.detect(file);
System.out.println(filetype);
System.out.println(tika.parseToString(file));
Simply create an instance of Tika and use the detect and parseToString methods to get the following output:
application/pdf
Jump to navigation Jump to search
Welcome to Wikipedia,
the free encyclopedia that anyone can edit.
5,673,388 articles in English
Arts
Biography
Geography
History
Mathematics
Science
Society
Technology
All portals
From today's featured article
George Steiner
The Portage to San Cristobal of A.H. is a 1981
literary and philosophical novella by George Steiner
(pictured). The story is about Jewish Nazi hunters
who find a fictional Adolf Hitler (A.H.) alive in the
Amazon jungle thirty years after the end of World
War II. The book was controversial, particularly
....
Internally, Tika will first detect the type of the document, select the appropriate parser, and then it will perform text extraction from the document. Tika also provides the parser interface and classes to parse the documents. We can also use AutoDetectParser or CompositeParser of Tika to achieve the same thing. Using the parser, it is possible to get the metadata of the document. More on Tika can be explored at https://tika.apache.org/.