Using Apache Tika for content analysis and extraction

Apache Tika is capable of detecting and extracting metadata and text from thousands of different type of files, such as .doc, .docx, .ppt, .pdf, .xls, and so on. It can be used for various file formats, which makes it useful for search engines, indexing, content analysis, translation, and so on. It can be downloaded from https://tika.apache.org/download.html. This section will explore how Tika can be used for text extraction for various formats. We will use Testdocument.docx and TestDocument.pdf only.

Using Tika is very straightforward, as shown in the following code:

File file = new File("TestDocument.pdf");            
Tika tika = new Tika();
String filetype = tika.detect(file);
            
System.out.println(filetype);
System.out.println(tika.parseToString(file));

Simply create an instance of Tika and use the detect and parseToString methods to get the following output:

application/pdf
Jump to navigation Jump to search  

Welcome to Wikipedia, 
the free encyclopedia that anyone can edit. 

5,673,388 articles in English 

 Arts 

 Biography 

 Geography 

 History 

 Mathematics 

 Science 

 Society 

 Technology 

 All portals 

From today's featured article 

 
George Steiner 

The Portage to San Cristobal of A.H. is a 1981 

literary and philosophical novella by George Steiner 

(pictured). The story is about Jewish Nazi hunters 

who find a fictional Adolf Hitler (A.H.) alive in the 

Amazon jungle thirty years after the end of World 

War II. The book was controversial, particularly 
....

Internally, Tika will first detect the type of the document, select the appropriate parser, and then it will perform text extraction from the document. Tika also provides the parser interface and classes to parse the documents. We can also use AutoDetectParser or CompositeParser of Tika to achieve the same thing. Using the parser, it is possible to get the metadata of the document. More on Tika can be explored at https://tika.apache.org/.

Table of Contents for Using Apache Tika for content analysis and extraction

Create new playlist

Sign In

Sign Up

Table of Contents for
Using Apache Tika for content analysis and extraction