Part 1. Getting started

“The Babel fish,” said The Hitchhiker’s Guide to the Galaxy quietly, “is small, yellow and leech-like, and probably the oddest thing in the Universe. It feeds on brainwave energy not from its carrier but from those around it. It absorbs all unconscious mental frequencies from this brainwave energy to nourish itself with. It then excretes into the mind of its carrier a telepathic matrix formed by combining the conscious thought frequencies with nerve signals picked up from the speech centers of the brain which has supplied them. The practical upshot of all this is that if you stick a Babel fish in your ear you can instantly understand anything said to you in any form of language.”

Douglas Adams, The Hitchhiker’s Guide to the Galaxy

This first part of the book will familiarize you with the necessity of being able to rapidly process, integrate, compare, and most importantly understand the variety of content available in the digital world. Likely you’ve encountered only a subset of the thousands of media types that exist (PDF, Word, Excel, HTML, just to name a few), and you likely need dozens of applications to read each type, edit and add text to it, view the text, copy and paste between documents, and include that information in your software programs (if you’re a programmer geek like us).

We’ll try to help you tackle this problem by introducing you to Apache Tika—a software framework focused on automatic media type identification, text extraction, and metadata extraction. Our goal for this part of the book is to equip you with historical knowledge (Tika’s motivation, history, and inception), practical knowledge (how to download and install it and leverage Tika in your application), and the steps required to start using Tika to deal with the proliferation of files available at your fingertips.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset