Chapter 1. The case for the digital Babel fish
Chapter 2. Getting started with Tika
Chapter 3. The information landscape
Chapter 4. Document type detection
Chapter 5. Content extraction
Figure 5.1. Overview of Tika’s parsing process
Figure 5.4. Class diagram that shows the extra functionality provided by the TikaInputStream class
Chapter 6. Understanding metadata
Chapter 7. Language detection
Figure 7.3. Title page of a 16th century printing of Romeo and Juliet by William Shakespeare
Figure 7.4. Frequency of letters in many languages based on the Latin alphabet
Chapter 8. What’s in a file?
Figure 8.1. Several areas where content can be gleaned from a file
Figure 8.6. The semantics of extracting file and directory metadata
Chapter 9. The big picture
Figure 9.1. Overview of a search engine. The arrows indicate flows of information.
Figure 9.3. Overview of a document management system
Figure 9.4. Overview of a text mining system
Figure 9.5. Tika deployment with parser implementations from multiple different sources
Figure 9.7. Building an inverse index as a map-reduce operation
Chapter 10. Tika and the Lucene search stack
Figure 10.2. The Apache ManifoldCF home page from the Apache Incubator
Figure 10.3. The Apache Open Relevance home page
Figure 10.4. The Apache Lucene Top Level Project home page
Figure 10.5. The Apache Solr Project home page
Figure 10.6. The Apache Nutch Top Level Project home page
Chapter 11. Extending Tika
Figure 11.2. Overview of a generic type detector
Figure 11.3. Custom parser classes for handling digital prescription documents
Chapter 12. Powering NASA science data systems
Figure 12.1. The flow of data through NASA’s Planetary Data System
Chapter 13. Content management with Apache Jackrabbit
Chapter 14. Curating cancer research data with Tika
Chapter 15. The classic search engine example
Figure 15.1. Searching the web for restaurant reviews discussing Chinese food by geographic location
Figure 15.3. Evaluating Tika’s charset detection with a web-scale data set