Table of Contents

Copyright

Brief Table of Contents

Table of Contents

Foreword

Preface

Acknowledgments

About this Book

About the Authors

About the Cover Illustration

1. Getting started

Chapter 1. The case for the digital Babel fish

1.1. Understanding digital documents

1.1.1. A taxonomy of file formats

1.1.2. Parser libraries

1.1.3. Structured text as the universal language

1.1.4. Universal metadata

1.1.5. The program that understands everything

1.2. What is Apache Tika?

1.2.1. A bit of history

1.2.2. Key design goals

1.2.3. When and where to use Tika

1.3. Summary

Chapter 2. Getting started with Tika

2.1. Working with Tika source code

2.1.1. Getting the source code

2.1.2. The Maven build

2.1.3. Including Tika in Ant projects

2.2. The Tika application

2.2.1. Drag-and-drop text extraction: the Tika GUI

2.2.2. Tika on the command line

2.3. Tika as an embedded library

2.3.1. Using the Tika facade

2.3.2. Managing dependencies

2.4. Summary

Chapter 3. The information landscape

3.1. Measuring information overload

3.1.1. Scale and growth

3.1.2. Complexity

3.2. I’m feeling lucky—searching the information landscape

3.2.1. Just click it: the modern search engine

3.2.2. Tika’s role in search

3.3. Beyond lucky: machine learning

3.3.1. Your likes and dislikes

3.3.2. Real-world machine learning

3.4. Summary

2. Tika in detail

Chapter 4. Document type detection

4.1. Internet media types

4.1.1. The parlance of media type names

4.1.2. Categories of media types

4.1.3. IANA and other type registries

4.2. Media types in Tika

4.2.1. The shared MIME-info database

4.2.2. The MediaType class

4.2.3. The MediaTypeRegistry class

4.2.4. Type hierarchies

4.3. File format diagnostics

4.3.1. Filename globs

4.3.2. Content type hints

4.3.3. Magic bytes

4.3.4. Character encodings

4.3.5. Other mechanisms

4.4. Tika, the type inspector

4.5. Summary

Chapter 5. Content extraction

5.1. Full-text extraction

5.1.1. Abstracting the parsing process

5.1.2. Full-text indexing

5.1.3. Incremental parsing

5.2. The Parser interface

5.2.1. Who knew parsing could be so easy?

5.2.2. The parse() method

5.2.3. Parser implementations

5.2.4. Parser selection

5.3. Document input stream

5.3.1. Standardizing input to Tika

5.3.2. The TikaInputStream class

5.4. Structured XHTML output

5.4.1. Semantic structure of text

5.4.2. Structured output via SAX events

5.4.3. Marking up structure with XHTML

5.5. Context-sensitive parsing

5.5.1. Environment settings

5.5.2. Custom document handling

5.6. Summary

Chapter 6. Understanding metadata

6.1. The standards of metadata

6.1.1. Metadata models

6.1.2. General metadata standards

6.1.3. Content-specific metadata standards

6.2. Metadata quality

6.2.1. Challenges/Problems

6.2.2. Unifying heterogeneous standards

6.3. Metadata in Tika

6.3.1. Keys and multiple values

6.3.2. Transformations and views

6.4. Practical uses of metadata

6.4.1. Common metadata for the Lucene indexer

6.4.2. Give me my metadata in my schema!

6.5. Summary

Chapter 7. Language detection

7.1. The most translated document in the world

7.2. Sounds Greek to me—theory of language detection

7.2.1. Language profiles

7.2.2. Profiling algorithms

7.2.3. The N-gram algorithm

7.2.4. Advanced profiling algorithms

7.3. Language detection in Tika

7.3.1. Incremental language detection

7.3.2. Putting it all together

7.4. Summary

Chapter 8. What’s in a file?

8.1. Types of content

8.1.1. HDF: a format for scientific data

8.1.2. Really Simple Syndication: a format for rapidly changing content

8.2. How Tika extracts content

8.2.1. Organization of content

8.2.2. File header and naming conventions

8.2.3. Storage affects extraction

8.3. Summary

3. Integration and advanced use

Chapter 9. The big picture

9.1. Tika in search engines

9.1.1. The search use case

9.1.2. The anatomy of a search index

9.2. Managing and mining information

9.2.1. Document management systems

9.2.2. Text mining

9.3. Buzzword compliance

9.3.1. Modularity, Spring, and OSGi

9.3.2. Large-scale computing

9.4. Summary

Chapter 10. Tika and the Lucene search stack

10.1. Load-bearing walls

10.1.1. ManifoldCF

10.1.2. Open Relevance

10.2. The steel frame

10.2.1. Lucene Core

10.2.2. Solr

10.3. The finishing touches

10.3.1. Nutch

10.3.2. Droids

10.3.3. Mahout

10.4. Summary

Chapter 11. Extending Tika

11.1. Adding type information

11.1.1. Custom media type configuration

11.2. Custom type detection

11.2.1. The Detector interface

11.2.2. Building a custom type detector

11.2.3. Plugging in new detectors

11.3. Customized parsing

11.3.1. Customizing existing parsers

11.3.2. Writing a new parser

11.3.3. Plugging in new parsers

11.3.4. Overriding existing parsers

11.4. Summary

4. Case studies

Chapter 12. Powering NASA science data systems

12.1. NASA’s Planetary Data System

12.1.1. PDS data model

12.1.2. The PDS search redesign

12.2. NASA’s Earth Science Enterprise

12.2.1. Leveraging Tika in NASA Earth Science SIPS

12.2.2. Using Tika within the ground data systems

12.3. Summary

Chapter 13. Content management with Apache Jackrabbit

13.1. Introducing Apache Jackrabbit

13.2. The text extraction pool

13.3. Content-aware WebDAV

13.4. Summary

Chapter 14. Curating cancer research data with Tika

14.1. The NCI Early Detection Research Network

14.1.1. The EDRN data model

14.1.2. Scientific data curation

14.2. Integrating Tika

14.2.1. Metadata extraction

14.2.2. MIME type identification and classification

14.3. Summary

Chapter 15. The classic search engine example

15.1. The Public Terabyte Dataset Project

15.2. The Bixo web crawler

15.2.1. Parsing fetched documents

15.2.2. Validating Tika’s charset detection

15.3. Summary

Appendix A. Tika quick reference

A.1. Tika facade

A.2. Command-line options

A.3. ContentHandler utilities

Appendix B. Supported metadata keys

B.1. Climate Forecast

B.2. Creative Commons

B.3. Dublin Core

B.4. Geographic metadata

B.5. HTTP headers

B.6. Microsoft Office

B.7. Message (email)

B.8. TIFF (Image)

Index

List of Figures

List of Tables

List of Listings

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset