Appendix A. Tika quick reference

All the key interfaces in Tika were described in detail earlier in this book and their Javadocs are all available online, but it’s often useful to have a quick reference for looking up some of the more commonly used functionality. This appendix answers that need by providing a summary of the key parts of the Tika API.

A.1. Tika facade

As discussed in chapter 2 and later in this book, the org.apache.tika.Tika facade class is designed to make simple Tika use cases as easy to use as possible. The facade class supports the methods shown in table A.1.

Table A.1. Key methods of the Tika facade class

Method

Description

detect(...) Returns the automatically detected media type of the given document. The return value is a string like application/pdf.
parse(...) Parses the given document and returns the extracted plain text content. The return value is a java.io.Reader instance and the parsing happens in a background thread while the text stream is read.
parseToString(...) Parses the given document and returns the extracted plain text content. The return value is a string whose length is limited by default to avoid memory issues with large documents.
setMaxStringLength(int) Sets the maximum length of the parseToString return value.

The type detection and text extraction methods accept the document to be processed in various different ways. Table A.2 lists the most common ways of specifying a document.

Table A.2. Document arguments to the Tika facade methods

Argument type

Description

java.io.InputStream The document is read from the given byte stream. You can also optionally specify an explicit Metadata instance to be used along with the document stream.
java.io.File The document is read from the given file. The filename and other file metadata are passed along with the document stream.
java.net.URL The document is read from the given URL. The possible filename at the end of the URL and any content type and other metadata hints included in the access protocol are passed along with the document stream.

A.2. Command-line options

The tika-app runnable JAR file allows you to use Tika as a command-line tool. To use this jar, start it as follows:

java -jar tika-app.jar [options] [file|URL]

The most commonly used command-line options are summarized in table A.3.

Table A.3. Tika command-line options

Option

Description

--xml or -x Outputs the extracted document content as XHTML. This is the default mode.
--text or -t Outputs the extracted document content as plain text.
--metadata or -m Outputs the extracted document metadata using a simple key: value format.
--json or -j Outputs the extracted document metadata as a JSON object.
--detect or -d Outputs only the detected document type.
--gui or -g Starts the Tika GUI. Useful for quick manual testing or experimentation.
--help or -? Prints a detailed listing of all the available command-line options.

A.3. ContentHandler utilities

As discussed in chapter 5, the document content extracted by a parser is returned as XHTML SAX events to the client application. Handling these events can be complicated at times, so Tika provides a number of utility classes in the org.apache.tika.sax package for various different purposes. Table A.4 summarizes the most commonly used utility classes.

Table A.4. ContentHandler utility classes

Class

Description

BodyContentHandler Captures the contents of the <body> tag of the incoming XHTML document and writes it to another ContentHandler instance, a character or a byte stream, or to an internal string buffer that can be accessed using the toString() method.
LinkContentHandler Collects all links from the incoming XHTML document. The collected links are available as a list of Link records from the getLinks() method.
TeeContentHandler Forwards the incoming XHTML document to any number of ContentHandler instances. Useful when you want to, for example, combine link extraction with other types of content processing.
XHTMLContentHandler Utility class used by Parser implementations to make it easier to produce valid and complete XHTML output.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset