All the key interfaces in Tika were described in detail earlier in this book and their Javadocs are all available online, but it’s often useful to have a quick reference for looking up some of the more commonly used functionality. This appendix answers that need by providing a summary of the key parts of the Tika API.
As discussed in chapter 2 and later in this book, the org.apache.tika.Tika facade class is designed to make simple Tika use cases as easy to use as possible. The facade class supports the methods shown in table A.1.
Method |
Description |
---|---|
detect(...) | Returns the automatically detected media type of the given document. The return value is a string like application/pdf. |
parse(...) | Parses the given document and returns the extracted plain text content. The return value is a java.io.Reader instance and the parsing happens in a background thread while the text stream is read. |
parseToString(...) | Parses the given document and returns the extracted plain text content. The return value is a string whose length is limited by default to avoid memory issues with large documents. |
setMaxStringLength(int) | Sets the maximum length of the parseToString return value. |
The type detection and text extraction methods accept the document to be processed in various different ways. Table A.2 lists the most common ways of specifying a document.
Argument type |
Description |
---|---|
java.io.InputStream | The document is read from the given byte stream. You can also optionally specify an explicit Metadata instance to be used along with the document stream. |
java.io.File | The document is read from the given file. The filename and other file metadata are passed along with the document stream. |
java.net.URL | The document is read from the given URL. The possible filename at the end of the URL and any content type and other metadata hints included in the access protocol are passed along with the document stream. |
The tika-app runnable JAR file allows you to use Tika as a command-line tool. To use this jar, start it as follows:
java -jar tika-app.jar [options] [file|URL]
The most commonly used command-line options are summarized in table A.3.
Option |
Description |
---|---|
--xml or -x | Outputs the extracted document content as XHTML. This is the default mode. |
--text or -t | Outputs the extracted document content as plain text. |
--metadata or -m | Outputs the extracted document metadata using a simple key: value format. |
--json or -j | Outputs the extracted document metadata as a JSON object. |
--detect or -d | Outputs only the detected document type. |
--gui or -g | Starts the Tika GUI. Useful for quick manual testing or experimentation. |
--help or -? | Prints a detailed listing of all the available command-line options. |
As discussed in chapter 5, the document content extracted by a parser is returned as XHTML SAX events to the client application. Handling these events can be complicated at times, so Tika provides a number of utility classes in the org.apache.tika.sax package for various different purposes. Table A.4 summarizes the most commonly used utility classes.
Class |
Description |
---|---|
BodyContentHandler | Captures the contents of the <body> tag of the incoming XHTML document and writes it to another ContentHandler instance, a character or a byte stream, or to an internal string buffer that can be accessed using the toString() method. |
LinkContentHandler | Collects all links from the incoming XHTML document. The collected links are available as a list of Link records from the getLinks() method. |
TeeContentHandler | Forwards the incoming XHTML document to any number of ContentHandler instances. Useful when you want to, for example, combine link extraction with other types of content processing. |
XHTMLContentHandler | Utility class used by Parser implementations to make it easier to produce valid and complete XHTML output. |