Appendix A. Tika quick reference

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Appendix A. Tika quick reference

All the key interfaces in Tika were described in detail earlier in this book and their Javadocs are all available online, but it’s often useful to have a quick reference for looking up some of the more commonly used functionality. This appendix answers that need by providing a summary of the key parts of the Tika API.

A.1. Tika facade

As discussed in chapter 2 and later in this book, the org.apache.tika.Tika facade class is designed to make simple Tika use cases as easy to use as possible. The facade class supports the methods shown in table A.1.

Table A.1. Key methods of the Tika facade class

Method	Description
detect(...)	Returns the automatically detected media type of the given document. The return value is a string like application/pdf.
parse(...)	Parses the given document and returns the extracted plain text content. The return value is a java.io.Reader instance and the parsing happens in a background thread while the text stream is read.
parseToString(...)	Parses the given document and returns the extracted plain text content. The return value is a string whose length is limited by default to avoid memory issues with large documents.
setMaxStringLength(int)	Sets the maximum length of the parseToString return value.

The type detection and text extraction methods accept the document to be processed in various different ways. Table A.2 lists the most common ways of specifying a document.

Table A.2. Document arguments to the `Tika` facade methods

Argument type	Description
java.io.InputStream	The document is read from the given byte stream. You can also optionally specify an explicit Metadata instance to be used along with the document stream.
java.io.File	The document is read from the given file. The filename and other file metadata are passed along with the document stream.
java.net.URL	The document is read from the given URL. The possible filename at the end of the URL and any content type and other metadata hints included in the access protocol are passed along with the document stream.

A.2. Command-line options

The tika-app runnable JAR file allows you to use Tika as a command-line tool. To use this jar, start it as follows:

java -jar tika-app.jar [options] [file|URL]

The most commonly used command-line options are summarized in table A.3.

Table A.3. Tika command-line options

Option	Description
--xml or -x	Outputs the extracted document content as XHTML. This is the default mode.
--text or -t	Outputs the extracted document content as plain text.
--metadata or -m	Outputs the extracted document metadata using a simple key: value format.
--json or -j	Outputs the extracted document metadata as a JSON object.
--detect or -d	Outputs only the detected document type.
--gui or -g	Starts the Tika GUI. Useful for quick manual testing or experimentation.
--help or -?	Prints a detailed listing of all the available command-line options.

A.3. ContentHandler utilities

As discussed in chapter 5, the document content extracted by a parser is returned as XHTML SAX events to the client application. Handling these events can be complicated at times, so Tika provides a number of utility classes in the org.apache.tika.sax package for various different purposes. Table A.4 summarizes the most commonly used utility classes.

Table A.4. `ContentHandler` utility classes

Class	Description
BodyContentHandler	Captures the contents of the <body> tag of the incoming XHTML document and writes it to another ContentHandler instance, a character or a byte stream, or to an internal string buffer that can be accessed using the toString() method.
LinkContentHandler	Collects all links from the incoming XHTML document. The collected links are available as a list of Link records from the getLinks() method.
TeeContentHandler	Forwards the incoming XHTML document to any number of ContentHandler instances. Useful when you want to, for example, combine link extraction with other types of content processing.
XHTMLContentHandler	Utility class used by Parser implementations to make it easier to produce valid and complete XHTML output.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Appendix A. Tika quick reference

Create new playlist

Sign In

Sign Up

Appendix A. Tika quick reference

A.1. Tika facade

Table A.1. Key methods of the Tika facade class

Table A.2. Document arguments to the Tika facade methods

A.2. Command-line options

Table A.3. Tika command-line options

A.3. ContentHandler utilities

Table A.4. ContentHandler utility classes

Table of Contents for
Appendix A. Tika quick reference

Table A.2. Document arguments to the `Tika` facade methods

Table A.4. `ContentHandler` utility classes