[A][B][C][D][E][F][G][H][I][J][L][M][N][O][P][R][S][T][U][V][W][X]
<a> tag
AbstractParser class
add() method
agglutinative languages
alias
analyzers.
See search engines.
annotations
Ant build.
See also source code.
Apache Droids
Apache Gora
Apache Hadoop, 2nd
Bixo
Apache Incubator, 3rd
podlings
Apache Lucene
Document class
ecosystem
Field class
Lucene Core
Apache Mahout, 2nd
Apache Manifold Connectors.
See ManifoldCF.
Apache Nutch, 2nd, 5th
and Bixo
Apache Gora
Protocol plugins
Apache PDFBox, 2nd
Apache Solr
Apache Tika, history of, 2nd
Apache UIMA
annotations
application programming interfaces (APIs)
Java ROME API
Parser API
pull APIs
push APIs.
See also Content Repository for Java API.
application/* MIME type
Architectural Styles and the Design of Network-based Software Architectures
audio/* MIME type
AutoDetectParser
AutoDetectParser class, 2nd
Babel fish
Behemoth
/bin/ls output
biomarkers
Bixo, 9th
and Apache Nutch
and TagSoup
Cascading
Fetch subassembly
Parse subassembly
parsing documents, 2nd
robots.txt file
black lists
BodyContentHandler class
BoilerpipeContentHandler class
BOM markers
Brin, Sergey
build.xml file
byte frequency matching
byte order marks.
See BOM markers; magic bytes.
callback functions
cancer research, 3rd
biomarkers.
See also Early Detection Research Network.
Cardinality property
Cascading
Cascading Style Sheets
categorization
CF.
See Climate Forecast model.
character encodings
BOM markers
byte frequency matching
statistical encoding detection.
See also character sets; charsets.
character sets, 2nd
validating character set detection.
See also character encodings.
CharsetDetector class
charsets.
See also character encodings; character sets.
classpath
Climate Forecast model
ClimateForecast interface
cloud computing, 2nd
clustering
collaborative filtering
command-line interface
--language option.
See also Tika CLI.
composite design pattern
CompositeDetector class
CompositeParser class
compression
content, 12th
how Tika extracts it, 2nd
organization of, 2nd
random access, 2nd
streaming, 2nd
types, 2nd
content management.
See document management systems.
content repositories, 2nd
text extraction pool
Content Repository for Java API
content type hints
Content-Encoding headers
ContentHandler argument
ContentHandler interface, 2nd
in Apache Jackrabbit
content-specific metadata standards, 4th
compared to general standards
difficulty of comparing across standards
Content-Type header.
See content type hints.
context-free interaction
COO.
See Orbiting Carbon Observatory.
corpus
distance from
Hamshahri
OHSUMED
Tempo
Cotent-Type header
crawlers.
See search engine.
CSS.
See Cascading Style Sheets.
custom parsers, 4th
creating
customizing existing parsers
DAACs.
See Distributed Active Archive Centers.
data curation
data mining, text mining
data models, Planetary Data System, 2nd
data, linked
databases, MIME-info
deduplication
Definition property
DelegatingParser class
dependencies, managing, 2nd
DeploymentAreaParser class
design goals, 9th
fast processing
flexible metadata
flexible MIME type detection
language detection
low memory footprint
MIME database
parser libraries
unified parsing interface
detect() method, 2nd, 3rd
detecting custom MIME types, 3rd
custom type detectors, 2nd
detecting file formats, 2nd
detecting MIME types
Detector interface
custom type detectors, 2nd
dictionary-based profiling
digital asset management.
See also document management systems.
Distributed Active Archive Centers
document analysis
Document class, 2nd
document management systems
Content-Type headers
Document Object Model
documents, 3rd
analyzing
as text
custom
document management systems
input stream, 2nd
language detection, 2nd
parsing with Bixo, 2nd
text mining.
See also files.
DOM.
See Document Object Model.
downloading
Git
Subversion
Tika source code
drag and drop
Droids.
See Apache Droids.
Dublin Core, 2nd
Early Detection Research Network, 14th
data model
data sets
eCAS Curator
EDRN Catalog and Archive Service
identifying MIME types
linked data
metadata extraction, 2nd
protocols
scientific data curation
use of Tika, 2nd
Earth Science Enterprise, 6th
Distributed Active Archive Centers
how Tika fits in, 2nd
principal investigator
Science Information Processing Systems
eCAS Curator
Ingester
references.
See also EDRN Catalog and Archive Service.
e-commerce, useful user data
EDRN Catalog and Archive Service
eCAS Curator.
See also Early Detection Research Network.
ElementMetadataHandler class
embedding Tika, 4th
Tika facade, 2nd
encoding, output encoding
endDocument function
endElement function
environment settings
errors, TextExtractionError.
See also exceptions.
events, STOP
example/* MIME type
exceptions
IOException, 2nd
SAXException
TikaException, 2nd
extending Tika, 3rd
adding MIME types
Extensible Hypertext Markup Language
in Tika CLI
structured output, 2nd
Extensible Markup Language (XML), 2nd
Resource Description Framework.
See also XML files.
Extensible Metadata Platform (XMP), 2nd
properties and property types
extracting text
full text, 2nd
with Apache Jackrabbit
ExtractingRequestHandler class
facade class
Facebook
fast processing
FeedParser class, 2nd
Fetch subassembly
Field class, 2nd
Fielding, Roy
file extensions.
See also glob patterns.
file formats
combined heuristics
content type hints
detecting, 2nd
filename globs, 2nd
HDF, 2nd, 3rd
headers
magic bytes
OLE
RSS, 2nd
XML
file headers
File Manager catalog
file naming conventions, 2nd
file storage.
See storage.
FileInputStream class
filenames, glob pattern, 2nd
formatted text
full-text extraction, 5th
incremental parsing
indexing, 2nd
full-text indexes, for large-scale systems
general metadata standards, 2nd
compared to content-specific standards
Dublin Core
Geographic interface
get() method
getContentHandler() method
getDefaultRegistry() method
getFile() method
getLanguage() method
getLinks() method
getSupertype() method.
See MediaTypeRegistry class.
getSupportedTypes() method, 2nd, 3rd
Git
glob patterns, 2nd
graphical user interface.
See Tika GUI.
Hamshahri corpus
handling custom documents
hasFile() method
HDF
matrix data
organization of content, 2nd
scalar data
vector data.
See also Hierarchical Data Format.
HDFParser class, 2nd
<head> tag
heuristics
Hierarchical Data Format
organization of content, 2nd
Hitchhiker’s Guide to the Galaxy
HTML.
See Hypertext Markup Language.
HtmlMapper interface
HTMLParser class, 2nd
HtmlParser class
Hypertext Markup Language, 3rd
<head> tag
in Tika CLI
IANA.
See Internet Assigned Numbers Authority.
IdentityHtmlMapper class
image/* MIME type
implementing parsers, 2nd
incremental language detection
incremental parsing, streaming
indexers
full-text indexing.
See also search engines.
indexing, full-text search, 2nd
IndexReader class
IndexWriter class
information overload
Ingester
references
input, standardizing
InputStream argument
InputStream class, 3rd, 4th
and parse() method
intermediaries, promotion
International Organization for Standardization
Internet Assigned Numbers Authority, 3rd
MIME type registry
inverse indexes
IOException, 2nd
input error
isMultiValued() method
ISO 639
isReasonablyCertain() method
isSpecializationOf() method.
See MediaTypeRegistry class.
Jackrabbit.
See Apache Jackrabbit.
Java
embedding Tika, 2nd
managing dependencies, 2nd
ROME API
service providers
Java Beans
Java ROME API
java.io.Writer class
java.util.zip package
JCR.
See Content Repository for Java API.
language detection, 2nd, 17th
advanced algorithms
agglutinative languages
corpus
distance
in Tika, 2nd
incremental
ISO 639 standards
language profiles
N-gram algorithm
profiling algorithms, 2nd
theory, 2nd
UDHR example
language detection theory, 2nd
--language option
language profiles
LanguageIdentifier class
LanguageIdentifierUpdateProcessor class
LanguageProfile class
LinkContentHandler class, 2nd
getLinks() method
linked data
LinkHandler class
links, between files, 2nd
Linnaean taxonomy.
See taxonomy.
Linnaeus, Carl
locale
LookaheadInputStream class
Lucene
Lucene Core
Lucene ecosystem
Apache Droids
Apache Mahout
Apache Nutch, 2nd
Apache Solr
ManifoldCF
Open Relevance, 2nd
LuceneIndexer class
and metadata
converting metadata to RSS
Luke
machine learning, 8th
categorization
clustering
collaborative filtering
predicting user likes and dislikes, 2nd
real-world examples, 2nd
magic bytes
Mahout.
See Apache Mahout.
ManifoldCF
mark feature
mark() method
matrix data
Maven build.
See also source code.
Maven, memory problems
media type registries
MediaTypeRegistry class
media types.
See also MIME types.
MediaType class.
See also media types.
MediaTypeRegistry class
memory footprint
message/* MIME type
<meta> tag. name attribute
metadata, 5th, 7th, 10th
metadata
and Early Detection Research Network, 2nd
and LuceneIndexer
and rest
and Tika CLI
and Tika facade
Cardinality property
challenges of acquiring, 2nd
Climate Forecast model
Content-Type header
converting to RSS, 2nd
Definition property
Extensible Metadata Platform
flexibility
how it’s created, 2nd
in Lucene Document objects
instances
metadata models
Metadata.LANGUAGE entry, 2nd
Name property
practical uses for, 2nd
quality of, 2nd
Relationships property
representing
standards, 2nd
transforming
Valid values property
Metadata argument
Metadata class, 2nd, 3rd
metadata instances
representing
transforming
metadata models
Climate Forecast model
Dublin Core.
See also metadata standards.
metadata quality, 2nd
metadata schema
metadata standards, 6th
content-specific standards, 2nd
Dublin Core
general standards, 2nd
MIME database
MIME type identifiers
MIME types, 9th
MIME types
adding new types to Tika
adding to MIME-info database
and Early Detection Research Network
and Parser interface
application/*
audio/*
categories of
custom, 2nd
custom MIME type detectors, 2nd
detecting, 2nd
example/*
identifiers
image/*
Internet Assigned Numbers Authority
media type registries
MediaType class
MediaTypeRegistry class
message/*
MIME database
MIME-info database
model/*
multipart/*
parent and child types
registration
syntax
text/*
Tika MIME repository
top-level
video/*
working with, 2nd
MIME-info database
adding new types to
MimeType class
MimeTypes class, 2nd
MimeTypesFactory class
ML.
See machine learning.
model/* MIME type
modularity
multipart/* MIME type
Multipurpose Internet Mail Extensions.
See MIME types.
Name property
NASA, 12th
Earth Science Enterprise, 2nd
how they use Tika, 2nd
National Polar-orbiting Operational Environmental Satellite System
Orbiting Carbon Observatory
PDS search redesign, 2nd
Planetary Data System, 2nd
Product Evaluation and Analysis Tool Element
Soil Moisture Active Passive
National Cancer Institute, Early Detection Research Network, 2nd
National Polar-orbiting Operational Environmental Satellite System
NetCDF
N-gram algorithm
nodes
NPOESS.
See National Polar-orbiting Operational Environmental Satellite System.
Nutch.
See Apache Nutch.
Object Linking and Embedding
OHSUMED corpus
OLE format
OLE.
See Object Linking and Embedding.
OODT
Open Relevance, 5th
Hamshahri corpus
OHSUMED corpus
Tempo corpus
Open Services Gateway Initiative (OSGi), 2nd
Orbiting Carbon Observatory
computing resources
org.apache.tika.language package, 2nd
org.apache.tika.metadata package, 2nd
org.apache.tika.mime package
org.apache.tika.parser.Parser interface, 2nd
org.apache.tika.parser package, 2nd
org.apache.tika.sax package
org.xml.sax.ContentHandler interface
organization of content, 2nd
OSGI.
See Open Services Gateway Initiative.
output serialization
overriding parsers
packages
org.apache.tika.language, 2nd
org.apache.tika.metadata, 2nd
org.apache.tika.mime
org.apache.tika.parser, 2nd
org.apache.tika.sax.
See also java.util.zip.
Page, Lawrence
Parse subassembly
parse() method, 2nd, 9th, 10th, 11th, 12th, 13th
and input streams
ContentHandler argument
in Apache Jackrabbit
InputStream argument
Metadata argument
ParseContext argument, 2nd
ParseContext
ParseContext argument
ParseContext class, 2nd
Parser API
Parser class
parser libraries, 2nd, 3rd
parser override
parser selection
ParserDecorator class
parseToString() method, 2nd
parsing context, 4th
environment settings
locale
ParsingReader class
PDF files, parsing
PDFBox library
PDFParser class, 2nd
PdfParser class
PDFTextStripper class
PDS
PDSRDFParser class
PEATE.
See Product Evaluation and Analysis Tool Element.
plain text
Planetary Data System, 11th
data model, 2nd
Instruments
labels
Missions
PDS Data Distribution System
products
search redesign, 2nd
Targets
Planetary Data System Data Distribution System (PDS-S).
See Planetary Data System.
plugins, parser plugins
podlings
principal investigator.
See Science Information Processing Systems.
Product class
Product Evaluation and Analysis Tool Element
and Tika
profiling algorithms, 3rd
advanced
N-gram algorithm
ProfilingHandler class, 2nd
ProfilingWriter class
promotion of intermediaries
properties, in Apache Jackrabbit
Property class
property types
property values, 2nd
PropertyType class
PropertyType enum
PropertyValue class
Protocol plugins
protocols
in Early Detection Research Network
provider configuration files
Public Terabyte Dataset (PTD)
Public Terabyte Dataset Project
pull API
purchase history
push API
random access
ratings
RDF.
See Resource Description Framework format.
Reader class, 2nd
Really Simple Syndication (RSS), 4th
from metadata, 2nd
organization of content, 2nd
Reference class, 2nd
Relationships property
Representational State Transfer.
See REST.
reset() method
Resource Description Framework format
resource management, close() method
REST
context-free interaction
principles of
promotion of intermediaries
use of metadata
RFC 5646
robots.txt file
ROME
root element detection, XML root detection
root elements
RSS
channels
organization of content, 2nd.
See also Really Simple Syndication.
SAX events
parsing.
See also Simple API for XML.
SAXException
output error
SAXTransformerFactory class
scalability, 2nd
scalar data
Science Information Processing Systems (SIPS)
how Tika fits in, 2nd
principal investigator
search engine
search engines, 2nd, 3rd, 10th, 11th, 17th
search engines
analyzers
and Tika
Bixo, 2nd
black lists
crawlers
deduplication
indexers
inverse indexes
Public Terabyte Dataset Project
structure of
URL filtering
web crawlers
white lists
service providers
provider configuration files
set methods
setMaxStringLength() method
setMediaTypeRegistry() method
shared MIME-info database.
See MIME-info database.
Simple API for XML
callback functions
parse() method ContentHandler argument
structured output
SimpleTypeDetector class
SIPS.
See Science Information Processing Systems.
SMAP.
See Soil Moisture Active Passive.
social media
Soil Moisture Active Passive
Solr.
See Apache Solr.
SolrCell
source code, 5th
downloading
Git
Subversion
Spring framework
bean configuration
startDocument function
startElement function
statistical encoding detection
STOP event, in Apache Jackrabbit
storage
how it affects extraction, 2nd
logical representation, 2nd
physical representation
streaming, 2nd
structured text, 2nd, 3rd, 6th
as SAX output
semantic structure
sub-class-of
Subversion
trunk checkout
TagSoup
Taste.
See Apache Mahout.
taxonomy
TeeContentHandler class, 2nd, 3rd
Tempo corpus
text mining
Text Retrieval Conference.
See TREC standards.
text, structured
text/* MIME type
TextExtractionError
TIFF interface
Tika Annotator.
See Apache UIMA.
Tika application, 4th
documentation
tika-app.
See also Tika CLI; Tika GUI.
Tika bundle.
See Open Services Gateway Initiative.
Tika facade, 2nd
and metadata
detect() method
parse() method
parseToString() method, 2nd
setMaxStringLength() method
Tika MIME repository
tika-app
tika-bundle
TikaCallable class
tika-core
TikaException, 2nd
parse error
TikaInputStream class, 2nd.
See also document stream.
tika-mimetypes.xml.
See media type registries.
tika-parent
tika-parsers
top level project
TransformerHandler class
transforming metadata
TREC standards
Twitter
type hints, content type hints
type/subtype
UDHR.
See Universal Declaration of Human Rights.
Unicode
BOM markers
uniform resource locators, URL filtering
Universal Declaration of Human Rights (UDHR)
Unix pipeline.
See Tika CLI.
unravelStringMet function
UpdateHandler class
updateVersion function
URL filtering
users
characteristics
item ratings
purchase history
Valid values property
ValueType enum
vector data
video/* MIME type
web browsers
web crawlers
protocol layer
web servers
Web-based Distributed Authoring and Versioning Protocol (WebDAV)
when to use
white lists
World Wide Web
architecture
complexity of, 2nd
scale and growth of, 2nd
XHTML output, 2nd
XHTML.
See Extensible Hypertext Markup Language.
XML files, root elements
XML.
See Extensible Markup Language.
XMLParser class
XmlRootExtractor class
XMP dynamic media
XMP.
See Extensible Metadata Platform.
xmpDM.
See XMP dynamic media.