Chapter 12. Powering NASA science data systems

 

 

Welcome to the first of four deep dives showing Tika’s use in a real-world system. We’ll assume that, by now, you have a firm grasp of what Tika can do, how you can use its functionality in your application, and how you can extend Tika and add new functionality to it.

In this chapter, we’ll spend less time covering Tika’s nuts and bolts, and we’ll spend more time showing you how a real-world, huge-scale organization like the National Aeronautics and Space Administration (NASA) uses Tika in some of its newer, large-scale data system efforts.

One of Tika’s flagship deployments has been within NASA. We’ve used Tika to help power search for NASA’s Planetary Data System, the archive for all planetary science information collected over the past 40 years. Tika’s helped us extract information from PDS datasets and index them for a revamp of PDS’s search architecture, helping to turn its online data distribution system into a Google-like, free-text and facet-based search. We’ll explain Tika’s role in this revamp early in the chapter.

Besides planetary science, Tika has also helped NASA in the Earth science domain. Tika now helps power many of NASA’s Earth Science Ground Data Systems, augmenting the power of another Apache technology called Object Oriented Data Technology (OODT) to identify files for cataloging and archiving, delivery to geospatial information systems, and processing and distribution to the general public. We’ll briefly discuss examples of Tika’s use within the Orbiting Carbon Observatory (OCO), the National Polar-orbiting Operational Environmental Satellite System (NPOESS) Preparatory Project (NPP), Sounder Product Evaluation and Analysis Tool Element (PEATE), and the Soil Moisture Active Passive (SMAP) missions. Yes, that was a ton of acronyms—welcome to the world of NASA and let’s dive in!

12.1. NASA’s Planetary Data System

We’ll cover the important aspects of the NASA’s Planetary Data System (PDS) search engine redesign project in this section, starting with some basic information about PDS, including discussion on its core data model and its improved search engine architecture. Along the way, we’ll explain where Tika fits in, and how it helped.

The PDS is NASA’s archive for all of its planetary science information. All of the planetary missions as far back as Viking[1] are cataloged in the PDS.

1 For a full list of NASA planetary missions, see http://science.nasa.gov/planetary-science/missions/.

The system accepts data and metadata processed from instrument science data teams after those teams receive raw data records downlinked from the spacecraft. The data can be arbitrarily represented and formatted (Word documents, engineering datasets, images—sound familiar?) as long as there’s a plain-text, ASCII metadata file (called a label) describing the data delivered along with it. This architecture is depicted in figure 12.1.

Figure 12.1. The flow of data through NASA’s Planetary Data System

In the next section, we’ll provide some brief background on the PDS data model.

12.1.1. PDS data model

All metadata in the Planetary Data System is guided by a domain data model, built around familiar NASA mission concepts. A Mission is flown with one or more science Instruments, concerned with observing a Target which could be a planet, star, or some small celestial body (a comet or asteroid). The full PDS data model is beyond the scope of what we could cover in this chapter (let alone this whole book), but we have enough to go on with the previous.

The Great Thing About Standards The full PDS standards reference is a 14 MB, 531-page document describing the PDS data and metadata model in glorious detail. If you’re interested in learning more about PDS data, check out http://mng.bz/6r1B.

Every (set of) data file(s), or as PDS and NASA in general call them product(s), delivered to PDS must have a label associated with it that in some form captures this data model. So, for example, if the Cassini mission sends some data to the system, that data will have the basic metadata information shown in table 12.1.

Table 12.1. A PDS label for Cassini

Metadata field

Value

Mission Cassini–Huygens
Instrument Cassini Plasma Spectrometer (CAPS); Cosmic Dust Analyzer (CDA); Composite Infrared Spectrometer (CIRS); Ion and Neutral Mass Spectrometer (INMS); Imaging Science Subsystem (ISS); Dual Technique Magnetometer (MAG); Magnetospheric Imaging Instrument (MIMI); Radar, Radio and Plasma Wave Science instrument (RPWS); Radio Science Subsystem (RSS); Ultraviolet Imaging Spectrograph (UVIS); Visible and Infrared Mapping Spectrometer (VIMS)
Target Saturn

Metadata is made accessible via the PDS Data Distribution System, or PDS-D for short. Each PDS product has an associated metadata entry available in PDS-D, made available in a variety of formats ranging from the Object Description Language (ODL) to the W3C standard Resource Description Framework (RDF).

Core understanding of the PDS data model is necessary to understand how to improve the search and access the information in the system described by instances of that model. On the surface, most PDS searches are domain-specific, powered by form elements, corresponding to some subset of the model. For the expert user, this is precisely the type of search that still today is useful for finding information in the system. But naive users of PDS were quickly put off by the difficulty of using these domain-specific search utilities, and wanted something as simple as a “Google-box” to input keywords in to, and to receive pointers to PDS data from.

Next, we’ll describe the redesign of the PDS search system that we began in 2005. Here’s where Tika comes in.

12.1.2. The PDS search redesign

In 2005, the PDS team began an effort in response to the growing desire for free-text or Google-like search within the PDS. To construct this capability, the decision was made to dump PDS metadata from PDS-D in the emerging W3C standard RDF format. RDF is an XML-based format, similar to the RSS example[2] we showed you in chapter 8.

2 Earlier versions of RSS actually leveraged RDF schema.

Once the RDF files were dumped for datasets in the PDS, the files would be indexed in the Apache Lucene (and eventually Apache Solr) search technology, making them easily searchable and available for the PDS-D website and portal to leverage as shown in figure 12.2.

Figure 12.2. The PDS Search Engine architecture redesign resultant architecture. Metadata is dumped from the PDS-D catalog, transformed to RDF by a custom Tika PDS parser, and then sent to Lucene/Solr for indexing.

This sounds eerily familiar to examples from chapter 1, 3, 8, and 9 where similar search pipelines resulted in making content available using the Lucene ecosystem of technologies. In the case of PDS, we ended up writing a parser for PDS metadata outside of the context of Tika (it was only a glimmer in our collective eyes at that point) that was eventually translated into a Tika Parser interface implementation. The PDSRDFParser extracted PDS dataset metadata and text that we then sent to Solr for indexing.

One nifty portion of this example is that we were able to leverage the PDS and its rich data model to identify facets that you see on the main PDS website at http://pds.nasa.gov, an example of which is shown in figure 12.3. Those facets are extracted by the Tika PDSRDFParser class and then sent to Solr, where the field names are specified as facet fields in the Solr schema. This allows the values to be counted and “bucketed,” allowing interested PDS users to use a combination of the facets and free-text search to find the PDS data (called products) of interest. Once those products are found, a user may click a link to find and download the product from a particular PDS discipline node site.

Figure 12.3. The NASA Planetary Data System (PDS) main web page and its drill-down (facet-based) search interface

Porting the original PDS parser to Tika was natural and didn’t require any additional overhead. And, like we mentioned in chapter 10, since Tika is one of the load-bearing walls for a number of other Lucene technologies, we were able to easily integrate it into Lucene and Solr to help create the new PDS search architecture.

Now that we’ve described the planetary use case, let’s switch gears and tell you how Tika is used to power data systems focused on our planet, rather than our neighbors in the solar system!

NASA Earth science data systems are traditionally more computationally intensive and more focused on data processing rather than data archival, so it represents another important and relevant domain to see where and how Tika is being used. The great thing is that Tika is just as useful in Earth science data processing systems (for file identification, classification, parsing, and more) as it is for extracting text and metadata from planetary data files and images.

12.2. NASA’s Earth Science Enterprise

NASA’s Earth Science Enterprise is vast, consisting of three major families of software systems, as depicted in figure 12.4.

Figure 12.4. NASA’s Earth Science Enterprise, consisting of three families of software systems: SIPS takes raw data and process it; DAACs distribute that data to the public; proposal systems do ad hoc analyses.

One of the most tremendous challenges in keeping up with the data volume, new instruments and missions, and sheer pace of scientific discovery in the NASA Earth science Enterprise involves the classification and archival of science information files. Tika is a welcome friend when confronted with this challenge. We’ll explain how in this section, but first you’ll need some background on NASA jargon to understand what’s going on.

Science Information Processing Systems (SIPS) are typically directed by a principal investigator (PI), along with a science team, co-located at a particular institution along with the SIPS. The PI and the science team get early access to the data, help to develop the processing system and science algorithms that transform data from raw data records, to geolocated, calibrated, physically meaningful Earth science data files (or products), which are disseminated to the broader community. SIPS typically include components for file management (labeled FM in figure 12.4), workflow management (labeled WM in the figure), and for resource management (labeled RM in the figure). SIPS also include ingestion components and delivery components, which take in raw data records (ingest), and deliver processed science data products to long-term archives and dissemination centers (called DAACs, and discussed next).

Distributed Active Archive Centers (DAACs) are NASA’s long-term Earth science data archives, geographically distributed around the United States, co-located with science expertise in oceans, land processes, carbon and atmospheres, to name a few. Each DAAC includes the same basic software stack (FM+WM+RM, ingest and delivery) as SIPS, yet typically has different requirements than that of a SIPS. For example, SIPS aren’t expected to preserve their data for any long-standing period of time; DAACs are, which uses more disk space and requires more metadata requirements, and more thought in general with respect to software development.[3] DAACs are the recommended NASA dissemination centers for all of the agency’s Earth science data.

3 Imagine if you had to make sure that data produced from a Java algorithm would still be reproducible in 30 years!

Rounding out the Earth science enterprise are NASA proposal-funded systems, typically conducting ad hoc analyses or generating value-added data products to distribute to the community. These systems may contain different combinations of the core data system software stack, and may add specific foci, such as science data portals, ad hoc workflows, or data extraction tools. These systems are direct consumers of data made available by DAACs.

12.2.1. Leveraging Tika in NASA Earth Science SIPS

Next we’ll explain a few Earth science SIPS systems where Tika has been directly leveraged to help identify files for ingestion, to identify files to pull down from remote sources, and to extract information from those files during pipeline processing.

The Orbiting Carbon Observatory

NASA’s Orbiting Carbon Observatory (OCO) mission is focused on obtaining high-resolution measurements of carbon dioxide sources and sinks at global scale. OCO is a first-of-its-kind mission, set to produce never-before-seen estimates of carbon throughout the entire world.

The first version of the OCO mission failed to launch in 2009, but is being rapidly reconstructed as OCO2 for a relaunch in the 2013 timeframe. OCO2’s SIPS is under construction at NASA’s Jet Propulsion Laboratory, heavily leveraging the system developed for launch in the 2009 timeframe.

Next up is the NPP Sounder PEATE mission.

NPOESS Preparatory Project (NPP) Sounder Peate

The NPOESS Preparatory Project is a joint NASA, Department of Defense, and National Oceanic and Atmospheric Administration (NOAA) satellite meant to take the United States into the next generation of weather and climate measurements.

NPP Sounder PEATE is one of five Product Evaluation and Testing Environment projects that support the overall NPP mission, assessing the climate quality of several important science data products (vertical temperature, moisture, and pressure).

The last Earth mission that we’ll talk about in this case study is the Soil Moisture Active Passive (SMAP) mission.

Soil Moisture Active Passive (SMAP) Mission

NASA’s Soil Moisture Active Passive (SMAP) mission is one of two current missions identified in the National Research Council’s Decadal Survey for Earth Science study that are part of the Tier 1 objective measurements required to better understand our planet over the next decade (the other Tier 1 mission is ICESAT-2).

Priorities for the Next Decade of Earth Measurements In 2007, the United States National Research Council produced a study identifying the most important Earth-related measurements that the nation should focus its attention over the next decade. These national priority measurements are proposed as missions that should be flown and tier-based priorities for those missions (Tier 1, Tier 2, and so on). You can read more about the Earth science decadal study report at http://mng.bz/Lj24.

SMAP’s main focus is on increasing the accuracy of freeze-thaw measurements, providing necessary information that will improve the overall measurements and predicative capabilities of regional water models.

The good news is that all of the aforementioned Earth science missions are in great shape: Tika’s helping out their data systems!

12.2.2. Using Tika within the ground data systems

So, where does Tika fit in? All throughout the architecture of each of the aforementioned NASA Earth science missions! File management typically needs to both identify files for ingestion and extract metadata from those files. In addition, ingestion leverages Tika to help identify what files to pull down into the system (based on MIME type). Further, many of the science algorithms that are pipelined together as workflows require the ingestion of pipeline-produced data files and metadata. This information is provided by leveraging Tika to extract metadata and text from these data files, and to either send them for cataloging to a file management component or to marshal the extracted metadata to the next science algorithm, which leverages it to make some sort of decision (how to geolocate the data file, how to calibrate it, and so on).

So what’s unique about each of these NASA Earth science missions? Typically the uniqueness comes from the areas shown in figure 12.5.

Figure 12.5. Tika’s use in the NASA Earth Science Enterprise. Tika helps classify files for file management, metadata extraction, and cataloging as shown in the upper left of the diagram. In the lower right, Tika helps workflow tasks share metadata and information used to trigger science algorithms.

Different Data File Types

File types vary from mission to mission. For example, SMAP’s data files and OCO’s data files, though both formatted using the HDF5 standard, store vastly different data. OCO, stores vertical columns of computed CO2, and thus stores matrices that represent those columns of data (recall chapter 8 to jog your memory about how information is stored in HDF5 files). On the other hand, data files from SMAP are radar-oriented, and may store data in matrices with different sizes, may use vectors with different names, or may choose some other representation supported by HDF. Tika is a huge help here in normalizing the HDF information extracted into a Tika Metadata object instance that can be introspected and transformed (recall chapter 6). This use case is shown in the upper-left portion of figure 12.5.

Different Processing Algorithms and Workflows

The NPP Sounder PEATE project tests and executes on the order of 5–10 science algorithms with vastly different control and data flow from that of OCO, for example, which uses on the order of 10–20 different science algorithms, arranged into different workflows. To support these differences, Tika helps to send metadata between each workflow task (a step in the overall pipeline) by using a common representation. This is depicted in the lower-/middle-right portion of figure 12.5.

Computing Resources

OCO’s projected 100-node cluster and SMAP’s similarly sized cluster that will be purchased are five orders of magnitude larger than that of the NPP Sounder PEATE execution environment, which consists of around 20 machines, individually networked and shared between several environments, as opposed to collectively partitioned and clustered together.

Identifying Files for Ingestion and the Overall Ingestion Process

Each mission requires different ancillary datasets and must pull this information from various sources. Tika saves the day here because we’ve been able to use its MIME identification system to automatically decide which files to pull down remotely, and to then decide how to extract metadata from those files and how to ingest them into the system. This is shown in the upper/middle portion of figure 12.5.

Requirements for Data Delivery and Dissemination

The missions all have their own requirements for dissemination to the public, for delivery to an archive such as a DAAC, and for long-term archival. These requirements and functions are supported by Tika, where it’s used to augment metadata models that have already been captured by the processing system, add flags for data quality, and decide when the time is right to ship a data product (or set data products) to a DAAC or to the general scientific community. This is shown (partially) in the upper-left/middle portion of figure 12.5.

Tika has infected quite a number of areas within the walls of NASA as you’ve seen. We’ll summarize lessons learned in the next section and get ready for the next case study!

12.3. Summary

We introduced you to two vastly different domains within NASA: the planetary domain, with its rich data model, and its search-focused virtual data system called PDS-D; and the Earth science domain, with its processing-centric family of Earth science data systems.

One of the major lessons learned from our experience using Tika at NASA is that large-scale data archival, processing, and dissemination almost always need the core capabilities that Tika provides along the road. MIME type identification of files was a huge help because we could leverage not just standards provided by organizations like IANA, but also the rigor and detail that NASA itself put into file naming conventions, file type identification, and documentation.

In addition, we’ve found that NASA data systems are metadata-centric, requiring rich descriptions of datasets as a front line of defense. Since the metadata is at a greatly reduced scale compared to the data (we’re talking the difference between hundreds of kilobytes and hundreds of terabytes!), science users appreciate the ability to browse and identify the data they’d like to start crunching on, before they have to download it.

We hope that this case study has helped generate ideas in your mind regarding how to leverage Tika within your own data system, and what types of situations and needs arise when using Tika in data ingestion, archival, and dissemination. In the next case study, we’ll introduce you another related (but entirely different) usage of Tika’s abilities: in the realm of digital content management!

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset