CHAPTER 11 Using OWL in the Wild

OWL provides a wide variety of modeling capabilities for relating information in flexible and powerful ways. We have seen a number of examples of how these constructs can be combined to represent complex relationships among various data sources. In this chapter, we delve into two detailed examples of how OWL can be used in real-world modeling situations.

Our examples feature applications from two government ontologies: one for modeling enterprise architectures and one in the life sciences. In both cases, the semantic model provides a set of reference concepts to be used as a basis for other work. In the first case, the model provides guidance for a description of the enterprise architecture of a government agency. The model has to mediate the simultaneous challenges of providing centralized advice for the development and maintenance of an enterprise architecture (after all, this is the government), while allowing a degree of autonomy for the agencies. We will see how a combination of RDF and OWL can be used to satisfy these requirements. In the life sciences case, the model provides a central repository for a controlled vocabulary. In this case, the challenge is to build and maintain a model that can serve the needs of a widely differentiated community, while still providing some degree of unity in their operation. When you read this chapter, you will not learn a lot about enterprise architecture or cancer research.

This chapter is not about solving the problems essential to these fields but rather about how they can use Semantic Web modeling to bring the advantages of a Web solution to their practice. For example, the value of explicit, executable enterprise architecture is controversial in many circles; we will not resolve this issue here. Once you accept (as the U.S. government has) that executable enterprise architecture is valuable, there still remains the problem of how hundreds of semiautonomous agencies working in a federated way can achieve the value of a distributed representation of their architecture. Allowing each agency the autonomy it needs, while respecting the central commonality required by participation in a single government, is a key challenge to instituting an executable federal enterprise architecture.

In contrast to an executable enterprise architecture, the value of cancer research is not usually contested. But this field also has requirements having to do with distribution of information and work. Each research team around the world has its own approach and methodology for pursuing its research. How can each team have the autonomy it needs to make its advances yet still participate in a worldwide community so that efforts from various teams can be used synergistically? The Semantic Web is about enabling the network effect in these efforts. The purpose of the models described here is to achieve an effective balance between federation (commonality) and autonomy (variability).

In both of these cases, we find once again that a little bit of OWL goes a long way. Most of the value in the federal enterprise architecture case comes from the use of RDF to describe how a centralized model can be extended by multiple agencies. But in addition to the use of RDF to extend the model, there are some constraints on the enterprise architecture that can be described using a little bit of OWL. There is no need for elaborate recombining models, only a handful of modeling patterns are used again and again. In the case of the cancer research ontology, a centralized model has been created in which OWL is used to describe how the various concepts related to research in the genetic basis of cancer are related to one another. These definitions, maintained in an ongoing effort by the U.S. National Cancer Institute, provide a common reference point for research teams from around the world to coordinate their results.

THE FEDERAL ENTERPRISE ARCHITECTURE REFERENCE MODEL ONTOLOGY

The federal government in the United States determined that some coherence was necessary among numerous government agencies in terms of information systems, their form, and their content. Toward this end, the government instituted an ongoing effort called the Federal Enterprise Architecture. The idea of an Enterprise Architecture is that it should be possible to describe, in a coherent, formal and machine-readable way, the information systems, components, and information content of a government agency. Even once the agencies represent their enterprise architecture in such a way, there is no guarantee that the decisions made by various agencies will be consistent. For this reason, the government defined the Federal Enterprise Architecture Reference Model (FEA-RM). The idea behind a reference model is that it is not an enterprise architecture itself but is a starting point for someone who plans to design an enterprise architecture. If every agency uses the same reference model as a starting point for its own enterprise architecture, then the hope is that we can guarantee, or at least encourage, some degree of consistency among the architectures of the various agencies.

The first edition of the Federal Enterprise Architecture was delivered as a series of documents written in natural language with an assortment of diagrams. Although the documents were very well organized, it was still possible for an enterprise architect in any particular agency to simply pay lip service to the federal reference model by “spinning” a story about why a nonconformant model is conformant to the reference model. It was felt that a more effective way to deliver the model would be in a formal, machine-readable (and, as far as possible, automatically verifiable) form.

REFERENCE MODELS AND COMPOSABILITY

Toward this end, the U.S. government sponsored a project to cast the FEA-RM into OWL (FEARMO). The FEARMO project chose RDF as the data modeling language to support the composability that is required by a reference model. The reference model itself is represented as an RDF graph; each agency customization is represented as a set of triples, which is merged with this graph. This ensures that each agency has a core structure (based on the FEA-RM) on which they all agree, but at the same time, each agency can add its own extra structure as it sees fit. For example, consider the fragment of the FEARMO shown in Figure 11-1.

The FEA-RM defines a business area called Management of Government Resources. In the original FEA document, components in the model are said to “comprise” one another or to be “comprised of” one another. Figure 11-1 shows that this business area is comprised of five components. Components at this level are called Lines of Business. The line called SupplyChainManagement consists of four other components, called subfunctions.

FIGURE 11-1    Sample business areas from the FEARMO.

The RDF representation of the FEARMO followed the modularity of the textual FEA-RM by defining a number of distinct namespaces for the various parts of the FEA-RM. Although these namespaces are used consistently in the published models, we will take some liberties for the sake of simplicity of diagrams and triples in this description by leaving out the particulars of the namespaces.

An agency can extend the FEA-RM by adding new lines of business in a business area, or new subfunctions to a line of business.Figure 11-2 shows such an extension, in which an agency has added an extra subfunction called FleetManagement to the SupplyChainManagement line of business. This extension is expressed with the single triple:

:SupplyChainManagement :isComprisedOf :FleetManagement.

So far, the FEARMO has not used any feature of OWL at all, only basic RDF. Even at this level, the model provides a valuable service in that it expresses how an agency can extend the model, and it provides a compact way for the agency to express just its extensions.

FIGURE 11-2    An agency extension to Figure 11-1. This agency has an additional subfunction of SupplyChainManagement for FleetManagement.

RESOLVING AMBIGUITY IN THE MODEL: SETS VERSUS INDIVIDUALS

The bulk of the FEA-RM is made up of tree structures of this sort, in which a number of components are described as consisting of other components. There are four sections to the FEA-RM: the Performance Reference Model, the Business Reference Model (of which a small excerpt was shown in Figure 11-1), the Service Component Reference Model, and the Technology Reference Model.

The original FEA-RM was expressed in English. One of the challenges of recasting an informal model (expressed in natural language) into a formal model (e.g., expressed in OWL) is sorting out the ambiguities in the informal model. One recurring source of ambiguity in the FEA-RM is the distinction between sets of components and components. For example, in Figure 11-1, should we view SupplyChainManagement as a set of four subfunctions, or should we see it as a component in its own right, with properties of its own (e.g., a responsible party or a budget line item)? Both of these viewpoints are viable and useful in the model. How do we deal with this?

In the FEARMO, this ambiguity is dealt with by noticing that in a graph like the one in Figure 11-1, we can view every node as a component but still acknowledge that there is utility in explicitly naming “the set of all things that ManagementOfGovernmentResources is comprised of.” Fortunately, it is a simple matter to express such a class using a hasValue restriction, as shown in Figure 11-3. We need to define an inverse for isComprisedOf to make the restriction; comprises is an obvious name for this property.

FIGURE 11-3    OWL definition of “the set of all things that ManagementOfGovernmentResources Is comprised of.”

This pattern should be familiar by now; we saw similar uses of hasValue in Chapter 9 (pages 202, 204), where we defined “The children of Shakespeare” and “the terms narrower than Milk.” The hasValue restriction gives us a simple way to move from an individual (like ManagementOfGovernmentResources, Milk, or Shakespeare) to the set of individuals related to it by some given property.

When we combine Figures 11-2 and 11-3 and show the inferences entailed by OWL, we get the set of triple shown in Figure 11-4. The owl:hasValue restriction ensures that the members of the class LOB_ManagementOfGovernmentResources are exactly the individuals that comprise the line of business ManagementOfGovernmentResources. Since these relationships are inferred, they will be maintained even when new members of the class ManagementOfGovernmentResources are asserted, or if we learn of a new subfunction that comprises ManagementOfGovernmentResources.

There are ample opportunities in the FEARMO to use this pattern; in fact, at each point in any of the FEA-RM trees, we can refer to the set of components of which some other component is comprised. Since the FEARMO is intended to be a reference model, any of these sets are likely to be used in some agency extension. The FEARMO therefore includes definitions like the one shown in Figure 11-3 for every intermediate node in any tree. (This design pattern is a special case of the relationship transfer pattern we saw previously.) In this case, the pattern transfers a relationship to some individual (in this example, the relationship comprises for ManagementOfGovernmentResources) to the rdf:type relation for a specified class (in this example, LOB_ManagementOfGovernmentResources). This particular use of the relationship transfer pattern is so specific and pervasive that we give it its own name. Since it defines a class whose membership tracks a property of an individual, we call it the Class-Individual Mirror pattern.

FIGURE 11-4    After inferencing. The members of the class LOB_Management Of Government Resources are exactly the individuals that comprise Management Of Government Resources.

CONSTRAINTS BETWEEN MODELS

The FEA-RM describes several layers of components in the four reference models: for Business, Performance, Service Components, and Technology. This is in itself a remarkable undertaking, since these models identify hundreds of specific entities that can play a role in the enterprise architecture of an agency. But the FEA-RM goes even further to describe constraints between these models.

For example, one level of the Performance Reference Model describes a number of Measurement Categories. Instead of specifying all the things that each measurement category is comprised of (as we saw for the Business Reference Model in Figure 11-1), for certain measurement categories, the Performance Reference Model stipulates that they are comprised of things that come from the Business Reference Model.

Let’s look at a specific example. The Performance Reference Model defines a Measurement Area called the Mission and Business Results Measurement Area (this is called MA_MissionAndBusinessResults in the FEARMO). Rather than listing the components that comprised it, the PRM stipulates that the things that comprise it are exactly the lines of business that comprise three specific business areas, including the Management of Government Resources business area outlined in Figures 11-1 through 11-4. To state this as a constraint in natural language, we have

Anything that comprises ManagementOfGovernmentResources should also comprise MA_MissionAndBusinessResults.

How can this be modeled in OWL?

Fortunately, since the FEARMO already uses the Class-Individual Mirror pattern throughout the model, there is a simple way to express this relationship directly in OWL. That is, it already has a definition (through Class-Individual Mirror) of “the set of all individuals that comprise ManagementOfGovernmentResources” (in FEARMO, that class is called LOB_ManagementOfGovernmentResources). It also already has a definition of “the set of all individuals that comprise MA_MissionAndBusinessResults” (in FEARMO, this class is called prm:LineOfBusinessMeasurementCategory). Given that these two classes have already been defined, and recalling that the type propagation rule of rdfs:subClassOf means that members of one class will be inferred to also be members of the other, FEARMO can model the constraint that all individuals that comprise ManagementOfGovernmentResources should also comprise MA_MissionAnd-BusinessResults with the single triple

LOB_ManagementOfGovernmentResources rdfs:subClassOf prm:LineOfBusinessMeasurementCategory.

Let’s have a closer look at how this works.

First, we look at the definition of prm:LineOfBusinessMeasurementCategory in Figure 11-5. This is defined with the same pattern as the one in Figure 11-3. Any individual that comprises prm:MA_MissionAndBusinessResults is a member of this class, and vice versa.

Now suppose we have a new individual that comprises ManagementOfGovernmentResources. The chain of inferences is shown in Figure 11-6. Reading counterclockwise from the bottom of the figure, we have the new line of business that comprises ManagementOfGovernmentResources. Because of the Class-Individual Mirror pattern for LOB_ManagementOfGovernmentResources (detailed in Figure 11-3 and summarized here in Figure 11-6), we can infer that the new line of business is a member of (i.e., has rdf:type) LOB_ManagementOfGovernmentResources. Since we just asserted that LOB_ManagementOfGovernmentResources is rdfs:subClassOf prm:LineOfBusinessMeasurementCategory, we can infer that the new line of business is also a member of prm:LineOfBusiness-MeasurementCategory. Now we can use the Class-Individual Mirror pattern again but this time to infer that the new line of business comprises MA_MissionAnd-BusinessResults, as desired.

FIGURE 11-5    LineOfBusinessMeasurementCategory is defined using the Class-Individual Mirror pattern. The Class is called LineOfBusinessMeasurementCategory; the individual is MA_MissionAndBusinessResults. The Restriction is an anonymous class, shown here as a formulaic node A34.

FIGURE 11-6    Inferences for a new Line Of Business that comprises Management Of Government Resources.

OWL AND COMPOSITION

As a reference model, one of the main values of the FEARMO is the capability to combine it with agency models in a modular way. As a complex document, the original FEA-RM was already divided into four sections, each of which has its own value but also have a value as an integrated whole. This is not an unusual situation in software deployment. Most software languages have features for managing modularity of this sort. OWL is no different in this regard, and it has language features for modularizing semantic models. These language features have no semantics for the model (they allow no new triples to be inferred), but they help us, as humans, to organize a model in a modular way.

owl:Ontology

OWL provides a built-in class whose members correspond to modular parts of a semantic model. It is customary for the URI of an Ontology to correspond to the URL of the file on the web where the ontology is stored. This makes use of a slightly different syntax in N3 than we have used so far. It is possible to spell out a URI by enclosing it in angle brackets:

<http://www.workingontologist.com/Examples/ch14/shakespeare.owl> a owl:Ontology.

Unlike the other constructs in OWL, the meaning of membership in owl:Ontology is not given by inference. In fact, one could say that it has no formal meaning at all. Informally, an instance of owl:Ontology corresponds to a set of RDF triples. In particular, it corresponds to exactly the triples that are stored in the file that is found at the URL specified by the URI of the Ontology instance. There is no connection in the model between an instance of owl:Ontology and the triples to which it corresponds.

Although such an individual has no significance from the point of view of model semantics, it can be quite useful when specifying modularity of semantic models. The primary way to specify modularity is with the property owl: imports.

owl:imports

This is a property that connects two instances of the class owl:Ontology. Just as is the case with owl:Ontology itself, no inferences are drawn based on owl: imports. But the meaning in terms of modularity of models is clear: When any system loads triples from the file corresponding to an instance of owl:Ontology, it can also find any file corresponding to an imported ontology and load that as well. This load can, in turn, trigger further imports, which trigger further loads, and so on. There is no need to worry about the situation in which there is a circuit of imports (e.g., prm imports brm imports fea imports prm). A simple policy of taking no action when a file is imported for a second time will guarantee that no vicious loops will occur. The resulting set of triples is the union of all triples in all imported files.

In the case of FEARMO, there is a somewhat elaborate import structure, as shown in Figure 11-7. The four main divisions of FEARMO are called srm, prm, brm, and trm. The rdfs:subClassOf triples that connect the PRM to the BRM, as illustrated in Figure 11-6, are included in a model called BRM2PRM, which, naturally enough, imports brm and prm. The srm imports brm and prm, and everything imports some common triples from a module called feac. Any part of this structure can be referenced independently; all the necessary modules can then be found by tracing the owl:imports links from one ontology to the next.

FIGURE 11-7    Import structure of FEARMO.

Although owl:imports is the workhorse of model modularity, OWL includes a handful of properties for version control. They also have no meaning for the inference semantics of OWL and so have no significance in terms of modeling, but they are useful to OWL as a computer language. (Note that only minimal support for these constructs is provided in most OWL tools, and they are not widely used.)

These are, for the most part, self-explanatory:

versionInfo: An annotation property for specifying version information, either human readable or for use by other version control systems.

priorVersion: Refers one ontology to another ontology that is a prior version.

backwardCompatibleWith: Like priorVersion but further states the new ontology is backward compatible with the previous one.

incompatibleWith: Like priorVersion but further states that the new ontology is incompatible with the previous one.

DeprecatedClass and DeprecatedProperty: Used to specify that a class or property, respectively, is deprecated in a particular version (and should no longer be used).

ADVANTAGES OF THE MODELING APPROACH

The inferences in Figure 11-6 follow from the OWL standard and can be done automatically by any OWL inference engine. What is the advantage of modeling the enterprise architecture like this?

To see the advantages of modeling in this way, we need to examine the alternatives. How else might the constraints between the performance and business reference models be maintained? One way to maintain this relationship is through work practice. Whenever a new line of business is established that comprises ManagementOfGovernmentResources, a person could be given the task to make a corresponding update to the MA_MissionAndBusinessResults measurement area. This solution requires documentation of that work practice and a reliance that it will continue to be done in the same way, even if personnel in the organization change. This is difficult to achieve in practice.

Another way to maintain the relationship would be to write a special-purpose program that watches for additions to the business model and makes corresponding changes to the performance model. It is easier to keep such a system working in the face of new personnel, but it has the disadvantage that because the solution is written in general-purpose program code, it is difficult to maintain and evolve the software or to make certain that the process is done the same way throughout the work flow. The relationship between the business and performance models is not explicitly stated anywhere, and it is difficult for future personnel to maintain software they don’t understand.

In the presence of an inference engine, the modeling solution given by the FEARMO is very similar to the programmatic solution. The inference engine plus the model together constitute a program that takes the appropriate action. Whenever a new line of business is established, a corresponding measurement area is also updated. The difference is that the FEARMO makes explicit the relationship between the business and reference models in a way that is separate from any other processing around the enterprise architecture.

The code that supports this constraint is not embedded in a general-purpose language with the rest of the processing of data or user interfaces, or any other aspect of an agency’s information system. The constraint is maintained by a standard inference engine. The relationship is expressed, in this case, in a single statement whose meaning is given by standard semantics.

This advantage is especially meaningful in the context of a reference model like FEA-RM. The utility of a reference model lies in its extensibility. When an agency makes an addition to the model (e.g., by adding a new line of business, which might just comprise ManagementOfGovernmentResources), it should make a corresponding addition to the performance measurement areas. How can this stipulation be unambiguously communicated and enforced? Custom code is not a solution to this problem at all—no single piece of code specified centrally (i.e., by the federal government) can be expected to run in the context of every agency’s systems. A semantic model can make such a specification and can do it unambiguously because of the standard meaning of the constructs in OWL. An agency can choose to enforce it in any way that it likes, as long as it respects the formal meaning of the model. Thus, one agency might choose to use, say, an Oracle implementation of OWL, while another might use some other OWL reasoner built by a custom contractor, but the semantic model of OWL guarantees interoperability between them. A special-purpose program does not provide this capability.

THE NATIONAL CANCER INSTITUTE ONTOLOGY

The NCI Thesaurus is a public domain terminology produced by the U.S. National Cancer Institute (one of the National Institutes of Health). It is currently released in a number of forms, including OWL encoding. OWL is a natural model for this vocabulary, as we shall see, because it provides a means for specifying in a formal and unambiguous way the relationships between terms.

The need for a comprehensive NCI-wide terminology arose because NCI staff require access to timely and accurate information about activities related to the scientific mission of the Institute. The collection, storage, and retrieval of data related to NCI research programs are necessary to analyze, manage, and report about these activities. Although centralized coding of NCI-supported research-related activities met some of these needs, supplementary data coding had become common. This coding was assigned independently within various components of the Institute and was frequently based on locally developed term lists or other informal vocabulary, making it difficult to find and combine information across programs.

The NCI source vocabulary within the NCI ontology encompasses the terminology used by the various offices and divisions within the Institute, with the goal of providing a common vocabulary to increase the interoperability of information systems. The NCI vocabulary provides not only an initial Institute-wide integrated vocabulary but also rich mappings of NCI terminology to numerous other biomedical vocabularies.

The NCI ontology itself does not take advantage of the distributed nature of the Semantic Web in that it is stored and published in the form of one very large file, with all the class and property definitions within it. This has the advantage that it makes it easier to keep the ontology consistent and to do version control (a new version is released monthly), but it comes with some cost. At present, the NCI employs several full-time workers to maintain the ontology and uses a complex work flow control system to manage the builds.

The NCI ontology primarily provides class definitions (and relationships between classes) that can be used by others to link their data. By the middle of 2007, the ontology had over 50,000 class definitions, and it has been growing by several thousands of classes a year over the past few years.

REQUIREMENTS OF THE NCI ONTOLOGY

Cancer research draws on a number of disciplines in the life sciences, including genetics, chemistry, and biology (among many others). Research in each of these fields includes a wide variety of specialized terminology. For this research to yield actionable results, some connection among the various fields must be made in a systematic way. But because of the complexity of each field, it is difficult to track what information in one field is relevant in another.

A small example of the situation is shown in Figure 11-8. The figure shows fragments of the terminology hierarchies for Genes, Species, and Biological Processes. In addition to listing terms in each of these areas, the NCI ontology also specifies that the special case of Gene called Oncogene occurs only in the species Human, as opposed to a number of other possible species in which other genes may occur. Furthermore, the genes in the more specific class called Oncogenes_Protein_Kinase have functions Protein_Phosphorylation and Signal_Transduction. Finally, the even more specific Oncogene_ErbB2 is associated with the disease Adenocarcinoma. In managing this ontology, it is important to note that there are tens of thousands of terms from several different disciplines and it is quite a daunting task to track all of these associations.

FIGURE 11-8    Parts of the NCI ontology and a few relationships between them. The meaning of each link depends on the particular taxonomies that are being linked—for example, an Oncogene occurs in species Human.

Simple lists of corresponding terms cannot effectively address this problem. In the example in Figure 11-8, it isn’t only the term Oncogene that is associated with the species Human but indeed every term below it in the terminology tree. The tree structure of each terminology space, as well as the ability to link the spaces together, is essential for the effective management of terminology for cancer research.

UPPER-LEVEL CLASSES

The NCI ontology is organized into several high-level classes that correspond to the various kinds of things that it describes. Each of these high-level classes is called a Kind. Each kind is defined as an OWL class that can be used to organize many subclasses. Some of the kinds are related to the biological aspects of oncology—for example:

NCI:Organism_Kind a owl:Class.

NCI:Gene_Kind a owl:Class.

and others. In addition, there are more general properties that are used for classifying treatments and processes in cancer care—for example:

NCI:Chemicals_and_Drugs_Kind a owl:Class.

NCI:Clinical_or_Research_Activity_Kind a owl:Class.

NCI:Chemotherapy_Regimen_Kind a owl:Class.

and finally, some that are used for classifying things used in cancer research or treatment, such as:

NCI:Equipment_Kind a owl:Class.

NCI:Technique_Kind a owl:Class.

There’s also a kind for those things that don’t really fit into other kinds or which are specific to NCI research:

NCI:NCI_Kind a owl:Class.

The kinds are linked to each other by a set of properties and their domains and ranges—for example:

NCI:Gene_In_Chromosomal_Location a rdf:Property,

rdfs:domain NCI:Gene_Kind,

rdfs:range NCI:Anatomy_Kind.

to assert that chromosomal locations link genes to parts of the anatomy.

As we saw in Figure 11-8, the primary use of ontology is to define more specific classes and to put constraints on how each property can relate classes in one tree to classes in another. How are the relationships shown in Figure 11-8 expressed in OWL? Let’s take a closer look at the example of Oncogene_ErbB2:

NCI:Oncogene_ErbB2 a owl: Class;

rdfs:subClassOf NCI:Gene_Kind;

rdfs:subClassOf

[a owl:Restriction;

owl:onProperty NCI:Gene_Found_In_Organism;

owl:someValuesFrom NCI:Human];

rdfs:subClassOf

[a owl:Restriction;

owl:onProperty NCI:Gene_In_Chromosomal_Location;

owl:someValuesFrom NCI:_17q21_1];

rdfs:subClassOf

[a owl:Restriction;

owl:onProperty NCI:Gene_Has_Function;

owl:someValuesFrom NCI:Protein_Phosorylation];

rdfs:subClassOf

[a owl:Restriction;

owl:onProperty NCI:Has_Function;

owl:someValuesFrom NCI:Signal_Transduction];

rdfs:subClassOf

[a owl:Restriction;

owl:onProperty NCI:Gene_is_Biomarker_Type;

owl:someValuesFrom NCI:Tumor_Marker];

rdfs:subClassOf

[a owl:Restriction;

owl:onProperty NCI:Gene_Associated_With_Disease;

owl:someValuesFrom NCI:Adenocarcinoma].

The triples in this example are shown graphically in Figure 11-9. In this way, the primary classification hierarchy is used to specify classes and the restrictions on the properties between them.

The pattern we see six times in Figure 11-9 (using rdfs:subClassOf and owl: someValuesFrom) occurs roughly 50,000 times in the NCI thesaurus; that’s roughly once per class. This is the predominant modeling pattern in this ontology. So what does it mean in this context, and how is it used?

To see how this pattern is used in the NCI ontology, let’s have a close look at a small part of the ontology: a pair of mappings from the Gene_Kind to the Biological_Process_Kind (see Figure 11-10). In this case, we are looking at two applications of the design pattern that maps one tree to another.In Figure 11-10, we see something of an odd situation: a Cancer_Gene has function Tumorigenesis, while an Oncogene has function Oncogenesis. This seems odd because although Cancer_Gene is more general than Oncogene, Tumorigenesis, it is less general than Oncogenesis. When we draw it in a diagram, the mappings cross one another.

FIGURE 11-9    Definition of Oncogene_ErbB2. Each owl:Restriction class Is shown here as a single node with a label in the Manchester syntax. Each one is a someValuesFrom restriction class, restricting a property to values from a particular class.

FIGURE 11-10    Mapping from the Gene_Kind to the Biological_Process_Kind. The links correspond to the Has_Function property in the ontology. Notice that the mapping “crosses levels,” a higher-level Gene class is mapped to a lower-level process class.

Just how odd is this? Is it something to worry about? What should be done about it? One value that OWL brings to a modeling effort like the NCI is clarity of the logical meaning of mappings of this sort. When we model this situation in OWL, we give it a formal meaning. More important, we can use that formal meaning to understand just what, if anything, is odd about the situation in which the mappings cross as they do in this case. More precisely, the formalism allows us to determine a formal description of the situation so that we have a clear understanding of what informally could only be understood as vaguely “odd.”

In the NCI ontology, each of these mappings was represented with owl:someValuesFrom ina manner similar to what we see in Figure 11-9. Figure 11-11 shows a closer look at the Oncogene/Cancer_Gene situation and how it was modeled in OWL.

FIGURE 11-11    Representation of Figure 11-10 in OWL, NCI Ontology v. 3.09d.

Each of the gene classes is defined to be a subclass of a restriction class on the property Gene_Has_Function. Notice that the subclass relationship between the biological processes points in the opposite direction from the one between the genes; the more general gene is mapped to the more specific process.

What conclusions can we draw from this diagram? If we think this looks odd, perhaps it is because the model is inconsistent and there is an unsatisfiable class. As it happens, this model has none of these problems. Every class in Figure 11-11 is satisfiable, and the model is consistent. To see this, consider a single Oncogene that has a single Tumorigenesis function; both gene classes are nonempty (they contain the gene), the two genesis classes are nonempty (they contain the function), and every member of each restriction class is known to satisfy the restriction. The model is in fine shape.

Let’s investigate a bit more closely. What else can we see from this model? We know that the subClassOf relationship can propagate through a someValuesFrom restriction. In this case, since Tumorigenesis is a subclass of Oncogenesis, we can conclude that one of the restriction classes is actually a subclass of the other, as shown in Figure 11-12.

FIGURE 11-12    The subclass relation between Tumorigenesis and Oncogenesis propagates to the corresponding restrictions (dashed line).

What can we infer from this situation? It has already been asserted that Oncogene is a subclass of Cancer_Gene, which is a subclass of the Tumorigenesis restriction class, which is a subclass of the Oncogenesis restriction class. According to the type propagation rule for subclass (see Chapter 5), we already know that Oncogene is a subclass of the Oncogenesis restriction class. That is, the assertion in Figure 11-11 that Oncogene is a subclass of the Oncogenesis restriction class is redundant.

Redundancy in a model is not problematic; after all, any query or inference engine will treat an inferred triple the same as an asserted triple. Since the NCI ontology acts as a record of terminology decisions made by the NCI ontology committee, a situation like this does make you wonder if there might be a mistake somewhere. Why did someone feel the need to assert a (redundant) connection between Oncogene and Oncogensis? Were theyunawarethatitwasredundant?Isit a mistake that this was a redundant assertion? Did they intend something more specific but were unable to express it in the current model?

We now have a better handle on our vague notion of “odd”—someone asserted something that could have been easily inferred. This could indicate an error in thinking or communication that could have an impact on the model. Now we have some idea what to investigate.

The resolution of questions of this sort, like the resolution of questions of consistency or satisfiabilty, cannot be done with OWL itself but must be considered by the authors and intended users of the model. In this case, a more recent version of the NCI ontology shows that indeed this situation has been rectified, as shown in Figure 11-13.

FIGURE 11-13    Newer version of Oncogene fragment of the NCI ontology, v. 7.05e, ca. 2007.

The relationship between Oncogene and Cancer_Gene is unchanged from the previous version, as are the subclass relationships with the restriction classes. But the relationships of the target classes Tumorigenesis and Oncogenesis have changed. Neither is now a subclass of the other, but they are both subclasses of the common superclass Pathogenesis. This resolves the redundancy from the earlier version while making a more specific statement about the relationships between the terms.

A further relationship between Tumorigenesis and Oncogenesis appears in the new version (but is not shown in Figure 11-13), in which the process Tumorigenesis is described as part of the process Oncogenesis, preserving the intended relationship between these two processes. The new model still uses two occurrences of the mapping pattern from a class in the Gene tree to a class in the Process tree, but in the new version, the anomaly of the crossing mappings has been resolved. Because of the semantics of OWL, we know exactly what aspects of the model were changed between the earlier and later versions.

DESCRIBING CLASSES IN THE NCI ONTOLOGY

Many of the classes in the NCI ontology correspond to genes (in particular, certain subclasses of the Gene_Kind class). Because of their importance in the life sciences, genes have been identified by a number of classification systems like Swiss_Prot and the GeneBank. It is essential for the interoperability of the NCI ontology that these identifiers be associated with the genes in the ontology whenever they are known. The obvious solution to this is to assert triples of the form such as

:FABP3_Genea :Gene_Kind;

:Swiss_Prot “P05413”.

What inferences should we expect from such a statement? Since FABP3_Gene is a class, it could have subclasses. Would they or should they share the Swiss_Prot number of FABP3_Gene? The answer is certainly not! The Swiss_Prot number is supposed to be an identifier of a particular gene.

In Chapter 13 we will discuss the logical details of making assertions of this sort about classes, but at this point all we need to observe is that it is intentionally not desired that the property Swiss_Prot take part in any inferencing. In OWL, we can indicate that a property is not to be used for inferencing by asserting that it is an AnnotationProperty, thus:

:Swiss_Prot a owl:AnnotationProperty.

By making this declaration, we inform readers of the model as well as inference engines that this property is intended to add extra information to a class without having any impact on inferencing.

INSTANCE-LEVEL INFERENCING IN THE NCI ONTOLOGY

The combination of rdfs:subClassOf and owl:someValuesFrom is pervasive in the NCI ontology, but it does not entail any inferences about individual members of classes. Figure 11-14 shows an example of why this is the case.

Suppose we were to assert membership of two instances: Gene_001 in Gene_Kind and Patient001 in Human, as shown. Since there is some value in the class Human on the property Gene_Found_In-Organism for Gene_001, an OWL inference engine would infer that it is indeed a member of the restriction class as shown with the dashed line in Figure 11-14. But that is as far as the inferencing can go; no inference rules apply at this point. In particular, the type propagation rule for subclass does not apply. So far, we have inferred that Gene_001 is a member of the superclass; the type propagation rule only applies if we know that it is a member of the subclass.

Perhaps it is not surprising that no useful instance-level inferences follow from the structure of the NCI ontology, since the NCI ontology was built as a way of managing terminology in cancer research and not the progress of individual patients. Nevertheless, more recent work on the NCI ontology has refined certain definitions to allow for more specific, instance-level inferencing.

FIGURE 11-14    Potential instance level assertions in the NCI Ontology. Solid lines are asserted triples; dotted lines are inferred triples.

For instance, the definition of the Oncogene ErbB2 in the newer version of NCI is given by

:Oncogene_ErbB2 owl:equivalentClass

[a owl:Class;

owl:intersectionOf

(:ERB_Oncogene_Family

[a owl:Restriction;

owl:onProperty:Allele_In_Chromosomal_Location;

owl:someValuesFrom :_17q21_1]

[a owl:Restriction;

owl:onProperty :Gene_Found_In_Organism;

owl:someValuesFrom :Human]

[a owl:Restriction;

owl:onProperty :Gene_Plays_Role_In_Process;

owl:someValuesFrom :Cell_Proliferation]

[a owl:Restriction;

owl:onProperty :Gene_Plays_Role_In_Process;

owl:someValuesFrom :Tyrosine_Phosphorylation]

[a owl:Restriction;

owl:onProperty :Gene_Plays_Role_In_Process;

owl:someValuesFrom :Receptor_Signaling]

[a owl:Restriction;

owl:onProperty :Gene_Is_Biomarker_Type;

owl:someValuesFrom :Tumor_Marker]

[a owl:Restriction;

owl:onProperty :Gene_Associated_With_Disease;

owl:someValuesFrom :Adenocarcinoma]

)

].

If we contrast this with the definition of ErbB2 in the earlier version of the NCI ontology, we see that ErbB2 is still a subclass of seven restriction classes (since the intersection of all the restriction classes is a subclass of each of them). Now ErbB2 is defined as being equivalent to the intersection of all those restriction classes. This means that should there be an instance that satisfies all of them (that is, a member of ERB_Oncogene_Family with an allele in chromosome location _17q21_1 found in organism Human, etc.), then an inference engine would conclude that it is a member of Oncogene_ErbB2. In this version of NCI, such capabilities are only beginning to be explored.

There are a number of aspects of the NCI ontology that could be criticized as being misleading or even, in some cases, incorrect. For instance, what is the significance of naming something Gene_Kind instead of just Gene? Why does the model provide class-level inferences but no instance-level inferences? Is this really a problem or not? We have seen an example of how a particular issue with the NCI ontology (crossing mappings) has been resolved in later versions and how certain instance-level inferencing is being treated in current research.

In Chapter 12, we will explore some basics of ontology engineering and design, which suggest that some requirements be spelled out in advance of beginning the construction of a model. In the case of an ongoing project like the NCI ontology, the requirements are likely to shift as the project matures and is used by more and more people. We see evidence of that shift in the current move to include the definitions needed for inferencing about individuals as well as classes.

SUMMARY

On the face of it, the Federal Enterprise Architecture Reference Model Ontology and the NCI Ontology serve very different functions. The FEARMO is intended to be a starting point for several agencies, each of which will extend it. As such, it is specifically designed for distributed maintenance. The NCI ontology, on the other hand, is managed by a single body as a centralized controlled vocabulary.

If we look a bit more deeply, we see that these differences are superficial. In both cases, the important aspect of the model is that it can be used by many different people who have two conflicting needs: On the one hand, they need to have some commonality among their work. In the case of the agencies and their enterprise architecture, the federal government wants some unity among the agencies. In the case of the NCI, researchers around the world want to be able to correlate their results. On the other hand, each user of the models has some independent needs; the agencies have their own lines of business to pursue, and the researchers have their own methodologies and agendas.

Each ontology mediates these conflicting needs by providing a formal, unambiguous, and reusable model of the constraints between the concepts in the respective domains. Core concepts that are shared among the stakeholders are represented in a core model. Constraints that hold between these core concepts are also represented. In the case of FEARMO, these constraints govern what happens when the model is extended. In the NCI case, these constraints help to disambiguate terms and to keep track of which term is used in what way. Both Ontologies take advantage of the formal semantics of OWL to balance their conflicting requirements.

Another similarity between these two ontologies is found in their design. Each of these ontologies repeats a particular ontology design pattern over and over again. In the case of the NCI ontology, it is the owl:someValuesFrom pattern that links classes in one tree to classes in another. In the case of FEARMO, it is an owl:hasValue pattern that gathers up all the entities that comprise another into a class. In FEARMO, the pattern is repeated over 200 times. In the NCI Ontology, its pattern is repeated over 50,000 times! What role does such repetition play in ontology design?

In both cases, the respective pattern reveals an underlying pattern of how information is organized in the domain that the ontology describes. FEARMO is concerned with managing the composition of systems: What components are combined to form a system? What higher-level system do these systems participate in? The repetition of this pattern in the domain (systems composition) is reflected as repetition of a pattern in the model. Similarly, the terminology space that the NCI Ontology describes has many facets. Certain terms in each facet have known relationships to terms of another facet. The relationship between each pair of classes is the same—that is, how we can find our way through the terminology space.

From the point of view of model maintenance, these repeated patterns give future modelers a chance of understanding how the models work and what could be done to modify them. No single person can understand a model of 50,000 classes all at once. If there were 50,000 distinct logical relationships, understanding what inferences result from the model would be a similarly daunting task. Inference engines can, of course, compute the inferences, but understanding at a high level exactly what is going on is still a challenge for someone who wants to maintain or modify a model. Repetition of modeling patterns simplifies this task, making model maintenance possible.

Fundamental Concepts

The following fundamental concepts were introduced in this chapter:

owl:imports—Allows one ontology to refer explicitly to another. Triples from the imported ontology are available for inferencing in the importing ontology.

Versioning—OWL provides a number of resources for tracking changes in model versions and the dependencies between them.

Annotation—Properties in OWL that do not participate in inferencing. The versioning properties are examples of annotations.

Ontology Design Patterns—Repeated modeling idioms that provide coherence and unity to a large model.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset