|Chapter 12: Metadata and provenance management
Ewa Deelman1, Bruce Berriman2, Ann Chervenak1, Oscar Corcho3, Paul Groth1, Luc Moreau4,
1USC Information Science Institute, Marina del Rey, CA
2Caltech, Pasadena, CA
3Universidad Politécnica de Madrid, Madrid, ES
4University of Southampton, Southampton, UK
Scientists today collect, analyze, and generate TeraBytes and PetaBytes of data. These data are often shared and further processed and analyzed among collaborators. In order to facilitate sharing and data interpretations, data need to carry with it metadata about how the data was collected or generated, and provenance information about how the data was processed. This chapter describes metadata and provenance in the context of the data lifecycle. It also gives an overview of the approaches to metadata and provenance management, followed by examples of how applications use metadata and provenance in their scientific processes.
1Metadata and Provenance
Today data are being collected by a vast number of instruments in every discipline of science. In addition to raw data, new products are created every day as a result of processing existing data and running simulations in order to understand observed data. As the sizes of the data sets grow into the petascale range, and as data are being shared among and across scientific communities, the importance of diligently recoding the meaning of data and the way they were produced increases dramatically.
One can think of metadata as data descriptions that assign meaning to the data, and data provenance as the information about how data was derived. Both are critical to the ability to interpret a particular data item. Even when the same individual is collecting the data and interpreting them, metadata and provenance are important. However, today, the key drivers for the capture and management of data descriptions are the scientific collaborations that bring collective knowledge and resources to solve a particular problem or explore a research area. Because sharing data in collaborations is essential, these data need to contain enough information for other members of the collaboration to interpret them and then use them for their own research. Metadata and provenance information are also important for the automation of scientific analysis where software needs to be able to identify the data sets appropriate for a particular analysis and then annotate new, derived data with metadata and provenance information.
Figure 12.1 depicts a generic data lifecycle in the context of a data processing environment where data are first discovered by the user with the help of metadata and provenance catalogs. Next, the user finds available analyses that can be performed on the data, relying on software component libraries that provide component metadata, a logical description of the component capabilities. During the data processing phase, data replica information may be entered in replica catalogs (which contain metadata about the data location), data may be transferred between storage and execution sites, and software components may be staged to the execution sites as well. While data are being processed, provenance information can be automatically captured and then stored in a provenance store. The resulting derived data products (both intermediate and final) can also be stored in an archive, with metadata about them stored in a metadata catalog and location information stored in a replica catalog.
Figure 12.1: The Data Lifecycle.
From a general point of view, metadata may be defined as “data about data”. However, this definition is too broad; hence other more specific definitions for this term have been provided in the literature, each of them focusing on different aspects of metadata capture, storage and use. Probably one of the most comprehensive definition is the one from , which defines metadata as “structured data about an object that supports functions associated with the designated object”. The structure implies a systematic data ordering according to a metadata schema specification; the object can be any entity or form for which contextual data can be recorded; and the associated functions can be activities and behaviors of the object. One of the characteristics of this definition is that it covers the dual function that metadata can have: describing the objects from a logical point of view as well as describing their physical and operational attributes.
We can cite a wide range of objects that metadata can be attached to, such as databases, documents, processes and workflows, instruments and other resources. These objects may be available in different formats. For example, documents may be available electronically in the form of HTML, PDF, Latex, etc., in the Web, in a data Grid, on a PC hard disk, or on paper in a library, among other formats. At the same time, metadata can be also expressed in a wide range of languages (from natural to formal ones) and with a wide range of vocabularies (from simple ones, based on a set of agreed keywords, to complex ones, with agreed taxonomies and formal axioms). Metadata can be available in different formats, both electronic and on paper, for example, written in a scientist’s lab notebook or in the margins of a textbook. Metadata can also be created and maintained using different types of tools, from text editors to metadata generation tools, either manually or automatically.
Given all this variety in representation formats, described resources, approaches for metadata capture, storage and use, etc., there is not a commonly agreed taxonomy of types of metadata or types of described resources, but different points of view about how metadata can be generated and used. We will now go through some of these points of view, illustrating them with examples.
One of the properties of metadata is that it can be organized in layers, that is, metadata can refer to raw data, (e.g. coming from an instrument or being available in a database), refer to information about the process of obtaining the raw data, or refer to derived data products. This allows distinguishing different layers (or chains) of metadata: primary, secondary, tertiary, etc. As an example, let us consider an application in the satellite imaging domain, such as the one described in . Raw data coming from satellites (e.g., images taken by instruments in the satellite) are sent to the ground stations so that they can be stored and processed. A wide range of metadata can be associated with these data, such as the times when they were obtained and transferred, the instrument used for capturing them, the time period when the image was taken, the position to which it refers, etc. This is considered as the primary metadata of the images received. Later on, this metadata can be used to check whether all the images that were supposed to be obtained from an instrument in a period of time have been actually obtained or whether there are any gaps, and new metadata can be generated regarding the grouping of pieces of metadata for an instrument, the quality of the results obtained for that time period, statistical summaries, etc. This is considered as secondary metadata, since it does not refer to the raw data being described, but to the metadata that refer to the analysis, summaries, and observations about the raw data, so that it forms a set of layers or a chain of metadata descriptions. Another common example of this organization of metadata into layers is that of provenance, which is described in the next section.
In all these cases, it is important to determine which type (layer) of metadata we use for searching, querying, etc., and which type of metadata we show to users, so that metadata coming from different layers is not merged together and is shown with the appropriate level of detail, as discussed in  .
The organization of metadata into layers also reflects an interesting characteristic of how metadata is used. To some extent, what is metadata for one application may be considered as data for another. In the previous example in the satellite domain, metadata about the positions of images on the Earth is considered as part of the primary metadata that is captured and stored for the satellite mission application when the information arrives to the ground station. However, the same spatial information would be considered as a data source for other applications, such as a map visualization service (e.g., Google Earth) that positions those resources in a map. In contrast, the dates when the images were taken or the instruments with which they were produced may still be considered as metadata in both cases.
Another aspect of metadata comes into play when new data products are being generated either as a result of analysis or simulation. These derived data are now scientific products and need to be described with appropriate metadata information. In some disciplines such as astronomy (see Section 5.1), the community has developed standard file formats that include metadata information in the header of the file. These are often referred to as “self-describing data formats”, since each file stored in such a format has all the necessary metadata in its header. Software is then able to read this metadata and to generate new data products with the appropriate headers. One of the difficulties of this approach is to be able to automatically catalog the derived data. In order to do that some process needs to be able to read the file headers and then extract the information and place it in a metadata catalog. In terms of metadata management, astronomy seems to be ahead of other disciplines, possibly because in addition to the astronomy community, the discipline appeals to many amateurs. As a result, astronomers needed to face the issue of data and metadata publication early on, making their data broadly accessible. Other disciplines of science are still working on the development of metadata standards and data formats. Without those, software cannot generate descriptions of the derived data. Even when the standards are formed within the community, there are often a number of legacy codes that need to be retrofitted (or wrapped) to be able to generate the necessary metadata descriptions as they generate new data products.
Finally, another perspective about metadata is whether a piece of metadata reflects an objective point of view about the resources that are being described or only a subjective point of view about it. While in the former case the term “metadata” is generally used, in the latter the more specific term “annotation” is more commonly used. Annotations are normally produced manually by humans and reflect the point of view of those humans with respect to the objects being described. These annotations are also known as “social annotations”, to reflect the fact that they can be provided by a large number of individuals. They normally consist of sets of tags that are manually attached to the resources being described, without a structured schema to be used for this annotation or a controlled vocabulary to be used as a reference. These types of annotations provide an additional point of view over existing data and metadata, reflecting the common views of a community, which can be extracted from the most common tags used to describe a resource. Flickr  or del.icio.us  are examples of services used to generate this type of metadata for images and bookmarks respectively, and are being used in some cases in scientific domains .
There are also other types of annotations that are present in the scientific domain. For example, researchers in genetics annotate the Mouse Genome Database  with information about the various genes, sequences, and phenotypes. All annotations in the database are supported with experimental evidence and citations and are curated. The annotations also draw from a standard vocabulary (normally in the form of controlled vocabularies, thesauri or ontologies, as described in the following subsection), so that they can be consistent. Another example of scientific annotations is in the neuroscience domain, where scientists are able to annotate a number of brain images . The issue in brain imaging is that there are very few automated techniques that can extract the features in an image. Rather, the analysis of the image is often done by a scientist. In some cases, the images need to be classified or annotated based on the functional properties of the brain, information which cannot be automatically extracted. As with other annotations, brain images can be annotated by various individuals, using different terms from an overall vocabulary. An advantage of using a controlled vocabulary for the annotations is that the annotations can be queried and thus data can be discovered based on these annotations.
Annotations can also be used in the context of scientific workflows (see chapter 13), where workflow components or entire workflows can be annotated so that they can be more readily discovered and evaluated for suitability. The myGrid project  has a particular emphasis on bioinformatics workflows composed of services broadly available to the community. These services are annotated with information about their functionality and characteristics. myGrid annotations can be both in a free text form and drawn from a controlled vocabulary .