Granule Metadata, netCDF, and ISO

From NOAA Environmental Data Management Wiki

The world of documentation has long recognized that there are many types of documentation and that they serve multiple purposes. Three classic documentation types include directory, inventory, and granule. One of the significant differences between these three levels is the role or applicability of standards. There are many standards at the directory level (the classic in this framework is the Directory Interchange Format, DIF) and fewer, more general standards (like the CF Conventions) at the granule level. This difference reflects the fact that the benefits of using standards increase with the breadth of applicability of those standards. At the directory level many products share documentation needs. At the granule level there are fewer shared needs.

Granule documentation needs to be represented within granules (obviously) and, at least some of it, will be represented in the directory (or collection) metadata. We consider first how that documentation is represented in the granules which are assumed to be written in netCDF.

netCDF

There are two ways to represent documentation in netCDF files. Global attributes are related to the entire file, variable attributes are related to particular variables.

Global Attributes

Global attributes are written into netCDF files directly under the root <netcdf> element. They have names and values as shown in this example. The names of the global attributes should be selected from the NetCDF Attribute Convention for Dataset Discovery and Climate Forecast Conventions. Note that the content of these metadata elements is written into the value attribute of the attribute element regardless of its length. In some cases this results in long values.

This example shows discovery metadata (the attributes above the divider), as well as sample quality information for the granule. The example is schematic and not intended as a recommendation (see NetCDF for netCDF resources). In this case the quality information includes overall numbers (percent_clear, percent_day, and percent_night), and several simple statistics for the parameters named "air_temperature_at_cloud_top", and "cloud_top_altitude". The name of the variable needs to be included in each attribute name in order to distinguish them from one another.

This approach is included here primarily because it is conceptually similar to the idea of file headers and header variables that exist in many legacy formats and processing system. It is probably the weakest of the choices presented here, primarily because of the lack of grouping of the global attributes.

<netcdf xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2">
  <!-- discovery information -->
  <attribute name="title" value="A dataset that includes cloud top temperature" />
  <attribute name="history" value="Created by some processing system" />
  <attribute name="algorithm_version" value="2.0" />
  <attribute name="institution" value="Producer of the product" />
  <attribute name="satellite_name" value="satelliteIdentifier" />
  <attribute name="sensor_name" value="instrumentIdentifier" />
  <attribute name="id" value="unique identifier for the granule" />
  <attribute name="Conventions" value="CF-1.4" />
  <attribute name="standard_name_vocabulary" value="CF-1.4" />
  <attribute name="cdm_data_type" value="Swath" />
  <attribute name="geospatial_lat_min" type="float" value="-87.74" />
  <attribute name="geospatial_lat_max" type="float" value="87.72" />
  <attribute name="geospatial_lon_min" type="float" value="-179.99" />
  <attribute name="geospatial_lon_max" type="float" value="180.0" />
  <attribute name="geospatial_lat_units" value="degrees_north" />
  <attribute name="geospatial_lon_units" value="degrees_east" />
  <!-- quality information -->
  <attribute name="percent_clear" type="integer" value="" />
  <attribute name="percent_day" type="integer" value="" />
  <attribute name="percent_night" type="integer" value="" />
  <attribute name="air_temperature_at_cloud_top_minimum" type="float" value="" />
  <attribute name="air_temperature_at_cloud_top_mean" type="float" value="" />
  <attribute name="air_temperature_at_cloud_top_maximum" type="float" value="" />
  <attribute name="air_temperature_at_cloud_top_standard_deviation" type="float" value="" />
  <attribute name="cloud_top_altitude_minimum" type="float" value="" />
  <attribute name="cloud_top_altitude_mean" type="float" value="" />
  <attribute name="cloud_top_altitude_maximum" type="float" value="" />
  <attribute name="cloud_top_altitude_standard_deviation" type="float" value="" />
  ...
</netcdf>

Variable Attributes

Variable attributes are written into netCDF files under the physical variable element to which they pertain. They have names and values as shown in this example. Recommendations for these names are included in the NetCDF Attribute Convention for Dataset Discovery and in the Climate Forecast Conventions. This example shows representative variable attributes for a cloud top temperature parameter. Note that the standard name for the variable comes from the CF-Conventions as indicated by the global attributes shown above.

In this case, the quality information has only the statistic names because the attributes are included in the variable named cldTopTemp. This reflects the logical grouping that is enabled using the variable element as a container for related attributes. This also makes it possible to retrieve all information related to the variable using standard XML tools as described below.

<variable name="cldTopTemp" type="float">
  <attribute name="long_name" value="Cloud Top Temperature" />
  <attribute name="standard_name" value="air_temperature_at_cloud_top" />
  <attribute name="scale_factor" type="float" value="0.01" />
  <attribute name="valid_range" type="short" value="5000 32500" />
  <attribute name="units" value="kelvin" />
  <attribute name="Note" value="Low resolution channels" />
  <attribute name="minimum" type="float" value="" />
  <attribute name="mean" type="float" value="" />
  <attribute name="maximum" type="float" value="" />
  <attribute name="standard_deviation" type="float" value="" />
</variable>

A Group of Attributes

NcML is the XML representation of the metadata in a netCDF file. The "rules" of NcML are described in the NcML Schema. The most recent version of that schema introduces the concept of groups as part of the [| Common Data Model]. Groups can contain any number of related attributes, so they provide a mechanism for storing granule metadata for all variables. The group would have an agreed upon name in order to facilitate retrieval of this information:

Using a group to associate variables:

<group name="qualityInformation">
  <attribute name="percent_clear" type="integer" value="" />
  <attribute name="percent_day" type="integer" value="" />
  <attribute name="percent_night" type="integer" value="" />
  <attribute name="air_temperature_at_cloud_top_minimum" type="float" value="" />
  <attribute name="air_temperature_at_cloud_top_mean" type="float" value="" />
  <attribute name="air_temperature_at_cloud_top_maximum" type="float" value="" />
  <attribute name="air_temperature_at_cloud_top_standard_deviation" type="float" value="" />
  <attribute name="cloud_top_altitude_minimum" type="float" value="" />
  <attribute name="cloud_top_altitude_mean" type="float" value="" />
  <attribute name="cloud_top_altitude_maximum" type="float" value="" />
  <attribute name="cloud_top_altitude_standard_deviation" type="float" value="" />
  ...
</group>

ISO Metadata

The examples above include three types of granule metadata: discovery metadata, percentages that apply to the entire file, and specific quality information about each parameter. Translating the discovery metadata into ISO is very straightforward. See NetCDF Attribute Convention for Dataset Discovery for the crosswalk. Translating the quality information is a bit more challenging.

Data Quality Objects

Like the netCDF standard, the ISO Standard offers several approaches to representing this granule information. Most of them involve the ISO Quality section which is shown at right. We are focusing on the DQ_Element in this discussion.

The DQ_Element includes information about the quality measure and the method for applying it. This information could be supplied at almost any level of detail or scope, described in the DQ_Scope object. In the most general case, the scope is the entire dataset and the MD_ScopeCode = "dataset". In that case all off the quality information for a product will be reported in one piece. The content of the DQ_Element would include a reference to a document that provided a complete description of the quality fields and how they were calculated. This reference would be a citation in the DQ_Element/evaluationProceedure element of the metadata. It is probably a good idea to include an identifier for this procedure in the measureIdentification object.

The numbers in the quality information are reported in the DQ_Element/result. The authors of the ISO Standard understood that flexibility would be required when describing quality information for any dataset, so they used the most flexible elements in the ISO Standard, the Record and the RecordType. The definitions of these types are given in the ISO 19139: "It is not appropriate to include a full description of Record and RecordType in this Technical Specification but it is important to recognize a few characteristics of these classes. A RecordType is the physical expression of a semantic definition (typically a feature type). A Record physically expresses an instance of the semantic definition corresponding to its RecordType." This description is arcane, but it means that the quality report can be expressed in terms of an object defined in the netCDF schema, i.e. a group, and a blob of XML that is an instance of an object of that type. The XML representation of the result references the group in the netCDF file:

<gmd:report>
    <gmd:DQ_QuantitativeAttributeAccuracy>
        <gmd:result>
            <gmd:DQ_QuantitativeResult>
                <gmd:valueType>
                    <gco:RecordType xlink:href="http://www.unidata.ucar.edu/schemas/netcdf/ncml-2.2.xsd
                    #xpointer(//element[@name='group'])">netCDF Variable</gco:RecordType>
                </gmd:valueType>
                <gmd:valueUnit gco:nilReason="inapplicable"/>
                <gmd:value>
                    <gco:Record xlink:href="http://www.ngdc.noaa.gov/ncmlService/granuleIdentifier
                    #xpointer(/netcdf/group[@name=qualityInformation])">
                    Quality Information for granule = granuleIdentifier</gco:Record>
                </gmd:value>
            </gmd:DQ_QuantitativeResult>
        </gmd:result>
    </gmd:DQ_QuantitativeAttributeAccuracy>
</gmd:report>

In these examples the URL http://www.ngdc.noaa.gov/ncmlService/granuleIdentifier is a REST Service that provides the NcML for the granule identified by granuleIdentifier. Several groups are presently working on implementing such a service.

It is important to note that both of these approaches access information in the netCDF file directly without 1) mapping of that information into the ISO standard or 2) any understanding of ISO on the part of the system writing the netCDF. This is a critical step as it completely separates the netCDf and ISO representations of the metadata.