GEOlabel

DMP linkable icons

DMP-3: Data encoding

The concept

Data should be structured using encodings that are widely accepted in the target user community and aligned with organizational needs and observing methods, with preference given to non-proprietary international standards.

Related terms: not defined.

Category: Usability

Explanation of the principle

Usability of data, and especially automated use, depends strongly on the extent to which end users (both human and machine) can rely on standardized encoding as tools, applications, and algorithms are typically designed to work with such. Use of standardised encodings brings benefits to the end user and limits the amount of time spent on transforming data, and therefore is a key to interoperability.

Complete interoperability needs three conditions to be met (Hugo 2009):

Schematic Interoperability defines the structure (schema) in which the data will be offered by a service. For many applications, this schema is critical for correct binding, but schema are likely to vary within a common framework depending on specific applications.

Syntactic Interoperability: this defines the way in which data services will be invoked (Hugo 2008). In many cases, such standards make provision for query parameters and sub-setting of data sets. OPeNDAP has started working on an additional refinement, in that requests for derived data (“offerings”), for example based on statistical analysis, can also be included into the service syntax. Such concepts, which allow requests for processing to be sent to data, instead of the other way round, is a major requirement in the field of Big Data applications (Fulker and Gallagher 2013). Definition of the parameters depend to some extent on semantic interoperability and conventions.

Semantic Interoperability ensures that the content of the schema (the data itself) can be understood by humans or machines (Heffin and Hendler 2000). It is the most complex of the interoperability requirements, and attempts to establish common ontologies, vocabularies, and frameworks such as “essential variables” (OOPC 2015), are all designed to address semantic interoperability. A subset or refinement of semantic interoperability concerns the protocols or methodologies used to gather the data – sometimes critical for valid collations or combinations. Some frameworks for essential variables in Earth and environmental observation science attempt to provide such protocols and methodologies.

In practice, true semantic interoperability is difficult to achieve, often requiring brokering and mediation to align with a standard. A future consideration is the extent to which it will be possible to persist such mediations for re-use. Agreement on a workable set of syntactic (service), schematic, and semantic standards for the typical data families in use by the community can help in some cases.

Guidance on Implementation, with Examples

The availability and acceptance of syntactic encoding standards are at a high level of maturity, and that these standards cover the majority of data families that the GEO community uses routinely. Examples include the map and sensor services defined by OGC SWE (2011), OPeNDAP and NetCDF services (OGC Network Common Data Form 2015; Common Data Model 2015), and the work done by WMO in respect of globally available meteorological data (WMO Information System 2015). The extent to which the community has implemented these standards is, however, highly variable, with implementation of Sensor Observation Services lagging seriously behind Web Mapping Services and the use of OPeNDAP and NetCDF. Practitioners should select the standards and open-source implementations of these appropriate to their data family, internal information technology platforms, and capabilities, as a preferred means of providing access to publicly available data sets.

Communities have also developed a portfolio of content standards in support of schematic interoperability. Examples include the provision of KML (OGC KML 2008), GML (OpenGIS® Geography Markup Language 2007), GeoJSON (GeoJSON Format Specification 2015), and other similar standards for the encoding of spatial data, and the SensorML (OGC® SensorML 2014) suite for encoding of time series and sensor observations. Interoperability in the field of especially spatial data sets, whether these are vector data or raster data sets, is highly mature, and it is common for applications and web components to support a wide variety of data schema. Best practice and guidance should stress the application of these widely adopted standards whenever possible.

The most diverse landscape is found in respect of semantic interoperability and content standard encoding to support it. Some communities have access to mature content standards (for example the Biodiversity community through TDWG (TDWG Standards 2015), the Climate Modelling community through essential climate variables (GCOS Essential Climate Variable(s) 2015), and WaterML(OGC® WaterML 2015)), and there are significant efforts to establish ontology, vocabularies, and name services for a wide variety of disciplines. A major concern is centered on this diversity, and it is often difficult for implementers and end users to select from the large number of options available. GEO is in a position to address this problem – firstly through creation of definitive registries of resources that are available, and by working towards community consensus on the most appropriate resources to use. In general, best practice in the absence of such guidance will be to use any published vocabularies, ontologies, and name services appropriate to the field of study rather than none at all.

Metrics to measure level of adherence to the principle

Measuring adherence to a schema offered by a data service depends on the data format (MIME type): in the case of XML encodings, the structure and vocabulary (in other words, both schematic and to some extent semantic interoperability) can be tested against the XSD (XML Schema Document). Other encodings (GeoJSON, text, or binary encodings) do not support such automated validation and have to be explicitly tested.

It will often only be possible to evaluate or test the compliance of a data set and/ or service by submitting such a data set or service to a validation service, but to our knowledge only a few such services exist or are in practical use. OGC makes several test services and suites available (OGC Validator 2007).

Resource Implications of Implementation

Implementation in the Earth and Environmental observation domain can be aided by the availability of free and open source software and can reduce the cost of deploying standardised data services. Offerings range from spatial databases (PostGres), through data servers (GeoServer, Sensor Observation Services, OPeNDAP) to visualisation tools (Global Imagery Browse Services, OpenLayers).

Implementation requires human resources with experience and knowledge in the domain of interest, spatial data, and computing. There is a growing need for this combination of skills as seen in the emergence of careers in data science. Their contributions range from systems development, configuration, and maintenance to content publication and standardisation. They may also provide assistance with development of vocabularies, name services, and content standards.

In practice, none of these ideal aspects of interoperability are likely to be realised, requiring brokering and mediation. The target of such brokering or mediation can be any of the three types of interoperability. A major consideration is the extent to which it will be possible to persist such mediations for future re-use.

From this, we deduce that a truly interoperable environment can only be realised if communities of practice converge towards a workable set of syntactic (service), schematic, and semantic standards for the typical data families that the community uses, and that brokering and mediation services and definitions are visible and available to practitioners.

Text extracted from the Data Management Principles Implementation Guidelines