DiSHACLed Dataset and Data Service Discovery

Living Document,

This version:
https://dishacled.github.io/discovery-specification/
Issue Tracking:
GitHub
Editor:
(Ghent University - imec)

Abstract

This specification defines an algorithm and the logical processes for a client application to discover a (set of) RDF dataset(s) and/or data service(s) from one or more DCAT-AP data catalogs, based on a given input SHACL shape or SPARQL query.

1. Introduction

The DCAT Application Profile for data portals (DCAT-AP) [DCAT-AP] is the official standard in Europe for describing datasets and data services in catalogues on the Web. It provides a common model that enables interoperability across data portals, making it easier to publish and share metadata consistently. This interoperability has been instrumental in improving the visibility and accessibility of open data across domains and EU member states.

However, the current design and usage of DCAT-AP are primarily oriented towards human-driven discovery. Catalogues are typically accessed through graphical interfaces where users browse, filter, and search for resources based on keywords or metadata attributes. While this supports transparency and accessibility for a wide audience, it limits the potential for automated, machine-based discovery and integration of datasets and data services. In practice, many use cases—such as federated analytics, cross-domain applications, or dynamic data integration pipelines—require more than keyword search: they need support for automated identification of datasets and services that match specific structural and semantic requirements.

This specification, aims to address this gap by defining a (i) data model extension for DCAT-AP that allows describing sematically what is contained within RDF [RDF11-CONCEPTS] datasets and expected/produced by RDF data services; and an (ii) algorithm for automated discovery and selection of RDF datasets and RDF data services over DCAT-AP catalogues, based on formal input constraints expressed as either:

By enabling discovery through such formal specifications, the proposed approach allows clients to automatically determine the relevance of available resources, thereby supporting machine-to-machine interoperability. This contributes to realising the full potential of DCAT-AP not only as a tool for publishing metadata for humans but also as a foundation for automated data ecosystems.

1.1. Document Conventions

Within this document, the following namespace prefix bindings are used:

Prefix Namespace IRI Description
dcat: http://www.w3.org/ns/dcat# [VOCAB-DCAT-3]
dcterms: http://purl.org/dc/terms/ [DCTERMS]
rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns# [RDF11-SCHEMA]
rdfs: http://www.w3.org/2000/01/rdf-schema# [RDF11-SCHEMA]
sh: http://www.w3.org/ns/shacl# [SHACL]
skos: http://www.w3.org/2004/02/skos/core# [SKOS-REFERENCE]
xsd: http://www.w3.org/2001/XMLSchema# [XMLSCHEMA11-2]

1.2. Terminology

2. Describing DCAT(-AP) resource content with shapes

This section defines a set of alternative ways in which DCAT(-AP) resources, specifically referring to RDF datasets or data services, can be extended to include references to data shapes that semantically describe their content. The extensions are designed to be minimal and compatible with existing DCAT-AP profiles.

Note: The § 3 Discovery logical flow and related § 4 Algorithm specification are designed to support all the following data model extensions. In the presence of more than one alternative, a discovery client will not prioritise any of them and should treat them as a logical OR relation.

2.1. Relating to shapes via dcterms:conformsTo

In this approach we establish the use of the existing property dcterms:conformsTo [DCTERMS], as described by [VOCAB-DCAT-3] and [DCAT-AP], to link an RDF resource defined as part of a catalog, to a shape that describes its content.

In [VOCAB-DCAT-3], the use of dcterms:conformsTo is informally suggested for linking a resource to a relevant standard, that specifies its model, schema, ontology, application profile, among others. SHACL shapes may be considered as a type of schema, as they define the structure and constraints that RDF instance data must adhere to.

No further clarifications are provided in [DCAT-AP] regarding the expected use of dcterms:conformsTo, beyond the informal suggestion to link to relevant standards, understood as instances of dcterms:Standard.

Figure 1 shows how any kind of RDF resource that has been declared in a catalog, may be linked to a related shape via dcterms:conformsTo.

DCAT Resources linked to a SHACL shape via dcterms:conformsTo.

Example 1 shows corresponding RDF representations for a dataset, distribution, and data service, respectively linked to their shapes.

RDF representation in Turtle format of a dcat:Dataset linked to a sh:NodeShape via dcterms:conformsTo
ex:myDataset a dcat:Dataset;
    dcterms:title "My Dataset";
    dcterms:conformsTo ex:myShape.

ex:myShape a sh:NodeShape;
    sh:targetClass ex:MyClass;
    sh:property [
        sh:path ex:myProperty;
        sh:datatype xsd:string;
        sh:minCount 1;
    ].

Note: A related shape to a data service could be referring to the constraints of either the input or output data (or both) of such service. Via the dcterms:conformsTo property, no distintion can be made about the role of a related shape (see § 2.2 Relating to shapes via dcat:qualifiedRelation for an alternative where it is possible). However, linking multiple shapes to a data service should not entail any conflicts for the discovery process, given its inclusive nature based on a logical OR operation.

2.2. Relating to shapes via dcat:qualifiedRelation

In this approach we establish the use of the property dcat:qualifiedRelation as described by [VOCAB-DCAT-3] and [DCAT-AP], to indirectly link an RDF resource defined as part of a catalog, to a shape that describes its content. This approach follows an n-ary [SWBP-N-ARYRELATIONS] or qualified [LDPATTERNS] data modelling pattern.

In [VOCAB-DCAT-3], the use of dcat:qualifiedRelation is informally specified as a way to define semantically richer relations among resources. This is accomplished with the introduction of the dcat:Relationship class, on which specific roles can be set, to further describe the nature of the relationship between a resource and another asset. For specifying types of roles, the use of controlled vocabularies is encouraged, which is reflected by the exsiting subclass relation between the dcat:Role class and skos:Concept.

No further guidelines nor usage notes are given for dcat:qualifiedRelation and dcat:Relationship in [DCAT-AP].

Figure 2 shows how a shape can be linked to any type of resource using the dcat:qualifiedRelation property.

DCAT Resources linked to a SHACL shape via dcat:qualifiedRelation.

In practice, for the discovery process, a client needs to traverse the property path dcat:qualifiedRelation/dcterms:relation, in search for an associated shape. This specification does not mandate a specific role concept value via dcat:hadRole. A discovery client must assess all existing qualified relations (see § 3 Discovery logical flow).

Note: [VOCAB-DCAT-3] suggests the use of coded term lists such as [IANA-RELATIONS], or [ISO-19115-1], to specify a dcat:Relationship role via the dcat:hadRole property. However, none of the recommended vocabularies defines a specific role type that conveys a relationship between a resource and an associated shape. A semantically similar role type is the describedby relation defined by [IANA-RELATIONS], but we opt for a lenient position since no established standard exists for this yet.

Example 2 shows corresponding RDF representations for a dataset and data service, respectively linked to their shapes via dcat:qualifiedRelation. According to [VOCAB-DCAT-3], the domain of dcat:qualifiedRelation is dcat:Resource, which excludes dcat:Distribution instances.

RDF representation in Turtle format of a dcat:Dataset linked to a sh:NodeShape via dcat:qualifiedRelation/dcterms:relation
ex:myDataset a dcat:Dataset;
    dcterms:title "My Dataset";
    dcat:qualifiedRelation [
        a dcat:Relationship;
        dcat:hadRole <http://www.iana.org/assignments/relation/describedby>; # this is optional
        dcterms:relation ex:myShape
    ].

ex:myShape a sh:NodeShape;
    sh:targetClass ex:MyClass;
    sh:property [
        sh:path ex:myProperty;
        sh:datatype xsd:string;
        sh:minCount 1;
    ].

Note: The qualified relation pattern allows for a more granular description of the type or relation that a given shape has with a resource. For example, as it is shown in Example 2 (Data Service Example), where the qualified relations describe the shapes of both input and output data of a data service independently. This has no impact on the discovery process as both relations shall be assessed to find relevant matches with respect to the given discovery constraints.

2.3. Relating to shapes via sh:shapesGraph

In this approach we establish the use of the property sh:shapesGraph, as defined by [SHACL], to link an RDF resource defined as part of a catalog, to a shape that describes its content.

The sh:shapesGraph property is used in [SHACL] to indicate a graph where the validation shapes are defined. From the perspective of DCAT, the use of sh:shapesGraph constitutes a model extension that is not currently contemplated by [VOCAB-DCAT-3] nor [DCAT-AP].

Note: An example of such an extenstion can be observed in the work of Frank Michiel et al. [WEB-API-DISC], where they propose the use of sh:shapesGraph to link a dcat:Distribution to a graph containing its shape definitions.

Figure 3 shows how any kind of RDF resource that has been declared in a catalog, may be linked to a related shape via sh:shapesGraph.

DCAT Resources linked to a SHACL shape via sh:shapesGraph.

An associated shapes graph linked via sh:shapesGraph may contain one or more shape definitions, but these must be related only to the resource that links to them. This is necessary to avoid false positives during the discovery process.

Example 3 shows corresponding RDF representations for a dataset, distribution, and data service, respectively linked to their shapes via sh:shapesGraph.

RDF representation in TriG format of a dcat:Dataset linked to a shape via sh:shapesGraph
ex:myDataset a dcat:Dataset;
    dcterms:title "My Dataset";
    sh:shapesGraph ex:myShapesGraph.

ex:myShapesGraph {
    ex:myShape a sh:NodeShape;
        sh:targetClass ex:MyClass;
        sh:property [
            sh:path ex:myProperty;
            sh:datatype xsd:string;
            sh:minCount 1;
        ].
}

3. Discovery logical flow

This section describes the logical steps that a client needs to follow to perform a discovery process. An ovewrview of the logical flow is shown in Figure 4.

Overview of a discovery process’s logical sequence.

In the next subsections a detailed description of each step is provided.

3.1. Discovery request

This is the initial step (labeled as 1️⃣ in Figure 4). A client starts a discovery process when it receives as input:

  1. a set of one or more catalog URLs (IRI), which must point to a dereferenceable RDF representation of a catalog; and

  2. a set of formal constraints, expressed either as an input SHACL shape (in any RDF serialization (string)) or an input SPARQL query (string).

3.2. Catalog retrieval

This step (labeled as 2️⃣ in Figure 4) involves retrieving the RDF representation of each catalog provided as input. A client must perform an HTTP GET request to each catalog URL, which must be dereferenceable to any RDF serialization (e.g. [TURTLE], [RDF11-XML], [JSON-LD], etc.).

Note: Even though in Figure 4 the catalog retrieval step is shown as a sequential process, a client may choose to perform the retrieval of multiple catalogs in parallel, to optimise the overall performance of the discovery process.

3.3. Extracting associated shapes

This step (labeled as 3️⃣ in Figure 4) involves parsing the RDF representation of each retrieved catalog, to extract all the shapes associated with RDF resources declared in the catalog. A client must identify all resources of type dcat:Dataset, dcat:Distribution, and dcat:DataService, and for each resource, extract any associated shape(s) using the three approaches defined in section § 2 Describing DCAT(-AP) resource content with shapes. Given the mutually inclusive relation of these approaches, a client must consider all of them when extracting shapes. That is, for every resource a client must:

  1. follow the property path dcterms:conformsTo and assess the existence of sh:NodeShape instances (§ 2.1 Relating to shapes via dcterms:conformsTo). The following SPARQL query expresses an equivalent operation to extract the relevant quads from a catalog instance.

    EQUIVALENT SPARQL QUERY
    PREFIX sh: <http://www.w3.org/ns/shacl#>
    PREFIX dcterms: <http://purl.org/dc/terms/>
    
    SELECT ?shape WHERE {
        VALUES ?type { dcat:Dataset dcat:Distribution dcat:DataService }
                
        ?resource a ?type.
            dcterms:conformsTo ?shape.
                
        ?shape a sh:NodeShape.
    }
    
  2. follow the property path dcat:qualifiedRelation/dcterms:relation and assess the existence of sh:NodeShape instances (§ 2.2 Relating to shapes via dcat:qualifiedRelation). The following SPARQL query expresses an equivalent operation to extract the relevant quads from a catalog instance.

    EQUIVALENT SPARQL QUERY
    PREFIX sh: <http://www.w3.org/ns/shacl#>
    PREFIX dcterms: <http://purl.org/dc/terms/>
    PREFIX dcat: <http://www.w3.org/ns/dcat#>
    
    SELECT ?shape WHERE {
        VALUES ?type { dcat:Dataset dcat:DataService }
                
        ?resource a ?type.
            dcat:qualifiedRelation [
                dcterms:relation ?shape
            ].
                
        ?shape a sh:NodeShape.
    }
    
  3. follow the property path sh:shapesGraph and assess the existence of sh:NodeShape instances within the referenced graph (§ 2.3 Relating to shapes via sh:shapesGraph). The following SPARQL query expresses an equivalent operation to extract the relevant quads from a catalog instance.

    EQUIVALENT SPARQL QUERY
    PREFIX sh: <http://www.w3.org/ns/shacl#>
    PREFIX dcterms: <http://purl.org/dc/terms/>
    PREFIX dcat: <http://www.w3.org/ns/dcat#>
    
    SELECT ?shape WHERE {
        VALUES ?type { dcat:Dataset dcat:Distribution dcat:DataService }
        
        ?resource a ?type.
            sh:shapesGraph ?shapesGraph.
        
        GRAPH ?shapesGraph {
            ?shape a sh:NodeShape.
        }
    }
    

The equivalent SPARQL queries listed above, make the assumption that all RDF quads for both resources and associated shapes, are locally available as part of a catalog RDF document. However, it may be the case that a related shape is linked as a remote resource. In such case, a client must also dereference the shape’s IRI to retrieve its RDF representation.

3.4. Determining resource relevance

This step (labeled as 4️⃣ in Figure 4) involves determining the relevance of each resource, for which one or more shapes were extracted, according to the given input SHACL shape or input SPARQL query expressing the client’s formal constraints.

A client must evaluate each resource’s shape against the input SHACL shape or input SPARQL query constraints, by applying the query-shape subsumption algorithm described in section § 4 Algorithm specification.

3.5. Presenting results

In this last step, (labeled as 5️⃣ in Figure 4) the client collects and returns the set of resource identifiers (IRI), which were deemed as relevant with respect to the input constrainst (input SHACL shape or input SPARQL query).

4. Algorithm specification

This section defines a query-shape subsumption algorithm that aims to determine the relevance of an RDF resource based on a given set of formal constraints, expressed in the form of a SHACL shape or a SPARQL query, and considering a set of associated SHACL shapes to which the RDF resource conforms to. The algorithm decomposes queries and shapes into its constituent graph star pattern and proceeds to compare them in search of logical overlaps that indicate relevance relations among the input query/shape and the set of shapes related to catalog resources.

First, a set of preliminary concepts are defined, followed by a description of the logical steps taken by the algorithm.

4.1. Preliminaries

The algorithm operation is specified upon the following concepts:

Conformance

Conformance requirements are expressed with a combination of descriptive assertions and RFC 2119 terminology. The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in the normative parts of this document are to be interpreted as described in RFC 2119. However, for readability, these words do not appear in all uppercase letters in this specification.

All of the text of this specification is normative except sections explicitly marked as non-normative, examples, and notes. [RFC2119]

Examples in this specification are introduced with the words “for example” or are set apart from the normative text with class="example", like this:

This is an example of an informative example.

Informative notes begin with the word “Note” and are set apart from the normative text with class="note", like this:

Note, this is an informative note.

Index

Terms defined by this specification

References

Normative References

[DCAT-AP]
DCAT Application Profile for data portals in Europe. 10 July 2025. URL: https://interoperable-europe.ec.europa.eu/collection/semic-support-centre/solution/dcat-application-profile-data-portals-europe
[DCTERMS]
DCMI Usage Board. DCMI Metadata Terms. 20 January 2020. DCMI Recommendation. URL: https://www.dublincore.org/specifications/dublin-core/dcmi-terms/
[RDF11-CONCEPTS]
Richard Cyganiak; David Wood; Markus Lanthaler. RDF 1.1 Concepts and Abstract Syntax. URL: https://w3c.github.io/rdf-concepts/spec/
[RDF11-SCHEMA]
Dan Brickley; Ramanathan Guha. RDF Schema 1.1. URL: https://w3c.github.io/rdf-schema/spec/
[RFC2119]
S. Bradner. Key words for use in RFCs to Indicate Requirement Levels. March 1997. Best Current Practice. URL: https://datatracker.ietf.org/doc/html/rfc2119
[SHACL]
Holger Knublauch; Dimitris Kontokostas. Shapes Constraint Language (SHACL). URL: https://w3c.github.io/data-shapes/shacl/
[SPARQL-QUERY]
Eric Prud'hommeaux; Andy Seaborne. SPARQL Query Language for RDF. URL: https://w3c.github.io/sparql-query/spec/
[VOCAB-DCAT-3]
Simon Cox; et al. Data Catalog Vocabulary (DCAT) - Version 3. URL: https://w3c.github.io/dxwg/dcat/
[XMLSCHEMA11-2]
David Peterson; et al. W3C XML Schema Definition Language (XSD) 1.1 Part 2: Datatypes. 5 April 2012. REC. URL: https://www.w3.org/TR/xmlschema11-2/

Informative References

[IANA-RELATIONS]
Link Relations. URL: https://www.iana.org/assignments/link-relations/
[ISO-19115-1]
ISO/TC 211. Geographic information -- Metadata -- Part 1: Fundamentals. 2014. International Standard. URL: https://www.iso.org/standard/53798.html
[JSON-LD]
Manu Sporny; Gregg Kellogg; Markus Lanthaler. JSON-LD 1.0. 3 November 2020. REC. URL: https://www.w3.org/TR/json-ld/
[LDPATTERNS]
Leigh Dodds; Ian Davis. Linked Data Patterns: A pattern catalogue for modelling, publishing, and consuming Linked Data. May 2012. URL: http://patterns.dataincubator.org/book/
[RDF11-XML]
Frank Manola; Eric Miller. RDF Primer. URL: https://w3c.github.io/rdf-primer/spec/
[SKOS-REFERENCE]
Alistair Miles; Sean Bechhofer. SKOS Simple Knowledge Organization System Reference. 18 August 2009. REC. URL: https://www.w3.org/TR/skos-reference/
[STAR-PATTERNS]
Farah Karim; Maria-Esther Vidal; Sören Auer. Compacting frequent star patterns in RDF graphs. April 2020. URL: http://dx.doi.org/10.1007/s10844-020-00595-9
[SWBP-N-ARYRELATIONS]
Natasha Noy; Alan Rector. Defining N-ary Relations on the Semantic Web. 12 April 2006. NOTE. URL: https://www.w3.org/TR/swbp-n-aryRelations/
[TURTLE]
Eric Prud'hommeaux; Gavin Carothers. RDF 1.1 Turtle. URL: https://w3c.github.io/rdf-turtle/spec/
[WEB-API-DISC]
Frank Michiel; et al. Enabling Automatic Discovery and Querying of Web APIs at Web Scale using Linked Data Standards. May 2019. URL: https://doi.org/10.1145/3308560.3317073