In a previous blog I gave an introduction to the work we were doing on establishing multilingual vocabularies for key parts of the reference database. It’s now nearly 18 months later and as we head into the final stages of the project, things have now taken shape.
As a quick recap, each of the project partners contributing data to the reference database have worked through their catalogues and identified the descriptive terms used within that refer to or classify significant elements of a ceramic form. Learning from the methodology and using the tools developed for the (European Commission-funded) ARIADNE project by the Hypermedia Research Group at the University of South Wales, a neutral spine to which partners could map these terms was deemed a preferable solution to establishing a whole new ontology bespoke to this project. As with the ARIADNE project, use of Getty Institutes Art and Archaeology Thesaurus (AAT) proved a suitable candidate, with the added value of interoperability with ARIADNE and any other data also mapped to the AAT, thus giving the mapping work done by ArchAIDE a strong sustainable base. In the initial phase of the project all partners agreed on a subset of AAT terms that would be used for this neutral spine, describing the following methods of recording pottery from archaeological excavations:
- The sherd type
- The vessel form
- The decoration type
- The decoration colour
Later in the project, partners established a need for controlled terminologies to describe the type or characteristics of specific parts of a ceramic vessel. At the time of writing the AAT does not include this level of detail. As an alternative, and as a concerted effort to ensure that ArchAIDE data was interoperable with past and future projects, the project used the concepts of recording established by the original creators of the Roman Amphorae database (Simon Keay and David Williams). The concepts defined by Keay and Williams have since been converted to SKOS by a separate research project (our own Holly Wright!) and then made available as Linked Open Data via the ADS (UoY) triplestore. The following vocabularies were thus used as the spine to be followed by the ArchAIDE project:
- The characteristics of the rim
- The characteristics of the neck
- The characteristics of the shoulder
- The characteristics of the body
- The characteristics of the base
- The characteristics of the handle
As noted above, it is important to highlight the benefit of aligning the terms used by ArchAIDE with existing vocabularies, in which they offer short-term benefits of concordance and definition, but also longer-term by allowing the archive to be re-used with an inherent and unambiguous understanding of the terminologies used within. The mapping of ‘native’ to neutral terms was undertaken by each project partner using the mapping tool created by the ARIADNE project. The template uses a simple iteration of SKOS which maps the native label (source) to the ARCHAIDE term (target) and defines the level of match. In the illustrative example below from the Italian mapping, one can see the benefit of this approach by allowing the mapping of unguentario and balsamario to ointment vessel, but also setting the preciseness of that match in conceptual terms. Thus, for example balsamario has no close or exact match, being a very particular way of describing a small container used to store balsams, but hierarchically has a broad match to what the Getty calls an ointment vessel.
Screenshot of the SKOS matching tool being used for Italian descriptions of ceramic form types.
The final mappings were reviewed, to reflect and then incorporate any misunderstandings or inconsistencies. At a later date a contribution to my 2017 ArchAIDE blog by Eleni Schindler Kaudelka drew attention to previous work in this sphere by Caroline Sourzat, and subsequently a copy of her Master’s thesis was used to add French terms, and enhance the German terminologies. A further contribution of an e-print of an article from the Journal of Roman Pottery Studies from Nicholas Cooper, an attendee of the December 2017 multiplier event, provided a basis for a mapping in Dutch. This was edited and refined by Leontien Talboom, a Digital Archivist at the ADS (UoY) and a native Dutch speaker. In late 2018, a mapping in Portuguese has also recently been contributed by Guilherme DAndrea Curra. At the end of this phase, a total of 1338 mappings in seven European languages had been completed. Sincere thanks to all those that have contributed!
The mapped vocabularies have been uploaded to the ArchAIDE reference database as a reference resource and are being used for the manual and automated (via text recognition) recording of paper and digital ceramic catalogues. The vocabularies will also be the spine which will allow users of the public application to cross-search the reference database in their own language, and as noted to reconcile differences in understanding or classification
The Linked Data approach
As a separate phase of work to supporting the reference database and ArchAIDE application, UoY have persevered to create a sustainable application to allow the searching and integration of the multilingual vocabularies outside of the immediate project database. It has done this by taking a Linked Data approach: a style of publishing data on the world wide web that makes it easy to interlink, discover and consume data. This approach as part of a wider vision of a Semantic Web has been a growing for nearly 20 years, as web pioneers have pushed for a move away from siloed data and a web of documents to a fully integrated, machine-readable web of data. A variety of archaeological data resources already make use of Semantic Web principles and technologies, and it is notably a key feature of ongoing collaborative European initiatives to reconcile diverse datasets are exploring and incorporating these methodologies.
Each of the 1338 ArchAIDE terms have then been assigned a unique uniform resource identifier (URI) by the ADS. This URI has followed the generic syntax, for example:
- The ADS Linked Data domain.
- The unique archaide repository.
- The language.
- The scheme (e.g. type_form).
- The concept identifier (e.g. platte)
The complete dataset (with SKOS mappings and URIs) was converted to Comma Separated Values, with individual files for each language. This was then converted to the N-Triples format, a plain text serialisation format for RDF (Resource Description Framework) graphs, commonly used in the Linked Data approach. Graph data are organized into a three-part relationship of subject, predicate, and object, also referred to as a triple. Thus, graph databases are often referred to as triplestores. The conversion was acheived using an XSLT, opnely avilable as part of the tools developed by the Ariadne project. Thanks to Ceri Binding for his help in this process!
Within the XSLT, the user is able to select the native language of file to be converted using ISO language identifiers (e.g. nl for dutch), which would ensure that any labels would be accompanied by this information, for example: ”platte (nl)”. The resulting N-Triples were also outputed using UTF-8 encoding to ensure that non-ascii characters were retained. Due to the detailed nature of the recording undertaken by project partners and collaborators, 2560 N-Triples were created.
As noted above, the graph data created by ArchAIDE does not itself become Linked Open Data. To acheive this next step the N-Triples were uploaded to the ADS triplestore (Allegrograph 4.6.1). Upon upload they are stored in the RDF/XML format.
RDF for the ArchAIDE concept of “fondo ad anello”.
At this time of writing the ArchAIDE data has their own repository within the triplestore (to keep the data discretely packaged), which has been copied into the publicly accessible repository which the ADS use for dissemination of all their triples. The ADS use Pubby, an Open Source software (https://github.com/cygri/pubby) for linking this data to a SPARQL endpoint. SPARQL is a Declarative Query Language (like SQL) for performing data manipulation and data definition operations on Data represented in a Linked Data format. SPARQL endpoints can be accessed only by SPARQL client applications that use the SPARQL protocol. It cannot be accessed by the growing variety of Linked Data clients. Pubby is designed to provide a Linked Data interface to those RDF data sources. In this instance the ADS interface (http://data.archaeologydataservice.ac.uk/query/) offers a very simple web-based interface to allow users to generate their own SPARQL, which in turns generates an XML output. For ArchAIDE, a simple query to find all concepts mapped to “funnel” would be:
SPARQL to find all ArchAIDE terms mapped to AAT concept for a funnel.
At the time of writing it is more common for users of Linked Open Data to use their own desk-based software of web applications to cross-search multiple triplestores and SPARQL endpoints. Common applications allow a user to connect to simply point their application at whatever sources exist. Alternatively, it is now common for programmers to design bespoke services or widgets to interact with triplestores in a more easily understood fashion, effectively acting as an intermediary to provide a lookup of a term against a LOD vocabulary. There are now a growing number of powerful tools and APIs for visualising a single dataset, and the relationships between the concepts. For example, a user can connects to the ADS triplestore (under the terms of a CC 0 licence) using a nytool with a built in API for working with LOD datasets (for example LodLive, although over tools exist). Within this application they can see the node for the ArchAIDE vocabularies, with concepts defining what they are, what they contain, licence, creators and so on.
Simple view of the ArchAIDE Linked Open Data, sing the LodLive API.
The structure of the data allows a user to explore nodes graphically or semantically, and in addition allowing a user to view and query the relationships between whatever other data the application is either incorporating (via other SPARQL endpoints) or linked to by the ArchAIDE data. In the simple example presented below, the ArchAIDE node for form “Form type” is displayed with several nodes expanded to include their link to the AAT Linked Data concepts. The AAT data is effectively live, with the user able to see any other nodes connected to that concept, effectively allowing them to find other data similarly aligned or related. Conversely, a user new to ArchAIDE would be able to see how their data relates that generated by this project, and through querying establish similarities and new routes of inquiry.
Example of ArchAIDE concepts describing the form of a vessel, linked to live AAT data (the concept of a jug or pitcher), with furter nodes and branches from the AAT concept allowing further exploration. LodLive API.
The data held in the ADS graph database (triplestore) now forms an easily queried and integrated resource, that it is hoped will be incorporated within future projects. Either as a small resource for alignment within future catalogues, or for cross-searching against the wider contents of the Semantic Web.
The ADS will continue to curate the triples as part of its ongoing work with developing their capacity for delivery of Linked Open Data. In addition (and for those who really aren’t keen on Linked Data!) the original data has been deposited with the ADS in CSV format as part of the overall package for ensuring the key outputs of ArchAIDE are archived for perpetuity.
On a final note, I’d like to send my sincere thanks to all the project partners (Gabrielle, Francesca, Eva, Marisol and Michael) who persevered with my SKOS work and helped create such a rich resource. They’ve certainly increased my Italian, Spanish, Catalan and German vocabularies! And of course, to Massimo, who has helped build these into the final database.
Overall this has been an incredibly fun piece of work, and I’m grateful for all in the ArchAIDE project for facilitating and supporting this small piece of research and development. Hopefully others will take this forward!