Members of the team have recently been undertaking work examining existing digital catalogues, principally Roman Amphorae and CeramAlex, and using the extant data to build a reference database to facilitate filtering of results alongside the image-based recognition. Some of this has been very simple, for example having room in the database to support classification of the labelling the part of the sherd being depicted, e.g. 'handle' or 'rim'. Others that at first seemed simple have transpired to be more difficult than I envisioned. For example terms used to describe the appearance or form of a vessel or sherd (or even decoration) may be consistent within a single schema, but not portable or applicable to other catalogues. In addition, how is the description of a term that is often subjective such as 'beaded' going to help an automated system?
Recently, we've also been considering of the use of geographical terms and/or geometries to assist in filtering results. The broad concept being that an extremely localised type 'xxx' will not usually appear (or has not been documented as found) in area 'xzy. For example, to look at the record from the Teliţa type of amphora, we can see that it has a distribution limited to 'Scythia' (also where it is manufactured), and mapped to the concept of 'Black Sea' in that particular system. Looking more closely, it seems there are 38 countries/areas used to classify distribution. These range from smaller entities such as Cyprus or Belgium to much broader terms such as 'The Levant' or 'Mediterranean region'. When dealing with more common types such as Mid Roman Amphora 5, these broad regions become more understandable (see below).
Record for Teliţa amphora type (http://dx.doi.org/10.5284/1028192)
Record for id Roman Amphora 5 type (http://dx.doi.org/10.5284/1028192)
So, thinking ahead. How could we build on this to try and help the application filter by where the sherd was found? And not only for Roman Amphora, but also collections from across archaeological periods and continental Europe, the Middle East and North Africa? To my mind there are three issues: text versus geometry, scale and consistency.
The initial proposal was to record countries or regions similar to that of the Amphorae collection. This is somewhat simplistic, and open to inconsistencies as the catalogue grows with the digitization of paper or museum collections. For example from a British perspective, do I record "Great Britain" "British isles" "United Kingdom", "England" or "Yorkshire"? And although I may pick a term that suits me, someone else digitizing a catalogue may well choose a different term based purely on subjective reasoning. A simple database like or equals statement then potentially misses a positive match, or perhaps even returns a false positive.
To get around this at the ADS we use the Getty Thesaurus of Geographical Names (TGN), a structured vocabulary, including names, descriptions, and other metadata for extant and historical cities, empires, archaeological sites, and physical features important to research of art and architecture. Mapping terms to Getty subjects (for example see the entry for 'England' or even 'Northern England') not only allows greater consistency, but also flexibility in searching and subsequent results. For the recent ARIADNE project, Holly successfully mapped spatial terms - including those within Roman Amphorae - to Geonames. Although my personal preference is for TGN, purely because it records historical regions such as Scythia or Britannia, I think mapping to modern terms in Geonames would be more useful. Especially as that system supports bounding boxes and polygons for higher tier administrative regions.
Polygon(s) for the extent of Repubblica Italiana in Geonames (http://www.geonames.org/3175395/repubblica-italiana.html)
Although this helps the accuracy of a 'spatial' filter, we're still just restricted to modern administrative regions, which may bear little resemblance to archaeological distribution. If we did want to move towards utilizing capacity of a spatial database, then we'd have to think about the following issues of using our own geometries.
- We could create overarching zones such as Roman Amphora (e.g. Baleric Islands or southern Britain) but with a spatial extent stored as a polygon. Types/Classes would then be mapped to these if appropriate. The positives are that these relatively simple to create and administer, the negatives are that we would have to decide on, and then create these 'zones' which for most of Europe will be a hassle. Plus, how detailed do we go?
- Each class/type has its own extent polygon(s). This is potentially more accurate than Option 1, but also time consuming and potentially inconsistent if the extents are done by different people (as with use of text terms, this could vary between detailed and very broad!)
- The recording of X/Y values for findspots of a particular class/type. This has the potential for a higher level of accuracy, and capacity to build more intuitive map searches. However this is again potentially time consuming for non-digital collections as well as catalogues (such as Amphora) that do not reference sites at all. There's also the danger that the bias of the coverage of some catalogues may unintentionally produce a skewed distribution that is not truly representative of the pottery type.
After talking this through with the project team we're going to investigate (in addition to the mapping of any text terms to Geonames) option 2. Although this will require a certain amount of creation and curation it will really help the application move beyond the restrictions of modern borders. In addition, the database will also look to support individual sites as points where they already exist in the catalogues we're using. There's also the potential for results from the user application to feed back into this, enhancing the coverage of the reference dataset.