Creating the ORE 2014 Dataset

The Reasoning Competition as part of the OWL Reasoner Evaluation Workshop 2013 was quite exciting and lead to a happy smiles on many faces of young reasoner developers. For us, the organising team, however, it posed a series of quite difficult and often unanticipated challenges, some of which were solved quite elegantly, some a bit ad hoc and maybe not to the degree of quality we would have liked to. Among the latter ones we counted the creation of a challenging, interesting and representative dataset to fuel our benchmarking framework. The 2013 dataset was not bad, probably better than most things we could have obtained by other means, but it was skewed by many technical details, some of which we set out to address for the fresh dataset to be released this year. This article discusses some of the challenges in creating the dataset and our attempts to address them:

  1. Data Gathering (How do we get the raw data?)
  2. Data Curation (How do we get to a set of potential candidates to sample from?)
  3. Data Set Assembly (Filtering and Binning)
  4. Determining an execution order for the competition (specific to ORE)

Challenge 1: Gathering the data

The first, and often biggest, problem is to find suitable data in the first place. There are generally four ways to obtain data, in our case, relevant ontologies:

  1. We can ask the community to submit ontologies they might find interesting in some way.
  2. We can tap known curated repositories such as Tones or BioPortal.
  3. We can tap known indexes of potential ontologies and filter them (Swoogle, Manchester OWL Repository, Watson).
  4. Generate ontologies artificially:
    1. From existing ontologies (subsets, modules, approximations).
    2. From patterns (KCNF generator, TBox generating techniques).

The first approach has the advantage that the resulting set will be highly relevant (at least one expert in the community found the ontology relevant for our case) and might result in ontologies that might not be accessible by any other means, for example because they are not directly accessible on the web. The main disadvantage however is that such “Calls for data” are generally only answered by a few merciful souls, and the resulting set will be small and insufficient for a comprehensive competition such the ORE Reasoner Competition. Moreover, due to the extreme bias introduced by the reach of our call, the resulting set can hardly be called representative of ontologies on the web, let alone ontologies in general. The second approach looks slightly more promising: Some hubs of ontologies provide rich sources for ontologies, most notably BioPortal, and despite the topical bias towards bio-health and life science related knowledge bases, provide a much larger, more diverse and less biased collection (for example due to the increased reach and the authenticity of the knowledge engineers intentions for providing the data) of relevant data. However, individual repositories are again limited in their degree of representativeness, their diversity (also due to the mere size) and possibly their scope of topic (whether that introduces relevant bias if we are merely interested in logical structure is at the very least unkown). This leads us to the third approach, tapping a comprehensive index of potential ontologies, for example gathered by a wide scale web crawl. Here we generally do the following:

  1. We obtain an index of an existing crawl (in principle simply a list of URLs that reference files that we suspect may contain axioms), such as Swoogle, the Manchester OWL Repository or Watson
  2. Create a snapshot (downloading all accessible files by some reasonable notion of accessibility), essentially creating a corpus of unfiltered files
  3. Parse and filter the snapshot according to some predefined minimal criteria

The advantage of the resulting set is its diversity (also due size), its lack of topical bias, its high degree of representativeness of ontologies on the web, and thus at least to some degree of ontologies in general. However, these desirable properties come at a great cost: the filtering procedure is far from straight forward; the snapshot is generally polluted by duplicate files, toy (educational) ontologies and ontologies published in a faceted fashion, and no cleaning procedure will be able to fully deal with this problem completely.

The fourth approach, creating ontologies artificially, has the advantage that the size of the resulting corpus is potentially infinite, is generally nicely scalable in terms of input size problems (for example, size of TBox or modal depth of KCNF clauses)  and almost readily available. On the other hand, the resulting set can hardly be called representative, as there is easy way to make sure that the problem distribution corresponds in any way to the distribution of the population (ie, all ontologies).  However, some generating techniques such as modularisation, approximation and random subsets generate problems that are at least to some degree relevant (modules in particular are becoming quite essential to modern reasoning techniques), and the potential to create large test tests remains appealing.

For ORE 2014, we decided to resort to all four approaches.

  1. We have issued a call for ontologies to the ORE community, gathering some hard and interesting ontologies from knowledge engineers all over the world.
  2. We have obtained a fresh snapshot of BioPortal using BioPortals new improved web services and defining a policy to deal with archives (if an archive contains one file, extract that, if it contains more than one, merge them together).
  3. We have created a fresh version of the Manchester OWL Corpus.
  4. We have scraped a copy of the Oxford Ontology Library.
  5. We have amended our corpus with a set of approximations, using approximation techniques developed by the TrOWL team, mainly to cast highly expressive ontologies into the EL, RL and QL profiles of OWL 2.

All ontologies are collected in a pool of uncurated potential candidates for the ORE corpus.

Challenge 2: Data Curation

One of the main problems with our competition in 2013 was to correctly determine the profile an ontology belongs to. Mere syntactic violations, such as multi-element expressions like DifferentIndividuals with a single element lead to a lot of ontologies that were filtered out needlessly. Many such cases can be repaired without impairing the logical structure of an ontology. For this years ORE, we decided to fully dlify the entire corpus (the pool) by:

  1. Replacing Illegal non absolute IRIs with absolute ones;
  2. Declaring undeclared entities (classes, data property, object property, data types)
  3. Dropping all other axioms that violate the OWL 2 profile (in the order they come by using the OWL API tools OWLProfileViolation’s repair functionality.

Because we are that way potentially changing the logical structure of an ontology significantly, the ORE corpus does not anymore lend itself to language surveys and ontology pattern surveys (resort to MOWLCorp instead, which keeps ontologies in their original form). Moreover, since we expect reasoners to deal with ontologies only up until an expressivity of OWL 2 DL, we decided to general drop all rules (DL Safe rules) from the ontology, as there is no requirement for reasoners to deal with them in general what might introduce an unnecessary computational disadvantage for those reasoners that do. Annotations are dropped as well, as they should have no effect on the reasoning, to reduce the size of the serialised corpus.

A second problem of the corpus is the fluctuating accessibility of imports. For this purpose, the final version of our corpus contains the ontology merged together with their imports closure. This is necessary because reasoning services such as classification require access to the imports closure, and we need to be certain it is available at the time of the competition.

The third problem we frequently encountered were bugs in the serialisers shipped with the OWL API. Ontologies might be exportable into a particular format, but not readable again. We made sure of the faithfulness of the transformation into another serialisation by approximating roundtripability: Load the old version, serialise into the desired format, load the ontology from that serialisation and check whether the old and the new version contain the same number of TBox axioms.

  • Some ontologies were shipped with local imports
  • Some ontologies had resolvable imports to redirects that contained gibberish:
  • OWL Full problems upon first loading attempt (empty/single entity nary operators (expressions and nary axioms))

The final set was exported into OWL Functional Syntax, the single one format supported by all competing reasoners.

Challenge 3: Data Set Assembly

The next big step in the corpus creation pipeline is to assemble the actual corpus. This step contains of two main sub-processes:

  1. Filtering
  2. Binning

The first process (filtering) can, in all generality, be described as follows:

  1. Define inclusion/exclusion criteria for each individual ontology (size, expressivity)
  2. Implement criteria, filter corpus
  3. Cluster Sample
    1. Define clustering function C(O1,O2,…), where O1, O2, … are the ontologies in the curated set and C() returns a set of clusters
    2. Define sampling strategy to obtain representatives of the clusters returned by C

This process can be iterative: It might be that we first apply one set of cluster sample techniques and on the resulting set another set of techniques. For ORE 2014 however, we omitted sophisticated cleaning procedures and merely reduced sets of byte identical ontologies to one after serialisations into a common format (OWL Functional Syntax) and excluded ontologies

  • with less than 50 logical axioms
  • with less than 5 class names
  • that were not OWL DL (nothing was actually excluded because of this, because the corpus was fully dl-ified, as described before)

Binning generally works as follows.

  1.  Specify desired bins
    1. bin size
    2. exclusion/ inclusion criteria
    3. if-too-small-policy (what to do if the corpus contain less ontologies that correspond to the criteria in (2) then the desired size specified in (1; for example filling up with artificial instances, or no action at all)).
    4. if-too-large-policy  (what to do if the corpus contain less ontologies that correspond to the criteria in (2) then the desired size specified in (1);  for example random sampling, or stratified sampling).
    5. Possibly an ordering describing a prioritisation of what ontologies to include first (harder ones, inconsistent ones, etc.). Ordering might be tied to some additional bins and some sampling function.
  2. Implement method to assign a given ontology O to a bin
  3. Implement the specified policies to obtain the desired bins

For ORE, the binning was done as follows:

  1. Specification: We need a total of 6 testsets: For both DL and EL we create sets for the  for each of the three reasoning services we cover
    1. Classification
    2. Realisation
    3. Consistency checking
  2. We leave the bin size open, to get the largest possible amount of challenges for the reasoners
  3. Inclusion criteria by bin:
    1. Classification: No special
    2. Consistency: No special
    3. Realisation: More than 100 ABox axioms
    4. DL: ontologies fall under the DL profile, but not in the QL or EL profile
    5. EL: ontologies fall under the EL profile
  4. If-too-small-policy and if-to-large-policy omitted, we take as much as we can get, prioritisation omitted because of challenge 4.

Challenge 4: Sorting the order of the test sets

The last challenge in the creation of the ORE data set stems from the necessity of reducing the probability for a given reasoner to encounter only problems it is particularly optimised for. In other words, we want to avoid having, due to chance, the first 100 ontologies to be classified to all contain a particular (always the same) problem that is easy for reasoner 1 but hard for reasoner 2, while another arbitrary problem is easier for reasoner 2 then reasoner 1. This problem is specific to the competition and will be of no relevance for other attempts to assemble a corpus. The main reason why we need to address this here is because we did not apply any further clustering methods in the filtering process.  That way, we deliberately allow large clusters of the same in our corpus, which a mere (randomisation-based) shuffle process would not be able to hide. In order to address this, we determine a file order for all reasoning problems in the competition by first binning the ontologies in the ORE corpus by size

  • Very small: < 100 logical axioms
  • Small: 100 – 1000 logical axioms
  • Medium: 1001 – 10000 logical axioms
  • Large: 10001 – 100000 logical axioms
  • Very large: > 100000 logical axioms

and then randomly selecting one candidate from each bin iteratively until all bins are empty. This results in higher probability of more interesting and diverse cases in the beginning of the competition and a more simple tail, because no reasoner is really expected to get through the entire set in a reasonable amount of time.