MOWLCorp Curation Pipeline

The Manchester OWL Corpus (MOWLCorp) forms the backbone of the Manchester OWL Repository. It is assembled through a wide spread seeding, an ongoing crawl, calls to the Google Custom Search API and manual submissions.

Data Gathering
The main component of the data gathering layer is a web crawl based on crawler4j, a java-based framework for custom web crawling and daily calls to the Google Custom Search API that fills the MOWLCorp. Both these components scan webpages for URLs that confirm to a wide range of file extension patterns that are somehow associated with ontologies, such as .owl, .rdf, .owl.xml, .ttl, and so on. These URLs are stored as candidates, together with a timestamp and an http access code. The candidates identified in that way are pooled with the manually collected seeds and fed to a downloader, that attempts to obtain the file corresponding the URL, assigns it an identifier and stores it, together with some metadata, in a specified directory.

Data Curation
The curation pipeline makes sure that the files in the repository are

  • byte unique
  • OWL API parseable
  • OWL/XML unique
  • and have an non-empty TBox

The first step in the curation pipeline is a process that discards empty and byte identical files, thus reducing the set of files under consideration to byte unique ones. The next process attempts to load a file into the OWL API, and gathers, if successful, some metrics about it (OWL API parseable). The third, and last, process serialises ontologies that contain at least one TBox axiom into a common format (OWL/XML), and check for byte identity again (as before). This works, because the OWL API orders the axioms and declarations in a principled way before writing them to the file. There are surprisingly many files that still get filtered out at this last stage.

At this point, the MOWLCorp is fully curated and ready for use. However, it is recommended to carefully think about a suitable sampling method from this quite large set of currently (July 2014) around 21.000 files. One method would be to bin the ontologies by logical axiom counts (into small, medium and large bins), and sample randomly from within each of the bins. A similar sampling strategy was employed for the ORE 2014 run order for each of the disciplines.

Advice

We keep each file in the MOWLCorp in their original serialisation. This means that the imports closure of an ontology is neither redirected, nor merged. This means that those ontologies that contain imports are not guaranteed to be parseable at all times. There exists a merged OWL/XML serialisation of the entire corpus, which will be made avaible shortly.