Download this gist and create a symbolic link
$ ln -s catmandu.yml config.yml
This is necessary for the dancer app. In this case Catmandu and Dancer are using the same configuration file.
In the last days you have learned how to store data with Catmandu. Storing data is a cool thing, but sharing data is awesome. Interoperability is important as other people may use your data (and you will profit from other people’s interoperable data)
In the day 13 tutorial we’ve learned the basic principle of metadata harvesting via OAI-PMH.
We will set up our OAI service with the Perl Dancer framework and an easy-to-use plugin called Dancer::Plugin::Catmandu::OAI. To install the required modules run:
$ cpanm Dancer
$ cpanm Dancer::Plugin::Catmandu::OAI
and you also might need
$ cpanm Template
Let’s start and index some data with Elasticsearch as learned in the previous post:
$ catmandu import OAI --url https://lib.ugent.be/oai --metadataPrefix oai_dc --set flandrica --handler oai_dc to Elasticsearch --index_name oai --bag publication
After this, you should have some data in your Elasticsearch index. Run the following command to check this:
$ catmandu export Elasticsearch --index_name oai --bag publication
Everything is fine, so let’s create a simple webservice which exposes to collected data via OAI-PMH. The following code can be downloaded from this gist.
Download this gist and create a symbolic link
$ ln -s catmandu.yml config.yml
This is necessary for the dancer app. In this case Catmandu and Dancer are using the same configuration file.
| store: | |
| oai: | |
| package: Elasticsearch | |
| options: | |
| index_name: oai | |
| bags: | |
| publication: | |
| cql_mapping: | |
| default_index: basic | |
| indexes: | |
| _id: | |
| op: | |
| 'any': true | |
| 'all': true | |
| '=': true | |
| 'exact': true | |
| field: '_id' | |
| basic: | |
| op: | |
| 'any': true | |
| 'all': true | |
| '=': true | |
| '<>': true | |
| field: '_all' | |
| description: "index with common fields..." | |
| datestamp: | |
| op: | |
| '=': true | |
| '<': true | |
| '<=': true | |
| '>=': true | |
| '>': true | |
| 'exact': true | |
| field: '_datestamp' | |
| index_mappings: | |
| publication: | |
| properties: | |
| _datestamp: {type: date, format: date_time_no_millis} | |
| plugins: | |
| 'Catmandu::OAI': | |
| store: oai | |
| bag: publication | |
| datestamp_field: datestamp | |
| repositoryName: "My OAI DataProvider" | |
| uri_base: "http://oai.service.com/oai" | |
| adminEmail: me@example.com | |
| earliestDatestamp: "1970-01-01T00:00:01Z" | |
| deletedRecord: persistent | |
| repositoryIdentifier: oai.service.com | |
| cql_filter: "datestamp>2014-12-01T00:00:00Z" | |
| limit: 200 | |
| delimiter: ":" | |
| sampleIdentifier: "oai:oai.service.com:1585315" | |
| metadata_formats: | |
| - | |
| metadataPrefix: oai_dc | |
| schema: "http://www.openarchives.org/OAI/2.0/oai_dc.xsd" | |
| metadataNamespace: "http://www.openarchives.org/OAI/2.0/oai_dc/" | |
| template: oai_dc.tt | |
| fix: | |
| - nothing() | |
| sets: | |
| - | |
| setSpec: openaccess | |
| setName: Open Access | |
| cql: 'oa=1' |
| #!/usr/bin/env perl | |
| use Dancer; | |
| use Catmandu; | |
| use Dancer::Plugin::Catmandu::OAI; | |
| Catmandu->load; | |
| Catmandu->config; | |
| oai_provider '/oai'; | |
| dance; |
| <oai_dc:dc xmlns="http://www.openarchives.org/OAI/2.0/oai_dc/" | |
| xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" | |
| xmlns:dc="http://purl.org/dc/elements/1.1/" | |
| xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" | |
| xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd"> | |
| [%- FOREACH var IN ['title' 'creator' 'subject' 'description' 'publisher' 'contributor' 'date' 'type' 'format' 'identifier' 'source' 'language' 'relation' 'coverage' 'rights'] %] | |
| [%- FOREACH val IN $var %] | |
| <dc:[% var %]>[% val | html %]</dc:[% var %]> | |
| [%- END %] | |
| [%- END %] | |
| </oai_dc:dc> |
What’s going on here? Well, the script oai-app.pl defines a route /oai via the plugin Dancer::Plugin::Catmandu::OAI.
The template oai_dc.tt defines the xml output of the records. And finally the configuration file catmandu.yml handles the settings for the Dancer plugin as well as for the Elasticsearch indexing and querying.
Run the following command to start a local webserver
$ perl oai-app.pl
and point your browser to https://localhost:3000/oai?verb=Identify. To get some records go to http://localhost:3000/oai?verb=ListRecords&metadataPrefix=oai_dc.
Yes, it’s that easy. You can extend this simple example by adding fixes to transform the data as you need it.
Continue to Day 15: MARC to Dublin Core >>
The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) is a protocol to harvest metadata records from OAI compliant repositories. It was developed by the Open Archives Initiative as a low-barrier mechanism for repository interoperability. The Open Archives Initiative maintains a registry of OAI data providers.
Every OAI server must provide metadata records in Dublin Core, other (bibliographic) formats like MARC may be supported additionally. Available metadata formats can be detected with “ListMetadataFormats“. You can set the metadata format for the Catmandu OAI client via the --metadataPrefix parameter.
The OAI server may support selective harvesting, so OAI clients can get only subsets of records from a repository. The client requests could be limited via datestamps (--from, --until) or set membership (--set).
To get some Dublin Core records from the collection of Ghent University Library and convert it to JSON (default) run the following catmandu command:
$ catmandu convert OAI --url https://lib.ugent.be/oai --metadataPrefix oai_dc --set flandrica --handler oai_dc
You can also harvest MARC data and store it in a file:
$ catmandu convert OAI --url https://lib.ugent.be/oai --metadataPrefix marcxml --set flandrica --handler marcxml to MARC --type USMARC > ugent.mrc
Instead of harvesting the whole metadata you can get the record identifiers (--listIdentifiers) only:
$ catmandu convert OAI --url https://lib.ugent.be/oai --metadataPrefix marcxml --set flandrica --listIdentifiers 1 to YAML
You can also transform incoming data and immediately store/index it with MongoDB or Elasticsearch. For the transformation you need to create a fix (see Day 6):
$ nano simple.fix
Add the following fixes to the file:
marc_map(245,title)
marc_map(100,creator.$append)
marc_map(260c,date)
remove_field(record)
Now you can run an ETL process (extract, transform, load) with one command:
$ catmandu import OAI --url https://lib.ugent.be/oai --metadataPrefix marcxml --set flandrica --handler marcxml --fix simple.fix to Elasticsearch --index_name oai --bag ugent
$ catmandu import OAI ---url https://lib.ugent.be/oai --metadataPrefix marcxml --set flandrica --handler marcxml --fix simple.fix to MongoDB --database_name oai --bag ugent
The Catmandu OAI client provides special handler (--handler) for Dublin Core (oai_dc) and MARC (marcxml). For other metadata formats use the default handler (raw) or implement your own. Read our documentation for further details.
Continue to Day 14: Set up your own OAI data service >>