Day 12: Index your data with ElasticSearch
ElasticSearch is a flexible and powerful open source, distributed, real-time search and analytics engine. You can store structured JSON documents and by default ElasticSearch will try to detect the data structure and index the data. ElasticSearch uses Lucene to provide full text search capabilities with a powerful query language. Install guides for various platforms are available at the ElasticSearch reference. To install the corresponding Catmandu module run:
$ cpanm Catmandu::Store::ElasticSearch
$ cpanm Search::Elasticsearch::Client::5_0::Direct
[For those of you running the Catmandu VirtualBox this installation is not required. ElasticSearch is by default installed]
Now get some JSON data to work with:
$ wget -O banned_books.json https://lib.ugent.be/download/librecat/data/verbannte-buecher.json
First index the data with ElasticSearch. You have to specify an index (--index_name) and type (--bag), client version (--client) for Elasticsearch versions >= 2.0 you also have to add a prefix to your IDs (--key_prefix):
$ catmandu import -v JSON --multiline 1 to ElasticSearch --index_name books --bag banned --key_prefix my_ --client '5_0::Direct' < banned_books.json
Now you can export all items from an index to different formats, like XLSX, YAML and XML:
$ catmandu export ElasticSearch --index_name books --bag banned --client '5_0::Direct' to YAML
$ catmandu export ElasticSearch --index_name books --bag banned --client '5_0::Direct' to XML
$ catmandu export -v ElasticSearch --index_name books --bag banned --client '5_0::Direct' to XLSX --file banned_books.xlsx
You can count all indexed items or those which match a query:
$ catmandu count ElasticSearch --index_name books --bag banned --client '5_0::Direct'
$ catmandu count ElasticSearch --index_name books --bag banned --client '5_0::Direct' --query 'firstEditionPublicationYear: "1937"'
$ catmandu count ElasticSearch --index_name books --bag banned --client '5_0::Direct' --query 'firstEditionPublicationPlace: "Berlin"'
You can search an index for a specific value and export all matching items:
$ catmandu export ElasticSearch --index_name books --bag banned --client '5_0::Direct' --query 'firstEditionPublicationYear: "1937"' to JSON
$ catmandu export ElasticSearch --index_name books --bag banned --client '5_0::Direct' --query 'firstEditionPublicationPlace: "Berlin"' to CSV --fields 'my_id,authorFirstname,authorLastname,title,firstEditionPublicationPlace'
You can delete whole collections from a database or just items which match a query:
$ catmandu delete ElasticSearch --index_name books --bag banned --client '5_0::Direct' --query 'firstEditionPublicationPlace: "Berlin"'
$ catmandu delete ElasticSearch --index_name books --bag banned --client '5_0::Direct'
Catmandu::Store::ElasticSearch supports CQL as query language. For setup and usage see documentation.
Continue to Day 13: Harvest data with OAI-PMH >>



















