With Catmandu we create ETL-pipelines for library workflows. Read data from OAI, SRU, Z39.50, PubMed, arXive, transform it with Catmandu Fixes and load the results into Solr, MongoDB, CouchDB or serialize into YAML, CSV, XML whatever you like. Read my blog post about the Catmandu Cheat Sheet to get a quick recap.
Today I want to show you how you can create your own Fix routines in any programming language using the Catmandu::Fix::cmd which Nicolas Steenlant created.
First we create a small Perl script to generate some sample JSON we will use in our examples (you can use your own JSON file or translate this trivial script into Python, Ruby, Java, C, Clojore, Go …).
Here is our little JSON generator:
When we execute the script we’ll get one thousand lines of JSON in our terminal:
It is now easy to use Catmandu Fixes to transform these JSON records. E.g. we can add a new field ‘title’ with content ‘test’:
This add_field() Fix was written in Perl. What if you need to write a new complicated Fix-routine and don’t want to use Perl? Well, we have Catmandu::Fix::Cmd to the rescue! You can create fixes in any language you like, as long as your program can read JSON records from the standard input and can write JSON records to the standard output you are cool. Lets try that out.
As example we create a Python script to read JSON from the stdin, add a title field and write the JSON back to stdout.
If we run this we can see the expected result.
With the Catmandu Fix ‘cmd’ we can make this Python program part of an ETL-pipeline. In the simple example below we will repeat the previous test:
Now this is working you can add the whole Catmandu stack to this pipeline. Add different importers, new fixes, store into ElasticSearch or MongoDB. E.g. we can do an SRU query and use our Python and Perl fixes simultaneously:
Here is how the same program might look like in Lua
With the same expected results:
Using Catmandu::Fix::cmd you can create complicated fix routines to extend your data crunching needs.