Extend Catmandu without Perl

05 July 2013 by Patrick

With Catmandu we create ETL-pipelines for library workflows. Read data from OAI, SRU, Z39.50, PubMed, arXive, transform it with Catmandu Fixes and load the results into Solr, MongoDB, CouchDB or serialize into YAML, CSV, XML whatever you like. Read my blog post about the Catmandu Cheat Sheet to get a quick recap.

Today I want to show you how you can create your own Fix routines in any programming language using the Catmandu::Fix::cmd which Nicolas Steenlant created.

First we create a small Perl script to generate some sample JSON we will use in our examples (you can use your own JSON file or translate this trivial script into Python, Ruby, Java, C, Clojore, Go …).

Here is our little JSON generator:

#!/usr/bin/env perl
# file: generate.pl

use JSON;

for (1...1000) {
    print encode_json({ random => rand }) , "\n";
}

When we execute the script we’ll get one thousand lines of JSON in our terminal:

$ ./generate.pl
{"random":0.721613357218615}
{"random":0.491180438229559}
{"random":0.868290266595814}
.
.
.

It is now easy to use Catmandu Fixes to transform these JSON records. E.g. we can add a new field ‘title’ with content ‘test’:

$ ./generate.pl | catmandu convert JSON --fix 'add_field("title","test")'
{"random":0.611390470122803,"title":"test"}
{"random":0.915937067437753,"title":"test"}
{"random":0.461684127836374,"title":"test"}
.
.
.

This add_field() Fix was written in Perl. What if you need to write a new complicated Fix-routine and don’t want to use Perl? Well, we have Catmandu::Fix::Cmd to the rescue! You can create fixes in any language you like, as long as your program can read JSON records from the standard input and can write JSON records to the standard output you are cool. Lets try that out.

As example we create a Python script to read JSON from the stdin, add a title field and write the JSON back to stdout.

#!/usr/bin/env python
# file: catjson.py
import sys
import json

while 1:
    line = sys.stdin.readline()
    if not line: break
    data = json.loads(line.strip())
    data['title'] = "test";
    print json.dumps(data)

If we run this we can see the expected result.

$ ./generate.pl | ./catjson.py
{"random": 0.530965947974309, "title": "test"}
{"random": 0.371021223752646, "title": "test"}
{"random": 0.0907161737840951, "title": "test"}
.
.
.

With the Catmandu Fix ‘cmd’ we can make this Python program part of an ETL-pipeline. In the simple example below we will repeat the previous test:

$ ./generate.pl | catmandu convert JSON --fix 'cmd("./catjson.py")'
{"random":0.554686750713572,"title":"test"}
{"random":0.275637603863029,"title":"test"}
{"random":0.318374223918873,"title":"test"}
.
.
.

Now this is working you can add the whole Catmandu stack to this pipeline. Add different importers, new fixes, store into ElasticSearch or MongoDB. E.g. we can do an SRU query and use our Python and Perl fixes simultaneously:

$ catmandu convert SRU --base http://www.unicat.be/sru --query dna --fix 'cmd("./catjson.py");remove_field("recordData")'
{"recordPacking":"xml","recordPosition":"1","title":"test","recordSchema":"info:srw/schema/1/dc-schema"}
{"recordPacking":"xml","recordPosition":"2","title":"test","recordSchema":"info:srw/schema/1/dc-schema"}
{"recordPacking":"xml","recordPosition":"3","title":"test","recordSchema":"info:srw/schema/1/dc-schema"}
.
.
.

Here is how the same program might look like in Lua

#!/usr/bin/env luajit
# file: catjson.lua
-- requires dkjson http://chiselapp.com/user/dhkolf/repository/dkjson/home
local json = require ("dkjson")

for line in io.lines() do
    local obj, pos, err = json.decode (line, 1, nil)
    obj['title'] = 'test'
    print(json.encode(obj))
end

With the same expected results:

$ ./generate.pl | catmandu convert JSON --fix 'cmd("./catjson.lua")'
{"random":0.54868770433573,"title":"test"}
{"random":0.26483418097243,"title":"test"}
{"random":0.15708750198151,"title":"test"}
.
.
.

Using Catmandu::Fix::cmd you can create complicated fix routines to extend your data crunching needs.

comments powered by Disqus