Catmandu

a data toolkit

This handbook is contains the aggregated content of Catmandu documentation wiki. Feel free to improve the documentation there!

1 Introduction ✎

Catmandu is a command line tool to access and convert data from your digital library, research services or any other open data sets. The toolkit was originally developed as part of the LibreCat project and attracts now an international development team with many participating institutions.

Catmandu has the following features, one can:

download data via protocols such as OAI-PMH, SRU, SPARQL and Linked Data Fragments.
convert formats library format such as MARC, MODS, Dublin Core and but also others like JSON, YAML, XML, Excel and many more.
generate RDF and speak the Semantic Web.
index data into databases such as Solr, Elasticsearch and MongoDB.
use a simple Fix language to convert metadata into any format you like.

Catmandu is used in the LibreCat project to build institutional repositories and search engines. Catmandu is used on the command line for quick and dirty reports but also as part of larger programming projects processing millions of records per day. For a short overview of use-cases, see our Homepage.

As of 15 Aug 2022, there are: - 98 Catmandu-related repositories available at GitHub LibreCat - 112 Catmandu-related modules on MetaCPAN - 227 Catmandu-related repostitories across all of Github.

2 Installation ✎

To get Catmandu running on your system you need to download and install at least the CPAN Catmandu module. Additional modules add support for more input and output formats, databases, and processing options.

To install Catmandu modules select at least Catmandu (and probably Catmandu::MARC, Catmandu::OAI, Catmandu::RDF, Catmandu::XLS)

$ sudo cpanm Catmandu Catmandu::MARC

To install extra Catmandu modules at any point in time, the cpanm command needs to be used.

$ sudo cpanm  Catmandu::OAI
$ sudo cpanm  Catmandu::RDF
$ sudo cpanm  Catmandu::Store::MongoDB
$ sudo cpanm  Catmandu::XLS

To make full usage of the capabilities of Catmandu, database and search engines such as MongoDB, Elasticsearch, Solr, Postgres, MySQL can be installed on the system with the corresponding Catmandu tools. How to install these database on your local system falls outside the scope of this documentation. Please consult the installation guide of the database product for more information. For more information on the available Catmandu packages consult our Distributions list.

Here are some Catmandu installation hints for various platforms.

2.0.1 Debian

Several Catmandu packages are officially included in Debian but not all (see Voting Catmandu packages to be included in Debian and this search of currently available packages).

You can install all packages officially included in Debian:

sudo apt-get update
sudo apt-get install libcatmandu*-perl

Alternatively, you can build newest Catmandu and dependencies from source:

sudo apt-get update
sudo apt-get install cpanminus build-essential libexpat1-dev libssl-dev libxml2-dev libxslt1-dev libgdbm-dev libmodule-install-perl
cpanm Catmandu Catmandu::MARC

Alternatively, you can build newest Catmandu as unofficial packages, using most possible official packages:

sudo apt update
sudo apt install dh-make-perl liblocal-lib-perl apt-file
sudo apt-file update
sudo apt install libtest-fatal-perl libmodule-build-tiny-perl libmoo-perl libmodule-pluggable-perl libcapture-tiny-perl libclass-load-perl libgetopt-long-descriptive-perl libio-tiecombine-perl libstring-rewriteprefix-perl libio-handle-util-perl
cpan2deb --vcs '' MooX::Aliases
cpan2deb --vcs '' Log::Any
cpan2deb --vcs '' App::Cmd
cpan2deb --vcs '' LaTeX::ToUnicode
cpan2deb --vcs '' PICA::Data
cpan2deb --vcs '' LV
cpan2deb --vcs '' MODS::Record
sudo dpkg -i lib*-perl_*.deb
cpan2deb --vcs '' BibTeX::Parser
sudo dpkg -i libbibtex-parser-perl_*.deb
sudo apt install libexporter-tiny-perl
cpan2deb --vcs '' JSON::Path
sudo dpkg -i libjson-path-perl_*.deb
cpan2deb --vcs '' JSON::Hyper
sudo dpkg -i libjson-hyper-perl_*.deb
sudo apt install libhttp-link-parser-perl libautovivification-perl libmatch-simple-perl
cpan2deb --vcs '' JSON::Schema
sudo dpkg -i libjson-schema-perl_*.deb
sudo apt install libjson-xs-perl libtest-exception-perl libtest-deep-perl libfile-slurp-tiny-perl liburi-template-perl libtry-tiny-byclass-perl libdata-util-perl libdata-compare-perl libhash-merge-simple-perl libthrowable-perl libclone-perl libdata-uuid-perl libmarpa-r2-perl libconfig-onion-perl libmodule-info-perl libtext-csv-perl libcgi-expand-perl
dh-make-perl --vcs '' --cpan Catmandu
perl -i -pe 's/libossp-uuid-perl[^,\n]*/libdata-uuid-perl/g' libcatmandu-perl/debian/control
( cd libcatmandu-perl && dpkg-buildpackage -b -us -uc -d )
sudo dpkg -i libcatmandu-perl_*.deb
dh-make-perl --vcs '' --cpan Catmandu::Twitter
perl -i -pe 's/liburi-perl\K[^,\n]*//g' libcatmandu-twitter-perl/debian/control
( cd libcatmandu-twitter-perl && dpkg-buildpackage -b -us -uc -d )
sudo apt install libchi-perl libnet-ldap-perl libdatetime-format-strptime-perl libxml-libxslt-perl libxml-struct-perl libnet-twitter-perl libxml-parser-perl libspreadsheet-xlsx-perl libexcel-writer-xlsx-perl libdevel-repl-perl libio-pty-easy-perl
cpan2deb --recursive --vcs '' Task::Catmandu
sudo apt install 'libcatmandu-*'
sudo dpkg -i libcatmandu-twitter-perl_*.deb
sudo dpkg -i ~/.cpan/build/libcatmandu-*-perl_*.deb

Catmandu::OAI

Alternatively, if you want to install as many packages as possible from the Debian repositories but also to have an additional package like Catmandu::OAI, you need to install packages and build just that module (with any dependency which would conflict if installed from the repositories):

sudo apt-get install build-essential libcatmandu*-perl libexpat1-dev libssl-dev libxml2-dev libxslt1-dev libgdbm-dev libmodule-install-perl dh-make-perl liblocal-lib-perl apt-file libtest-fatal-perl libmodule-build-tiny-perl libmoo-perl libmodule-pluggable-perl libcapture-tiny-perl libclass-load-perl libgetopt-long-descriptive-perl libio-tiecombine-perl libstring-rewriteprefix-perl libio-handle-util-perl libtest-simple-perl libtest-needsdisplay-perl libtest-lwp-useragent-perl cpanminus
sudo cpanm Catmandu::OAI

(Tested in Debian 8 / Jessie and Ubuntu 17.10. Compared to the advice above, we add libtest-simple-perl libtest-needsdisplay-perl libtest-lwp-useragent-perl and avoid libhttp-oai-perl which produces Installed version (3.27) of HTTP::OAI is not in range '4.03'.)

2.0.2 Ubuntu Server 12.04.4 LTS

apt-get install make
apt-get install libmodule-install-perl
apt-get install libyaz-dev
apt-get install libwrap0-dev
apt-get install libxml2-dev zlib1g zlib1g-dev
apt-get install libexpat1-dev
apt-get install libxslt1-dev
apt-get install libssl-dev
apt-get install libgdbm-dev
apt-get install perl-doc
yes | cpan Test::More
yes | cpan YAML
yes | cpan App::cpanminus
/usr/local/bin/cpanm Catmandu Catmandu::MARC

2.0.3 CentOS 6.4

yum groupinstall "Development Tools"
yum install perl-ExtUtils-MakeMaker
yum install perl-CPAN -y
yum install gcc -y
yum install gdbm gdbm-devel -y
yum install openssl-devel -y
yum install tcp_wrappers-devel -y
yum install expat expat-devel -y
yum install libxml2 libxml2-devel libxslt libxslt-devel -y
yes | cpan YAML
yes | cpan App::cpanminus
/usr/local/bin/cpanm Catmandu Catmandu::MARC

2.0.4 CentOS 7

yum group install "Development Tools"
yum install perl-devel perl-YAML perl-CPAN perl-App-cpanminus -y
yum install openssl-devel tcp_wrappers-devel expat expat-devel libxml2 libxml2-devel libxslt libxslt-devel -y
cpanm autodie Catmandu Catmandu::MARC

2.0.5 openSUSE

sudo zypper install --type pattern devel_basis
sudo zypper install libxml2-devel libxslt-devel
curl -L http://cpanmin.us | perl - App::cpanminus  ## unless you already have cpanm
cpanm Catmandu Catmandu::MARC

2.0.6 OpenBSD 53

cpan App::cpanminus
cpanm Catmandu Catmandu::MARC

2.0.7 OSX

Install XCode from the app store first and homebrew from https://brew.sh

brew install libxml++ libxml2 xml2 libxslt
# Install plenv from https://github.com/tokuhirom/plenv
git clone https://github.com/tokuhirom/plenv.git ~/.plenv
echo 'export PATH="$HOME/.plenv/bin:$PATH"' >> ~/.bash_profile
echo 'eval "$(plenv init -)"' >> ~/.bash_profile
exec $SHELL -l
git clone https://github.com/tokuhirom/Perl-Build.git ~/.plenv/plugins/perl-build/
# Install a modern Perl
plenv install 5.22.0
plenv rehash
plenv install-cpanm
plenv global 5.22.0

# Install catmandu
cpanm Catmandu Catmandu::MARC
plenv rehash

2.0.8 Windows, Mac OSX, Linux

A docker image of Catmandu is build with each release. After installation of docker get and use the Catmandu image like this:

# Upgrade to the latest version
docker pull librecat/catmandu

# Run the docker command
docker run -it librecat/catmandu

Or, in case you want a native install use Strawberry Perl. Catmandu installations have been tested up to version 5.24.1.1. After installation of the EXE, reboot your machine, start the cmd.exe command line and execute:

cpanm Catmandu Catmandu::MARC

2.0.9 Raspbian GNU/Linux 7 on the Raspberry Pi (armhf)

Since Raspbian is based on Debian stable, you could follow the instructions there. Unfortunately, you will run into timeouts, so it is advisable to install some prerequisites via apt-get first:

sudo apt-get install libboolean-perl libdevel-repl-perl libnet-twitter-perl 
sudo apt-get install libxml-easy-perl libxslt1-dev libgdbm-dev

3 Command line client ✎

Most of the Catmandu processing doesn’t require you to write any code. With our command line tools you can store data files into databases, index your data, export data in various formats and provide basic data cleanup operations.

convert

The convert command is used to transfrom one format to another, or to download data from the Internet. For example, to extract all titles from a MARC record one can write

$ catmandu convert MARC to CSV --fix 'marc_map(245a,title); retain(title)' < data.mrc

In the example above, we import MARC and export it again as CSV while extracting the 245a field from a record and deleting all the rest. With the convert command one can transform data from one format to another.

Transform JSON to YAML:

$ catmandu convert JSON to YAML < data.json

Transform YAML to JSON:

$ catmandu convert YAML to JSON < data.json

Convert Excel to CSV:

$ catmandu convert XLS to CSV < data.xls

A Fix language can be used to extract the fields from a input you are interested in:

Convert Excel to CSV and only keep the titles, authors, and year columns:

$ catmandu convert XLS to CSV --fix 'retain(titles,authors,year)' < data.xls

In formats such as JSON or YAML the data can be deeply nested. All these fields can be accessed and converted.

$ catmandu convert JSON --fix 'upcase(my.nested.field.1)' < data.xls

In the example above a JSON input is converted by upcasing the field my that contains a field nested that contains a field field that contains a list for which the second item (indicated by 1) should be upcased.

The convert command can also be used to extract data from a database. For example to download the Dublin Core data from the UGent institutional repository type:

$ catmandu convert OAI --url http://biblio.ugent.be/oai

To get a CSV export of all identifiers in this OAI-PMH service type:

$ catmandu convert OAI --url http://biblio.ugent.be/oai to CSV --fix 'retain(_id)'

Or a YAML file with all titles:

$ catmandu convert OAI --url http://biblio.ugent.be/oai --set public to YAML --fix 'retain(title)'

import

test The import command is used to import data into a database. Catmandu provides support for NOSQL databases such as MongoDB, Elasticsearch and CouchDB which require no pre-configuration before they can be used. There is also support for relational databases such as Oracle, MySQL and Postgres via DBI or search engines like Solr but they need to be configured first (databases, tables, schemas need to be created first).

Importing a JSON document into MongoDB database can be as simple as:

$ catmandu import JSON  to MongoDB --database_name bibliography < books.json

Importing into a database can be done for every format that is supported by Catmandu. For instance, MARC can be imported with this command:

$ catmandu import MARC to MongoDB --database_name marc_data < data.mrc

Or, XLS

$ catmandu import XLS to MongoDB --database_name my_xls_data < data.xls

Even a download from a website can be directly stored into a database.

$ catmandu import -v OAI --url http://biblio.ugent.be/oai to MongoDB --database_name oai_data

In the example above a copy of the institutional repository of Ghent University was loaded into a MongoDB database. Use the option -v to see a progress report.

Before the data is imported a Fix can be applied to extract fields or transform fields before they are stored into the database. For instance, we can extract the publication year from a MARC import and store this as a separate year field:

$ catmandu import MARC to MongoDB --database_name marc_data --fix 'marc_map("008/7-10",year)' < data.mrc

export

The export command is used to retreive data from a database. See the import command above for a list of databases that are supported.

For instance we can export all the MARC records we have imported with this command:

$ catmandu export MongoDB --database_name marc_data

In case we only need the title field from the marc records and want the results in a CSV format we can add some fixes:

$ catmandu export MongoDB --database_name marc_data to CSV --fix 'marc_map(245a,title); retain(title)'

Some database support a query syntax to query for records to be exported. For instance, in the example above we extracted the year field form the MARC import. This can be used to only export the records of a particular year:

$ catmandu export MongoDB --database_name marc_data --query '{"year": "1971"}'

configuration

It is often handy to store the configuration options of importers, exporter and stores into a file. This allows you to create shorter easier commands. To do this a file ‘catmandu.yml’ needs to be created in your working directory with content like:

---
importer:
  ghent:
     package: OAI
     options:
        url: http://biblio.ugent.be
        set: public
        handler: marcxml
        metadataPrefix: marc21
store:
  ghentdb:
     package: MongoDB
     options:
        database_name: oai_data
        default_bag: data

When this file is available, an OAI-PMH harvest could be done with the shortened command:

$ catmandu convert ghent

To store the ghent OAI-PMH import into the MongoDB database, one could write:

$ catmandu import ghent to ghentdb

To extract the data from the database, one can write:

$ catmandu export ghentdb

See the Command line client Cheat Sheet for more examples of command line commands.

4 Concepts ✎

To better make use of Catmandu is helps to first understand its core concepts:

Items are the basic unit of data processing in Catmandu. Items can be read, stored, and accessed in many formats. An item can be a MARC record or a RDF triple or one row in an Excel file.

Importers are used to read items. There are importers for MARC, JSON, YAML, CSV, Excel, and many other input formats. One can also import from remote sources such as SPARQL, Atom and OAI-PMH endpoints.

Exporters are used to transform items back into JSON, YAML, CSV, Excel or any format you like.

Stores are database to store your data. With database such MongoDB and ElasticSearch it becomes really, really easy to store quite complicated, deeply nested, items.

Fixes transforms items, transform the data into any format you like. See Fix language and Fix packages for details.

4.1 Items ✎

An item is the basic unit of data processing in Catmandu. Items are data structures build of key-value-pairs (aka objects), lists (aka arrays), strings, numbers, and null-values. All items can be expressed in JSON and YAML, among other formats.

Internally all data processing by Catmandu is using a generic data format not unlike JSON. If one imports MARC, XML, Excel, OAI-PMH, SPARQL, data from a database or any other format, everything can be expressed as JSON.

For example:

JSON/YAML - when importing a large JSON/YAML collections as an array, every item is a Catmandu item.
Text - for text import every line of text is one Catmandu item.
MARC - when importing MARC data, every record in a MARC file is one Catmandu item.
XLS,CSV - for tabular formats such as Excel, CSV and TSV, each row in a table is one Catmandu item
RDF - for linked data formats such as RDF/XML, RDF/nTriples, RDF/Turtle each triple is one Catmandu item
SPARQL - for a result set of a SPARQL or LDF query, every result (with the variable bindings) is one Catmandu item
MongoDB,ElasticSearch,Solr,DBI - for databases every record in the database is one Catmandu item

To transform items with the Fix language one points to the fields in items with a JSONPath expression (Catmandu uses an extension of JSONPath actually). The fixes provided to a catmandu command operate on all individual items.

For instance, the command below will upcase the publisher field for every item (row) in the data.xls file:

$ catmandu convert XLS --fix 'upcase(publisher)' < data.xls

This command will select only the JSON items that contain ‘Tsjechov’ in a nested authors field:

$ catmandu convert XLS --fix 'select any_match(authors.*,"Tsjechov.*")' < data.json

This command will delete all the uppercase A characters from a Text file:

$ catmandu convert Text to Text --fix 'replace_all(A,"")' < data.txt

To see the internal representation of a MARC file in Catmandu, transform it for instance to YAML

$ catmandu convert MARC to YAML < data.mrc

One will see that a MARC record is treated as an array of arrays for each item.

4.2 Importers ✎

Importers are Catmandu packages to read a specific data format. Catmandu provides importers for MARC, JSON, YAML, CSV, Excel, and many other input formats. One can also import from remote sources for instance via protocols such as SPARQL and OAI-PMH.

The name of a Catmandu importer should be provided as first argument to the convert command.

Read JSON input:

$ catmandu convert JSON

Read YAML input

$ catmandu convert YAML

Read MARC input

$ catmandu convert MARC

The Importer accepts configurable options. Eg you can use the --type arguments to the MARC importer where the following types are currently supported:

USMARC (use ISO as an alias)
MicroLIF
MARCMaker
Line 9for line-oriented MARC)
MiJ (for MARC-in-JSON)
XML (for MARCXML)
RAW
Lint (for importing ISO and checking validity)
ALEPHSEQ(for Aleph Sequential)

Read MARC-XML input

$ catmandu convert MARC --type XML < marc.xml

Read Aleph sequential input

$ catmandu convert MARC --type ALEPHSEQ < marc.txt

Read more about the configuration options of importers by reading their manual pages:

$ catmandu help import JSON
$ catmandu help import YAML

4.3 Exporters ✎

Exporters are Catmandu packages to export data in specific format. See Importers for the opposite action.

Some exporter such as JSON and YAML can handle any type of input. It doesn’t matter how the input is structured, it is always possible to create a JSON or YAML file.

Exporter are given after the to argument to the convert command

$ catmandu convert OAI --url http://biblio.ugent.be/oai to JSON
$ catmandu convert MARC to JSON
$ catmandu convert XLS to JSON

For most exporters however, the input data needs to be structured in a specific format. For instance, tabular formats such as Excel, CSV and TSV don’t allow for nested fields. In the example below, catmandu tried to convert a list into a simple value which will fail:

$ echo '{"colors":["red","green","blue"]}' | catmandu convert JSON to CSV
colors
ARRAY(0x7f8885a16a50)

The is an ARRAY output, indicating that the colors field is nested. To fix this, a transformation needs to be provided:

$ echo '{"colors":["red","green","blue"]}' | catmandu convert JSON to CSV --fix 'join_field(colors,",")'
colors
"red,green,blue"

MARC output should have an input in the Catmandu MARC format, RDF exports need the aREF format, etc etc.

Exporter also accept options to configure the various kinds of exports. For instance, JSON can be exporter in a array or line by line format

$ catmandu convert MARC to JSON --array 1 < data.mrc
$ catmandu convert MARC to JSON --line_delimited 1 < data.mrc
$ catmandu convert MARC to JSON --pretty 1 < data.mrc

The Catmandu::Template package can be used to generate any type of structured output given an input using the Template Toolkit language.

For instance, to create a JSON array of colors an echo command can used on Linux:

$ echo '{"colors":["red","green","blue"]}'

To transform this JSON into XML, the Template exporter can be used with a template file as a command line argument:

$ echo '{"colors":["red","green","blue"]}' | catmandu convert JSON to Template --template `pwd`/xml.tt

and xml.tt like:

<colors>
[% FOREACH c IN colors %]
  <color>[% c %]</color>
[% END %]
</colors>

will produce:

<colors>
  <color>red</color>
  <color>green</color>
  <color>blue</color>
</colors>

Consult the manual pages of catmandu to see the output options of the different Exporters:

$ catmandu help export JSON
$ catmandu help export YAML
$ catmandu help export CSV

4.4 Stores ✎

Store are Catmandu packages to store Catmandu Items in a database. These databases need to be installed separately from Catmandu. Special database such as MongoDB, ElasticSearch and CouchDB can work out-of-the-box with hardly any configuration. For other databases such as Solr, MySQL, Postgres and Oracle extra configuration steps are needed to define the database schemas.

Catmandu stores such as MongoDB, ElasticSearch and CouchDB can accept any type of input. They are perfect tools to store the output of data conversions.

Without defining any database schema, JSON, YAML , MARC, Excel, CSV, OAI-PMH or any other Catmandu supported format can be stored.

$ catmandu import JSON to MongoDB --database_name test < data.json
$ catmandu import YAML to MongoDB --database_name test < data.yml
$ catmandu import MARC to MongoDB --database_name test < data.mrc
$ catmandu import XLS to MongoDB --database_name test  < data.xls

Many Catmandu stores can be queried with their native query language:

$ catmandu export MongoDB --database_name test --query '{"my.deep.field":"abc"}'

To delete data from a store the delete command can be used.

# Delete everything
$ catmandu delete MongoDB --database_name test  
# Delete record with _id = 1234 and _id = 1235
$ catmandu delete MongoDB --database_name test --id 1234 --id 1235

Use the count command to show the size of a database.

$ catmandu count MongoDB --database_name test

One important use-case for Catmandu is indexation of data in search engines such as Solr. To do this, Solr needs to be configured for the fields you want to make searchable. Your data collection can be indexed in the Solr engine by mapping the fields in your data to the fields available in Solr.

$ catmandu import MARC to Solr --fix marc2solr.fix < data.mrc

where marc2solr.fix is a Fix script containing all the fixes required to transform your input data in the Solr format:

# marc2solr.fix
marc_map('008_/7-10','year')
marc_map('020a','isbn.$append')
marc_map('022a','issn.$append')
marc_map('245a','title_short')
.
.
.

In reality the Fix script will contain many mappings and data transformations to clean data. See Example Fix Script for a long example of such a data cleaning in action.

4.5 FileStore ✎

Stores are Catmandu packages to store Catmandu Items in a database. A FileStore is a Store where you can store binary content (unstructured data). Out of the box, one FileStore implementation is provided: File::Simple which stores files in a directory structure on the local file system.

The command below stores the /tmp/myfile.txt in the File::Simple FileStore in the “container” 1234 with the file identifier myfile.txt:

$ catmandu stream /tmp/myfile.txt to File::Simple --root t/data --bag 1234 --id myfile.txt

The root parameter is mandatory for the File::Simple FileStore. It defines the location where all stored files are written. The other two parameters bag and id are mandatory for every FileStore (see below).

To extract a file from a FileStore the stream command can be used in the opposite direction:

$ catmandu stream File::Simple --root t/data --bag 1234 --id myfile.txt to /tmp/myfile.txt

From the File::Simple the file myfile.txt is extracted from the container with identifier 1234.

Every FileStore inherits the functionality of a Store. In this way the drop and delete commands can be used to delete data from a FileStore:

# Delete a "file"
$ catmandu delete File::Simple --root t/data --bag 1234 --id myfile.txt

# Delete a "folder"
$ catmandu drop File::Simple --root t/data --bag 1234

4.5.1 Bag

A FileStore contains one or more Bags. These Bags are containers (or “folders”) to store zero or more files. The name of these container, indicated with the bag option in the Catmandu commands, is an identifier. In the case of the File::Simple this identifier needs to be a number, or when setting the uuid option a UUID identifier.

The binary data (files) stored in these Bags also needs an identifier, indicated with the id option. Usually the file name is a good choice to use.

Both the bag name option and id options are required when uploading or streaming data from a FileStore.

Within a FileStore Bag there is no deeper hierarchy possible. A Bag contains a flat list of files. To store deeply nested folders and files, mechanisms such as ZIP files need to be created and imported.

$ zip -r /tmp/files.zip /mnt/data/files
$ catmandu stream /tmp/files.zip --root t/data --bag 1234 --id files.zip

4.5.2 Index

Every FileStore has a default Bag called index which contains a list of all available Bags in the store (like the listing of all folders). Using the export command a listing of bags can be requested from the FileStore:

$ catmandu export File::Simple --root t/data to YAML

To retrieve a listing of all files stored in a bag the bag option needs to be provided:

$ catmandu export File::Simple --root t/data --bag 1234 to YAML

4.5.3 Technical Metadata

Each Bag (“container”) in a FileStore contains at least the _id as metadata. Some FileStores may contain more metadata. To retrieve a listing of all containers use the export command on the FileStore:

$ catmandu export File::Simple --root t/data 
[{"_id":"1234"},{"_id":"1235"},{"_id":"1236"}]

Every “file” in a FileStore contains at least the following fields:

_id : the name of the file
_stream : a callback function to download the contents of the file (pass it an IO::Handle)
created : the creation date time of the file as a UNIX timestamp
modified : the last modification date time of the file as a UNIX timestamp
content_type : the content type of the file
size : the file size in bytes
md5 : an MD5 checksum if the FileStore support is, or an empty string

NOTE: Not every exporter can serialise the code reference in the stream field. For instance, when exporting to JSON this error message will be show up:

$ catmandu export File::Simple --root t/data --bag 1234
Oops! encountered CODE(0x7f99685f4390), but JSON can only represent references to arrays or hashes at /Users/hochsten/.plenv/versions/5.24.0/lib/perl5/site_perl/5.24.0/Catmandu/Exporter/JSON.pm line 36.

This field can be ignored from the output using the remove_field fix:

$ catmandu export File::Simple --root t/data --bag 1234 --fix 'remove_field(_stream)'
[{"_id":"files.pdf","content_type":"application/pdf","modified":1498122646,"md5":"","size":883202,"created":1498122646}]

Always use the stream command in Catmandu to extract files from a FileStore:

$ catmandu stream File::Simple --root t/data --bag 1234 --id 'files.pdf' > output.pdf

4.5.4 Configuration

As for Stores, the configuration parameters for FileStore can be written in a catmandu.yml configuration file. In this way the Catmandu commands can be shortened:

$ cat catmandu.yml
---
store:
  files
    package: File::Simple
    options:
        root: t/data

# Get a "directory" listing
$ catmandu export files to YAML

# Get a "file" listing
$ catmandu export files --bag 1234 to YAML

# Add a file
$ catmandu stream /tmp/myfile.txt to files --bag 1234 --id myfile.txt

# Download a file
$ catmandu stream files --bag 1234 --id myfile.txt to /tmp/myfile.txt

4.6 Fixes ✎

Fixes are used for easy data transformations by non programmers. Using a small Fix language non-programmers can manipulate Catmandu Items.

To introduce the capabilities of Fix, an example will be provided below to extract data from a MARC input.

First, make sure that Catmandu::MARC is installed on your system.

 $ sudo cpanm Catmandu::MARC

We will use the Catmandu command line client to extract data from an example USMARC file that can be downloaded via this: link - camel.usmarc.

With the convert command one can read items from a MARC Importer and convert it into a new format. By default, convert will output JSON:

$ catmandu convert MARC < camel.usmarc
{"record":[["LDR",null,null,"_","00755cam  22002414a 4500"],["001",null,null...
...
["650"," ","0","a","Cross-platform software development."]],"_id":"fol05882032 "}

You can make this conversion explicit:

$ catmandu convert MARC to JSON < camel.usmarc

To transform this MARC data we first will create a Fix file which contains all the Fix commands we will use. Create a text file ‘fixes.txt’ on your system with this input:

remove_field('record');

and execute the following command:

$ catmandu convert MARC --fix fixes.txt < camel.usmarc
{"_id":"fol05731351 "}
{"_id":"fol05754809 "}
{"_id":"fol05843555 "}
{"_id":"fol05843579 "}

We have removed the field ‘record’ (containing the MARC data) from the JSON record. This is what the ‘remove_field’ Fix does: remove one field in a JSON record. We will use this remove_field(‘record’) to make our output a bit more terse and easier readable.

With the ‘marc_map’ Fix from the Catmandu::MARC package we can extract MARC (sub)fields from the record. Add these to the fixes.txt file:

marc_map('245','title');
remove_field('record');

When we run our previous catmandu command we get the following output:

$ catmandu convert MARC --fix fixes.txt to JSON --line_delimited 1 < camel.usmarc
{"_id":"fol05731351 ","title":"ActivePerl with ASP and ADO /Tobias Martinsson."}
{"_id":"fol05754809 ","title":"Programming the Perl DBI /Alligator Descartes and Tim Bunce."}
{"_id":"fol05843555 ","title":"Perl :programmer's reference /Martin C. Brown."}

We know that in the 650-a field of MARC we can find subjects. Lets add them to the fixes.txt:

marc_map('245','title');
marc_map('650a','subject');
remove_field('record');

and run the command again:

$ catmandu convert MARC --fix fixes.txt to JSON --line_delimited 1 < camel.usmarc
{"subject":"Perl (Computer program language)","_id":"fol05731351 ","title":"ActivePerl with ASP and ADO /Tobias Martinsson."}
{"subject":"Perl (Computer program language)Database management.","_id":"fol05754809 ","title":"Programming the Perl DBI /Alligator Descartes and Tim Bunce."}
{"subject":"Perl (Computer program language)","_id":"fol05843555 ","title":"Perl :programmer's reference /Martin C. Brown."}

The MARC 008 field from position 7 to 10 contains publication years. We can also add these to the ‘fixes.txt’ file:

marc_map('245','title');
marc_map('650a','subject');
marc_map('008/7-10,'year');
remove_field('record');

and run the command:

$ catmandu convert MARC --fix fixes.txt to JSON --line_delimited 1 < camel.usmarc
{"subject":"Perl (Computer program language)","_id":"fol05731351 ","title":"ActivePerl with ASP and ADO /Tobias Martinsson.","year":"2000"}
{"subject":"Perl (Computer program language)Database management.","_id":"fol05754809 ","title":"Programming the Perl DBI /Alligator Descartes and Tim Bunce.","year":"2000"}
{"subject":"Perl (Computer program language)","_id":"fol05843555 ","title":"Perl :programmer's reference /Martin C. Brown.","year":"1999"}

You don’t need to write fixes into a file to use them. E.g. if we want to have some statistic on the publication year in the camel.usmarc file we can do something like:

$ catmandu convert MARC --fix "marc_map('008/7-10','year'); retain('year')" to CSV < camel.usmarc
year
2000
2000
1999
.
.

With marc_map we extracted the year form the 008 field. With retain we deleted everything in the output except for the field ‘year’. We used the CSV Exporter to present the results in an easy format.

5 Fix language ✎

Catmandu comes with a small domain specific language for manipulation of data items called Fix. The Fix consists of

Paths to refer to particular parts of an item
Functions to manipulate (parts of) an item
Selectors to manipulate which items end up in the end result
Conditionals to control when to apply which Fix functions
Binds to manipulate the execution of Fix functions
Comments to provide documentation in the Fix script

5.1 Paths ✎

Paths

Almost any transformation on a Catmandu item contains a path to the part of the item that needs to be changed. To upcase the title field in an item the Fix upcase need to be used:

upcase(title)

A field can be nested in key-value-pairs (objects). To access the field deep in a key-value-pair, the dot-notation should be used:

upcase(my.deep.nested.title)

If a part of an item contains a list of fields than the index-notation should be used. Use index 0 to point to the first item in a list, index 1 to point to the second item in a list, index 2 to the third, etc, etc.

upcase(my.data.2.title)  # upcase the title of the 3rd item in the my.data list

For example, given this YAML input:

___
title: My Little Pony
my:
 colors:
   - red
   - green
   - blue
 nested:
     a:
      b:
       c: Hoi!

The value ‘My Little Pony’ can be accessed using the path:

title

The value ‘green’ can be accessed using the path:

my.colors.1

The value ‘Hoi!’ can be accessed using the path:

my.nested.a.b.c

5.1.1 Wildcards

Wildcards are used to point to relative positions or many positions in a list.

To point to the first item in a list (e.g. the value ‘red’ in the example above) the wildcard $first can be used:

my.colors.$first

To point to the last item in a list (e.g. the value ‘blue’ in the example above) the wildcard $last can be used:

my.colors.$last

In some cases, one needs to point to a position before the first item in a list. For instance, add a new field before the color ‘red’ in our example above, the wildcard ‘$prepend’ should be used:

my.colors.$prepend

This wildcard can be used in the functions like set_field:

set_field(my.colors.$prepend,'pink')

To add a new field add the end of a list (after the color ‘blue’), the wildcard ‘$append’ should be used:

my.colors.$append

As in:

set_field(my.colors.$append,'yellow')

The star notation is used to point to all the items in a list:

my.colors.*

To upcase all the colors use:

upcase(my.colors.*)

When lists are nested inside lists, then wildcards can also be nested:

my.*.colors.*

The above trick can be used when the my field contains a list which contains a color field which contains again a list of data. E.g.

---
my:
 - colors:
     - red
     - blue
 - colors:
     - yellow
     - green

5.1.2 MARC, MAB, PICA paths

For some data formats is can be quite difficult to extract data by the exact position of a field. In data formats such as MARC, one is unsually not interested in a field in the 17th position which contains a subfield in position 3. MARC contains tags and subfields, which can be at any position in the MARC record.

Specialized Fix functions for MARC, MAB and PICA make it easier to access data by changing the Path syntax. For instance, to copy the 245a field in a MARC record to the title field one can write:

marc_map("245a",title)

In the context of a marc_map Fix the “245a” Path is a MARC Path that points to a part of the MARC record. These MARC Paths only work in MARC Fixes (marc_map, marc_add, marc_set, marc_remove). It is not possible to use these paths in other Catmandu fix functions:

marc_map("245a",title)            # This will work
copy_field("246a","other_title")  # This will NOT work

Consult the documentation of the different specialised packages for the Path syntax that can be used.

5.2 Functions ✎

Fix functions manipulate fields in every item of a Catmandu Importer. For instance, using the command below the title field will be upcased for every item in the input list of JSON items.

$ catmandu convert JSON --fix 'upcase(title)' < data.json

Fix functions can have zero or more arguments separated by commas:

vacuum()              # Clean all empty fields in a record
upcase(title)         # Upcase the title value
append(title,"-123")  # Add -123 at the end of the title value

The arguments to a Fix function can be a Fix path or a literal string. Literal string can be quoted with double or single quotes.

append(title,"-123")
append(title,'foo bar')

In case of single quotes all the characters between quotes will be interpreted verbatim. When using double quotes, the values in quotes can be interpreted by some Fix functions.

replace_all(title,"My (.*) Pony","Our $1 Fish")   # Replace 'My Little Pony' by 'Our Little Fish'

Some Fix functions accept zero or more options which need to be specified as a name : value:

sort_field(tags, reverse:1)               # Sort the tags field in reverse order
lookup("title","dict.csv", sep_char:'|',default:'NONE')  # Lookup a title in a CSV file

Unless specified otherwise (such as in Binds), Fix function are executed in the order given by the Fix script:

upcase(authors.*)
append(authors.*,"abc")
replace_all(authors.*,"a","AB")

In the example above all transformations on the field authors will be executed in the order given. For example when the field authors contains this list:

---
authors:
  - John
  - Mary
  - Dave

The first fix will transform this list into:

---
authors:
  - JOHN
  - MARY
  - DAVE

The second fix will append “abc” to all authors

---
authors:
  - JOHNabc
  - MARYabc
  - DAVEabc

The third fix will replace all “a”-s by “AB”s

---
authors:
  - JOHNABbc
  - MARYABbc
  - DAVEABbc

In some cases the ordering of transformations of items in a list matters. For instance, you want to first do a sequence of transformation on all first items in a list, then a sequence of transformations on all second items in a list, etc. To change this ordering of Fix functions Binds need to be used.

For a nearly complete list of functions currently available in Catmandu, take a look at the Fixes Cheat Sheet.

5.3 Selectors ✎

With Fix selectors one can select which Catmandu items can end up in an output stream or not. Using a selector to throw away the records you are not interested in. For instance, to filter out all the records in a input use the reject() selector:

$ catmandu MARC to YAML --fix "reject()" < data.mrc

The command above will generate no output: every record is rejected. The opposite of reject() is the select() selector which can be used to select all the Catmandu items you want to keep in an output:

$ catmandu MARC to YAML --fix "select()" < data.mrc

The command above will return all the MARC items in the input file.

Selectors are of little use when used in isolation. Most of the time they are combined with Conditionals. To select only the MARC records that have “Tsjechov” in the 100a field one can write:

$ catmandu MARC to YAML --fix "select marc_match(100a,'.*Tsjechov.*') " < data.mrc

There are two alternative ways to combine selector with a conditional. Using the guard syntax, the conditional is written after the selector:

reject exits(error.field)
reject all_match(publisher,'xyz')
select any_match(years,2005)

Using the if/then/else syntax the conditional is written explicitly:

if exists(error.field)
   reject()
end

if all_match(publisher,'xyz')
   reject()
end

5.4 Conditionals ✎

A Conditional is executed depending on a boolean condition that can be true or false. For instance, to skip a Catmandu item when the field error exists one would write the conditional exists:

if exists(error)
  reject()
end

A condition contains an if or unless statement a Conditional (Fix functions which can be true or false), a body of zero or more Fix functions and an optional elsif or else clause:

if exists(error)
   # Write here all the Fix functions when the field 'error' exists
end

unless exists(error)
  # Write here all the Fix functions when the field 'error' doesn't exist
end

if exists(error)
   # If error exists then do this
elsif exists(warning)
   # If warning exists then do this
else
   # otherwise do this
end

Catmandu also supports a limited number of boolean operators:

exist(foo)  and add_field(ok,1)     # only execute add_field() when 'foo' exists
exists(foo) or  add_field(error,1)  # only execute add_field() when 'foo' doesn't exist

Below follows some basic fix functions that are implemented in Catmandu. Check the manual pages of the individual Catmandu extensions for more elaborate Conditionals.

all_equal(path,value)

True, when the path exists and is exactly equal to a value. When the path points to a list, then all the list members need to be equal to the value. False otherwise.

if all_equal(year,"2018")
  set_field(published,"future")
end

if all_equal(animals.*,"cat")
  set_field(animal_types,"feline")
end

any_equal(path,value)

True, when the path exists and is exactly equal to a value. When the path points to a list, then at least one of the list members need to be equal to the value. False otherwise.

if any_equal(year,"2018")
  set_field(published,"future")
end

if any_equal(animals.*,"cat")
  set_field(animal_types,"some feline")
end

all_match(path,regex)

True, when the path exists and the value matched the regex regular expression. When the path points to a list, then all the values have to match the regular expression. False otherwise.

if all_match(year,"^19.*$")
  set_field(period,"20th century")
end

if all_match(publishers.*,"Elsevier.*")
  set_field(is_elsevier,1)
end

any_match(path,regex)

True, when the path exists and the value matched the regex regular expression. When the path points to a list, then at least one of the values has to match the regular expression. False otherwise.

if any_match(year,"^19.*$")
  set_field(period,"20th century")
end

if any_match(publishers.*,"Elsevier.*")
  set_field(some_elsevier,1)
end

exists(path)

True, when the path exists in the Catmandu item. False otherwise.

if exists(my.deep.field)
end

if exists(my.list.0)
end

greater_than(path,number)

True, when the path exists and the value is greater than a number. When the path points to a list, then all the members need to be greater than the number. False otherwise.

less_than(path,number)

True, when the path exists and the value is less than a number. When the path points to a list, then all the members need to be less than the number. False otherwise.

in(path1,path2)

True, when the values of the first path1 are contained in the values at the second path2. False otherwise.

For instance to check if two paths contain the same values type:

if in(my.title,your.title)
  set_field(same,1)
end

To check if a value in one path is contained in a list of an other path type:

if in(my.author,your.authors.*)
   set_field(known_author,1)
end

is_true(path)

True, if the value at path can be evaluated to a boolean true. False otherwise

is_false(path)

True, if the value at path can be evaluated to a boolean false. False otherwise

5.5 Binds ✎

Binds change the execution context of a Fix script. In normal operation, all Fix functions are executed from the first to the last. For example given the YAML input:

---
colors:
  - red
  - green
  - blue

every Fix functions will be executed one by one on all the colors:

upcase(colors.*)
append(colors.*," is a nice color")
copy_field(colors.*,result.$append)

The first Fix upcase will uppercase all the colors, the second append will add ” is a nice color” to all the colors, the last copy_field will copy all the colors to a new field.

But what should you do when you want the three Fix functions to operate on each color separately? First upcase on the first color, append on the first color, copy_field on the first color, then again upcase on the second color, append on the second color, etc.

For this type of operation a Bind is needed using the do notation:

do list(path:colors, var:c)
  upcase(c)
  append(c," is a nice color")
  copy_field(c,result.$append)
end

In the example above the list Bind was introduced. The context of the execution of the Bind body is changed. Instead of operating on one Catmandu item as a whole, the Fix functions are executed for each element in the list.

These Binds can also work on hash (object) inputs. An example is the each Bind. Given:

demo:
 nl: 'Tuin der lusten'
 en: 'The Garden of Earthly Delights'

When we want to have a title field with all the values of demo concatinated, we can’t use the list (because it works on arrays) but need to use the each Bind:

do each(path: demo, var: t)
    copy_field(t.value, titles.$append)
end

The result will be:

titles:
   - 'Tuin der lusten'
   - 'The Garden of Earthly Delights'

Each Bind changes the execution context in some way. For instance Fix functions could execute queries into database, or fetch data from the internet. These operations can fail when the database is down, or the website couldn’t be reached. What should happen in that case in a Fix script? Should the execution be stopped? Or, should there errors be ignored.

my_fix1()
my_fix2()
download_from_internet() # <--- this one failes
process_results()

What should happen in the example above? Should the results be processed when the download_from_internet fails? Using the maybe Bind one can skip Fix functions that fail:

do maybe()
  my_fix1()
  my_fix2()
  download_from_internet() 
  process_results() # <--- this is skipped when download_from_internet fails
end

Binds are also used when creating Fix executables. That are Fix scripts that can be run directly from the command line. In the example below we’ll write a Fix script that downloads data from an OAI-PMH repository and prints all the record identifiers:

#!/usr/bin/env catmandu run
do importer(OAI,url: "http://lib.ugent.be/oai") 
  retain(_id)
  add_to_exporter(.,YAML)
end

If this script is stored on a file system as myscript.fix and made executable:

$ chmod 755 myscript.fix

then you can run this script as any other Unix command:

$ ./myscript.fix

5.6 Comments ✎

Comments can be added to the Fix scripts to enhance the readability of your transformations. All lines that start with a hash sign (#) are ignored by Catmandu:

# This is a comment
  # This is also a comment
add_field(foo,bar)  #This is a comment at the and of a line, add_field will be executed
# remove_field(foo) this line is a comment, remove_field(foo) will not be executed by the script

6 Cheat sheets ✎

6.1 Command line client Cheat Sheet ✎

This cheat sheet summarizes the command line client capabilities.

$ catmandu help
$ catmandu help convert

6.1.1 Convert

Convert one data format to another optionally provide a Fix script to transform the data

$ catmandu convert MARC to JSON < records.mrc
$ catmandu convert MARC to YAML < records.mrc
$ catmandu convert MARC to JSON --pretty 1 < records.mrc
$ catmandu convert MARC to JSON --fix 'marc_map("245","title");remove_field("record")' < records.mrc
$ catmandu convert MARC to CSV --fix myfixes.fix < records.mrc
$ cat myfixes.fix
marc_map("245","title")
remove_field("record")
$ catmandu convert MARC to CSV --fix myfixes2.fix --var source="Springer" < records.mrc
$ cat myfixes2.fix
add_field("source","{{source}")
marc_map("245","title")
remove_field("record")
$ catmandu convert OAI --url http://biblio.ugent.be/oai --set allFtxt to JSON
$ catmandu convert OAI --url http://biblio.ugent.be/oai --set allFtxt to JSON --fix 'retain("title")'
$ catmandu convert SRU --base http://www.unicat.be/sru --query dna  
$ catmandu convert ArXiv --query 'all:electron'
$ catmandu convert PubMed --term 'hochstenbach'
$ cat test.tt
[%- FOREACH f IN record %]
[% _id %] [% f.shift %][% f.shift %][% f.shift %][% f.join(":") %]
[%- END %]
$ catmandu convert MARC to Template --template `pwd`/test.tt < records.mrc

6.1.2 Import/Export

Store data in a (noSQL) database and export it out again

$ catmandu import JSON to MongoDB --database_name mydb --bag data < records.json
$ catmandu import MARC to MongoDB --database_name mydb --bag data < records.mrc
$ catmandu import MARC to ElasticSearch --index_name mydb --bag data < records.mrc
$ catmandu import MARC to ElasticSearch --index_name mydb --bag data --fix 'marc_map("245a","title")' < records.mrc

$ catmandu export MongoDB --database_name mydb --bag data to JSON
$ catmandu export MongoDB --database_name mydb --bag data to JSON --fix 'retain("_id")'
$ catmandu export Solr --url http://localhost:8983/solr to JSON
$ catmandu export ElasticSearch --index_name mydb to JSON

6.1.3 Copy

Copy data from one database to another

$ catmandu copy MongoDB --database_name items --bag book to ElasticSearch --index_name items --bag book

6.1.4 Count

Count the number of items in a store

$ catmandu count ElasticSearch --index-name shop --bag products --query 'brand:Acme'

6.1.5 Delete

Delete data from a store

# delete items with matching _id
$ catmandu delete ElasticSearch --index-name items --bag book --id 1234 --id 2345

# delete items matching the query
$ catmandu delete ElasticSearch --index-name items --bag book --query 'title:"My Rabbit"'

# delete all items
$ catmandu delete ElasticSearch --index-name items --bag book

6.1.6 Configuration

$ cat catmandu.yml
---
store:
  test1:
   package: MongoDB
   options:
    database_name: mydb
  test2:
   package: ElasticSearch
   options:
    index_name: mydb
  test3:
   package: Solr
   options:
    url: http://localhost:8983/solr

$ catmandu import JSON to test1 < records.json # Mongo
$ catmandu import MARC to test2 < records.mrc  # ElasticSearch
$ catmandu import YAML to test3 < records.yaml # Solr
$ catmandu export test1 to JSON                # Mongo
$ catmandu export test2 to JSON                # ElasticSearch
$ catmandy export test3                        # Solr
$ cat fixes.txt
marc_map("245a","title");
marc_map("100","author.$append");
join_field("author",";");
marc_map("008_/10-13","language");
$ catmandu import MARC to test2 --fix fixes.txt

6.1.7 Stream

# Add a file to a FileStore
$ catmandu stream /tmp/myfile.txt to File::Simple --root t/data --bag 1234 --id myfile.txt

# Download a file from a FileStore
$ catmandu stream File::Simple --root t/data --bag 1234 --id myfile.txt to /tmp/output.txt

6.2 Fixes Cheat Sheet ✎

This cheat sheet summarizes the fix language.

For more on the marc_* methods, see MARC mapping rules.
Also see this PDF formatted cheat sheet

# Fixes clean your data. As input you get a Perl HASH. Each fix function is a command
# to transform the Perl HASH. Some fixes such as marc_map contain logic to transform
# complex data structures such as MARC.
set_field("my.name","patrick")             # { my => { name => 'Patrick'} }
add_field("my.name2","nicolas")
move_field("my.name","your.name")
copy_field("your.name","your.name2")
remove_field("your.name")
# Replace in all the field names in 'foo' all dots into underscores
rename(foo,"\.","_")

set_array("foo")                           # Create an empty array foo => []
set_array("foo","a","b","c")               # Create an array with three values foo => ['a','b','c']
set_hash("foo")                            # Create an empty hash foo => {}
set_hash("foo",a: b,c: d)                  # Create an hash with two values foo => { a => 'b' , c => 'd' }

array("foo")                               # Create an array from a hash :
                                           # foo => {"name":"value"} => [ "name" , "value" ]
hash("foo")                                # Create a hash from an array
                                           # foo => [ "name" , "value" ] => {"name":"value"}

assoc(fields, pairs.*.key, pairs.*.val)    # Associate two values as a hash key and value
                                           # {pairs => [{key => 'year', val => 2009}, {key => 'subject', val => 'Perl'}]}
                                           # {fields => {subject => 'Perl', year => 2009}, pairs => [...]}

upcase("title")                            # marc -> MARC
downcase("title")                          # MARC -> marc
capitalize("my.deeply.nested.field.0")     # marc -> Marc
trim("field_with_spaces")                  # "  marc  " -> marc
substring("title",0,1)                     # marc -> m
prepend("title","die ")                    # marc -> die marc
append("title"," must die")                # marc -> marc must die

# {author => "tom jones"}  -> {author => "senoj mot"}
reverse(author)
 
# {numbers => [1,14,2]} -> {numbers => [2,14,1]}
reverse(numbers)

# replace the value with a formatted (sprintf-like) version
# e.g. numbers: 
#         - 41
#         - 15
format(number,"%-10.10d %-5.5d") # numbers => "0000000041 00015"
# e.g. hash:
#        name: Albert
format(name,"%-10s: %s") # hash: "name      : Albert"

# date: "2015-03-07"
parse_text(date, '(\d\d\d\d)-(\d\d)-(\d\d)')
# date: 
#    - 2015
#    - 03
#    - 07

#  parses a text into an array or hash of values
# date: "2015-03-07"
parse_text(date, '(\d\d\d\d)-(\d\d)-(\d\d)')
# date: 
#    - 2015
#    - 03
#    - 07 
 
# If you data record is:
#   a: eeny
#   b: meeny
#   c: miny
#   d: moe
paste(my.string,a,b,c,d)                 # my.string: eeny meeny miny moe
 
# Use a join character
paste(my.string,a,b,c,d,join_char:", ")  # my.string: eeny, meeny, miny, moe
 
# Paste literal strings with a tilde sign
paste(my.string,~Hi,a,~how are you?)     # my.string: Hi eeny how are you?

# date: "2015-03-07"
parse_text(date, '(?<year>\d\d\d\d)-(?<month>\d\d)-(?<day>\d\d)')
# date:
#   year: "2015"
#   month: "03" 
#   day: "07"
 
# date: "abcd"
parse_text(date, '(\d\d\d\d)-(\d\d)-(\d\d)')
# date: "abcd"

lookup("title","dict.csv", sep_char:'|')  # lookup 'marc' in dict.csv and replace the value
lookup("title","dict.csv", default:test)  # lookup 'marc' in dict.csv and replace the value or set it to 'test'
lookup("title","dict.csv", delete:1)    # lookup 'marc' in dict.csv and replace the value or delete nothing found

lookup_in_store('title', 'MongoDB', database_name:lookups)  # lookup the (id)-value of title in 'lookups' and
                                           # replace it with the data found
lookup_in_store('title', 'MongoDB', default:'default value' , delete:1) 

# Query a Solr index with the query stored in the 'query' field and overwrite it with all the results
search_in_store('query','Solr',url:"http://localhost:8983/solr",limit:10)

# Replace the data in foo.bar with an external file or url
import(foo.bar, JSON, file: "http://foo.com/bar.json", data_path: data.*)

add_to_store('authors.*', 'MongoDB', bag:authors, database_name:catalog)  # add matching values to a store as a side effect

add_to_exporter(data,CSV,header:1,file:/tmp/data.csv) # send the 'data' path to an alternative exporter
add_to_exporter(.,CSV,header:1,file:/tmp/data.csv)    # send the complete record to an alternative exporter

count("myarray")                           # count number of elements in an array or hash
sum("numbers")                             # replace an array element with the sum of its values
sort_field("tags")                         # sort the values of an array
sort_field("tags", uniq:1)                 # sort the values plus keep unique values
sort_field("tags", reverse:1)              # revese sort
sort_field("tags", numeric:1)              # sort numerical values
uniq(tags)                                 # strip duplicate values from an array
filter("tags","[Cc]at")                    # filter array values tags = ["Cats","Dogs"] => ["Cats"]
flatten(deep)                              # {deep => [1, [2, 3], 4, [5, [6, 7]]]} => {deep => [1, 2, 3, 4, 5, 6, 7]}

cmd("java MyClass")                        # Use an external program that can read JSON 
                                           # from stdin and write JSON to stdout
perlcode("myscript.pl")                    # Execute Perl code as fix function
sleep(1,SECOND)                            # Do nothing for one second

split_field("foo",":")                     # marc:must:die -> ['marc','must','die']
join_field("foo",":")                      # ['marc','must','die'] -> marc:must:die
retain("id","id2","id3")                   # delete any field except 'id', 'id2', 'id3'
replace_all("title","a","x")               # marc -> mxrc

# Most functions can work also work on arrays. E.g.
replace_all("author.*","a","x")            # [ 'marc','jan'] => ['mxrc','jxn']
# Use:
#   authors.$last (last entry)
#   authors.$first (first entry)
#   authors.$append (last + 1)
#   authors.$prepend (first - 1)
#   authors.* (all authors)
#   authors.2 (3rd author)

collapse()                                 # collapse deep nested hash to a flat hash
expand()                                   # expand flat hash to deep nested hash
clone()                                    # clone the perl hash and work on the clone
reject()                                   # Reject (skip) a record
reject [condition]                         # Reject a record on some condition:
                                           #   reject all_match(...)
                                           #   reject any_match(...)
                                           #   reject exists(...)
select()                                   # Select a record
select [condition]                         # Select only those records that match a condition (see reject)

to_json('my.field')                        # convert a value of a field to json
from_json('my.field')                      # replace the json field with the parsed value

export_to_string('my.field',CSV,sep_char:";")   # convert the value of a field into CSV
import_from_string('my.field',CSV,sep_char:";") # replace a CSV field with the parsed value

error("eek!")                              # abort the processing and say "eek!"
nothing()                                  # do nothing (used in benchmarking)

# Include fixes from another file
include('/path/to/myfixes.txt')

# Send debug messages to a logger
log('test123')
log('hello world' , level: 'DEBUG')

# Boolean AND and OR, need a Condition + 'and'/'or' + a Fix 
exists(foo) and log('foo exists' , level: INFO)
exists(foo) or log('foo doesnt exist' , level: INFO)
valid('', JSONSchema, schema: "my/schema.json") or log('this record is wrong', level: ERROR)

# '3%A9' => 'café'
uri_decode(place)
# 'café' => '3%A9'
uri_encode(place)

# Add a new field 'foo' with a random value between 0 and 9
random(foo, 10)

# Delete all the empty fields
vacuum()

# Copy all 245 subfields into the my.title hash
marc_map('245','my.title') 
# Copy the 245-$a$b$c subfields into the my.title hash in the order of the record
marc_map('245abc','my.title') 
# Copy the 245-$c$b$a subfields into the my.title hash in the order of the mapping
marc_map('245cba','my.title' , pluck:1) 
# Copy the 100 subfields into the my.authors array
marc_map('100','my.authors.$append') 
# Add the 710 subfields into the my.authors array
marc_map('710','my.authors.$append')
# Copy the 600-$x subfields into the my.subjects array while packing each into a genre.text hash
marc_map('600x','my.subjects.$append.genre.text')
# Copy the 008 characters 35-35 into the my.language hash
marc_map('008_/35-35','my.language')
# Copy all the 600 fields into a my.stringy hash joining them by '; '
marc_map('600','my.stringy', join:'; ')
# When 024 field exists create the my.has024 hash with value 'found'
marc_map('024','my.has024', value:found)
# Do the same examples now with the marc fields in 'record2'
marc_map('245','my.title', record:record2)
# Remove the 900 fields
marc_remove('900')
# Add a marc field (in Catmandu::MARC 0.110)
marc_add('999', ind1, ' ' , ind2, '1' , a, 'test123')
# Add a marc field populated with data from your record
marc_add('245', a , $.my.title.field, c , $.my.author.field)
# Set a marc value of one (sub)field to a new value
marc_set('LDR/6','p')
marc_set('650p','test')
marc_set('100[3]a','Farquhar family.')

# Map all 650 subjects into an array 
marc_map('650','subject', join:'###') 
split_field('subject','###')

# uppercase the value of field 'foo' if all members of 'oogly' have the value 'doogly'
if all_match('oogly.*', 'doogly')
  upcase('foo') # foo => 'BAR'
else
  downcase('foo') # foo => 'bar'
end

# inverted
unless all_match('oogly.*', 'doogly')
  upcase('foo') # foo => 'BAR'
end;

# uppercase the value of field 'foo' if field 'oogly' has the value 'doogly'
if any_match('oogly', 'doogly')
  upcase('foo') # foo => 'BAR'
end

# inverted
unless any_match('oogly', 'doogly')
  upcase('foo') # foo => 'BAR'
end

# uppercase the value of field 'foo' if the field 'oogly' exists
if exists('oogly')
  upcase('foo') # foo => 'BAR'
end

# inverted
unless exists('oogly')
  upcase('foo') # foo => 'bar'
end

# add a new field when the 'year' field is equal to 2018
if all_equal('year','2018')
 add_field('my.funny.title','true')
end

# add a new field when at least one of the 'year'-s is equal to 2018
if any_equal('years.*','2018')
 add_field('my.funny.title','true')
end

# compare things (needs Catmandu 0.92 or better)
if greater_than('year',2000)
  add_field('recent','yes')
end

if less_than('year',1970)
  add_field('ancient','yes')
end

# execute fixes if one path is contained in another
# foo => 1 , bar => [3,2,1]  => in(foo,bar) -> true
if in(foo,bar)
   add_field(test,ok)
end

# only execute fixes if all path values are the boolean true, 1 or "true"
if is_true(data.*.has_error)
  add_field(error,yes)
end

# only execute fixes if all path values are the boolean true, 0 or "false"
if is_false(data.*.has_error)
  add_field(error,no)
end

# only execute the fixes if the path contains an array
if is_array(data)
  upcase(data.0)
end

# only execute the fixes if the path contains an object (an hash, nested field)
if is_object(data)
  add_field(data.ok,yes)
end

# only execute the fixes if the path contains a number
if is_number(data)
  append(data," : is a number")
end

# only execute the fixes if the path contains a string
if is_string(data)
  append(data," : is a string")
end

# only execute the fixes if the path contains 'null' values
if is_null(data)
  set_field(data,"I'm empty!")
end

# Evaluates true when one or all marc (sub)fields match a regular expression
if marc_all_match('245','My funny title')
  add_field('funny.title','yes')
end
if marc_all_match('LDR/6','c')
  marc_set('LDR/6','p')
end

# Evaluates to true when at least one of the marc (sub)fields match a regular expression
if marc_any_match('650','catmandu')
  add_field('important.books','yes')
end


# Evaluates true when the JSON fragment is valid against a JSON Schema
if valid(data,JSONSchema,schema:myschema.json)
   ...
end

## Binds (needs Catmandu 0.92 or better)

# The identity binder doesn't embody any computational strategy. It simply 
# applies the bound fix functions sequentially to its input without any 
# modification.
do identity()
  add_field(foo,bar)
  add_field(foo2,bar2)
end

# Maybe, computes all the fix functions and ignores fixes once they throw errors
# or return undef.
do maybe()
  foo()
  return_undef() # rest will be ignored
  bar()
end

# List over all items in demo and add a foo => bar field
# { demo => [{},{},{}] } => { demo => [{foo=>bar},{foo=>bar},{foo=>bar}]}
do list(path: demo)
  add_field(foo,bar)
end

# Print statistical information on the processing speed of fixes to the standaard error.
do benchmark(output:/dev/stderr)
  foo()
end

# Find all ISBN in a stream
do hashmap(exporter: JSON, join:',')
  # Need an identity binder to group all operations that calculate key_value pairs
  do identity()
   copy_field(isbn,key)
   copy_field(_id,value)
  end
end

# Count the number of ISBN occurrences in a stream
do hashmap(count: 1)
  copy_field(isbn,key)
end

# Filter out an array (needs Catmandu 0.9302 or better)
#    data:
#       - name: patrick
#       - name: nicolas
# to:
#    data:
#       - name: patrick
do with(path:data)
  reject all_match(name,nicolas)
  # Or:
  # if all_match(name,nicolas)
  #  reject()
  # end
end

#  run fixes that should run within a time limit
do timeout(time => 5, units => seconds)
  ...
end

# a binder that computes Fix-es for every element in record
do visitor()
   # upcase all the 'name' fields in the record
   if all_match(key,name)
     upcase(scalar)
   end
end

# a binder runs fixes on records from an importer
do importer(OAI,url: "http://lib.ugent.be/oai") 
  retain(_id)
  add_to_exporter(.,YAML)
end

6.3 Example Fix Script ✎

Here is an example Fix script taken from a production system at Ghent University Library that can be used for inspiration. This script is used to feed data from a MongoDB store of MARC records to a Black Light Solr installation.

#-
#- LLUDSS - Data cleaning fixes. Using MARC records as input
#-
#- 2013 Patrick.Hochstenbach@UGent.be
#-

copy_field('merge.source','source')
copy_field('merge.id','id')
set_field('is_deleted','false')

set_field('is_hidden','false')
copy_field('merge.hidden','is_hidden')

if exists('merge.related_desc')
    copy_field('merge.related_desc','json.merge_related_desc')
end

if exists('merge.deleted')
    set_field('is_deleted','true')
else
    #- Document Type
    unless exists('type')
        marc_map('920a','type')
        lookup("type", "/opt/lludss-import/etc/material_types.csv", default:"other")
    end

    #- ISBN/ISSN
    marc_map('020a','isbn.$append', join:'==')
    marc_map('022a','issn.$append', join:'==')
    join_field('isbn','==')
    split_field('isbn','==')
    join_field('issn','==')
    split_field('issn','==')
    replace_all('isbn.*','^([0-9xX-]+).*$','$1')
    replace_all('issn.*','^([0-9xX-]+).*','$1')

    #- Title
    marc_map('245ab','title', join:' ')
    replace_all('title','\[(.*)\]','$1')
    copy_field('title','title_sort')
    replace_all('title_sort','\W+','')
    substring('title_sort',0,50)
    downcase('title_sort')
    copy_field('title','json.title')
    marc_map('246','json.title_remainder', join:' ')
    marc_map('245a','title_short')

    #- Author
    marc_map('100ab','author.$append', join:' ')
    marc_map('700ab','author.$append', join:' ')
    unless all_match('type','phd|master|bachelor')
        marc_map('720ab','author.$append', join:' ')
    end
    author_names()
    copy_field('author','json.author')

    #- Imprint
    marc_map('008_/7-10','year')
    if all_match('year','[u^?-]{4}')
       remove_field('year')
    end
    replace_all('year','\D','0')

    if greater_than('2018','year')
        remove_field('year')
    end

    if marc_match('008_/6-6','b')
        prepend('year','-')
    end

    #- Edition
    marc_map('250a','json.edition')

    #- Description
    marc_map('300a','json.desc_extend')

    #- Summary
    marc_map('505a','json.summary.$append', join:"\n")
    marc_map('520a','json.summary.$append', join:"\n")

    #- Als we een dissertation hebben dan is 502 de summary met 720 als promotor.
    #- Dit is dan ook automatisch een UGent publiaction
    if all_match('type','phd|master')
        marc_map('502a','summary.$append')

        if exists('summary');
            join_field('summary','')
            move_field('summary','json.summary.$append')
        end

        add_field('only.$append','ugent')
    end

    unless exists('json.summary')
        weave_by_id('summary')
        if exists('_weave.summary.data.summary')
            copy_field('_weave.summary.data.summary','json.summary.$append')
        end
        remove_field('_weave')
    end

    #- Boost
    unless exists('_boost')
        weave_by_id('boost')
        if exists('_weave.boost.data.boost')
            copy_field('_weave.boost.data.boost','_boost')
        end
        remove_field('_weave')
    end

    #- Language
    marc_map('008_/35-37','lang')
    if all_match('lang','\W+')
        set_field('lang','und')
    end

    #- Subject
    marc_map('6**^0123456789','subject.$append', join:' ')
    replace_all('subject.*','\.$','')
    sort_field('subject', uniq:1)
    copy_field('subject','json.subject')

    #- Library, Faculty, Location
    marc_map('852c','library.$append')
    sort_field('library', uniq:1)
    marc_map('852x','faculty.$append')
    sort_field('faculty', uniq:1)
    marc_map('852j','location.$append')
    sort_field('location', uniq:1)

     #- Host publication
    host_publication()
    move_field('host_publication','json.host_publication.$append')

    #- Holding
    if exists('p_holding')
        copy_field('p_holding','year')
        replace_all('year',' .*','')
        move_field('p_holding','json.p_holding')
        move_field('p_holding_txt','json.host_publication.$append')
    end
    if exists('e_holding')
        copy_field('e_holding','year')
        replace_all('year',' .*','')
        move_field('e_holding','json.e_holding')
        move_field('e_holding_txt','json.host_publication.$append')
    end

    join_field('json.host_publication','<br>');

    #- Year cleanup
    replace_all('year','^(?<=-)?0+','')
    unless all_match('year','^-?([0-9]|[123456789][0-9]+)$')
        remove_field('year')
    end

    #- Wikipedia
    weave_by_id('wikipedia')
    copy_field('_weave.wikipedia.data.wikipedia_url','json.wikipedia_url')
    remove_field('_weave')

    #- Cover Image
    if all_match('merge.source','rug01|pug01|ebk01')
        weave_by_id('cover')
        copy_field('_weave.cover.data.cover_remote','json.cover_remote')
        remove_field('_weave')
    end

    #- Cover card-catalog
    if  exists(cid)
        add_field('json.cover_remote.$append','http://search.ugent.be/meercat/x/stream?source=rug02&id=')
        move_field('cid','json.cover_remote.$append')
        join_field('json.cover_remote','')
    end

    #- Fulltext
    fulltext()
    move_field('fulltext','json.fulltext')

    #- Remove record without items or fulltext
    unless exists('items')
        unless exists('json.fulltext')
            set_field('is_deleted','true')
        end
    end

    #- CATEGORY
    if exists('json.fulltext')
        add_field('only.$append','online')
    end
    if exists('items')
        add_field('only.$append','print')
    end

    if all_match('merge.source','pug01')
       add_field('only.$append','ugent')
    end

    sort_field("only", uniq:1, reverse:0)

    #- ALL Field
    all()

    #- Identifier indexes rug01, ser01, ...
    ids()

    #- Set
    marc_map('005','updated_at')
    #- Warning: Aleph doesn't do zulu-time...
    datetime_format('updated_at', time_zone:'Europe/Brussels', set_time_zone:'UTC', source_pattern: '%Y%m%d%H%M%S.%N', destination_pattern:'%Y-%m-%dT%H:%M:%SZ', delete:1)
    add_field('is_oai','false')
    if exists('updated_at')
        add_field('set.$append','all')
        set_field('is_oai','true')
    end
    sort_field('set', unique:1)

    #- MARC Display
    marc_map('245','marc_display.$append.title', join:' ')
    marc_map('246','marc_display.$append.other-title', join:' ')
    marc_map('765','marc_display.$append.orig-title', join:' ')
    marc_map('210','marc_display.$append.abbrev-title', join:' ')
    marc_map('240','marc_display.$append.other-title', join:' ')
    marc_map('020','marc_display.$append.isbn', join:' ')
    marc_map('022','marc_display.$append.issn', join:' ')
    marc_map('028','marc_display.$append.publisher-no', join:' ')
    marc_map('048','marc_display.$append.voices-code', join:' ')
    marc_map('100','marc_display.$append.author', join:' ')
    marc_map('110','marc_display.$append.corp-author', join:' ')
    marc_map('700','marc_display.$append.author', join:' ')
    marc_map('720','marc_display.$append.other-name', join:' ')
    marc_map('111','marc_display.$append.conference', join:' ')
    marc_map('130','marc_display.$append.other-title', join:' ')
    marc_map('250','marc_display.$append.edition', join:' ')
    marc_map('255','marc_display.$append.scale', join:' ')
    marc_map('256','marc_display.$append.edition', join:' ')
    marc_map('260','marc_display.$append.publisher', join:' ')
    marc_map('261','marc_display.$append.publisher', join:' ')
    marc_map('263','marc_display.$append.publisher', join:' ')
    marc_map('300','marc_display.$append.description', join:' ')
    marc_map('310','marc_display.$append.frequency', join:' ')
    marc_map('321','marc_display.$append.prior-freq', join:' ')
    marc_map('340','marc_display.$append.description', join:' ')
    marc_map('362','marc_display.$append.pub-history', join:' ')
    marc_map('400','marc_display.$append.series', join:' ')
    marc_map('410','marc_display.$append.series', join:' ')
    marc_map('440','marc_display.$append.series', join:' ')
    marc_map('490','marc_display.$append.series', join:' ')
    marc_map('500','marc_display.$append.note', join:' ')
    marc_map('501','marc_display.$append.note', join:' ')
    marc_map('502','marc_display.$append.thesis', join:' ')
    marc_map('504','marc_display.$append.bibliography', join:' ')
    marc_map('505','marc_display.$append.content', join:' ')
    marc_map('508','marc_display.$append.credits', join:' ')
    marc_map('510','marc_display.$append.note', join:' ')
    marc_map('511','marc_display.$append.performers', join:' ')
    marc_map('515','marc_display.$append.note', join:' ')
    marc_map('518','marc_display.$append.note', join:' ')
    marc_map('520','marc_display.$append.summary', join:' ')
    marc_map('521','marc_display.$append.note', join:' ')
    marc_map('525','marc_display.$append.note', join:' ')
    marc_map('530','marc_display.$append.note', join:' ')
    marc_map('533','marc_display.$append.note', join:' ')
    marc_map('534','marc_display.$append.note', join:' ')
    marc_map('540','marc_display.$append.note', join:' ')
    marc_map('541','marc_display.$append.note', join:' ')
    marc_map('544','marc_display.$append.note', join:' ')
    marc_map('545','marc_display.$append.note', join:' ')
    marc_map('546','marc_display.$append.note', join:' ')
    marc_map('550','marc_display.$append.note', join:' ')
    marc_map('555','marc_display.$append.note', join:' ')
    marc_map('561','marc_display.$append.note', join:' ')
    marc_map('580','marc_display.$append.note', join:' ')
    marc_map('581','marc_display.$append.publication', join:' ')
    marc_map('583','marc_display.$append.note', join:' ')
    marc_map('586','marc_display.$append.note', join:' ')
    marc_map('591','marc_display.$append.note', join:' ')
    marc_map('598','marc_display.$append.classification', join:' ')
    marc_map('080','marc_display.$append.udc-no', join:' ')
    marc_map('082','marc_display.$append.dewey-no', join:' ')
    marc_map('084','marc_display.$append.other-call-no', join:' ')
    marc_map('600','marc_display.$append.subject', join:' ')
    marc_map('610','marc_display.$append.subject', join:' ')
    marc_map('611','marc_display.$append.subject', join:' ')
    marc_map('630','marc_display.$append.subject', join:' ')
    marc_map('650','marc_display.$append.subject', join:' ')
    marc_map('651','marc_display.$append.subject', join:' ')
    marc_map('653','marc_display.$append.subject', join:' ')
    marc_map('655','marc_display.$append.subject', join:' ')
    marc_map('662','marc_display.$append.subject', join:' ')
    marc_map('690','marc_display.$append.subject', join:' ')
    marc_map('692','marc_display.$append.subject', join:' ')
    marc_map('693','marc_display.$append.subject', join:' ')
    marc_map('710','marc_display.$append.corp-author', join:' ')
    marc_map('711','marc_display.$append.conference', join:' ')
    marc_map('730','marc_display.$append.other-title', join:' ')
    marc_map('749','marc_display.$append.title-local', join:' ')
    marc_map('752','marc_display.$append.other-info', join:' ')
    marc_map('753','marc_display.$append.other-info', join:' ')
    marc_map('772','marc_display.$append.parent-rec-ent', join:' ')
    marc_map('776','marc_display.$append.add-phys-form-e', join:' ')
    marc_map('777','marc_display.$append.issu-with-entry', join:' ')
    marc_map('780','marc_display.$append.preceding-entry', join:' ')
    marc_map('785','marc_display.$append.succeed-entry', join:' ')
    marc_map('LKR','marc_display.$append.note', join:' ')
    marc_map('024','marc_display.$append.object-id', join:' ')
    marc_map('856','marc_display.$append.e-location', join:' ')
    #-if_all_match('merge.source','ser01')
    #-    marc_map('852jhaz','marc_display.$append.location', join:' | ')
    #-end
    #-if_all_match('merge.source','rug01')
    #-    marc_map('Z303haz','marc_display.$append.location', join:' | ')
    #-end
    to_json('marc_display')

    #- Europeana Magic
    europeana()

    #- MARCXML
    marc_xml('record')
    move_field('record','fXML')
end

#- JSON
to_json('json')

add_field('_bag','data')

remove_field('record')
remove_field('merge')
remove_field('version')

6.4 Cookbook ✎

Install Catmandu OAI processing on your computer

Make sure you have cpanm (hint: $ cpan App::cpanminus) installed.

$ cpanm Catmandu::OAI

Read Dublin Core records from an OAI repository from the command line

Goto: http://www.opendoar.org/
Find a repository of choice
Read the base URL of the repository from the ‘OAI-PMH’
Execute in a terminal the catmandu import command with the URL found in the OAI-PPMH field

E.g.

$ catmandu convert OAI --url https://biblio.ugent.be/oai

Read Dublin Core records from an OAI repository in your Perl code

use Catmandu;

Catmandu->importer('OAI',url => 'https://biblio.ugent.be/oai')->each(sub {
   my $record = shift;
   print "$record\n";
});

Convert Dublin Core records from an OAI repository into YAML from the command line

$ catmandu convert OAI --url https://biblio.ugent.be/oai to YAML

Convert Dublin Core records from an OAI repository into YAML in your Perl code

use Catmandu -all;

my $importer = importer('OAI',url => 'https://biblio.ugent.be/oai');
my $exporter = exporter('YAML');

$exporter->add_many($importer);
$exporter->commit;

Extract all identifiers from an OAI repository from the command line

$ catmandu convert OAI --url https://biblio.ugent.be/oai --fix 'retain("_id")'

or if you like an CSV file

$ catmandu convert OAI --url https://biblio.ugent.be/oai to CSV --fix 'retain("_id")'

Extract all identifiers from an OAI repository into CSV in your Perl code

use Catmandu;

my $importer = Catmandu->importer('OAI',url => 'https://biblio.ugent.be/oai');
my $fixer    = Catmandu->fixer('retain("_id")');
my $exporter = Catmandu->exporter('CSV');

$exporter->add_many(
     $fixer->fix($importer)
);

$exporter->commit;

Show the speed of importing records from the command line

Hint: use the -v option

$ catmandu convert -v OAI --url https://biblio.ugent.be/oai to CSV --fix 'retain("_id")' > /dev/null

Here we send the output to the /dev/null to show the verbose messages.

Show the speed of importing records from your Perl program

use Catmandu;

my $importer = Catmandu->importer('OAI',url => 'https://biblio.ugent.be/oai');
my $fixer    = Catmandu->fixer('retain("_id")');
my $exporter = Catmandu->exporter('CSV');

$exporter->add_many(
     $fixer->fix($importer->benchmark)
);

$exporter->commit;

See some debug messages

Make sure you have Log::Log4perl installed (hint: $ cpan Log::Any::Adapter::Log4perl).

In your main program do:

use Catmandu;
use Log::Any::Adapter;
use Log::Log4perl;

Log::Any::Adapter->set('Log4perl');
Log::Log4perl::init('./log4perl.conf');

# The lines above should be enough to activate logging for Catmandu.
# Include the lines below to activate logging for your main program.
my $logger = Log::Log4perl->get_logger('myprog');

$logger->info("Starting main program");

...your code...

with log4perl.conf like:

# Send a copy of all logging messages to STDERR
log4perl.rootLogger=DEBUG,STDERR

# Logging specific for your main program
log4perl.category.myprog=INFO,STDERR

# Logging specific for on part of Catmandu
log4perl.category.Catmandu::Fix=DEBUG,STDERR

# Where to send the STDERR output
log4perl.appender.STDERR=Log::Log4perl::Appender::Screen
log4perl.appender.STDERR.stderr=1
log4perl.appender.STDERR.utf8=1

log4perl.appender.STDERR.layout=PatternLayout
log4perl.appender.STDERR.layout.ConversionPattern=%d [%P] - %p %l time=%r : %m%n

You will see now Catmandu log messages (e.g. for Fixes).

If you want to add logging functionality in your own Perl modules you have two options;

Your package is a Catmandu::Importer or Catmandu::Exporter. In this case you are lucky because you have a logger as part of your instance:

$self->log->debug(‘blablabla’); # where $self is an Importer,Fix or Exporter instance
You need to create the logger yourself.

package Foo::Bar;

use Moo;

with ‘Catmandu::Logger’;

sub bar { my $self = shift; $self->log->debug(‘tadaah’); }

If you want to see the logging messages only of your package, then use a this type of line in your log4perl.conf:

log4perl.category.Foo::Bar=DEBUG,STDOUT

or if you want to see all the log messages for Foo packages:

log4perl.category.Foo=DEBUG,STDOUT

How to create a new Catmandu::Store

A Catmandu::Store is used to store items. Stores can have one or more compartments where to store the items. Each such compartment is a Catmandu::Bag. You can compare a Store with a database and a Bag with a table in a database. Like tables, Bags have names. When no name is provided for a Bag, then ‘data’ is used.

To implement a Catmandu store you need to create at least two packages:

A ‘Catmandu::Store’, defining the general parameters, possible connection parameters and actions for the whole store.
A ‘Catmandu::Bag’, which is used to list, add,fetch and delete items from a Bag.

As example, this is a skeleton for a ‘Foo’ Catmandu::Store which requires at least one ‘foo’ connection parameter:

package Catmandu::Store::Foo;
use Moo;

use Catmandu::Store::Foo::Bag;

with 'Catmandu::Store';

has 'foo' => (is => 'ro' , required => 1);

1;

For this Catmandu::Store::Foo we can define a module ‘Catmandu::Store::Foo::Bag’ to implement the Bag functions. Notice how in the generator the bag can access the Catmandu::Store instance:

package Catmandu::Store::Foo::Bag;
use Moo;

with 'Catmandu::Bag';

sub generator {
    my $self = shift;
    sub {
        # This subroutine is used to loop over all items
        # in a store and should return a item HASH for
        # every call
        return { 
             name => $self->name,
             foo => $self->store->foo 
       };
    };
}

sub get {
    my ($self,$id) = @_;
    # return a item HASH given an $id
    return {};
}

sub add {
    my ($self,$data) = @_;
    # add/update an item HASH to the bag and return the item with an _id field set
    return $data;
}

sub delete {
    my ($self,$id) = @_;
    # delete an item from the bag given an $id
    1;
}

sub delete_all {
    my ($self) = @_;
    # delete all items
    $self->each(sub {
        $self->delete($_[0]->{_id});
    });
}

1;

With this skeleton Store you have enough code to run basic tests. Save these package in a lib directory:

lib/Catmandu/Store/Foo.pm lib/Catmandu/Store/Foo/Bag.pm

and a catmandu command to test your implementation:

$ catmandu -I lib export Foo –foo bar

{“foo”:“bar”,“name”:“data”} {“foo”:“bar”,“name”:“data”} {“foo”:“bar”,“name”:“data”} . . .

Or create a test.pl script to access your new Store via Perl:

#!/usr/bin/env perl
use lib qw(./lib);
use Catmandu;

my $store = Catmandu->store('Foo', foo => 'bar');

$store->add({ test => 123});

7 API ✎

This section will provide an in depth overview how to extend Catmandu using the API

7.1 Fix packages ✎

Create a simple Fix

The easiest way to create a new ‘Fix’ is by creating a Perl package in the Catmandu::Fix namespace that has a ‘fix’ instance method. For example:

package Catmandu::Fix::foo;

use Moo;

sub fix {
    my ($self, $data) = @_;

    # modify your data here, for instance...
    $data->{foo} = 'bar';

    $data;
}

1;

When this code is available in your perl library path as Catmandu/Fix/foo.pm it can be used as fix function foo(). To try out save the file as lib/Catmandu/Fix/foo.pm in your local directory and execute:

$ echo '{}' | catmandu -I lib convert JSON --fix "foo()"
{"foo":"bar"}

Fix creation with helper packages

The following instruction is incomplete, see POD of Catmandu::Fix

If you want pass arguments to your fix, you can make use of Moo and Catmandu::Fix::Has to read in required and optional parameters.

package Catmandu::Fix::foo;

use Moo;

has greeting => (fix_arg => 1);  # required first argument
has message  => (fix_arg => 1);  # required second argument
has eol      => (fix_opt => 1, default => sub { '!' });  # optional argument , default '!'

sub fix {
    my ($self,$data) = @_;

    $self->log->debug($self->greeting . ", " . $self->message .  $self->eol. "\n";

    # Fix your data here...

    $data;
}

1;

Now you can write log messages in your Fixes:

$ echo '{}' | catmandu convert --fix 'foo(Hello,World)'
Hello, World!
{}
$ echo '{}' | catmandu convert --fix 'foo(Hello,World, eol: ?)'
Hello, World?
{}

Extended introduction

For an extended introduction into creating Fix packages read the two blog posts at:

Create a fixer - Part 1
Create a fixer - Part 2

8 Contribution ✎

This guide has been written to help anyone interested in contributing to the development of Catmandu. Please read this guide before contributing to Catmandu or related projects, to avoid wasted effort and maximizing the chances of your contributions being used.

8.1 Ways to contribute

There are many ways to contribute to the project. Catmandu is a young yet active project and any kind of help is very much appreciated!

8.1.1 Publicity

You don’t have to start by hacking the code, spreading the word is very valuable as well!

If you have a blog, just feel free to speak about Catmandu.

Of course, it doesn’t have to be limited to blogs or Twitter. Feel free to spread the word in whatever way you consider fit and drop us a line on the Catmandu user mailing list noted below.

Also, if you’re using and enjoying Catmandu, rating us on cpanratings.perl.org, explaining what you like about Catmandu is another very valuable contribution that helps other new users find us!

8.1.2 Mailing list

Subscribing to the mailing list and providing assistance to new users is incredibly valuable.

Mailing list: librecat-dev@lists.uni-bielefeld.de
Subscribe or view archives here: https://lists.uni-bielefeld.de/mailman2/cgi/unibi/listinfo/librecat-dev

8.1.3 Documentation

We value documentation very much, but it’s difficult to keep it up-to-date. If you find a typo or an error in the documentation please do let us know - ideally by submitting a patch (pull request) with your fix or suggestion (see Patch Submission).

8.1.4 Code

To can contribute to Catmandu’s core code or extend the functionality by new Importers, Exporters, Stores, Fix packages, Validators, Binds, or Plugins. Have a look at the list of missing modules for existing ideas and resources for new Catmandu modules. Feel also free to add new ideas and links there.

For more detailed guidelines, see:

development setup for how to set up a development environment
patch submission for how to submit patches using the GitHub workflow
coding guidelines for how to write readable code and documentation

8.2 Quality Supervision and Reporting Bugs

We can measure our quality using the CPAN testers platform: http://www.cpantesters.org.

A good way to help the project is to find a failing build log on the CPAN testers: http://www.cpantesters.org/distro/D/Catmandu.html

If you find a failing test report or another kind of bug, feel free to report it as a GitHub issue: http://github.com/LibreCat/Catmandu/issues. Please make sure the bug you’re reporting does not yet exist.

8.3 RESOURCES FOR DEVELOPERS

8.3.1 Website

The official website is here: http://librecat.org/ A Wordpress blog with hints is available at: https://librecatproject.wordpress.com/

8.3.2 Mailing Lists

A mailing list is available here: librecat-dev@mail.librecat.org

8.3.3 Repositories

The official repository is hosted on GitHub at http://github.com/LibreCat/Catmandu.

Official developers have write access to this repository, contributors are invited to fork the dev branch (!) and submit a pull request, as described at patch submission.

8.3.4 Core Maintainers

LibreCat/Catmandu - @nics
LibreCat/Catmandu-AWS - @phochste
LibreCat/Catmandu-AlephX - @nicolasfranck
LibreCat/Catmandu-ArXiv - @pietsch, @vpeil
LibreCat/Catmandu-Atom - @phochste
LibreCat/Catmandu-BibTeX - @pietsch, @vpeil
LibreCat/Catmandu-Cmd-fuse - @nics
LibreCat/Catmandu-Cmd-repl - @pietsch
LibreCat/Catmandu-CrossRef -@pietsch, @vpeil
LibreCat/Catmandu-DBI - @nicolasfranck
LibreCat/Catmandu-DSpace - @nicolasfranck
LibreCat/Catmandu-EuropePMC - @vpeil
LibreCat/Catmandu-Exporter-ODS - @snorri
LibreCat/Catmandu-Exporter-RTF - @petrakohorst
LibreCat/Catmandu-Exporter-Template - @vpeil
LibreCat/Catmandu-FedoraCommons - @phochste
LibreCat/Catmandu-Fix-XML - @nichtich
LibreCat/Catmandu-Fix-cmd - @nichtich
LibreCat/Catmandu-Importer-CPAN - @nichtich @phochste
LibreCat/Catmandu-Importer-Parltrack - @jonas
LibreCat/Catmandu-Inspire - @vpeil
LibreCat/Catmandu-LDAP - @nics
LibreCat/Catmandu-MARC - @phochste
LibreCat/Catmandu-MediaMosa - @nicolasfranck
LibreCat/Catmandu-OAI - @pietsch, @phochste
LibreCat/Catmandu-ORCID - @pietsch
LibreCat/Catmandu-PLoS - @pietsch, @vpeil
LibreCat/Catmandu-Plack-REST - @phochste
LibreCat/Catmandu-PubMed - @pietsch, @vpeil
LibreCat/Catmandu-Pure - @snorri
LibreCat/Catmandu-RDF - @nichtich
LibreCat/Catmandu-SRU - @pietsch
LibreCat/Catmandu-Serializer-messagepack - @nicolasfranck
LibreCat/Catmandu-Serializer-storable - @nics
LibreCat/Catmandu-Store-CouchDB - @nics
LibreCat/Catmandu-Store-Elasticsearch - @nics
LibreCat/Catmandu-Store-Lucy - @nics
LibreCat/Catmandu-Store-MongoDB - @nics
LibreCat/Catmandu-Store-Solr - @nicolasfranck , @nics
LibreCat/Catmandu-Twitter - @pietsch
LibreCat/Catmandu-XLS - @jorol, @nics
LibreCat/Catmandu-Z3950 - @pietsch
LibreCat/Dancer-Plugin-Auth-RBAC-Credentials-Catmandu - @nicolasfranck
LibreCat/Dancer-Plugin-Catmandu-OAI - @nicolasfranck
LibreCat/Dancer-Plugin-Catmandu-SRU - @nics, phochste
LibreCat/Dancer-Session-Catmandu - @nics
LibreCat/LibreCat-Sitemap - @phochste
LibreCat/MODS-Record - @phochste
LibreCat/Plack-Session-Store-Catmandu - @nics
LibreCat/Task-Catmandu - @nics
LibreCat/WWW-ORCID - @nics

8.4 Acknowledgement

This guide was based on .

8.5 Development Setup ✎

The following guidelines describe how to set up a development environment for contribution of code.

8.5.1 Set up a development environment

If you want to submit a patch for Catmandu, you need git and very likely also milla (). We also recommend perlbrew (see below) to test and develop Catmandu on a recent version of perl. We also suggest ) to quickly and comfortably install perl modules under perlbrew.

In the following sections we provide tips for the installation of some of these tools together with Catmandu. Please also see the documentation that comes with these tools for more info.

Perlbrew tips (Optional)

Install perlbrew for example with

cpanm App::perlbrew

Check which perls are available

perlbrew available

At the time of writing it looks like this

perl-5.18.0
perl-5.16.3
perl-5.14.4
perl-5.12.5
perl-5.10.1
perl-5.8.9
perl-5.6.2
perl5.005_04
perl5.004_05
perl5.003_07

Then go on and install a version inside Perlbrew. I recommend you give a name to the installation (--as option), as well as compiling without the tests (--n option) to speed it up.

perlbrew install -n perl-5.16.3 --as catmandu_dev -j 3

Wait a while, and it should be done. Switch to your new Perl with:

perlbrew switch catmandu_dev

Now you are using the fresh Perl, you can check it with:

which perl

Install cpanm on your brewed version of perl.

perlbrew install-cpanm

8.5.2 Install dependencies (required)

this section needs to be rewritten to reflect the change to Dist::Milla

8.5.3 Get Catmandu sources

Get the Catmandu sources from github (for a more complete git workflow see below):

Clone your fork to have a local copy using the following command:

$ git clone git@github.com:LibreCat/Catmandu.git

The installation is then straight forward:

$ cd Catmandu
$ perl Build.PL
$ ./Build
$ ./Build test
$ ./Build install

You can now start with hacking Catmandu and patch submission!

8.6 Coding guidelines ✎

The following guidelines are no strict rules but they should be considered as best practice for contribution.

8.7 Compatibility

Catmandu should be able to install for all Perl versions since 5.10.1, on any platform for which Perl exists. We focus mainly on GNU/Linux (any distribution).

You should avoid regressions as much as possible and keep backwards compatibility in mind when refactoring. Stable releases should not break functionality and new releases should provide an upgrade path and upgrade tips such as warning the user about deprecated functionality.

8.8 Code documentation

Document your module with

a meaningful abstract
a SYNOPSIS with usage example
a short DESCRIPTION giving an introduction, including explicit links to other modules (e.g. roles)
a CONFIGURATION section listing all constructor arguments
a METHODS section listing all public methods. Methods derived from other modules should not be included but the modules should be mentioned explicitly.
a SEE ALSO section listing related modules

Names of other moduless should be linked (e.g. L<Catmandu::Importer>)

8.9 Patch Submission ✎

The Catmandu development team uses GitHub to collaborate. We greatly appreciate contributions submitted via GitHub, as it makes tracking these contributions and applying them much, much easier. This gives your contribution a much better chance of being integrated into Catmandu quickly!

To help us achieve high-quality, stable releases, git-flow workflow is used to handle pull-requests, that means contributors must work on their dev branch rather than on their master. (Master should be touched only by the core dev team when preparing a release to CPAN; all ongoing development happens in branches which are merged to the dev branch.)

Here is the workflow for submitting a patch:

Fork the repository http://github.com/LibreCat/Catmandu (click “Fork”)
Clone your fork to have a local copy using the following command:
```
 $ git clone git://github.com/$myname/Catmandu.git
```
As a contributor, you should always work on the dev branch of your clone (master is used only for building releases).
```
 $ git remote add upstream https://github.com/LibreCat/Catmandu.git
 $ git fetch upstream
 $ git checkout -b dev upstream/dev
```
This will create a local branch in your clone named dev and that will track the official dev branch. That way, if you have more or less commits than the upstream repo, you’ll be immediately notified by git.
You want to isolate all your commits in a topic branch, this will make the reviewing much easier for the core team and will allow you to continue working on your clone without worrying about different commits mixing together.

To do that, first create a local branch to build your pull request:
```
 # you should be in dev branch here
 git checkout -b pr/$name
```
Now you have created a local branch named pr/$name where I<$name> is the name you want (it should describe the purpose of the pull request you’re preparing).
In that branch, do all the commits you need (the more the better) and when done, push the branch to your fork:
```
# ... commits ...
git push origin pr/$name
```
You are now ready to send a pull request.
Send a pull request via the GitHub interface. Make sure your pull request is based on the pr/$name branch you’ve just pushed, so that it incorporates the appropriate commits only.

It’s also a good idea to summarize your work in a report sent to the users mailing list (see below), in order to make sure the team is aware of it.

When the core team reviews your pull request, it will either accept (and then merge into dev) or refuse your request.

If it’s refused, try to understand the reasons explained by the team for the denial. Most of the time, communicating with the core team is enough to understand what the mistake was. Above all, please don’t be offended.
If your pull-request is merged into dev, then all you have to do is to remove your local and remote pr/$name branch:
```
 git checkout dev
 git branch -D pr/$name
 git push origin :pr/$name
```
And then, of course, you need to sync your local dev branch with the upstream:
```
 git pull upstream dev
 git push origin dev
```
You’re now ready to start working on a new pull request!

9 Related Projects ✎

Here we provide a list of related projects that also provide ETL/data processing tools.

9.1 selected formats

Cocoon - Apache Cocoon XML pipeline
csvkit - Csvkit is a suite of utilities for converting to and working with CSV, the king of tabular file formats.
Datamash - performs calculation (e.g. sum,, count, min, max, skewness, standard deviation) on input files.
DNB-Conv-Tools - Java conversion tools for MARC, ONIX, MAB, Pica and others
easyM2R - https://github.com/cKlee/easyM2R
ETL-Yertl
jq
Librisxl - Tools for conversion of libris.kb.se data
MABLE - MABLE+ ist ein Java-gestütztes Software-Tool zur automatischen Daten- und Fehleranalyse von Bibliothekskatalogen.
MABTools - MAB tools created by the Deutschen Nationalbibliothek
MARCEdit - http://marcedit.reeset.net/
MARCgrep.pl - MARCgrep.pl is a Perl script to filter or count bibliographic records based on condition built upon tag name, indicators, subfield, field value (or tag, positions, value for control fields 00x).
marc2rdf - https://github.com/digibib/marc2rdf (uses JSON mappings such as this)
MARCspec - http://cklee.github.io/marc-spec/marc-spec.html (mapping language for MARC)
marctools - https://github.com/ubleipzig/marctools (various MARC command line utilities)
MARiMbA - is a command-line tool, designed with librarians in mind, to transform MARC (MAchine-Readable Cataloging) records to RDF
miller - is like sed, awk, cut, join, and sort for name-indexed data such as CSV and tabular JSON
pymarc - pymarc is a python library for working with bibliographic data encoded in MARC21
rml - RML Generic Mapping Language (RDF)
solrmarc - https://code.google.com/p/solrmarc/
TARQL - a SPARQL-based data mapping language to convert CSV, XML, JSON to RDF
Traject - an easy to use, high-performance, flexible and extensible MARC to Solr indexer.

9.2 general frameworks

Akara - Akara is a platform for developing data services available on the Web, using REST architecture. Akara is open source software written in Python and C
App::RecordStream - App::RecordStream - recs - A system for command-line analysis of data.
ATTX - Putting Linked Data to Work (University of Helsinki)
bibcat - Engineering toolkit for building semantic web and bibliographic applications
Conduit - Haskell framework for dealing with streaming data
COMSODE - The project COMSODE is an SME-driven RTD project aimed at progressing the capabilities in the field of Open Data re-use.
DNet
d:swarm - data management platform for enrichment, normalization and linkage of knowledge data structures.
ETL::Yertl - ETL with a Shell
Fink - Apache Flink® - Stateful Computations over Data Streams
Heiðrún - Heiðrún is the DPLA metadata ingestion and QA system, and is an implementation of the Kri-kri Rails engine.
JAQL - Query Language for JavaScript(r) Object Notation (JSON)
KNIME - Open source Analytics Platform
Krikri - DPLA Ruby on Rails engine for metadata aggregation, enhancement and quality control.
Luwak - A Lucene extention to search data streams. See also this blog entry.
Metadata Interoperability Framework (MIF) - http://elag2014.org/programme/elag-workshops-list-page/11-5/ PPT
MINT - Metadata Interoperability Services
Meresco - Under the Meresco name Dutch public institutions share quality software components related to metadata management and search.
Metacrunch
metafacture - used in culturegraph
Metadata Services Toolkit - part of the eXtensible Catalog (XC)
Metadata & Object Repository (MoRe)
MUPD8 - Data stream processing from Wallmartlabs.
OpenRefine - (formerly Google Refine) a toolkit to work with tabular data.
Petl - Python ETL library
Pig - Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs.
Ratchet - A library for performing data pipeline / ETL tasks in Go.
REPOX - Data Aggregation and Interoperability Manager
Samza - Apache Samza is a distributed stream processing framework.
Silk - The Silk framework provides a declarative language for specifying which types of RDF links should be discovered between data sources as well as which conditions data items must fulfil in order to be interlinked.
Spark Streaming - Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams.
Storm - Apache Storm is a distributed stream processing framework.
Strukt - The most interactive way to work with all kinds of tabular data
Supplejack - Supplejack was designed to provide assurance to the quality of data management activities when working at scale.
TeePee - Command line tool to extract data from structures