This handbook is contains the aggregated content of Catmandu documentation wiki. Feel free to improve the documentation there!
Catmandu is a command line tool to access and convert data from your digital library, research services or any other open data sets. The toolkit was originally developed as part of the LibreCat project and attracts now an international development team with many participating institutions.
Catmandu has the following features, one can:
Catmandu is used in the LibreCat project to build institutional repositories and search engines. Catmandu is used on the command line for quick and dirty reports but also as part of larger programming projects processing millions of records per day. For a short overview of use-cases, see our Homepage.
As of 15 Aug 2022, there are: - 98 Catmandu-related repositories available at GitHub LibreCat - 112 Catmandu-related modules on MetaCPAN - 227 Catmandu-related repostitories across all of Github.
To get Catmandu running on your system you need to download and install at least the CPAN Catmandu module. Additional modules add support for more input and output formats, databases, and processing options.
To install Catmandu modules select at least Catmandu (and probably Catmandu::MARC, Catmandu::OAI, Catmandu::RDF, Catmandu::XLS)
$ sudo cpanm Catmandu Catmandu::MARC
To install extra Catmandu modules at any point in time, the cpanm command needs to be used.
$ sudo cpanm Catmandu::OAI
$ sudo cpanm Catmandu::RDF
$ sudo cpanm Catmandu::Store::MongoDB
$ sudo cpanm Catmandu::XLS
To make full usage of the capabilities of Catmandu, database and search engines such as MongoDB, Elasticsearch, Solr, Postgres, MySQL can be installed on the system with the corresponding Catmandu tools. How to install these database on your local system falls outside the scope of this documentation. Please consult the installation guide of the database product for more information. For more information on the available Catmandu packages consult our Distributions list.
Here are some Catmandu installation hints for various platforms.
Several Catmandu packages are officially included in Debian but not all (see Voting Catmandu packages to be included in Debian and this search of currently available packages).
You can install all packages officially included in Debian:
sudo apt-get update
sudo apt-get install libcatmandu*-perl
Alternatively, you can build newest Catmandu and dependencies from source:
sudo apt-get update
sudo apt-get install cpanminus build-essential libexpat1-dev libssl-dev libxml2-dev libxslt1-dev libgdbm-dev libmodule-install-perl
cpanm Catmandu Catmandu::MARC
Alternatively, you can build newest Catmandu as unofficial packages, using most possible official packages:
sudo apt update
sudo apt install dh-make-perl liblocal-lib-perl apt-file
sudo apt-file update
sudo apt install libtest-fatal-perl libmodule-build-tiny-perl libmoo-perl libmodule-pluggable-perl libcapture-tiny-perl libclass-load-perl libgetopt-long-descriptive-perl libio-tiecombine-perl libstring-rewriteprefix-perl libio-handle-util-perl
cpan2deb --vcs '' MooX::Aliases
cpan2deb --vcs '' Log::Any
cpan2deb --vcs '' App::Cmd
cpan2deb --vcs '' LaTeX::ToUnicode
cpan2deb --vcs '' PICA::Data
cpan2deb --vcs '' LV
cpan2deb --vcs '' MODS::Record
sudo dpkg -i lib*-perl_*.deb
cpan2deb --vcs '' BibTeX::Parser
sudo dpkg -i libbibtex-parser-perl_*.deb
sudo apt install libexporter-tiny-perl
cpan2deb --vcs '' JSON::Path
sudo dpkg -i libjson-path-perl_*.deb
cpan2deb --vcs '' JSON::Hyper
sudo dpkg -i libjson-hyper-perl_*.deb
sudo apt install libhttp-link-parser-perl libautovivification-perl libmatch-simple-perl
cpan2deb --vcs '' JSON::Schema
sudo dpkg -i libjson-schema-perl_*.deb
sudo apt install libjson-xs-perl libtest-exception-perl libtest-deep-perl libfile-slurp-tiny-perl liburi-template-perl libtry-tiny-byclass-perl libdata-util-perl libdata-compare-perl libhash-merge-simple-perl libthrowable-perl libclone-perl libdata-uuid-perl libmarpa-r2-perl libconfig-onion-perl libmodule-info-perl libtext-csv-perl libcgi-expand-perl
dh-make-perl --vcs '' --cpan Catmandu
perl -i -pe 's/libossp-uuid-perl[^,\n]*/libdata-uuid-perl/g' libcatmandu-perl/debian/control
( cd libcatmandu-perl && dpkg-buildpackage -b -us -uc -d )
sudo dpkg -i libcatmandu-perl_*.deb
dh-make-perl --vcs '' --cpan Catmandu::Twitter
perl -i -pe 's/liburi-perl\K[^,\n]*//g' libcatmandu-twitter-perl/debian/control
( cd libcatmandu-twitter-perl && dpkg-buildpackage -b -us -uc -d )
sudo apt install libchi-perl libnet-ldap-perl libdatetime-format-strptime-perl libxml-libxslt-perl libxml-struct-perl libnet-twitter-perl libxml-parser-perl libspreadsheet-xlsx-perl libexcel-writer-xlsx-perl libdevel-repl-perl libio-pty-easy-perl
cpan2deb --recursive --vcs '' Task::Catmandu
sudo apt install 'libcatmandu-*'
sudo dpkg -i libcatmandu-twitter-perl_*.deb
sudo dpkg -i ~/.cpan/build/libcatmandu-*-perl_*.deb
Alternatively, if you want to install as many packages as possible
from the Debian repositories but also to have an additional package like
Catmandu::OAI
, you need to install packages and build just
that module (with any dependency which would conflict if installed from
the repositories):
sudo apt-get install build-essential libcatmandu*-perl libexpat1-dev libssl-dev libxml2-dev libxslt1-dev libgdbm-dev libmodule-install-perl dh-make-perl liblocal-lib-perl apt-file libtest-fatal-perl libmodule-build-tiny-perl libmoo-perl libmodule-pluggable-perl libcapture-tiny-perl libclass-load-perl libgetopt-long-descriptive-perl libio-tiecombine-perl libstring-rewriteprefix-perl libio-handle-util-perl libtest-simple-perl libtest-needsdisplay-perl libtest-lwp-useragent-perl cpanminus
sudo cpanm Catmandu::OAI
(Tested in Debian 8 / Jessie and Ubuntu 17.10. Compared to the advice
above, we add
libtest-simple-perl libtest-needsdisplay-perl libtest-lwp-useragent-perl
and avoid libhttp-oai-perl
which produces
Installed version (3.27) of HTTP::OAI is not in range '4.03'
.)
apt-get install make
apt-get install libmodule-install-perl
apt-get install libyaz-dev
apt-get install libwrap0-dev
apt-get install libxml2-dev zlib1g zlib1g-dev
apt-get install libexpat1-dev
apt-get install libxslt1-dev
apt-get install libssl-dev
apt-get install libgdbm-dev
apt-get install perl-doc
yes | cpan Test::More
yes | cpan YAML
yes | cpan App::cpanminus
/usr/local/bin/cpanm Catmandu Catmandu::MARC
yum groupinstall "Development Tools"
yum install perl-ExtUtils-MakeMaker
yum install perl-CPAN -y
yum install gcc -y
yum install gdbm gdbm-devel -y
yum install openssl-devel -y
yum install tcp_wrappers-devel -y
yum install expat expat-devel -y
yum install libxml2 libxml2-devel libxslt libxslt-devel -y
yes | cpan YAML
yes | cpan App::cpanminus
/usr/local/bin/cpanm Catmandu Catmandu::MARC
yum group install "Development Tools"
yum install perl-devel perl-YAML perl-CPAN perl-App-cpanminus -y
yum install openssl-devel tcp_wrappers-devel expat expat-devel libxml2 libxml2-devel libxslt libxslt-devel -y
cpanm autodie Catmandu Catmandu::MARC
sudo zypper install --type pattern devel_basis
sudo zypper install libxml2-devel libxslt-devel
curl -L http://cpanmin.us | perl - App::cpanminus ## unless you already have cpanm
cpanm Catmandu Catmandu::MARC
cpan App::cpanminus
cpanm Catmandu Catmandu::MARC
Install XCode from the app store first and homebrew from https://brew.sh
brew install libxml++ libxml2 xml2 libxslt
# Install plenv from https://github.com/tokuhirom/plenv
git clone https://github.com/tokuhirom/plenv.git ~/.plenv
echo 'export PATH="$HOME/.plenv/bin:$PATH"' >> ~/.bash_profile
echo 'eval "$(plenv init -)"' >> ~/.bash_profile
exec $SHELL -l
git clone https://github.com/tokuhirom/Perl-Build.git ~/.plenv/plugins/perl-build/
# Install a modern Perl
plenv install 5.22.0
plenv rehash
plenv install-cpanm
plenv global 5.22.0
# Install catmandu
cpanm Catmandu Catmandu::MARC
plenv rehash
A docker image of Catmandu is build with each release. After installation of docker get and use the Catmandu image like this:
# Upgrade to the latest version
docker pull librecat/catmandu
# Run the docker command
docker run -it librecat/catmandu
Or, in case you want a native install use Strawberry Perl. Catmandu
installations have been tested up to version 5.24.1.1. After
installation of the EXE, reboot your machine, start the
cmd.exe
command line and execute:
cpanm Catmandu Catmandu::MARC
Since Raspbian is based on Debian stable, you could follow the instructions there. Unfortunately, you will run into timeouts, so it is advisable to install some prerequisites via apt-get first:
sudo apt-get install libboolean-perl libdevel-repl-perl libnet-twitter-perl
sudo apt-get install libxml-easy-perl libxslt1-dev libgdbm-dev
Most of the Catmandu processing doesn’t require you to write any code. With our command line tools you can store data files into databases, index your data, export data in various formats and provide basic data cleanup operations.
The convert command is used to transfrom one format to another, or to download data from the Internet. For example, to extract all titles from a MARC record one can write
$ catmandu convert MARC to CSV --fix 'marc_map(245a,title); retain(title)' < data.mrc
In the example above, we import MARC and export it again as CSV while extracting the 245a field from a record and deleting all the rest. With the convert command one can transform data from one format to another.
Transform JSON to YAML:
$ catmandu convert JSON to YAML < data.json
Transform YAML to JSON:
$ catmandu convert YAML to JSON < data.json
Convert Excel to CSV:
$ catmandu convert XLS to CSV < data.xls
A Fix language can be used to extract the fields from a input you are interested in:
Convert Excel to CSV and only keep the titles, authors, and year columns:
$ catmandu convert XLS to CSV --fix 'retain(titles,authors,year)' < data.xls
In formats such as JSON or YAML the data can be deeply nested. All these fields can be accessed and converted.
$ catmandu convert JSON --fix 'upcase(my.nested.field.1)' < data.xls
In the example above a JSON input is converted by upcasing the field my that contains a field nested that contains a field field that contains a list for which the second item (indicated by 1) should be upcased.
The convert command can also be used to extract data from a database. For example to download the Dublin Core data from the UGent institutional repository type:
$ catmandu convert OAI --url http://biblio.ugent.be/oai
To get a CSV export of all identifiers in this OAI-PMH service type:
$ catmandu convert OAI --url http://biblio.ugent.be/oai to CSV --fix 'retain(_id)'
Or a YAML file with all titles:
$ catmandu convert OAI --url http://biblio.ugent.be/oai --set public to YAML --fix 'retain(title)'
test The import command is used to import data into a database. Catmandu provides support for NOSQL databases such as MongoDB, Elasticsearch and CouchDB which require no pre-configuration before they can be used. There is also support for relational databases such as Oracle, MySQL and Postgres via DBI or search engines like Solr but they need to be configured first (databases, tables, schemas need to be created first).
Importing a JSON document into MongoDB database can be as simple as:
$ catmandu import JSON to MongoDB --database_name bibliography < books.json
Importing into a database can be done for every format that is supported by Catmandu. For instance, MARC can be imported with this command:
$ catmandu import MARC to MongoDB --database_name marc_data < data.mrc
Or, XLS
$ catmandu import XLS to MongoDB --database_name my_xls_data < data.xls
Even a download from a website can be directly stored into a database.
$ catmandu import -v OAI --url http://biblio.ugent.be/oai to MongoDB --database_name oai_data
In the example above a copy of the institutional repository of Ghent University was loaded into a MongoDB database. Use the option -v to see a progress report.
Before the data is imported a Fix can be applied to extract fields or transform fields before they are stored into the database. For instance, we can extract the publication year from a MARC import and store this as a separate year field:
$ catmandu import MARC to MongoDB --database_name marc_data --fix 'marc_map("008/7-10",year)' < data.mrc
The export command is used to retreive data from a database. See the import command above for a list of databases that are supported.
For instance we can export all the MARC records we have imported with this command:
$ catmandu export MongoDB --database_name marc_data
In case we only need the title field from the marc records and want the results in a CSV format we can add some fixes:
$ catmandu export MongoDB --database_name marc_data to CSV --fix 'marc_map(245a,title); retain(title)'
Some database support a query syntax to query for records to be exported. For instance, in the example above we extracted the year field form the MARC import. This can be used to only export the records of a particular year:
$ catmandu export MongoDB --database_name marc_data --query '{"year": "1971"}'
It is often handy to store the configuration options of importers, exporter and stores into a file. This allows you to create shorter easier commands. To do this a file ‘catmandu.yml’ needs to be created in your working directory with content like:
---
importer:
ghent:
package: OAI
options:
url: http://biblio.ugent.be
set: public
handler: marcxml
metadataPrefix: marc21
store:
ghentdb:
package: MongoDB
options:
database_name: oai_data
default_bag: data
When this file is available, an OAI-PMH harvest could be done with the shortened command:
$ catmandu convert ghent
To store the ghent OAI-PMH import into the MongoDB database, one could write:
$ catmandu import ghent to ghentdb
To extract the data from the database, one can write:
$ catmandu export ghentdb
See the Command line client Cheat Sheet for more examples of command line commands.
To better make use of Catmandu is helps to first understand its core concepts:
Items are the basic unit of data processing in Catmandu. Items can be read, stored, and accessed in many formats. An item can be a MARC record or a RDF triple or one row in an Excel file.
Importers are used to read items. There are importers for MARC, JSON, YAML, CSV, Excel, and many other input formats. One can also import from remote sources such as SPARQL, Atom and OAI-PMH endpoints.
Exporters are used to transform items back into JSON, YAML, CSV, Excel or any format you like.
Stores are database to store your data. With database such MongoDB and ElasticSearch it becomes really, really easy to store quite complicated, deeply nested, items.
Fixes transforms items, transform the data into any format you like. See Fix language and Fix packages for details.
An item is the basic unit of data processing in Catmandu. Items are data structures build of key-value-pairs (aka objects), lists (aka arrays), strings, numbers, and null-values. All items can be expressed in JSON and YAML, among other formats.
Internally all data processing by Catmandu is using a generic data format not unlike JSON. If one imports MARC, XML, Excel, OAI-PMH, SPARQL, data from a database or any other format, everything can be expressed as JSON.
For example:
To transform items with the Fix language one points to the fields in items with a JSONPath expression (Catmandu uses an extension of JSONPath actually). The fixes provided to a catmandu command operate on all individual items.
For instance, the command below will upcase the publisher field for every item (row) in the data.xls file:
$ catmandu convert XLS --fix 'upcase(publisher)' < data.xls
This command will select only the JSON items that contain ‘Tsjechov’ in a nested authors field:
$ catmandu convert XLS --fix 'select any_match(authors.*,"Tsjechov.*")' < data.json
This command will delete all the uppercase A characters from a Text file:
$ catmandu convert Text to Text --fix 'replace_all(A,"")' < data.txt
To see the internal representation of a MARC file in Catmandu, transform it for instance to YAML
$ catmandu convert MARC to YAML < data.mrc
One will see that a MARC record is treated as an array of arrays for each item.
Importers are Catmandu packages to read a specific data format. Catmandu provides importers for MARC, JSON, YAML, CSV, Excel, and many other input formats. One can also import from remote sources for instance via protocols such as SPARQL and OAI-PMH.
The name of a Catmandu importer should be provided as first argument to the convert command.
Read JSON input:
$ catmandu convert JSON
Read YAML input
$ catmandu convert YAML
Read MARC input
$ catmandu convert MARC
The Importer accepts configurable options. Eg you can use the
--type
arguments to the MARC importer where the following
types are currently supported:
USMARC
(use ISO
as an alias)MicroLIF
MARCMaker
Line
9for line-oriented MARC)MiJ
(for MARC-in-JSON)XML
(for MARCXML)RAW
Lint
(for importing ISO and checking validity)ALEPHSEQ
(for Aleph Sequential)Read MARC-XML input
$ catmandu convert MARC --type XML < marc.xml
Read Aleph sequential input
$ catmandu convert MARC --type ALEPHSEQ < marc.txt
Read more about the configuration options of importers by reading their manual pages:
$ catmandu help import JSON
$ catmandu help import YAML
Exporters are Catmandu packages to export data in specific format. See Importers for the opposite action.
Some exporter such as JSON and YAML can handle any type of input. It doesn’t matter how the input is structured, it is always possible to create a JSON or YAML file.
Exporter are given after the to argument to the convert command
$ catmandu convert OAI --url http://biblio.ugent.be/oai to JSON
$ catmandu convert MARC to JSON
$ catmandu convert XLS to JSON
For most exporters however, the input data needs to be structured in a specific format. For instance, tabular formats such as Excel, CSV and TSV don’t allow for nested fields. In the example below, catmandu tried to convert a list into a simple value which will fail:
$ echo '{"colors":["red","green","blue"]}' | catmandu convert JSON to CSV
colors
ARRAY(0x7f8885a16a50)
The is an ARRAY output, indicating that the colors field is nested. To fix this, a transformation needs to be provided:
$ echo '{"colors":["red","green","blue"]}' | catmandu convert JSON to CSV --fix 'join_field(colors,",")'
colors
"red,green,blue"
MARC output should have an input in the Catmandu MARC format, RDF exports need the aREF format, etc etc.
Exporter also accept options to configure the various kinds of exports. For instance, JSON can be exporter in a array or line by line format
$ catmandu convert MARC to JSON --array 1 < data.mrc
$ catmandu convert MARC to JSON --line_delimited 1 < data.mrc
$ catmandu convert MARC to JSON --pretty 1 < data.mrc
The Catmandu::Template package can be used to generate any type of structured output given an input using the Template Toolkit language.
For instance, to create a JSON array of colors an echo command can used on Linux:
$ echo '{"colors":["red","green","blue"]}'
To transform this JSON into XML, the Template exporter can be used with a template file as a command line argument:
$ echo '{"colors":["red","green","blue"]}' | catmandu convert JSON to Template --template `pwd`/xml.tt
and xml.tt like:
<colors>
[% FOREACH c IN colors %]
<color>[% c %]</color>
[% END %]
</colors>
will produce:
<colors>
<color>red</color>
<color>green</color>
<color>blue</color>
</colors>
Consult the manual pages of catmandu to see the output options of the different Exporters:
$ catmandu help export JSON
$ catmandu help export YAML
$ catmandu help export CSV
Store are Catmandu packages to store Catmandu Items in a database. These databases need to be installed separately from Catmandu. Special database such as MongoDB, ElasticSearch and CouchDB can work out-of-the-box with hardly any configuration. For other databases such as Solr, MySQL, Postgres and Oracle extra configuration steps are needed to define the database schemas.
Catmandu stores such as MongoDB, ElasticSearch and CouchDB can accept any type of input. They are perfect tools to store the output of data conversions.
Without defining any database schema, JSON, YAML , MARC, Excel, CSV, OAI-PMH or any other Catmandu supported format can be stored.
$ catmandu import JSON to MongoDB --database_name test < data.json
$ catmandu import YAML to MongoDB --database_name test < data.yml
$ catmandu import MARC to MongoDB --database_name test < data.mrc
$ catmandu import XLS to MongoDB --database_name test < data.xls
Many Catmandu stores can be queried with their native query language:
$ catmandu export MongoDB --database_name test --query '{"my.deep.field":"abc"}'
To delete data from a store the delete command can be used.
# Delete everything
$ catmandu delete MongoDB --database_name test
# Delete record with _id = 1234 and _id = 1235
$ catmandu delete MongoDB --database_name test --id 1234 --id 1235
Use the count command to show the size of a database.
$ catmandu count MongoDB --database_name test
One important use-case for Catmandu is indexation of data in search engines such as Solr. To do this, Solr needs to be configured for the fields you want to make searchable. Your data collection can be indexed in the Solr engine by mapping the fields in your data to the fields available in Solr.
$ catmandu import MARC to Solr --fix marc2solr.fix < data.mrc
where marc2solr.fix is a Fix script containing all the fixes required to transform your input data in the Solr format:
# marc2solr.fix
marc_map('008_/7-10','year')
marc_map('020a','isbn.$append')
marc_map('022a','issn.$append')
marc_map('245a','title_short')
.
.
.
In reality the Fix script will contain many mappings and data transformations to clean data. See Example Fix Script for a long example of such a data cleaning in action.
Stores are Catmandu packages to store Catmandu Items in a database. A FileStore is a Store where you can store binary content (unstructured data). Out of the box, one FileStore implementation is provided: File::Simple which stores files in a directory structure on the local file system.
The command below stores the /tmp/myfile.txt
in the
File::Simple
FileStore in the “container” 1234
with the file identifier myfile.txt
:
$ catmandu stream /tmp/myfile.txt to File::Simple --root t/data --bag 1234 --id myfile.txt
The root
parameter is mandatory for the
File::Simple
FileStore. It defines the location where all
stored files are written. The other two parameters bag
and
id
are mandatory for every FileStore (see below).
To extract a file from a FileStore the stream
command
can be used in the opposite direction:
$ catmandu stream File::Simple --root t/data --bag 1234 --id myfile.txt to /tmp/myfile.txt
From the File::Simple
the file myfile.txt
is extracted from the container with identifier 1234
.
Every FileStore inherits the functionality of a Store. In this way
the drop
and delete
commands can be used to
delete data from a FileStore:
# Delete a "file"
$ catmandu delete File::Simple --root t/data --bag 1234 --id myfile.txt
# Delete a "folder"
$ catmandu drop File::Simple --root t/data --bag 1234
A FileStore contains one or more Bags. These Bags
are containers (or “folders”) to store zero or more files. The name of
these container, indicated with the bag
option in the
Catmandu commands, is an identifier. In the case of the
File::Simple
this identifier needs to be a number, or when
setting the uuid
option a UUID identifier.
The binary data (files) stored in these Bags also needs an
identifier, indicated with the id
option. Usually the file
name is a good choice to use.
Both the bag
name option and id
options are
required when uploading or streaming data from a FileStore.
Within a FileStore Bag there is no deeper hierarchy possible. A Bag contains a flat list of files. To store deeply nested folders and files, mechanisms such as ZIP files need to be created and imported.
$ zip -r /tmp/files.zip /mnt/data/files
$ catmandu stream /tmp/files.zip --root t/data --bag 1234 --id files.zip
Every FileStore has a default Bag called index
which
contains a list of all available Bags in the store (like the listing of
all folders). Using the export
command a listing of bags
can be requested from the FileStore:
$ catmandu export File::Simple --root t/data to YAML
To retrieve a listing of all files stored in a bag the
bag
option needs to be provided:
$ catmandu export File::Simple --root t/data --bag 1234 to YAML
Each Bag (“container”) in a FileStore contains at least the
_id
as metadata. Some FileStores may contain more metadata.
To retrieve a listing of all containers use the export
command on the FileStore:
$ catmandu export File::Simple --root t/data
[{"_id":"1234"},{"_id":"1235"},{"_id":"1236"}]
Every “file” in a FileStore contains at least the following fields:
_id
: the name of the file_stream
: a callback function to download the contents
of the file (pass it an IO::Handle)created
: the creation date time of the file as a UNIX
timestampmodified
: the last modification date time of the file
as a UNIX timestampcontent_type
: the content type of the filesize
: the file size in bytesmd5
: an MD5 checksum if the FileStore support is, or
an empty stringNOTE: Not every exporter can serialise the code reference in the
stream
field. For instance, when exporting to JSON this
error message will be show up:
$ catmandu export File::Simple --root t/data --bag 1234
Oops! encountered CODE(0x7f99685f4390), but JSON can only represent references to arrays or hashes at /Users/hochsten/.plenv/versions/5.24.0/lib/perl5/site_perl/5.24.0/Catmandu/Exporter/JSON.pm line 36.
This field can be ignored from the output using the
remove_field
fix:
$ catmandu export File::Simple --root t/data --bag 1234 --fix 'remove_field(_stream)'
[{"_id":"files.pdf","content_type":"application/pdf","modified":1498122646,"md5":"","size":883202,"created":1498122646}]
Always use the stream
command in Catmandu to extract
files from a FileStore:
$ catmandu stream File::Simple --root t/data --bag 1234 --id 'files.pdf' > output.pdf
As for Stores, the configuration parameters for FileStore can be
written in a catmandu.yml
configuration file. In this way
the Catmandu commands can be shortened:
$ cat catmandu.yml
---
store:
files
package: File::Simple
options:
root: t/data
# Get a "directory" listing
$ catmandu export files to YAML
# Get a "file" listing
$ catmandu export files --bag 1234 to YAML
# Add a file
$ catmandu stream /tmp/myfile.txt to files --bag 1234 --id myfile.txt
# Download a file
$ catmandu stream files --bag 1234 --id myfile.txt to /tmp/myfile.txt
Fixes are used for easy data transformations by non programmers. Using a small Fix language non-programmers can manipulate Catmandu Items.
To introduce the capabilities of Fix, an example will be provided below to extract data from a MARC input.
First, make sure that Catmandu::MARC is installed on your system.
$ sudo cpanm Catmandu::MARC
We will use the Catmandu command line
client to extract data from an example USMARC file that can be
downloaded via this: link -
camel.usmarc
.
With the convert
command one can read items from a MARC
Importer and convert it into a new format. By
default, convert will output JSON:
$ catmandu convert MARC < camel.usmarc
{"record":[["LDR",null,null,"_","00755cam 22002414a 4500"],["001",null,null...
...
["650"," ","0","a","Cross-platform software development."]],"_id":"fol05882032 "}
You can make this conversion explicit:
$ catmandu convert MARC to JSON < camel.usmarc
To transform this MARC data we first will create a Fix file which contains all the Fix commands we will use. Create a text file ‘fixes.txt’ on your system with this input:
remove_field('record');
and execute the following command:
$ catmandu convert MARC --fix fixes.txt < camel.usmarc
{"_id":"fol05731351 "}
{"_id":"fol05754809 "}
{"_id":"fol05843555 "}
{"_id":"fol05843579 "}
We have removed the field ‘record’ (containing the MARC data) from the JSON record. This is what the ‘remove_field’ Fix does: remove one field in a JSON record. We will use this remove_field(‘record’) to make our output a bit more terse and easier readable.
With the ‘marc_map’ Fix from the Catmandu::MARC package we can extract MARC (sub)fields from the record. Add these to the fixes.txt file:
marc_map('245','title');
remove_field('record');
When we run our previous catmandu command we get the following output:
$ catmandu convert MARC --fix fixes.txt to JSON --line_delimited 1 < camel.usmarc
{"_id":"fol05731351 ","title":"ActivePerl with ASP and ADO /Tobias Martinsson."}
{"_id":"fol05754809 ","title":"Programming the Perl DBI /Alligator Descartes and Tim Bunce."}
{"_id":"fol05843555 ","title":"Perl :programmer's reference /Martin C. Brown."}
We know that in the 650-a field of MARC we can find subjects. Lets add them to the fixes.txt:
marc_map('245','title');
marc_map('650a','subject');
remove_field('record');
and run the command again:
$ catmandu convert MARC --fix fixes.txt to JSON --line_delimited 1 < camel.usmarc
{"subject":"Perl (Computer program language)","_id":"fol05731351 ","title":"ActivePerl with ASP and ADO /Tobias Martinsson."}
{"subject":"Perl (Computer program language)Database management.","_id":"fol05754809 ","title":"Programming the Perl DBI /Alligator Descartes and Tim Bunce."}
{"subject":"Perl (Computer program language)","_id":"fol05843555 ","title":"Perl :programmer's reference /Martin C. Brown."}
The MARC 008 field from position 7 to 10 contains publication years. We can also add these to the ‘fixes.txt’ file:
marc_map('245','title');
marc_map('650a','subject');
marc_map('008/7-10,'year');
remove_field('record');
and run the command:
$ catmandu convert MARC --fix fixes.txt to JSON --line_delimited 1 < camel.usmarc
{"subject":"Perl (Computer program language)","_id":"fol05731351 ","title":"ActivePerl with ASP and ADO /Tobias Martinsson.","year":"2000"}
{"subject":"Perl (Computer program language)Database management.","_id":"fol05754809 ","title":"Programming the Perl DBI /Alligator Descartes and Tim Bunce.","year":"2000"}
{"subject":"Perl (Computer program language)","_id":"fol05843555 ","title":"Perl :programmer's reference /Martin C. Brown.","year":"1999"}
You don’t need to write fixes into a file to use them. E.g. if we want to have some statistic on the publication year in the camel.usmarc file we can do something like:
$ catmandu convert MARC --fix "marc_map('008/7-10','year'); retain('year')" to CSV < camel.usmarc
year
2000
2000
1999
.
.
With marc_map we extracted the year form the 008 field. With retain we deleted everything in the output except for the field ‘year’. We used the CSV Exporter to present the results in an easy format.
Catmandu comes with a small domain specific language for manipulation of data items called Fix. The Fix consists of
Almost any transformation on a Catmandu item contains a path to the part of the item that needs to be changed. To upcase the title field in an item the Fix upcase need to be used:
upcase(title)
A field can be nested in key-value-pairs (objects). To access the field deep in a key-value-pair, the dot-notation should be used:
upcase(my.deep.nested.title)
If a part of an item contains a list of fields than the index-notation should be used. Use index 0 to point to the first item in a list, index 1 to point to the second item in a list, index 2 to the third, etc, etc.
upcase(my.data.2.title) # upcase the title of the 3rd item in the my.data list
For example, given this YAML input:
___
title: My Little Pony
my:
colors:
- red
- green
- blue
nested:
a:
b:
c: Hoi!
The value ‘My Little Pony’ can be accessed using the path:
title
The value ‘green’ can be accessed using the path:
my.colors.1
The value ‘Hoi!’ can be accessed using the path:
my.nested.a.b.c
Wildcards are used to point to relative positions or many positions in a list.
To point to the first item in a list (e.g. the value ‘red’ in the example above) the wildcard $first can be used:
my.colors.$first
To point to the last item in a list (e.g. the value ‘blue’ in the example above) the wildcard $last can be used:
my.colors.$last
In some cases, one needs to point to a position before the first item in a list. For instance, add a new field before the color ‘red’ in our example above, the wildcard ‘$prepend’ should be used:
my.colors.$prepend
This wildcard can be used in the functions like set_field:
set_field(my.colors.$prepend,'pink')
To add a new field add the end of a list (after the color ‘blue’), the wildcard ‘$append’ should be used:
my.colors.$append
As in:
set_field(my.colors.$append,'yellow')
The star notation is used to point to all the items in a list:
my.colors.*
To upcase all the colors use:
upcase(my.colors.*)
When lists are nested inside lists, then wildcards can also be nested:
my.*.colors.*
The above trick can be used when the my field contains a list which contains a color field which contains again a list of data. E.g.
---
my:
- colors:
- red
- blue
- colors:
- yellow
- green
For some data formats is can be quite difficult to extract data by the exact position of a field. In data formats such as MARC, one is unsually not interested in a field in the 17th position which contains a subfield in position 3. MARC contains tags and subfields, which can be at any position in the MARC record.
Specialized Fix functions for MARC, MAB and PICA make it easier to access data by changing the Path syntax. For instance, to copy the 245a field in a MARC record to the title field one can write:
marc_map("245a",title)
In the context of a marc_map Fix the “245a” Path is a MARC Path that points to a part of the MARC record. These MARC Paths only work in MARC Fixes (marc_map, marc_add, marc_set, marc_remove). It is not possible to use these paths in other Catmandu fix functions:
marc_map("245a",title) # This will work
copy_field("246a","other_title") # This will NOT work
Consult the documentation of the different specialised packages for the Path syntax that can be used.
Fix functions manipulate fields in every item of a Catmandu Importer. For instance, using the command below the title field will be upcased for every item in the input list of JSON items.
$ catmandu convert JSON --fix 'upcase(title)' < data.json
Fix functions can have zero or more arguments separated by commas:
vacuum() # Clean all empty fields in a record
upcase(title) # Upcase the title value
append(title,"-123") # Add -123 at the end of the title value
The arguments to a Fix function can be a Fix path or a literal string. Literal string can be quoted with double or single quotes.
append(title,"-123")
append(title,'foo bar')
In case of single quotes all the characters between quotes will be interpreted verbatim. When using double quotes, the values in quotes can be interpreted by some Fix functions.
replace_all(title,"My (.*) Pony","Our $1 Fish") # Replace 'My Little Pony' by 'Our Little Fish'
Some Fix functions accept zero or more options which need to be specified as a name : value:
sort_field(tags, reverse:1) # Sort the tags field in reverse order
lookup("title","dict.csv", sep_char:'|',default:'NONE') # Lookup a title in a CSV file
Unless specified otherwise (such as in Binds), Fix function are executed in the order given by the Fix script:
upcase(authors.*)
append(authors.*,"abc")
replace_all(authors.*,"a","AB")
In the example above all transformations on the field authors will be executed in the order given. For example when the field authors contains this list:
---
authors:
- John
- Mary
- Dave
The first fix will transform this list into:
---
authors:
- JOHN
- MARY
- DAVE
The second fix will append “abc” to all authors
---
authors:
- JOHNabc
- MARYabc
- DAVEabc
The third fix will replace all “a”-s by “AB”s
---
authors:
- JOHNABbc
- MARYABbc
- DAVEABbc
In some cases the ordering of transformations of items in a list matters. For instance, you want to first do a sequence of transformation on all first items in a list, then a sequence of transformations on all second items in a list, etc. To change this ordering of Fix functions Binds need to be used.
For a nearly complete list of functions currently available in Catmandu, take a look at the Fixes Cheat Sheet.
With Fix selectors one can select which Catmandu items can end up in an output stream or not. Using a selector to throw away the records you are not interested in. For instance, to filter out all the records in a input use the reject() selector:
$ catmandu MARC to YAML --fix "reject()" < data.mrc
The command above will generate no output: every record is rejected. The opposite of reject() is the select() selector which can be used to select all the Catmandu items you want to keep in an output:
$ catmandu MARC to YAML --fix "select()" < data.mrc
The command above will return all the MARC items in the input file.
Selectors are of little use when used in isolation. Most of the time they are combined with Conditionals. To select only the MARC records that have “Tsjechov” in the 100a field one can write:
$ catmandu MARC to YAML --fix "select marc_match(100a,'.*Tsjechov.*') " < data.mrc
There are two alternative ways to combine selector with a conditional. Using the guard syntax, the conditional is written after the selector:
reject exits(error.field)
reject all_match(publisher,'xyz')
select any_match(years,2005)
Using the if/then/else syntax the conditional is written explicitly:
if exists(error.field)
reject()
end
if all_match(publisher,'xyz')
reject()
end
A Conditional is executed depending on a boolean condition that can be true or false. For instance, to skip a Catmandu item when the field error exists one would write the conditional exists:
if exists(error)
reject()
end
A condition contains an if or unless statement a Conditional (Fix functions which can be true or false), a body of zero or more Fix functions and an optional elsif or else clause:
if exists(error)
# Write here all the Fix functions when the field 'error' exists
end
unless exists(error)
# Write here all the Fix functions when the field 'error' doesn't exist
end
if exists(error)
# If error exists then do this
elsif exists(warning)
# If warning exists then do this
else
# otherwise do this
end
Catmandu also supports a limited number of boolean operators:
exist(foo) and add_field(ok,1) # only execute add_field() when 'foo' exists
exists(foo) or add_field(error,1) # only execute add_field() when 'foo' doesn't exist
Below follows some basic fix functions that are implemented in Catmandu. Check the manual pages of the individual Catmandu extensions for more elaborate Conditionals.
True, when the path exists and is exactly equal to a value. When the path points to a list, then all the list members need to be equal to the value. False otherwise.
if all_equal(year,"2018")
set_field(published,"future")
end
if all_equal(animals.*,"cat")
set_field(animal_types,"feline")
end
True, when the path exists and is exactly equal to a value. When the path points to a list, then at least one of the list members need to be equal to the value. False otherwise.
if any_equal(year,"2018")
set_field(published,"future")
end
if any_equal(animals.*,"cat")
set_field(animal_types,"some feline")
end
True, when the path exists and the value matched the regex regular expression. When the path points to a list, then all the values have to match the regular expression. False otherwise.
if all_match(year,"^19.*$")
set_field(period,"20th century")
end
if all_match(publishers.*,"Elsevier.*")
set_field(is_elsevier,1)
end
True, when the path exists and the value matched the regex regular expression. When the path points to a list, then at least one of the values has to match the regular expression. False otherwise.
if any_match(year,"^19.*$")
set_field(period,"20th century")
end
if any_match(publishers.*,"Elsevier.*")
set_field(some_elsevier,1)
end
True, when the path exists in the Catmandu item. False otherwise.
if exists(my.deep.field)
end
if exists(my.list.0)
end
True, when the path exists and the value is greater than a number. When the path points to a list, then all the members need to be greater than the number. False otherwise.
True, when the path exists and the value is less than a number. When the path points to a list, then all the members need to be less than the number. False otherwise.
True, when the values of the first path1 are contained in the values at the second path2. False otherwise.
For instance to check if two paths contain the same values type:
if in(my.title,your.title)
set_field(same,1)
end
To check if a value in one path is contained in a list of an other path type:
if in(my.author,your.authors.*)
set_field(known_author,1)
end
True, if the value at path can be evaluated to a boolean true. False otherwise
True, if the value at path can be evaluated to a boolean false. False otherwise
Binds change the execution context of a Fix script. In normal operation, all Fix functions are executed from the first to the last. For example given the YAML input:
---
colors:
- red
- green
- blue
every Fix functions will be executed one by one on all the colors:
upcase(colors.*)
append(colors.*," is a nice color")
copy_field(colors.*,result.$append)
The first Fix upcase will uppercase all the colors, the second append will add ” is a nice color” to all the colors, the last copy_field will copy all the colors to a new field.
But what should you do when you want the three Fix functions to operate on each color separately? First upcase on the first color, append on the first color, copy_field on the first color, then again upcase on the second color, append on the second color, etc.
For this type of operation a Bind is needed using the do notation:
do list(path:colors, var:c)
upcase(c)
append(c," is a nice color")
copy_field(c,result.$append)
end
In the example above the list Bind was introduced. The context of the execution of the Bind body is changed. Instead of operating on one Catmandu item as a whole, the Fix functions are executed for each element in the list.
These Binds can also work on hash (object) inputs. An example is the
each
Bind. Given:
demo:
nl: 'Tuin der lusten'
en: 'The Garden of Earthly Delights'
When we want to have a title
field with all the values
of demo
concatinated, we can’t use the list
(because it works on arrays) but need to use the each
Bind:
do each(path: demo, var: t)
copy_field(t.value, titles.$append)
end
The result will be:
titles:
- 'Tuin der lusten'
- 'The Garden of Earthly Delights'
Each Bind changes the execution context in some way. For instance Fix functions could execute queries into database, or fetch data from the internet. These operations can fail when the database is down, or the website couldn’t be reached. What should happen in that case in a Fix script? Should the execution be stopped? Or, should there errors be ignored.
my_fix1()
my_fix2()
download_from_internet() # <--- this one failes
process_results()
What should happen in the example above? Should the results be processed when the download_from_internet fails? Using the maybe Bind one can skip Fix functions that fail:
do maybe()
my_fix1()
my_fix2()
download_from_internet()
process_results() # <--- this is skipped when download_from_internet fails
end
Binds are also used when creating Fix executables. That are Fix scripts that can be run directly from the command line. In the example below we’ll write a Fix script that downloads data from an OAI-PMH repository and prints all the record identifiers:
#!/usr/bin/env catmandu run
do importer(OAI,url: "http://lib.ugent.be/oai")
retain(_id)
add_to_exporter(.,YAML)
end
If this script is stored on a file system as myscript.fix and made executable:
$ chmod 755 myscript.fix
then you can run this script as any other Unix command:
$ ./myscript.fix
This cheat sheet summarizes the command line client capabilities.
$ catmandu help
$ catmandu help convert
Convert one data format to another optionally provide a Fix script to transform the data
$ catmandu convert MARC to JSON < records.mrc
$ catmandu convert MARC to YAML < records.mrc
$ catmandu convert MARC to JSON --pretty 1 < records.mrc
$ catmandu convert MARC to JSON --fix 'marc_map("245","title");remove_field("record")' < records.mrc
$ catmandu convert MARC to CSV --fix myfixes.fix < records.mrc
$ cat myfixes.fix
marc_map("245","title")
remove_field("record")
$ catmandu convert MARC to CSV --fix myfixes2.fix --var source="Springer" < records.mrc
$ cat myfixes2.fix
add_field("source","{{source}")
marc_map("245","title")
remove_field("record")
$ catmandu convert OAI --url http://biblio.ugent.be/oai --set allFtxt to JSON
$ catmandu convert OAI --url http://biblio.ugent.be/oai --set allFtxt to JSON --fix 'retain("title")'
$ catmandu convert SRU --base http://www.unicat.be/sru --query dna
$ catmandu convert ArXiv --query 'all:electron'
$ catmandu convert PubMed --term 'hochstenbach'
$ cat test.tt
[%- FOREACH f IN record %]
[% _id %] [% f.shift %][% f.shift %][% f.shift %][% f.join(":") %]
[%- END %]
$ catmandu convert MARC to Template --template `pwd`/test.tt < records.mrc
Store data in a (noSQL) database and export it out again
$ catmandu import JSON to MongoDB --database_name mydb --bag data < records.json
$ catmandu import MARC to MongoDB --database_name mydb --bag data < records.mrc
$ catmandu import MARC to ElasticSearch --index_name mydb --bag data < records.mrc
$ catmandu import MARC to ElasticSearch --index_name mydb --bag data --fix 'marc_map("245a","title")' < records.mrc
$ catmandu export MongoDB --database_name mydb --bag data to JSON
$ catmandu export MongoDB --database_name mydb --bag data to JSON --fix 'retain("_id")'
$ catmandu export Solr --url http://localhost:8983/solr to JSON
$ catmandu export ElasticSearch --index_name mydb to JSON
Copy data from one database to another
$ catmandu copy MongoDB --database_name items --bag book to ElasticSearch --index_name items --bag book
Count the number of items in a store
$ catmandu count ElasticSearch --index-name shop --bag products --query 'brand:Acme'
Delete data from a store
# delete items with matching _id
$ catmandu delete ElasticSearch --index-name items --bag book --id 1234 --id 2345
# delete items matching the query
$ catmandu delete ElasticSearch --index-name items --bag book --query 'title:"My Rabbit"'
# delete all items
$ catmandu delete ElasticSearch --index-name items --bag book
$ cat catmandu.yml
---
store:
test1:
package: MongoDB
options:
database_name: mydb
test2:
package: ElasticSearch
options:
index_name: mydb
test3:
package: Solr
options:
url: http://localhost:8983/solr
$ catmandu import JSON to test1 < records.json # Mongo
$ catmandu import MARC to test2 < records.mrc # ElasticSearch
$ catmandu import YAML to test3 < records.yaml # Solr
$ catmandu export test1 to JSON # Mongo
$ catmandu export test2 to JSON # ElasticSearch
$ catmandy export test3 # Solr
$ cat fixes.txt
marc_map("245a","title");
marc_map("100","author.$append");
join_field("author",";");
marc_map("008_/10-13","language");
$ catmandu import MARC to test2 --fix fixes.txt
# Add a file to a FileStore
$ catmandu stream /tmp/myfile.txt to File::Simple --root t/data --bag 1234 --id myfile.txt
# Download a file from a FileStore
$ catmandu stream File::Simple --root t/data --bag 1234 --id myfile.txt to /tmp/output.txt
This cheat sheet summarizes the fix language.
marc_*
methods, see MARC
mapping rules.# Fixes clean your data. As input you get a Perl HASH. Each fix function is a command
# to transform the Perl HASH. Some fixes such as marc_map contain logic to transform
# complex data structures such as MARC.
set_field("my.name","patrick") # { my => { name => 'Patrick'} }
add_field("my.name2","nicolas")
move_field("my.name","your.name")
copy_field("your.name","your.name2")
remove_field("your.name")
# Replace in all the field names in 'foo' all dots into underscores
rename(foo,"\.","_")
set_array("foo") # Create an empty array foo => []
set_array("foo","a","b","c") # Create an array with three values foo => ['a','b','c']
set_hash("foo") # Create an empty hash foo => {}
set_hash("foo",a: b,c: d) # Create an hash with two values foo => { a => 'b' , c => 'd' }
array("foo") # Create an array from a hash :
# foo => {"name":"value"} => [ "name" , "value" ]
hash("foo") # Create a hash from an array
# foo => [ "name" , "value" ] => {"name":"value"}
assoc(fields, pairs.*.key, pairs.*.val) # Associate two values as a hash key and value
# {pairs => [{key => 'year', val => 2009}, {key => 'subject', val => 'Perl'}]}
# {fields => {subject => 'Perl', year => 2009}, pairs => [...]}
upcase("title") # marc -> MARC
downcase("title") # MARC -> marc
capitalize("my.deeply.nested.field.0") # marc -> Marc
trim("field_with_spaces") # " marc " -> marc
substring("title",0,1) # marc -> m
prepend("title","die ") # marc -> die marc
append("title"," must die") # marc -> marc must die
# {author => "tom jones"} -> {author => "senoj mot"}
reverse(author)
# {numbers => [1,14,2]} -> {numbers => [2,14,1]}
reverse(numbers)
# replace the value with a formatted (sprintf-like) version
# e.g. numbers:
# - 41
# - 15
format(number,"%-10.10d %-5.5d") # numbers => "0000000041 00015"
# e.g. hash:
# name: Albert
format(name,"%-10s: %s") # hash: "name : Albert"
# date: "2015-03-07"
parse_text(date, '(\d\d\d\d)-(\d\d)-(\d\d)')
# date:
# - 2015
# - 03
# - 07
# parses a text into an array or hash of values
# date: "2015-03-07"
parse_text(date, '(\d\d\d\d)-(\d\d)-(\d\d)')
# date:
# - 2015
# - 03
# - 07
# If you data record is:
# a: eeny
# b: meeny
# c: miny
# d: moe
paste(my.string,a,b,c,d) # my.string: eeny meeny miny moe
# Use a join character
paste(my.string,a,b,c,d,join_char:", ") # my.string: eeny, meeny, miny, moe
# Paste literal strings with a tilde sign
paste(my.string,~Hi,a,~how are you?) # my.string: Hi eeny how are you?
# date: "2015-03-07"
parse_text(date, '(?<year>\d\d\d\d)-(?<month>\d\d)-(?<day>\d\d)')
# date:
# year: "2015"
# month: "03"
# day: "07"
# date: "abcd"
parse_text(date, '(\d\d\d\d)-(\d\d)-(\d\d)')
# date: "abcd"
lookup("title","dict.csv", sep_char:'|') # lookup 'marc' in dict.csv and replace the value
lookup("title","dict.csv", default:test) # lookup 'marc' in dict.csv and replace the value or set it to 'test'
lookup("title","dict.csv", delete:1) # lookup 'marc' in dict.csv and replace the value or delete nothing found
lookup_in_store('title', 'MongoDB', database_name:lookups) # lookup the (id)-value of title in 'lookups' and
# replace it with the data found
lookup_in_store('title', 'MongoDB', default:'default value' , delete:1)
# Query a Solr index with the query stored in the 'query' field and overwrite it with all the results
search_in_store('query','Solr',url:"http://localhost:8983/solr",limit:10)
# Replace the data in foo.bar with an external file or url
import(foo.bar, JSON, file: "http://foo.com/bar.json", data_path: data.*)
add_to_store('authors.*', 'MongoDB', bag:authors, database_name:catalog) # add matching values to a store as a side effect
add_to_exporter(data,CSV,header:1,file:/tmp/data.csv) # send the 'data' path to an alternative exporter
add_to_exporter(.,CSV,header:1,file:/tmp/data.csv) # send the complete record to an alternative exporter
count("myarray") # count number of elements in an array or hash
sum("numbers") # replace an array element with the sum of its values
sort_field("tags") # sort the values of an array
sort_field("tags", uniq:1) # sort the values plus keep unique values
sort_field("tags", reverse:1) # revese sort
sort_field("tags", numeric:1) # sort numerical values
uniq(tags) # strip duplicate values from an array
filter("tags","[Cc]at") # filter array values tags = ["Cats","Dogs"] => ["Cats"]
flatten(deep) # {deep => [1, [2, 3], 4, [5, [6, 7]]]} => {deep => [1, 2, 3, 4, 5, 6, 7]}
cmd("java MyClass") # Use an external program that can read JSON
# from stdin and write JSON to stdout
perlcode("myscript.pl") # Execute Perl code as fix function
sleep(1,SECOND) # Do nothing for one second
split_field("foo",":") # marc:must:die -> ['marc','must','die']
join_field("foo",":") # ['marc','must','die'] -> marc:must:die
retain("id","id2","id3") # delete any field except 'id', 'id2', 'id3'
replace_all("title","a","x") # marc -> mxrc
# Most functions can work also work on arrays. E.g.
replace_all("author.*","a","x") # [ 'marc','jan'] => ['mxrc','jxn']
# Use:
# authors.$last (last entry)
# authors.$first (first entry)
# authors.$append (last + 1)
# authors.$prepend (first - 1)
# authors.* (all authors)
# authors.2 (3rd author)
collapse() # collapse deep nested hash to a flat hash
expand() # expand flat hash to deep nested hash
clone() # clone the perl hash and work on the clone
reject() # Reject (skip) a record
reject [condition] # Reject a record on some condition:
# reject all_match(...)
# reject any_match(...)
# reject exists(...)
select() # Select a record
select [condition] # Select only those records that match a condition (see reject)
to_json('my.field') # convert a value of a field to json
from_json('my.field') # replace the json field with the parsed value
export_to_string('my.field',CSV,sep_char:";") # convert the value of a field into CSV
import_from_string('my.field',CSV,sep_char:";") # replace a CSV field with the parsed value
error("eek!") # abort the processing and say "eek!"
nothing() # do nothing (used in benchmarking)
# Include fixes from another file
include('/path/to/myfixes.txt')
# Send debug messages to a logger
log('test123')
log('hello world' , level: 'DEBUG')
# Boolean AND and OR, need a Condition + 'and'/'or' + a Fix
exists(foo) and log('foo exists' , level: INFO)
exists(foo) or log('foo doesnt exist' , level: INFO)
valid('', JSONSchema, schema: "my/schema.json") or log('this record is wrong', level: ERROR)
# '3%A9' => 'café'
uri_decode(place)
# 'café' => '3%A9'
uri_encode(place)
# Add a new field 'foo' with a random value between 0 and 9
random(foo, 10)
# Delete all the empty fields
vacuum()
# Copy all 245 subfields into the my.title hash
marc_map('245','my.title')
# Copy the 245-$a$b$c subfields into the my.title hash in the order of the record
marc_map('245abc','my.title')
# Copy the 245-$c$b$a subfields into the my.title hash in the order of the mapping
marc_map('245cba','my.title' , pluck:1)
# Copy the 100 subfields into the my.authors array
marc_map('100','my.authors.$append')
# Add the 710 subfields into the my.authors array
marc_map('710','my.authors.$append')
# Copy the 600-$x subfields into the my.subjects array while packing each into a genre.text hash
marc_map('600x','my.subjects.$append.genre.text')
# Copy the 008 characters 35-35 into the my.language hash
marc_map('008_/35-35','my.language')
# Copy all the 600 fields into a my.stringy hash joining them by '; '
marc_map('600','my.stringy', join:'; ')
# When 024 field exists create the my.has024 hash with value 'found'
marc_map('024','my.has024', value:found)
# Do the same examples now with the marc fields in 'record2'
marc_map('245','my.title', record:record2)
# Remove the 900 fields
marc_remove('900')
# Add a marc field (in Catmandu::MARC 0.110)
marc_add('999', ind1, ' ' , ind2, '1' , a, 'test123')
# Add a marc field populated with data from your record
marc_add('245', a , $.my.title.field, c , $.my.author.field)
# Set a marc value of one (sub)field to a new value
marc_set('LDR/6','p')
marc_set('650p','test')
marc_set('100[3]a','Farquhar family.')
# Map all 650 subjects into an array
marc_map('650','subject', join:'###')
split_field('subject','###')
# uppercase the value of field 'foo' if all members of 'oogly' have the value 'doogly'
if all_match('oogly.*', 'doogly')
upcase('foo') # foo => 'BAR'
else
downcase('foo') # foo => 'bar'
end
# inverted
unless all_match('oogly.*', 'doogly')
upcase('foo') # foo => 'BAR'
end;
# uppercase the value of field 'foo' if field 'oogly' has the value 'doogly'
if any_match('oogly', 'doogly')
upcase('foo') # foo => 'BAR'
end
# inverted
unless any_match('oogly', 'doogly')
upcase('foo') # foo => 'BAR'
end
# uppercase the value of field 'foo' if the field 'oogly' exists
if exists('oogly')
upcase('foo') # foo => 'BAR'
end
# inverted
unless exists('oogly')
upcase('foo') # foo => 'bar'
end
# add a new field when the 'year' field is equal to 2018
if all_equal('year','2018')
add_field('my.funny.title','true')
end
# add a new field when at least one of the 'year'-s is equal to 2018
if any_equal('years.*','2018')
add_field('my.funny.title','true')
end
# compare things (needs Catmandu 0.92 or better)
if greater_than('year',2000)
add_field('recent','yes')
end
if less_than('year',1970)
add_field('ancient','yes')
end
# execute fixes if one path is contained in another
# foo => 1 , bar => [3,2,1] => in(foo,bar) -> true
if in(foo,bar)
add_field(test,ok)
end
# only execute fixes if all path values are the boolean true, 1 or "true"
if is_true(data.*.has_error)
add_field(error,yes)
end
# only execute fixes if all path values are the boolean true, 0 or "false"
if is_false(data.*.has_error)
add_field(error,no)
end
# only execute the fixes if the path contains an array
if is_array(data)
upcase(data.0)
end
# only execute the fixes if the path contains an object (an hash, nested field)
if is_object(data)
add_field(data.ok,yes)
end
# only execute the fixes if the path contains a number
if is_number(data)
append(data," : is a number")
end
# only execute the fixes if the path contains a string
if is_string(data)
append(data," : is a string")
end
# only execute the fixes if the path contains 'null' values
if is_null(data)
set_field(data,"I'm empty!")
end
# Evaluates true when one or all marc (sub)fields match a regular expression
if marc_all_match('245','My funny title')
add_field('funny.title','yes')
end
if marc_all_match('LDR/6','c')
marc_set('LDR/6','p')
end
# Evaluates to true when at least one of the marc (sub)fields match a regular expression
if marc_any_match('650','catmandu')
add_field('important.books','yes')
end
# Evaluates true when the JSON fragment is valid against a JSON Schema
if valid(data,JSONSchema,schema:myschema.json)
...
end
## Binds (needs Catmandu 0.92 or better)
# The identity binder doesn't embody any computational strategy. It simply
# applies the bound fix functions sequentially to its input without any
# modification.
do identity()
add_field(foo,bar)
add_field(foo2,bar2)
end
# Maybe, computes all the fix functions and ignores fixes once they throw errors
# or return undef.
do maybe()
foo()
return_undef() # rest will be ignored
bar()
end
# List over all items in demo and add a foo => bar field
# { demo => [{},{},{}] } => { demo => [{foo=>bar},{foo=>bar},{foo=>bar}]}
do list(path: demo)
add_field(foo,bar)
end
# Print statistical information on the processing speed of fixes to the standaard error.
do benchmark(output:/dev/stderr)
foo()
end
# Find all ISBN in a stream
do hashmap(exporter: JSON, join:',')
# Need an identity binder to group all operations that calculate key_value pairs
do identity()
copy_field(isbn,key)
copy_field(_id,value)
end
end
# Count the number of ISBN occurrences in a stream
do hashmap(count: 1)
copy_field(isbn,key)
end
# Filter out an array (needs Catmandu 0.9302 or better)
# data:
# - name: patrick
# - name: nicolas
# to:
# data:
# - name: patrick
do with(path:data)
reject all_match(name,nicolas)
# Or:
# if all_match(name,nicolas)
# reject()
# end
end
# run fixes that should run within a time limit
do timeout(time => 5, units => seconds)
...
end
# a binder that computes Fix-es for every element in record
do visitor()
# upcase all the 'name' fields in the record
if all_match(key,name)
upcase(scalar)
end
end
# a binder runs fixes on records from an importer
do importer(OAI,url: "http://lib.ugent.be/oai")
retain(_id)
add_to_exporter(.,YAML)
end
Here is an example Fix script taken from a production system at Ghent University Library that can be used for inspiration. This script is used to feed data from a MongoDB store of MARC records to a Black Light Solr installation.
#-
#- LLUDSS - Data cleaning fixes. Using MARC records as input
#-
#- 2013 Patrick.Hochstenbach@UGent.be
#-
copy_field('merge.source','source')
copy_field('merge.id','id')
set_field('is_deleted','false')
set_field('is_hidden','false')
copy_field('merge.hidden','is_hidden')
if exists('merge.related_desc')
copy_field('merge.related_desc','json.merge_related_desc')
end
if exists('merge.deleted')
set_field('is_deleted','true')
else
#- Document Type
unless exists('type')
marc_map('920a','type')
lookup("type", "/opt/lludss-import/etc/material_types.csv", default:"other")
end
#- ISBN/ISSN
marc_map('020a','isbn.$append', join:'==')
marc_map('022a','issn.$append', join:'==')
join_field('isbn','==')
split_field('isbn','==')
join_field('issn','==')
split_field('issn','==')
replace_all('isbn.*','^([0-9xX-]+).*$','$1')
replace_all('issn.*','^([0-9xX-]+).*','$1')
#- Title
marc_map('245ab','title', join:' ')
replace_all('title','\[(.*)\]','$1')
copy_field('title','title_sort')
replace_all('title_sort','\W+','')
substring('title_sort',0,50)
downcase('title_sort')
copy_field('title','json.title')
marc_map('246','json.title_remainder', join:' ')
marc_map('245a','title_short')
#- Author
marc_map('100ab','author.$append', join:' ')
marc_map('700ab','author.$append', join:' ')
unless all_match('type','phd|master|bachelor')
marc_map('720ab','author.$append', join:' ')
end
author_names()
copy_field('author','json.author')
#- Imprint
marc_map('008_/7-10','year')
if all_match('year','[u^?-]{4}')
remove_field('year')
end
replace_all('year','\D','0')
if greater_than('2018','year')
remove_field('year')
end
if marc_match('008_/6-6','b')
prepend('year','-')
end
#- Edition
marc_map('250a','json.edition')
#- Description
marc_map('300a','json.desc_extend')
#- Summary
marc_map('505a','json.summary.$append', join:"\n")
marc_map('520a','json.summary.$append', join:"\n")
#- Als we een dissertation hebben dan is 502 de summary met 720 als promotor.
#- Dit is dan ook automatisch een UGent publiaction
if all_match('type','phd|master')
marc_map('502a','summary.$append')
if exists('summary');
join_field('summary','')
move_field('summary','json.summary.$append')
end
add_field('only.$append','ugent')
end
unless exists('json.summary')
weave_by_id('summary')
if exists('_weave.summary.data.summary')
copy_field('_weave.summary.data.summary','json.summary.$append')
end
remove_field('_weave')
end
#- Boost
unless exists('_boost')
weave_by_id('boost')
if exists('_weave.boost.data.boost')
copy_field('_weave.boost.data.boost','_boost')
end
remove_field('_weave')
end
#- Language
marc_map('008_/35-37','lang')
if all_match('lang','\W+')
set_field('lang','und')
end
#- Subject
marc_map('6**^0123456789','subject.$append', join:' ')
replace_all('subject.*','\.$','')
sort_field('subject', uniq:1)
copy_field('subject','json.subject')
#- Library, Faculty, Location
marc_map('852c','library.$append')
sort_field('library', uniq:1)
marc_map('852x','faculty.$append')
sort_field('faculty', uniq:1)
marc_map('852j','location.$append')
sort_field('location', uniq:1)
#- Host publication
host_publication()
move_field('host_publication','json.host_publication.$append')
#- Holding
if exists('p_holding')
copy_field('p_holding','year')
replace_all('year',' .*','')
move_field('p_holding','json.p_holding')
move_field('p_holding_txt','json.host_publication.$append')
end
if exists('e_holding')
copy_field('e_holding','year')
replace_all('year',' .*','')
move_field('e_holding','json.e_holding')
move_field('e_holding_txt','json.host_publication.$append')
end
join_field('json.host_publication','<br>');
#- Year cleanup
replace_all('year','^(?<=-)?0+','')
unless all_match('year','^-?([0-9]|[123456789][0-9]+)$')
remove_field('year')
end
#- Wikipedia
weave_by_id('wikipedia')
copy_field('_weave.wikipedia.data.wikipedia_url','json.wikipedia_url')
remove_field('_weave')
#- Cover Image
if all_match('merge.source','rug01|pug01|ebk01')
weave_by_id('cover')
copy_field('_weave.cover.data.cover_remote','json.cover_remote')
remove_field('_weave')
end
#- Cover card-catalog
if exists(cid)
add_field('json.cover_remote.$append','http://search.ugent.be/meercat/x/stream?source=rug02&id=')
move_field('cid','json.cover_remote.$append')
join_field('json.cover_remote','')
end
#- Fulltext
fulltext()
move_field('fulltext','json.fulltext')
#- Remove record without items or fulltext
unless exists('items')
unless exists('json.fulltext')
set_field('is_deleted','true')
end
end
#- CATEGORY
if exists('json.fulltext')
add_field('only.$append','online')
end
if exists('items')
add_field('only.$append','print')
end
if all_match('merge.source','pug01')
add_field('only.$append','ugent')
end
sort_field("only", uniq:1, reverse:0)
#- ALL Field
all()
#- Identifier indexes rug01, ser01, ...
ids()
#- Set
marc_map('005','updated_at')
#- Warning: Aleph doesn't do zulu-time...
datetime_format('updated_at', time_zone:'Europe/Brussels', set_time_zone:'UTC', source_pattern: '%Y%m%d%H%M%S.%N', destination_pattern:'%Y-%m-%dT%H:%M:%SZ', delete:1)
add_field('is_oai','false')
if exists('updated_at')
add_field('set.$append','all')
set_field('is_oai','true')
end
sort_field('set', unique:1)
#- MARC Display
marc_map('245','marc_display.$append.title', join:' ')
marc_map('246','marc_display.$append.other-title', join:' ')
marc_map('765','marc_display.$append.orig-title', join:' ')
marc_map('210','marc_display.$append.abbrev-title', join:' ')
marc_map('240','marc_display.$append.other-title', join:' ')
marc_map('020','marc_display.$append.isbn', join:' ')
marc_map('022','marc_display.$append.issn', join:' ')
marc_map('028','marc_display.$append.publisher-no', join:' ')
marc_map('048','marc_display.$append.voices-code', join:' ')
marc_map('100','marc_display.$append.author', join:' ')
marc_map('110','marc_display.$append.corp-author', join:' ')
marc_map('700','marc_display.$append.author', join:' ')
marc_map('720','marc_display.$append.other-name', join:' ')
marc_map('111','marc_display.$append.conference', join:' ')
marc_map('130','marc_display.$append.other-title', join:' ')
marc_map('250','marc_display.$append.edition', join:' ')
marc_map('255','marc_display.$append.scale', join:' ')
marc_map('256','marc_display.$append.edition', join:' ')
marc_map('260','marc_display.$append.publisher', join:' ')
marc_map('261','marc_display.$append.publisher', join:' ')
marc_map('263','marc_display.$append.publisher', join:' ')
marc_map('300','marc_display.$append.description', join:' ')
marc_map('310','marc_display.$append.frequency', join:' ')
marc_map('321','marc_display.$append.prior-freq', join:' ')
marc_map('340','marc_display.$append.description', join:' ')
marc_map('362','marc_display.$append.pub-history', join:' ')
marc_map('400','marc_display.$append.series', join:' ')
marc_map('410','marc_display.$append.series', join:' ')
marc_map('440','marc_display.$append.series', join:' ')
marc_map('490','marc_display.$append.series', join:' ')
marc_map('500','marc_display.$append.note', join:' ')
marc_map('501','marc_display.$append.note', join:' ')
marc_map('502','marc_display.$append.thesis', join:' ')
marc_map('504','marc_display.$append.bibliography', join:' ')
marc_map('505','marc_display.$append.content', join:' ')
marc_map('508','marc_display.$append.credits', join:' ')
marc_map('510','marc_display.$append.note', join:' ')
marc_map('511','marc_display.$append.performers', join:' ')
marc_map('515','marc_display.$append.note', join:' ')
marc_map('518','marc_display.$append.note', join:' ')
marc_map('520','marc_display.$append.summary', join:' ')
marc_map('521','marc_display.$append.note', join:' ')
marc_map('525','marc_display.$append.note', join:' ')
marc_map('530','marc_display.$append.note', join:' ')
marc_map('533','marc_display.$append.note', join:' ')
marc_map('534','marc_display.$append.note', join:' ')
marc_map('540','marc_display.$append.note', join:' ')
marc_map('541','marc_display.$append.note', join:' ')
marc_map('544','marc_display.$append.note', join:' ')
marc_map('545','marc_display.$append.note', join:' ')
marc_map('546','marc_display.$append.note', join:' ')
marc_map('550','marc_display.$append.note', join:' ')
marc_map('555','marc_display.$append.note', join:' ')
marc_map('561','marc_display.$append.note', join:' ')
marc_map('580','marc_display.$append.note', join:' ')
marc_map('581','marc_display.$append.publication', join:' ')
marc_map('583','marc_display.$append.note', join:' ')
marc_map('586','marc_display.$append.note', join:' ')
marc_map('591','marc_display.$append.note', join:' ')
marc_map('598','marc_display.$append.classification', join:' ')
marc_map('080','marc_display.$append.udc-no', join:' ')
marc_map('082','marc_display.$append.dewey-no', join:' ')
marc_map('084','marc_display.$append.other-call-no', join:' ')
marc_map('600','marc_display.$append.subject', join:' ')
marc_map('610','marc_display.$append.subject', join:' ')
marc_map('611','marc_display.$append.subject', join:' ')
marc_map('630','marc_display.$append.subject', join:' ')
marc_map('650','marc_display.$append.subject', join:' ')
marc_map('651','marc_display.$append.subject', join:' ')
marc_map('653','marc_display.$append.subject', join:' ')
marc_map('655','marc_display.$append.subject', join:' ')
marc_map('662','marc_display.$append.subject', join:' ')
marc_map('690','marc_display.$append.subject', join:' ')
marc_map('692','marc_display.$append.subject', join:' ')
marc_map('693','marc_display.$append.subject', join:' ')
marc_map('710','marc_display.$append.corp-author', join:' ')
marc_map('711','marc_display.$append.conference', join:' ')
marc_map('730','marc_display.$append.other-title', join:' ')
marc_map('749','marc_display.$append.title-local', join:' ')
marc_map('752','marc_display.$append.other-info', join:' ')
marc_map('753','marc_display.$append.other-info', join:' ')
marc_map('772','marc_display.$append.parent-rec-ent', join:' ')
marc_map('776','marc_display.$append.add-phys-form-e', join:' ')
marc_map('777','marc_display.$append.issu-with-entry', join:' ')
marc_map('780','marc_display.$append.preceding-entry', join:' ')
marc_map('785','marc_display.$append.succeed-entry', join:' ')
marc_map('LKR','marc_display.$append.note', join:' ')
marc_map('024','marc_display.$append.object-id', join:' ')
marc_map('856','marc_display.$append.e-location', join:' ')
#-if_all_match('merge.source','ser01')
#- marc_map('852jhaz','marc_display.$append.location', join:' | ')
#-end
#-if_all_match('merge.source','rug01')
#- marc_map('Z303haz','marc_display.$append.location', join:' | ')
#-end
to_json('marc_display')
#- Europeana Magic
europeana()
#- MARCXML
marc_xml('record')
move_field('record','fXML')
end
#- JSON
to_json('json')
add_field('_bag','data')
remove_field('record')
remove_field('merge')
remove_field('version')
Make sure you have cpanm
(hint:
$ cpan App::cpanminus
) installed.
$ cpanm Catmandu::OAI
catmandu import
command with
the URL found in the OAI-PPMH fieldE.g.
$ catmandu convert OAI --url https://biblio.ugent.be/oai
use Catmandu;
Catmandu->importer('OAI',url => 'https://biblio.ugent.be/oai')->each(sub {
my $record = shift;
print "$record\n";
});
$ catmandu convert OAI --url https://biblio.ugent.be/oai to YAML
use Catmandu -all;
my $importer = importer('OAI',url => 'https://biblio.ugent.be/oai');
my $exporter = exporter('YAML');
$exporter->add_many($importer);
$exporter->commit;
$ catmandu convert OAI --url https://biblio.ugent.be/oai --fix 'retain("_id")'
or if you like an CSV file
$ catmandu convert OAI --url https://biblio.ugent.be/oai to CSV --fix 'retain("_id")'
use Catmandu;
my $importer = Catmandu->importer('OAI',url => 'https://biblio.ugent.be/oai');
my $fixer = Catmandu->fixer('retain("_id")');
my $exporter = Catmandu->exporter('CSV');
$exporter->add_many(
$fixer->fix($importer)
);
$exporter->commit;
Hint: use the -v
option
$ catmandu convert -v OAI --url https://biblio.ugent.be/oai to CSV --fix 'retain("_id")' > /dev/null
Here we send the output to the /dev/null to show the verbose messages.
use Catmandu;
my $importer = Catmandu->importer('OAI',url => 'https://biblio.ugent.be/oai');
my $fixer = Catmandu->fixer('retain("_id")');
my $exporter = Catmandu->exporter('CSV');
$exporter->add_many(
$fixer->fix($importer->benchmark)
);
$exporter->commit;
Make sure you have Log::Log4perl
installed (hint:
$ cpan Log::Any::Adapter::Log4perl
).
In your main program do:
use Catmandu;
use Log::Any::Adapter;
use Log::Log4perl;
Log::Any::Adapter->set('Log4perl');
Log::Log4perl::init('./log4perl.conf');
# The lines above should be enough to activate logging for Catmandu.
# Include the lines below to activate logging for your main program.
my $logger = Log::Log4perl->get_logger('myprog');
$logger->info("Starting main program");
...your code...
with log4perl.conf like:
# Send a copy of all logging messages to STDERR
log4perl.rootLogger=DEBUG,STDERR
# Logging specific for your main program
log4perl.category.myprog=INFO,STDERR
# Logging specific for on part of Catmandu
log4perl.category.Catmandu::Fix=DEBUG,STDERR
# Where to send the STDERR output
log4perl.appender.STDERR=Log::Log4perl::Appender::Screen
log4perl.appender.STDERR.stderr=1
log4perl.appender.STDERR.utf8=1
log4perl.appender.STDERR.layout=PatternLayout
log4perl.appender.STDERR.layout.ConversionPattern=%d [%P] - %p %l time=%r : %m%n
You will see now Catmandu log messages (e.g. for Fixes).
If you want to add logging functionality in your own Perl modules you have two options;
Your package is a Catmandu::Importer or Catmandu::Exporter. In this case you are lucky because you have a logger as part of your instance:
$self->log->debug(‘blablabla’); # where $self is an Importer,Fix or Exporter instance
You need to create the logger yourself.
package Foo::Bar;
use Moo;
with ‘Catmandu::Logger’;
sub bar { my $self = shift; $self->log->debug(‘tadaah’); }
If you want to see the logging messages only of your package, then use a this type of line in your log4perl.conf:
log4perl.category.Foo::Bar=DEBUG,STDOUT
or if you want to see all the log messages for Foo packages:
log4perl.category.Foo=DEBUG,STDOUT
A Catmandu::Store is used to store items. Stores can have one or more compartments where to store the items. Each such compartment is a Catmandu::Bag. You can compare a Store with a database and a Bag with a table in a database. Like tables, Bags have names. When no name is provided for a Bag, then ‘data’ is used.
To implement a Catmandu store you need to create at least two packages:
As example, this is a skeleton for a ‘Foo’ Catmandu::Store which requires at least one ‘foo’ connection parameter:
package Catmandu::Store::Foo;
use Moo;
use Catmandu::Store::Foo::Bag;
with 'Catmandu::Store';
has 'foo' => (is => 'ro' , required => 1);
1;
For this Catmandu::Store::Foo we can define a module ‘Catmandu::Store::Foo::Bag’ to implement the Bag functions. Notice how in the generator the bag can access the Catmandu::Store instance:
package Catmandu::Store::Foo::Bag;
use Moo;
with 'Catmandu::Bag';
sub generator {
my $self = shift;
sub {
# This subroutine is used to loop over all items
# in a store and should return a item HASH for
# every call
return {
name => $self->name,
foo => $self->store->foo
};
};
}
sub get {
my ($self,$id) = @_;
# return a item HASH given an $id
return {};
}
sub add {
my ($self,$data) = @_;
# add/update an item HASH to the bag and return the item with an _id field set
return $data;
}
sub delete {
my ($self,$id) = @_;
# delete an item from the bag given an $id
1;
}
sub delete_all {
my ($self) = @_;
# delete all items
$self->each(sub {
$self->delete($_[0]->{_id});
});
}
1;
With this skeleton Store you have enough code to run basic tests. Save these package in a lib directory:
lib/Catmandu/Store/Foo.pm lib/Catmandu/Store/Foo/Bag.pm
and a catmandu
command to test your implementation:
$ catmandu -I lib export Foo –foo bar
{“foo”:“bar”,“name”:“data”} {“foo”:“bar”,“name”:“data”} {“foo”:“bar”,“name”:“data”} . . .
Or create a test.pl script to access your new Store via Perl:
#!/usr/bin/env perl
use lib qw(./lib);
use Catmandu;
my $store = Catmandu->store('Foo', foo => 'bar');
$store->add({ test => 123});
This section will provide an in depth overview how to extend Catmandu using the API
The easiest way to create a new ‘Fix’ is by creating a Perl package in the Catmandu::Fix namespace that has a ‘fix’ instance method. For example:
package Catmandu::Fix::foo;
use Moo;
sub fix {
my ($self, $data) = @_;
# modify your data here, for instance...
$data->{foo} = 'bar';
$data;
}
1;
When this code is available in your perl library path as
Catmandu/Fix/foo.pm
it can be used as fix function
foo()
. To try out save the file as
lib/Catmandu/Fix/foo.pm
in your local directory and
execute:
$ echo '{}' | catmandu -I lib convert JSON --fix "foo()"
{"foo":"bar"}
The following instruction is incomplete, see POD of Catmandu::Fix
If you want pass arguments to your fix, you can make use of Moo and Catmandu::Fix::Has to read in required and optional parameters.
package Catmandu::Fix::foo;
use Moo;
1); # required first argument
has greeting => (fix_arg => 1); # required second argument
has message => (fix_arg => 1, default => sub { '!' }); # optional argument , default '!'
has eol => (fix_opt =>
sub fix {
my ($self,$data) = @_;
$self->log->debug($self->greeting . ", " . $self->message . $self->eol. "\n";
# Fix your data here...
$data;
}
1;
Now you can write log messages in your Fixes:
$ echo '{}' | catmandu convert --fix 'foo(Hello,World)'
Hello, World!
{}
$ echo '{}' | catmandu convert --fix 'foo(Hello,World, eol: ?)'
Hello, World?
{}
See also Catmandu::Fix::SimpleGetValue.
For an extended introduction into creating Fix packages read the two blog posts at:
This guide has been written to help anyone interested in contributing to the development of Catmandu. Please read this guide before contributing to Catmandu or related projects, to avoid wasted effort and maximizing the chances of your contributions being used.
There are many ways to contribute to the project. Catmandu is a young yet active project and any kind of help is very much appreciated!
You don’t have to start by hacking the code, spreading the word is very valuable as well!
If you have a blog, just feel free to speak about Catmandu.
Of course, it doesn’t have to be limited to blogs or Twitter. Feel free to spread the word in whatever way you consider fit and drop us a line on the Catmandu user mailing list noted below.
Also, if you’re using and enjoying Catmandu, rating us on cpanratings.perl.org, explaining what you like about Catmandu is another very valuable contribution that helps other new users find us!
Subscribing to the mailing list and providing assistance to new users is incredibly valuable.
librecat-dev@lists.uni-bielefeld.de
We value documentation very much, but it’s difficult to keep it up-to-date. If you find a typo or an error in the documentation please do let us know - ideally by submitting a patch (pull request) with your fix or suggestion (see Patch Submission).
To can contribute to Catmandu’s core code or extend the functionality by new Importers, Exporters, Stores, Fix packages, Validators, Binds, or Plugins. Have a look at the list of missing modules for existing ideas and resources for new Catmandu modules. Feel also free to add new ideas and links there.
For more detailed guidelines, see:
We can measure our quality using the CPAN testers platform: http://www.cpantesters.org.
A good way to help the project is to find a failing build log on the CPAN testers: http://www.cpantesters.org/distro/D/Catmandu.html
If you find a failing test report or another kind of bug, feel free to report it as a GitHub issue: http://github.com/LibreCat/Catmandu/issues. Please make sure the bug you’re reporting does not yet exist.
The official website is here: http://librecat.org/ A Wordpress blog with hints is available at: https://librecatproject.wordpress.com/
A mailing list is available here:
librecat-dev@mail.librecat.org
The official repository is hosted on GitHub at http://github.com/LibreCat/Catmandu.
Official developers have write access to this repository, contributors are invited to fork the dev branch (!) and submit a pull request, as described at patch submission.
This guide was based on
The following guidelines describe how to set up a development environment for contribution of code.
If you want to submit a patch for Catmandu, you need git and very
likely also milla
(
In the following sections we provide tips for the installation of some of these tools together with Catmandu. Please also see the documentation that comes with these tools for more info.
Install perlbrew for example with
cpanm App::perlbrew
Check which perls are available
perlbrew available
At the time of writing it looks like this
perl-5.18.0
perl-5.16.3
perl-5.14.4
perl-5.12.5
perl-5.10.1
perl-5.8.9
perl-5.6.2
perl5.005_04
perl5.004_05
perl5.003_07
Then go on and install a version inside Perlbrew. I recommend you
give a name to the installation (--as
option), as well as
compiling without the tests (--n
option) to speed it
up.
perlbrew install -n perl-5.16.3 --as catmandu_dev -j 3
Wait a while, and it should be done. Switch to your new Perl with:
perlbrew switch catmandu_dev
Now you are using the fresh Perl, you can check it with:
which perl
Install cpanm on your brewed version of perl.
perlbrew install-cpanm
this section needs to be rewritten to reflect the change to Dist::Milla
Get the Catmandu sources from github (for a more complete git workflow see below):
Clone your fork to have a local copy using the following command:
$ git clone git@github.com:LibreCat/Catmandu.git
The installation is then straight forward:
$ cd Catmandu
$ perl Build.PL
$ ./Build
$ ./Build test
$ ./Build install
You can now start with hacking Catmandu and patch submission!
The following guidelines are no strict rules but they should be considered as best practice for contribution.
Catmandu should be able to install for all Perl versions since 5.10.1, on any platform for which Perl exists. We focus mainly on GNU/Linux (any distribution).
You should avoid regressions as much as possible and keep backwards compatibility in mind when refactoring. Stable releases should not break functionality and new releases should provide an upgrade path and upgrade tips such as warning the user about deprecated functionality.
Document your module with
Names of other moduless should be linked
(e.g. L<Catmandu::Importer>
)
The Catmandu development team uses GitHub to collaborate. We greatly appreciate contributions submitted via GitHub, as it makes tracking these contributions and applying them much, much easier. This gives your contribution a much better chance of being integrated into Catmandu quickly!
To help us achieve high-quality, stable releases, git-flow workflow
is used to handle pull-requests, that means contributors must work on
their dev
branch rather than on their master
.
(Master should be touched only by the core dev team when preparing a
release to CPAN; all ongoing development happens in branches which are
merged to the dev
branch.)
Here is the workflow for submitting a patch:
Fork the repository http://github.com/LibreCat/Catmandu (click “Fork”)
Clone your fork to have a local copy using the following command:
$ git clone git://github.com/$myname/Catmandu.git
As a contributor, you should always work on the
dev
branch of your clone (master
is used only
for building releases).
$ git remote add upstream https://github.com/LibreCat/Catmandu.git
$ git fetch upstream
$ git checkout -b dev upstream/dev
This will create a local branch in your clone named dev
and that will track the official dev
branch. That way, if
you have more or less commits than the upstream repo, you’ll be
immediately notified by git.
You want to isolate all your commits in a topic
branch, this will make the reviewing much easier for the core team and
will allow you to continue working on your clone without worrying about
different commits mixing together.
To do that, first create a local branch to build your pull request:
# you should be in dev branch here
git checkout -b pr/$name
Now you have created a local branch named pr/$name
where
I<$name> is the name you want (it should describe the purpose of
the pull request you’re preparing).
In that branch, do all the commits you need (the more the better) and when done, push the branch to your fork:
# ... commits ...
git push origin pr/$name
You are now ready to send a pull request.
Send a pull request
via the GitHub interface. Make
sure your pull request is based on the pr/$name
branch
you’ve just pushed, so that it incorporates the appropriate commits
only.
It’s also a good idea to summarize your work in a report sent to the users mailing list (see below), in order to make sure the team is aware of it.
When the core team reviews your pull request, it will either accept
(and then merge into dev
) or refuse your request.
If it’s refused, try to understand the reasons explained by the team for the denial. Most of the time, communicating with the core team is enough to understand what the mistake was. Above all, please don’t be offended.
If your pull-request is merged into dev
, then all
you have to do is to remove your local and remote pr/$name
branch:
git checkout dev
git branch -D pr/$name
git push origin :pr/$name
And then, of course, you need to sync your local dev branch with the upstream:
git pull upstream dev
git push origin dev
You’re now ready to start working on a new pull request!
5.6 Comments ✎
Comments can be added to the Fix scripts to enhance the readability of your transformations. All lines that start with a hash sign (#) are ignored by Catmandu: