Importing, transforming, storing and indexing data should be easy
Catmandu provides a suite of Perl modules to ease the import, storage, retrieval, export and transformation of metadata records. Combine Catmandu modules with web application frameworks such as PSGI/Plack, document stores such as MongoDB and full text indexes as Solr to create a rapid development environment for digital library services such as institutional repositories and search engines.
In the LibreCat project it is our goal to provide in open source a set of programming components to build up digital libraries services suited to your local needs. Here is an example of the projects we are working on:
We have more than 40 Catmandu projects available at GitHub LibreCat.
Create a search engine, one of your first tasks will to import data from various sources, map the fields to a common data model and post it to a full-text search engine. Perl modules such as WebService::Solr or ElasticSearch provide easy access to your favorite document stores, but you keep writing a lot of boilerplate code to create the connections, massaging the incoming data into the correct format, validating and uploading and indexing the data in the database. Next morning you are asked to provide a fast dump of records into an Excel worksheet. After some fixes are applied you are asked to upload it into your database. Again you hit Emacs or Vi and provide an ad-hoc script. In our LibreCat group we saw this workflow over and over. We tried to abstract this problem to a set of Perl tools which can work with library data such as MARC, Dublin Core, EndNote protocols such as OAI-PMH, SRU and repositories such as DSpace and Fedora. In data warehouses these processes are called ETL, Extract, Transform, Load. Many tools currenty exist for ETL processing but none adress typical library data models and services.
As programmers, we would like to reuse our code and algorithms as easy as possible. In fast application development you typically want to copy and paste parts of existing code in a new project. In Catmandu we use a functional style of programming to keep our code tight and clean and suitable for copy and pasting. When working with library data models we use native Perl hashes and arrays to pass data around. In this way adhere to the rationale of Alan J. Perlis: "It is better to have 100 functions operate on one data structure than to have 10 functions operate on 10 data structures." Our functions are all based on a few primary data structures on which we define many functions (map, count, each, first, take, ...)
Working with native Perl hashes and arrays we would like to use an easy mechanism to store and index this data in a database of choice. In the past it was a nuisance to create database schemas and indexes to store and search your data. Certainly in institutional repositories this can be a ongoing job for a programmer because the metadata schemas are not fixed in time. Any new report will require you to add new data fields and new relations for which you need to change your database schema. With the introduction of schemaless databases the storage of complex records is really easy. Create a Perl hash excute the function 'add' and your record is stored into the database. Execute 'get' to load a Perl hash from the database in memory. With our ElasticSearch plugin we even can provide you a CQL style query language for retrieval.
To get Catmandu running on your system you need to download and install the Catmandu module from CPAN. The Task::Catmandu meta-package bundles some additional modules commonly used with Catmandu. Install it like this:
Importers are Catmandu packages to read data into an application. We provide importers for MARC, JSON, YAML, CSV, Excel but also Atom and OAI-PMH endpoints.
As an example, lets create a Perl script to read a YAML file containing an array of values. We use the each function to loop through all the items
Running this script using this test.yaml file you should see as output:
Here is an example script to read 10 records from an OAI-PMH endpoint into an application:
The Iterable package provides many list methods to process large streams of records. Most of the methods are lazy if the underlying datastream supports it. While all of the data in Catmandu are native Perl hashes and arrays it can be impratical to load a result set of thousands of records into memory. Most Catmandu packages such as Importer, Exporter, Store provide therefor an Iterable implementation.
Using a 'Mock' importer we can generate some Perl hashes on-the-fly and show the functionality provided by Iterable:
With each you can loop over all the items in an iterator:
Using any, many, all you can test for the existence of items in an Iterator:
Map and reduce are functions that evaluate a function on all the items in an iterator to procude a new iterator or a summary of the results:
The Iterable package provides many more functions such as: to_array, count, each, first, slice, take, group, tap, detect, select, reject, any, many, all, map, reduce and invoke.
Exporters are Catmandu packages to export data from an application. As input they can get native Perl hashes or arrays but also Iterators to stream huge data sets.
Here is an example using our Mock importer to stream 1 million Perl hashes through an Exporter:
Catmandu provides exporters for BibTeX, CSV, JSON, RIS, XLS and YAML. If you need a special exporter for your own format you could use the Template exporter which uses Template Toolkit.
As an example lets create an exporter for an Perl array of hashes $data using a template:The template example.tt will be rendered for every hash in the array $data (or for every item in an Iterable $data).
Fixed can be used for easy data manipulation by non programmers. Using a small Perl DSL language librarians can use Fix routines to manipulate data objects. A plain text file of fixes can be created to specify all the data manipulations that need to be executed to 'massage' the data in the desired format.
As an example we will import data from a MARC file and change some metadata fields using Fix routines. Here is the code to run the example:
The output of this script should generate something like this:
We need two files as input: marc.txt is a file containing MARC records and marc.fix contains the fixes that need to be applied to each MARC record. Lets take a look at the contents of this marc.fix file:
The fixes in this file are specialized in MARC processing. In line 1 we map the contents of the MARC-100 field into a deeply neested Perl hash with key 'authors'. In line 3 we map the contents of the MARC-600 x-subfield into the 'subjects' field. In Line 4 we read characters 35 to 37 from the MARC-008 control field into the 'language' key.
A Catmandu Fix provides also many functions to manipulate Perl hashes. The remove_field, as shown above in the fix file, will remove a key from a Perl hash. Other fix function are: add_field, capitalize, clone, collapse, copy_field, downcase, expand, join_field, move_fild, remove_field, replace_all, retain_field, set_field, split_field, substring, trim and upcase.
As explained in the introduction, one of the rationales for creating Catmandu is to ease the serialization of records in our database of choice. The introduction of schemaless databases made the storage of complex records quite easy. Before we delve into this type of database we need to show you what syntax Catmandu is using to store data.
As example lets create the most simple storage mechanism possible, an in memory hash. We use this mock 'database' to show some of the features that any Catmandu::Store has. First we will create a YAML importer as shown above to import records into an in memory hash store:
Each Catmandu::Store have one or more compartments (e.g. tables) to store data called 'bag'. We use the function 'add_many' to store each item in the importer Iterable into the Store. We can also store an array of Perl hashes with the same command. Or store a single hash with the 'add' method.
Each bag is an Iterator so you can apply any of the 'each','any','all',... methods shown above to read data from a bag.
When you store a perl Hash into a Catmandu::Store then an identifier field '_id' gets added to your perl Hash that can be used to retrieve the item at a later stage. Lets take a look at the identifier and how it can be used.
And that is how it works. Catmandu::Store has some more functionality to delete items and query the store (if the backend supports it), but this is how you can store very complex Perl structures in memory or on disk with just a few lines of code. As a complete example we can show how easy it is to store data in a fulltext search engine like ElasticSearch.
In this example we will download ElasticSearch version 0.19.3 from this website and install it on our system:
After running the last command 'bin/elasticsearch' we have started the search daemon. Now we can index some data with Catmandu:
All records in the file 'test.yaml' should be available now index. We can test this by executing a new script to read all records stored in the store:
If everything work correct you should something like this:
The ElasticSearch store even provides an implementation of the Lucene and CQL query language:
This last example will print 'Albert Einstein' as result. Clinton Gormley did some great work in providing a Perl client for ElasticSearch. Searching complex objects can be done by using a dot syntax e.g. 'record.titles.0.subtitle:"My Funny Valentine"'. The beauty of ElasticSearch is that it is completely plainless to setup and requires no schema: indexing data is simply done by using JSON over HTTP. All your fields are indexed automatically.
Most of the Catmandu processing doesn't require you to write any Perl code. With command line tools you can store data files into databases, index your data, export data in various formats and provide basis data cleanup operations.
Say, you have a YAML file 'test.yml' like:
and you are required to transform it into JSON. Using the 'catmandu' command you can do this with these options:
Basically you connect a YAML importer to a JSON exporter.
Need some fancy export? Then use the Template exporter which uses a template file like 'test.xml.tt' below to render the output.
To run the 'catmandu' command you need to provide 'Template' as the exporter to write into and a full path to the template file (without the .tt extension). Note that optional arguments for Importers and Exporters can be provided with the '--from-[NAME]' , '--into-[NAME]' syntax:
Which produces the output:
Using this command line tools indexing data becomes also very easy. Boot up the ElasticSearch and run the command below to index the test.yml file:
To show the results from your hard word we can export all the records from the ElasticSearch store:
We can even be more lazy by creating a catmandu.yml file containing the connection parameters to the ElasticSearch:
Using the configuration file above indexation of YAML data can be done like this:
And exporting all data can be done like this:
For Catmandu stores that support a query language, exporting data can be very powerfull using the '--query' option. E.g. we can export all records about 'Einstein' from our ElasticSearch store using:
If you are interested in writing web applications, then please proceed to part 2 of this tutorial: Dancer & Catmandu.