Fix
Fixed can be used for easy data manipulation by non programmers. Using a small Perl DSL language librarians can use Fix routines to manipulate data objects. A plain text file of fixes can be created to specify all the data manipulations that need to be executed to 'massage' the data in the desired format.
As an example we will import data from a MARC file and change some metadata fields using Fix routines. Here is the code to run the example:
use Catmandu::Fix;
use Catmandu::Importer::MARC;
use Data::Dumper;
my $fixer = Catmandu::Fix->new(fixes => ['marc.fix']);
my $it = Catmandu::Importer::MARC->new(file => 'marc.txt', type => 'ALEPHSEQ');
$fixer->fix($it)->each(sub {
my $obj = shift;
print Dumper($obj);
});
The output of this script should generate something like this:
$VAR1 = {
'_id' => '000000043',
'my' => {
'authors' => [
'Patrick Hochstenbachhttp://ok',
'Patrick Hochstenbach2My bMy eMy codeGhent1971',
'Patrick Hochstenbach3',
'Stichting Ons Erfdeel'
],
'language' => 'dut',
'subjects' => [
'MyTopic1',
'MyTopic2',
'MyTopic3',
'MyTopic4'
],
'stringy' => 'MyTopic1; MyGenre1; MyTopic2; MyGenre2; MyTopic3; MyTopic4; MyGenre4'
}
};
We need two files as input: marc.txt is a file containing MARC records and marc.fix contains the fixes that need to be applied to each MARC record. Lets take a look at the contents of this marc.fix file:
marc_map('100','my.authors.$append');
marc_map('710','my.authors.$append');
marc_map('600x','my.subjects.$append');
marc_map('008_/35-37','my.language');
marc_map('600','my.stringy', -join => "; ");
marc_map('199','my.condition', -value => 'ok');
remove_field('record');
The fixes in this file are specialized in MARC processing. In line 1 we map the contents of the MARC-100 field into a deeply neested Perl hash with key 'authors'. In line 3 we map the contents of the MARC-600 x-subfield into the 'subjects' field. In Line 4 we read characters 35 to 37 from the MARC-008 control field into the 'language' key.
A Catmandu Fix provides also many functions to manipulate Perl hashes. The remove_field, as shown above in the fix file, will remove a key from a Perl hash. Other fix function are: add_field, capitalize, clone, collapse, copy_field, downcase, expand, join_field, move_fild, remove_field, replace_all, retain_field, set_field, split_field, substring, trim and upcase.
Store
As explained in the introduction, one of the rationales for creating Catmandu is to ease the serialization of records in our database of choice.
The introduction of schemaless databases made the storage of complex records quite easy. Before we delve into this type of database
we need to show you what syntax Catmandu is using to store data.
As example lets create the most simple storage mechanism possible, an in memory hash. We use this mock 'database' to show some
of the features that any Catmandu::Store has. First we will create a YAML importer as shown above to import records into
an in memory hash store:
use Catmandu::Importer::YAML;
use Catmandu::Store::Hash;
use Data::Dumper;
my $importer = Catmandu::Importer::YAML->new(file => "./test.yaml");
my $store = Catmandu::Store::Hash->new();
# Store an iterable
$store->bag->add_many($importer);
# Store an array of hashes
$store->bag->add_many([ { name => 'John' } , { name => 'Peter' }]);
# Store one hash
$store->bag->add( { name => 'Patrick' });
# Commit all changes
$store->bag->commit;
Each Catmandu::Store have one or more compartments (e.g. tables) to store data called 'bag'. We use the function 'add_many' to store
each item in the importer Iterable into the Store. We can also store an array of Perl hashes with the same command. Or store a
single hash with the 'add' method.
Each bag is an Iterator so you can apply any of the 'each','any','all',... methods shown above to read data from a bag.
$store->bag->take(3)->each(sub {
my $obj = shift;
#.. your code
});
When you store a perl Hash into a Catmandu::Store then an identifier field '_id' gets added to your perl Hash that can be used to
retrieve the item at a later stage. Lets take a look at the identifier and how it can be used.
# First store a perl hash and return the stored item which includes the stored identifier
my $item = $store->bag->add( { name => 'Patrick' });
# This will show you an UUID like '414003DC-9AD0-11E1-A3AD-D6BEE5345D14'...
print $item->{_id} , "\n";
# Now you can use this identifier to retrieve the object from the store
my $item2 = $store->bag->get('414003DC-9AD0-11E1-A3AD-D6BEE5345D14');
And that is how it works. Catmandu::Store has some more functionality to delete items and query the store (if the backend
supports it), but this is how you can store very complex Perl structures in memory or on disk with just a few lines of
code. As a complete example we can show how easy it is to store data in a fulltext search engine like ElasticSearch.
In this example we will download ElasticSearch version 0.19.3 from this website
and install it on our system:
$ wget https://github.com/downloads/elasticsearch/elasticsearch/elasticsearch-0.19.3.tar.gz
$ tar zxvf elasticsearch-0.19.3.tar.gz
$ cd elasticsearch-0.19.3
$ bin/elasticsearch
After running the last command 'bin/elasticsearch' we have started the search daemon. Now we can index some data with
Catmandu:
use Catmandu::Importer::YAML;
use Catmandu::Store::ElasticSearch;
my $importer = Catmandu::Importer::YAML->new(file => './test.yaml');
my $store = Catmandu::Store::ElasticSearch->new(index_name => 'demo');
$store->bag->add_many($importer);
$store->bag->commit;
All records in the file 'test.yaml' should be available now index. We can test this by executing a new script to
read all records stored in the store:
use Catmandu::Store::ElasticSearch;
use Data::Dumper;
my $store = Catmandu::Store::ElasticSearch->new(index_name => 'demo');
$store->bag->each(sub {
my $obj = shift;
print Dumper($obj);
});
If everything work correct you should something like this:
$VAR1 = {
'first' => 'Charly',
'_id' => '96CA6692-9AD2-11E1-8800-92A3DA44A36C',
'last' => 'Parker',
'job' => 'Artist'
};
$VAR1 = {
'first' => 'Joseph',
'_id' => '96CA87F8-9AD2-11E1-B760-84F8F47D3A65',
'last' => 'Ratzinger',
'job' => 'Pope'
};
$VAR1 = {
'first' => 'Albert',
'_id' => '96CA83AC-9AD2-11E1-B1CD-CC6B8E6A771E',
'last' => 'Einstein',
'job' => 'Physicist'
};
The ElasticSearch store even provides an implementation of the Lucene and CQL query language:
my $hits = $store->bag->searcher(query => 'first:Albert');
$hits->each(sub {
my $obj = shift;
printf "%s %s\n" , $obj->{first} , $obj->{last};
});
This last example will print 'Albert Einstein' as result. Clinton Gormley did some great work in providing
a Perl client for ElasticSearch. Searching complex objects can be done by using a dot syntax e.g. 'record.titles.0.subtitle:"My Funny Valentine"'.
The beauty of ElasticSearch is that it is completely plainless to setup and requires no schema: indexing
data is simply done by using JSON over HTTP. All your fields are indexed automatically.
Lazy
Most of the Catmandu processing doesn't require you to write any Perl code. With command line tools you can store data files into databases, index your data, export data in various formats and provide basis data cleanup operations.
Say, you have a YAML file 'test.yml' like:
---
first: Charly
last: Parker
job: Artist
---
first: Albert
last: Einstein
job: Physicist
---
first: Joseph
last: Ratzinger
job: Pope
...
and you are required to transform it into JSON. Using the 'catmandu' command you can do this with these options:
$ catmandu data --from-importer YAML --into-exporter JSON < test.yml
Basically you connect a YAML importer to a JSON exporter.
Need some fancy export? Then use the Template exporter which uses a template file like 'test.xml.tt' below to render the output.
<foo>
<first>[% first %]</first>
<last>[% last %]</last>
<job>[% job %]</job>
</foo>
To run the 'catmandu' command you need to provide 'Template' as the exporter to write into and a full path to the template file (without the .tt extension). Note that optional arguments for Importers and Exporters can be provided with the '--from-[NAME]' , '--into-[NAME]' syntax:
$ catmandy data --from-importer YAML --into-exporter Template --into-template `pwd`/test.xml < test.yml
Which produces the output:
<foo>
<first>Charly</first>
<last>Parker</last>
<job>Artist</job>
</foo>
<foo>
<first>Albert</first>
<last>Einstein</last>
<job>Physicist</job>
</foo>
<foo>
<first>Joseph</first>
<last>Ratzinger</last>
<job>Pope</job>
</foo>
Using this command line tools indexing data becomes also very easy. Boot up the ElasticSearch and run the command below to index the test.yml file:
$ catmandu data -v --into-store ElasticSearch --into-index_name demo --into-bag data --from-importer YAML < test.yml
To show the results from your hard word we can export all the records from the ElasticSearch store:
$ catmandu data --from-store ElasticSearch --from-bag data --from-index_name demo
{"first":"Albert","_id":"3A07B0F8-0973-11E2-98F8-F84380C42756","last":"Einstein","job":"Physicist"}
{"first":"Charly","_id":"3A0792D0-0973-11E2-8724-A22A2812F5B2","last":"Parker","job":"Artist"}
{"first":"Joseph","_id":"3A07B5EE-0973-11E2-97BF-E053E6A92BE5","last":"Ratzinger","job":"Pope"}
We can even be more lazy by creating a catmandu.yml file containing the connection parameters to the ElasticSearch:
---
store:
default:
package: ElasticSearch
options:
index_name: demo
Using the configuration file above indexation of YAML data can be done like this:
$ catmandu data -v --into-bag data --from-importer YAML < ~/Desktop/test.yaml
And exporting all data can be done like this:
$ catmandu data --from-bag data
For Catmandu stores that support a query language, exporting data can be very powerfull using the '--query' option. E.g. we can export all records about 'Einstein' from our ElasticSearch store using:
$ catmandu data --from-bag data --query "Einstein"