2015-09-14 Poznań

Catmandu

Created in 2012 by library professionals at Ghent University, Lund University and Bielefeld University.

Currently an international community on all continents with a dozen active submitters.

Used by university libraries, archives, commercial implementers to extract metadata records from various sources, transform them into new formats and load them into databases such as Solr, ElasticSearch, MySQL, PostgreSQL, Oracle, and many more

Catmandu @ Bielefeld

Catmandu @ Ghent

Catmandu @ Catalogs

Catmandu @ OpenRefine

Catmandu @ LinkedDataFragments

ELAG presentation & Demo

Virtual Box

Exercise 1

Catmandu Tools

  • Command line tool: catmandu
  • Importers: Text, JSON, YAML, CSV, RDF, OAI-PMH, SRU, Z3950, DBI, LDAP, MARC, RIS, Twitter, Wikidata, XLS, …
  • Exporters: Text, JSON, YAML, CSV, RDF, MARC, RIS, Template, XLS, XML, …
  • Stores: MongoDB, CouchDB, ElasticSearch, Solr, DBI, FedoraCommons, Aleph
  • Transformation: Fix or any program that can read and write JSON
  • API: Perl
  • Web development: PSGI, Dancer

Catmandu command line tool

$ catmandu <COMMAND> <--OPTIONS>
$ catmandu help
$ catmandu convert <IMPORTER> to <EXPORTER> < file.yaml
$ catmandu convert YAML to JSON < file.yaml
$ catmandu convert YAML to JSON --pretty 1 < file.yaml
$ catmandu convert CSV --sep_char ';' to JSON < file.csv
$ catmandu convert YAML to JSON --fix 'retain(title)' < file.yaml

CSV, JSON, YAML

CSV

name,organization,country
Nicolas,Ghent University,Belgium
Patrick,Ghent University,Belgium
Snorri,Lund University,Lund
Vitali,Bielefeld University,Germany
Johan,Staatsbibliothek zu Berlin,Germany

CSV, JSON, YAML

JSON

{ "name":"Nicolas" , "organization":"Ghent University" , 
"county":"Belgium" , "hobbies": ["bicycles","milling","literature"],
"education":{ "type":"akkadian" , "place":"Ghent Univeristy" , 
"year":"2000"}}
{ "name":"Patrick" , "organization":"Ghent University" , 
"county":"Belgium" , "hobbies": ["drawing","music","literature"],
"education":{ "type":"physics" , "place":"Nijmegen University" ,
"year":"1995"}}

CSV, JSON, YAML

YAML

---
name: Patrick Hochstenbach
organization: Ghent University Library
country: Belgium
hobbies:
-   drawing
-   music
-   reading
education:
   type: physics
   place: Nijmegen, the Netherlands
   year: 1995

Exercise 2

JSON Path

Path: name

---
name: Patrick Hochstenbach
organization: Ghent University Library
country: Belgium
hobbies:
-   drawing
-   music
-   reading
education:
   type: physics
   place: Nijmegen, the Netherlands
   year: 1995

JSON Path

Path: hobbies.1

---
name: Patrick Hochstenbach
organization: Ghent University Library
country: Belgium
hobbies:
-   drawing
-   music
-   reading
education:
   type: physics
   place: Nijmegen, the Netherlands
   year: 1995

JSON Path

Path: education.year

---
name: Patrick Hochstenbach
organization: Ghent University Library
country: Belgium
hobbies:
-   drawing
-   music
-   reading
education:
   type: physics
   place: Nijmegen, the Netherlands
   year: 1995

JSON Path

Path: foo.bar.0.test.1.value
---
foo:
  bar:
    - test:
        - ???: ???
        - value: HERE
{"foo":{
   "bar":[
     { "test":[
        { "???","???"} ,
        {"value": "HERE"}
        ]
     }
   ]
 }
}

JSON Path

input record:

---

fix: add_field(foo,test)

result record:

---
foo: test

JSON Path

input record:

---

fix: add_field(foo.bar,test)

result record:

---
foo:
  bar: test

JSON Path

input record:

---

fix: add_field(foo.$append,test)

or: add_field(foo.0,test)

result record:

---
foo:
  - test

JSON Path testing

$ echo {} | catmandu convert JSON to YAML --fix 'add_field(foo.1.ok,test)'
---
foo:
- ~
- ok: test
...

Exercise 3

Fix script

Execute fixes from the command line or a Fix script

$ catmandu convert YAML to JSON --fix 'fix()' < data.yaml

$ catmandu convert YAML to JSON --fix 'fix();fix();fix()' < data.yaml

$ catmandu convert YAML to JSON --fix 'fix_file.fix'  < data.yaml

$ catmandu convert YAML to JSON --fix 'fix_file.fix'  
                                --fix 'fix_file2.fix' < data.yaml

Fix commands

Fixes can have one or more arguments and possibly one or more options. The fixes are executed in the order they are written in a Fix script.

collapse()

upcase(titles.*)
upcase("titles.*")

replace_all(title,foo,bar)
replace_all(title,"foo foo","bar bar")

lookup(name, names.csv, sep_char:"|")
sort_field(tags, reverse:1, numeric:1)

# Commented fields anything you write here will be ignored
fix() # The fix will be executed this comment is ignored

Fix conditions

Execute fixes only if some condition is true:

if exists(my.funny.field)
   fix()
   fix()
end

unless all_match(my.status,true)
   fix()
   fix()
end

Conditions:

exists(field) , all_match(field,value) , any_match(field,value) , ...

Fix reject/select records

Fix functions can be used to filter our the records you don't need:

reject()  # reject the entire record

reject() exists(bad_data) # reject the record if the bad_data
                          # field exists
                          
select()  # select the entire record

select() all_match(title,'blabla') # select the record if the
                                   # title contains blabla

Fix binds

Fix binds groups fixes in one computational strategy. They can provide a new context in which the fixes are executed.

upcase(my.very.deep.field)
upcase(my.very.deep.field2)

do with(path => my.very.deep)
  upcase(field)
  upcase(field2)
end
do hashmap(count: 1, exporter: CSV)
   copy_fied(FIELD_TO_COUNT,key)
end

Fixes

Exercise 3 1/2

MARC

MARC map

$ catmandu convert MARC --type USMARC to YAML < camel.usmarc
copy_field(record.17.5,title)  ???
marc_map("245a",title)  !!!

MARC map

marc_map("FIELD[IND]SUBFIELD(S)",PATH,--OPT...)
marc_map("245",title) # Cross-platform Perl /Eric F. Johnson.
marc_map("245",title,join:" ") # Cross-platform Perl / Eric F. Johnson.
marc_map("245ac",title,join:" ") # Cross-platform Perl / Eric F. Johnson.
marc_map("245ca",title,join:" ") # Cross-platform Perl / Eric F. Johnson.
marc_map("245ca",title,join:" ",pluck:1) 
      # Eric F. Johnson. Cross-platform Perl /
marc_map("245a",title) # Cross-platform Perl

marc_map("655a", subject.$append)
$ perldoc Catmandu::Fix::marc_map

Exercise 4

RDF

SUBJECT PREDICATE OBJECT

http://en.wikipedia.org/wiki/Albert_Einstein
   http://purl.org/dc/elements/1.1/creator
      http://en.wikipedia.org/wiki/Theory_of_relativity
http://lib.ugent.be/record/012918219 dct:title "War and Peace"
http://lib.ugent.be/record/012918219 dct:creator "Leo Tolstoy"
http://lib.ugent.be/record/012918219 dct:identifier "1400079985"
http://lib.ugent.be/record/012918219 dct:identifier 978-1400079988
SELECT subject WHERE { ?subject dct:creator "Leo Tolstoy" }

Catmandu RDF

---
_id: http://lib.ugent.be/record/012918219
dct_title: "War and Peace"
dct_creator: "Leo Tolstoy"
dct_identifier:
    - 1400079985
    - 978-1400079988
$ catmandu convert YAML to RDF --type NTriples --fix myfixes.fix < test.yaml

$ catmandu convert MARC to RDF --type NTriples --fix marc.fix < data.mrc
http://lib.ugent.be/record/012918219 dct:identifier "1400079985"

http://lib.ugent.be/record/012918219 dct:identifier usn:isbn:1400079985

Exercise 5

RDF Reconciliation

Reconciliation is identifying multiple representations of the same real-world object.

  • Dostoyevsky
  • Dostoevskij
  • Dostojevskij
  • Dostoievski

Who is the real Jan Jansen?

  • Jansen, Jan
  • Jansen, Jan
  • Jansen, Jan

VIAF http:://viaf.org

VIAF LDF http://data.linkeddatafragments.org/viaf

Catmandu RDF

catmandu convert RDF --url http://data.linkeddatafragments.org/viaf 
   --sparql 'SELECT * {?s ?p "Einstein, Albert"}'
---
name: Einstein, Albert, 1879-1955
rdf_ldf_statements(name,
   url:"http://data.linkedatafragments.org/viaf",
   predicate:"http://schema.org/alternateName")
---
name:
  - http://viaf.org/viaf/75121530

Exercise 6

LDF Server

LDF Server config

{
  "title": "My Linked Data Fragments server",
  "datasources": {
    "rug01": {
      "title": "Catalog",
      "type": "HdtDatasource",
      "description": "Catalog sample",
      "settings": { "file":
      "/home/catmandu/LinkedDataFragments/data/rug01.hdt" }
    }
  },

  "prefixes": {
    "rdf":         "http://www.w3.org/1999/02/22-rdf-syntax-ns#",
    "rdfs":        "http://www.w3.org/2000/01/rdf-schema#",
    "xsd":         "http://www.w3.org/2001/XMLSchema#",
    "dc":          "http://purl.org/dc/terms/",
    "foaf":        "http://xmlns.com/foaf/0.1/",
    "dbpedia":     "http://dbpedia.org/resource/",
    "dbpedia-owl": "http://dbpedia.org/ontology/",
    "dbpprop":     "http://dbpedia.org/property/",
    "hydra":       "http://www.w3.org/ns/hydra/core#",
    "void":        "http://rdfs.org/ns/void#"
  }
}

LDF Server boot

$ ldf-server config.json

Exercise 7