sshoc-skosmapping/sshoc_31_skos-yarrrml.ipynb

8.2 KiB

Mapping SSHOC Multilingual Metadata to SKOS resources

This Notebook implements a simple parser used to transform the SSHOC Multilingual Metadata, created in the Task 3.1 of the SSHOC project and published as spreadsheet, into a SKOS resource. The parser reads the spreadsheet and transforms the content following a set of mapping rules defined using YRRRML , the result is stored in Turtle files, and downloaded in a Fuseki server.

In [1]:
import pandas as pd
import rdflib
import itertools
import yaml
import datetime
import json
from jsonpath_ng import jsonpath, parse
from rdflib.namespace import DC, DCAT, DCTERMS, OWL, \
                            RDF, RDFS, SKOS,  \
                           XMLNS, XSD, XMLNS
from rdflib import Namespace
from rdflib import URIRef, BNode, Literal

The file config.yaml contains the external information used in the parsing, including the path of the spreadsheets. Set the correct values before running the Notebook. The file mappings.yaml contains the YARRRML mapping rules

In [2]:
try:
    with open("config.yaml", 'r') as stream:
        try:
           conf=yaml.safe_load(stream)
        except yaml.YAMLError as exc:
            print(exc)
    with open("rules.yaml", 'r') as stream:
        try:
           rules=yaml.safe_load(stream)
        except yaml.YAMLError as exc:
            print(exc)
except FileNotFoundError:
    print('Warning config.yaml file not present! Please store it in the same directory as the notebook')
#print (conf)

The following functions implement a basic parser that processes the YARRRML rules and creates a RDF graph

In [3]:
def jsonmapper (json_source, rules):
    ccr_y = rdflib.Graph()
    prefixes=rules['prefixes']
    nss={}
    for key in prefixes:
        myns=Namespace(URIRef(prefixes[key]))
        nss[key]=myns
        ccr_y.bind(key, nss[key])
    mappings=rules['mappings']
    for mapi in mappings:
        jsonpath_expression = parse(mappings[mapi]['sources'][0][1])
        source_list = [match.value for match in jsonpath_expression.find(json_source)] 
        for source in source_list:
            labelst=mappings[mapi]['s'].split(':')
            if ('$' in labelst[1]):
                labelpath=labelst[1].replace('$(','').replace(')','')
                labelexpression=parse(labelpath)
                lb_ids_list = [match.value for match in labelexpression.find(source)]
                labelid=lb_ids_list[0]
            else:
                labelid=labelst[1]
            labns=nss[labelst[0]]
            urilabel=labns[labelid]
            propsobs=mappings[mapi]['po']
            for popob in propsobs:
                if (popob[0]=='a'):
                    myob=popob[1].split(':')
                    tpns=nss[myob[0]]
                    ccr_y.add((urilabel, RDF.type, tpns[myob[1]]))
                    continue

                myspath=(f"{popob[1].replace('$(','').replace(')','')}")
                po_expression = parse(myspath)
                po_ids_list = [match.value for match in po_expression.find(source)]
                lang=''
                if (len(popob) >2 and ('lang' in popob[2])):
                    lang=popob[2].replace('~lang','')
                for poval in po_ids_list:
                    ob=Literal((poval))
                    if lang!='':
                        ob= Literal(ob, lang=lang)
                    prns=nss[popob[0].split(':')[0]]
                    ccr_y.add((urilabel, prns[popob[0].split(':')[1]], ob))
    return ccr_y
        

Download SSHOC Multilingual Metadata spreadsheet

In [4]:
mdurl=conf['Source']['METADATASOURCE']
df_metadata=pd.read_csv(mdurl)
In [5]:
df_metadata.rename(columns = {'English': 'Englishterm', 'Unnamed: 1':'Englishdefinition', 'Unnamed: 2':'source',
                             'Unnamed: 3':'URI', 'Dutch':'Dutchterm', 'Unnamed: 5':'Dutchdefinition', 
                             'French':'Frenchterm', 'Unnamed: 7':'Frenchdefinition',
                             'Greek':'Greekterm', 'Unnamed: 9':'Greekdefinition',
                             'Italian':'Italianterm', 'Unnamed: 11':'Italiandefinition'}, inplace = True)
df_metadata=df_metadata.drop(0)
df_metadata['ConceptId']=df_metadata['URI'].apply(lambda y: y.replace('http://hdl.handle.net/11459/',''))
df_metadata['source']=df_metadata['source'].apply(lambda y: y.replace('(source: ','').replace(')',''))
In [6]:
df_metadata.to_json('data/file.json', orient='records')
In [7]:
myjson=df_metadata.to_json(orient='records')
concepts_metadata=json.loads(myjson)
json_metadata={'concepts': concepts_metadata}
json_metadata['title']=conf['Texts']['METADATATITLE']
json_metadata['description']=conf['Texts']['METADATADESCRIPTION']
json_metadata['id']=conf['Texts']['METADATAID']
json_metadata['createdate']=conf['Texts']['METADATACREATEDATE']
json_metadata['version']=conf['Texts']['METADATAVERSION']
In [10]:
skosgraph=jsonmapper(json_metadata, rules)
In [9]:
skosgraph.serialize(destination='data/skosccr_y.ttl', format="n3")
skosgraph.serialize(destination='data/skosccr_y.rdf', format="pretty-xml")
Out[9]:
<Graph identifier=Nd89e673445d645f69476105bb18cad6a (<class 'rdflib.graph.Graph'>)>