QuickGraph#13 Using a SKOS taxonomy for semantic search on a document repository

The TESEO database is an online repository containing the details of all PhD thesis from Spanish universities. It offers an html/form based search interface where you can look up theses by author, topic, university, etc. As a UI it is rather painful to use and quite limited, I must say, but that’s another story. While we wait for an open data version of this public content we have to find workarounds to query and analyse it. This is what this QuickGraph is about.

Interestingly, one of the ways you can search the database is by field of study, and TESEO uses the UNESCO nomenclature for fields of science and technology (a standard proposed back in 1988). It is possible for example to search for PhD theses in the field of Linear Programming (UNESCO code 12.07.09) or Ethnolinguistics (UNESCO code 5705.02).

If we want to run some interesting (and automated) querying of the DB, we needed a machine readable form of the UNESCO nomenclature so I continued my research and found that a team at the University of Murcia had formalised the UNESCO nomenclature in a SKOS taxonomy. You can explore it on line here.

My plan is to run semantic searches on the Teseo portal. For example, “find PhD thesis on fields related to Isotopes” or “broader than Model theory“. This will require exploring the relationships between concepts in the SKOS taxonomy and using the results of this exploration to query the TESEO database. Easy, right?

I think I’ve got all I need:

  • A DB that can be queried (with some pain) via http.
  • A public SKOS taxonomy accessible through a SPARQL endpoint.
  • n10s and APOC to integrate all the parts together on Neo4j.

Loading the UNESCO nomenclature into Neo4j

Neosemantics (n10s) includes methods to import SKOS concept schemes into Neo4j in a fully automated way. We saw in the previous quickgraph how the n10s.skos.import.* procedures build a simplified model of the taxonomy in Neo4j. For today’s experiment, we will import the SKOS taxonomy using the generic n10s.rdf.import.* procedures. SKOS is just an RDF vocabulary so we can import it as RDF and we’ll get a like for like representation of every statement (triple) in the UNESCO nomenclature as nodes/relationships/attributes in Neo4j.

Here’s what the process looks like:

//create the n10s configuration: multival & multilang
CALL n10s.graphconfig.init( 
           { handleMultival : "ARRAY", keepLangTag : true } );

//define a param with the sparql query 
:param sparql=> "CONSTRUCT { ?s ?p ?o } WHERE { GRAPH <http://skos.um.es/unesco6> { ?s ?p ?o } } "

//import the skos taxonomy excluding irrelevant predicates
CALL n10s.rdf.import.fetch(
"https://skos.um.es/sparql/?query=" + apoc.text.urlencode($sparql) + "&output=turtle", "Turtle", 
{ predicateExclusionList : ["http://www.w3.org/2004/02/skos/core#topConceptOf", "http://www.w3.org/2004/02/skos/core#inScheme"]})

The call to the n10s.rdf.import.fetch procedure does three things:

  1. it issues a request to the SPARQL endpoint with a query that returns the whole UNESCO nomenclature as RDF triples.
  2. it filters out certain triples based on the exclusion list passed as parameter. In this case we are excluding some redundant/useless statements (feel free to modify the list).
  3. it persists the triples into Neo4j as nodes and relationships.

Exploring the graph

Once imported into neo4j, the taxonomy is formed of a set of small clusters of concepts organised in trees using the skos:narrower and its inverse skos:broader. Here’s a bird’s eye view on Bloom.

From a Property Graph point of view, it is redundant and therefore pointless to have both relationships (skos:narrower and skos:broader) in the graph, so feel free to remove one of them with a simple cypher expression like this:

MATCH (:skos__Concept)-[br:skos__broader]->() 
DELETE br

If we zoom into one of the clusters, for example the one on Astronomy and astrophysics (UNESCO code 21) we can appreciate its hierarchical structure. This is even more obvious when applying the hierarchical layout

This query will get the categories under Astronomy and astrophysics. Note that I’m looking it up by URI.

MATCH tree = (:Resource { uri: "http://skos.um.es/unesco6/21"})-[:skos__narrower*]->() RETURN tree

In addition to the hierarchical relationships, there are skos:related relationships connecting transversally related concepts in different taxonomies and acting as bridges across subtrees. Here is an example of how Political Science (UNESCO code 59) relates to Philosophy (UNESCO code 72). Note the skos:related relationships displayed in orange.

The following query returns in a tabular form the bridging points (pairs) between the two trees:

MATCH (politics:Resource { uri: "http://skos.um.es/unesco6/59"})-[:skos__narrower*]->(bridge1)-[:skos__related]-(bridge2)<-[:skos__narrower*]-(philosophy:Resource { uri: "http://skos.um.es/unesco6/72"})
RETURN DISTINCT n10s.rdf.getLangValue("en", bridge1.skos__prefLabel) as bridgeOnPolitics, n10s.rdf.getLangValue("en", bridge2.skos__prefLabel) as bridgeOnPhilosophy

Producing as result:

Running semantic search on the repository of PhD Theses

In order to query the Teseo database, we need to generate HTTP requests with the following structure:

https://www.educacion.gob.es/teseo/listarBusqueda.do?tipo=avanzada&idUni=0&idDepartamento=0&cursoDesde=17&cursoDesde2=18&
descriptor1.termino=[<CODE>]%20-%20<NAME>&descriptor1.idGen=<CODE_1>&descriptor1.idMed=<CODE_2>&descriptor1.idEsp=<CODE_3>

The server URL and the first set of request parameters are fixed but the field of study is passed dynamically and we will generate it on the fly from the results of exploring the SKOS taxonomy. Interestingly the code of the nomenclature term is required twice, one prefixing the term name (to be passed in Spanish and capitalized) and two, broken down in three parts and passed as descriptor1 idGen, idMed, idEsp.

Multilingual searches

It is quite straightforward to run searches on the Teseo DB using fields of study in different languages. The next cypher fragment shows how to return theses in the field of “Ocean bottom processes”.

//multilingual semantic search
MATCH (c:skos__Concept) 
WHERE n10s.rdf.getLangValue("en", c.skos__prefLabel) contains "Ocean-bottom processes"
WITH replace(c.skos__notation[0],".","") as cat_code , toUpper(n10s.rdf.getLangValue("es", c.skos__prefLabel)) as cat_name_in_url
CALL apoc.load.html("https://www.educacion.gob.es/teseo/listarBusqueda.do?tipo=avanzada&idUni=0&idDepartamento=0&cursoDesde=15&cursoDesde2=16&descriptor1.termino=[" + cat_code +"]%20-%20" + apoc.text.urlencode(cat_name_in_url) + "&descriptor1.idGen=" + substring(cat_code,0,2) + "&descriptor1.idMed=" + substring(cat_code,2,2) + "&descriptor1.idEsp=" + substring(cat_code,4,2) + "&rpp=500",{metadata:"label"}) yield value
UNWIND value.metadata as result
RETURN result.text as thesisTitle

We can see that the query starts with a topic search by English name in the UNESCO nomenclature, and with the results, the HTTP request is constructed to return the matching theses (titles). The previous query return a list of 18 results. Note that the search is restricted to the theses published from 2015.

Identical results could have been obtained by running a search with the french term “Processus des fonds océaniques”. More on multilingual thesaurus management with n10s in this previous post.

//multilingual semantic search
MATCH (c:skos__Concept) 
WHERE n10s.rdf.getLangValue("fr", c.skos__prefLabel) contains "Processus des fonds océaniques"
WITH replace ...

Semantic query expansion

More interesting is the possibility of leveraging the different relationships in the SKOS taxonomy to enrich the search results with related ones. In this case, we are extending the previous query by navigating the skos:related relationship to get related fields. Here is how to get all theses in “Cosmochemistry” and related fields.

MATCH (c:skos__Concept)-[:skos__related]->(related) 
WHERE n10s.rdf.getLangValue("en", c.skos__prefLabel) contains "Cosmochemistry"
WITH replace(related.skos__notation[0],".","") as cat_code , toUpper(n10s.rdf.getLangValue("es", related.skos__prefLabel)) as cat_name_in_url, n10s.rdf.getLangValue("en", related.skos__prefLabel) as relatedCat
CALL apoc.load.html("https://www.educacion.gob.es/teseo/listarBusqueda.do?tipo=avanzada&idUni=0&idDepartamento=0&cursoDesde=15&cursoDesde2=16&descriptor1.termino=[" + cat_code +"]%20-%20" + apoc.text.urlencode(cat_name_in_url) + "&descriptor1.idGen=" + substring(cat_code,0,2) + "&descriptor1.idMed=" + substring(cat_code,2,2) + "&descriptor1.idEsp=" + substring(cat_code,4,2) + "&rpp=500",{metadata:"label"}) yield value
UNWIND value.metadata as result
RETURN cat_code as relatedCatCode, relatedCat, result.text as thesisTitle

The query starts from the node representing the “Cosmochemistry” field and explores other nodes connected to it via the skos:related relationship…

…and issues a request to the Teseo DB for each related topic. The result is some additional items that could be used to populate a “you may also find interesting the following results”. We can see in the table below with the results of the previous query that it includes PhD theses in the topics represented by the related nodes in the graph (“Stellar composition”, “Planetary geology”, and “Interplanetary matter”). Note that the query only returns the additional results (the related ones) but not the ones tagged explicitly with “Cosmochemistry”.

Easy to get the gist I guess, feel free to modify the code to fit your needs. For example explore the taxonomy relationshipss like skos:narrower to get finer grain results.

What’s interesting about this QuickGraph?

This QG shows how straightforward it can be to enhance a document search service with semantic search capabilities. In this example we’ve used a public nomenclature (taxonomy) but we could also have created our own custom ontology like we’ve shown in previous posts like this one.

Maybe this post can give whoever owns the TESEO portal (probably https://www.ciencia.gob.es/) some ideas when they decide to revamp it. But more generally, if you want to power your document search service with semantic capabilities, you have here some concepts for getting started.

Also a couple of days ago I asked datos.gob.es (the entity that “manages the Public Sector Open Data Catalog, and promotes advanced services based on them”) why the TESEO data was not openly available instead of kept hidden behind this not-great UI. I’ve had no response yet unfortunately but I’m patient 🙂

As usual give this a try and give us your feedback! See you in the next post or at the neo4j community site. Bye 2020!