QuickGraph#13 Using a SKOS taxonomy for semantic search on a document repository

The TESEO database is an online repository containing the details of all PhD thesis from Spanish universities. It offers an html/form based search interface where you can look up theses by author, topic, university, etc. As a UI it is rather painful to use and quite limited, I must say, but that’s another story. While we wait for an open data version of this public content we have to find workarounds to query and analyse it. This is what this QuickGraph is about.

Interestingly, one of the ways you can search the database is by field of study, and TESEO uses the UNESCO nomenclature for fields of science and technology (a standard proposed back in 1988). It is possible for example to search for PhD theses in the field of Linear Programming (UNESCO code 12.07.09) or Ethnolinguistics (UNESCO code 5705.02).

If we want to run some interesting (and automated) querying of the DB, we needed a machine readable form of the UNESCO nomenclature so I continued my research and found that a team at the University of Murcia had formalised the UNESCO nomenclature in a SKOS taxonomy. You can explore it on line here.

My plan is to run semantic searches on the Teseo portal. For example, “find PhD thesis on fields related to Isotopes” or “broader than Model theory“. This will require exploring the relationships between concepts in the SKOS taxonomy and using the results of this exploration to query the TESEO database. Easy, right?

I think I’ve got all I need:

  • A DB that can be queried (with some pain) via http.
  • A public SKOS taxonomy accessible through a SPARQL endpoint.
  • n10s and APOC to integrate all the parts together on Neo4j.

Loading the UNESCO nomenclature into Neo4j

Neosemantics (n10s) includes methods to import SKOS concept schemes into Neo4j in a fully automated way. We saw in the previous quickgraph how the n10s.skos.import.* procedures build a simplified model of the taxonomy in Neo4j. For today’s experiment, we will import the SKOS taxonomy using the generic n10s.rdf.import.* procedures. SKOS is just an RDF vocabulary so we can import it as RDF and we’ll get a like for like representation of every statement (triple) in the UNESCO nomenclature as nodes/relationships/attributes in Neo4j.

Here’s what the process looks like:

//create the n10s configuration: multival & multilang
CALL n10s.graphconfig.init( 
           { handleMultival : "ARRAY", keepLangTag : true } );

//define a param with the sparql query 
:param sparql=> "CONSTRUCT { ?s ?p ?o } WHERE { GRAPH <http://skos.um.es/unesco6> { ?s ?p ?o } } "

//import the skos taxonomy excluding irrelevant predicates
CALL n10s.rdf.import.fetch(
"https://skos.um.es/sparql/?query=" + apoc.text.urlencode($sparql) + "&output=turtle", "Turtle", 
{ predicateExclusionList : ["http://www.w3.org/2004/02/skos/core#topConceptOf", "http://www.w3.org/2004/02/skos/core#inScheme"]})

The call to the n10s.rdf.import.fetch procedure does three things:

  1. it issues a request to the SPARQL endpoint with a query that returns the whole UNESCO nomenclature as RDF triples.
  2. it filters out certain triples based on the exclusion list passed as parameter. In this case we are excluding some redundant/useless statements (feel free to modify the list).
  3. it persists the triples into Neo4j as nodes and relationships.

Exploring the graph

Once imported into neo4j, the taxonomy is formed of a set of small clusters of concepts organised in trees using the skos:narrower and its inverse skos:broader. Here’s a bird’s eye view on Bloom.

From a Property Graph point of view, it is redundant and therefore pointless to have both relationships (skos:narrower and skos:broader) in the graph, so feel free to remove one of them with a simple cypher expression like this:

MATCH (:skos__Concept)-[br:skos__broader]->() 

If we zoom into one of the clusters, for example the one on Astronomy and astrophysics (UNESCO code 21) we can appreciate its hierarchical structure. This is even more obvious when applying the hierarchical layout

This query will get the categories under Astronomy and astrophysics. Note that I’m looking it up by URI.

MATCH tree = (:Resource { uri: "http://skos.um.es/unesco6/21"})-[:skos__narrower*]->() RETURN tree

In addition to the hierarchical relationships, there are skos:related relationships connecting transversally related concepts in different taxonomies and acting as bridges across subtrees. Here is an example of how Political Science (UNESCO code 59) relates to Philosophy (UNESCO code 72). Note the skos:related relationships displayed in orange.

The following query returns in a tabular form the bridging points (pairs) between the two trees:

MATCH (politics:Resource { uri: "http://skos.um.es/unesco6/59"})-[:skos__narrower*]->(bridge1)-[:skos__related]-(bridge2)<-[:skos__narrower*]-(philosophy:Resource { uri: "http://skos.um.es/unesco6/72"})
RETURN DISTINCT n10s.rdf.getLangValue("en", bridge1.skos__prefLabel) as bridgeOnPolitics, n10s.rdf.getLangValue("en", bridge2.skos__prefLabel) as bridgeOnPhilosophy

Producing as result:

Running semantic search on the repository of PhD Theses

In order to query the Teseo database, we need to generate HTTP requests with the following structure:


The server URL and the first set of request parameters are fixed but the field of study is passed dynamically and we will generate it on the fly from the results of exploring the SKOS taxonomy. Interestingly the code of the nomenclature term is required twice, one prefixing the term name (to be passed in Spanish and capitalized) and two, broken down in three parts and passed as descriptor1 idGen, idMed, idEsp.

Multilingual searches

It is quite straightforward to run searches on the Teseo DB using fields of study in different languages. The next cypher fragment shows how to return theses in the field of “Ocean bottom processes”.

//multilingual semantic search
MATCH (c:skos__Concept) 
WHERE n10s.rdf.getLangValue("en", c.skos__prefLabel) contains "Ocean-bottom processes"
WITH replace(c.skos__notation[0],".","") as cat_code , toUpper(n10s.rdf.getLangValue("es", c.skos__prefLabel)) as cat_name_in_url
CALL apoc.load.html("https://www.educacion.gob.es/teseo/listarBusqueda.do?tipo=avanzada&idUni=0&idDepartamento=0&cursoDesde=15&cursoDesde2=16&descriptor1.termino=[" + cat_code +"]%20-%20" + apoc.text.urlencode(cat_name_in_url) + "&descriptor1.idGen=" + substring(cat_code,0,2) + "&descriptor1.idMed=" + substring(cat_code,2,2) + "&descriptor1.idEsp=" + substring(cat_code,4,2) + "&rpp=500",{metadata:"label"}) yield value
UNWIND value.metadata as result
RETURN result.text as thesisTitle

We can see that the query starts with a topic search by English name in the UNESCO nomenclature, and with the results, the HTTP request is constructed to return the matching theses (titles). The previous query return a list of 18 results. Note that the search is restricted to the theses published from 2015.

Identical results could have been obtained by running a search with the french term “Processus des fonds océaniques”. More on multilingual thesaurus management with n10s in this previous post.

//multilingual semantic search
MATCH (c:skos__Concept) 
WHERE n10s.rdf.getLangValue("fr", c.skos__prefLabel) contains "Processus des fonds océaniques"
WITH replace ...

Semantic query expansion

More interesting is the possibility of leveraging the different relationships in the SKOS taxonomy to enrich the search results with related ones. In this case, we are extending the previous query by navigating the skos:related relationship to get related fields. Here is how to get all theses in “Cosmochemistry” and related fields.

MATCH (c:skos__Concept)-[:skos__related]->(related) 
WHERE n10s.rdf.getLangValue("en", c.skos__prefLabel) contains "Cosmochemistry"
WITH replace(related.skos__notation[0],".","") as cat_code , toUpper(n10s.rdf.getLangValue("es", related.skos__prefLabel)) as cat_name_in_url, n10s.rdf.getLangValue("en", related.skos__prefLabel) as relatedCat
CALL apoc.load.html("https://www.educacion.gob.es/teseo/listarBusqueda.do?tipo=avanzada&idUni=0&idDepartamento=0&cursoDesde=15&cursoDesde2=16&descriptor1.termino=[" + cat_code +"]%20-%20" + apoc.text.urlencode(cat_name_in_url) + "&descriptor1.idGen=" + substring(cat_code,0,2) + "&descriptor1.idMed=" + substring(cat_code,2,2) + "&descriptor1.idEsp=" + substring(cat_code,4,2) + "&rpp=500",{metadata:"label"}) yield value
UNWIND value.metadata as result
RETURN cat_code as relatedCatCode, relatedCat, result.text as thesisTitle

The query starts from the node representing the “Cosmochemistry” field and explores other nodes connected to it via the skos:related relationship…

…and issues a request to the Teseo DB for each related topic. The result is some additional items that could be used to populate a “you may also find interesting the following results”. We can see in the table below with the results of the previous query that it includes PhD theses in the topics represented by the related nodes in the graph (“Stellar composition”, “Planetary geology”, and “Interplanetary matter”). Note that the query only returns the additional results (the related ones) but not the ones tagged explicitly with “Cosmochemistry”.

Easy to get the gist I guess, feel free to modify the code to fit your needs. For example explore the taxonomy relationshipss like skos:narrower to get finer grain results.

What’s interesting about this QuickGraph?

This QG shows how straightforward it can be to enhance a document search service with semantic search capabilities. In this example we’ve used a public nomenclature (taxonomy) but we could also have created our own custom ontology like we’ve shown in previous posts like this one.

Maybe this post can give whoever owns the TESEO portal (probably https://www.ciencia.gob.es/) some ideas when they decide to revamp it. But more generally, if you want to power your document search service with semantic capabilities, you have here some concepts for getting started.

Also a couple of days ago I asked datos.gob.es (the entity that “manages the Public Sector Open Data Catalog, and promotes advanced services based on them”) why the TESEO data was not openly available instead of kept hidden behind this not-great UI. I’ve had no response yet unfortunately but I’m patient 🙂

As usual give this a try and give us your feedback! See you in the next post or at the neo4j community site. Bye 2020!

QuickGraph#11 The Christmas messages graph

It’s this time of the year… when heads of state address their nations with messages of hope and reflect on the past year and the challenges ahead. I was looking for a data set to do some text analysis and I thought this could be an interesting one. I collected a few Christmas messages from some of Europe’s heads of state (to be more precise, the English translations available).

Continue reading “QuickGraph#11 The Christmas messages graph”

QuickGraph#10 Enrich your Neo4j Knowledge Graph by querying Wikidata

Wikidata is a collaboratively edited knowledge base. It is a source of open data that you may want to use in your projects. Wikidata offers a query service for integrations. In this QuickGraph, I will show how to use the Wikidata Query Service to get data into Neo4j. Continue reading “QuickGraph#10 Enrich your Neo4j Knowledge Graph by querying Wikidata”

QuickGraph#9 The fashion Knowledge Graph. Inferencing with Ontologies in Neo4j

Last winter I had the opportunity to meet Katariina Kari at a Neo4j event in Helsinki. We had a conversation about graphs, RDF, LPG… we agreed on some things… and disagreed on others 🙂 but I remember telling her that I had found very interesting a post she had published on how they were using Ontologies to drive semantic searches on the Zalando web site.

I’ll use her example from that post and show how you can implement semantic searches/recommendations in Neo4j and leverage existing Ontologies (public standards or your own). That’s what this QuickGraph is about.

I assume you have some level of familiarity with RDF and semantic technologies. Continue reading “QuickGraph#9 The fashion Knowledge Graph. Inferencing with Ontologies in Neo4j”

QuickGraph#8 Cloning subgraphs between Neo4j instances with Cypher+RDF

I have two Neo4j instances: let’s call them instance-one and instance-two. My problem is simple, I want an easy way to copy fragments of the graph stored in instance-one to instance-two. In this post, I’ll explain here how to use:

  • Cypher to define the subgraph to be cloned and
  • RDF as the model for data exchange (serialisation format)

All with the help of the neosemantics plugin. Continue reading “QuickGraph#8 Cloning subgraphs between Neo4j instances with Cypher+RDF”

Neo4j is your RDF store (part 3) : Thomson Reuters’ OpenPermID

If you’re new to RDF/LPG, here is a good introduction to the differences between both types of graphs.  
For the last post in this series, I will work with a larger public RDF dataset in Neo4j. We’ve already seen a few times that importing an RDF dataset into Neo4j is easy, so what I will focus on in this post is what I think is the more interesting part, which is what comes after the data import, here are some highlights:

  1. Applying transformations to the imported RDF graph to make it benefit from the LPG modelling capabilities and enriching the graph with additional complementary data sources.
  2. Querying the graph to do complex path analysis and use graph patterns to detect data quality issues like data duplication and also to profile your dataset
  3. Integrate Neo4j with standard BI tools to build nice charts on the output of Cypher queries on your graph.
  4. Building an RDF API on top of your Neo4j graph.

All the code I’ll use is available on GitHub. Enjoy!

Continue reading “Neo4j is your RDF store (part 3) : Thomson Reuters’ OpenPermID”

QuickGraph#6 Building the Wikipedia Knowledge Graph in Neo4j (QG#2 revisited)

After last week’s Neo4j online meetup, I thought I’d revisit QuickGraph#2 and update it a bit to include a couple new things:

  • How to load not only categories but also pages (as in Wikipedia articles) and enrich the graph by querying DBpedia. In doing this I’ll describe some advanced usage of APOC procedures.
  • How to batch load the whole Wikipedia hierarchy of categories into Neo4j

Continue reading “QuickGraph#6 Building the Wikipedia Knowledge Graph in Neo4j (QG#2 revisited)”

QuickGraph#5 Learning a taxonomy from your tagged data

The Objective

Say we have a dataset of multi-tagged items: books with multiple genres, articles with multiple topics, products with multiple categories… We want to organise logically these tags -the genres, the topics, the categories…- in a descriptive but also actionable way. A typical organisation will be hierarchical, like a taxonomy.

But rather than building it manually, we are going to learn it from the data in an automated way. This means that the quality of the results will totally depend on the quality and distribution of the tagging in your data, so sometimes we’ll produce a rich taxonomy but sometimes the data will only yield a set of rules describing how tags relate to each other.

Finally, we’ll want to show how this taxonomy can be used and I’ll do it with an example on content recommendation / enhanced search. Continue reading “QuickGraph#5 Learning a taxonomy from your tagged data”

Neo4j is your RDF store (part 2)

As in previous posts, for those of you less familiar with the differences and similarities between RDF and the Property Graph, I recommend you watch this talk I gave at Graph Connect San Francisco in October 2016.

In the previous post on this series, I showed the most basic way in which a portion of your graph can be exposed as RDF. That was identifying a node by ID or URI if your data was imported from an RDF dataset. In this one, I’ll explore a more interesting way by running Cypher queries and serialising the resulting subgraph as RDF. Continue reading “Neo4j is your RDF store (part 2)”