QuickGraph#13 Using a SKOS taxonomy for semantic search on a document repository

The TESEO database is an online repository containing the details of all PhD thesis from Spanish universities. It offers an html/form based search interface where you can look up theses by author, topic, university, etc. As a UI it is rather painful to use and quite limited, I must say, but that’s another story. While we wait for an open data version of this public content we have to find workarounds to query and analyse it. This is what this QuickGraph is about.

Interestingly, one of the ways you can search the database is by field of study, and TESEO uses the UNESCO nomenclature for fields of science and technology (a standard proposed back in 1988). It is possible for example to search for PhD theses in the field of Linear Programming (UNESCO code 12.07.09) or Ethnolinguistics (UNESCO code 5705.02).

If we want to run some interesting (and automated) querying of the DB, we needed a machine readable form of the UNESCO nomenclature so I continued my research and found that a team at the University of Murcia had formalised the UNESCO nomenclature in a SKOS taxonomy. You can explore it on line here.

My plan is to run semantic searches on the Teseo portal. For example, “find PhD thesis on fields related to Isotopes” or “broader than Model theory“. This will require exploring the relationships between concepts in the SKOS taxonomy and using the results of this exploration to query the TESEO database. Easy, right?

I think I’ve got all I need:

  • A DB that can be queried (with some pain) via http.
  • A public SKOS taxonomy accessible through a SPARQL endpoint.
  • n10s and APOC to integrate all the parts together on Neo4j.

Loading the UNESCO nomenclature into Neo4j

Neosemantics (n10s) includes methods to import SKOS concept schemes into Neo4j in a fully automated way. We saw in the previous quickgraph how the n10s.skos.import.* procedures build a simplified model of the taxonomy in Neo4j. For today’s experiment, we will import the SKOS taxonomy using the generic n10s.rdf.import.* procedures. SKOS is just an RDF vocabulary so we can import it as RDF and we’ll get a like for like representation of every statement (triple) in the UNESCO nomenclature as nodes/relationships/attributes in Neo4j.

Here’s what the process looks like:

//create the n10s configuration: multival & multilang
CALL n10s.graphconfig.init( 
           { handleMultival : "ARRAY", keepLangTag : true } );

//define a param with the sparql query 
:param sparql=> "CONSTRUCT { ?s ?p ?o } WHERE { GRAPH <http://skos.um.es/unesco6> { ?s ?p ?o } } "

//import the skos taxonomy excluding irrelevant predicates
CALL n10s.rdf.import.fetch(
"https://skos.um.es/sparql/?query=" + apoc.text.urlencode($sparql) + "&output=turtle", "Turtle", 
{ predicateExclusionList : ["http://www.w3.org/2004/02/skos/core#topConceptOf", "http://www.w3.org/2004/02/skos/core#inScheme"]})

The call to the n10s.rdf.import.fetch procedure does three things:

  1. it issues a request to the SPARQL endpoint with a query that returns the whole UNESCO nomenclature as RDF triples.
  2. it filters out certain triples based on the exclusion list passed as parameter. In this case we are excluding some redundant/useless statements (feel free to modify the list).
  3. it persists the triples into Neo4j as nodes and relationships.

Exploring the graph

Once imported into neo4j, the taxonomy is formed of a set of small clusters of concepts organised in trees using the skos:narrower and its inverse skos:broader. Here’s a bird’s eye view on Bloom.

From a Property Graph point of view, it is redundant and therefore pointless to have both relationships (skos:narrower and skos:broader) in the graph, so feel free to remove one of them with a simple cypher expression like this:

MATCH (:skos__Concept)-[br:skos__broader]->() 
DELETE br

If we zoom into one of the clusters, for example the one on Astronomy and astrophysics (UNESCO code 21) we can appreciate its hierarchical structure. This is even more obvious when applying the hierarchical layout

This query will get the categories under Astronomy and astrophysics. Note that I’m looking it up by URI.

MATCH tree = (:Resource { uri: "http://skos.um.es/unesco6/21"})-[:skos__narrower*]->() RETURN tree

In addition to the hierarchical relationships, there are skos:related relationships connecting transversally related concepts in different taxonomies and acting as bridges across subtrees. Here is an example of how Political Science (UNESCO code 59) relates to Philosophy (UNESCO code 72). Note the skos:related relationships displayed in orange.

The following query returns in a tabular form the bridging points (pairs) between the two trees:

MATCH (politics:Resource { uri: "http://skos.um.es/unesco6/59"})-[:skos__narrower*]->(bridge1)-[:skos__related]-(bridge2)<-[:skos__narrower*]-(philosophy:Resource { uri: "http://skos.um.es/unesco6/72"})
RETURN DISTINCT n10s.rdf.getLangValue("en", bridge1.skos__prefLabel) as bridgeOnPolitics, n10s.rdf.getLangValue("en", bridge2.skos__prefLabel) as bridgeOnPhilosophy

Producing as result:

Running semantic search on the repository of PhD Theses

In order to query the Teseo database, we need to generate HTTP requests with the following structure:

https://www.educacion.gob.es/teseo/listarBusqueda.do?tipo=avanzada&idUni=0&idDepartamento=0&cursoDesde=17&cursoDesde2=18&
descriptor1.termino=[<CODE>]%20-%20<NAME>&descriptor1.idGen=<CODE_1>&descriptor1.idMed=<CODE_2>&descriptor1.idEsp=<CODE_3>

The server URL and the first set of request parameters are fixed but the field of study is passed dynamically and we will generate it on the fly from the results of exploring the SKOS taxonomy. Interestingly the code of the nomenclature term is required twice, one prefixing the term name (to be passed in Spanish and capitalized) and two, broken down in three parts and passed as descriptor1 idGen, idMed, idEsp.

Multilingual searches

It is quite straightforward to run searches on the Teseo DB using fields of study in different languages. The next cypher fragment shows how to return theses in the field of “Ocean bottom processes”.

//multilingual semantic search
MATCH (c:skos__Concept) 
WHERE n10s.rdf.getLangValue("en", c.skos__prefLabel) contains "Ocean-bottom processes"
WITH replace(c.skos__notation[0],".","") as cat_code , toUpper(n10s.rdf.getLangValue("es", c.skos__prefLabel)) as cat_name_in_url
CALL apoc.load.html("https://www.educacion.gob.es/teseo/listarBusqueda.do?tipo=avanzada&idUni=0&idDepartamento=0&cursoDesde=15&cursoDesde2=16&descriptor1.termino=[" + cat_code +"]%20-%20" + apoc.text.urlencode(cat_name_in_url) + "&descriptor1.idGen=" + substring(cat_code,0,2) + "&descriptor1.idMed=" + substring(cat_code,2,2) + "&descriptor1.idEsp=" + substring(cat_code,4,2) + "&rpp=500",{metadata:"label"}) yield value
UNWIND value.metadata as result
RETURN result.text as thesisTitle

We can see that the query starts with a topic search by English name in the UNESCO nomenclature, and with the results, the HTTP request is constructed to return the matching theses (titles). The previous query return a list of 18 results. Note that the search is restricted to the theses published from 2015.

Identical results could have been obtained by running a search with the french term “Processus des fonds océaniques”. More on multilingual thesaurus management with n10s in this previous post.

//multilingual semantic search
MATCH (c:skos__Concept) 
WHERE n10s.rdf.getLangValue("fr", c.skos__prefLabel) contains "Processus des fonds océaniques"
WITH replace ...

Semantic query expansion

More interesting is the possibility of leveraging the different relationships in the SKOS taxonomy to enrich the search results with related ones. In this case, we are extending the previous query by navigating the skos:related relationship to get related fields. Here is how to get all theses in “Cosmochemistry” and related fields.

MATCH (c:skos__Concept)-[:skos__related]->(related) 
WHERE n10s.rdf.getLangValue("en", c.skos__prefLabel) contains "Cosmochemistry"
WITH replace(related.skos__notation[0],".","") as cat_code , toUpper(n10s.rdf.getLangValue("es", related.skos__prefLabel)) as cat_name_in_url, n10s.rdf.getLangValue("en", related.skos__prefLabel) as relatedCat
CALL apoc.load.html("https://www.educacion.gob.es/teseo/listarBusqueda.do?tipo=avanzada&idUni=0&idDepartamento=0&cursoDesde=15&cursoDesde2=16&descriptor1.termino=[" + cat_code +"]%20-%20" + apoc.text.urlencode(cat_name_in_url) + "&descriptor1.idGen=" + substring(cat_code,0,2) + "&descriptor1.idMed=" + substring(cat_code,2,2) + "&descriptor1.idEsp=" + substring(cat_code,4,2) + "&rpp=500",{metadata:"label"}) yield value
UNWIND value.metadata as result
RETURN cat_code as relatedCatCode, relatedCat, result.text as thesisTitle

The query starts from the node representing the “Cosmochemistry” field and explores other nodes connected to it via the skos:related relationship…

…and issues a request to the Teseo DB for each related topic. The result is some additional items that could be used to populate a “you may also find interesting the following results”. We can see in the table below with the results of the previous query that it includes PhD theses in the topics represented by the related nodes in the graph (“Stellar composition”, “Planetary geology”, and “Interplanetary matter”). Note that the query only returns the additional results (the related ones) but not the ones tagged explicitly with “Cosmochemistry”.

Easy to get the gist I guess, feel free to modify the code to fit your needs. For example explore the taxonomy relationshipss like skos:narrower to get finer grain results.

What’s interesting about this QuickGraph?

This QG shows how straightforward it can be to enhance a document search service with semantic search capabilities. In this example we’ve used a public nomenclature (taxonomy) but we could also have created our own custom ontology like we’ve shown in previous posts like this one.

Maybe this post can give whoever owns the TESEO portal (probably https://www.ciencia.gob.es/) some ideas when they decide to revamp it. But more generally, if you want to power your document search service with semantic capabilities, you have here some concepts for getting started.

Also a couple of days ago I asked datos.gob.es (the entity that “manages the Public Sector Open Data Catalog, and promotes advanced services based on them”) why the TESEO data was not openly available instead of kept hidden behind this not-great UI. I’ve had no response yet unfortunately but I’m patient 🙂

As usual give this a try and give us your feedback! See you in the next post or at the neo4j community site. Bye 2020!

QuickGraph#12 Working with a Multilingual Thesaurus

The UNESCO Thesaurus is a controlled and structured list of terms in the areas of education, culture, natural sciences, social and human sciences, communication and information. It’s used used to annotate documents and publications like the ones in the UNESDOC digital library.

The Thesaurus is available as a multilingual SKOS concept scheme and at the time of writing, the available languages were English, Spanish, French, Russian and Arabic (download link).

SKOS (Simple Knowledge Organization System) is an RDF based model for expressing the basic structure and content of concept schemes such as thesauri, taxonomies and other types of controlled vocabulary. Learn more about it here.

Loading the UNESCO Thesaurus into Neo4j

Neosemantics (n10s) includes methods to import SKOS concept schemes into Neo4j in a fully automated way. Here’s what the process looks like

Monolingual Thesaurus

Let’s say we want to load the Thesaurus in French only (replace French with your preferred language here).

Step 1: Setting the Graph Config

call n10s.graphconfig.init({ handleVocabUris: "IGNORE", 
    handleMultival: "ARRAY", 
    multivalPropList: ["http://www.w3.org/2004/02/skos/core#altLabel"] })

With handleVocabUris: "IGNORE" we are asking n10s to ignore the namespaces used by SKOS and keep only the local names. We will see when we import the Thesaurus into Neo4j that nodes representing categories will have properties called altLabel or prefLabel instead of the fully qualified skos:altLabel or skos:prefLabel. With handleMultival: "ARRAY" we are setting n10s to import multivalued properties into arrays in Neo4j. By using multivalPropList we can specify the list of properties we want this behaviour to be applied (the rest will be stored as atomic values). In this case we expect concepts in the Thesaurus to have one single preferred label (prefLabel) and potentially multiple alternative labels (altLabel).

Step 2: Importing the Turtle serialisation of the Thesaurus directly from the UNESCO site.

call n10s.skos.import.fetch("http://vocabularies.unesco.org/browser/rest/v1/thesaurus/data?format=text/turtle",
   "Turtle", { languageFilter: "fr" })

With this procedure we are importing the Thesaurus directly from its public address in vocabularies.unesco.org. With param languageFilter we indicate that we want to filter literal triples and only keep the ones tagged as French ("fr"). It should not take more than a second or two to process the 90K triples in the current version of the Thesaurus.

The imported graph reflects the information in Thesaurus as we can see in the next capture for a specific concept Health Services (“Service de santé”):

The information about a specific concept and the ones related to it can be retrieved with this simple Cypher query:

MATCH p = (:Class { prefLabel : "Service de santé"})<--() 
RETURN p

Multilingual Thesaurus

Alternatively, we may want to load the Thesaurus and include all available languages for concepts and terms. The approach is identical and only requires a small change to the Graph Config. (I’ll assume that we’ll empty the graph and start again here).

Step 1: Setting the Graph Config

call n10s.graphconfig.init({ handleVocabUris: "IGNORE", handleMultival: "ARRAY", keepLangTag: true })

In this case we want both preferred and alternative labels to be stored in arrays because we expect to have several of them (at least one per language), that’s why we don’t need to specify the multivalPropList as we did before because all will be stored in arrays. There is also a new param set (keepLangTag) to indicate that we want to keep the language tag with each value.

Step 2: Importing the Turtle serialisation of the Thesaurus directly from the UNESCO site.

call n10s.skos.import.fetch("http://vocabularies.unesco.org/browser/rest/v1/thesaurus/data?format=text/turtle")

Identical to the previous one but this time without the filter on French literals. We are keeping all languages now.

The resulting graph is much richer now, and for every concept we have a nice set of multivalued properties. Let’s see what the Health Services concept looks like now:

MATCH (c:Class { 
    uri: "http://vocabularies.unesco.org/thesaurus/concept412"}) 
RETURN c.prefLabel as preferredLabels, c.altLabel as alternativeLabels

This time we are looking up by URI (:concept412). Remember that URIs are globally unique identifiers for concepts in a vocabulary. A primary key for lookup in a DB so to speak.

We can also do searches in different languages, and similarly produce the result in the same language. Here’s an example of a simple lookup using the Arabic version of the Thesaurus.

MATCH (c:Class) WHERE "خدمات صحية@ar" IN c.prefLabel
RETURN [x IN c.prefLabel WHERE n10s.rdf.hasLangTag("ar",x)| n10s.rdf.getValue(x)] AS prefLabel,
       [x IN c.altLabel WHERE n10s.rdf.hasLangTag("ar",x) | n10s.rdf.getValue(x)] AS altLabel

Which would produce the following result:

Or the same concept this time in russian (lookup by URI again):

MATCH (c:Class { uri: "http://vocabularies.unesco.org/thesaurus/concept412"})
RETURN [x IN c.prefLabel WHERE n10s.rdf.hasLangTag("ru",x)| n10s.rdf.getValue(x)] AS prefLabel,
       [x IN c.altLabel WHERE n10s.rdf.hasLangTag("ru",x) | n10s.rdf.getValue(x)] AS altLabel

Producing rich output with Cypher

The following Cypher query makes extensive use of comprehensions and aggregations to build a rich JSON structure for a given Thesaurus concept.

The query takes two parameters

  • the identifier of the concept (its URI in this case)
  • a list of languages

And returns the list of parent concepts in the Thesaurus with their ids, preferred and alternative labels. And it does it n times, one for each language in the list passed as param. Here is the Cypher in question:

MATCH taxonomy = (c:Class { uri: $uri})-[:SCO*]->(top)
WHERE NOT (top)-[:SCO]->() WITH taxonomy
UNWIND $langs as lang
RETURN collect({ lang: lang, taxonomy: [concept in nodes(taxonomy)| { pref: n10s.rdf.getLangValue(lang, concept.prefLabel), alt: [alt in concept.altLabel where n10s.rdf.hasLangTag(lang, alt) | n10s.rdf.getLangValue(lang, alt) ] }]}) as multilang

Notice how we are using the n10s.rdf.getLangValue function to get the value of a property in the required lang and the n10s.rdf.hasLangTag to check whether a property value has a particular lang tag.

We can test the query on the concept “Open Universities” (uri: concept8244) in English and Spanish. All we need to do is set the properties using :param if we are running this in the Neo4j browser. Here’s how:

:param uri => "http://vocabularies.unesco.org/thesaurus/concept8244"
:param langs => ["en","es"]

When we run the query above we get a single result serialised in JSON like the one to the right (courtesy of JSON Formatter). To the left we can see the actual structure in the graph that we are serialising.

Analytics on the Thesaurus Concepts

Just to finish the post I thought I’d add a little section on analytics. Once your thesaurus is stored in Neo4j, you can leverage the GDS library to run graph analytics on your data. Here’s how you can determine the centrality of the Concepts in your graph with two lines of code.

First we create a graph projection where we use the Class nodes and the RELATED relationships (feel free to change that and include also the SCO or even customise it by defining a Cypher-based projection. We will call our projection ‘thesaurus-analytics’.

CALL gds.graph.create('thesaurus-analytics', ['Class'], ['RELATED'])

We can now run graph algos in the projected graph. I’ll select a couple of centrality ones in the alpha tier (experimental): articleRank and eigenvector centrality. Running them is pretty straightforward. I this example I’m streaming the results rather than persisting them in the graph.

call gds.alpha.articleRank.stream('thesaurus-analytics') YIELD nodeId, score RETURN n10s.rdf.getLangValue('en',gds.util.asNode(nodeId).prefLabel) AS concept, score ORDER BY score DESC

From the documentation, ArticleRank is a variant of the Page Rank algorithm, which measures the transitive influence or connectivity of nodes. Where ArticleRank differs to Page Rank is that Page Rank assumes that relationships from nodes that have a low out-degree are more important than relationships from nodes with a higher out-degree. ArticleRank weakens this assumption.

The same analysis, this time using eigenvector centrality:

call gds.alpha.eigenvector.stream('thesaurus-analytics') YIELD nodeId, score RETURN n10s.rdf.getLangValue('en',gds.util.asNode(nodeId).prefLabel) AS concept, score ORDER BY score DESC

From the documentation, relationships to high-scoring nodes contribute more to the score of a node than connections to low-scoring nodes. A high score means that a node is connected to other nodes that have high scores.

Interestingly, very little overlap in the top results for one and other algos, but this is a topic for another post 🙂

What’s interesting about this QuickGraph?

I think here, it’s clearly how n10s helps in the handling of multilingual taxonomies, but equaly interesting is the ease of import (and export) from/to standards like SKOS. More interesting stuff on this thesaurus, involving unstructured data to come in next posts.

As usual, give it a try and give us your feedback! See you at the neo4j community site.

QuickGraph#11 The Christmas messages graph

It’s this time of the year… when heads of state address their nations with messages of hope and reflect on the past year and the challenges ahead. I was looking for a data set to do some text analysis and I thought this could be an interesting one. I collected a few Christmas messages from some of Europe’s heads of state (to be more precise, the English translations available).

Continue reading “QuickGraph#11 The Christmas messages graph”

QuickGraph#10 Enrich your Neo4j Knowledge Graph by querying Wikidata

Wikidata is a collaboratively edited knowledge base. It is a source of open data that you may want to use in your projects. Wikidata offers a query service for integrations. In this QuickGraph, I will show how to use the Wikidata Query Service to get data into Neo4j. Continue reading “QuickGraph#10 Enrich your Neo4j Knowledge Graph by querying Wikidata”

QuickGraph#9 The fashion Knowledge Graph. Inferencing with Ontologies in Neo4j

Last winter I had the opportunity to meet Katariina Kari at a Neo4j event in Helsinki. We had a conversation about graphs, RDF, LPG… we agreed on some things… and disagreed on others 🙂 but I remember telling her that I had found very interesting a post she had published on how they were using Ontologies to drive semantic searches on the Zalando web site.

I’ll use her example from that post and show how you can implement semantic searches/recommendations in Neo4j and leverage existing Ontologies (public standards or your own). That’s what this QuickGraph is about.

I assume you have some level of familiarity with RDF and semantic technologies. Continue reading “QuickGraph#9 The fashion Knowledge Graph. Inferencing with Ontologies in Neo4j”

QuickGraph#8 Cloning subgraphs between Neo4j instances with Cypher+RDF

I have two Neo4j instances: let’s call them instance-one and instance-two. My problem is simple, I want an easy way to copy fragments of the graph stored in instance-one to instance-two. In this post, I’ll explain here how to use:

  • Cypher to define the subgraph to be cloned and
  • RDF as the model for data exchange (serialisation format)

All with the help of the neosemantics plugin. Continue reading “QuickGraph#8 Cloning subgraphs between Neo4j instances with Cypher+RDF”

Neo4j is your RDF store (part 3) : Thomson Reuters’ OpenPermID

If you’re new to RDF/LPG, here is a good introduction to the differences between both types of graphs.  
For the last post in this series, I will work with a larger public RDF dataset in Neo4j. We’ve already seen a few times that importing an RDF dataset into Neo4j is easy, so what I will focus on in this post is what I think is the more interesting part, which is what comes after the data import, here are some highlights:

  1. Applying transformations to the imported RDF graph to make it benefit from the LPG modelling capabilities and enriching the graph with additional complementary data sources.
  2. Querying the graph to do complex path analysis and use graph patterns to detect data quality issues like data duplication and also to profile your dataset
  3. Integrate Neo4j with standard BI tools to build nice charts on the output of Cypher queries on your graph.
  4. Building an RDF API on top of your Neo4j graph.

All the code I’ll use is available on GitHub. Enjoy!

Continue reading “Neo4j is your RDF store (part 3) : Thomson Reuters’ OpenPermID”

QuickGraph#6 Building the Wikipedia Knowledge Graph in Neo4j (QG#2 revisited)

After last week’s Neo4j online meetup, I thought I’d revisit QuickGraph#2 and update it a bit to include a couple new things:

  • How to load not only categories but also pages (as in Wikipedia articles) and enrich the graph by querying DBpedia. In doing this I’ll describe some advanced usage of APOC procedures.
  • How to batch load the whole Wikipedia hierarchy of categories into Neo4j

Continue reading “QuickGraph#6 Building the Wikipedia Knowledge Graph in Neo4j (QG#2 revisited)”

QuickGraph#5 Learning a taxonomy from your tagged data

The Objective

Say we have a dataset of multi-tagged items: books with multiple genres, articles with multiple topics, products with multiple categories… We want to organise logically these tags -the genres, the topics, the categories…- in a descriptive but also actionable way. A typical organisation will be hierarchical, like a taxonomy.

But rather than building it manually, we are going to learn it from the data in an automated way. This means that the quality of the results will totally depend on the quality and distribution of the tagging in your data, so sometimes we’ll produce a rich taxonomy but sometimes the data will only yield a set of rules describing how tags relate to each other.

Finally, we’ll want to show how this taxonomy can be used and I’ll do it with an example on content recommendation / enhanced search. Continue reading “QuickGraph#5 Learning a taxonomy from your tagged data”