QuickGraph#7 Creating a schema.org linked data endpoint on Neo4j

In this instalment of the QuickGraph series, I’ll show how to map a graph stored in Neo4j to an ontology (or schema, or vocabulary…) using the neosemantics extension.

THE DATASET

For this experiment, I’ve used the public IMDb Datasets available at https://datasets.imdbws.com/. I’ll use only a subset of the data  (people, titles and genres) since the interesting part of this post is the model mapping rather than the actual dataset. The interested reader can build a bigger and richer graph using other elements like ratings, film crew, etc. Here is a detailed description of the files available in the dataset.

LOADING THE DATA INTO NEO4J

The data load is pretty straightforward, you can find the cypher scripts in GitHub. The resulting graph contains 9.4 million nodes and 45 million relationships. The average node degree is 4.82, and the average number of properties per node is 2.82.

Here’s a small excerpt with some of the elements around the actors who played two of the most memorable characters of all time: Inigo Montoya in The Princess Bride (Mandy Patinkin) and The Dude in The Big Lebowski (Jeff Bridges).

Screenshot 2018-10-18 at 15.46.51.png

QUERYING THE GRAPH

Just for fun, we can easily find some of the shortest paths between Patinkin and John Travolta (they’re 6 hops away in terms of graph edges ).

MATCH sp = allshortestPaths((n:Person { pid : "nm0001597"})-[:IN_TITLE*..10]-(:Person { pid: "nm0000237"}))
RETURN sp limit 10

Screenshot 2018-10-18 at 16.19.05.png

Or more interestingly, use the latest addition to the Neo4j Graph Algorithms library, the Overlap coefficient similarity algo to find similarities in the genres used in the IMDB dataset. I don’t want to diverge from my objective for this post so I’ll just show the cypher you’d need to run to get the overlap coefficients:

MATCH (gen:Genre) 
MATCH (gen)<-[:HAS_GENRE]-(title)
WITH {item:id(gen), 
categories: collect(id(title))} as genreData
WITH collect(genreData) as data
CALL algo.similarity.overlap.stream(data)
YIELD item1, item2, count1, count2, intersection, similarity
RETURN algo.getNodeById(item1).name AS from, 
algo.getNodeById(item2).name AS to,
count1, count2, intersection, similarity
ORDER BY similarity DESC

Screenshot 2018-10-18 at 16.54.01.png

You can use these results to apply the ‘Barrasa method’ (as my colleague Mark Needham baptised it) to learn a latent taxonomy in a tagged dataset. Read more on Mark’s post and here for the original description of the method.

The set of genres is quite reduced in the IMDB dataset so I don’t think it’ll be possible to learn much more than the fact that “Film-Noir” is a sub-genre of both “Crime” and “Drama”. Anyway, as I said, this could be the subject of another QuickGraph so let’s stop it here.

MAPPING THE GRAPH TO A PUBLIC SCHEMA

My objective is to create a linked data endpoint on my Neo4j DB that exposes the graph I’ve just built from the IMDB dataset. And I want the RDF generated by this endpoint to use a public schema so I thought I’d use the most popular and widely used public schema (or set of schemas), this is schema.org.

The process is hugely simplified by the new mapping capabilities in the neosemantics extension. Let’s give it a go!

Using neosemantics to produce RDF

As you probably already know, once you’ve installed neosemantics ( here’s how to do it) you can query your graph and get the results serialised as RDF. This was described a while ago in these posts.

To try this you can run a rdf/describe/ call to get all statements about a particular individual (a node in the graph) given its id. Let’s use id 14 which in my database is the one assigned to James Dean. You can get the unique ID of a node by using the function id(node) in Neo4j. Here’s the request if you run it from the Neo4j browser:

:GET /rdf/describe/id?nodeid=14

Neo4j will output an RDF serialisation with all the info about James Dean. The browser will produce Turtle by default but content negotiation is possible through header params in your HTTP request.

Screenshot 2018-10-18 at 10.43.34

As you know, node ids are assigned by the data store and you’ll probably want to use more meaningful attributes to identify entities like their IMDB id or the name of a person or the title of a movie. You can do that by passing a Cypher query to the /rdf/cypher endpoint. This approach gives you total control on the subgraph you want to return:

:POST /rdf/cypher 
{ "cypher" : "MATCH (s:Person { pid: 'nm0000015'})-[p]-(o) RETURN s,p,o" }

Screenshot 2018-10-18 at 10.59.09

What we notice though, is that the RDF is expressed in terms of a dynamically generated virtual vocabulary based on the Neo4j schema, with URIs of the form <neo4j://defaultvocabulary#XXX> for labels, properties and relationships. You can see these in the previous RDF fragment generated for James Dean. That’s kind of OK if all you care about is the RDF serialisation. If what you want is to enable interoperability, you’ll want to use a standard vocabulary. Enter schema.org.

Schema.org

Schema.org defines a series of types and properties that cover quite nicely the Movies domain: you can check out the rich definitions of categories like Person, Movie or Occupation.

Creating the mappings with Neosemantics

To create a mapping we need to do two things: first, create a reference to a public schema, and then use that reference to create individual mappings from elements in the Neo4j schema to elements in the public schema (you can have a look at the documentation for the mapping procedures here):

  • STEP 1: create a reference to the schema.org vocabularies passing the base URI and a prefix to be used in the serialisation.
call semantics.mapping.addSchema("http://schema.org/","sch")
  • STEP 2: create actual mappings from individual elements in the Neo4j graph schema to the elements in the public schema. In order to do this we first get an existing schema definition with listSchemas and pass it to the addMappingToSchema stored proc. All it takes is the names of both items being mapped (in the case of the public schema, just the local name, not the whole uri as it’s associated to a schema passed as the first parameter).
call semantics.mapping.listSchemas("http://schema.org/") yield node as sch
call semantics.mapping.addMappingToSchema(sch,"Profession","Occupation") yield node as mapping
return mapping

Notice that we could have done it all in one single step addSchema + addMappingToSchema as we’ll see later on.

If we now call the listMappings procedure we get a list of all the currently defined mappings as follows:

call semantics.mapping.listMappings()

Screenshot 2018-10-18 at 12.38.42.png

We could go on adding mappings manually one by one, but we’ll rather load them all in one go, but before we do that, here’s how you can drop a schema definition with all of its mappings:

call semantics.mapping.dropSchema("http://schema.org/")

Let’s say we’ve come up with the following set of mappings and we want to load them all in one go.

Screenshot 2018-10-18 at 13.50.46

The following cypher would do the job. I’m passing the mappings in JSON format as a parameter to the query but you can easily transform this to read them from a CSV file for example.

with [{ neoSchemaElem : "Person", publicSchemaElem:	"Person" },
{ neoSchemaElem : "yob", publicSchemaElem: "birthDate" },
{ neoSchemaElem : "yod", publicSchemaElem: "deathDate" },
{ neoSchemaElem : "name", publicSchemaElem: "name" },
{ neoSchemaElem : "Movie", publicSchemaElem: "Movie" },
{ neoSchemaElem : "title", publicSchemaElem: "name" },
{ neoSchemaElem : "year", publicSchemaElem: "datePublished" },
{ neoSchemaElem : "runTimeMinutes", publicSchemaElem: "duration" },
{ neoSchemaElem : "tagline", publicSchemaElem: "headline" },
{ neoSchemaElem : "IN_TITLE", publicSchemaElem: "actor" }] as mappings
call semantics.mapping.addSchema("http://schema.org/","sch") yield node as sch
unwind mappings as m
call semantics.mapping.addMappingToSchema(sch,m.neoSchemaElem,m.publicSchemaElem) yield node 
return count(node) as mappingsDefined

We can check that the mappings have been correctly defined again:

 

Screenshot 2018-10-18 at 14.05.49

And that’s it! Simple, right? We can now re-run our query on James Dean’s movies and we get something quite different:

Screenshot 2018-10-18 at 14.07.07

The data is now described in terms of Schema.org!! We can see, however, that there are elements that have not been mapped and they are serialised with the default Neo4j schema.

The Profession nodes can be nicely mapped to the Occupation type in Schema.org so we can use the same structure as before to add three additional mappings. Notice that this time we use the listSchemas method instead of the addSchema since the schema already exists :

with [{ neoSchemaElem : "Profession", publicSchemaElem: "Occupation" },
{ neoSchemaElem : "pname", publicSchemaElem: "name" },
{ neoSchemaElem : "HAS_MAIN_PRO", publicSchemaElem: "hasOccupation" }] as mappings
call semantics.mapping.listSchemas("http://schema.org/") yield node as sch
unwind mappings as m
call semantics.mapping.addMappingToSchema(sch,m.neoSchemaElem,m.publicSchemaElem) yield node 
return count(node) as mappingsDefined

The unique identifiers used in the IMDB dataset don’t have an obvious mapping in Schema.org so but I found that they are included in the DBPedia schema. Perfect, let’s use this definition in our mapping. Nothing stops us from combining elements from multiple schemas.

Usual process. We create a new schema definition and then we add the mapping to for both the person id and movie id in the Neo4j graph:

with [{ neoSchemaElem : "pid", publicSchemaElem: "imdbId" },
{ neoSchemaElem : "tid", publicSchemaElem: "imdbId" }] as mappings
call semantics.mapping.addSchema("http://dbpedia.org/ontology/","dbo") yield node as sch
unwind mappings as m
call semantics.mapping.addMappingToSchema(sch,m.neoSchemaElem,m.publicSchemaElem) yield node 
return count(node) as mappingsDefined

Neat! Here’s the complete list of mappings after adding the Occupation and the IMDB ids:

Screenshot 2018-10-18 at 17.37.56.png

Let’s run our request one last time on Jeff Bridges, we can get his IMDB id from the URL of his page.

Screenshot 2018-10-18 at 14.43.00.png

WHAT’S INTERESTING ABOUT THIS QUICKGRAPH?

One of the main uses of linked data and shared schemas (ontologies) is to enable interoperability. In this post, I’ve shown how easy it is to dynamically expose the data in Neo4j as a linked data endpoint according to the schema of your choice.

The fact that both RDF and Neo4j have underlying graph models make it particularly simple to define mappings as opposed to defining them on tabular or document based schemas.

There are a couple of additions coming up like the definition of multiple mappings and the selection of one on the fly (at query time). Give it a try and let me know what you think, how would you extend the current functionality and please share your use case!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s