QuickGraph#11 The Christmas messages graph

It’s this time of the year… when heads of state address their nations with messages of hope and reflect on the past year and the challenges ahead. I was looking for a data set to do some text analysis and I thought this could be an interesting one. I collected a few Christmas messages from some of Europe’s heads of state (to be more precise, the English translations available).

Here’s the set that I’ll use:

Processing the source data and loading it into Neo4j

I’ve used this python script to process the text in the documents. It uses NLTK to exclude punctuation elements and stopwords in order to produce a clean list of tokens…

cleantext = ''.join([c for c in text if c not in non_words]).lower()
tokens = [tk for tk in word_tokenize(cleantext) if tk not in stopwords]

… from the tokens I get the stems again using NLTK:


The resulting structure contains word stems, their frequency and the words that share the same stem. Here’s a fragment of the clean data ready to be imported into Neo4j:

 {'stem': 'presid', 'count': 1, 'words': ['president']}, 
 {'stem': 'ireland', 'count': 4, 'words': ['ireland']}, 

This is passed as the $list parameter to a cypher script that populates the Neo4j graph:

CREATE (sp:Speech { country: $country}) WITH sp
UNWIND $list as entry MERGE (st:Stem { id: entry.stem}) 
MERGE (st)-[:USED_IN { freq: entry.count }]->(sp)
WITH st, entry.words as words UNWIND words AS word 
MERGE (w:Word { id: word })
MERGE (w)-[:HAS_STEM]->(st)

The model in Neo4j is pretty simple. It contains three types of entities:

  • Speech nodes representing the message by a head of state
  • Stem nodes representing a word stem. They are connected through the USED_IN relationship to the Speech nodes where they are used. The relationship has a property called freq indicating the number of times the word stem is used in a particular speech.
  • Word nodes representing the actual words that were used in a speech. Words are linked to their stem through the HAS_STEM relationship.

Here is an excerpt of the graph showing the stem ‘challeng’ shared by the words ‘challenge’ and ‘challenges’ and used in the messages by the heads of state of Spain, Ireland, Sweden and the UK.

Screenshot 2019-12-29 at 02.11.23.png

It is worth mentioning that I’ve made an arguable modelling decision which is to link the stems to the speeches and aggregate all the words that share the same stem around the stem node, losing the link between the word and the speech where it was used. This simplifies the model for the type of analysis that we’ll run in this QuickGraph but could be an issue if we wanted to know which exact words (not just the stem) were used in a particular speech.

Querying the graph

Let’s now run some queries on this dataset.

Which words appear in all Christmas messages?

Some words like ‘Christmas’ or peace’ appear in all messages. Here’s how we can get the whole list of word stems used in all speeches and all word variants of each stem.

MATCH (s:Speech) WITH count(s) AS countryCount
MATCH (w:Word)-[:HAS_STEM]->(st:Stem)-[:USED_IN]->(speech)
WHERE size((st)-[:USED_IN]->()) = countryCount
RETURN st.id AS stem, collect(distinct w.id) AS words

Screenshot 2019-12-29 at 02.26.28.png

If instead of absolute word frequency we look at the count of speeches where the word appears, we can get the list of most common words. And this is something that would look great in a tag cloud. Here is the cypher query that returns the top 30, followed by a beautiful heart-shaped tag cloud using that data (courtesy of WordClouds).

MATCH (st:Stem)
RETURN st.id AS stem, size((st)-[:USED_IN]->()) as freq ORDER BY freq DESC LIMIT 30

Screenshot 2019-12-29 at 02.29.34

How many times is the country’s name or demonym mentioned in a message?

All heads of state mention their country’s name (or the country’s demonym) in their speeches. Some more than others. Here’s the cypher query that reveals who does it more and who does it less:

UNWIND [{ country: 'germany', stems: ['german']},{ country: 'netherlands', stems: ['netherland','dutch']},{ country: 'sweden', stems: ['swed']},{ country: 'ireland', stems: ['ireland','irish']},{ country: 'spain', stems: ['spa']},{ country: 'uk', stems: ['brit']}] as item
MATCH (st:Stem)-[ui:USED_IN]->(speech:Speech { country: item.country})
WHERE any(word in item.stems where st.id contains word)
RETURN speech.country, sum(ui.freq) as freq ORDER BY freq DESC

producing the following results:

Screenshot 2019-12-29 at 02.57.53.png

Word frequencies

We can get the set of speeches where a word is used using the following graph pattern

match path = (:Word { id: "climate"})-[:HAS_STEM]->(st:Stem)-[ui:USED_IN]->(s:Speech)
return path

which returns this subgraph:

Screenshot 2019-12-29 at 20.07.05.png

A simple variant of the previous query can give us the frequency of a specific word stem by speech. Here’s what the cypher looks like:

match (:Word { id: "climate"})-[:HAS_STEM]->(st:Stem) with st
match (s:Speech) with s, st
optional match (st)-[ui:USED_IN]-(s)
return s.country as country, coalesce(ui.freq,0) as freq

Screenshot 2019-12-29 at 20.24.41.png

And the result is bar chart material

Screenshot 2019-12-29 at 20.23.16.png

Which are the top five words in each speech?

It’s sometimes interesting to look at the stems that are repeated the most in a message as they sometimes show a bit of the spirit of the message. The following cypher query returns the top 5 stems (and words) for each message:

unwind ["sweden", "germany", "netherlands", "uk", "ireland", "spain"] as ctry
match (:Speech { country : ctry })<-[ui:USED_IN]-(st:Stem)
with ctry, ui.freq as freq , st order by freq desc
with ctry, collect({freq:freq, stem: st}) as stems
unwind stems[0..5] as topStem
match (x)<-[:HAS_STEM]-(w:Word) where x = topStem.stem
return ctry, topStem.stem.id, collect(w.id), topStem.freq

Producing the following results:

Screenshot 2019-12-29 at 03.01.52.png

Let’s see how they look on a map (courtesy of StepMap)

Screenshot 2019-12-29 at 03.04.19

Graph Algorithms

And to conclude, I’ll show how to use the overlap similarity algorithm to find which are the most similar speeches based on the overlap of word stems.

To remove some noise, I’ll use all stems with a frequency higher than 3 and return results producing an overlap similarity over 40%. Here’s how we can do it with four lines of cypher:

match (st:Stem)-[ui:USED_IN]->(s:Speech) where ui.freq > 3
with { item: id(s), categories: collect(distinct id(st))} as dt with collect(dt) as data
call algo.similarity.overlap.stream(data) yield item1, count1, item2, count2, intersection, similarity where similarity >= 0.4
return algo.asNode(item1).country, count1, algo.asNode(item2).country , count2, intersection, similarity order by similarity desc

This returns a nearly 70% similarity between the speech given by the Spain and UK heads of state… the interested reader can interpret these results 🙂

Screenshot 2019-12-29 at 03.09.20.png

Ok, that’s it for QG#11. My plan is to extend this analysis using more advanced features in NLTK like identifying parts of speech or named entities, combining it with WordNet or domain ontologies… but this will be the next decade.

Enjoy playing with the code (in GitHub as always), analyse other datasets and share your experience.

Happy New Year everyone!



Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s