Create a Data Marvel — Part 3: Hydrating the Model

Author : greensameblue
Publish Date : 2021-01-05 17:40:07


Create a Data Marvel — Part 3: Hydrating the Model

Over the last couple of weeks, we have shown the early steps of this project — the background, building the data model, and the start of the data import from an API to Neo4j. If you haven’t and want to catch up on that information, you can read the posts for Part 1 and Part 2.
Today, we will continue the process of getting a maximum amount of data from the Marvel Developer Portal API into Neo4j. We saw the initial import Cypher statement last week, but the trickiest piece is the next import statement, and the issues we uncovered in running it.
To recap our limitations and foundations so far, I’ve included an image of our Neo4j graph data model for the Marvel data (yours could differ), as well as some highlights.
Marvel API enforces some limitations such as 3,000 calls per day and 100 results per call. So our goal was to get as much data as possible in these limits.
Retrieving all comics (43,000+) was too much, so we decided to import characters first.
To do that, we used the APOC library pulled characters based on names by letters of the alphabet.
Next, we need to retrieve the rest of the data (comics, series, events, stories, creators) to finish populating our graph!
Image for postImage for post
Marvel comics data model in Neo4j
More Importing — “Hydrating” the Model
One of the best things about APOC is the possibilities it opens for different aspects of data import. Remember that apoc.periodic.iterate procedure we used in our last query to loop through each letter of the alphabet and select the characters that start with that letter? Well, we are going to use that procedure again, but in a slightly different way.
This time, we will use the first statement within that procedure to select the characters we added to the database in our previous query, and the second statement will call the API to retrieve the comics for each of those characters. The next query is below with a walkthrough of the syntax. It might look complicated, but don’t worry. It simply builds upon syntax we have already discussed above.


https://www.epicmountainsports.com/grt/video-river-plate-vs-palmeiras-en-vivo-o1.html
https://www.epicmountainsports.com/grt/video-river-plate-vs-palmeiras-en-vivo-o2.html
https://www.epicmountainsports.com/grt/video-river-plate-vs-palmeiras-en-vivo-o3.html
https://www.epicmountainsports.com/grt/video-river-plate-vs-palmeiras-en-vivo-o4.html
https://www.epicmountainsports.com/grt/video-river-plate-vs-palmeiras-en-vivo-o5.html
https://www.epicmountainsports.com/grt/video-river-plate-vs-palmeiras-en-vivo-o6.html
https://www.epicmountainsports.com/grt/video-river-plate-vs-palmeiras-en-vivo-o7.html
https://www.epicmountainsports.com/grt/video-river-plate-vs-palmeiras-en-vivo-o8.html
https://www.epicmountainsports.com/grt/video-tottenham-vs-brentford-01.html
https://www.epicmountainsports.com/grt/video-tottenham-vs-brentford-02.html
https://www.epicmountainsports.com/grt/video-tottenham-vs-brentford-03.html
https://www.epicmountainsports.com/grt/video-tottenham-vs-brentford-04.html
https://www.epicmountainsports.com/grt/video-tottenham-vs-brentford-05.html
https://www.epicmountainsports.com/grt/video-tottenham-vs-brentford-06.html
https://www.epicmountainsports.com/grt/video-usa-finland-live-score-updates01.html
https://www.epicmountainsports.com/grt/video-usa-finland-live-score-updates02.html
https://www.epicmountainsports.com/grt/video-usa-finland-live-score-updates03.html
https://www.epicmountainsports.com/grt/video-usa-finland-live-score-updates04.html
https://www.epicmountainsports.com/grt/video-usa-finland-live-score-updates05.html
WITH apoc.date.format(timestamp(), “ms”, ‘yyyyMMddHHmmss’) AS ts
WITH “&ts=” + ts + “&apikey=” + $marvel_public + “&hash=” + apoc.util.md5([ts,$marvel_private,$marvel_public]) as suffix
CALL apoc.periodic.iterate(‘MATCH (c:Character) WHERE c.resourceURI IS NOT NULL AND NOT exists((c)<-[:INCLUDES]-()) RETURN c LIMIT 100’,
‘CALL apoc.util.sleep(2000)
CALL apoc.load.json(c.resourceURI+”/comics?format=comic&formatType=comic&limit=100"+$suffix)
YIELD value
WITH c, value.data.results as results
 WHERE results IS NOT NULL
UNWIND results as result
MERGE (comic:ComicIssue {id: result.id})
ON CREATE SET comic.name = result.title,
  comic.issueNumber = result.issueNumber,
  comic.pageCount = result.pageCount,
  comic.resourceURI = result.resourceURI,
  comic.thumbnail = result.thumbnail.path + 
                    ”.” + result.thumbnail.extension
WITH c, comic, result
MERGE (comic)-[r:INCLUDES]->(c)
WITH c, comic, result WHERE result.series IS NOT NULL
UNWIND result.series as comicSeries
MERGE (series:Series {id: toInt(split(comicSeries.resourceURI,”/”)[-1])})
ON CREATE SET series.name = comicSeries.name,
  series.resourceURI = comicSeries.resourceURI
WITH c, comic, series, result
MERGE (comic)-[r2:BELONGS_TO]->(series)
WITH c, comic, result, result.creators.items as items
 WHERE items IS NOT NULL
UNWIND items as item
MERGE (creator:Creator {id: toInt(split(item.resourceURI,”/”)[-1])})
ON CREATE SET creator.name = item.name,
  creator.resourceURI = item.resourceURI
WITH c, comic, result, creator
MERGE (comic)-[r3:CREATED_BY]->(creator)
WITH c, comic, result, result.stories.items as items
 WHERE items IS NOT NULL
UNWIND items as item
MERGE (story:Story {id: toInt(split(item.resourceURI,”/”)[-1])})
ON CREATE SET story.name = item.name,
  story.resourceURI = item.resourceURI,
  story.type = item.type
WITH c, comic, result, story
MERGE (comic)-[r4:MADE_OF]->(story)
WITH c, comic, result, result.events.items AS items
 WHERE items IS NOT NULL
UNWIND items as item
MERGE (event:Event {id: toInt(split(item.resourceURI,”/”)[-1])})
ON CREATE SET event.name = item.name,
  event.resourceURI = item.resourceURI
MERGE (comic)-[r5:PART_OF]->(event)’,
{batchSize: 20, iterateList:false, retries:2, params:{suffix:suffix}});
To help process this lengthy query, we will break it up into sections and explain each block. Let us start with the first two sections together.
//First section
WITH apoc.date.format(timestamp(), “ms”, ‘yyyyMMddHHmmss’) AS ts
WITH “&ts=” + ts + “&apikey=” + $marvel_public + “&hash=” + apoc.util.md5([ts,$marvel_private,$marvel_public]) as suffix
CALL apoc.periodic.iterate(‘MATCH (c:Character) WHERE c.resourceURI IS NOT NULL AND NOT exists((c)<-[:INCLUDES]-()) RETURN c LIMIT 100’,
‘CALL apoc.util.sleep(2000)
CALL apoc.load.json(c.resourceURI+”/comics?format=comic&formatType=comic&limit=100"+$suffix)
YIELD value
Just as with our initial load query, we start the query using the WITH clause to set up and pass the timestamp and url suffix parameters that we will use further down. The next lines of code calls the familiar apoc.periodic.iterate to pull all the characters in Neo4j (MATCH statement). Notice the criteria starting from the WHERE clause. We check the Character nodes to see if the resourceURI field contains a value. Marvel puts the url path for most entities in the resourceURI field, so this is a simple check to see if the Character has a url path for us to retrieve data. If it doesn’t, our call will fail, and it won’t find data.
* Hint: this is also a good way to trim the number of API calls. If we know a call will fail, then we should not waste precious resources on it. :)
The next criteria checks if a relationship type of INCLUDES already exists for the node. This sees if we have already retrieved and inserted comics for a character. If a relationship exists, then we do not pull the comic info for that character again. This avoids duplicate calls for entities where we have already added that information.
Finally, we add a LIMIT 100 to that query to only pull 100 characters at a time from our Neo4j database. We ran across issues where queries would timeout because the server would stop responding. Marvel’s server instances probably have a timeout value to ensure users do not hog resources. Or, it could simply be that they need to bolster the architecture a bit to support heavier requests. ;)
Either way, we wanted to reduce the time taken to pull in batches of data, so my colleague suggested a LIMIT clause to create smaller processing for each call. While this would increase the number of calls made to the database, it was better than larger batches failing frequently.
* Note: at this point, we had 26 calls to load all of the characters for each alphabet letter. That gave us around 1,000 characters in our Neo4j instance. If we pull 100 at a time, we could have a maximum of 11 batches of up to 100 calls (one for each character in a batch).
The second statement within the apoc.periodic.iterate adds 2 seconds of sleep between calls to the API for each character. This tries to avoid the timeout of our Marvel server in the middle of one of our calls. Once we wait, we use the apoc.load.json to hit the API endpoint for the comics pertaining to that character from the resourceURI field on our Character nodes. Again, we yield back the JSON object (YIELD value).
//Second section
WITH c, value.data.results as results
 WHERE results IS NOT NULL
UNWIND results as result
MERGE (comic:ComicIssue {id: result.id})
ON CREATE SET comic.name = result.title,
  comic.issueNumber = result.issueNumber,
  comic.pageCount = result.pageCount,
  comic.resourceURI = result.resourceURI,
  comic.thumbnail = result.thumbnail.path + 
                    ”.” + result.thumbna



Category : general

Tour de France Stage 19 stopped because of adverse weather ungreen

Tour de France Stage 19 stopped because of adverse weather ungreen

- The race was stopped near the end of the stage with less than 18 miles to go. Rider times have been


Nutanix NCA-5.15 Certification - Your Bright Profession Is Waiting

Nutanix NCA-5.15 Certification - Your Bright Profession Is Waiting

- The trouble can be a real just one. The level of higher education enrollments goes up steadily and since this may be expected to maintain heading


Candidates Fail In The Fortinet NSE4_FGT-6.4 Certification Exam

Candidates Fail In The Fortinet NSE4_FGT-6.4 Certification Exam

- Marketing automation is one of the great processes that help businesses not only to automate their repetitive marketing tasks.


 Real IBM C1000-056 Certification Exam

Real IBM C1000-056 Certification Exam

- Buying a new laptop is a not an easy adventure since there are a vast variety of laptops in the market. Numerous