The Blog of Ian Mercer.

A Semantic Web ontology / triple Store built on MongoDB

In a previous blog post I discussed building a Semantic Triple Store using SQL Server. That approach works fine but I'm struck by how many joins are needed to get any results from the data and as I look to storing much larger ontologies containing billions of triples there are many potential scalability issues with this approach. So over the past few evenings I decided to try a different approach and so I created a semantic store based on MongoDB. In the MongoDB version of my semantic store I take a different approach to storing the basic building blocks of semantic knowledge representation. For starters I decided that typical ABox and TBox knowledge has really quite different storage requirements and that smashing all the complex TBox assertions into simple triples and stringing them together with meta fields only to immediately join then back up whenever needed just seemed like a bad idea from the NOSQL / document-database perspective.

TBox/ABox: In the ABox you typically find simple triples of the form X-predicate-Y. These store simple assertions about individuals and classes. In the TBox you typically find complex sequents, that's to say complex logic statements having a head (or consequent) and a body (or antecedents). The head is 'entailed' by the body, which means that if you can satisfy all of the body statements then the head is true. In a traditional store all the ABox assertions can be represented as triples and all the complex TBox assertions use quads with a meta field that is used solely to rebuild the sequent with a head and a body. The ABox/TBox distinction is however arbitrary (see http://www.semanticoverflow.com/questions/1107/why-is-it-necessary-to-split-reasoning-into-t-box-and-a-box).

I also decided that I wanted to be use ObjectIds as the primary way of referring to any Entity in the store. Using the full Uri for every Entity is of course possible and MongoDB couuld have used that as the index but I wanted to make this efficient and easily shardable across multiple MongoDB servers. The MongoDB ObjectID is ideal for that purpose and will make queries and indexing more efficient.

The first step then was to create a collection that would hold Entities and would permit the mapping from Uri to ObjectId. That was easy: an Entity type inheriting from a Resource type produces a simple document like the one shown below. An index on Uri with a unique condition ensures that it's easy to look up any Entity by Uri and that there can only ever be one mapping to an Id for any Uri.

	RESOURCES COLLECTION - SAMPLE DOCUMENT
	{ 
		"_id": "4d243af69b1f26166cb7606b", 
		"_t": "Entity", 
		"Uri": "http://www.w3.org/1999/02/22-rdf-syntax-ns\#first" 
	}

Although I should use a proper Uri for every Entity I also decided to allow arbitrary strings to be used here so if you are building a simple ontology that never needs to go beyond the bounds of this one system you can forgo namespaces and http:// prefixes and just put a string there, e.g. "SELLS". Since every Entity reference is immediately mapped to an Id and that Id is used throughout the rest of the system it really doesn't matter much.

The next step was to represent simple ABox assertions. Rather than storing each assertion as its own document I created a document that could hold several assertions all related to the same subject. Of course, if there are too many assertions you'll still need to split them up into separate documents but that's easy to do. This move was mainly a convenience for developing the system as it makes it easy to look at all the assertions made concerning a single Entity using MongoVue or the Mongo command line interface but I'm hoping it will also help performance as typical access patterns need to bring in all of the statements concerning a given Entity.

Where a statement requires a literal the literal is stored directly in the document and since literals don't have Uris there is no entry in the resources collection.

To make searches for statements easy and fast I added an array field "SPO" which stores the set of all Ids mentioned anywhere in any of the statements in the document. This array is indexed in MongoDB using the array indexing feature which makes it very efficient to find and fetch every document that mentions a particular Entity. If the Entity only ever appears in the subject position in statements that search will result in possibly just one document coming back which contains all of the assertions about that Entity. For example:

	STATEMENTGROUPS COLLECTION - SAMPLE DOCUMENT
	{ 
		"\id": "4d243af99b1f26166cb760c6", 
		"SPO": [ "4d243af69b1f26166cb7606f", "4d243af69b1f26166cb76079", "4d243af69b1f26166cb7607c" ],
		"Statements": [ 
			{	
				"_id": "4d243af99b1f26166cb760c5",
				"Subject": { "_t": "Entity", "_id": "4d243af69b1f26166cb7606f", "Uri": "GROCERYSTORE" },
				"Predicate": { "_t": "Entity", "_id": "4d243af69b1f26166cb7607c", "Uri": "SELLS" },
				"Object": { "_t": "Entity", "_id": "4d243af69b1f26166cb76079", "Uri": "DAIRY" }
			} 
			... more statements here ... 
			] 
	}

The third and final collection I created is used to store TBox sequents consisting of a head (consequent) and a body (antecedents). Once again I added an array which indexes all of the Entities mentioned anywhere in any of the statements used in the sequent. Below that I have an array of Antecedent statements and then a single Consequent statement. Although the statements don't really need the full serialized version of an Entity (all they need is the _id) I include the Uri and type for each Entity for now. Variables also have Id values but unlike Entities, variables are not stored in the Resources collection, they exist only in the Rule collection as part of consequent statements. Variables have no meaning outside a consequent unless they are bound to some other value.

	RULE COLLECTION - SAMPLE DOCUMENT
	{
		"_id": "4d243af99b1f26166cb76102", 
		"References": [ "4d243af69b1f26166cb7607d", "4d243af99b1f26166cb760f8", "4d243af99b1f26166cb760fa", "4d243af99b1f26166cb760fc", "4d243af99b1f26166cb760fe" ],
		"Antecedents": [ 
		{ 
			"_id": "4d243af99b1f26166cb760ff", 
			"Subject": { "_t": "Variable", "_id": "4d243af99b1f26166cb760f8", "Uri": "V3-Subclass8" }, 
			"Predicate": { "_t": "Entity", "_id": "4d243af69b1f26166cb7607d", "Uri": "rdfs:subClassOf" }, 
			"Object": { "_t": "Variable", "_id": "4d243af99b1f26166cb760fa", "Uri": "V3-Class9" } 
		},
		{ 
			"_id": "4d243af99b1f26166cb76100", 
			"Subject": { "_t": "Variable", "_id": "4d243af99b1f26166cb760fa", "Uri": "V3-Class9" }, 
			"Predicate": { "_t": "Variable", "_id": "4d243af99b1f26166cb760fc", "Uri": "V3-Predicate10" }, 
			"Object": { "_t": "Variable", "_id": "4d243af99b1f26166cb760fe", "Uri": "V3-Something11" } 
		}],
		"Consequent": 
		{
			"_id": "4d243af99b1f26166cb76101",
			"Subject": { "_t": "Variable", "_id": "4d243af99b1f26166cb760f8", "Uri": "V3-Subclass8" },
			"Predicate": { "_t": "Variable", "_id": "4d243af99b1f26166cb760fc", "Uri": "V3-Predicate10" }, 
			"Object": { "_t": "Variable", "_id": "4d243af99b1f26166cb760fe", "Uri": "V3-Something11" } 
		} 
	}

That is essentially the whole semantic store. I connected it up to a reasoner and have successfully run a few test cases against it. Next time I get a chance to experiment with this technology I plan to try loading a larger ontology and will rework the reasoner so that it can work directly against the database instead of taking in-memory copies of most queries that it performs.

At this point this is JUST AN EXPERIMENT but hopefully someone will find this blog entry useful. I hope later to connect this up to the home automation system so that it can begin reasoning across an ontology of the house and a set of ABox assertions about its current and past state.

Since I'm still relatively new to the semantic web I'd welcome feedback on this approach to storing ontologies in NOSQL databases from any experienced semanticists.

Related Stories

My love/hate relationship with Stackoverflow

Stackoverflow is a terrific source of information but can also be infuriating.

Ian Mercer
Ian Mercer

Xamarin Forms Application For Home Automation

Building a Xamarin Forms application to control my home automation system

Ian Mercer
Ian Mercer

Websites should stop using passwords for login!

A slightly radical idea to eliminate passwords from many of the websites you use just occasionally

Ian Mercer
Ian Mercer

VariableWithHistory - making persistence invisible, making history visible

A novel approach to adding history to variables in a programming language

Ian Mercer
Ian Mercer

Neo4j Meetup in Seattle - some observations

Some observations from a meetup in Seattle on graph databases and Neo4j

Ian Mercer
Ian Mercer

Updated Release of the Abodit State Machine

A hierarchical state machine for .NET

Ian Mercer
Ian Mercer

My first programme [sic]

At the risk of looking seriously old, here's something found on a paper tape

Ian Mercer
Ian Mercer

Building a better .NET State Machine

A state machine for .NET that I've released on Nuget

Ian Mercer
Ian Mercer

A simple state machine in C#

State machines are useful in many contexts but especially for home automation

Ian Mercer
Ian Mercer

MongoDB substring search with a difference

Ian Mercer
Ian Mercer

MongoDB - Map-Reduce coming from C#

Ian Mercer
Ian Mercer

MongoDB Map-Reduce - Hints and Tips

Ian Mercer
Ian Mercer

A great video explaining the Semantic Web

Ian Mercer
Ian Mercer

Why don't you trust your build system?

Ian Mercer
Ian Mercer

Elliott 803 - An Early Computer

Ian Mercer
Ian Mercer

Continuous Integration -> Continuous Deployment

What is "quality" in terms of a released software product or website?

Ian Mercer
Ian Mercer

Making a bootable Windows 7 USB Memory Stick

Here's how I made a bootable USB memory stick for Windows 7

Ian Mercer
Ian Mercer

Tip: getting the index in a foreeach statement

A tip on using LINQ's Select expression with an index

Ian Mercer
Ian Mercer

SQL Server - error: 18456, severity: 14, state: 38 - Incorrect Login

A rant about developers using the same message for different errors

Ian Mercer
Ian Mercer

WCF and the SYSTEM account

Namespace reservations and http.sys, my, oh my!

Ian Mercer
Ian Mercer

Mixed mode assembly errors after upgrade to .NET 4 Beta 2

Fixing this error was fairly simple

Ian Mercer
Ian Mercer

Shortened URLs should be treated like a Codec ...

Expanding URLs would help users decide whether or not to click a link

Ian Mercer
Ian Mercer

Tagging File Systems

Isn't it time we stopped knowing which drive our file is on?

Ian Mercer
Ian Mercer

A great site for developing and testing regular expressions

Just a link to a site I found useful

Ian Mercer
Ian Mercer

Introducing Jigsaw menus

A novel UI for menus that combines a breadcrumb and a menu in one visual metaphor

Ian Mercer
Ian Mercer

Fix for IE's overflow:hidden problem

Ian Mercer
Ian Mercer

A better Tail program for Windows

A comparison of tail programs for Windows

Ian Mercer
Ian Mercer

Measuring website browser performance

Found this great resource on website performance

Ian Mercer
Ian Mercer

Amazon Instance vs Dedicated Server comparison

Some benchmark performance for Amazon vs a dedicated server

Ian Mercer
Ian Mercer

Agile Software Development is Like Sailing

You cannot tack too often when sailing or you get nowhere. Agile is a bit like that.

Ian Mercer
Ian Mercer

Javascript error reporting

Sending client-side errors back to a server for analysis

Ian Mercer
Ian Mercer

AntiVirus Software is the Worst Software!

When your anti-virus software starts stealing your personal data, it's time to remove it!

Ian Mercer
Ian Mercer

ASP.NET Custom Validation

How to solve a problem encountered with custom validation in ASP.NET

Ian Mercer
Ian Mercer

Optimization Advice

Some advice on software optimization

Ian Mercer
Ian Mercer

Google Chart API

Ian Mercer
Ian Mercer

Cache optimized scanning of pairwise combinations of values

Using space-filling curves to optimize caching

Ian Mercer
Ian Mercer

Threading and User Interfaces

A rant about how few software programs get threading right

Ian Mercer
Ian Mercer

Take out the trash!

Why Windows shutdown takes so long

Ian Mercer
Ian Mercer

Dell upgrades - a pricey way to go

Ian Mercer
Ian Mercer

Programming mostly C#

Ian's advice on programming

Ian Mercer
Ian Mercer