MongoDB
Neo4j Meetup in Seattle – some observations
Oct 23rd
I attended the Neo4j Meetup in Seattle this evening. It was an interesting tour around the internals of Neo4j and some of the design decisions behind how they store graphs in a database.
The most interesting thing about Neo4j is the Cypher query language used to construct graph queries that follow relationships, evaluate conditions on properties on relationships and nodes. Neo4j shows much promise in terms of being able to represent data in a very natural way and to query it using Cypher in ways that would bring SQL to its knees with join-upon-join-upon-join.
In an earlier blog post I lamented the lack of a single database solution that was the best of all worlds: relational + document + graph + semantic web. Tonight that feeling was compounded: Neo4j is a graph database but it’s missing several key features that could make it much more.
We were privileged to get a first hand explanation as to how Neo4j worked internally but what we saw looked like a work in progress: an unfinished implementation of something that could be so much better. Here’s some of the things Neo4j needs to fix before I’ll give it a go:-
1) Stealing bits from one value to give to another to create odd word lengths like 23 bits is so 1980′s. I cannot believe this is a worthwhile optimization to make in 2012. Neo should bite the bullet, upgrade their few existing customers and move to a more modern byte aligned, 64-bit address space. I was equally amazed at the implementation of compression schemes for text on disk but the omission of other obvious space-saving opportunities like declaring some relationships to be one-way only (no reverse queries, thus no need to store the back link). It’s 2012: disk space is essentially limitless; I should never have to hit a file-size limit because someone decided to use 23, 28 or some other random number of bits instead of 64.
2) The extremely limited set of data types. If you want to store json you’d better support at least all the common Javascript options including Dates. Frankly I don’t care if your database is written in Java, it exposes a web api using json so that’s what it should support. Also odd was the choice of a linked list, meandering its way through the file, as the way to store properties for a node. IMHO Neo4j should just switch to Bson and put a document size limit on nodes like MongoDB instead of carrying on down this bit-packing, linked-list approach to properties with a partial implementation of types.
3) The lack of file splitting at 2GB/4GB boundaries.
4) Putting nodes and relationships into separate files. Sure this simplifies the access pattern but it’s not going to give good locality to data on disk. An alignment based on disk block sizes with nodes and relationships packed into blocks seems likely to be a much better approach to minimizing disk seeks and reads.
3) Reliance on Lucene to provide indexing. Much as I appreciate Lucene, Neo4j needs built-in indexes; without them it’s impossible to optimize query plans across the graph and the indexes. MongoDB has a good selection of indexing options including 2D geo-spatial indexing; IMHO Neo4j should adopt the same set of options and offer queries that are both good relational database queries and good graph queries not force their users to pick one or the other whilst handling the interop between two different systems.
In fact, in my ideal world Neo4j and MongoDB would just become one database: a document database that also has great graph-querying capabilities!
I’ll keep monitoring Neo4j but in the meantime it’s full speed ahead with my own implementation of a graph database in MongoDB with the added twist that in my implementation, relationships are all modeled as triples (just like in a semantic web triple-store). My graph-query language isn’t likely to be as powerful as Cypher any time soon but I have indexes, the ability to query by relationships easily and a robust implementation of properties on each node with support for all common data-types and through my interface-based approach to storing objects with multiple-inheritance I get strongly-typed result sets in C#.
MongoDB substring search with a difference
Nov 25th
It’s quite common to want to search a database for a key that starts with a given string. In SQL you have LIKE and in MongoDB you have regular expressions:
db.customers.find( { name : { $regex : '^acme', $options: 'i' } } );
But what if you want to do the inverse of this? i.e. to search the database for the keys that are themselves substrings of the search string? For example, suppose you are trying to parse a block of text and you want to find phrases in the database that match the start of the current block of text. In SQL you would be dead in the water but with MongoDB you can create a RegEx that matches either the first word, or the first two words, or the first three words, … and so on.
We can construct a regular expression to do this, it might look something like: ^word1($| word2($| word3$))
Here’s a C# method that can create the necessary regular expression:
/// <summary>
/// This generates a regular expression that matches as much of the given phrase as it can from a string
/// i.e. a reverse prefix search where you want the database to supply the prefix and match it against your query
/// useful for matching 'as much as possible from a given input'
/// </summary>
private string generatePrefixRegex(string phrase, bool atStart)
{
string[] bits = phrase.Split(' ');
string result = bits[0];
// At the start of a sentence, if the first character is upper cased, we should also be looking for a lowercased verson of it
if (atStart && char.IsUpper(result[0]))
{
result = string.Format("(%0|%1)%2", char.ToLowerInvariant(result[0]), char.ToUpperInvariant(result[0]), result.Substring(1));
}
// Each additional word - either we end the string before it or we must include it
foreach (var bit in bits.Skip(1))
{
result = result + "($| " + Regex.Escape(bit);
}
result = result + "$"; // last word must end string
foreach (var bit in bits.Skip(1))
{
result = result + ")"; // close the expression
}
return "^" + result; // Must start at the start of a Name
}
“Remember Everything” … a long-term project
Sep 15th
“Remember Everything” connects nearly all of my projects into one giant solution that, well, remembers everything and has a natural language interface over it.
As inputs it will take information from my home automation system, my whole-network storage crawler, Google calendar, email, Twitter, blog, web crawler, an address-monitoring browser add-on I plan to write, the weather and traffic feeds, and, of course my natural language engine.
All this data will be put into MongoDB and can then be queried. Relationships between entities will be created using a semantic-web triple store and reasoner.
Together these capabilities will allow queries like:-
* Copy all the photos I took last week onto c:\vacationPhotos
* Send img_0938.jpg to mum.
* Who called last Monday?
* Show pictures from last month taken on sunny days.
* What was happening two weeks ago when X called?
* Who called yesterday when I was in a meeting?
* What song was playing around 9pm last night?
* How long did I spend on the phone to my accountant last week?
* What web pages did I read last week about the Semantic Web?
* Send the web page I tweeted about last night to my Kindle.
* We need butter and olives.
* What do I need to buy from QFC? (a semantic shopping list concept, more on that later …)
In addition to the shopping lists concept (that’s already in my home automation system but lacks the semantic reasoning) the system will take any subject-verb-object phrase and remember it and then allow you to query it back later, e.g.
* My son read 20 pages tonight (making the weekly reading report easier)
* How many pages did he read this week?
* I took the red pill at 10AM
* I walked 2 miles this morning
* I ran 4 miles
* How much exercise did I do this week when it wasn’t raining? (summarizing values semantically and mathematically)
* The Audi was serviced this week (remembering schedules so you can check if an item is overdue)
* My BA frequent flyer number is #### (remembering numbers you need to look up often)
* I took the day off on friday (vacation reporting)
* I spent $12.95 on lunch (expense reporting)
…
Whenever you have anything you need to remember the system will be able to remember it, recall it, and where possible aggregate or summarize it using math and/or semantic reasoning (e.g. running subClassOf exercise, butter subClassOf dairy product, dairy products areSoldAt QFC, …).
By linking my natural language engine to a triple store I can even allow users to teach it new concepts:
By silently monitoring your email, Twitter stream, calendar, activity in the house, … it will be able to answer questions based on the context not just on the content in ways that we take for granted as humans but which are not possible for computers today.
Dynamic persistence with MongoDB – look, no classes! Multiple inheritance in C#!
Sep 6th
In an earlier post I explained a technique to create a class-free persistence layer using MongoDB. [Read that post first, then come back here.]
Since then I’ve refined the techniques involved and created a cleaner implementation that does away with the `.props` collection on each object. Now when you add an interface to an object you get exactly what you expected in the persisted data.
To use it you first need to register the serialization code somewhere in your startup code…
BsonSerializer.RegisterSerializationProvider(new MongoDynamicSerializationProvider());
The Serialization provider is quite simple:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using MongoDB.Bson.Serialization;
namespace MongoData.Dynamic
{
public class MongoDynamicSerializationProvider : IBsonSerializationProvider
{
public IBsonSerializer GetSerializer(Type type)
{
if (typeof(MongoDynamic).IsAssignableFrom(type))
return MongoDynamicBsonSerializer.Instance;
return null;
}
}
}
The serializer is a bit more involved. It uses an interface map to decide what type to return for each serialized object. This is critical because many different .NET types can map onto the same BSon serialized value and only by maintaining this map can we get back to the original type. It’s also
critical for handling nested object graphs containing different types.
using System;
using System.Collections.Concurrent;
using System.Dynamic;
using System.Linq;
using System.Linq.Expressions;
using System.Runtime.CompilerServices;
using Microsoft.CSharp.RuntimeBinder;
using MongoDB.Bson.IO;
using MongoDB.Bson.Serialization;
using MongoDB.Bson.Serialization.Serializers;
using MongoDB.Bson;
using MongoDB.Bson.Serialization.IdGenerators;
using System.Collections.Generic;
using ImpromptuInterface;
namespace MongoData.Dynamic
{
public class MongoDynamicBsonSerializer : BsonBaseSerializer
{
private static MongoDynamicBsonSerializer instance = new MongoDynamicBsonSerializer();
public static MongoDynamicBsonSerializer Instance
{
get { return instance; }
}
public override object Deserialize(BsonReader bsonReader, Type nominalType, IBsonSerializationOptions options)
{
var bsonType = bsonReader.CurrentBsonType;
if (bsonType == BsonType.Null)
{
bsonReader.ReadNull();
return null;
}
else if (bsonType == BsonType.Document)
{
var os = new ObjectSerializer();
MongoDynamic md = new MongoDynamic();
bsonReader.ReadStartDocument();
Dictionary<string, Type> typeMap = null;
// scan document first to find interfaces
{
var bookMark = bsonReader.GetBookmark();
if (bsonReader.FindElement(MongoDynamic.InterfacesField))
{
md[MongoDynamic.InterfacesField] = BsonValue.ReadFrom(bsonReader).AsBsonArray.Select(x => x.AsString);
typeMap = md.GetTypeMap();
}
else
{
throw new FormatException("No interfaces defined for this dynamic object - can't deserialize it");
}
bsonReader.ReturnToBookmark(bookMark);
}
while (bsonReader.ReadBsonType() != BsonType.EndOfDocument)
{
var name = bsonReader.ReadName();
if (name == "_id")
{
md[name] = BsonValue.ReadFrom(bsonReader).AsObjectId;
}
else if (name == MongoDynamic.InterfacesField)
{
// Read it and ignore it, we already have it
BsonValue.ReadFrom(bsonReader);
}
else
{
if (typeMap == null) throw new FormatException("No interfaces define for this dynamic object - can't deserialize");
// lookup the type for this element according to the interfaces
Type elementType;
if (typeMap.TryGetValue(name, out elementType))
{
var value = BsonSerializer.Deserialize(bsonReader, elementType);
md[name] = value;
}
else
{
// This is a value that is no longer in the interface, maybe a column you removed
// not really much we can do with it ... but we need to read it and move on
var value = BsonSerializer.Deserialize(bsonReader, typeof(object));
md[name] = value;
// As with all databases, removing elements from the schema is always going to cause problems ...
}
}
}
bsonReader.ReadEndDocument();
return md;
}
else
{
var message = string.Format("Can't deserialize a {0} from BsonType {1}.", nominalType.FullName, bsonType);
throw new FormatException(message);
}
}
public override bool GetDocumentId(object document, out object id, out Type idNominalType, out IIdGenerator idGenerator)
{
MongoDynamic x = (MongoDynamic)document;
id = x._id;
idNominalType = typeof(ObjectId);
idGenerator = new ObjectIdGenerator();
return true;
}
public override void SetDocumentId(object document, object id)
{
MongoDynamic x = (MongoDynamic)document;
x._id = (ObjectId)id;
}
public override void Serialize(BsonWriter bsonWriter, Type nominalType, object value, IBsonSerializationOptions options)
{
if (value == null)
{
bsonWriter.WriteNull();
return;
}
var metaObject = ((IDynamicMetaObjectProvider)value).GetMetaObject(Expression.Constant(value));
var memberNames = metaObject.GetDynamicMemberNames().ToList();
if (memberNames.Count == 0)
{
bsonWriter.WriteNull();
return;
}
bsonWriter.WriteStartDocument();
foreach (var memberName in memberNames)
{
// ToDo: handle all those _id Id id variants?
bsonWriter.WriteName(memberName);
object memberValue;
if (memberName == "_id") memberValue = ((MongoDynamic)value)._id;
else if (memberName == "int") memberValue = ((MongoDynamic)value).@int;
else memberValue = Impromptu.InvokeGet(value, memberName);
if (memberValue == null)
bsonWriter.WriteNull();
else
{
var memberType = memberValue.GetType();
var serializer = BsonSerializer.LookupSerializer(memberType);
serializer.Serialize(bsonWriter, memberType, memberValue, null);
}
}
bsonWriter.WriteEndDocument();
}
}
}
And finally, the actual
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Dynamic;
using MongoDB.Bson;
using MongoDB.Bson.Serialization.Attributes;
using ImpromptuInterface;
namespace MongoData.Dynamic
{
/// <summary>
/// All MongoDynamic objects support this interface because every object needs an _id in MongoDB
/// </summary>
public interface IId
{
ObjectId _id { get; set; }
}
/// <summary>
/// MongoDynamic is like an ExpandoObject that also understands document Ids and uses Improptu interface
/// to act like any other collection of interfaces ...
/// It can be serialized and deserialized from BSon and thus stored in a MongoDB database.
/// </summary>
/// <remarks>
/// This simple class gives you the ability to define database objects using only .NET interfaces - no classes!
/// Those objects can be dynamically extended to support any interface you want to add to them - polymorphism!
/// When loaded back from the database the object will support all of the interfaces that were ever applied to it.
/// Adding a new field is easy. Removing one works too.
/// All fields must be nullable since they may not be present on earlier instances of an object type.
/// </remarks>
public class MongoDynamic : DynamicObject, IId
{
[BsonId(Order=1)]
public ObjectId _id { get; set; }
// Dumb name for a property - which is why I chose it - very unlikely it will ever conflict with a real property name
public const string InterfacesField = "int";
/// <summary>
/// Interfaces that have been added to this object
/// </summary>
/// <remarks>
/// We always begin by supporting the _id interface
/// Order is important, we need to see this field before we can deserialize any others
/// </remarks>
[BsonElement(InterfacesField, Order=2)]
internal HashSet<string> @int = new HashSet<string>(){ typeof(IId).FullName };
/// <summary>
/// A text version of all interfaces - mostly for debugging purposes, stored in alphabetical order
/// </summary>
[BsonIgnore]
public string InterfacesAsText
{
get { return string.Join(",", this.@int.OrderBy(i => i)); }
}
/// <summary>
/// Add support for an interface to this document if it doesn't already have it
/// </summary>
public T AddLike<T>()
where T : class
{
@int.Add(typeof(T).FullName);
// And also act like any interfaces that interface implements (which will include ones they represent too)
foreach (var @interface in typeof(T).GetInterfaces())
@int.Add(@interface.FullName);
return Impromptu.ActLike<T>(this, this.GetAllInterfaces());
}
/// <summary>
/// Add support for multiple interfaces
/// </summary>
public T AddLike<T>(Type[] otherInterfaces)
where T : class
{
var allInterfaces = otherInterfaces.Concat(new[] { typeof(T) });
var allInterfacesAndDescendants = allInterfaces.Concat(allInterfaces.SelectMany(x => x.GetInterfaces()));
foreach (var @interface in allInterfacesAndDescendants)
@int.Add(@interface.FullName);
return Impromptu.ActLike<T>(this, this.GetAllInterfaces());
}
/// <summary>
/// Cast this object to an interface only if it has previously been created as one of that kind
/// </summary>
public T AsLike<T>()
where T : class
{
if (!this.@int.Contains(typeof(T).FullName)) return null;
else return Impromptu.ActLike<T>(this, this.GetAllInterfaces());
}
// A rather large cache of all interface types loaded into the App Domain
private static List<Type> cacheOfTypes = null;
// A cache of the interface types corresponding to a given 'key' of interface names
private static Dictionary<string, Type[]> cacheOfInterfaces = new Dictionary<string, Type[]>();
public Type[] GetAllInterfaces()
{
// We always behave like an object with an Id plus any other interfaces we have
var key = string.Join(",", this.@int.OrderBy(i => i));
if (!cacheOfInterfaces.ContainsKey(key))
{
if (cacheOfTypes == null)
{
var assemblies = AppDomain.CurrentDomain.GetAssemblies();
cacheOfTypes = assemblies.SelectMany(ass => ass.GetTypes()).Where(t => t.IsInterface).ToList();
}
var interfaces = cacheOfTypes.Where(t => this.@int.Any(i => i == t.FullName));
// Could trim the interfaces to remove any that are inherited from others ...
cacheOfInterfaces.Add(key, interfaces.ToArray());
}
return cacheOfInterfaces[key];
}
/// <summary>
/// Get a mapping from a field name to a type according to the interfaces on this object
/// </summary>
/// <returns></returns>
public Dictionary<string, Type> GetTypeMap()
{
Dictionary<string, Type> typeMap = new Dictionary<string, Type>();
var interfaces = this.GetAllInterfaces();
foreach (var mi in interfaces.SelectMany(intf => intf.GetProperties()))
{
typeMap[mi.Name] = mi.PropertyType;
}
return typeMap;
}
/// <summary>
/// Becomes a Proxy object that acts like it implements all of the interfaces listed as being supported by this Entity
/// </summary>
/// <remarks>
/// Because the returned object supports ALL of the interfaces that have ever been added to this object
/// you can cast it to any of them. This enables a type of polymorphism.
/// </remarks>
public object ActLikeAllInterfacesPresent()
{
return Impromptu.DynamicActLike(this, this.GetAllInterfaces());
}
[BsonIgnore]
// BsonIgnore because Bson serialization will happen on the dynamic interface this class exposes not on this dictionary
private Dictionary<string, object> children = new Dictionary<string, object>();
/// <summary>
/// Fetch a property by name
/// </summary>
public override bool TryGetMember(GetMemberBinder binder, out object result)
{
if (binder.Name == "_id") { result = this._id; return true; }
else if (binder.Name == InterfacesField) { result = this.@int; return true; }
else
{
children.TryGetValue(binder.Name, out result);
result = null; // we hope that it's nullable! If not you have an issue
return true; // when you do a database migration or query a nullable field it won't be in 'children'
}
}
/// <summary>
/// Set a property (e.g. person1.Name = "Smith")
/// </summary>
public override bool TrySetMember(SetMemberBinder binder, object value)
{
if (binder.Name == "_id") { this._id = (ObjectId)value; return true; } // you shouldn't need to use this
if (binder.Name == InterfacesField) throw new AccessViolationException("You cannot set the interfaces directly, use AddLike() instead");
if (!this.GetTypeMap().ContainsKey(binder.Name)) throw new ArgumentException("Property '" + binder.Name + "' not found. You need to call AddLike to specify the interfaces you want to support.");
children[binder.Name] = value;
return true;
}
public override IEnumerable<string> GetDynamicMemberNames()
{
return new[]{"_id", InterfacesField}.Concat(children.Keys);
}
/// <summary>
/// An indexer for use by serialization code
/// </summary>
internal object this[string key]
{
get
{
if (key == "_id") return this._id;
else if (key == InterfacesField) return this.@int;
else return children[key];
}
set
{
if (key == "_id" && value is BsonObjectId) this._id = ((BsonObjectId)value).Value;
else if (key == "_id") this._id = (ObjectId)value;
else if (key == InterfacesField) this.@int = new HashSet<string>((IEnumerable<string>)value);
else children[key] = value;
}
}
}
}
You’ll need Impromptu interface (from Nuget) to build this. To use it, you write code like this to save to MongoDB:
MongoDynamic entity = new MongoDynamic();
var user = entity.AddLike<IUser>(); // *** Add the IUser fields to it ...
user.Name = name; // Use it as if it were an IUser
// save it to the database as normal
And to retrieve an object you create a query as normal and then query for MongoDynamic objects like so …
var user = database.GetCollection<MongoDynamic>("***collectionName***").FindOne(query);
if (user == null) return null;
return user.AsLike<IUser>();
Typically you will want your query to reference the field called int (where all the interfaces are stored) so you can query for objects that support a specific type (if you do, you’ll want to add an index on that field). [NB the name was chosen to be one you were unlikely to ever use in .NET]
MongoDynamic objects are polymorphic – you can morph them to support any other interface at any time like so …
user.AddLike<ISomeOtherInterface>();
Home network crawler – cataloging every file on the home LAN with C# and MongoDB
Aug 22nd

Map-Reduce in action: The glaciers in Greenland 'map' the canyon walls into streams of rocks called lateral moraine. As the glaciers merge these rocks are 'reduced' into streams in the middle called 'medial' moraine. (A photo I took over Greenland this summer.)
I’m not a huge fan or RAID arrays – they mostly mean there’s another component to go wrong (the controller card) and when they do go wrong you can lose all your data just as easily as if it were all on one drive. I prefer a multiple copy strategy, an “Amazon S3 for the home” if you like. The downside of this is that there are multiple copies of each file across the home network and as I have several generations of hard drives the mapping from primary to secondary to tertiary is complex and hard to manage! It’s also really hard to find a single file when there are so many places to look and it’s nigh on impossible to be sure that I have the necessary three copies of every important file in the right places at all times.
So this weekend I embarked on a small project to catalog every file, directory and storage volume on the entire home network including drives that are only sometimes connected. The software has been running all weekend and is close to cataloging everything. It’s found 5 million files so far representing over 6TB of data!
The architecture I chose for this software was an agent that runs on each PC to catalog all of the attached volumes. This client uploads all the directories and files that it finds to a MongoDB database running on the same Atom server as the main storage array. The poor little Atom server’s 4GB of RAM has been in constant use but the server has remained responsive, in part because it boots from an SSD drive.
Each volume, directory and file is represented by a document in MongoDB in a single collection. The agent calculates an MD5 hash for each file and extracts metadata from MP3, WMA and JPG files. It also stores all of the key file dates (created, updated, accessed) and references to parent directories, volume identifiers and the currently connected PC. It does not assume that a volume is always connected to the same computer – you can unplug an external drive from one and put it somewhere else and it will all work just fine.
I implemented a re-startable tree scan that uses a couple of DateTime stamps to be able to determine which directories need to be scanned during the current pass and which ones have already been scanned. Any agent can be killed at any time and restarted and it will carry on walking the directory tree right where it left off. It will even continue correctly in the case where you move a volume from one PC to another.
Each agent uses the Parallel Task library’s Parallel.ForEach to crawl each volume in parallel and to parse multiple files from each directory simultaneously.
By storing all of the file metadata in Mongo DB it’s easy to use Map-Reduce to calculate some interesting statistics for the files on the network.
For example, to create a summary of file sizes I can use a Map function:
function Map() {
if (this.Size && this._t == "FileInformation")
{
var size = this.Size;
if (size < 1024)
emit ("kb", {count:1, size:this.Size});
else if (size < 1024*1024)
emit ("mb", {count:1, size:this.Size});
else if (size < 1024*1024*1024)
emit ("gb", {count:1, size:this.Size});
else if (size < 1024*1024*1024*1024)
emit ("tb", {count:1, size:this.Size});
else
emit ("tb+", {count:1, size:this.Size});
}
}
and a reduce function:
function Reduce(key, arr_values) {
var count = 0;
var size = 0;
for(var i in arr_values)
{
count = count + arr_values[i].count;
size = size + arr_values[i].size;
}
return {count:count, size:size};
}
Map-Reduce operations like this take about 20 minutes to run (on the Atom server with just 4GB of RAM) whereas any query serviced by one of the indexes on the MongoDB collection is almost instantaneous.
I’ve been using the excellent MongoVue to run simple map-reduce scripts like this and to keep track of how quickly the database is growing.
Map-reduce can also be used to find duplicate files – by emitting the MD5 hash as the key and some information about the file as the value I can find every copy of every file across every computer on the home network.
Since I have the file name and metadata for every file on the home network I can also easily find any file using MongoDB’s regex matching feature against the path.
The Hard Parts
For starters you’ll need a library that can handle long file names. Then you’ll need to fix it to provide at least the functionality that FileInfo and DirectoryInfo give you in .NET.
Next you’ll need to learn about reparse-points and hard-links and you’ll need to skip over them because with them in place the file system is not a tree; it’s a cyclical graph in which a simple crawler will quickly get confused or stuck.
You’ll also want to store the NTFS file Id and the unique Volume ID for every file so you can track it when the file is moved or the removable drive is connected to a different computer.
So how well does it work?
This all seems to work really well. Nearly every volume has now been cataloged. It’s located about 5M files occupying over 6TB of space. The worst case offender for the number of copies of the same file is 100+. I’ve used the find feature in MongoDB to find a file I was missing and I’m better able to plan how to arrange directories and file generations across the various hard drives I have.
What’s next
Well, of course this needs to be connected to the home automation system and my Natural Language engine so you can ask “send a copy of IMG_0228 from last week to X” or “where are all the spreadsheets I created last year?” That will be fairly easy.
After that I hope to incorporate backup features into the agents too so they can automatically keep the required number of copies of each file according to its importance. I’d also like to set up a rotating set of external drives that go in the fire safe when not connected and when they are connected they get updated with the latest copies of all the important files.
I’d also like to be able to get the agents to move whole groups of directories around between drives as juggling the directory layout each time a new hard drive is added to the system is always a time consuming process.
Comments or Questions?
Does everyone else have a hard time managing multiple computers, hard drives, directories and multiple copies of files? What tools do you use to do this? Is there anything commercially available that I could have used instead? Would a tool like this be useful to you? Should I publish the code somewhere? Comments and questions are always welcome here or on twitter.
Class-free persistence and multiple inheritance in C# with MongoDB
May 4th
Much as I appreciate Object Relational Mappers and the C# type system there’s a lot of work to do if you just want create and persist a few objects. MongoDB alleviates a lot of that work with its Bson serialization code that converts almost any object into a binary serialized object notation and provides easy round tripping with JSON.
But there’s no getting around the limitations of C# when it comes to multiple inheritance. You can use interfaces to get most of the benefits of multiple inheritance but implementing a tangled set of classes with multiple interfaces on them can lead to a lot of duplicate code.
What if there was a way to do multiple inheritance without every having to write a class? What if we could simply declare a few interfaces and then ask for an object that implements all of them and a way to persist it to disk and get it back? What if we could later take one of those objects and add another interface to it? “Crazy talk” I hear you say!
Well, maybe not so crazy … take a look at the open source project impromptu-interface and you’ll see some of what you’ll need to make this reality. It can take a .NET dynamic object and turn it into an object that implements a specific interface.
Combine that with a simple MongoDB document store and some cunning logic to link the two together and voila, we have persistent objects that can implement any interface dynamically and there’s absolutely no classes in sight anywhere!
Let’s take a look at it in use and then I’ll explain how it works. First, let’s define a few interfaces:
public interface ILegs
{
int Legs { get; set; }
}
public interface IMammal
{
double BodyTemperatureCelcius { get; set; }
}
// Interfaces can use multiple inheritance:
public interface IHuman: IMammal, ILegs
{
string Name { get; set; }
}
// We can have interfaces that apply to specific instances of a class: not all humans are carnivores
public interface ICarnivore
{
string Prey { get; set; }
}
Now let’s take a look at some code to create a few of these new dynamic documents and treat them as implementors of those interfaces. First we need a MongoDB connection:
MongoServer MongoServer = MongoServer.Create(ConnectionString);
MongoDatabase mongoDatabase = MongoServer.GetDatabase("Remember", credentials);
Next we grab a collection where we will persist our objects.
var sampleCollection = mongoDatabase.GetCollection<SimpleDocument>("Sample");
Now we can create some objects adding interfaces to them dynamically and we get to use those strongly typed interfaces to set properties on them.
var person1 = new SimpleDocument();
person1.AddLike<IHuman>().Name = "John";
person1.AddLike<ILegs>().Legs = 2;
person1.AddLike<ICarniovore>().Prey = "Cattle";
sampleCollection.Save(person1);
var monkey1 = new SimpleDocument();
monkey1.AddLike<IMammal>(); // mark as a mammal
monkey1.AddLike<ILegs>().Legs = 2;
monkey1.AddLike<ICarniovore>().Prey = "Bugs";
sampleCollection.Save(monkey1);
Yes, that’s it! That’s all we needed to do to create persisted objects that implement any collection of interfaces. Note how the IHuman is also an IMammal because our code will also support inheritance amongst interfaces. We can load them back in from MongoDB and get the strongly typed versions of them by using .AsLike
So next, let’s take a look at how we can query for objects that support a given interface and how we can get strongly typed objects back from MongoDB:
var query = Query.EQ("int", typeof(IHuman).Name);
var humans = sampleCollection.Find(query);
Console.WriteLine("Examine the raw documents");
foreach (var doc in humans)
{
Console.WriteLine(doc.ToJson());
}
Console.WriteLine("Use query results strongly typed");
foreach (IHuman human in humans.Select(m => m.AsLike<IHuman>()))
{
Console.WriteLine(human.Name);
}
Console.ReadKey();
So how does this ‘magic’ work? First we need a simple Document class. It can be any old object class, no special requirements. At the moment it does wrap these interface properties up in a document inside it called ‘prop’ making it just a little bit harder to query and index but still fairly easy.
/// <summary>
/// A very simple document object
/// </summary>
public class SimpleDocument : DynamicObject
{
public ObjectId Id { get; set; }
// All other properties are added dynamically and stored wrapped in another Document
[BsonElement("prop")]
protected BsonDocument properties = new BsonDocument();
/// <summary>
/// Interfaces that have been added to this object
/// </summary>
[BsonElement("int")]
protected HashSet<string> interfaces = new HashSet<string>();
/// <summary>
/// Add support for an interface to this document if it doesn't already have it
/// </summary>
public T AddLike<T>()
where T:class
{
interfaces.Add(typeof(T).Name);
foreach (var @interface in typeof(T).GetInterfaces())
interfaces.Add(@interface.Name);
return Impromptu.ActLike<T>(new Proxy(this.properties));
}
/// <summary>
/// Cast this object to an interface only if it has previously been created as one of that kind
/// </summary>
public T AsLike<T>()
where T : class
{
if (!this.interfaces.Contains(typeof(T).Name)) return null;
else return Impromptu.ActLike<T>(new Proxy(this.properties));
}
}
Then we need a simple proxy object to wrap up the properties as a dynamic object that we can feed to Impromptu:
public class Proxy : DynamicObject
{
public BsonDocument document { get; set; }
public Proxy(BsonDocument document)
{
this.document = document;
}
public override bool TryGetMember(GetMemberBinder binder, out object result)
{
BsonValue res = null;
this.document.TryGetValue(binder.Name, out res);
result = res.RawValue;
return true; // We always support a member even if we don't have it in the dictionary
}
/// <summary>
/// Set a property (e.g. person1.Name = "Smith")
/// </summary>
public override bool TrySetMember(SetMemberBinder binder, object value)
{
this.document.Add(binder.Name, BsonValue.Create(value));
return true;
}
}
And that’s it! There is no other code required. Multiple-inheritance and code-free persistent objects are now a reality! All you need to do is design some interfaces and objects spring magically to life and get persisted easily.
[NOTE: This is experimental code: it's a prototype of an idea that's been bugging me for some time as I look at how to meld Semantic Web classes which have multiple inheritance relationships with C# classes (that don't) and with MongoDB's document-centric storage format. Does everything really have to be stored in a triple-store or is there some hybrid where objects can be stored with their properties and triple-store statements can be reserved for more complex relationships? Can we get semantic web objects back as meaningful C# objects with strongly typed properties on them? It's an interesting challenge and this approach appears to have some merit as a way to solve it.]
MongoDB – Map-Reduce coming from C#
Jan 20th
People coming from traditional relational database thinking and LINQ sometimes struggle to understand map-reduce. One way to understand it is to realize that it’s actually the simple composition of some LINQ operators with which you may already be familiar.
Map reduce is in effect a SelectMany() followed by a GroupBy() followed by an Aggregate() operation.
In a SelectMany() you are projecting a sequence but each element can become multiple elements. This is equivalent to using multiple emit statements in your map operation. The map operation can also chose not to call emit which is like having a Where() clause inside your SelectMany() operation.
In a GroupBy() you are collecting elements with the same key which is what Map-Reduce does with the key value that you emit from the map operation.
In the Aggregate() or reduce step you are taking the collections associated with each group key and combining them in some way to produce one result for each key. Often this combination is simply adding up a single ’1′ value output with each key from the map step but sometimes it’s more complicated.
One thing you should be aware of with map-reduce in MongoDB is that the reduce operation must accept and output the same data type because it may be applied repeatedly to partial sets of the grouped data. In C# your Aggregate() operation would be applied repeatedly on partial sequences to get to the final sequence.
Custom Serialization for MongoDB – Hashset with IBsonSerializable
Jan 7th
The official C# driver for MongoDB does a great job serializing most objects without any extra work. Arrays, Lists and Hashsets all round trip nicely and share a common representation in the actual database as a simple array. This is great if you ever change the C# code from one collection type to another – there’s no migration work to do on the database – you can write a List and retrieve a Hashset and everything just works.
But there are cases where everything doesn’t work and one of these is a Hashset with a custom comparer. The MongoDB driver will instantiate a regular Hashset rather than one with the custom comparer when it materializes objects from the database.
Fortunately MongoDB provides several ways to override the default Bson serialization. Unfortunately the documentation doesn’t include an example showing how to do it. So here’s an example using the IBsonSerializable option. In this example I show a custom Hashset with a custom comparer to test for equality. It still serializes to an array in MongoDB but on deserialization it instantiates the correct Hashset with the custom comparer in place.
/// <summary>
/// A HashSet with a specific comparer that prevents duplicate Entity Ids
/// </summary>
public class EntityHashSet : HashSet<Entity>, IBsonSerializable
{
private class EntityComparer : IEqualityComparer<Entity>
{
public bool Equals(Entity x, Entity y) { return x.Id.Equals(y.Id); }
public int GetHashCode(Entity obj) { return obj.Id.GetHashCode(); }
}
public EntityHashSet()
: base(new EntityComparer())
{
}
public EntityHashSet(IEnumerable<Entity> values)
: base (values, new EntityComparer())
{
}
public void Serialize(MongoDB.Bson.IO.BsonWriter bsonWriter, Type nominalType, bool serializeIdFirst)
{
if (nominalType != typeof(EntityHashSet)) throw new ArgumentException("Cannot serialize anything but self");
ArraySerializer<Entity> ser = new ArraySerializer<Entity>();
ser.Serialize(bsonWriter, typeof(Entity[]), this.ToArray(), serializeIdFirst);
}
public object Deserialize(MongoDB.Bson.IO.BsonReader bsonReader, Type nominalType)
{
if (nominalType != typeof(EntityHashSet)) throw new ArgumentException("Cannot deserialize anything but self");
ArraySerializer<Entity> ser = new ArraySerializer<Entity>();
return new EntityHashSet((Entity[])ser.Deserialize(bsonReader, typeof(Entity[])));
}
public bool GetDocumentId(out object id, out IIdGenerator idGenerator)
{
id = null;
idGenerator = null;
return false;
}
public void SetDocumentId(object id)
{
return;
}
}
A Semantic Web ontology / triple Store built on MongoDB
Jan 5th
In a previous blog post I discussed building a Semantic Triple Store using SQL Server. That approach works fine but I’m struck by how many joins are needed to get any results from the data and as I look to storing much larger ontologies containing billions of triples there are many potential scalability issues with this approach. So over the past few evenings I decided to try a different approach and so I created a semantic store based on MongoDB. In the MongoDB version of my semantic store I take a different approach to storing the basic building blocks of semantic knowledge representation. For starters I decided that typical ABox and TBox knowledge has really quite different storage requirements and that smashing all the complex TBox assertions into simple triples and stringing them together with meta fields only to immediately join then back up whenever needed just seemed like a bad idea from the NOSQL / document-database perspective.
TBox/ABox: In the ABox you typically find simple triples of the form X-predicate-Y. These store simple assertions about individuals and classes. In the TBox you typically find complex sequents, that’s to say complex logic statements having a head (or consequent) and a body (or antecedents). The head is ‘entailed’ by the body, which means that if you can satisfy all of the body statements then the head is true. In a traditional store all the ABox assertions can be represented as triples and all the complex TBox assertions use quads with a meta field that is used solely to rebuild the sequent with a head and a body. The ABox/TBox distinction is however arbitrary (see http://www.semanticoverflow.com/questions/1107/why-is-it-necessary-to-split-reasoning-into-t-box-and-a-box).
I also decided that I wanted to be use ObjectIds as the primary way of referring to any Entity in the store. Using the full Uri for every Entity is of course possible and MongoDB couuld have used that as the index but I wanted to make this efficient and easily shardable across multiple MongoDB servers. The MongoDB ObjectID is ideal for that purpose and will make queries and indexing more efficient.
The first step then was to create a collection that would hold Entities and would permit the mapping from Uri to ObjectId. That was easy: an Entity type inheriting from a Resource type produces a simple document like the one shown below. An index on Uri with a unique condition ensures that it’s easy to look up any Entity by Uri and that there can only ever be one mapping to an Id for any Uri.
RESOURCES COLLECTION - SAMPLE DOCUMENT
{
"_id": "4d243af69b1f26166cb7606b",
"_t": "Entity",
"Uri": "http://www.w3.org/1999/02/22-rdf-syntax-ns#first"
}
Although I should use a proper Uri for every Entity I also decided to allow arbitrary strings to be used here so if you are building a simple ontology that never needs to go beyond the bounds of this one system you can forgo namespaces and http:// prefixes and just put a string there, e.g. “SELLS”. Since every Entity reference is immediately mapped to an Id and that Id is used throughout the rest of the system it really doesn’t matter much.
The next step was to represent simple ABox assertions. Rather than storing each assertion as its own document I created a document that could hold several assertions all related to the same subject. Of course, if there are too many assertions you’ll still need to split them up into separate documents but that’s easy to do. This move was mainly a convenience for developing the system as it makes it easy to look at all the assertions made concerning a single Entity using MongoVue or the Mongo command line interface but I’m hoping it will also help performance as typical access patterns need to bring in all of the statements concerning a given Entity.
Where a statement requires a literal the literal is stored directly in the document and since literals don’t have Uris there is no entry in the resources collection.
To make searches for statements easy and fast I added an array field “SPO” which stores the set of all Ids mentioned anywhere in any of the statements in the document. This array is indexed in MongoDB using the array indexing feature which makes it very efficient to find and fetch every document that mentions a particular Entity. If the Entity only ever appears in the subject position in statements that search will result in possibly just one document coming back which contains all of the assertions about that Entity. For example:
STATEMENTGROUPS COLLECTION - SAMPLE DOCUMENT
{
"_id": "4d243af99b1f26166cb760c6",
"SPO": [
"4d243af69b1f26166cb7606f",
"4d243af69b1f26166cb76079",
"4d243af69b1f26166cb7607c"
],
"Statements": [
{
"_id": "4d243af99b1f26166cb760c5",
"Subject": {
"_t": "Entity",
"_id": "4d243af69b1f26166cb7606f",
"Uri": "GROCERYSTORE"
},
"Predicate": {
"_t": "Entity",
"_id": "4d243af69b1f26166cb7607c",
"Uri": "SELLS"
},
"Object": {
"_t": "Entity",
"_id": "4d243af69b1f26166cb76079",
"Uri": "DAIRY"
}
}
... more statements here ...
]
}
The third and final collection I created is used to store TBox sequents consisting of a head (consequent) and a body (antecedents). Once again I added an array which indexes all of the Entities mentioned anywhere in any of the statements used in the sequent. Below that I have an array of Antecedent statements and then a single Consequent statement. Although the statements don’t really need the full serialized version of an Entity (all they need is the _id) I include the Uri and type for each Entity for now. Variables also have Id values but unlike Entities, variables are not stored in the Resources collection, they exist only in the Rule collection as part of consequent statements. Variables have no meaning outside a consequent unless they are bound to some other value.
RULE COLLECTION - SAMPLE DOCUMENT
{
"_id": "4d243af99b1f26166cb76102",
"References": [
"4d243af69b1f26166cb7607d",
"4d243af99b1f26166cb760f8",
"4d243af99b1f26166cb760fa",
"4d243af99b1f26166cb760fc",
"4d243af99b1f26166cb760fe"
],
"Antecedents": [
{
"_id": "4d243af99b1f26166cb760ff",
"Subject": {
"_t": "Variable",
"_id": "4d243af99b1f26166cb760f8",
"Uri": "V3-Subclass8"
},
"Predicate": {
"_t": "Entity",
"_id": "4d243af69b1f26166cb7607d",
"Uri": "rdfs:subClassOf"
},
"Object": {
"_t": "Variable",
"_id": "4d243af99b1f26166cb760fa",
"Uri": "V3-Class9"
}
},
{
"_id": "4d243af99b1f26166cb76100",
"Subject": {
"_t": "Variable",
"_id": "4d243af99b1f26166cb760fa",
"Uri": "V3-Class9"
},
"Predicate": {
"_t": "Variable",
"_id": "4d243af99b1f26166cb760fc",
"Uri": "V3-Predicate10"
},
"Object": {
"_t": "Variable",
"_id": "4d243af99b1f26166cb760fe",
"Uri": "V3-Something11"
}
}
],
"Consequent": {
"_id": "4d243af99b1f26166cb76101",
"Subject": {
"_t": "Variable",
"_id": "4d243af99b1f26166cb760f8",
"Uri": "V3-Subclass8"
},
"Predicate": {
"_t": "Variable",
"_id": "4d243af99b1f26166cb760fc",
"Uri": "V3-Predicate10"
},
"Object": {
"_t": "Variable",
"_id": "4d243af99b1f26166cb760fe",
"Uri": "V3-Something11"
}
}
}
That is essentially the whole semantic store. I connected it up to a reasoner and have successfully run a few test cases against it. Next time I get a chance to experiment with this technology I plan to try loading a larger ontology and will rework the reasoner so that it can work directly against the database instead of taking in-memory copies of most queries that it performs.
At this point this is JUST AN EXPERIMENT but hopefully someone will find this blog entry useful. I hope later to connect this up to the home automation system so that it can begin reasoning across an ontology of the house and a set of ABox assertions about its current and past state.
Since I’m still relatively new to the semantic web I’d welcome feedback on this approach to storing ontologies in NOSQL databases from any experienced semanticists.
MongoDB C# Driver – arrays, lists and hashsets
Dec 14th
Here’s a nice feature of the C# MongoDB driver: when you save .NET arrays, lists or Hashsets (essentially an IEnumerable<T>) to MongoDB you can retrieve it as any other IEnumerable<T>. This means you can migrate your business objects between these different representations without having to migrate anything in your database. It also means that any other language can access the same MongoDB database without needing to know anything about .NET data types.
For example, the following will all serialize to the same BSon data and any can be retrieved.
public class Test1
{
[BsonId]
public ObjectId Id { get; set; }
public List<string> array { get; set; }
}
public class Test2
{
[BsonId]
public ObjectId Id { get; set; }
public string[] array { get; set; }
}
public class Test3
{
[BsonId]
public ObjectId Id { get; set; }
public HashSet<string> array { get; set; }
}

