Posts tagged MongoDB

VariableWithHistory – making persistence invisible, making history visible

In a typical .NET application variables have a short lifetime. When they go out of scope or the application ends their value is lost. Also, you cannot ask a variable what its value was 1 hour ago, or what its average, maximum or minimum value was yesterday.

Yet, such a variable would be extremely useful when writing a Home Automation System because you often need to make comparisons between a current value and some historical average, or between two ranges (e.g. was the kitchen more or less occupied than yesterday). Now, normally you wouldn’t want to mix persistence up with the representation of a value in your code (see ‘Separation of Concerns’), but in this case I decided that it was worth mixing the two concepts because the benefits of doing so were so great.

So I created a class called VariableWithHistory<T> which is the abstract base class for IntegerWithHistory, DoubleWithHistory, BoolWithHistory, StringWithHistory and a number of others.  The first property worth noting on these classes is the .Current property.  This always gives you the latest value that has been set.  Setting the .Current value stores both the value and the DateTime (Utc of course) at which the value became current.  A history of all past values is maintained in MongoDB up to some suitable limit per variable (each variable can have its own adjustable history size in bytes by using MongoDB’s capped collections).  If the new value is the same as the old one no update is made, the implicit behavior being that the value changed and stayed there until it changes again, so if you want to know what the value is now it is the same as the last change recorded.

With this new variable type in place any object in the house can have any number of persistent fields on it (bool occupied, double temperature, string triggeredBy, …).  Updating these values is as simple as assigning to their .Current property.  When the system loads, each value comes back with the value it had when the system was shut down.  To accomplish this every VariableWithHistory is given a unique id (based on the unique id of it’s parent, e.g. a room).

So far so good, shut down, restart and the house doesn’t need to query a device to know if it’s on or off and all the long running Sequential Logic Blocks I use for rules (e.g. .Delay(days:2)) carry on running as if nothing happened.  This is particularly useful since I typically deploy a new version almost every day and some logic blocks have long delays built into them.

But besides providing simple recovery from a reboot, these persistent variables allow me to do some much more interesting things.

int CountTransitions(DateTimeRange range, T direction);
Counts how many transitions there have been to the value T in a given time range, e.g. how many times did the driveway alarm go ‘true’ this evening?

Dictionary<T, double> Fractional(DateTimeRange range);
Builds a histogram of all the values seen in the time range, e.g. 50% hot, 20% cold, 30% warm for a string variable that tracks temperature

DateTimeOffset LastChangedState
e.g. when was this sensor last triggered?

TimedValue<T> ValueAtTime(DateTimeOffset dt)
What was the value at a given time in the past, e.g. what was the temperature at the same time yesterday?

Each specific type of VariableWithHistory<T> may also have additional methods relevant to the type T.  For example, on DoubleWithHistory there is a method double Average(DateTimeOffset minValue, DateTimeOffset maxValue) which gets the average value over the specified time range.  On BoolWithHistory there is a method double PercentageTrue(DateTimeRange range) which you could use to find the average occupancy for a room yesterday.

 

My initial implementation waited for the database to write each update before allowing any queries but now I simply cache the Current value and assume that queries will probably get executed after updates and that the average temperature yesterday is close enough with or without the last 100ms of updates.  I did try to keep this class isolated from MongoDB but in the end the benefit of some of the atomic update capabilities in MongoDB made it easier to just take the dependency.

My previous implementation of this feature used my own in-memory database, MongoDB has slowed it down a bit but I’ve gained the ability to archive terabytes of sensor data which should prove useful for my next project which is to add some machine learning to the system.

 

 

 

 

 

 

 

Neo4j Meetup in Seattle – some observations

I attended the Neo4j Meetup in Seattle this evening. It was an interesting tour around the internals of Neo4j and some of the design decisions behind how they store graphs in a database.

The most interesting thing about Neo4j is the Cypher query language used to construct graph queries that follow relationships, evaluate conditions on properties on relationships and nodes. Neo4j shows much promise in terms of being able to represent data in a very natural way and to query it using Cypher in ways that would bring SQL to its knees with join-upon-join-upon-join.

In an earlier blog post I lamented the lack of a single database solution that was the best of all worlds: relational + document + graph + semantic web. Tonight that feeling was compounded: Neo4j is a graph database but it’s missing several key features that could make it much more.

We were privileged to get a first hand explanation as to how Neo4j worked internally but what we saw looked like a work in progress: an unfinished implementation of something that could be so much better. Here’s some of the things Neo4j needs to fix before I’ll give it a go:-

1) Stealing bits from one value to give to another to create odd word lengths like 23 bits is so 1980′s. I cannot believe this is a worthwhile optimization to make in 2012. Neo should bite the bullet, upgrade their few existing customers and move to a more modern byte aligned, 64-bit address space. I was equally amazed at the implementation of compression schemes for text on disk but the omission of other obvious space-saving opportunities like declaring some relationships to be one-way only (no reverse queries, thus no need to store the back link). It’s 2012: disk space is essentially limitless; I should never have to hit a file-size limit because someone decided to use 23, 28 or some other random number of bits instead of 64.

2) The extremely limited set of data types. If you want to store json you’d better support at least all the common Javascript options including Dates. Frankly I don’t care if your database is written in Java, it exposes a web api using json so that’s what it should support. Also odd was the choice of a linked list, meandering its way through the file, as the way to store properties for a node. IMHO Neo4j should just switch to Bson and put a document size limit on nodes like MongoDB instead of carrying on down this bit-packing, linked-list approach to properties with a partial implementation of types.

3) The lack of file splitting at 2GB/4GB boundaries.

4) Putting nodes and relationships into separate files. Sure this simplifies the access pattern but it’s not going to give good locality to data on disk. An alignment based on disk block sizes with nodes and relationships packed into blocks seems likely to be a much better approach to minimizing disk seeks and reads.

3) Reliance on Lucene to provide indexing. Much as I appreciate Lucene, Neo4j needs built-in indexes; without them it’s impossible to optimize query plans across the graph and the indexes. MongoDB has a good selection of indexing options including 2D geo-spatial indexing; IMHO Neo4j should adopt the same set of options and offer queries that are both good relational database queries and good graph queries not force their users to pick one or the other whilst handling the interop between two different systems.

In fact, in my ideal world Neo4j and MongoDB would just become one database: a document database that also has great graph-querying capabilities!

I’ll keep monitoring Neo4j but in the meantime it’s full speed ahead with my own implementation of a graph database in MongoDB with the added twist that in my implementation, relationships are all modeled as triples (just like in a semantic web triple-store). My graph-query language isn’t likely to be as powerful as Cypher any time soon but I have indexes, the ability to query by relationships easily and a robust implementation of properties on each node with support for all common data-types and through my interface-based approach to storing objects with multiple-inheritance I get strongly-typed result sets in C#.

MongoDB substring search with a difference

It’s quite common to want to search a database for a key that starts with a given string. In SQL you have LIKE and in MongoDB you have regular expressions:

db.customers.find( { name : { $regex : '^acme', $options: 'i' } } );

But what if you want to do the inverse of this? i.e. to search the database for the keys that are themselves substrings of the search string? For example, suppose you are trying to parse a block of text and you want to find phrases in the database that match the start of the current block of text. In SQL you would be dead in the water but with MongoDB you can create a RegEx that matches either the first word, or the first two words, or the first three words, … and so on.

We can construct a regular expression to do this, it might look something like: ^word1($| word2($| word3$))

Here’s a C# method that can create the necessary regular expression:

        /// <summary>
        /// This generates a regular expression that matches as much of the given phrase as it can from a string
        /// i.e. a reverse prefix search where you want the database to supply the prefix and match it against your query
        /// useful for matching 'as much as possible from a given input'
        /// </summary>
        private string generatePrefixRegex(string phrase, bool atStart)
        {
            string[] bits = phrase.Split(' ');
            string result = bits[0];

            // At the start of a sentence, if the first character is upper cased, we should also be looking for a lowercased verson of it
            if (atStart && char.IsUpper(result[0]))
            {
                result = string.Format("(%0|%1)%2", char.ToLowerInvariant(result[0]), char.ToUpperInvariant(result[0]), result.Substring(1));
            }

            // Each additional word - either we end the string before it or we must include it

            foreach (var bit in bits.Skip(1))
            {
                result = result + "($| " + Regex.Escape(bit);
            }

            result = result + "$";                      // last word must end string

            foreach (var bit in bits.Skip(1))
            {
                result = result + ")";                 // close the expression
            }
            return "^" + result;                        // Must start at the start of a Name
        }

Dynamic persistence with MongoDB – look, no classes! Multiple inheritance in C#!

In an earlier post I explained a technique to create a class-free persistence layer using MongoDB. [Read that post first, then come back here.]

Since then I’ve refined the techniques involved and created a cleaner implementation that does away with the `.props` collection on each object. Now when you add an interface to an object you get exactly what you expected in the persisted data.

To use it you first need to register the serialization code somewhere in your startup code…

            BsonSerializer.RegisterSerializationProvider(new MongoDynamicSerializationProvider());

The Serialization provider is quite simple:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using MongoDB.Bson.Serialization;

namespace MongoData.Dynamic
{
    public class MongoDynamicSerializationProvider : IBsonSerializationProvider
    {

        public IBsonSerializer GetSerializer(Type type)
        {
            if (typeof(MongoDynamic).IsAssignableFrom(type))
                return MongoDynamicBsonSerializer.Instance;
            return null;
        }
    }
}

The serializer is a bit more involved. It uses an interface map to decide what type to return for each serialized object. This is critical because many different .NET types can map onto the same BSon serialized value and only by maintaining this map can we get back to the original type. It’s also
critical for handling nested object graphs containing different types.

using System;
using System.Collections.Concurrent;
using System.Dynamic;
using System.Linq;
using System.Linq.Expressions;
using System.Runtime.CompilerServices;
using Microsoft.CSharp.RuntimeBinder;
using MongoDB.Bson.IO;
using MongoDB.Bson.Serialization;
using MongoDB.Bson.Serialization.Serializers;
using MongoDB.Bson;
using MongoDB.Bson.Serialization.IdGenerators;
using System.Collections.Generic;
using ImpromptuInterface;

namespace MongoData.Dynamic
{
    public class MongoDynamicBsonSerializer : BsonBaseSerializer
    {
        private static MongoDynamicBsonSerializer instance = new MongoDynamicBsonSerializer();

        public static MongoDynamicBsonSerializer Instance
        {
            get { return instance; }
        }

        public override object Deserialize(BsonReader bsonReader, Type nominalType, IBsonSerializationOptions options)
        {
            var bsonType = bsonReader.CurrentBsonType;
            if (bsonType == BsonType.Null)
            {
                bsonReader.ReadNull();
                return null;
            }
            else if (bsonType == BsonType.Document)
            {
                var os = new ObjectSerializer();
                MongoDynamic md = new MongoDynamic();
                bsonReader.ReadStartDocument();

                Dictionary<string, Type> typeMap = null;

                // scan document first to find interfaces
                {
                    var bookMark = bsonReader.GetBookmark();
                    if (bsonReader.FindElement(MongoDynamic.InterfacesField))
                    {
                        md[MongoDynamic.InterfacesField] = BsonValue.ReadFrom(bsonReader).AsBsonArray.Select(x => x.AsString);
                        typeMap = md.GetTypeMap();
                    }
                    else
                    {
                        throw new FormatException("No interfaces defined for this dynamic object - can't deserialize it");
                    }
                    bsonReader.ReturnToBookmark(bookMark);
                }

                while (bsonReader.ReadBsonType() != BsonType.EndOfDocument)
                {
                    var name = bsonReader.ReadName();

                    if (name == "_id")
                    {
                        md[name] = BsonValue.ReadFrom(bsonReader).AsObjectId;
                    }
                    else if (name == MongoDynamic.InterfacesField)
                    {
                        // Read it and ignore it, we already have it
                        BsonValue.ReadFrom(bsonReader);
                    }
                    else
                    {
                        if (typeMap == null) throw new FormatException("No interfaces define for this dynamic object - can't deserialize");
                        // lookup the type for this element according to the interfaces
                        Type elementType;
                        if (typeMap.TryGetValue(name, out elementType))
                        {
                            var value = BsonSerializer.Deserialize(bsonReader, elementType);
                            md[name] = value;
                        }
                        else
                        {
                            // This is a value that is no longer in the interface, maybe a column you removed
                            // not really much we can do with it ... but we need to read it and move on
                            var value = BsonSerializer.Deserialize(bsonReader, typeof(object));
                            md[name] = value;

                            // As with all databases, removing elements from the schema is always going to cause problems ... 
                        }
                    }
                }
                bsonReader.ReadEndDocument();
                return md;
            }
            else
            {
                var message = string.Format("Can't deserialize a {0} from BsonType {1}.", nominalType.FullName, bsonType);
                throw new FormatException(message);
            }
        } 
    

        public override bool GetDocumentId(object document, out object id, out Type idNominalType, out IIdGenerator idGenerator)
        {
            MongoDynamic x = (MongoDynamic)document;
            id = x._id;
            idNominalType = typeof(ObjectId);
            idGenerator = new ObjectIdGenerator();
            return true;
        }

        public override void SetDocumentId(object document, object id)
        {
            MongoDynamic x = (MongoDynamic)document;
            x._id = (ObjectId)id;
        }

        public override void Serialize(BsonWriter bsonWriter, Type nominalType, object value, IBsonSerializationOptions options)
        {
            if (value == null)
            {
                bsonWriter.WriteNull();
                return;
            }
            var metaObject = ((IDynamicMetaObjectProvider)value).GetMetaObject(Expression.Constant(value));
            var memberNames = metaObject.GetDynamicMemberNames().ToList();
            if (memberNames.Count == 0)
            {
                bsonWriter.WriteNull();
                return;
            }

            bsonWriter.WriteStartDocument();
            foreach (var memberName in memberNames)
            {
                // ToDo: handle all those _id Id id variants?
                bsonWriter.WriteName(memberName);

                object memberValue;
                if (memberName == "_id") memberValue = ((MongoDynamic)value)._id;
                else if (memberName == "int") memberValue = ((MongoDynamic)value).@int;
                else memberValue = Impromptu.InvokeGet(value, memberName);

                if (memberValue == null)
                    bsonWriter.WriteNull();
                else
                {
                    var memberType = memberValue.GetType();
                    var serializer = BsonSerializer.LookupSerializer(memberType);
                    serializer.Serialize(bsonWriter, memberType, memberValue, null);
                }
            }
            bsonWriter.WriteEndDocument();
        }
    }
}

And finally, the actual MongoDynamic class:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Dynamic;
using MongoDB.Bson;
using MongoDB.Bson.Serialization.Attributes;
using ImpromptuInterface;

namespace MongoData.Dynamic
{
    /// <summary>
    /// All MongoDynamic objects support this interface because every object needs an _id in MongoDB
    /// </summary>
    public interface IId
    {
        ObjectId _id { get; set; }
    }

    /// <summary>
    /// MongoDynamic is like an ExpandoObject that also understands document Ids and uses Improptu interface
    /// to act like any other collection of interfaces ...
    /// It can be serialized and deserialized from BSon and thus stored in a MongoDB database.
    /// </summary>
    /// <remarks>
    /// This simple class gives you the ability to define database objects using only .NET interfaces - no classes!
    /// Those objects can be dynamically extended to support any interface you want to add to them - polymorphism!
    /// When loaded back from the database the object will support all of the interfaces that were ever applied to it.
    /// Adding a new field is easy.  Removing one works too.
    /// All fields must be nullable since they may not be present on earlier instances of an object type.
    /// </remarks>
    public class MongoDynamic : DynamicObject, IId
    {
        [BsonId(Order=1)]
        public ObjectId _id { get; set; }

        // Dumb name for a property - which is why I chose it - very unlikely it will ever conflict with a real property name
        public const string InterfacesField = "int";

        /// <summary>
        /// Interfaces that have been added to this object
        /// </summary>
        /// <remarks>
        /// We always begin by supporting the _id interface
        /// Order is important, we need to see this field before we can deserialize any others
        /// </remarks>
        [BsonElement(InterfacesField, Order=2)]
        internal HashSet<string> @int = new HashSet<string>(){ typeof(IId).FullName };

        /// <summary>
        /// A text version of all interfaces - mostly for debugging purposes, stored in alphabetical order
        /// </summary>
        [BsonIgnore]
        public string InterfacesAsText
        {
            get { return string.Join(",", this.@int.OrderBy(i => i)); }
        }

        /// <summary>
        /// Add support for an interface to this document if it doesn't already have it
        /// </summary>
        public T AddLike<T>()
            where T : class
        {
            @int.Add(typeof(T).FullName);
            // And also act like any interfaces that interface implements (which will include ones they represent too)
            foreach (var @interface in typeof(T).GetInterfaces())
                @int.Add(@interface.FullName);
            return Impromptu.ActLike<T>(this, this.GetAllInterfaces());
        }

        /// <summary>
        /// Add support for multiple interfaces
        /// </summary>
        public T AddLike<T>(Type[] otherInterfaces)
            where T : class
        {
            var allInterfaces = otherInterfaces.Concat(new[] { typeof(T) });
            var allInterfacesAndDescendants = allInterfaces.Concat(allInterfaces.SelectMany(x => x.GetInterfaces()));
            foreach (var @interface in allInterfacesAndDescendants)
                @int.Add(@interface.FullName);
            return Impromptu.ActLike<T>(this, this.GetAllInterfaces());
        }

        /// <summary>
        /// Cast this object to an interface only if it has previously been created as one of that kind
        /// </summary>
        public T AsLike<T>()
            where T : class
        {
            if (!this.@int.Contains(typeof(T).FullName)) return null;
            else return Impromptu.ActLike<T>(this, this.GetAllInterfaces());
        }

        // A rather large cache of all interface types loaded into the App Domain
        private static List<Type> cacheOfTypes = null;

        // A cache of the interface types corresponding to a given 'key' of interface names
        private static Dictionary<string, Type[]> cacheOfInterfaces = new Dictionary<string, Type[]>();

        public Type[] GetAllInterfaces()
        {
            // We always behave like an object with an Id plus any other interfaces we have
            var key = string.Join(",", this.@int.OrderBy(i => i));
            if (!cacheOfInterfaces.ContainsKey(key))
            {
                if (cacheOfTypes == null)
                {
                    var assemblies = AppDomain.CurrentDomain.GetAssemblies();
                    cacheOfTypes = assemblies.SelectMany(ass => ass.GetTypes()).Where(t => t.IsInterface).ToList();
                }
                var interfaces = cacheOfTypes.Where(t => this.@int.Any(i => i == t.FullName));

                // Could trim the interfaces to remove any that are inherited from others ...
                cacheOfInterfaces.Add(key, interfaces.ToArray());
            }
            return cacheOfInterfaces[key];
        }

        /// <summary>
        /// Get a mapping from a field name to a type according to the interfaces on this object
        /// </summary>
        /// <returns></returns>
        public Dictionary<string, Type> GetTypeMap()
        {
            Dictionary<string, Type> typeMap = new Dictionary<string, Type>();
            var interfaces = this.GetAllInterfaces();
            foreach (var mi in interfaces.SelectMany(intf => intf.GetProperties()))
            {
                typeMap[mi.Name] = mi.PropertyType;
            }
            return typeMap;
        }


        /// <summary>
        /// Becomes a Proxy object that acts like it implements all of the interfaces listed as being supported by this Entity
        /// </summary>
        /// <remarks>
        /// Because the returned object supports ALL of the interfaces that have ever been added to this object
        /// you can cast it to any of them.  This enables a type of polymorphism.
        /// </remarks>
        public object ActLikeAllInterfacesPresent()
        {
            return Impromptu.DynamicActLike(this, this.GetAllInterfaces());
        }

        [BsonIgnore]
        // BsonIgnore because Bson serialization will happen on the dynamic interface this class exposes not on this dictionary
        private Dictionary<string, object> children = new Dictionary<string, object>();

        /// <summary>
        /// Fetch a property by name
        /// </summary>
        public override bool TryGetMember(GetMemberBinder binder, out object result)
        {
            if (binder.Name == "_id") { result = this._id; return true; }
            else if (binder.Name == InterfacesField) { result = this.@int; return true; }
            else 
            {
               children.TryGetValue(binder.Name, out result); 
               result = null;                         // we hope that it's nullable!  If not you have an issue 
               return true;                           // when you do a database migration or query a nullable field it won't be in 'children'
            }
        }

        /// <summary>
        /// Set a property (e.g. person1.Name = "Smith")
        /// </summary>     
        public override bool TrySetMember(SetMemberBinder binder, object value)
        {
            if (binder.Name == "_id") { this._id = (ObjectId)value; return true; }      // you shouldn't need to use this
            if (binder.Name == InterfacesField) throw new AccessViolationException("You cannot set the interfaces directly, use AddLike() instead");
            if (!this.GetTypeMap().ContainsKey(binder.Name)) throw new ArgumentException("Property '" + binder.Name + "' not found.  You need to call AddLike to specify the interfaces you want to support."); 
            children[binder.Name] = value;
            return true;
        }

        public override IEnumerable<string> GetDynamicMemberNames()
        {
            return new[]{"_id", InterfacesField}.Concat(children.Keys);
        }

        /// <summary>
        /// An indexer for use by serialization code
        /// </summary>
        internal object this[string key]
        {
            get
            {
                if (key == "_id") return this._id;
                else if (key == InterfacesField) return this.@int;
                else return children[key];
            }

            set
            {
                if (key == "_id" && value is BsonObjectId) this._id = ((BsonObjectId)value).Value;
                else if (key == "_id") this._id = (ObjectId)value;
                else if (key == InterfacesField) this.@int = new HashSet<string>((IEnumerable<string>)value);
                else children[key] = value;
            }
        }
    }
}

You’ll need Impromptu interface (from Nuget) to build this. To use it, you write code like this to save to MongoDB:

            MongoDynamic entity = new MongoDynamic();
            var user = entity.AddLike<IUser>();         // *** Add the IUser fields to it ...
            user.Name = name;                           // Use it as if it were an IUser
            // save it to the database as normal

And to retrieve an object you create a query as normal and then query for MongoDynamic objects like so …

            var user = database.GetCollection<MongoDynamic>("***collectionName***").FindOne(query);
            if (user == null) return null;
            return user.AsLike<IUser>();

Typically you will want your query to reference the field called int (where all the interfaces are stored) so you can query for objects that support a specific type (if you do, you’ll want to add an index on that field). [NB the name was chosen to be one you were unlikely to ever use in .NET]

MongoDynamic objects are polymorphic – you can morph them to support any other interface at any time like so …

            user.AddLike<ISomeOtherInterface>();

Home network crawler – cataloging every file on the home LAN with C# and MongoDB

Map-Reduce in operation in Greenland

Map-Reduce in action: The glaciers in Greenland 'map' the canyon walls into streams of rocks called lateral moraine. As the glaciers merge these rocks are 'reduced' into streams in the middle called 'medial' moraine. (A photo I took over Greenland this summer.)

With the addition of two more 3TB drives to the home network it’s becoming impossible to track files and to remember where each one is and whether it’s a backup of some other disk or not. There are 8 computers on the home network and over 10TB of storage distributed between them. Much of the storage is concentrated on a single machine running Windows Server 2008. It’s a low-powered Atom server connected to a Sans Digital 1U Rackmount Sans Digital disk array running in JBOD mode (just a bunch of disks).

I’m not a huge fan or RAID arrays – they mostly mean there’s another component to go wrong (the controller card) and when they do go wrong you can lose all your data just as easily as if it were all on one drive. I prefer a multiple copy strategy, an “Amazon S3 for the home” if you like. The downside of this is that there are multiple copies of each file across the home network and as I have several generations of hard drives the mapping from primary to secondary to tertiary is complex and hard to manage! It’s also really hard to find a single file when there are so many places to look and it’s nigh on impossible to be sure that I have the necessary three copies of every important file in the right places at all times.

So this weekend I embarked on a small project to catalog every file, directory and storage volume on the entire home network including drives that are only sometimes connected. The software has been running all weekend and is close to cataloging everything. It’s found 5 million files so far representing over 6TB of data!

The architecture I chose for this software was an agent that runs on each PC to catalog all of the attached volumes. This client uploads all the directories and files that it finds to a MongoDB database running on the same Atom server as the main storage array. The poor little Atom server’s 4GB of RAM has been in constant use but the server has remained responsive, in part because it boots from an SSD drive.

Each volume, directory and file is represented by a document in MongoDB in a single collection. The agent calculates an MD5 hash for each file and extracts metadata from MP3, WMA and JPG files. It also stores all of the key file dates (created, updated, accessed) and references to parent directories, volume identifiers and the currently connected PC. It does not assume that a volume is always connected to the same computer – you can unplug an external drive from one and put it somewhere else and it will all work just fine.

I implemented a re-startable tree scan that uses a couple of DateTime stamps to be able to determine which directories need to be scanned during the current pass and which ones have already been scanned. Any agent can be killed at any time and restarted and it will carry on walking the directory tree right where it left off. It will even continue correctly in the case where you move a volume from one PC to another.

Each agent uses the Parallel Task library’s Parallel.ForEach to crawl each volume in parallel and to parse multiple files from each directory simultaneously.

By storing all of the file metadata in Mongo DB it’s easy to use Map-Reduce to calculate some interesting statistics for the files on the network.

For example, to create a summary of file sizes I can use a Map function:

function Map() {
	if (this.Size && this._t == "FileInformation")
	{
		var size = this.Size;
		
		if (size < 1024)
			emit ("kb", {count:1, size:this.Size});
		else if (size < 1024*1024)
			emit ("mb", {count:1, size:this.Size});
		else if (size < 1024*1024*1024)
			emit ("gb", {count:1, size:this.Size});
		else if (size < 1024*1024*1024*1024)
			emit ("tb", {count:1, size:this.Size});
		else 
			emit ("tb+", {count:1, size:this.Size});
	}
}

and a reduce function:

function Reduce(key, arr_values) {

	var count = 0;
	var size = 0;
	
	for(var i in arr_values) 
	{
		count = count + arr_values[i].count;
		size = size + arr_values[i].size;
	}
	
	return {count:count, size:size};
}

Map-Reduce operations like this take about 20 minutes to run (on the Atom server with just 4GB of RAM) whereas any query serviced by one of the indexes on the MongoDB collection is almost instantaneous.

I’ve been using the excellent MongoVue to run simple map-reduce scripts like this and to keep track of how quickly the database is growing.

Map-reduce can also be used to find duplicate files – by emitting the MD5 hash as the key and some information about the file as the value I can find every copy of every file across every computer on the home network.

Since I have the file name and metadata for every file on the home network I can also easily find any file using MongoDB’s regex matching feature against the path.

The Hard Parts

For starters you’ll need a library that can handle long file names. Then you’ll need to fix it to provide at least the functionality that FileInfo and DirectoryInfo give you in .NET.

Next you’ll need to learn about reparse-points and hard-links and you’ll need to skip over them because with them in place the file system is not a tree; it’s a cyclical graph in which a simple crawler will quickly get confused or stuck.

You’ll also want to store the NTFS file Id and the unique Volume ID for every file so you can track it when the file is moved or the removable drive is connected to a different computer.

So how well does it work?

This all seems to work really well. Nearly every volume has now been cataloged. It’s located about 5M files occupying over 6TB of space. The worst case offender for the number of copies of the same file is 100+. I’ve used the find feature in MongoDB to find a file I was missing and I’m better able to plan how to arrange directories and file generations across the various hard drives I have.

What’s next

Well, of course this needs to be connected to the home automation system and my Natural Language engine so you can ask “send a copy of IMG_0228 from last week to X” or “where are all the spreadsheets I created last year?” That will be fairly easy.

After that I hope to incorporate backup features into the agents too so they can automatically keep the required number of copies of each file according to its importance. I’d also like to set up a rotating set of external drives that go in the fire safe when not connected and when they are connected they get updated with the latest copies of all the important files.

I’d also like to be able to get the agents to move whole groups of directories around between drives as juggling the directory layout each time a new hard drive is added to the system is always a time consuming process.

Comments or Questions?

Does everyone else have a hard time managing multiple computers, hard drives, directories and multiple copies of files? What tools do you use to do this? Is there anything commercially available that I could have used instead? Would a tool like this be useful to you? Should I publish the code somewhere? Comments and questions are always welcome here or on twitter.

MongoDB – Map-Reduce coming from C#

People coming from traditional relational database thinking and LINQ sometimes struggle to understand map-reduce. One way to understand it is to realize that it’s actually the simple composition of some LINQ operators with which you may already be familiar.

Map reduce is in effect a SelectMany() followed by a GroupBy() followed by an Aggregate() operation.

In a SelectMany() you are projecting a sequence but each element can become multiple elements. This is equivalent to using multiple emit statements in your map operation. The map operation can also chose not to call emit which is like having a Where() clause inside your SelectMany() operation.

In a GroupBy() you are collecting elements with the same key which is what Map-Reduce does with the key value that you emit from the map operation.

In the Aggregate() or reduce step you are taking the collections associated with each group key and combining them in some way to produce one result for each key. Often this combination is simply adding up a single ’1′ value output with each key from the map step but sometimes it’s more complicated.

One thing you should be aware of with map-reduce in MongoDB is that the reduce operation must accept and output the same data type because it may be applied repeatedly to partial sets of the grouped data. In C# your Aggregate() operation would be applied repeatedly on partial sequences to get to the final sequence.

A Semantic Web ontology / triple Store built on MongoDB

In a previous blog post I discussed building a Semantic Triple Store using SQL Server. That approach works fine but I’m struck by how many joins are needed to get any results from the data and as I look to storing much larger ontologies containing billions of triples there are many potential scalability issues with this approach. So over the past few evenings I decided to try a different approach and so I created a semantic store based on MongoDB. In the MongoDB version of my semantic store I take a different approach to storing the basic building blocks of semantic knowledge representation. For starters I decided that typical ABox and TBox knowledge has really quite different storage requirements and that smashing all the complex TBox assertions into simple triples and stringing them together with meta fields only to immediately join then back up whenever needed just seemed like a bad idea from the NOSQL / document-database perspective.

TBox/ABox: In the ABox you typically find simple triples of the form X-predicate-Y. These store simple assertions about individuals and classes. In the TBox you typically find complex sequents, that’s to say complex logic statements having a head (or consequent) and a body (or antecedents). The head is ‘entailed’ by the body, which means that if you can satisfy all of the body statements then the head is true. In a traditional store all the ABox assertions can be represented as triples and all the complex TBox assertions use quads with a meta field that is used solely to rebuild the sequent with a head and a body. The ABox/TBox distinction is however arbitrary (see http://www.semanticoverflow.com/questions/1107/why-is-it-necessary-to-split-reasoning-into-t-box-and-a-box).

I also decided that I wanted to be use ObjectIds as the primary way of referring to any Entity in the store. Using the full Uri for every Entity is of course possible and MongoDB couuld have used that as the index but I wanted to make this efficient and easily shardable across multiple MongoDB servers. The MongoDB ObjectID is ideal for that purpose and will make queries and indexing more efficient.

The first step then was to create a collection that would hold Entities and would permit the mapping from Uri to ObjectId. That was easy: an Entity type inheriting from a Resource type produces a simple document like the one shown below. An index on Uri with a unique condition ensures that it’s easy to look up any Entity by Uri and that there can only ever be one mapping to an Id for any Uri.


RESOURCES COLLECTION - SAMPLE DOCUMENT

{
  "_id": "4d243af69b1f26166cb7606b",
  "_t": "Entity",
  "Uri": "http://www.w3.org/1999/02/22-rdf-syntax-ns#first"
}

Although I should use a proper Uri for every Entity I also decided to allow arbitrary strings to be used here so if you are building a simple ontology that never needs to go beyond the bounds of this one system you can forgo namespaces and http:// prefixes and just put a string there, e.g. “SELLS”. Since every Entity reference is immediately mapped to an Id and that Id is used throughout the rest of the system it really doesn’t matter much.

The next step was to represent simple ABox assertions. Rather than storing each assertion as its own document I created a document that could hold several assertions all related to the same subject. Of course, if there are too many assertions you’ll still need to split them up into separate documents but that’s easy to do. This move was mainly a convenience for developing the system as it makes it easy to look at all the assertions made concerning a single Entity using MongoVue or the Mongo command line interface but I’m hoping it will also help performance as typical access patterns need to bring in all of the statements concerning a given Entity.

Where a statement requires a literal the literal is stored directly in the document and since literals don’t have Uris there is no entry in the resources collection.

To make searches for statements easy and fast I added an array field “SPO” which stores the set of all Ids mentioned anywhere in any of the statements in the document. This array is indexed in MongoDB using the array indexing feature which makes it very efficient to find and fetch every document that mentions a particular Entity. If the Entity only ever appears in the subject position in statements that search will result in possibly just one document coming back which contains all of the assertions about that Entity. For example:


STATEMENTGROUPS COLLECTION - SAMPLE DOCUMENT

{
  "_id": "4d243af99b1f26166cb760c6",
  "SPO": [
    "4d243af69b1f26166cb7606f",
    "4d243af69b1f26166cb76079",
    "4d243af69b1f26166cb7607c"
  ],
  "Statements": [
    {
      "_id": "4d243af99b1f26166cb760c5",
      "Subject": {
        "_t": "Entity",
        "_id": "4d243af69b1f26166cb7606f",
        "Uri": "GROCERYSTORE"
      },
      "Predicate": {
        "_t": "Entity",
        "_id": "4d243af69b1f26166cb7607c",
        "Uri": "SELLS"
      },
      "Object": {
        "_t": "Entity",
        "_id": "4d243af69b1f26166cb76079",
        "Uri": "DAIRY"
      }
    }
	... more statements here ...
  ]
}

The third and final collection I created is used to store TBox sequents consisting of a head (consequent) and a body (antecedents). Once again I added an array which indexes all of the Entities mentioned anywhere in any of the statements used in the sequent. Below that I have an array of Antecedent statements and then a single Consequent statement. Although the statements don’t really need the full serialized version of an Entity (all they need is the _id) I include the Uri and type for each Entity for now. Variables also have Id values but unlike Entities, variables are not stored in the Resources collection, they exist only in the Rule collection as part of consequent statements. Variables have no meaning outside a consequent unless they are bound to some other value.


RULE COLLECTION - SAMPLE DOCUMENT

{
  "_id": "4d243af99b1f26166cb76102",
  "References": [
    "4d243af69b1f26166cb7607d",
    "4d243af99b1f26166cb760f8",
    "4d243af99b1f26166cb760fa",
    "4d243af99b1f26166cb760fc",
    "4d243af99b1f26166cb760fe"
  ],
  "Antecedents": [
    {
      "_id": "4d243af99b1f26166cb760ff",
      "Subject": {
        "_t": "Variable",
        "_id": "4d243af99b1f26166cb760f8",
        "Uri": "V3-Subclass8"
      },
      "Predicate": {
        "_t": "Entity",
        "_id": "4d243af69b1f26166cb7607d",
        "Uri": "rdfs:subClassOf"
      },
      "Object": {
        "_t": "Variable",
        "_id": "4d243af99b1f26166cb760fa",
        "Uri": "V3-Class9"
      }
    },
    {
      "_id": "4d243af99b1f26166cb76100",
      "Subject": {
        "_t": "Variable",
        "_id": "4d243af99b1f26166cb760fa",
        "Uri": "V3-Class9"
      },
      "Predicate": {
        "_t": "Variable",
        "_id": "4d243af99b1f26166cb760fc",
        "Uri": "V3-Predicate10"
      },
      "Object": {
        "_t": "Variable",
        "_id": "4d243af99b1f26166cb760fe",
        "Uri": "V3-Something11"
      }
    }
  ],
  "Consequent": {
    "_id": "4d243af99b1f26166cb76101",
    "Subject": {
      "_t": "Variable",
      "_id": "4d243af99b1f26166cb760f8",
      "Uri": "V3-Subclass8"
    },
    "Predicate": {
      "_t": "Variable",
      "_id": "4d243af99b1f26166cb760fc",
      "Uri": "V3-Predicate10"
    },
    "Object": {
      "_t": "Variable",
      "_id": "4d243af99b1f26166cb760fe",
      "Uri": "V3-Something11"
    }
  }
}

That is essentially the whole semantic store. I connected it up to a reasoner and have successfully run a few test cases against it. Next time I get a chance to experiment with this technology I plan to try loading a larger ontology and will rework the reasoner so that it can work directly against the database instead of taking in-memory copies of most queries that it performs.

At this point this is JUST AN EXPERIMENT but hopefully someone will find this blog entry useful. I hope later to connect this up to the home automation system so that it can begin reasoning across an ontology of the house and a set of ABox assertions about its current and past state.

Since I’m still relatively new to the semantic web I’d welcome feedback on this approach to storing ontologies in NOSQL databases from any experienced semanticists.

MongoDB Map-Reduce – Hints and Tips

For anyone getting started with Map-Reduce on MongoDB here are a few pointers to get you started.

1. Guids are not a good choice for MongoDB identifiers: use the provided ObjectId instead.

Guids in javascript compare as binary objects and thus don’t work well as keys for Map-Reduce operations.

2. You can’t use an Array as the return type for the reduce operation.

This is actually documented clearly on the site for those of you that actually read the documentation before trying to get it to work but for everyone else it’s going to cause some frustration.

3. The output value emitted by the map function MUST be the same format as the value returned by reduce.

The documentation on this one says it ‘should’ be the same, but in practice anything but the same format is bound to cause problems. What you need to understand is that ‘map-reduce’ is somewhat of a misnomer, the reduce function may be called iteratively and that doesn’t work unless each reduce operation can have its results fed back in to another reduce operation.

4. Using .length on the values array passed into the reduce function is never the right thing to do

In your map operation you often output a value of ’1′ for each key. In the reduce operation you want to add up these ’1′s. It looks like you could use value.length to get the result. But, here too the iterative nature of the reduce operation means that you actually need to examine the values in the array passed in and accumulate them.

5. The print() function provides for some limited debugging assistance

When you need to see some intermediate results in your map or reduce functions the print() function can help.

6. The Reduce function automatically includes the key and the return value

When deserializing the results in C# you’ll want to deserialize them into a type that has an ‘_id’ property and a ‘value’ property. The following generic type can help:

    /// <summary>
    /// This is a useful type for dealing with MapReduce results.  
    /// It takes two type parameters, one for the key and one for the value.
    /// The simplest possible result would use type parameters ObjectId and int
    /// </summary>
    public class MapReduceResult2<Tid, Tvalue>
    {
        public Tid _id { get; set; }
        public Tvalue value { get; set; }

        public override string ToString()
        {
            return string.Format("{0} {1}", _id, value);
        }
    }

7. Chaining Map-Reduce operations to get the result you need is quite normal

If you can’t see how to get to what you want in a single Map-Reduce cycle don’t worry, it’s quite easy and normal to pass the results of one Map-Reduce operation to the next.