The Blog of Ian Mercer.

Home network crawler - cataloging every file on the home LAN with C# and MongoDB

Map-Reduce in operation in Greenland

Map-Reduce in action: The glaciers in Greenland 'map' the canyon walls into streams of rocks called lateral moraine. As the glaciers merge these rocks are 'reduced' into streams in the middle called 'medial' moraine. (A photo I took over Greenland this summer.)

With the addition of two more 3TB drives to the home network it's becoming impossible to track files and to remember where each one is and whether it's a backup of some other disk or not. There are 8 computers on the home network and over 10TB of storage distributed between them. Much of the storage is concentrated on a single machine running Windows Server 2008. It's a low-powered Atom server connected to a Sans Digital 1U Rackmount Sans Digital disk array running in JBOD mode (just a bunch of disks).

I'm not a huge fan or RAID arrays - they mostly mean there's another component to go wrong (the controller card) and when they do go wrong you can lose all your data just as easily as if it were all on one drive. I prefer a multiple copy strategy, an "Amazon S3 for the home" if you like. The downside of this is that there are multiple copies of each file across the home network and as I have several generations of hard drives the mapping from primary to secondary to tertiary is complex and hard to manage! It's also really hard to find a single file when there are so many places to look and it's nigh on impossible to be sure that I have the necessary three copies of every important file in the right places at all times.

So this weekend I embarked on a small project to catalog every file, directory and storage volume on the entire home network including drives that are only sometimes connected. The software has been running all weekend and is close to cataloging everything. It's found 5 million files so far representing over 6TB of data!

The architecture I chose for this software was an agent that runs on each PC to catalog all of the attached volumes. This client uploads all the directories and files that it finds to a MongoDB database running on the same Atom server as the main storage array. The poor little Atom server's 4GB of RAM has been in constant use but the server has remained responsive, in part because it boots from an SSD drive.

Each volume, directory and file is represented by a document in MongoDB in a single collection. The agent calculates an MD5 hash for each file and extracts metadata from MP3, WMA and JPG files. It also stores all of the key file dates (created, updated, accessed) and references to parent directories, volume identifiers and the currently connected PC. It does not assume that a volume is always connected to the same computer - you can unplug an external drive from one and put it somewhere else and it will all work just fine.

I implemented a re-startable tree scan that uses a couple of DateTime stamps to be able to determine which directories need to be scanned during the current pass and which ones have already been scanned. Any agent can be killed at any time and restarted and it will carry on walking the directory tree right where it left off. It will even continue correctly in the case where you move a volume from one PC to another.

Each agent uses the Parallel Task library's Parallel.ForEach to crawl each volume in parallel and to parse multiple files from each directory simultaneously.

By storing all of the file metadata in Mongo DB it's easy to use Map-Reduce to calculate some interesting statistics for the files on the network.

For example, to create a summary of file sizes I can use a Map function:

function Map() { if (this.Size && this._t == "FileInformation") { var size = this.Size; if (size < 1024) emit ("kb", {count:1, size:this.Size}); else if (size < 1024\*1024) emit ("mb", {count:1, size:this.Size}); else if (size < 1024*1024\*1024) emit ("gb", {count:1, size:this.Size}); else if (size < 1024*1024*1024*1024) emit ("tb", {count:1, size:this.Size}); else emit ("tb+", {count:1, size:this.Size}); } }

and a reduce function:

function Reduce(key, arr_values) { var count = 0; var size = 0; for(var i in arr_values) { count = count + arr_values[i].count; size = size + arr_values[i].size; } return {count:count, size:size}; }

Map-Reduce operations like this take about 20 minutes to run (on the Atom server with just 4GB of RAM) whereas any query serviced by one of the indexes on the MongoDB collection is almost instantaneous.

I've been using the excellent MongoVue to run simple map-reduce scripts like this and to keep track of how quickly the database is growing.

Map-reduce can also be used to find duplicate files - by emitting the MD5 hash as the key and some information about the file as the value I can find every copy of every file across every computer on the home network.

Since I have the file name and metadata for every file on the home network I can also easily find any file using MongoDB's regex matching feature against the path.

The Hard Parts

For starters you'll need a library that can handle long file names. Then you'll need to fix it to provide at least the functionality that FileInfo and DirectoryInfo give you in .NET.

Next you'll need to learn about reparse-points and hard-links and you'll need to skip over them because with them in place the file system is not a tree; it's a cyclical graph in which a simple crawler will quickly get confused or stuck.

You'll also want to store the NTFS file Id and the unique Volume ID for every file so you can track it when the file is moved or the removable drive is connected to a different computer.

So how well does it work?

This all seems to work really well. Nearly every volume has now been cataloged. It's located about 5M files occupying over 6TB of space. The worst case offender for the number of copies of the same file is 100+. I've used the find feature in MongoDB to find a file I was missing and I'm better able to plan how to arrange directories and file generations across the various hard drives I have.

What's next

Well, of course this needs to be connected to the home automation system and my Natural Language engine so you can ask "send a copy of IMG_0228 from last week to X" or "where are all the spreadsheets I created last year?" That will be fairly easy.

After that I hope to incorporate backup features into the agents too so they can automatically keep the required number of copies of each file according to its importance. I'd also like to set up a rotating set of external drives that go in the fire safe when not connected and when they are connected they get updated with the latest copies of all the important files.

I'd also like to be able to get the agents to move whole groups of directories around between drives as juggling the directory layout each time a new hard drive is added to the system is always a time consuming process.

Comments or Questions?

Does everyone else have a hard time managing multiple computers, hard drives, directories and multiple copies of files? What tools do you use to do this? Is there anything commercially available that I could have used instead? Would a tool like this be useful to you? Should I publish the code somewhere? Comments and questions are always welcome here or on twitter.

Related Stories

My love/hate relationship with Stackoverflow

Stackoverflow is a terrific source of information but can also be infuriating.

Ian Mercer
Ian Mercer

Xamarin Forms Application For Home Automation

Building a Xamarin Forms application to control my home automation system

Ian Mercer
Ian Mercer

JSON Patch - a C# implementation

Ian Mercer
Ian Mercer

Websites should stop using passwords for login!

A slightly radical idea to eliminate passwords from many of the websites you use just occasionally

Ian Mercer
Ian Mercer

Dynamically building 'Or' Expressions in LINQ

How to create a LINQ expression that logically ORs together a set of predicates

Ian Mercer
Ian Mercer

VariableWithHistory - making persistence invisible, making history visible

A novel approach to adding history to variables in a programming language

Ian Mercer
Ian Mercer

Neo4j Meetup in Seattle - some observations

Some observations from a meetup in Seattle on graph databases and Neo4j

Ian Mercer
Ian Mercer

Updated Release of the Abodit State Machine

A hierarchical state machine for .NET

Ian Mercer
Ian Mercer

My first programme [sic]

At the risk of looking seriously old, here's something found on a paper tape

Ian Mercer
Ian Mercer

Building a better .NET State Machine

A state machine for .NET that I've released on Nuget

Ian Mercer
Ian Mercer

The Internet of Dogs

Connecting our dog into the home automation

Ian Mercer
Ian Mercer

A simple state machine in C#

State machines are useful in many contexts but especially for home automation

Ian Mercer
Ian Mercer

MongoDB substring search with a difference

Ian Mercer
Ian Mercer

Convert a property getter to a setter

Ian Mercer
Ian Mercer

MongoDB - Map-Reduce coming from C#

Ian Mercer
Ian Mercer

MongoDB Map-Reduce - Hints and Tips

Ian Mercer
Ian Mercer

Weather Forecasting for Home Automation

Ian Mercer
Ian Mercer

Lengthening short Urls in C#

Ian Mercer
Ian Mercer

Why don't you trust your build system?

Ian Mercer
Ian Mercer

ASP.NET MVC SEO - Solution Part 1

Ian Mercer
Ian Mercer

Elliott 803 - An Early Computer

Ian Mercer
Ian Mercer

Building sitemap.xml for SEO ASP.NET MVC

Ian Mercer
Ian Mercer

Continuous Integration -> Continuous Deployment

What is "quality" in terms of a released software product or website?

Ian Mercer
Ian Mercer

Making a bootable Windows 7 USB Memory Stick

Here's how I made a bootable USB memory stick for Windows 7

Ian Mercer
Ian Mercer

Tip: getting the index in a foreeach statement

A tip on using LINQ's Select expression with an index

Ian Mercer
Ian Mercer

SQL Server - error: 18456, severity: 14, state: 38 - Incorrect Login

A rant about developers using the same message for different errors

Ian Mercer
Ian Mercer

WCF and the SYSTEM account

Namespace reservations and http.sys, my, oh my!

Ian Mercer
Ian Mercer

404 errors on IIS6 with ASP.NET 4 Beta 2

Ian Mercer
Ian Mercer

Mixed mode assembly errors after upgrade to .NET 4 Beta 2

Fixing this error was fairly simple

Ian Mercer
Ian Mercer

The EntityContainer name could not be determined

How to fix the exception "the entitycontainer" name could not be determined

Ian Mercer
Ian Mercer

Shortened URLs should be treated like a Codec ...

Expanding URLs would help users decide whether or not to click a link

Ian Mercer
Ian Mercer

Tagging File Systems

Isn't it time we stopped knowing which drive our file is on?

Ian Mercer
Ian Mercer

A great site for developing and testing regular expressions

Just a link to a site I found useful

Ian Mercer
Ian Mercer

Introducing Jigsaw menus

A novel UI for menus that combines a breadcrumb and a menu in one visual metaphor

Ian Mercer
Ian Mercer

Entity Framework in .NET 4

Ian Mercer
Ian Mercer

Fix for IE's overflow:hidden problem

Ian Mercer
Ian Mercer

A better Tail program for Windows

A comparison of tail programs for Windows

Ian Mercer
Ian Mercer

Measuring website browser performance

Found this great resource on website performance

Ian Mercer
Ian Mercer

Amazon Instance vs Dedicated Server comparison

Some benchmark performance for Amazon vs a dedicated server

Ian Mercer
Ian Mercer

System.Data.EntitySqlException

Hints for dealing with this exception

Ian Mercer
Ian Mercer

Agile Software Development is Like Sailing

You cannot tack too often when sailing or you get nowhere. Agile is a bit like that.

Ian Mercer
Ian Mercer

Exception Handling using Exception.Data

My latest article on CodeProject covers the lesser known Exception.Data property

Ian Mercer
Ian Mercer

Javascript error reporting

Sending client-side errors back to a server for analysis

Ian Mercer
Ian Mercer

AntiVirus Software is the Worst Software!

When your anti-virus software starts stealing your personal data, it's time to remove it!

Ian Mercer
Ian Mercer

ASP.NET Custom Validation

How to solve a problem encountered with custom validation in ASP.NET

Ian Mercer
Ian Mercer

Optimization Advice

Some advice on software optimization

Ian Mercer
Ian Mercer

Linq's missing link

LinqKit came in handy back in 2009

Ian Mercer
Ian Mercer

Google Chart API

Ian Mercer
Ian Mercer

Cache optimized scanning of pairwise combinations of values

Using space-filling curves to optimize caching

Ian Mercer
Ian Mercer

Threading and User Interfaces

A rant about how few software programs get threading right

Ian Mercer
Ian Mercer

Take out the trash!

Why Windows shutdown takes so long

Ian Mercer
Ian Mercer

Dell upgrades - a pricey way to go

Ian Mercer
Ian Mercer

Programming mostly C#

Ian's advice on programming

Ian Mercer
Ian Mercer