Posts tagged .NET
Dynamically building ‘Or’ Expressions in LINQ
Feb 12th
One common question on Stackoverflow concerns the creation of a LINQ expression that logically Ors together a set of predicates. The need stated is to be able to build such an expression dynamically. Creating the ‘And’ version is easy, you simply stack multiple ‘.Where‘ clauses onto an expression as you add each predicate. You can’t do the same for ‘Or’. The common responses are ‘use LINQKit’ or ‘use Dynamic LINQ’. LINQKit however adds the unfortunate ‘.AsExpandable()’ into the expression which can cause problems in some circumstances, and Dynamic LINQ is not strongly-typed so doesn’t survive renaming operations. Neither answer is ideal.
But, there is another way, using a bit of Expression tree manipulation you can build an ‘Or‘ expression dynamically while staying strongly-typed. The code below achieves this.
using System;
using System.Linq;
using System.Linq.Expressions;
using System.Collections.Generic;
public static class ExpressionBuilder
{
public static Expression<Func<T, bool>> True<T>() { return f => true; }
public static Expression<Func<T, bool>> False<T>() { return f => false; }
public static Expression<T> Compose<T>(this Expression<T> first,
Expression<T> second,
Func<Expression, Expression, Expression> merge)
{
// build parameter map (from parameters of second to parameters of first)
var map = first.Parameters
.Select((f, i) => new { f, s = second.Parameters[i] })
.ToDictionary(p => p.s, p => p.f);
// replace parameters in the second lambda expression with parameters from
// the first
var secondBody = ParameterRebinder.ReplaceParameters(map, second.Body);
// apply composition of lambda expression bodies to parameters from
// the first expression
return Expression.Lambda<T>(merge(first.Body, secondBody), first.Parameters);
}
public static Expression<Func<T, bool>> And<T>(
this Expression<Func<T, bool>> first,
Expression<Func<T, bool>> second)
{
return first.Compose(second, Expression.And);
}
public static Expression<Func<T, bool>> Or<T>(
this Expression<Func<T, bool>> first,
Expression<Func<T, bool>> second)
{
return first.Compose(second, Expression.Or);
}
public class ParameterRebinder : ExpressionVisitor
{
private readonly Dictionary<ParameterExpression, ParameterExpression> map;
public ParameterRebinder(
Dictionary<ParameterExpression,
ParameterExpression> map)
{
this.map = map??new Dictionary<ParameterExpression,ParameterExpression>();
}
public static Expression ReplaceParameters(
Dictionary<ParameterExpression,
ParameterExpression> map,
Expression exp)
{
return new ParameterRebinder(map).Visit(exp);
}
protected override Expression VisitParameter(ParameterExpression p)
{
ParameterExpression replacement;
if (map.TryGetValue(p, out replacement))
{
p = replacement;
}
return base.VisitParameter(p);
}
}
}
NB Some of the ideas in this case from other blog posts, I can’t find them right now but if part of this was your idea I’d be happy to add a link to your blog.
VariableWithHistory – making persistence invisible, making history visible
Feb 3rd
In a typical .NET application variables have a short lifetime. When they go out of scope or the application ends their value is lost. Also, you cannot ask a variable what its value was 1 hour ago, or what its average, maximum or minimum value was yesterday.
Yet, such a variable would be extremely useful when writing a Home Automation System because you often need to make comparisons between a current value and some historical average, or between two ranges (e.g. was the kitchen more or less occupied than yesterday). Now, normally you wouldn’t want to mix persistence up with the representation of a value in your code (see ‘Separation of Concerns’), but in this case I decided that it was worth mixing the two concepts because the benefits of doing so were so great.
So I created a class called VariableWithHistory<T> which is the abstract base class for IntegerWithHistory, DoubleWithHistory, BoolWithHistory, StringWithHistory and a number of others. The first property worth noting on these classes is the .Current property. This always gives you the latest value that has been set. Setting the .Current value stores both the value and the DateTime (Utc of course) at which the value became current. A history of all past values is maintained in MongoDB up to some suitable limit per variable (each variable can have its own adjustable history size in bytes by using MongoDB’s capped collections). If the new value is the same as the old one no update is made, the implicit behavior being that the value changed and stayed there until it changes again, so if you want to know what the value is now it is the same as the last change recorded.
With this new variable type in place any object in the house can have any number of persistent fields on it (bool occupied, double temperature, string triggeredBy, …). Updating these values is as simple as assigning to their .Current property. When the system loads, each value comes back with the value it had when the system was shut down. To accomplish this every VariableWithHistory is given a unique id (based on the unique id of it’s parent, e.g. a room).
So far so good, shut down, restart and the house doesn’t need to query a device to know if it’s on or off and all the long running Sequential Logic Blocks I use for rules (e.g. .Delay(days:2)) carry on running as if nothing happened. This is particularly useful since I typically deploy a new version almost every day and some logic blocks have long delays built into them.
But besides providing simple recovery from a reboot, these persistent variables allow me to do some much more interesting things.
int CountTransitions(DateTimeRange range, T direction);
Counts how many transitions there have been to the value T in a given time range, e.g. how many times did the driveway alarm go ‘true’ this evening?
Dictionary<T, double> Fractional(DateTimeRange range);
Builds a histogram of all the values seen in the time range, e.g. 50% hot, 20% cold, 30% warm for a string variable that tracks temperature
DateTimeOffset LastChangedState
e.g. when was this sensor last triggered?
TimedValue<T> ValueAtTime(DateTimeOffset dt)
What was the value at a given time in the past, e.g. what was the temperature at the same time yesterday?
Each specific type of VariableWithHistory<T> may also have additional methods relevant to the type T. For example, on DoubleWithHistory there is a method double Average(DateTimeOffset minValue, DateTimeOffset maxValue) which gets the average value over the specified time range. On BoolWithHistory there is a method double PercentageTrue(DateTimeRange range) which you could use to find the average occupancy for a room yesterday.
My initial implementation waited for the database to write each update before allowing any queries but now I simply cache the Current value and assume that queries will probably get executed after updates and that the average temperature yesterday is close enough with or without the last 100ms of updates. I did try to keep this class isolated from MongoDB but in the end the benefit of some of the atomic update capabilities in MongoDB made it easier to just take the dependency.
My previous implementation of this feature used my own in-memory database, MongoDB has slowed it down a bit but I’ve gained the ability to archive terabytes of sensor data which should prove useful for my next project which is to add some machine learning to the system.
Updated Release of the Abodit State Machine
Jul 10th
I published a new version of the Abodit State Machine to Nuget this evening. You can find it here.
One breaking change in this version is that the state machine is now specified using three Type parameters instead of two:
public class OccupancyStateMachine :
StateMachine<OccupancyStateMachine, Event, BuildingArea>
The third type parameter, TContext, is a context object that can be passed in with every event occurrence or tick. This means that you don’t need to store any extraneous data in the state machine itself and can keep it as a pure representation of the state of the system.
In the example above I have an OccupancyStateMachine and the context is a BuildingArea. Each call to EventHappens now takes the event that happened and a BuildingArea object.
When you define your state machine you will need to include 4 parameters in each lambda expression.
Here, for example, is the current state machine for a BuildingArea in my home automation. It uses a hierarchy of states with two base states: Not Occupied and Occupied. It has timers for activity within a room or for occupancy within rooms that are contained by a floor. Note how it also exposes an IObservable<State> so that other objects can subscribe to state machine changes. I didn’t want to take the Rx dependency in the state machine class itself but you can see how easy it is to hook it up.
Of interest also is the way I represent occupancy as three distinct states, the extra one ‘Asleep’ represents a room that is not-occupied in the sense that there is no motion there now but there was at some point during the evening before.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using Abodit.StateMachine;
using log4net;
using Abodit.Units;
using AboditUnits.Units;
using System.Reactive.Subjects;
using System.Reactive.Linq;
namespace Abodit
{
/// <summary>
/// An Occupancy State machine handles not occupied, occupied, asleep
/// </summary>
[Serializable]
public class OccupancyStateMachine : StateMachine<OccupancyStateMachine, Event, BuildingArea>
{
private readonly Subject<State> watch = new Subject<State>();
public IObservable<State> Watch { get { return watch.AsObservable(); } }
public override void OnStateChanging(StateMachine<OccupancyStateMachine, Event, BuildingArea>.State newState, BuildingArea context)
{
watch.OnNext(newState);
}
public static readonly State Starting = AddState("Starting");
public static readonly State NotOccupied = AddState("Not occupied",
(m, e, s, c) => {
m.CancelScheduledEvent(eTick); // Stop the clock
m.IsTimerRunning = false;
m.IsRecentlyOccupied = false;
m.IsHeavilyOccupied = false;
m.After(new TimeSpan(hours:0, minutes:5, seconds:0), e5MinutesSinceOccupied);
m.After(new TimeSpan(hours:24, minutes:0, seconds:0), e24hoursSinceOccupied);
m.After(new TimeSpan(hours:48, minutes:0, seconds:0), e48hoursSinceOccupied);
},
(m, e, s, c) => { });
public static readonly State NotOccupiedIn5Minutes = AddState("Not occupied in over 5 minutes",
(m, e, s, c) => { },
(m, e, s, c) => { }, NotOccupied);
public static readonly State NotOccupiedInOver24Hours = AddState("Not occupied in over 24 hours",
(m, e, s, c) => { },
(m, e, s, c) => { }, NotOccupiedIn5Minutes);
public static readonly State NotOccupiedInOver48Hours = AddState("Not occupied in over 48 hours",
(m, e, s, c) => { },
(m, e, s, c) => { }, NotOccupiedInOver24Hours);
public static readonly State NotOccupiedInOver1Week = AddState("Not occupied in over 1 week",
(m, e, s, c) => { },
(m, e, s, c) => { }, NotOccupiedInOver48Hours);
public static readonly State Asleep = AddState("Asleep",
(m, e, s, c) =>
{
// Set a timer going for morning
var now = TimeProvider.Current.Now.LocalDateTime;
var morning = now.Hour < 8 ? now.AddHours(-now.Hour + 8) : now.AddHours(24 - now.Hour + 8);
m.At(morning.ToUniversalTime(), eMorning);
},
(m, e, s, c) => { },
parent:NotOccupied);
public static readonly State Occupied = AddState("Occupied",
(m, e, s, c) =>
{
m.IsRecentlyOccupied = true;
// Add a timer that runs while we are occupied
m.Every(new TimeSpan(hours:0, minutes:0, seconds:10), eTick);
// And set a timer going to mark 5 minutes since occupied
m.After(new TimeSpan(hours:0, minutes:5, seconds:0), e5MinutesAfterBecomingOccupied);
m.CancelScheduledEvent(e5MinutesSinceOccupied);
m.CancelScheduledEvent(e24hoursSinceOccupied);
m.CancelScheduledEvent(e48hoursSinceOccupied);
},
(m, e, s, c) => { });
public static readonly State HeavilyOccupied = AddState("Heavily occupied",
(m, e, s, c) => { },
(m, e, s, c) => { },
parent:Occupied);
private static readonly Event eStart = new Event("Starts");
private static readonly Event eUserActivity = new Event("User activity");
private static readonly Event eTick = new Event("Tick");
private static readonly Event eTimeout = new Event("Timeout");
private static readonly Event eMorning = new Event("Morning");
private static readonly Event e5MinutesAfterBecomingOccupied = new Event("5 minutes after becoming occupied");
private static readonly Event e5MinutesSinceOccupied = new Event("5 minutes since occupied");
private static readonly Event e24hoursSinceOccupied = new Event("24 hours since occupied");
private static readonly Event e48hoursSinceOccupied = new Event("48 hours since occupied");
private static readonly Event eAllChildrenNotOccupied = new Event("No child occupied");
private static readonly Event eAtLeastOneChildOccupied = new Event("At least one child occupied");
private double decliningActivity = 0.0; // Up 1000 every UserInput, down x0.9 every n seconds
private const int ActivityPerUserInput = 1000;
private const double rateOfDecline = 0.92;
public bool IsTimerRunning { get; set; }
public bool IsRecentlyOccupied { get; set; }
public bool IsHeavilyOccupied { get; set; }
static OccupancyStateMachine()
{
// On startup we transition immediately to starting
// but we want an event call to do this so we aren't doing any work
// in the constructor, and so the initialization only happens when it's
// a true 'cold start' not a 'warm start' from some database state
Starting
.When(eStart, (m, s, e, c) => { return NotOccupied; });
// Note: This is a hierarchical state machine so NotOccupied includes Asleep
NotOccupied
.When(eAtLeastOneChildOccupied, (m, s, e, c) =>
{
return Occupied;
})
.When(e5MinutesSinceOccupied, (m, s, e, c) =>
{
// Could signal something??
return s;
})
.When(e24hoursSinceOccupied, (m, s, e, c) =>
{
// Could signal something??
return s;
})
.When(e48hoursSinceOccupied, (m, s, e, c) =>
{
// Could signal something??
return s;
})
.When(eUserActivity, (m, s, e, c) =>
{
m.After(c.OccupancyTimeout, eTimeout); // start a new timeout
m.IsTimerRunning = true;
return Occupied;
});
// Asleep is a substate of not occupied so no need for more logic on becoming occupied ...
Asleep
.When(eMorning, (m, s, e, c) =>
{
// Eliminate Asleep if appropriate
return NotOccupied;
});
// Occupied includes recently occupied and heavily occupied ...
Occupied
.When(e5MinutesAfterBecomingOccupied, (m, s, e, c) =>
{
m.IsRecentlyOccupied = false;
return s;
})
.When(eUserActivity, (m, s, e, c) =>
{
// Accumulate activity ...
m.decliningActivity += ActivityPerUserInput;
m.CancelScheduledEvent(eTimeout); // cancel the old timeout
m.After(c.OccupancyTimeout, eTimeout); // start a new timeout
m.IsTimerRunning = true;
if (m.decliningActivity > 20 * ActivityPerUserInput)
return HeavilyOccupied;
else
return s;
})
.When(eAllChildrenNotOccupied, (m, s, e, c) =>
{
if (m.IsTimerRunning)
{
// If the timer is running ... wait until it runs out
return s;
}
else
{
DateTime nowLocal = TimeProvider.Current.Now.LocalDateTime;
if (nowLocal.Hour > 17)
return Asleep;
else
return NotOccupied;
}
})
.When(eTick, (m, s, e, c) =>
{
m.decliningActivity *= rateOfDecline;
return s;
})
.When(eTimeout, (m, s, e, c) =>
{
DateTime nowLocal = TimeProvider.Current.Now.LocalDateTime;
if (nowLocal.Hour > 17)
return Asleep;
else
return NotOccupied;
});
HeavilyOccupied.When(eTick, (m, s, e, c) =>
{
// Same code as Occupied but this one will override if we are in HeavilyOccupied mode
m.decliningActivity *= rateOfDecline;
// Fall back to just occupied when ...
if (m.decliningActivity < 0.2 * ActivityPerUserInput)
return Occupied;
else
return s;
});
}
public OccupancyStateMachine()
: base(Starting)
{
}
public OccupancyStateMachine(State initialState)
: base(initialState)
{
}
public override void Start()
{
this.EventHappens(eStart, null);
}
public void UserActivity(BuildingArea ba)
{
this.EventHappens(eUserActivity, ba);
}
public void AllChildrenNotOccupied(BuildingArea ba)
{
this.EventHappens(eAllChildrenNotOccupied, ba);
}
public void AtLeastOneChildOccupied(BuildingArea ba)
{
this.EventHappens(eAtLeastOneChildOccupied, ba);
}
}
}
Building a better .NET State Machine
Apr 14th
[Note: Updated version on Nuget has slightly different API, see latest blog post.]
There are several state machine implementations for .NET out there but, sadly, none of them met all of the requirements I have for a state machine. These are:-
1) Well written using encapsulation and other good practices
2) Able to be easily serialized to disk
3) Able to handle temporal events easily (After … At … Every …)
4) Disk serialized form must expose a property saying when it next needs to be fetched from disk to run
5) Implements hierarchical states with entry and exit actions
So I built one, and have made the source code available on Nuget so you can add it to any project easily without any extra DLLs.
Look for “AboditStateMachine” on Nuget to download it. The download includes a sample state machine documented to show off some of its capabilities.
Defining states is easy, just give them a name and specify their parent state if any:-
public static readonly State UnVerified = AddState("UnVerified");
public static readonly State Verified = AddState("Verified");
// States are hierarchical. If you are in state VerifiedRecently you are also in is parent state Verified.
public static readonly State VerifiedRecently = AddState("Verified recently", parent: Verified);
public static readonly State VerifiedAWhileAgo = AddState("Verified a while ago", parent: Verified);
You can use any other type that’s IEquatable
private static Event eUserVerifiedEmail = new Event("User verified email");
private static Event eScheduledCheck = new Event("Scheduled Check");
private static Event eBeenHereAWhile = new Event("Been here a while");
The state machine itself is specified in a static constructor so it runs just once no matter how many instances of the state machine you create. Each method is provided with an instance of the state machine ‘m’ as well as the state ‘s’ and the event ‘e’ as appropriate:
static DemoStatemachine()
{
UnVerified
.OnEnter((m, s, e) =>
{
// States can execute code when they are entered or when they are left
// In this case we start a timer to bug the user until they confirm their email
m.Every(new TimeSpan(hours: 10, minutes:0, seconds:0), eScheduledCheck);
// You can also set a reminder to happen at a specific time, or after a given interval just once
m.At(new DateTime(DateTime.Now.Year+1, 1, 1), eScheduledCheck);
m.After(new TimeSpan(hours: 24, minutes: 0, seconds: 0), eScheduledCheck);
// All necessary timing information is serialized with the state machine
// The serialized state machine also exposes a property showing when it next needs to be woken up
// External code will need to call the Tick(utc) method at that time to trigger the next temporal event
})
.When(eScheduledCheck, (m, s, e) =>
{
Trace.WriteLine("Here is where we would send a message to the user asking them to verify their email");
// We return the current state 's' rather than 'UnVerified' in case we are in a child state of 'Unverified'
// This makes it easy to handle hierarchical states and to either change to a different state or stay in the same state
return s;
})
.When(eUserVerifiedEmail, (m, s, e) =>
{
Trace.WriteLine("The user has verified their email address, we are done (almost)");
// Kill the scheduled check event, we no longer need it
m.CancelScheduledEvent(eScheduledCheck);
// Start a timer for one last transition
m.After(new TimeSpan(hours:24, minutes:0, seconds:0), eBeenHereAWhile);
return VerifiedRecently;
});
VerifiedRecently
.When(eBeenHereAWhile, (m, s, e) =>
{
Trace.WriteLine("User has now been a member for over 24 hours - give them additional priviledges for example");
// No need to cancel the eBeenHereAWhile event because it wasn't auto-repeating
//m.CancelScheduledEvent(eBeenHereAWhile);
return VerifiedAWhileAgo;
});
Verified.OnEnter((m, s, e) =>
{
Trace.WriteLine("The user is now fully verified");
});
VerifiedAWhileAgo.OnEnter((m, s, e) =>
{
Trace.WriteLine("The user has been verified for over 24 hours");
});
}
With your state machine defined you can now create instances of it, trigger events on them, serialize them to disk, fetch them back, carry on eventing on them, …
DemoStatemachine demoStateMachine = new DemoStatemachine(DemoStatemachine.UnVerified);
// At the time specified in demoStateMachine.NextTimedEventAt you reload the state machine from disk and call
demoStateMachine.Tick(DateTime.UtcNow);
// When the user verifies their email address you call ...
demoStateMachine.VerifiesEmail();
// At any other time you can examine the current state, act on the state changed event, ...
I hope you find this new state machine implementation useful, and if you have any feedback, do please send it my way.
MongoDB substring search with a difference
Nov 25th
It’s quite common to want to search a database for a key that starts with a given string. In SQL you have LIKE and in MongoDB you have regular expressions:
db.customers.find( { name : { $regex : '^acme', $options: 'i' } } );
But what if you want to do the inverse of this? i.e. to search the database for the keys that are themselves substrings of the search string? For example, suppose you are trying to parse a block of text and you want to find phrases in the database that match the start of the current block of text. In SQL you would be dead in the water but with MongoDB you can create a RegEx that matches either the first word, or the first two words, or the first three words, … and so on.
We can construct a regular expression to do this, it might look something like: ^word1($| word2($| word3$))
Here’s a C# method that can create the necessary regular expression:
/// <summary>
/// This generates a regular expression that matches as much of the given phrase as it can from a string
/// i.e. a reverse prefix search where you want the database to supply the prefix and match it against your query
/// useful for matching 'as much as possible from a given input'
/// </summary>
private string generatePrefixRegex(string phrase, bool atStart)
{
string[] bits = phrase.Split(' ');
string result = bits[0];
// At the start of a sentence, if the first character is upper cased, we should also be looking for a lowercased verson of it
if (atStart && char.IsUpper(result[0]))
{
result = string.Format("(%0|%1)%2", char.ToLowerInvariant(result[0]), char.ToUpperInvariant(result[0]), result.Substring(1));
}
// Each additional word - either we end the string before it or we must include it
foreach (var bit in bits.Skip(1))
{
result = result + "($| " + Regex.Escape(bit);
}
result = result + "$"; // last word must end string
foreach (var bit in bits.Skip(1))
{
result = result + ")"; // close the expression
}
return "^" + result; // Must start at the start of a Name
}
Home network crawler – cataloging every file on the home LAN with C# and MongoDB
Aug 22nd

Map-Reduce in action: The glaciers in Greenland 'map' the canyon walls into streams of rocks called lateral moraine. As the glaciers merge these rocks are 'reduced' into streams in the middle called 'medial' moraine. (A photo I took over Greenland this summer.)
I’m not a huge fan or RAID arrays – they mostly mean there’s another component to go wrong (the controller card) and when they do go wrong you can lose all your data just as easily as if it were all on one drive. I prefer a multiple copy strategy, an “Amazon S3 for the home” if you like. The downside of this is that there are multiple copies of each file across the home network and as I have several generations of hard drives the mapping from primary to secondary to tertiary is complex and hard to manage! It’s also really hard to find a single file when there are so many places to look and it’s nigh on impossible to be sure that I have the necessary three copies of every important file in the right places at all times.
So this weekend I embarked on a small project to catalog every file, directory and storage volume on the entire home network including drives that are only sometimes connected. The software has been running all weekend and is close to cataloging everything. It’s found 5 million files so far representing over 6TB of data!
The architecture I chose for this software was an agent that runs on each PC to catalog all of the attached volumes. This client uploads all the directories and files that it finds to a MongoDB database running on the same Atom server as the main storage array. The poor little Atom server’s 4GB of RAM has been in constant use but the server has remained responsive, in part because it boots from an SSD drive.
Each volume, directory and file is represented by a document in MongoDB in a single collection. The agent calculates an MD5 hash for each file and extracts metadata from MP3, WMA and JPG files. It also stores all of the key file dates (created, updated, accessed) and references to parent directories, volume identifiers and the currently connected PC. It does not assume that a volume is always connected to the same computer – you can unplug an external drive from one and put it somewhere else and it will all work just fine.
I implemented a re-startable tree scan that uses a couple of DateTime stamps to be able to determine which directories need to be scanned during the current pass and which ones have already been scanned. Any agent can be killed at any time and restarted and it will carry on walking the directory tree right where it left off. It will even continue correctly in the case where you move a volume from one PC to another.
Each agent uses the Parallel Task library’s Parallel.ForEach to crawl each volume in parallel and to parse multiple files from each directory simultaneously.
By storing all of the file metadata in Mongo DB it’s easy to use Map-Reduce to calculate some interesting statistics for the files on the network.
For example, to create a summary of file sizes I can use a Map function:
function Map() {
if (this.Size && this._t == "FileInformation")
{
var size = this.Size;
if (size < 1024)
emit ("kb", {count:1, size:this.Size});
else if (size < 1024*1024)
emit ("mb", {count:1, size:this.Size});
else if (size < 1024*1024*1024)
emit ("gb", {count:1, size:this.Size});
else if (size < 1024*1024*1024*1024)
emit ("tb", {count:1, size:this.Size});
else
emit ("tb+", {count:1, size:this.Size});
}
}
and a reduce function:
function Reduce(key, arr_values) {
var count = 0;
var size = 0;
for(var i in arr_values)
{
count = count + arr_values[i].count;
size = size + arr_values[i].size;
}
return {count:count, size:size};
}
Map-Reduce operations like this take about 20 minutes to run (on the Atom server with just 4GB of RAM) whereas any query serviced by one of the indexes on the MongoDB collection is almost instantaneous.
I’ve been using the excellent MongoVue to run simple map-reduce scripts like this and to keep track of how quickly the database is growing.
Map-reduce can also be used to find duplicate files – by emitting the MD5 hash as the key and some information about the file as the value I can find every copy of every file across every computer on the home network.
Since I have the file name and metadata for every file on the home network I can also easily find any file using MongoDB’s regex matching feature against the path.
The Hard Parts
For starters you’ll need a library that can handle long file names. Then you’ll need to fix it to provide at least the functionality that FileInfo and DirectoryInfo give you in .NET.
Next you’ll need to learn about reparse-points and hard-links and you’ll need to skip over them because with them in place the file system is not a tree; it’s a cyclical graph in which a simple crawler will quickly get confused or stuck.
You’ll also want to store the NTFS file Id and the unique Volume ID for every file so you can track it when the file is moved or the removable drive is connected to a different computer.
So how well does it work?
This all seems to work really well. Nearly every volume has now been cataloged. It’s located about 5M files occupying over 6TB of space. The worst case offender for the number of copies of the same file is 100+. I’ve used the find feature in MongoDB to find a file I was missing and I’m better able to plan how to arrange directories and file generations across the various hard drives I have.
What’s next
Well, of course this needs to be connected to the home automation system and my Natural Language engine so you can ask “send a copy of IMG_0228 from last week to X” or “where are all the spreadsheets I created last year?” That will be fairly easy.
After that I hope to incorporate backup features into the agents too so they can automatically keep the required number of copies of each file according to its importance. I’d also like to set up a rotating set of external drives that go in the fire safe when not connected and when they are connected they get updated with the latest copies of all the important files.
I’d also like to be able to get the agents to move whole groups of directories around between drives as juggling the directory layout each time a new hard drive is added to the system is always a time consuming process.
Comments or Questions?
Does everyone else have a hard time managing multiple computers, hard drives, directories and multiple copies of files? What tools do you use to do this? Is there anything commercially available that I could have used instead? Would a tool like this be useful to you? Should I publish the code somewhere? Comments and questions are always welcome here or on twitter.
Stop writing rude software! Use LASTINPUTINFO instead.
Aug 19th
Can you imagine what life would be like it people behaved like software programs do?
You’d be working away on something when someone would interrupt, steal your attention, and demand a response. You’d be interrupted in the middle of sentences all the time and while you were dealing with one interruption someone else could come up and interrupt you again.
You wouldn’t put up with people like that so why do you put up with software that behaves that way?
Windows itself is one of the worst offenders: the dreaded dialog that explains that updates have been installed and it wants to reboot, right this instant has caused me significant inconvenience in the past as it steals focus and then grabs the next return character and assumes I really did want to reboot right now, right in the middle of a blog post!
There really is no excuse for writing rude software. Windows includes an API called LASTINPUTINFO that can tell you if the user is busy typing or moving the mouse and you can delay your annoying toast pop-up, or worse that focus-stealing modal dialog until you think the user is ready for it. The C# code below shows how to use this API call to get a number of seconds since the last user input. Simply delay your notification or dialog until an appropriate time has passed (e.g. 5 seconds) and only then interrupt the user).
Background processing
Similarly if your background processing is hammering the disk drive you can make it more polite and throttle it back when the user is active on their computer. (You did, of course do all that background processing on a lower priority thread, didn’t you!)
One other area you might want to consider is using BITS to download files instead of hammering their internet connection to fetch files in the background.
The Code
So here’s the code you should use from today to make your software polite:
public static class Input
{
[DllImport("User32.dll")]
private static extern bool
GetLastInputInfo(ref LASTINPUTINFO plii);
private struct LASTINPUTINFO
{
public uint cbSize;
public uint dwTime;
}
/// <summary>
/// How many seconds since last user input
/// </summary>
public static double SecondsSinceLastInput()
{
LASTINPUTINFO lastInPut = new LASTINPUTINFO();
lastInPut.cbSize = (uint)System.Runtime.InteropServices.Marshal.SizeOf(lastInPut);
GetLastInputInfo(ref lastInPut);
uint idle = (uint)Environment.TickCount - lastInPut.dwTime;
return idle/1000.0;
}
}
C# Natural Language Engine connected to Microsoft Dynamics CRM 2011 Online
Jun 5th
In an earlier post I discussed some ideas around a Semantic CRM.
Recently I’ve been doing some clean up work on my C# Natural Language Engine and decided to do a quick test connecting it to a real CRM. As you may know from reading my blog, this natural language engine is already heavily used in my home automation system to control lights, sprinklers, HVAC, music and more and to query caller ID logs and other information.
I recently refactored it to use the Autofac dependency injection framework and in the process realized just how close my NLP engine is to ASP.NET MVC 3 in its basic structure and philosophy! To use it you create Controller classes and put action methods in them. Those controller classes use Autofac to get all of the dependencies they may need (services like an email service, a repository, a user service, an HTML email formattting service, …) and then the methods in them represents a specific sentence parse using the various token types that the NLP engine supports. Unlike ASP.NET MVC3 there is no Route registration; the method itself represents the route (i.e. sentence structure) that it used to decide which method to call. Internally my NLP engine has its own code to match incoming words and phrases to tokens and then on to the action methods. In a sense the engine itself is one big dependency injection framework working against the action methods. I sometimes wish ASP.NET MVC 3 had the same route-registration-free approach to designing web applications (but also appreciate all the reasons why it doesn’t).
Another improvement I made recently to the NLP Engine was to develop a connector for the Twilio SMS service. This means that my home automation system can now accept SMS messages as well as all the other communication formats it supports: email, web chat, XMPP chat and direct URL commands. My Twilio connector to NLP supports message splitting and batching so it will buffer up outgoing messages to reach the limit of a single SMS and will send that. This lowers SMS charges and also allows responses that are longer than a single SMS message.
Using this new, improved version of my Natural Language Engine I decided to try connecting it to a CRM. I chose Microsoft Dynamics CRM 2011 and elected to use the strongly-typed, early-bound objects that you can generate for any instance of the CRM service. I added some simple sentences in an NLPRules project that allow you to tell it who you met, and to input some of their details. Unlike a traditional forms-based approach the user can decide what information to enter and what order to enter it in. The Natural Language Engine supports the concept of a conversation and can remember what you were discussing allowing a much more natural style of conversation that some simple rule-based engines and even allowing it to ask questions and get answers from the user.
Here’s a screenshot showing a sample conversation using Google Talk (XMPP/Jabber) and the resulting CRM record in Microsoft CRM 2011 Online. You could have the same conversation over SMS or email. Click to enlarge.
Based on my limited testing this looks like another promising area where a truly fluent, conversational-style natural language engine could play a significant role. Note how it understands email addresses, phone numbers and such like and in code these all become strongly typed objects. Where it really excels is in temporal expressions where it can understand things like “who called on a Saturday in May last year?” and can construct an efficient SQL query from that.
Constrained parallelism for the Task Parallel Library
Sep 1st
When developing .NET applications there is often the need to execute multiple background processes, for example, fetching and rendering different size thumbnails for images. Typically you queue actions like these onto the thread pool. But in the case of thumbnail generation you typically want to fetch a base image first and then perform the resize operations on it. If five web pages each request a different thumbnail size simultaneously you may end up fetching the same image five times before processing it. Of course, you can add file based locking around this to ensure that only the first once gets to fetch the data but it would be much better if you could instead instruct the Task Parallel Library to execute co-dependent tasks sequentially.
The new Task parallel library has continuations that allow one task to chain onto the end of a previous task but you still a way to track all the tasks currently active so you can find the other task to chain onto it. In a multi-threaded asp.net environment that’s not so easy.
Below is a TaskFactory that gives you constrained parallelism allowing you to queue up tasks in such a way that no two tasks with the same key will execute in parallel. To use it you simply create a new TaskFactorySequentiallyByKey and then call StartNewChainByKey() with a suitable key, e.g. “RENDERimage12345.jpg”. This method returns a normal Task object that you can Wait on or add more continuations. All the usual TaskFactory constructor options are provided so you can have a different TaskScheduler, common cancellation token, and other options.
Note also that it expects an Action<CancellationToken> not just a plain Action. This is so your Action can be polite and monitor the cancellation token to know when to stop early. If you don’t need that you can always pass in a closure that tosses the CancellationToken, i.e. (token) => MyAction().
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Threading;
using System.Diagnostics;
namespace Utility
{
/// <summary>
/// The TaskFactorySequentiallyByKey factory limits concurrency when actions are passed with the same key. Those actions are executed sequentially
/// and never in parallel.
/// </summary>
/// <remarks>
/// For example, you have an action to fetch an image from the web to a local hard drive and then render a specific size of thumbnail for it.
/// The action includes code to check if the original image is already on disk, if not it fetches it.
/// It then checks if the correct size thumbnail has been rendered, if not it renders it.
/// You want to be able to fire off requests for thumbnails from multiple different asp.net web pages and ensure that any two requests for the
/// same original image are executed sequentially so that the image is only fetched once from the web before both thumbnail renders run.
/// </remarks>
public class TaskFactorySequentiallyByKey : TaskFactory
{
/// <summary>
/// Tasks currently queued based on key
/// </summary>
Dictionary<string, Task> inUse = new Dictionary<string, Task>();
public TaskFactorySequentiallyByKey()
: base()
{
}
public TaskFactorySequentiallyByKey(CancellationToken cancellationToken)
: base(cancellationToken)
{ }
public TaskFactorySequentiallyByKey(TaskScheduler scheduler)
: base(scheduler)
{ }
public TaskFactorySequentiallyByKey(TaskCreationOptions creationOptions, TaskContinuationOptions continuationOptions)
: base(creationOptions, continuationOptions)
{ }
public TaskFactorySequentiallyByKey(CancellationToken cancellationToken, TaskCreationOptions creationOptions, TaskContinuationOptions continuationOptions, TaskScheduler scheduler)
: base(cancellationToken, creationOptions, continuationOptions, scheduler)
{ }
protected virtual void FinishedUsing(string key, Task taskThatJustCompleted)
{
lock (this.inUse)
{
// If the key is present AND it point to the task that just finished THEN we are done
// and can clear the key for the next task that comes in ...
if (this.inUse.ContainsKey(key))
if (this.inUse[key] == taskThatJustCompleted)
{
this.inUse.Remove(key);
Debug.WriteLine("Finished using " + key + " completely");
}
else
{
Debug.WriteLine("Finished an item for " + key);
}
}
}
/// <summary>
/// Queue an action but prevent parallel execution of items having the same key. Instead, run them sequentially.
/// </summary>
/// <remarks>
/// This allows you to, for example, queue up tasks to fetch an image from the web to a cache and render a thumbnail for it at different sizes
/// while ensuring that the image is only fetched to the cache once before each different size thumbnail is generated
/// </remarks>
public Task StartNewChainByKey(string key, Action<CancellationToken> action)
{
return StartNewChainByKey(key, action, base.CancellationToken);
}
/// <summary>
/// Queue an action but prevent parallel execution of items having the same key. Instead, run them sequentially.
/// </summary>
/// <remarks>
/// This allows you to, for example, queue up tasks to fetch an image from the web to a cache and render a thumbnail for it at different sizes
/// while ensuring that the image is only fetched to the cache once before each different size thumbnail is generated
/// </remarks>
public Task StartNewChainByKey(string key, Action<CancellationToken> action, CancellationToken cancellationToken)
{
CancellationToken combined = cancellationToken == base.CancellationToken ? base.CancellationToken :
CancellationTokenSource.CreateLinkedTokenSource(cancellationToken, base.CancellationToken).Token;
lock (inUse)
{
Task result;
if (inUse.TryGetValue(key, out result))
{
// chain the supplied action after it ...
result = result.ContinueWith((task) => action(combined), combined);
// And then schedule a completion check after that
result.ContinueWith((task) => FinishedUsing(key, task));
// Update the dictionary so that it tracks the new LAST task in line, not any of the earlier ones
inUse[key] = result;
Debug.WriteLine("Chained onto " + key);
return result;
}
// otherwise simply create it and start it after remembering that the key is in use
result = new Task(() => action(combined), combined);
inUse.Add(key, result);
// queue up the check after it
result.ContinueWith((task) => FinishedUsing(key, task));
Debug.WriteLine("Starting a new action for " + key);
// And finally start it
result.Start(this.Scheduler);
return result;
}
}
}
}
Singleton tasks: A TaskFactory for the Task Parallel Library with ‘run-only-one’ semantics
Sep 1st
When developing .NET applications there is often the need to execute some slow background process repeatedly. For example, fetching a feed from a remote site, updating a user’s last logged in time, … etc. Typically you queue actions like these onto the thread pool. But under load that becomes problematic as requests may be coming in faster than you can service them, the queue builds up and you are now executing multiple requests for the same action when you only really needed to do one. Even when not under load, if two users request a web page that requires the same image to be loaded and resized for display you only want to fetch it and resize it once. What you really want is an intelligent work queue that can coalesce multiple requests for the same action into a single action that gets executed just once.
The new Task parallel library doesn’t have anything that can handle these ‘run-only-one’ actions directly but it does have all the necessary building blocks to build one by creating a new TaskFactory and using Task continuations.
Below is a TaskFactory that gives you ‘run-only-one’ actions. To use it you simply create a new TaskFactoryLimitOneByKey and then call StartNewOrUseExisting() with a suitable key, e.g. “FETCH/cache/image12345.jpg”. This method returns a normal Task object that you can Wait on or add more continuations. All the usual TaskFactory constructor options are provided so you can have a different TaskScheduler, common cancellation token, and other options.
Note also that it expects an Action<CancellationToken> not just a plain Action. This is so your Action can be polite and monitor the cancellation token to know when to stop early. If you don’t need that you can always pass in a closure that tosses the CancellationToken, i.e. (token) => MyAction().
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Threading;
using System.Diagnostics;
namespace Utility
{
/// <summary>
/// A task factory where Tasks are queued up with a key and only one of that key is allowed to exist either in the queue or executing
/// </summary>
/// <remarks>
/// This is useful for tasks like fetching a file from backing store, or updating local information from a remote service
/// You want to be able to queue up a Task to go do the work but you don't want it to happen 5 times in quick succession
/// NB: This does not absolve you from using file locking and other techniques in your method to handle simultaneous requests,
/// it just greatly reduces the chances of it happening. Another example would be updating a user's last logged in data in a
/// database. Under heavy load the queue to write to the database may be getting long and you don't want to update it for the same
/// user repeatedly if you can avoid it with a single write.
/// </remarks>
public class TaskFactoryLimitOneByKey : TaskFactory
{
/// <summary>
/// Tasks currently queued based on key
/// </summary>
Dictionary<string, Task> inUse = new Dictionary<string, Task>();
public TaskFactoryLimitOneByKey()
: base()
{
}
public TaskFactoryLimitOneByKey(CancellationToken cancellationToken)
: base(cancellationToken)
{ }
public TaskFactoryLimitOneByKey(TaskScheduler scheduler)
: base(scheduler)
{ }
public TaskFactoryLimitOneByKey(TaskCreationOptions creationOptions, TaskContinuationOptions continuationOptions)
: base(creationOptions, continuationOptions)
{ }
public TaskFactoryLimitOneByKey(CancellationToken cancellationToken, TaskCreationOptions creationOptions, TaskContinuationOptions continuationOptions, TaskScheduler scheduler)
: base(cancellationToken, creationOptions, continuationOptions, scheduler)
{ }
protected virtual void FinishedUsing(string key, Task taskThatJustCompleted)
{
lock (this.inUse)
{
// If the key is present AND it point to the task that just finished THEN we are done
// and can clear the key so that the next task coming in using it will get to execute ...
if (this.inUse.ContainsKey(key))
if (this.inUse[key] == taskThatJustCompleted)
{
this.inUse.Remove(key);
Debug.WriteLine("Finished using " + key + " completely");
}
else
{
Debug.WriteLine("Finished an item for " + key);
}
}
}
/// <summary>
/// Queue only one of a given action based on a key. A singleton pattern for Tasks with the same key.
/// </summary>
/// <remarks>
/// This allows you to queue up a request to, for example, render a file based on the file name
/// Even if multiple users all request the file at the same time, only one render will ever run
/// and they can all wait on that Task to complete.
/// </remarks>
public Task StartNewOrUseExisting(string key, Action<CancellationToken> action)
{
return StartNewOrUseExisting(key, action, base.CancellationToken);
}
/// <summary>
/// Queue only one of a given action based on a key. A singleton pattern for Tasks with the same key.
/// </summary>
/// <remarks>
/// This allows you to queue up a request to, for example, render a file based on the file name
/// Even if multiple users all request the file at the same time, only one render will ever run
/// and they can all wait on that Task to complete.
/// </remarks>
public Task StartNewOrUseExisting (string key, Action<CancellationToken> action, CancellationToken cancellationToken)
{
CancellationToken combined = cancellationToken == base.CancellationToken ? base.CancellationToken :
CancellationTokenSource.CreateLinkedTokenSource(cancellationToken, base.CancellationToken).Token;
lock (inUse)
{
if (inUse.ContainsKey(key))
{
Debug.WriteLine("Reusing existing action for " + key);
return inUse[key]; // and toss the new action away
}
// otherwise, make a new one and add it ... with a continuation on the end to pull it off ...
Task result = new Task(() => action(combined), combined);
inUse.Add(key, result);
// queue up the check after it
result.ContinueWith((finished) => this.FinishedUsing(key, result));
Debug.WriteLine("Starting a new action for " + key);
// and finally start it
result.Start(this.Scheduler);
return result;
}
}
}
}
