The Blog of Ian Mercer.

MongoDB substring search with a difference

MongoDB substring search with a difference

It's quite common to want to search a database for a key that starts with a given string. In SQL you have LIKE and in MongoDB you have regular expressions:

[javascript] db.customers.find( { name : { $regex : '^acme', $options: 'i' } } ); [/javascript]

But what if you want to do the inverse of this? i.e. to search the database for the keys that are themselves substrings of the search string? For example, suppose you are trying to parse a block of text and you want to find phrases in the database that match the start of the current block of text. In SQL you would be dead in the water but with MongoDB you can create a RegEx that matches either the first word, or the first two words, or the first three words, ... and so on.

We can construct a regular expression to do this, it might look something like: ^word1($| word2($| word3$))

Here's a C# method that can create the necessary regular expression:

[csharp] /// <summary> /// This generates a regular expression that matches as much of the given phrase as it can from a string /// i.e. a reverse prefix search where you want the database to supply the prefix and match it against your query /// useful for matching 'as much as possible from a given input' /// </summary> private string generatePrefixRegex(string phrase, bool atStart) { string[] bits = phrase.Split(' '); string result = bits[0];

// At the start of a sentence, if the first character is upper cased, we should also be looking for a lowercased verson of it if (atStart && char.IsUpper(result[0])) { result = string.Format("(%0|%1)%2", char.ToLowerInvariant(result[0]), char.ToUpperInvariant(result[0]), result.Substring(1)); }

// Each additional word - either we end the string before it or we must include it

foreach (var bit in bits.Skip(1)) { result = result + "($| " + Regex.Escape(bit); }

result = result + " result = result + "$"; // last word must end stringquot;; // last word must end string

foreach (var bit in bits.Skip(1)) { result = result + ")"; // close the expression } return "^" + result; // Must start at the start of a Name } [/csharp]

Related Stories