A strongly-typed natural language engine (C# NLP)
Please visit nlp.abodit.com for information about my natural language engine. The information below is out of date.
Here is an explanation of the natural language engine that powers my home automation system. It's a strongly-typed natural language engine with tokens and sentences being defined in code. It currently understands sentences to control lights, heating, music, sprinklers, ... You can ask it who called, you can tell it to play music in a particular room, ... it tells you when a car comes down the drive, when the traffic is bad on I-90, when there's fresh snow in the mountains, when it finds new podcasts from NPR, ... and much more.
The natural language engine itself is a separate component that I hope one day to use in other applications.
Existing Natural Language Engines
- Have a large, STATIC dictionary data file
- Can parse complex sentence structure
- Hand back a tree of tokens (strings)
- Don’t handle conversations
C# NLP Engine
- Defines strongly-typed tokens in code
- Uses type inheritance to model ‘is a’
- Defines sentences in code
- Rules engine executes sentences
- Understands context (conversation history)
Sample conversation
Goals
-
Make it easy to define tokens and sentences (not XML)
-
Safe, compile-time checked definition of the syntax and grammar (no XML)
-
Model real-world inheritance with C# class inheritance: ‘a labrador’ is ‘a dog’ is ‘an animal’ is ‘a thing’
-
Handle ambiguity,
e.g.
play something in the air tonight in the kitchen
remind me at 4pm to call john at 5pm
C# NLP Engine Structure
Tokens - Token Definition
- A hierarchy of Token-derived classes
- Uses inheritance, e.g. TokenOn is a TokenOnOff is a TokenState is a Token. This allows a single sentence rule to handle multiple cases, e.g. On and Off
- Derived from base Token class
- Simple tokens are a set of words, e.g. « is | are »
- Complex tokens have a parser, e.g. TokenDouble
A Simple Token Definition
public class TokenPersonalPronoun : TokenGenericNoun
{
internal static string wordz { get { return "he,him,she,her,them"; } }
}
- Recognizes any of the words specified
- Can use inheritance (as in this example)
A Complex Token
public abstract class TokenNumber : Token
{
public static IEnumerable\<TokenResult\> Initialize(string input) { …
- Initialize method parses input and returns one or more possible
parses.
TokenNumber is a good example:
- Parses any numeric value and returns one or more of TokenInt, TokenLong, TokenIntOrdinal, TokenDouble, or TokenPercentage results.
The catch-all TokenPhrase
public class TokenPhrase : Token
TokenPhrase matches anything, especially anything in quote marks
e.g. add a reminder "call Bruno at 4pm"
The sentence signature to recognize this could be
(…, TokenAdd, TokenReminder, TokenPhrase, TokenExactTime)
This would match the rule too …
add a reminder discuss 6pm conference call with Bruno at 4pm
TemporalTokens
A complete set of tokens and related classes for representing time
- Point in time, e.g. today at 5pm
- Approximate time, e.g. who called at 5pm today
- Finite sequence, e.g. every Thursday in May 2009
- Infinite sequence, e.g. every Thursday
- Ambiguous time with context, e.g. remind me on Tuesday (context means it is next Tuesday)
- Null time
- Unknowable/incomprehensible time
TemporalTokens (Cont.)
Code to merge any sequence of temporal tokens to the smallest canonical representation,
e.g.
the first thursday in may 2009
-> {TIMETHEFIRST the first} + {THURSDAY thursday} + {MAY in may} + {INT 2009 -\> 2009}
-> [TEMPORALSETFINITESINGLEINTERVAL [Thursday 5/7/2009] ]
TemporalTokens (Cont.)
Finite TemporalClasses provide
A way to enumerate the DateTimeRanges they cover
All TemporalClassesprovide
A LINQ expression generator and Entity-SQL expression generator allowing them to be used to query a database
Existing Token Types
- Numbers (double, long, int, percentage, phone, temperature)
- File names, Directories
- URLs, Domain names
- Names, Companies, Addresses
- Rooms, Lights, Sensors, Sprinklers, …
- States (On, Off, Dim, Bright, Loud, Quiet, …)
- Units of Time, Weight, Distance
- Songs, albums, artists, genres, tags
- Temporal expressions
- Commands, verbs, nouns, pronouns, …
Rules - A simple rule
/// <summary>
/// Set a light to a given state
///</summary>
private static void LightState(NLPState st, TokenLight tlight, TokenStateOnOff ts)
{
if (ts.IsTrueState == true)
tlight.ForceOn(st.Actor);
if (ts.IsTrueState == false)
tlight.ForceOff(st.Actor);
st.Say("I turned it " + ts.LowerCased);
}
Any method matching this signature is a sentence rule:- NLPState, Token*
Rule matching respects inheritance, and variable repeats … (NLPState st, TokenThing tt, TokenState tokenState, TokenTimeConstraint[] constraints)
Rules are discovered on startup using Reflection and an efficient parse graph is built allowing rapid detection and rejection of incoming sentences.
State - NLPState
- Every sentence method takes an NLPState first parameter
- State includes RememberedObject(s) allowing sentences to react to anything that happened earlier in a conversation
- Non-interactive uses can pass a dummy state
- State can be per-user or per-conversation for non-realtime
conversations like email
-
Chat (e.g Jabber/Gtalk)
-
Web chat
-
Email
-
Calendar (do X at time Y)
-
Rich client application
-
Strongly-typed natural language engine
-
Compile time checking, inheritance, …
-
Define tokens and sentences (rules) in C#
-
Strongly-typed tokens: numbers, percentages, times, dates, file names, urls, people, business objects, …
-
Builds an efficient parse graph
-
Tracks conversation history
-
- Company names, locations, documents, …
- From TimeExpressions