Neo4j Meetup in Seattle - some observations
I attended the Neo4j Meetup in Seattle this evening. It was an interesting tour around the internals of Neo4j and some of the design decisions behind how they store graphs in a database.
The most interesting thing about Neo4j is the Cypher query language used to construct graph queries that follow relationships, evaluate conditions on properties on relationships and nodes. Neo4j shows much promise in terms of being able to represent data in a very natural way and to query it using Cypher in ways that would bring SQL to its knees with join-upon-join-upon-join.
In an earlier blog post I lamented the lack of a single database solution that was the best of all worlds: relational + document + graph + semantic web. Tonight that feeling was compounded: Neo4j is a graph database but it's missing several key features that could make it much more.
We were privileged to get a first hand explanation as to how Neo4j worked internally but what we saw looked like a work in progress: an unfinished implementation of something that could be so much better. Here's some of the things Neo4j needs to fix before I'll give it a go:-
1) Stealing bits from one value to give to another to create odd word lengths like 23 bits is so 1980's. I cannot believe this is a worthwhile optimization to make in 2012. Neo should bite the bullet, upgrade their few existing customers and move to a more modern byte aligned, 64-bit address space. I was equally amazed at the implementation of compression schemes for text on disk but the omission of other obvious space-saving opportunities like declaring some relationships to be one-way only (no reverse queries, thus no need to store the back link). It's 2012: disk space is essentially limitless; I should never have to hit a file-size limit because someone decided to use 23, 28 or some other random number of bits instead of 64.
3) The lack of file splitting at 2GB/4GB boundaries.
4) Putting nodes and relationships into separate files. Sure this simplifies the access pattern but it's not going to give good locality to data on disk. An alignment based on disk block sizes with nodes and relationships packed into blocks seems likely to be a much better approach to minimizing disk seeks and reads.
3) Reliance on Lucene to provide indexing. Much as I appreciate Lucene, Neo4j needs built-in indexes; without them it's impossible to optimize query plans across the graph and the indexes. MongoDB has a good selection of indexing options including 2D geo-spatial indexing; IMHO Neo4j should adopt the same set of options and offer queries that are both good relational database queries and good graph queries not force their users to pick one or the other whilst handling the interop between two different systems.
In fact, in my ideal world Neo4j and MongoDB would just become one database: a document database that also has great graph-querying capabilities!
I'll keep monitoring Neo4j but in the meantime it's full speed ahead with my own implementation of a graph database in MongoDB with the added twist that in my implementation, relationships are all modeled as triples (just like in a semantic web triple-store). My graph-query language isn't likely to be as powerful as Cypher any time soon but I have indexes, the ability to query by relationships easily and a robust implementation of properties on each node with support for all common data-types and through my interface-based approach to storing objects with multiple-inheritance I get strongly-typed result sets in C#.