ThruDB for Rails? ActiveDocument
Since Matt Knox talked about ThruDB on last tuesday’s meeting of NYC.rb, my brain has been thinking about document-oriented databases, about how tired I am of SQL, about how tired I am of trying to scale database servers, about how tempting is to have more flexible models and data structures, and about how tempting it is to have a clear and simple scalability path.
The samples included in the ThruDB tutorial are, to be honest, ugly. But they are designed to show how thrift provides language-agnostic data types and how ThruDB can be accessed from different languages.
However, I have several ideas in my head about how to implement something I’m calling, for the time being, ActiveDocument. It won’t be a direct replacement for ActiveRecord, but it will have similar features (i.e. validations and callback hooks) and it will allow for very simple usage of ThruDB. I might later add support for CouchDB, SimpleDB and other similar technologies, but just like Rails doesn’t try to be a full database server abstraction, your ActiveDocument code will not work on different servers unless it’s limited to very simple operations. The world of document-oriented databases is even less standardized than relational database servers.
Here’s a little look at how it might look:
class User < ActiveDocument::Model
attribute :login, :string, :indexed, :sortable
attribute :email, :string, :indexed
attribute :created_on, :datetime
attribute :password, :string
has_many :bookmarks
end
class Bookmark < ActiveDocument::Model
attribute :title, :string, :indexed
attribute :url, :string, :indexed
belongs_to :user
end
User.find_by_login("sd")
User.find(:all, :conditions => “login:’s*’ AND created_at :[20071201 TO 20080115]”)
As you can see, the two biggest differences from plain old ActiveRecord is that the model will have to define it’s own schema, and that queries will use the Lucene Syntax
Relationships would be defined using fields with lists of IDs, and queried using Lucene’s fast indexes. This might make models too big when they have a large number of related objects, but that’s a problem to be solved later.
Since document-oriented databases have no concept of joins, some queries will be definitely slower than their SQL counterparts, having to make multiple calls to the server to retrieve individual objects. However, each one of those calls would be simpler and easier to cache, which I hope will reduce the performance impact. And as long as it’s not 100 times slower, I’m willing to trade off some performance for the promise of infinite scalability.
And since the models will be more flexible, you can probably skip a lot of traditional SQL tables and store the data directly into the model itself. For example, users can have preference arrays or hashes, which would have been separate tables in SQL but that are just additional attributes in ThruDB.
Speaking of attributes. ThruDB uses thrift for its own API, and the tutorials suggest using it to encode the documents themselves, but the API doesn’t require that. I’ve been trying to figure out how to encode a thrift object along with it’s own class name, to make it easier to decode afterwards, specially when performing polymorfic queries. Perhaps I’ll have to use double encoding, with an envelope thrift object containing the class name and the encoded string. Or perhaps I’ll use YAML to encode an attribute hash. YAML is tempting because it will allow for more complex objects and for dynamic schemas (i.e. an attribute that’s a hash of hashes containing values of different types).
Anyway, I’m starting to write the code, and it looks like it might be possible to have some working prototype a lot sooner than I though possible at first.
If you’re interested, just drop me a note, leave a comment, send me an email or look for me as ’sd’ on Freenode’s #nyc.rb.
January 11th, 2008 at 12:27 am
This is a great idea.
I’d be happy to discuss your integration with Thrudb. Drop me a note.
Also, I live close to NYC. I can try and be there at your next meeting…
Thanks to you and Matt for your interest in Thrudb.
January 11th, 2008 at 3:51 am
I started on this a while back from the CouchDB angle:
class Post
January 11th, 2008 at 2:46 pm
Yeah, I started work on a similar mapping (also called ActiveDocument, natch) today. May the best AD win! (and the loser buy beer for both!) :)
Thanks for your kind words about the talk-I’m a bit overwhelmed at how good the response has been.
January 12th, 2008 at 4:04 pm
Hey, I started work on one last night too. We should all compare notes and just write one lib.
Also, I’d suggest not bothering with validations and callbacks in ActiveDocument, and rather write them in ActiveModel, an extraction from ActiveRecord that I started on.
I’m gonna post mine to git and do a quick blog post. Hopefully we can meet later in irc/campfire/whatever and join forces :)
January 12th, 2008 at 4:24 pm
Sebastian, this looks great. I was reading your post over again when something occurred to me: part of the beauty of ActiveRecord is that it provides you, in most cases, with a database-agnostic way of querying data and structuring it. I realize that you’re not trying to provide a catch-all abstraction layer - but how cool would something like that be?
You could switch out MySQL with Postgresql with CouchDB with ThruDB with minimal changes to your models.
Of course there would be edge cases where you would have to write specific queries in Lucene syntax or SQL. How often do people use find_by_sql in Rails apps, though (actually, I really want to know this)?
Borrowing from your example above, do I really care about how the conditional should be structured here:
User.find(:all, :conditions => “login:’s*’ AND created_at :[20071201 TO 20080115]”)
I really just want to say find all users where the login starts with “s” that were created between dec 01 of last year and the 15th of this month.
Basically, I guess what I’m proposing is twofold:
1. something like ActiveRecord::Extensions for your ActiveDocument - this would be easier than a bigger abstraction layer and would help developers adopt DocDBs because it removes most of the need to learn lucene syntax. Joins are of course a more difficult issue that I would need to give more, serious, thought to.
2. my pie in the sky (heh) dream is the bigger abstraction layer of course, which would satisfy my above-expressed desire to be able to swap about data storage backends with minimal code changes or restructuring. Imagine realizing that you are using a RDBMS but would get better scaling, at a way cheaper cost, by switching to a DocDB system. Then, oh I don’t know, some need changes in the project and you realize, crap, I need MySQL for this.
Maybe this wouldn’t require a totally agnostic abstraction layer (as in a replacement for or AR & AD), but just an interface used in models that results in your application not really caring what your storage system is; a config parameter that specifies:
config.data_store = :active_record
config.data_store = :active_document
much like the current directive:
config.action_controller.session_store = :active_record_store
Something else just came to me - what if you continue using a DocDB system but switch from Lucene to something else, with a different query language syntax?
Anyway those are just my immediate thoughts. This is a really interesting endeavor and I’d love to hack away at it with you. We should figure out a good time to meet up with Paul Dix - he said next week after Wednesday was good for him I believe?
January 12th, 2008 at 8:51 pm
[…] http://www.notsostupid.com/blog/2008/01/10/thrudb-for-rails-activedocument/ […]
February 7th, 2008 at 7:11 pm
Jacqui, sounds like a job for ambition: http://ambition.rubyforge.org/
It’s a project to allow querying data sources using the Enumerable api.