1: We are here at RubyFringe with Damien Katz. How about you introduce yourself?
I am the creator and project leader of CouchDB. I currently work for IBM. Before that I worked for MySQL and before that I worked at IBM again on a project called Lotus Notes.
2: So what's this CouchDB thing that we see on your T-shirt?
So what is CouchDB? CouchDB is a document database; it's a replicated document database with a Rest interface which means that it is accessible over http, using standard get, put, post verbs. And it uses JavaScript as its query language to create views of your data. It's schemaless unlike SQL databases where you have to define a bunch of tables with all the data types and sizes. With CouchDB each document is its own independent object and it can be any sort of JSON structure. That's kind of a technical description but what is it good for, it's good for building lots of collaborative applications, lots of web applications which generally are centered around documents, context, To-Do's, bug reports, things like that. And that's the sort of stuff CouchDB excels at.
3: How would you compare it to other products like Lotus Notes which I think is also document based in a way. Or XML databases which are similar, I think. How would you put it in that area?
So, it's most like Lotus Notes because I worked so many years on Lotus Notes, I got a really good grasp on Lotus Notes' whole platform, and what is actually good about it, there were a whole lot of crap piled out on Lotus Notes and a lot of people really disliked it, but it's been successful for a reason. It's been around for a long time and it still got like a hundred million users. So there is something there and I felt like I had a pretty good idea of the core of Lotus Notes what was actually powerful about that, so that's what I tried to extract down and make it in CouchDB. It was that document model. So it definitely works most like Lotus Notes. XML databases where you have these very, very, large single XML documents, they aren't really quite the same model, so I don't know. I haven't used them that much but I know they are kind of different.
4: So basically CouchDB works by you inserting a lot of unstructured documents. How do you search in them? How do you index them? I think you use JavaScript for it?
Yes. Generally each document won't have a random structure. The documents will have some sort of predefined structures but it's not enforced by the database, being enforced by the application layer. Eventually we would have hooks into the database where you can enforce when documents are saved that they adhere to a specific format or schema. But generally speaking the documents don't have to follow any sort of schema, it's the application. So it gives you a lot of flexibility, how you want to display the data. If you want to display a bunch of comments, for example you have a discussion database and you want to display the main topics, or you want to display the comments, or you want to display all the documents by a certain user, it's really easy to do that with CouchDB.
5: How would you do a query in CouchDB? What do you use? What language?
You use JavaScript and you create these views. If you know the name of the document, each document has its own ID, this can be any source string or it can just be randomly generated, so if you know the name you could just ask for it, do a Get for the document. If you want to create a view of all the documents of a certain type, then you create a JavaScript function, and that function will be used by the view engine and it will be fed every document in the database, and your function then decides what information in that document does it want to admit into the view. And then it admits a key and a value and the key sorts the value into the view. So every document that the function gets run over it can admit whatever it likes. And then those values and the view are collected and displayed. And then you do a Get on the view and you get back a JSON response of all these nicely formatted results.
6: Do these run for every query? Or do they run through all the documents, instance of caching?
It runs over all the documents, but it keeps a persisted index so that every time you run it, it just has to use the index. When documents are updated, five documents are updated, it doesn't have to go run over all the documents in the database to recreate this index. Instead, it just figures out the documents that have changed, and recomputes the results, eliminates the old results and adjusts the index and then you can query it. All that happens automatically, all you do is you create the view definition and you access it, and CouchDB handles it, doing all this things for you.
7: You chose an interesting language for implementing CouchDB, Erlang. What's the main reason for that?
Yes, so the original versions of CouchDB were written in C++, and I kind of hit the wall. I had a storage engine, I had a view engine and I had a query language, that I had written in C++ and I hit the wall with the concurrency issues. So, I always had to do conventional threading with locks and messaging and things like that. And I read about Erlang on Lambda the Ultimate or something like that it has been a really good concurrent language, so I decided I was going to figure how I could integrate that with my code base. So I played around with it, downloaded it and it didn't take long before I just decided that it was perfectly suited for writing a database engine and server and I threw away all my C and C++ code and rewrote everything in Erlang and it's been fantastically productive for that. It's excellent for infrastructure type stuff, it's designed for Telecom. Telecom has a lot of the same issues that you have with databases, lots of input, output, has to be reliable, has to deal with failure gracefully. So, it ended up being a perfect language for that.
8: How do you integrate JavaScript, is JavaScript runtime in Erlang? Is that the same process?
So we have the Mozilla Spider Monkey engine and what we have is a separate executable, a command line process that links in the Mozilla Spider Monkey engine and it talks to CouchDB over standard IO, so CouchDB actually spawns instances of this process and then CouchDB sends JSON over standard out and then it gets the response via the pipe that it creates. Then, if the JavaScript process hangs or uses too much memory or whatever and it gets killed by the OS, the Erlang VM is still fine and we recover with no problem.
9: This protocol that you use over the pipe, that's just JSON messages?
It's just JSON getting pushed back and forth. So you have a command and the command acts like a verb and it will just be a string, an array and then the arguments will be subsequent elements and the array, and pushes that across and then it gets parsed on the other side, and then it figures out what you are asking for, then it computes the result and sends it back. And the nice thing about having a simple line based protocol like that is that we can swap up the language backend. So right now the standard engine is JavaScript but we already have a language backend for Ruby and Python, maybe other languages I think somebody wrote one for PHP too, so you can write your queries in whatever language you want. I know that it became actually very popular with the Python community right now.
10: You are using Erlang's features for reliability. I figure you liked that for CouchDB. Do you also use sort of scalability features?
Definitely for concurrency. Somebody did some early benchmarks on CouchDB and they would probably get twenty thousands simultaneous connections. That was pretty impressive. And we haven't even done any profiling yet. Definitely Erlang helps us in that area. If I had written this using conventional threading model, you were lucky to get five hundred active connections, so definitely Erlang helps with single machine scalability. Erlang will also help with multi machine scalability but we are not really using Erlang for that yet, but it has a whole lot of tools and libraries and things like that to allow for multi machine Erlang environments for automated fail over and efficient messaging and things like that. And we just haven't taken advantage of it yet.
11: Did you see any benefits from Erlang and SMP versions? Or didn't that make any difference?
I hadn't profiled it, I think that the person who profiled it, profiled it with the multi processor around, so I don't think anybody actually profiled it with the single processing version of Erlang.
12: One thing you recently or some time ago wrote a blog about what you didn't like about Erlang. So what's your current position on that? Did anything change?
Yes, most of those complaints are still there, but every language has things that you dislike about it. Some of the things in Erlang are old. If you are designing it to modern day, you wouldn't have made these decisions and some of it is kind of inherent to the programming paradigm like you can't fix that without breaking other things that are very right with Erlang, so there are things that are frustrating about it. It has very poor string handling, something that I think could be improved. But if your problems don't fit well into the functional paradigm, then maybe you should just use a different language.
13: What's your top three complaints: the string handling, Unicode support, something like that?
I think string handling right now is the issue that is always there. It's really inefficient right now, so not only has it been cumbersome doing a lot of the string stuff that, if you are using a language like Ruby or Python, would be much easier, much cleaner, so not only was it harder to write code out there, it's also slower. So yes, that's an issue. But we are also trying to address that, we are trying to use a different style of strings where each string is a element in a list and it actually ends up taking sixteen bytes just to store a single character. And then it has binary strings which are more like conventional strings in other programming languages, the syntax for those is ugly of course, but I think we are going to switch over to them anyway because it's way more efficient.
14: Well coming back to the document database. What are document databases like Lotus Notes for instance? What are they particularly good for? What is Lotus Notes for instance used?
A good analogy I would like to use, as an exercise for you to figure out what's a god application for a document database. If you weren't doing this application, if it weren't computerize, how would you do it in the real world? And if it ends up being lots of pieces of paper that are filed away and passed around to different people, that's a really good indication that a document database is the right place. If it ends up being the kind of problem like a pain program or single spreadsheet, it's all locked down, so for an accountant, what do they call them? They are spreadsheets on paper, those are the things that can't be split up, they have to be a single document. In those cases that means that it probably should be a single application but if you have a bunch of these documents they are constantly getting spread around like to-do lists, bug list, customer complaints, these are the kind of things that in the real world they would be generating stacks of paper. And that's when you should really start to consider maybe a relational database is the ideal place and maybe a document database is, and with CouchDB's nature where you can actually take the documents with you offline and then edit them and then later when you are online replicate the changes back, that's definitely something that is very difficult to do with a relational database. So any time you need that offline capability to access your data and edit your data, that's when a document database like CouchDB would really excel.
15: That's also what Lotus is used for in big companies.
Yes, Lotus is still extensively used, for people a lot of time think of Lotus as just an email platform but it's actually an application development platform for documents.
16: Did it start out this way, or did it start as email?
It started of that way and one of the first applications built on top of it as being a document oriented database was email. So it is email and that's one of its top applications, but it's just another application in the Lotus Notes stack. But Lotus also has all these other applications so you can do bug tracking and customer reports and CRM type stuff and it's used extensively by a lot of companies for that sort of thing.
17: So would you say that this document database concept, this has been around for some time with Lotus Notes, this has been available in other products, or it seams to be becoming popular with CouchDB.
Exchange wanted to be something like that a long time ago, they had the Exchange server, they had this concept with shared folders where you were supposed to build the applications on that. And that never really worked out, nobody really used it for that, so there have been other attempts to build things like that and then of course anything on the web like a SharePoint works very much like Lotus Notes, but it's a single instance web server of Lotus Notes and you use your browser client, but it's still doing a lot of the same things, it still is very document oriented type of environment. So there have been other things but Lotus Notes is in my mind the only thing that really got it right. Even though it got a whole lot wrong, in addition to that. I always thought that Lotus Notes was still kind of unique in the market place and that is why I really wanted to build CouchDB because I thought that that model had been under appreciated and under explored.
18: You want to bring that document model to the open source. It seams to be a lot of other paradigms for databases as a competition to a relational databases like Google's BigTable. What do you think about that?
BigTable - I don't really quite get the benefits of it, other than scalability. I definitely see the benefits in that way but for most applications that people want to build, they don't need that sort of scalability. It's kind of a limited platform but I haven't actually used it, so I don't know, I just read some of the complaints.
19: Do you see it as an alternative to a document based concept or do you see it as a different model do you think?
I do think it's a different model, they don't have the same view model. The really key to CouchDB is the view model where you can create these views that are generated to index all your data. And BigTable is just this big key- value table store and not really sure that's powerful enough to build interesting applications.